Design a machine learning-based content classification system — Scale AI

Problem Statement

Design a scalable system that automatically classifies user-uploaded content across multiple modalities -- text documents, images, and videos -- using machine learning models. The platform accepts content via an API or web interface, routes it through modality-specific preprocessing and inference pipelines, and returns category labels with confidence scores. Users can track processing status for asynchronous jobs, retrieve historical classification results, and provide feedback on incorrect labels to improve future model accuracy.

The system must handle millions of uploads daily from individual users and enterprise clients, with content sizes ranging from kilobytes of text to gigabytes of video. Classification accuracy directly feeds downstream use cases like content moderation, search indexing, and personalized recommendations, so model versioning, A/B testing, and safe rollout are essential operational concerns. Interviewers use this problem to assess how you separate hot synchronous paths from heavy batch processing, manage GPU and CPU inference fleets, enforce idempotency, and build an evolvable platform that can adopt new models without disrupting users.

Key Requirements

Functional

Multi-modal upload -- Accept text, images, and videos up to 500 MB per file and return a job ID for tracking
Classification results -- Return top-N category labels with confidence scores, synchronously for small items and asynchronously for large ones
Job management -- Users check job status and retrieve historical results by job ID, content ID, or time range
Model versioning and experimentation -- Run multiple model versions simultaneously for A/B testing and gradual rollout with per-request routing
Feedback loop -- Users report incorrect classifications to feed a correction pipeline that improves model quality

Non-Functional

Scalability -- Handle 10 million classification requests per day with peak loads of 5,000 requests per second
Reliability -- Achieve 99.9 percent uptime with graceful degradation to cached or rule-based results when ML models are unavailable
Latency -- 95th percentile under 2 seconds for text and images, under 10 seconds for short videos
Consistency -- Eventual consistency for model updates across serving nodes; strong consistency for user classification history

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. ML Model Serving Architecture

Interviewers want to understand how you serve inference at scale while maintaining low latency and high availability, including separation of model serving from business logic and handling different model types per modality.

Hints to consider:

Use dedicated model serving frameworks (TensorFlow Serving, TorchServe, or cloud-managed endpoints) behind a load balancer
Cache classification results for previously seen content using a content hash as the cache key to avoid redundant inference
Allocate GPU instances for image and video inference and CPU instances for text to optimize cost and latency
Pre-load models on startup and implement warm-up requests to avoid cold-start latency spikes

2. Content Processing Pipeline by Modality

Each content type has different preprocessing requirements and resource profiles. Interviewers look for a thoughtful pipeline design that balances throughput and resource utilization.

Hints to consider:

Route uploads to modality-specific queues (text, image, video) with appropriate priority levels and worker pool sizes
For video, sample frames at regular intervals rather than processing every frame, balancing accuracy against processing time
Extract features as a separate reusable stage that feeds multiple downstream models
Use distributed task queues with visibility timeouts and dead-letter handling for robust asynchronous processing

3. Storage Strategy for Content and Results

Storing raw uploads, extracted features, classification results, and model artifacts requires careful separation of access patterns and cost tiers.

Hints to consider:

Store raw content in object storage with lifecycle policies that transition older files to infrequent-access tiers
Keep classification metadata (labels, scores, timestamps, model version) in a database optimized for time-range and content-ID queries
Maintain a model registry that tracks version metadata, performance metrics, and deployment status
Implement data retention policies to manage storage costs while meeting compliance requirements

4. Model Management, Experimentation, and Rollout

Managing multiple model versions in production, running controlled experiments, and rolling out improvements safely is critical. Interviewers expect discussion of the operational lifecycle.

Hints to consider:

Implement traffic splitting via feature flags or a routing layer that directs a percentage of requests to a new model version
Run shadow mode where new models process live traffic without affecting user-facing results, comparing accuracy offline
Track which model version served each request so experiment analysis can attribute accuracy differences correctly
Build monitoring dashboards comparing latency, confidence distributions, and error rates across model versions in real time

5. Cost Optimization at Scale

ML inference is expensive, especially for video. Interviewers want to see you balance performance with cost efficiency.

Hints to consider:

Auto-scale inference workers based on queue depth and GPU utilization, using spot or preemptible instances for batch workloads
Batch inference requests of the same modality and model version together to maximize GPU throughput
Use quantized or distilled models for lower-priority or latency-tolerant requests
Deduplicate identical content across users using content hashing to avoid processing the same file twice

Suggested Approach

Step 1: Clarify Requirements

Confirm the expected distribution of uploads across text, images, and videos. Ask how many categories exist in the taxonomy and whether it is flat or hierarchical. Clarify latency expectations for different user tiers (free versus paid). Determine whether real-time streaming classification is needed or only file uploads. Ask about typical file sizes and video durations, and whether the system must detect harmful content as a special high-priority category.

Step 2: High-Level Architecture

Sketch the major components: an API gateway handling authentication, rate limiting, and routing to synchronous or asynchronous paths based on content type and size. Object storage for raw uploads. A message queue feeding modality-specific worker pools for preprocessing and feature extraction. A model serving layer running multiple versions behind load balancers, with GPU nodes for images and videos and CPU nodes for text. A results database storing classification metadata and confidence scores. A caching layer for content-hash-based deduplication. A model management service tracking versions, experiments, and deployment status.

Step 3: Deep Dive on Inference Pipeline

Walk through the critical path for an image classification request. The API computes a SHA-256 hash of the uploaded content and checks the cache. On a cache miss, it stores the image in object storage and publishes a job message to the image processing queue. A worker picks up the job, downloads the image, applies preprocessing (resize, normalize), and sends the prepared tensor to the model serving cluster. The model server returns top-K labels with confidence scores. The worker writes the results to the database, updates the cache, and marks the job complete. For synchronous requests (small images), the API holds the connection open and polls for completion within the latency budget. For asynchronous requests, the API returns the job ID immediately. Discuss circuit breakers: if the model serving cluster is overloaded, fall back to cached results for similar content or return a graceful degradation response.

Step 4: Address Secondary Concerns

Cover the feedback loop: when users mark classifications as incorrect, store the correction and periodically retrain or fine-tune models using accumulated feedback. Discuss monitoring: track latency percentiles per modality, model confidence distributions to detect data drift, GPU utilization, queue depths, and cache hit rates. Address scalability: horizontally scale stateless workers based on queue depth, partition the results database by content type or time, and use Redis Cluster for distributed caching. Mention security: encrypt uploads in transit and at rest, scan for malicious content before processing, and enforce per-user upload quotas. Briefly discuss cost: use spot instances for batch video processing, implement intelligent routing that sends low-value requests to cheaper model tiers, and monitor per-request inference cost.

Problem Statement

Key Requirements

Functional

Multi-modal upload -- Accept text, images, and videos up to 500 MB per file and return a job ID for tracking
Classification results -- Return top-N category labels with confidence scores, synchronously for small items and asynchronously for large ones
Job management -- Users check job status and retrieve historical results by job ID, content ID, or time range
Model versioning and experimentation -- Run multiple model versions simultaneously for A/B testing and gradual rollout with per-request routing
Feedback loop -- Users report incorrect classifications to feed a correction pipeline that improves model quality

Non-Functional

Scalability -- Handle 10 million classification requests per day with peak loads of 5,000 requests per second
Reliability -- Achieve 99.9 percent uptime with graceful degradation to cached or rule-based results when ML models are unavailable
Latency -- 95th percentile under 2 seconds for text and images, under 10 seconds for short videos
Consistency -- Eventual consistency for model updates across serving nodes; strong consistency for user classification history

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. ML Model Serving Architecture

Hints to consider:

Use dedicated model serving frameworks (TensorFlow Serving, TorchServe, or cloud-managed endpoints) behind a load balancer
Cache classification results for previously seen content using a content hash as the cache key to avoid redundant inference
Allocate GPU instances for image and video inference and CPU instances for text to optimize cost and latency
Pre-load models on startup and implement warm-up requests to avoid cold-start latency spikes

2. Content Processing Pipeline by Modality

Each content type has different preprocessing requirements and resource profiles. Interviewers look for a thoughtful pipeline design that balances throughput and resource utilization.

Hints to consider:

Route uploads to modality-specific queues (text, image, video) with appropriate priority levels and worker pool sizes
For video, sample frames at regular intervals rather than processing every frame, balancing accuracy against processing time
Extract features as a separate reusable stage that feeds multiple downstream models
Use distributed task queues with visibility timeouts and dead-letter handling for robust asynchronous processing

3. Storage Strategy for Content and Results

Storing raw uploads, extracted features, classification results, and model artifacts requires careful separation of access patterns and cost tiers.

Hints to consider:

Store raw content in object storage with lifecycle policies that transition older files to infrequent-access tiers
Keep classification metadata (labels, scores, timestamps, model version) in a database optimized for time-range and content-ID queries
Maintain a model registry that tracks version metadata, performance metrics, and deployment status
Implement data retention policies to manage storage costs while meeting compliance requirements

4. Model Management, Experimentation, and Rollout

Managing multiple model versions in production, running controlled experiments, and rolling out improvements safely is critical. Interviewers expect discussion of the operational lifecycle.

Hints to consider:

Implement traffic splitting via feature flags or a routing layer that directs a percentage of requests to a new model version
Run shadow mode where new models process live traffic without affecting user-facing results, comparing accuracy offline
Track which model version served each request so experiment analysis can attribute accuracy differences correctly
Build monitoring dashboards comparing latency, confidence distributions, and error rates across model versions in real time

5. Cost Optimization at Scale

ML inference is expensive, especially for video. Interviewers want to see you balance performance with cost efficiency.

Hints to consider:

Auto-scale inference workers based on queue depth and GPU utilization, using spot or preemptible instances for batch workloads
Batch inference requests of the same modality and model version together to maximize GPU throughput
Use quantized or distilled models for lower-priority or latency-tolerant requests
Deduplicate identical content across users using content hashing to avoid processing the same file twice