Design a scalable system that automatically classifies user-uploaded content across multiple modalities -- text documents, images, and videos -- using machine learning models. The platform accepts content via an API or web interface, routes it through modality-specific preprocessing and inference pipelines, and returns category labels with confidence scores. Users can track processing status for asynchronous jobs, retrieve historical classification results, and provide feedback on incorrect labels to improve future model accuracy.
The system must handle millions of uploads daily from individual users and enterprise clients, with content sizes ranging from kilobytes of text to gigabytes of video. Classification accuracy directly feeds downstream use cases like content moderation, search indexing, and personalized recommendations, so model versioning, A/B testing, and safe rollout are essential operational concerns. Interviewers use this problem to assess how you separate hot synchronous paths from heavy batch processing, manage GPU and CPU inference fleets, enforce idempotency, and build an evolvable platform that can adopt new models without disrupting users.
Based on real interview experiences, these are the areas interviewers probe most deeply:
Interviewers want to understand how you serve inference at scale while maintaining low latency and high availability, including separation of model serving from business logic and handling different model types per modality.
Hints to consider:
Each content type has different preprocessing requirements and resource profiles. Interviewers look for a thoughtful pipeline design that balances throughput and resource utilization.
Hints to consider:
Storing raw uploads, extracted features, classification results, and model artifacts requires careful separation of access patterns and cost tiers.
Hints to consider:
Managing multiple model versions in production, running controlled experiments, and rolling out improvements safely is critical. Interviewers expect discussion of the operational lifecycle.
Hints to consider:
ML inference is expensive, especially for video. Interviewers want to see you balance performance with cost efficiency.
Hints to consider:
Confirm the expected distribution of uploads across text, images, and videos. Ask how many categories exist in the taxonomy and whether it is flat or hierarchical. Clarify latency expectations for different user tiers (free versus paid). Determine whether real-time streaming classification is needed or only file uploads. Ask about typical file sizes and video durations, and whether the system must detect harmful content as a special high-priority category.
Sketch the major components: an API gateway handling authentication, rate limiting, and routing to synchronous or asynchronous paths based on content type and size. Object storage for raw uploads. A message queue feeding modality-specific worker pools for preprocessing and feature extraction. A model serving layer running multiple versions behind load balancers, with GPU nodes for images and videos and CPU nodes for text. A results database storing classification metadata and confidence scores. A caching layer for content-hash-based deduplication. A model management service tracking versions, experiments, and deployment status.
Walk through the critical path for an image classification request. The API computes a SHA-256 hash of the uploaded content and checks the cache. On a cache miss, it stores the image in object storage and publishes a job message to the image processing queue. A worker picks up the job, downloads the image, applies preprocessing (resize, normalize), and sends the prepared tensor to the model serving cluster. The model server returns top-K labels with confidence scores. The worker writes the results to the database, updates the cache, and marks the job complete. For synchronous requests (small images), the API holds the connection open and polls for completion within the latency budget. For asynchronous requests, the API returns the job ID immediately. Discuss circuit breakers: if the model serving cluster is overloaded, fall back to cached results for similar content or return a graceful degradation response.
Cover the feedback loop: when users mark classifications as incorrect, store the correction and periodically retrain or fine-tune models using accumulated feedback. Discuss monitoring: track latency percentiles per modality, model confidence distributions to detect data drift, GPU utilization, queue depths, and cache hit rates. Address scalability: horizontally scale stateless workers based on queue depth, partition the results database by content type or time, and use Redis Cluster for distributed caching. Mention security: encrypt uploads in transit and at rest, scan for malicious content before processing, and enforce per-user upload quotas. Briefly discuss cost: use spot instances for batch video processing, implement intelligent routing that sends low-value requests to cheaper model tiers, and monitor per-request inference cost.