Problem Statement
Design a recommendation engine that analyzes user preferences and behavior to suggest relevant content, products, or services with personalized rankings. Think of the systems behind Netflix, Spotify, or Amazon that learn from what you watch, listen to, or buy and surface items you are likely to engage with next. Users browse personalized lists that update as they interact, and they can filter or refine results by categories, genres, or context.
The core engineering challenges are combining offline model training with online serving under strict latency budgets, ingesting high-volume clickstream data for feature computation, handling the cold-start problem for new users and items, and balancing relevance with diversity and exploration. Your architecture must support hundreds of millions of users, maintain sub-100ms serving latency for recommendation requests, and evolve through A/B testing and feedback loops without disrupting the production experience.
Key Requirements
Functional
- Personalized ranked lists -- users see a tailored, scored list of recommended items based on their viewing history, ratings, and contextual signals
- Filtering and refinement -- users can narrow recommendations by genre, category, price range, or other attributes and see personalized results within those constraints
- Explicit feedback -- users can like, dislike, save, or hide items, and these signals immediately influence subsequent recommendations
- Freshness -- recommendations refresh as users interact, incorporating recent behavior without requiring a manual page reload
Non-Functional
- Scalability -- support 200 million daily active users generating billions of interaction events per day, with 500,000+ recommendation requests per second at peak
- Reliability -- 99.9% availability for the serving path with graceful fallback to popularity-based recommendations if personalization is degraded
- Latency -- end-to-end recommendation serving within 100ms at p95, including candidate retrieval, scoring, and re-ranking
- Freshness -- user actions reflected in recommendation features within 5 minutes via near-real-time feature updates
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Multi-Stage Recommendation Pipeline
Interviewers expect you to decompose the recommendation flow into distinct stages rather than treating it as a single model call. The pipeline structure reveals how you balance quality, latency, and cost.
Hints to consider:
- Stage 1 (Candidate Generation): retrieve a few hundred candidates from multiple sources -- collaborative filtering, content-based similarity, trending/popular items -- using lightweight approximate methods
- Stage 2 (Scoring/Ranking): apply a heavier ML model (e.g., gradient-boosted trees or a neural ranker) to score each candidate using user features, item features, and contextual signals
- Stage 3 (Re-ranking): apply business rules -- diversity constraints, deduplication, content policy filtering, and promotional boosts -- to produce the final ordered list
- Each stage narrows the candidate set so expensive computation is only applied to the most promising items
2. Feature Store and Real-Time Feature Computation
Recommendations are only as good as their features. Interviewers want to see how you maintain both batch-computed features (historical preferences, item embeddings) and streaming features (session clicks, recent watches).
Hints to consider:
- Use a feature store that serves precomputed feature vectors at low latency (Redis or a dedicated feature serving system) alongside near-real-time aggregations from a stream processor like Flink
- Batch features (user genre affinity, item popularity over 30 days) are computed in periodic jobs and written to the store
- Streaming features (items viewed in the current session, clicks in the last 5 minutes) are computed in Flink consuming from Kafka and written to Redis with short TTLs
- The scoring model reads both batch and streaming features at request time, blending long-term preferences with immediate context
3. Cold Start and Exploration
New users have no interaction history, and new items have no engagement data. Interviewers probe your strategies for bootstrapping personalization and avoiding feedback loops that entrench popular items.
Hints to consider:
- For new users, start with popularity-based recommendations, explicit preference surveys (onboarding quiz), or demographic/contextual signals until sufficient interaction data accumulates
- For new items, use content-based features (genre, description embeddings, creator metadata) to place them in the candidate pool alongside established items
- Implement epsilon-greedy or Thompson sampling exploration to inject a controlled fraction of non-personalized candidates, ensuring new items get exposure
- Monitor diversity metrics (category coverage, novelty) alongside engagement metrics to detect and correct filter bubbles
4. Offline Training and Online Serving Separation
The system has fundamentally different workloads for training models and serving predictions. Interviewers want to see clean separation so training jobs do not impact serving latency and new models can be deployed safely.
Hints to consider:
- Train models offline on historical interaction data stored in a data warehouse or data lake, producing model artifacts that are versioned and stored in a model registry
- The serving layer loads model artifacts and feature vectors into memory for low-latency inference, independent of the training infrastructure
- Use shadow scoring (run new models in parallel with production models, compare outputs, but serve only the production model's results) before promoting a new model
- A/B test new models by routing a fraction of traffic to the challenger model and measuring engagement and business metrics over a defined window
5. Event Ingestion and Feedback Loop
Every user interaction (impression, click, watch duration, purchase) feeds back into the system to improve future recommendations. The ingestion pipeline must handle massive volume without losing events or introducing unacceptable delays.
Hints to consider:
- Ingest events through Kafka topics partitioned by user ID for ordered processing, with consumers writing to both the streaming feature pipeline and a data lake for batch training
- Deduplicate events using client-generated event IDs to prevent inflated counts from network retries
- Aggregate events at multiple time windows (5 minutes, 1 hour, 1 day) in the stream processor for real-time features
- Feed impression and engagement data back to the training pipeline with proper attribution (which model version, which position) to measure model quality accurately
Suggested Approach
Step 1: Clarify Requirements
Ask about the item domain (videos, products, articles, music) since it affects feature engineering and candidate retrieval strategies. Clarify scale: number of users, items in the catalog, and expected request rate. Determine latency and freshness targets. Ask whether the system should handle multiple recommendation surfaces (home feed, "more like this," search results) or just one. Confirm whether A/B testing infrastructure is in scope.
Step 2: High-Level Architecture
Sketch the dual-path architecture: an offline path (training pipeline in the data warehouse producing model artifacts) and an online path (serving layer that retrieves candidates, fetches features, scores, and re-ranks). Show Kafka as the event ingestion backbone feeding both a stream processor (Flink) for near-real-time features and a data lake for batch training data. Place a feature store (Redis cluster) in the online path for low-latency feature retrieval. Include a model registry for versioned model artifacts, a serving tier that loads models and handles inference, and an API gateway for routing, A/B allocation, and rate limiting.
Step 3: Deep Dive on Serving Path
Walk through a recommendation request. The API receives a request with user ID and context (device, time of day, current page). The candidate generator retrieves 500 candidates from multiple sources: the user's collaborative filtering neighborhood (precomputed and cached), content-based similar items, trending items, and exploration candidates. The scoring service loads the user's feature vector and each candidate's feature vector from the feature store, applies the ranking model, and produces scores. The re-ranker applies diversity rules (no more than 3 items from the same genre in the top 10), promotional boosts, and content policy filters. The final ranked list is returned to the client, and impression events are logged to Kafka for feedback.
Step 4: Address Secondary Concerns
Cover cold start by describing the onboarding flow and content-based fallback for new users and items. Discuss the training pipeline: periodic batch jobs process interaction logs, train updated models, validate on hold-out data, and register new artifacts. Address A/B testing: the API gateway assigns users to experiment groups, the serving layer loads the appropriate model version, and an analytics pipeline computes per-group metrics. Touch on monitoring: track serving latency percentiles, feature store hit rates, model prediction distributions, recommendation diversity scores, and engagement metrics (CTR, watch time). Discuss scaling: horizontally partition the feature store by user ID, run stateless scoring pods behind a load balancer, and add Kafka partitions and Flink parallelism for event throughput.
Related Learning
Deepen your understanding of the patterns used in this problem:
- Ad Click Aggregator -- high-throughput event ingestion and real-time aggregation patterns for clickstream processing
- Top-K Videos -- aggregating and serving ranked content feeds efficiently at scale
- Caching -- low-latency feature serving and candidate caching with Redis
- Message Queues -- Kafka for event ingestion, feature pipeline input, and training data collection
- Databases -- storage strategies for user profiles, item metadata, and interaction history