Design a recommendation engine that analyzes user behavior and preferences to surface personalized lists of items such as products, articles, videos, or jobs. Users see ranked results tailored to their interests, and those results update as their activity changes. The system must combine offline model training with online serving to deliver fresh, relevant recommendations under strict latency budgets while supporting continuous experimentation through A/B tests.
Building a production recommender involves a multi-stage pipeline: ingesting behavioral signals at high volume, computing and storing features, training models offline, serving candidates in real time, and closing the feedback loop to measure and improve quality. Interviewers use this problem to assess your ability to balance freshness against cost, design layered retrieval and ranking architectures, handle cold-start scenarios for new users and items, and build systems that evolve through controlled experimentation without disrupting the user experience.
Based on real interview experiences, these are the areas interviewers probe most deeply:
Scoring every item in a 100-million-item catalog at request time is infeasible. Interviewers expect a funnel-shaped pipeline that narrows candidates efficiently before applying expensive ranking models.
Hints to consider:
Recommendations depend on features that vary in freshness and cost to compute. Interviewers want to see a clear separation between batch, near-real-time, and request-time features.
Hints to consider:
A recommender that only works for established users and popular items provides a poor experience for newcomers. Interviewers look for practical fallback strategies.
Hints to consider:
Improving a recommender requires continuous experimentation. Interviewers expect a robust framework for running controlled tests without degrading the user experience.
Hints to consider:
Recommendation serving is read-heavy and latency-sensitive. Interviewers want to see how you keep response times low under peak load without excessive infrastructure cost.
Hints to consider:
Confirm the item catalog size, user base, and expected request volume. Ask whether the recommender targets a single surface (homepage feed) or multiple surfaces (search, email, notifications) with different latency budgets. Clarify which engagement metrics matter most (clicks, purchases, watch time, long-term retention). Determine whether real-time personalization based on the current session is required or if daily batch updates suffice. Ask about cold-start expectations and whether new users arrive in large bursts (marketing campaigns) or gradually.
Sketch the major layers: an event ingestion pipeline (Kafka) that captures impressions, clicks, and conversions from clients. A streaming feature processor (Flink) that maintains real-time aggregates and writes to a low-latency feature store (Redis). A batch training pipeline (Spark) that reads historical data, trains collaborative filtering and ranking models, and publishes artifacts to a model registry. An online serving layer that receives a recommendation request, retrieves candidate items from a candidate generation service, assembles feature vectors from the feature store, scores candidates using the deployed ranking model, applies re-ranking rules, and returns the final list. Include an experiment configuration service that controls traffic routing across model versions.
Walk through a single recommendation request in detail. The API receives a request with user_id and optional filters. It first checks a Redis cache for a precomputed list. On a cache miss, it calls the candidate generation service, which performs an approximate nearest-neighbor lookup on user embedding against item embeddings to retrieve the top 1,000 candidates. It then loads user features (batch preferences, streaming session signals) and item features (popularity, freshness, category) from the feature store. The ranking model scores all candidates and produces a sorted list. The re-ranker applies diversity constraints (no more than three items from the same category in the top ten), filters out previously seen items, and enforces any business boosting rules. The final top-50 list is returned and cached with a short TTL. The API also logs the served items to Kafka for downstream impression tracking and model training.
Cover the feedback loop: impression and click events flow through Kafka into the training pipeline, closing the loop between serving and model improvement. Discuss cold start: new users receive popularity-based recommendations enriched with contextual signals; as clicks accumulate, the system transitions to personalized results within a few sessions. Address monitoring: track recommendation latency, cache hit rate, CTR by experiment arm, model serving errors, and feature staleness. Mention cost optimization: precompute lists for the most active users to reduce real-time inference load, use tiered model complexity (lightweight model for low-value surfaces, full model for high-engagement surfaces), and auto-scale inference pods based on traffic patterns. Briefly touch on reliability: if the ranking service is unavailable, serve the cached or popularity-based fallback list so the user experience is never blank.