System Design - Design Perplexity Discover — Perplexity

Reference Answer

For a walkthrough of building an AI-powered answer system with retrieval pipelines, see our Design ChatGPT guide. While the ChatGPT guide focuses on conversational AI, many of the same patterns around content ingestion, ranking, and low-latency serving apply to a recommendation feed.

Also review the Databases, Caching, and Search building blocks.

Problem Statement

Design a recommender system for Perplexity's Discover feed. The system collects articles from thousands of publishers and serves personalized feeds to millions of users. Logged-in users see content tailored to their interests, subscribed topics, and reading history, while guest users see a trending or global feed based on popularity and recency.

This problem combines content aggregation, personalization, and ranking at scale. You should be prepared to discuss recommendation approaches -- content-based filtering, collaborative filtering, and learned ranking models -- as well as the infrastructure required for real-time ingestion from diverse sources, efficient indexing, and low-latency feed generation. Interviewers care as much about how the pieces fit together architecturally as they do about the recommendation algorithms themselves.

Key Requirements

Functional

Content ingestion pipeline -- Collect articles from thousands of publishers via RSS feeds, APIs, and web scraping, with deduplication and quality filtering
Personalized feed generation -- Rank articles based on user interests, reading history, subscribed topics, and engagement signals
Trending feed for guests -- Serve a non-personalized feed to anonymous users based on recency, popularity, and editorial curation
Topic modeling and tagging -- Automatically classify articles into topics and extract entities for filtering and recommendation
User interaction tracking -- Capture clicks, reads, time-on-page, and shares to refine future recommendations
Low latency -- Feed generation completes within hundreds of milliseconds for a responsive user experience
High throughput -- Handle millions of feed requests per day from a growing user base
Freshness -- New articles appear in feeds within minutes of publication
Scalability -- All components (ingestion, ranking, serving) scale horizontally with traffic and content volume
Content diversity -- Avoid filter bubbles by balancing relevance with serendipity and topic variety

Based on real interview experiences, these are the areas interviewers probe most deeply:

Interviewers expect a complete system diagram showing how articles flow from publishers through ingestion, processing, indexing, and ranking to the user-facing feed. They want to see storage systems, caching layers, and how real-time and batch components interact.

Interviewers probe how you handle diverse content sources and prevent showing the same story from multiple publishers.

Scheduled crawlers poll RSS feeds and APIs at regular intervals; web scraping with headless browsers handles publishers without structured feeds
Content fingerprinting using MinHash or simhash detects near-duplicates even when publishers rephrase the same story
Quality filtering removes spam, low-quality content, and irrelevant articles before they enter the index
Incremental processing handles high-volume publishers without overwhelming downstream systems

Interviewers want to see that you understand how to build a recommendation system that balances relevance, freshness, and diversity.

Build user profiles from reading history, explicit topic subscriptions, and implicit signals (clicks, time-on-page, shares)
Use content-based filtering by matching article embeddings (for example, from a sentence transformer model) against user interest vectors
Add collaborative filtering to surface articles read by users with similar profiles
Apply recency boosting to prioritize fresh content and diversity injection to prevent echo chambers
Design an A/B testing framework to evaluate ranking algorithm changes on real traffic

Without personalization data, the guest feed must still be compelling. Interviewers push on how you generate it.

Interviewers probe how you handle growing content volume and user traffic without degrading latency or freshness.

Scale ingestion workers horizontally, each responsible for a subset of publishers
Shard the user profile store by user ID and the article index by article ID
Use multi-tiered caching: CDN for the guest trending feed, application cache for personalized feeds with short TTL
Pre-compute user interest vectors and article embeddings in batch during off-peak hours
Separate read and write paths so ingestion load does not affect serving latency

Ask about the scale (number of publishers, articles per day, active users), latency budget for feed generation, acceptable staleness for personalization updates, and whether the interviewer wants a deep dive on recommendation algorithms or infrastructure.

Discuss how you handle diverse publisher formats, deduplication strategies, and quality filtering. Walk through the pipeline from raw HTML or RSS to structured article records in the index. Explain how the content processor generates embeddings for each article using a pre-trained language model, extracts named entities for topic tagging, and writes the enriched record to both the article index and a Kafka topic for downstream consumers.

This is where Perplexity interviews focus. Explain how user profiles are built incrementally from engagement events. Describe the ranking formula that combines content similarity (cosine similarity between article and user interest vectors), collaborative signal (what similar users engaged with), recency score (exponential decay by article age), and diversity penalty (reduce score for articles too similar to recently shown items). Discuss the two-phase retrieval approach: a fast candidate generation step that retrieves hundreds of articles from the index using topic filters and embedding similarity, followed by a scoring and re-ranking step that applies the full ranking model. Cover caching: store pre-computed personalized feeds with a short TTL and invalidate on new high-relevance articles.

Discuss caching strategies at multiple layers: the guest trending feed is pre-computed every few minutes and served from CDN; personalized feeds are cached per user with a TTL of one to five minutes; user interest vectors are materialized in a fast key-value store; and article embeddings are pre-computed during ingestion. Explain how you balance freshness (shorter TTL, more recomputation) with latency (longer TTL, stale but fast).

Candidates report that Perplexity interviewers want to see a full architecture diagram with all components -- ingestion, processing, storage, ranking, and serving -- and care deeply about how the pieces fit together and where bottlenecks might appear. Deduplication was a major topic: how do you detect when multiple outlets cover the same story? Personalization was the core discussion: how do you build user profiles from sparse data, especially for new users? Scalability was probed hard: what happens when you grow from 1 million to 100 million users? The trending feed for guests was also interesting: how do you make it compelling without any personalization data?