Practice/Oracle/Design a News Aggregator System
Design a News Aggregator System
System DesignMust
Problem Statement
Design a news aggregation system like Google News that allows users to subscribe to different news sources, categories, and topics, then view a personalized real-time news feed based on their interests. The system collects articles from many publishers, normalizes and deduplicates them, and delivers a personalized, near real-time feed.
Interviewers ask this because it blends streaming ingestion, content normalization, feed serving, and real-time delivery. It tests whether you can design multi-stage pipelines, pick the right serving strategy (fanout-on-write vs fanout-on-read), manage deduplication and idempotency, and hit latency targets with read-heavy workloads. The role-play often rewards candidates who scope ranking simply and focus on correctness, scale, and timely updates.
Key Requirements
Functional
- Subscription management -- users subscribe to news sources, categories, and keyword topics to personalize their feed
- Personalized feed -- users view a feed that updates in near real time with new and relevant stories based on their subscriptions
- Preference management -- users mute/unfollow sources or topics, set language/region preferences, and see changes reflected quickly
- Breaking news -- users receive optional breaking-news notifications aligned with their interests
Non-Functional
- Scalability -- ingest articles from tens of thousands of publishers with millions of articles per day; serve feeds to hundreds of millions of users
- Reliability -- no article loss in the ingestion pipeline; feeds remain available even during partial system failures
- Latency -- new articles appear in relevant feeds within 2-5 minutes of publication; feed page loads in under 300ms
- Consistency -- eventual consistency for feed updates; strong consistency for subscription changes
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Ingestion Pipeline and Deduplication
Articles arrive from thousands of sources with duplicates, syndication, and varying formats. Interviewers expect a robust multi-stage pipeline.
Hints to consider:
- Build a crawler/fetcher service that polls RSS feeds and publisher APIs on configurable schedules
- Use Kafka to decouple fetching from processing, providing durability and backpressure handling
- Implement deduplication using content fingerprinting (SimHash or MinHash) to detect near-duplicate articles from syndication
- Normalize article metadata (title, summary, author, category, publication date) into a consistent schema
2. Feed Generation Strategy
Interviewers want to see how you build personalized feeds at scale without expensive per-request computation.
Hints to consider:
- Use a hybrid approach: fanout-on-write for users with few subscriptions, fanout-on-read for users following broad categories
- Precompute feed candidate sets per topic/source and merge them at read time based on user subscriptions
- Store precomputed feeds or feed pointers in Redis sorted sets for low-latency retrieval
- Apply simple rule-based ranking (recency + source credibility + category match) rather than complex ML models
3. Real-Time Updates and Notifications
Users expect their feed to stay current. Interviewers probe your strategy for timely delivery of new articles.
Hints to consider:
- Use server-sent events or WebSocket connections for real-time feed updates to active users
- For inactive users, rely on pull-based feed generation on next visit to avoid wasted fan-out
- Implement a breaking-news pipeline with separate priority: bypass normal processing and push directly to matching users
- Use topic-based pub/sub to efficiently route new articles to interested subscriber groups
4. Content Classification and Enrichment
Raw articles need to be classified into categories and topics for matching against user interests.
Hints to consider:
- Apply NLP-based classification during ingestion to tag articles with categories, entities, and sentiment
- Use a lightweight taxonomy of categories and map publisher-provided categories to your canonical taxonomy
- Index enriched articles in Elasticsearch for full-text search, faceted filtering, and "more like this" queries
- Maintain a topic trending service that identifies emerging stories by detecting article clusters