Practice/Oracle/Design a News Aggregator System

Design a News Aggregator System

System DesignMust

Problem Statement

Design a news aggregation system like Google News that allows users to subscribe to different news sources, categories, and topics, then view a personalized real-time news feed based on their interests. The system collects articles from many publishers, normalizes and deduplicates them, and delivers a personalized, near real-time feed.

Interviewers ask this because it blends streaming ingestion, content normalization, feed serving, and real-time delivery. It tests whether you can design multi-stage pipelines, pick the right serving strategy (fanout-on-write vs fanout-on-read), manage deduplication and idempotency, and hit latency targets with read-heavy workloads. The role-play often rewards candidates who scope ranking simply and focus on correctness, scale, and timely updates.

Key Requirements

Functional

Subscription management -- users subscribe to news sources, categories, and keyword topics to personalize their feed
Personalized feed -- users view a feed that updates in near real time with new and relevant stories based on their subscriptions
Preference management -- users mute/unfollow sources or topics, set language/region preferences, and see changes reflected quickly
Breaking news -- users receive optional breaking-news notifications aligned with their interests

Non-Functional

Scalability -- ingest articles from tens of thousands of publishers with millions of articles per day; serve feeds to hundreds of millions of users
Reliability -- no article loss in the ingestion pipeline; feeds remain available even during partial system failures
Latency -- new articles appear in relevant feeds within 2-5 minutes of publication; feed page loads in under 300ms
Consistency -- eventual consistency for feed updates; strong consistency for subscription changes

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Ingestion Pipeline and Deduplication

Articles arrive from thousands of sources with duplicates, syndication, and varying formats. Interviewers expect a robust multi-stage pipeline.

Hints to consider:

Build a crawler/fetcher service that polls RSS feeds and publisher APIs on configurable schedules
Use Kafka to decouple fetching from processing, providing durability and backpressure handling
Implement deduplication using content fingerprinting (SimHash or MinHash) to detect near-duplicate articles from syndication
Normalize article metadata (title, summary, author, category, publication date) into a consistent schema

2. Feed Generation Strategy

Interviewers want to see how you build personalized feeds at scale without expensive per-request computation.

Hints to consider:

Use a hybrid approach: fanout-on-write for users with few subscriptions, fanout-on-read for users following broad categories
Precompute feed candidate sets per topic/source and merge them at read time based on user subscriptions
Store precomputed feeds or feed pointers in Redis sorted sets for low-latency retrieval
Apply simple rule-based ranking (recency + source credibility + category match) rather than complex ML models

3. Real-Time Updates and Notifications

Users expect their feed to stay current. Interviewers probe your strategy for timely delivery of new articles.

Hints to consider:

Use server-sent events or WebSocket connections for real-time feed updates to active users
For inactive users, rely on pull-based feed generation on next visit to avoid wasted fan-out
Implement a breaking-news pipeline with separate priority: bypass normal processing and push directly to matching users
Use topic-based pub/sub to efficiently route new articles to interested subscriber groups

4. Content Classification and Enrichment

Raw articles need to be classified into categories and topics for matching against user interests.

Hints to consider:

Apply NLP-based classification during ingestion to tag articles with categories, entities, and sentiment
Use a lightweight taxonomy of categories and map publisher-provided categories to your canonical taxonomy
Index enriched articles in Elasticsearch for full-text search, faceted filtering, and "more like this" queries
Maintain a topic trending service that identifies emerging stories by detecting article clusters

Practice/Oracle/Design a News Aggregator System

Design a News Aggregator System

System DesignMust

Problem Statement

Key Requirements

Functional

Subscription management -- users subscribe to news sources, categories, and keyword topics to personalize their feed
Personalized feed -- users view a feed that updates in near real time with new and relevant stories based on their subscriptions
Preference management -- users mute/unfollow sources or topics, set language/region preferences, and see changes reflected quickly
Breaking news -- users receive optional breaking-news notifications aligned with their interests

Non-Functional

Scalability -- ingest articles from tens of thousands of publishers with millions of articles per day; serve feeds to hundreds of millions of users
Reliability -- no article loss in the ingestion pipeline; feeds remain available even during partial system failures
Latency -- new articles appear in relevant feeds within 2-5 minutes of publication; feed page loads in under 300ms
Consistency -- eventual consistency for feed updates; strong consistency for subscription changes

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Ingestion Pipeline and Deduplication

Articles arrive from thousands of sources with duplicates, syndication, and varying formats. Interviewers expect a robust multi-stage pipeline.

Hints to consider:

Build a crawler/fetcher service that polls RSS feeds and publisher APIs on configurable schedules
Use Kafka to decouple fetching from processing, providing durability and backpressure handling
Implement deduplication using content fingerprinting (SimHash or MinHash) to detect near-duplicate articles from syndication
Normalize article metadata (title, summary, author, category, publication date) into a consistent schema

2. Feed Generation Strategy

Interviewers want to see how you build personalized feeds at scale without expensive per-request computation.

Hints to consider:

Use a hybrid approach: fanout-on-write for users with few subscriptions, fanout-on-read for users following broad categories
Precompute feed candidate sets per topic/source and merge them at read time based on user subscriptions
Store precomputed feeds or feed pointers in Redis sorted sets for low-latency retrieval
Apply simple rule-based ranking (recency + source credibility + category match) rather than complex ML models

3. Real-Time Updates and Notifications

Users expect their feed to stay current. Interviewers probe your strategy for timely delivery of new articles.

Hints to consider:

Use server-sent events or WebSocket connections for real-time feed updates to active users
For inactive users, rely on pull-based feed generation on next visit to avoid wasted fan-out
Implement a breaking-news pipeline with separate priority: bypass normal processing and push directly to matching users
Use topic-based pub/sub to efficiently route new articles to interested subscriber groups

4. Content Classification and Enrichment

Raw articles need to be classified into categories and topics for matching against user interests.

Hints to consider:

Apply NLP-based classification during ingestion to tag articles with categories, entities, and sentiment
Use a lightweight taxonomy of categories and map publisher-provided categories to your canonical taxonomy
Index enriched articles in Elasticsearch for full-text search, faceted filtering, and "more like this" queries
Maintain a topic trending service that identifies emerging stories by detecting article clusters