Practice/Rippling/Design a News Aggregator System

Design a News Aggregator System

System DesignMust

Problem Statement

Design a news aggregation platform similar to Google News that collects articles from thousands of publishers, organizes them by topic and category, and delivers a personalized, continuously updating feed to each user based on their subscriptions and interests. Users follow specific news sources, categories like "Technology" or "Sports," and keyword topics like "AI" or "Climate." The system presents them with a stream of headlines, summaries, and links that refreshes in near real time as new stories break.

The core technical challenge spans multiple domains: building a reliable ingestion pipeline that fetches and normalizes content from heterogeneous publisher feeds, deduplicating stories that appear across syndicated sources, generating personalized feeds at scale for millions of users, and pushing updates with low latency when breaking news arrives. You need to decide between fanout-on-write and fanout-on-read strategies, handle bursty ingestion from major events, and keep feed quality high through deduplication and relevance filtering -- all while meeting tight latency targets on a read-heavy workload.

Key Requirements

Functional

Source subscriptions -- Users subscribe to news sources, categories, and keyword topics to shape their personalized feed
Personalized feed -- Display a continuously updating feed of articles matching user interests, sorted by recency and relevance
Preference management -- Users mute or unfollow sources and topics, set language and region filters, and see changes reflected within seconds
Breaking news alerts -- Deliver optional push notifications for high-priority stories aligned with user interests

Non-Functional

Scalability -- Ingest articles from 50,000+ publishers, serve personalized feeds to 100M+ users, handle 200K feed requests per second at peak
Reliability -- Zero article loss in the ingestion pipeline; 99.9% uptime for feed serving; graceful degradation if a publisher feed goes down
Latency -- New articles appear in user feeds within 60 seconds of publication; feed page loads complete in under 200ms at p95
Consistency -- Users may see slightly stale feeds (seconds, not minutes) but should never see duplicate articles or missing stories that were previously visible

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Ingestion Pipeline and Content Normalization

How you reliably fetch content from thousands of publishers with different formats (RSS, Atom, APIs, webhooks), normalize it into a consistent schema, and handle failures without losing articles. Publishers vary wildly in reliability, update frequency, and payload format.

Hints to consider:

Deploy a fleet of fetcher workers that poll publisher feeds on configurable schedules, with per-source rate limiting to avoid being blocked
Normalize all content into a canonical article schema (title, body, summary, author, publish_time, source, categories, media) immediately after ingestion
Publish normalized articles to Kafka, partitioned by source ID, to decouple fetching from downstream processing
Implement idempotency using a hash of (source_url, publish_time) to prevent duplicate articles from re-fetched feeds

2. Deduplication and Story Clustering

The same story often appears from dozens of syndicated sources. Without deduplication, user feeds become repetitive and trust in the platform erodes. Interviewers want to see you think about near-duplicate detection at scale.

Hints to consider:

Use MinHash or SimHash on article text to compute fingerprints, then cluster articles with similarity above a threshold
Maintain a sliding window of recent fingerprints in Redis for fast lookup during ingestion
Select a "canonical" article per cluster (e.g., the earliest or most authoritative source) and link duplicates to it
Use Elasticsearch "more like this" queries as a secondary check during indexing for near-duplicates that fingerprinting missed

3. Feed Generation Strategy

The central architectural decision is how you build each user's personalized feed. Interviewers probe whether you choose fanout-on-write (precompute feeds), fanout-on-read (assemble at request time), or a hybrid approach, and how you justify the tradeoff.

Hints to consider:

Fanout-on-write works well for users following a small number of topics: when an article arrives, append it to the feed lists of all matching subscribers
Fanout-on-read suits users with broad interests or when the article volume is very high: at request time, query the user's subscriptions and merge recent articles
A hybrid approach precomputes feeds for active users and falls back to on-read assembly for dormant accounts
Store precomputed feeds as sorted sets in Redis keyed by user ID, with article IDs scored by publish time for efficient pagination

4. Database Write Scaling Under Bursty Ingestion

When a major news event breaks, thousands of articles arrive within minutes and ingestion write throughput spikes dramatically. Interviewers ask how you prevent database write bottlenecks, especially when one category is disproportionately hot.

Hints to consider:

Buffer incoming articles in Kafka and drain to the database at a controlled rate using consumer group parallelism
Shard the article store by category or source region so that a hot category does not block writes in other partitions
Use batch inserts rather than individual row writes to reduce transaction overhead during spikes
Monitor per-partition lag in Kafka to detect and alert on ingestion backlogs before they affect feed freshness

5. Fault Tolerance in the Article Processing Path

If the database write fails partway through processing an article, you may have already uploaded thumbnails to blob storage or updated secondary indexes. Interviewers probe how you ensure consistency across these side effects.

Hints to consider:

Process each article as a transactional unit: write metadata to the database first, then upload media, then update the search index
If the database write fails, do not proceed with downstream steps; rely on Kafka retries to reprocess the article
Run a periodic reconciliation job that scans blob storage for orphaned assets (thumbnails without a corresponding article row) and deletes them
Use outbox pattern: write article metadata and an event record in the same database transaction, then publish the event asynchronously to trigger media upload and indexing

Practice/Rippling/Design a News Aggregator System

Design a News Aggregator System

System DesignMust

Problem Statement

Key Requirements

Functional

Source subscriptions -- Users subscribe to news sources, categories, and keyword topics to shape their personalized feed
Personalized feed -- Display a continuously updating feed of articles matching user interests, sorted by recency and relevance
Preference management -- Users mute or unfollow sources and topics, set language and region filters, and see changes reflected within seconds
Breaking news alerts -- Deliver optional push notifications for high-priority stories aligned with user interests

Non-Functional

Scalability -- Ingest articles from 50,000+ publishers, serve personalized feeds to 100M+ users, handle 200K feed requests per second at peak
Reliability -- Zero article loss in the ingestion pipeline; 99.9% uptime for feed serving; graceful degradation if a publisher feed goes down
Latency -- New articles appear in user feeds within 60 seconds of publication; feed page loads complete in under 200ms at p95
Consistency -- Users may see slightly stale feeds (seconds, not minutes) but should never see duplicate articles or missing stories that were previously visible

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Ingestion Pipeline and Content Normalization

Hints to consider:

Deploy a fleet of fetcher workers that poll publisher feeds on configurable schedules, with per-source rate limiting to avoid being blocked
Normalize all content into a canonical article schema (title, body, summary, author, publish_time, source, categories, media) immediately after ingestion
Publish normalized articles to Kafka, partitioned by source ID, to decouple fetching from downstream processing
Implement idempotency using a hash of (source_url, publish_time) to prevent duplicate articles from re-fetched feeds

2. Deduplication and Story Clustering

Hints to consider:

Use MinHash or SimHash on article text to compute fingerprints, then cluster articles with similarity above a threshold
Maintain a sliding window of recent fingerprints in Redis for fast lookup during ingestion
Select a "canonical" article per cluster (e.g., the earliest or most authoritative source) and link duplicates to it
Use Elasticsearch "more like this" queries as a secondary check during indexing for near-duplicates that fingerprinting missed

3. Feed Generation Strategy

Hints to consider:

Fanout-on-write works well for users following a small number of topics: when an article arrives, append it to the feed lists of all matching subscribers
Fanout-on-read suits users with broad interests or when the article volume is very high: at request time, query the user's subscriptions and merge recent articles
A hybrid approach precomputes feeds for active users and falls back to on-read assembly for dormant accounts
Store precomputed feeds as sorted sets in Redis keyed by user ID, with article IDs scored by publish time for efficient pagination

4. Database Write Scaling Under Bursty Ingestion

Hints to consider:

Buffer incoming articles in Kafka and drain to the database at a controlled rate using consumer group parallelism
Shard the article store by category or source region so that a hot category does not block writes in other partitions
Use batch inserts rather than individual row writes to reduce transaction overhead during spikes
Monitor per-partition lag in Kafka to detect and alert on ingestion backlogs before they affect feed freshness

5. Fault Tolerance in the Article Processing Path

Hints to consider:

Process each article as a transactional unit: write metadata to the database first, then upload media, then update the search index
If the database write fails, do not proceed with downstream steps; rely on Kafka retries to reprocess the article
Run a periodic reconciliation job that scans blob storage for orphaned assets (thumbnails without a corresponding article row) and deletes them
Use outbox pattern: write article metadata and an event record in the same database transaction, then publish the event asynchronously to trigger media upload and indexing