Practice/Rippling/Design a News Aggregator System
Design a News Aggregator System
System DesignMust
Problem Statement
Design a news aggregation platform similar to Google News that collects articles from thousands of publishers, organizes them by topic and category, and delivers a personalized, continuously updating feed to each user based on their subscriptions and interests. Users follow specific news sources, categories like "Technology" or "Sports," and keyword topics like "AI" or "Climate." The system presents them with a stream of headlines, summaries, and links that refreshes in near real time as new stories break.
The core technical challenge spans multiple domains: building a reliable ingestion pipeline that fetches and normalizes content from heterogeneous publisher feeds, deduplicating stories that appear across syndicated sources, generating personalized feeds at scale for millions of users, and pushing updates with low latency when breaking news arrives. You need to decide between fanout-on-write and fanout-on-read strategies, handle bursty ingestion from major events, and keep feed quality high through deduplication and relevance filtering -- all while meeting tight latency targets on a read-heavy workload.
Key Requirements
Functional
- Source subscriptions -- Users subscribe to news sources, categories, and keyword topics to shape their personalized feed
- Personalized feed -- Display a continuously updating feed of articles matching user interests, sorted by recency and relevance
- Preference management -- Users mute or unfollow sources and topics, set language and region filters, and see changes reflected within seconds
- Breaking news alerts -- Deliver optional push notifications for high-priority stories aligned with user interests
Non-Functional
- Scalability -- Ingest articles from 50,000+ publishers, serve personalized feeds to 100M+ users, handle 200K feed requests per second at peak
- Reliability -- Zero article loss in the ingestion pipeline; 99.9% uptime for feed serving; graceful degradation if a publisher feed goes down
- Latency -- New articles appear in user feeds within 60 seconds of publication; feed page loads complete in under 200ms at p95
- Consistency -- Users may see slightly stale feeds (seconds, not minutes) but should never see duplicate articles or missing stories that were previously visible
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Ingestion Pipeline and Content Normalization
How you reliably fetch content from thousands of publishers with different formats (RSS, Atom, APIs, webhooks), normalize it into a consistent schema, and handle failures without losing articles. Publishers vary wildly in reliability, update frequency, and payload format.
Hints to consider:
- Deploy a fleet of fetcher workers that poll publisher feeds on configurable schedules, with per-source rate limiting to avoid being blocked
- Normalize all content into a canonical article schema (title, body, summary, author, publish_time, source, categories, media) immediately after ingestion
- Publish normalized articles to Kafka, partitioned by source ID, to decouple fetching from downstream processing
- Implement idempotency using a hash of (source_url, publish_time) to prevent duplicate articles from re-fetched feeds
2. Deduplication and Story Clustering
The same story often appears from dozens of syndicated sources. Without deduplication, user feeds become repetitive and trust in the platform erodes. Interviewers want to see you think about near-duplicate detection at scale.
Hints to consider:
- Use MinHash or SimHash on article text to compute fingerprints, then cluster articles with similarity above a threshold
- Maintain a sliding window of recent fingerprints in Redis for fast lookup during ingestion
- Select a "canonical" article per cluster (e.g., the earliest or most authoritative source) and link duplicates to it
- Use Elasticsearch "more like this" queries as a secondary check during indexing for near-duplicates that fingerprinting missed
3. Feed Generation Strategy
The central architectural decision is how you build each user's personalized feed. Interviewers probe whether you choose fanout-on-write (precompute feeds), fanout-on-read (assemble at request time), or a hybrid approach, and how you justify the tradeoff.
Hints to consider:
- Fanout-on-write works well for users following a small number of topics: when an article arrives, append it to the feed lists of all matching subscribers
- Fanout-on-read suits users with broad interests or when the article volume is very high: at request time, query the user's subscriptions and merge recent articles
- A hybrid approach precomputes feeds for active users and falls back to on-read assembly for dormant accounts
- Store precomputed feeds as sorted sets in Redis keyed by user ID, with article IDs scored by publish time for efficient pagination
4. Database Write Scaling Under Bursty Ingestion
When a major news event breaks, thousands of articles arrive within minutes and ingestion write throughput spikes dramatically. Interviewers ask how you prevent database write bottlenecks, especially when one category is disproportionately hot.
Hints to consider:
- Buffer incoming articles in Kafka and drain to the database at a controlled rate using consumer group parallelism
- Shard the article store by category or source region so that a hot category does not block writes in other partitions
- Use batch inserts rather than individual row writes to reduce transaction overhead during spikes
- Monitor per-partition lag in Kafka to detect and alert on ingestion backlogs before they affect feed freshness
5. Fault Tolerance in the Article Processing Path
If the database write fails partway through processing an article, you may have already uploaded thumbnails to blob storage or updated secondary indexes. Interviewers probe how you ensure consistency across these side effects.
Hints to consider:
- Process each article as a transactional unit: write metadata to the database first, then upload media, then update the search index
- If the database write fails, do not proceed with downstream steps; rely on Kafka retries to reprocess the article
- Run a periodic reconciliation job that scans blob storage for orphaned assets (thumbnails without a corresponding article row) and deletes them
- Use outbox pattern: write article metadata and an event record in the same database transaction, then publish the event asynchronously to trigger media upload and indexing