Practice/Atlassian/Design a News Aggregator System
Design a News Aggregator System
System DesignMust
Problem Statement
Design a news aggregation system like Google News that collects articles from many publishers, allows users to subscribe to different news sources, categories, and topics, and delivers a personalized, near real-time feed based on their interests. Users should see a continuously updating stream of headlines, summaries, and links from sources they follow.
At Atlassian, this maps to aggregating updates from multiple product feeds -- Jira activity, Confluence page edits, Bitbucket commits -- into a unified activity stream. The interview focuses on designing multi-stage ingestion pipelines, choosing between fanout-on-write and fanout-on-read for feed assembly, handling deduplication of syndicated content, and meeting latency targets for a read-heavy workload. Interviewers reward candidates who scope ranking simply (rule-based by recency and relevance) and focus on correctness, scale, and timely updates.
Key Requirements
Functional
- Source subscription -- users subscribe to news sources, categories, and keyword topics to personalize their feed
- Personalized feed -- deliver a feed that updates in near real-time with new and relevant stories matching user interests
- Preference management -- users can mute or unfollow sources and topics, set language and region filters, and see changes reflected quickly
- Breaking news notifications -- optionally push alerts for high-priority stories aligned with user interests
Non-Functional
- Scalability -- support millions of active users with hundreds of publishers producing thousands of articles per hour
- Latency -- p95 feed load time under 200ms; new articles visible in feeds within 30 seconds of ingestion
- Reliability -- no article loss during publisher outages or internal failures; graceful degradation to cached feeds
- Consistency -- eventual consistency for feed contents; strong consistency for subscription changes
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Ingestion Pipeline and Deduplication
News articles are syndicated across many publishers, so the same story appears from multiple sources. Interviewers probe how you normalize, deduplicate, and enrich articles reliably.
Hints to consider:
- Use Kafka to decouple fetchers from processors, with per-source partitioning for ordering and rate isolation
- Implement a multi-stage pipeline: fetch via RSS or webhooks, parse and normalize to a canonical schema, deduplicate using content fingerprints (SimHash or MinHash), enrich with categories and entities, then index
- Assign stable article IDs based on content hashes to make the entire pipeline idempotent
- Use dead-letter queues for articles that fail parsing, allowing manual inspection without blocking the pipeline
2. Feed Assembly Strategy
The core design decision is whether to precompute feeds on write or assemble them on read. Interviewers test whether you understand the tradeoffs for different user and publisher distributions.
Hints to consider:
- Fanout-on-write works well when the number of subscribers per source is moderate: when an article arrives, push its ID into each subscriber's feed list in Redis
- Fanout-on-read is better for users who follow many sources or for publishers with millions of followers, avoiding write amplification
- A hybrid approach uses fanout-on-write for most publishers and fanout-on-read for high-follower sources
- Pre-compute and cache the top 50 feed items per user with a short TTL, invalidating on new article arrival
3. Real-Time Update Delivery
Users expect their feed to update without manual refresh. Interviewers assess your push mechanism and how it handles millions of concurrent connections.
Hints to consider:
- Use WebSocket or Server-Sent Events (SSE) for connected users, with a lightweight notification payload (article ID, headline) rather than full content
- Implement a pub/sub layer (Redis Pub/Sub or Kafka consumer groups) to fan out notifications to WebSocket servers
- Fall back to client-side polling for unreliable connections with a reasonable interval (15-30 seconds)
- Use a CDN to serve the initial feed page and static assets, reducing origin load during traffic spikes
4. Handling Write Spikes and Hot Categories
When major news breaks, a single category (politics, sports) can see a burst of articles that overwhelms the ingestion pipeline. Interviewers want to see how you absorb spikes without degrading the user experience.
Hints to consider:
- Kafka absorbs bursts naturally through its log-based architecture; add consumer instances to increase processing throughput
- Implement circuit breakers on external publisher feeds to prevent a misbehaving source from starving others
- Use rate limiting per category during fanout to prevent a single hot topic from consuming all write capacity
- Serve slightly stale cached feeds during spikes rather than letting the system degrade