Practice/Atlassian/Design a News Aggregator System

Design a News Aggregator System

System DesignMust

Problem Statement

Design a news aggregation system like Google News that collects articles from many publishers, allows users to subscribe to different news sources, categories, and topics, and delivers a personalized, near real-time feed based on their interests. Users should see a continuously updating stream of headlines, summaries, and links from sources they follow.

At Atlassian, this maps to aggregating updates from multiple product feeds -- Jira activity, Confluence page edits, Bitbucket commits -- into a unified activity stream. The interview focuses on designing multi-stage ingestion pipelines, choosing between fanout-on-write and fanout-on-read for feed assembly, handling deduplication of syndicated content, and meeting latency targets for a read-heavy workload. Interviewers reward candidates who scope ranking simply (rule-based by recency and relevance) and focus on correctness, scale, and timely updates.

Key Requirements

Functional

Source subscription -- users subscribe to news sources, categories, and keyword topics to personalize their feed
Personalized feed -- deliver a feed that updates in near real-time with new and relevant stories matching user interests
Preference management -- users can mute or unfollow sources and topics, set language and region filters, and see changes reflected quickly
Breaking news notifications -- optionally push alerts for high-priority stories aligned with user interests

Non-Functional

Scalability -- support millions of active users with hundreds of publishers producing thousands of articles per hour
Latency -- p95 feed load time under 200ms; new articles visible in feeds within 30 seconds of ingestion
Reliability -- no article loss during publisher outages or internal failures; graceful degradation to cached feeds
Consistency -- eventual consistency for feed contents; strong consistency for subscription changes

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Ingestion Pipeline and Deduplication

News articles are syndicated across many publishers, so the same story appears from multiple sources. Interviewers probe how you normalize, deduplicate, and enrich articles reliably.

Hints to consider:

Use Kafka to decouple fetchers from processors, with per-source partitioning for ordering and rate isolation
Implement a multi-stage pipeline: fetch via RSS or webhooks, parse and normalize to a canonical schema, deduplicate using content fingerprints (SimHash or MinHash), enrich with categories and entities, then index
Assign stable article IDs based on content hashes to make the entire pipeline idempotent
Use dead-letter queues for articles that fail parsing, allowing manual inspection without blocking the pipeline

2. Feed Assembly Strategy

The core design decision is whether to precompute feeds on write or assemble them on read. Interviewers test whether you understand the tradeoffs for different user and publisher distributions.

Hints to consider:

Fanout-on-write works well when the number of subscribers per source is moderate: when an article arrives, push its ID into each subscriber's feed list in Redis
Fanout-on-read is better for users who follow many sources or for publishers with millions of followers, avoiding write amplification
A hybrid approach uses fanout-on-write for most publishers and fanout-on-read for high-follower sources
Pre-compute and cache the top 50 feed items per user with a short TTL, invalidating on new article arrival

3. Real-Time Update Delivery

Users expect their feed to update without manual refresh. Interviewers assess your push mechanism and how it handles millions of concurrent connections.

Hints to consider:

Use WebSocket or Server-Sent Events (SSE) for connected users, with a lightweight notification payload (article ID, headline) rather than full content
Implement a pub/sub layer (Redis Pub/Sub or Kafka consumer groups) to fan out notifications to WebSocket servers
Fall back to client-side polling for unreliable connections with a reasonable interval (15-30 seconds)
Use a CDN to serve the initial feed page and static assets, reducing origin load during traffic spikes

4. Handling Write Spikes and Hot Categories

When major news breaks, a single category (politics, sports) can see a burst of articles that overwhelms the ingestion pipeline. Interviewers want to see how you absorb spikes without degrading the user experience.

Hints to consider:

Kafka absorbs bursts naturally through its log-based architecture; add consumer instances to increase processing throughput
Implement circuit breakers on external publisher feeds to prevent a misbehaving source from starving others
Use rate limiting per category during fanout to prevent a single hot topic from consuming all write capacity
Serve slightly stale cached feeds during spikes rather than letting the system degrade

Practice/Atlassian/Design a News Aggregator System

Design a News Aggregator System

System DesignMust

Problem Statement

Key Requirements

Functional

Source subscription -- users subscribe to news sources, categories, and keyword topics to personalize their feed
Personalized feed -- deliver a feed that updates in near real-time with new and relevant stories matching user interests
Preference management -- users can mute or unfollow sources and topics, set language and region filters, and see changes reflected quickly
Breaking news notifications -- optionally push alerts for high-priority stories aligned with user interests

Non-Functional

Scalability -- support millions of active users with hundreds of publishers producing thousands of articles per hour
Latency -- p95 feed load time under 200ms; new articles visible in feeds within 30 seconds of ingestion
Reliability -- no article loss during publisher outages or internal failures; graceful degradation to cached feeds
Consistency -- eventual consistency for feed contents; strong consistency for subscription changes

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Ingestion Pipeline and Deduplication

News articles are syndicated across many publishers, so the same story appears from multiple sources. Interviewers probe how you normalize, deduplicate, and enrich articles reliably.

Hints to consider:

Use Kafka to decouple fetchers from processors, with per-source partitioning for ordering and rate isolation
Implement a multi-stage pipeline: fetch via RSS or webhooks, parse and normalize to a canonical schema, deduplicate using content fingerprints (SimHash or MinHash), enrich with categories and entities, then index
Assign stable article IDs based on content hashes to make the entire pipeline idempotent
Use dead-letter queues for articles that fail parsing, allowing manual inspection without blocking the pipeline

2. Feed Assembly Strategy

The core design decision is whether to precompute feeds on write or assemble them on read. Interviewers test whether you understand the tradeoffs for different user and publisher distributions.

Hints to consider:

Fanout-on-write works well when the number of subscribers per source is moderate: when an article arrives, push its ID into each subscriber's feed list in Redis
Fanout-on-read is better for users who follow many sources or for publishers with millions of followers, avoiding write amplification
A hybrid approach uses fanout-on-write for most publishers and fanout-on-read for high-follower sources
Pre-compute and cache the top 50 feed items per user with a short TTL, invalidating on new article arrival

3. Real-Time Update Delivery

Users expect their feed to update without manual refresh. Interviewers assess your push mechanism and how it handles millions of concurrent connections.

Hints to consider:

Use WebSocket or Server-Sent Events (SSE) for connected users, with a lightweight notification payload (article ID, headline) rather than full content
Implement a pub/sub layer (Redis Pub/Sub or Kafka consumer groups) to fan out notifications to WebSocket servers
Fall back to client-side polling for unreliable connections with a reasonable interval (15-30 seconds)
Use a CDN to serve the initial feed page and static assets, reducing origin load during traffic spikes

4. Handling Write Spikes and Hot Categories

Hints to consider:

Kafka absorbs bursts naturally through its log-based architecture; add consumer instances to increase processing throughput
Implement circuit breakers on external publisher feeds to prevent a misbehaving source from starving others
Use rate limiting per category during fanout to prevent a single hot topic from consuming all write capacity
Serve slightly stale cached feeds during spikes rather than letting the system degrade