Practice/Meta/Design a News Aggregator System
Design a News Aggregator System
System DesignMust
Problem Statement
Design a live sports scores and updates platform similar to ESPN ScoreCenter or theScore that delivers real-time game data, play-by-play updates, and statistics to millions of concurrent users across multiple sports leagues. The system must ingest feeds from official data providers, normalize heterogeneous formats, compute derived statistics, and push updates to users following specific teams, leagues, or games with sub-second latency.
The platform serves fans who expect instant score updates, live commentary, and rich game context (player stats, league standings, game highlights) without refresh delays. During peak events like playoff games or championship matches, millions of users may follow the same game simultaneously, creating massive read spikes and fanout challenges. Your design must balance fresh data delivery, efficient resource utilization, and graceful degradation under load while maintaining data accuracy and consistency across all client views.
Key Requirements
Functional
- Live score updates -- Users receive scores, play-by-play events, and game state changes within 1 second of occurrence
- Flexible subscriptions -- Users follow specific teams, leagues, players, or individual games and receive relevant updates only
- Rich game context -- System provides cumulative stats, historical comparisons, league standings, and player performance metrics alongside live scores
- Multi-sport support -- Platform handles diverse sports with different scoring systems, game structures, and update frequencies (soccer vs. basketball vs. baseball)
- Historical data access -- Users query past games, season statistics, and head-to-head records with low latency
Non-Functional
- Scalability -- Support 50M concurrent users during major events, handle 100K+ games per day across all sports
- Reliability -- 99.95% uptime during live games, zero data loss for official score updates, graceful degradation when upstream feeds fail
- Latency -- p99 ingestion-to-delivery latency under 800ms, API response times under 100ms for live game queries
- Consistency -- All users see identical scores and game state within the eventual consistency window; no score reversals or phantom updates
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Ingestion Pipeline Design and Data Normalization
How you handle heterogeneous feeds from multiple providers, each with unique formats, update frequencies, and reliability characteristics. A soccer API might send continuous possession updates while baseball sends discrete pitch-by-pitch events.
Hints to consider:
- Use schema adapters per provider to normalize into a canonical event format before processing
- Implement idempotency keys and sequence numbers to handle duplicate feeds and out-of-order delivery
- Apply backpressure and rate limiting per provider to prevent a single misbehaving feed from overwhelming the system
- Design a conflict resolution strategy when multiple feeds disagree on the same event (e.g., timestamp discrepancies, scoring disputes)
2. Real-Time Update Distribution at Scale
How you deliver score changes to millions of subscribers without write amplification or hot partition problems. A single touchdown might need to reach 5M users following that team or game.
Hints to consider:
- Choose between fanout-on-write (precompute user feeds) vs. fanout-on-read (compute on query) based on follower distribution and update frequency
- Use WebSocket connections with connection pooling and sticky sessions to maintain stateful push channels efficiently
- Implement topic-based pub/sub where each game/team is a topic, allowing subscribers to self-select without coordinator bottlenecks
- Cache active game states in memory tiers and use incremental updates rather than full snapshots to reduce bandwidth
- Design heartbeat and reconnection logic to handle mobile clients with flaky network conditions
3. Derived Metrics and Aggregation Strategy
How you compute and maintain real-time statistics like player averages, team rankings, and league standings that depend on live game outcomes updating continuously.
Hints to consider:
- Use stream processing frameworks to compute rolling aggregations incrementally rather than recomputing from scratch
- Separate critical path (live scores) from analytical path (season stats) to prevent complex aggregations from blocking time-sensitive updates
- Precompute common queries (top scorers, standings) on write and invalidate caches selectively when dependencies change
- Store snapshots at key moments (end of quarter, end of game) to enable fast historical queries without replaying entire event streams
4. Handling Spiky Traffic and Thundering Herds
How you manage load when millions of users tune in simultaneously at game start or refresh frantically during critical moments like overtime or penalty kicks.
Hints to consider:
- Implement read-through caching with TTLs aligned to update frequency (1s for live scores, 1min for standings)
- Use request coalescing to deduplicate identical concurrent queries and serve from a single backend call
- Apply circuit breakers and fallback to slightly stale data when live query paths are overloaded
- Design autoscaling policies that anticipate game schedules rather than reacting after traffic arrives
- Serve static or slowly-changing data (team rosters, league schedules) from CDN edges to reduce origin load