Practice/Meta/Design Meta Live Comments
Design Meta Live Comments
System DesignMust
Problem Statement
Design a real-time commentary system for live sports events where millions of viewers can follow professional commentators' play-by-play analysis and insights as games unfold. The platform must deliver commentary updates to viewers within milliseconds of publication, support multiple simultaneous sporting events (from major leagues to regional games), and handle massive traffic spikes during critical moments like goals, touchdowns, or game-winning plays.
Your system should support approximately 500 million daily active users during peak sports seasons, with individual marquee events attracting 50+ million concurrent viewers. The platform needs to gracefully handle scenarios where viewers join mid-game, switch between multiple live events, lose network connectivity, or background their mobile apps. Commentary should feel instantaneous, ordered, and complete -- viewers should never see duplicate updates or miss critical play-by-play moments even under degraded network conditions.
Key Requirements
Functional
- Real-time commentary delivery -- Professional commentators publish play-by-play updates that appear on viewer screens within 200-500ms globally
- Historical context on join -- New viewers joining mid-game immediately see the last 50-100 commentary updates to understand game flow and current situation
- Multi-event switching -- Users can switch between different live games instantly and see only relevant commentary for each event without delays or cross-contamination
- Persistent ordering -- Commentary updates appear in strict chronological order for all viewers regardless of when they joined or reconnected
- Engagement features -- Viewers can react to individual commentary updates with emotions (excitement, disappointment, etc.) and see aggregate reaction counts in real-time
Non-Functional
- Scalability -- Support 50M+ concurrent viewers on a single major event, with 100+ simultaneous events running across the platform at peak times
- Reliability -- Maintain 99.95% uptime during live events with graceful degradation rather than complete failures under extreme load
- Latency -- P95 end-to-end latency under 500ms from commentator publish to viewer display, P99 under 1 second
- Consistency -- Guarantee strict ordering of commentary within a single event; eventual consistency acceptable for reaction counts and non-critical metadata
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Connection Management and State Recovery
The mobile-first reality means viewers constantly disconnect, background apps, switch networks, and rejoin. Interviewers want to see how you prevent gaps and duplicates while minimizing client and server overhead.
Hints to consider:
- Design a cursor-based subscription model where clients track the last commentary ID they received and can request catch-up efficiently
- Implement a bounded replay buffer (last N minutes) in fast storage so reconnecting clients don't trigger database queries
- Use sequence numbers or vector clocks per event to enable client-side deduplication and detect missing updates
- Consider heartbeat mechanisms to detect stale connections and clean up server-side state proactively
2. Hot Event Handling and Write Amplification
A championship game finale with 50M viewers means each commentary update must fan out to millions of connections. Naive broadcasting collapses under this load and creates massive write amplification.
Hints to consider:
- Partition subscribers across multiple WebSocket gateway servers and build a hierarchical fanout tree to distribute load
- Use consistent hashing to co-locate viewers of the same event on the same gateway servers, reducing cross-server messaging overhead
- Implement backpressure and rate limiting at the commentator ingestion layer to prevent system overload during frenzied play-by-play moments
- Design admission control and tiered service levels so degradation affects less critical features (like reaction count accuracy) before core commentary delivery
3. Storage Architecture for Dual Access Patterns
The system has conflicting needs: real-time append-only writes for new commentary, sequential reads for viewer catch-up, and random access for historical lookup. No single database excels at all three patterns.
Hints to consider:
- Use an append-only log (like Kafka) as the source of truth for commentary events, supporting both real-time tailing and replay from arbitrary offsets
- Maintain a sliding window cache of recent commentary in Redis with TTL-based eviction for fast reconnect scenarios
- Async write commentary to a document store or wide-column database for historical access and analytics, accepting eventual consistency
- Consider partitioning by eventId rather than time to keep all commentary for a game co-located and simplify queries
4. Geographic Distribution and CDN Integration
Sports audiences are globally distributed, but latency requirements are strict. Simply replicating to multiple regions creates consistency and coordination challenges.
Hints to consider:
- Deploy WebSocket gateway clusters in each major region but maintain a single source-of-truth write path to avoid split-brain scenarios
- Use a global message bus (multi-region Kafka or Pub/Sub) to replicate commentary events to regional fanout tiers asynchronously
- Accept that some viewers see updates 100-200ms later than others due to geographic distribution; optimize for within-region latency
- Design the system so regional failures degrade gracefully by falling back to cross-region WebSocket connections rather than complete outages