Practice/Meta/Design Status Search
Design Status Search
System DesignMust
Problem Statement
Design a system that aggregates and displays live updates from millions of concurrent events happening around the world. Think of a platform where sports matches, concerts, conferences, and breaking news events all stream real-time commentary, scores, and updates. Users should be able to browse active events, subscribe to specific ones, and see updates flow in with sub-second latency. The system must handle peak loads during major global events (like World Cup finals or breaking news) where tens of millions of users simultaneously follow the same event, while also serving thousands of smaller niche events with just a handful of followers each.
Your design needs to support event creators posting updates at varying rates (from one update per minute to hundreds per second during intense moments), handle massive fan-out when popular events broadcast to millions, maintain ordering guarantees so users see updates in the correct sequence, and scale both horizontally for more events and vertically for viral moments. The challenge is balancing write throughput, read fan-out efficiency, storage costs for billions of historical updates, and the real-time delivery expectations of modern users.
Key Requirements
Functional
- Event creation and management -- users can create events, post updates to them, and close/archive events when finished
- Real-time update delivery -- followers receive new updates within 1 second of posting with correct chronological ordering
- Event discovery and subscription -- users can browse trending/active events and subscribe to receive live updates
- Historical playback -- users can view past updates from completed events in chronological order with pagination
Non-Functional
- Scalability -- support 100K active events simultaneously with 1M updates per second globally during peak hours
- Reliability -- no update loss; system should gracefully degrade under extreme load rather than fail completely
- Latency -- median update delivery under 500ms; p99 under 2 seconds even for events with millions of followers
- Consistency -- updates must appear in the same order for all subscribers; no duplicate delivery
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Write Path and Fan-Out Architecture
Interviewers want to see how you handle the fan-out problem when a single update must reach millions of subscribers. The naive approach of writing to millions of user timelines synchronously will collapse under load.
Hints to consider:
- Discuss the tradeoff between fan-out on write (pre-compute delivery lists) versus fan-out on read (compute on demand)
- Consider hybrid approaches where celebrity events use different delivery mechanisms than small events
- Think about batching, buffering, and backpressure mechanisms when delivery queues get overwhelmed
- Address how you detect and handle "hot" events that suddenly go viral
2. Real-Time Delivery Mechanism
The interviewer will push on how you actually push updates to connected clients with sub-second latency while maintaining ordering guarantees and handling network failures.
Hints to consider:
- Compare WebSocket connections versus Server-Sent Events versus long polling for different client types
- Discuss how you maintain persistent connections at scale and which layer handles connection state
- Think about how to partition connections across servers and route updates to the correct connection
- Consider sequence numbers, acknowledgments, and client-side buffering for reliability
3. Storage Strategy for Hot and Cold Data
With billions of updates accumulating daily, interviewers expect you to articulate a tiered storage approach that balances access patterns, retention requirements, and cost.
Hints to consider:
- Recent/active event data needs low-latency access while historical data can tolerate higher latency
- Discuss time-based partitioning and automated archival policies
- Consider different storage technologies for the write-heavy hot path versus bulk cold storage
- Think about how query patterns differ between live events (sequential reads) and historical browsing (random access)
4. Handling Skewed Load Distribution
Real-world event popularity follows a power law distribution. Interviewers want to see if you recognize that 1% of events will consume 99% of resources and design accordingly.
Hints to consider:
- Discuss detecting trending events early through metrics and adaptive rate limiting
- Consider read replicas, caching layers, and CDN strategies for viral content
- Think about dedicated infrastructure pools or priority queues for high-profile events
- Address fairness and quality-of-service guarantees for smaller events during global traffic spikes