Practice/OpenAI/Design YouTube
Design YouTube
System DesignMust
Problem Statement
Design a platform that delivers live sports events to millions of concurrent viewers with minimal delay, while also providing on-demand replays, real-time score updates, and synchronized commentary. The system must handle sudden traffic spikes when major games start, maintain consistent quality across varying network conditions, and support features like live chat and instant highlights.
Unlike on-demand video platforms, live streaming introduces unique challenges: content must be ingested in real-time from stadiums or broadcast centers, encoded and distributed with sub-second latency targets, and served to millions of viewers simultaneously watching the same event. The system must gracefully handle broadcaster interruptions, adapt to viewer bandwidth constraints, and provide frame-accurate DVR functionality. Interviewers want to see how you balance latency, scale, and cost while designing for both the live path (ingest to viewer) and the on-demand path (replay storage and delivery).
Key Requirements
Functional
- Live event streaming -- ingest video feeds from multiple sources, transcode in real-time, and deliver to millions of concurrent viewers with minimal delay
- Adaptive bitrate playback -- automatically adjust video quality based on viewer bandwidth and device capabilities without interrupting the stream
- DVR functionality -- allow viewers to pause, rewind, and catch up during live events while maintaining sync with the live broadcast
- Real-time metadata -- display synchronized scores, statistics, and commentary overlays that update as events unfold
- Instant replay generation -- automatically detect and create shareable highlight clips from live streams within seconds of key moments
Non-Functional
- Scalability -- support 10M+ concurrent viewers for major events with the ability to scale up within minutes of game start
- Reliability -- maintain 99.9% uptime during scheduled events with automatic failover for encoder and origin failures
- Latency -- deliver live video within 3-5 seconds of real-time (glass-to-glass latency) while maintaining quality
- Consistency -- ensure all viewers see the same content within acceptable time windows, with synchronized metadata updates
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Real-Time Ingest and Transcoding Pipeline
The live path is fundamentally different from on-demand uploads. You must ingest a continuous stream, transcode it in real-time into multiple bitrates and formats, and package segments as they arrive -- all while maintaining sub-second processing delays.
Hints to consider:
- Use protocols like RTMP or SRT for reliable ingest from broadcast centers, with primary and backup ingestion endpoints
- Deploy stateful transcoding clusters that maintain encoding context across segments, with hot standbys ready to take over
- Generate short segment durations (2-3 seconds) for low latency, but discuss the tradeoff with compression efficiency and CDN caching
- Implement health checks that detect stalled or corrupted streams and automatically switch to backup feeds
2. CDN Strategy and Cache Warming
Live content creates unique caching challenges because everyone watches the same segments simultaneously, but those segments are constantly being created. A cache miss during a critical game moment can overwhelm your origin.
Hints to consider:
- Use origin shielding to protect transcoders from direct viewer requests, with dedicated shield POPs that aggregate CDN traffic
- Implement predictive cache warming that pushes new segments to CDN edges before viewers request them
- Design cache keys that include stream ID and segment sequence number, allowing parallel delivery of multiple bitrates
- Handle thundering herd problems when millions request the same new segment by using request coalescing at the CDN level
3. Time-Shifting and DVR Storage
Allowing viewers to pause and rewind live streams requires maintaining a sliding window of recent segments while managing storage costs and ensuring seekability.
Hints to consider:
- Store recent segments (last 2-4 hours) in a distributed cache tier with TTL-based eviction, separate from long-term replay storage
- Maintain parallel timelines for each viewer's playback position, with server-side tracking or client-side offsets
- Use a manifest manipulation service that generates personalized playlists showing available DVR windows per viewer
- Implement catch-up acceleration that temporarily increases bitrate when viewers fast-forward to live
4. Metadata Synchronization and Fan Engagement
Scores, stats, and chat messages must appear synchronized with the video stream despite variable viewer latency, network delays, and different playback positions.
Hints to consider:
- Embed presentation timestamps (PTS) in metadata events that reference specific video frames, allowing client-side alignment
- Use WebSocket or Server-Sent Events for pushing real-time updates, with fallback polling for clients behind restrictive networks
- Implement eventual consistency for non-critical metadata like viewer counts, but provide strong consistency for scores and game state
- Consider dedicated infrastructure for chat messages separate from video delivery to isolate scaling and failure domains
5. Capacity Planning and Cost Optimization
Live events create predictable traffic patterns (sudden spikes at game start, gradual decline) but with massive peak-to-trough ratios that make static provisioning expensive.
Hints to consider:
- Pre-scale CDN and transcoding capacity based on event schedules, with automated ramp-up 15-30 minutes before scheduled start times
- Use spot instances or preemptible VMs for transcoding workloads, with reservation-based capacity for critical backup encoders
- Implement quality-based admission control that reduces bitrate options or redirects overflow traffic during extreme spikes
- Archive live streams to cheaper cold storage after 24-48 hours, with separate on-demand encoding for popular replays