Practice/Amazon/Design YouTube

Design YouTube

System DesignMust

Problem Statement

Design a video streaming platform like YouTube where creators can upload and share videos, viewers can stream content on demand, and users receive personalized recommendations based on their viewing history. The platform must handle the complete lifecycle from video upload through processing to global delivery at scale.

YouTube is an on-demand video streaming platform with two distinct paths: the creator path (upload, processing, management) and the viewer path (low-latency playback at scale). Interviewers ask this to assess whether you can design for both paths while handling large blobs, asynchronous workflows, and viral read traffic. Expect to discuss uploads, transcoding pipelines (HLS/DASH), CDN strategy, metadata stores, and operational concerns like hotspots and backfills.

Key Requirements

Functional

Video upload -- creators reliably upload large video files with pause and resume support for interrupted uploads and progress visibility
On-demand playback -- viewers watch videos with smooth playback, adaptive quality based on bandwidth, and the ability to seek to any position
Content management -- creators manage metadata (title, description, thumbnail), see processing status, and control visibility (public, unlisted, private)
Sharing and discovery -- users share video links, browse recommendations, and search for content across the platform

Non-Functional

Scalability -- support billions of video views per day with millions of uploads, handling viral content that generates massive concurrent viewership
Reliability -- maintain 99.9% uptime for playback with graceful degradation during partial outages; zero data loss for uploaded content
Latency -- video playback start under 2 seconds, adaptive bitrate switching within one segment, upload processing completion within minutes
Consistency -- eventual consistency acceptable for view counts and recommendations; strong consistency for upload state and access controls

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Upload and Transcoding Pipeline

Video files are massive binary objects requiring specialized handling. The upload path must be resilient to network interruptions, and the processing pipeline must transcode videos into multiple formats and resolutions asynchronously.

Hints to consider:

Use pre-signed URLs for direct-to-object-storage uploads, bypassing application servers to avoid bandwidth bottlenecks
Implement multipart/chunked uploads with resumability so users can recover from network failures without re-uploading
Design the transcoding pipeline as an asynchronous workflow with retries, idempotency, and progress tracking across multiple stages (decode, encode at multiple resolutions, package into HLS/DASH segments, generate thumbnails)
Use a message queue to decouple upload completion from transcoding, enabling backpressure and independent scaling

2. CDN Strategy and Video Delivery

Serving video segments to millions of concurrent viewers requires intelligent content distribution. Cache misses during viral moments can overwhelm origin servers.

Hints to consider:

Use origin shielding to protect storage from direct viewer requests, with dedicated shield POPs that aggregate CDN traffic
Design cache keys that include video ID, resolution, and segment number for efficient parallel delivery of multiple bitrates
Implement adaptive bitrate streaming (HLS or DASH) so clients automatically switch quality based on network conditions
Handle thundering herd problems when a viral video generates millions of simultaneous requests for the same segments

3. Metadata Storage and Hot Content Handling

Video metadata, view counts, and engagement data face extreme read/write patterns, especially for trending content that creates hot keys in storage.

Hints to consider:

Separate the metadata store (video details, creator info) from engagement counters (views, likes) since they have very different access patterns
Use sharded counters for view counts on popular videos to avoid single-key contention, with periodic aggregation
Cache hot video metadata in Redis with short TTLs to absorb read spikes without overwhelming the primary database
Design the recommendation system to precompute candidate sets offline and do lightweight blending at serve time

4. Content Processing Workflow Orchestration

Transcoding, thumbnail generation, content moderation, and metadata extraction are CPU-intensive multi-minute workflows that must be orchestrated reliably.

Hints to consider:

Model the processing pipeline as a directed acyclic graph where each stage produces outputs consumed by downstream stages
Implement checkpointing so partially completed workflows can resume after worker failures without restarting
Use dead-letter queues for videos that repeatedly fail processing, with alerting for manual investigation
Design for priority levels where popular creators or time-sensitive content gets processed ahead of the queue

Suggested Approach

Step 1: Clarify Requirements

Start by confirming scope with the interviewer. Ask about target scale (uploads per day, concurrent viewers), whether live streaming is in scope or just on-demand, maximum video length and file size, and supported devices. Clarify whether the recommendation engine is in scope or treated as a black box. Confirm whether content moderation is required before publishing. Establish SLAs for upload processing time and playback start latency.

Practice/Amazon/Design YouTube

Design YouTube

System DesignMust

Problem Statement

Key Requirements

Functional

Video upload -- creators reliably upload large video files with pause and resume support for interrupted uploads and progress visibility
On-demand playback -- viewers watch videos with smooth playback, adaptive quality based on bandwidth, and the ability to seek to any position
Content management -- creators manage metadata (title, description, thumbnail), see processing status, and control visibility (public, unlisted, private)
Sharing and discovery -- users share video links, browse recommendations, and search for content across the platform

Non-Functional

Scalability -- support billions of video views per day with millions of uploads, handling viral content that generates massive concurrent viewership
Reliability -- maintain 99.9% uptime for playback with graceful degradation during partial outages; zero data loss for uploaded content
Latency -- video playback start under 2 seconds, adaptive bitrate switching within one segment, upload processing completion within minutes
Consistency -- eventual consistency acceptable for view counts and recommendations; strong consistency for upload state and access controls

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Upload and Transcoding Pipeline

Hints to consider:

Use pre-signed URLs for direct-to-object-storage uploads, bypassing application servers to avoid bandwidth bottlenecks
Implement multipart/chunked uploads with resumability so users can recover from network failures without re-uploading
Design the transcoding pipeline as an asynchronous workflow with retries, idempotency, and progress tracking across multiple stages (decode, encode at multiple resolutions, package into HLS/DASH segments, generate thumbnails)
Use a message queue to decouple upload completion from transcoding, enabling backpressure and independent scaling

2. CDN Strategy and Video Delivery

Serving video segments to millions of concurrent viewers requires intelligent content distribution. Cache misses during viral moments can overwhelm origin servers.

Hints to consider:

Use origin shielding to protect storage from direct viewer requests, with dedicated shield POPs that aggregate CDN traffic
Design cache keys that include video ID, resolution, and segment number for efficient parallel delivery of multiple bitrates
Implement adaptive bitrate streaming (HLS or DASH) so clients automatically switch quality based on network conditions
Handle thundering herd problems when a viral video generates millions of simultaneous requests for the same segments

3. Metadata Storage and Hot Content Handling

Video metadata, view counts, and engagement data face extreme read/write patterns, especially for trending content that creates hot keys in storage.

Hints to consider:

Separate the metadata store (video details, creator info) from engagement counters (views, likes) since they have very different access patterns
Use sharded counters for view counts on popular videos to avoid single-key contention, with periodic aggregation
Cache hot video metadata in Redis with short TTLs to absorb read spikes without overwhelming the primary database
Design the recommendation system to precompute candidate sets offline and do lightweight blending at serve time

4. Content Processing Workflow Orchestration

Transcoding, thumbnail generation, content moderation, and metadata extraction are CPU-intensive multi-minute workflows that must be orchestrated reliably.

Hints to consider:

Model the processing pipeline as a directed acyclic graph where each stage produces outputs consumed by downstream stages
Implement checkpointing so partially completed workflows can resume after worker failures without restarting
Use dead-letter queues for videos that repeatedly fail processing, with alerting for manual investigation
Design for priority levels where popular creators or time-sensitive content gets processed ahead of the queue