Design YouTube — Datadog

Problem Statement

Design a video streaming platform where creators can upload large video files, the system processes and transcodes them into multiple quality levels, and viewers can stream content on demand with smooth playback across devices. The platform must support features like adaptive bitrate streaming, video search and discovery, and shareable links with visibility controls.

YouTube-style platforms involve two fundamentally different data paths: the creator path (upload, transcode, publish) and the viewer path (discover, stream, interact). Interviewers use this question to evaluate whether you can design for massive blob storage, asynchronous processing pipelines, and CDN-backed delivery at global scale. They also probe your ability to handle viral traffic spikes, optimize costs, and make pragmatic trade-offs around consistency, caching, and storage tiers.

Key Requirements

Functional

Reliable video upload -- creators upload large video files with resume support for interrupted transfers, progress tracking, and size validation
Asynchronous transcoding pipeline -- uploaded videos are processed into multiple resolutions and formats (HLS/DASH segments) with status visibility for creators
On-demand streaming -- viewers watch videos with adaptive bitrate playback that adjusts quality based on network conditions, with seek and scrub support
Video management and sharing -- creators manage metadata (title, description, thumbnail), set visibility (public, unlisted, private), and generate shareable links

Non-Functional

Scalability -- support billions of daily video views with the ability to absorb viral traffic spikes without degrading playback quality
Reliability -- maintain 99.9% availability for video playback and ensure zero data loss for uploaded content
Latency -- video playback starts within 2 seconds, seek operations complete within 1 second, and search results return within 300ms
Cost efficiency -- optimize storage and bandwidth costs through intelligent caching, tiered storage, and CDN strategies

What Interviewers Focus On

Based on real interview experiences at Datadog, these are the areas interviewers probe most deeply:

1. Upload Path and Blob Storage Strategy

Large video files cannot flow through application servers without creating bottlenecks. Interviewers want to see how you handle direct-to-storage uploads, integrity verification, and graceful recovery from network interruptions.

Hints to consider:

Use pre-signed URLs to let clients upload directly to object storage (S3), bypassing application servers entirely
Implement multipart upload with resumability so creators can recover from interrupted transfers without re-uploading the entire file
Validate file integrity using checksums (MD5 or SHA-256) computed client-side and verified server-side after upload completion
Store upload session state in a database so the system can resume from the last successful chunk after failures

2. Transcoding Pipeline Design

Transcoding is CPU-intensive and can take minutes per video. Interviewers expect an asynchronous, durable workflow with clear status tracking and failure recovery rather than anything that blocks the upload request.

Hints to consider:

Use a message queue (Kafka or SQS) to decouple upload completion from transcoding, allowing independent scaling of workers
Produce multiple output renditions (1080p, 720p, 480p, 360p) and package them as HLS or DASH segments with a master manifest
Implement idempotent processing so retried jobs produce identical outputs without duplicating storage
Track pipeline status (queued, transcoding, packaging, complete, failed) in a metadata store and expose it to creators through polling or WebSocket updates

3. CDN and Caching Strategy for Playback

View traffic massively outweighs upload traffic and is highly skewed toward popular content. Interviewers want to see how you protect your origin servers from being overwhelmed during viral moments.

Hints to consider:

Deploy a multi-tier CDN with origin shielding so that edge misses aggregate at shield nodes before reaching your origin storage
Design cache keys around video ID, resolution, and segment number to enable granular caching of individual segments
Handle thundering herd problems for newly-uploaded popular videos by using request coalescing at CDN shield nodes
Implement hot-key mitigation for viral content by pre-warming CDN edges based on early traffic signals

4. Data Model and Metadata Storage

Interviewers at Datadog have shown particular interest in how you model video metadata and chunk data, especially the relationship between a video record, its transcoded variants, and individual segments.

Hints to consider:

Separate video metadata (title, description, creator, visibility, timestamps) from transcoding output metadata (renditions, segment manifests, storage locations)
Use a relational database for video metadata with indexes on creator ID, upload date, and visibility for efficient querying
Store segment-level data (chunk sequence, byte ranges, duration) in a key-value store optimized for sequential reads during playback
Cache hot video metadata in Redis to absorb read spikes without hitting the primary database on every playback request

5. Search, Discovery, and Operational Concerns

Beyond the core upload-and-play loop, interviewers often ask how users find content and how you handle abuse, backfills, and operational visibility.

Hints to consider:

Index video metadata in a search engine (Elasticsearch) with fields for title, description, tags, and creator for full-text search
Implement content moderation as a step in the transcoding pipeline, running automated checks before publishing
Design for backfill scenarios where you need to re-transcode existing videos (e.g., new codec support) without disrupting live traffic
Monitor key metrics like upload success rate, transcoding latency, playback start time, and buffering ratio for operational visibility

Suggested Approach

Step 1: Clarify Requirements

Confirm scope and priorities with your interviewer. Ask about expected upload volume versus view volume, maximum video file sizes, supported devices and formats, and whether live streaming is in scope. Clarify the content model -- are we supporting only video-on-demand or also shorts/clips? Ask about geographic distribution of users and whether multi-region deployment is needed. Establish latency targets for playback start time and seek operations, and confirm whether features like comments, likes, and recommendations are in scope for this discussion.

Step 2: High-Level Architecture

Sketch the two main data flows. For the upload path: clients obtain pre-signed URLs from an Upload Service, upload directly to object storage, and an event notification triggers the Transcoding Pipeline. The pipeline reads the raw file, produces segmented renditions, writes outputs to object storage, and updates the Video Metadata Service with manifest locations. For the playback path: clients request a video page from the API, receive metadata and a manifest URL, and the video player fetches segments from the CDN. Include a Search Service backed by Elasticsearch for discovery, a Thumbnail Service for generating preview images, and a Moderation Service for content safety. Show Redis caching hot metadata and CDN caching video segments at the edge.

Step 3: Deep Dive on Transcoding and Delivery

Walk through the transcoding pipeline in detail. When object storage emits an upload-complete event, a message is published to Kafka. A pool of transcoding workers consumes messages, each processing one video at a time. A worker downloads the raw file, runs FFmpeg to produce multiple bitrate renditions, packages each into 4-second HLS segments with corresponding playlist files, and uploads all outputs back to object storage. A master manifest (m3u8) lists available renditions so the player can switch quality adaptively. After transcoding completes, the worker updates the metadata database and publishes a completion event that triggers thumbnail generation and search indexing.

For delivery, explain the CDN topology: video players resolve a CDN hostname, request the manifest from the nearest edge, and fetch segments sequentially during playback. Edge servers cache segments with long TTLs since video content is immutable once transcoded. On a cache miss, the request routes to a shield node that aggregates requests before fetching from origin storage. Discuss how you handle cache warming for content predicted to go viral based on early engagement metrics.

Step 4: Address Secondary Concerns

Cover storage cost optimization: use tiered storage with hot video segments in standard S3 and cold content (old, rarely-viewed videos) in cheaper archival tiers, with lazy promotion back to hot storage on access. Discuss reliability: replicate raw uploads across availability zones before acknowledging success, use dead-letter queues for failed transcoding jobs, and implement circuit breakers between services. Address abuse prevention: rate-limit uploads per creator, scan content during transcoding, and implement copyright detection. Touch on monitoring: track end-to-end pipeline latency from upload to playable status, alert on transcoding failure rates, and measure viewer experience metrics like rebuffer ratio and startup time across regions.

Problem Statement

Key Requirements

Functional

Reliable video upload -- creators upload large video files with resume support for interrupted transfers, progress tracking, and size validation
Asynchronous transcoding pipeline -- uploaded videos are processed into multiple resolutions and formats (HLS/DASH segments) with status visibility for creators
On-demand streaming -- viewers watch videos with adaptive bitrate playback that adjusts quality based on network conditions, with seek and scrub support
Video management and sharing -- creators manage metadata (title, description, thumbnail), set visibility (public, unlisted, private), and generate shareable links

Non-Functional

Scalability -- support billions of daily video views with the ability to absorb viral traffic spikes without degrading playback quality
Reliability -- maintain 99.9% availability for video playback and ensure zero data loss for uploaded content
Latency -- video playback starts within 2 seconds, seek operations complete within 1 second, and search results return within 300ms
Cost efficiency -- optimize storage and bandwidth costs through intelligent caching, tiered storage, and CDN strategies

What Interviewers Focus On

Based on real interview experiences at Datadog, these are the areas interviewers probe most deeply:

1. Upload Path and Blob Storage Strategy

Hints to consider:

Use pre-signed URLs to let clients upload directly to object storage (S3), bypassing application servers entirely
Implement multipart upload with resumability so creators can recover from interrupted transfers without re-uploading the entire file
Validate file integrity using checksums (MD5 or SHA-256) computed client-side and verified server-side after upload completion
Store upload session state in a database so the system can resume from the last successful chunk after failures

2. Transcoding Pipeline Design

Hints to consider:

Use a message queue (Kafka or SQS) to decouple upload completion from transcoding, allowing independent scaling of workers
Produce multiple output renditions (1080p, 720p, 480p, 360p) and package them as HLS or DASH segments with a master manifest
Implement idempotent processing so retried jobs produce identical outputs without duplicating storage
Track pipeline status (queued, transcoding, packaging, complete, failed) in a metadata store and expose it to creators through polling or WebSocket updates

3. CDN and Caching Strategy for Playback

View traffic massively outweighs upload traffic and is highly skewed toward popular content. Interviewers want to see how you protect your origin servers from being overwhelmed during viral moments.

Hints to consider:

Deploy a multi-tier CDN with origin shielding so that edge misses aggregate at shield nodes before reaching your origin storage
Design cache keys around video ID, resolution, and segment number to enable granular caching of individual segments
Handle thundering herd problems for newly-uploaded popular videos by using request coalescing at CDN shield nodes
Implement hot-key mitigation for viral content by pre-warming CDN edges based on early traffic signals

4. Data Model and Metadata Storage

Hints to consider:

Separate video metadata (title, description, creator, visibility, timestamps) from transcoding output metadata (renditions, segment manifests, storage locations)
Use a relational database for video metadata with indexes on creator ID, upload date, and visibility for efficient querying
Store segment-level data (chunk sequence, byte ranges, duration) in a key-value store optimized for sequential reads during playback
Cache hot video metadata in Redis to absorb read spikes without hitting the primary database on every playback request

5. Search, Discovery, and Operational Concerns

Beyond the core upload-and-play loop, interviewers often ask how users find content and how you handle abuse, backfills, and operational visibility.

Hints to consider:

Index video metadata in a search engine (Elasticsearch) with fields for title, description, tags, and creator for full-text search
Implement content moderation as a step in the transcoding pipeline, running automated checks before publishing
Design for backfill scenarios where you need to re-transcode existing videos (e.g., new codec support) without disrupting live traffic
Monitor key metrics like upload success rate, transcoding latency, playback start time, and buffering ratio for operational visibility