Design a video streaming platform where creators can upload large video files, the system processes and transcodes them into multiple quality levels, and viewers can stream content on demand with smooth playback across devices. The platform must support features like adaptive bitrate streaming, video search and discovery, and shareable links with visibility controls.
YouTube-style platforms involve two fundamentally different data paths: the creator path (upload, transcode, publish) and the viewer path (discover, stream, interact). Interviewers use this question to evaluate whether you can design for massive blob storage, asynchronous processing pipelines, and CDN-backed delivery at global scale. They also probe your ability to handle viral traffic spikes, optimize costs, and make pragmatic trade-offs around consistency, caching, and storage tiers.
Based on real interview experiences at Datadog, these are the areas interviewers probe most deeply:
Large video files cannot flow through application servers without creating bottlenecks. Interviewers want to see how you handle direct-to-storage uploads, integrity verification, and graceful recovery from network interruptions.
Hints to consider:
Transcoding is CPU-intensive and can take minutes per video. Interviewers expect an asynchronous, durable workflow with clear status tracking and failure recovery rather than anything that blocks the upload request.
Hints to consider:
View traffic massively outweighs upload traffic and is highly skewed toward popular content. Interviewers want to see how you protect your origin servers from being overwhelmed during viral moments.
Hints to consider:
Interviewers at Datadog have shown particular interest in how you model video metadata and chunk data, especially the relationship between a video record, its transcoded variants, and individual segments.
Hints to consider:
Beyond the core upload-and-play loop, interviewers often ask how users find content and how you handle abuse, backfills, and operational visibility.
Hints to consider:
Confirm scope and priorities with your interviewer. Ask about expected upload volume versus view volume, maximum video file sizes, supported devices and formats, and whether live streaming is in scope. Clarify the content model -- are we supporting only video-on-demand or also shorts/clips? Ask about geographic distribution of users and whether multi-region deployment is needed. Establish latency targets for playback start time and seek operations, and confirm whether features like comments, likes, and recommendations are in scope for this discussion.
Sketch the two main data flows. For the upload path: clients obtain pre-signed URLs from an Upload Service, upload directly to object storage, and an event notification triggers the Transcoding Pipeline. The pipeline reads the raw file, produces segmented renditions, writes outputs to object storage, and updates the Video Metadata Service with manifest locations. For the playback path: clients request a video page from the API, receive metadata and a manifest URL, and the video player fetches segments from the CDN. Include a Search Service backed by Elasticsearch for discovery, a Thumbnail Service for generating preview images, and a Moderation Service for content safety. Show Redis caching hot metadata and CDN caching video segments at the edge.
Walk through the transcoding pipeline in detail. When object storage emits an upload-complete event, a message is published to Kafka. A pool of transcoding workers consumes messages, each processing one video at a time. A worker downloads the raw file, runs FFmpeg to produce multiple bitrate renditions, packages each into 4-second HLS segments with corresponding playlist files, and uploads all outputs back to object storage. A master manifest (m3u8) lists available renditions so the player can switch quality adaptively. After transcoding completes, the worker updates the metadata database and publishes a completion event that triggers thumbnail generation and search indexing.
For delivery, explain the CDN topology: video players resolve a CDN hostname, request the manifest from the nearest edge, and fetch segments sequentially during playback. Edge servers cache segments with long TTLs since video content is immutable once transcoded. On a cache miss, the request routes to a shield node that aggregates requests before fetching from origin storage. Discuss how you handle cache warming for content predicted to go viral based on early engagement metrics.
Cover storage cost optimization: use tiered storage with hot video segments in standard S3 and cold content (old, rarely-viewed videos) in cheaper archival tiers, with lazy promotion back to hot storage on access. Discuss reliability: replicate raw uploads across availability zones before acknowledging success, use dead-letter queues for failed transcoding jobs, and implement circuit breakers between services. Address abuse prevention: rate-limit uploads per creator, scan content during transcoding, and implement copyright detection. Touch on monitoring: track end-to-end pipeline latency from upload to playable status, alert on transcoding failure rates, and measure viewer experience metrics like rebuffer ratio and startup time across regions.