Practice/Meta/Design Wikipedia Crawler
Design Wikipedia Crawler
System DesignMust
Problem Statement
Design a large-scale distributed system that ingests video files uploaded by content creators and transcodes them into multiple formats and resolutions for streaming playback. Your system must handle thousands of concurrent uploads per minute, process videos ranging from seconds to hours in length, and output adaptive bitrate (ABR) sets optimized for different devices and network conditions. The pipeline should minimize end-to-end latency from upload to availability, gracefully handle worker failures mid-transcode, and provide creators with real-time progress updates. Consider that transcoding is CPU-intensive, videos may be terabytes in aggregate daily, and output files must be durably stored and indexed for low-latency retrieval by CDN edge nodes.
This problem tests your ability to decompose a compute-heavy workflow into distributed stages, coordinate long-running tasks across ephemeral workers, manage state and retries without losing partial progress, and balance cost efficiency with quality-of-service guarantees. Interviewers want to see how you reason about resource scheduling, backpressure, idempotency, and monitoring in a pipeline that blends storage I/O, network transfers, and heavyweight computation.
Key Requirements
Functional
- Video ingestion -- accept uploads from creators, validate format/size, and persist raw source files durably before transcoding begins
- Multi-format output -- generate multiple resolutions (1080p, 720p, 480p, 360p) and codecs (H.264, VP9, AV1) for adaptive streaming manifests
- Progress tracking -- provide creators with real-time job status, estimated completion time, and error details for failed segments
- Retry and resume -- recover from worker crashes or transient failures by resuming partial work without reprocessing completed segments
Non-Functional
- Scalability -- process 5,000 concurrent video uploads with total input volume of 50 TB per day
- Reliability -- ensure at least 99.9% of jobs complete successfully within SLA; automatically retry failed tasks up to three times
- Latency -- 95th percentile end-to-end processing time under 10 minutes for a 5-minute 1080p video
- Cost efficiency -- minimize idle compute by auto-scaling workers based on queue depth and prioritize cheaper spot instances where appropriate
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Task Decomposition and Workflow Orchestration
Transcoding is not a single atomic operation. Interviewers expect you to break the pipeline into discrete stages (upload, chunking, transcode per resolution, manifest generation, storage) and explain how tasks flow between stages. They want to see you model this as a durable workflow that survives worker crashes and can checkpoint progress.
Hints to consider:
- Use a workflow engine or state machine to track job lifecycle (pending, processing, complete, failed) with transitions persisted in a database
- Split large videos into independent chunks or segments that can be transcoded in parallel by different workers to reduce overall latency
- Ensure idempotency by tagging tasks with unique IDs and tracking completion status so retries do not duplicate output artifacts
- Discuss tradeoffs between sequential pipelines (simpler coordination) and parallel DAGs (faster but more complex error handling)
2. Worker Coordination and Fault Tolerance
Workers are ephemeral and may crash mid-transcode. Interviewers want to know how you prevent lost work, avoid duplicate processing, and redistribute failed tasks without manual intervention. This probes your understanding of leases, heartbeats, dead-letter queues, and exactly-once vs. at-least-once semantics.
Hints to consider:
- Implement task leases with TTLs so that if a worker dies, the task becomes available again after lease expiration for another worker to claim
- Use a durable message queue (SQS, Kafka) where workers pull tasks and acknowledge completion only after writing outputs, ensuring at-least-once delivery
- Store intermediate outputs (partially transcoded chunks) in object storage with versioning so resumed jobs can skip already-completed segments
- Monitor worker health with periodic heartbeats and trigger failover if a worker stops reporting progress within a threshold
3. Resource Management and Backpressure
Transcoding is CPU and I/O intensive. Interviewers expect you to discuss auto-scaling worker pools, prioritizing jobs, and preventing queue buildup when ingest rate exceeds processing capacity. They also care about cost optimization via spot instances or GPU acceleration.
Hints to consider:
- Auto-scale worker fleets based on queue depth metrics (scale out when backlog exceeds threshold, scale in when queues drain)
- Use priority queues to fast-track short videos or premium creator uploads while delaying lower-priority batch jobs during peak load
- Apply backpressure by throttling upload acceptance or returning HTTP 503 if queue depth exceeds safe limits to prevent cascading failures
- Leverage spot instances for non-urgent transcodes with checkpointing to resume on preemption, reserving on-demand capacity for latency-sensitive jobs
4. Deduplication and Storage Efficiency
Many creators re-upload the same video or make minor edits. Interviewers want to see you avoid redundant transcoding and storage by fingerprinting content, detecting duplicates, and reusing existing outputs when safe to do so.
Hints to consider:
- Compute a content hash (MD5, SHA-256) of the raw input file and check a lookup table before starting transcode; return cached outputs if hash matches
- Use perceptual hashing or video fingerprinting to detect near-duplicates (re-encoded versions, cropped frames) and prompt users to reuse existing transcodes
- Implement tiered storage with hot/cold policies, moving older or rarely-accessed transcode outputs to cheaper archival storage (S3 Glacier, Azure Cool Blob)
- Apply content-aware compression or variable bitrate encoding to reduce output sizes without sacrificing perceived quality
5. Monitoring, Observability, and SLA Management
Production transcoding pipelines must surface real-time metrics on throughput, error rates, per-stage latency, and cost per job. Interviewers want you to design telemetry and alerting that helps operators identify bottlenecks, budget overruns, or degraded quality before customers complain.
Hints to consider:
- Emit structured logs and metrics (processing time per stage, queue depth, worker utilization) to a centralized observability platform (Datadog, Prometheus, CloudWatch)
- Track end-to-end SLA compliance per job and alert when P95 latency exceeds target or error rate spikes above threshold
- Provide creators with a webhook or WebSocket stream for real-time job updates (progress percentage, current stage, ETA) and failure notifications
- Implement distributed tracing (OpenTelemetry) to correlate spans across upload, queue, worker, and storage layers for root-cause analysis of slow jobs