Practice/Meta/Design Wikipedia Crawler

Design Wikipedia Crawler

System DesignMust

Problem Statement

Design a large-scale distributed system that ingests video files uploaded by content creators and transcodes them into multiple formats and resolutions for streaming playback. Your system must handle thousands of concurrent uploads per minute, process videos ranging from seconds to hours in length, and output adaptive bitrate (ABR) sets optimized for different devices and network conditions. The pipeline should minimize end-to-end latency from upload to availability, gracefully handle worker failures mid-transcode, and provide creators with real-time progress updates. Consider that transcoding is CPU-intensive, videos may be terabytes in aggregate daily, and output files must be durably stored and indexed for low-latency retrieval by CDN edge nodes.

This problem tests your ability to decompose a compute-heavy workflow into distributed stages, coordinate long-running tasks across ephemeral workers, manage state and retries without losing partial progress, and balance cost efficiency with quality-of-service guarantees. Interviewers want to see how you reason about resource scheduling, backpressure, idempotency, and monitoring in a pipeline that blends storage I/O, network transfers, and heavyweight computation.

Key Requirements

Functional

Video ingestion -- accept uploads from creators, validate format/size, and persist raw source files durably before transcoding begins
Multi-format output -- generate multiple resolutions (1080p, 720p, 480p, 360p) and codecs (H.264, VP9, AV1) for adaptive streaming manifests
Progress tracking -- provide creators with real-time job status, estimated completion time, and error details for failed segments
Retry and resume -- recover from worker crashes or transient failures by resuming partial work without reprocessing completed segments

Non-Functional

Scalability -- process 5,000 concurrent video uploads with total input volume of 50 TB per day
Reliability -- ensure at least 99.9% of jobs complete successfully within SLA; automatically retry failed tasks up to three times
Latency -- 95th percentile end-to-end processing time under 10 minutes for a 5-minute 1080p video
Cost efficiency -- minimize idle compute by auto-scaling workers based on queue depth and prioritize cheaper spot instances where appropriate

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Task Decomposition and Workflow Orchestration

Transcoding is not a single atomic operation. Interviewers expect you to break the pipeline into discrete stages (upload, chunking, transcode per resolution, manifest generation, storage) and explain how tasks flow between stages. They want to see you model this as a durable workflow that survives worker crashes and can checkpoint progress.

Hints to consider:

Use a workflow engine or state machine to track job lifecycle (pending, processing, complete, failed) with transitions persisted in a database
Split large videos into independent chunks or segments that can be transcoded in parallel by different workers to reduce overall latency
Ensure idempotency by tagging tasks with unique IDs and tracking completion status so retries do not duplicate output artifacts
Discuss tradeoffs between sequential pipelines (simpler coordination) and parallel DAGs (faster but more complex error handling)

2. Worker Coordination and Fault Tolerance

Workers are ephemeral and may crash mid-transcode. Interviewers want to know how you prevent lost work, avoid duplicate processing, and redistribute failed tasks without manual intervention. This probes your understanding of leases, heartbeats, dead-letter queues, and exactly-once vs. at-least-once semantics.

Hints to consider:

Implement task leases with TTLs so that if a worker dies, the task becomes available again after lease expiration for another worker to claim
Use a durable message queue (SQS, Kafka) where workers pull tasks and acknowledge completion only after writing outputs, ensuring at-least-once delivery
Store intermediate outputs (partially transcoded chunks) in object storage with versioning so resumed jobs can skip already-completed segments
Monitor worker health with periodic heartbeats and trigger failover if a worker stops reporting progress within a threshold

3. Resource Management and Backpressure

Transcoding is CPU and I/O intensive. Interviewers expect you to discuss auto-scaling worker pools, prioritizing jobs, and preventing queue buildup when ingest rate exceeds processing capacity. They also care about cost optimization via spot instances or GPU acceleration.

Hints to consider:

Auto-scale worker fleets based on queue depth metrics (scale out when backlog exceeds threshold, scale in when queues drain)
Use priority queues to fast-track short videos or premium creator uploads while delaying lower-priority batch jobs during peak load
Apply backpressure by throttling upload acceptance or returning HTTP 503 if queue depth exceeds safe limits to prevent cascading failures
Leverage spot instances for non-urgent transcodes with checkpointing to resume on preemption, reserving on-demand capacity for latency-sensitive jobs

4. Deduplication and Storage Efficiency

Many creators re-upload the same video or make minor edits. Interviewers want to see you avoid redundant transcoding and storage by fingerprinting content, detecting duplicates, and reusing existing outputs when safe to do so.

Hints to consider:

Compute a content hash (MD5, SHA-256) of the raw input file and check a lookup table before starting transcode; return cached outputs if hash matches
Use perceptual hashing or video fingerprinting to detect near-duplicates (re-encoded versions, cropped frames) and prompt users to reuse existing transcodes
Implement tiered storage with hot/cold policies, moving older or rarely-accessed transcode outputs to cheaper archival storage (S3 Glacier, Azure Cool Blob)
Apply content-aware compression or variable bitrate encoding to reduce output sizes without sacrificing perceived quality

5. Monitoring, Observability, and SLA Management

Production transcoding pipelines must surface real-time metrics on throughput, error rates, per-stage latency, and cost per job. Interviewers want you to design telemetry and alerting that helps operators identify bottlenecks, budget overruns, or degraded quality before customers complain.

Hints to consider:

Emit structured logs and metrics (processing time per stage, queue depth, worker utilization) to a centralized observability platform (Datadog, Prometheus, CloudWatch)
Track end-to-end SLA compliance per job and alert when P95 latency exceeds target or error rate spikes above threshold
Provide creators with a webhook or WebSocket stream for real-time job updates (progress percentage, current stage, ETA) and failure notifications
Implement distributed tracing (OpenTelemetry) to correlate spans across upload, queue, worker, and storage layers for root-cause analysis of slow jobs

Practice/Meta/Design Wikipedia Crawler

Design Wikipedia Crawler

System DesignMust

Problem Statement

Key Requirements

Functional

Video ingestion -- accept uploads from creators, validate format/size, and persist raw source files durably before transcoding begins
Multi-format output -- generate multiple resolutions (1080p, 720p, 480p, 360p) and codecs (H.264, VP9, AV1) for adaptive streaming manifests
Progress tracking -- provide creators with real-time job status, estimated completion time, and error details for failed segments
Retry and resume -- recover from worker crashes or transient failures by resuming partial work without reprocessing completed segments

Non-Functional

Scalability -- process 5,000 concurrent video uploads with total input volume of 50 TB per day
Reliability -- ensure at least 99.9% of jobs complete successfully within SLA; automatically retry failed tasks up to three times
Latency -- 95th percentile end-to-end processing time under 10 minutes for a 5-minute 1080p video
Cost efficiency -- minimize idle compute by auto-scaling workers based on queue depth and prioritize cheaper spot instances where appropriate

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Task Decomposition and Workflow Orchestration

Hints to consider:

Use a workflow engine or state machine to track job lifecycle (pending, processing, complete, failed) with transitions persisted in a database
Split large videos into independent chunks or segments that can be transcoded in parallel by different workers to reduce overall latency
Ensure idempotency by tagging tasks with unique IDs and tracking completion status so retries do not duplicate output artifacts
Discuss tradeoffs between sequential pipelines (simpler coordination) and parallel DAGs (faster but more complex error handling)

2. Worker Coordination and Fault Tolerance

Hints to consider:

Implement task leases with TTLs so that if a worker dies, the task becomes available again after lease expiration for another worker to claim
Use a durable message queue (SQS, Kafka) where workers pull tasks and acknowledge completion only after writing outputs, ensuring at-least-once delivery
Store intermediate outputs (partially transcoded chunks) in object storage with versioning so resumed jobs can skip already-completed segments
Monitor worker health with periodic heartbeats and trigger failover if a worker stops reporting progress within a threshold

3. Resource Management and Backpressure

Hints to consider:

Auto-scale worker fleets based on queue depth metrics (scale out when backlog exceeds threshold, scale in when queues drain)
Use priority queues to fast-track short videos or premium creator uploads while delaying lower-priority batch jobs during peak load
Apply backpressure by throttling upload acceptance or returning HTTP 503 if queue depth exceeds safe limits to prevent cascading failures
Leverage spot instances for non-urgent transcodes with checkpointing to resume on preemption, reserving on-demand capacity for latency-sensitive jobs

4. Deduplication and Storage Efficiency

Hints to consider:

Compute a content hash (MD5, SHA-256) of the raw input file and check a lookup table before starting transcode; return cached outputs if hash matches
Use perceptual hashing or video fingerprinting to detect near-duplicates (re-encoded versions, cropped frames) and prompt users to reuse existing transcodes
Implement tiered storage with hot/cold policies, moving older or rarely-accessed transcode outputs to cheaper archival storage (S3 Glacier, Azure Cool Blob)
Apply content-aware compression or variable bitrate encoding to reduce output sizes without sacrificing perceived quality

5. Monitoring, Observability, and SLA Management

Hints to consider:

Emit structured logs and metrics (processing time per stage, queue depth, worker utilization) to a centralized observability platform (Datadog, Prometheus, CloudWatch)
Track end-to-end SLA compliance per job and alert when P95 latency exceeds target or error rate spikes above threshold
Provide creators with a webhook or WebSocket stream for real-time job updates (progress percentage, current stage, ETA) and failure notifications
Implement distributed tracing (OpenTelemetry) to correlate spans across upload, queue, worker, and storage layers for root-cause analysis of slow jobs