Design Task Scheduling System (DAG)

[ OK ] 312 — full content available

[ INFO ] category: System Design difficulty: hard freq: medium first seen: 2026-01-13

[HARD][SYSTEM DESIGN][MEDIUM]SchedulingDistributed SystemsDAGwebmachine_learningSystem Designbackendinfrastructure

$ cat problem.md

Design a distributed task-scheduler that can execute a workflow expressed as a Directed Acyclic Graph (DAG). Each node is a task; an edge A → B means “task B can start only after task A finishes”. The system must support tens of thousands of concurrent DAGs, each DAG having up to 100 000 tasks and 1 000 000 edges. Tasks are heterogeneous: some are 100 ms CPU-only, others are multi-hour GPU training jobs. DAGs are submitted through a REST/gRPC gateway and are executed by a pool of stateless workers that can be added or removed at any time. The scheduler itself must be horizontally scalable and fault-tolerant (no single point of failure). You must guarantee that every task is executed exactly once even if workers or scheduler nodes crash, network partitions occur, or tasks are retried. While a DAG is running users may add new tasks or new edges (dynamic DAG mutation). The system should expose APIs to: submit a DAG, query the status of any task, cancel a DAG, and stream real-time progress. Provide high-level design, core data structures, consistency protocols, and a rough capacity plan for 1 M DAGs/day with 99.9 % availability.

user@intervues:~/tiktok$