Extend a Distributed Task Queue with Sub-Tasks

[ OK ] 954 — full content available

[ INFO ] category: System Design difficulty: hard freq: medium first seen: 2026-05-16

[HARD][SYSTEM DESIGN][MEDIUM]System DesignDistributed SystemsTask QueueWorkflowsDurable Execution

$ cat problem.md

Design and extend a distributed task-queue system so that a single parent task can spawn many sub-tasks, wait for all (or all-settled) results, and then continue execution—without ever blocking a worker thread on I/O. The system must support millions of concurrent tasks, tens of thousands of workers, and must guarantee at-least-once execution, exactly-once result persistence, and cascading cancellation. Your design should cover: (1) how a worker that reaches an “awaitAll” call can serialize its continuation point, ACK the message, and immediately pick up new work; (2) how the queue atomically resumes the parent only after the last child finishes; (3) how child failures are propagated under pluggable policies (fail-fast, best-effort, retry-children); (4) how cancellation of a parent reliably aborts every descendant; (5) how deadlines are enforced across the entire tree; and (6) the schema and state-machine for the task row, queue protocol, and transactional patterns required for correctness under high concurrency. You do not need to implement the handlers themselves—only the infrastructure that turns an ordinary function into a durable, resumable, distributed workflow.

user@intervues:~/notion$