Practice/DoorDash/Design a Scheduler System
Design a Scheduler System
System DesignOptional
Problem Statement
Design a system that allows users to schedule events and processes them at specified times. The system should support both one-time tasks (send an email at 3pm tomorrow) and recurring schedules (run a data pipeline every hour). Applications register tasks with a target execution time and payload, and the scheduler guarantees reliable, timely execution even under failures or high load.
The core engineering challenge is building a correct and scalable event staging pipeline. Tasks must be durably stored for future execution, promoted into a ready queue as their scheduled time approaches, safely handed off to workers, and reliably completed or retried. Real-world scheduling patterns create severe traffic spikes at round times like the top of each minute or midnight UTC, and your design must handle these bursts without dropping or duplicating work.
Interviewers ask this question to test whether you can separate the concerns of time-indexed storage from execution dispatch, reason about delivery guarantees (at-least-once with idempotency), design lease-based worker coordination, and prevent thundering-herd problems. Expect probing questions on clock skew across distributed nodes, visibility timeouts, and how to gracefully handle worker failures mid-execution.
Key Requirements
Functional
- Task registration -- Applications can schedule one-time tasks at a specific future time or recurring tasks using cron-like expressions, with an associated payload and callback target
- Task management -- Clients can query task status, update scheduled times before execution, and cancel pending tasks
- Retry and backoff -- Failed executions are retried with configurable exponential backoff, maximum attempt counts, and dead-letter handling for permanently failed tasks
- Execution history -- The system tracks every execution attempt with timestamps, status transitions, error messages, and outcome logs
Non-Functional
- Scalability -- Handle tens of millions of concurrent scheduled tasks with peak execution loads of hundreds of thousands of tasks per minute
- Reliability -- Guarantee at-least-once execution with no lost tasks, even during node failures or availability zone outages
- Latency -- Execute tasks within 30 seconds of their scheduled time under normal load; task registration should complete in under 100 milliseconds
- Consistency -- Maintain strong consistency for task metadata and prevent duplicate execution of the same scheduled instance
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Time-Based Event Staging Architecture
The central challenge is efficiently moving tasks from long-term durable storage into an execution-ready state as their scheduled time approaches, without constantly scanning the entire dataset.
Hints to consider:
- Use time-bucketed partitioning (by hour or minute) so promoter processes can target only the relevant partition
- Design a two-tier model: a persistent store (DynamoDB, PostgreSQL) for all future tasks and a fast ready queue (Redis sorted set) for tasks due in the next few minutes
- Run multiple promoter instances that divide responsibility using consistent hashing or range sharding to avoid single points of failure
- Use conditional writes or version numbers to prevent multiple promoters from racing to promote the same task
2. Worker Claim and Lease Mechanism
Once tasks are in the ready queue, you need a safe handoff protocol that prevents both lost work and duplicate execution.
Hints to consider:
- Implement visibility timeouts where a claimed task becomes re-claimable if the worker does not acknowledge completion within the lease window
- Use atomic operations (ZPOPMIN in Redis or conditional deletes) so only one worker can claim a given task
- Include unique execution IDs combining task ID and scheduled timestamp for downstream deduplication
- Support heartbeat extensions for long-running tasks so workers can renew their lease before it expires
3. Thundering Herd and Time Clustering
Cron-based workloads create massive spikes at minute and hour boundaries. Interviewers expect you to design for this predictable burst pattern.
Hints to consider:
- Apply jitter when promoting tasks to spread identical cron schedules across a 5-10 second window
- Use backpressure mechanisms like queue depth monitoring to prevent overwhelming downstream workers
- Pre-compute upcoming tasks and shard them across multiple promotion workers before the spike hits
- Design rate-limiting between the ready queue and workers to smooth execution throughput