Practice/DoorDash/Design a Scheduler System

Design a Scheduler System

System DesignOptional

Problem Statement

Design a system that allows users to schedule events and processes them at specified times. The system should support both one-time tasks (send an email at 3pm tomorrow) and recurring schedules (run a data pipeline every hour). Applications register tasks with a target execution time and payload, and the scheduler guarantees reliable, timely execution even under failures or high load.

The core engineering challenge is building a correct and scalable event staging pipeline. Tasks must be durably stored for future execution, promoted into a ready queue as their scheduled time approaches, safely handed off to workers, and reliably completed or retried. Real-world scheduling patterns create severe traffic spikes at round times like the top of each minute or midnight UTC, and your design must handle these bursts without dropping or duplicating work.

Interviewers ask this question to test whether you can separate the concerns of time-indexed storage from execution dispatch, reason about delivery guarantees (at-least-once with idempotency), design lease-based worker coordination, and prevent thundering-herd problems. Expect probing questions on clock skew across distributed nodes, visibility timeouts, and how to gracefully handle worker failures mid-execution.

Key Requirements

Functional

Task registration -- Applications can schedule one-time tasks at a specific future time or recurring tasks using cron-like expressions, with an associated payload and callback target
Task management -- Clients can query task status, update scheduled times before execution, and cancel pending tasks
Retry and backoff -- Failed executions are retried with configurable exponential backoff, maximum attempt counts, and dead-letter handling for permanently failed tasks
Execution history -- The system tracks every execution attempt with timestamps, status transitions, error messages, and outcome logs

Non-Functional

Scalability -- Handle tens of millions of concurrent scheduled tasks with peak execution loads of hundreds of thousands of tasks per minute
Reliability -- Guarantee at-least-once execution with no lost tasks, even during node failures or availability zone outages
Latency -- Execute tasks within 30 seconds of their scheduled time under normal load; task registration should complete in under 100 milliseconds
Consistency -- Maintain strong consistency for task metadata and prevent duplicate execution of the same scheduled instance

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Time-Based Event Staging Architecture

The central challenge is efficiently moving tasks from long-term durable storage into an execution-ready state as their scheduled time approaches, without constantly scanning the entire dataset.

Hints to consider:

Use time-bucketed partitioning (by hour or minute) so promoter processes can target only the relevant partition
Design a two-tier model: a persistent store (DynamoDB, PostgreSQL) for all future tasks and a fast ready queue (Redis sorted set) for tasks due in the next few minutes
Run multiple promoter instances that divide responsibility using consistent hashing or range sharding to avoid single points of failure
Use conditional writes or version numbers to prevent multiple promoters from racing to promote the same task

2. Worker Claim and Lease Mechanism

Once tasks are in the ready queue, you need a safe handoff protocol that prevents both lost work and duplicate execution.

Hints to consider:

Implement visibility timeouts where a claimed task becomes re-claimable if the worker does not acknowledge completion within the lease window
Use atomic operations (ZPOPMIN in Redis or conditional deletes) so only one worker can claim a given task
Include unique execution IDs combining task ID and scheduled timestamp for downstream deduplication
Support heartbeat extensions for long-running tasks so workers can renew their lease before it expires

3. Thundering Herd and Time Clustering

Cron-based workloads create massive spikes at minute and hour boundaries. Interviewers expect you to design for this predictable burst pattern.

Hints to consider:

Apply jitter when promoting tasks to spread identical cron schedules across a 5-10 second window
Use backpressure mechanisms like queue depth monitoring to prevent overwhelming downstream workers
Pre-compute upcoming tasks and shard them across multiple promotion workers before the spike hits
Design rate-limiting between the ready queue and workers to smooth execution throughput

Practice/DoorDash/Design a Scheduler System

Design a Scheduler System

System DesignOptional

Problem Statement

Key Requirements

Functional

Task registration -- Applications can schedule one-time tasks at a specific future time or recurring tasks using cron-like expressions, with an associated payload and callback target
Task management -- Clients can query task status, update scheduled times before execution, and cancel pending tasks
Retry and backoff -- Failed executions are retried with configurable exponential backoff, maximum attempt counts, and dead-letter handling for permanently failed tasks
Execution history -- The system tracks every execution attempt with timestamps, status transitions, error messages, and outcome logs

Non-Functional

Scalability -- Handle tens of millions of concurrent scheduled tasks with peak execution loads of hundreds of thousands of tasks per minute
Reliability -- Guarantee at-least-once execution with no lost tasks, even during node failures or availability zone outages
Latency -- Execute tasks within 30 seconds of their scheduled time under normal load; task registration should complete in under 100 milliseconds
Consistency -- Maintain strong consistency for task metadata and prevent duplicate execution of the same scheduled instance

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Time-Based Event Staging Architecture

The central challenge is efficiently moving tasks from long-term durable storage into an execution-ready state as their scheduled time approaches, without constantly scanning the entire dataset.

Hints to consider:

Use time-bucketed partitioning (by hour or minute) so promoter processes can target only the relevant partition
Design a two-tier model: a persistent store (DynamoDB, PostgreSQL) for all future tasks and a fast ready queue (Redis sorted set) for tasks due in the next few minutes
Run multiple promoter instances that divide responsibility using consistent hashing or range sharding to avoid single points of failure
Use conditional writes or version numbers to prevent multiple promoters from racing to promote the same task

2. Worker Claim and Lease Mechanism

Once tasks are in the ready queue, you need a safe handoff protocol that prevents both lost work and duplicate execution.

Hints to consider:

Implement visibility timeouts where a claimed task becomes re-claimable if the worker does not acknowledge completion within the lease window
Use atomic operations (ZPOPMIN in Redis or conditional deletes) so only one worker can claim a given task
Include unique execution IDs combining task ID and scheduled timestamp for downstream deduplication
Support heartbeat extensions for long-running tasks so workers can renew their lease before it expires

3. Thundering Herd and Time Clustering

Cron-based workloads create massive spikes at minute and hour boundaries. Interviewers expect you to design for this predictable burst pattern.

Hints to consider:

Apply jitter when promoting tasks to spread identical cron schedules across a 5-10 second window
Use backpressure mechanisms like queue depth monitoring to prevent overwhelming downstream workers
Pre-compute upcoming tasks and shard them across multiple promotion workers before the spike hits
Design rate-limiting between the ready queue and workers to smooth execution throughput