Practice/Netflix/Design a Scheduler System
Design a Scheduler System
System DesignMust
Problem Statement
Build a distributed system that enables applications to register tasks for execution at a future time and guarantees those tasks execute reliably when their scheduled time arrives. Your service must handle both one-time scheduled tasks (run a report at 3pm tomorrow) and recurring tasks (charge subscriptions every month). Applications should be able to create, update, cancel, and query the status of scheduled tasks. The system must support millions of concurrent scheduled tasks, with peak execution loads of hundreds of thousands of tasks per minute during common scheduling windows like midnight UTC or the top of each hour.
Your design should handle clock skew across distributed nodes, prevent duplicate execution when components fail and restart, support configurable retry policies for failed tasks, and ensure tasks are not lost even during partial system outages. Consider how tasks transition through the lifecycle: created and stored for the future, promoted to a ready-to-execute state when their time arrives, claimed by a worker, executed, and finally marked complete or retried on failure.
Key Requirements
Functional
- Task Registration -- Applications must be able to schedule tasks with a specific execution time, payload data, and callback information (webhook URL or message queue destination)
- Recurring Schedules -- Support cron-like recurring tasks that execute on a repeating schedule (every hour, daily at 2am, monthly on the 15th, etc.)
- Task Management -- Allow clients to query task status, update scheduled times before execution, and cancel pending tasks
- Execution Guarantees -- Ensure each task executes at least once after its scheduled time, with configurable retry behavior for failures
- Result Tracking -- Record execution outcomes (success, failure, retry count) and make this history queryable by clients
Non-Functional
- Scalability -- Handle 100 million scheduled tasks in the system at any time, with 500,000 tasks executing per minute during peak windows
- Reliability -- Tolerate multiple node failures and availability zone outages without losing scheduled tasks or creating duplicates
- Latency -- Execute tasks within 30 seconds of their scheduled time under normal load (p99); task registration and cancellation should complete in under 100ms
- Consistency -- Guarantee exactly-once semantics for task execution from the client's perspective (at-most-once delivery with idempotency support)
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Time-Based Event Promotion Architecture
The core challenge is efficiently moving tasks from long-term storage into an execution-ready state as their scheduled time approaches. Interviewers want to see how you handle millions of future tasks without constantly scanning the entire dataset.
Hints to consider:
- Consider time-bucketed partitioning where tasks are organized by their execution hour or minute to enable targeted queries
- Explore using a two-tier storage model: persistent store for all future tasks, with a separate ready queue for tasks due in the next few minutes
- Think about how background promoter processes discover due tasks without creating hot partitions or missing work during failures
- Address how you prevent multiple promoters from racing to promote the same tasks and causing duplicates
2. Worker Claim and Lease Mechanism
Once tasks are ready to execute, you need a safe handoff to worker processes that prevents both lost work (if a worker crashes) and duplicate execution (if multiple workers claim the same task).
Hints to consider:
- Implement visibility timeouts or leases so tasks become re-claimable if a worker dies mid-execution
- Use conditional writes or fencing tokens to ensure only one worker can claim a task at a time
- Consider how idempotency keys in the task payload allow downstream services to safely deduplicate retries
- Design acknowledgment flows where workers must explicitly mark tasks complete before the lease expires
3. Handling Time Clustering and Thundering Herds
Real-world scheduling patterns cluster heavily at round times: midnight, top of the hour, or synchronized batch windows. This creates massive execution spikes.
Hints to consider:
- Apply jitter when promoting or executing tasks to spread load across a time window rather than a single instant
- Use rate limiting and backpressure mechanisms so the execution layer doesn't overwhelm downstream services
- Consider pre-sharding the ready queue by time slice and tenant to distribute load across multiple workers
- Think about how recurring tasks can drift slightly in their next execution time to avoid reconverging on round boundaries
4. State Transitions and Failure Recovery
Tasks flow through a clear state machine (pending → ready → claimed → running → complete/failed), and failures can occur at any transition.
Hints to consider:
- Design state transitions to be idempotent and atomic, using conditional updates based on current state and version
- Implement dead-letter queues or poison message handling for tasks that repeatedly fail after maximum retries
- Consider how you'll recover from partial failures where a task executed successfully but the acknowledgment was lost
- Think about supporting task cancellation and updates even after a task has been promoted to the ready queue