Build a distributed system that enables applications to register tasks for execution at a future time and guarantees those tasks execute reliably when their scheduled time arrives. Your service must handle both one-time scheduled tasks (run a report at 3pm tomorrow) and recurring tasks (charge subscriptions every month). Applications should be able to create, update, cancel, and query the status of scheduled tasks. The system must support millions of concurrent scheduled tasks, with peak execution loads of hundreds of thousands of tasks per minute during common scheduling windows like midnight UTC or the top of each hour.
Your design should handle clock skew across distributed nodes, prevent duplicate execution when components fail and restart, support configurable retry policies for failed tasks, and ensure tasks are not lost even during partial system outages. Consider how tasks transition through the lifecycle: created and stored for the future, promoted to a ready-to-execute state when their time arrives, claimed by a worker, executed, and finally marked complete or retried on failure.
Task Registration -- Applications must be able to schedule tasks with a specific execution time, payload data, and callback information (webhook URL or message queue destination)
Recurring Schedules -- Support cron-like recurring tasks that execute on a repeating schedule (every hour, daily at 2am, monthly on the 15th, etc.)
Task Management -- Allow clients to query task status, update scheduled times before execution, and cancel pending tasks
Execution Guarantees -- Ensure each task executes at least once after its scheduled time, with configurable retry behavior for failures
Result Tracking -- Record execution outcomes (success, failure, retry count) and make this history queryable by clients
Reliability -- Tolerate multiple node failures and availability zone outages without losing scheduled tasks or creating duplicates
Latency -- Execute tasks within 30 seconds of their scheduled time under normal load (p99); task registration and cancellation should complete in under 100ms
Consistency -- Guarantee exactly-once semantics for task execution from the client's perspective (at-most-once delivery with idempotency support)
Based on real interview experiences, these are the areas interviewers probe most deeply:
The core challenge is efficiently moving tasks from long-term storage into an execution-ready state as their scheduled time approaches. Interviewers want to see how you handle millions of future tasks without constantly scanning the entire dataset.
Consider time-bucketed partitioning where tasks are organized by their execution hour or minute to enable targeted queries
Explore using a two-tier storage model: persistent store for all future tasks, with a separate ready queue for tasks due in the next few minutes
Think about how background promoter processes discover due tasks without creating hot partitions or missing work during failures
Address how you prevent multiple promoters from racing to promote the same tasks and causing duplicates
Once tasks are ready to execute, you need a safe handoff to worker processes that prevents both lost work (if a worker crashes) and duplicate execution (if multiple workers claim the same task).
Real-world scheduling patterns cluster heavily at round times: midnight, top of the hour, or synchronized batch windows. This creates massive execution spikes.
Apply jitter when promoting or executing tasks to spread load across a time window rather than a single instant
Use rate limiting and backpressure mechanisms so the execution layer doesn't overwhelm downstream services
Consider pre-sharding the ready queue by time slice and tenant to distribute load across multiple workers
Think about how recurring tasks can drift slightly in their next execution time to avoid reconverging on round boundaries
Tasks flow through a clear state machine (pending → ready → claimed → running → complete/failed), and failures can occur at any transition.
Design state transitions to be idempotent and atomic, using conditional updates based on current state and version
Implement dead-letter queues or poison message handling for tasks that repeatedly fail after maximum retries
Consider how you'll recover from partial failures where a task executed successfully but the acknowledgment was lost
Think about supporting task cancellation and updates even after a task has been promoted to the ready queue