For a full example answer with detailed architecture diagrams and deep dives, see our Design a Job Scheduler guide.
Design a distributed job scheduler that can handle high-throughput job processing (10,000+ jobs per second), support both scheduled (cron-based) and ad-hoc job execution, include retry mechanisms for failed jobs, and maintain execution history for up to one year.
Think of Airflow, Quartz, or AWS EventBridge/CloudWatch Events -- a system that can trigger and run 10,000 jobs per second with reliable execution semantics and a year of searchable history. Users submit jobs that run immediately, at a specific future time, or on a recurring cron schedule. Each job carries a payload, configurable retries with exponential backoff, timeout policies, and concurrency limits.
The core challenge is separating scheduling from execution so each can scale independently, designing for the predictable traffic spikes that cron workloads create at minute and hour boundaries, and reasoning about delivery guarantees (at-least-once with idempotency). Interviewers also probe how you model long-term execution history across hot and cold storage tiers and how you manage a fleet of heterogeneous workers without single points of failure.
Based on real interview experiences at Robinhood (72 candidate reports), these are the areas interviewers probe most deeply:
Interviewers want to see you architect separate components for determining when jobs should run versus actually executing them. Mixing these concerns creates scaling bottlenecks and makes failure isolation impossible.
Hints to consider:
Cron-based workloads create massive spikes at minute and hour boundaries when thousands of jobs become eligible simultaneously. Interviewers expect you to design for this predictable burst pattern.
Hints to consider:
The system must handle crashes, network partitions, and timeouts without losing or duplicating work. Interviewers probe deeply on how you achieve reliable execution semantics.
Hints to consider:
Storing execution records for every job over a year at 10k/sec creates billions of records. Interviewers want to see you reason about hot versus cold storage tiers and query patterns.
Hints to consider:
The system needs a fleet of workers that can scale up and down while handling in-flight jobs gracefully. Interviewers expect you to reason about worker registration, health checking, and shutdown procedures.
Hints to consider:
Start by confirming the scope and constraints. Ask about the types of jobs being scheduled -- are they lightweight webhooks, heavy data processing, or a mix? Clarify whether jobs pass large payloads or just small trigger messages. Understand the acceptable staleness for history queries and whether users need real-time dashboards or can tolerate eventual consistency. Confirm the retry semantics -- should retries happen immediately or follow their own exponential backoff schedule? Ask if jobs can have dependencies on other jobs or are independent. Verify whether the system needs to support job cancellation and pausing/resuming of schedules.
Sketch a component diagram with these layers: a Job Registry (DynamoDB storing job definitions, schedules, and metadata), a Scheduler Cluster (multiple instances that scan for jobs becoming ready and promote them to execution queues), a Ready Queue (Redis sorted sets holding jobs ordered by execution time), a Worker Pool (stateless compute nodes pulling jobs and executing them), and a History Store (time-partitioned database for execution records). Include separate write paths for ad-hoc submissions that bypass scheduling and go directly to the ready queue. Add an API layer that handles job CRUD operations and history queries, and a coordination service (ZooKeeper) that manages scheduler leader election and shard assignment.
Walk through how the scheduler works. Multiple scheduler instances run concurrently, each responsible for a shard of jobs based on consistent hashing of job IDs. Each scheduler maintains an in-memory min-heap of the next N jobs in its shard, continuously polling its DynamoDB partition for jobs whose next execution time is within the next few minutes. When a job becomes ready, the scheduler atomically adds it to the Redis sorted set with its execution timestamp as the score. Worker processes use ZPOPMIN to claim jobs in time order, acquiring a visibility timeout lease. If a worker completes the job, it writes to the history store and deletes the entry. If it fails or times out, the visibility timeout expires and the job reappears for retry. Discuss how this design avoids lock contention and enables horizontal scaling of both schedulers and workers.
Cover history storage: explain the partitioning strategy using composite keys like (job_id, execution_timestamp) and GSIs on (status, timestamp) for failure queries. Describe the lifecycle policy that uses DynamoDB TTLs to move 30-day-old records to S3 in Parquet format with metadata in a catalog for querying. For observability, add a metrics pipeline that aggregates execution counts, latencies, and failure rates per job. Discuss autoscaling workers based on queue depth and execution rate metrics. Address failure scenarios: what happens if Redis fails (fallback to database-backed queue with lower throughput), if DynamoDB has an outage (schedulers buffer ready jobs in memory), or if there is a network partition (workers fail open and retry). Touch on security concerns like authentication for job submission and encryption of sensitive payloads.
Deepen your understanding of the patterns used in this problem: