Design a Job Scheduler — Robinhood

Reference Answer

For a full example answer with detailed architecture diagrams and deep dives, see our Design a Job Scheduler guide.

Problem Statement

Design a distributed job scheduler that can handle high-throughput job processing (10,000+ jobs per second), support both scheduled (cron-based) and ad-hoc job execution, include retry mechanisms for failed jobs, and maintain execution history for up to one year.

Think of Airflow, Quartz, or AWS EventBridge/CloudWatch Events -- a system that can trigger and run 10,000 jobs per second with reliable execution semantics and a year of searchable history. Users submit jobs that run immediately, at a specific future time, or on a recurring cron schedule. Each job carries a payload, configurable retries with exponential backoff, timeout policies, and concurrency limits.

The core challenge is separating scheduling from execution so each can scale independently, designing for the predictable traffic spikes that cron workloads create at minute and hour boundaries, and reasoning about delivery guarantees (at-least-once with idempotency). Interviewers also probe how you model long-term execution history across hot and cold storage tiers and how you manage a fleet of heterogeneous workers without single points of failure.

Key Requirements

Functional

Job definition and registration -- users create jobs that run immediately, at a specific future time, or on a recurring cron schedule, with payload data and execution parameters
Retry and timeout configuration -- each job supports configurable retry attempts, exponential backoff strategies, maximum execution timeouts, and dead-letter handling for permanent failures
Execution history and observability -- the system tracks every execution attempt with timestamps, status transitions, error messages, and logs, queryable for up to one year
Concurrency controls -- support per-job concurrency limits to prevent resource exhaustion and allow serial execution when ordering matters

Non-Functional

Scalability -- handle 10,000+ job executions per second with headroom for peak traffic spikes at schedule boundaries
Reliability -- guarantee at-least-once execution semantics with no lost jobs even during infrastructure failures or deployments
Latency -- trigger scheduled jobs within 1-2 seconds of their target time under normal load; ad-hoc jobs should start within 500ms
Consistency -- maintain strong consistency for job metadata and eventual consistency for execution history, with no duplicate executions for the same scheduled instance

What Interviewers Focus On

Based on real interview experiences at Robinhood (72 candidate reports), these are the areas interviewers probe most deeply:

1. Separation of Scheduling and Execution Planes

Interviewers want to see you architect separate components for determining when jobs should run versus actually executing them. Mixing these concerns creates scaling bottlenecks and makes failure isolation impossible.

Hints to consider:

Use a time-based priority queue or sorted set structure to maintain jobs ordered by their next execution time
Design worker pools that pull ready jobs rather than having a central dispatcher push work
Consider how scheduler components discover and coordinate without creating single points of failure
Think about how to partition scheduling responsibility across multiple scheduler instances using consistent hashing or range sharding

2. Handling Time-Based Traffic Spikes

Cron-based workloads create massive spikes at minute and hour boundaries when thousands of jobs become eligible simultaneously. Interviewers expect you to design for this predictable burst pattern.

Hints to consider:

Implement jitter in job promotion to spread identical cron schedules across a 5-10 second window
Use backpressure mechanisms like queue depth monitoring to prevent overwhelming downstream workers
Design a staged queue architecture with rate-limiting between stages to smooth traffic
Consider pre-computing and sharding upcoming jobs across multiple promotion workers before the spike hits

3. Delivery Guarantees and Idempotency

The system must handle crashes, network partitions, and timeouts without losing or duplicating work. Interviewers probe deeply on how you achieve reliable execution semantics.

Hints to consider:

Use visibility timeouts or lease-based patterns where jobs become visible again if not completed within the timeout window
Generate unique execution IDs combining job ID, scheduled timestamp, and attempt number for deduplication
Design job handlers to be idempotent or provide deduplication keys that workers can check before execution
Implement heartbeat mechanisms for long-running jobs to extend their lease and prevent premature re-execution

4. Long-Term History Storage and Queryability

Storing execution records for every job over a year at 10k/sec creates billions of records. Interviewers want to see you reason about hot versus cold storage tiers and query patterns.

Hints to consider:

Partition history by time buckets (daily or monthly) and job ID to enable efficient range queries
Use a hot tier (database) for recent history (30-90 days) and cold tier (object storage) for older data with different query latency SLAs
Design secondary indexes carefully for common query patterns like "find all failed executions for job X in the last week"
Implement TTL-based lifecycle policies to automatically migrate data between tiers and eventually archive to compressed formats

5. Worker Lifecycle and Graceful Degradation

The system needs a fleet of workers that can scale up and down while handling in-flight jobs gracefully. Interviewers expect you to reason about worker registration, health checking, and shutdown procedures.

Hints to consider:

Use a service discovery mechanism (like ZooKeeper or etcd) for workers to register availability and advertise capacity
Implement graceful shutdown where workers stop accepting new jobs but complete in-flight work or checkpoint progress
Design health checks that detect stuck workers and reclaim their leases to make jobs available for retry
Consider supporting heterogeneous workers with different capabilities or resource profiles for CPU-intensive versus I/O-bound jobs

Suggested Approach

Step 1: Clarify Requirements

Start by confirming the scope and constraints. Ask about the types of jobs being scheduled -- are they lightweight webhooks, heavy data processing, or a mix? Clarify whether jobs pass large payloads or just small trigger messages. Understand the acceptable staleness for history queries and whether users need real-time dashboards or can tolerate eventual consistency. Confirm the retry semantics -- should retries happen immediately or follow their own exponential backoff schedule? Ask if jobs can have dependencies on other jobs or are independent. Verify whether the system needs to support job cancellation and pausing/resuming of schedules.

Step 2: High-Level Architecture

Sketch a component diagram with these layers: a Job Registry (DynamoDB storing job definitions, schedules, and metadata), a Scheduler Cluster (multiple instances that scan for jobs becoming ready and promote them to execution queues), a Ready Queue (Redis sorted sets holding jobs ordered by execution time), a Worker Pool (stateless compute nodes pulling jobs and executing them), and a History Store (time-partitioned database for execution records). Include separate write paths for ad-hoc submissions that bypass scheduling and go directly to the ready queue. Add an API layer that handles job CRUD operations and history queries, and a coordination service (ZooKeeper) that manages scheduler leader election and shard assignment.

Step 3: Deep Dive on Scheduling and Dispatch

Walk through how the scheduler works. Multiple scheduler instances run concurrently, each responsible for a shard of jobs based on consistent hashing of job IDs. Each scheduler maintains an in-memory min-heap of the next N jobs in its shard, continuously polling its DynamoDB partition for jobs whose next execution time is within the next few minutes. When a job becomes ready, the scheduler atomically adds it to the Redis sorted set with its execution timestamp as the score. Worker processes use ZPOPMIN to claim jobs in time order, acquiring a visibility timeout lease. If a worker completes the job, it writes to the history store and deletes the entry. If it fails or times out, the visibility timeout expires and the job reappears for retry. Discuss how this design avoids lock contention and enables horizontal scaling of both schedulers and workers.

Step 4: Address Secondary Concerns

Cover history storage: explain the partitioning strategy using composite keys like (job_id, execution_timestamp) and GSIs on (status, timestamp) for failure queries. Describe the lifecycle policy that uses DynamoDB TTLs to move 30-day-old records to S3 in Parquet format with metadata in a catalog for querying. For observability, add a metrics pipeline that aggregates execution counts, latencies, and failure rates per job. Discuss autoscaling workers based on queue depth and execution rate metrics. Address failure scenarios: what happens if Redis fails (fallback to database-backed queue with lower throughput), if DynamoDB has an outage (schedulers buffer ready jobs in memory), or if there is a network partition (workers fail open and retry). Touch on security concerns like authentication for job submission and encryption of sensitive payloads.

Related Learning

Deepen your understanding of the patterns used in this problem:

Job Scheduler -- a detailed walkthrough of this exact problem with architecture diagrams
Message Queues -- the foundation for reliable job delivery and decoupling scheduling from execution
Databases -- partitioning strategies and TTL-based lifecycle management for billion-record history stores

Reference Answer

For a full example answer with detailed architecture diagrams and deep dives, see our Design a Job Scheduler guide.

Problem Statement

Key Requirements

Functional

Job definition and registration -- users create jobs that run immediately, at a specific future time, or on a recurring cron schedule, with payload data and execution parameters
Retry and timeout configuration -- each job supports configurable retry attempts, exponential backoff strategies, maximum execution timeouts, and dead-letter handling for permanent failures
Execution history and observability -- the system tracks every execution attempt with timestamps, status transitions, error messages, and logs, queryable for up to one year
Concurrency controls -- support per-job concurrency limits to prevent resource exhaustion and allow serial execution when ordering matters

Non-Functional

Scalability -- handle 10,000+ job executions per second with headroom for peak traffic spikes at schedule boundaries
Reliability -- guarantee at-least-once execution semantics with no lost jobs even during infrastructure failures or deployments
Latency -- trigger scheduled jobs within 1-2 seconds of their target time under normal load; ad-hoc jobs should start within 500ms
Consistency -- maintain strong consistency for job metadata and eventual consistency for execution history, with no duplicate executions for the same scheduled instance

What Interviewers Focus On

Based on real interview experiences at Robinhood (72 candidate reports), these are the areas interviewers probe most deeply:

1. Separation of Scheduling and Execution Planes

Hints to consider:

Use a time-based priority queue or sorted set structure to maintain jobs ordered by their next execution time
Design worker pools that pull ready jobs rather than having a central dispatcher push work
Consider how scheduler components discover and coordinate without creating single points of failure
Think about how to partition scheduling responsibility across multiple scheduler instances using consistent hashing or range sharding

2. Handling Time-Based Traffic Spikes

Cron-based workloads create massive spikes at minute and hour boundaries when thousands of jobs become eligible simultaneously. Interviewers expect you to design for this predictable burst pattern.

Hints to consider:

Implement jitter in job promotion to spread identical cron schedules across a 5-10 second window
Use backpressure mechanisms like queue depth monitoring to prevent overwhelming downstream workers
Design a staged queue architecture with rate-limiting between stages to smooth traffic
Consider pre-computing and sharding upcoming jobs across multiple promotion workers before the spike hits

3. Delivery Guarantees and Idempotency

The system must handle crashes, network partitions, and timeouts without losing or duplicating work. Interviewers probe deeply on how you achieve reliable execution semantics.

Hints to consider:

Use visibility timeouts or lease-based patterns where jobs become visible again if not completed within the timeout window
Generate unique execution IDs combining job ID, scheduled timestamp, and attempt number for deduplication
Design job handlers to be idempotent or provide deduplication keys that workers can check before execution
Implement heartbeat mechanisms for long-running jobs to extend their lease and prevent premature re-execution

4. Long-Term History Storage and Queryability

Storing execution records for every job over a year at 10k/sec creates billions of records. Interviewers want to see you reason about hot versus cold storage tiers and query patterns.

Hints to consider:

Partition history by time buckets (daily or monthly) and job ID to enable efficient range queries
Use a hot tier (database) for recent history (30-90 days) and cold tier (object storage) for older data with different query latency SLAs
Design secondary indexes carefully for common query patterns like "find all failed executions for job X in the last week"
Implement TTL-based lifecycle policies to automatically migrate data between tiers and eventually archive to compressed formats

5. Worker Lifecycle and Graceful Degradation

Hints to consider:

Use a service discovery mechanism (like ZooKeeper or etcd) for workers to register availability and advertise capacity
Implement graceful shutdown where workers stop accepting new jobs but complete in-flight work or checkpoint progress
Design health checks that detect stuck workers and reclaim their leases to make jobs available for retry
Consider supporting heterogeneous workers with different capabilities or resource profiles for CPU-intensive versus I/O-bound jobs

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive on Scheduling and Dispatch

Step 4: Address Secondary Concerns

Related Learning

Deepen your understanding of the patterns used in this problem:

Job Scheduler -- a detailed walkthrough of this exact problem with architecture diagrams
Message Queues -- the foundation for reliable job delivery and decoupling scheduling from execution
Databases -- partitioning strategies and TTL-based lifecycle management for billion-record history stores