Design a Job Scheduler — Wise

Reference Answer

For a full example answer with detailed architecture diagrams and deep dives, see our Job Scheduler guide.

Problem Statement

Design a distributed job scheduler that can handle high-throughput job processing (10,000+ jobs per second), support both scheduled (cron-based) and ad-hoc job execution, include retry mechanisms for failed jobs, and maintain execution history for up to one year.

Think of systems like Apache Airflow, AWS EventBridge, or Quartz Scheduler that trigger and orchestrate work at massive scale. Users define jobs with execution parameters, set schedules using cron expressions or one-time triggers, and expect the system to execute them reliably with detailed observability into every attempt. The scheduler must handle bursty workloads (thousands of cron jobs firing at the top of each minute), maintain at-least-once execution guarantees, and store a searchable audit trail of billions of execution records over a year.

Interviewers use this problem to test whether you can cleanly separate scheduling concerns from execution, design for high throughput under predictable burst patterns, reason about delivery guarantees and idempotency, and architect tiered storage for long-term history. Expect deep questions on worker lifecycle management, distributed coordination, and strategies to avoid single points of failure.

Key Requirements

Functional

Job definition and scheduling -- Users can create jobs that run immediately, at a specific future time, or on a recurring schedule using cron expressions, with configurable payloads and callback targets
Retry and failure handling -- Each job supports configurable retry attempts with exponential backoff, maximum execution timeouts, and dead-letter queues for permanently failed jobs
Execution history -- The system tracks every execution attempt with timestamps, status transitions, error messages, and logs, queryable for up to one year
Concurrency controls -- Support per-job concurrency limits to prevent resource exhaustion and allow serial execution when ordering matters

Non-Functional

Scalability -- Handle 10,000+ job executions per second with headroom for peak traffic spikes at schedule boundaries
Reliability -- Guarantee at-least-once execution with no lost jobs, even during infrastructure failures or rolling deployments
Latency -- Trigger scheduled jobs within 1-2 seconds of their target time; ad-hoc jobs should start within 500 milliseconds
Consistency -- Strong consistency for job metadata; no duplicate executions for the same scheduled instance

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Separation of Scheduling and Execution

Interviewers want to see distinct components for determining when jobs should run versus actually running them. Mixing these concerns creates scaling bottlenecks and prevents independent failure isolation.

Hints to consider:

Use a time-based priority queue (Redis sorted set) to maintain jobs ordered by their next execution time
Design worker pools that pull ready jobs rather than having a central dispatcher push work
Partition scheduling responsibility across multiple scheduler instances using consistent hashing or range sharding
Ensure scheduler components can discover and coordinate without creating single points of failure using ZooKeeper or etcd

2. Handling Time-Based Traffic Spikes

Cron workloads create massive spikes at minute and hour boundaries when thousands of jobs become eligible simultaneously. This is a predictable burst pattern that your design must accommodate.

Hints to consider:

Implement jitter in job promotion to spread identical cron schedules across a 5-10 second window
Use backpressure mechanisms like queue depth monitoring to prevent overwhelming downstream workers
Design a staged queue architecture with rate limiting between stages to smooth traffic
Pre-compute and shard upcoming jobs across multiple promotion workers before the spike hits

3. Delivery Guarantees and Idempotency

The system must handle crashes, network partitions, and timeouts without losing or duplicating work. This is where interviewers probe most deeply.

Hints to consider:

Use visibility timeouts (lease-based patterns) where jobs become re-claimable if not completed within the window
Generate unique execution IDs combining job ID, scheduled timestamp, and attempt number for deduplication
Design job handlers to be idempotent or provide deduplication keys that workers check before execution
Implement heartbeat mechanisms for long-running jobs to extend their lease and prevent premature re-execution

4. Long-Term History Storage

Storing execution records at 10,000 per second for a year creates billions of records. Interviewers want to see you reason about hot versus cold storage tiers.

Hints to consider:

Partition history by time buckets (daily or monthly) and job ID to enable efficient range queries
Use a hot tier (database) for recent history (30-90 days) and a cold tier (S3 in Parquet format) for older data
Design secondary indexes for common query patterns like "find all failed executions for job X in the last week"
Implement TTL-based lifecycle policies to automatically migrate data between tiers

5. Worker Lifecycle and Graceful Degradation

The system needs a fleet of workers that can scale up and down while handling in-flight jobs gracefully. Interviewers expect you to reason about worker registration, health checking, and shutdown procedures.

Hints to consider:

Use a service discovery mechanism (ZooKeeper or etcd) for workers to register availability and advertise capacity
Implement graceful shutdown where workers stop accepting new jobs but complete in-flight work or checkpoint progress
Design health checks that detect stuck workers and reclaim their leases to make jobs available for retry
Consider supporting heterogeneous workers with different capabilities or resource profiles for CPU-intensive versus I/O-bound jobs

Suggested Approach

Step 1: Clarify Requirements

Start by confirming the scope and constraints. Ask about the types of jobs being scheduled (lightweight webhooks, heavy data processing, or a mix). Clarify whether jobs can have dependencies on other jobs or are independent. Understand the acceptable staleness for history queries and whether users need real-time dashboards or can tolerate eventual consistency. Confirm retry semantics and whether the system needs to support job cancellation and pausing of schedules.

Step 2: High-Level Architecture

Sketch a component diagram with five layers: a Job Registry (database storing job definitions, schedules, and metadata), a Scheduler Cluster (multiple instances scanning for jobs becoming ready), a Ready Queue (Redis sorted sets holding jobs ordered by execution time), a Worker Pool (stateless compute nodes pulling and executing jobs), and a History Store (time-partitioned database for execution records). Include an API layer for job CRUD and history queries, and a coordination service (ZooKeeper or etcd) for scheduler leader election and shard assignment. Show separate paths for ad-hoc submissions that bypass scheduling and go directly to the ready queue.

Step 3: Deep Dive -- Scheduling and Dispatch

Walk through the detailed scheduling flow. Multiple scheduler instances run concurrently, each responsible for a shard of jobs via consistent hashing. Each scheduler maintains an in-memory min-heap of the next N jobs in its shard, polling its database partition for jobs whose next execution time falls within the upcoming promotion window. When a job becomes ready, the scheduler atomically adds it to the Redis sorted set with execution timestamp as the score. Workers use ZPOPMIN to claim the next due job, acquiring a visibility timeout lease. On success, the worker writes to the history store and deletes the queue entry. On failure or timeout, the visibility timeout expires and the job reappears for retry. Discuss how this avoids lock contention and enables horizontal scaling of both schedulers and workers.

Step 4: Address Secondary Concerns

Cover history storage with a partitioning strategy using composite keys like (job_id, execution_timestamp) and global secondary indexes on (status, timestamp) for failure queries. Describe lifecycle policies that move 30-day-old records to S3 in Parquet format with metadata in a catalog for querying via Athena or BigQuery. For observability, add a metrics pipeline tracking execution counts, latencies, and failure rates per job. Discuss autoscaling workers based on queue depth. Address failure scenarios: Redis failure (fallback to database-backed queue), database outage (schedulers buffer ready jobs in memory), network partition (workers fail open and retry). Touch on security (authentication for job submission, encryption of sensitive payloads) and graceful shutdown procedures.

Related Learning

Job Scheduler guide -- complete reference answer with architecture diagrams
Message Queues -- event-driven job dispatch patterns
Databases -- partitioning and tiered storage strategies
Caching -- Redis sorted sets as priority queues
Rate Limiters -- controlling thundering herd traffic

Reference Answer

For a full example answer with detailed architecture diagrams and deep dives, see our Job Scheduler guide.

Problem Statement

Key Requirements

Functional

Job definition and scheduling -- Users can create jobs that run immediately, at a specific future time, or on a recurring schedule using cron expressions, with configurable payloads and callback targets
Retry and failure handling -- Each job supports configurable retry attempts with exponential backoff, maximum execution timeouts, and dead-letter queues for permanently failed jobs
Execution history -- The system tracks every execution attempt with timestamps, status transitions, error messages, and logs, queryable for up to one year
Concurrency controls -- Support per-job concurrency limits to prevent resource exhaustion and allow serial execution when ordering matters

Non-Functional

Scalability -- Handle 10,000+ job executions per second with headroom for peak traffic spikes at schedule boundaries
Reliability -- Guarantee at-least-once execution with no lost jobs, even during infrastructure failures or rolling deployments
Latency -- Trigger scheduled jobs within 1-2 seconds of their target time; ad-hoc jobs should start within 500 milliseconds
Consistency -- Strong consistency for job metadata; no duplicate executions for the same scheduled instance

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Separation of Scheduling and Execution

Hints to consider:

Use a time-based priority queue (Redis sorted set) to maintain jobs ordered by their next execution time
Design worker pools that pull ready jobs rather than having a central dispatcher push work
Partition scheduling responsibility across multiple scheduler instances using consistent hashing or range sharding
Ensure scheduler components can discover and coordinate without creating single points of failure using ZooKeeper or etcd

2. Handling Time-Based Traffic Spikes

Cron workloads create massive spikes at minute and hour boundaries when thousands of jobs become eligible simultaneously. This is a predictable burst pattern that your design must accommodate.

Hints to consider:

Implement jitter in job promotion to spread identical cron schedules across a 5-10 second window
Use backpressure mechanisms like queue depth monitoring to prevent overwhelming downstream workers
Design a staged queue architecture with rate limiting between stages to smooth traffic
Pre-compute and shard upcoming jobs across multiple promotion workers before the spike hits

3. Delivery Guarantees and Idempotency

The system must handle crashes, network partitions, and timeouts without losing or duplicating work. This is where interviewers probe most deeply.

Hints to consider:

Use visibility timeouts (lease-based patterns) where jobs become re-claimable if not completed within the window
Generate unique execution IDs combining job ID, scheduled timestamp, and attempt number for deduplication
Design job handlers to be idempotent or provide deduplication keys that workers check before execution
Implement heartbeat mechanisms for long-running jobs to extend their lease and prevent premature re-execution

4. Long-Term History Storage

Storing execution records at 10,000 per second for a year creates billions of records. Interviewers want to see you reason about hot versus cold storage tiers.

Hints to consider:

Partition history by time buckets (daily or monthly) and job ID to enable efficient range queries
Use a hot tier (database) for recent history (30-90 days) and a cold tier (S3 in Parquet format) for older data
Design secondary indexes for common query patterns like "find all failed executions for job X in the last week"
Implement TTL-based lifecycle policies to automatically migrate data between tiers

5. Worker Lifecycle and Graceful Degradation

Hints to consider:

Use a service discovery mechanism (ZooKeeper or etcd) for workers to register availability and advertise capacity
Implement graceful shutdown where workers stop accepting new jobs but complete in-flight work or checkpoint progress
Design health checks that detect stuck workers and reclaim their leases to make jobs available for retry
Consider supporting heterogeneous workers with different capabilities or resource profiles for CPU-intensive versus I/O-bound jobs

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive -- Scheduling and Dispatch

Step 4: Address Secondary Concerns

Related Learning

Job Scheduler guide -- complete reference answer with architecture diagrams
Message Queues -- event-driven job dispatch patterns
Databases -- partitioning and tiered storage strategies
Caching -- Redis sorted sets as priority queues
Rate Limiters -- controlling thundering herd traffic