Practice/LinkedIn/Design a Job Scheduler

Design a Job Scheduler

System DesignMust

Problem Statement

Design a distributed job scheduler that can handle high-throughput job processing (10,000+ jobs per second), support both scheduled (cron-based) and ad-hoc job execution, include retry mechanisms for failed jobs, and maintain execution history for up to one year.

This tests whether you can separate scheduling from execution, design for high throughput and bursty "top-of-minute" load, and reason about delivery guarantees (at-least-once, idempotency, retries). Also consider pragmatic data modeling for long-term history (hot vs. cold), worker lifecycle management, and strategies to avoid single points of failure.

Key Requirements

Functional

Flexible scheduling -- users create jobs that run immediately, at a specific future time, or on a recurring schedule using cron expressions
Ad-hoc execution -- users submit one-off job runs in addition to scheduled executions
Execution tracking -- users view live job status and detailed execution history (timestamps, attempts, outcome, logs) for up to one year
Configurable policies -- users set retries with backoff, timeouts, and optional concurrency limits per job

Non-Functional

Scalability -- handle 10,000+ job executions per second with headroom for peak traffic spikes at schedule boundaries
Reliability -- guarantee at-least-once execution semantics with no lost jobs even during infrastructure failures
Latency -- trigger scheduled jobs within 1-2 seconds of their target time; ad-hoc jobs start within 500ms
Consistency -- maintain strong consistency for job metadata; no duplicate executions for the same scheduled instance

Interview Reports from Hello Interview

72 reports from candidates. Most recently asked at LinkedIn in Early February 2026.

Also commonly asked at: Robinhood, DoorDash, Meta, Microsoft, Lyft, Airbnb, Netflix.

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Separation of Scheduling and Execution Planes

Interviewers want to see separate components for determining when jobs should run versus actually executing them. Mixing these concerns creates scaling bottlenecks.

Hints to consider:

Use a time-based priority queue or sorted set structure to maintain jobs ordered by their next execution time
Design worker pools that pull ready jobs rather than having a central dispatcher push work
Partition scheduling responsibility across multiple scheduler instances using consistent hashing or range sharding
Consider how scheduler components discover and coordinate without creating single points of failure

2. Handling Time-Based Traffic Spikes

Cron-based workloads create massive spikes at minute and hour boundaries when thousands of jobs become eligible simultaneously.

Hints to consider:

Implement jitter in job promotion to spread identical cron schedules across a 5-10 second window
Use backpressure mechanisms like queue depth monitoring to prevent overwhelming downstream workers
Design a staged queue architecture with rate-limiting between stages to smooth traffic
Pre-compute upcoming jobs across multiple promotion workers before the spike hits

3. Delivery Guarantees and Idempotency

The system must handle crashes, network partitions, and timeouts without losing or duplicating work.

Hints to consider:

Use visibility timeouts or lease-based patterns where jobs become visible again if not completed within the timeout window
Generate unique execution IDs combining job ID, scheduled timestamp, and attempt number for deduplication
Design job handlers to be idempotent or provide deduplication keys that workers can check before execution
Implement heartbeat mechanisms for long-running jobs to extend their lease and prevent premature re-execution

4. Long-Term History Storage and Queryability

Storing execution records for every job over a year at 10k/sec creates billions of records.

Hints to consider:

Partition history by time buckets (daily or monthly) and job ID to enable efficient range queries
Use a hot tier (database) for recent history (30-90 days) and cold tier (object storage) for older data
Design secondary indexes for common query patterns like "find all failed executions for job X in the last week"
Implement TTL-based lifecycle policies to migrate data between tiers automatically

5. Worker Lifecycle and Graceful Degradation

Workers must scale up and down while handling in-flight jobs gracefully.

Hints to consider:

Use a service discovery mechanism for workers to register availability and advertise capacity
Implement graceful shutdown where workers stop accepting new jobs but complete in-flight work
Design health checks that detect stuck workers and reclaim their leases for retry
Support heterogeneous workers with different capabilities for CPU-intensive versus I/O-bound jobs

Practice/LinkedIn/Design a Job Scheduler

Design a Job Scheduler

System DesignMust

Problem Statement

Key Requirements

Functional

Flexible scheduling -- users create jobs that run immediately, at a specific future time, or on a recurring schedule using cron expressions
Ad-hoc execution -- users submit one-off job runs in addition to scheduled executions
Execution tracking -- users view live job status and detailed execution history (timestamps, attempts, outcome, logs) for up to one year
Configurable policies -- users set retries with backoff, timeouts, and optional concurrency limits per job

Non-Functional

Scalability -- handle 10,000+ job executions per second with headroom for peak traffic spikes at schedule boundaries
Reliability -- guarantee at-least-once execution semantics with no lost jobs even during infrastructure failures
Latency -- trigger scheduled jobs within 1-2 seconds of their target time; ad-hoc jobs start within 500ms
Consistency -- maintain strong consistency for job metadata; no duplicate executions for the same scheduled instance

Interview Reports from Hello Interview

72 reports from candidates. Most recently asked at LinkedIn in Early February 2026.

Also commonly asked at: Robinhood, DoorDash, Meta, Microsoft, Lyft, Airbnb, Netflix.

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Separation of Scheduling and Execution Planes

Interviewers want to see separate components for determining when jobs should run versus actually executing them. Mixing these concerns creates scaling bottlenecks.

Hints to consider:

Use a time-based priority queue or sorted set structure to maintain jobs ordered by their next execution time
Design worker pools that pull ready jobs rather than having a central dispatcher push work
Partition scheduling responsibility across multiple scheduler instances using consistent hashing or range sharding
Consider how scheduler components discover and coordinate without creating single points of failure

2. Handling Time-Based Traffic Spikes

Cron-based workloads create massive spikes at minute and hour boundaries when thousands of jobs become eligible simultaneously.

Hints to consider:

Implement jitter in job promotion to spread identical cron schedules across a 5-10 second window
Use backpressure mechanisms like queue depth monitoring to prevent overwhelming downstream workers
Design a staged queue architecture with rate-limiting between stages to smooth traffic
Pre-compute upcoming jobs across multiple promotion workers before the spike hits

3. Delivery Guarantees and Idempotency

The system must handle crashes, network partitions, and timeouts without losing or duplicating work.

Hints to consider:

Use visibility timeouts or lease-based patterns where jobs become visible again if not completed within the timeout window
Generate unique execution IDs combining job ID, scheduled timestamp, and attempt number for deduplication
Design job handlers to be idempotent or provide deduplication keys that workers can check before execution
Implement heartbeat mechanisms for long-running jobs to extend their lease and prevent premature re-execution

4. Long-Term History Storage and Queryability

Storing execution records for every job over a year at 10k/sec creates billions of records.

Hints to consider:

Partition history by time buckets (daily or monthly) and job ID to enable efficient range queries
Use a hot tier (database) for recent history (30-90 days) and cold tier (object storage) for older data
Design secondary indexes for common query patterns like "find all failed executions for job X in the last week"
Implement TTL-based lifecycle policies to migrate data between tiers automatically

5. Worker Lifecycle and Graceful Degradation

Workers must scale up and down while handling in-flight jobs gracefully.

Hints to consider:

Use a service discovery mechanism for workers to register availability and advertise capacity
Implement graceful shutdown where workers stop accepting new jobs but complete in-flight work
Design health checks that detect stuck workers and reclaim their leases for retry
Support heterogeneous workers with different capabilities for CPU-intensive versus I/O-bound jobs