Practice/LinkedIn/Design a Job Scheduler
Design a Job Scheduler
System DesignMust
Problem Statement
Design a distributed job scheduler that can handle high-throughput job processing (10,000+ jobs per second), support both scheduled (cron-based) and ad-hoc job execution, include retry mechanisms for failed jobs, and maintain execution history for up to one year.
This tests whether you can separate scheduling from execution, design for high throughput and bursty "top-of-minute" load, and reason about delivery guarantees (at-least-once, idempotency, retries). Also consider pragmatic data modeling for long-term history (hot vs. cold), worker lifecycle management, and strategies to avoid single points of failure.
Key Requirements
Functional
- Flexible scheduling -- users create jobs that run immediately, at a specific future time, or on a recurring schedule using cron expressions
- Ad-hoc execution -- users submit one-off job runs in addition to scheduled executions
- Execution tracking -- users view live job status and detailed execution history (timestamps, attempts, outcome, logs) for up to one year
- Configurable policies -- users set retries with backoff, timeouts, and optional concurrency limits per job
Non-Functional
- Scalability -- handle 10,000+ job executions per second with headroom for peak traffic spikes at schedule boundaries
- Reliability -- guarantee at-least-once execution semantics with no lost jobs even during infrastructure failures
- Latency -- trigger scheduled jobs within 1-2 seconds of their target time; ad-hoc jobs start within 500ms
- Consistency -- maintain strong consistency for job metadata; no duplicate executions for the same scheduled instance
Interview Reports from Hello Interview
72 reports from candidates. Most recently asked at LinkedIn in Early February 2026.
Also commonly asked at: Robinhood, DoorDash, Meta, Microsoft, Lyft, Airbnb, Netflix.
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Separation of Scheduling and Execution Planes
Interviewers want to see separate components for determining when jobs should run versus actually executing them. Mixing these concerns creates scaling bottlenecks.
Hints to consider:
- Use a time-based priority queue or sorted set structure to maintain jobs ordered by their next execution time
- Design worker pools that pull ready jobs rather than having a central dispatcher push work
- Partition scheduling responsibility across multiple scheduler instances using consistent hashing or range sharding
- Consider how scheduler components discover and coordinate without creating single points of failure
2. Handling Time-Based Traffic Spikes
Cron-based workloads create massive spikes at minute and hour boundaries when thousands of jobs become eligible simultaneously.
Hints to consider:
- Implement jitter in job promotion to spread identical cron schedules across a 5-10 second window
- Use backpressure mechanisms like queue depth monitoring to prevent overwhelming downstream workers
- Design a staged queue architecture with rate-limiting between stages to smooth traffic
- Pre-compute upcoming jobs across multiple promotion workers before the spike hits
3. Delivery Guarantees and Idempotency
The system must handle crashes, network partitions, and timeouts without losing or duplicating work.
Hints to consider:
- Use visibility timeouts or lease-based patterns where jobs become visible again if not completed within the timeout window
- Generate unique execution IDs combining job ID, scheduled timestamp, and attempt number for deduplication
- Design job handlers to be idempotent or provide deduplication keys that workers can check before execution
- Implement heartbeat mechanisms for long-running jobs to extend their lease and prevent premature re-execution
4. Long-Term History Storage and Queryability
Storing execution records for every job over a year at 10k/sec creates billions of records.
Hints to consider:
- Partition history by time buckets (daily or monthly) and job ID to enable efficient range queries
- Use a hot tier (database) for recent history (30-90 days) and cold tier (object storage) for older data
- Design secondary indexes for common query patterns like "find all failed executions for job X in the last week"
- Implement TTL-based lifecycle policies to migrate data between tiers automatically
5. Worker Lifecycle and Graceful Degradation
Workers must scale up and down while handling in-flight jobs gracefully.
Hints to consider:
- Use a service discovery mechanism for workers to register availability and advertise capacity
- Implement graceful shutdown where workers stop accepting new jobs but complete in-flight work
- Design health checks that detect stuck workers and reclaim their leases for retry
- Support heterogeneous workers with different capabilities for CPU-intensive versus I/O-bound jobs