Practice/Databricks/Design a Job Scheduler
Design a Job Scheduler
System DesignMust
Problem Statement
Design a large-scale distributed task scheduling system capable of executing over 10,000 jobs per second. The system must support both one-time scheduled tasks (using cron-like expressions) and immediate ad-hoc execution. Jobs should include sophisticated retry logic with exponential backoff, configurable timeout policies, and detailed execution tracking. The system needs to maintain a complete audit trail of all job executions for regulatory compliance, storing execution history for at least one year while remaining queryable for debugging and analytics.
This problem tests your ability to build a resilient distributed system that handles high write throughput, manages worker capacity dynamically, and provides strong delivery guarantees. Consider how you would prevent duplicate execution during failures, handle thundering herd problems when thousands of jobs trigger simultaneously at the top of each minute, and architect for both high availability and horizontal scalability as job volume grows.
Key Requirements
Functional
- Job Definition and Registration -- Users must be able to define jobs with cron schedules, one-time future execution, or immediate trigger, along with payload data and execution parameters
- Retry and Timeout Configuration -- Each job should support configurable retry attempts, exponential backoff strategies, maximum execution timeouts, and dead-letter handling for permanent failures
- Execution History and Observability -- The system must track every execution attempt with timestamps, status transitions, error messages, and execution logs, queryable for up to one year
- Concurrency Controls -- Support per-job concurrency limits to prevent resource exhaustion and allow serial execution when order matters
Non-Functional
- Scalability -- Handle 10,000+ job executions per second with headroom for peak traffic spikes at schedule boundaries
- Reliability -- Guarantee at-least-once execution semantics with no lost jobs even during infrastructure failures or deployments
- Latency -- Trigger scheduled jobs within 1-2 seconds of their target time under normal load; ad-hoc jobs should start within 500ms
- Consistency -- Maintain strong consistency for job metadata and eventual consistency for execution history, with no duplicate executions for the same scheduled instance
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Separation of Scheduling and Execution Planes
Interviewers want to see you architect separate components for determining when jobs should run versus actually executing them. Mixing these concerns creates scaling bottlenecks and makes failure isolation impossible.
Hints to consider:
- Use a time-based priority queue or sorted set structure to maintain jobs ordered by their next execution time
- Design worker pools that pull ready jobs rather than having a central dispatcher push work
- Consider how scheduler components discover and coordinate without creating single points of failure
- Think about how to partition scheduling responsibility across multiple scheduler instances using consistent hashing or range sharding
2. Handling Time-Based Traffic Spikes
Cron-based workloads create massive spikes at minute and hour boundaries when thousands of jobs become eligible simultaneously. Interviewers expect you to design for this predictable burst pattern.
Hints to consider:
- Implement jitter in job promotion to spread identical cron schedules across a 5-10 second window
- Use backpressure mechanisms like queue depth monitoring to prevent overwhelming downstream workers
- Design a staged queue architecture with rate-limiting between stages to smooth traffic
- Consider pre-computing and sharding upcoming jobs across multiple promotion workers before the spike hits
3. Delivery Guarantees and Idempotency
The system must handle crashes, network partitions, and timeouts without losing or duplicating work. Interviewers probe deeply on how you achieve reliable execution semantics.
Hints to consider:
- Use visibility timeouts or lease-based patterns where jobs become visible again if not completed within the timeout window
- Generate unique execution IDs combining job ID, scheduled timestamp, and attempt number for deduplication
- Design job handlers to be idempotent or provide deduplication keys that workers can check before execution
- Implement heartbeat mechanisms for long-running jobs to extend their lease and prevent premature re-execution
4. Long-Term History Storage and Queryability
Storing execution records for every job over a year at 10k/sec creates billions of records. Interviewers want to see you reason about hot versus cold storage tiers and query patterns.
Hints to consider:
- Partition history by time buckets (daily or monthly) and job ID to enable efficient range queries
- Use a hot tier (database) for recent history (30-90 days) and cold tier (object storage) for older data with different query latency SLAs
- Design secondary indexes carefully for common query patterns like "find all failed executions for job X in the last week"
- Implement TTL-based lifecycle policies to automatically migrate data between tiers and eventually archive to compressed formats
5. Worker Lifecycle and Graceful Degradation
The system needs a fleet of workers that can scale up and down while handling in-flight jobs gracefully. Interviewers expect you to reason about worker registration, health checking, and shutdown procedures.
Hints to consider:
- Use a service discovery mechanism (like ZooKeeper or etcd) for workers to register availability and advertise capacity
- Implement graceful shutdown where workers stop accepting new jobs but complete in-flight work or checkpoint progress
- Design health checks that detect stuck workers and reclaim their leases to make jobs available for retry
- Consider supporting heterogeneous workers with different capabilities or resource profiles for CPU-intensive versus I/O-bound jobs