Practice/Meta/Internal Job scheduler
Internal Job scheduler
Product DesignOptional
Problem Statement
Your team needs to build a distributed task scheduling platform that enables internal engineering teams to run automated workloads at specified times or intervals. Teams should be able to submit tasks (such as data processing pipelines, report generation, or system maintenance scripts), define when they should execute, monitor their progress, and receive notifications about execution results. The system must handle thousands of concurrent tasks while ensuring reliable execution even when individual components fail.
The platform should support both one-time scheduled tasks and recurring jobs with cron-like expressions. Scale expectations include supporting up to 100,000 active scheduled tasks with peak execution rates of 10,000 tasks per minute. Tasks may have dependencies on other tasks, and the system must handle scenarios where worker nodes crash during task execution.
Key Requirements
Functional
- Task submission and configuration -- Users can create tasks with execution schedules (one-time or recurring), retry policies, timeouts, and priority levels
- Real-time status tracking -- Users can query the current state of any task (pending, running, completed, failed) and view execution history
- Task lifecycle management -- Users can pause, resume, or permanently cancel scheduled tasks before or during execution
- Notification delivery -- System sends alerts via email or webhook when tasks complete successfully or encounter failures
- Dependency management -- Tasks can specify dependencies on other tasks, ensuring proper execution ordering
Non-Functional
- Scalability -- Support 100,000 active schedules with 10,000 task executions per minute at peak load
- Reliability -- Guarantee at-least-once execution for all scheduled tasks with 99.9% availability
- Latency -- Trigger tasks within 1 second of their scheduled time for 99th percentile cases
- Consistency -- Ensure no duplicate task executions for one-time jobs while allowing controlled retries for failures
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Task Scheduling and Time Precision
The core challenge involves efficiently tracking millions of scheduled tasks and triggering them at precise times without scanning entire databases. Interviewers want to see how you handle the gap between storing schedules and executing them at the right moment.
Hints to consider:
- Explore time-bucketing strategies where tasks are organized into time windows (e.g., next minute, next hour)
- Consider using priority queues or sorted sets with task execution times as scores
- Discuss tradeoffs between polling approaches versus event-driven architectures
- Address clock skew and time zone handling across distributed nodes
2. Distributed Execution and Fault Tolerance
With worker nodes that can fail mid-execution, the system must prevent both task loss and duplicate executions. Interviewers look for strategies to track task ownership and recover from failures gracefully.
Hints to consider:
- Design a heartbeat mechanism where workers periodically signal they're still processing tasks
- Implement task leasing with timeouts so stalled tasks can be reassigned
- Discuss idempotency requirements and how to make tasks safely retryable
- Consider write-ahead logs or task state machines to track execution phases
3. Scale and Load Distribution
As task volumes grow, a single scheduler becomes a bottleneck. Interviewers expect you to partition the scheduling workload and balance execution across worker pools.
Hints to consider:
- Propose sharding strategies for the schedule storage (by task ID, time range, or customer)
- Design a distributed locking mechanism to prevent multiple schedulers from triggering the same task
- Discuss how to handle hotspots when many tasks are scheduled for the same time
- Consider separate queues or worker pools for different task priorities or types
4. Monitoring and Observability
Production systems need comprehensive visibility into task health, execution patterns, and system performance. Interviewers want to see how you enable operations teams to debug issues.
Hints to consider:
- Store detailed execution logs including start time, end time, exit codes, and error messages
- Provide metrics on queue depths, execution latencies, and failure rates
- Design alerting for stuck tasks, rising error rates, or scheduling lag
- Consider retention policies for historical execution data as it grows unbounded