ML System Design - ML Job Scheduler — Netflix

Problem Statement

Design a large-scale distributed task scheduling system capable of executing over 10,000 jobs per second. The system must support both one-time scheduled tasks (using cron-like expressions) and immediate ad-hoc execution. Jobs should include sophisticated retry logic with exponential backoff, configurable timeout policies, and detailed execution tracking. The system needs to maintain a complete audit trail of all job executions for regulatory compliance, storing execution history for at least one year while remaining queryable for debugging and analytics.

This problem tests your ability to build a resilient distributed system that handles high write throughput, manages worker capacity dynamically, and provides strong delivery guarantees. Consider how you would prevent duplicate execution during failures, handle thundering herd problems when thousands of jobs trigger simultaneously at the top of each minute, and architect for both high availability and horizontal scalability as job volume grows.

Key Requirements

Functional

Job Definition and Registration -- Users must be able to define jobs with cron schedules, one-time future execution, or immediate trigger, along with payload data and execution parameters
Retry and Timeout Configuration -- Each job should support configurable retry attempts, exponential backoff strategies, maximum execution timeouts, and dead-letter handling for permanent failures
Execution History and Observability -- The system must track every execution attempt with timestamps, status transitions, error messages, and execution logs, queryable for up to one year
Concurrency Controls -- Support per-job concurrency limits to prevent resource exhaustion and allow serial execution when order matters

Non-Functional

Scalability -- Handle 10,000+ job executions per second with headroom for peak traffic spikes at schedule boundaries
Latency -- Trigger scheduled jobs within 1-2 seconds of their target time under normal load; ad-hoc jobs should start within 500ms
Consistency -- Maintain strong consistency for job metadata and eventual consistency for execution history, with no duplicate executions for the same scheduled instance

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Separation of Scheduling and Execution Planes

Interviewers want to see you architect separate components for determining when jobs should run versus actually executing them. Mixing these concerns creates scaling bottlenecks and makes failure isolation impossible.

Use a time-based priority queue or sorted set structure to maintain jobs ordered by their next execution time
Design worker pools that pull ready jobs rather than having a central dispatcher push work
Consider how scheduler components discover and coordinate without creating single points of failure
Think about how to partition scheduling responsibility across multiple scheduler instances using consistent hashing or range sharding

Cron-based workloads create massive spikes at minute and hour boundaries when thousands of jobs become eligible simultaneously. Interviewers expect you to design for this predictable burst pattern.

Implement jitter in job promotion to spread identical cron schedules across a 5-10 second window
Use backpressure mechanisms like queue depth monitoring to prevent overwhelming downstream workers
Design a staged queue architecture with rate-limiting between stages to smooth traffic
Consider pre-computing and sharding upcoming jobs across multiple promotion workers before the spike hits

The system must handle crashes, network partitions, and timeouts without losing or duplicating work. Interviewers probe deeply on how you achieve reliable execution semantics.

Use visibility timeouts or lease-based patterns where jobs become visible again if not completed within the timeout window
Generate unique execution IDs combining job ID, scheduled timestamp, and attempt number for deduplication
Design job handlers to be idempotent or provide deduplication keys that workers can check before execution
Implement heartbeat mechanisms for long-running jobs to extend their lease and prevent premature re-execution

Storing execution records for every job over a year at 10k/sec creates billions of records. Interviewers want to see you reason about hot versus cold storage tiers and query patterns.

Partition history by time buckets (daily or monthly) and job ID to enable efficient range queries
Use a hot tier (database) for recent history (30-90 days) and cold tier (object storage) for older data with different query latency SLAs
Design secondary indexes carefully for common query patterns like "find all failed executions for job X in the last week"
Implement TTL-based lifecycle policies to automatically migrate data between tiers and eventually archive to compressed formats

The system needs a fleet of workers that can scale up and down while handling in-flight jobs gracefully. Interviewers expect you to reason about worker registration, health checking, and shutdown procedures.