System Design - Design a Distributed Job Scheduler for GPU Compute Platform — Databricks

Problem Statement

Design a distributed job scheduler that manages a pool of GPU resources for model training workloads. Users submit jobs with different priority levels, and the system must efficiently allocate GPUs, schedule execution, handle failures, and ensure fair resource distribution while preventing starvation.

This problem combines elements of resource management, distributed scheduling, fault tolerance, and container orchestration. The focus is on understanding how to build production ML infrastructure that handles expensive GPU resources efficiently and reliably handles the challenges of long-running training jobs.

Key Requirements

Functional

Job submission -- users submit training jobs with resource requirements (GPU count, memory, time limit) and priority levels
GPU pool management -- track available GPUs, allocate to jobs, reclaim from completed or failed jobs
Priority-based scheduling -- higher priority jobs get resources first, but prevent complete starvation of low-priority jobs
Job lifecycle management -- track job states (queued, running, completed, failed, retrying)
Checkpoint and recovery -- support checkpointing for long-running jobs and resumption after failures
Resource cleanup -- properly release GPU resources when jobs complete or fail

Non-Functional

Fault tolerance -- handle GPU node failures, network partitions, and training script crashes gracefully
Scalability -- manage hundreds of GPU nodes and thousands of queued jobs
Fairness -- prevent high-priority jobs from completely starving low-priority jobs
Observability -- users should be able to monitor job progress, resource utilization, and error logs
Resource efficiency -- maximize GPU utilization while minimizing idle time between jobs

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Error Handling and Fault Tolerance (Most Emphasized)

Interviewers want to see comprehensive thinking about failure modes specific to long-running ML jobs.

Hardware failures: GPU crashes, node failures, network issues
Software errors: out-of-memory, training script bugs, dependency conflicts
Checkpoint strategy: how frequently to save, where to store checkpoints (S3, distributed file system)
Retry logic: exponential backoff, maximum retry attempts, distinguishing transient vs permanent failures
Timeout handling: jobs that run longer than expected, stuck training loops
Partial failure in distributed training: one GPU fails in multi-GPU job
Resource cleanup: ensure GPU memory is released even when jobs crash

How to fairly allocate GPUs among competing jobs with different priorities.

Priority queue for job queue: higher priority jobs dequeue first
Starvation prevention: age-based priority boost (waiting jobs gradually increase priority)
Preemption decision: should high-priority jobs preempt running low-priority jobs?
Fair share scheduling: each user/team gets quota, prevent single user monopolizing all GPUs
Backfill scheduling: small jobs can run on temporarily available GPUs without delaying larger jobs
Resource fragmentation: how to pack jobs efficiently when GPUs are partially allocated

How to package jobs and ensure isolation between different users and workloads.

Docker images: users specify dependencies, system manages image registry
Container orchestration: Kubernetes for managing containers across GPU nodes, or custom scheduler
Job isolation: prevent one job from affecting others (resource limits, network isolation)
GPU allocation: NVIDIA Docker runtime for GPU access, device isolation
Security: user authentication, job sandboxing, prevent malicious code from accessing other users' data

How to track available GPUs and allocate them to jobs efficiently.

GPU inventory tracking: which nodes have GPUs, current allocation status, health status
Allocation strategy: pack jobs tightly on fewer nodes vs spread across many nodes
Multi-GPU jobs: ensure all required GPUs are allocated before starting job (gang scheduling)
GPU affinity: place job on node with cached model data to avoid transfer overhead
Oversubscription: allow more jobs queued than GPUs available, for future scheduling

How users interact with the system and monitor their jobs.

Ask about expected scale (number of GPUs, jobs per day), whether to use existing orchestration platforms like Kubernetes or build custom, acceptable job startup latency, and how much emphasis on error handling vs happy path. Confirm whether to discuss Docker/container details or assume that infrastructure exists.

Job Submission API -- REST endpoint for users to submit jobs with resource requirements and priority
Job Queue -- priority queue storing pending jobs, ordered by priority and submission time
Scheduler -- pulls jobs from queue, finds available GPU resources, allocates and starts jobs
GPU Resource Manager -- tracks GPU inventory, allocation state, health monitoring
Container Orchestration Layer -- Kubernetes or custom scheduler managing Docker containers on GPU nodes
Checkpoint Service -- periodic checkpoint storage to S3 or distributed file system
Monitoring Service -- collects metrics, logs, exposes user-facing status API
Job State Store -- database tracking job metadata, state transitions, resource allocation
Job arrives in priority queue with required GPU count
Scheduler periodically scans queue and available resources
For highest priority job, check if enough GPUs available
If yes, allocate GPUs (mark as in-use), start container on appropriate node
If no, check if lower-priority jobs can backfill (smaller GPU requirements)
Handle starvation: boost priority of jobs waiting beyond threshold time

Discuss trade-offs between immediate allocation vs batch scheduling, and preemption policies.

This is often the focus. Discuss multiple failure scenarios:

GPU node failure: detect via heartbeat, mark GPUs unavailable, requeue job from last checkpoint
Training script crash: retry job with exponential backoff, limit retry attempts
Out of memory: job killed by OS, user notified to reduce batch size or request more memory
Timeout: job exceeds time limit, gracefully terminate and notify user
Checkpoint corruption: verify checkpoint integrity with hashing, fallback to earlier checkpoint

Step 5: Discuss Kubernetes Integration or Custom Approach

Use Kubernetes for container orchestration, pod scheduling, resource limits
Custom controller watches job queue and creates pods with GPU requests
Use node selectors for GPU nodes, resource limits for memory/CPU
Be honest: "I have limited Kubernetes experience, but I understand the concepts of job queues, Docker containers, resource allocation, and fault tolerance. Let me explain how I'd design these components."
Focus on fundamental distributed systems principles rather than specific technology

"The interviewer seemed to expect discussion around Kubernetes and container orchestration. I had limited experience with it and chose to be honest rather than speculate. I focused on general distributed systems principles like queuing, containerization, and fault tolerance."

"This variant heavily emphasized error handling over the happy path. The interviewer wanted thoughtful consideration of edge cases and failure modes specific to long-running ML training jobs, which can run for hours or days and consume expensive GPU resources."

"Topics discussed included job queue design, Docker repository for managing container images, fault tolerance strategies for both job and infrastructure failures, and how to allocate GPU resources based on requirements and availability."

"Primary focus areas were error handling scenarios: hardware failures (GPU crashes, node failures), software errors (OOM, training script bugs), network partitions, checkpoint corruption, and timeout handling for stuck jobs."

"Common themes across interviews: understanding of queues and workers, basic container technology even if not Kubernetes-specific, comprehensive failure scenarios and recovery strategies, smart allocation of expensive GPU resources, and honesty over speculation about unfamiliar technologies."

"The question tests both high-level system design skills and practical awareness of the challenges in building production ML infrastructure. It's acceptable to acknowledge knowledge gaps and focus on what you know deeply rather than making up details."