Practice/DoorDash/Design an Offline Batch Inference Service

Design an Offline Batch Inference Service

ML System DesignMust

Problem Statement

Design a distributed batch inference system that allows data scientists and ML engineers to run large-scale model predictions on terabytes of data. The system should accept user-submitted inference jobs that specify a trained model, input data location, and compute requirements. Users need visibility into job progress, automatic recovery from failures, and efficient resource utilization across a cluster of machines.

The platform must handle diverse workloads -- from small experimental runs processing gigabytes to production jobs processing petabytes across thousands of data partitions. Jobs may take hours or days to complete, and the system should gracefully handle machine failures, stragglers, and transient errors without losing progress or requiring full job restarts.

Key Requirements

Functional

Job Submission API -- users submit inference requests with model artifacts, input data paths, output destinations, and resource specs
Data Partitioning -- automatically shard large datasets into processable chunks that can be distributed across workers
Progress Tracking -- provide real-time visibility into job status, completed partitions, and estimated completion time
Fault Tolerance -- detect and retry failed tasks without reprocessing successfully completed work
Resource Management -- allocate compute resources based on job requirements and cluster availability

Non-Functional

Scalability -- support concurrent execution of hundreds of jobs processing petabytes of data daily
Reliability -- achieve 99.9% job completion rate with automatic recovery from transient failures
Efficiency -- maintain >85% cluster utilization while minimizing waste from failed or stragglers tasks
Latency -- complete small jobs (under 100GB) within 30 minutes, large jobs should show measurable progress within 5 minutes

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Data Sharding Strategy

How you partition input data determines parallelism, fault tolerance, and recovery efficiency. The system must balance partition size with the overhead of task scheduling and checkpointing.

Hints to consider:

Consider file-based vs record-based sharding approaches and their impact on skew
Think about how partition metadata is stored and accessed by workers
Discuss adaptive partitioning for datasets with non-uniform distribution
Address what happens when input data format is streaming vs batch storage

2. Job Orchestration and Scheduling

The system needs a control plane that assigns work to compute nodes, monitors health, and coordinates retries. This is the brain of the platform.

Hints to consider:

Compare centralized scheduler vs distributed consensus approaches
Design task assignment protocol including heartbeats and lease management
Handle priority queuing for interactive vs batch workloads
Consider how to prevent resource starvation and ensure fair sharing

3. Progress Tracking and Checkpointing

Users need visibility into long-running jobs, and the system must avoid reprocessing completed work after failures. This requires careful state management.

Hints to consider:

Design a checkpointing scheme that balances granularity with overhead
Store partition-level completion status in a durable metadata store
Provide APIs for querying job progress at different granularities
Think about tombstones for idempotent writes to output storage

4. Failure Handling and Retry Logic

The system must distinguish between retryable failures (network timeouts, machine crashes) and permanent failures (malformed data, OOM errors) to avoid infinite retry loops.

Hints to consider:

Implement exponential backoff with jitter for transient failures
Set per-task and per-job retry limits with different policies
Design dead letter queues for permanently failed partitions
Handle stragglers through speculative execution or timeout policies
Consider how model loading failures differ from inference failures

5. Resource Allocation and Isolation

Efficiently packing jobs onto shared infrastructure while preventing noisy neighbors and resource contention is critical for cost and performance.

Hints to consider:

Containerization strategies for model and dependency isolation
CPU, memory, and GPU allocation policies based on job specifications
Auto-scaling workers based on queue depth and resource utilization
Bin-packing algorithms for heterogeneous job resource profiles

Practice/DoorDash/Design an Offline Batch Inference Service

Design an Offline Batch Inference Service

ML System DesignMust

Problem Statement

Key Requirements

Functional

Job Submission API -- users submit inference requests with model artifacts, input data paths, output destinations, and resource specs
Data Partitioning -- automatically shard large datasets into processable chunks that can be distributed across workers
Progress Tracking -- provide real-time visibility into job status, completed partitions, and estimated completion time
Fault Tolerance -- detect and retry failed tasks without reprocessing successfully completed work
Resource Management -- allocate compute resources based on job requirements and cluster availability

Non-Functional

Scalability -- support concurrent execution of hundreds of jobs processing petabytes of data daily
Reliability -- achieve 99.9% job completion rate with automatic recovery from transient failures
Efficiency -- maintain >85% cluster utilization while minimizing waste from failed or stragglers tasks
Latency -- complete small jobs (under 100GB) within 30 minutes, large jobs should show measurable progress within 5 minutes

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Data Sharding Strategy

How you partition input data determines parallelism, fault tolerance, and recovery efficiency. The system must balance partition size with the overhead of task scheduling and checkpointing.

Hints to consider:

Consider file-based vs record-based sharding approaches and their impact on skew
Think about how partition metadata is stored and accessed by workers
Discuss adaptive partitioning for datasets with non-uniform distribution
Address what happens when input data format is streaming vs batch storage

2. Job Orchestration and Scheduling

The system needs a control plane that assigns work to compute nodes, monitors health, and coordinates retries. This is the brain of the platform.

Hints to consider:

Compare centralized scheduler vs distributed consensus approaches
Design task assignment protocol including heartbeats and lease management
Handle priority queuing for interactive vs batch workloads
Consider how to prevent resource starvation and ensure fair sharing

3. Progress Tracking and Checkpointing

Users need visibility into long-running jobs, and the system must avoid reprocessing completed work after failures. This requires careful state management.

Hints to consider:

Design a checkpointing scheme that balances granularity with overhead
Store partition-level completion status in a durable metadata store
Provide APIs for querying job progress at different granularities
Think about tombstones for idempotent writes to output storage

4. Failure Handling and Retry Logic

The system must distinguish between retryable failures (network timeouts, machine crashes) and permanent failures (malformed data, OOM errors) to avoid infinite retry loops.

Hints to consider:

Implement exponential backoff with jitter for transient failures
Set per-task and per-job retry limits with different policies
Design dead letter queues for permanently failed partitions
Handle stragglers through speculative execution or timeout policies
Consider how model loading failures differ from inference failures

5. Resource Allocation and Isolation

Efficiently packing jobs onto shared infrastructure while preventing noisy neighbors and resource contention is critical for cost and performance.

Hints to consider:

Containerization strategies for model and dependency isolation
CPU, memory, and GPU allocation policies based on job specifications
Auto-scaling workers based on queue depth and resource utilization
Bin-packing algorithms for heterogeneous job resource profiles