Practice/DoorDash/Design an Offline Batch Inference Service
Design an Offline Batch Inference Service
ML System DesignMust
Problem Statement
Design a distributed batch inference system that allows data scientists and ML engineers to run large-scale model predictions on terabytes of data. The system should accept user-submitted inference jobs that specify a trained model, input data location, and compute requirements. Users need visibility into job progress, automatic recovery from failures, and efficient resource utilization across a cluster of machines.
The platform must handle diverse workloads -- from small experimental runs processing gigabytes to production jobs processing petabytes across thousands of data partitions. Jobs may take hours or days to complete, and the system should gracefully handle machine failures, stragglers, and transient errors without losing progress or requiring full job restarts.
Key Requirements
Functional
- Job Submission API -- users submit inference requests with model artifacts, input data paths, output destinations, and resource specs
- Data Partitioning -- automatically shard large datasets into processable chunks that can be distributed across workers
- Progress Tracking -- provide real-time visibility into job status, completed partitions, and estimated completion time
- Fault Tolerance -- detect and retry failed tasks without reprocessing successfully completed work
- Resource Management -- allocate compute resources based on job requirements and cluster availability
Non-Functional
- Scalability -- support concurrent execution of hundreds of jobs processing petabytes of data daily
- Reliability -- achieve 99.9% job completion rate with automatic recovery from transient failures
- Efficiency -- maintain >85% cluster utilization while minimizing waste from failed or stragglers tasks
- Latency -- complete small jobs (under 100GB) within 30 minutes, large jobs should show measurable progress within 5 minutes
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Data Sharding Strategy
How you partition input data determines parallelism, fault tolerance, and recovery efficiency. The system must balance partition size with the overhead of task scheduling and checkpointing.
Hints to consider:
- Consider file-based vs record-based sharding approaches and their impact on skew
- Think about how partition metadata is stored and accessed by workers
- Discuss adaptive partitioning for datasets with non-uniform distribution
- Address what happens when input data format is streaming vs batch storage
2. Job Orchestration and Scheduling
The system needs a control plane that assigns work to compute nodes, monitors health, and coordinates retries. This is the brain of the platform.
Hints to consider:
- Compare centralized scheduler vs distributed consensus approaches
- Design task assignment protocol including heartbeats and lease management
- Handle priority queuing for interactive vs batch workloads
- Consider how to prevent resource starvation and ensure fair sharing
3. Progress Tracking and Checkpointing
Users need visibility into long-running jobs, and the system must avoid reprocessing completed work after failures. This requires careful state management.
Hints to consider:
- Design a checkpointing scheme that balances granularity with overhead
- Store partition-level completion status in a durable metadata store
- Provide APIs for querying job progress at different granularities
- Think about tombstones for idempotent writes to output storage
4. Failure Handling and Retry Logic
The system must distinguish between retryable failures (network timeouts, machine crashes) and permanent failures (malformed data, OOM errors) to avoid infinite retry loops.
Hints to consider:
- Implement exponential backoff with jitter for transient failures
- Set per-task and per-job retry limits with different policies
- Design dead letter queues for permanently failed partitions
- Handle stragglers through speculative execution or timeout policies
- Consider how model loading failures differ from inference failures
5. Resource Allocation and Isolation
Efficiently packing jobs onto shared infrastructure while preventing noisy neighbors and resource contention is critical for cost and performance.
Hints to consider:
- Containerization strategies for model and dependency isolation
- CPU, memory, and GPU allocation policies based on job specifications
- Auto-scaling workers based on queue depth and resource utilization
- Bin-packing algorithms for heterogeneous job resource profiles