Practice/Anthropic/Design an LLM request batching API
System DesignMust
Design an HTTP API that exposes a batch processing function for large language model inference. Individual users make single synchronous requests, but internally the system must batch these requests together for efficient GPU processing.
You are provided with a fixed backend function that you cannot modify:
` def batchstring(inputs: list[str]) -> list[str]: """ Processes a batch of string inputs and returns string outputs.
Constraints:
- Input size: 1-100 strings per batch
- Output size: 1-100 strings (one per input)
- Latency: ~100ms per batch (fixed, regardless of batch size within limits)
- Concurrency: Each GPU instance can only process ONE batch at a time
"""
pass
`
The core challenge is: how do you design a service that accepts individual synchronous HTTP requests from users, aggregates them into batches internally, routes batches to available GPU workers, maps responses back to the original requesters, and maintains low latency while maximizing throughput?
Related question: Inference API System Design. That question gives you an existing API and focuses on operational infrastructure — priority queues, rate limiting, and auto-scaling. This question gives you only a bare function and focuses on the core mechanics — how to collect individual HTTP requests into batches and route GPU responses back to the correct waiting connections.
Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.
The complexity lies in the internal batching mechanism, not user-facing features. Keep functional requirements minimal.
Capacity Estimation:
` Given:
Throughput per GPU = 10 batches/sec x 32 requests/batch = 320 RPS
Raw GPUs needed = 1,000 / 320 = 3.125 -> 4 GPUs With 70% utilization headroom = 4 / 0.7 ~ 6 GPUs
Latency breakdown: Average batching delay: ~16ms (half of 32ms to fill batch at 1000 RPS) Network overhead: ~10ms GPU processing: 100ms Total average: ~126ms (under 200ms target)
Concurrent connections = RPS x avg latency = 1,000 x 0.126s = 126 Well within default OS limits (can support 10K+ with tuning) `
Based on real interview experiences, these are the areas interviewers probe most deeply:
This is where most candidates fail. The HTTP connection exists between User and API Server. The GPU worker that processes the batch is a different process entirely. How does the result get back to the correct waiting HTTP connection?
Hints to consider:
request_id to the pending HTTP connection in memoryreturn_to field identifying the originating API server instanceresponses:api-1, responses:api-2)Pure size-based batching causes unacceptable latency during low traffic periods. Interviewers want to see if you understand timeout-based batching and can reason about the timeout value.
Hints to consider:
Interviewers probe whether you push batches to GPUs or let GPUs pull work, and how you handle GPU failures.
Hints to consider:
BLPOP from a batch queue, which provides atomic claiming with no race conditionsBLPOP is atomic; if one crashes, others continueThis is a key architectural trade-off. Interviewers want to see you reason about when to use each approach.
Hints to consider:
Confirm the following with the interviewer before designing:
This system is largely stateless — no durable database needed. All state is transient and tied to in-flight requests.
` InferenceRequest: request_id: UUID (unique identifier) input: string (user's input text) timestamp: datetime (arrival time) return_to: string (API instance identifier) status: enum (pending, processing, completed, failed)
Batch: batch_id: UUID requests: list of InferenceRequest (1-100 items) created_at: datetime gpu_id: string (assigned GPU)
InferenceResponse: request_id: UUID (maps to original request) output: string (processed result) latency_ms: int (total processing time) `
Data locality:
Data LocalityDataStorageLifetimePending requestsIn-memory (API server)Until response receivedRequest queueRedis ListUntil batchedBatch queueRedis ListUntil GPU claims itResponse routingRedis Pub/SubEphemeral
A single REST endpoint with clear error contracts:
` POST /api/inference Content-Type: application/json
Request: input: string
Response (200): output: string request_id: string latency_ms: integer
Error Responses: 400 Bad Request — Invalid input format 429 Rate Limited — User exceeded quota 503 Overloaded — Queue full, retry after delay 504 Timeout — Request exceeded deadline `
Sketch the core components and walk through the complete request flow:
request_id, store the pending connection in an in-memory map, enqueue the request to Redis, and hold the connection open until a response arrives via Pub/SubBLPOP, execute batchstring(), and publish results to per-API-instance Redis Pub/Sub channelsRequest flow in four steps:
Step A — Request Arrival: User sends POST to API Server. The server generates a request_id, stores the HTTP connection in pending_requests[request_id], pushes the request to Redis with a return_to: "api-2" field, and waits.
Step B — Batch Formation: The Batching Service pulls requests from the queue via BLPOP, accumulates them into a batch, triggers when batch.size == 32 OR elapsed_time > 50ms, and pushes the formed batch to the batch queue.
Step C — GPU Processing (Pull-Based): A GPU worker pulls a batch from the queue via BLPOP (atomic, no race conditions), executes batchstring(inputs) (~100ms), and publishes each result to the correct API server's Pub/Sub channel.
Step D — Response Routing: API Server 2 (subscribed to responses:api-2) receives the message, looks up pending_requests[request_id], sends the HTTP response through the stored connection, and cleans up.
Batching strategy:
` class BatchingService: def init(self, batch_size=32, timeout_ms=50): self.batch_size = batch_size self.timeout_ms = timeout_ms self.current_batch = [] self.batch_start_time = None
def should_send_batch(self):
if len(self.current_batch) >= self.batch_size:
return True # Size trigger
if self.batch_start_time and elapsed_ms() > self.timeout_ms:
return True # Timeout trigger
return False
`
The timeout value matters most during low traffic. At 1,000 RPS, the size trigger fires at ~32ms. At 100 RPS, the timeout fires at 50ms with ~5 requests. At 10 RPS, the timeout fires at 50ms with 0-1 requests. Set the timeout to ~50% of your remaining latency budget after GPU processing.
Collocated alternative: For simpler deployments, run the batcher inside each API server. The batcher holds direct references to HTTP connections, eliminating the need for Pub/Sub routing. Trade-off: lower GPU efficiency (independent batching per instance) but significantly simpler architecture. Start here and migrate to a separate batching service when GPU under-utilization becomes a problem.
Collocated vs Separate ServiceApproachComplexityLatencyGPU EfficiencyScale LimitCollocatedLow~120msMedium (independent batching)~5K RPSSeparate ServiceHigh~140msHigh (global batching)50K+ RPS