Practice/Anthropic/Design an LLM request batching API

Design an LLM request batching API

System DesignMust

Problem Statement

Design an HTTP API that exposes a batch processing function for large language model inference. Individual users make single synchronous requests, but internally the system must batch these requests together for efficient GPU processing.

You are provided with a fixed backend function that you cannot modify:

` def batchstring(inputs: list[str]) -> list[str]: """ Processes a batch of string inputs and returns string outputs.

Constraints:
- Input size: 1-100 strings per batch
- Output size: 1-100 strings (one per input)
- Latency: ~100ms per batch (fixed, regardless of batch size within limits)
- Concurrency: Each GPU instance can only process ONE batch at a time
"""
pass

The core challenge is: how do you design a service that accepts individual synchronous HTTP requests from users, aggregates them into batches internally, routes batches to available GPU workers, maps responses back to the original requesters, and maintains low latency while maximizing throughput?

Related question: Inference API System Design. That question gives you an existing API and focuses on operational infrastructure — priority queues, rate limiting, and auto-scaling. This question gives you only a bare function and focuses on the core mechanics — how to collect individual HTTP requests into batches and route GPU responses back to the correct waiting connections.

Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.

Key Requirements

Functional

Submit inference requests -- Users send a single string input via HTTP and receive a processed string output in the same connection
Synchronous response -- Users receive responses in the same HTTP connection (no polling or callbacks)
High concurrency -- System handles thousands of simultaneous users without degradation

The complexity lies in the internal batching mechanism, not user-facing features. Keep functional requirements minimal.

Non-Functional

Latency -- P95 under 200ms end-to-end; the 100ms fixed GPU processing time is irreducible, so all other latency (batching, routing, network) must fit within the remaining ~100ms budget
Throughput -- 1,000 RPS for initial deployment
Availability -- 99.9% uptime with graceful degradation under load
GPU utilization -- Target 70-80% to balance efficiency with headroom for spikes

Capacity Estimation:

` Given:

Target: 1,000 RPS
GPU processing: 100ms per batch (10 batches/sec/GPU)
Target batch size: 32 requests

Throughput per GPU = 10 batches/sec x 32 requests/batch = 320 RPS

Raw GPUs needed = 1,000 / 320 = 3.125 -> 4 GPUs With 70% utilization headroom = 4 / 0.7 ~ 6 GPUs

Latency breakdown: Average batching delay: ~16ms (half of 32ms to fill batch at 1000 RPS) Network overhead: ~10ms GPU processing: 100ms Total average: ~126ms (under 200ms target)

Concurrent connections = RPS x avg latency = 1,000 x 0.126s = 126 Well within default OS limits (can support 10K+ with tuning) `

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Response Routing Back to the Correct Connection

This is where most candidates fail. The HTTP connection exists between User and API Server. The GPU worker that processes the batch is a different process entirely. How does the result get back to the correct waiting HTTP connection?

Hints to consider:

Each API server instance keeps a map of request_id to the pending HTTP connection in memory
When enqueueing a request, include a return_to field identifying the originating API server instance
GPU workers publish results to per-instance Redis Pub/Sub channels (e.g., responses:api-1, responses:api-2)
The originating API server receives the published result, looks up the pending connection, sends the HTTP response, and cleans up
Alternative: collocated architecture where the batcher runs inside each API server, allowing direct in-memory connection references (simpler but lower GPU efficiency)

2. Batching Strategy and the Timeout Trigger

Pure size-based batching causes unacceptable latency during low traffic periods. Interviewers want to see if you understand timeout-based batching and can reason about the timeout value.

Hints to consider:

Use dual triggers: send a batch when it reaches the target size (e.g., 32) OR when a timeout expires (e.g., 50ms), whichever comes first
At 1,000 RPS, requests arrive every 1ms, so a batch of 32 fills in ~32ms and the size trigger fires before the timeout
The timeout only matters during low traffic — at 10 RPS, without a timeout, users would wait 3.2 seconds for a batch of 32 to fill
Rule of thumb: set timeout to ~50% of your latency budget after GPU processing (with 100ms remaining, a 50ms timeout leaves headroom for network)
Consider adaptive batching: smaller batches and shorter timeouts when GPU utilization is low, larger batches and longer timeouts when overloaded

3. GPU Assignment and Pull-Based Architecture

Interviewers probe whether you push batches to GPUs or let GPUs pull work, and how you handle GPU failures.

Hints to consider:

Pull-based is simpler and more robust: GPU workers loop on BLPOP from a batch queue, which provides atomic claiming with no race conditions
No need to track GPU availability — when a GPU finishes, it simply pulls the next batch; natural backpressure emerges automatically
If a GPU crashes mid-batch, 32 user requests fail simultaneously; handle this with a timeout at the GPU level (e.g., 300ms = 3x expected) that requeues the batch, plus a timeout at the API layer (e.g., 5 seconds) that returns 504 to the user
Multiple batching service instances can safely pull from the same request queue since BLPOP is atomic; if one crashes, others continue

4. Collocated vs. Separate Batching Service

This is a key architectural trade-off. Interviewers want to see you reason about when to use each approach.

Hints to consider:

Collocated: run the batcher inside each API server for simplicity; the batcher has direct in-memory references to HTTP connections, eliminating the need for Pub/Sub routing
Separate service: a centralized batcher aggregates requests from all API servers for better GPU efficiency; at 1,000 RPS split across 3 API instances (~333 RPS each), each collocated batcher fills a batch of 32 in ~96ms, while a centralized batcher fills it in ~32ms
Start collocated for simplicity, migrate to a separate service when you observe GPU under-utilization due to small batch sizes
The centralized approach scales to 50K+ RPS but adds complexity (Pub/Sub routing, additional Redis hops)

Suggested Approach

Step 1: Clarify Requirements

Confirm the following with the interviewer before designing:

What is the target RPS and acceptable P95 latency?
Is the GPU function truly fixed (batch size limits, concurrency constraints)?
Are there different user tiers or priority levels?
What is the acceptable failure rate and timeout behavior?
How many GPU instances are available, and can they auto-scale?

Step 2: Data Model

This system is largely stateless — no durable database needed. All state is transient and tied to in-flight requests.

` InferenceRequest: request_id: UUID (unique identifier) input: string (user's input text) timestamp: datetime (arrival time) return_to: string (API instance identifier) status: enum (pending, processing, completed, failed)

Batch: batch_id: UUID requests: list of InferenceRequest (1-100 items) created_at: datetime gpu_id: string (assigned GPU)

InferenceResponse: request_id: UUID (maps to original request) output: string (processed result) latency_ms: int (total processing time) `

Data locality:

Data LocalityDataStorageLifetimePending requestsIn-memory (API server)Until response receivedRequest queueRedis ListUntil batchedBatch queueRedis ListUntil GPU claims itResponse routingRedis Pub/SubEphemeral

Step 3: API Design

A single REST endpoint with clear error contracts:

` POST /api/inference Content-Type: application/json

Request: input: string

Response (200): output: string request_id: string latency_ms: integer

Error Responses: 400 Bad Request — Invalid input format 429 Rate Limited — User exceeded quota 503 Overloaded — Queue full, retry after delay 504 Timeout — Request exceeded deadline `

Step 4: High-Level Architecture

Sketch the core components and walk through the complete request flow:

Load Balancer: Distributes incoming requests across API server instances
API Servers: Accept HTTP requests, generate request_id, store the pending connection in an in-memory map, enqueue the request to Redis, and hold the connection open until a response arrives via Pub/Sub
Redis Request Queue: FIFO queue where API servers push individual requests
Batching Service: Pulls requests from the queue, accumulates them into batches using size OR timeout triggers, and pushes formed batches to the batch queue
Redis Batch Queue: FIFO queue where formed batches wait for GPU workers
GPU Workers: Pull batches atomically via BLPOP, execute batchstring(), and publish results to per-API-instance Redis Pub/Sub channels
Redis Pub/Sub: Routes responses back to the originating API server instance

Request flow in four steps:

Step A — Request Arrival: User sends POST to API Server. The server generates a request_id, stores the HTTP connection in pending_requests[request_id], pushes the request to Redis with a return_to: "api-2" field, and waits.

Step B — Batch Formation: The Batching Service pulls requests from the queue via BLPOP, accumulates them into a batch, triggers when batch.size == 32 OR elapsed_time > 50ms, and pushes the formed batch to the batch queue.

Step C — GPU Processing (Pull-Based): A GPU worker pulls a batch from the queue via BLPOP (atomic, no race conditions), executes batchstring(inputs) (~100ms), and publishes each result to the correct API server's Pub/Sub channel.

Step D — Response Routing: API Server 2 (subscribed to responses:api-2) receives the message, looks up pending_requests[request_id], sends the HTTP response through the stored connection, and cleans up.

Batching strategy:

` class BatchingService: def init(self, batch_size=32, timeout_ms=50): self.batch_size = batch_size self.timeout_ms = timeout_ms self.current_batch = [] self.batch_start_time = None

def should_send_batch(self):
    if len(self.current_batch) >= self.batch_size:
        return True  # Size trigger
    if self.batch_start_time and elapsed_ms() > self.timeout_ms:
        return True  # Timeout trigger
    return False

The timeout value matters most during low traffic. At 1,000 RPS, the size trigger fires at ~32ms. At 100 RPS, the timeout fires at 50ms with ~5 requests. At 10 RPS, the timeout fires at 50ms with 0-1 requests. Set the timeout to ~50% of your remaining latency budget after GPU processing.

Collocated alternative: For simpler deployments, run the batcher inside each API server. The batcher holds direct references to HTTP connections, eliminating the need for Pub/Sub routing. Trade-off: lower GPU efficiency (independent batching per instance) but significantly simpler architecture. Start here and migrate to a separate batching service when GPU under-utilization becomes a problem.

Collocated vs Separate ServiceApproachComplexityLatencyGPU EfficiencyScale LimitCollocatedLow~120msMedium (independent batching)~5K RPSSeparate ServiceHigh~140msHigh (global batching)50K+ RPS

Practice/Anthropic/Design an LLM request batching API

Design an LLM request batching API

System DesignMust

Problem Statement

You are provided with a fixed backend function that you cannot modify:

` def batchstring(inputs: list[str]) -> list[str]: """ Processes a batch of string inputs and returns string outputs.

Constraints:
- Input size: 1-100 strings per batch
- Output size: 1-100 strings (one per input)
- Latency: ~100ms per batch (fixed, regardless of batch size within limits)
- Concurrency: Each GPU instance can only process ONE batch at a time
"""
pass

Key Requirements

Functional

Submit inference requests -- Users send a single string input via HTTP and receive a processed string output in the same connection
Synchronous response -- Users receive responses in the same HTTP connection (no polling or callbacks)
High concurrency -- System handles thousands of simultaneous users without degradation

The complexity lies in the internal batching mechanism, not user-facing features. Keep functional requirements minimal.

Non-Functional

Latency -- P95 under 200ms end-to-end; the 100ms fixed GPU processing time is irreducible, so all other latency (batching, routing, network) must fit within the remaining ~100ms budget
Throughput -- 1,000 RPS for initial deployment
Availability -- 99.9% uptime with graceful degradation under load
GPU utilization -- Target 70-80% to balance efficiency with headroom for spikes

Capacity Estimation:

` Given:

Target: 1,000 RPS
GPU processing: 100ms per batch (10 batches/sec/GPU)
Target batch size: 32 requests

Throughput per GPU = 10 batches/sec x 32 requests/batch = 320 RPS

Raw GPUs needed = 1,000 / 320 = 3.125 -> 4 GPUs With 70% utilization headroom = 4 / 0.7 ~ 6 GPUs

Latency breakdown: Average batching delay: ~16ms (half of 32ms to fill batch at 1000 RPS) Network overhead: ~10ms GPU processing: 100ms Total average: ~126ms (under 200ms target)

Concurrent connections = RPS x avg latency = 1,000 x 0.126s = 126 Well within default OS limits (can support 10K+ with tuning) `

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Response Routing Back to the Correct Connection

Hints to consider:

Each API server instance keeps a map of request_id to the pending HTTP connection in memory
When enqueueing a request, include a return_to field identifying the originating API server instance
GPU workers publish results to per-instance Redis Pub/Sub channels (e.g., responses:api-1, responses:api-2)
The originating API server receives the published result, looks up the pending connection, sends the HTTP response, and cleans up
Alternative: collocated architecture where the batcher runs inside each API server, allowing direct in-memory connection references (simpler but lower GPU efficiency)

2. Batching Strategy and the Timeout Trigger

Pure size-based batching causes unacceptable latency during low traffic periods. Interviewers want to see if you understand timeout-based batching and can reason about the timeout value.

Hints to consider:

Use dual triggers: send a batch when it reaches the target size (e.g., 32) OR when a timeout expires (e.g., 50ms), whichever comes first
At 1,000 RPS, requests arrive every 1ms, so a batch of 32 fills in ~32ms and the size trigger fires before the timeout
The timeout only matters during low traffic — at 10 RPS, without a timeout, users would wait 3.2 seconds for a batch of 32 to fill
Rule of thumb: set timeout to ~50% of your latency budget after GPU processing (with 100ms remaining, a 50ms timeout leaves headroom for network)
Consider adaptive batching: smaller batches and shorter timeouts when GPU utilization is low, larger batches and longer timeouts when overloaded

3. GPU Assignment and Pull-Based Architecture

Interviewers probe whether you push batches to GPUs or let GPUs pull work, and how you handle GPU failures.

Hints to consider:

Pull-based is simpler and more robust: GPU workers loop on BLPOP from a batch queue, which provides atomic claiming with no race conditions
No need to track GPU availability — when a GPU finishes, it simply pulls the next batch; natural backpressure emerges automatically
If a GPU crashes mid-batch, 32 user requests fail simultaneously; handle this with a timeout at the GPU level (e.g., 300ms = 3x expected) that requeues the batch, plus a timeout at the API layer (e.g., 5 seconds) that returns 504 to the user
Multiple batching service instances can safely pull from the same request queue since BLPOP is atomic; if one crashes, others continue

4. Collocated vs. Separate Batching Service

This is a key architectural trade-off. Interviewers want to see you reason about when to use each approach.

Hints to consider:

Collocated: run the batcher inside each API server for simplicity; the batcher has direct in-memory references to HTTP connections, eliminating the need for Pub/Sub routing
Separate service: a centralized batcher aggregates requests from all API servers for better GPU efficiency; at 1,000 RPS split across 3 API instances (~333 RPS each), each collocated batcher fills a batch of 32 in ~96ms, while a centralized batcher fills it in ~32ms
Start collocated for simplicity, migrate to a separate service when you observe GPU under-utilization due to small batch sizes
The centralized approach scales to 50K+ RPS but adds complexity (Pub/Sub routing, additional Redis hops)

Suggested Approach

Step 1: Clarify Requirements

Confirm the following with the interviewer before designing:

What is the target RPS and acceptable P95 latency?
Is the GPU function truly fixed (batch size limits, concurrency constraints)?
Are there different user tiers or priority levels?
What is the acceptable failure rate and timeout behavior?
How many GPU instances are available, and can they auto-scale?

Step 2: Data Model

This system is largely stateless — no durable database needed. All state is transient and tied to in-flight requests.

Batch: batch_id: UUID requests: list of InferenceRequest (1-100 items) created_at: datetime gpu_id: string (assigned GPU)

InferenceResponse: request_id: UUID (maps to original request) output: string (processed result) latency_ms: int (total processing time) `

Data locality:

Step 3: API Design

A single REST endpoint with clear error contracts:

` POST /api/inference Content-Type: application/json

Request: input: string

Response (200): output: string request_id: string latency_ms: integer

Error Responses: 400 Bad Request — Invalid input format 429 Rate Limited — User exceeded quota 503 Overloaded — Queue full, retry after delay 504 Timeout — Request exceeded deadline `

Step 4: High-Level Architecture

Sketch the core components and walk through the complete request flow:

Load Balancer: Distributes incoming requests across API server instances
API Servers: Accept HTTP requests, generate request_id, store the pending connection in an in-memory map, enqueue the request to Redis, and hold the connection open until a response arrives via Pub/Sub
Redis Request Queue: FIFO queue where API servers push individual requests
Batching Service: Pulls requests from the queue, accumulates them into batches using size OR timeout triggers, and pushes formed batches to the batch queue
Redis Batch Queue: FIFO queue where formed batches wait for GPU workers
GPU Workers: Pull batches atomically via BLPOP, execute batchstring(), and publish results to per-API-instance Redis Pub/Sub channels
Redis Pub/Sub: Routes responses back to the originating API server instance

Request flow in four steps:

Batching strategy:

` class BatchingService: def init(self, batch_size=32, timeout_ms=50): self.batch_size = batch_size self.timeout_ms = timeout_ms self.current_batch = [] self.batch_start_time = None

def should_send_batch(self):
    if len(self.current_batch) >= self.batch_size:
        return True  # Size trigger
    if self.batch_start_time and elapsed_ms() > self.timeout_ms:
        return True  # Timeout trigger
    return False

Collocated vs Separate ServiceApproachComplexityLatencyGPU EfficiencyScale LimitCollocatedLow~120msMedium (independent batching)~5K RPSSeparate ServiceHigh~140msHigh (global batching)50K+ RPS