Practice/Anthropic/Design a high-concurrency inference API
Design a high-concurrency inference API
System DesignMust
Problem Statement
Design a high-concurrency inference API system that can handle massive concurrent requests efficiently. The inference API endpoint is provided and cannot be modified — your focus is on designing the surrounding infrastructure, particularly the batch service that manages requests to GPU workers.
The system must accept synchronous HTTP requests from clients who wait for responses, but internal processing can be asynchronous through queues. Despite this internal asynchrony, responses must feel near real-time (typically under 500ms to 1s). The core challenge is maximizing GPU utilization through intelligent request batching while maintaining predictable, low latency under high concurrency.
Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.
Given Constraints:
- Fixed inference API: You are given an inference API that you cannot modify
- Client-facing synchronous requests: Clients make synchronous HTTP requests and wait for responses, but internal processing can be asynchronous through queues
- Near real-time latency requirement: Despite internal asynchronous processing, responses must be fast enough to feel near real-time (typically under 500ms to 1s)
- Traffic requirements: Must handle high concurrent requests with predictable latency
- GPU resources: Limited GPU resources that need efficient utilization
Key Requirements
Functional
- Submit inference requests -- Users can submit prompts and receive model responses through a synchronous HTTP API
- Request batching -- Aggregate individual requests into batches for efficient GPU processing, balancing throughput against latency
- Multi-tier support -- Handle requests from different user tiers (free, paid, enterprise) with different priorities and rate limits
- Rate limiting -- Protect the system from overload with tiered rate limits that adjust dynamically based on available capacity
Non-Functional
- Scalability -- Handle 1,000 RPS initially, plan for 10,000+ RPS with 3x peak bursts
- Latency -- P95 under 500ms end-to-end (including queue wait and inference time)
- Availability -- 99.9% uptime with graceful degradation under load; shed lower-priority traffic before failing higher-priority requests
- Cost efficiency -- Maximize GPU utilization (target 70-80%) while maintaining the latency SLA; avoid over-provisioning while retaining buffer for spikes
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Request Batching Strategy
The batcher is the most critical component — it determines the fundamental tradeoff between GPU throughput and request latency. Interviewers want to see if you understand timeout-based batching and can reason about the tradeoffs.
Hints to consider:
- Use a dual-trigger batching strategy: send a batch when it reaches the target size (e.g., 32 requests) OR when a timeout expires (e.g., 40ms), whichever comes first
- During high traffic, batches fill quickly and GPU utilization is maximized; during low traffic, the timeout ensures requests are not delayed indefinitely
- Pull from priority queues in order (enterprise first, then paid, then free) so high-priority requests get batched first
- Use atomic queue operations (
RPOPLPUSH) to move requests into an in-flight list for crash recovery — if the batcher dies, a reaper process can return stale items to the queue
2. Response Routing Architecture
Clients hold open HTTP connections waiting for responses, but the batcher and GPU workers process requests asynchronously. Interviewers probe how you route responses back to the correct waiting client connection.
Hints to consider:
- The API Gateway keeps the HTTP connection open using async I/O while storing a mapping of
request_id to the gateway instance and response channel in Redis with a TTL matching the request timeout
- After GPU processing, the batcher publishes results to Redis Pub/Sub; only the owning gateway instance picks up the response and resolves the pending connection
- This decouples request ingestion from response delivery and allows horizontal scaling of gateway instances
- Consider what happens when a gateway instance crashes while holding open connections — the TTL on the Redis mapping provides automatic cleanup
3. Dynamic Rate Limiting and Capacity Management
Static rate limits break when GPU capacity changes (e.g., hardware failures, scaling events). Interviewers want to see if you can design a feedback loop between GPU health and rate limiting.
Hints to consider:
- Calculate available capacity dynamically:
active_gpus * rps_per_gpu * target_utilization
- Implement tiered shedding based on queue depth: under normal load accept all traffic; under moderate load shed free tier; under critical load accept only enterprise
- GPU capacity changes must propagate to rate limiters within seconds, not minutes — use Redis Pub/Sub or a metrics push system
- Consider the 70% utilization target as a deliberate tradeoff: higher cost but absorbs traffic spikes while auto-scaling provisions new GPUs
4. GPU Utilization and Auto-Scaling
GPUs are expensive and slow to provision. Interviewers test whether you can balance cost efficiency against latency guarantees and handle the provisioning gap during traffic spikes.
Hints to consider:
- Running at 95% utilization leaves no buffer for spikes; 70-80% provides a 20-30% absorption buffer while auto-scaling triggers
- Scale up aggressively (2 minutes of high utilization), scale down conservatively (10 minutes of low utilization) to avoid flapping
- GPU provisioning takes 1-5 minutes (instance launch plus model loading); your buffer must absorb traffic during this window
- Use multiple scaling signals: queue depth, GPU utilization, and P95 latency — any single metric can miss certain failure modes
Suggested Approach
Step 1: Clarify Requirements and Estimate Capacity
Confirm the following with the interviewer before designing:
- What is the expected request rate, and how spiky is the traffic?
- What are the GPU specs — how long does inference take per batch, and what batch sizes are supported?
- Are there different user tiers, and should they receive different quality of service?
- Is the inference API truly fixed, or can we influence batch sizes and parameters?
- What is the acceptable end-to-end latency, and is it the same for all tiers?
Then perform capacity estimation to ground your design in concrete numbers:
`
Traffic:
- Target: 1,000 RPS (plan for 10x growth to 10,000 RPS)
- Peak: 3x average (3,000 RPS at peak)
- Distribution: 70% free tier, 25% paid, 5% enterprise
GPU Throughput (given):
- Batch size: 32 requests
- Inference time per batch: 50ms
- Batching delay (average): ~20ms
- Data transfer overhead: ~5ms
Total time per batch: 50ms + 20ms + 5ms = 75ms
Batches per GPU per second: 1000ms / 75ms = 13.3
Requests per GPU per second: 13.3 x 32 = 426 RPS
GPUs needed (raw): 1000 / 426 = 2.35
GPUs needed (at 70% utilization): 2.35 / 0.7 ~ 4 GPUs
For 10,000 RPS: 10,000 / 426 = 23.5 / 0.7 ~ 34 GPUs
For 3x peak on that target: ~100 GPUs
`
Always calculate GPU needs at your target utilization (70-80%), not 100%. This provides buffer for traffic spikes and prevents latency from exploding as utilization approaches capacity.
Step 2: Data Model
Define the core entities for request tracking, GPU management, and batching:
`
-- Request tracking (PostgreSQL for durability)
Request:
id: UUID (PK)
user_id: UUID (FK)
tier: Enum (free, paid, enterprise)
model: String
status: Enum (queued, processing, completed, failed)
created_at: Timestamp
completed_at: Timestamp
latency_ms: Integer
-- GPU worker registry (Redis for real-time status)
GPUWorker:
worker_id: String
status: Enum (healthy, degraded, offline)
current_utilization: Float
queue_depth: Integer
last_heartbeat: Timestamp
-- Batch metadata (Batcher internal state)
Batch:
batch_id: UUID
request_ids: List of UUID
created_at: Timestamp
sent_to_gpu_at: Timestamp
gpu_worker_id: String
`
For the queue layer, use Redis lists with separate queues per tier to enable priority processing. The batcher pulls from enterprise queue first, then paid, then free:
`
Key: queue:<tier> (e.g., queue:free, queue:paid, queue:enterprise)
Type: List (FIFO)
Key: inflight:<tier>:<batcher_id>
Type: List (in-flight requests for crash recovery)
Value per entry:
request_id: "req_abc123"
user_id: "user_456"
input: "..."
parameters: ...
enqueued_at: timestamp
timeout_at: timestamp
`
Use LPUSH + RPOPLPUSH to atomically move items into an in-flight list. A reaper process periodically scans in-flight lists and returns stale items (those past their timeout_at) back to the main queue.
Step 3: High-Level Architecture
Sketch the following core components:
- Load Balancer: Distributes incoming requests across API Gateway instances
- API Gateway / Rate Limiter: Authenticates API keys, checks rate limits (Redis lookup), determines user tier, enqueues requests to the appropriate priority queue, and holds the HTTP connection open with async I/O
- Priority Queues (Redis): Separate FIFO queues per tier (enterprise, paid, free) enabling priority-based consumption
- Request Batcher: Polls queues in priority order, collects requests into batches using timeout-based batching, and dispatches batches to the least-loaded GPU worker
- GPU Workers: Receive batches, run inference using the fixed API, and return results
- Redis (Pub/Sub + State): Stores request-to-gateway mappings, publishes inference results back to the owning gateway, and maintains GPU health state
- PostgreSQL: Persists request logs for auditing, analytics, and debugging
Walk through the request flow with latency breakdown:
- Client to Load Balancer (2ms): Client sends
POST /v1/inference
- Load Balancer to API Gateway (1ms): Routes to a healthy gateway instance
- API Gateway (5ms): Authenticates, checks rate limit, determines tier, enqueues request, stores
request_id to gateway instance mapping in Redis, holds connection open
- Queue wait (0-40ms): Request waits in tier queue; average ~20ms for uniform traffic
- Batcher to GPU Worker (55ms): Batcher polls queues, collects batch, sends to least-loaded GPU (~5ms transfer), GPU inference (~50ms)
- Response path (8ms): GPU returns results to batcher, batcher publishes to Redis Pub/Sub, gateway receives notification and resolves the pending connection
- Total: ~90-130ms (well under 500ms SLA)