Practice/Anthropic/Design a high-concurrency inference API

Design a high-concurrency inference API

System DesignMust

Problem Statement

Design a high-concurrency inference API system that can handle massive concurrent requests efficiently. The inference API endpoint is provided and cannot be modified — your focus is on designing the surrounding infrastructure, particularly the batch service that manages requests to GPU workers.

The system must accept synchronous HTTP requests from clients who wait for responses, but internal processing can be asynchronous through queues. Despite this internal asynchrony, responses must feel near real-time (typically under 500ms to 1s). The core challenge is maximizing GPU utilization through intelligent request batching while maintaining predictable, low latency under high concurrency.

Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.

Given Constraints:

Fixed inference API: You are given an inference API that you cannot modify
Client-facing synchronous requests: Clients make synchronous HTTP requests and wait for responses, but internal processing can be asynchronous through queues
Near real-time latency requirement: Despite internal asynchronous processing, responses must be fast enough to feel near real-time (typically under 500ms to 1s)
Traffic requirements: Must handle high concurrent requests with predictable latency
GPU resources: Limited GPU resources that need efficient utilization

Key Requirements

Functional

Submit inference requests -- Users can submit prompts and receive model responses through a synchronous HTTP API
Request batching -- Aggregate individual requests into batches for efficient GPU processing, balancing throughput against latency
Multi-tier support -- Handle requests from different user tiers (free, paid, enterprise) with different priorities and rate limits
Rate limiting -- Protect the system from overload with tiered rate limits that adjust dynamically based on available capacity

Non-Functional

Scalability -- Handle 1,000 RPS initially, plan for 10,000+ RPS with 3x peak bursts
Latency -- P95 under 500ms end-to-end (including queue wait and inference time)
Availability -- 99.9% uptime with graceful degradation under load; shed lower-priority traffic before failing higher-priority requests
Cost efficiency -- Maximize GPU utilization (target 70-80%) while maintaining the latency SLA; avoid over-provisioning while retaining buffer for spikes

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Request Batching Strategy

The batcher is the most critical component — it determines the fundamental tradeoff between GPU throughput and request latency. Interviewers want to see if you understand timeout-based batching and can reason about the tradeoffs.

Hints to consider:

Use a dual-trigger batching strategy: send a batch when it reaches the target size (e.g., 32 requests) OR when a timeout expires (e.g., 40ms), whichever comes first
During high traffic, batches fill quickly and GPU utilization is maximized; during low traffic, the timeout ensures requests are not delayed indefinitely
Pull from priority queues in order (enterprise first, then paid, then free) so high-priority requests get batched first
Use atomic queue operations (RPOPLPUSH) to move requests into an in-flight list for crash recovery — if the batcher dies, a reaper process can return stale items to the queue

2. Response Routing Architecture

Clients hold open HTTP connections waiting for responses, but the batcher and GPU workers process requests asynchronously. Interviewers probe how you route responses back to the correct waiting client connection.

Hints to consider:

The API Gateway keeps the HTTP connection open using async I/O while storing a mapping of request_id to the gateway instance and response channel in Redis with a TTL matching the request timeout
After GPU processing, the batcher publishes results to Redis Pub/Sub; only the owning gateway instance picks up the response and resolves the pending connection
This decouples request ingestion from response delivery and allows horizontal scaling of gateway instances
Consider what happens when a gateway instance crashes while holding open connections — the TTL on the Redis mapping provides automatic cleanup

3. Dynamic Rate Limiting and Capacity Management

Static rate limits break when GPU capacity changes (e.g., hardware failures, scaling events). Interviewers want to see if you can design a feedback loop between GPU health and rate limiting.

Hints to consider:

Calculate available capacity dynamically: active_gpus * rps_per_gpu * target_utilization
Implement tiered shedding based on queue depth: under normal load accept all traffic; under moderate load shed free tier; under critical load accept only enterprise
GPU capacity changes must propagate to rate limiters within seconds, not minutes — use Redis Pub/Sub or a metrics push system
Consider the 70% utilization target as a deliberate tradeoff: higher cost but absorbs traffic spikes while auto-scaling provisions new GPUs

4. GPU Utilization and Auto-Scaling

GPUs are expensive and slow to provision. Interviewers test whether you can balance cost efficiency against latency guarantees and handle the provisioning gap during traffic spikes.

Hints to consider:

Running at 95% utilization leaves no buffer for spikes; 70-80% provides a 20-30% absorption buffer while auto-scaling triggers
Scale up aggressively (2 minutes of high utilization), scale down conservatively (10 minutes of low utilization) to avoid flapping
GPU provisioning takes 1-5 minutes (instance launch plus model loading); your buffer must absorb traffic during this window
Use multiple scaling signals: queue depth, GPU utilization, and P95 latency — any single metric can miss certain failure modes

Suggested Approach

Step 1: Clarify Requirements and Estimate Capacity

Confirm the following with the interviewer before designing:

What is the expected request rate, and how spiky is the traffic?
What are the GPU specs — how long does inference take per batch, and what batch sizes are supported?
Are there different user tiers, and should they receive different quality of service?
Is the inference API truly fixed, or can we influence batch sizes and parameters?
What is the acceptable end-to-end latency, and is it the same for all tiers?

Then perform capacity estimation to ground your design in concrete numbers:

` Traffic:

Target: 1,000 RPS (plan for 10x growth to 10,000 RPS)
Peak: 3x average (3,000 RPS at peak)
Distribution: 70% free tier, 25% paid, 5% enterprise

GPU Throughput (given):

Batch size: 32 requests
Inference time per batch: 50ms
Batching delay (average): ~20ms
Data transfer overhead: ~5ms

Total time per batch: 50ms + 20ms + 5ms = 75ms Batches per GPU per second: 1000ms / 75ms = 13.3 Requests per GPU per second: 13.3 x 32 = 426 RPS

GPUs needed (raw): 1000 / 426 = 2.35 GPUs needed (at 70% utilization): 2.35 / 0.7 ~ 4 GPUs

For 10,000 RPS: 10,000 / 426 = 23.5 / 0.7 ~ 34 GPUs For 3x peak on that target: ~100 GPUs `

Always calculate GPU needs at your target utilization (70-80%), not 100%. This provides buffer for traffic spikes and prevents latency from exploding as utilization approaches capacity.

Step 2: Data Model

Define the core entities for request tracking, GPU management, and batching:

` -- Request tracking (PostgreSQL for durability) Request: id: UUID (PK) user_id: UUID (FK) tier: Enum (free, paid, enterprise) model: String status: Enum (queued, processing, completed, failed) created_at: Timestamp completed_at: Timestamp latency_ms: Integer

-- GPU worker registry (Redis for real-time status) GPUWorker: worker_id: String status: Enum (healthy, degraded, offline) current_utilization: Float queue_depth: Integer last_heartbeat: Timestamp

-- Batch metadata (Batcher internal state) Batch: batch_id: UUID request_ids: List of UUID created_at: Timestamp sent_to_gpu_at: Timestamp gpu_worker_id: String `

For the queue layer, use Redis lists with separate queues per tier to enable priority processing. The batcher pulls from enterprise queue first, then paid, then free:

` Key: queue:<tier> (e.g., queue:free, queue:paid, queue:enterprise) Type: List (FIFO)

Key: inflight:<tier>:<batcher_id> Type: List (in-flight requests for crash recovery)

Value per entry: request_id: "req_abc123" user_id: "user_456" input: "..." parameters: ... enqueued_at: timestamp timeout_at: timestamp `

Use LPUSH + RPOPLPUSH to atomically move items into an in-flight list. A reaper process periodically scans in-flight lists and returns stale items (those past their timeout_at) back to the main queue.

Step 3: High-Level Architecture

Sketch the following core components:

Load Balancer: Distributes incoming requests across API Gateway instances
API Gateway / Rate Limiter: Authenticates API keys, checks rate limits (Redis lookup), determines user tier, enqueues requests to the appropriate priority queue, and holds the HTTP connection open with async I/O
Priority Queues (Redis): Separate FIFO queues per tier (enterprise, paid, free) enabling priority-based consumption
Request Batcher: Polls queues in priority order, collects requests into batches using timeout-based batching, and dispatches batches to the least-loaded GPU worker
GPU Workers: Receive batches, run inference using the fixed API, and return results
Redis (Pub/Sub + State): Stores request-to-gateway mappings, publishes inference results back to the owning gateway, and maintains GPU health state
PostgreSQL: Persists request logs for auditing, analytics, and debugging

Walk through the request flow with latency breakdown:

Client to Load Balancer (2ms): Client sends POST /v1/inference
Load Balancer to API Gateway (1ms): Routes to a healthy gateway instance
API Gateway (5ms): Authenticates, checks rate limit, determines tier, enqueues request, stores request_id to gateway instance mapping in Redis, holds connection open
Queue wait (0-40ms): Request waits in tier queue; average ~20ms for uniform traffic
Batcher to GPU Worker (55ms): Batcher polls queues, collects batch, sends to least-loaded GPU (~5ms transfer), GPU inference (~50ms)
Response path (8ms): GPU returns results to batcher, batcher publishes to Redis Pub/Sub, gateway receives notification and resolves the pending connection
Total: ~90-130ms (well under 500ms SLA)

Practice/Anthropic/Design a high-concurrency inference API

Design a high-concurrency inference API

System DesignMust

Problem Statement

Given Constraints:

Fixed inference API: You are given an inference API that you cannot modify
Client-facing synchronous requests: Clients make synchronous HTTP requests and wait for responses, but internal processing can be asynchronous through queues
Near real-time latency requirement: Despite internal asynchronous processing, responses must be fast enough to feel near real-time (typically under 500ms to 1s)
Traffic requirements: Must handle high concurrent requests with predictable latency
GPU resources: Limited GPU resources that need efficient utilization

Key Requirements

Functional

Submit inference requests -- Users can submit prompts and receive model responses through a synchronous HTTP API
Request batching -- Aggregate individual requests into batches for efficient GPU processing, balancing throughput against latency
Multi-tier support -- Handle requests from different user tiers (free, paid, enterprise) with different priorities and rate limits
Rate limiting -- Protect the system from overload with tiered rate limits that adjust dynamically based on available capacity

Non-Functional

Scalability -- Handle 1,000 RPS initially, plan for 10,000+ RPS with 3x peak bursts
Latency -- P95 under 500ms end-to-end (including queue wait and inference time)
Availability -- 99.9% uptime with graceful degradation under load; shed lower-priority traffic before failing higher-priority requests
Cost efficiency -- Maximize GPU utilization (target 70-80%) while maintaining the latency SLA; avoid over-provisioning while retaining buffer for spikes

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Request Batching Strategy

Hints to consider:

Use a dual-trigger batching strategy: send a batch when it reaches the target size (e.g., 32 requests) OR when a timeout expires (e.g., 40ms), whichever comes first
During high traffic, batches fill quickly and GPU utilization is maximized; during low traffic, the timeout ensures requests are not delayed indefinitely
Pull from priority queues in order (enterprise first, then paid, then free) so high-priority requests get batched first
Use atomic queue operations (RPOPLPUSH) to move requests into an in-flight list for crash recovery — if the batcher dies, a reaper process can return stale items to the queue

2. Response Routing Architecture

Hints to consider:

The API Gateway keeps the HTTP connection open using async I/O while storing a mapping of request_id to the gateway instance and response channel in Redis with a TTL matching the request timeout
After GPU processing, the batcher publishes results to Redis Pub/Sub; only the owning gateway instance picks up the response and resolves the pending connection
This decouples request ingestion from response delivery and allows horizontal scaling of gateway instances
Consider what happens when a gateway instance crashes while holding open connections — the TTL on the Redis mapping provides automatic cleanup

3. Dynamic Rate Limiting and Capacity Management

Static rate limits break when GPU capacity changes (e.g., hardware failures, scaling events). Interviewers want to see if you can design a feedback loop between GPU health and rate limiting.

Hints to consider:

Calculate available capacity dynamically: active_gpus * rps_per_gpu * target_utilization
Implement tiered shedding based on queue depth: under normal load accept all traffic; under moderate load shed free tier; under critical load accept only enterprise
GPU capacity changes must propagate to rate limiters within seconds, not minutes — use Redis Pub/Sub or a metrics push system
Consider the 70% utilization target as a deliberate tradeoff: higher cost but absorbs traffic spikes while auto-scaling provisions new GPUs

4. GPU Utilization and Auto-Scaling

GPUs are expensive and slow to provision. Interviewers test whether you can balance cost efficiency against latency guarantees and handle the provisioning gap during traffic spikes.

Hints to consider:

Running at 95% utilization leaves no buffer for spikes; 70-80% provides a 20-30% absorption buffer while auto-scaling triggers
Scale up aggressively (2 minutes of high utilization), scale down conservatively (10 minutes of low utilization) to avoid flapping
GPU provisioning takes 1-5 minutes (instance launch plus model loading); your buffer must absorb traffic during this window
Use multiple scaling signals: queue depth, GPU utilization, and P95 latency — any single metric can miss certain failure modes

Suggested Approach

Step 1: Clarify Requirements and Estimate Capacity

Confirm the following with the interviewer before designing:

What is the expected request rate, and how spiky is the traffic?
What are the GPU specs — how long does inference take per batch, and what batch sizes are supported?
Are there different user tiers, and should they receive different quality of service?
Is the inference API truly fixed, or can we influence batch sizes and parameters?
What is the acceptable end-to-end latency, and is it the same for all tiers?

Then perform capacity estimation to ground your design in concrete numbers:

` Traffic:

Target: 1,000 RPS (plan for 10x growth to 10,000 RPS)
Peak: 3x average (3,000 RPS at peak)
Distribution: 70% free tier, 25% paid, 5% enterprise

GPU Throughput (given):

Batch size: 32 requests
Inference time per batch: 50ms
Batching delay (average): ~20ms
Data transfer overhead: ~5ms

Total time per batch: 50ms + 20ms + 5ms = 75ms Batches per GPU per second: 1000ms / 75ms = 13.3 Requests per GPU per second: 13.3 x 32 = 426 RPS

GPUs needed (raw): 1000 / 426 = 2.35 GPUs needed (at 70% utilization): 2.35 / 0.7 ~ 4 GPUs

For 10,000 RPS: 10,000 / 426 = 23.5 / 0.7 ~ 34 GPUs For 3x peak on that target: ~100 GPUs `

Always calculate GPU needs at your target utilization (70-80%), not 100%. This provides buffer for traffic spikes and prevents latency from exploding as utilization approaches capacity.

Step 2: Data Model

Define the core entities for request tracking, GPU management, and batching:

-- GPU worker registry (Redis for real-time status) GPUWorker: worker_id: String status: Enum (healthy, degraded, offline) current_utilization: Float queue_depth: Integer last_heartbeat: Timestamp

-- Batch metadata (Batcher internal state) Batch: batch_id: UUID request_ids: List of UUID created_at: Timestamp sent_to_gpu_at: Timestamp gpu_worker_id: String `

For the queue layer, use Redis lists with separate queues per tier to enable priority processing. The batcher pulls from enterprise queue first, then paid, then free:

` Key: queue:<tier> (e.g., queue:free, queue:paid, queue:enterprise) Type: List (FIFO)

Key: inflight:<tier>:<batcher_id> Type: List (in-flight requests for crash recovery)

Value per entry: request_id: "req_abc123" user_id: "user_456" input: "..." parameters: ... enqueued_at: timestamp timeout_at: timestamp `

Step 3: High-Level Architecture

Sketch the following core components:

Load Balancer: Distributes incoming requests across API Gateway instances
API Gateway / Rate Limiter: Authenticates API keys, checks rate limits (Redis lookup), determines user tier, enqueues requests to the appropriate priority queue, and holds the HTTP connection open with async I/O
Priority Queues (Redis): Separate FIFO queues per tier (enterprise, paid, free) enabling priority-based consumption
Request Batcher: Polls queues in priority order, collects requests into batches using timeout-based batching, and dispatches batches to the least-loaded GPU worker
GPU Workers: Receive batches, run inference using the fixed API, and return results
Redis (Pub/Sub + State): Stores request-to-gateway mappings, publishes inference results back to the owning gateway, and maintains GPU health state
PostgreSQL: Persists request logs for auditing, analytics, and debugging

Walk through the request flow with latency breakdown:

Client to Load Balancer (2ms): Client sends POST /v1/inference
Load Balancer to API Gateway (1ms): Routes to a healthy gateway instance
API Gateway (5ms): Authenticates, checks rate limit, determines tier, enqueues request, stores request_id to gateway instance mapping in Redis, holds connection open
Queue wait (0-40ms): Request waits in tier queue; average ~20ms for uniform traffic
Batcher to GPU Worker (55ms): Batcher polls queues, collects batch, sends to least-loaded GPU (~5ms transfer), GPU inference (~50ms)
Response path (8ms): GPU returns results to batcher, batcher publishes to Redis Pub/Sub, gateway receives notification and resolves the pending connection
Total: ~90-130ms (well under 500ms SLA)