Challenge What to Design?
Design a large-scale LLM serving system that efficiently serves many models (base models, fine-tunes, adapters) and heterogeneous requests (chat, completion, embedding, multi-turn agentic calls) from many tenants on the same GPU fleet.
The system must answer four concrete questions:
Before diving in, ask yourself:
You don't need to answer all of these right away. A Senior/Staff+ Engineer doesn't stop at basic functionality - they anticipate edge cases, design for resilience, and push for production-grade reliability.
But great systems start with great questions. What would you ask next?
Design the backend of a large-scale LLM serving platform. Tenants upload or select pretrained models (e.g., Llama, Mistral, proprietary fine-tunes, or LoRA adapters on top of a shared base) and call an OpenAI-compatible API to run inference against them. The platform must:
Scope & Assumptions In Scope
- Request ingress, authentication, tenant identification, and rate limiting.
- Model registry and weight distribution to GPU workers.
- Routing from API Gateway to the correct model replica.
- Inference runtime responsibilities (continuous batching, KV cache, adapter hot-loading) at the interface level - we design around these, not inside them.
- Fleet-level auto-scaling, health checks, and failover.
- Monitoring and alerting surface area.
Out of Scope
- Training, fine-tuning, RLHF, or evaluation pipelines.
- The internal math of attention, speculative decoding, or quantization kernels.
- Billing, quota accounting, and subscription management.
- Frontend UX and playground UI.
System Scale Scale MetricsMetricTarget
Distinct models served10,000+ (mostly LoRA adapters on ~20 base models)Active tenants50,000Requests per second100,000 RPS peak across the fleetConcurrent streams1,000,000 open SSE connectionsGPU fleet5,000+ accelerators across multiple regionsP50 time-to-first-tokenunder 300 msP99 time-to-first-tokenunder 1.5 secondsWeight-update propagationunder 60 s for a new adapter, under 10 min for a new base modelAvailability target99.95%
FR1 - Serve many models from a shared GPU fleet
Clients specify model_id on every request. The system routes the request to a replica that currently has that model's weights loaded, or triggers a just-in-time load if the model is cold. A single GPU replica may host one base model plus many LoRA adapters swapped in per-batch.
FR2 - Support both streaming and non-streaming requests
Chat and completion endpoints stream tokens via SSE. Embedding and classification endpoints return a single response synchronously. The routing layer must handle both without branching the entire stack.
FR3 - Roll out model weight updates quickly and safely
Uploading a new adapter version, promoting a fine-tune to "stable", or patching a base model must propagate to the serving fleet in seconds to minutes, never hours. In-flight requests on the old version must complete successfully.
FR4 - Route requests for load balance and cache locality
The router must pick a replica that is (a) healthy, (b) has the target model loaded, (c) has low queue depth, and (d) is likely to have the request's KV-cache prefix already in memory for multi-turn conversations.
NFR1 - Latency
NFR2 - Throughput
NFR3 - Availability
NFR4 - Fault isolation
NFR5 - Observability
NFR6 - Elasticity
Requirement SummaryIDRequirementTypeFR1Serve many models from shared GPUsFunctionalFR2Streaming + non-streaming inferenceFunctionalFR3Fast, safe weight rolloutsFunctionalFR4Cache-aware load balancingFunctionalNFR1P50 TTFT under 300ms, P99 under 1.5sNon-functionalNFR2100k RPS, 70-80% GPU utilizationNon-functionalNFR399.95% availabilityNon-functionalNFR4Per-tenant and per-model fault isolationNon-functionalNFR5Deep, per-request observabilityNon-functionalNFR6Auto-scale ahead of traffic spikesNon-functional
Four hard problems define this system. Strong candidates anticipate every one of them before being asked:
Secondary signals interviewers listen for:
Before drawing any boxes, derive the architecture from the three properties that make LLM serving different from stateless web serving: the replica has state you want to reuse (KV-cache), capacity takes minutes to provision and seconds to exhaust, and models are not monolithic. Each property pushes a specific design lever.
Core EntitiesEntityDescriptionTenantA customer account with API keys, quota, and model ownershipModelEither a base model (e.g., llama-3-70b) or an adapter (LoRA) keyed by model_id + versionModelArtifactImmutable blob in object storage: weights, tokenizer, configReplicaA GPU process running a base model; may host multiple adapters simultaneouslyRequestTenant ID, model ID, prompt, generation params, streaming flag, session IDSessionLogical conversation with a stable KV-cache prefix and a sticky replica affinityBatchSet of tokens from different requests being processed by one forward pass on a replica
` POST /v1/chat/completions Authorization: Bearer <api_key>
Request body: { "model": "tenant-42/support-bot-v7", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reset my password?"} ], "stream": true, "max_tokens": 512, "temperature": 0.7, "session_id": "sess_abc123" }
Response (stream=true): text/event-stream data: {"choices":[{"delta":{"content":"To"}}]} data: {"choices":[{"delta":{"content":" reset"}}]} ... data: [DONE] `
The session_id is the key to KV-cache affinity. If the client passes a stable ID across turns, the router pins that session to the same replica so the KV-cache prefix for prior turns stays hot.
` POST /internal/generate { "request_id": "req_xyz", "model_id": "tenant-42/support-bot-v7", "adapter_id": "support-bot-v7-lora", "prompt_tokens": [1, 2, 3, ...], "params": { "max_tokens": 512, "temperature": 0.7 }, "session_id": "sess_abc123" }
Response: streaming token deltas over gRPC or HTTP/2 `
PUT /v1/models/{model_id}/versions/{version} { "artifact_uri": "s3://fw-models/tenant-42/support-bot-v7.safetensors", "base_model": "llama-3-70b", "adapter_type": "lora", "promote_to": "stable" }
Promoting a version to stable triggers the weight propagation flow described in DD2.
Tip The system separates cleanly into three planes:
Mixing these is the most common architectural mistake in LLM serving systems.
1. Client sends request, Load Balancer terminates TLS
A regional L7 load balancer (e.g., Envoy) terminates TLS and routes to the API Gateway fleet. Long-running SSE streams need connection draining on deploys - the LB must hold connections for up to max_tokens x 50ms worst case.
2. API Gateway: auth, tenant resolution, rate limiting
The Gateway validates the API key, resolves the tenant, enforces per-tenant rate limits (tokens/second, RPS), and attaches a trace ID. If the tenant is over quota, it returns HTTP 429 before any GPU resource is touched.
3. Model Router resolves model_id to replica candidates
The Model Router reads from a Model Registry cache (sourced from a durable control plane DB) to find which replicas currently have the target model loaded. For a LoRA adapter, this means any replica running the correct base model - adapters load in ~1 second from a local SSD cache.
4. Load Balancer picks a replica
Among the candidates, the router picks a replica using a composite score (see DD3):
session_id already pinned, go there.5. Replica enqueues into a continuous batch
The replica's inference runtime (vLLM / TensorRT-LLM / TGI-style) merges the new request into the continuous batch currently running on the GPU. New tokens from this request interleave with decode steps of other in-flight requests every forward pass.
6. Tokens stream back
Each decoded token is written to the gRPC/HTTP/2 response stream. The Gateway forwards tokens to the client over SSE. When the request hits max_tokens or emits an EOS token, the connection closes cleanly and the request is evicted from the batch.
7. Async observability
Every hop - gateway, router, replica prefill, replica decode - emits structured events to a fire-and-forget telemetry pipeline (Kafka -> ClickHouse for traces, Prometheus for metrics). None of this sits on the critical path.
Client | v [Regional Load Balancer] <- TLS termination, connection draining | v [API Gateway fleet] <- auth, rate limit, tenant routing | v [Model Router] <- model_id -> candidate replicas | (reads Model Registry cache) v [Replica LB] <- session affinity + cache + queue score | v [GPU Replica] <- continuous batching, KV cache, adapters | ^ | |--- weight pull ----[Weight Distribution Layer] | | ^ | | | | v [Object Store: S3 / GCS] [Token Stream] -> Gateway -> Client
Three asynchronous side-flows sit next to the hot path:
Tip Four hard problems define this system. Each maps to one of the questions in the problem statement:
Tip Where the problem lies
A naive design assigns one model per GPU. With 10,000 models and ~5,000 GPUs, that's already impossible. Even if it weren't, long-tail models would idle 99.9% of the time while head models starved for capacity. The fleet collapses at both ends.
Observation: 99% of "distinct models" are LoRA adapters on ~20 base models.
LoRA (Low-Rank Adaptation) adapters are tiny - typically 10-100 MB - compared to a 140 GB base model. The insight that unlocks multi-tenant LLM serving is: one GPU can hold one base model resident and swap in dozens of LoRA adapters on-the-fly, each adapter costing milliseconds to activate.
The fleet structure follows:
Replica CompositionLayerLifetimeSizeCost to swapBase model weights (in HBM)Hours-days20-500 GBCold start: 3-5 min from object storeLoRA adapters (in HBM, many)Seconds-minutes10-100 MB each~1s from local SSD, ~50ms if already in DRAMKV-cache for active sessionsSecondsVariable (prompt + generation)Free - already in memory
Pooling base models. The fleet is partitioned into replica pools - one pool per base model. Each pool is auto-scaled independently based on its traffic. A pool for llama-3-70b might have 200 replicas; a pool for a niche qwen-2-7b might have 4. This is the first and biggest utilization lever.
Adapter multiplexing inside a pool. Within the llama-3-70b pool, every replica can serve every adapter that targets llama-3-70b. The router doesn't care which replica - it just needs one that has the base model. Adapters are cached with an LRU policy on each replica's local NVMe; a cold adapter hit costs a 1-second first-token penalty, which the router tracks and avoids during load spikes.
Per-request adapter selection in the batch. Modern inference runtimes (e.g., vLLM with multi-LoRA, S-LoRA) support running a single forward pass where different tokens in the batch use different adapters. This is what makes the economics work: you pay one GPU compute cost for a batch that serves many tenants on many adapters.
Why not keep adapters only on disk and load on demand? You can, but you pay the 1-second load penalty on every cold adapter hit. At 100k RPS with a long tail of adapters, that's a latency disaster.
The right design is a 3-tier cache:
- HBM (GPU memory) - tens of hot adapters, zero-cost activation
- Host DRAM - hundreds of warm adapters, ~50 ms promote to HBM
- Local NVMe - thousands of adapters, ~1 s promote to DRAM then HBM
- Object store (S3) - ground truth, used only on tier miss
The router keeps a rough bloom-filter-per-replica summary of which adapters are warm and avoids routing cold requests during peak load.
Tip Where the problem lies
"Just push the new weights" hides three failure modes:
The design must solve propagation speed, safety, and in-flight request survival together.
Artifact structure: content-addressed and immutable
Every model artifact is keyed by sha256(weights) in object storage. The Model Registry maps human-readable identifiers (tenant-42/support-bot-v7:stable) to artifact hashes. Updates are just new (hash, version) tuples - never mutations of existing ones. This is what makes the rest of the design safe.
Two propagation paths: adapter vs base model
Weight Propagation by SizeArtifact typeTypical sizePropagation strategyTarget SLOLoRA adapter10-100 MBDirect S3 + per-region cache (CDN)under 60 sFull fine-tune or new base model20-500 GBP2P / BitTorrent-style fan-out across replicas in the poolunder 10 min
The P2P approach (see the separate ML Model Distribution design) is critical for base-model-sized artifacts: the first replica pulls from S3, then other replicas pull chunks from each other in parallel. This makes propagation time roughly O(log N) in the pool size instead of O(N).
Progressive rollout (the safety story)
A new version is never deployed to 100% of the fleet at once. The control plane executes a staged rollout:
stages = [ { "percent": 1, "duration": "5 min", "auto_halt": ["error_rate > 2%", "p99_latency > 2x"] }, { "percent": 10, "duration": "15 min", "auto_halt": [...] }, { "percent": 50, "duration": "30 min", "auto_halt": [...] }, { "percent": 100, "duration": "steady state" }, ]
At each stage the control plane samples metrics from replicas running the new version and compares them against a control group still running the old version. Any SLO breach triggers automatic rollback: the registry pointer flips back and replicas re-activate the previous version from cache (which they still hold - the artifact is immutable and cache-resident).
In-flight request survival
When a replica is told to swap in a new version, it does not kill in-flight streams. The runtime uses a double-buffer pattern:
For adapters, this is near-instantaneous (a pointer swap in the adapter table). For base-model swaps, you accept a brief period of 2x HBM usage per replica - which is why the base-model pool must be provisioned with headroom for a rolling update.
Why content-addressing (sha256 keys) matters so much here Three things fall out of content-addressing for free:
- Idempotent downloads. If a replica is interrupted mid-pull, it can resume by asking any peer for missing chunks. Chunks verify by hash.
- Instant rollback. The old artifact's hash is still pinned in the fleet. "Roll back" is just a registry pointer flip - zero new data movement.
- Deterministic debugging. A production trace carries the artifact hash. Reproducing a bug means loading exactly that artifact anywhere, forever. No "which commit was deployed at 3am?" confusion.
This is the same principle behind Docker image layers and Git objects. Use it unchanged for model artifacts.
Tip Where the problem lies
Round-robin routing looks appealing for its simplicity. It destroys the KV-cache. In a chat workload where a user's next turn shares the first 2,000 tokens with the previous turn, a cache miss means the replica must re-run prefill for all 2,000 tokens - hundreds of milliseconds of GPU time thrown away. At 100k RPS this alone can tank the fleet.
Load balancing in LLM serving is fundamentally different from stateless web serving. The replica has state you want to reuse.
The router scores each candidate replica on three axes and picks the highest-scoring healthy replica.
Signal 1 - Session affinity (hard preference)
If the request carries a session_id that is already pinned to a replica, go there. The pin is written on first use and expires ~5 minutes after the last token. This is a consistent-hashing-style soft pin - if the pinned replica dies or is drained, the session transparently rebalances with a cache miss (expected, acceptable failure mode).
Signal 2 - Prefix cache hint (soft preference)
For requests without a stable session (or the first turn of a session), the router checks which replicas have recently served a request with the same prompt prefix. Replicas publish a rolling bloom filter of their top-N recent prefix hashes to the router every few seconds. This is cheap, noisy, and good enough.
Signal 3 - Queue depth + utilization (tiebreaker)
Once the candidate set is narrowed by model membership and cache signals, break ties on the replica with the lowest pending_tokens / tokens_per_second ratio. This is closer to "time until free" than raw queue length and maps directly to expected queue time.
` score = ( session_match * 100 # dominates if present
Avoiding hot spots
A single tenant with a very popular prompt prefix would hammer one replica forever. The router adds jitter above a utilization threshold: once a replica crosses 85% utilization, the router shifts some of its incoming cache-hit requests to cold replicas, accepting a cache miss to preserve P99 latency.
Why not just use consistent hashing on (session_id)? Consistent hashing solves affinity but ignores load. If one replica happens to get the popular shard, it melts while others idle. Real production systems layer "consistent hashing for affinity" with "bounded loads" on top - route to the preferred replica unless it is more than X% above the mean load, in which case route to the next-best. This is the Power-of-Two-Choices + Bounded-Loads pattern.
That layered approach is what the scoring function above is approximating: prefer affinity, but let load signals override when the cost of stickiness exceeds the benefit.
What gets streamed back during prefill vs decode? LLM inference has two distinct phases with very different cost profiles:
- Prefill: process the entire prompt in one big GPU-parallel pass. Latency scales with prompt length but is compute-bound and highly parallelizable. Dominates time-to-first-token.
- Decode: generate one token at a time, each step requiring a full forward pass on the growing KV-cache. Memory-bandwidth-bound. Dominates total response time for long generations.
This matters for the router because: prefill cost is predictable from prompt length, decode cost is predictable from
max_tokens. The router can estimate total replica-time for a request before dispatching, and use that to pick the replica whose queue will drain fastest.
Tip Where the problem lies
LLM serving has a uniquely cruel combination: GPU cold starts take minutes, traffic can spike in seconds, individual requests can hold a GPU for tens of seconds, and a single OOM or CUDA error can kill a replica mid-batch and take 50 in-flight streams with it.
You need to (1) detect failure fast, (2) contain its blast radius, (3) scale in a way that respects cold-start latency, and (4) observe the system deeply enough to know any of this is working.
Part 1 - Fault tolerance
Failure Modes and ResponsesFailureDetectionResponseReplica OOM / crashMissed heartbeats (5s interval, 15s timeout)Router stops sending traffic; orchestrator reschedules pod; in-flight requests fail fast with retriable 503Slow / stuck replica (noisy neighbor)Rolling P99 queue time per replica exceeds 3x fleet medianDrain replica: stop new traffic, let in-flight finish, then restartGPU hardware error (ECC, XID)nvidia-smi + kernel logs via node agentCordon node, migrate replicas to healthy nodes, page on-call for hardware replacementNode-level failureKubernetes node NotReadyPods rescheduled; pool auto-scaler backfillsZone outageMultiple nodes + control-plane health signal lostRegional LB shifts traffic to other zones; pools in other zones scale upBad model version rolloutProgressive rollout SLO breach (DD2)Automatic rollback, incident page
Retry semantics (critical). Only requests that failed before any token was emitted are safe to retry on a different replica. Once tokens have streamed to the client, a retry would duplicate content. The gateway tags each request with tokens_emitted; if 0 when the replica fails, retry transparently, otherwise surface the error.
Part 2 - Monitoring (the 4 metrics that matter)
The Metrics That Actually PageMetricWhy it mattersTypical thresholdTime-to-first-token P99 by modelUser-visible latency; breaches SLO before utilization doesunder 1.5 sInter-token latency P99Detects slow decode / batch starvationunder 50 msQueue depth (pending tokens) per poolLeading indicator of scale-up needalert above 0.5s backlogGPU utilization per poolCost vs headroom; secondary scale signaltarget 70-80%KV-cache hit rateRouting health; drops imply router misconfigabove 60% for chat workloadsTokens per second per GPUCapacity planning and regression detectionbaseline per model
Traces are per-request and carry {tenant_id, model_id, adapter_id, replica_id, session_id, prefill_ms, queue_ms, first_token_ms, decode_tokens, error}. ClickHouse is the right store: cheap, fast aggregate queries over billions of events, easy cohort analysis.
Part 3 - Auto-scaling
The scaler works per base-model pool, because that is the unit of capacity. The adapter layer is already multiplexed for free.
`
if p99_ttft > 1.2 * SLO for 60s: provision_replicas(+10%) elif queue_depth > 0.5s of backlog: provision_replicas(+5%) elif gpu_util > 0.85 for 3 min: provision_replicas(+5%)
if gpu_util < 0.40 for 15 min: drain_and_remove(1) # one replica at a time, never all at once `
The asymmetry is the same lesson as any autoscaler, amplified by the 3-5 minute cold start: scale up on a leading indicator (queue depth, latency P99), scale down on a trailing indicator (sustained low utilization).
Pre-warmed replicas. For head models with spiky traffic, keep N replicas permanently in a "warm pool" - base model loaded, zero traffic, ready to accept routing in under a second. This pays ~5% of fleet cost to buy a 5-minute head start on the next spike.
Challenge Yourself - traffic for model X just 5x'd in 30 seconds. New GPUs take 5 minutes. What do you do? The answer is not "provision more GPUs" - that's too slow to matter for the next 5 minutes. The answer is a layered response:
- Shed load at the edge. The gateway's per-tenant rate limiter tightens automatically when the pool crosses a queue-depth threshold. Free-tier and bulk traffic returns 429 within milliseconds.
- Shift cache-miss tolerant traffic. Non-chat requests (embeddings, completions without session_id) can be routed to adjacent pools if the target pool is saturated - accepting a higher latency but not dropping.
- Borrow capacity from warm pool. Promote 2-3 warm replicas immediately into the hot rotation.
- Now start provisioning cold replicas. They'll arrive in 5 minutes as backup, not as the primary response.
The candidate who says "increase HPA replica count" has not operated this kind of system. The candidate who says "shed first, shift second, provision third" has.
A large-scale LLM serving platform looks like a web service on the surface - requests in, responses out - but every meaningful engineering decision comes from three properties that web services don't have:
When you put all four deep dives together, the shape of the system is inevitable: a small number of base-model pools (auto-scaled independently), each running replicas that multiplex adapters dynamically, fronted by a cache-aware router that blends session affinity with queue depth, updated via content-addressed progressive rollouts, and observed through per-request traces that make every hop accountable.
Tip Lead with the economics, not the architecture.
The first question to answer in this design is not "what services do we need?" - it is "why does a naive per-model deployment fail?" Candidates who can derive the adapter multiplexing requirement from first principles (10k models, 5k GPUs, long-tail traffic) separate themselves immediately.
Tip Name the two latency phases: prefill and decode.
Most candidates treat inference as one number. Senior candidates know that prefill dominates TTFT, decode dominates total latency, and the router needs both to pick a replica well. This one distinction opens up the entire load-balancing discussion.
Tip Pre-empt the rollout question.
An interviewer will ask "how do you update a model." The strongest answer doesn't just describe a staged rollout - it names content-addressed artifacts, explains why immutability enables instant rollback, and distinguishes adapter propagation from base-model propagation. That level of depth in a single answer is rare.
Tip Know what you don't need.
You don't need a database on the hot path. You don't need Kafka between the gateway and the router. You don't need a distributed lock. You need a fast in-memory routing layer, an async telemetry pipeline, and a durable control plane - and the discipline to keep those three planes from bleeding into each other.
Tip How ShowOffer can help you?
We've included callouts and open-ended design prompts throughout this write-up - perfect for self-practice or interview discussion. If you'd like feedback or want to walk through this system design in a mock interview setting, book a session with us at ShowOffer.io. We're here to help you sharpen your skills, gain confidence, and land your next big offer.