Challenge What to Design?

Design a large-scale LLM serving system that efficiently serves many models (base models, fine-tunes, adapters) and heterogeneous requests (chat, completion, embedding, multi-turn agentic calls) from many tenants on the same GPU fleet.

The system must answer four concrete questions:

Multiplex: How do you pack many models onto a limited GPU fleet without cold-starting a replica for every request?
Hot-swap: How do you push a new model version (or a new LoRA adapter) to the serving fleet in seconds, not minutes, without dropping in-flight requests?
Balance: How do you route requests so that GPUs stay saturated, KV-cache hit rates stay high, and no single replica becomes a hotspot?
Survive: How do you detect GPU/node failures, auto-scale ahead of traffic spikes, and keep the P99 steady when a replica dies mid-stream?

Before diving in, ask yourself:

Why is a naive "one model per GPU" fleet a disaster at 10,000 tenants with long-tail traffic?
Why is KV-cache locality the single most important routing signal for chat-style workloads?
Why does updating weights look easy on paper and catastrophic in production?
What does the scale-up trigger actually measure - GPU utilization, queue depth, or time-to-first-token?

You don't need to answer all of these right away. A Senior/Staff+ Engineer doesn't stop at basic functionality - they anticipate edge cases, design for resilience, and push for production-grade reliability.

But great systems start with great questions. What would you ask next?

Problem Statement

Design the backend of a large-scale LLM serving platform. Tenants upload or select pretrained models (e.g., Llama, Mistral, proprietary fine-tunes, or LoRA adapters on top of a shared base) and call an OpenAI-compatible API to run inference against them. The platform must:

Serve thousands of distinct models from a shared GPU fleet with high utilization.
Accept both streaming (token-by-token chat) and non-streaming (single-shot completion, embedding) requests.
Push new model versions and adapter updates to the fleet quickly and safely.
Balance load across replicas while preserving KV-cache affinity for multi-turn conversations.
Provide fault tolerance, observability, and auto-scaling as first-class concerns.

Scope & Assumptions In Scope

Request ingress, authentication, tenant identification, and rate limiting.

Model registry and weight distribution to GPU workers.

Routing from API Gateway to the correct model replica.

Inference runtime responsibilities (continuous batching, KV cache, adapter hot-loading) at the interface level - we design around these, not inside them.

Fleet-level auto-scaling, health checks, and failover.

Monitoring and alerting surface area.

Out of Scope

Training, fine-tuning, RLHF, or evaluation pipelines.

The internal math of attention, speculative decoding, or quantization kernels.

Billing, quota accounting, and subscription management.

Frontend UX and playground UI.

System Scale Scale MetricsMetricTargetDistinct models served10,000+ (mostly LoRA adapters on ~20 base models)Active tenants50,000Requests per second100,000 RPS peak across the fleetConcurrent streams1,000,000 open SSE connectionsGPU fleet5,000+ accelerators across multiple regionsP50 time-to-first-tokenunder 300 msP99 time-to-first-tokenunder 1.5 secondsWeight-update propagationunder 60 s for a new adapter, under 10 min for a new base modelAvailability target99.95%

Key Requirements

Functional

FR1 - Serve many models from a shared GPU fleet

Clients specify model_id on every request. The system routes the request to a replica that currently has that model's weights loaded, or triggers a just-in-time load if the model is cold. A single GPU replica may host one base model plus many LoRA adapters swapped in per-batch.

FR2 - Support both streaming and non-streaming requests

Chat and completion endpoints stream tokens via SSE. Embedding and classification endpoints return a single response synchronously. The routing layer must handle both without branching the entire stack.

FR3 - Roll out model weight updates quickly and safely

Uploading a new adapter version, promoting a fine-tune to "stable", or patching a base model must propagate to the serving fleet in seconds to minutes, never hours. In-flight requests on the old version must complete successfully.

FR4 - Route requests for load balance and cache locality

The router must pick a replica that is (a) healthy, (b) has the target model loaded, (c) has low queue depth, and (d) is likely to have the request's KV-cache prefix already in memory for multi-turn conversations.

Non-Functional

NFR1 - Latency

P50 time-to-first-token under 300 ms
P99 time-to-first-token under 1.5 s
Inter-token latency under 50 ms at P99 during streaming

NFR2 - Throughput

100,000 RPS across the fleet
70-80% sustained GPU utilization (never 100% - you need headroom for cold starts)

NFR3 - Availability

99.95% uptime. No single GPU, node, or zone failure should cause a tenant outage.

NFR4 - Fault isolation

One noisy tenant or one misbehaving model must not degrade others. Per-tenant rate limits, per-model queues, and circuit breakers isolate blast radius.

NFR5 - Observability

Per-request traces that include model ID, tenant ID, replica ID, queue time, prefill time, decode time, and token count.
Fleet-wide metrics: GPU utilization, KV-cache hit rate, batch size distribution, time-to-first-token by model.

NFR6 - Elasticity

Auto-scaling must react to traffic spikes before P99 breaches SLO, given a ~3-5 minute GPU cold-start cost.

Requirement SummaryIDRequirementTypeFR1Serve many models from shared GPUsFunctionalFR2Streaming + non-streaming inferenceFunctionalFR3Fast, safe weight rolloutsFunctionalFR4Cache-aware load balancingFunctionalNFR1P50 TTFT under 300ms, P99 under 1.5sNon-functionalNFR2100k RPS, 70-80% GPU utilizationNon-functionalNFR399.95% availabilityNon-functionalNFR4Per-tenant and per-model fault isolationNon-functionalNFR5Deep, per-request observabilityNon-functionalNFR6Auto-scale ahead of traffic spikesNon-functional

What Interviewers Focus On

Four hard problems define this system. Strong candidates anticipate every one of them before being asked:

Multi-model multiplexing -- why "one model per GPU" fails at 10,000 models, and how shared base models with LoRA adapters let you serve thousands of tenants from a fraction of the fleet.
Fast, safe weight rollouts -- content-addressed artifacts, progressive promotion with auto-rollback, and keeping in-flight streams alive during a swap.
Cache-aware load balancing -- why round-robin destroys the KV-cache, and how to layer session affinity + prefix-cache hints + queue depth into a single scoring function.
Fault tolerance and auto-scaling under cold-start constraints -- retrying safely without duplicating streamed tokens, draining noisy replicas, and scaling up on leading indicators (queue depth, TTFT) rather than trailing ones (utilization).

Secondary signals interviewers listen for:

You name prefill vs decode and explain their different cost profiles.
You separate data plane, control plane, and observability plane and defend why they must not bleed into each other.
You size the fleet from the scale numbers (RPS, model count, adapter count) rather than plucking instance counts out of thin air.

Suggested Approach

Before drawing any boxes, derive the architecture from the three properties that make LLM serving different from stateless web serving: the replica has state you want to reuse (KV-cache), capacity takes minutes to provision and seconds to exhaust, and models are not monolithic. Each property pushes a specific design lever.

Core Entities

Core EntitiesEntityDescriptionTenantA customer account with API keys, quota, and model ownershipModelEither a base model (e.g., llama-3-70b) or an adapter (LoRA) keyed by model_id + versionModelArtifactImmutable blob in object storage: weights, tokenizer, configReplicaA GPU process running a base model; may host multiple adapters simultaneouslyRequestTenant ID, model ID, prompt, generation params, streaming flag, session IDSessionLogical conversation with a stable KV-cache prefix and a sticky replica affinityBatchSet of tokens from different requests being processed by one forward pass on a replica

API Design

Client-Facing API (OpenAI-compatible)

` POST /v1/chat/completions Authorization: Bearer <api_key>

Request body: { "model": "tenant-42/support-bot-v7", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reset my password?"} ], "stream": true, "max_tokens": 512, "temperature": 0.7, "session_id": "sess_abc123" }

Response (stream=true): text/event-stream data: {"choices":[{"delta":{"content":"To"}}]} data: {"choices":[{"delta":{"content":" reset"}}]} ... data: [DONE] `

The session_id is the key to KV-cache affinity. If the client passes a stable ID across turns, the router pins that session to the same replica so the KV-cache prefix for prior turns stays hot.

Internal API (Router to Replica)

` POST /internal/generate { "request_id": "req_xyz", "model_id": "tenant-42/support-bot-v7", "adapter_id": "support-bot-v7-lora", "prompt_tokens": [1, 2, 3, ...], "params": { "max_tokens": 512, "temperature": 0.7 }, "session_id": "sess_abc123" }

Response: streaming token deltas over gRPC or HTTP/2 `

Control-Plane API (Model Management)

PUT /v1/models/{model_id}/versions/{version} { "artifact_uri": "s3://fw-models/tenant-42/support-bot-v7.safetensors", "base_model": "llama-3-70b", "adapter_type": "lora", "promote_to": "stable" }

Promoting a version to stable triggers the weight propagation flow described in DD2.

High Level Design

Tip The system separates cleanly into three planes:

Data plane - hot path that carries inference requests. Must be low-latency.
Control plane - manages models, replicas, and config. Can be eventually consistent.
Observability plane - metrics, traces, logs. Must never block the hot path.

Mixing these is the most common architectural mistake in LLM serving systems.

End-to-End Request Flow

1. Client sends request, Load Balancer terminates TLS

A regional L7 load balancer (e.g., Envoy) terminates TLS and routes to the API Gateway fleet. Long-running SSE streams need connection draining on deploys - the LB must hold connections for up to max_tokens x 50ms worst case.

2. API Gateway: auth, tenant resolution, rate limiting

The Gateway validates the API key, resolves the tenant, enforces per-tenant rate limits (tokens/second, RPS), and attaches a trace ID. If the tenant is over quota, it returns HTTP 429 before any GPU resource is touched.

3. Model Router resolves model_id to replica candidates

The Model Router reads from a Model Registry cache (sourced from a durable control plane DB) to find which replicas currently have the target model loaded. For a LoRA adapter, this means any replica running the correct base model - adapters load in ~1 second from a local SSD cache.

4. Load Balancer picks a replica

Among the candidates, the router picks a replica using a composite score (see DD3):

Session affinity first: if the request has a session_id already pinned, go there.
KV-cache hint: if the request's prompt prefix matches a recently seen prefix on a replica, prefer it.
Queue depth: break ties by choosing the replica with the shortest pending queue.
Health: exclude any replica flagged unhealthy in the last health-check window.

5. Replica enqueues into a continuous batch

The replica's inference runtime (vLLM / TensorRT-LLM / TGI-style) merges the new request into the continuous batch currently running on the GPU. New tokens from this request interleave with decode steps of other in-flight requests every forward pass.

6. Tokens stream back

Each decoded token is written to the gRPC/HTTP/2 response stream. The Gateway forwards tokens to the client over SSE. When the request hits max_tokens or emits an EOS token, the connection closes cleanly and the request is evicted from the batch.

7. Async observability

Every hop - gateway, router, replica prefill, replica decode - emits structured events to a fire-and-forget telemetry pipeline (Kafka -> ClickHouse for traces, Prometheus for metrics). None of this sits on the critical path.

High-Level Architecture

Client | v [Regional Load Balancer] <- TLS termination, connection draining | v [API Gateway fleet] <- auth, rate limit, tenant routing | v [Model Router] <- model_id -> candidate replicas | (reads Model Registry cache) v [Replica LB] <- session affinity + cache + queue score | v [GPU Replica] <- continuous batching, KV cache, adapters | ^ | |--- weight pull ----[Weight Distribution Layer] | | ^ | | | | v [Object Store: S3 / GCS] [Token Stream] -> Gateway -> Client

Three asynchronous side-flows sit next to the hot path:

Control plane: Model Registry DB, version promotions, fleet membership - read by the Router.
Weight distribution: Object store + P2P or CDN fanout - pulls weights onto replicas.
Observability: Traces, metrics, logs - consumed by dashboards and auto-scaler.

Deep Dives

Tip Four hard problems define this system. Each maps to one of the questions in the problem statement:

DD1 - Multi-model multiplexing: how do we serve 10,000 models on a much smaller fleet?
DD2 - Fast, safe weight updates: how do we swap weights without dropping traffic?
DD3 - Cache-aware load balancing: how do we route for both utilization and locality?
DD4 - Fault tolerance, monitoring, auto-scaling: how do we survive and scale under load?

DD1 - Multi-Model Multiplexing

Tip Where the problem lies

A naive design assigns one model per GPU. With 10,000 models and ~5,000 GPUs, that's already impossible. Even if it weren't, long-tail models would idle 99.9% of the time while head models starved for capacity. The fleet collapses at both ends.

Observation: 99% of "distinct models" are LoRA adapters on ~20 base models.

LoRA (Low-Rank Adaptation) adapters are tiny - typically 10-100 MB - compared to a 140 GB base model. The insight that unlocks multi-tenant LLM serving is: one GPU can hold one base model resident and swap in dozens of LoRA adapters on-the-fly, each adapter costing milliseconds to activate.

The fleet structure follows:

Replica CompositionLayerLifetimeSizeCost to swapBase model weights (in HBM)Hours-days20-500 GBCold start: 3-5 min from object storeLoRA adapters (in HBM, many)Seconds-minutes10-100 MB each~1s from local SSD, ~50ms if already in DRAMKV-cache for active sessionsSecondsVariable (prompt + generation)Free - already in memory

Pooling base models. The fleet is partitioned into replica pools - one pool per base model. Each pool is auto-scaled independently based on its traffic. A pool for llama-3-70b might have 200 replicas; a pool for a niche qwen-2-7b might have 4. This is the first and biggest utilization lever.

Adapter multiplexing inside a pool. Within the llama-3-70b pool, every replica can serve every adapter that targets llama-3-70b. The router doesn't care which replica - it just needs one that has the base model. Adapters are cached with an LRU policy on each replica's local NVMe; a cold adapter hit costs a 1-second first-token penalty, which the router tracks and avoids during load spikes.

Per-request adapter selection in the batch. Modern inference runtimes (e.g., vLLM with multi-LoRA, S-LoRA) support running a single forward pass where different tokens in the batch use different adapters. This is what makes the economics work: you pay one GPU compute cost for a batch that serves many tenants on many adapters.

Why not keep adapters only on disk and load on demand? You can, but you pay the 1-second load penalty on every cold adapter hit. At 100k RPS with a long tail of adapters, that's a latency disaster.

The right design is a 3-tier cache:

HBM (GPU memory) - tens of hot adapters, zero-cost activation

Host DRAM - hundreds of warm adapters, ~50 ms promote to HBM

Local NVMe - thousands of adapters, ~1 s promote to DRAM then HBM

Object store (S3) - ground truth, used only on tier miss

The router keeps a rough bloom-filter-per-replica summary of which adapters are warm and avoids routing cold requests during peak load.

DD2 - Fast and Safe Weight Updates

Tip Where the problem lies

"Just push the new weights" hides three failure modes:

A new adapter version is pushed. Ten thousand replicas try to download it at once from S3. The bandwidth bill explodes and propagation takes an hour.
A replica swaps in a buggy new version and starts returning garbage. With no progressive rollout, every user sees it.
A new base model version (140 GB) has to propagate to 200 replicas. Each pulls from S3 at 10 Gbps. S3 egress is saturated. Propagation takes 45 minutes when the business expects 5.

The design must solve propagation speed, safety, and in-flight request survival together.

Artifact structure: content-addressed and immutable

Every model artifact is keyed by sha256(weights) in object storage. The Model Registry maps human-readable identifiers (tenant-42/support-bot-v7:stable) to artifact hashes. Updates are just new (hash, version) tuples - never mutations of existing ones. This is what makes the rest of the design safe.

Two propagation paths: adapter vs base model

Weight Propagation by SizeArtifact typeTypical sizePropagation strategyTarget SLOLoRA adapter10-100 MBDirect S3 + per-region cache (CDN)under 60 sFull fine-tune or new base model20-500 GBP2P / BitTorrent-style fan-out across replicas in the poolunder 10 min

The P2P approach (see the separate ML Model Distribution design) is critical for base-model-sized artifacts: the first replica pulls from S3, then other replicas pull chunks from each other in parallel. This makes propagation time roughly O(log N) in the pool size instead of O(N).

Progressive rollout (the safety story)

A new version is never deployed to 100% of the fleet at once. The control plane executes a staged rollout:

stages = [ { "percent": 1, "duration": "5 min", "auto_halt": ["error_rate > 2%", "p99_latency > 2x"] }, { "percent": 10, "duration": "15 min", "auto_halt": [...] }, { "percent": 50, "duration": "30 min", "auto_halt": [...] }, { "percent": 100, "duration": "steady state" }, ]

At each stage the control plane samples metrics from replicas running the new version and compares them against a control group still running the old version. Any SLO breach triggers automatic rollback: the registry pointer flips back and replicas re-activate the previous version from cache (which they still hold - the artifact is immutable and cache-resident).

In-flight request survival

When a replica is told to swap in a new version, it does not kill in-flight streams. The runtime uses a double-buffer pattern:

Load new weights into a second HBM buffer alongside the current one.
Stop accepting new requests for the old version (router sees the new version flag and stops routing to it).
Drain: let existing streams finish on the old buffer.
Free the old buffer once the last stream closes.

For adapters, this is near-instantaneous (a pointer swap in the adapter table). For base-model swaps, you accept a brief period of 2x HBM usage per replica - which is why the base-model pool must be provisioned with headroom for a rolling update.

Why content-addressing (sha256 keys) matters so much here Three things fall out of content-addressing for free:

Idempotent downloads. If a replica is interrupted mid-pull, it can resume by asking any peer for missing chunks. Chunks verify by hash.

Instant rollback. The old artifact's hash is still pinned in the fleet. "Roll back" is just a registry pointer flip - zero new data movement.

Deterministic debugging. A production trace carries the artifact hash. Reproducing a bug means loading exactly that artifact anywhere, forever. No "which commit was deployed at 3am?" confusion.

This is the same principle behind Docker image layers and Git objects. Use it unchanged for model artifacts.

DD3 - Cache-Aware Load Balancing

Tip Where the problem lies

Round-robin routing looks appealing for its simplicity. It destroys the KV-cache. In a chat workload where a user's next turn shares the first 2,000 tokens with the previous turn, a cache miss means the replica must re-run prefill for all 2,000 tokens - hundreds of milliseconds of GPU time thrown away. At 100k RPS this alone can tank the fleet.

Load balancing in LLM serving is fundamentally different from stateless web serving. The replica has state you want to reuse.

The router scores each candidate replica on three axes and picks the highest-scoring healthy replica.

Signal 1 - Session affinity (hard preference)

If the request carries a session_id that is already pinned to a replica, go there. The pin is written on first use and expires ~5 minutes after the last token. This is a consistent-hashing-style soft pin - if the pinned replica dies or is drained, the session transparently rebalances with a cache miss (expected, acceptable failure mode).

Signal 2 - Prefix cache hint (soft preference)

For requests without a stable session (or the first turn of a session), the router checks which replicas have recently served a request with the same prompt prefix. Replicas publish a rolling bloom filter of their top-N recent prefix hashes to the router every few seconds. This is cheap, noisy, and good enough.

Signal 3 - Queue depth + utilization (tiebreaker)

Once the candidate set is narrowed by model membership and cache signals, break ties on the replica with the lowest pending_tokens / tokens_per_second ratio. This is closer to "time until free" than raw queue length and maps directly to expected queue time.

` score = ( session_match * 100 # dominates if present

prefix_hit * 10 # strong hint

estimated_wait_ms # penalize busy replicas
unhealthy * infinity # exclude ) `

Avoiding hot spots

A single tenant with a very popular prompt prefix would hammer one replica forever. The router adds jitter above a utilization threshold: once a replica crosses 85% utilization, the router shifts some of its incoming cache-hit requests to cold replicas, accepting a cache miss to preserve P99 latency.

Why not just use consistent hashing on (session_id)? Consistent hashing solves affinity but ignores load. If one replica happens to get the popular shard, it melts while others idle. Real production systems layer "consistent hashing for affinity" with "bounded loads" on top - route to the preferred replica unless it is more than X% above the mean load, in which case route to the next-best. This is the Power-of-Two-Choices + Bounded-Loads pattern.

That layered approach is what the scoring function above is approximating: prefer affinity, but let load signals override when the cost of stickiness exceeds the benefit.

What gets streamed back during prefill vs decode? LLM inference has two distinct phases with very different cost profiles:

Prefill: process the entire prompt in one big GPU-parallel pass. Latency scales with prompt length but is compute-bound and highly parallelizable. Dominates time-to-first-token.

Decode: generate one token at a time, each step requiring a full forward pass on the growing KV-cache. Memory-bandwidth-bound. Dominates total response time for long generations.

This matters for the router because: prefill cost is predictable from prompt length, decode cost is predictable from max_tokens. The router can estimate total replica-time for a request before dispatching, and use that to pick the replica whose queue will drain fastest.

DD4 - Fault Tolerance, Monitoring, and Auto-Scaling

Tip Where the problem lies

LLM serving has a uniquely cruel combination: GPU cold starts take minutes, traffic can spike in seconds, individual requests can hold a GPU for tens of seconds, and a single OOM or CUDA error can kill a replica mid-batch and take 50 in-flight streams with it.

You need to (1) detect failure fast, (2) contain its blast radius, (3) scale in a way that respects cold-start latency, and (4) observe the system deeply enough to know any of this is working.

Part 1 - Fault tolerance

Failure Modes and ResponsesFailureDetectionResponseReplica OOM / crashMissed heartbeats (5s interval, 15s timeout)Router stops sending traffic; orchestrator reschedules pod; in-flight requests fail fast with retriable 503Slow / stuck replica (noisy neighbor)Rolling P99 queue time per replica exceeds 3x fleet medianDrain replica: stop new traffic, let in-flight finish, then restartGPU hardware error (ECC, XID)nvidia-smi + kernel logs via node agentCordon node, migrate replicas to healthy nodes, page on-call for hardware replacementNode-level failureKubernetes node NotReadyPods rescheduled; pool auto-scaler backfillsZone outageMultiple nodes + control-plane health signal lostRegional LB shifts traffic to other zones; pools in other zones scale upBad model version rolloutProgressive rollout SLO breach (DD2)Automatic rollback, incident page

Retry semantics (critical). Only requests that failed before any token was emitted are safe to retry on a different replica. Once tokens have streamed to the client, a retry would duplicate content. The gateway tags each request with tokens_emitted; if 0 when the replica fails, retry transparently, otherwise surface the error.

Part 2 - Monitoring (the 4 metrics that matter)

The Metrics That Actually PageMetricWhy it mattersTypical thresholdTime-to-first-token P99 by modelUser-visible latency; breaches SLO before utilization doesunder 1.5 sInter-token latency P99Detects slow decode / batch starvationunder 50 msQueue depth (pending tokens) per poolLeading indicator of scale-up needalert above 0.5s backlogGPU utilization per poolCost vs headroom; secondary scale signaltarget 70-80%KV-cache hit rateRouting health; drops imply router misconfigabove 60% for chat workloadsTokens per second per GPUCapacity planning and regression detectionbaseline per model

Traces are per-request and carry {tenant_id, model_id, adapter_id, replica_id, session_id, prefill_ms, queue_ms, first_token_ms, decode_tokens, error}. ClickHouse is the right store: cheap, fast aggregate queries over billions of events, easy cohort analysis.

Part 3 - Auto-scaling

The scaler works per base-model pool, because that is the unit of capacity. The adapter layer is already multiplexed for free.

Scale UP - fast, aggressive

if p99_ttft > 1.2 * SLO for 60s: provision_replicas(+10%) elif queue_depth > 0.5s of backlog: provision_replicas(+5%) elif gpu_util > 0.85 for 3 min: provision_replicas(+5%)

Scale DOWN - slow, conservative

if gpu_util < 0.40 for 15 min: drain_and_remove(1) # one replica at a time, never all at once `

The asymmetry is the same lesson as any autoscaler, amplified by the 3-5 minute cold start: scale up on a leading indicator (queue depth, latency P99), scale down on a trailing indicator (sustained low utilization).

Pre-warmed replicas. For head models with spiky traffic, keep N replicas permanently in a "warm pool" - base model loaded, zero traffic, ready to accept routing in under a second. This pays ~5% of fleet cost to buy a 5-minute head start on the next spike.

Challenge Yourself - traffic for model X just 5x'd in 30 seconds. New GPUs take 5 minutes. What do you do? The answer is not "provision more GPUs" - that's too slow to matter for the next 5 minutes. The answer is a layered response:

Shed load at the edge. The gateway's per-tenant rate limiter tightens automatically when the pool crosses a queue-depth threshold. Free-tier and bulk traffic returns 429 within milliseconds.

Shift cache-miss tolerant traffic. Non-chat requests (embeddings, completions without session_id) can be routed to adjacent pools if the target pool is saturated - accepting a higher latency but not dropping.

Borrow capacity from warm pool. Promote 2-3 warm replicas immediately into the hot rotation.

Now start provisioning cold replicas. They'll arrive in 5 minutes as backup, not as the primary response.

The candidate who says "increase HPA replica count" has not operated this kind of system. The candidate who says "shed first, shift second, provision third" has.

Final Thoughts

A large-scale LLM serving platform looks like a web service on the surface - requests in, responses out - but every meaningful engineering decision comes from three properties that web services don't have:

The replica has state you want to reuse. The KV-cache is the single biggest performance lever in the system, and it is invisible to the load balancer unless you design routing around it. This is the insight behind DD3.
Capacity takes minutes to provision and seconds to exhaust. Every scaling, rollout, and failover decision is shaped by this asymmetry. This is the insight behind DD2 and DD4.
Models are not monolithic. The fleet is economical only because thousands of adapters ride on shared base models. Treating every model as its own deployment is the path to an unworkable system. This is the insight behind DD1.

When you put all four deep dives together, the shape of the system is inevitable: a small number of base-model pools (auto-scaled independently), each running replicas that multiplex adapters dynamically, fronted by a cache-aware router that blends session affinity with queue depth, updated via content-addressed progressive rollouts, and observed through per-request traces that make every hop accountable.

Interview Tips

Tip Lead with the economics, not the architecture.

The first question to answer in this design is not "what services do we need?" - it is "why does a naive per-model deployment fail?" Candidates who can derive the adapter multiplexing requirement from first principles (10k models, 5k GPUs, long-tail traffic) separate themselves immediately.

Tip Name the two latency phases: prefill and decode.

Most candidates treat inference as one number. Senior candidates know that prefill dominates TTFT, decode dominates total latency, and the router needs both to pick a replica well. This one distinction opens up the entire load-balancing discussion.

Tip Pre-empt the rollout question.

An interviewer will ask "how do you update a model." The strongest answer doesn't just describe a staged rollout - it names content-addressed artifacts, explains why immutability enables instant rollback, and distinguishes adapter propagation from base-model propagation. That level of depth in a single answer is rare.

Tip Know what you don't need.

You don't need a database on the hot path. You don't need Kafka between the gateway and the router. You don't need a distributed lock. You need a fast in-memory routing layer, an async telemetry pipeline, and a durable control plane - and the discipline to keep those three planes from bleeding into each other.

Tip How ShowOffer can help you?

We've included callouts and open-ended design prompts throughout this write-up - perfect for self-practice or interview discussion. If you'd like feedback or want to walk through this system design in a mock interview setting, book a session with us at ShowOffer.io. We're here to help you sharpen your skills, gain confidence, and land your next big offer.

Challenge What to Design?

The system must answer four concrete questions:

Multiplex: How do you pack many models onto a limited GPU fleet without cold-starting a replica for every request?
Hot-swap: How do you push a new model version (or a new LoRA adapter) to the serving fleet in seconds, not minutes, without dropping in-flight requests?
Balance: How do you route requests so that GPUs stay saturated, KV-cache hit rates stay high, and no single replica becomes a hotspot?
Survive: How do you detect GPU/node failures, auto-scale ahead of traffic spikes, and keep the P99 steady when a replica dies mid-stream?

Before diving in, ask yourself:

Why is a naive "one model per GPU" fleet a disaster at 10,000 tenants with long-tail traffic?
Why is KV-cache locality the single most important routing signal for chat-style workloads?
Why does updating weights look easy on paper and catastrophic in production?
What does the scale-up trigger actually measure - GPU utilization, queue depth, or time-to-first-token?

But great systems start with great questions. What would you ask next?

Problem Statement

Serve thousands of distinct models from a shared GPU fleet with high utilization.
Accept both streaming (token-by-token chat) and non-streaming (single-shot completion, embedding) requests.
Push new model versions and adapter updates to the fleet quickly and safely.
Balance load across replicas while preserving KV-cache affinity for multi-turn conversations.
Provide fault tolerance, observability, and auto-scaling as first-class concerns.

Scope & Assumptions In Scope

Request ingress, authentication, tenant identification, and rate limiting.

Model registry and weight distribution to GPU workers.

Routing from API Gateway to the correct model replica.

Inference runtime responsibilities (continuous batching, KV cache, adapter hot-loading) at the interface level - we design around these, not inside them.

Fleet-level auto-scaling, health checks, and failover.

Monitoring and alerting surface area.

Out of Scope

Training, fine-tuning, RLHF, or evaluation pipelines.

The internal math of attention, speculative decoding, or quantization kernels.

Billing, quota accounting, and subscription management.

Frontend UX and playground UI.

System Scale Scale MetricsMetricTargetDistinct models served10,000+ (mostly LoRA adapters on ~20 base models)Active tenants50,000Requests per second100,000 RPS peak across the fleetConcurrent streams1,000,000 open SSE connectionsGPU fleet5,000+ accelerators across multiple regionsP50 time-to-first-tokenunder 300 msP99 time-to-first-tokenunder 1.5 secondsWeight-update propagationunder 60 s for a new adapter, under 10 min for a new base modelAvailability target99.95%

Key Requirements

Functional

FR1 - Serve many models from a shared GPU fleet

FR2 - Support both streaming and non-streaming requests

FR3 - Roll out model weight updates quickly and safely

FR4 - Route requests for load balance and cache locality

Non-Functional

NFR1 - Latency

P50 time-to-first-token under 300 ms
P99 time-to-first-token under 1.5 s
Inter-token latency under 50 ms at P99 during streaming

NFR2 - Throughput

100,000 RPS across the fleet
70-80% sustained GPU utilization (never 100% - you need headroom for cold starts)

NFR3 - Availability

99.95% uptime. No single GPU, node, or zone failure should cause a tenant outage.

NFR4 - Fault isolation

One noisy tenant or one misbehaving model must not degrade others. Per-tenant rate limits, per-model queues, and circuit breakers isolate blast radius.

NFR5 - Observability

Per-request traces that include model ID, tenant ID, replica ID, queue time, prefill time, decode time, and token count.
Fleet-wide metrics: GPU utilization, KV-cache hit rate, batch size distribution, time-to-first-token by model.

NFR6 - Elasticity

Auto-scaling must react to traffic spikes before P99 breaches SLO, given a ~3-5 minute GPU cold-start cost.

What Interviewers Focus On

Four hard problems define this system. Strong candidates anticipate every one of them before being asked:

Multi-model multiplexing -- why "one model per GPU" fails at 10,000 models, and how shared base models with LoRA adapters let you serve thousands of tenants from a fraction of the fleet.
Fast, safe weight rollouts -- content-addressed artifacts, progressive promotion with auto-rollback, and keeping in-flight streams alive during a swap.
Cache-aware load balancing -- why round-robin destroys the KV-cache, and how to layer session affinity + prefix-cache hints + queue depth into a single scoring function.
Fault tolerance and auto-scaling under cold-start constraints -- retrying safely without duplicating streamed tokens, draining noisy replicas, and scaling up on leading indicators (queue depth, TTFT) rather than trailing ones (utilization).

Secondary signals interviewers listen for:

You name prefill vs decode and explain their different cost profiles.
You separate data plane, control plane, and observability plane and defend why they must not bleed into each other.
You size the fleet from the scale numbers (RPS, model count, adapter count) rather than plucking instance counts out of thin air.

Suggested Approach

Core Entities

API Design

Client-Facing API (OpenAI-compatible)

` POST /v1/chat/completions Authorization: Bearer <api_key>

Response (stream=true): text/event-stream data: {"choices":[{"delta":{"content":"To"}}]} data: {"choices":[{"delta":{"content":" reset"}}]} ... data: [DONE] `

The session_id is the key to KV-cache affinity. If the client passes a stable ID across turns, the router pins that session to the same replica so the KV-cache prefix for prior turns stays hot.

Internal API (Router to Replica)

Response: streaming token deltas over gRPC or HTTP/2 `

Control-Plane API (Model Management)

PUT /v1/models/{model_id}/versions/{version} { "artifact_uri": "s3://fw-models/tenant-42/support-bot-v7.safetensors", "base_model": "llama-3-70b", "adapter_type": "lora", "promote_to": "stable" }

Promoting a version to stable triggers the weight propagation flow described in DD2.

High Level Design

Tip The system separates cleanly into three planes:

Data plane - hot path that carries inference requests. Must be low-latency.
Control plane - manages models, replicas, and config. Can be eventually consistent.
Observability plane - metrics, traces, logs. Must never block the hot path.

Mixing these is the most common architectural mistake in LLM serving systems.

End-to-End Request Flow

1. Client sends request, Load Balancer terminates TLS

2. API Gateway: auth, tenant resolution, rate limiting

3. Model Router resolves model_id to replica candidates

4. Load Balancer picks a replica

Among the candidates, the router picks a replica using a composite score (see DD3):

Session affinity first: if the request has a session_id already pinned, go there.
KV-cache hint: if the request's prompt prefix matches a recently seen prefix on a replica, prefer it.
Queue depth: break ties by choosing the replica with the shortest pending queue.
Health: exclude any replica flagged unhealthy in the last health-check window.

5. Replica enqueues into a continuous batch

6. Tokens stream back

7. Async observability

High-Level Architecture

Three asynchronous side-flows sit next to the hot path:

Control plane: Model Registry DB, version promotions, fleet membership - read by the Router.
Weight distribution: Object store + P2P or CDN fanout - pulls weights onto replicas.
Observability: Traces, metrics, logs - consumed by dashboards and auto-scaler.

Deep Dives

Tip Four hard problems define this system. Each maps to one of the questions in the problem statement:

DD1 - Multi-model multiplexing: how do we serve 10,000 models on a much smaller fleet?
DD2 - Fast, safe weight updates: how do we swap weights without dropping traffic?
DD3 - Cache-aware load balancing: how do we route for both utilization and locality?
DD4 - Fault tolerance, monitoring, auto-scaling: how do we survive and scale under load?

DD1 - Multi-Model Multiplexing

Tip Where the problem lies

Observation: 99% of "distinct models" are LoRA adapters on ~20 base models.

The fleet structure follows:

Why not keep adapters only on disk and load on demand? You can, but you pay the 1-second load penalty on every cold adapter hit. At 100k RPS with a long tail of adapters, that's a latency disaster.

The right design is a 3-tier cache:

HBM (GPU memory) - tens of hot adapters, zero-cost activation

Host DRAM - hundreds of warm adapters, ~50 ms promote to HBM

Local NVMe - thousands of adapters, ~1 s promote to DRAM then HBM

Object store (S3) - ground truth, used only on tier miss

The router keeps a rough bloom-filter-per-replica summary of which adapters are warm and avoids routing cold requests during peak load.

DD2 - Fast and Safe Weight Updates

Tip Where the problem lies

"Just push the new weights" hides three failure modes:

A new adapter version is pushed. Ten thousand replicas try to download it at once from S3. The bandwidth bill explodes and propagation takes an hour.
A replica swaps in a buggy new version and starts returning garbage. With no progressive rollout, every user sees it.
A new base model version (140 GB) has to propagate to 200 replicas. Each pulls from S3 at 10 Gbps. S3 egress is saturated. Propagation takes 45 minutes when the business expects 5.

The design must solve propagation speed, safety, and in-flight request survival together.

Artifact structure: content-addressed and immutable

Two propagation paths: adapter vs base model

Progressive rollout (the safety story)

A new version is never deployed to 100% of the fleet at once. The control plane executes a staged rollout:

In-flight request survival

When a replica is told to swap in a new version, it does not kill in-flight streams. The runtime uses a double-buffer pattern:

Load new weights into a second HBM buffer alongside the current one.
Stop accepting new requests for the old version (router sees the new version flag and stops routing to it).
Drain: let existing streams finish on the old buffer.
Free the old buffer once the last stream closes.

Why content-addressing (sha256 keys) matters so much here Three things fall out of content-addressing for free:

Idempotent downloads. If a replica is interrupted mid-pull, it can resume by asking any peer for missing chunks. Chunks verify by hash.

Instant rollback. The old artifact's hash is still pinned in the fleet. "Roll back" is just a registry pointer flip - zero new data movement.

Deterministic debugging. A production trace carries the artifact hash. Reproducing a bug means loading exactly that artifact anywhere, forever. No "which commit was deployed at 3am?" confusion.

This is the same principle behind Docker image layers and Git objects. Use it unchanged for model artifacts.

DD3 - Cache-Aware Load Balancing

Tip Where the problem lies

Load balancing in LLM serving is fundamentally different from stateless web serving. The replica has state you want to reuse.

The router scores each candidate replica on three axes and picks the highest-scoring healthy replica.

Signal 1 - Session affinity (hard preference)

Signal 2 - Prefix cache hint (soft preference)

Signal 3 - Queue depth + utilization (tiebreaker)

` score = ( session_match * 100 # dominates if present

prefix_hit * 10 # strong hint

estimated_wait_ms # penalize busy replicas
unhealthy * infinity # exclude ) `

Avoiding hot spots

Why not just use consistent hashing on (session_id)? Consistent hashing solves affinity but ignores load. If one replica happens to get the popular shard, it melts while others idle. Real production systems layer "consistent hashing for affinity" with "bounded loads" on top - route to the preferred replica unless it is more than X% above the mean load, in which case route to the next-best. This is the Power-of-Two-Choices + Bounded-Loads pattern.

That layered approach is what the scoring function above is approximating: prefer affinity, but let load signals override when the cost of stickiness exceeds the benefit.

What gets streamed back during prefill vs decode? LLM inference has two distinct phases with very different cost profiles:

Prefill: process the entire prompt in one big GPU-parallel pass. Latency scales with prompt length but is compute-bound and highly parallelizable. Dominates time-to-first-token.

Decode: generate one token at a time, each step requiring a full forward pass on the growing KV-cache. Memory-bandwidth-bound. Dominates total response time for long generations.

This matters for the router because: prefill cost is predictable from prompt length, decode cost is predictable from max_tokens. The router can estimate total replica-time for a request before dispatching, and use that to pick the replica whose queue will drain fastest.

DD4 - Fault Tolerance, Monitoring, and Auto-Scaling

Tip Where the problem lies

You need to (1) detect failure fast, (2) contain its blast radius, (3) scale in a way that respects cold-start latency, and (4) observe the system deeply enough to know any of this is working.

Part 1 - Fault tolerance

Part 2 - Monitoring (the 4 metrics that matter)

Part 3 - Auto-scaling

The scaler works per base-model pool, because that is the unit of capacity. The adapter layer is already multiplexed for free.

Scale UP - fast, aggressive

if p99_ttft > 1.2 * SLO for 60s: provision_replicas(+10%) elif queue_depth > 0.5s of backlog: provision_replicas(+5%) elif gpu_util > 0.85 for 3 min: provision_replicas(+5%)

Scale DOWN - slow, conservative

if gpu_util < 0.40 for 15 min: drain_and_remove(1) # one replica at a time, never all at once `

Challenge Yourself - traffic for model X just 5x'd in 30 seconds. New GPUs take 5 minutes. What do you do? The answer is not "provision more GPUs" - that's too slow to matter for the next 5 minutes. The answer is a layered response:

Shed load at the edge. The gateway's per-tenant rate limiter tightens automatically when the pool crosses a queue-depth threshold. Free-tier and bulk traffic returns 429 within milliseconds.

Shift cache-miss tolerant traffic. Non-chat requests (embeddings, completions without session_id) can be routed to adjacent pools if the target pool is saturated - accepting a higher latency but not dropping.

Borrow capacity from warm pool. Promote 2-3 warm replicas immediately into the hot rotation.

Now start provisioning cold replicas. They'll arrive in 5 minutes as backup, not as the primary response.

The candidate who says "increase HPA replica count" has not operated this kind of system. The candidate who says "shed first, shift second, provision third" has.

Final Thoughts

The replica has state you want to reuse. The KV-cache is the single biggest performance lever in the system, and it is invisible to the load balancer unless you design routing around it. This is the insight behind DD3.
Capacity takes minutes to provision and seconds to exhaust. Every scaling, rollout, and failover decision is shaped by this asymmetry. This is the insight behind DD2 and DD4.
Models are not monolithic. The fleet is economical only because thousands of adapters ride on shared base models. Treating every model as its own deployment is the path to an unworkable system. This is the insight behind DD1.

Interview Tips

Tip Lead with the economics, not the architecture.

Tip Name the two latency phases: prefill and decode.

Tip Pre-empt the rollout question.

Tip Know what you don't need.

Tip How ShowOffer can help you?

Firework LLM Serving

Problem Statement

Key Requirements

Functional

Non-Functional

What Interviewers Focus On

Suggested Approach

Core Entities

API Design

Client-Facing API (OpenAI-compatible)

Internal API (Router to Replica)

Control-Plane API (Model Management)

High Level Design

End-to-End Request Flow

High-Level Architecture

Deep Dives

DD1 - Multi-Model Multiplexing

DD2 - Fast and Safe Weight Updates

DD3 - Cache-Aware Load Balancing

DD4 - Fault Tolerance, Monitoring, and Auto-Scaling

Scale UP - fast, aggressive

Scale DOWN - slow, conservative

Final Thoughts

Interview Tips

Firework LLM Serving

Problem Statement

Key Requirements

Functional

Non-Functional

What Interviewers Focus On

Suggested Approach

Core Entities

API Design

Client-Facing API (OpenAI-compatible)

Internal API (Router to Replica)

Control-Plane API (Model Management)

High Level Design

End-to-End Request Flow

High-Level Architecture

Deep Dives

DD1 - Multi-Model Multiplexing

DD2 - Fast and Safe Weight Updates

DD3 - Cache-Aware Load Balancing

DD4 - Fault Tolerance, Monitoring, and Auto-Scaling

Scale UP - fast, aggressive

Scale DOWN - slow, conservative

Final Thoughts

Interview Tips