Design a distributed rate limiting system that can track API requests across multiple servers and enforce limits per user, client, or API key to prevent abuse and maintain system performance. The system must operate at the edge of a large-scale API platform, making sub-millisecond allow/deny decisions for every inbound request while staying accurate across a fleet of hundreds of gateway instances spanning multiple data centers.
Think of the guardrails behind API gateways at companies like Meta, Stripe, or OpenAI that prevent abusive bursts while keeping legitimate traffic fast. A single misbehaving client should not be able to degrade service for everyone else, and the system must handle millions of requests per second without becoming a bottleneck itself. Interviewers ask this to see if you can design low-latency, highly concurrent systems that remain correct under contention and failures, reason about rate limiting algorithms (token bucket, sliding window, leaky bucket), handle hot-key mitigation, multi-region coordination, and clear client signaling with 429 status codes and Retry-After headers.
Based on real interview experiences, these are the areas interviewers probe most deeply:
Interviewers expect you to compare algorithms and justify your choice based on the requirements. Each algorithm has distinct trade-offs around burst tolerance, memory usage, and implementation complexity.
Hints to consider:
Every request is a write (counter increment), and popular tenants or shared endpoints create severe contention on a small number of keys. Interviewers want to see how you keep writes fast under skew.
Hints to consider:
Rate limiting spans data centers but synchronous cross-region coordination would add unacceptable latency. Interviewers probe how you balance accuracy with availability.
Hints to consider:
Rate limit rules must be changeable without code deployment. Interviewers want to see how policies are stored, versioned, and distributed to all gateway instances quickly.
Hints to consider:
Rate limiting is only useful if clients can react to it properly. Poor signaling leads to retry storms and wasted capacity.
Hints to consider:
X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers on every response so clients can self-throttle before hitting limitsRetry-After with 429 responses, computed from the token bucket or window reset time, to prevent immediate retriesStart by confirming scope with your interviewer. Ask about the expected request volume, number of distinct rate limit keys (users, API keys, IPs), and whether limits must be enforced globally or per-region. Clarify the tolerance for slight over-admission versus the cost of cross-region latency. Understand whether policies need to support burst allowances, tiered rate plans, or per-endpoint granularity. Confirm whether you must design the policy management system or focus on the enforcement hot path.
Sketch the system as a library or sidecar embedded in each API gateway instance. Requests flow through the gateway, which extracts the rate limit key (user ID, API key, or IP), looks up the applicable policy from a local cache, and calls the rate limiter module. The limiter performs an atomic check-and-increment against a Redis cluster sharded by key hash. Redis Lua scripts implement the token bucket algorithm in a single round trip: read the current token count and last refill timestamp, compute tokens to add based on elapsed time, deduct one token if available, and return allow/deny with remaining quota. Policies are stored in DynamoDB and pushed to gateway caches via a Kafka configuration topic. A metrics pipeline streams throttle events to a time-series database for dashboards and alerting.
Walk through the critical path in detail. Each rate limit key maps to a Redis hash with fields: tokens (current count), last_refill (timestamp), and policy_version. A Lua script executes atomically: it reads the hash, calculates elapsed = now - last_refill, computes new_tokens = min(burst_capacity, tokens + elapsed * refill_rate), and if new_tokens >= 1, decrements and returns ALLOWED with remaining count. If not, it returns DENIED with the time until the next token arrives (used for Retry-After). The key has a TTL equal to burst_capacity / refill_rate plus a buffer, so inactive clients' keys expire automatically. For hot keys, gateways maintain a local token bucket that handles the first N requests per window locally, only synchronizing with Redis when the local bucket is exhausted or periodically to reconcile counts.
Deepen your understanding of the patterns used in this problem:
Cover multi-region deployment: each region has its own Redis cluster with the global limit split proportionally. An asynchronous reconciliation job runs every few seconds, summing cross-region usage and adjusting regional allocations. If one region sees a traffic spike, the reconciler tightens its allocation and loosens others. For failure modes, implement a three-tier fallback: primary Redis, replica Redis, then local in-memory bucket with conservative limits. Monitor Redis latency and error rates with circuit breakers that trip after 3 consecutive failures within 1 second. For observability, emit structured events (key, policy, decision, latency) to Kafka and aggregate in ClickHouse for dashboards showing throttle rates by tenant, endpoint, and region. Alert on sudden spikes in 429 rates or counter store latency degradation. Finally, discuss abuse detection: if a single client consistently hits limits across many endpoints, escalate to a block list evaluated before the rate limiter to save counter store capacity.