Design a Rate Limiter — JPMorgan Chase

Problem Statement

Design a distributed rate limiting system that can track API requests across multiple servers and enforce limits per user, client, or API key to prevent abuse and maintain system performance. The system must operate at the edge of a large-scale API platform, making sub-millisecond allow/deny decisions for every inbound request while staying accurate across a fleet of hundreds of gateway instances spanning multiple data centers.

Think of the guardrails behind API gateways at companies like Meta, Stripe, or OpenAI that prevent abusive bursts while keeping legitimate traffic fast. A single misbehaving client should not be able to degrade service for everyone else, and the system must handle millions of requests per second without becoming a bottleneck itself. Interviewers ask this to see if you can design low-latency, highly concurrent systems that remain correct under contention and failures, reason about rate limiting algorithms (token bucket, sliding window, leaky bucket), handle hot-key mitigation, multi-region coordination, and clear client signaling with 429 status codes and Retry-After headers.

Key Requirements

Functional

Distributed enforcement -- Enforce request limits consistently across multiple servers and regions for identities like user ID, API key, IP address, or tenant
Flexible policies -- Support different limit configurations per endpoint, burst versus sustained rates, and per-tenant quotas, updatable without downtime or redeployment
Client feedback -- Return appropriate 429 status codes with Retry-After headers and remaining quota information so clients can implement intelligent backoff
Real-time observability -- Expose usage and throttling metrics in near real time so operators can understand policy effects, detect abuse, and tune limits

Non-Functional

Scalability -- Handle 5 million rate limit checks per second across all gateway instances with horizontal scaling as traffic grows
Latency -- p99 decision latency under 5 milliseconds including the counter lookup, to avoid adding meaningful overhead to every API request
Reliability -- Degrade gracefully if the counter store is unavailable, falling back to local limiting rather than blocking all traffic or allowing unlimited requests
Consistency -- Accept approximate enforcement (slight over-admission) to avoid cross-datacenter synchronous coordination, with per-region counters reconciled asynchronously

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Rate Limiting Algorithm Selection

Interviewers expect you to compare algorithms and justify your choice based on the requirements. Each algorithm has distinct trade-offs around burst tolerance, memory usage, and implementation complexity.

Hints to consider:

Token bucket allows configurable burst capacity separate from sustained rate, making it popular for APIs that tolerate short spikes but cap long-term throughput
Sliding window log tracks exact timestamps per request for precise enforcement but consumes more memory per client than counter-based approaches
Sliding window counter approximates the log approach using two fixed windows and a weighted average, balancing accuracy with constant memory per key
Fixed window is simplest but allows double-burst at window boundaries; discuss this limitation and how sliding approaches mitigate it

2. Distributed Counter Design and Hot Keys

Every request is a write (counter increment), and popular tenants or shared endpoints create severe contention on a small number of keys. Interviewers want to see how you keep writes fast under skew.

Hints to consider:

Use Redis with Lua scripting for atomic check-and-increment operations, keeping the entire decision in a single round trip to avoid race conditions
Shard counters across multiple Redis instances by hashing the rate limit key, ensuring no single instance becomes a bottleneck
For extremely hot keys (a single tenant doing millions of requests per second), use local per-instance counters with periodic synchronization to a central store
Set short TTLs on counter keys matching the rate window so expired counters are garbage collected automatically without background cleanup

3. Multi-Region and Failure Handling

Rate limiting spans data centers but synchronous cross-region coordination would add unacceptable latency. Interviewers probe how you balance accuracy with availability.

Hints to consider:

Use per-region counters with asynchronous replication, accepting that a client could slightly exceed global limits during the replication lag
Split the global quota across regions proportionally to expected traffic and rebalance periodically based on observed load
When the central counter store is unreachable, fall back to local in-memory token buckets with conservative limits to prevent complete bypass
Implement circuit breakers around Redis calls so a counter store outage does not cascade into gateway-level failures

4. Policy Configuration and Propagation

Rate limit rules must be changeable without code deployment. Interviewers want to see how policies are stored, versioned, and distributed to all gateway instances quickly.

Hints to consider:

Store policies in a durable configuration store (DynamoDB or PostgreSQL) and cache them in each gateway instance with short TTLs
Use a push-based propagation mechanism (Kafka topic or configuration bus) to distribute policy changes to all gateways within seconds
Version policies so gateways can detect stale configurations and self-heal by re-fetching from the source of truth
Support policy hierarchies: global defaults, per-tier overrides, per-tenant exceptions, evaluated in priority order

5. Client Communication and Developer Experience

Rate limiting is only useful if clients can react to it properly. Poor signaling leads to retry storms and wasted capacity.

Hints to consider:

Include X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers on every response so clients can self-throttle before hitting limits
Return Retry-After with 429 responses, computed from the token bucket or window reset time, to prevent immediate retries
Provide a quota inquiry endpoint so clients can check their current usage without consuming quota
Log throttled requests with client identity, endpoint, and policy details for debugging and abuse investigation

Suggested Approach

Step 1: Clarify Requirements

Start by confirming scope with your interviewer. Ask about the expected request volume, number of distinct rate limit keys (users, API keys, IPs), and whether limits must be enforced globally or per-region. Clarify the tolerance for slight over-admission versus the cost of cross-region latency. Understand whether policies need to support burst allowances, tiered rate plans, or per-endpoint granularity. Confirm whether you must design the policy management system or focus on the enforcement hot path.

Step 2: High-Level Architecture

Sketch the system as a library or sidecar embedded in each API gateway instance. Requests flow through the gateway, which extracts the rate limit key (user ID, API key, or IP), looks up the applicable policy from a local cache, and calls the rate limiter module. The limiter performs an atomic check-and-increment against a Redis cluster sharded by key hash. Redis Lua scripts implement the token bucket algorithm in a single round trip: read the current token count and last refill timestamp, compute tokens to add based on elapsed time, deduct one token if available, and return allow/deny with remaining quota. Policies are stored in DynamoDB and pushed to gateway caches via a Kafka configuration topic. A metrics pipeline streams throttle events to a time-series database for dashboards and alerting.

Step 3: Deep Dive on the Token Bucket in Redis

Walk through the critical path in detail. Each rate limit key maps to a Redis hash with fields: tokens (current count), last_refill (timestamp), and policy_version. A Lua script executes atomically: it reads the hash, calculates elapsed = now - last_refill, computes new_tokens = min(burst_capacity, tokens + elapsed * refill_rate), and if new_tokens >= 1, decrements and returns ALLOWED with remaining count. If not, it returns DENIED with the time until the next token arrives (used for Retry-After). The key has a TTL equal to burst_capacity / refill_rate plus a buffer, so inactive clients' keys expire automatically. For hot keys, gateways maintain a local token bucket that handles the first N requests per window locally, only synchronizing with Redis when the local bucket is exhausted or periodically to reconcile counts.

Step 4: Address Secondary Concerns

Related Learning

Deepen your understanding of the patterns used in this problem:

Rate Limiters -- algorithm comparison, token bucket mechanics, and distributed enforcement patterns
API Gateways -- the enforcement point for rate limiting, authentication, and request routing
Caching -- Redis as a low-latency counter store and policy cache layer
Ad Click Aggregator -- high-throughput counter aggregation and hot-key sharding patterns
Databases -- DynamoDB for durable policy and quota configuration storage

Problem Statement

Key Requirements

Functional

Distributed enforcement -- Enforce request limits consistently across multiple servers and regions for identities like user ID, API key, IP address, or tenant
Flexible policies -- Support different limit configurations per endpoint, burst versus sustained rates, and per-tenant quotas, updatable without downtime or redeployment
Client feedback -- Return appropriate 429 status codes with Retry-After headers and remaining quota information so clients can implement intelligent backoff
Real-time observability -- Expose usage and throttling metrics in near real time so operators can understand policy effects, detect abuse, and tune limits

Non-Functional

Scalability -- Handle 5 million rate limit checks per second across all gateway instances with horizontal scaling as traffic grows
Latency -- p99 decision latency under 5 milliseconds including the counter lookup, to avoid adding meaningful overhead to every API request
Reliability -- Degrade gracefully if the counter store is unavailable, falling back to local limiting rather than blocking all traffic or allowing unlimited requests
Consistency -- Accept approximate enforcement (slight over-admission) to avoid cross-datacenter synchronous coordination, with per-region counters reconciled asynchronously

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Rate Limiting Algorithm Selection

Hints to consider:

Token bucket allows configurable burst capacity separate from sustained rate, making it popular for APIs that tolerate short spikes but cap long-term throughput
Sliding window log tracks exact timestamps per request for precise enforcement but consumes more memory per client than counter-based approaches
Sliding window counter approximates the log approach using two fixed windows and a weighted average, balancing accuracy with constant memory per key
Fixed window is simplest but allows double-burst at window boundaries; discuss this limitation and how sliding approaches mitigate it

2. Distributed Counter Design and Hot Keys

Every request is a write (counter increment), and popular tenants or shared endpoints create severe contention on a small number of keys. Interviewers want to see how you keep writes fast under skew.

Hints to consider:

Use Redis with Lua scripting for atomic check-and-increment operations, keeping the entire decision in a single round trip to avoid race conditions
Shard counters across multiple Redis instances by hashing the rate limit key, ensuring no single instance becomes a bottleneck
For extremely hot keys (a single tenant doing millions of requests per second), use local per-instance counters with periodic synchronization to a central store
Set short TTLs on counter keys matching the rate window so expired counters are garbage collected automatically without background cleanup

3. Multi-Region and Failure Handling

Rate limiting spans data centers but synchronous cross-region coordination would add unacceptable latency. Interviewers probe how you balance accuracy with availability.

Hints to consider:

Use per-region counters with asynchronous replication, accepting that a client could slightly exceed global limits during the replication lag
Split the global quota across regions proportionally to expected traffic and rebalance periodically based on observed load
When the central counter store is unreachable, fall back to local in-memory token buckets with conservative limits to prevent complete bypass
Implement circuit breakers around Redis calls so a counter store outage does not cascade into gateway-level failures

4. Policy Configuration and Propagation

Rate limit rules must be changeable without code deployment. Interviewers want to see how policies are stored, versioned, and distributed to all gateway instances quickly.

Hints to consider:

Store policies in a durable configuration store (DynamoDB or PostgreSQL) and cache them in each gateway instance with short TTLs
Use a push-based propagation mechanism (Kafka topic or configuration bus) to distribute policy changes to all gateways within seconds
Version policies so gateways can detect stale configurations and self-heal by re-fetching from the source of truth
Support policy hierarchies: global defaults, per-tier overrides, per-tenant exceptions, evaluated in priority order

5. Client Communication and Developer Experience

Rate limiting is only useful if clients can react to it properly. Poor signaling leads to retry storms and wasted capacity.

Hints to consider:

Include X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers on every response so clients can self-throttle before hitting limits
Return Retry-After with 429 responses, computed from the token bucket or window reset time, to prevent immediate retries
Provide a quota inquiry endpoint so clients can check their current usage without consuming quota
Log throttled requests with client identity, endpoint, and policy details for debugging and abuse investigation

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive on the Token Bucket in Redis

Step 4: Address Secondary Concerns

Related Learning

Deepen your understanding of the patterns used in this problem:

Rate Limiters -- algorithm comparison, token bucket mechanics, and distributed enforcement patterns
API Gateways -- the enforcement point for rate limiting, authentication, and request routing
Caching -- Redis as a low-latency counter store and policy cache layer
Ad Click Aggregator -- high-throughput counter aggregation and hot-key sharding patterns
Databases -- DynamoDB for durable policy and quota configuration storage