Design a Rate Limiter — Visa

Reference Answer

Review the Rate Limiters building block for background on token bucket, sliding window, and distributed counter patterns that are central to this problem.

Also review Caching and API Gateways for context on low-latency counter stores and centralized enforcement points.

Problem Statement

Design a distributed rate limiting system that can track API requests across multiple servers and enforce limits per user or client to prevent abuse and maintain system performance. The system must support configurable policies, handle high concurrency, and provide clear feedback when requests are throttled.

A distributed rate limiter is a control plane that tracks and limits how many API requests a client (user, IP, API key, or service) can make across many servers in a short time. Think of the guardrails behind API gateways at companies like Visa or Stripe that prevent abusive bursts while keeping legitimate traffic fast.

Interviewers ask this to see if you can design low-latency, highly concurrent systems that remain correct under contention and failures. They are looking for your ability to reason about algorithms (token bucket, leaky bucket, sliding window), atomic updates across a fleet, multi-region trade-offs, hot-key mitigation, and clear client signaling (429, Retry-After).

Key Requirements

Functional

Distributed enforcement -- enforce request limits consistently across multiple servers and regions for identities like user ID, API key, or IP address
Configurable policies -- define different policies such as per-endpoint limits, burst versus sustained rates, and per-tenant quotas, with the ability to update them without downtime
Clear throttle signaling -- return appropriate status codes and retry hints (429, Retry-After header) when a client exceeds their limit
Observability -- expose usage and throttling metrics in near real time so operators can understand policy effects and detect abuse

Non-Functional

Scalability -- handle millions of rate-check decisions per second across a fleet of API servers without becoming a bottleneck
Latency -- add no more than 1-2ms of overhead per request on the critical path at p99
Reliability -- degrade gracefully if the central counter store becomes unavailable, falling back to local enforcement rather than blocking all traffic
Consistency -- tolerate small over-admission windows during counter propagation but prevent sustained over-limit abuse

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Rate Limiting Algorithm Selection

Different algorithms offer different trade-offs between accuracy, memory, and burst handling. Interviewers want to see that you understand the options and can justify your choice.

Hints to consider:

Token bucket allows controlled bursts up to bucket capacity while enforcing a sustained rate, making it a good default for most API rate limiting
Sliding window log provides precise counting but requires storing every request timestamp, which is expensive at high QPS
Sliding window counter approximates accuracy by weighting the previous window's count, offering a good balance of precision and memory efficiency
Fixed window is simplest but allows double-burst at window boundaries, which interviewers expect you to identify and address

2. Distributed Counter Synchronization

Rate limiters concentrate writes on a few hot keys (popular tenants, shared endpoints). You must design for atomic increments, hot-key sharding, and backpressure to avoid lock contention and store meltdown.

Hints to consider:

Use Redis with Lua scripting to perform atomic check-and-increment operations in a single round trip, avoiding race conditions between read and write
Shard counters by client ID across multiple Redis nodes to distribute write load and prevent any single node from becoming a hotspot
Consider local per-instance counters with periodic synchronization to reduce round trips to the central store for high-frequency endpoints
Use short TTLs on counter keys so stale data is automatically cleaned up and memory usage is bounded

3. Handling Hot Keys and High-Contention Clients

A small number of clients or endpoints often generate disproportionate traffic. Without mitigation, their counters become contention bottlenecks that degrade performance for everyone.

Hints to consider:

Detect hot keys through request frequency monitoring and automatically split their counters across multiple sub-keys that are summed on read
Use local admission control as a first pass: if a client is clearly over-limit based on local counts alone, reject immediately without touching the central store
Implement backpressure by returning 429 with exponential Retry-After values that increase as a client repeatedly exceeds limits
Cache recent rate-limit decisions for short durations to reduce redundant lookups for the same client within milliseconds

4. Graceful Degradation and Failure Modes

If the central counter store goes down, you cannot simply block all traffic. Interviewers look for resilience patterns that maintain protection without causing outages.

Hints to consider:

Fall back to per-instance in-memory counters when the central store is unreachable, accepting temporary over-admission proportional to the number of instances
Use circuit breakers on the Redis connection so that a slow or failing store does not add latency to every request
Replicate counter state across availability zones so a single-zone failure does not eliminate rate limiting
Log and alert on fallback events so operators can respond before sustained abuse occurs during degraded mode

5. Multi-Region Rate Limiting

For global services, enforcing a single global rate requires cross-region coordination that conflicts with low-latency goals. Interviewers want to see how you handle this tension.

Hints to consider:

Use regional rate limiters with independent counters per region, setting per-region limits that sum to the global limit
For stricter global enforcement, asynchronously replicate counter deltas between regions and accept a small over-admission window during replication lag
Consider a hybrid approach: enforce most limits regionally for speed, but route global quota checks through a designated primary region for critical policies
Design the policy configuration system to distribute updated rules to all regions within seconds using a pub/sub mechanism or configuration service

Suggested Approach

Step 1: Clarify Requirements

Start by confirming the scope and constraints with your interviewer. Ask about the number of unique clients, expected QPS, and which identifiers are used for limiting (user ID, API key, IP). Clarify whether limits are per-endpoint or global per client, whether burst tolerance is needed, and what the acceptable over-admission margin is. Confirm latency targets for the rate-check path and whether multi-region enforcement is required.

Step 2: High-Level Architecture

Sketch the core components: an API gateway that intercepts every request and performs the rate-limit check before forwarding to backend services. A Rate Limiter Service runs the chosen algorithm (token bucket or sliding window counter) against counters stored in a Redis cluster. The gateway queries this service synchronously on the hot path. A Policy Configuration Service backed by a database (DynamoDB or PostgreSQL) stores rate limit rules per client, endpoint, and tier, and pushes updates to all gateway instances via a pub/sub channel. A Metrics Pipeline streams throttle events to Kafka for aggregation in a time-series database, powering dashboards and alerts.

Step 3: Deep Dive on the Rate-Check Flow

Walk through the critical path: a request arrives at the API gateway, which extracts the client identifier and endpoint. It constructs a composite key (e.g., client_id:endpoint) and sends it to Redis along with a Lua script that atomically reads the current counter, checks it against the configured limit, increments if allowed, and sets a TTL matching the rate window. If the counter exceeds the limit, the gateway returns 429 with a Retry-After header computed from the window expiration time. If Redis is unreachable, the gateway falls back to a local in-memory token bucket for that client, using a conservative limit divided by the number of instances. Discuss how to handle counter key migrations when policies change mid-window without resetting legitimate usage.

Step 4: Address Secondary Concerns

Cover monitoring by tracking per-client throttle rates, p99 rate-check latency, Redis hit rates, and fallback activation frequency. Discuss how to handle policy updates: the configuration service writes new rules to the database and publishes a change event; gateways subscribe and hot-reload rules without restart. Address cost optimization by batching counter increments for low-priority endpoints and using probabilistic counting for analytics-only rate tracking. Mention security concerns: rate limiting itself must be protected against enumeration attacks where adversaries probe to discover exact limits.

Real Interview Quotes

"FR and NFR were given, and were also given an interface of a class for implementing the rate Limiter."

Related Learning

Deepen your understanding of the patterns used in this problem:

Rate Limiters -- covers token bucket, leaky bucket, sliding window algorithms and their trade-offs for distributed enforcement
Caching -- Redis as a low-latency counter store is central to rate limiting; understand replication, eviction, and failure modes
API Gateways -- the natural enforcement point for rate limiting, authentication, and request routing
Load Balancers -- distributing traffic across instances affects how per-instance fallback counters behave during central store outages

Reference Answer

Review the Rate Limiters building block for background on token bucket, sliding window, and distributed counter patterns that are central to this problem.

Also review Caching and API Gateways for context on low-latency counter stores and centralized enforcement points.

Problem Statement

Key Requirements

Functional

Distributed enforcement -- enforce request limits consistently across multiple servers and regions for identities like user ID, API key, or IP address
Configurable policies -- define different policies such as per-endpoint limits, burst versus sustained rates, and per-tenant quotas, with the ability to update them without downtime
Clear throttle signaling -- return appropriate status codes and retry hints (429, Retry-After header) when a client exceeds their limit
Observability -- expose usage and throttling metrics in near real time so operators can understand policy effects and detect abuse

Non-Functional

Scalability -- handle millions of rate-check decisions per second across a fleet of API servers without becoming a bottleneck
Latency -- add no more than 1-2ms of overhead per request on the critical path at p99
Reliability -- degrade gracefully if the central counter store becomes unavailable, falling back to local enforcement rather than blocking all traffic
Consistency -- tolerate small over-admission windows during counter propagation but prevent sustained over-limit abuse

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Rate Limiting Algorithm Selection

Different algorithms offer different trade-offs between accuracy, memory, and burst handling. Interviewers want to see that you understand the options and can justify your choice.

Hints to consider:

Token bucket allows controlled bursts up to bucket capacity while enforcing a sustained rate, making it a good default for most API rate limiting
Sliding window log provides precise counting but requires storing every request timestamp, which is expensive at high QPS
Sliding window counter approximates accuracy by weighting the previous window's count, offering a good balance of precision and memory efficiency
Fixed window is simplest but allows double-burst at window boundaries, which interviewers expect you to identify and address

2. Distributed Counter Synchronization

Hints to consider:

Use Redis with Lua scripting to perform atomic check-and-increment operations in a single round trip, avoiding race conditions between read and write
Shard counters by client ID across multiple Redis nodes to distribute write load and prevent any single node from becoming a hotspot
Consider local per-instance counters with periodic synchronization to reduce round trips to the central store for high-frequency endpoints
Use short TTLs on counter keys so stale data is automatically cleaned up and memory usage is bounded

3. Handling Hot Keys and High-Contention Clients

A small number of clients or endpoints often generate disproportionate traffic. Without mitigation, their counters become contention bottlenecks that degrade performance for everyone.

Hints to consider:

Detect hot keys through request frequency monitoring and automatically split their counters across multiple sub-keys that are summed on read
Use local admission control as a first pass: if a client is clearly over-limit based on local counts alone, reject immediately without touching the central store
Implement backpressure by returning 429 with exponential Retry-After values that increase as a client repeatedly exceeds limits
Cache recent rate-limit decisions for short durations to reduce redundant lookups for the same client within milliseconds

4. Graceful Degradation and Failure Modes

If the central counter store goes down, you cannot simply block all traffic. Interviewers look for resilience patterns that maintain protection without causing outages.

Hints to consider:

Fall back to per-instance in-memory counters when the central store is unreachable, accepting temporary over-admission proportional to the number of instances
Use circuit breakers on the Redis connection so that a slow or failing store does not add latency to every request
Replicate counter state across availability zones so a single-zone failure does not eliminate rate limiting
Log and alert on fallback events so operators can respond before sustained abuse occurs during degraded mode

5. Multi-Region Rate Limiting

For global services, enforcing a single global rate requires cross-region coordination that conflicts with low-latency goals. Interviewers want to see how you handle this tension.

Hints to consider:

Use regional rate limiters with independent counters per region, setting per-region limits that sum to the global limit
For stricter global enforcement, asynchronously replicate counter deltas between regions and accept a small over-admission window during replication lag
Consider a hybrid approach: enforce most limits regionally for speed, but route global quota checks through a designated primary region for critical policies
Design the policy configuration system to distribute updated rules to all regions within seconds using a pub/sub mechanism or configuration service

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive on the Rate-Check Flow

Step 4: Address Secondary Concerns

Real Interview Quotes

"FR and NFR were given, and were also given an interface of a class for implementing the rate Limiter."

Related Learning

Deepen your understanding of the patterns used in this problem:

Rate Limiters -- covers token bucket, leaky bucket, sliding window algorithms and their trade-offs for distributed enforcement
Caching -- Redis as a low-latency counter store is central to rate limiting; understand replication, eviction, and failure modes
API Gateways -- the natural enforcement point for rate limiting, authentication, and request routing
Load Balancers -- distributing traffic across instances affects how per-instance fallback counters behave during central store outages