Practice/Microsoft/Design a Distributed Counter

Design a Distributed Counter

System DesignOptional

Problem Statement

Build a centralized rate limiting service that protects backend APIs from abuse and overload by enforcing per-user, per-IP, or per-API-key request quotas across a fleet of distributed web servers. The service must handle millions of requests per second from thousands of microservices, returning allow/deny decisions in single-digit milliseconds while accurately tracking usage windows (e.g., 1000 requests per minute) and gracefully handling partial failures.

Unlike a simple in-memory throttle on one machine, this system must coordinate state across many nodes to prevent a malicious client from bypassing limits by hitting different servers. It also needs to support multiple rate limit policies simultaneously—some users get higher quotas, some APIs have burst allowances, and rules can change dynamically without restarts. Interviewers use this question to assess your understanding of distributed counting, cache coherence, consistency tradeoffs, and how to scale a write-heavy, latency-critical service without creating bottlenecks.

Key Requirements

Functional

Policy enforcement -- check incoming requests against configured limits (requests per second/minute/hour) and return allow or deny
Multi-dimensional keys -- support rate limiting by user ID, API key, IP address, or combinations thereof
Dynamic rule updates -- allow operators to add, modify, or remove rate limit rules without downtime
Batch decision API -- optionally allow callers to check multiple keys in one round trip to reduce network overhead

Non-Functional

Scalability -- handle 10 million checks per second globally with horizontal scaling
Reliability -- tolerate node failures and network partitions; degrade gracefully rather than fail closed and block all traffic
Latency -- return allow/deny decisions in under 5ms at p99 to avoid adding noticeable delay to user-facing APIs
Consistency -- ensure limits are enforced accurately enough that users cannot exceed quotas by more than a small margin (eventual consistency acceptable with bounded drift)

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Handling Distributed State and Avoiding Hot Spots

Every rate limit check is a read-modify-write on a counter. If all requests for a popular user funnel through one machine or one database row, you create a serialization bottleneck that caps throughput and latency.

Hints to consider:

Shard counters by hashing the key (user ID, IP) so different nodes own different key spaces
Use local in-memory caches on each web server with periodic sync to central storage to absorb spikes
Consider token bucket or leaky bucket algorithms that allow local enforcement with eventual reconciliation
Pre-aggregate counts in memory and flush batches to shared storage to reduce write amplification

2. Choosing the Right Consistency Model

Strict global consistency would require distributed locks or consensus on every check, killing latency. But pure eventual consistency might let a user burst past limits by hitting multiple servers before counters converge.

Hints to consider:

Accept small overages (e.g., 1-2% above limit) in exchange for low latency and high availability
Use sliding window counters or fixed window bucketing with timestamps to bound drift
Employ write-behind caching where local nodes track usage and asynchronously sync to Redis or a database
Discuss CAP tradeoffs: favor availability and partition tolerance, allow temporary inconsistency during network splits

3. Time Window Management and Clock Skew

Rate limits are time-bound (requests per minute), so accurate timekeeping across distributed nodes is critical. Clock drift can cause some servers to reset windows early or late, creating gaps or false denials.

Hints to consider:

Synchronize clocks with NTP but assume small skew; use coarse-grained windows (1-minute buckets) to tolerate drift
Implement sliding window logs or approximate algorithms (sliding window counters) that smooth out boundary effects
Store timestamps in a shared, monotonic source (Redis with server-side time) rather than relying on client or application clocks
Handle leap seconds and clock corrections gracefully by using intervals rather than absolute timestamps

4. Failure Modes and Graceful Degradation

If the central rate limiter becomes unavailable, should all traffic be blocked (fail closed) or allowed (fail open)? Both have risks—fail closed causes outages, fail open invites abuse.

Hints to consider:

Cache recent allow/deny decisions locally with short TTLs so servers can continue operating during brief outages
Use a secondary fallback tier (read replicas, stale cache) to serve approximate limits when primary is down
Implement circuit breakers that switch to fail-open mode after detecting persistent errors, logging for post-incident review
Rate limit the rate limiter itself: if the service is overloaded, sample requests or rate limit less critical keys first

Suggested Approach

Step 1: Clarify Requirements

Start by confirming scale, accuracy expectations, and failure behavior. Ask:

What is the expected request volume (requests per second globally)?
What time windows are needed (per second, minute, hour, day)?
How strict must enforcement be—can we tolerate 1-2% overages or must it be exact?
Should the system fail open or fail closed when the rate limiter is unavailable?
Are rate limits user-facing (visible in API responses) or purely internal protection?
Do we need real-time analytics or audit logs of rate limit violations?

Clarify whether limits are per-user, per-IP, per-API-key, or combinations, and whether rules are static or dynamically configurable by operators.

Step 2: High-Level Architecture

Sketch a layered design with these core components:

Client-facing edge: Web servers or API gateways that intercept requests and call the rate limiter before forwarding to backend services. These nodes maintain local in-memory caches of recent counters to serve hot paths without network hops.

Rate limiter cluster: A fleet of stateless service nodes that receive check requests, look up current usage from shared storage, apply the algorithm (token bucket, sliding window), and return allow/deny. These nodes are horizontally scaled and sit behind a load balancer.

Shared storage: Redis or DynamoDB holding current counter values keyed by {user_id, window_start_time}. Redis supports atomic increments and Lua scripts for complex logic; DynamoDB offers global replication and conditional writes for multi-region setups.

Configuration service: A control plane (backed by a database or ZooKeeper) that stores rate limit policies. Changes propagate to rate limiter nodes via polling or push notifications, allowing dynamic updates without redeploy.

Monitoring and logging: Emit metrics on allowed/denied requests, latency percentiles, and cache hit rates. Log violations for security analysis and capacity planning.

Practice/Microsoft/Design a Distributed Counter

Design a Distributed Counter

System DesignOptional

Problem Statement

Key Requirements

Functional

Policy enforcement -- check incoming requests against configured limits (requests per second/minute/hour) and return allow or deny
Multi-dimensional keys -- support rate limiting by user ID, API key, IP address, or combinations thereof
Dynamic rule updates -- allow operators to add, modify, or remove rate limit rules without downtime
Batch decision API -- optionally allow callers to check multiple keys in one round trip to reduce network overhead

Non-Functional

Scalability -- handle 10 million checks per second globally with horizontal scaling
Reliability -- tolerate node failures and network partitions; degrade gracefully rather than fail closed and block all traffic
Latency -- return allow/deny decisions in under 5ms at p99 to avoid adding noticeable delay to user-facing APIs
Consistency -- ensure limits are enforced accurately enough that users cannot exceed quotas by more than a small margin (eventual consistency acceptable with bounded drift)

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Handling Distributed State and Avoiding Hot Spots

Hints to consider:

Shard counters by hashing the key (user ID, IP) so different nodes own different key spaces
Use local in-memory caches on each web server with periodic sync to central storage to absorb spikes
Consider token bucket or leaky bucket algorithms that allow local enforcement with eventual reconciliation
Pre-aggregate counts in memory and flush batches to shared storage to reduce write amplification

2. Choosing the Right Consistency Model

Hints to consider:

Accept small overages (e.g., 1-2% above limit) in exchange for low latency and high availability
Use sliding window counters or fixed window bucketing with timestamps to bound drift
Employ write-behind caching where local nodes track usage and asynchronously sync to Redis or a database
Discuss CAP tradeoffs: favor availability and partition tolerance, allow temporary inconsistency during network splits

3. Time Window Management and Clock Skew

Hints to consider:

Synchronize clocks with NTP but assume small skew; use coarse-grained windows (1-minute buckets) to tolerate drift
Implement sliding window logs or approximate algorithms (sliding window counters) that smooth out boundary effects
Store timestamps in a shared, monotonic source (Redis with server-side time) rather than relying on client or application clocks
Handle leap seconds and clock corrections gracefully by using intervals rather than absolute timestamps

4. Failure Modes and Graceful Degradation

If the central rate limiter becomes unavailable, should all traffic be blocked (fail closed) or allowed (fail open)? Both have risks—fail closed causes outages, fail open invites abuse.

Hints to consider:

Cache recent allow/deny decisions locally with short TTLs so servers can continue operating during brief outages
Use a secondary fallback tier (read replicas, stale cache) to serve approximate limits when primary is down
Implement circuit breakers that switch to fail-open mode after detecting persistent errors, logging for post-incident review
Rate limit the rate limiter itself: if the service is overloaded, sample requests or rate limit less critical keys first

Suggested Approach

Step 1: Clarify Requirements

Start by confirming scale, accuracy expectations, and failure behavior. Ask:

What is the expected request volume (requests per second globally)?
What time windows are needed (per second, minute, hour, day)?
How strict must enforcement be—can we tolerate 1-2% overages or must it be exact?
Should the system fail open or fail closed when the rate limiter is unavailable?
Are rate limits user-facing (visible in API responses) or purely internal protection?
Do we need real-time analytics or audit logs of rate limit violations?

Clarify whether limits are per-user, per-IP, per-API-key, or combinations, and whether rules are static or dynamically configurable by operators.

Step 2: High-Level Architecture

Sketch a layered design with these core components:

Monitoring and logging: Emit metrics on allowed/denied requests, latency percentiles, and cache hit rates. Log violations for security analysis and capacity planning.