Practice/Amazon/Design a Distributed Cache System
Design a Distributed Cache System
System DesignMust
Problem Statement
Design a distributed caching system that can serve data with low latency across multiple geographic locations, handling cache misses, replication, and scaling to support billions of records. Think of Redis or Memcached deployed as a managed service that sits in front of databases and APIs to deliver microsecond-to-millisecond latency reads. Users read and write key-value entries with optional TTLs, and the cache offloads backend systems while keeping latency predictable.
Interviewers ask this because it bundles core distributed systems skills: sharding and client routing, consistency versus freshness, replication and failover, multi-region design, hot-key contention, and cache invalidation. You are expected to reason about tradeoffs (cache-aside vs write-through), manage thundering herds, and describe concrete mechanisms that keep the system fast and correct under failures at global scale.
Key Requirements
Functional
- Key-value operations -- clients can set, get, and delete entries with optional per-key TTLs
- Low-latency reads -- data is served from the nearest available cache node or region with sub-millisecond to single-digit millisecond latency
- Eviction policies -- configurable eviction behavior (LRU, LFU) to stay within memory limits
- Cache invalidation -- precise invalidation (single key) or broad invalidation (namespace/tag) when underlying data changes
Non-Functional
- Scalability -- support millions of operations per second with petabyte-scale data distributed across thousands of nodes
- Reliability -- tolerate individual node failures and regional outages without data loss or extended downtime; automatic failover
- Latency -- p99 read latency under 1ms for cache hits; p99 write latency under 5ms
- Consistency -- eventual consistency for cross-region replication; read-after-write consistency within a region
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Sharding and Data Distribution
How you partition data across cache nodes determines scalability, load balance, and resilience to node changes. Naive approaches cause massive cache churn during scale events.
Hints to consider:
- Use consistent hashing with virtual nodes to distribute keys evenly and minimize remapping when nodes join or leave
- Assign each key to a primary node and one or more replicas for fault tolerance
- Implement client-side or proxy-side routing that maintains a ring topology map, updated via a coordination service
- Handle key migration gracefully: during rebalancing, route reads to both old and new owners to avoid cache misses
2. Cache Coherence and Invalidation
Stale data is the most common problem with caching. Interviewers want concrete mechanisms for keeping cache entries fresh.
Hints to consider:
- Implement cache-aside (lazy loading) as the default pattern: read from cache, on miss read from DB and populate cache
- Support write-through or write-behind for workloads that require immediate consistency
- Use TTL jitter (randomize expiration times) to prevent synchronized mass expiration that causes thundering herds
- Propagate invalidations across regions using a pub/sub message bus (Kafka) for near-real-time coherence
3. Handling Hot Keys and Thundering Herds
Popular keys and synchronized expirations create load spikes that can cascade to the backend. Interviewers look for proactive mitigation strategies.
Hints to consider:
- Implement single-flight (request coalescing) on cache misses: only one request fetches from the backend while others wait
- Replicate extremely hot keys across multiple nodes so reads are distributed
- Use negative caching (cache misses with short TTLs) to prevent repeated expensive lookups for non-existent keys
- Apply circuit breakers to protect the backend when cache miss rates spike
4. Replication and Failover
Single-instance shards are single points of failure. Interviewers expect a replication strategy with automated recovery.
Hints to consider:
- Deploy primary-replica pairs for each shard, with synchronous replication within a region and asynchronous cross-region
- Use health checks and leader election (via ZooKeeper or etcd) for automatic failover when a primary fails
- Handle split-brain scenarios with fencing tokens to prevent stale primaries from accepting writes
- Design replica catch-up after failover to minimize data loss and restore full redundancy quickly
Suggested Approach
Step 1: Clarify Requirements
Confirm scope with your interviewer. Ask about expected data volume (total keys, average value size), operation rates (reads/writes per second), geographic distribution requirements, and consistency expectations. Determine if the cache is for a specific use case (session store, database cache, API response cache) or a general-purpose system. Clarify whether multi-tenancy is required and what eviction behavior is expected.