Practice/Meta/Design Redis
Design Redis
System DesignMust
Problem Statement
Design a global leaderboard system that tracks and ranks millions of players in real-time across competitive online games. The system must handle frequent score updates from active players, support instant retrieval of top rankings, and allow users to query their current rank and nearby competitors with minimal latency.
Consider a multiplayer gaming platform where millions of concurrent users compete across different game modes and tournaments. Each match completion triggers score updates, and players expect to see their new rank within milliseconds. The system must balance write-heavy workloads during peak hours with low-latency read requirements while maintaining consistent rankings across all queries. Your design should handle both the hot path for active tournaments and the long tail of historical leaderboards that still need occasional queries.
Key Requirements
Functional
- Score updates -- Accept score changes for millions of players with automatic rank recalculation
- Top-N queries -- Retrieve the top 100 players with sub-50ms latency
- Rank lookup -- Return any player's current rank and score in under 100ms
- Nearby players -- Show 10 players above and below a given player's position
- Multiple leaderboards -- Support separate rankings per game mode, region, and time period
Non-Functional
- Scalability -- Handle 100 million active players with 50,000 score updates per second during peak hours
- Reliability -- Maintain 99.95% availability with no data loss on infrastructure failures
- Latency -- P95 read latency under 50ms and write acknowledgment within 200ms
- Consistency -- Eventual consistency is acceptable with rank convergence within 5 seconds globally
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Data Structure Selection for Ranking
The core challenge is choosing a data structure that efficiently maintains sorted order while supporting frequent updates and range queries. Naive approaches like sorting on every read or maintaining a single global sorted list create unacceptable bottlenecks.
Hints to consider:
- Sorted sets with skip lists provide O(log N) updates and range queries, similar to Redis ZADD/ZRANGE operations
- Consider sharding strategies that preserve ranking semantics, such as range-based partitioning by score
- Trade off exact ranks versus approximate ranks using probabilistic data structures for massive scale
- Evaluate whether to materialize ranks eagerly on writes or compute them lazily on reads
2. Handling Write Contention and Hot Keys
During major tournaments, popular games create extreme write hotspots where thousands of players update scores simultaneously. A single-threaded bottleneck or coarse-grained locking will create tail latency disasters.
Hints to consider:
- Partition leaderboards by game mode and tournament ID to distribute write load across shards
- Use buffering and micro-batching to aggregate rapid-fire updates from the same player
- Implement optimistic concurrency control or compare-and-swap operations to handle concurrent updates
- Consider write-behind caching patterns where updates are acknowledged quickly and applied asynchronously
3. Efficient Rank Computation
Computing a player's exact rank (how many players have higher scores) requires counting all better-performing players, which becomes prohibitively expensive at scale. The system needs clever indexing or approximation techniques.
Hints to consider:
- Precompute and cache ranks periodically, then interpolate for players whose scores haven't changed
- Use augmented trees or B-trees with subtree counts to enable O(log N) rank queries
- For top-heavy distributions, maintain exact counts for top tiers and approximate counts for lower tiers
- Employ consistent hashing to ensure players always query the same shard for cache locality
4. Scaling Across Geographic Regions
Global games need leaderboards accessible from multiple continents with minimal latency, but maintaining perfect consistency across regions creates coordination overhead that kills performance.
Hints to consider:
- Deploy regional read replicas with eventual consistency for queries, routing users to their nearest datacenter
- Funnel all writes to authoritative shards with async replication to read replicas
- Use multi-level leaderboards where regional boards merge periodically into global rankings
- Accept stale reads with bounded staleness guarantees, displaying "as of N seconds ago" timestamps