Design a Top-K System — Bloomberg

Reference Answer

For a full example answer with detailed architecture diagrams and deep dives, see our Design Top-K guide. The guide covers streaming aggregation, windowed counting, and serving precomputed rankings at scale.

Also review the Caching and Message Queues building blocks for background on in-memory ranking stores and durable event pipelines.

Problem Statement

Design a system that efficiently retrieves the top-K items -- songs, videos, hashtags, restaurant listings, or any entity -- ranked by user activity or engagement metrics within configurable time windows. Users and downstream services query endpoints like "top 10 songs in the last 7 days" or "trending hashtags in the past 24 hours," and expect fresh, deterministic results returned in milliseconds.

The system ingests millions of scoring events per second from diverse sources (mobile apps, web clients, backend services), aggregates them across multiple rolling windows (hourly, daily, weekly, all-time), and serves low-latency ranked lists. The primary challenges are handling extreme write skew when content goes viral, maintaining accurate rankings despite out-of-order and late-arriving events, supporting segmentation by geography or category, and keeping infrastructure costs reasonable while meeting sub-100ms read latency targets.

Bloomberg relies heavily on real-time data aggregation for financial terminals, making this a natural fit for interviews that test streaming pipeline design, hot key mitigation, and read-path optimization.

Key Requirements

Functional

Time-windowed rankings -- support multiple fixed windows (1 hour, 24 hours, 7 days, 30 days, all-time) with each window updating independently as new events arrive
Segmented leaderboards -- partition rankings by dimensions such as geography, content category, or user cohort so clients can request "top songs in the US" or "trending hashtags in sports"
Configurable result size with tie handling -- allow clients to request top-K where K ranges from 10 to 1,000, with stable, deterministic ordering when scores tie
Pagination and snapshot consistency -- enable page-through for large result sets with consistent snapshots so items do not shift or repeat across pages

Non-Functional

Scalability -- handle 10M+ scoring events per second during peak hours, scaling write throughput independently of read capacity
Reliability -- tolerate datacenter failures and individual node crashes without losing more than a few minutes of leaderboard history
Latency -- p99 read latency under 100ms for top-K queries; write-to-visibility delay under 10 seconds for non-viral content
Consistency -- eventual consistency is acceptable for leaderboards; rankings should converge within seconds and handle duplicate or late events gracefully

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Write Path and Event Aggregation

Interviewers want to see how you handle millions of incoming score events without creating database hotspots or losing data. They test whether you understand stream processing, event deduplication, and the difference between raw events and aggregated counts.

Hints to consider:

Use a distributed event log like Kafka as a durable buffer between producers and consumers, partitioned by item ID or a composite key to enable parallel processing
Apply tumbling or sliding windows in a stream processor such as Apache Flink with event-time semantics and watermarks to handle late-arriving data
Batch writes to the storage layer (flush every few seconds or every 1,000 events) to reduce write amplification and improve throughput
Implement idempotent writes using event IDs so reprocessing after failures does not inflate counts

2. Handling Hot Keys and Skew

Viral content creates enormous skew where a single item receives thousands of times more events than average. Interviewers assess whether you recognize this problem and can propose solutions beyond naive sharding.

Hints to consider:

Implement split counters where hot keys are sharded into N sub-counters (e.g., item_123_shard_0 through item_123_shard_7) and merged periodically by a combiner job
Use probabilistic data structures like Count-Min Sketch for approximate counting when exact precision is not required, reducing memory pressure
Apply skew-aware routing where the ingestion layer detects hot items via per-partition metrics and distributes their events across additional partitions dynamically
Separate the "top 100" (which changes infrequently) from the long tail using tiered storage with different update frequencies

3. Maintaining and Serving Top-K Efficiently

Computing top-K by scanning all items on every query is a critical mistake. Interviewers look for pre-computation strategies, appropriate data structures, and cache-friendly designs.

Hints to consider:

Pre-compute top-K lists for each window and segment combination using a stream processor that emits ranked results on window boundaries or at regular intervals
Store pre-computed rankings in Redis sorted sets (ZADD with scores) or a key-value store with TTLs matching the window duration
Use min-heaps of size K during the aggregation reduce phase to maintain top candidates efficiently in O(N log K) instead of O(N log N) full sorting
Implement tiered caching: application-level in-memory cache (30-second TTL), CDN layer (1-minute TTL), Redis as the authoritative source (10-second freshness), and batch recomputation as a fallback

4. Time Window Management and Freshness

Different time windows have different update patterns and expiration logic. Interviewers probe whether you understand sliding versus tumbling windows, retention policies, and how to balance freshness with cost.

Hints to consider:

Use tumbling windows for fixed-duration leaderboards (daily resets at midnight) and sliding windows for "last 24 hours" that continuously roll forward
Implement window expiration through TTLs on Redis keys (EXPIRE) or partition-level deletion in time-series databases
Trade off freshness for cost: update "last hour" every 30 seconds but "last 30 days" only every 5 minutes since the ranking changes slowly
Handle late-arriving events with allowed lateness in stream processing (accept events up to a configurable period, then emit final window results)

Suggested Approach

Step 1: Clarify Requirements

Start by confirming scope and constraints with your interviewer. Ask about the types of events being tracked (views, plays, likes, purchases) and expected volume per second. Clarify how many time windows must be supported simultaneously and whether windows can overlap. Determine the number of unique segments (geography, category, user cohorts) and how many distinct segment combinations exist. Confirm acceptable staleness for results -- can leaderboards be 10 seconds behind, or must they update in near real-time? Ask about tie-breaking rules and whether clients need exact counts alongside rankings.

Step 2: High-Level Architecture

Sketch three main pipelines:

Ingestion Layer: Event producers (app servers, mobile clients, partner feeds) write scoring events to Kafka topics partitioned by item ID. Kafka provides durability, replay capability, and backpressure handling. Each event includes item_id, event_type, score_delta, timestamp, and segment metadata.

Aggregation Layer: Apache Flink jobs consume from Kafka, performing stateful stream processing. For each window (1 hour, 24 hours, 7 days), maintain in-memory counters per item using keyed state. Apply tumbling or sliding window logic with watermarks for late data handling. On window close (or at regular intervals for sliding windows), emit aggregated scores and top-K candidates to output topics and directly to the serving layer.

Serving Layer: A Redis cluster stores pre-computed top-K sorted sets per window and segment combination, using key patterns like leaderboard:daily:us:music. API servers query Redis with sub-10ms latency, apply an application-level cache with 30-second TTL, and handle pagination using cursor tokens. A separate batch job recomputes leaderboards every 5 minutes from the data warehouse as a consistency fallback for long windows.

Step 3: Deep Dive on Aggregation and Hot Key Handling

Walk through the write path in detail. Describe how Flink's keyed state maintains a counter per (window, segment, item_id) tuple. Use event-time processing with watermarks (for example, max_timestamp minus 5 minutes) to handle out-of-order events while allowing timely window closure. When a window closes, use a process function to extract top-K items via a min-heap (priority queue of size K) during the reduce phase.

Address hot keys explicitly: for items exceeding a threshold (detected via per-partition metrics), split their events across multiple sub-keys using consistent hashing. A separate combiner job merges these split counters every 10 seconds. For extremely viral items (top 0.01 percent), use approximate counting with Count-Min Sketch and accept small error margins to reduce state size.

Explain the output: emit top-K results to a Kafka topic (for downstream consumers) and directly upsert into Redis sorted sets using pipelining (batches of 100 commands). Use versioned keys to handle concurrent updates and enable atomic swaps when recomputing full rankings.

Step 4: Address Secondary Concerns

Discuss fault tolerance, consistency, and operational considerations. Flink checkpoints every 60 seconds to S3, enabling recovery with at most 60 seconds of reprocessing. Kafka's replication factor of 3 prevents data loss. Redis uses AOF persistence and replication with automatic failover via Redis Cluster.

For deduplication, include event_id in messages and use Flink's state to deduplicate within the allowed lateness window. Accept that exact-once delivery across system boundaries is hard; leaderboards are inherently approximate, and a variance of a few events is acceptable.

For monitoring, track write lag (event timestamp versus processing timestamp), p99 read latency per segment, cache hit rates, and hot key detection metrics. Alert when freshness exceeds 30 seconds or any segment's query latency crosses 100ms. For cost optimization, long windows (30 days, all-time) update infrequently, so recompute them hourly via batch jobs rather than continuous streaming, and archive old window data to cheaper storage after retention expires.

Related Learning

Deepen your understanding of the patterns used in this problem:

Top-K -- the canonical reference for streaming aggregation, windowed counting, hot key mitigation, and serving precomputed rankings
Ads Click Aggregator -- shares the same streaming pipeline patterns for counting events across time windows with deduplication and late data handling
Caching -- Redis sorted sets and tiered caching strategies are central to serving low-latency top-K queries
Message Queues -- Kafka provides the durable event backbone that decouples ingestion from aggregation and enables replay for recovery

Reference Answer

Also review the Caching and Message Queues building blocks for background on in-memory ranking stores and durable event pipelines.

Problem Statement

Key Requirements

Functional

Time-windowed rankings -- support multiple fixed windows (1 hour, 24 hours, 7 days, 30 days, all-time) with each window updating independently as new events arrive
Segmented leaderboards -- partition rankings by dimensions such as geography, content category, or user cohort so clients can request "top songs in the US" or "trending hashtags in sports"
Configurable result size with tie handling -- allow clients to request top-K where K ranges from 10 to 1,000, with stable, deterministic ordering when scores tie
Pagination and snapshot consistency -- enable page-through for large result sets with consistent snapshots so items do not shift or repeat across pages

Non-Functional

Scalability -- handle 10M+ scoring events per second during peak hours, scaling write throughput independently of read capacity
Reliability -- tolerate datacenter failures and individual node crashes without losing more than a few minutes of leaderboard history
Latency -- p99 read latency under 100ms for top-K queries; write-to-visibility delay under 10 seconds for non-viral content
Consistency -- eventual consistency is acceptable for leaderboards; rankings should converge within seconds and handle duplicate or late events gracefully

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Write Path and Event Aggregation

Hints to consider:

Use a distributed event log like Kafka as a durable buffer between producers and consumers, partitioned by item ID or a composite key to enable parallel processing
Apply tumbling or sliding windows in a stream processor such as Apache Flink with event-time semantics and watermarks to handle late-arriving data
Batch writes to the storage layer (flush every few seconds or every 1,000 events) to reduce write amplification and improve throughput
Implement idempotent writes using event IDs so reprocessing after failures does not inflate counts

2. Handling Hot Keys and Skew

Hints to consider:

Implement split counters where hot keys are sharded into N sub-counters (e.g., item_123_shard_0 through item_123_shard_7) and merged periodically by a combiner job
Use probabilistic data structures like Count-Min Sketch for approximate counting when exact precision is not required, reducing memory pressure
Apply skew-aware routing where the ingestion layer detects hot items via per-partition metrics and distributes their events across additional partitions dynamically
Separate the "top 100" (which changes infrequently) from the long tail using tiered storage with different update frequencies

3. Maintaining and Serving Top-K Efficiently

Computing top-K by scanning all items on every query is a critical mistake. Interviewers look for pre-computation strategies, appropriate data structures, and cache-friendly designs.

Hints to consider:

Pre-compute top-K lists for each window and segment combination using a stream processor that emits ranked results on window boundaries or at regular intervals
Store pre-computed rankings in Redis sorted sets (ZADD with scores) or a key-value store with TTLs matching the window duration
Use min-heaps of size K during the aggregation reduce phase to maintain top candidates efficiently in O(N log K) instead of O(N log N) full sorting
Implement tiered caching: application-level in-memory cache (30-second TTL), CDN layer (1-minute TTL), Redis as the authoritative source (10-second freshness), and batch recomputation as a fallback

4. Time Window Management and Freshness

Hints to consider:

Use tumbling windows for fixed-duration leaderboards (daily resets at midnight) and sliding windows for "last 24 hours" that continuously roll forward
Implement window expiration through TTLs on Redis keys (EXPIRE) or partition-level deletion in time-series databases
Trade off freshness for cost: update "last hour" every 30 seconds but "last 30 days" only every 5 minutes since the ranking changes slowly
Handle late-arriving events with allowed lateness in stream processing (accept events up to a configurable period, then emit final window results)

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Sketch three main pipelines:

Step 3: Deep Dive on Aggregation and Hot Key Handling

Step 4: Address Secondary Concerns

Related Learning

Deepen your understanding of the patterns used in this problem:

Top-K -- the canonical reference for streaming aggregation, windowed counting, hot key mitigation, and serving precomputed rankings
Ads Click Aggregator -- shares the same streaming pipeline patterns for counting events across time windows with deduplication and late data handling
Caching -- Redis sorted sets and tiered caching strategies are central to serving low-latency top-K queries
Message Queues -- Kafka provides the durable event backbone that decouples ingestion from aggregation and enables replay for recovery