For a full example answer with detailed architecture diagrams and deep dives, see our Design Top-K guide. The guide covers streaming aggregation, windowed counting, and serving precomputed rankings at scale.
Also review the Caching and Message Queues building blocks for background on in-memory ranking stores and durable event pipelines.
Design a system that efficiently retrieves the top-K items -- songs, videos, hashtags, restaurant listings, or any entity -- ranked by user activity or engagement metrics within configurable time windows. Users and downstream services query endpoints like "top 10 songs in the last 7 days" or "trending hashtags in the past 24 hours," and expect fresh, deterministic results returned in milliseconds.
The system ingests millions of scoring events per second from diverse sources (mobile apps, web clients, backend services), aggregates them across multiple rolling windows (hourly, daily, weekly, all-time), and serves low-latency ranked lists. The primary challenges are handling extreme write skew when content goes viral, maintaining accurate rankings despite out-of-order and late-arriving events, supporting segmentation by geography or category, and keeping infrastructure costs reasonable while meeting sub-100ms read latency targets.
Bloomberg relies heavily on real-time data aggregation for financial terminals, making this a natural fit for interviews that test streaming pipeline design, hot key mitigation, and read-path optimization.
Based on real interview experiences, these are the areas interviewers probe most deeply:
Interviewers want to see how you handle millions of incoming score events without creating database hotspots or losing data. They test whether you understand stream processing, event deduplication, and the difference between raw events and aggregated counts.
Hints to consider:
Viral content creates enormous skew where a single item receives thousands of times more events than average. Interviewers assess whether you recognize this problem and can propose solutions beyond naive sharding.
Hints to consider:
item_123_shard_0 through item_123_shard_7) and merged periodically by a combiner jobComputing top-K by scanning all items on every query is a critical mistake. Interviewers look for pre-computation strategies, appropriate data structures, and cache-friendly designs.
Hints to consider:
O(N log K) instead of O(N log N) full sortingDifferent time windows have different update patterns and expiration logic. Interviewers probe whether you understand sliding versus tumbling windows, retention policies, and how to balance freshness with cost.
Hints to consider:
Start by confirming scope and constraints with your interviewer. Ask about the types of events being tracked (views, plays, likes, purchases) and expected volume per second. Clarify how many time windows must be supported simultaneously and whether windows can overlap. Determine the number of unique segments (geography, category, user cohorts) and how many distinct segment combinations exist. Confirm acceptable staleness for results -- can leaderboards be 10 seconds behind, or must they update in near real-time? Ask about tie-breaking rules and whether clients need exact counts alongside rankings.
Sketch three main pipelines:
Ingestion Layer: Event producers (app servers, mobile clients, partner feeds) write scoring events to Kafka topics partitioned by item ID. Kafka provides durability, replay capability, and backpressure handling. Each event includes item_id, event_type, score_delta, timestamp, and segment metadata.
Aggregation Layer: Apache Flink jobs consume from Kafka, performing stateful stream processing. For each window (1 hour, 24 hours, 7 days), maintain in-memory counters per item using keyed state. Apply tumbling or sliding window logic with watermarks for late data handling. On window close (or at regular intervals for sliding windows), emit aggregated scores and top-K candidates to output topics and directly to the serving layer.
Serving Layer: A Redis cluster stores pre-computed top-K sorted sets per window and segment combination, using key patterns like leaderboard:daily:us:music. API servers query Redis with sub-10ms latency, apply an application-level cache with 30-second TTL, and handle pagination using cursor tokens. A separate batch job recomputes leaderboards every 5 minutes from the data warehouse as a consistency fallback for long windows.
Walk through the write path in detail. Describe how Flink's keyed state maintains a counter per (window, segment, item_id) tuple. Use event-time processing with watermarks (for example, max_timestamp minus 5 minutes) to handle out-of-order events while allowing timely window closure. When a window closes, use a process function to extract top-K items via a min-heap (priority queue of size K) during the reduce phase.
Address hot keys explicitly: for items exceeding a threshold (detected via per-partition metrics), split their events across multiple sub-keys using consistent hashing. A separate combiner job merges these split counters every 10 seconds. For extremely viral items (top 0.01 percent), use approximate counting with Count-Min Sketch and accept small error margins to reduce state size.
Explain the output: emit top-K results to a Kafka topic (for downstream consumers) and directly upsert into Redis sorted sets using pipelining (batches of 100 commands). Use versioned keys to handle concurrent updates and enable atomic swaps when recomputing full rankings.
Discuss fault tolerance, consistency, and operational considerations. Flink checkpoints every 60 seconds to S3, enabling recovery with at most 60 seconds of reprocessing. Kafka's replication factor of 3 prevents data loss. Redis uses AOF persistence and replication with automatic failover via Redis Cluster.
For deduplication, include event_id in messages and use Flink's state to deduplicate within the allowed lateness window. Accept that exact-once delivery across system boundaries is hard; leaderboards are inherently approximate, and a variance of a few events is acceptable.
For monitoring, track write lag (event timestamp versus processing timestamp), p99 read latency per segment, cache hit rates, and hot key detection metrics. Alert when freshness exceeds 30 seconds or any segment's query latency crosses 100ms. For cost optimization, long windows (30 days, all-time) update infrequently, so recompute them hourly via batch jobs rather than continuous streaming, and archive old window data to cheaper storage after retention expires.
Deepen your understanding of the patterns used in this problem: