For a full example answer with detailed architecture diagrams and deep dives, see our Design Top K guide.
Design a system that efficiently retrieves the top-K items -- songs, videos, hashtags, products, or any entity -- based on user activity or engagement metrics within configurable time windows. The system must handle real-time data aggregation at massive scale and support queries like "top 10 songs in the last 7 days" or "trending hashtags in the past 24 hours."
At Atlassian, this manifests as surfacing the top-K Confluence pages, Jira issues, or Bitbucket repositories by popularity or engagement. The core challenge is ingesting millions of events per second, maintaining accurate windowed counts despite late-arriving or out-of-order data, and serving ranked results with sub-100ms latency. Interviewers probe whether you can separate the write aggregation path from the read serving path, handle hot keys caused by viral content, and make sensible tradeoffs between freshness, accuracy, and infrastructure cost.
Based on real interview experiences at Atlassian, these are the areas interviewers probe most deeply:
Interviewers want to see how you handle millions of incoming events without creating database hotspots. They test whether you understand stream processing, event deduplication, and the distinction between raw events and aggregated counts.
Hints to consider:
Viral content creates massive skew where a single item receives orders of magnitude more events than average. Interviewers assess whether you recognize this and can propose solutions beyond naive sharding.
Hints to consider:
Computing top-K by scanning all items on every query is a critical mistake. Interviewers look for pre-computation strategies and cache-friendly designs.
Hints to consider:
Different windows have different update cadences and cost profiles. Interviewers probe whether you understand sliding versus tumbling windows and how to balance freshness with infrastructure cost.
Hints to consider:
Start by confirming the scope. Ask what types of events are tracked (views, likes, clicks), what the expected event volume is, how many windows must be supported simultaneously, and what freshness guarantee matters. Clarify whether users need exact counts or approximate rankings. Establish the query pattern: mostly top-10/top-100, or do users frequently request deeper rankings?
Sketch three pipelines: an Ingestion Layer where event producers write to Kafka topics partitioned by item ID; an Aggregation Layer where Flink jobs consume events, maintain per-window counters using stateful processing, and emit top-K candidates; and a Serving Layer where Redis sorted sets store pre-computed rankings per window and segment, served by API servers with application-level caching and pagination support.
Walk through the write path in detail. Describe how Flink keyed state maintains a counter per (window, segment, item_id) tuple using event-time processing with watermarks. When a window closes, extract top-K items using a min-heap during the reduce phase. Address hot keys explicitly: for items exceeding a threshold, split events across sub-keys and merge periodically. Explain how results flow to Redis sorted sets using pipelined batch writes with versioned keys for atomic updates.
Discuss fault tolerance through Flink checkpoints to S3 and Kafka replication. Cover deduplication using event IDs and Flink state. Explain monitoring: track write lag, p99 read latency, cache hit rates, and hot key detection. Address cost optimization by recomputing long windows (30 days, all-time) hourly via batch Spark jobs rather than continuous streaming.
"Asked in Atlassian: give top K Confluence pages. If a user has seen one, do not show it again."
"Design popular K feeds in Confluence. Follow-ups included popularity score calculation strategies, support for querying in windows, and updating dashboards of users in real time."