Practice/Google/Design a distributed system for getting the slowest query from Google search

Design a distributed system for getting the slowest query from Google search

System DesignMust

Problem Statement

Google processes billions of search queries per day across thousands of servers distributed worldwide. Understanding performance characteristics — particularly identifying the slowest queries — is essential for maintaining search quality, debugging latency regressions, and prioritizing optimization efforts. The slowest query in a given time window might indicate a pathological query pattern, a degraded backend shard, or an infrastructure issue in a specific data center.

The fundamental challenge is that no single machine sees all queries. Each query is handled by a specific set of servers, and latency measurements are distributed across the entire fleet. Finding the global maximum (or top-K slowest queries) requires aggregating telemetry from every server in a way that is both accurate and efficient, without overwhelming the network or storage with raw per-query data.

You need to design a distributed telemetry system that captures query latency measurements from every search server, aggregates them in near-real-time to identify the slowest queries within configurable time windows, and presents the results through a dashboard and alerting system — handling both real-time streaming analysis and historical batch queries.

Key Requirements

Functional

Telemetry Ingestion -- Every search server emits a latency record for each query it processes, including the query text, latency, timestamp, server identity, and data center location.
Windowed Aggregation -- Compute the slowest query (and top-K slowest queries) within configurable time windows (1 minute, 5 minutes, 1 hour, 1 day).
Real-Time Dashboard -- Display the current slowest queries in a live-updating dashboard with drill-down by data center, server pool, and query type.
Historical Analysis -- Support querying historical data to find the slowest queries on a specific past day or time range for post-mortem analysis.

Non-Functional

Scalability -- Ingest latency records from hundreds of thousands of servers processing tens of billions of queries per day.
Latency -- Real-time top-K results available within 10 seconds of the query completing.
Accuracy -- The reported slowest query must be the true global maximum, not an approximation — this is a max aggregation, not a percentile estimate.
Fault Tolerance -- The system continues producing results even if individual servers or data centers fail to report, with clear visibility into data completeness.

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Hierarchical Aggregation Architecture

Finding a global max across hundreds of thousands of servers requires a scalable aggregation topology. Interviewers want to see how you avoid making every server report to a single aggregator. Hints to consider:

Think about a tree-structured aggregation: each server computes its local top-K per window, reports to a per-rack or per-cluster aggregator, which merges and forwards to a per-datacenter aggregator, and finally to a global aggregator.
Consider why max (or top-K) is a particularly friendly aggregation — it is associative and commutative, meaning partial results can be merged in any order without affecting correctness.
Evaluate the trade-off between more aggregation levels (less data to transmit at each hop) and more levels of latency in the pipeline.
Think about how you size K at each level — you only need top-1 for the slowest query, but top-K at intermediate levels provides robustness if late-arriving data shifts the ranking.

2. Event-Time Windowing and Late Data

Queries arrive at aggregators with varying delays. Interviewers probe how you handle the gap between when a query executes and when its telemetry reaches the aggregator. Hints to consider:

Consider using event-time (the timestamp when the query was processed) rather than processing-time (when the aggregator receives the record) for window assignment.
Think about watermarks — a signal that indicates "all data before time T has likely arrived" — and how you advance them in a multi-source pipeline.
Evaluate what you do when a late record arrives after the window has closed — do you recompute, issue a correction, or drop it with a staleness counter?
Consider how clock skew between servers affects window boundaries and what level of NTP synchronization you can assume.