Practice/Google/Design a system to display aggregation of events where multiple robots are continuously generating events in realtime
Design a system to display aggregation of events where multiple robots are continuously generating events in realtime
System DesignMust
Problem Statement
A fleet of autonomous robots operating in warehouses, factories, or delivery networks continuously generates events — task completions, sensor readings, error conditions, movement updates, and maintenance alerts. Operations teams need a live dashboard that displays aggregated metrics over rolling time windows: how many tasks were completed fleet-wide in the last minute, the average battery level across all robots, the count of error events per robot type, and similar summaries.
The core difficulty is that events arrive from hundreds or thousands of robots simultaneously, each with its own clock and network latency characteristics. Events may arrive out of order — a robot in a low-connectivity zone might batch and send events minutes after they occurred. The aggregation system must produce accurate, timely results despite this disorder, updating dashboard visualizations within seconds of the aggregation window closing.
You need to design a system that ingests high-throughput event streams from a robot fleet, performs windowed aggregations (counts, averages, maximums, percentiles) across multiple dimensions (per-robot, per-type, fleet-wide), handles late and out-of-order data gracefully, and pushes updated results to a real-time dashboard consumed by operations teams.
Key Requirements
Functional
- Event Ingestion -- Accept structured events from thousands of robots, each event containing a robot ID, event type, timestamp, and a payload of key-value metrics.
- Windowed Aggregation -- Compute aggregations (count, sum, average, max, percentiles) over configurable tumbling and sliding windows (e.g., 1 minute, 5 minutes) across multiple grouping dimensions.
- Late Data Handling -- Accept and correctly incorporate events that arrive after their window has nominally closed, issuing corrected aggregation results when late data materially changes a window's output.
- Live Dashboard -- Push aggregation results to a web-based dashboard in near-real-time, with visualizations updating within seconds of window closure.
Non-Functional
- Scalability -- Handle tens of thousands of events per second from a fleet of thousands of robots, scaling horizontally as the fleet grows.
- Latency -- Aggregated results for a 1-minute window are available on the dashboard within 5 seconds of the window closing.
- Accuracy -- Aggregations must correctly reflect all events assigned to a window, including late arrivals processed within the allowed lateness threshold.
- Fault Tolerance -- The system continues producing results through individual component failures, with exactly-once processing semantics to prevent double-counting.
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Event-Time vs. Processing-Time Semantics
Robots generate events at one time but the system receives them at another. Interviewers want to see that you understand this distinction and its implications for correctness.
Hints to consider:
- Think about why using event-time (the timestamp from the robot's clock) rather than processing-time (when the server receives the event) is critical for accurate aggregations, especially when robots batch events during connectivity gaps.
- Consider how you assign events to windows based on event-time and what happens when an event's timestamp falls into a window that is already "closed" from a processing-time perspective.
- Evaluate how watermarks work in stream processing frameworks like Flink — a watermark is a declaration that no events with timestamps earlier than the watermark value are expected to arrive.
- Think about the trade-off between watermark aggressiveness (advancing quickly means lower latency but risks dropping late events) and conservatism (waiting longer for stragglers means higher latency but better accuracy).
2. Handling Late and Out-of-Order Data
Real-world robot fleets produce messy data streams. Interviewers probe your strategy for dealing with this messiness without sacrificing correctness.
Hints to consider:
- Consider an allowed-lateness configuration that keeps window state open for a configurable period after the watermark passes the window boundary, allowing late events to trigger updated results.
- Think about how you emit results — an initial result when the watermark passes the window end, and updated (retracted and re-emitted) results when late data arrives.
- Evaluate what happens when data arrives after even the allowed-lateness period — do you drop it, log it, or route it to a separate "late data" pipeline for batch reconciliation?
- Consider how you handle robot clock drift — if a robot's clock is significantly wrong, all its events land in the wrong windows. Think about server-side timestamp validation.
3. Dashboard Update Mechanism
Pushing aggregation results to a live dashboard efficiently is its own design challenge.
Hints to consider:
- Think about using WebSockets or Server-Sent Events (SSE) to push updated aggregations to connected dashboard clients, rather than having clients poll.
- Consider how you structure the data model for the dashboard — pre-compute all dimension combinations (robot ID, robot type, fleet-wide) at the streaming layer, or let the dashboard query a materialized view?
- Evaluate how you handle dashboard clients that connect mid-window — do you show partial results, wait for the next window, or serve a snapshot from the last completed window?
- Think about how you visualize corrected results when late data changes a previously displayed aggregation — do you silently update, or show a correction indicator?