Design a real-time event aggregation system that ingests continuous telemetry streams from a fleet of robots and presents aggregated metrics on a live dashboard with one-minute granularity. Each robot emits status events, error codes, and performance counters every few seconds. The dashboard shows trends, anomalies, and health indicators that plant operators rely on for operational decisions.
The core engineering challenge is processing millions of events per minute with minimal lag, computing accurate aggregates even when events arrive delayed or out of order, and serving real-time queries to dashboards that need to refresh every few seconds. The system must scale horizontally to accommodate growing robot fleets, handle network disruptions that cause event bursts on reconnection, and clearly distinguish between provisional results (still accepting late data) and finalized metrics.
Based on real interview experiences, these are the areas interviewers probe most deeply:
Interviewers want to see if you understand the difference between when an event occurred at the robot versus when your system processes it. Network delays, robot buffering, and outages cause events to arrive late, making processing-time windows inaccurate.
Hints to consider:
With millions of events per second, a single global aggregator becomes a bottleneck. Interviewers expect a partitioned, parallel architecture that avoids hotspots on popular time buckets.
Hints to consider:
The database layer must efficiently store and query billions of minute-level aggregate records while supporting both fast writes during ingestion and low-latency reads for dashboard queries.
Hints to consider:
Dashboards need to reflect new data within seconds. Constant polling wastes resources and adds latency. Interviewers want a scalable push mechanism.
Hints to consider:
Confirm how many robots exist today and the expected growth rate. Clarify whether one-minute windows are aligned to clock minutes or sliding. Understand acceptable latency for dashboard freshness and how long to wait for late events. Verify which metrics matter most (simple counts, averages, percentiles, distinct counts). Establish whether operators need drill-down capabilities or only fleet-wide views. Ask about exactly-once guarantees versus at-least-once with idempotency.
Sketch a streaming pipeline with five layers: (1) an ingestion tier using Kafka to buffer incoming robot events partitioned by robot ID; (2) a stream processing tier running Flink to perform windowed aggregations with event-time semantics and watermarks; (3) a storage tier using Cassandra to persist per-minute aggregates keyed by timestamp and robot dimensions; (4) an API tier providing REST endpoints for historical queries with Redis caching for recent windows; (5) a push notification service using WebSockets to stream new window results to connected dashboards. Draw data flow arrows showing events moving from robots through Kafka, being processed into aggregates by Flink, persisting to Cassandra, and being pushed to dashboards.
Walk through how Flink handles one-minute tumbling windows with event time. Each event carries a timestamp indicating when it occurred at the robot. The processor assigns events to windows (e.g., 10:00:00 to 10:00:59). Watermarks track event-time progress across partitions. When the watermark passes the end of a window plus allowed lateness (say 2 minutes), the window closes and results emit. Discuss trigger strategies: emit provisional results as data arrives, then final results when watermarks advance. Address idempotency using unique event IDs and deduplication to prevent double-counting if failures cause reprocessing. Late events that arrive after window close update the stored aggregate in Cassandra and invalidate the corresponding cache entry.
Cover scalability: partition Kafka topics by robot ID hash, scale Flink by adding task slots since partitions enable parallel computation, and shard Cassandra by time range and robot ID prefix. Discuss reliability: Kafka replicates events, Flink checkpoints state for crash recovery, and Cassandra replication handles node failures. Address query performance: pre-aggregate at multiple granularities (per-robot and per-group) and use materialized rollup tables. Mention monitoring and alerting: emit metrics on pipeline lag (difference between event time and processing time), track late event percentages, and alert on excessive backpressure. Touch on cost optimization: archive raw events to cold storage after aggregates are computed and downsample historical data into hourly and daily granularities.