Design a system to display aggregation of events where multiple robots are continuously generating events in realtime — UiPath

Problem Statement

Design a real-time event aggregation system that ingests continuous telemetry streams from a fleet of robots and presents aggregated metrics on a live dashboard with one-minute granularity. Each robot emits status events, error codes, and performance counters every few seconds. The dashboard shows trends, anomalies, and health indicators that plant operators rely on for operational decisions.

The core engineering challenge is processing millions of events per minute with minimal lag, computing accurate aggregates even when events arrive delayed or out of order, and serving real-time queries to dashboards that need to refresh every few seconds. The system must scale horizontally to accommodate growing robot fleets, handle network disruptions that cause event bursts on reconnection, and clearly distinguish between provisional results (still accepting late data) and finalized metrics.

Key Requirements

Functional

Real-time fleet metrics -- compute counts, averages, and error rates in one-minute tumbling windows, viewable per robot and fleet-wide
Historical queries -- operators query past aggregates with time range filters, robot IDs, and metric types at one-minute granularity
Live dashboard updates -- push new window results to connected dashboards within seconds of each window closing
Alerting -- notify operators when aggregated metrics cross configurable thresholds or show anomalous patterns

Non-Functional

Scalability -- handle 50,000 robots generating 100 events per second each (5 million events per second total), with capacity to scale to 200,000 robots
Reliability -- ensure no data loss during node failures; support effectively-once processing semantics for accurate counts
Latency -- display new window aggregates within 10 seconds of window close; return dashboard queries in under 200 milliseconds
Consistency -- aggregates are eventually consistent; handle out-of-order events without double-counting or missed events

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Event Time vs Processing Time

Interviewers want to see if you understand the difference between when an event occurred at the robot versus when your system processes it. Network delays, robot buffering, and outages cause events to arrive late, making processing-time windows inaccurate.

Hints to consider:

Extract event timestamps from incoming data and use them for window assignment rather than relying on arrival time
Implement watermark mechanisms that track event-time progress across partitions and decide when to close windows despite stragglers
Balance the tradeoff between waiting longer for late data (higher accuracy) and closing windows quickly (lower latency)
Communicate provisional versus finalized window states to dashboard users so they know when data is still settling

2. Partitioned Aggregation to Avoid Hotspots

With millions of events per second, a single global aggregator becomes a bottleneck. Interviewers expect a partitioned, parallel architecture that avoids hotspots on popular time buckets.

Hints to consider:

Partition the event stream by robot ID to parallelize window computations across workers
Use local pre-aggregation (combiners) within each partition to reduce data volume before final aggregation
Avoid having all events for a time window converge on a single reducer or database partition
Design dynamic scaling based on event rate fluctuations, adding or removing workers as fleet size changes

3. Storage for Time-Series Aggregates

The database layer must efficiently store and query billions of minute-level aggregate records while supporting both fast writes during ingestion and low-latency reads for dashboard queries.

Hints to consider:

Design a schema that clusters data by time and robot dimensions to support range queries over time intervals
Use a database optimized for append-heavy workloads with timestamp-based queries, such as Cassandra with wide-row designs
Implement data retention policies and downsampling strategies (aggregating minutes into hours and days after a configurable period)
Address how to update existing records when late events require recomputation of past windows

4. Pushing Real-Time Updates to Dashboards

Dashboards need to reflect new data within seconds. Constant polling wastes resources and adds latency. Interviewers want a scalable push mechanism.

Hints to consider:

Use WebSocket connections or Server-Sent Events to push completed window aggregates to browsers as they finalize
Implement pub-sub patterns where aggregation workers publish completed windows to topics that dashboard servers subscribe to
Handle reconnections, backpressure, and selective subscriptions (operators watching specific robots or groups)
Cache recent window results so that new dashboard connections receive current state immediately

Suggested Approach

Step 1: Clarify Requirements

Confirm how many robots exist today and the expected growth rate. Clarify whether one-minute windows are aligned to clock minutes or sliding. Understand acceptable latency for dashboard freshness and how long to wait for late events. Verify which metrics matter most (simple counts, averages, percentiles, distinct counts). Establish whether operators need drill-down capabilities or only fleet-wide views. Ask about exactly-once guarantees versus at-least-once with idempotency.

Step 2: High-Level Architecture

Sketch a streaming pipeline with five layers: (1) an ingestion tier using Kafka to buffer incoming robot events partitioned by robot ID; (2) a stream processing tier running Flink to perform windowed aggregations with event-time semantics and watermarks; (3) a storage tier using Cassandra to persist per-minute aggregates keyed by timestamp and robot dimensions; (4) an API tier providing REST endpoints for historical queries with Redis caching for recent windows; (5) a push notification service using WebSockets to stream new window results to connected dashboards. Draw data flow arrows showing events moving from robots through Kafka, being processed into aggregates by Flink, persisting to Cassandra, and being pushed to dashboards.

Step 3: Deep Dive on Windowing and Late Data

Walk through how Flink handles one-minute tumbling windows with event time. Each event carries a timestamp indicating when it occurred at the robot. The processor assigns events to windows (e.g., 10:00:00 to 10:00:59). Watermarks track event-time progress across partitions. When the watermark passes the end of a window plus allowed lateness (say 2 minutes), the window closes and results emit. Discuss trigger strategies: emit provisional results as data arrives, then final results when watermarks advance. Address idempotency using unique event IDs and deduplication to prevent double-counting if failures cause reprocessing. Late events that arrive after window close update the stored aggregate in Cassandra and invalidate the corresponding cache entry.

Step 4: Address Secondary Concerns

Cover scalability: partition Kafka topics by robot ID hash, scale Flink by adding task slots since partitions enable parallel computation, and shard Cassandra by time range and robot ID prefix. Discuss reliability: Kafka replicates events, Flink checkpoints state for crash recovery, and Cassandra replication handles node failures. Address query performance: pre-aggregate at multiple granularities (per-robot and per-group) and use materialized rollup tables. Mention monitoring and alerting: emit metrics on pipeline lag (difference between event time and processing time), track late event percentages, and alert on excessive backpressure. Touch on cost optimization: archive raw events to cold storage after aggregates are computed and downsample historical data into hourly and daily granularities.

Related Learning Resources

Design an Ad Click Aggregator -- covers streaming count aggregation, windowed processing, and time-series storage
Message Queues -- Kafka partitioning, consumer groups, and event replay for fault tolerance
Databases -- time-series storage schemas, partitioning, and retention strategies

Problem Statement

Key Requirements

Functional

Real-time fleet metrics -- compute counts, averages, and error rates in one-minute tumbling windows, viewable per robot and fleet-wide
Historical queries -- operators query past aggregates with time range filters, robot IDs, and metric types at one-minute granularity
Live dashboard updates -- push new window results to connected dashboards within seconds of each window closing
Alerting -- notify operators when aggregated metrics cross configurable thresholds or show anomalous patterns

Non-Functional

Scalability -- handle 50,000 robots generating 100 events per second each (5 million events per second total), with capacity to scale to 200,000 robots
Reliability -- ensure no data loss during node failures; support effectively-once processing semantics for accurate counts
Latency -- display new window aggregates within 10 seconds of window close; return dashboard queries in under 200 milliseconds
Consistency -- aggregates are eventually consistent; handle out-of-order events without double-counting or missed events

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Event Time vs Processing Time

Hints to consider:

Extract event timestamps from incoming data and use them for window assignment rather than relying on arrival time
Implement watermark mechanisms that track event-time progress across partitions and decide when to close windows despite stragglers
Balance the tradeoff between waiting longer for late data (higher accuracy) and closing windows quickly (lower latency)
Communicate provisional versus finalized window states to dashboard users so they know when data is still settling

2. Partitioned Aggregation to Avoid Hotspots

With millions of events per second, a single global aggregator becomes a bottleneck. Interviewers expect a partitioned, parallel architecture that avoids hotspots on popular time buckets.

Hints to consider:

Partition the event stream by robot ID to parallelize window computations across workers
Use local pre-aggregation (combiners) within each partition to reduce data volume before final aggregation
Avoid having all events for a time window converge on a single reducer or database partition
Design dynamic scaling based on event rate fluctuations, adding or removing workers as fleet size changes

3. Storage for Time-Series Aggregates

The database layer must efficiently store and query billions of minute-level aggregate records while supporting both fast writes during ingestion and low-latency reads for dashboard queries.

Hints to consider:

Design a schema that clusters data by time and robot dimensions to support range queries over time intervals
Use a database optimized for append-heavy workloads with timestamp-based queries, such as Cassandra with wide-row designs
Implement data retention policies and downsampling strategies (aggregating minutes into hours and days after a configurable period)
Address how to update existing records when late events require recomputation of past windows

4. Pushing Real-Time Updates to Dashboards

Dashboards need to reflect new data within seconds. Constant polling wastes resources and adds latency. Interviewers want a scalable push mechanism.

Hints to consider:

Use WebSocket connections or Server-Sent Events to push completed window aggregates to browsers as they finalize
Implement pub-sub patterns where aggregation workers publish completed windows to topics that dashboard servers subscribe to
Handle reconnections, backpressure, and selective subscriptions (operators watching specific robots or groups)
Cache recent window results so that new dashboard connections receive current state immediately

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive on Windowing and Late Data

Step 4: Address Secondary Concerns

Related Learning Resources

Design an Ad Click Aggregator -- covers streaming count aggregation, windowed processing, and time-series storage
Message Queues -- Kafka partitioning, consumer groups, and event replay for fault tolerance
Databases -- time-series storage schemas, partitioning, and retention strategies