Design a metrics aggregation system that collects count-based events from application services (such as user signups, API errors, and ad clicks), aggregates them into time buckets, and displays the results as histograms and time-series charts on a monitoring dashboard. Services emit events through a lightweight client library, and the dashboard allows engineers to query aggregated counts within specific time frames, filtered by metric name and tags.
Think of platforms like Datadog, Grafana, or an internal StatsD pipeline. The system ingests a high-volume firehose of metric events -- potentially millions per second across thousands of services -- pre-aggregates them into configurable time windows (1 second, 10 seconds, 1 minute), and serves interactive queries with sub-second response times. The core challenges are sustaining write throughput under bursty traffic, avoiding hot-key contention on popular metrics, choosing the right storage layout for efficient time-range queries, and balancing real-time freshness against aggregation correctness.
Based on real interview experiences, these are the areas interviewers probe most deeply:
With millions of events per second, naive approaches that write each event individually to a database will fail immediately. Interviewers want to see how you decouple producers from consumers and smooth out traffic bursts.
Hints to consider:
Popular metrics like "api_requests" may receive orders of magnitude more events than others. A single counter per metric becomes a write bottleneck that limits horizontal scaling.
Hints to consider:
Dashboards execute range queries across time windows with dimension filters. The storage schema must support efficient scans without full table traversal.
Hints to consider:
At-least-once delivery means events may arrive more than once, and network delays cause late arrivals. Without a clear correctness model, counts drift and dashboards become unreliable.
Hints to consider:
Begin by confirming scale parameters: how many services emit metrics, expected events per second at average and peak, and how many concurrent dashboard users query the system. Clarify which metric types are in scope (counts only, or also gauges, histograms, percentiles). Establish acceptable latency for both ingestion and queries. Ask about data retention policies, cardinality limits on tag values, and whether alerting on thresholds is in scope.
Sketch a pipeline with three layers. The ingestion layer consists of lightweight SDKs that batch events client-side and send them to a stateless API tier, which writes to Kafka partitioned by metric name and tag hash. The processing layer runs stream processors (Flink or Kafka Streams) that maintain in-memory counters per time bucket, flush pre-aggregated results to Redis for the most recent windows, and write finalized buckets to Cassandra for long-term storage. The serving layer provides a query API that reads from Redis for near-real-time data and from Cassandra for historical ranges, with a caching layer for popular dashboard queries.
Walk through an event lifecycle. A service calls the SDK with a metric name, value, and tags. The SDK buffers events locally and flushes a batch every 5 seconds to the ingestion API. The API validates, enriches with a receive timestamp, and publishes to a Kafka topic partitioned by a hash of metric name plus tags. A Flink consumer maintains in-memory counters for each open time window (1-second tumbling windows). When a window closes, the consumer atomically increments the corresponding Redis key (metric:tags:timestamp) and also writes the finalized count to Cassandra. For late events arriving after window close, the consumer updates the Cassandra record and invalidates the Redis cache entry for that bucket. Discuss how partitioning by metric plus tags ensures all events for one counter land on the same worker, eliminating distributed contention.
Discuss rollup jobs that periodically aggregate 1-second buckets into 1-minute and 1-hour summaries in Cassandra, reducing query scan ranges for long time windows. Cover caching: popular dashboard queries (e.g., "api_requests for the last hour") are served from pre-computed rollups cached in Redis. Address cardinality control: reject or sample tag values that exceed a cardinality threshold to prevent storage explosion. Explain monitoring of the pipeline itself: track Kafka consumer lag, Flink checkpoint intervals, and query latency percentiles. Discuss fault tolerance: Kafka provides durable replay if a Flink worker crashes, and Cassandra replication handles node failures.