Practice/Google/Design a Metrics Aggregator

Design a Metrics Aggregator

System DesignMust

Problem Statement

Design a metrics collection and aggregation platform — similar to StatsD, Datadog, or Prometheus — that receives high-throughput counter, gauge, and timer metrics from thousands of services, aggregates them into time-bucketed rollups, and serves dashboard queries over arbitrary time ranges.

The fundamental challenge is handling massive write throughput (potentially millions of data points per second) while providing low-latency reads for dashboard queries that span hours or days. You need to aggregate raw data points into meaningful summaries (sums, averages, percentiles) at multiple time granularities (1-second, 1-minute, 1-hour) without losing accuracy.

Hot key contention is a critical concern: a single popular metric (like total HTTP requests) can receive updates from every service instance simultaneously, creating a bottleneck if all writes target the same counter.

Key Requirements

Functional

Metric ingestion -- Accept counter increments, gauge values, and timer observations from service agents via UDP or HTTP.
Time-bucketed aggregation -- Aggregate raw data points into configurable time buckets (1s, 1m, 1h) with rollups for sum, count, average, min, max, and percentiles.
Dashboard queries -- Serve queries that retrieve aggregated metric values over a time range, optionally grouped by tags (service, region, host).
Alerting rules -- Evaluate threshold-based alert conditions on aggregated metrics and trigger notifications when breached.

Non-Functional

Scalability -- Ingest 5 million+ data points per second across all tenants.
Latency -- Dashboard queries over a 24-hour range return within 500ms; recent metrics (last 5 minutes) visible within 10 seconds.
Durability -- No metric data loss during normal operation; at-least-once delivery from agents to the pipeline.
Storage efficiency -- Retain raw data for 24 hours, 1-minute rollups for 30 days, and 1-hour rollups for 1 year.

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Hot Key Contention

When thousands of hosts increment the same metric simultaneously, naive designs create a write bottleneck. Hints to consider:

Consider local pre-aggregation on each agent — flush accumulated counts every 10 seconds instead of sending every increment
How do you merge pre-aggregated partial sums from multiple agents into the final rollup?
Think about sharding writes by time bucket so that concurrent flushes target different partitions
What happens if an agent crashes before flushing — how do you handle data loss vs. double-counting?

2. Time-Bucketed Aggregation Pipeline

Converting a continuous stream of raw data points into discrete time buckets is the core data processing challenge. Hints to consider:

Use Kafka as the ingestion buffer, partitioned by metric name hash, to distribute aggregation load
How do you handle late-arriving data points that belong to an already-closed time bucket?
Consider a two-phase approach: stream processors aggregate into 1-second buckets, then a background job rolls up into 1-minute and 1-hour buckets
What data structure efficiently tracks percentiles (p50, p95, p99) without storing every raw value?

3. Query Performance Over Time Ranges

Dashboard queries spanning days of data must remain fast despite the data volume. Hints to consider:

Use Cassandra with a partition key of (metric_id, time_bucket_granularity) and clustering key of timestamp for efficient range scans
How do you automatically select the right granularity — 1-second vs. 1-minute vs. 1-hour — based on the query's time range?
Consider pre-computing common dashboard queries to avoid aggregating on the fly
What caching strategy works for metrics dashboards where data at the edges changes but historical data is immutable?