Practice/Amazon/Design a Metrics Aggregator
Design a Metrics Aggregator
System DesignMust
Problem Statement
Design a metrics aggregation system that collects count-based metrics (like user signups, system errors, ad clicks) from services via a client library and displays them as histogram data on a dashboard. The system should support querying aggregated metrics within specific time frames for monitoring and analytics.
This is a StatsD/Datadog-style platform that lets services emit count-based events via a lightweight client library. The backend ingests a high-volume firehose, aggregates counts into time buckets, and powers dashboards that show histograms and time-series charts with filters by tags and time ranges. Interviewers ask this to test your ability to design high-throughput, low-latency ingestion and aggregation pipelines, reason about windowed counters, hot-key contention, and storage layouts for time-series data.
Key Requirements
Functional
- Metric emission -- services instrument with a client library to emit count metrics with optional tags (e.g., env=prod, region=us-east)
- Dashboard visualization -- users view near-real-time histograms and time-series graphs for selected metrics over a chosen time range and resolution (1s, 10s, 1m)
- Filtered queries -- users query aggregated counts filtered by metric name and tags to support monitoring and analytics use cases
- Predictable read latency -- users retrieve aggregated metrics within specific time frames with consistent latency even under high write load
Non-Functional
- Scalability -- sustain millions of metric events per second from thousands of services with bursty traffic patterns
- Reliability -- maintain 99.9% uptime for ingestion with no silent data loss; dashboards remain available during backend maintenance
- Latency -- metric events reflected in dashboards within 10-30 seconds; query responses under 500ms for common time ranges
- Consistency -- eventual consistency acceptable for counter aggregates with convergence within one minute; at-least-once delivery with bucket-level idempotence
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. High-Throughput Ingestion Pipeline
Metrics ingestion is write-heavy and bursty. The system must sustain millions of events per second without dropping data or creating backpressure on emitting services.
Hints to consider:
- Use a client library that batches and buffers metrics locally before sending to reduce network overhead and smooth bursts
- Place Kafka as a durable buffer between client libraries and aggregation workers, partitioned by metric name or tag hash
- Design the ingestion API to be stateless and horizontally scalable, accepting fire-and-forget UDP or batched HTTP payloads
- Implement backpressure signals so producers can shed low-priority metrics during sustained overload
2. Hot Key Contention
Popular metrics create hot keys where millions of increments target the same counter. A single counter becomes a bottleneck that limits horizontal scale.
Hints to consider:
- Use sharded counters where increments are distributed across N sub-keys, with periodic merging for query time
- Implement per-bucket aggregation in stream processors so raw events are reduced to partial sums before hitting storage
- Consider striped Redis keys or in-memory aggregation in Flink workers to absorb skewed traffic
- Design the aggregation layer to handle hot-key detection and automatic re-partitioning
3. Time-Bucketed Storage and Query Optimization
Dashboards execute range queries across time windows, requiring a storage schema optimized for time-series access patterns.
Hints to consider:
- Store pre-aggregated counts in a time-series database (Cassandra) with partition keys combining metric name, tags, and time bucket
- Maintain multiple resolutions: raw 1-second buckets for recent data, rolled up to 1-minute and 1-hour buckets for historical queries
- Implement TTL-based retention per resolution tier to control storage costs automatically
- Cache frequently queried time ranges (last hour, last 24 hours) in Redis to reduce load on the primary store
4. Correctness Under Failures
Without a clear correctness model and deduplication strategy, counts will drift, dashboards will be misleading, and incident response will suffer.
Hints to consider:
- Design for at-least-once delivery with idempotent aggregation at the bucket level using sequence numbers or dedupe windows
- Handle late-arriving events by allowing aggregation windows to accept updates within a grace period before finalizing
- Implement periodic reconciliation between streaming aggregates and batch recomputation from raw event logs
- Provide data quality dashboards showing ingestion lag, event duplication rate, and bucket completion status