Practice/Amazon/Design a Metrics Aggregator

Design a Metrics Aggregator

System DesignMust

Problem Statement

Design a metrics aggregation system that collects count-based metrics (like user signups, system errors, ad clicks) from services via a client library and displays them as histogram data on a dashboard. The system should support querying aggregated metrics within specific time frames for monitoring and analytics.

This is a StatsD/Datadog-style platform that lets services emit count-based events via a lightweight client library. The backend ingests a high-volume firehose, aggregates counts into time buckets, and powers dashboards that show histograms and time-series charts with filters by tags and time ranges. Interviewers ask this to test your ability to design high-throughput, low-latency ingestion and aggregation pipelines, reason about windowed counters, hot-key contention, and storage layouts for time-series data.

Key Requirements

Functional

Metric emission -- services instrument with a client library to emit count metrics with optional tags (e.g., env=prod, region=us-east)
Dashboard visualization -- users view near-real-time histograms and time-series graphs for selected metrics over a chosen time range and resolution (1s, 10s, 1m)
Filtered queries -- users query aggregated counts filtered by metric name and tags to support monitoring and analytics use cases
Predictable read latency -- users retrieve aggregated metrics within specific time frames with consistent latency even under high write load

Non-Functional

Scalability -- sustain millions of metric events per second from thousands of services with bursty traffic patterns
Reliability -- maintain 99.9% uptime for ingestion with no silent data loss; dashboards remain available during backend maintenance
Latency -- metric events reflected in dashboards within 10-30 seconds; query responses under 500ms for common time ranges
Consistency -- eventual consistency acceptable for counter aggregates with convergence within one minute; at-least-once delivery with bucket-level idempotence

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. High-Throughput Ingestion Pipeline

Metrics ingestion is write-heavy and bursty. The system must sustain millions of events per second without dropping data or creating backpressure on emitting services.

Hints to consider:

Use a client library that batches and buffers metrics locally before sending to reduce network overhead and smooth bursts
Place Kafka as a durable buffer between client libraries and aggregation workers, partitioned by metric name or tag hash
Design the ingestion API to be stateless and horizontally scalable, accepting fire-and-forget UDP or batched HTTP payloads
Implement backpressure signals so producers can shed low-priority metrics during sustained overload

2. Hot Key Contention

Popular metrics create hot keys where millions of increments target the same counter. A single counter becomes a bottleneck that limits horizontal scale.

Hints to consider:

Use sharded counters where increments are distributed across N sub-keys, with periodic merging for query time
Implement per-bucket aggregation in stream processors so raw events are reduced to partial sums before hitting storage
Consider striped Redis keys or in-memory aggregation in Flink workers to absorb skewed traffic
Design the aggregation layer to handle hot-key detection and automatic re-partitioning

3. Time-Bucketed Storage and Query Optimization

Dashboards execute range queries across time windows, requiring a storage schema optimized for time-series access patterns.

Hints to consider:

Store pre-aggregated counts in a time-series database (Cassandra) with partition keys combining metric name, tags, and time bucket
Maintain multiple resolutions: raw 1-second buckets for recent data, rolled up to 1-minute and 1-hour buckets for historical queries
Implement TTL-based retention per resolution tier to control storage costs automatically
Cache frequently queried time ranges (last hour, last 24 hours) in Redis to reduce load on the primary store

4. Correctness Under Failures

Without a clear correctness model and deduplication strategy, counts will drift, dashboards will be misleading, and incident response will suffer.

Hints to consider:

Design for at-least-once delivery with idempotent aggregation at the bucket level using sequence numbers or dedupe windows
Handle late-arriving events by allowing aggregation windows to accept updates within a grace period before finalizing
Implement periodic reconciliation between streaming aggregates and batch recomputation from raw event logs
Provide data quality dashboards showing ingestion lag, event duplication rate, and bucket completion status

Practice/Amazon/Design a Metrics Aggregator

Design a Metrics Aggregator

System DesignMust

Problem Statement

Key Requirements

Functional

Metric emission -- services instrument with a client library to emit count metrics with optional tags (e.g., env=prod, region=us-east)
Dashboard visualization -- users view near-real-time histograms and time-series graphs for selected metrics over a chosen time range and resolution (1s, 10s, 1m)
Filtered queries -- users query aggregated counts filtered by metric name and tags to support monitoring and analytics use cases
Predictable read latency -- users retrieve aggregated metrics within specific time frames with consistent latency even under high write load

Non-Functional

Scalability -- sustain millions of metric events per second from thousands of services with bursty traffic patterns
Reliability -- maintain 99.9% uptime for ingestion with no silent data loss; dashboards remain available during backend maintenance
Latency -- metric events reflected in dashboards within 10-30 seconds; query responses under 500ms for common time ranges
Consistency -- eventual consistency acceptable for counter aggregates with convergence within one minute; at-least-once delivery with bucket-level idempotence

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. High-Throughput Ingestion Pipeline

Metrics ingestion is write-heavy and bursty. The system must sustain millions of events per second without dropping data or creating backpressure on emitting services.

Hints to consider:

Use a client library that batches and buffers metrics locally before sending to reduce network overhead and smooth bursts
Place Kafka as a durable buffer between client libraries and aggregation workers, partitioned by metric name or tag hash
Design the ingestion API to be stateless and horizontally scalable, accepting fire-and-forget UDP or batched HTTP payloads
Implement backpressure signals so producers can shed low-priority metrics during sustained overload

2. Hot Key Contention

Popular metrics create hot keys where millions of increments target the same counter. A single counter becomes a bottleneck that limits horizontal scale.

Hints to consider:

Use sharded counters where increments are distributed across N sub-keys, with periodic merging for query time
Implement per-bucket aggregation in stream processors so raw events are reduced to partial sums before hitting storage
Consider striped Redis keys or in-memory aggregation in Flink workers to absorb skewed traffic
Design the aggregation layer to handle hot-key detection and automatic re-partitioning

3. Time-Bucketed Storage and Query Optimization

Dashboards execute range queries across time windows, requiring a storage schema optimized for time-series access patterns.

Hints to consider:

Store pre-aggregated counts in a time-series database (Cassandra) with partition keys combining metric name, tags, and time bucket
Maintain multiple resolutions: raw 1-second buckets for recent data, rolled up to 1-minute and 1-hour buckets for historical queries
Implement TTL-based retention per resolution tier to control storage costs automatically
Cache frequently queried time ranges (last hour, last 24 hours) in Redis to reduce load on the primary store

4. Correctness Under Failures

Without a clear correctness model and deduplication strategy, counts will drift, dashboards will be misleading, and incident response will suffer.

Hints to consider:

Design for at-least-once delivery with idempotent aggregation at the bucket level using sequence numbers or dedupe windows
Handle late-arriving events by allowing aggregation windows to accept updates within a grace period before finalizing
Implement periodic reconciliation between streaming aggregates and batch recomputation from raw event logs
Provide data quality dashboards showing ingestion lag, event duplication rate, and bucket completion status