Practice/DoorDash/Design a Metrics Aggregator

Design a Metrics Aggregator

System DesignMust

Problem Statement

Design a metrics aggregation system that collects count-based metrics (like user signups, system errors, ad clicks) from services via a client library and displays them as histogram data on a dashboard. The system should support querying aggregated metrics within specific time frames for monitoring and analytics purposes.

Think of a StatsD or Datadog-style platform where engineering teams instrument their applications with a lightweight SDK that emits count events tagged with metadata like environment, region, or service name. The backend ingests this high-volume firehose, aggregates counts into configurable time buckets, and powers dashboards that show time-series charts, histograms, and breakdowns by tag with sub-second query response times.

This problem tests your ability to design a high-throughput, write-heavy ingestion pipeline while simultaneously serving fast analytical queries. You will need to reason about streaming versus batch aggregation, hot-key contention for popular metrics, time-series storage layouts, cardinality management for tags, and the trade-offs between real-time freshness and query performance. Interviewers expect you to address how duplicate events, late arrivals, and bursty traffic are handled without corrupting dashboard accuracy.

Key Requirements

Functional

Client-side instrumentation -- Services emit count metrics with optional tags (e.g., env=prod, region=us-east) through a lightweight client library that batches and ships events
Time-window aggregation -- The system computes aggregated counts over configurable intervals (1 second, 10 seconds, 1 minute, 1 hour) with near-real-time freshness
Dashboard queries -- Users view time-series graphs and histograms for selected metrics, filtered by metric name, tags, and time range, with response times under 500 milliseconds
Multi-resolution rollups -- Fine-grained data (1-second buckets) is retained for recent periods while coarser rollups (minute, hour, day) are kept for longer-term trend analysis

Non-Functional

Scalability -- Sustain ingestion of 1 million events per second across thousands of producing services with headroom for traffic spikes
Reliability -- Tolerate individual component failures without losing metric data; provide at-least-once delivery with eventual accuracy
Latency -- Reflect new events in dashboards within 10 seconds for the most recent time windows
Availability -- Maintain 99.9% uptime for both the ingestion endpoint and the query API

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Ingestion Pipeline and Buffering

High-throughput event ingestion requires careful decoupling between producers and aggregation logic. Interviewers want to see how you handle bursty traffic without dropping data or overwhelming downstream components.

Hints to consider:

Use Kafka as a durable buffer between client libraries and aggregation workers, partitioned by metric name hash for parallel processing
Implement client-side batching in the SDK to reduce network overhead and server-side request rates
Design the ingestion API to be stateless and horizontally scalable behind a load balancer
Address backpressure: if Kafka consumers fall behind, monitor consumer lag and scale aggregation workers independently

2. Streaming Aggregation and Window Management

Computing accurate counts from a firehose of events involves trade-offs between latency, accuracy, and resource cost. Interviewers probe your understanding of windowed aggregation and late-event handling.

Hints to consider:

Use stream processing (Apache Flink, Kafka Streams) to maintain running counters in tumbling time windows
Handle late-arriving events with grace periods or correction windows that update previously closed buckets
Pre-aggregate in-memory using sharded counters in Redis before flushing to durable storage
Discuss the trade-off between per-event processing and micro-batch aggregation for throughput versus latency

3. Hot-Key Contention for Popular Metrics

A small number of metrics (like API error counts for a major service) will receive disproportionate write traffic, creating hotspots. Interviewers expect you to prevent single-point contention.

Hints to consider:

Use sharded counters where increments are distributed across multiple Redis keys or database rows per metric per time bucket
Aggregate shard totals in periodic background jobs and cache the summed result for read queries
Partition Kafka topics by metric hash so hot metrics are spread across multiple consumer instances
Apply rate limiting at the client library level to prevent a single service from overwhelming the pipeline

4. Time-Series Storage and Query Optimization

Analytics queries scan large time ranges and filter by multiple dimensions, making storage schema design critical for performance.

Hints to consider:

Use a columnar time-series database (Apache Cassandra, ClickHouse, or TimescaleDB) with partition keys on metric name and time bucket
Design multi-resolution rollup tables: raw 1-second data retained for 24 hours, 1-minute rollups for 30 days, hourly rollups for 1 year
Use TTLs for automatic data lifecycle management across storage tiers
Cache frequently queried time ranges (last 1 hour, last 24 hours) in Redis to reduce storage layer load

Practice/DoorDash/Design a Metrics Aggregator

Design a Metrics Aggregator

System DesignMust

Problem Statement

Key Requirements

Functional

Client-side instrumentation -- Services emit count metrics with optional tags (e.g., env=prod, region=us-east) through a lightweight client library that batches and ships events
Time-window aggregation -- The system computes aggregated counts over configurable intervals (1 second, 10 seconds, 1 minute, 1 hour) with near-real-time freshness
Dashboard queries -- Users view time-series graphs and histograms for selected metrics, filtered by metric name, tags, and time range, with response times under 500 milliseconds
Multi-resolution rollups -- Fine-grained data (1-second buckets) is retained for recent periods while coarser rollups (minute, hour, day) are kept for longer-term trend analysis

Non-Functional

Scalability -- Sustain ingestion of 1 million events per second across thousands of producing services with headroom for traffic spikes
Reliability -- Tolerate individual component failures without losing metric data; provide at-least-once delivery with eventual accuracy
Latency -- Reflect new events in dashboards within 10 seconds for the most recent time windows
Availability -- Maintain 99.9% uptime for both the ingestion endpoint and the query API

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Ingestion Pipeline and Buffering

Hints to consider:

Use Kafka as a durable buffer between client libraries and aggregation workers, partitioned by metric name hash for parallel processing
Implement client-side batching in the SDK to reduce network overhead and server-side request rates
Design the ingestion API to be stateless and horizontally scalable behind a load balancer
Address backpressure: if Kafka consumers fall behind, monitor consumer lag and scale aggregation workers independently

2. Streaming Aggregation and Window Management

Hints to consider:

Use stream processing (Apache Flink, Kafka Streams) to maintain running counters in tumbling time windows
Handle late-arriving events with grace periods or correction windows that update previously closed buckets
Pre-aggregate in-memory using sharded counters in Redis before flushing to durable storage
Discuss the trade-off between per-event processing and micro-batch aggregation for throughput versus latency

3. Hot-Key Contention for Popular Metrics

A small number of metrics (like API error counts for a major service) will receive disproportionate write traffic, creating hotspots. Interviewers expect you to prevent single-point contention.

Hints to consider:

Use sharded counters where increments are distributed across multiple Redis keys or database rows per metric per time bucket
Aggregate shard totals in periodic background jobs and cache the summed result for read queries
Partition Kafka topics by metric hash so hot metrics are spread across multiple consumer instances
Apply rate limiting at the client library level to prevent a single service from overwhelming the pipeline

4. Time-Series Storage and Query Optimization

Analytics queries scan large time ranges and filter by multiple dimensions, making storage schema design critical for performance.

Hints to consider:

Use a columnar time-series database (Apache Cassandra, ClickHouse, or TimescaleDB) with partition keys on metric name and time bucket
Design multi-resolution rollup tables: raw 1-second data retained for 24 hours, 1-minute rollups for 30 days, hourly rollups for 1 year
Use TTLs for automatic data lifecycle management across storage tiers
Cache frequently queried time ranges (last 1 hour, last 24 hours) in Redis to reduce storage layer load