Practice/DoorDash/Design a Metrics Aggregator
Design a Metrics Aggregator
System DesignMust
Problem Statement
Design a metrics aggregation system that collects count-based metrics (like user signups, system errors, ad clicks) from services via a client library and displays them as histogram data on a dashboard. The system should support querying aggregated metrics within specific time frames for monitoring and analytics purposes.
Think of a StatsD or Datadog-style platform where engineering teams instrument their applications with a lightweight SDK that emits count events tagged with metadata like environment, region, or service name. The backend ingests this high-volume firehose, aggregates counts into configurable time buckets, and powers dashboards that show time-series charts, histograms, and breakdowns by tag with sub-second query response times.
This problem tests your ability to design a high-throughput, write-heavy ingestion pipeline while simultaneously serving fast analytical queries. You will need to reason about streaming versus batch aggregation, hot-key contention for popular metrics, time-series storage layouts, cardinality management for tags, and the trade-offs between real-time freshness and query performance. Interviewers expect you to address how duplicate events, late arrivals, and bursty traffic are handled without corrupting dashboard accuracy.
Key Requirements
Functional
- Client-side instrumentation -- Services emit count metrics with optional tags (e.g., env=prod, region=us-east) through a lightweight client library that batches and ships events
- Time-window aggregation -- The system computes aggregated counts over configurable intervals (1 second, 10 seconds, 1 minute, 1 hour) with near-real-time freshness
- Dashboard queries -- Users view time-series graphs and histograms for selected metrics, filtered by metric name, tags, and time range, with response times under 500 milliseconds
- Multi-resolution rollups -- Fine-grained data (1-second buckets) is retained for recent periods while coarser rollups (minute, hour, day) are kept for longer-term trend analysis
Non-Functional
- Scalability -- Sustain ingestion of 1 million events per second across thousands of producing services with headroom for traffic spikes
- Reliability -- Tolerate individual component failures without losing metric data; provide at-least-once delivery with eventual accuracy
- Latency -- Reflect new events in dashboards within 10 seconds for the most recent time windows
- Availability -- Maintain 99.9% uptime for both the ingestion endpoint and the query API
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Ingestion Pipeline and Buffering
High-throughput event ingestion requires careful decoupling between producers and aggregation logic. Interviewers want to see how you handle bursty traffic without dropping data or overwhelming downstream components.
Hints to consider:
- Use Kafka as a durable buffer between client libraries and aggregation workers, partitioned by metric name hash for parallel processing
- Implement client-side batching in the SDK to reduce network overhead and server-side request rates
- Design the ingestion API to be stateless and horizontally scalable behind a load balancer
- Address backpressure: if Kafka consumers fall behind, monitor consumer lag and scale aggregation workers independently
2. Streaming Aggregation and Window Management
Computing accurate counts from a firehose of events involves trade-offs between latency, accuracy, and resource cost. Interviewers probe your understanding of windowed aggregation and late-event handling.
Hints to consider:
- Use stream processing (Apache Flink, Kafka Streams) to maintain running counters in tumbling time windows
- Handle late-arriving events with grace periods or correction windows that update previously closed buckets
- Pre-aggregate in-memory using sharded counters in Redis before flushing to durable storage
- Discuss the trade-off between per-event processing and micro-batch aggregation for throughput versus latency
3. Hot-Key Contention for Popular Metrics
A small number of metrics (like API error counts for a major service) will receive disproportionate write traffic, creating hotspots. Interviewers expect you to prevent single-point contention.
Hints to consider:
- Use sharded counters where increments are distributed across multiple Redis keys or database rows per metric per time bucket
- Aggregate shard totals in periodic background jobs and cache the summed result for read queries
- Partition Kafka topics by metric hash so hot metrics are spread across multiple consumer instances
- Apply rate limiting at the client library level to prevent a single service from overwhelming the pipeline
4. Time-Series Storage and Query Optimization
Analytics queries scan large time ranges and filter by multiple dimensions, making storage schema design critical for performance.
Hints to consider:
- Use a columnar time-series database (Apache Cassandra, ClickHouse, or TimescaleDB) with partition keys on metric name and time bucket
- Design multi-resolution rollup tables: raw 1-second data retained for 24 hours, 1-minute rollups for 30 days, hourly rollups for 1 year
- Use TTLs for automatic data lifecycle management across storage tiers
- Cache frequently queried time ranges (last 1 hour, last 24 hours) in Redis to reduce storage layer load