Practice/LinkedIn/Design a Metrics Aggregator
Design a Metrics Aggregator
System DesignMust
Problem Statement
Design a metrics aggregation system that collects count-based metrics (like user signups, system errors, ad clicks) from services via a client library and displays them as histogram data on a dashboard. The system should support querying aggregated metrics within specific time frames for monitoring and analytics purposes.
The backend ingests a high-volume firehose, aggregates counts into time buckets, and powers dashboards that show histograms and time-series charts with filters by tags and time ranges. The challenge lies in designing high-throughput, low-latency ingestion and aggregation pipelines, reasoning about windowed counters, hot-key contention, and storage layouts for time-series data.
Key Requirements
Functional
- Client instrumentation -- services emit count metrics with optional tags (e.g., env=prod, region=us-east) using a lightweight client library
- Near-real-time dashboards -- users view histograms and time-series graphs for selected metrics over a chosen time range and resolution (1s, 10s, 1m)
- Tag-based filtering -- users query aggregated counts filtered by metric name and tags to support monitoring and analytics
- Predictable query latency -- users retrieve aggregated metrics within specific time frames with consistent latency, even under high write load
Non-Functional
- Scalability -- handle millions of metric events per second with horizontal scaling of both ingestion and query paths
- Reliability -- tolerate individual component failures without data loss; at-least-once delivery with bucket-level reconciliation
- Latency -- metric data visible in dashboards within 10 seconds of emission; query response under 500ms for recent time windows
- Consistency -- eventual consistency is acceptable for dashboards; counts should converge within seconds
Interview Reports from Hello Interview
29 reports from candidates. Most recently asked at LinkedIn in Early January 2026.
Also commonly asked at: Stripe, DoorDash, OpenAI, Meta, Google, Twitch.
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Pre-Aggregation and Buffering Strategy
Writing every event directly to a database without pre-aggregation is a critical red flag. Interviewers expect a multi-layer aggregation approach.
Hints to consider:
- Aggregate counts in the client library (flush every 5-10 seconds) to reduce network traffic by orders of magnitude
- Use Kafka as a durable buffer between client libraries and aggregation workers for backpressure and replay
- Implement server-side pre-aggregation in stream processors that maintain per-bucket counters before writing to storage
- Batch writes to the time-series store to reduce write amplification and improve throughput
2. Hot Key Contention
Popular metrics create hot keys that bottleneck on single counters. Interviewers assess whether you recognize and solve this.
Hints to consider:
- Use sharded counters where hot metrics are split across N sub-counters and merged periodically
- Apply per-partition aggregation in the stream processor to avoid global locks on popular metrics
- Implement striped keys in Redis (metric_name:shard_0, metric_name:shard_1) with periodic merging
- Monitor per-key write rates and dynamically increase shard count for detected hot keys
3. Time-Series Storage Layout
Dashboards execute range queries across time windows. Interviewers want storage designs that make these queries efficient.
Hints to consider:
- Use wide-row designs in Cassandra partitioned by (metric_name, tags, time_bucket) for efficient time-range scans
- Implement multi-resolution rollups: keep raw 1-second data for 24 hours, 1-minute aggregates for 30 days, 5-minute aggregates for 1 year
- Apply TTLs on storage partitions for automatic data lifecycle management
- Consider columnar formats for analytical queries that aggregate across many time buckets
4. Handling Duplicates and Late Events
Without a clear correctness model and deduplication strategy, counts will drift and dashboards will be misleading.
Hints to consider:
- Use at-least-once delivery with bucket-level idempotence: aggregate into fixed time windows and accept that a window's count is finalized after the allowed lateness
- Include event timestamps for event-time processing rather than processing-time to handle late-arriving data correctly
- Implement periodic reconciliation between streaming and batch counts to detect and correct drift
- Design the client library to include sequence numbers for detecting and handling retransmissions