Practice/LinkedIn/Design a Metrics Aggregator

Design a Metrics Aggregator

System DesignMust

Problem Statement

Design a metrics aggregation system that collects count-based metrics (like user signups, system errors, ad clicks) from services via a client library and displays them as histogram data on a dashboard. The system should support querying aggregated metrics within specific time frames for monitoring and analytics purposes.

The backend ingests a high-volume firehose, aggregates counts into time buckets, and powers dashboards that show histograms and time-series charts with filters by tags and time ranges. The challenge lies in designing high-throughput, low-latency ingestion and aggregation pipelines, reasoning about windowed counters, hot-key contention, and storage layouts for time-series data.

Key Requirements

Functional

Client instrumentation -- services emit count metrics with optional tags (e.g., env=prod, region=us-east) using a lightweight client library
Near-real-time dashboards -- users view histograms and time-series graphs for selected metrics over a chosen time range and resolution (1s, 10s, 1m)
Tag-based filtering -- users query aggregated counts filtered by metric name and tags to support monitoring and analytics
Predictable query latency -- users retrieve aggregated metrics within specific time frames with consistent latency, even under high write load

Non-Functional

Scalability -- handle millions of metric events per second with horizontal scaling of both ingestion and query paths
Reliability -- tolerate individual component failures without data loss; at-least-once delivery with bucket-level reconciliation
Latency -- metric data visible in dashboards within 10 seconds of emission; query response under 500ms for recent time windows
Consistency -- eventual consistency is acceptable for dashboards; counts should converge within seconds

Interview Reports from Hello Interview

29 reports from candidates. Most recently asked at LinkedIn in Early January 2026.

Also commonly asked at: Stripe, DoorDash, OpenAI, Meta, Google, Twitch.

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Pre-Aggregation and Buffering Strategy

Writing every event directly to a database without pre-aggregation is a critical red flag. Interviewers expect a multi-layer aggregation approach.

Hints to consider:

Aggregate counts in the client library (flush every 5-10 seconds) to reduce network traffic by orders of magnitude
Use Kafka as a durable buffer between client libraries and aggregation workers for backpressure and replay
Implement server-side pre-aggregation in stream processors that maintain per-bucket counters before writing to storage
Batch writes to the time-series store to reduce write amplification and improve throughput

2. Hot Key Contention

Popular metrics create hot keys that bottleneck on single counters. Interviewers assess whether you recognize and solve this.

Hints to consider:

Use sharded counters where hot metrics are split across N sub-counters and merged periodically
Apply per-partition aggregation in the stream processor to avoid global locks on popular metrics
Implement striped keys in Redis (metric_name:shard_0, metric_name:shard_1) with periodic merging
Monitor per-key write rates and dynamically increase shard count for detected hot keys

3. Time-Series Storage Layout

Dashboards execute range queries across time windows. Interviewers want storage designs that make these queries efficient.

Hints to consider:

Use wide-row designs in Cassandra partitioned by (metric_name, tags, time_bucket) for efficient time-range scans
Implement multi-resolution rollups: keep raw 1-second data for 24 hours, 1-minute aggregates for 30 days, 5-minute aggregates for 1 year
Apply TTLs on storage partitions for automatic data lifecycle management
Consider columnar formats for analytical queries that aggregate across many time buckets

4. Handling Duplicates and Late Events

Without a clear correctness model and deduplication strategy, counts will drift and dashboards will be misleading.

Hints to consider:

Use at-least-once delivery with bucket-level idempotence: aggregate into fixed time windows and accept that a window's count is finalized after the allowed lateness
Include event timestamps for event-time processing rather than processing-time to handle late-arriving data correctly
Implement periodic reconciliation between streaming and batch counts to detect and correct drift
Design the client library to include sequence numbers for detecting and handling retransmissions

Practice/LinkedIn/Design a Metrics Aggregator

Design a Metrics Aggregator

System DesignMust

Problem Statement

Key Requirements

Functional

Client instrumentation -- services emit count metrics with optional tags (e.g., env=prod, region=us-east) using a lightweight client library
Near-real-time dashboards -- users view histograms and time-series graphs for selected metrics over a chosen time range and resolution (1s, 10s, 1m)
Tag-based filtering -- users query aggregated counts filtered by metric name and tags to support monitoring and analytics
Predictable query latency -- users retrieve aggregated metrics within specific time frames with consistent latency, even under high write load

Non-Functional

Scalability -- handle millions of metric events per second with horizontal scaling of both ingestion and query paths
Reliability -- tolerate individual component failures without data loss; at-least-once delivery with bucket-level reconciliation
Latency -- metric data visible in dashboards within 10 seconds of emission; query response under 500ms for recent time windows
Consistency -- eventual consistency is acceptable for dashboards; counts should converge within seconds

Interview Reports from Hello Interview

29 reports from candidates. Most recently asked at LinkedIn in Early January 2026.

Also commonly asked at: Stripe, DoorDash, OpenAI, Meta, Google, Twitch.

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Pre-Aggregation and Buffering Strategy

Writing every event directly to a database without pre-aggregation is a critical red flag. Interviewers expect a multi-layer aggregation approach.

Hints to consider:

Aggregate counts in the client library (flush every 5-10 seconds) to reduce network traffic by orders of magnitude
Use Kafka as a durable buffer between client libraries and aggregation workers for backpressure and replay
Implement server-side pre-aggregation in stream processors that maintain per-bucket counters before writing to storage
Batch writes to the time-series store to reduce write amplification and improve throughput

2. Hot Key Contention

Popular metrics create hot keys that bottleneck on single counters. Interviewers assess whether you recognize and solve this.

Hints to consider:

Use sharded counters where hot metrics are split across N sub-counters and merged periodically
Apply per-partition aggregation in the stream processor to avoid global locks on popular metrics
Implement striped keys in Redis (metric_name:shard_0, metric_name:shard_1) with periodic merging
Monitor per-key write rates and dynamically increase shard count for detected hot keys

3. Time-Series Storage Layout

Dashboards execute range queries across time windows. Interviewers want storage designs that make these queries efficient.

Hints to consider:

Use wide-row designs in Cassandra partitioned by (metric_name, tags, time_bucket) for efficient time-range scans
Implement multi-resolution rollups: keep raw 1-second data for 24 hours, 1-minute aggregates for 30 days, 5-minute aggregates for 1 year
Apply TTLs on storage partitions for automatic data lifecycle management
Consider columnar formats for analytical queries that aggregate across many time buckets

4. Handling Duplicates and Late Events

Without a clear correctness model and deduplication strategy, counts will drift and dashboards will be misleading.

Hints to consider:

Use at-least-once delivery with bucket-level idempotence: aggregate into fixed time windows and accept that a window's count is finalized after the allowed lateness
Include event timestamps for event-time processing rather than processing-time to handle late-arriving data correctly
Implement periodic reconciliation between streaming and batch counts to detect and correct drift
Design the client library to include sequence numbers for detecting and handling retransmissions