Design a Metrics Aggregator — Twitch

Problem Statement

Design a metrics aggregation system that collects count-based events from application services (such as user signups, API errors, and ad clicks), aggregates them into time buckets, and displays the results as histograms and time-series charts on a monitoring dashboard. Services emit events through a lightweight client library, and the dashboard allows engineers to query aggregated counts within specific time frames, filtered by metric name and tags.

Think of platforms like Datadog, Grafana, or an internal StatsD pipeline. The system ingests a high-volume firehose of metric events -- potentially millions per second across thousands of services -- pre-aggregates them into configurable time windows (1 second, 10 seconds, 1 minute), and serves interactive queries with sub-second response times. The core challenges are sustaining write throughput under bursty traffic, avoiding hot-key contention on popular metrics, choosing the right storage layout for efficient time-range queries, and balancing real-time freshness against aggregation correctness.

Key Requirements

Functional

Client instrumentation -- services emit count-based metrics with optional tags (e.g., env=prod, region=us-east) through a lightweight SDK
Time-window aggregation -- compute event counts in configurable intervals (1 second, 10 seconds, 1 minute) with near-real-time freshness
Dashboard queries -- support time-series graphs, histograms, and tabular views filtered by metric name, tags, and time range with sub-second response
Tag-based filtering -- allow grouping and filtering by arbitrary key-value tags to slice metrics by service, environment, region, or custom dimensions

Non-Functional

Scalability -- sustain ingestion of 5 million events per second from thousands of services, scaling horizontally as new services onboard
Reliability -- tolerate individual component failures without data loss; accept brief periods of slightly stale data during failover
Latency -- reflect new events in dashboards within 10 seconds for the most recent time window; return query results in under 500 milliseconds
Consistency -- accept eventual consistency for aggregated counts with convergence within one minute; handle duplicate events gracefully

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. High-Throughput Ingestion Pipeline

With millions of events per second, naive approaches that write each event individually to a database will fail immediately. Interviewers want to see how you decouple producers from consumers and smooth out traffic bursts.

Hints to consider:

Use a distributed message queue (Kafka) as a durable buffer between client libraries and aggregation workers
Implement client-side batching in the SDK to reduce network overhead and per-event processing costs
Partition the stream by metric name or tag hash to parallelize aggregation across multiple worker instances
Design the ingestion API to be stateless and horizontally scalable behind a load balancer

2. Avoiding Hot-Key Contention

Popular metrics like "api_requests" may receive orders of magnitude more events than others. A single counter per metric becomes a write bottleneck that limits horizontal scaling.

Hints to consider:

Use sharded counters where each metric's count is split across multiple keys and aggregated on read
Implement local pre-aggregation in stream processors (combine events for the same metric and time bucket before writing to storage)
Stripe time-bucket keys by worker ID or partition so concurrent writers do not contend on the same storage row
Consider using Redis atomic increments for real-time windows with periodic flush to durable storage

3. Storage Layout for Time-Series Queries

Dashboards execute range queries across time windows with dimension filters. The storage schema must support efficient scans without full table traversal.

Hints to consider:

Design a wide-row schema in a time-series store (e.g., Cassandra) with partition key of metric name plus tags and clustering key of timestamp
Create rollup tables at multiple granularities (1-second, 1-minute, 1-hour, 1-day) to serve queries efficiently at different zoom levels
Apply TTLs to fine-grained data (delete 1-second records after 24 hours) and retain coarser rollups longer
Partition by time ranges to enable fast pruning of irrelevant data and support retention policies

4. Handling Duplicates, Retries, and Late Events

At-least-once delivery means events may arrive more than once, and network delays cause late arrivals. Without a clear correctness model, counts drift and dashboards become unreliable.

Hints to consider:

Define a correctness model: at-least-once ingestion with bucket-level idempotence or periodic reconciliation
Use unique event IDs for deduplication when exact counts matter, or accept approximate counts for most use cases
Design grace periods for time windows to accept late events before finalizing bucket values
Run periodic batch reconciliation jobs that compare raw event counts against aggregated totals and emit corrections

Suggested Approach

Step 1: Clarify Requirements

Begin by confirming scale parameters: how many services emit metrics, expected events per second at average and peak, and how many concurrent dashboard users query the system. Clarify which metric types are in scope (counts only, or also gauges, histograms, percentiles). Establish acceptable latency for both ingestion and queries. Ask about data retention policies, cardinality limits on tag values, and whether alerting on thresholds is in scope.

Step 2: High-Level Architecture

Sketch a pipeline with three layers. The ingestion layer consists of lightweight SDKs that batch events client-side and send them to a stateless API tier, which writes to Kafka partitioned by metric name and tag hash. The processing layer runs stream processors (Flink or Kafka Streams) that maintain in-memory counters per time bucket, flush pre-aggregated results to Redis for the most recent windows, and write finalized buckets to Cassandra for long-term storage. The serving layer provides a query API that reads from Redis for near-real-time data and from Cassandra for historical ranges, with a caching layer for popular dashboard queries.

Step 3: Deep Dive on Aggregation Pipeline

Walk through an event lifecycle. A service calls the SDK with a metric name, value, and tags. The SDK buffers events locally and flushes a batch every 5 seconds to the ingestion API. The API validates, enriches with a receive timestamp, and publishes to a Kafka topic partitioned by a hash of metric name plus tags. A Flink consumer maintains in-memory counters for each open time window (1-second tumbling windows). When a window closes, the consumer atomically increments the corresponding Redis key (metric:tags:timestamp) and also writes the finalized count to Cassandra. For late events arriving after window close, the consumer updates the Cassandra record and invalidates the Redis cache entry for that bucket. Discuss how partitioning by metric plus tags ensures all events for one counter land on the same worker, eliminating distributed contention.

Step 4: Address Secondary Concerns

Discuss rollup jobs that periodically aggregate 1-second buckets into 1-minute and 1-hour summaries in Cassandra, reducing query scan ranges for long time windows. Cover caching: popular dashboard queries (e.g., "api_requests for the last hour") are served from pre-computed rollups cached in Redis. Address cardinality control: reject or sample tag values that exceed a cardinality threshold to prevent storage explosion. Explain monitoring of the pipeline itself: track Kafka consumer lag, Flink checkpoint intervals, and query latency percentiles. Discuss fault tolerance: Kafka provides durable replay if a Flink worker crashes, and Cassandra replication handles node failures.

Related Learning Resources

Design an Ad Click Aggregator -- covers streaming count aggregation, windowed processing, and time-series storage patterns
Message Queues -- Kafka partitioning, consumer groups, and backpressure handling
Caching -- Redis atomic counters, TTL-based expiration, and cache invalidation strategies

Problem Statement

Key Requirements

Functional

Client instrumentation -- services emit count-based metrics with optional tags (e.g., env=prod, region=us-east) through a lightweight SDK
Time-window aggregation -- compute event counts in configurable intervals (1 second, 10 seconds, 1 minute) with near-real-time freshness
Dashboard queries -- support time-series graphs, histograms, and tabular views filtered by metric name, tags, and time range with sub-second response
Tag-based filtering -- allow grouping and filtering by arbitrary key-value tags to slice metrics by service, environment, region, or custom dimensions

Non-Functional

Scalability -- sustain ingestion of 5 million events per second from thousands of services, scaling horizontally as new services onboard
Reliability -- tolerate individual component failures without data loss; accept brief periods of slightly stale data during failover
Latency -- reflect new events in dashboards within 10 seconds for the most recent time window; return query results in under 500 milliseconds
Consistency -- accept eventual consistency for aggregated counts with convergence within one minute; handle duplicate events gracefully

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. High-Throughput Ingestion Pipeline

Hints to consider:

Use a distributed message queue (Kafka) as a durable buffer between client libraries and aggregation workers
Implement client-side batching in the SDK to reduce network overhead and per-event processing costs
Partition the stream by metric name or tag hash to parallelize aggregation across multiple worker instances
Design the ingestion API to be stateless and horizontally scalable behind a load balancer

2. Avoiding Hot-Key Contention

Popular metrics like "api_requests" may receive orders of magnitude more events than others. A single counter per metric becomes a write bottleneck that limits horizontal scaling.

Hints to consider:

Use sharded counters where each metric's count is split across multiple keys and aggregated on read
Implement local pre-aggregation in stream processors (combine events for the same metric and time bucket before writing to storage)
Stripe time-bucket keys by worker ID or partition so concurrent writers do not contend on the same storage row
Consider using Redis atomic increments for real-time windows with periodic flush to durable storage

3. Storage Layout for Time-Series Queries

Dashboards execute range queries across time windows with dimension filters. The storage schema must support efficient scans without full table traversal.

Hints to consider:

Design a wide-row schema in a time-series store (e.g., Cassandra) with partition key of metric name plus tags and clustering key of timestamp
Create rollup tables at multiple granularities (1-second, 1-minute, 1-hour, 1-day) to serve queries efficiently at different zoom levels
Apply TTLs to fine-grained data (delete 1-second records after 24 hours) and retain coarser rollups longer
Partition by time ranges to enable fast pruning of irrelevant data and support retention policies

4. Handling Duplicates, Retries, and Late Events

At-least-once delivery means events may arrive more than once, and network delays cause late arrivals. Without a clear correctness model, counts drift and dashboards become unreliable.

Hints to consider:

Define a correctness model: at-least-once ingestion with bucket-level idempotence or periodic reconciliation
Use unique event IDs for deduplication when exact counts matter, or accept approximate counts for most use cases
Design grace periods for time windows to accept late events before finalizing bucket values
Run periodic batch reconciliation jobs that compare raw event counts against aggregated totals and emit corrections

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive on Aggregation Pipeline

Step 4: Address Secondary Concerns

Related Learning Resources

Design an Ad Click Aggregator -- covers streaming count aggregation, windowed processing, and time-series storage patterns
Message Queues -- Kafka partitioning, consumer groups, and backpressure handling
Caching -- Redis atomic counters, TTL-based expiration, and cache invalidation strategies