Design Metrics Aggregation / Monitoring System

[ OK ] 118 — full content available

[ INFO ] category: System Design difficulty: hard freq: high first seen: 2026-01-12

[HARD][SYSTEM DESIGN][HIGH]data_engineeringMonitoringwebbackendinfrastructureTime SeriesSystem Design

$ cat problem.md

LinkedIn interview questions for system design, especially "Design Metrics Aggregation / Monitoring System," are typically shared in posts, repositories, or prep sites but often lack a single canonical full statement with explicit input/output examples and constraints. From available sources, the closest matches compile a synthesized problem based on common variants asked at companies like Meta, Stripe, TikTok, and Amazon.

Problem Statement

Design a scalable metrics aggregation and monitoring system that collects time-series metrics (e.g., CPU usage, memory, error rates, request latency, user signups, ad clicks) from hundreds of thousands of servers or services via client libraries or agents. The system must aggregate count-based or gauge metrics into histograms, support low-latency queries for dashboards, enable threshold-based alerting with notifications (e.g., email, Slack), handle high write throughput (millions of datapoints/sec), provide multi-tenant isolation, and store data with tiered retention (hot/cold storage).[1][3][4]

Key Requirements

Functional: Ingest via pull (scrapers polling /metrics endpoints) or push gateways; aggregate metrics client-side or server-side (e.g., counters summed per minute); query aggregated data by time range, labels, or filters; visualize via dashboards; alert on anomalies/thresholds.[2][9]
Non-functional:
- Scale to 10M+ metrics/day or millions/sec.
- Low latency (<1s queries, <10s writes).
- High availability/reliability (no data loss, exactly-once semantics).
- Retention: Hours/days hot, months/years cold.
- Flexibility for new metric types/tech.[1][2]

Input Examples

Metrics arrive as time-series data points, often in Prometheus-style format: `

Counter: cumulative errors

http_requests_total{job="api", instance="server1", status="500"} 120 1640995200 # value 120 at Unix timestamp

Gauge: current memory usage

memory_usage_bytes{instance="server1"} 8589934592 1640995200

Histogram: request durations

http_request_duration_seconds_bucket{le="0.1"} 50 1640995200 `

Push: Services send via HTTP/UDP to gateway.
Pull: Collectors query endpoints every 10-60s, using service discovery and consistent hashing to shard load.[9][2][1]

Output Examples

Aggregated Query: For "sum(errors) by job over last hour": | Time Range | job=api | job=web | |------------|---------|---------| | 10:00-10:15 | 50 | 20 | | 10:15-10:30 | 60 | 25 | Returned as JSON/Protobuf for dashboard (e.g., Grafana).[4]
Alert: If avg(latency) > 500ms: "Alert: High latency on server1, value=650ms".[3]
Dashboard: Time-series graphs with downsampled data (e.g., 1min/5min/1hr aggregates).[1]

Constraints

No explicit bounded constraints in sources, but derived from scale discussions:

Throughput: 1M+ writes/sec, 10k QPS queries.[2]
Data Volume: 10M metrics/day initially, growing; compress via aggregation/downsampling (e.g., reduce 10s to 1min buckets).[1]
Latency: Writes <5s E2E, queries <1s p99.[2]
Storage: TBs-PBs; hot tier (SSD) 7 days, cold (S3) 1+ year.[2]
Other: Multi-tenant quotas, dedup duplicates, handle 99.99% uptime, support high-cardinality labels (mitigate via hashing).[4][1][2]

user@intervues:~/linkedin$