LinkedIn interview questions for system design, especially "Design Metrics Aggregation / Monitoring System," are typically shared in posts, repositories, or prep sites but often lack a single canonical full statement with explicit input/output examples and constraints. From available sources, the closest matches compile a synthesized problem based on common variants asked at companies like Meta, Stripe, TikTok, and Amazon.
Design a scalable metrics aggregation and monitoring system that collects time-series metrics (e.g., CPU usage, memory, error rates, request latency, user signups, ad clicks) from hundreds of thousands of servers or services via client libraries or agents. The system must aggregate count-based or gauge metrics into histograms, support low-latency queries for dashboards, enable threshold-based alerting with notifications (e.g., email, Slack), handle high write throughput (millions of datapoints/sec), provide multi-tenant isolation, and store data with tiered retention (hot/cold storage).[1][3][4]
Metrics arrive as time-series data points, often in Prometheus-style format: `
http_requests_total{job="api", instance="server1", status="500"} 120 1640995200 # value 120 at Unix timestamp
memory_usage_bytes{instance="server1"} 8589934592 1640995200
http_request_duration_seconds_bucket{le="0.1"} 50 1640995200 `
| Time Range | job=api | job=web | |------------|---------|---------| | 10:00-10:15 | 50 | 20 | | 10:15-10:30 | 60 | 25 |
Returned as JSON/Protobuf for dashboard (e.g., Grafana).[4]No explicit bounded constraints in sources, but derived from scale discussions: