Cloud Metrics Collection System

[ OK ] 367 — full content available

[ INFO ] category: System Design difficulty: medium freq: medium first seen: 2026-01-13

[MEDIUM][SYSTEM DESIGN][MEDIUM]data_engineeringwebTime Seriesmachine_learningStreamingMonitoringbackendinfrastructure

$ cat problem.md

Design a cloud-scale metrics collection and monitoring system that can ingest 5 million metric data points per second from 500,000 servers, store them cost-effectively, and let users create real-time dashboards and alerts. Each metric is a tiny time-series record: metric name, a set of key-value tags, a Unix timestamp, and a numeric value. Servers emit ~100 such points every 10–30 s. The system must keep the last 24 h of data at full 10 s resolution for live dashboards, down-sample older data to 1 min, 5 min, and 1 h aggregates to save space, and support ad-hoc queries that return in seconds even when they span weeks. Users should be able to define alert rules (e.g., “if average CPU > 80 % for 5 min”) and receive e-mail/SMS/push notifications within 1 min of the condition being true. The service must be highly available—alerting accuracy is more important than dashboard freshness—and gracefully accept late or out-of-order data caused by network delays. You do not need to design multi-region replication or a custom query language; focus on ingestion, storage, query, and alerting pipelines that scale to the given volume while keeping cloud costs under control.

user@intervues:~/snowflake$