Design Metrics Collection and Alerting

[ OK ] 456 — full content available

[ INFO ] category: System Design difficulty: hard freq: high first seen: 2026-01-13

[HARD][SYSTEM DESIGN][HIGH]data_engineeringwebTime SeriesbackendAlertingMonitoringData Pipelineinfrastructure

$ cat problem.md

Design a metrics monitoring and alerting system that can collect, store, query, visualize, and generate alerts based on system metrics (e.g., CPU usage, memory, request counts). The system should support both real-time monitoring and historical analysis. It must ingest continuous streams of metric data from servers, applications, and services, then make time-series metrics available for low-latency queries, power dashboards, and trigger alerts via email/SMS/Slack when user-defined thresholds or rules are met. The interviewer expects you to walk through the end-to-end pipeline: collection (push vs pull), streaming ingestion, time-series storage with retention tiers, query layer, alerting engine, and notification routing. You should also address how to scale to millions of metrics per second, maintain high availability and durability, and provide multi-tenant isolation and security.

user@intervues:~/snapchat$