Practice/LinkedIn/Design a System Metrics Monitoring and Alerting Platform
Design a System Metrics Monitoring and Alerting Platform
System DesignMust
Problem Statement
Design a metrics monitoring platform that collects performance data (CPU, memory, throughput) from hundreds of thousands of servers, allows users to set threshold-based alerts, and provides dashboards for visualization and notifications when alerts are triggered. The core workflow is collect, store, query, alert, and notify, at very high scale and in near real time.
Your architecture must handle a high-throughput telemetry pipeline, a cost-effective time-series store, and a reliable low-latency alerting engine across hundreds of thousands of nodes. Balance ingestion throughput, cardinality control, retention/rollups, multi-tenant isolation, backpressure, real-time updates, and failure handling while keeping the design operable and cost-aware.
Key Requirements
Functional
- Dashboard visualization -- users view near real-time and historical metrics (CPU, memory, throughput) from hundreds of thousands of servers with filters and aggregation
- Alert rule management -- users define, edit, and disable threshold-based alert rules scoped by metric, labels/tags, and time windows
- Multi-channel notifications -- users receive notifications (email, SMS, webhooks, chat) when alerts trigger, with deduplication and severity routing
- Data retention control -- users control data retention and resolution (raw vs. downsampled) to balance cost and fidelity
Non-Functional
- Scalability -- ingest metrics from 500K+ servers at millions of data points per second with linear write scalability
- Reliability -- tolerate component failures without data loss; alerting must continue during partial outages
- Latency -- dashboard queries return within 2 seconds for recent data; alerts fire within 30 seconds of threshold breach
- Consistency -- accept eventual consistency for dashboards; alert evaluation requires at-least-once semantics to avoid missed alerts
Interview Reports from Hello Interview
22 reports from candidates. Most recently asked at LinkedIn in Early January 2026.
Also commonly asked at: Meta, Amazon, TikTok.
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. High-Throughput Ingestion Pipeline
Interviewers want to see how you handle millions of metric data points per second from hundreds of thousands of heterogeneous sources without dropping data or overwhelming downstream systems.
Hints to consider:
- Use Kafka as a durable buffer between metric producers and consumers to decouple ingestion from processing and provide backpressure
- Implement client-side batching and compression in metric agents to reduce network overhead
- Partition ingestion by metric name or host to enable parallel processing and avoid hot partitions
- Apply per-tenant rate limiting to prevent noisy neighbors from starving others in multi-tenant deployments
2. Time-Series Storage and Cardinality Control
Uncontrolled label cardinality can explode storage costs and make queries impossibly slow. Interviewers expect you to address this directly.
Hints to consider:
- Use a time-series optimized store (Cassandra with wide rows, or a purpose-built TSDB) with time-bucketed partitioning for efficient writes and range scans
- Implement cardinality quotas and allow/deny lists to prevent unbounded series creation from high-cardinality labels (user IDs, request IDs)
- Apply downsampling rollups: keep raw data for 7 days, 1-minute aggregates for 30 days, 5-minute aggregates for 1 year
- Use TTLs on storage partitions for automatic data lifecycle management
3. Alert Evaluation as a Separate Stream
Coupling alert evaluation to the write path or dashboard query path creates cascading failures. Interviewers look for a decoupled alerting architecture.
Hints to consider:
- Run alert evaluation as a separate Flink/streaming job that consumes metric events and evaluates rules with windowed aggregations
- Support multiple alert types: threshold (CPU > 90% for 5 minutes), rate-of-change (error rate increased 3x), and absence (no data for 10 minutes)
- Implement deduplication and cooldown periods to prevent alert storms during widespread incidents
- Use a separate notification service with retry logic and provider abstraction for email, SMS, and webhook delivery
4. Query Optimization for Dashboards
Dashboard queries span time ranges and aggregate across many series. Interviewers want to see how you keep query latency low as data grows.
Hints to consider:
- Pre-compute common aggregations (per-service error rates, cluster-level CPU averages) and cache them for fast dashboard rendering
- Use columnar storage or materialized views for common query patterns to avoid scanning raw data
- Implement query routing that directs recent queries to hot storage and historical queries to cold/aggregated storage
- Cache frequently accessed dashboard panels with short TTLs and invalidate on new data arrival