Practice/LinkedIn/Design a System Metrics Monitoring and Alerting Platform

Design a System Metrics Monitoring and Alerting Platform

System DesignMust

Problem Statement

Design a metrics monitoring platform that collects performance data (CPU, memory, throughput) from hundreds of thousands of servers, allows users to set threshold-based alerts, and provides dashboards for visualization and notifications when alerts are triggered. The core workflow is collect, store, query, alert, and notify, at very high scale and in near real time.

Your architecture must handle a high-throughput telemetry pipeline, a cost-effective time-series store, and a reliable low-latency alerting engine across hundreds of thousands of nodes. Balance ingestion throughput, cardinality control, retention/rollups, multi-tenant isolation, backpressure, real-time updates, and failure handling while keeping the design operable and cost-aware.

Key Requirements

Functional

Dashboard visualization -- users view near real-time and historical metrics (CPU, memory, throughput) from hundreds of thousands of servers with filters and aggregation
Alert rule management -- users define, edit, and disable threshold-based alert rules scoped by metric, labels/tags, and time windows
Multi-channel notifications -- users receive notifications (email, SMS, webhooks, chat) when alerts trigger, with deduplication and severity routing
Data retention control -- users control data retention and resolution (raw vs. downsampled) to balance cost and fidelity

Non-Functional

Scalability -- ingest metrics from 500K+ servers at millions of data points per second with linear write scalability
Reliability -- tolerate component failures without data loss; alerting must continue during partial outages
Latency -- dashboard queries return within 2 seconds for recent data; alerts fire within 30 seconds of threshold breach
Consistency -- accept eventual consistency for dashboards; alert evaluation requires at-least-once semantics to avoid missed alerts

Interview Reports from Hello Interview

22 reports from candidates. Most recently asked at LinkedIn in Early January 2026.

Also commonly asked at: Meta, Amazon, TikTok.

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. High-Throughput Ingestion Pipeline

Interviewers want to see how you handle millions of metric data points per second from hundreds of thousands of heterogeneous sources without dropping data or overwhelming downstream systems.

Hints to consider:

Use Kafka as a durable buffer between metric producers and consumers to decouple ingestion from processing and provide backpressure
Implement client-side batching and compression in metric agents to reduce network overhead
Partition ingestion by metric name or host to enable parallel processing and avoid hot partitions
Apply per-tenant rate limiting to prevent noisy neighbors from starving others in multi-tenant deployments

2. Time-Series Storage and Cardinality Control

Uncontrolled label cardinality can explode storage costs and make queries impossibly slow. Interviewers expect you to address this directly.

Hints to consider:

Use a time-series optimized store (Cassandra with wide rows, or a purpose-built TSDB) with time-bucketed partitioning for efficient writes and range scans
Implement cardinality quotas and allow/deny lists to prevent unbounded series creation from high-cardinality labels (user IDs, request IDs)
Apply downsampling rollups: keep raw data for 7 days, 1-minute aggregates for 30 days, 5-minute aggregates for 1 year
Use TTLs on storage partitions for automatic data lifecycle management

3. Alert Evaluation as a Separate Stream

Coupling alert evaluation to the write path or dashboard query path creates cascading failures. Interviewers look for a decoupled alerting architecture.

Hints to consider:

Run alert evaluation as a separate Flink/streaming job that consumes metric events and evaluates rules with windowed aggregations
Support multiple alert types: threshold (CPU > 90% for 5 minutes), rate-of-change (error rate increased 3x), and absence (no data for 10 minutes)
Implement deduplication and cooldown periods to prevent alert storms during widespread incidents
Use a separate notification service with retry logic and provider abstraction for email, SMS, and webhook delivery

4. Query Optimization for Dashboards

Dashboard queries span time ranges and aggregate across many series. Interviewers want to see how you keep query latency low as data grows.

Hints to consider:

Pre-compute common aggregations (per-service error rates, cluster-level CPU averages) and cache them for fast dashboard rendering
Use columnar storage or materialized views for common query patterns to avoid scanning raw data
Implement query routing that directs recent queries to hot storage and historical queries to cold/aggregated storage
Cache frequently accessed dashboard panels with short TTLs and invalidate on new data arrival

Practice/LinkedIn/Design a System Metrics Monitoring and Alerting Platform

Design a System Metrics Monitoring and Alerting Platform

System DesignMust

Problem Statement

Key Requirements

Functional

Dashboard visualization -- users view near real-time and historical metrics (CPU, memory, throughput) from hundreds of thousands of servers with filters and aggregation
Alert rule management -- users define, edit, and disable threshold-based alert rules scoped by metric, labels/tags, and time windows
Multi-channel notifications -- users receive notifications (email, SMS, webhooks, chat) when alerts trigger, with deduplication and severity routing
Data retention control -- users control data retention and resolution (raw vs. downsampled) to balance cost and fidelity

Non-Functional

Scalability -- ingest metrics from 500K+ servers at millions of data points per second with linear write scalability
Reliability -- tolerate component failures without data loss; alerting must continue during partial outages
Latency -- dashboard queries return within 2 seconds for recent data; alerts fire within 30 seconds of threshold breach
Consistency -- accept eventual consistency for dashboards; alert evaluation requires at-least-once semantics to avoid missed alerts

Interview Reports from Hello Interview

22 reports from candidates. Most recently asked at LinkedIn in Early January 2026.

Also commonly asked at: Meta, Amazon, TikTok.

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. High-Throughput Ingestion Pipeline

Interviewers want to see how you handle millions of metric data points per second from hundreds of thousands of heterogeneous sources without dropping data or overwhelming downstream systems.

Hints to consider:

Use Kafka as a durable buffer between metric producers and consumers to decouple ingestion from processing and provide backpressure
Implement client-side batching and compression in metric agents to reduce network overhead
Partition ingestion by metric name or host to enable parallel processing and avoid hot partitions
Apply per-tenant rate limiting to prevent noisy neighbors from starving others in multi-tenant deployments

2. Time-Series Storage and Cardinality Control

Uncontrolled label cardinality can explode storage costs and make queries impossibly slow. Interviewers expect you to address this directly.

Hints to consider:

Use a time-series optimized store (Cassandra with wide rows, or a purpose-built TSDB) with time-bucketed partitioning for efficient writes and range scans
Implement cardinality quotas and allow/deny lists to prevent unbounded series creation from high-cardinality labels (user IDs, request IDs)
Apply downsampling rollups: keep raw data for 7 days, 1-minute aggregates for 30 days, 5-minute aggregates for 1 year
Use TTLs on storage partitions for automatic data lifecycle management

3. Alert Evaluation as a Separate Stream

Coupling alert evaluation to the write path or dashboard query path creates cascading failures. Interviewers look for a decoupled alerting architecture.

Hints to consider:

Run alert evaluation as a separate Flink/streaming job that consumes metric events and evaluates rules with windowed aggregations
Support multiple alert types: threshold (CPU > 90% for 5 minutes), rate-of-change (error rate increased 3x), and absence (no data for 10 minutes)
Implement deduplication and cooldown periods to prevent alert storms during widespread incidents
Use a separate notification service with retry logic and provider abstraction for email, SMS, and webhook delivery

4. Query Optimization for Dashboards

Dashboard queries span time ranges and aggregate across many series. Interviewers want to see how you keep query latency low as data grows.

Hints to consider:

Pre-compute common aggregations (per-service error rates, cluster-level CPU averages) and cache them for fast dashboard rendering
Use columnar storage or materialized views for common query patterns to avoid scanning raw data
Implement query routing that directs recent queries to hot storage and historical queries to cold/aggregated storage
Cache frequently accessed dashboard panels with short TTLs and invalidate on new data arrival