Practice/Amazon/Design a System Metrics Monitoring and Alerting Platform

Design a System Metrics Monitoring and Alerting Platform

System DesignMust

Problem Statement

Design a metrics monitoring platform that collects performance data (CPU, memory, throughput, error rates, custom business metrics) from hundreds of thousands of services across a cloud infrastructure, provides real-time dashboards and historical trend analysis, and triggers alerts when metrics indicate problems. Think of systems like Datadog, Prometheus/Grafana, or Amazon CloudWatch.

At Amazon, interviewers ask this to evaluate your ability to design high-throughput data ingestion, time-series storage with multi-resolution queries, real-time stream processing for alerting, and efficient dashboard serving. The challenge is handling the cardinality explosion from metric labels (service, host, region, endpoint) while maintaining query performance and cost efficiency.

Key Requirements

Functional

Metric ingestion -- accept counter, gauge, and histogram metrics from services via push (agents) or pull (scraping) with configurable intervals
Dashboard visualization -- operators build custom dashboards with charts, graphs, and heatmaps showing metrics across configurable time windows and label filters
Alert rule management -- operators define alert rules with conditions (threshold, rate of change, absence), evaluation windows, and notification channels
Metric exploration -- operators query and explore metrics ad-hoc using a query language (e.g., "average CPU by service where region=us-east over last 2 hours")

Non-Functional

Scalability -- ingest 5+ million unique metric time series with 10-second resolution, producing 500K+ data points per second
Reliability -- no metric data loss during ingestion; alerting pipeline maintains 99.9% uptime with redundant evaluation
Latency -- recent metric queries (last 1 hour) return within 1 second; alert evaluation delay under 30 seconds
Consistency -- eventual consistency for dashboards; alerts evaluated on freshest available data with bounded staleness

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. High-Cardinality Metric Handling

The combination of metric name and label dimensions (service, host, region, method, status_code) creates millions of unique time series. Interviewers probe whether you understand the cardinality challenge and its impact on storage and query performance.

Hints to consider:

Impose cardinality limits per metric namespace to prevent label explosions from crashing the system
Use a metric registry that tracks active time series and drops or aggregates series exceeding limits
Store label indexes separately from time-series data to enable efficient label-based filtering without scanning raw data
Pre-aggregate across high-cardinality dimensions (e.g., per-host to per-service) at ingestion time for common dashboard queries

2. Time-Series Storage Architecture

Metrics at 10-second resolution with millions of series create enormous storage volume. Interviewers look for tiered storage with efficient encoding.

Hints to consider:

Use a time-series-optimized storage engine with column-oriented layout for efficient compression and range scans
Apply delta-of-delta and XOR encoding (gorilla compression) for timestamps and values, achieving 10-16x compression
Implement multi-resolution storage: raw data (10s) for 2 days, 1-minute rollups for 30 days, 5-minute rollups for 1 year
Partition by time and metric namespace to enable parallel queries and efficient data lifecycle management

3. Alert Evaluation Engine

Evaluating thousands of alert rules against millions of time series continuously is computationally expensive. Interviewers want to see an efficient evaluation architecture.

Hints to consider:

Group alert rules by the metrics they reference and evaluate in batches: fetch data once, evaluate all rules against it
Implement multi-stage evaluation: fast filters (fixed thresholds) discard most data, complex evaluations (rate of change, percentiles) run only on candidates
Support evaluation at different frequencies based on alert priority: critical alerts every 10 seconds, warnings every minute
Handle rule evaluation failures gracefully: retry with backoff, never silently drop alerts, and track evaluation success rate

4. Query Engine and Dashboard Performance

Ad-hoc metric queries require a flexible query language and efficient execution. Interviewers probe your query planning and optimization strategies.

Hints to consider:

Implement a query planner that selects the appropriate storage resolution based on the requested time range
Push label filters into the storage layer to reduce data scanned (predicate pushdown)
Cache frequently accessed dashboard queries in Redis with TTLs that match the metric resolution (e.g., 10-second TTL for live dashboards)
Support query fan-out across multiple storage partitions with parallel execution and result merging

Practice/Amazon/Design a System Metrics Monitoring and Alerting Platform

Design a System Metrics Monitoring and Alerting Platform

System DesignMust

Problem Statement

Key Requirements

Functional

Metric ingestion -- accept counter, gauge, and histogram metrics from services via push (agents) or pull (scraping) with configurable intervals
Dashboard visualization -- operators build custom dashboards with charts, graphs, and heatmaps showing metrics across configurable time windows and label filters
Alert rule management -- operators define alert rules with conditions (threshold, rate of change, absence), evaluation windows, and notification channels
Metric exploration -- operators query and explore metrics ad-hoc using a query language (e.g., "average CPU by service where region=us-east over last 2 hours")

Non-Functional

Scalability -- ingest 5+ million unique metric time series with 10-second resolution, producing 500K+ data points per second
Reliability -- no metric data loss during ingestion; alerting pipeline maintains 99.9% uptime with redundant evaluation
Latency -- recent metric queries (last 1 hour) return within 1 second; alert evaluation delay under 30 seconds
Consistency -- eventual consistency for dashboards; alerts evaluated on freshest available data with bounded staleness

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. High-Cardinality Metric Handling

Hints to consider:

Impose cardinality limits per metric namespace to prevent label explosions from crashing the system
Use a metric registry that tracks active time series and drops or aggregates series exceeding limits
Store label indexes separately from time-series data to enable efficient label-based filtering without scanning raw data
Pre-aggregate across high-cardinality dimensions (e.g., per-host to per-service) at ingestion time for common dashboard queries

2. Time-Series Storage Architecture

Metrics at 10-second resolution with millions of series create enormous storage volume. Interviewers look for tiered storage with efficient encoding.

Hints to consider:

Use a time-series-optimized storage engine with column-oriented layout for efficient compression and range scans
Apply delta-of-delta and XOR encoding (gorilla compression) for timestamps and values, achieving 10-16x compression
Implement multi-resolution storage: raw data (10s) for 2 days, 1-minute rollups for 30 days, 5-minute rollups for 1 year
Partition by time and metric namespace to enable parallel queries and efficient data lifecycle management

3. Alert Evaluation Engine

Evaluating thousands of alert rules against millions of time series continuously is computationally expensive. Interviewers want to see an efficient evaluation architecture.

Hints to consider:

Group alert rules by the metrics they reference and evaluate in batches: fetch data once, evaluate all rules against it
Implement multi-stage evaluation: fast filters (fixed thresholds) discard most data, complex evaluations (rate of change, percentiles) run only on candidates
Support evaluation at different frequencies based on alert priority: critical alerts every 10 seconds, warnings every minute
Handle rule evaluation failures gracefully: retry with backoff, never silently drop alerts, and track evaluation success rate

4. Query Engine and Dashboard Performance

Ad-hoc metric queries require a flexible query language and efficient execution. Interviewers probe your query planning and optimization strategies.

Hints to consider:

Implement a query planner that selects the appropriate storage resolution based on the requested time range
Push label filters into the storage layer to reduce data scanned (predicate pushdown)
Cache frequently accessed dashboard queries in Redis with TTLs that match the metric resolution (e.g., 10-second TTL for live dashboards)
Support query fan-out across multiple storage partitions with parallel execution and result merging