Practice/Amazon/Design a System Metrics Monitoring and Alerting Platform
Design a System Metrics Monitoring and Alerting Platform
System DesignMust
Problem Statement
Design a metrics monitoring platform that collects performance data (CPU, memory, throughput, error rates, custom business metrics) from hundreds of thousands of services across a cloud infrastructure, provides real-time dashboards and historical trend analysis, and triggers alerts when metrics indicate problems. Think of systems like Datadog, Prometheus/Grafana, or Amazon CloudWatch.
At Amazon, interviewers ask this to evaluate your ability to design high-throughput data ingestion, time-series storage with multi-resolution queries, real-time stream processing for alerting, and efficient dashboard serving. The challenge is handling the cardinality explosion from metric labels (service, host, region, endpoint) while maintaining query performance and cost efficiency.
Key Requirements
Functional
- Metric ingestion -- accept counter, gauge, and histogram metrics from services via push (agents) or pull (scraping) with configurable intervals
- Dashboard visualization -- operators build custom dashboards with charts, graphs, and heatmaps showing metrics across configurable time windows and label filters
- Alert rule management -- operators define alert rules with conditions (threshold, rate of change, absence), evaluation windows, and notification channels
- Metric exploration -- operators query and explore metrics ad-hoc using a query language (e.g., "average CPU by service where region=us-east over last 2 hours")
Non-Functional
- Scalability -- ingest 5+ million unique metric time series with 10-second resolution, producing 500K+ data points per second
- Reliability -- no metric data loss during ingestion; alerting pipeline maintains 99.9% uptime with redundant evaluation
- Latency -- recent metric queries (last 1 hour) return within 1 second; alert evaluation delay under 30 seconds
- Consistency -- eventual consistency for dashboards; alerts evaluated on freshest available data with bounded staleness
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. High-Cardinality Metric Handling
The combination of metric name and label dimensions (service, host, region, method, status_code) creates millions of unique time series. Interviewers probe whether you understand the cardinality challenge and its impact on storage and query performance.
Hints to consider:
- Impose cardinality limits per metric namespace to prevent label explosions from crashing the system
- Use a metric registry that tracks active time series and drops or aggregates series exceeding limits
- Store label indexes separately from time-series data to enable efficient label-based filtering without scanning raw data
- Pre-aggregate across high-cardinality dimensions (e.g., per-host to per-service) at ingestion time for common dashboard queries
2. Time-Series Storage Architecture
Metrics at 10-second resolution with millions of series create enormous storage volume. Interviewers look for tiered storage with efficient encoding.
Hints to consider:
- Use a time-series-optimized storage engine with column-oriented layout for efficient compression and range scans
- Apply delta-of-delta and XOR encoding (gorilla compression) for timestamps and values, achieving 10-16x compression
- Implement multi-resolution storage: raw data (10s) for 2 days, 1-minute rollups for 30 days, 5-minute rollups for 1 year
- Partition by time and metric namespace to enable parallel queries and efficient data lifecycle management
3. Alert Evaluation Engine
Evaluating thousands of alert rules against millions of time series continuously is computationally expensive. Interviewers want to see an efficient evaluation architecture.
Hints to consider:
- Group alert rules by the metrics they reference and evaluate in batches: fetch data once, evaluate all rules against it
- Implement multi-stage evaluation: fast filters (fixed thresholds) discard most data, complex evaluations (rate of change, percentiles) run only on candidates
- Support evaluation at different frequencies based on alert priority: critical alerts every 10 seconds, warnings every minute
- Handle rule evaluation failures gracefully: retry with backoff, never silently drop alerts, and track evaluation success rate
4. Query Engine and Dashboard Performance
Ad-hoc metric queries require a flexible query language and efficient execution. Interviewers probe your query planning and optimization strategies.
Hints to consider:
- Implement a query planner that selects the appropriate storage resolution based on the requested time range
- Push label filters into the storage layer to reduce data scanned (predicate pushdown)
- Cache frequently accessed dashboard queries in Redis with TTLs that match the metric resolution (e.g., 10-second TTL for live dashboards)
- Support query fan-out across multiple storage partitions with parallel execution and result merging