Practice/Amazon/Design a Health Monitoring System for Infrastructure
Design a Health Monitoring System for Infrastructure
System DesignMust
Problem Statement
Design a system that monitors the health of machines running various applications by tracking key metrics like CPU utilization, memory usage, disk I/O, and network throughput. The system ingests millions of metric data points per second from agents running on each machine, evaluates alerting rules in near real-time, and notifies on-call engineers when thresholds are breached.
Interviewers ask this to test your ability to build a high-throughput data ingestion pipeline, design efficient time-series storage, implement streaming rule evaluation, and build reliable alerting workflows with deduplication and suppression. Expect to discuss how you handle metric cardinality, alert storms during correlated failures, and the tradeoff between evaluation latency and system cost.
Key Requirements
Functional
- Metric ingestion -- collect CPU, memory, disk, and network metrics from agents on every monitored machine at configurable intervals (e.g., every 10 seconds)
- Alerting rules -- define threshold-based and anomaly-based alerting rules on individual metrics or aggregates, with configurable severity levels
- Alert routing -- route triggered alerts to the correct on-call team via PagerDuty, email, or Slack, with escalation policies for unacknowledged alerts
- Dashboards and queries -- query historical metrics for troubleshooting and display real-time dashboards showing fleet health at multiple aggregation levels
Non-Functional
- Scalability -- support millions of monitored machines producing millions of metric data points per second across the fleet
- Reliability -- no missed alerts for critical threshold breaches; tolerate partial infrastructure failures without losing metric data
- Latency -- alerts fire within 60 seconds of a threshold breach; dashboard queries return within 2 seconds for common time ranges
- Consistency -- at-least-once delivery of alerts with deduplication; eventual consistency for dashboard aggregations
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. High-Throughput Metric Ingestion
Millions of machines each emit multiple metrics every few seconds, creating a firehose of data that must be reliably ingested and stored.
Hints to consider:
- Use Kafka as the ingestion buffer between metric agents and storage, partitioned by machine ID or metric type for parallelism
- Deploy a stateless ingestion gateway that validates, normalizes, and batches incoming metrics before writing to Kafka
- Use a time-series database (InfluxDB, Cassandra with time-series schema, or a custom solution) optimized for append-heavy writes and time-range queries
- Implement tiered storage with recent data at full resolution and older data downsampled to reduce storage costs
2. Streaming Alert Evaluation
Alerting rules must be evaluated against the metric stream in near real-time without scanning stored data, which would be too slow.
Hints to consider:
- Use a stream processing engine (Flink or Kafka Streams) to evaluate alerting rules against the live metric stream with windowed aggregations
- Partition rule evaluation by metric key so rules for different machines are evaluated independently and in parallel
- Support both simple threshold rules (CPU > 90% for 5 minutes) and composite rules (CPU > 80% AND memory > 90%) using stateful stream operators
- Handle late-arriving metrics with watermarks and configurable grace periods to avoid false negatives from delayed data
3. Alert Deduplication, Suppression, and Routing
A correlated failure (e.g., a network switch failure) can trigger thousands of alerts simultaneously, overwhelming on-call teams.
Hints to consider:
- Deduplicate alerts by grouping on (rule_id, affected_entity) so repeated threshold breaches do not generate redundant notifications
- Implement alert suppression that collapses correlated alerts (e.g., all machines on the same rack) into a single grouped notification
- Route alerts through an escalation pipeline: notify the primary on-call, escalate to secondary after a timeout, then to the team lead
- Track alert lifecycle states (firing, acknowledged, resolved) with timestamps to support audit trails and post-incident review
4. Metric Cardinality and Query Performance
High-cardinality labels (per-container, per-request-ID) can explode the metric space, degrading query performance and increasing costs.
Hints to consider:
- Enforce cardinality limits on metric labels at ingestion time, rejecting or aggregating metrics that exceed thresholds
- Pre-compute common aggregations (fleet-wide averages, per-region rollups) and store them as separate time series for fast dashboard queries
- Use columnar storage formats or inverted indexes on labels to support efficient filtering and grouping in ad-hoc queries
- Implement query caching for popular dashboard queries with short TTLs, invalidated by new data arrival
Suggested Approach
Step 1: Clarify Requirements
Confirm scope with the interviewer. Ask about the number of monitored machines, metric emission frequency, and the expected metric cardinality (how many unique time series). Clarify whether anomaly detection (ML-based) is in scope or only threshold-based rules. Establish alerting SLA: how quickly must alerts fire after a threshold is breached? Determine whether the system must support custom metrics from applications in addition to infrastructure metrics.