Practice/Amazon/Design a Health Monitoring System for Infrastructure

Design a Health Monitoring System for Infrastructure

System DesignMust

Problem Statement

Design a system that monitors the health of machines running various applications by tracking key metrics like CPU utilization, memory usage, disk I/O, and network throughput. The system ingests millions of metric data points per second from agents running on each machine, evaluates alerting rules in near real-time, and notifies on-call engineers when thresholds are breached.

Interviewers ask this to test your ability to build a high-throughput data ingestion pipeline, design efficient time-series storage, implement streaming rule evaluation, and build reliable alerting workflows with deduplication and suppression. Expect to discuss how you handle metric cardinality, alert storms during correlated failures, and the tradeoff between evaluation latency and system cost.

Key Requirements

Functional

Metric ingestion -- collect CPU, memory, disk, and network metrics from agents on every monitored machine at configurable intervals (e.g., every 10 seconds)
Alerting rules -- define threshold-based and anomaly-based alerting rules on individual metrics or aggregates, with configurable severity levels
Alert routing -- route triggered alerts to the correct on-call team via PagerDuty, email, or Slack, with escalation policies for unacknowledged alerts
Dashboards and queries -- query historical metrics for troubleshooting and display real-time dashboards showing fleet health at multiple aggregation levels

Non-Functional

Scalability -- support millions of monitored machines producing millions of metric data points per second across the fleet
Reliability -- no missed alerts for critical threshold breaches; tolerate partial infrastructure failures without losing metric data
Latency -- alerts fire within 60 seconds of a threshold breach; dashboard queries return within 2 seconds for common time ranges
Consistency -- at-least-once delivery of alerts with deduplication; eventual consistency for dashboard aggregations

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. High-Throughput Metric Ingestion

Millions of machines each emit multiple metrics every few seconds, creating a firehose of data that must be reliably ingested and stored.

Hints to consider:

Use Kafka as the ingestion buffer between metric agents and storage, partitioned by machine ID or metric type for parallelism
Deploy a stateless ingestion gateway that validates, normalizes, and batches incoming metrics before writing to Kafka
Use a time-series database (InfluxDB, Cassandra with time-series schema, or a custom solution) optimized for append-heavy writes and time-range queries
Implement tiered storage with recent data at full resolution and older data downsampled to reduce storage costs

2. Streaming Alert Evaluation

Alerting rules must be evaluated against the metric stream in near real-time without scanning stored data, which would be too slow.

Hints to consider:

Use a stream processing engine (Flink or Kafka Streams) to evaluate alerting rules against the live metric stream with windowed aggregations
Partition rule evaluation by metric key so rules for different machines are evaluated independently and in parallel
Support both simple threshold rules (CPU > 90% for 5 minutes) and composite rules (CPU > 80% AND memory > 90%) using stateful stream operators
Handle late-arriving metrics with watermarks and configurable grace periods to avoid false negatives from delayed data

3. Alert Deduplication, Suppression, and Routing

A correlated failure (e.g., a network switch failure) can trigger thousands of alerts simultaneously, overwhelming on-call teams.

Hints to consider:

Deduplicate alerts by grouping on (rule_id, affected_entity) so repeated threshold breaches do not generate redundant notifications
Implement alert suppression that collapses correlated alerts (e.g., all machines on the same rack) into a single grouped notification
Route alerts through an escalation pipeline: notify the primary on-call, escalate to secondary after a timeout, then to the team lead
Track alert lifecycle states (firing, acknowledged, resolved) with timestamps to support audit trails and post-incident review

4. Metric Cardinality and Query Performance

High-cardinality labels (per-container, per-request-ID) can explode the metric space, degrading query performance and increasing costs.

Hints to consider:

Enforce cardinality limits on metric labels at ingestion time, rejecting or aggregating metrics that exceed thresholds
Pre-compute common aggregations (fleet-wide averages, per-region rollups) and store them as separate time series for fast dashboard queries
Use columnar storage formats or inverted indexes on labels to support efficient filtering and grouping in ad-hoc queries
Implement query caching for popular dashboard queries with short TTLs, invalidated by new data arrival

Suggested Approach

Step 1: Clarify Requirements

Confirm scope with the interviewer. Ask about the number of monitored machines, metric emission frequency, and the expected metric cardinality (how many unique time series). Clarify whether anomaly detection (ML-based) is in scope or only threshold-based rules. Establish alerting SLA: how quickly must alerts fire after a threshold is breached? Determine whether the system must support custom metrics from applications in addition to infrastructure metrics.