Practice/Amazon/Design a Server Health Monitoring System

Design a Server Health Monitoring System

System DesignMust

Problem Statement

Design a distributed health monitoring system that can track the status and performance metrics of millions of servers across a cloud infrastructure. The system must collect CPU usage, memory utilization, disk I/O, network throughput, and custom application metrics from every server, aggregate them into dashboards with configurable time windows, and trigger alerts when metrics exceed defined thresholds or exhibit anomalous patterns.

At Amazon, interviewers ask this to evaluate your ability to build write-heavy telemetry pipelines, time-series storage at scale, real-time alerting with low latency, and efficient dashboard serving. The challenge lies in ingesting millions of metric data points per second while maintaining sub-minute freshness for dashboards and seconds-level responsiveness for critical alerts.

Key Requirements

Functional

Metric collection -- agents on each server report CPU, memory, disk, network, and custom metrics every 10-30 seconds
Dashboard visualization -- operators view real-time and historical metric charts with drill-down by server, cluster, service, and time range
Threshold-based alerting -- operators define alert rules (e.g., CPU > 90% for 5 minutes) that trigger notifications via email, PagerDuty, or Slack
Anomaly detection -- the system identifies unusual patterns (sudden traffic drops, memory leaks) beyond simple threshold breaches

Non-Functional

Scalability -- handle 1 million+ servers each reporting 50+ metrics every 15 seconds, producing 3+ million metric writes per second
Reliability -- no metric data loss during ingestion spikes; alerting pipeline maintains 99.9% uptime with redundant evaluation
Latency -- dashboard queries return within 2 seconds for recent data (last hour) and 5 seconds for historical ranges (last 30 days)
Consistency -- eventual consistency acceptable for dashboards; alerts must fire within 30 seconds of threshold breach

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Metric Ingestion at Scale

Collecting millions of metric data points per second from heterogeneous sources creates massive write pressure. Interviewers want to see a pipeline that absorbs bursts, handles agent retries, and feeds multiple consumers.

Hints to consider:

Deploy collection agents that batch and compress metrics before sending, reducing network overhead
Use Kafka as a durable buffer between agents and consumers, partitioned by server_id or metric namespace
Support multiple metric formats and normalize them at the ingestion layer
Implement backpressure mechanisms so agents slow down rather than crash when the pipeline is congested

2. Time-Series Storage Design

With billions of data points per day, storage architecture determines query performance and cost. Interviewers probe your approach to partitioning, retention, and query optimization.

Hints to consider:

Use a time-series database optimized for append-heavy workloads with time-range scans (TimescaleDB, InfluxDB, or custom on Cassandra)
Partition data by (metric_namespace, time_bucket) to enable efficient time-range queries and parallel writes
Pre-compute rollups at different granularities: raw data (10-second resolution) for 7 days, 1-minute rollups for 30 days, 5-minute rollups for 1 year
Apply compression (delta encoding, gorilla compression) to reduce storage costs for numeric time-series

3. Alerting Pipeline Architecture

Alerts must evaluate continuously against streaming data without introducing processing delays. Interviewers look for separation between the alerting path and dashboard serving.

Hints to consider:

Build a dedicated alerting stream processor that evaluates rules against incoming metrics in near real-time
Maintain per-rule state with sliding windows to detect sustained threshold breaches (e.g., CPU > 90% for 5 consecutive minutes)
Implement alert deduplication and rate limiting to prevent notification storms during widespread incidents
Support alert routing with escalation policies: initial page to on-call, escalate if unacknowledged within 10 minutes

4. Dashboard Query Performance

Operators expect fast, interactive dashboards that aggregate millions of data points into charts. Interviewers want to see how you serve these queries without scanning raw data.

Hints to consider:

Serve recent data from the high-resolution store and historical data from pre-computed rollups automatically based on time range
Cache frequently accessed dashboard queries in Redis with short TTLs, invalidated by new data arrival
Use materialized views for common aggregations (per-service average CPU, per-cluster memory utilization)
Implement query pushdown: if the user zooms out to a weekly view, serve from 5-minute rollups rather than raw data

Practice/Amazon/Design a Server Health Monitoring System

Design a Server Health Monitoring System

System DesignMust

Problem Statement

Key Requirements

Functional

Metric collection -- agents on each server report CPU, memory, disk, network, and custom metrics every 10-30 seconds
Dashboard visualization -- operators view real-time and historical metric charts with drill-down by server, cluster, service, and time range
Threshold-based alerting -- operators define alert rules (e.g., CPU > 90% for 5 minutes) that trigger notifications via email, PagerDuty, or Slack
Anomaly detection -- the system identifies unusual patterns (sudden traffic drops, memory leaks) beyond simple threshold breaches

Non-Functional

Scalability -- handle 1 million+ servers each reporting 50+ metrics every 15 seconds, producing 3+ million metric writes per second
Reliability -- no metric data loss during ingestion spikes; alerting pipeline maintains 99.9% uptime with redundant evaluation
Latency -- dashboard queries return within 2 seconds for recent data (last hour) and 5 seconds for historical ranges (last 30 days)
Consistency -- eventual consistency acceptable for dashboards; alerts must fire within 30 seconds of threshold breach

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Metric Ingestion at Scale

Hints to consider:

Deploy collection agents that batch and compress metrics before sending, reducing network overhead
Use Kafka as a durable buffer between agents and consumers, partitioned by server_id or metric namespace
Support multiple metric formats and normalize them at the ingestion layer
Implement backpressure mechanisms so agents slow down rather than crash when the pipeline is congested

2. Time-Series Storage Design

With billions of data points per day, storage architecture determines query performance and cost. Interviewers probe your approach to partitioning, retention, and query optimization.

Hints to consider:

Use a time-series database optimized for append-heavy workloads with time-range scans (TimescaleDB, InfluxDB, or custom on Cassandra)
Partition data by (metric_namespace, time_bucket) to enable efficient time-range queries and parallel writes
Pre-compute rollups at different granularities: raw data (10-second resolution) for 7 days, 1-minute rollups for 30 days, 5-minute rollups for 1 year
Apply compression (delta encoding, gorilla compression) to reduce storage costs for numeric time-series

3. Alerting Pipeline Architecture

Alerts must evaluate continuously against streaming data without introducing processing delays. Interviewers look for separation between the alerting path and dashboard serving.

Hints to consider:

Build a dedicated alerting stream processor that evaluates rules against incoming metrics in near real-time
Maintain per-rule state with sliding windows to detect sustained threshold breaches (e.g., CPU > 90% for 5 consecutive minutes)
Implement alert deduplication and rate limiting to prevent notification storms during widespread incidents
Support alert routing with escalation policies: initial page to on-call, escalate if unacknowledged within 10 minutes

4. Dashboard Query Performance

Operators expect fast, interactive dashboards that aggregate millions of data points into charts. Interviewers want to see how you serve these queries without scanning raw data.

Hints to consider:

Serve recent data from the high-resolution store and historical data from pre-computed rollups automatically based on time range
Cache frequently accessed dashboard queries in Redis with short TTLs, invalidated by new data arrival
Use materialized views for common aggregations (per-service average CPU, per-cluster memory utilization)
Implement query pushdown: if the user zooms out to a weekly view, serve from 5-minute rollups rather than raw data