Practice/Amazon/Design a Server Health Monitoring System
Design a Server Health Monitoring System
System DesignMust
Problem Statement
Design a distributed health monitoring system that can track the status and performance metrics of millions of servers across a cloud infrastructure. The system must collect CPU usage, memory utilization, disk I/O, network throughput, and custom application metrics from every server, aggregate them into dashboards with configurable time windows, and trigger alerts when metrics exceed defined thresholds or exhibit anomalous patterns.
At Amazon, interviewers ask this to evaluate your ability to build write-heavy telemetry pipelines, time-series storage at scale, real-time alerting with low latency, and efficient dashboard serving. The challenge lies in ingesting millions of metric data points per second while maintaining sub-minute freshness for dashboards and seconds-level responsiveness for critical alerts.
Key Requirements
Functional
- Metric collection -- agents on each server report CPU, memory, disk, network, and custom metrics every 10-30 seconds
- Dashboard visualization -- operators view real-time and historical metric charts with drill-down by server, cluster, service, and time range
- Threshold-based alerting -- operators define alert rules (e.g., CPU > 90% for 5 minutes) that trigger notifications via email, PagerDuty, or Slack
- Anomaly detection -- the system identifies unusual patterns (sudden traffic drops, memory leaks) beyond simple threshold breaches
Non-Functional
- Scalability -- handle 1 million+ servers each reporting 50+ metrics every 15 seconds, producing 3+ million metric writes per second
- Reliability -- no metric data loss during ingestion spikes; alerting pipeline maintains 99.9% uptime with redundant evaluation
- Latency -- dashboard queries return within 2 seconds for recent data (last hour) and 5 seconds for historical ranges (last 30 days)
- Consistency -- eventual consistency acceptable for dashboards; alerts must fire within 30 seconds of threshold breach
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Metric Ingestion at Scale
Collecting millions of metric data points per second from heterogeneous sources creates massive write pressure. Interviewers want to see a pipeline that absorbs bursts, handles agent retries, and feeds multiple consumers.
Hints to consider:
- Deploy collection agents that batch and compress metrics before sending, reducing network overhead
- Use Kafka as a durable buffer between agents and consumers, partitioned by server_id or metric namespace
- Support multiple metric formats and normalize them at the ingestion layer
- Implement backpressure mechanisms so agents slow down rather than crash when the pipeline is congested
2. Time-Series Storage Design
With billions of data points per day, storage architecture determines query performance and cost. Interviewers probe your approach to partitioning, retention, and query optimization.
Hints to consider:
- Use a time-series database optimized for append-heavy workloads with time-range scans (TimescaleDB, InfluxDB, or custom on Cassandra)
- Partition data by (metric_namespace, time_bucket) to enable efficient time-range queries and parallel writes
- Pre-compute rollups at different granularities: raw data (10-second resolution) for 7 days, 1-minute rollups for 30 days, 5-minute rollups for 1 year
- Apply compression (delta encoding, gorilla compression) to reduce storage costs for numeric time-series
3. Alerting Pipeline Architecture
Alerts must evaluate continuously against streaming data without introducing processing delays. Interviewers look for separation between the alerting path and dashboard serving.
Hints to consider:
- Build a dedicated alerting stream processor that evaluates rules against incoming metrics in near real-time
- Maintain per-rule state with sliding windows to detect sustained threshold breaches (e.g., CPU > 90% for 5 consecutive minutes)
- Implement alert deduplication and rate limiting to prevent notification storms during widespread incidents
- Support alert routing with escalation policies: initial page to on-call, escalate if unacknowledged within 10 minutes
4. Dashboard Query Performance
Operators expect fast, interactive dashboards that aggregate millions of data points into charts. Interviewers want to see how you serve these queries without scanning raw data.
Hints to consider:
- Serve recent data from the high-resolution store and historical data from pre-computed rollups automatically based on time range
- Cache frequently accessed dashboard queries in Redis with short TTLs, invalidated by new data arrival
- Use materialized views for common aggregations (per-service average CPU, per-cluster memory utilization)
- Implement query pushdown: if the user zooms out to a weekly view, serve from 5-minute rollups rather than raw data