Practice/Google/Design a Server Health Monitoring System
Design a Server Health Monitoring System
System DesignMust
Problem Statement
Design an observability platform that monitors the health of millions of services running across a large-scale distributed infrastructure. Each service emits periodic heartbeats and metrics (CPU usage, memory consumption, request latency, error rates), and the platform must ingest this telemetry, evaluate alerting rules in near real-time, and provide operators with dashboards for investigation and trend analysis.
The primary challenge is the sheer volume and velocity of telemetry data. With millions of services each reporting dozens of metrics every 10-15 seconds, the system must handle tens of millions of data points per second. Each metric carries labels (service name, region, instance ID, version) that create a high-cardinality space, making naive indexing approaches prohibitively expensive in both storage and query cost.
The platform must also separate hot-path processing (real-time alerting on the latest data) from cold-path storage (retaining weeks or months of historical metrics for capacity planning and post-incident analysis). Alert evaluation must be fast and reliable — a delayed alert on a cascading failure can turn a minor incident into a major outage.
Key Requirements
Functional
- Metric ingestion -- Accept structured time-series data points from millions of service instances, each with a metric name, value, timestamp, and set of labels
- Real-time alerting -- Evaluate user-defined alerting rules (threshold, rate-of-change, anomaly detection) against incoming metrics and fire notifications within 30 seconds of a condition being met
- Dashboarding and queries -- Support ad-hoc queries over historical metrics with filtering, grouping, and aggregation by arbitrary label combinations
- Heartbeat monitoring -- Detect services that stop sending heartbeats within a configurable timeout and generate liveness alerts
Non-Functional
- Scalability -- Ingest tens of millions of metrics per second with horizontal scaling of all pipeline components
- Latency -- Alert evaluation should process each incoming data point within seconds, not minutes
- Storage efficiency -- Retain raw metrics for days and downsampled aggregates for months without unbounded storage growth
- Availability -- The alerting pipeline must be highly available — a monitoring system that goes down during an incident defeats its purpose
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. High-Throughput Ingestion Pipeline
Receiving tens of millions of data points per second without dropping data or creating backpressure requires careful pipeline design.
Hints to consider:
- How do you partition incoming metrics in Kafka — by service name, by metric name, or by a hash of the label set?
- What serialization format balances compactness with schema evolution (Protobuf, Avro, or a custom binary format)?
- How do you handle bursts when thousands of services restart simultaneously and flood the pipeline with initial metric reports?
- What backpressure mechanisms prevent a slow consumer from causing Kafka partition lag to grow unboundedly?
2. Label Cardinality and Storage Design
High-cardinality labels (like instance ID or request path) can explode the number of unique time series, overwhelming indexing and storage.
Hints to consider:
- How do you model time-series storage — one row per data point, or columnar storage with time-series-aware compression?
- What strategies limit cardinality — rejecting labels above a threshold, aggregating high-cardinality labels on ingest, or using an allowlist?
- How does a time-series database like Cassandra or a purpose-built TSDB (InfluxDB, VictoriaMetrics) handle compaction and downsampling?
- How do you implement a hot/cold path — keeping recent data in fast storage (SSD) and moving older data to cheaper cold storage (S3)?
3. Streaming Alert Evaluation
Alert rules must be evaluated continuously against the latest metrics with low latency and high reliability.
Hints to consider:
- How do you model alerting rules as streaming computations in Flink — each rule as a keyed window aggregation?
- How do you handle multi-metric alerts that require joining data from different metric streams (e.g., alert when error rate rises AND latency increases)?
- What happens when the alert evaluation pipeline lags — do you evaluate on stale data, skip data points, or catch up?
- How do you prevent alert storms during widespread outages — through deduplication, grouping, or rate-limiting notifications?