Practice/Google/Design a Server Health Monitoring System

Design a Server Health Monitoring System

System DesignMust

Problem Statement

Design an observability platform that monitors the health of millions of services running across a large-scale distributed infrastructure. Each service emits periodic heartbeats and metrics (CPU usage, memory consumption, request latency, error rates), and the platform must ingest this telemetry, evaluate alerting rules in near real-time, and provide operators with dashboards for investigation and trend analysis.

The primary challenge is the sheer volume and velocity of telemetry data. With millions of services each reporting dozens of metrics every 10-15 seconds, the system must handle tens of millions of data points per second. Each metric carries labels (service name, region, instance ID, version) that create a high-cardinality space, making naive indexing approaches prohibitively expensive in both storage and query cost.

The platform must also separate hot-path processing (real-time alerting on the latest data) from cold-path storage (retaining weeks or months of historical metrics for capacity planning and post-incident analysis). Alert evaluation must be fast and reliable — a delayed alert on a cascading failure can turn a minor incident into a major outage.

Key Requirements

Functional

Metric ingestion -- Accept structured time-series data points from millions of service instances, each with a metric name, value, timestamp, and set of labels
Real-time alerting -- Evaluate user-defined alerting rules (threshold, rate-of-change, anomaly detection) against incoming metrics and fire notifications within 30 seconds of a condition being met
Dashboarding and queries -- Support ad-hoc queries over historical metrics with filtering, grouping, and aggregation by arbitrary label combinations
Heartbeat monitoring -- Detect services that stop sending heartbeats within a configurable timeout and generate liveness alerts

Non-Functional

Scalability -- Ingest tens of millions of metrics per second with horizontal scaling of all pipeline components
Latency -- Alert evaluation should process each incoming data point within seconds, not minutes
Storage efficiency -- Retain raw metrics for days and downsampled aggregates for months without unbounded storage growth
Availability -- The alerting pipeline must be highly available — a monitoring system that goes down during an incident defeats its purpose

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. High-Throughput Ingestion Pipeline

Receiving tens of millions of data points per second without dropping data or creating backpressure requires careful pipeline design. Hints to consider:

How do you partition incoming metrics in Kafka — by service name, by metric name, or by a hash of the label set?
What serialization format balances compactness with schema evolution (Protobuf, Avro, or a custom binary format)?
How do you handle bursts when thousands of services restart simultaneously and flood the pipeline with initial metric reports?
What backpressure mechanisms prevent a slow consumer from causing Kafka partition lag to grow unboundedly?

2. Label Cardinality and Storage Design

High-cardinality labels (like instance ID or request path) can explode the number of unique time series, overwhelming indexing and storage. Hints to consider:

How do you model time-series storage — one row per data point, or columnar storage with time-series-aware compression?
What strategies limit cardinality — rejecting labels above a threshold, aggregating high-cardinality labels on ingest, or using an allowlist?
How does a time-series database like Cassandra or a purpose-built TSDB (InfluxDB, VictoriaMetrics) handle compaction and downsampling?
How do you implement a hot/cold path — keeping recent data in fast storage (SSD) and moving older data to cheaper cold storage (S3)?

3. Streaming Alert Evaluation

Alert rules must be evaluated continuously against the latest metrics with low latency and high reliability. Hints to consider:

How do you model alerting rules as streaming computations in Flink — each rule as a keyed window aggregation?
How do you handle multi-metric alerts that require joining data from different metric streams (e.g., alert when error rate rises AND latency increases)?
What happens when the alert evaluation pipeline lags — do you evaluate on stale data, skip data points, or catch up?
How do you prevent alert storms during widespread outages — through deduplication, grouping, or rate-limiting notifications?