Practice/Google/Design a global VM monitoring system

Design a global VM monitoring system

System DesignOptional

Problem Statement

You are asked to design a monitoring system that provides a real-time dashboard for all virtual machines running across multiple global regions. The dashboard displays each VM's name, region, and current health status, and supports filtering, sorting, and free-text search so operators can quickly locate and investigate issues.

The system ingests heartbeat signals from millions of VMs distributed across dozens of regions. Each VM periodically reports its status, and the monitoring system must compute health based on heartbeat recency — a VM that has not sent a heartbeat within a configurable TTL is marked unhealthy. The dashboard must update in near real-time as health statuses change.

A critical design challenge is failure isolation. If a single region's network partitions or a heartbeat ingestion pipeline goes down, the system should degrade gracefully — clearly indicating that data from that region is stale rather than incorrectly marking all its VMs as unhealthy.

Key Requirements

Functional

Heartbeat ingestion -- Accept periodic heartbeat messages from millions of VMs, each containing VM ID, region, timestamp, and optional metadata.
Health computation -- Determine each VM's health status based on whether its most recent heartbeat falls within the configured TTL window.
Real-time dashboard -- Push health status updates to connected dashboard clients within seconds of a state change, supporting thousands of concurrent operator sessions.
Filter, sort, and search -- Allow operators to filter VMs by region, health status, or tags, sort by any column, and perform free-text search across VM names and metadata.

Non-Functional

Scalability -- Handle 10 million VMs sending heartbeats every 30 seconds, producing roughly 330,000 heartbeat events per second globally.
Latency -- Dashboard updates should reflect health changes within 5 seconds of the triggering heartbeat (or heartbeat absence).
Availability -- 99.99% uptime for the dashboard; operators must always have visibility into fleet health even during partial outages.
Failure isolation -- A failure in one region's ingestion pipeline must not cascade to other regions or produce misleading global health data.

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Multi-Region Heartbeat Ingestion

Interviewers probe how you design an ingestion pipeline that handles hundreds of thousands of heartbeats per second across globally distributed regions without creating a single bottleneck. Hints to consider:

Think about regional Kafka clusters that ingest heartbeats locally, avoiding cross-region latency on the write path
Consider how you aggregate regional heartbeat data into a global view — push from regions to a central store, or pull from a global query layer
Explore using DynamoDB Global Tables or a similar multi-region datastore to replicate VM state without custom sync logic
Discuss how you handle clock skew between VM clocks and the monitoring system's clock when evaluating TTL

2. Health Computation with TTL

The health model seems simple (heartbeat within TTL equals healthy), but interviewers look for nuance around edge cases, batch transitions, and the computational cost of checking TTLs across millions of VMs. Hints to consider:

Think about whether you compute health on-read (check TTL when the dashboard queries) or on-write (update health status when a heartbeat arrives or expires)
Consider using Redis sorted sets with heartbeat timestamps as scores, allowing efficient range queries to find expired VMs
Explore how you handle the "thundering herd" when a region comes back online and millions of VMs send heartbeats simultaneously
Discuss how TTL expiration events are generated — polling vs. event-driven expiry notifications

3. Real-Time Dashboard Push

Interviewers want to see how you push updates to thousands of connected dashboard clients without overwhelming them with the full firehose of heartbeat events. Hints to consider:

Think about WebSocket connections from dashboard clients to a push gateway that streams only state-change events (healthy to unhealthy or vice versa)
Consider how you handle dashboard clients that subscribe to filtered views — pushing only events matching their active filters
Explore using a pub/sub system where each region or health-status transition is a channel that clients selectively subscribe to
Discuss how you handle client reconnection and backfill so operators see accurate state after a brief disconnection