Practice/Amazon/Design a Health Monitoring System for Database Servers

Design a Health Monitoring System for Database Servers

System DesignMust

Problem Statement

Design a system that monitors database server health in real time and sends alerts when servers become down or unresponsive. Operators install lightweight agents or configure health checks that send heartbeats and probe queries. The system evaluates liveness and performance signals, shows status on a live dashboard, and routes alerts to on-call teams.

Interviewers ask this to see how you design reliable, low-latency detection under partial failures, high event volume, and noisy signals. They want to hear how you choose push vs pull health checks, debounce flapping, deduplicate alerts, and make the monitoring system itself highly available without overcomplicating the core design.

Key Requirements

Functional

Health check configuration -- users register database servers and define health checks (heartbeat, query latency, replication lag) with thresholds and alerting windows
Real-time dashboard -- users view a live dashboard showing server status, recent incidents, and key health metrics
Alert policies -- users configure alert policies (channels, severity, escalation, maintenance windows) and receive actionable alerts
Incident lifecycle -- users acknowledge, silence, and resolve incidents with an audit trail and notification history

Non-Functional

Scalability -- monitor thousands of database servers across multiple data centers with agents reporting every 10-30 seconds
Reliability -- the monitoring system itself must be more reliable than what it monitors, targeting 99.99% uptime for alert delivery
Latency -- detect server failures within 30-60 seconds of occurrence, deliver alerts within 10 seconds of detection
Consistency -- strong consistency for incident state and alert delivery; eventual consistency for dashboard metrics and historical views

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Health Check Design and Failure Detection

Choosing between push-based (agents send heartbeats) and pull-based (central system probes servers) health checks, and handling the inherent unreliability of both approaches.

Hints to consider:

Prefer push-based heartbeats for scalability: agents on database servers send periodic health reports, and absence of reports indicates potential failure
Use TTL-based keys in Redis where each heartbeat refreshes the TTL; an expired key triggers investigation
Implement multi-signal verification before declaring a server down: combine heartbeat absence with active probe failure and network reachability check
Handle network partitions where the monitoring system cannot reach a healthy server by using multiple monitoring paths

2. Alert Deduplication and Flapping Prevention

A single missed heartbeat should not trigger an alert, and servers oscillating between healthy and unhealthy states should not flood on-call engineers.

Hints to consider:

Use sliding window evaluation: trigger alerts only when a threshold is breached for N consecutive checks or for X% of checks in a window
Implement dampening with configurable cooldown periods before re-alerting for the same server and condition
Group related alerts (multiple replicas of the same database cluster) into a single incident to reduce noise
Design suppression rules for planned maintenance windows where known downtime should not generate alerts

3. Incident Lifecycle and Notification Workflow

Alerting is a multi-step workflow that must be reliable, auditable, and support escalation when initial responders do not acknowledge.

Hints to consider:

Model the incident as a state machine: open, acknowledged, investigating, resolved, with timestamps and actor attribution at each transition
Fan out notifications through multiple channels (PagerDuty, Slack, email) with retry logic and delivery confirmation
Implement escalation policies: if no acknowledgment within 5 minutes, notify the next person in the on-call rotation
Store all incident state transitions as immutable audit events for post-incident review and compliance

4. Monitoring System Reliability

The monitoring system must be more reliable than the systems it monitors, requiring careful attention to its own failure modes.

Hints to consider:

Deploy the monitoring system across multiple availability zones with no single point of failure
Use distributed consensus for critical state (incident records, alert configuration) rather than a single database
Implement watchdog processes that monitor the monitoring system itself and alert through an independent channel if it becomes unresponsive
Design for graceful degradation: if the dashboard is down, alerts still flow; if one notification channel fails, others continue

Practice/Amazon/Design a Health Monitoring System for Database Servers

Design a Health Monitoring System for Database Servers

System DesignMust

Problem Statement

Key Requirements

Functional

Health check configuration -- users register database servers and define health checks (heartbeat, query latency, replication lag) with thresholds and alerting windows
Real-time dashboard -- users view a live dashboard showing server status, recent incidents, and key health metrics
Alert policies -- users configure alert policies (channels, severity, escalation, maintenance windows) and receive actionable alerts
Incident lifecycle -- users acknowledge, silence, and resolve incidents with an audit trail and notification history

Non-Functional

Scalability -- monitor thousands of database servers across multiple data centers with agents reporting every 10-30 seconds
Reliability -- the monitoring system itself must be more reliable than what it monitors, targeting 99.99% uptime for alert delivery
Latency -- detect server failures within 30-60 seconds of occurrence, deliver alerts within 10 seconds of detection
Consistency -- strong consistency for incident state and alert delivery; eventual consistency for dashboard metrics and historical views

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Health Check Design and Failure Detection

Choosing between push-based (agents send heartbeats) and pull-based (central system probes servers) health checks, and handling the inherent unreliability of both approaches.

Hints to consider:

Prefer push-based heartbeats for scalability: agents on database servers send periodic health reports, and absence of reports indicates potential failure
Use TTL-based keys in Redis where each heartbeat refreshes the TTL; an expired key triggers investigation
Implement multi-signal verification before declaring a server down: combine heartbeat absence with active probe failure and network reachability check
Handle network partitions where the monitoring system cannot reach a healthy server by using multiple monitoring paths

2. Alert Deduplication and Flapping Prevention

A single missed heartbeat should not trigger an alert, and servers oscillating between healthy and unhealthy states should not flood on-call engineers.

Hints to consider:

Use sliding window evaluation: trigger alerts only when a threshold is breached for N consecutive checks or for X% of checks in a window
Implement dampening with configurable cooldown periods before re-alerting for the same server and condition
Group related alerts (multiple replicas of the same database cluster) into a single incident to reduce noise
Design suppression rules for planned maintenance windows where known downtime should not generate alerts

3. Incident Lifecycle and Notification Workflow

Alerting is a multi-step workflow that must be reliable, auditable, and support escalation when initial responders do not acknowledge.

Hints to consider:

Model the incident as a state machine: open, acknowledged, investigating, resolved, with timestamps and actor attribution at each transition
Fan out notifications through multiple channels (PagerDuty, Slack, email) with retry logic and delivery confirmation
Implement escalation policies: if no acknowledgment within 5 minutes, notify the next person in the on-call rotation
Store all incident state transitions as immutable audit events for post-incident review and compliance

4. Monitoring System Reliability

The monitoring system must be more reliable than the systems it monitors, requiring careful attention to its own failure modes.

Hints to consider:

Deploy the monitoring system across multiple availability zones with no single point of failure
Use distributed consensus for critical state (incident records, alert configuration) rather than a single database
Implement watchdog processes that monitor the monitoring system itself and alert through an independent channel if it becomes unresponsive
Design for graceful degradation: if the dashboard is down, alerts still flow; if one notification channel fails, others continue