Practice/Amazon/Design a Health Monitoring System for Database Servers
Design a Health Monitoring System for Database Servers
System DesignMust
Problem Statement
Design a system that monitors database server health in real time and sends alerts when servers become down or unresponsive. Operators install lightweight agents or configure health checks that send heartbeats and probe queries. The system evaluates liveness and performance signals, shows status on a live dashboard, and routes alerts to on-call teams.
Interviewers ask this to see how you design reliable, low-latency detection under partial failures, high event volume, and noisy signals. They want to hear how you choose push vs pull health checks, debounce flapping, deduplicate alerts, and make the monitoring system itself highly available without overcomplicating the core design.
Key Requirements
Functional
- Health check configuration -- users register database servers and define health checks (heartbeat, query latency, replication lag) with thresholds and alerting windows
- Real-time dashboard -- users view a live dashboard showing server status, recent incidents, and key health metrics
- Alert policies -- users configure alert policies (channels, severity, escalation, maintenance windows) and receive actionable alerts
- Incident lifecycle -- users acknowledge, silence, and resolve incidents with an audit trail and notification history
Non-Functional
- Scalability -- monitor thousands of database servers across multiple data centers with agents reporting every 10-30 seconds
- Reliability -- the monitoring system itself must be more reliable than what it monitors, targeting 99.99% uptime for alert delivery
- Latency -- detect server failures within 30-60 seconds of occurrence, deliver alerts within 10 seconds of detection
- Consistency -- strong consistency for incident state and alert delivery; eventual consistency for dashboard metrics and historical views
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Health Check Design and Failure Detection
Choosing between push-based (agents send heartbeats) and pull-based (central system probes servers) health checks, and handling the inherent unreliability of both approaches.
Hints to consider:
- Prefer push-based heartbeats for scalability: agents on database servers send periodic health reports, and absence of reports indicates potential failure
- Use TTL-based keys in Redis where each heartbeat refreshes the TTL; an expired key triggers investigation
- Implement multi-signal verification before declaring a server down: combine heartbeat absence with active probe failure and network reachability check
- Handle network partitions where the monitoring system cannot reach a healthy server by using multiple monitoring paths
2. Alert Deduplication and Flapping Prevention
A single missed heartbeat should not trigger an alert, and servers oscillating between healthy and unhealthy states should not flood on-call engineers.
Hints to consider:
- Use sliding window evaluation: trigger alerts only when a threshold is breached for N consecutive checks or for X% of checks in a window
- Implement dampening with configurable cooldown periods before re-alerting for the same server and condition
- Group related alerts (multiple replicas of the same database cluster) into a single incident to reduce noise
- Design suppression rules for planned maintenance windows where known downtime should not generate alerts
3. Incident Lifecycle and Notification Workflow
Alerting is a multi-step workflow that must be reliable, auditable, and support escalation when initial responders do not acknowledge.
Hints to consider:
- Model the incident as a state machine: open, acknowledged, investigating, resolved, with timestamps and actor attribution at each transition
- Fan out notifications through multiple channels (PagerDuty, Slack, email) with retry logic and delivery confirmation
- Implement escalation policies: if no acknowledgment within 5 minutes, notify the next person in the on-call rotation
- Store all incident state transitions as immutable audit events for post-incident review and compliance
4. Monitoring System Reliability
The monitoring system must be more reliable than the systems it monitors, requiring careful attention to its own failure modes.
Hints to consider:
- Deploy the monitoring system across multiple availability zones with no single point of failure
- Use distributed consensus for critical state (incident records, alert configuration) rather than a single database
- Implement watchdog processes that monitor the monitoring system itself and alert through an independent channel if it becomes unresponsive
- Design for graceful degradation: if the dashboard is down, alerts still flow; if one notification channel fails, others continue