Practice/Bloomberg/Design a Health Monitoring System for Database Servers

Design a Health Monitoring System for Database Servers

System DesignMust

Problem Statement

Design a distributed monitoring platform that tracks the operational health of thousands of database servers across multiple data centers. The system must detect failures within seconds, distinguish between transient glitches and genuine outages, and route alerts to the appropriate on-call teams through channels like email, Slack, and PagerDuty. Operators need a live dashboard showing real-time fleet status, the ability to define custom health checks beyond simple connectivity (query response times, replication lag, disk usage, connection pool saturation), and tools to manage alert fatigue through intelligent grouping, suppression, and maintenance windows.

This problem is directly relevant to Bloomberg's infrastructure, where thousands of database instances power the Bloomberg Terminal's real-time financial data delivery. The monitoring platform must handle 10,000+ servers generating health signals every 10 to 30 seconds, support sub-minute detection for critical failures, and remain operational even when parts of the monitoring infrastructure itself experience outages. The system must avoid both false positives that erode operator trust and missed alerts that lead to extended, undetected downtime.

Key Requirements

Functional

Health check configuration -- operators register servers and define checks (heartbeat ping, query execution latency, replication lag, disk usage) with configurable intervals, thresholds, and evaluation windows
Real-time status dashboard -- display current server states, historical uptime percentages, active incidents, and metric trends with sub-second updates via WebSocket or Server-Sent Events
Alert routing and escalation -- configurable notification policies that route alerts based on severity, time of day, team ownership, and escalate to management if unacknowledged within a time limit
Incident lifecycle management -- operators can acknowledge alerts, add investigation notes, silence noisy servers during planned maintenance, and resolve incidents with a full audit trail

Non-Functional

Scalability -- support 10,000 servers with 30-second check intervals (330+ health events per second baseline), scaling to 100,000+ servers without architectural changes
Reliability -- no single point of failure in the monitoring pipeline; continue collecting health signals and evaluating rules even if the dashboard or alerting components fail independently
Latency -- detect critical failures within 60 seconds of occurrence; update the dashboard within 2 seconds of a state change
Consistency -- eventual consistency acceptable for metric storage and dashboard rendering; strong consistency required for alert deduplication and incident state transitions

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Health Check Architecture and Failure Detection

Interviewers want to understand your strategy for determining whether a database is truly down versus experiencing a transient hiccup. This involves choosing between push-based heartbeats and pull-based polling, handling network partitions where the monitor loses connectivity but the database remains healthy, and minimizing both false positives and false negatives.

Hints to consider:

Compare agent-based heartbeats (servers push status to the monitoring platform) versus centralized probing (monitoring workers poll each server); discuss failure scenarios where each approach gives incorrect results
Implement sliding window evaluation: require N failures within M consecutive attempts before declaring an outage, not just a single missed check
When a server appears down, trigger verification checks from multiple monitoring workers in different network zones before creating an alert, reducing false positives from localized network issues
Handle "split brain" scenarios where different monitoring regions see conflicting states for the same database by requiring quorum agreement before state transitions

2. Alert Pipeline and Deduplication

A naive implementation sends an alert for every failed check, flooding notification channels with duplicates. Interviewers expect you to design an incident lifecycle that groups related failures and prevents alert storms.

Hints to consider:

Create incident records as first-class objects with unique IDs and state machine transitions (open, acknowledged, resolved); individual check failures link to incidents rather than generating independent alerts
Use deduplication windows where multiple failures for the same server within a configurable period (e.g., 5 minutes) map to a single incident
Design alert batching rules that aggregate related failures (e.g., "3 servers in the us-east database cluster are down") into a single notification rather than 3 separate pages
Implement idempotency keys for notification dispatch so retries and multiple workers processing the same event never send duplicate emails or pages

3. Data Model and Storage Strategy

With hundreds of health events per second generating time-series data, interviewers want to see appropriate storage choices for different access patterns: fast reads for dashboard queries, efficient writes for incoming metrics, and cost-effective retention for historical data.

Hints to consider:

Use Redis with TTL-based keys for current server state: if a heartbeat key expires (no renewal within the expected interval), the server is considered potentially down
Store historical health metrics in a time-series database like InfluxDB or TimescaleDB, or a wide-column store like Cassandra, with aggressive TTLs to limit unbounded growth
Partition health check results by server ID and time to enable efficient writes (append-only) and time-range queries (dashboard charts)
Separate the hot path (real-time evaluation of health signals) from the cold path (historical analysis and reporting) and optimize each independently

4. System Reliability and Self-Monitoring

A monitoring system that fails silently is worse than no monitoring at all. Interviewers expect you to address how the monitoring infrastructure itself stays healthy and avoids becoming its own single point of failure.

Hints to consider:

Deploy monitoring workers across multiple availability zones so a regional network outage does not stop all health checks
Implement a "dead man's switch" pattern where the monitoring system itself sends periodic heartbeats to an external watchdog; if the heartbeats stop, the watchdog alerts through an independent channel
Design all processing workers to be stateless and horizontally scalable, with health check assignments distributed via consistent hashing on server ID
Add backpressure and rate limiting to prevent alert storms from overwhelming notification services during cascading failures

Suggested Approach

Step 1: Clarify Requirements

Confirm the scale and constraints that shape the design. Ask about the number of servers to monitor, typical check frequencies, and acceptable detection latency. Clarify what types of health checks are needed beyond simple connectivity -- query latency percentiles, replication lag, disk and memory thresholds, connection pool metrics. Understand alert routing needs: how many notification channels, what escalation policies, and whether integration with existing incident management tools (PagerDuty, Opsgenie) is required. Confirm availability requirements for the monitoring platform itself and whether brief monitoring gaps during failover are acceptable.

Practice/Bloomberg/Design a Health Monitoring System for Database Servers

Design a Health Monitoring System for Database Servers

System DesignMust

Problem Statement

Key Requirements

Functional

Health check configuration -- operators register servers and define checks (heartbeat ping, query execution latency, replication lag, disk usage) with configurable intervals, thresholds, and evaluation windows
Real-time status dashboard -- display current server states, historical uptime percentages, active incidents, and metric trends with sub-second updates via WebSocket or Server-Sent Events
Alert routing and escalation -- configurable notification policies that route alerts based on severity, time of day, team ownership, and escalate to management if unacknowledged within a time limit
Incident lifecycle management -- operators can acknowledge alerts, add investigation notes, silence noisy servers during planned maintenance, and resolve incidents with a full audit trail

Non-Functional

Scalability -- support 10,000 servers with 30-second check intervals (330+ health events per second baseline), scaling to 100,000+ servers without architectural changes
Reliability -- no single point of failure in the monitoring pipeline; continue collecting health signals and evaluating rules even if the dashboard or alerting components fail independently
Latency -- detect critical failures within 60 seconds of occurrence; update the dashboard within 2 seconds of a state change
Consistency -- eventual consistency acceptable for metric storage and dashboard rendering; strong consistency required for alert deduplication and incident state transitions

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Health Check Architecture and Failure Detection

Hints to consider:

Compare agent-based heartbeats (servers push status to the monitoring platform) versus centralized probing (monitoring workers poll each server); discuss failure scenarios where each approach gives incorrect results
Implement sliding window evaluation: require N failures within M consecutive attempts before declaring an outage, not just a single missed check
When a server appears down, trigger verification checks from multiple monitoring workers in different network zones before creating an alert, reducing false positives from localized network issues
Handle "split brain" scenarios where different monitoring regions see conflicting states for the same database by requiring quorum agreement before state transitions

2. Alert Pipeline and Deduplication

Hints to consider:

Create incident records as first-class objects with unique IDs and state machine transitions (open, acknowledged, resolved); individual check failures link to incidents rather than generating independent alerts
Use deduplication windows where multiple failures for the same server within a configurable period (e.g., 5 minutes) map to a single incident
Design alert batching rules that aggregate related failures (e.g., "3 servers in the us-east database cluster are down") into a single notification rather than 3 separate pages
Implement idempotency keys for notification dispatch so retries and multiple workers processing the same event never send duplicate emails or pages

3. Data Model and Storage Strategy

Hints to consider:

Use Redis with TTL-based keys for current server state: if a heartbeat key expires (no renewal within the expected interval), the server is considered potentially down
Store historical health metrics in a time-series database like InfluxDB or TimescaleDB, or a wide-column store like Cassandra, with aggressive TTLs to limit unbounded growth
Partition health check results by server ID and time to enable efficient writes (append-only) and time-range queries (dashboard charts)
Separate the hot path (real-time evaluation of health signals) from the cold path (historical analysis and reporting) and optimize each independently

4. System Reliability and Self-Monitoring

Hints to consider:

Deploy monitoring workers across multiple availability zones so a regional network outage does not stop all health checks
Implement a "dead man's switch" pattern where the monitoring system itself sends periodic heartbeats to an external watchdog; if the heartbeats stop, the watchdog alerts through an independent channel
Design all processing workers to be stateless and horizontally scalable, with health check assignments distributed via consistent hashing on server ID
Add backpressure and rate limiting to prevent alert storms from overwhelming notification services during cascading failures