Practice/Bloomberg/Design a Health Monitoring System for Database Servers
Design a Health Monitoring System for Database Servers
System DesignMust
Problem Statement
Design a distributed monitoring platform that tracks the operational health of thousands of database servers across multiple data centers. The system must detect failures within seconds, distinguish between transient glitches and genuine outages, and route alerts to the appropriate on-call teams through channels like email, Slack, and PagerDuty. Operators need a live dashboard showing real-time fleet status, the ability to define custom health checks beyond simple connectivity (query response times, replication lag, disk usage, connection pool saturation), and tools to manage alert fatigue through intelligent grouping, suppression, and maintenance windows.
This problem is directly relevant to Bloomberg's infrastructure, where thousands of database instances power the Bloomberg Terminal's real-time financial data delivery. The monitoring platform must handle 10,000+ servers generating health signals every 10 to 30 seconds, support sub-minute detection for critical failures, and remain operational even when parts of the monitoring infrastructure itself experience outages. The system must avoid both false positives that erode operator trust and missed alerts that lead to extended, undetected downtime.
Key Requirements
Functional
- Health check configuration -- operators register servers and define checks (heartbeat ping, query execution latency, replication lag, disk usage) with configurable intervals, thresholds, and evaluation windows
- Real-time status dashboard -- display current server states, historical uptime percentages, active incidents, and metric trends with sub-second updates via WebSocket or Server-Sent Events
- Alert routing and escalation -- configurable notification policies that route alerts based on severity, time of day, team ownership, and escalate to management if unacknowledged within a time limit
- Incident lifecycle management -- operators can acknowledge alerts, add investigation notes, silence noisy servers during planned maintenance, and resolve incidents with a full audit trail
Non-Functional
- Scalability -- support 10,000 servers with 30-second check intervals (330+ health events per second baseline), scaling to 100,000+ servers without architectural changes
- Reliability -- no single point of failure in the monitoring pipeline; continue collecting health signals and evaluating rules even if the dashboard or alerting components fail independently
- Latency -- detect critical failures within 60 seconds of occurrence; update the dashboard within 2 seconds of a state change
- Consistency -- eventual consistency acceptable for metric storage and dashboard rendering; strong consistency required for alert deduplication and incident state transitions
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Health Check Architecture and Failure Detection
Interviewers want to understand your strategy for determining whether a database is truly down versus experiencing a transient hiccup. This involves choosing between push-based heartbeats and pull-based polling, handling network partitions where the monitor loses connectivity but the database remains healthy, and minimizing both false positives and false negatives.
Hints to consider:
- Compare agent-based heartbeats (servers push status to the monitoring platform) versus centralized probing (monitoring workers poll each server); discuss failure scenarios where each approach gives incorrect results
- Implement sliding window evaluation: require N failures within M consecutive attempts before declaring an outage, not just a single missed check
- When a server appears down, trigger verification checks from multiple monitoring workers in different network zones before creating an alert, reducing false positives from localized network issues
- Handle "split brain" scenarios where different monitoring regions see conflicting states for the same database by requiring quorum agreement before state transitions
2. Alert Pipeline and Deduplication
A naive implementation sends an alert for every failed check, flooding notification channels with duplicates. Interviewers expect you to design an incident lifecycle that groups related failures and prevents alert storms.
Hints to consider:
- Create incident records as first-class objects with unique IDs and state machine transitions (open, acknowledged, resolved); individual check failures link to incidents rather than generating independent alerts
- Use deduplication windows where multiple failures for the same server within a configurable period (e.g., 5 minutes) map to a single incident
- Design alert batching rules that aggregate related failures (e.g., "3 servers in the us-east database cluster are down") into a single notification rather than 3 separate pages
- Implement idempotency keys for notification dispatch so retries and multiple workers processing the same event never send duplicate emails or pages
3. Data Model and Storage Strategy
With hundreds of health events per second generating time-series data, interviewers want to see appropriate storage choices for different access patterns: fast reads for dashboard queries, efficient writes for incoming metrics, and cost-effective retention for historical data.
Hints to consider:
- Use Redis with TTL-based keys for current server state: if a heartbeat key expires (no renewal within the expected interval), the server is considered potentially down
- Store historical health metrics in a time-series database like InfluxDB or TimescaleDB, or a wide-column store like Cassandra, with aggressive TTLs to limit unbounded growth
- Partition health check results by server ID and time to enable efficient writes (append-only) and time-range queries (dashboard charts)
- Separate the hot path (real-time evaluation of health signals) from the cold path (historical analysis and reporting) and optimize each independently
4. System Reliability and Self-Monitoring
A monitoring system that fails silently is worse than no monitoring at all. Interviewers expect you to address how the monitoring infrastructure itself stays healthy and avoids becoming its own single point of failure.
Hints to consider:
- Deploy monitoring workers across multiple availability zones so a regional network outage does not stop all health checks
- Implement a "dead man's switch" pattern where the monitoring system itself sends periodic heartbeats to an external watchdog; if the heartbeats stop, the watchdog alerts through an independent channel
- Design all processing workers to be stateless and horizontally scalable, with health check assignments distributed via consistent hashing on server ID
- Add backpressure and rate limiting to prevent alert storms from overwhelming notification services during cascading failures
Suggested Approach
Step 1: Clarify Requirements
Confirm the scale and constraints that shape the design. Ask about the number of servers to monitor, typical check frequencies, and acceptable detection latency. Clarify what types of health checks are needed beyond simple connectivity -- query latency percentiles, replication lag, disk and memory thresholds, connection pool metrics. Understand alert routing needs: how many notification channels, what escalation policies, and whether integration with existing incident management tools (PagerDuty, Opsgenie) is required. Confirm availability requirements for the monitoring platform itself and whether brief monitoring gaps during failover are acceptable.