Practice/Microsoft/Design a Health Monitoring System for Infrastructure

Design a Health Monitoring System for Infrastructure

System DesignMust

Problem Statement

Design a platform that monitors real-time financial market data streams and delivers personalized alerts to traders when user-defined conditions are met. The system must ingest price feeds, order book updates, and trade executions from multiple exchanges, evaluate thousands of user-specified rules against this data in real time, and deliver notifications through multiple channels (mobile push, SMS, email) with minimal latency.

Your system will serve both retail traders who monitor a handful of stocks and institutional clients who track thousands of instruments simultaneously. The platform must handle market open spikes when trading volume surges 10x, provide sub-second alert delivery for time-sensitive opportunities, and prevent alert fatigue through intelligent deduplication and throttling. Expect to discuss how you'll balance low latency with high reliability, manage the cost of storing historical market data for backtesting, and ensure alerts remain accurate even during exchange outages or data feed irregularities.

Key Requirements

Functional

Market data ingestion -- continuously ingest price ticks, volume data, and trade executions from multiple exchanges and data providers
Custom alert rules -- allow users to define complex conditions combining price movements, technical indicators, volume thresholds, and cross-asset correlations
Multi-channel delivery -- send alerts via mobile push notifications, SMS, email, and webhooks with configurable priority and routing
Alert management -- provide users with alert history, statistics on trigger frequency, and the ability to pause, modify, or delete rules
Backtesting capability -- enable users to test alert rules against historical data to validate effectiveness before deploying

Non-Functional

Scalability -- support 50 million alerts evaluated per second during market hours; handle 100,000+ concurrent users with varying portfolio sizes
Reliability -- ensure 99.9% alert delivery success rate; gracefully degrade rather than fail completely during partial outages
Latency -- deliver alerts within 500ms of condition trigger for critical price movements; dashboard queries should return in under 2 seconds
Consistency -- guarantee at-least-once alert delivery; accept eventual consistency for historical data views and analytics

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Stream Processing Architecture for Rule Evaluation

The core challenge is continuously evaluating thousands of user-defined rules against high-velocity market data without missing triggers or introducing unacceptable latency. Interviewers want to see how you partition the workload, maintain stateful computations (like moving averages), and handle rule updates without dropping events.

Hints to consider:

Partition incoming market data by instrument symbol to enable parallel processing while maintaining ordering guarantees
Consider how you'll maintain windows of recent data for technical indicators (e.g., 50-day moving average) and whether to precompute common indicators
Discuss trade-offs between pull-based polling (periodic evaluation) versus push-based reactive evaluation when data arrives
Address how rule changes propagate to processing nodes and whether in-flight evaluations should use old or new rule versions

2. Alert Deduplication and Throttling Strategy

Naive alert systems flood users with redundant notifications when prices oscillate near threshold boundaries or when correlated assets all trigger simultaneously. Interviewers expect you to design intelligent suppression that prevents alert fatigue while ensuring critical notifications still get through.

Hints to consider:

Implement time-based suppression windows where subsequent triggers of the same rule are batched together
Consider hysteresis thresholds (e.g., alert when price crosses $100, but don't re-alert until it drops to $98 and crosses $100 again)
Design priority levels where critical alerts bypass throttling while informational ones may be delayed or bundled
Discuss how to handle "all clear" or resolution notifications when conditions return to normal

3. Market Data Ingestion and Normalization

Different exchanges provide data in varying formats, update frequencies, and quality levels. You must ingest feeds reliably, normalize across sources, handle missing or delayed data, and detect anomalies that could trigger false alerts.

Hints to consider:

Use a message queue to buffer incoming feeds and absorb traffic spikes during market events
Implement schema normalization to unify data from different exchanges into a consistent internal format
Detect and filter outlier data points (e.g., a stock suddenly showing $0.01 due to feed errors)
Consider using multiple redundant data providers and implementing fallback logic when primary feeds fail

4. Cost-Efficient Storage for Historical Market Data

Users want to backtest alert rules and view historical charts, but storing tick-level data for thousands of instruments across years is prohibitively expensive. Interviewers want to see how you'll tier storage and balance query performance with cost.

Hints to consider:

Store recent high-resolution data (tick-level) in hot storage for real-time evaluation and near-term backtesting
Downsample older data to minute or hourly aggregates and move to cheaper cold storage after a retention period
Separate storage for raw market data versus derived metrics like computed indicators
Consider time-series databases optimized for compression and range queries over chronological data

5. Delivery Guarantees and Channel Reliability

Alert delivery crosses system boundaries to third-party services (SMS gateways, push notification providers) that may be unavailable or rate-limited. You need to ensure critical alerts get through while managing costs and avoiding overwhelming external APIs.

Hints to consider:

Implement a retry queue with exponential backoff for failed delivery attempts
Track delivery status and provide users with confirmation that alerts were sent successfully
Consider circuit breakers for delivery channels that are consistently failing to prevent cascading failures
Design fallback channels (e.g., if push fails, try SMS) for high-priority alerts

Practice/Microsoft/Design a Health Monitoring System for Infrastructure

Design a Health Monitoring System for Infrastructure

System DesignMust

Problem Statement

Key Requirements

Functional

Market data ingestion -- continuously ingest price ticks, volume data, and trade executions from multiple exchanges and data providers
Custom alert rules -- allow users to define complex conditions combining price movements, technical indicators, volume thresholds, and cross-asset correlations
Multi-channel delivery -- send alerts via mobile push notifications, SMS, email, and webhooks with configurable priority and routing
Alert management -- provide users with alert history, statistics on trigger frequency, and the ability to pause, modify, or delete rules
Backtesting capability -- enable users to test alert rules against historical data to validate effectiveness before deploying

Non-Functional

Scalability -- support 50 million alerts evaluated per second during market hours; handle 100,000+ concurrent users with varying portfolio sizes
Reliability -- ensure 99.9% alert delivery success rate; gracefully degrade rather than fail completely during partial outages
Latency -- deliver alerts within 500ms of condition trigger for critical price movements; dashboard queries should return in under 2 seconds
Consistency -- guarantee at-least-once alert delivery; accept eventual consistency for historical data views and analytics

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Stream Processing Architecture for Rule Evaluation

Hints to consider:

Partition incoming market data by instrument symbol to enable parallel processing while maintaining ordering guarantees
Consider how you'll maintain windows of recent data for technical indicators (e.g., 50-day moving average) and whether to precompute common indicators
Discuss trade-offs between pull-based polling (periodic evaluation) versus push-based reactive evaluation when data arrives
Address how rule changes propagate to processing nodes and whether in-flight evaluations should use old or new rule versions

2. Alert Deduplication and Throttling Strategy

Hints to consider:

Implement time-based suppression windows where subsequent triggers of the same rule are batched together
Consider hysteresis thresholds (e.g., alert when price crosses $100, but don't re-alert until it drops to $98 and crosses $100 again)
Design priority levels where critical alerts bypass throttling while informational ones may be delayed or bundled
Discuss how to handle "all clear" or resolution notifications when conditions return to normal

3. Market Data Ingestion and Normalization

Hints to consider:

Use a message queue to buffer incoming feeds and absorb traffic spikes during market events
Implement schema normalization to unify data from different exchanges into a consistent internal format
Detect and filter outlier data points (e.g., a stock suddenly showing $0.01 due to feed errors)
Consider using multiple redundant data providers and implementing fallback logic when primary feeds fail

4. Cost-Efficient Storage for Historical Market Data

Hints to consider:

Store recent high-resolution data (tick-level) in hot storage for real-time evaluation and near-term backtesting
Downsample older data to minute or hourly aggregates and move to cheaper cold storage after a retention period
Separate storage for raw market data versus derived metrics like computed indicators
Consider time-series databases optimized for compression and range queries over chronological data

5. Delivery Guarantees and Channel Reliability

Hints to consider:

Implement a retry queue with exponential backoff for failed delivery attempts
Track delivery status and provide users with confirmation that alerts were sent successfully
Consider circuit breakers for delivery channels that are consistently failing to prevent cascading failures
Design fallback channels (e.g., if push fails, try SMS) for high-priority alerts