Practice/Microsoft/Design a Health Monitoring System for Infrastructure
Design a Health Monitoring System for Infrastructure
System DesignMust
Problem Statement
Design a platform that monitors real-time financial market data streams and delivers personalized alerts to traders when user-defined conditions are met. The system must ingest price feeds, order book updates, and trade executions from multiple exchanges, evaluate thousands of user-specified rules against this data in real time, and deliver notifications through multiple channels (mobile push, SMS, email) with minimal latency.
Your system will serve both retail traders who monitor a handful of stocks and institutional clients who track thousands of instruments simultaneously. The platform must handle market open spikes when trading volume surges 10x, provide sub-second alert delivery for time-sensitive opportunities, and prevent alert fatigue through intelligent deduplication and throttling. Expect to discuss how you'll balance low latency with high reliability, manage the cost of storing historical market data for backtesting, and ensure alerts remain accurate even during exchange outages or data feed irregularities.
Key Requirements
Functional
- Market data ingestion -- continuously ingest price ticks, volume data, and trade executions from multiple exchanges and data providers
- Custom alert rules -- allow users to define complex conditions combining price movements, technical indicators, volume thresholds, and cross-asset correlations
- Multi-channel delivery -- send alerts via mobile push notifications, SMS, email, and webhooks with configurable priority and routing
- Alert management -- provide users with alert history, statistics on trigger frequency, and the ability to pause, modify, or delete rules
- Backtesting capability -- enable users to test alert rules against historical data to validate effectiveness before deploying
Non-Functional
- Scalability -- support 50 million alerts evaluated per second during market hours; handle 100,000+ concurrent users with varying portfolio sizes
- Reliability -- ensure 99.9% alert delivery success rate; gracefully degrade rather than fail completely during partial outages
- Latency -- deliver alerts within 500ms of condition trigger for critical price movements; dashboard queries should return in under 2 seconds
- Consistency -- guarantee at-least-once alert delivery; accept eventual consistency for historical data views and analytics
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Stream Processing Architecture for Rule Evaluation
The core challenge is continuously evaluating thousands of user-defined rules against high-velocity market data without missing triggers or introducing unacceptable latency. Interviewers want to see how you partition the workload, maintain stateful computations (like moving averages), and handle rule updates without dropping events.
Hints to consider:
- Partition incoming market data by instrument symbol to enable parallel processing while maintaining ordering guarantees
- Consider how you'll maintain windows of recent data for technical indicators (e.g., 50-day moving average) and whether to precompute common indicators
- Discuss trade-offs between pull-based polling (periodic evaluation) versus push-based reactive evaluation when data arrives
- Address how rule changes propagate to processing nodes and whether in-flight evaluations should use old or new rule versions
2. Alert Deduplication and Throttling Strategy
Naive alert systems flood users with redundant notifications when prices oscillate near threshold boundaries or when correlated assets all trigger simultaneously. Interviewers expect you to design intelligent suppression that prevents alert fatigue while ensuring critical notifications still get through.
Hints to consider:
- Implement time-based suppression windows where subsequent triggers of the same rule are batched together
- Consider hysteresis thresholds (e.g., alert when price crosses $100, but don't re-alert until it drops to $98 and crosses $100 again)
- Design priority levels where critical alerts bypass throttling while informational ones may be delayed or bundled
- Discuss how to handle "all clear" or resolution notifications when conditions return to normal
3. Market Data Ingestion and Normalization
Different exchanges provide data in varying formats, update frequencies, and quality levels. You must ingest feeds reliably, normalize across sources, handle missing or delayed data, and detect anomalies that could trigger false alerts.
Hints to consider:
- Use a message queue to buffer incoming feeds and absorb traffic spikes during market events
- Implement schema normalization to unify data from different exchanges into a consistent internal format
- Detect and filter outlier data points (e.g., a stock suddenly showing $0.01 due to feed errors)
- Consider using multiple redundant data providers and implementing fallback logic when primary feeds fail
4. Cost-Efficient Storage for Historical Market Data
Users want to backtest alert rules and view historical charts, but storing tick-level data for thousands of instruments across years is prohibitively expensive. Interviewers want to see how you'll tier storage and balance query performance with cost.
Hints to consider:
- Store recent high-resolution data (tick-level) in hot storage for real-time evaluation and near-term backtesting
- Downsample older data to minute or hourly aggregates and move to cheaper cold storage after a retention period
- Separate storage for raw market data versus derived metrics like computed indicators
- Consider time-series databases optimized for compression and range queries over chronological data
5. Delivery Guarantees and Channel Reliability
Alert delivery crosses system boundaries to third-party services (SMS gateways, push notification providers) that may be unavailable or rate-limited. You need to ensure critical alerts get through while managing costs and avoiding overwhelming external APIs.
Hints to consider:
- Implement a retry queue with exponential backoff for failed delivery attempts
- Track delivery status and provide users with confirmation that alerts were sent successfully
- Consider circuit breakers for delivery channels that are consistently failing to prevent cascading failures
- Design fallback channels (e.g., if push fails, try SMS) for high-priority alerts