Practice/Meta/Design a System Metrics Monitoring and Alerting Platform
Design a System Metrics Monitoring and Alerting Platform
System DesignMust
Problem Statement
Design a distributed platform that ingests, indexes, and enables real-time search across log data generated by hundreds of thousands of application servers and services. The system must handle tens of millions of log events per second, support full-text search queries with sub-second latency, provide aggregation and analytics capabilities, and allow users to set up alerts based on log patterns or anomalies. Think of systems like Splunk, Elasticsearch/ELK Stack, or Sumo Logic that power operational intelligence for modern cloud infrastructure.
The core challenge is building an ingestion pipeline that can absorb massive write volumes without data loss, an indexing layer that makes unstructured or semi-structured logs queryable in near real-time, and a query engine that delivers fast results across petabytes of data while maintaining cost efficiency. Your design must handle heterogeneous log formats, support multi-tenancy, manage data retention policies, and provide horizontal scalability as log volume grows exponentially.
Key Requirements
Functional
- Log Ingestion -- accept log streams from hundreds of thousands of sources via multiple protocols (HTTP, syslog, agent-based collectors) with guaranteed delivery
- Full-Text Search -- enable users to search logs using keywords, phrases, regular expressions, and field-based filters across time ranges with sub-second response times
- Aggregation and Analytics -- support count, sum, average, percentile calculations grouped by fields, time buckets, or custom dimensions for dashboards and reporting
- Pattern-Based Alerting -- allow users to define alert rules that trigger when specific log patterns appear, thresholds are exceeded, or anomalies are detected
- Multi-Tenancy -- isolate log data, queries, and quotas across different organizations or teams sharing the same infrastructure
Non-Functional
- Scalability -- handle 50 million log events per second ingestion rate with ability to scale linearly as traffic grows
- Reliability -- ensure no data loss during ingestion spikes, component failures, or network partitions with at-least-once delivery guarantees
- Latency -- achieve end-to-end latency from log generation to searchability under 10 seconds; query response time under 2 seconds for 95th percentile
- Consistency -- accept eventual consistency for search results but maintain strict ordering within each log source stream
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Ingestion Pipeline Architecture and Backpressure Handling
Interviewers want to see how you design a write-heavy pipeline that can absorb variable traffic patterns without dropping data or overwhelming downstream components. Log traffic often exhibits spikes during deployments, incidents, or batch job runs, so your architecture must gracefully handle 10x normal load.
Hints to consider:
- Use a message queue or streaming platform to decouple producers from consumers and provide durable buffering during traffic spikes
- Implement client-side batching and compression to reduce network overhead and improve throughput
- Design a load balancer with health checks that routes traffic away from struggling ingestion nodes
- Consider rate limiting and admission control per tenant to prevent noisy neighbors from impacting others
2. Indexing Strategy for Fast Full-Text Search
The crux of this system is making unstructured text searchable efficiently. Interviewers expect you to discuss inverted indexes, sharding strategies, and how to balance write throughput with query performance as data volume grows.
Hints to consider:
- Partition data by time ranges (e.g., daily or hourly indices) to enable efficient pruning of old data and faster queries on recent logs
- Build inverted indices that map terms to document IDs, with positional information for phrase queries
- Use bloom filters or other probabilistic structures to quickly skip shards that definitely don't contain search terms
- Consider field extraction and schema-on-read vs. schema-on-write tradeoffs for semi-structured logs like JSON
3. Query Execution and Result Aggregation at Scale
When a user searches across billions of logs, the query must fan out to hundreds of shards, execute in parallel, and aggregate results efficiently. Interviewers look for understanding of distributed query planning, result merging, and optimizations that avoid scanning unnecessary data.
Hints to consider:
- Implement a scatter-gather pattern where a coordinator fans out queries to shards based on time range and other filters
- Use columnar storage or field indices to avoid full document scans for aggregation queries
- Cache frequently accessed results or precompute common aggregations for dashboards
- Discuss pagination, early termination, and approximate algorithms (like HyperLogLog for cardinality) to handle very large result sets
4. Data Retention, Tiering, and Cost Management
With logs growing at terabytes per day, raw storage costs become prohibitive. Interviewers want to see lifecycle management strategies that balance accessibility, cost, and compliance requirements.
Hints to consider:
- Implement hot-warm-cold tiering where recent logs stay on fast SSDs and older data moves to cheaper object storage
- Apply downsampling or rollup strategies that keep summary statistics while discarding detailed logs after a retention period
- Use compression aggressively since logs are write-once and often highly compressible
- Support tenant-specific retention policies and quotas to control costs in multi-tenant deployments
5. Alerting Pipeline and Anomaly Detection
Real-time alerting on log patterns requires continuous evaluation of streaming data without adding significant latency or load. Interviewers look for separation of concerns between indexing and alert evaluation.
Hints to consider:
- Build a separate stream processing path that evaluates alert rules on incoming logs before or during indexing
- Implement stateful windowing to detect threshold breaches over time windows (e.g., error rate exceeds 5% in last 5 minutes)
- Use deduplication and rate limiting to prevent alert storms when the same issue generates thousands of matching logs
- Consider pluggable notification channels with retry logic and backoff for delivery failures