Practice/TikTok/Design a System Metrics Monitoring and Alerting Platform
Design a System Metrics Monitoring and Alerting Platform
System DesignMust
Problem Statement
Design a metrics monitoring and alerting platform that collects performance telemetry (CPU utilization, memory consumption, request throughput, error rates) from hundreds of thousands of servers, stores this data efficiently as time series, enables engineers to visualize it on dashboards, and triggers notifications when user-defined thresholds are breached. Think of products like Datadog, Prometheus with Grafana, or Amazon CloudWatch.
The core workflow spans five stages: lightweight agents on servers emit metric data points, an ingestion pipeline absorbs millions of data points per second, a time-series store retains data at varying resolutions, a query engine powers dashboards and ad-hoc exploration, and an alerting engine continuously evaluates rules and dispatches notifications. The system must operate reliably even when the infrastructure it monitors is degrading.
Interviewers ask this question to test whether you can design a high-throughput write pipeline, choose an appropriate time-series storage model, build a cost-effective retention strategy with rollups and tiering, and architect a low-latency alerting engine that evaluates thousands of rules without coupling to the ingestion hot path.
Key Requirements
Functional
- Metric collection -- ingest structured metric data points (timestamp, metric name, value, labels/tags) from agents running on hundreds of thousands of servers via push or pull protocols
- Dashboard visualization -- allow engineers to build dashboards that query and graph metrics across arbitrary time ranges, with filtering by labels, grouping, and aggregation functions (avg, sum, p99)
- Threshold-based alerting -- enable users to define alert rules specifying a metric, label filters, aggregation function, time window, threshold, and severity, with notifications sent via email, Slack, PagerDuty, or webhooks
- Retention and resolution management -- support configurable retention policies that keep raw-resolution data for recent periods and automatically downsample older data to reduce storage costs
Non-Functional
- Scalability -- sustain 10 million metric data points per second ingestion rate with linear horizontal scaling as new servers and services are added
- Reliability -- tolerate node failures, network partitions, and ingestion spikes without losing data or missing alerts; the monitoring system must be more reliable than the systems it monitors
- Latency -- end-to-end latency from metric emission to dashboard visibility under 30 seconds; alert evaluation and notification dispatch within 60 seconds of threshold breach
- Cost efficiency -- store months to years of metric history without unbounded storage growth by applying rollup, compression, and tiered storage strategies
What Interviewers Focus On
Based on real interview experiences at TikTok and similar companies, these are the areas interviewers probe most deeply:
1. High-Throughput Ingestion Pipeline
Metrics pipelines are dominated by writes. Hundreds of thousands of agents emit data continuously, and bursts occur during deployments, autoscaling events, or incidents. Interviewers want to see how you absorb this volume without dropping data or overwhelming downstream stores.
Hints to consider:
- Place Kafka (or a similar durable streaming platform) between agents and the storage layer to decouple ingestion rate from write throughput capacity
- Partition Kafka topics by metric name or label hash to distribute load and preserve per-series ordering
- Implement client-side batching and compression in the agents to reduce network calls and bandwidth
- Add per-tenant rate limiting and admission control at the ingestion gateway to prevent cardinality explosions from mislabeled metrics
2. Time-Series Storage and Retention Strategy
Storing raw-resolution data indefinitely is prohibitively expensive at scale. Interviewers expect you to discuss storage engine choices, data modeling for time-series workloads, and a rollup/tiering strategy that balances query fidelity with cost.
Hints to consider:
- Use a time-series database like Cassandra with wide-row partitioning by metric name and time bucket, or a purpose-built store like InfluxDB or VictoriaMetrics
- Organize data into time-bucketed segments (hourly or daily) so retention policies can drop or archive entire segments efficiently
- Run periodic rollup jobs that compute pre-aggregated summaries (1-minute, 5-minute, 1-hour granularity) and store them in separate tables
- Tier cold data to object storage (S3) with a query adapter that transparently fetches from the appropriate tier based on the requested time range
3. Alert Evaluation Architecture
Alerting must be decoupled from the ingestion and query paths to avoid cascading failures. When the monitoring system is under stress (often exactly when alerts matter most), evaluation must continue independently. Interviewers look for a stream-based or poll-based design with clear failure boundaries.
Hints to consider:
- Tap a copy of the ingested metric stream into a dedicated alert evaluation path using Kafka consumer groups or Apache Flink for windowed aggregation
- Evaluate alert rules using stateful stream processing with tumbling or sliding windows, emitting alert events when thresholds cross
- Deduplicate alert notifications using state tracking (firing, pending, resolved) to prevent alert storms during flapping conditions
- Implement a separate notification dispatcher with retry logic, backoff, and escalation policies per severity level
4. Cardinality Control and Multi-Tenancy
Unbounded label cardinality (for example, tagging metrics with request IDs or user IDs) can explode the number of unique time series and collapse query performance. In multi-tenant environments, one team's misconfigured agent can degrade the platform for everyone.
Hints to consider:
- Enforce cardinality limits per metric and per tenant at the ingestion gateway, rejecting or dropping data points that exceed thresholds
- Maintain an allow-list of label keys and alert operators when new labels appear that violate naming conventions
- Isolate tenant data at the storage level using tenant-prefixed partitions or separate keyspaces
- Monitor the monitoring system itself: track active series count, ingestion rate per tenant, and query latency percentiles to detect problems early