Practice/TikTok/Design a System Metrics Monitoring and Alerting Platform

Design a System Metrics Monitoring and Alerting Platform

System DesignMust

Problem Statement

Design a metrics monitoring and alerting platform that collects performance telemetry (CPU utilization, memory consumption, request throughput, error rates) from hundreds of thousands of servers, stores this data efficiently as time series, enables engineers to visualize it on dashboards, and triggers notifications when user-defined thresholds are breached. Think of products like Datadog, Prometheus with Grafana, or Amazon CloudWatch.

The core workflow spans five stages: lightweight agents on servers emit metric data points, an ingestion pipeline absorbs millions of data points per second, a time-series store retains data at varying resolutions, a query engine powers dashboards and ad-hoc exploration, and an alerting engine continuously evaluates rules and dispatches notifications. The system must operate reliably even when the infrastructure it monitors is degrading.

Interviewers ask this question to test whether you can design a high-throughput write pipeline, choose an appropriate time-series storage model, build a cost-effective retention strategy with rollups and tiering, and architect a low-latency alerting engine that evaluates thousands of rules without coupling to the ingestion hot path.

Key Requirements

Functional

Metric collection -- ingest structured metric data points (timestamp, metric name, value, labels/tags) from agents running on hundreds of thousands of servers via push or pull protocols
Dashboard visualization -- allow engineers to build dashboards that query and graph metrics across arbitrary time ranges, with filtering by labels, grouping, and aggregation functions (avg, sum, p99)
Threshold-based alerting -- enable users to define alert rules specifying a metric, label filters, aggregation function, time window, threshold, and severity, with notifications sent via email, Slack, PagerDuty, or webhooks
Retention and resolution management -- support configurable retention policies that keep raw-resolution data for recent periods and automatically downsample older data to reduce storage costs

Non-Functional

Scalability -- sustain 10 million metric data points per second ingestion rate with linear horizontal scaling as new servers and services are added
Reliability -- tolerate node failures, network partitions, and ingestion spikes without losing data or missing alerts; the monitoring system must be more reliable than the systems it monitors
Latency -- end-to-end latency from metric emission to dashboard visibility under 30 seconds; alert evaluation and notification dispatch within 60 seconds of threshold breach
Cost efficiency -- store months to years of metric history without unbounded storage growth by applying rollup, compression, and tiered storage strategies

What Interviewers Focus On

Based on real interview experiences at TikTok and similar companies, these are the areas interviewers probe most deeply:

1. High-Throughput Ingestion Pipeline

Metrics pipelines are dominated by writes. Hundreds of thousands of agents emit data continuously, and bursts occur during deployments, autoscaling events, or incidents. Interviewers want to see how you absorb this volume without dropping data or overwhelming downstream stores.

Hints to consider:

Place Kafka (or a similar durable streaming platform) between agents and the storage layer to decouple ingestion rate from write throughput capacity
Partition Kafka topics by metric name or label hash to distribute load and preserve per-series ordering
Implement client-side batching and compression in the agents to reduce network calls and bandwidth
Add per-tenant rate limiting and admission control at the ingestion gateway to prevent cardinality explosions from mislabeled metrics

2. Time-Series Storage and Retention Strategy

Storing raw-resolution data indefinitely is prohibitively expensive at scale. Interviewers expect you to discuss storage engine choices, data modeling for time-series workloads, and a rollup/tiering strategy that balances query fidelity with cost.

Hints to consider:

Use a time-series database like Cassandra with wide-row partitioning by metric name and time bucket, or a purpose-built store like InfluxDB or VictoriaMetrics
Organize data into time-bucketed segments (hourly or daily) so retention policies can drop or archive entire segments efficiently
Run periodic rollup jobs that compute pre-aggregated summaries (1-minute, 5-minute, 1-hour granularity) and store them in separate tables
Tier cold data to object storage (S3) with a query adapter that transparently fetches from the appropriate tier based on the requested time range

3. Alert Evaluation Architecture

Alerting must be decoupled from the ingestion and query paths to avoid cascading failures. When the monitoring system is under stress (often exactly when alerts matter most), evaluation must continue independently. Interviewers look for a stream-based or poll-based design with clear failure boundaries.

Hints to consider:

Tap a copy of the ingested metric stream into a dedicated alert evaluation path using Kafka consumer groups or Apache Flink for windowed aggregation
Evaluate alert rules using stateful stream processing with tumbling or sliding windows, emitting alert events when thresholds cross
Deduplicate alert notifications using state tracking (firing, pending, resolved) to prevent alert storms during flapping conditions
Implement a separate notification dispatcher with retry logic, backoff, and escalation policies per severity level

4. Cardinality Control and Multi-Tenancy

Unbounded label cardinality (for example, tagging metrics with request IDs or user IDs) can explode the number of unique time series and collapse query performance. In multi-tenant environments, one team's misconfigured agent can degrade the platform for everyone.

Hints to consider:

Enforce cardinality limits per metric and per tenant at the ingestion gateway, rejecting or dropping data points that exceed thresholds
Maintain an allow-list of label keys and alert operators when new labels appear that violate naming conventions
Isolate tenant data at the storage level using tenant-prefixed partitions or separate keyspaces
Monitor the monitoring system itself: track active series count, ingestion rate per tenant, and query latency percentiles to detect problems early

Practice/TikTok/Design a System Metrics Monitoring and Alerting Platform

Design a System Metrics Monitoring and Alerting Platform

System DesignMust

Problem Statement

Key Requirements

Functional

Metric collection -- ingest structured metric data points (timestamp, metric name, value, labels/tags) from agents running on hundreds of thousands of servers via push or pull protocols
Dashboard visualization -- allow engineers to build dashboards that query and graph metrics across arbitrary time ranges, with filtering by labels, grouping, and aggregation functions (avg, sum, p99)
Threshold-based alerting -- enable users to define alert rules specifying a metric, label filters, aggregation function, time window, threshold, and severity, with notifications sent via email, Slack, PagerDuty, or webhooks
Retention and resolution management -- support configurable retention policies that keep raw-resolution data for recent periods and automatically downsample older data to reduce storage costs

Non-Functional

Scalability -- sustain 10 million metric data points per second ingestion rate with linear horizontal scaling as new servers and services are added
Reliability -- tolerate node failures, network partitions, and ingestion spikes without losing data or missing alerts; the monitoring system must be more reliable than the systems it monitors
Latency -- end-to-end latency from metric emission to dashboard visibility under 30 seconds; alert evaluation and notification dispatch within 60 seconds of threshold breach
Cost efficiency -- store months to years of metric history without unbounded storage growth by applying rollup, compression, and tiered storage strategies

What Interviewers Focus On

Based on real interview experiences at TikTok and similar companies, these are the areas interviewers probe most deeply:

1. High-Throughput Ingestion Pipeline

Hints to consider:

Place Kafka (or a similar durable streaming platform) between agents and the storage layer to decouple ingestion rate from write throughput capacity
Partition Kafka topics by metric name or label hash to distribute load and preserve per-series ordering
Implement client-side batching and compression in the agents to reduce network calls and bandwidth
Add per-tenant rate limiting and admission control at the ingestion gateway to prevent cardinality explosions from mislabeled metrics

2. Time-Series Storage and Retention Strategy

Hints to consider:

Use a time-series database like Cassandra with wide-row partitioning by metric name and time bucket, or a purpose-built store like InfluxDB or VictoriaMetrics
Organize data into time-bucketed segments (hourly or daily) so retention policies can drop or archive entire segments efficiently
Run periodic rollup jobs that compute pre-aggregated summaries (1-minute, 5-minute, 1-hour granularity) and store them in separate tables
Tier cold data to object storage (S3) with a query adapter that transparently fetches from the appropriate tier based on the requested time range

3. Alert Evaluation Architecture

Hints to consider:

Tap a copy of the ingested metric stream into a dedicated alert evaluation path using Kafka consumer groups or Apache Flink for windowed aggregation
Evaluate alert rules using stateful stream processing with tumbling or sliding windows, emitting alert events when thresholds cross
Deduplicate alert notifications using state tracking (firing, pending, resolved) to prevent alert storms during flapping conditions
Implement a separate notification dispatcher with retry logic, backoff, and escalation policies per severity level

4. Cardinality Control and Multi-Tenancy

Hints to consider:

Enforce cardinality limits per metric and per tenant at the ingestion gateway, rejecting or dropping data points that exceed thresholds
Maintain an allow-list of label keys and alert operators when new labels appear that violate naming conventions
Isolate tenant data at the storage level using tenant-prefixed partitions or separate keyspaces
Monitor the monitoring system itself: track active series count, ingestion rate per tenant, and query latency percentiles to detect problems early