Practice/Stripe/Design a Metrics Aggregator

Design a Metrics Aggregator

System DesignMust

Problem Statement

Design a system that enables product teams to track user behavior and business events in real time across web and mobile applications. The platform should collect events such as page views, button clicks, purchases, and API calls from millions of users, aggregate them into meaningful metrics, and display interactive dashboards showing event volumes, funnels, and trends over various time windows.

The system must handle bursts of traffic during peak hours, support flexible grouping by custom properties (user segment, country, device type), and provide sub-second query responses for common analytics queries. Consider a scale where 10,000 events per second arrive during normal operation, spiking to 100,000 events per second during product launches or marketing campaigns, with retention requirements of 90 days for detailed data and two years for rolled-up summaries.

Key Requirements

Functional

Event ingestion -- Accept events from web browsers, mobile apps, and backend services through SDKs and HTTP APIs with minimal latency impact on client applications
Custom dimensions -- Support arbitrary key-value properties on each event to enable filtering and grouping (e.g., user_tier=premium, campaign_id=summer2024)
Time-window aggregation -- Compute event counts and unique user counts over configurable intervals (1 minute, 5 minutes, 1 hour, 1 day) with near real-time freshness
Interactive dashboards -- Provide query APIs that power graphs and tables showing event volumes, breakdowns by dimension, and time-series trends with sub-second response times

Non-Functional

Scalability -- Handle sustained ingestion of 10,000 events/second with burst capacity to 100,000 events/second, storing 30 billion events per month
Reliability -- Ensure 99.9% availability for ingestion endpoints and tolerate individual component failures without data loss
Latency -- Reflect new events in dashboards within 10 seconds for most recent time windows, with query response times under 500ms for typical dashboard loads
Consistency -- Accept eventual consistency for aggregated counts with convergence within one minute, allowing graceful handling of duplicate events and out-of-order arrival

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Ingestion Pipeline Design

High-throughput event ingestion requires careful buffering and partitioning strategies. Interviewers want to see how you decouple the write path from aggregation processing to avoid overwhelming downstream components and maintain low client-side latency.

Hints to consider:

Consider a message queue to buffer incoming events and smooth out traffic spikes
Think about how to partition events for parallel processing (by event type, user ID hash, or timestamp)
Discuss client-side batching in SDKs to reduce network overhead and server load
Address how to validate and filter malformed events early to protect downstream systems

2. Real-Time Aggregation Strategy

Computing accurate counts and unique user metrics from a firehose of events involves tradeoffs between accuracy, latency, and resource costs. Interviewers probe whether you understand streaming aggregation, approximate algorithms, and when to pre-aggregate versus compute on demand.

Hints to consider:

Explore using stream processing frameworks to maintain running counts in tumbling or sliding time windows
Discuss approximate algorithms like HyperLogLog for unique user counts when exact counts are not required
Consider a lambda architecture with both real-time and batch paths to balance speed and correctness
Address how to handle late-arriving events that fall outside current time windows

3. Storage Layout for Time-Series Queries

Analytics queries scan large time ranges and filter by multiple dimensions, making storage schema design critical for performance. Interviewers expect you to reason about partitioning schemes, compression, and index strategies that enable fast range scans without full table scans.

Hints to consider:

Design a schema that co-locates data by time ranges to minimize disk seeks for time-series queries
Consider columnar storage formats to enable efficient filtering and aggregation on specific dimensions
Discuss partitioning by day or hour to enable fast pruning of irrelevant data and support TTL-based retention
Address cardinality explosion when users create high-cardinality custom properties and how to limit or sample

4. Hot Path Optimization and Caching

Dashboards repeatedly query recent time windows (last hour, last 24 hours), creating predictable read patterns that can be optimized. Interviewers want to see how you identify hot paths and use caching or materialized views to reduce load on the storage layer.

Hints to consider:

Cache aggregated results for popular queries (recent time windows, top events) in a fast key-value store
Use materialized rollup tables that pre-compute aggregates at multiple granularities (minute, hour, day)
Discuss read replicas or read-through caches to distribute query load away from the primary storage
Consider how cache invalidation works when late events arrive or corrections are needed

Practice/Stripe/Design a Metrics Aggregator

Design a Metrics Aggregator

System DesignMust

Problem Statement

Key Requirements

Functional

Event ingestion -- Accept events from web browsers, mobile apps, and backend services through SDKs and HTTP APIs with minimal latency impact on client applications
Custom dimensions -- Support arbitrary key-value properties on each event to enable filtering and grouping (e.g., user_tier=premium, campaign_id=summer2024)
Time-window aggregation -- Compute event counts and unique user counts over configurable intervals (1 minute, 5 minutes, 1 hour, 1 day) with near real-time freshness
Interactive dashboards -- Provide query APIs that power graphs and tables showing event volumes, breakdowns by dimension, and time-series trends with sub-second response times

Non-Functional

Scalability -- Handle sustained ingestion of 10,000 events/second with burst capacity to 100,000 events/second, storing 30 billion events per month
Reliability -- Ensure 99.9% availability for ingestion endpoints and tolerate individual component failures without data loss
Latency -- Reflect new events in dashboards within 10 seconds for most recent time windows, with query response times under 500ms for typical dashboard loads
Consistency -- Accept eventual consistency for aggregated counts with convergence within one minute, allowing graceful handling of duplicate events and out-of-order arrival

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Ingestion Pipeline Design

Hints to consider:

Consider a message queue to buffer incoming events and smooth out traffic spikes
Think about how to partition events for parallel processing (by event type, user ID hash, or timestamp)
Discuss client-side batching in SDKs to reduce network overhead and server load
Address how to validate and filter malformed events early to protect downstream systems

2. Real-Time Aggregation Strategy

Hints to consider:

Explore using stream processing frameworks to maintain running counts in tumbling or sliding time windows
Discuss approximate algorithms like HyperLogLog for unique user counts when exact counts are not required
Consider a lambda architecture with both real-time and batch paths to balance speed and correctness
Address how to handle late-arriving events that fall outside current time windows

3. Storage Layout for Time-Series Queries

Hints to consider:

Design a schema that co-locates data by time ranges to minimize disk seeks for time-series queries
Consider columnar storage formats to enable efficient filtering and aggregation on specific dimensions
Discuss partitioning by day or hour to enable fast pruning of irrelevant data and support TTL-based retention
Address cardinality explosion when users create high-cardinality custom properties and how to limit or sample

4. Hot Path Optimization and Caching

Hints to consider:

Cache aggregated results for popular queries (recent time windows, top events) in a fast key-value store
Use materialized rollup tables that pre-compute aggregates at multiple granularities (minute, hour, day)
Discuss read replicas or read-through caches to distribute query load away from the primary storage
Consider how cache invalidation works when late events arrive or corrections are needed