Practice/Stripe/Design a Metrics Aggregator
Design a Metrics Aggregator
System DesignMust
Problem Statement
Design a system that enables product teams to track user behavior and business events in real time across web and mobile applications. The platform should collect events such as page views, button clicks, purchases, and API calls from millions of users, aggregate them into meaningful metrics, and display interactive dashboards showing event volumes, funnels, and trends over various time windows.
The system must handle bursts of traffic during peak hours, support flexible grouping by custom properties (user segment, country, device type), and provide sub-second query responses for common analytics queries. Consider a scale where 10,000 events per second arrive during normal operation, spiking to 100,000 events per second during product launches or marketing campaigns, with retention requirements of 90 days for detailed data and two years for rolled-up summaries.
Key Requirements
Functional
- Event ingestion -- Accept events from web browsers, mobile apps, and backend services through SDKs and HTTP APIs with minimal latency impact on client applications
- Custom dimensions -- Support arbitrary key-value properties on each event to enable filtering and grouping (e.g., user_tier=premium, campaign_id=summer2024)
- Time-window aggregation -- Compute event counts and unique user counts over configurable intervals (1 minute, 5 minutes, 1 hour, 1 day) with near real-time freshness
- Interactive dashboards -- Provide query APIs that power graphs and tables showing event volumes, breakdowns by dimension, and time-series trends with sub-second response times
Non-Functional
- Scalability -- Handle sustained ingestion of 10,000 events/second with burst capacity to 100,000 events/second, storing 30 billion events per month
- Reliability -- Ensure 99.9% availability for ingestion endpoints and tolerate individual component failures without data loss
- Latency -- Reflect new events in dashboards within 10 seconds for most recent time windows, with query response times under 500ms for typical dashboard loads
- Consistency -- Accept eventual consistency for aggregated counts with convergence within one minute, allowing graceful handling of duplicate events and out-of-order arrival
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Ingestion Pipeline Design
High-throughput event ingestion requires careful buffering and partitioning strategies. Interviewers want to see how you decouple the write path from aggregation processing to avoid overwhelming downstream components and maintain low client-side latency.
Hints to consider:
- Consider a message queue to buffer incoming events and smooth out traffic spikes
- Think about how to partition events for parallel processing (by event type, user ID hash, or timestamp)
- Discuss client-side batching in SDKs to reduce network overhead and server load
- Address how to validate and filter malformed events early to protect downstream systems
2. Real-Time Aggregation Strategy
Computing accurate counts and unique user metrics from a firehose of events involves tradeoffs between accuracy, latency, and resource costs. Interviewers probe whether you understand streaming aggregation, approximate algorithms, and when to pre-aggregate versus compute on demand.
Hints to consider:
- Explore using stream processing frameworks to maintain running counts in tumbling or sliding time windows
- Discuss approximate algorithms like HyperLogLog for unique user counts when exact counts are not required
- Consider a lambda architecture with both real-time and batch paths to balance speed and correctness
- Address how to handle late-arriving events that fall outside current time windows
3. Storage Layout for Time-Series Queries
Analytics queries scan large time ranges and filter by multiple dimensions, making storage schema design critical for performance. Interviewers expect you to reason about partitioning schemes, compression, and index strategies that enable fast range scans without full table scans.
Hints to consider:
- Design a schema that co-locates data by time ranges to minimize disk seeks for time-series queries
- Consider columnar storage formats to enable efficient filtering and aggregation on specific dimensions
- Discuss partitioning by day or hour to enable fast pruning of irrelevant data and support TTL-based retention
- Address cardinality explosion when users create high-cardinality custom properties and how to limit or sample
4. Hot Path Optimization and Caching
Dashboards repeatedly query recent time windows (last hour, last 24 hours), creating predictable read patterns that can be optimized. Interviewers want to see how you identify hot paths and use caching or materialized views to reduce load on the storage layer.
Hints to consider:
- Cache aggregated results for popular queries (recent time windows, top events) in a fast key-value store
- Use materialized rollup tables that pre-compute aggregates at multiple granularities (minute, hour, day)
- Discuss read replicas or read-through caches to distribute query load away from the primary storage
- Consider how cache invalidation works when late events arrive or corrections are needed