System Design - Design an Ad Event Aggregator — Rippling

Problem Statement

Design a distributed system that tracks and aggregates video view counts for a platform serving hundreds of millions of users worldwide. The system must handle peak traffic of 5 million concurrent video views, increment counters in near real-time (under 5 seconds), and provide accurate historical analytics for content creators spanning the past 3 years. Views should be deduplicated per user session, survive partial network failures, and support queries by video, channel, category, geography, and device type.

This problem tests your ability to architect high-throughput write systems that balance eventual consistency with user expectations of "live" counts, handle highly skewed traffic patterns where viral videos create extreme hotspots, and efficiently store time-series data at multiple granularities while keeping storage costs manageable.

Key Requirements

Functional

View event capture -- Record every video view with metadata (video ID, user ID, timestamp, device, location) without blocking video playback
Real-time counter updates -- Display view counts to users with maximum 5-second staleness
Deduplication -- Count each view once per user session, handling page refreshes and network retries
Multi-dimensional analytics -- Support creator dashboards querying views by time range, geography, device type, and traffic source
Historical retention -- Preserve view data for 3 years with minute-level granularity for recent data and hourly rollups for older data

Non-Functional

Scalability -- Handle 5M concurrent views (peak ~50K view events/second), with ability to scale to 10x during viral events
Reliability -- Tolerate datacenter failures without losing view counts; target 99.9% accuracy
Latency -- Public-facing view counts update within 5 seconds; analytics queries return within 2 seconds for standard time ranges
Consistency -- Eventual consistency acceptable for view counts; strong consistency not required but monotonically increasing counts preferred

Based on real interview experiences, these are the areas interviewers probe most deeply:

When a video goes viral, it can receive millions of views within minutes, creating an extreme write hotspot that can overwhelm a single partition or counter. Interviewers want to see if you recognize this challenge and design accordingly.

Consider splitting hot keys across multiple shards using hash suffixes or random distribution, then aggregating during reads
Discuss buffering and batching writes in the application tier to reduce database load during traffic spikes
Explore probabilistic counting techniques (HyperLogLog) for approximate counts on extremely high-traffic videos
Consider separate code paths for "popular" vs "normal" videos detected via metrics or prediction

Network retries, browser refreshes, and client bugs can cause duplicate view events. Without proper deduplication, view counts become inflated and creators lose trust in analytics. The challenge is deduplicating efficiently at massive scale.

Generate client-side unique view IDs combining user session, video ID, and timestamp window
Use distributed caching (Redis) with TTL for recent view IDs to catch duplicates before writing
Design idempotent write paths where reprocessing the same event ID doesn't increment counters twice
Discuss tradeoffs between exact deduplication (expensive) and approximate deduplication (bloom filters)
Stream processing frameworks like Apache Flink or Apache Spark Streaming with tumbling/sliding windows
Separate pipelines for real-time (second/minute aggregates) versus batch (hourly/daily rollups)
Watermarking and late-arrival handling for events delayed by network issues or mobile offline scenarios
Discuss exactly-once processing semantics and checkpointing to prevent count loss during worker failures

Storing 3 years of minute-level view data for hundreds of millions of videos is prohibitively expensive and slow to query. You must design a multi-tiered storage strategy that balances cost, query performance, and data retention.

Hot storage (last 7 days) in time-series databases with minute granularity for real-time dashboards
Warm storage (last 3 months) with hourly pre-aggregates in columnar databases like ClickHouse or Druid
Cold storage (3 years) in data lakes (S3/Parquet) with daily rollups, accessed via batch queries
Use materialized views or pre-computed aggregates for common query patterns (views by day, by country)

Start by confirming scale, accuracy expectations, and query patterns: