Practice/Meta/Design an Ad Click Aggregator
Design an Ad Click Aggregator
System DesignMust
Problem Statement
Design a distributed system that tracks and aggregates video view counts for a platform serving hundreds of millions of users worldwide. The system must handle peak traffic of 5 million concurrent video views, increment counters in near real-time (under 5 seconds), and provide accurate historical analytics for content creators spanning the past 3 years. Views should be deduplicated per user session, survive partial network failures, and support queries by video, channel, category, geography, and device type.
This problem tests your ability to architect high-throughput write systems that balance eventual consistency with user expectations of "live" counts, handle highly skewed traffic patterns where viral videos create extreme hotspots, and efficiently store time-series data at multiple granularities while keeping storage costs manageable.
Key Requirements
Functional
- View event capture -- Record every video view with metadata (video ID, user ID, timestamp, device, location) without blocking video playback
- Real-time counter updates -- Display view counts to users with maximum 5-second staleness
- Deduplication -- Count each view once per user session, handling page refreshes and network retries
- Multi-dimensional analytics -- Support creator dashboards querying views by time range, geography, device type, and traffic source
- Historical retention -- Preserve view data for 3 years with minute-level granularity for recent data and hourly rollups for older data
Non-Functional
- Scalability -- Handle 5M concurrent views (peak ~50K view events/second), with ability to scale to 10x during viral events
- Reliability -- Tolerate datacenter failures without losing view counts; target 99.9% accuracy
- Latency -- Public-facing view counts update within 5 seconds; analytics queries return within 2 seconds for standard time ranges
- Consistency -- Eventual consistency acceptable for view counts; strong consistency not required but monotonically increasing counts preferred
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Handling Write Hotspots and Skew
When a video goes viral, it can receive millions of views within minutes, creating an extreme write hotspot that can overwhelm a single partition or counter. Interviewers want to see if you recognize this challenge and design accordingly.
Hints to consider:
- Consider splitting hot keys across multiple shards using hash suffixes or random distribution, then aggregating during reads
- Discuss buffering and batching writes in the application tier to reduce database load during traffic spikes
- Explore probabilistic counting techniques (HyperLogLog) for approximate counts on extremely high-traffic videos
- Consider separate code paths for "popular" vs "normal" videos detected via metrics or prediction
2. Deduplication Strategy Under Failures
Network retries, browser refreshes, and client bugs can cause duplicate view events. Without proper deduplication, view counts become inflated and creators lose trust in analytics. The challenge is deduplicating efficiently at massive scale.
Hints to consider:
- Generate client-side unique view IDs combining user session, video ID, and timestamp window
- Use distributed caching (Redis) with TTL for recent view IDs to catch duplicates before writing
- Design idempotent write paths where reprocessing the same event ID doesn't increment counters twice
- Discuss tradeoffs between exact deduplication (expensive) and approximate deduplication (bloom filters)
3. Real-Time Aggregation Pipeline Architecture
You need to transform millions of individual view events into aggregated counts across multiple time windows and dimensions, all while maintaining the 5-second freshness SLA. This requires careful pipeline design.
Hints to consider:
- Stream processing frameworks like Apache Flink or Apache Spark Streaming with tumbling/sliding windows
- Separate pipelines for real-time (second/minute aggregates) versus batch (hourly/daily rollups)
- Watermarking and late-arrival handling for events delayed by network issues or mobile offline scenarios
- Discuss exactly-once processing semantics and checkpointing to prevent count loss during worker failures
4. Storage Tiering and Query Performance
Storing 3 years of minute-level view data for hundreds of millions of videos is prohibitively expensive and slow to query. You must design a multi-tiered storage strategy that balances cost, query performance, and data retention.
Hints to consider:
- Hot storage (last 7 days) in time-series databases with minute granularity for real-time dashboards
- Warm storage (last 3 months) with hourly pre-aggregates in columnar databases like ClickHouse or Druid
- Cold storage (3 years) in data lakes (S3/Parquet) with daily rollups, accessed via batch queries
- Use materialized views or pre-computed aggregates for common query patterns (views by day, by country)
Suggested Approach
Step 1: Clarify Requirements
Start by confirming scale, accuracy expectations, and query patterns:
- What is peak concurrent viewership and expected growth? (Determines write capacity planning)
- How do we define a "view"? Is it starting playback, watching X seconds, or completing the video? (Affects event generation logic)
- What level of count accuracy is acceptable? Can we tolerate 1% error for approximate counting? (Opens optimization opportunities)
- What are the most common analytics queries creators run? (Guides indexing and pre-aggregation strategy)
- Do we need to support real-time anomaly detection or only historical reporting? (Affects streaming vs batch architecture)