Practice/Meta/Design an Ad Click Aggregator

Design an Ad Click Aggregator

System DesignMust

Problem Statement

Design a distributed system that tracks and aggregates video view counts for a platform serving hundreds of millions of users worldwide. The system must handle peak traffic of 5 million concurrent video views, increment counters in near real-time (under 5 seconds), and provide accurate historical analytics for content creators spanning the past 3 years. Views should be deduplicated per user session, survive partial network failures, and support queries by video, channel, category, geography, and device type.

This problem tests your ability to architect high-throughput write systems that balance eventual consistency with user expectations of "live" counts, handle highly skewed traffic patterns where viral videos create extreme hotspots, and efficiently store time-series data at multiple granularities while keeping storage costs manageable.

Key Requirements

Functional

View event capture -- Record every video view with metadata (video ID, user ID, timestamp, device, location) without blocking video playback
Real-time counter updates -- Display view counts to users with maximum 5-second staleness
Deduplication -- Count each view once per user session, handling page refreshes and network retries
Multi-dimensional analytics -- Support creator dashboards querying views by time range, geography, device type, and traffic source
Historical retention -- Preserve view data for 3 years with minute-level granularity for recent data and hourly rollups for older data

Non-Functional

Scalability -- Handle 5M concurrent views (peak ~50K view events/second), with ability to scale to 10x during viral events
Reliability -- Tolerate datacenter failures without losing view counts; target 99.9% accuracy
Latency -- Public-facing view counts update within 5 seconds; analytics queries return within 2 seconds for standard time ranges
Consistency -- Eventual consistency acceptable for view counts; strong consistency not required but monotonically increasing counts preferred

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Handling Write Hotspots and Skew

When a video goes viral, it can receive millions of views within minutes, creating an extreme write hotspot that can overwhelm a single partition or counter. Interviewers want to see if you recognize this challenge and design accordingly.

Hints to consider:

Consider splitting hot keys across multiple shards using hash suffixes or random distribution, then aggregating during reads
Discuss buffering and batching writes in the application tier to reduce database load during traffic spikes
Explore probabilistic counting techniques (HyperLogLog) for approximate counts on extremely high-traffic videos
Consider separate code paths for "popular" vs "normal" videos detected via metrics or prediction

2. Deduplication Strategy Under Failures

Network retries, browser refreshes, and client bugs can cause duplicate view events. Without proper deduplication, view counts become inflated and creators lose trust in analytics. The challenge is deduplicating efficiently at massive scale.

Hints to consider:

Generate client-side unique view IDs combining user session, video ID, and timestamp window
Use distributed caching (Redis) with TTL for recent view IDs to catch duplicates before writing
Design idempotent write paths where reprocessing the same event ID doesn't increment counters twice
Discuss tradeoffs between exact deduplication (expensive) and approximate deduplication (bloom filters)

3. Real-Time Aggregation Pipeline Architecture

You need to transform millions of individual view events into aggregated counts across multiple time windows and dimensions, all while maintaining the 5-second freshness SLA. This requires careful pipeline design.

Hints to consider:

Stream processing frameworks like Apache Flink or Apache Spark Streaming with tumbling/sliding windows
Separate pipelines for real-time (second/minute aggregates) versus batch (hourly/daily rollups)
Watermarking and late-arrival handling for events delayed by network issues or mobile offline scenarios
Discuss exactly-once processing semantics and checkpointing to prevent count loss during worker failures

4. Storage Tiering and Query Performance

Storing 3 years of minute-level view data for hundreds of millions of videos is prohibitively expensive and slow to query. You must design a multi-tiered storage strategy that balances cost, query performance, and data retention.

Hints to consider:

Hot storage (last 7 days) in time-series databases with minute granularity for real-time dashboards
Warm storage (last 3 months) with hourly pre-aggregates in columnar databases like ClickHouse or Druid
Cold storage (3 years) in data lakes (S3/Parquet) with daily rollups, accessed via batch queries
Use materialized views or pre-computed aggregates for common query patterns (views by day, by country)

Suggested Approach

Step 1: Clarify Requirements

Start by confirming scale, accuracy expectations, and query patterns:

What is peak concurrent viewership and expected growth? (Determines write capacity planning)
How do we define a "view"? Is it starting playback, watching X seconds, or completing the video? (Affects event generation logic)
What level of count accuracy is acceptable? Can we tolerate 1% error for approximate counting? (Opens optimization opportunities)
What are the most common analytics queries creators run? (Guides indexing and pre-aggregation strategy)
Do we need to support real-time anomaly detection or only historical reporting? (Affects streaming vs batch architecture)

Practice/Meta/Design an Ad Click Aggregator

Design an Ad Click Aggregator

System DesignMust

Problem Statement

Key Requirements

Functional

View event capture -- Record every video view with metadata (video ID, user ID, timestamp, device, location) without blocking video playback
Real-time counter updates -- Display view counts to users with maximum 5-second staleness
Deduplication -- Count each view once per user session, handling page refreshes and network retries
Multi-dimensional analytics -- Support creator dashboards querying views by time range, geography, device type, and traffic source
Historical retention -- Preserve view data for 3 years with minute-level granularity for recent data and hourly rollups for older data

Non-Functional

Scalability -- Handle 5M concurrent views (peak ~50K view events/second), with ability to scale to 10x during viral events
Reliability -- Tolerate datacenter failures without losing view counts; target 99.9% accuracy
Latency -- Public-facing view counts update within 5 seconds; analytics queries return within 2 seconds for standard time ranges
Consistency -- Eventual consistency acceptable for view counts; strong consistency not required but monotonically increasing counts preferred

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Handling Write Hotspots and Skew

Hints to consider:

Consider splitting hot keys across multiple shards using hash suffixes or random distribution, then aggregating during reads
Discuss buffering and batching writes in the application tier to reduce database load during traffic spikes
Explore probabilistic counting techniques (HyperLogLog) for approximate counts on extremely high-traffic videos
Consider separate code paths for "popular" vs "normal" videos detected via metrics or prediction

2. Deduplication Strategy Under Failures

Hints to consider:

Generate client-side unique view IDs combining user session, video ID, and timestamp window
Use distributed caching (Redis) with TTL for recent view IDs to catch duplicates before writing
Design idempotent write paths where reprocessing the same event ID doesn't increment counters twice
Discuss tradeoffs between exact deduplication (expensive) and approximate deduplication (bloom filters)

3. Real-Time Aggregation Pipeline Architecture

Hints to consider:

Stream processing frameworks like Apache Flink or Apache Spark Streaming with tumbling/sliding windows
Separate pipelines for real-time (second/minute aggregates) versus batch (hourly/daily rollups)
Watermarking and late-arrival handling for events delayed by network issues or mobile offline scenarios
Discuss exactly-once processing semantics and checkpointing to prevent count loss during worker failures

4. Storage Tiering and Query Performance

Hints to consider:

Hot storage (last 7 days) in time-series databases with minute granularity for real-time dashboards
Warm storage (last 3 months) with hourly pre-aggregates in columnar databases like ClickHouse or Druid
Cold storage (3 years) in data lakes (S3/Parquet) with daily rollups, accessed via batch queries
Use materialized views or pre-computed aggregates for common query patterns (views by day, by country)

Suggested Approach

Step 1: Clarify Requirements

Start by confirming scale, accuracy expectations, and query patterns:

What is peak concurrent viewership and expected growth? (Determines write capacity planning)
How do we define a "view"? Is it starting playback, watching X seconds, or completing the video? (Affects event generation logic)
What level of count accuracy is acceptable? Can we tolerate 1% error for approximate counting? (Opens optimization opportunities)
What are the most common analytics queries creators run? (Guides indexing and pre-aggregation strategy)
Do we need to support real-time anomaly detection or only historical reporting? (Affects streaming vs batch architecture)