Practice/Meta/Design YouTube Analytics
Design YouTube Analytics
System DesignOptional
Problem Statement
Design a large-scale analytics platform for a video streaming service that processes billions of user engagement events daily. The system must track metrics including video views, watch duration, likes, shares, and comments, then aggregate them into meaningful insights for content creators and business intelligence teams. Content creators need near real-time visibility into how their videos are performing across different dimensions such as geography, device type, and referral source. The platform should support both live dashboards showing metrics from the last few minutes and historical trend analysis spanning months or years.
The core challenge lies in building a pipeline that can ingest massive event streams with high throughput, deduplicate events reliably, handle late-arriving data gracefully, and serve low-latency queries across multiple dimensions. Your design must balance the freshness requirements of real-time dashboards with the accuracy demands of revenue reporting and creator payouts.
Key Requirements
Functional
- Multi-dimensional metric tracking -- capture and aggregate view counts, total watch time, engagement actions (likes, comments, shares), and revenue metrics across dimensions including video ID, creator channel, geography, device type, and time windows
- Flexible querying -- support dashboard queries filtered by single video, entire channel, country, device platform, or custom date ranges with sub-second response times
- Near real-time updates -- display provisional metrics updated within 5-10 minutes for creator dashboards and internal monitoring
- Historical analysis -- provide accurate aggregated reports for any historical time period with drill-down capability from yearly trends to minute-by-minute breakdowns
- Top-N ranking -- identify trending videos, fastest-growing channels, and most-engaged content across the platform or within specific categories
Non-Functional
- Scalability -- handle 10 billion events per day (100K+ events/second sustained, 500K+ during peaks) with ability to scale horizontally as platform grows
- Reliability -- ensure exactly-once counting semantics for billing-critical metrics; tolerate datacenter failures without data loss; support event replay for corrections
- Latency -- serve dashboard queries in under 500ms at p99; ingest events with end-to-end latency under 10 minutes for real-time views
- Consistency -- accept eventual consistency for real-time dashboards but guarantee strong consistency for finalized daily/monthly reports used in financial reconciliation
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Event Ingestion and Deduplication
Managing billions of events per day requires careful attention to idempotency and exactly-once semantics. Interviewers want to see how you prevent duplicate counting when clients retry requests, network partitions cause message redelivery, or processing failures trigger replays.
Hints to consider:
- Assign unique event IDs at the client and use them for deduplication in the stream processor
- Partition events by video ID or user ID to maintain ordering while enabling horizontal scaling
- Consider a distributed commit log like Kafka for durability, replayability, and decoupling producers from consumers
- Design your event schema to include client-side timestamps for event-time processing and handling out-of-order arrivals
2. Stream Processing and Aggregation Strategy
The heart of the system is transforming raw events into queryable aggregates. Interviewers expect you to discuss windowing strategies, handling late data, and the tradeoff between speed and accuracy.
Hints to consider:
- Use tumbling or sliding windows to compute aggregates incrementally rather than scanning all historical data
- Implement watermarks to balance waiting for late events against freshness requirements
- Pre-aggregate data at multiple granularities (minute, hour, day) to serve different query patterns efficiently
- Maintain provisional results that update in real-time and finalized results that incorporate late-arriving corrections
3. Storage Model for Multi-Dimensional Analytics
Your storage choices directly impact query performance. Interviewers look for understanding of time-series databases, columnar stores, and how to model data for efficient filtering across multiple dimensions.
Hints to consider:
- Use a time-series database or wide-column store optimized for time-range queries and high write throughput
- Denormalize data to avoid joins; store pre-aggregated metrics keyed by composite keys like (video_id, country, device, hour)
- Consider separate hot and cold storage tiers; recent data in memory-optimized stores, historical data in columnar format
- Implement rollup tables at different time granularities to avoid scanning minute-level data for yearly trends
4. Handling Late and Out-of-Order Events
Mobile clients on poor networks, offline viewing, and clock skew cause events to arrive late. Your design must account for this without constantly rewriting historical aggregates.
Hints to consider:
- Process events based on event-time (when the view occurred) not processing-time (when it arrived)
- Define acceptable lateness windows (e.g., accept events up to 24 hours late) and emit corrections for affected time buckets
- Use delta updates to increment already-published aggregates when late events arrive
- Maintain separate real-time and batch processing paths that eventually reconcile
5. Query Optimization and Serving Layer
Interactive dashboards require sub-second query responses across billions of records. Interviewers want to see how you optimize read paths and cache intelligently.
Hints to consider:
- Cache frequently accessed aggregates (popular videos, large channels) in a distributed cache with TTL-based invalidation
- Use materialized views or OLAP cubes for common query patterns to avoid computing aggregates on the fly
- Implement query result pagination and sampling for exploratory queries over large datasets
- Consider a serving layer that unions recent streaming results with stable batch results for the best of both worlds