Practice/Meta/Design Google Search Index
Design Google Search Index
System DesignOptional
Problem Statement
Design a real-time analytics dashboard system that ingests clickstream events from millions of active users, computes aggregated metrics (page views, unique visitors, conversion funnels), and displays live visualizations with sub-second refresh rates. The system must handle peak traffic spikes during product launches or marketing campaigns while maintaining query responsiveness and data accuracy.
Your solution should support multiple concurrent dashboards with different time windows (last 5 minutes, last hour, last 24 hours) and dimensional breakdowns (by country, device type, campaign ID). Users expect metrics to update continuously without manual refresh, and historical data should remain queryable for trend analysis. The architecture must balance the competing demands of ingestion throughput, query latency, and cost efficiency at scale.
Key Requirements
Functional
- Event ingestion -- accept and process millions of clickstream events per second from web and mobile clients with guaranteed delivery
- Real-time aggregation -- compute rolling metrics (counts, sums, averages, percentiles) over sliding time windows with sub-minute freshness
- Interactive querying -- support ad-hoc queries with dimensional filters and drill-downs that return results in under 500ms
- Alerting and anomaly detection -- trigger notifications when metrics breach thresholds or deviate from expected patterns
Non-Functional
- Scalability -- handle 10 million events/second during peak hours with the ability to scale horizontally
- Reliability -- ensure 99.9% data delivery with exactly-once processing semantics and automatic failure recovery
- Latency -- display updated metrics within 1 second of event occurrence for real-time dashboards
- Consistency -- provide eventually consistent aggregates with bounded staleness for historical queries
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Ingestion Pipeline Design and Backpressure Management
Handling bursty traffic patterns without data loss or degraded performance is critical. Interviewers want to see how you buffer incoming events, apply backpressure when downstream systems are overloaded, and ensure durability during failures.
Hints to consider:
- Use a distributed message queue (Kafka, Pulsar) as a shock absorber between producers and consumers to decouple ingestion from processing
- Implement partitioning strategies that distribute load evenly while preserving ordering guarantees where needed (e.g., partition by user_id for session analysis)
- Design retry logic with exponential backoff and dead-letter queues for events that consistently fail validation or processing
- Consider pre-aggregation at the edge (client-side batching) to reduce network overhead and downstream processing load
2. Stream Processing Architecture for Real-Time Aggregation
Computing accurate metrics over sliding time windows while handling late-arriving events and out-of-order data requires careful state management. Strong candidates articulate watermarking strategies, windowing semantics, and stateful operator design.
Hints to consider:
- Choose between tumbling windows (fixed, non-overlapping) and sliding windows (overlapping) based on query patterns and acceptable latency
- Implement watermarks to handle late arrivals while bounding state retention (e.g., allow 5-minute grace period then finalize window)
- Use incremental aggregation with combine functions that are associative and commutative to enable parallel processing and exactly-once semantics
- Discuss tradeoffs between stateful stream processing (Flink, Kafka Streams) versus micro-batching (Spark Streaming) for your use case
3. Storage Layer for Fast Aggregated Reads
Supporting sub-second queries across arbitrary time ranges and dimensions demands specialized storage optimizations. Interviewers expect you to reason about columnar formats, indexing strategies, and tiered storage architectures.
Hints to consider:
- Store pre-aggregated rollups at multiple granularities (per-minute, per-hour, per-day) to accelerate common queries without scanning raw events
- Use columnar databases (ClickHouse, Druid) optimized for analytical workloads with compression, vectorized execution, and distributed query planning
- Implement hot/warm/cold tiering where recent data lives in memory or SSD for low-latency access while older data moves to cheaper object storage
- Materialize common dimensional views (by country, by device) to avoid expensive runtime joins and filters
4. Query Serving and Dashboard Update Mechanisms
Delivering live updates to browser clients efficiently while maintaining system stability under high concurrency requires thoughtful API design and caching strategies.
Hints to consider:
- Use WebSockets or Server-Sent Events (SSE) for pushing incremental updates rather than polling, reducing network overhead and perceived latency
- Implement query result caching with short TTLs (10-30 seconds) to serve dashboards displaying the same metrics without re-executing expensive aggregations
- Design rate limiting and query complexity guards to prevent a single user from overwhelming the system with expensive ad-hoc queries
- Consider serving recently computed results from a distributed cache (Redis) with fallback to the analytical database for cache misses