Practice/Apple/Design a Real-Time Event Message System
Design a Real-Time Event Message System
System DesignMust
Problem Statement
Design a system that ingests time-sensitive events from many sources — movie releases, sports scores, ticket drops, breaking news — and serves low-latency queries and live dashboards. Users see counts, trending lists, and aggregated stats update in near real time.
The challenge is handling bursty traffic (millions of events during a major sporting event), computing aggregations on the fly without scanning raw data at query time, and keeping dashboards responsive. You need to reason about streaming pipelines, stateful windowed aggregations, and how to serve pre-computed results with sub-second latency.
Key Requirements
Functional
- Event ingestion -- accept events from multiple external sources in near real time with metadata (category, timestamp, location)
- Real-time aggregation -- compute metrics like counts, rates, and top-K over recent time windows (last 5 min, 1 hour)
- Live dashboards -- users subscribe to queries and see results update automatically with low latency
- Configurable metrics -- users can define new aggregation rules without system downtime
Non-Functional
- Scalability -- handle 100K+ events per second during peak periods
- Latency -- aggregated metrics should reflect new events within 5-10 seconds; dashboard queries return in under 500ms
- Reliability -- no event loss even during infrastructure failures; aggregations should eventually be correct
- Ordering -- handle late-arriving and out-of-order events gracefully
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Stream Processing and Windowed Aggregations
The core of the system is a streaming pipeline that computes aggregations over sliding or tumbling time windows. Interviewers want to see you reason about state management, windowing semantics, and handling late data.
Hints to consider:
- Use tumbling windows (e.g., 10-second buckets) for simple counters and sliding windows for smoother trend lines
- Discuss watermarks: how do you know when a window is "complete" enough to emit results?
- Late events can be handled by allowing a grace period before finalizing a window, or by retracting/correcting previous outputs
- Keep windowed state in memory for speed, but checkpoint periodically for fault tolerance
2. Handling Hot Partitions and Viral Events
A single viral event (World Cup goal, celebrity tweet) can generate a massive spike on one partition key, overwhelming a single worker.
Hints to consider:
- Partition events by category or event ID, but add sub-partitioning (salt keys) for known hot keys
- Pre-aggregate at multiple tiers: per-second at the edge, per-minute in the stream processor, per-hour in batch
- Use approximate data structures (Count-Min Sketch, HyperLogLog) for top-K and cardinality to reduce memory
- Discuss backpressure: what happens when ingestion outpaces processing?
3. Serving Pre-Computed Results
Dashboards need fast reads, but aggregations are expensive to compute. The serving layer must be optimized for reads, not computed on demand.
Hints to consider:
- Write aggregated results to a fast read store (Redis sorted sets for top-K, hash maps for counters)
- Use TTLs on cached results to automatically expire stale data
- For live subscriptions, push updates via WebSockets when the underlying aggregate changes
- Discuss how to handle "drill-down" queries that need more granularity than the pre-computed data offers