Practice/Apple/Design a Real-Time Event Message System

Design a Real-Time Event Message System

System DesignMust

Problem Statement

Design a system that ingests time-sensitive events from many sources — movie releases, sports scores, ticket drops, breaking news — and serves low-latency queries and live dashboards. Users see counts, trending lists, and aggregated stats update in near real time.

The challenge is handling bursty traffic (millions of events during a major sporting event), computing aggregations on the fly without scanning raw data at query time, and keeping dashboards responsive. You need to reason about streaming pipelines, stateful windowed aggregations, and how to serve pre-computed results with sub-second latency.

Key Requirements

Functional

Event ingestion -- accept events from multiple external sources in near real time with metadata (category, timestamp, location)
Real-time aggregation -- compute metrics like counts, rates, and top-K over recent time windows (last 5 min, 1 hour)
Live dashboards -- users subscribe to queries and see results update automatically with low latency
Configurable metrics -- users can define new aggregation rules without system downtime

Non-Functional

Scalability -- handle 100K+ events per second during peak periods
Latency -- aggregated metrics should reflect new events within 5-10 seconds; dashboard queries return in under 500ms
Reliability -- no event loss even during infrastructure failures; aggregations should eventually be correct
Ordering -- handle late-arriving and out-of-order events gracefully

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Stream Processing and Windowed Aggregations

The core of the system is a streaming pipeline that computes aggregations over sliding or tumbling time windows. Interviewers want to see you reason about state management, windowing semantics, and handling late data.

Hints to consider:

Use tumbling windows (e.g., 10-second buckets) for simple counters and sliding windows for smoother trend lines
Discuss watermarks: how do you know when a window is "complete" enough to emit results?
Late events can be handled by allowing a grace period before finalizing a window, or by retracting/correcting previous outputs
Keep windowed state in memory for speed, but checkpoint periodically for fault tolerance

2. Handling Hot Partitions and Viral Events

A single viral event (World Cup goal, celebrity tweet) can generate a massive spike on one partition key, overwhelming a single worker.

Hints to consider:

Partition events by category or event ID, but add sub-partitioning (salt keys) for known hot keys
Pre-aggregate at multiple tiers: per-second at the edge, per-minute in the stream processor, per-hour in batch
Use approximate data structures (Count-Min Sketch, HyperLogLog) for top-K and cardinality to reduce memory
Discuss backpressure: what happens when ingestion outpaces processing?

3. Serving Pre-Computed Results

Dashboards need fast reads, but aggregations are expensive to compute. The serving layer must be optimized for reads, not computed on demand.

Hints to consider:

Write aggregated results to a fast read store (Redis sorted sets for top-K, hash maps for counters)
Use TTLs on cached results to automatically expire stale data
For live subscriptions, push updates via WebSockets when the underlying aggregate changes
Discuss how to handle "drill-down" queries that need more granularity than the pre-computed data offers

Practice/Apple/Design a Real-Time Event Message System

Design a Real-Time Event Message System

System DesignMust

Problem Statement

Key Requirements

Functional

Event ingestion -- accept events from multiple external sources in near real time with metadata (category, timestamp, location)
Real-time aggregation -- compute metrics like counts, rates, and top-K over recent time windows (last 5 min, 1 hour)
Live dashboards -- users subscribe to queries and see results update automatically with low latency
Configurable metrics -- users can define new aggregation rules without system downtime

Non-Functional

Scalability -- handle 100K+ events per second during peak periods
Latency -- aggregated metrics should reflect new events within 5-10 seconds; dashboard queries return in under 500ms
Reliability -- no event loss even during infrastructure failures; aggregations should eventually be correct
Ordering -- handle late-arriving and out-of-order events gracefully

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Stream Processing and Windowed Aggregations

Hints to consider:

Use tumbling windows (e.g., 10-second buckets) for simple counters and sliding windows for smoother trend lines
Discuss watermarks: how do you know when a window is "complete" enough to emit results?
Late events can be handled by allowing a grace period before finalizing a window, or by retracting/correcting previous outputs
Keep windowed state in memory for speed, but checkpoint periodically for fault tolerance

2. Handling Hot Partitions and Viral Events

A single viral event (World Cup goal, celebrity tweet) can generate a massive spike on one partition key, overwhelming a single worker.

Hints to consider:

Partition events by category or event ID, but add sub-partitioning (salt keys) for known hot keys
Pre-aggregate at multiple tiers: per-second at the edge, per-minute in the stream processor, per-hour in batch
Use approximate data structures (Count-Min Sketch, HyperLogLog) for top-K and cardinality to reduce memory
Discuss backpressure: what happens when ingestion outpaces processing?

3. Serving Pre-Computed Results

Dashboards need fast reads, but aggregations are expensive to compute. The serving layer must be optimized for reads, not computed on demand.

Hints to consider:

Write aggregated results to a fast read store (Redis sorted sets for top-K, hash maps for counters)
Use TTLs on cached results to automatically expire stale data
For live subscriptions, push updates via WebSockets when the underlying aggregate changes
Discuss how to handle "drill-down" queries that need more granularity than the pre-computed data offers