Practice/CrowdStrike/Design a Real-Time Event Message System

Design a Real-Time Event Message System

System DesignMust

Problem Statement

Design a real-time event messaging system that can ingest event data (movie releases, sports events, etc.) from various sources and provide fast in-memory storage for real-time queries and aggregations.

A real-time event messaging system is a platform that ingests time-sensitive signals (movie releases, sports scores, ticket drops) from many sources and serves low-latency queries and aggregations from in-memory storage. Think of live score tickers, trending dashboards, and real-time leaderboards where users see counts, top-K lists, or rates update instantly.

Interviewers ask this to test your grasp of streaming ingestion, stateful aggregations, and OLAP-style read patterns under tight latency budgets. They want to see how you handle ordering, backpressure, hot keys, idempotency, and fault tolerance, while balancing freshness, availability, and cost. Expect to discuss event pipelines, windowing, in-memory data models, and recovery strategies that keep the system accurate and fast at scale.

Key Requirements

Functional

Event ingestion -- ingest event data from multiple external sources in near real time with guaranteed delivery
Real-time aggregations -- query fresh results and aggregations (counts, rates, top-K) over recent time windows by filters like category, time, or geography
Live subscriptions -- subscribe to live updates for saved queries or dashboards and receive changes with low latency
Dynamic aggregation rules -- define and update aggregation rules or metrics and see results reflected quickly without full system downtime

Non-Functional

Scalability -- handle 500,000 events per second across thousands of event categories with horizontal scaling
Reliability -- maintain 99.9% uptime with no data loss; recover in-memory state from durable logs after failures
Latency -- process incoming events within 200ms; serve dashboard queries in under 100ms at P99
Consistency -- ensure users see monotonically increasing counters; handle out-of-order and duplicate events correctly

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Stream Processing Architecture

The core challenge is building a pipeline that handles high-velocity event streams with ordering guarantees and exactly-once semantics while computing stateful aggregations in real time.

Hints to consider:

Partition event streams by category or event type to maintain ordering within each partition while enabling parallel processing
Use watermarking strategies to handle late-arriving events without blocking real-time updates
Design idempotent update logic using event IDs or sequence numbers to prevent double-counting when reprocessing
Consider micro-batching to reduce per-event overhead while keeping latency within the 200ms budget

2. In-Memory State Management

Serving low-latency aggregation queries requires carefully designed in-memory data structures that balance query performance, memory usage, and update efficiency.

Hints to consider:

Choose appropriate structures for different query patterns: sorted sets for leaderboards and top-K, hash maps for counters, time-series buckets for windowed metrics
Implement tiered storage where active time windows live in memory, recent windows in fast SSD-backed stores, and historical data in columnar databases
Design snapshot and compaction strategies to bound memory usage as events accumulate
Plan cache invalidation or update propagation when computed aggregates change

3. Handling Hot Partitions and Traffic Spikes

Popular events like championship games or major releases create massive imbalances where a single category generates far more load than others, requiring dynamic resource allocation.

Hints to consider:

Detect hot keys and apply adaptive strategies like read replicas, client-side caching with short TTLs, or fan-out to multiple processing instances
Use rate limiting and request coalescing at the API gateway to protect downstream services from thundering herds
Pre-compute and cache common aggregations to shift work from the read path to the write path
Design autoscaling policies that respond to both sustained load and sudden spikes within seconds

4. Data Correctness and Recovery

Out-of-order events, duplicate feeds, and system failures can corrupt aggregations if not handled properly, requiring robust reconciliation and recovery mechanisms.

Hints to consider:

Implement event-time processing with configurable watermarks rather than processing-time to handle network delays
Design deduplication using sliding windows of event IDs or use exactly-once processing guarantees from stream processors
Maintain durable event logs that can be replayed to rebuild in-memory state after failures
Create reconciliation jobs that periodically compare aggregates with authoritative sources to detect drift

Suggested Approach

Step 1: Clarify Requirements

Start by confirming the types of events and query patterns. Ask about the expected event rate, number of distinct categories, and peak-to-average traffic ratio. Determine the required aggregation types: simple counts, rates, top-K, or more complex multi-dimensional rollups. Clarify freshness requirements: must dashboards reflect events within milliseconds, or is a few seconds of lag acceptable? Ask about retention: how long must aggregations stay queryable, and do users need access to raw events? Confirm whether the system must support ad-hoc queries or only pre-defined aggregations.

Practice/CrowdStrike/Design a Real-Time Event Message System

Design a Real-Time Event Message System

System DesignMust

Problem Statement

Key Requirements

Functional

Event ingestion -- ingest event data from multiple external sources in near real time with guaranteed delivery
Real-time aggregations -- query fresh results and aggregations (counts, rates, top-K) over recent time windows by filters like category, time, or geography
Live subscriptions -- subscribe to live updates for saved queries or dashboards and receive changes with low latency
Dynamic aggregation rules -- define and update aggregation rules or metrics and see results reflected quickly without full system downtime

Non-Functional

Scalability -- handle 500,000 events per second across thousands of event categories with horizontal scaling
Reliability -- maintain 99.9% uptime with no data loss; recover in-memory state from durable logs after failures
Latency -- process incoming events within 200ms; serve dashboard queries in under 100ms at P99
Consistency -- ensure users see monotonically increasing counters; handle out-of-order and duplicate events correctly

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Stream Processing Architecture

The core challenge is building a pipeline that handles high-velocity event streams with ordering guarantees and exactly-once semantics while computing stateful aggregations in real time.

Hints to consider:

Partition event streams by category or event type to maintain ordering within each partition while enabling parallel processing
Use watermarking strategies to handle late-arriving events without blocking real-time updates
Design idempotent update logic using event IDs or sequence numbers to prevent double-counting when reprocessing
Consider micro-batching to reduce per-event overhead while keeping latency within the 200ms budget

2. In-Memory State Management

Serving low-latency aggregation queries requires carefully designed in-memory data structures that balance query performance, memory usage, and update efficiency.

Hints to consider:

Choose appropriate structures for different query patterns: sorted sets for leaderboards and top-K, hash maps for counters, time-series buckets for windowed metrics
Implement tiered storage where active time windows live in memory, recent windows in fast SSD-backed stores, and historical data in columnar databases
Design snapshot and compaction strategies to bound memory usage as events accumulate
Plan cache invalidation or update propagation when computed aggregates change

3. Handling Hot Partitions and Traffic Spikes

Popular events like championship games or major releases create massive imbalances where a single category generates far more load than others, requiring dynamic resource allocation.

Hints to consider:

Detect hot keys and apply adaptive strategies like read replicas, client-side caching with short TTLs, or fan-out to multiple processing instances
Use rate limiting and request coalescing at the API gateway to protect downstream services from thundering herds
Pre-compute and cache common aggregations to shift work from the read path to the write path
Design autoscaling policies that respond to both sustained load and sudden spikes within seconds

4. Data Correctness and Recovery

Out-of-order events, duplicate feeds, and system failures can corrupt aggregations if not handled properly, requiring robust reconciliation and recovery mechanisms.

Hints to consider:

Implement event-time processing with configurable watermarks rather than processing-time to handle network delays
Design deduplication using sliding windows of event IDs or use exactly-once processing guarantees from stream processors
Maintain durable event logs that can be replayed to rebuild in-memory state after failures
Create reconciliation jobs that periodically compare aggregates with authoritative sources to detect drift