Practice/CrowdStrike/Design a Real-Time Event Message System
Design a Real-Time Event Message System
System DesignMust
Problem Statement
Design a real-time event messaging system that can ingest event data (movie releases, sports events, etc.) from various sources and provide fast in-memory storage for real-time queries and aggregations.
A real-time event messaging system is a platform that ingests time-sensitive signals (movie releases, sports scores, ticket drops) from many sources and serves low-latency queries and aggregations from in-memory storage. Think of live score tickers, trending dashboards, and real-time leaderboards where users see counts, top-K lists, or rates update instantly.
Interviewers ask this to test your grasp of streaming ingestion, stateful aggregations, and OLAP-style read patterns under tight latency budgets. They want to see how you handle ordering, backpressure, hot keys, idempotency, and fault tolerance, while balancing freshness, availability, and cost. Expect to discuss event pipelines, windowing, in-memory data models, and recovery strategies that keep the system accurate and fast at scale.
Key Requirements
Functional
- Event ingestion -- ingest event data from multiple external sources in near real time with guaranteed delivery
- Real-time aggregations -- query fresh results and aggregations (counts, rates, top-K) over recent time windows by filters like category, time, or geography
- Live subscriptions -- subscribe to live updates for saved queries or dashboards and receive changes with low latency
- Dynamic aggregation rules -- define and update aggregation rules or metrics and see results reflected quickly without full system downtime
Non-Functional
- Scalability -- handle 500,000 events per second across thousands of event categories with horizontal scaling
- Reliability -- maintain 99.9% uptime with no data loss; recover in-memory state from durable logs after failures
- Latency -- process incoming events within 200ms; serve dashboard queries in under 100ms at P99
- Consistency -- ensure users see monotonically increasing counters; handle out-of-order and duplicate events correctly
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Stream Processing Architecture
The core challenge is building a pipeline that handles high-velocity event streams with ordering guarantees and exactly-once semantics while computing stateful aggregations in real time.
Hints to consider:
- Partition event streams by category or event type to maintain ordering within each partition while enabling parallel processing
- Use watermarking strategies to handle late-arriving events without blocking real-time updates
- Design idempotent update logic using event IDs or sequence numbers to prevent double-counting when reprocessing
- Consider micro-batching to reduce per-event overhead while keeping latency within the 200ms budget
2. In-Memory State Management
Serving low-latency aggregation queries requires carefully designed in-memory data structures that balance query performance, memory usage, and update efficiency.
Hints to consider:
- Choose appropriate structures for different query patterns: sorted sets for leaderboards and top-K, hash maps for counters, time-series buckets for windowed metrics
- Implement tiered storage where active time windows live in memory, recent windows in fast SSD-backed stores, and historical data in columnar databases
- Design snapshot and compaction strategies to bound memory usage as events accumulate
- Plan cache invalidation or update propagation when computed aggregates change
3. Handling Hot Partitions and Traffic Spikes
Popular events like championship games or major releases create massive imbalances where a single category generates far more load than others, requiring dynamic resource allocation.
Hints to consider:
- Detect hot keys and apply adaptive strategies like read replicas, client-side caching with short TTLs, or fan-out to multiple processing instances
- Use rate limiting and request coalescing at the API gateway to protect downstream services from thundering herds
- Pre-compute and cache common aggregations to shift work from the read path to the write path
- Design autoscaling policies that respond to both sustained load and sudden spikes within seconds
4. Data Correctness and Recovery
Out-of-order events, duplicate feeds, and system failures can corrupt aggregations if not handled properly, requiring robust reconciliation and recovery mechanisms.
Hints to consider:
- Implement event-time processing with configurable watermarks rather than processing-time to handle network delays
- Design deduplication using sliding windows of event IDs or use exactly-once processing guarantees from stream processors
- Maintain durable event logs that can be replayed to rebuild in-memory state after failures
- Create reconciliation jobs that periodically compare aggregates with authoritative sources to detect drift
Suggested Approach
Step 1: Clarify Requirements
Start by confirming the types of events and query patterns. Ask about the expected event rate, number of distinct categories, and peak-to-average traffic ratio. Determine the required aggregation types: simple counts, rates, top-K, or more complex multi-dimensional rollups. Clarify freshness requirements: must dashboards reflect events within milliseconds, or is a few seconds of lag acceptable? Ask about retention: how long must aggregations stay queryable, and do users need access to raw events? Confirm whether the system must support ad-hoc queries or only pre-defined aggregations.