Data Analysis System (Streaming)

[ OK ] 404 — full content available

[ INFO ] category: System Design difficulty: hard freq: medium first seen: 2026-01-13

[HARD][SYSTEM DESIGN][MEDIUM]StreamingApache KafkaApache Flinkdata_engineeringwebSystem Design

$ cat problem.md

Airbnb's "Data Analysis System (Streaming)" interview question focuses on designing a scalable streaming data pipeline for real-time analytics, typically involving Apache Kafka for ingestion and Apache Flink for processing.

Problem Statement

Design a streaming system to process high-volume event data (e.g., user interactions like views, clicks, bookings) from Airbnb's platform. The system must compute real-time aggregations such as view counts over sliding windows (e.g., last 5 hours per user-item) and averages (e.g., ratings over 90 days), supporting both online serving for predictions and offline backfills for training. Handle exactly-once semantics, low-latency queries (<100ms), and fault tolerance amid millions of events/sec, with data sources including Kafka topics for streams and databases for historical data. Incorporate Point-in-Time Joins (PITC) for enriching streams with batch features.[1]

Key Components

Ingestion: Apache Kafka clusters partition events by user/item keys; use multiple topics for views, ratings, etc.
Processing: Apache Flink for stateful streaming with event-time windows, aggregations (count, avg), and joins; support backpressure and checkpointing.
Storage/Serving: Online stores like Redis/DynamoDB for low-latency feature reads; offline to S3/Hive for ML training.
Challenges: Window overlaps, data skew, consistency between online/offline paths, backfilling historical data without reprocessing all.[1]

Input/Output Examples

No official full examples are publicly documented, but typical cases from Airbnb's feature engineering talks include:

| Scenario | Input Events | Aggregation Key | Window | Output Feature | |----------|--------------|-----------------|--------|---------------| | User views | {user_id:123, item_id:456, timestamp: t1, event:view} stream | (user_id, item_id) | Last 5h tumbling | view_count_5h: 42 [1] | | Item ratings | {item_id:456, rating:4.2, timestamp: t2} from DB + stream | item_id | Last 90d sliding | avg_rating_90d: 3.8 [1] |

Example Flink pseudocode for view count: stream .keyBy(<user_id, item_id>) .window(TumblingEventTimeWindows.of(Time.hours(5))) .aggregate(new CountAggregator()) .addSink(onlineStoreSink);

Constraints

Throughput: 10M+ events/sec peak, sustained 1M+/sec.
Latency: Online serving <50-100ms p99; processing <1s.
Scale: Billions of unique keys; state >10TB per job.
Reliability: Exactly-once delivery; 99.99% uptime; backfill ranges up to 1 year without full recompute.
Consistency: Online/offline feature parity (e.g., SMAPE <1% for numerics).[1]

user@intervues:~/airbnb$