Airbnb's "Data Analysis System (Streaming)" interview question focuses on designing a scalable streaming data pipeline for real-time analytics, typically involving Apache Kafka for ingestion and Apache Flink for processing.
Design a streaming system to process high-volume event data (e.g., user interactions like views, clicks, bookings) from Airbnb's platform. The system must compute real-time aggregations such as view counts over sliding windows (e.g., last 5 hours per user-item) and averages (e.g., ratings over 90 days), supporting both online serving for predictions and offline backfills for training. Handle exactly-once semantics, low-latency queries (<100ms), and fault tolerance amid millions of events/sec, with data sources including Kafka topics for streams and databases for historical data. Incorporate Point-in-Time Joins (PITC) for enriching streams with batch features.[1]
No official full examples are publicly documented, but typical cases from Airbnb's feature engineering talks include:
| Scenario | Input Events | Aggregation Key | Window | Output Feature |
|----------|--------------|-----------------|--------|---------------|
| User views | {user_id:123, item_id:456, timestamp: t1, event:view} stream | (user_id, item_id) | Last 5h tumbling | view_count_5h: 42 [1] |
| Item ratings | {item_id:456, rating:4.2, timestamp: t2} from DB + stream | item_id | Last 90d sliding | avg_rating_90d: 3.8 [1] |
Example Flink pseudocode for view count:
stream .keyBy(<user_id, item_id>) .window(TumblingEventTimeWindows.of(Time.hours(5))) .aggregate(new CountAggregator()) .addSink(onlineStoreSink);