Design a Data Warehouse for Workspace Analytics

[ OK ] 952 — full content available

[ INFO ] category: System Design difficulty: hard freq: medium first seen: 2026-05-16

[HARD][SYSTEM DESIGN][MEDIUM]System DesignData WarehouseETLAnalyticsSnowflake

$ cat problem.md

Design a data-warehouse and ETL pipeline that lets Notion’s product and growth teams answer workspace-level analytics questions in (near) real-time.

Notion has two high-volume data sources:

OLTP Postgres cluster that stores workspace metadata, user memberships, plan tiers, permissions, etc. ~50 k row-changes/sec at peak, tables are 3–10 TB each, updates/deletes happen.
Real-time collaboration event firehose (Kafka) that emits every block edit, page view, comment, etc. ~1 M events/sec at peak, append-only, JSON schema evolves weekly.

Goals

Provide sub-second dashboards that show daily active workspaces, edits per plan tier, seat utilization, retention curves, etc.
Support ad-hoc analysis (SQL) by analysts without engineering help.
Keep raw data for 2 years, aggregated data for 7 years.
Comply with GDPR: user-delete request must be reflected in dashboards within 24 h and permanently erased within 30 days.
99.9 % pipeline availability; new event schema fields usable in marts within 1 day.

Deliverables to cover in the interview a. End-to-end data flow: ingestion → storage → modeling → serving. b. Choice of warehouse (Snowflake vs BigQuery vs Redshift) and justification. c. Schema design: fact/dimension tables, partitioning, clustering, slowly-changing dimensions. d. ETL orchestration: how code is deployed, tested, back-filled, and rolled back. e. Scaling to 10× volume and 3× new data sources with minimal rework. f. Cost and performance guard-rails: how to prevent a single analyst query from blowing the budget. g. GDPR delete implementation that avoids full-partition rewrites every day.

user@intervues:~/notion$