Design a data-warehouse and ETL pipeline that lets Notion’s product and growth teams answer workspace-level analytics questions in (near) real-time.
Notion has two high-volume data sources:
OLTP Postgres cluster that stores workspace metadata, user memberships, plan tiers, permissions, etc. ~50 k row-changes/sec at peak, tables are 3–10 TB each, updates/deletes happen.
Real-time collaboration event firehose (Kafka) that emits every block edit, page view, comment, etc. ~1 M events/sec at peak, append-only, JSON schema evolves weekly.
Goals
Provide sub-second dashboards that show daily active workspaces, edits per plan tier, seat utilization, retention curves, etc.
Support ad-hoc analysis (SQL) by analysts without engineering help.
Keep raw data for 2 years, aggregated data for 7 years.
Comply with GDPR: user-delete request must be reflected in dashboards within 24 h and permanently erased within 30 days.
99.9 % pipeline availability; new event schema fields usable in marts within 1 day.
Deliverables to cover in the interview
a. End-to-end data flow: ingestion → storage → modeling → serving.
b. Choice of warehouse (Snowflake vs BigQuery vs Redshift) and justification.
c. Schema design: fact/dimension tables, partitioning, clustering, slowly-changing dimensions.
d. ETL orchestration: how code is deployed, tested, back-filled, and rolled back.
e. Scaling to 10× volume and 3× new data sources with minimal rework.
f. Cost and performance guard-rails: how to prevent a single analyst query from blowing the budget.
g. GDPR delete implementation that avoids full-partition rewrites every day.