Practice/Uber/Design a Notification System

Design a Notification System

System DesignMust

Problem Statement

Design a notification system that handles critical time-sensitive notifications (like 1:1 chat messages and OTPs) and promotional notifications (like content recommendations) with expiration logic. The system should scale to 1M notifications per second with an 80/20 critical/promotional distribution, where promotional notifications can target thousands of users simultaneously.

A notifications platform lets product teams send critical alerts and promotional campaigns to users across channels like push, SMS, and email. Users expect critical messages in near real time, and they should never receive a promotional notification after the offer expires.

Interviewers at Uber ask this to evaluate if you can design a high-throughput, priority-aware pipeline that isolates critical traffic from bulk campaigns, handles fan-out to large audiences, respects provider rate limits, and enforces time windows. They also probe your grasp of durability, idempotency, per-user ordering, and resiliency when downstream providers degrade.

Key Requirements

Functional

Priority-based delivery -- critical notifications delivered in near real time with high reliability; promotional notifications respect valid windows
Multi-channel routing -- route notifications to push, SMS, email, and webhook endpoints based on user preferences and message type
User preference management -- honor opt-out settings, channel preferences, quiet hours, and frequency caps per user
Campaign fan-out -- support targeted campaigns that fan out to thousands or millions of users without degrading critical notification latency

Non-Functional

Scalability -- handle 1M notifications per second with peaks during promotional campaigns
Reliability -- achieve 99.95% delivery success rate with automatic retries, circuit breaking, and fallback channel routing
Latency -- deliver P95 of critical notifications within 2 seconds from API submission to channel handoff
Consistency -- guarantee at-least-once delivery with idempotency keys to prevent duplicate sends

What Interviewers Focus On

Based on real interview experiences at Uber, Microsoft, and Airbnb, these are the areas interviewers probe most deeply:

1. Priority Isolation and Queue Architecture

A monolithic queue causes head-of-line blocking when promotional blasts flood the system, delaying critical notifications. Interviewers want to see explicit traffic separation.

Hints to consider:

Use dedicated Kafka topics and consumer pools for critical vs promotional notifications to prevent interference
Implement priority queues with separate worker fleets so urgent alerts bypass bulk marketing messages
Partition by user ID within each priority tier to maintain per-user ordering while enabling horizontal scaling
Design dead letter queues with exponential backoff for persistent failures

2. Promotional Expiration and Time Windows

Promotional notifications must never be delivered after their offer expires. Interviewers look for end-to-end enforcement of time constraints.

Hints to consider:

Attach TTL or expiration timestamp to every promotional message and check it at every processing stage
Drop expired messages in the queue consumer before attempting delivery to avoid wasting external API calls
Use Redis sorted sets with score as expiration time for efficient batch scheduling and expiry checking
Log dropped-due-to-expiry events for analytics and reconciliation

3. Idempotency and Exactly-Once Processing

At-least-once retries without deduplication yield duplicates. Interviewers want explicit idempotency handling across the delivery pipeline.

Hints to consider:

Accept client-provided idempotency keys and check them in Redis before processing
Make notification IDs deterministic based on content hash plus recipient to catch duplicates from different sources
Use conditional database writes to ensure a notification record is created exactly once
Check delivery status before sending to external channels to handle races between workers

4. Third-Party Channel Integration and Failure Handling

External SMS, email, and push providers have varying SLAs and failure modes. Interviewers want resilience patterns for vendor health management.

Hints to consider:

Implement circuit breakers per vendor to fail fast and route traffic to backup providers when error rates spike
Use per-vendor rate limiters with token buckets to respect API quotas
Design channel fallback logic: if push fails, try SMS; if SMS quota exhausted, queue for email
Track per-channel success rates and latency to automatically deprioritize failing providers

Practice/Uber/Design a Notification System

Design a Notification System

System DesignMust

Problem Statement

Key Requirements

Functional

Priority-based delivery -- critical notifications delivered in near real time with high reliability; promotional notifications respect valid windows
Multi-channel routing -- route notifications to push, SMS, email, and webhook endpoints based on user preferences and message type
User preference management -- honor opt-out settings, channel preferences, quiet hours, and frequency caps per user
Campaign fan-out -- support targeted campaigns that fan out to thousands or millions of users without degrading critical notification latency

Non-Functional

Scalability -- handle 1M notifications per second with peaks during promotional campaigns
Reliability -- achieve 99.95% delivery success rate with automatic retries, circuit breaking, and fallback channel routing
Latency -- deliver P95 of critical notifications within 2 seconds from API submission to channel handoff
Consistency -- guarantee at-least-once delivery with idempotency keys to prevent duplicate sends

What Interviewers Focus On

Based on real interview experiences at Uber, Microsoft, and Airbnb, these are the areas interviewers probe most deeply:

1. Priority Isolation and Queue Architecture

A monolithic queue causes head-of-line blocking when promotional blasts flood the system, delaying critical notifications. Interviewers want to see explicit traffic separation.

Hints to consider:

Use dedicated Kafka topics and consumer pools for critical vs promotional notifications to prevent interference
Implement priority queues with separate worker fleets so urgent alerts bypass bulk marketing messages
Partition by user ID within each priority tier to maintain per-user ordering while enabling horizontal scaling
Design dead letter queues with exponential backoff for persistent failures

2. Promotional Expiration and Time Windows

Promotional notifications must never be delivered after their offer expires. Interviewers look for end-to-end enforcement of time constraints.

Hints to consider:

Attach TTL or expiration timestamp to every promotional message and check it at every processing stage
Drop expired messages in the queue consumer before attempting delivery to avoid wasting external API calls
Use Redis sorted sets with score as expiration time for efficient batch scheduling and expiry checking
Log dropped-due-to-expiry events for analytics and reconciliation

3. Idempotency and Exactly-Once Processing

At-least-once retries without deduplication yield duplicates. Interviewers want explicit idempotency handling across the delivery pipeline.

Hints to consider:

Accept client-provided idempotency keys and check them in Redis before processing
Make notification IDs deterministic based on content hash plus recipient to catch duplicates from different sources
Use conditional database writes to ensure a notification record is created exactly once
Check delivery status before sending to external channels to handle races between workers

4. Third-Party Channel Integration and Failure Handling

External SMS, email, and push providers have varying SLAs and failure modes. Interviewers want resilience patterns for vendor health management.

Hints to consider:

Implement circuit breakers per vendor to fail fast and route traffic to backup providers when error rates spike
Use per-vendor rate limiters with token buckets to respect API quotas
Design channel fallback logic: if push fails, try SMS; if SMS quota exhausted, queue for email
Track per-channel success rates and latency to automatically deprioritize failing providers