Practice/Uber/Design a Notification System
Design a Notification System
System DesignMust
Problem Statement
Design a notification system that handles critical time-sensitive notifications (like 1:1 chat messages and OTPs) and promotional notifications (like content recommendations) with expiration logic. The system should scale to 1M notifications per second with an 80/20 critical/promotional distribution, where promotional notifications can target thousands of users simultaneously.
A notifications platform lets product teams send critical alerts and promotional campaigns to users across channels like push, SMS, and email. Users expect critical messages in near real time, and they should never receive a promotional notification after the offer expires.
Interviewers at Uber ask this to evaluate if you can design a high-throughput, priority-aware pipeline that isolates critical traffic from bulk campaigns, handles fan-out to large audiences, respects provider rate limits, and enforces time windows. They also probe your grasp of durability, idempotency, per-user ordering, and resiliency when downstream providers degrade.
Key Requirements
Functional
- Priority-based delivery -- critical notifications delivered in near real time with high reliability; promotional notifications respect valid windows
- Multi-channel routing -- route notifications to push, SMS, email, and webhook endpoints based on user preferences and message type
- User preference management -- honor opt-out settings, channel preferences, quiet hours, and frequency caps per user
- Campaign fan-out -- support targeted campaigns that fan out to thousands or millions of users without degrading critical notification latency
Non-Functional
- Scalability -- handle 1M notifications per second with peaks during promotional campaigns
- Reliability -- achieve 99.95% delivery success rate with automatic retries, circuit breaking, and fallback channel routing
- Latency -- deliver P95 of critical notifications within 2 seconds from API submission to channel handoff
- Consistency -- guarantee at-least-once delivery with idempotency keys to prevent duplicate sends
What Interviewers Focus On
Based on real interview experiences at Uber, Microsoft, and Airbnb, these are the areas interviewers probe most deeply:
1. Priority Isolation and Queue Architecture
A monolithic queue causes head-of-line blocking when promotional blasts flood the system, delaying critical notifications. Interviewers want to see explicit traffic separation.
Hints to consider:
- Use dedicated Kafka topics and consumer pools for critical vs promotional notifications to prevent interference
- Implement priority queues with separate worker fleets so urgent alerts bypass bulk marketing messages
- Partition by user ID within each priority tier to maintain per-user ordering while enabling horizontal scaling
- Design dead letter queues with exponential backoff for persistent failures
2. Promotional Expiration and Time Windows
Promotional notifications must never be delivered after their offer expires. Interviewers look for end-to-end enforcement of time constraints.
Hints to consider:
- Attach TTL or expiration timestamp to every promotional message and check it at every processing stage
- Drop expired messages in the queue consumer before attempting delivery to avoid wasting external API calls
- Use Redis sorted sets with score as expiration time for efficient batch scheduling and expiry checking
- Log dropped-due-to-expiry events for analytics and reconciliation
3. Idempotency and Exactly-Once Processing
At-least-once retries without deduplication yield duplicates. Interviewers want explicit idempotency handling across the delivery pipeline.
Hints to consider:
- Accept client-provided idempotency keys and check them in Redis before processing
- Make notification IDs deterministic based on content hash plus recipient to catch duplicates from different sources
- Use conditional database writes to ensure a notification record is created exactly once
- Check delivery status before sending to external channels to handle races between workers
4. Third-Party Channel Integration and Failure Handling
External SMS, email, and push providers have varying SLAs and failure modes. Interviewers want resilience patterns for vendor health management.
Hints to consider:
- Implement circuit breakers per vendor to fail fast and route traffic to backup providers when error rates spike
- Use per-vendor rate limiters with token buckets to respect API quotas
- Design channel fallback logic: if push fails, try SMS; if SMS quota exhausted, queue for email
- Track per-channel success rates and latency to automatically deprioritize failing providers