Practice/Oracle/Design a Notification System
Design a Notification System
System DesignMust
Problem Statement
Design a notification system that handles critical time-sensitive notifications (like 1:1 chat messages and OTPs) and promotional notifications (like system-generated content recommendations) with expiration logic. The system should scale to 1M notifications per second with an 80/20 critical/promotional distribution, where promotional notifications can target thousands of users simultaneously.
Interviewers ask this to evaluate if you can design a high-throughput, multi-tenant, priority-aware pipeline that isolates critical traffic from bulk campaigns, handles fan-out to large audiences, respects provider rate limits, and enforces time windows. They also probe your grasp of durability, idempotency, per-user ordering, retries, and resiliency when downstream providers degrade.
Key Requirements
Functional
- Critical notifications -- users receive critical notifications (OTPs, security alerts, chat messages) in near real time with high delivery reliability and per-user ordering
- Promotional notifications -- users receive promotional notifications only within the valid promotion window and never after expiration
- User preferences -- users manage notification preferences (opt-in/out, channels, quiet hours) that are respected across all message types
- Campaign targeting -- targeted campaign notifications reach large audiences without degrading critical notification latency or reliability
Non-Functional
- Scalability -- handle 1M notifications per second with high fan-out for campaigns targeting millions of users
- Reliability -- achieve 99.99% delivery rate for critical notifications; promotional can tolerate occasional drops
- Latency -- deliver critical notifications within 2 seconds; promotional within 5 minutes of campaign trigger
- Consistency -- exactly-once delivery semantics for critical notifications; at-least-once acceptable for promotional with deduplication
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Priority Isolation Between Critical and Promotional Traffic
Interviewers want to see how you prevent promotional blast traffic from starving critical notifications.
Hints to consider:
- Use separate Kafka topics or queue systems for critical and promotional traffic with independent consumer groups
- Implement dedicated worker pools for each priority level with separate scaling policies
- Apply rate limiting on promotional dispatch to protect shared downstream resources (push notification providers, SMS gateways)
- Design circuit breakers that shed promotional load before impacting critical delivery paths
2. Fan-Out for Campaign Notifications
A single campaign can target millions of users. Interviewers probe how you expand audiences and dispatch at scale.
Hints to consider:
- Expand audience segments into individual notification tasks asynchronously using a separate expansion service
- Partition expanded tasks by user ID for balanced distribution across worker pools
- Implement incremental expansion with progress checkpoints to handle failures mid-campaign without restarting
- Check promotion expiry at every stage of the pipeline (expansion, dispatch, delivery) to drop stale messages
3. Delivery Reliability and Idempotency
At-least-once retries without deduplication yield duplicates. Interviewers expect concrete mechanisms for reliable delivery.
Hints to consider:
- Assign globally unique notification IDs and use them as idempotency keys with downstream providers
- Partition by user ID to maintain per-user ordering within a priority class
- Implement exponential backoff with jitter for retries, with a dead letter queue for permanently failed notifications
- Track delivery status (pending, sent, delivered, failed) in a durable store for reconciliation and debugging
4. Multi-Channel Delivery
Users receive notifications via push, SMS, email, and in-app channels. Interviewers probe your channel routing and fallback strategies.
Hints to consider:
- Implement a channel router that selects delivery channels based on notification type, user preferences, and device availability
- Support fallback chains: try push first, fall back to SMS if push fails or the device is unreachable
- Respect per-channel rate limits imposed by providers (APNS, FCM, SMS gateways) using token bucket limiters
- Batch email notifications to avoid spamming users with individual messages (digest mode for low-priority content)