Practice/Microsoft/Design a Notification System
Design a Notification System
System DesignMust
Problem Statement
Build a scalable alert delivery platform that sends messages to users across multiple channels such as push notifications, SMS, and email. The platform must handle two distinct traffic patterns: high-priority alerts (authentication codes, security warnings, direct messages) that demand sub-second delivery, and batch campaign messages (marketing announcements, feature updates, content recommendations) that can be scheduled and target large user segments simultaneously.
Your system needs to process one million notifications per second at peak load, with roughly 80% being time-critical alerts and 20% being campaign messages. Campaign notifications often fan out to thousands or millions of recipients but must never be delivered after their expiration timestamp. The architecture must prevent bulk campaigns from degrading the delivery speed of critical alerts, handle provider rate limits gracefully, ensure exactly-once delivery semantics per user, and maintain message ordering for each recipient. You are designing the server-side infrastructure only, not the client SDK or device-level delivery confirmation.
Key Requirements
Functional
- Priority-based routing -- Critical alerts must be isolated from promotional traffic to guarantee sub-second delivery regardless of campaign volume
- Time-windowed delivery -- Campaign messages must respect start and end timestamps, with automatic cancellation once the promotion window closes
- Multi-channel support -- Route messages to push, SMS, and email providers based on user preferences and message type
- User preference enforcement -- Honor opt-out settings, channel preferences, and quiet hours before dispatching any notification
- Massive fan-out -- A single campaign request must efficiently expand to millions of individual recipient messages without blocking the ingestion pipeline
Non-Functional
- Scalability -- Support sustained throughput of 1M notifications/second with headroom for 3x traffic spikes during major campaigns
- Reliability -- Guarantee at-least-once delivery with idempotency to prevent duplicates; tolerate partial provider outages without data loss
- Latency -- P99 delivery latency under 500ms for critical alerts; batch messages can tolerate multi-second queueing as long as they arrive before expiration
- Consistency -- Maintain per-user message ordering within each priority class; eventually consistent preference updates are acceptable with a propagation window under 10 seconds
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Traffic Isolation and Priority Management
Interviewers want to see how you prevent promotional blasts from starving critical alerts. A naive single-queue design will fail under load when a million-user campaign floods the system.
Hints to consider:
- Separate logical or physical queues for each priority tier with dedicated consumer pools
- Apply weighted fair queuing or strict priority scheduling to guarantee capacity for critical traffic
- Use backpressure mechanisms and queue depth limits to shed low-priority load during overload
- Consider per-tenant quotas to prevent one customer's campaign from affecting others in a multi-tenant platform
2. Expiration and Time-Window Enforcement
Delivering an expired promotion damages user trust and may violate business rules. Interviewers look for end-to-end TTL checks, not just a filter at ingestion.
Hints to consider:
- Attach expiration metadata to every message and validate it at each processing stage (queue consumption, provider dispatch, retry)
- Use time-indexed data structures like sorted sets or scheduled queues to automatically discard expired items
- Implement cancellation workflows so in-flight messages can be aborted when a campaign is manually stopped or reaches its deadline
- Monitor drift between message creation time and dispatch time to detect processing delays that might cause late deliveries
3. Fan-Out and Bulk Expansion Efficiency
A single campaign API call targeting a million users must not block ingestion or overload the database with synchronous lookups. Interviewers expect an asynchronous expansion strategy.
Hints to consider:
- Decouple campaign submission from recipient resolution using a two-phase pipeline: accept and acknowledge the campaign, then expand audience in the background
- Stream audience segments from precomputed user lists or query results rather than loading everything into memory
- Batch audience expansion into chunks (e.g., 10k users per task) and parallelize across workers
- Cache user preferences and device tokens in a low-latency store to avoid round-trips to the primary database during expansion
4. Rate Limiting and Provider Integration
Third-party notification providers impose per-second and per-minute limits. Interviewers want to see adaptive rate limiting that respects quotas without dropping messages.
Hints to consider:
- Implement token bucket or leaky bucket rate limiters per provider and per tenant, stored in a fast shared cache
- Use exponential backoff and circuit breakers when a provider returns rate-limit errors or degraded responses
- Queue messages that exceed rate limits in a separate holding queue and retry them after the window resets
- Monitor provider health and dynamically shift traffic to backup channels (e.g., SMS fallback if push quota is exhausted)
5. Idempotency and Exactly-Once Semantics
At-least-once retry semantics can cause duplicate sends. Interviewers expect deduplication logic to prevent users from receiving the same alert twice.
Hints to consider:
- Generate a deterministic message ID (hash of campaign ID, user ID, and content) and store it in a fast cache with TTL equal to the retry window
- Check the idempotency key before dispatching to the provider; skip if already seen
- Use database constraints or conditional writes to ensure notification records are written exactly once
- Distinguish between delivery retries (safe to dedupe) and user-initiated resends (new message ID)