Practice/Microsoft/Design a Notification System

Design a Notification System

System DesignMust

Problem Statement

Build a scalable alert delivery platform that sends messages to users across multiple channels such as push notifications, SMS, and email. The platform must handle two distinct traffic patterns: high-priority alerts (authentication codes, security warnings, direct messages) that demand sub-second delivery, and batch campaign messages (marketing announcements, feature updates, content recommendations) that can be scheduled and target large user segments simultaneously.

Your system needs to process one million notifications per second at peak load, with roughly 80% being time-critical alerts and 20% being campaign messages. Campaign notifications often fan out to thousands or millions of recipients but must never be delivered after their expiration timestamp. The architecture must prevent bulk campaigns from degrading the delivery speed of critical alerts, handle provider rate limits gracefully, ensure exactly-once delivery semantics per user, and maintain message ordering for each recipient. You are designing the server-side infrastructure only, not the client SDK or device-level delivery confirmation.

Key Requirements

Functional

Priority-based routing -- Critical alerts must be isolated from promotional traffic to guarantee sub-second delivery regardless of campaign volume
Time-windowed delivery -- Campaign messages must respect start and end timestamps, with automatic cancellation once the promotion window closes
Multi-channel support -- Route messages to push, SMS, and email providers based on user preferences and message type
User preference enforcement -- Honor opt-out settings, channel preferences, and quiet hours before dispatching any notification
Massive fan-out -- A single campaign request must efficiently expand to millions of individual recipient messages without blocking the ingestion pipeline

Non-Functional

Scalability -- Support sustained throughput of 1M notifications/second with headroom for 3x traffic spikes during major campaigns
Reliability -- Guarantee at-least-once delivery with idempotency to prevent duplicates; tolerate partial provider outages without data loss
Latency -- P99 delivery latency under 500ms for critical alerts; batch messages can tolerate multi-second queueing as long as they arrive before expiration
Consistency -- Maintain per-user message ordering within each priority class; eventually consistent preference updates are acceptable with a propagation window under 10 seconds

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Traffic Isolation and Priority Management

Interviewers want to see how you prevent promotional blasts from starving critical alerts. A naive single-queue design will fail under load when a million-user campaign floods the system.

Hints to consider:

Separate logical or physical queues for each priority tier with dedicated consumer pools
Apply weighted fair queuing or strict priority scheduling to guarantee capacity for critical traffic
Use backpressure mechanisms and queue depth limits to shed low-priority load during overload
Consider per-tenant quotas to prevent one customer's campaign from affecting others in a multi-tenant platform

2. Expiration and Time-Window Enforcement

Delivering an expired promotion damages user trust and may violate business rules. Interviewers look for end-to-end TTL checks, not just a filter at ingestion.

Hints to consider:

Attach expiration metadata to every message and validate it at each processing stage (queue consumption, provider dispatch, retry)
Use time-indexed data structures like sorted sets or scheduled queues to automatically discard expired items
Implement cancellation workflows so in-flight messages can be aborted when a campaign is manually stopped or reaches its deadline
Monitor drift between message creation time and dispatch time to detect processing delays that might cause late deliveries

3. Fan-Out and Bulk Expansion Efficiency

A single campaign API call targeting a million users must not block ingestion or overload the database with synchronous lookups. Interviewers expect an asynchronous expansion strategy.

Hints to consider:

Decouple campaign submission from recipient resolution using a two-phase pipeline: accept and acknowledge the campaign, then expand audience in the background
Stream audience segments from precomputed user lists or query results rather than loading everything into memory
Batch audience expansion into chunks (e.g., 10k users per task) and parallelize across workers
Cache user preferences and device tokens in a low-latency store to avoid round-trips to the primary database during expansion

4. Rate Limiting and Provider Integration

Third-party notification providers impose per-second and per-minute limits. Interviewers want to see adaptive rate limiting that respects quotas without dropping messages.

Hints to consider:

Implement token bucket or leaky bucket rate limiters per provider and per tenant, stored in a fast shared cache
Use exponential backoff and circuit breakers when a provider returns rate-limit errors or degraded responses
Queue messages that exceed rate limits in a separate holding queue and retry them after the window resets
Monitor provider health and dynamically shift traffic to backup channels (e.g., SMS fallback if push quota is exhausted)

5. Idempotency and Exactly-Once Semantics

At-least-once retry semantics can cause duplicate sends. Interviewers expect deduplication logic to prevent users from receiving the same alert twice.

Hints to consider:

Generate a deterministic message ID (hash of campaign ID, user ID, and content) and store it in a fast cache with TTL equal to the retry window
Check the idempotency key before dispatching to the provider; skip if already seen
Use database constraints or conditional writes to ensure notification records are written exactly once
Distinguish between delivery retries (safe to dedupe) and user-initiated resends (new message ID)

Practice/Microsoft/Design a Notification System

Design a Notification System

System DesignMust

Problem Statement

Key Requirements

Functional

Priority-based routing -- Critical alerts must be isolated from promotional traffic to guarantee sub-second delivery regardless of campaign volume
Time-windowed delivery -- Campaign messages must respect start and end timestamps, with automatic cancellation once the promotion window closes
Multi-channel support -- Route messages to push, SMS, and email providers based on user preferences and message type
User preference enforcement -- Honor opt-out settings, channel preferences, and quiet hours before dispatching any notification
Massive fan-out -- A single campaign request must efficiently expand to millions of individual recipient messages without blocking the ingestion pipeline

Non-Functional

Scalability -- Support sustained throughput of 1M notifications/second with headroom for 3x traffic spikes during major campaigns
Reliability -- Guarantee at-least-once delivery with idempotency to prevent duplicates; tolerate partial provider outages without data loss
Latency -- P99 delivery latency under 500ms for critical alerts; batch messages can tolerate multi-second queueing as long as they arrive before expiration
Consistency -- Maintain per-user message ordering within each priority class; eventually consistent preference updates are acceptable with a propagation window under 10 seconds

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Traffic Isolation and Priority Management

Interviewers want to see how you prevent promotional blasts from starving critical alerts. A naive single-queue design will fail under load when a million-user campaign floods the system.

Hints to consider:

Separate logical or physical queues for each priority tier with dedicated consumer pools
Apply weighted fair queuing or strict priority scheduling to guarantee capacity for critical traffic
Use backpressure mechanisms and queue depth limits to shed low-priority load during overload
Consider per-tenant quotas to prevent one customer's campaign from affecting others in a multi-tenant platform

2. Expiration and Time-Window Enforcement

Delivering an expired promotion damages user trust and may violate business rules. Interviewers look for end-to-end TTL checks, not just a filter at ingestion.

Hints to consider:

Attach expiration metadata to every message and validate it at each processing stage (queue consumption, provider dispatch, retry)
Use time-indexed data structures like sorted sets or scheduled queues to automatically discard expired items
Implement cancellation workflows so in-flight messages can be aborted when a campaign is manually stopped or reaches its deadline
Monitor drift between message creation time and dispatch time to detect processing delays that might cause late deliveries

3. Fan-Out and Bulk Expansion Efficiency

A single campaign API call targeting a million users must not block ingestion or overload the database with synchronous lookups. Interviewers expect an asynchronous expansion strategy.

Hints to consider:

Decouple campaign submission from recipient resolution using a two-phase pipeline: accept and acknowledge the campaign, then expand audience in the background
Stream audience segments from precomputed user lists or query results rather than loading everything into memory
Batch audience expansion into chunks (e.g., 10k users per task) and parallelize across workers
Cache user preferences and device tokens in a low-latency store to avoid round-trips to the primary database during expansion

4. Rate Limiting and Provider Integration

Third-party notification providers impose per-second and per-minute limits. Interviewers want to see adaptive rate limiting that respects quotas without dropping messages.

Hints to consider:

Implement token bucket or leaky bucket rate limiters per provider and per tenant, stored in a fast shared cache
Use exponential backoff and circuit breakers when a provider returns rate-limit errors or degraded responses
Queue messages that exceed rate limits in a separate holding queue and retry them after the window resets
Monitor provider health and dynamically shift traffic to backup channels (e.g., SMS fallback if push quota is exhausted)

5. Idempotency and Exactly-Once Semantics

At-least-once retry semantics can cause duplicate sends. Interviewers expect deduplication logic to prevent users from receiving the same alert twice.

Hints to consider:

Generate a deterministic message ID (hash of campaign ID, user ID, and content) and store it in a fast cache with TTL equal to the retry window
Check the idempotency key before dispatching to the provider; skip if already seen
Use database constraints or conditional writes to ensure notification records are written exactly once
Distinguish between delivery retries (safe to dedupe) and user-initiated resends (new message ID)