Design a Notification System — Reddit

Reference Answer

For a full example answer with detailed architecture diagrams and deep dives, see our Design a Job Scheduler guide. While the job scheduler guide focuses on task execution, many of the same patterns around priority queuing, durable workflows, and rate limiting apply directly to notification delivery pipelines.

Also review the Message Queues and Rate Limiters building blocks for background on asynchronous processing and throughput control.

Problem Statement

Design a notification system that delivers both critical time-sensitive alerts (such as direct messages, OTPs, and security warnings) and promotional campaign messages (such as content recommendations and marketing announcements) across push, SMS, and email channels. The platform must process up to one million notifications per second at peak load, where roughly 80 percent of traffic is critical and 20 percent is promotional. Promotional notifications can target thousands or millions of users simultaneously but must never arrive after their expiration window closes.

Think of platforms like OneSignal, Firebase Cloud Messaging, or the notification infrastructure behind large consumer apps. Product teams submit notification requests through an API, specifying priority, audience, expiration, and channel preferences. The backend classifies traffic, expands audience segments, enforces user preferences and quiet hours, rate-limits against provider quotas, and dispatches messages with at-least-once reliability. Your design must guarantee that bulk promotional campaigns never degrade the delivery latency of critical alerts.

Key Requirements

Functional

Priority-isolated delivery -- critical alerts reach users in sub-second time regardless of concurrent promotional campaign volume
Time-windowed campaigns -- promotional messages respect start and expiration timestamps, with automatic cancellation once the window closes
Multi-channel dispatch -- route each notification to the appropriate provider (push, SMS, email) based on message type and user preferences
Audience expansion -- a single campaign request targeting a user segment expands asynchronously into millions of individual messages without blocking the ingestion path
Preference enforcement -- honor per-user opt-out settings, channel preferences, and quiet-hour rules before dispatching any message

Non-Functional

Scalability -- sustain 1M notifications per second with headroom for 3x traffic spikes during major campaigns
Reliability -- guarantee at-least-once delivery with deduplication to prevent users from receiving the same message twice
Latency -- p99 delivery latency under 500ms for critical alerts; promotional messages tolerate multi-second queueing as long as they arrive before expiration
Ordering -- maintain per-user message ordering within each priority class to avoid confusing out-of-sequence deliveries

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Traffic Isolation Between Priority Tiers

A single shared queue will cause head-of-line blocking when a million-user campaign floods the system, delaying critical alerts and violating latency SLOs. Interviewers want to see explicit isolation between priority classes.

Hints to consider:

Use separate Kafka topics or queue partitions for critical and promotional streams with independent consumer groups
Allocate dedicated worker pools per priority tier so promotional expansion never competes with critical dispatch for compute
Apply backpressure and queue-depth limits on the promotional path to shed load during overload rather than slowing critical traffic
Consider per-tenant quotas so one customer's blast cannot monopolize shared infrastructure

2. End-to-End Expiration Enforcement

Delivering an expired promotion erodes user trust and may violate compliance rules. A single check at ingestion is insufficient because messages can sit in queues for unpredictable durations.

Hints to consider:

Attach TTL metadata to every message and validate it at each processing stage: queue consumption, audience expansion, provider dispatch, and retry
Use sorted sets or scheduled queues keyed by expiration time so expired items can be discarded without scanning
Implement cancellation workflows so in-flight messages are aborted when a campaign is manually stopped or its deadline passes
Monitor the gap between creation time and dispatch time to detect processing delays that might cause late deliveries

3. Fan-Out and Audience Expansion

A single campaign API call targeting millions of users must not block the ingestion pipeline or overload the database. Interviewers expect an asynchronous, chunked expansion strategy.

Hints to consider:

Decouple campaign submission from recipient resolution using a two-phase pipeline: accept and acknowledge the campaign immediately, then expand the audience in background workers
Stream user segments from precomputed audience lists or paginated database queries rather than loading everything into memory
Batch expansion into chunks (for example, 10,000 users per task) and parallelize across a fleet of workers
Cache user preferences and device tokens in Redis to avoid round-trips to the primary database during expansion

4. Idempotency and Deduplication

At-least-once retry semantics combined with fan-out can produce duplicates. Users receiving the same OTP or marketing message twice degrades the experience.

Hints to consider:

Generate a deterministic message ID by hashing the campaign ID, user ID, and content; store it in Redis with a TTL matching the retry window
Check the idempotency key before dispatching to the provider and skip if already seen
Use database constraints or conditional writes to ensure delivery records are written exactly once
Distinguish between system retries (safe to deduplicate) and user-initiated resends (new message ID)

Suggested Approach

Step 1: Clarify Requirements

Confirm the definition of critical versus promotional traffic and the acceptable latency targets for each. Ask whether the system is multi-tenant and if different customers need isolated quotas. Verify whether multi-region deployment is required or if a single datacenter suffices. Clarify retry policies: how many attempts, over what time window, and whether retries should fall back to alternative channels. Confirm the size of the largest expected campaign and whether audience expansion happens before or after the campaign is submitted. Establish whether you need delivery receipts and engagement tracking or if fire-and-forget is acceptable.

Step 2: High-Level Architecture

Sketch an ingestion API layer that accepts notification requests and immediately writes them to a partitioned message log. Use separate Kafka topics for critical and promotional streams to enforce traffic isolation at the infrastructure level. Deploy consumer groups that pull from these topics and perform audience expansion, preference lookups, and provider-specific formatting. Place a rate-limiting layer in front of provider gateways that tracks quota per provider and per tenant using Redis token buckets. Include a retry service that requeues failed messages with exponential backoff and checks expiration timestamps before each attempt. Store notification metadata and delivery status in a horizontally scalable database like Cassandra for auditing and analytics.

Step 3: Deep Dive on Priority Isolation and Delivery Pipeline

Walk through how a promotional campaign flows end-to-end. The API accepts the campaign, assigns an expiration timestamp, and writes a single campaign job record to the promotional Kafka topic. A fleet of expansion workers picks up the job, queries user segments in batches, and emits individual user-message pairs to a downstream dispatch topic. A second tier of formatting workers reads from this topic, fetches user preferences from a cache-backed store, enriches the message payload per channel, and forwards to the provider gateway. Critical alerts bypass the expansion step entirely and go directly from the ingestion API to the formatting workers via the high-priority topic. Partition both topics by user ID to maintain per-user ordering and enable parallel processing. Emphasize that critical and promotional consumer groups scale independently and that promotional workers can be throttled or paused during overload without affecting critical delivery.

Step 4: Address Secondary Concerns

Discuss reliability: use Kafka replication and consumer offset commits to prevent data loss, and configure dead-letter queues for messages that fail after maximum retries. Cover rate limiting: implement token bucket counters per provider in Redis with atomic decrement operations, and queue overflow messages in a sorted set for retry after the rate window resets. Address observability: emit metrics for queue depth per priority tier, per-priority latency percentiles, provider error rates, and expiration-drop counts to detect issues before they impact SLOs. Mention multi-region considerations if the user base is global, with regional message logs and provider gateways to reduce cross-region latency.

Real Interview Quotes

"Design Notification system for Simple and Bulk notification."

"Required to share the API endpoints, Database design and Contracts for the components."

"Design Notification System for sending otps or emails."

Reference Answer

Also review the Message Queues and Rate Limiters building blocks for background on asynchronous processing and throughput control.

Problem Statement

Key Requirements

Functional

Priority-isolated delivery -- critical alerts reach users in sub-second time regardless of concurrent promotional campaign volume
Time-windowed campaigns -- promotional messages respect start and expiration timestamps, with automatic cancellation once the window closes
Multi-channel dispatch -- route each notification to the appropriate provider (push, SMS, email) based on message type and user preferences
Audience expansion -- a single campaign request targeting a user segment expands asynchronously into millions of individual messages without blocking the ingestion path
Preference enforcement -- honor per-user opt-out settings, channel preferences, and quiet-hour rules before dispatching any message

Non-Functional

Scalability -- sustain 1M notifications per second with headroom for 3x traffic spikes during major campaigns
Reliability -- guarantee at-least-once delivery with deduplication to prevent users from receiving the same message twice
Latency -- p99 delivery latency under 500ms for critical alerts; promotional messages tolerate multi-second queueing as long as they arrive before expiration
Ordering -- maintain per-user message ordering within each priority class to avoid confusing out-of-sequence deliveries

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Traffic Isolation Between Priority Tiers

Hints to consider:

Use separate Kafka topics or queue partitions for critical and promotional streams with independent consumer groups
Allocate dedicated worker pools per priority tier so promotional expansion never competes with critical dispatch for compute
Apply backpressure and queue-depth limits on the promotional path to shed load during overload rather than slowing critical traffic
Consider per-tenant quotas so one customer's blast cannot monopolize shared infrastructure

2. End-to-End Expiration Enforcement

Delivering an expired promotion erodes user trust and may violate compliance rules. A single check at ingestion is insufficient because messages can sit in queues for unpredictable durations.

Hints to consider:

Attach TTL metadata to every message and validate it at each processing stage: queue consumption, audience expansion, provider dispatch, and retry
Use sorted sets or scheduled queues keyed by expiration time so expired items can be discarded without scanning
Implement cancellation workflows so in-flight messages are aborted when a campaign is manually stopped or its deadline passes
Monitor the gap between creation time and dispatch time to detect processing delays that might cause late deliveries

3. Fan-Out and Audience Expansion

A single campaign API call targeting millions of users must not block the ingestion pipeline or overload the database. Interviewers expect an asynchronous, chunked expansion strategy.

Hints to consider:

Decouple campaign submission from recipient resolution using a two-phase pipeline: accept and acknowledge the campaign immediately, then expand the audience in background workers
Stream user segments from precomputed audience lists or paginated database queries rather than loading everything into memory
Batch expansion into chunks (for example, 10,000 users per task) and parallelize across a fleet of workers
Cache user preferences and device tokens in Redis to avoid round-trips to the primary database during expansion

4. Idempotency and Deduplication

At-least-once retry semantics combined with fan-out can produce duplicates. Users receiving the same OTP or marketing message twice degrades the experience.

Hints to consider:

Generate a deterministic message ID by hashing the campaign ID, user ID, and content; store it in Redis with a TTL matching the retry window
Check the idempotency key before dispatching to the provider and skip if already seen
Use database constraints or conditional writes to ensure delivery records are written exactly once
Distinguish between system retries (safe to deduplicate) and user-initiated resends (new message ID)

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive on Priority Isolation and Delivery Pipeline

Step 4: Address Secondary Concerns

Real Interview Quotes

"Design Notification system for Simple and Bulk notification."

"Required to share the API endpoints, Database design and Contracts for the components."

"Design Notification System for sending otps or emails."