Design a Notification System — Metropolis

Reference Answer

For a walkthrough of scheduling and orchestrating background work at scale, see our Design Job Scheduler guide. Many of the same patterns around durable task queues, retries, and priority isolation apply directly to notification pipelines.

Also review the Message Queues, Databases, and Rate Limiters building blocks for background on asynchronous processing, storage choices, and throttle control.

Problem Statement

Design a notification system that delivers messages to users across multiple channels -- push notifications, SMS, and email. The system must handle two distinct traffic patterns: high-priority critical alerts (authentication codes, security warnings, direct messages) that demand sub-second delivery, and bulk promotional campaigns (marketing offers, feature announcements) that can target thousands or millions of recipients but must never arrive after their expiration window closes.

The platform needs to sustain one million notifications per second at peak load, with roughly 80 percent being time-critical alerts and 20 percent being campaigns. Promotional fan-outs must never degrade the delivery speed of critical traffic. The architecture must respect third-party provider rate limits, enforce exactly-once delivery per user, maintain per-user message ordering within each priority class, and gracefully handle provider outages. You are designing the server-side pipeline only, not the client SDK or device-level delivery confirmation.

Key Requirements

Functional

Priority-based routing -- Critical alerts are isolated from promotional traffic to guarantee sub-second delivery regardless of campaign volume
Time-windowed delivery -- Campaign messages respect start and end timestamps, with automatic cancellation once the promotion window closes
Multi-channel support -- Route messages to push, SMS, and email providers based on user preferences and message type
User preference enforcement -- Honor opt-out settings, channel preferences, and quiet hours before dispatching any notification
Massive fan-out -- A single campaign request efficiently expands to millions of individual recipient messages without blocking ingestion

Non-Functional

Scalability -- Sustain 1M notifications per second with headroom for 3x spikes during major campaigns
Reliability -- At-least-once delivery with idempotency to prevent duplicates; tolerate partial provider outages without data loss
Latency -- P99 under 500ms for critical alerts; batch messages tolerate multi-second queueing as long as they arrive before expiration
Consistency -- Per-user ordering within each priority class; eventually consistent preference updates within 10 seconds

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Traffic Isolation and Priority Management

Interviewers want to see how you prevent promotional blasts from starving critical alerts. A naive single-queue design will fail under load when a million-user campaign floods the system.

Hints to consider:

Use physically separate message topics or queues for each priority tier with dedicated consumer pools
Apply strict priority scheduling or weighted fair queuing to guarantee capacity for critical traffic
Introduce backpressure mechanisms and queue-depth limits that shed low-priority load during overload
Consider per-tenant quotas to prevent one customer's campaign from affecting others in a multi-tenant setup

2. Expiration and Time-Window Enforcement

Delivering an expired promotion damages user trust and may violate business rules. Interviewers look for TTL checks at every processing stage, not just at ingestion.

Hints to consider:

Attach expiration metadata to every message and validate it at each stage: queue consumption, provider dispatch, and retry
Use time-indexed data structures like sorted sets or scheduled queues to discard expired items automatically
Implement cancellation workflows so in-flight messages are aborted when a campaign is manually stopped or reaches its deadline
Monitor drift between message creation time and dispatch time to detect processing delays that might cause late deliveries

3. Fan-Out and Bulk Expansion Efficiency

A single campaign API call targeting a million users must not block ingestion or overload the database with synchronous lookups. Interviewers expect an asynchronous expansion strategy.

Hints to consider:

Decouple campaign submission from recipient resolution using a two-phase pipeline: accept the campaign, then expand the audience in the background
Stream audience segments from precomputed user lists rather than loading everything into memory
Batch audience expansion into chunks (for example, 10,000 users per task) and parallelize across workers
Cache user preferences and device tokens in a low-latency store to avoid round-trips to the primary database during expansion

4. Rate Limiting and Provider Integration

Third-party notification providers impose per-second and per-minute limits. Interviewers want to see adaptive rate limiting that respects quotas without dropping messages.

Hints to consider:

Implement token-bucket or leaky-bucket rate limiters per provider and per tenant, stored in a fast shared cache
Use exponential backoff and circuit breakers when a provider returns rate-limit errors or degraded responses
Queue messages that exceed limits in a holding queue and retry them after the rate window resets
Monitor provider health and dynamically shift traffic to backup channels when a primary provider is exhausted

5. Idempotency and Exactly-Once Semantics

At-least-once retry semantics can cause duplicate sends. Interviewers expect deduplication logic to prevent users from receiving the same alert twice.

Hints to consider:

Generate a deterministic message ID by hashing campaign ID, user ID, and content, then store it in a fast cache with a TTL matching the retry window
Check the idempotency key before dispatching to the provider and skip if already seen
Use database constraints or conditional writes to ensure notification records are written exactly once
Distinguish between delivery retries (safe to deduplicate) and user-initiated resends (which should generate a new message ID)

Suggested Approach

Step 1: Clarify Requirements

Confirm the definition of "critical" versus "promotional" traffic and acceptable latency targets for each. Ask whether the system is multi-tenant and if different customers should have isolated quotas. Verify whether multi-region deployments are needed. Clarify retry policies: how many attempts, over what window, and whether retries fall back to alternative channels. Confirm the size of the largest expected campaign and whether audience expansion happens before or after submission. Establish whether delivery receipts and engagement tracking are required.

Step 2: High-Level Architecture

Sketch an ingestion API layer that accepts notification requests and writes them to a partitioned message log. Use separate Kafka topics for critical and promotional streams to enforce traffic isolation. Deploy consumer groups that pull from these topics and perform audience expansion, preference lookups, and provider-specific formatting. Place a rate-limiting layer in front of provider gateways that tracks quotas per provider and per tenant using a shared cache. Include a retry service that requeues failed messages with exponential backoff and respects expiration timestamps. Store notification metadata and delivery status in a horizontally scalable database for auditing and analytics.

Step 3: Deep Dive on Traffic Isolation and Fan-Out

Walk through how a promotional campaign flows through the system. The API accepts the campaign, assigns an expiration timestamp, and writes a single "campaign job" record to the promotional topic. Expansion workers pick up the job, query user segments in batches, and emit individual user-message pairs to a downstream topic. Formatting workers read from this topic, fetch user preferences from a cache-backed store, enrich the payload, and forward to the provider gateway. Critical alerts bypass the expansion step and go directly to formatting workers via the high-priority topic. Partition by user ID or device ID to maintain per-user ordering. Emphasize that critical and promotional workers scale independently and that promotional workers can be throttled during overload.

Step 4: Address Secondary Concerns

Discuss reliability through Kafka replication and consumer offset commits to prevent data loss, with dead-letter queues for messages that fail after maximum retries. Cover idempotency by generating a unique message ID and checking a Redis cache before sending. Explain rate limiting with a token bucket per provider in Redis using atomic decrement operations. Address expiration by checking timestamps at every stage and dropping messages whose window has closed. Mention observability: emit metrics for queue depth, per-priority latency, provider error rates, and expiration drops. Optionally sketch multi-region deployment with regional message logs and provider gateways.

Real Interview Insights

Candidates at Metropolis report that interviewers asked them to design both simple one-to-one notification delivery and bulk campaign fan-out in the same session, paying close attention to how the two traffic types are isolated. One interviewer specifically requested API endpoint designs, database schemas, and contracts between components. Be ready to discuss how you would handle a provider outage mid-campaign without losing or duplicating messages.

Reference Answer

Also review the Message Queues, Databases, and Rate Limiters building blocks for background on asynchronous processing, storage choices, and throttle control.

Problem Statement

Key Requirements

Functional

Priority-based routing -- Critical alerts are isolated from promotional traffic to guarantee sub-second delivery regardless of campaign volume
Time-windowed delivery -- Campaign messages respect start and end timestamps, with automatic cancellation once the promotion window closes
Multi-channel support -- Route messages to push, SMS, and email providers based on user preferences and message type
User preference enforcement -- Honor opt-out settings, channel preferences, and quiet hours before dispatching any notification
Massive fan-out -- A single campaign request efficiently expands to millions of individual recipient messages without blocking ingestion

Non-Functional

Scalability -- Sustain 1M notifications per second with headroom for 3x spikes during major campaigns
Reliability -- At-least-once delivery with idempotency to prevent duplicates; tolerate partial provider outages without data loss
Latency -- P99 under 500ms for critical alerts; batch messages tolerate multi-second queueing as long as they arrive before expiration
Consistency -- Per-user ordering within each priority class; eventually consistent preference updates within 10 seconds

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Traffic Isolation and Priority Management

Interviewers want to see how you prevent promotional blasts from starving critical alerts. A naive single-queue design will fail under load when a million-user campaign floods the system.

Hints to consider:

Use physically separate message topics or queues for each priority tier with dedicated consumer pools
Apply strict priority scheduling or weighted fair queuing to guarantee capacity for critical traffic
Introduce backpressure mechanisms and queue-depth limits that shed low-priority load during overload
Consider per-tenant quotas to prevent one customer's campaign from affecting others in a multi-tenant setup

2. Expiration and Time-Window Enforcement

Delivering an expired promotion damages user trust and may violate business rules. Interviewers look for TTL checks at every processing stage, not just at ingestion.

Hints to consider:

Attach expiration metadata to every message and validate it at each stage: queue consumption, provider dispatch, and retry
Use time-indexed data structures like sorted sets or scheduled queues to discard expired items automatically
Implement cancellation workflows so in-flight messages are aborted when a campaign is manually stopped or reaches its deadline
Monitor drift between message creation time and dispatch time to detect processing delays that might cause late deliveries

3. Fan-Out and Bulk Expansion Efficiency

A single campaign API call targeting a million users must not block ingestion or overload the database with synchronous lookups. Interviewers expect an asynchronous expansion strategy.

Hints to consider:

Decouple campaign submission from recipient resolution using a two-phase pipeline: accept the campaign, then expand the audience in the background
Stream audience segments from precomputed user lists rather than loading everything into memory
Batch audience expansion into chunks (for example, 10,000 users per task) and parallelize across workers
Cache user preferences and device tokens in a low-latency store to avoid round-trips to the primary database during expansion

4. Rate Limiting and Provider Integration

Third-party notification providers impose per-second and per-minute limits. Interviewers want to see adaptive rate limiting that respects quotas without dropping messages.

Hints to consider:

Implement token-bucket or leaky-bucket rate limiters per provider and per tenant, stored in a fast shared cache
Use exponential backoff and circuit breakers when a provider returns rate-limit errors or degraded responses
Queue messages that exceed limits in a holding queue and retry them after the rate window resets
Monitor provider health and dynamically shift traffic to backup channels when a primary provider is exhausted

5. Idempotency and Exactly-Once Semantics

At-least-once retry semantics can cause duplicate sends. Interviewers expect deduplication logic to prevent users from receiving the same alert twice.

Hints to consider:

Generate a deterministic message ID by hashing campaign ID, user ID, and content, then store it in a fast cache with a TTL matching the retry window
Check the idempotency key before dispatching to the provider and skip if already seen
Use database constraints or conditional writes to ensure notification records are written exactly once
Distinguish between delivery retries (safe to deduplicate) and user-initiated resends (which should generate a new message ID)