Design a Notification System — Navan

Reference Answer

For a walkthrough of scheduling and orchestrating background work at scale, see our Design Job Scheduler guide. Many of the same patterns around durable task queues, retries, and priority isolation apply directly to notification pipelines.

Also review the Message Queues, Databases, and Rate Limiters building blocks.

Problem Statement

Design a notification system that delivers messages to users across multiple channels -- push notifications, SMS, and email. The system must handle two distinct traffic patterns: high-priority critical alerts (authentication codes, security warnings, direct messages) that demand sub-second delivery, and bulk promotional campaigns (marketing offers, feature announcements) that target large user segments but must never arrive after their expiration window closes.

The platform needs to sustain one million notifications per second at peak load, with roughly 80 percent being time-critical alerts and 20 percent being campaigns. The architecture must prevent bulk campaigns from degrading the delivery speed of critical alerts, handle third-party provider rate limits gracefully, enforce exactly-once delivery per user, and maintain per-user message ordering within each priority class. You are designing the server-side pipeline, not the client SDK or device-level delivery confirmation.

Key Requirements

Functional

Priority-based routing -- Critical alerts are isolated from promotional traffic to guarantee sub-second delivery regardless of campaign volume
Time-windowed delivery -- Campaign messages respect start and end timestamps, with automatic cancellation when the promotion window closes
Multi-channel support -- Route messages to push, SMS, and email providers based on user preferences and message type
User preference enforcement -- Honor opt-out settings, channel preferences, and quiet hours before dispatching any notification
Massive fan-out -- A single campaign request efficiently expands to millions of individual messages without blocking the ingestion pipeline

Non-Functional

Scalability -- Sustain 1M notifications per second with 3x headroom for campaign spikes
Reliability -- At-least-once delivery with idempotency to prevent duplicates; tolerate partial provider outages without data loss
Latency -- P99 under 500ms for critical alerts; batch messages tolerate multi-second queueing as long as they arrive before expiration
Consistency -- Per-user ordering within each priority class; eventually consistent preference updates within 10 seconds

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Traffic Isolation Between Critical and Promotional

The central design challenge is preventing a million-user promotional blast from starving time-sensitive alerts like OTPs or security warnings. A single-queue architecture will fail catastrophically under campaign load.

Hints to consider:

Deploy physically separate message topics for each priority tier with independent consumer pools that scale separately
Use strict priority scheduling or dedicated resource pools to guarantee that critical traffic always has reserved capacity
Implement backpressure and queue-depth limits that automatically shed promotional load during system overload
Consider per-tenant quotas so one customer's campaign cannot starve notifications for other tenants

2. End-to-End Expiration Enforcement

An expired promotion that still gets delivered damages user trust and may violate compliance rules. The expiration check must happen at every processing stage, not just at ingestion.

Hints to consider:

Embed an expiration timestamp in every message and validate it before queue consumption, before provider dispatch, and before retry
Use sorted sets or delay queues with TTL to automatically discard expired items without requiring sweeper jobs
Implement campaign cancellation workflows so that in-flight messages are aborted when a campaign is manually stopped
Track the delta between message creation time and actual dispatch time as a latency metric to catch systemic delays

3. Audience Expansion at Scale

A single campaign targeting millions of users must be expanded into individual messages asynchronously. Synchronous expansion would block the API and overwhelm the database.

Hints to consider:

Accept the campaign as a single job record, return immediately, and expand the audience in background workers
Process user segments in batches (for example, 10,000 at a time) and write individual messages to a downstream topic
Cache user preferences and device tokens in a low-latency store to avoid hitting the primary database for each recipient
Use streaming cursor-based pagination over the user segment to avoid loading entire recipient lists into memory

4. Provider Rate Limiting and Failover

External push, SMS, and email providers impose strict rate limits. Exceeding them results in rejected messages or temporary bans. Interviewers expect adaptive throttling, not just retry loops.

Hints to consider:

Maintain token-bucket rate limiters per provider in a shared cache (Redis), decrementing atomically before each dispatch
When a provider returns a rate-limit response, route overflow messages to a holding queue sorted by retry time
Use circuit breakers to detect provider degradation and fail fast rather than pile up timeouts
Implement channel fallback logic: if push delivery fails, try SMS; if SMS quota is exhausted, queue for email

5. Idempotency and Deduplication

At-least-once delivery guarantees mean consumers may process the same message twice during rebalances or retries. Without deduplication, users receive duplicate notifications.

Hints to consider:

Compute a deterministic message ID from campaign ID, user ID, and content hash, and store it in a fast cache with a TTL equal to the retry window
Before dispatching to the provider, check the deduplication cache and skip if the message was already sent
Use database unique constraints as a second layer of defense for critical notifications
Separate retry-driven duplicates (safe to suppress) from intentional resends (new message ID required)

Suggested Approach

Step 1: Clarify Requirements

Confirm what qualifies as "critical" versus "promotional" and the latency target for each. Ask whether the system is multi-tenant, whether different tenants need isolated quotas, and whether multi-region deployment is required. Clarify retry policy: how many attempts, over what time window, and whether retries should fall back to alternative channels. Establish whether delivery receipts and engagement tracking are needed or if fire-and-forget is acceptable.

Step 2: High-Level Architecture

Sketch an ingestion API that accepts notification requests and writes them to a partitioned message log. Use separate Kafka topics for critical and promotional streams. Deploy consumer groups for each stream that handle audience expansion, preference lookups, and provider formatting. Place a rate-limiting layer in front of provider gateways, backed by a shared Redis cache. Include a retry service that requeues failed messages with exponential backoff and checks expiration before each retry. Store notification metadata and delivery status in a scalable database for audit and analytics.

Step 3: Deep Dive on Isolation and Fan-Out

Trace a promotional campaign through the system. The API accepts the campaign, assigns an expiration timestamp, and writes a job record to the promotional topic. Expansion workers consume the job, query user segments in batches, and produce individual user-message pairs on a downstream topic. Formatting workers apply templates, resolve user preferences, and forward messages to provider gateways. Critical alerts skip the expansion phase and flow directly from the ingestion API through the high-priority topic to formatting workers and out to providers. Partition both topics by user ID to maintain per-user ordering. Show that the two pipelines share no worker pools, so promotional load cannot affect critical delivery.

Step 4: Address Secondary Concerns

Cover reliability through Kafka replication and consumer offset management, with dead-letter queues for exhausted retries. Discuss idempotency using the deterministic message ID pattern. Explain provider rate limiting with atomic token-bucket decrements in Redis. Address expiration enforcement at ingestion, expansion, formatting, and dispatch stages. Mention observability: dashboard queue depth, per-priority P99 latency, provider success rates, and expiration drop counts. Optionally discuss multi-region deployment with regional Kafka clusters and provider gateways to reduce cross-region latency.

Real Interview Insights

Candidates report that Navan interviewers focus on both the one-to-one notification path and the campaign fan-out path in the same session. The key differentiator is demonstrating clear traffic isolation so that critical alerts never experience latency spikes during large promotional sends. Be prepared to discuss specific queue partitioning strategies and how you would handle a provider going down mid-campaign without losing or duplicating messages.

Reference Answer

Also review the Message Queues, Databases, and Rate Limiters building blocks.

Problem Statement

Key Requirements

Functional

Priority-based routing -- Critical alerts are isolated from promotional traffic to guarantee sub-second delivery regardless of campaign volume
Time-windowed delivery -- Campaign messages respect start and end timestamps, with automatic cancellation when the promotion window closes
Multi-channel support -- Route messages to push, SMS, and email providers based on user preferences and message type
User preference enforcement -- Honor opt-out settings, channel preferences, and quiet hours before dispatching any notification
Massive fan-out -- A single campaign request efficiently expands to millions of individual messages without blocking the ingestion pipeline

Non-Functional

Scalability -- Sustain 1M notifications per second with 3x headroom for campaign spikes
Reliability -- At-least-once delivery with idempotency to prevent duplicates; tolerate partial provider outages without data loss
Latency -- P99 under 500ms for critical alerts; batch messages tolerate multi-second queueing as long as they arrive before expiration
Consistency -- Per-user ordering within each priority class; eventually consistent preference updates within 10 seconds

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Traffic Isolation Between Critical and Promotional

Hints to consider:

Deploy physically separate message topics for each priority tier with independent consumer pools that scale separately
Use strict priority scheduling or dedicated resource pools to guarantee that critical traffic always has reserved capacity
Implement backpressure and queue-depth limits that automatically shed promotional load during system overload
Consider per-tenant quotas so one customer's campaign cannot starve notifications for other tenants

2. End-to-End Expiration Enforcement

An expired promotion that still gets delivered damages user trust and may violate compliance rules. The expiration check must happen at every processing stage, not just at ingestion.

Hints to consider:

Embed an expiration timestamp in every message and validate it before queue consumption, before provider dispatch, and before retry
Use sorted sets or delay queues with TTL to automatically discard expired items without requiring sweeper jobs
Implement campaign cancellation workflows so that in-flight messages are aborted when a campaign is manually stopped
Track the delta between message creation time and actual dispatch time as a latency metric to catch systemic delays

3. Audience Expansion at Scale

A single campaign targeting millions of users must be expanded into individual messages asynchronously. Synchronous expansion would block the API and overwhelm the database.

Hints to consider:

Accept the campaign as a single job record, return immediately, and expand the audience in background workers
Process user segments in batches (for example, 10,000 at a time) and write individual messages to a downstream topic
Cache user preferences and device tokens in a low-latency store to avoid hitting the primary database for each recipient
Use streaming cursor-based pagination over the user segment to avoid loading entire recipient lists into memory

4. Provider Rate Limiting and Failover

External push, SMS, and email providers impose strict rate limits. Exceeding them results in rejected messages or temporary bans. Interviewers expect adaptive throttling, not just retry loops.

Hints to consider:

Maintain token-bucket rate limiters per provider in a shared cache (Redis), decrementing atomically before each dispatch
When a provider returns a rate-limit response, route overflow messages to a holding queue sorted by retry time
Use circuit breakers to detect provider degradation and fail fast rather than pile up timeouts
Implement channel fallback logic: if push delivery fails, try SMS; if SMS quota is exhausted, queue for email

5. Idempotency and Deduplication

At-least-once delivery guarantees mean consumers may process the same message twice during rebalances or retries. Without deduplication, users receive duplicate notifications.

Hints to consider:

Compute a deterministic message ID from campaign ID, user ID, and content hash, and store it in a fast cache with a TTL equal to the retry window
Before dispatching to the provider, check the deduplication cache and skip if the message was already sent
Use database unique constraints as a second layer of defense for critical notifications
Separate retry-driven duplicates (safe to suppress) from intentional resends (new message ID required)