For a walkthrough of scheduling and orchestrating background work at scale, see our Design Job Scheduler guide. Many of the same patterns around durable task queues, retries, and priority isolation apply directly to notification pipelines.
Also review the Message Queues, Databases, and Rate Limiters building blocks.
Design a notification system that delivers messages to users across multiple channels -- push notifications, SMS, and email. The system must handle two distinct traffic patterns: high-priority critical alerts (authentication codes, security warnings, direct messages) that demand sub-second delivery, and bulk promotional campaigns (marketing offers, feature announcements) that target large user segments but must never arrive after their expiration window closes.
The platform needs to sustain one million notifications per second at peak load, with roughly 80 percent being time-critical alerts and 20 percent being campaigns. The architecture must prevent bulk campaigns from degrading the delivery speed of critical alerts, handle third-party provider rate limits gracefully, enforce exactly-once delivery per user, and maintain per-user message ordering within each priority class. You are designing the server-side pipeline, not the client SDK or device-level delivery confirmation.
Based on real interview experiences, these are the areas interviewers probe most deeply:
The central design challenge is preventing a million-user promotional blast from starving time-sensitive alerts like OTPs or security warnings. A single-queue architecture will fail catastrophically under campaign load.
Hints to consider:
An expired promotion that still gets delivered damages user trust and may violate compliance rules. The expiration check must happen at every processing stage, not just at ingestion.
Hints to consider:
A single campaign targeting millions of users must be expanded into individual messages asynchronously. Synchronous expansion would block the API and overwhelm the database.
Hints to consider:
External push, SMS, and email providers impose strict rate limits. Exceeding them results in rejected messages or temporary bans. Interviewers expect adaptive throttling, not just retry loops.
Hints to consider:
At-least-once delivery guarantees mean consumers may process the same message twice during rebalances or retries. Without deduplication, users receive duplicate notifications.
Hints to consider:
Confirm what qualifies as "critical" versus "promotional" and the latency target for each. Ask whether the system is multi-tenant, whether different tenants need isolated quotas, and whether multi-region deployment is required. Clarify retry policy: how many attempts, over what time window, and whether retries should fall back to alternative channels. Establish whether delivery receipts and engagement tracking are needed or if fire-and-forget is acceptable.
Sketch an ingestion API that accepts notification requests and writes them to a partitioned message log. Use separate Kafka topics for critical and promotional streams. Deploy consumer groups for each stream that handle audience expansion, preference lookups, and provider formatting. Place a rate-limiting layer in front of provider gateways, backed by a shared Redis cache. Include a retry service that requeues failed messages with exponential backoff and checks expiration before each retry. Store notification metadata and delivery status in a scalable database for audit and analytics.
Trace a promotional campaign through the system. The API accepts the campaign, assigns an expiration timestamp, and writes a job record to the promotional topic. Expansion workers consume the job, query user segments in batches, and produce individual user-message pairs on a downstream topic. Formatting workers apply templates, resolve user preferences, and forward messages to provider gateways. Critical alerts skip the expansion phase and flow directly from the ingestion API through the high-priority topic to formatting workers and out to providers. Partition both topics by user ID to maintain per-user ordering. Show that the two pipelines share no worker pools, so promotional load cannot affect critical delivery.
Cover reliability through Kafka replication and consumer offset management, with dead-letter queues for exhausted retries. Discuss idempotency using the deterministic message ID pattern. Explain provider rate limiting with atomic token-bucket decrements in Redis. Address expiration enforcement at ingestion, expansion, formatting, and dispatch stages. Mention observability: dashboard queue depth, per-priority P99 latency, provider success rates, and expiration drop counts. Optionally discuss multi-region deployment with regional Kafka clusters and provider gateways to reduce cross-region latency.
Candidates report that Navan interviewers focus on both the one-to-one notification path and the campaign fan-out path in the same session. The key differentiator is demonstrating clear traffic isolation so that critical alerts never experience latency spikes during large promotional sends. Be prepared to discuss specific queue partitioning strategies and how you would handle a provider going down mid-campaign without losing or duplicating messages.