Design a scalable notification delivery platform that sends messages across multiple channels -- push notifications, SMS, email, and webhooks -- to millions of users worldwide. The system must handle billions of notifications daily, support delivery prioritization, ensure idempotency, and provide tracking for delivery status and user engagement metrics.
Consider the scale of platforms like Stripe, Twilio, or Amazon SNS that route billions of time-sensitive events. The system must gracefully handle third-party channel failures, respect user preferences and quiet hours, deduplicate redundant messages, and provide near real-time delivery status updates to sender applications. Interviewers expect you to reason about message queuing, delivery guarantees, fan-out patterns, channel-specific retry logic, and the observability needed to diagnose delivery failures across a distributed fleet.
Multi-channel delivery -- System must route notifications to push, SMS, email, and webhook endpoints based on user preferences and message type
Delivery tracking -- Provide real-time status updates (sent, delivered, failed, opened, clicked) and aggregate engagement analytics for senders
User preference management -- Honor opt-out settings, channel preferences, quiet hours, and notification frequency caps per user
Template and personalization -- Support message templates with variable substitution, localization, and dynamic content rendering
Priority and scheduling -- Allow urgent messages to bypass queues while supporting scheduled delivery and batching for non-critical notifications
Scalability -- Handle 10 billion notifications per day with peaks of 500,000 messages per second during high-traffic events
Reliability -- Achieve 99.95% delivery success rate with automatic retries, circuit breaking for failing channels, and fallback channel routing
Latency -- Deliver P95 of high-priority notifications within 2 seconds from API submission to channel handoff; P99 under 5 seconds
Consistency -- Guarantee at-least-once delivery semantics with idempotency keys to prevent duplicate sends; eventually consistent delivery status tracking
Based on real interview experiences, these are the areas interviewers probe most deeply:
Notifications must be distributed across workers efficiently while preventing head-of-line blocking when specific channels slow down or fail. Interviewers want to see how you partition work, isolate failures, and maintain ordering guarantees where needed.
Use dedicated queues per channel type to isolate slow SMS providers from fast push notifications
Implement priority queues with separate consumer pools to ensure urgent alerts bypass bulk marketing messages
Consider partitioning by user ID or tenant ID to maintain per-user ordering while enabling horizontal scaling
Design dead letter queues with exponential backoff and manual retry capabilities for persistent failures
Users may retry API calls during timeouts, causing duplicate notification requests. Sending the same alert twice damages user trust and wastes money on SMS or third-party API calls. You must prevent duplicates across retries and crashes.
External SMS, email, and push providers have varying SLAs, rate limits, and failure modes. A single provider outage should not block all deliveries. Interviewers want resilience patterns and observability into vendor health.
Implement circuit breakers per vendor to fail fast and route traffic to backup providers when error rates spike
Use per-vendor rate limiters with token buckets to respect API quotas and avoid 429 errors that trigger penalties
Design a webhook retry system with exponential backoff for customer endpoints, capping retries to avoid infinite loops
Track per-channel success rates and latency metrics to automatically deprioritize slow or failing providers
Users must be able to opt out of categories, mute notifications during sleep hours, and have their data deleted. GDPR and TCPA laws penalize non-compliance. The system must enforce rules consistently at high throughput.
Cache user preferences in memory or Redis with short TTLs to avoid database lookups on every notification check
Evaluate quiet hours and frequency caps before enqueuing to reduce wasted processing on messages that will be dropped
Store opt-out lists in low-latency storage and propagate updates within seconds to all notification workers
Design a preference service API that workers query in batch to amortize lookup costs across large fan-out operations