Practice/Meta/Design Recommender System
Design Recommender System
System DesignMust
Problem Statement
Design a scalable notification delivery system that sends timely alerts to millions of users across multiple channels including push notifications, SMS, email, and in-app messages. The system must handle event-triggered notifications (like a new message or follower), scheduled notifications (daily digests, reminders), and promotional campaigns while respecting user preferences and delivery windows.
Your architecture should support bursts of traffic when viral events occur, guarantee delivery for critical notifications, prevent duplicate sends, and provide real-time tracking of delivery status. The system needs to handle 100 million daily active users with peaks of 50,000 notifications per second during major events, maintain delivery latency under 5 seconds for priority notifications, and support retry logic across different channel providers with varying reliability characteristics.
Key Requirements
Functional
- Multi-channel delivery -- users receive notifications through their preferred channels (push, SMS, email, in-app) based on notification type and urgency
- User preference management -- users can configure notification frequency, quiet hours, channel preferences per category, and opt-out options while maintaining compliance
- Template and personalization -- notifications use dynamic templates with user-specific content, localization, and A/B testing variants for messaging optimization
- Delivery tracking and analytics -- system tracks sent, delivered, opened, and clicked events with real-time dashboards showing delivery rates and user engagement
- Priority and rate limiting -- critical notifications bypass rate limits while marketing messages respect per-user quotas and global throughput constraints
Non-Functional
- Scalability -- handle 100M daily active users, 50K notifications/sec peak throughput, and support 10x growth without architectural changes
- Reliability -- guarantee 99.9% delivery for critical notifications, implement retry logic with exponential backoff, and maintain idempotency across retries
- Latency -- deliver priority notifications within 5 seconds end-to-end, batch non-urgent notifications within 15 minutes, and support real-time status updates
- Consistency -- ensure exactly-once delivery semantics per channel, maintain eventual consistency for preference updates, and prevent notification storms through deduplication
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Event Ingestion and Routing Architecture
How you handle massive event volumes from multiple sources while routing to appropriate channels based on user preferences and notification priority levels.
Hints to consider:
- Use a message queue as the primary ingestion point to decouple producers from the notification pipeline and provide backpressure protection
- Implement a routing layer that enriches events with user preferences, resolves channel selection, and applies filtering rules before fan-out
- Consider partitioning strategies that maintain ordering guarantees for user-level notifications while enabling parallel processing
- Design for idempotency by generating deterministic notification IDs and checking for duplicates before processing
2. Multi-Channel Delivery Management
Your strategy for abstracting different delivery providers (FCM, APNS, Twilio, SendGrid) while handling their unique constraints, rate limits, and failure modes.
Hints to consider:
- Create a unified delivery abstraction layer with retry policies, circuit breakers, and fallback providers for each channel type
- Implement channel-specific rate limiting and batching (email can batch hundreds, SMS needs careful rate control, push supports device-level batching)
- Handle provider-specific delivery receipts and map them to a unified tracking model for consistent analytics
- Design for graceful degradation when a provider is down by queuing messages and switching to backup providers
3. Deduplication and Exactly-Once Semantics
How you prevent duplicate notifications when retries occur, multiple events trigger the same notification, or distributed systems experience partial failures.
Hints to consider:
- Use a distributed cache (Redis) with TTL-based deduplication keys derived from user ID, notification type, and content hash
- Implement a delivery log in a fast key-value store that tracks sent notifications with unique IDs before attempting delivery
- Design windowing logic to collapse rapid-fire similar events (like 10 likes in 2 minutes) into a single aggregated notification
- Consider using database transactions or distributed locks for critical notifications that must never duplicate
4. User Preference Resolution and Enforcement
Your approach to storing, caching, and applying complex user preferences including quiet hours, frequency caps, channel priorities, and category-level opt-outs.
Hints to consider:
- Cache user preferences in Redis with hierarchical fallback (user-specific → category default → global default) to minimize database lookups
- Implement timezone-aware quiet hour enforcement at the routing layer before messages enter channel-specific queues
- Design a flexible preference schema that supports per-category rules, channel priority ordering, and frequency capping windows
- Handle preference updates by invalidating caches and allowing in-flight notifications to complete with stale preferences to avoid complexity
5. Monitoring, Observability, and Delivery Guarantees
How you track notification lifecycle from creation through delivery, measure system health, debug failures, and prove SLA compliance.
Hints to consider:
- Emit structured events at each pipeline stage (received, routed, sent, delivered, opened) and aggregate them in a real-time analytics system
- Implement dead letter queues for failed notifications with separate retry workers and alerting thresholds
- Track delivery rates, latency percentiles, and error rates per channel and provider with dashboards that highlight anomalies
- Design a reconciliation system that compares sent counts against provider delivery receipts to detect silent failures