Practice/Meta/Design Recommender System

Design Recommender System

System DesignMust

Problem Statement

Design a scalable notification delivery system that sends timely alerts to millions of users across multiple channels including push notifications, SMS, email, and in-app messages. The system must handle event-triggered notifications (like a new message or follower), scheduled notifications (daily digests, reminders), and promotional campaigns while respecting user preferences and delivery windows.

Your architecture should support bursts of traffic when viral events occur, guarantee delivery for critical notifications, prevent duplicate sends, and provide real-time tracking of delivery status. The system needs to handle 100 million daily active users with peaks of 50,000 notifications per second during major events, maintain delivery latency under 5 seconds for priority notifications, and support retry logic across different channel providers with varying reliability characteristics.

Key Requirements

Functional

Multi-channel delivery -- users receive notifications through their preferred channels (push, SMS, email, in-app) based on notification type and urgency
User preference management -- users can configure notification frequency, quiet hours, channel preferences per category, and opt-out options while maintaining compliance
Template and personalization -- notifications use dynamic templates with user-specific content, localization, and A/B testing variants for messaging optimization
Delivery tracking and analytics -- system tracks sent, delivered, opened, and clicked events with real-time dashboards showing delivery rates and user engagement
Priority and rate limiting -- critical notifications bypass rate limits while marketing messages respect per-user quotas and global throughput constraints

Non-Functional

Scalability -- handle 100M daily active users, 50K notifications/sec peak throughput, and support 10x growth without architectural changes
Reliability -- guarantee 99.9% delivery for critical notifications, implement retry logic with exponential backoff, and maintain idempotency across retries
Latency -- deliver priority notifications within 5 seconds end-to-end, batch non-urgent notifications within 15 minutes, and support real-time status updates
Consistency -- ensure exactly-once delivery semantics per channel, maintain eventual consistency for preference updates, and prevent notification storms through deduplication

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Event Ingestion and Routing Architecture

How you handle massive event volumes from multiple sources while routing to appropriate channels based on user preferences and notification priority levels.

Hints to consider:

Use a message queue as the primary ingestion point to decouple producers from the notification pipeline and provide backpressure protection
Implement a routing layer that enriches events with user preferences, resolves channel selection, and applies filtering rules before fan-out
Consider partitioning strategies that maintain ordering guarantees for user-level notifications while enabling parallel processing
Design for idempotency by generating deterministic notification IDs and checking for duplicates before processing

2. Multi-Channel Delivery Management

Your strategy for abstracting different delivery providers (FCM, APNS, Twilio, SendGrid) while handling their unique constraints, rate limits, and failure modes.

Hints to consider:

Create a unified delivery abstraction layer with retry policies, circuit breakers, and fallback providers for each channel type
Implement channel-specific rate limiting and batching (email can batch hundreds, SMS needs careful rate control, push supports device-level batching)
Handle provider-specific delivery receipts and map them to a unified tracking model for consistent analytics
Design for graceful degradation when a provider is down by queuing messages and switching to backup providers

3. Deduplication and Exactly-Once Semantics

How you prevent duplicate notifications when retries occur, multiple events trigger the same notification, or distributed systems experience partial failures.

Hints to consider:

Use a distributed cache (Redis) with TTL-based deduplication keys derived from user ID, notification type, and content hash
Implement a delivery log in a fast key-value store that tracks sent notifications with unique IDs before attempting delivery
Design windowing logic to collapse rapid-fire similar events (like 10 likes in 2 minutes) into a single aggregated notification
Consider using database transactions or distributed locks for critical notifications that must never duplicate

4. User Preference Resolution and Enforcement

Your approach to storing, caching, and applying complex user preferences including quiet hours, frequency caps, channel priorities, and category-level opt-outs.

Hints to consider:

Cache user preferences in Redis with hierarchical fallback (user-specific → category default → global default) to minimize database lookups
Implement timezone-aware quiet hour enforcement at the routing layer before messages enter channel-specific queues
Design a flexible preference schema that supports per-category rules, channel priority ordering, and frequency capping windows
Handle preference updates by invalidating caches and allowing in-flight notifications to complete with stale preferences to avoid complexity

5. Monitoring, Observability, and Delivery Guarantees

How you track notification lifecycle from creation through delivery, measure system health, debug failures, and prove SLA compliance.

Hints to consider:

Emit structured events at each pipeline stage (received, routed, sent, delivered, opened) and aggregate them in a real-time analytics system
Implement dead letter queues for failed notifications with separate retry workers and alerting thresholds
Track delivery rates, latency percentiles, and error rates per channel and provider with dashboards that highlight anomalies
Design a reconciliation system that compares sent counts against provider delivery receipts to detect silent failures

Practice/Meta/Design Recommender System

Design Recommender System

System DesignMust

Problem Statement

Key Requirements

Functional

Multi-channel delivery -- users receive notifications through their preferred channels (push, SMS, email, in-app) based on notification type and urgency
User preference management -- users can configure notification frequency, quiet hours, channel preferences per category, and opt-out options while maintaining compliance
Template and personalization -- notifications use dynamic templates with user-specific content, localization, and A/B testing variants for messaging optimization
Delivery tracking and analytics -- system tracks sent, delivered, opened, and clicked events with real-time dashboards showing delivery rates and user engagement
Priority and rate limiting -- critical notifications bypass rate limits while marketing messages respect per-user quotas and global throughput constraints

Non-Functional

Scalability -- handle 100M daily active users, 50K notifications/sec peak throughput, and support 10x growth without architectural changes
Reliability -- guarantee 99.9% delivery for critical notifications, implement retry logic with exponential backoff, and maintain idempotency across retries
Latency -- deliver priority notifications within 5 seconds end-to-end, batch non-urgent notifications within 15 minutes, and support real-time status updates
Consistency -- ensure exactly-once delivery semantics per channel, maintain eventual consistency for preference updates, and prevent notification storms through deduplication

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Event Ingestion and Routing Architecture

How you handle massive event volumes from multiple sources while routing to appropriate channels based on user preferences and notification priority levels.

Hints to consider:

Use a message queue as the primary ingestion point to decouple producers from the notification pipeline and provide backpressure protection
Implement a routing layer that enriches events with user preferences, resolves channel selection, and applies filtering rules before fan-out
Consider partitioning strategies that maintain ordering guarantees for user-level notifications while enabling parallel processing
Design for idempotency by generating deterministic notification IDs and checking for duplicates before processing

2. Multi-Channel Delivery Management

Your strategy for abstracting different delivery providers (FCM, APNS, Twilio, SendGrid) while handling their unique constraints, rate limits, and failure modes.

Hints to consider:

Create a unified delivery abstraction layer with retry policies, circuit breakers, and fallback providers for each channel type
Implement channel-specific rate limiting and batching (email can batch hundreds, SMS needs careful rate control, push supports device-level batching)
Handle provider-specific delivery receipts and map them to a unified tracking model for consistent analytics
Design for graceful degradation when a provider is down by queuing messages and switching to backup providers

3. Deduplication and Exactly-Once Semantics

How you prevent duplicate notifications when retries occur, multiple events trigger the same notification, or distributed systems experience partial failures.

Hints to consider:

Use a distributed cache (Redis) with TTL-based deduplication keys derived from user ID, notification type, and content hash
Implement a delivery log in a fast key-value store that tracks sent notifications with unique IDs before attempting delivery
Design windowing logic to collapse rapid-fire similar events (like 10 likes in 2 minutes) into a single aggregated notification
Consider using database transactions or distributed locks for critical notifications that must never duplicate

4. User Preference Resolution and Enforcement

Your approach to storing, caching, and applying complex user preferences including quiet hours, frequency caps, channel priorities, and category-level opt-outs.

Hints to consider:

Cache user preferences in Redis with hierarchical fallback (user-specific → category default → global default) to minimize database lookups
Implement timezone-aware quiet hour enforcement at the routing layer before messages enter channel-specific queues
Design a flexible preference schema that supports per-category rules, channel priority ordering, and frequency capping windows
Handle preference updates by invalidating caches and allowing in-flight notifications to complete with stale preferences to avoid complexity

5. Monitoring, Observability, and Delivery Guarantees

How you track notification lifecycle from creation through delivery, measure system health, debug failures, and prove SLA compliance.

Hints to consider:

Emit structured events at each pipeline stage (received, routed, sent, delivered, opened) and aggregate them in a real-time analytics system
Implement dead letter queues for failed notifications with separate retry workers and alerting thresholds
Track delivery rates, latency percentiles, and error rates per channel and provider with dashboards that highlight anomalies
Design a reconciliation system that compares sent counts against provider delivery receipts to detect silent failures