Practice/Meta/Design a Rate Limiter
Design a Rate Limiter
System DesignMust
Problem Statement
Design a scalable notification delivery platform that sends messages across multiple channels -- push notifications, SMS, email, and webhooks -- to millions of users worldwide. The system must handle billions of notifications daily, support delivery prioritization, ensure idempotency, and provide tracking for delivery status and user engagement metrics.
Consider the scale of platforms like Stripe, Twilio, or Amazon SNS that route billions of time-sensitive events. The system must gracefully handle third-party channel failures, respect user preferences and quiet hours, deduplicate redundant messages, and provide near real-time delivery status updates to sender applications. Interviewers expect you to reason about message queuing, delivery guarantees, fan-out patterns, channel-specific retry logic, and the observability needed to diagnose delivery failures across a distributed fleet.
Key Requirements
Functional
- Multi-channel delivery -- System must route notifications to push, SMS, email, and webhook endpoints based on user preferences and message type
- Delivery tracking -- Provide real-time status updates (sent, delivered, failed, opened, clicked) and aggregate engagement analytics for senders
- User preference management -- Honor opt-out settings, channel preferences, quiet hours, and notification frequency caps per user
- Template and personalization -- Support message templates with variable substitution, localization, and dynamic content rendering
- Priority and scheduling -- Allow urgent messages to bypass queues while supporting scheduled delivery and batching for non-critical notifications
Non-Functional
- Scalability -- Handle 10 billion notifications per day with peaks of 500,000 messages per second during high-traffic events
- Reliability -- Achieve 99.95% delivery success rate with automatic retries, circuit breaking for failing channels, and fallback channel routing
- Latency -- Deliver P95 of high-priority notifications within 2 seconds from API submission to channel handoff; P99 under 5 seconds
- Consistency -- Guarantee at-least-once delivery semantics with idempotency keys to prevent duplicate sends; eventually consistent delivery status tracking
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Message Queuing and Fan-Out Architecture
Notifications must be distributed across workers efficiently while preventing head-of-line blocking when specific channels slow down or fail. Interviewers want to see how you partition work, isolate failures, and maintain ordering guarantees where needed.
Hints to consider:
- Use dedicated queues per channel type to isolate slow SMS providers from fast push notifications
- Implement priority queues with separate consumer pools to ensure urgent alerts bypass bulk marketing messages
- Consider partitioning by user ID or tenant ID to maintain per-user ordering while enabling horizontal scaling
- Design dead letter queues with exponential backoff and manual retry capabilities for persistent failures
2. Idempotency and Exactly-Once Processing
Users may retry API calls during timeouts, causing duplicate notification requests. Sending the same alert twice damages user trust and wastes money on SMS or third-party API calls. You must prevent duplicates across retries and crashes.
Hints to consider:
- Accept client-provided idempotency keys and store them with TTL in Redis or DynamoDB to detect duplicates within a time window
- Make notification IDs deterministic based on content hash plus recipient to catch duplicates from different sources
- Use database constraints or conditional writes to ensure a notification record is created exactly once per idempotency key
- Handle the race where two workers process the same message by checking delivery status before sending to external channels
3. Third-Party Channel Integration and Failure Handling
External SMS, email, and push providers have varying SLAs, rate limits, and failure modes. A single provider outage should not block all deliveries. Interviewers want resilience patterns and observability into vendor health.
Hints to consider:
- Implement circuit breakers per vendor to fail fast and route traffic to backup providers when error rates spike
- Use per-vendor rate limiters with token buckets to respect API quotas and avoid 429 errors that trigger penalties
- Design a webhook retry system with exponential backoff for customer endpoints, capping retries to avoid infinite loops
- Track per-channel success rates and latency metrics to automatically deprioritize slow or failing providers
4. User Preferences and Compliance
Users must be able to opt out of categories, mute notifications during sleep hours, and have their data deleted. GDPR and TCPA laws penalize non-compliance. The system must enforce rules consistently at high throughput.
Hints to consider:
- Cache user preferences in memory or Redis with short TTLs to avoid database lookups on every notification check
- Evaluate quiet hours and frequency caps before enqueuing to reduce wasted processing on messages that will be dropped
- Store opt-out lists in low-latency storage and propagate updates within seconds to all notification workers
- Design a preference service API that workers query in batch to amortize lookup costs across large fan-out operations