Practice/Google/Design an e-commerce offer subscription system
Design an e-commerce offer subscription system
System DesignMust
Problem Statement
An e-commerce platform allows users to subscribe to upcoming promotional offers — flash sales, limited-time discounts, and product launches — and receive real-time notifications the moment those offers go live. Users browse a catalog of upcoming deals, subscribe to the ones they care about, and choose how they want to be notified: push notification, email, SMS, or in-app alert. When a merchant activates an offer, the system must fan out notifications to all subscribed users within seconds to give them a fair chance at limited-inventory deals.
The scale is significant: popular offers may attract 1 to 5 million subscribers, and the platform runs dozens of concurrent promotions daily. The notification pipeline must handle massive fan-out without duplicating messages or overwhelming downstream providers. Each notification channel has different throughput limits, failure characteristics, and cost profiles — SMS is expensive and rate-limited by carriers, push notifications are cheap but unreliable on some devices, and email has high throughput but slower delivery.
Users also have preferences that the system must respect: quiet hours during which no notifications should be sent, per-channel opt-outs, and frequency caps to prevent notification fatigue. The system needs to guarantee that every eligible subscriber receives exactly one notification per offer activation, even in the face of worker crashes, provider timeouts, and retry storms.
Key Requirements
Functional
- Offer subscription management -- Users browse upcoming offers, subscribe or unsubscribe, and select preferred notification channels for each subscription
- Real-time fan-out on activation -- When a merchant activates an offer, notify all subscribed users across their preferred channels within seconds
- Multi-channel delivery -- Support push notifications, email, SMS, and in-app alerts with channel-specific retry logic and fallback routing
- User preference enforcement -- Respect quiet hours, frequency caps, and per-channel opt-outs before sending any notification
Non-Functional
- Scalability -- Handle fan-out to 1-5 million subscribers per offer activation, with dozens of concurrent activations daily across the platform
- Latency -- Deliver first notifications within 5 seconds of offer activation; complete full fan-out within 60 seconds for offers with millions of subscribers
- Reliability -- Guarantee at-least-once delivery with exactly-once deduplication per user-offer-channel combination, surviving worker crashes and provider failures
- Cost efficiency -- Minimize SMS and email costs by deduplicating across channels, batching where possible, and avoiding unnecessary retries
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Fan-Out Architecture at Scale
The core challenge is distributing millions of notifications quickly when an offer goes live. Interviewers want to see how you partition the work, parallelize delivery, and avoid bottlenecks in the fan-out path.
Hints to consider:
- Think about how you partition subscribers for parallel processing — by user ID hash, geographic region, or notification channel — and how each affects throughput
- Consider a two-stage fan-out: first expand the offer activation into per-user notification tasks, then route each task to channel-specific workers
- Plan for hot offers with millions of subscribers versus long-tail offers with hundreds — your architecture should handle both without over-provisioning
- Decide how you track fan-out progress so you can detect and recover from partial failures without restarting the entire batch
2. Idempotent Delivery and Deduplication
Retries, worker restarts, and at-least-once messaging semantics can cause duplicate notifications. Interviewers expect a robust deduplication strategy that prevents users from receiving the same alert twice.
Hints to consider:
- Generate a deterministic idempotency key from the combination of offer ID, user ID, and channel type before enqueueing each send
- Use Redis
SET NX with a TTL or PostgreSQL conditional inserts to enforce send-once semantics at the point of dispatch
- Consider time-windowed deduplication to suppress repeated notifications for the same offer if retries span minutes
- Design the send ledger so it survives worker crashes and supports idempotent replay of incomplete batches
3. Provider Rate Limits and Backpressure
Each notification provider (APNs, FCM, Twilio, SendGrid) has its own rate limits, retry policies, and failure modes. Interviewers probe how you manage throughput across heterogeneous providers without dropping messages.
Hints to consider:
- Use per-provider queue partitions with configurable concurrency limits that match each provider's documented rate ceiling
- Implement token bucket or leaky bucket rate limiting at the worker level to smooth bursts and avoid provider throttling
- Design circuit breakers that temporarily halt sends to a failing provider and reroute to fallback channels
- Plan retry strategies with exponential backoff and jitter, distinguishing between transient failures (retry) and permanent failures (dead-letter)
4. Quiet Hours and Preference Filtering
Respecting user preferences adds a filtering step that must execute efficiently at fan-out scale without becoming a bottleneck.
Hints to consider:
- Decide whether to filter preferences eagerly during fan-out expansion or lazily at the channel worker level, and consider the cost trade-offs
- Think about how you handle timezone-aware quiet hours — a user in Tokyo has different quiet hours than one in London for the same offer activation
- Plan for deferred delivery: if a notification is suppressed due to quiet hours, should it be queued for later delivery or dropped entirely
- Consider caching user preferences in Redis to avoid hitting the database for every notification decision during a large fan-out