Design a Notification System for Marketing Campaigns — Trustpilot

Problem Statement

Design a notification system that can orchestrate large-scale marketing campaigns for enterprise customers. The platform lets businesses define target audiences, compose personalized messages across multiple channels (email, SMS, push notifications), schedule delivery windows, and monitor campaign performance in real time. Think of systems like Mailchimp, Braze, or Salesforce Marketing Cloud.

A single campaign may fan out to tens of millions of recipients, each requiring personalization, consent verification, frequency cap enforcement, and delivery through third-party providers with strict rate limits. Campaigns often run for hours or days, accounting for recipient time zones and provider throttling. The system must be auditable, pausable, and resilient to partial failures -- a provider outage should not cause duplicate sends or lost messages when the campaign resumes.

Key Requirements

Functional

Audience targeting -- define and preview audiences using rules over user profiles and behavioral events, then lock an immutable snapshot for the send
Multi-channel delivery -- compose personalized messages for email, SMS, push, and in-app channels, and schedule delivery immediately or within time-zone-aware windows
Consent and frequency caps -- enforce opt-in preferences, legal compliance (GDPR, CAN-SPAM), and per-user frequency limits across channels and campaigns
Campaign controls -- monitor live campaigns with real-time delivery metrics, pause and resume without duplicates, and run A/B experiments on message variants

Non-Functional

Scalability -- send 100 million messages per campaign across multiple channels, supporting dozens of concurrent campaigns
Reliability -- guarantee at-least-once delivery with no duplicates visible to recipients; survive provider outages without data loss
Latency -- begin delivering messages within 60 seconds of campaign launch; reflect delivery metrics on dashboards within 30 seconds
Consistency -- ensure exactly-once processing in the send pipeline through idempotency; accept eventual consistency for analytics aggregations

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Campaign Workflow Orchestration

Enterprise sends are multi-step, long-running processes. Interviewers want to see how you model the workflow so it survives crashes, supports pause/resume, and avoids duplicating work on restart.

Hints to consider:

Model the campaign as a durable workflow with explicit stages: audience snapshot, segmentation, message rendering, pacing, delivery, and metrics reconciliation
Implement checkpointing so that a resumed campaign picks up exactly where it left off without re-sending to already-processed recipients
Design pause/resume as state transitions that stop worker consumption and persist progress markers
Use a saga or durable execution framework to coordinate compensations if a step fails (e.g., rolling back A/B test routing)

2. Fan-Out and Provider Rate Limiting

Blasting millions of messages at once overwhelms email and SMS providers, causing throttling or dropped sends. Interviewers expect you to design pacing, backpressure, and per-provider rate management.

Hints to consider:

Use a partitioned message queue (Kafka) to distribute outbound messages across worker pools, each rate-limited per provider
Implement token-bucket rate limiters in Redis keyed by provider and tenant to respect API limits
Design time-zone-aware scheduling that spreads delivery across hours rather than sending everything at midnight UTC
Build circuit breakers around provider calls that pause sending and drain to a retry queue on sustained errors

3. Consent Enforcement and Frequency Caps

Skipping consent checks leads to legal risk and user churn. Interviewers look for fast, authoritative checks in the hot send path rather than batch-only verification.

Hints to consider:

Maintain a per-user preferences store with channel-level opt-in/opt-out flags, checked synchronously before every send
Implement rolling frequency caps (e.g., max 3 emails per user per week) using Redis counters with TTL-based expiration
Design the pipeline so that consent and frequency violations produce skip records rather than silent drops, preserving auditability
Consider how to handle mid-campaign preference changes (user unsubscribes while campaign is in flight)

4. Idempotency and Exactly-Once Delivery

Retries after transient failures must not produce duplicate messages visible to recipients. Interviewers expect an outbox pattern or equivalent mechanism for safe retry.

Hints to consider:

Generate a deterministic message ID from campaign ID, recipient ID, and variant to serve as an idempotency key
Write each outbound message to a durable outbox table before calling the provider, marking it as sent only on confirmed delivery
Design consumers to be idempotent: if a message ID has already been marked as sent, skip it on retry
Use dead-letter queues for messages that fail repeatedly, with alerting and manual review workflows

Suggested Approach

Step 1: Clarify Requirements

Confirm the expected campaign size (millions vs. billions of recipients) and how many campaigns run concurrently. Ask about channel mix and whether delivery receipts come back synchronously or via webhooks. Clarify time-zone handling and whether campaigns need per-recipient delivery windows. Understand consent and compliance obligations for target markets. Verify whether A/B testing, dynamic content, and triggered (event-based) sends are in scope.

Step 2: High-Level Architecture

Sketch these major services: Campaign Service (CRUD for campaigns, audience rules, scheduling), Audience Service (evaluates targeting rules and produces immutable recipient snapshots), Rendering Service (personalizes message templates per recipient and channel), Send Pipeline (Kafka-backed queue with worker pools per channel/provider), Delivery Tracker (consumes provider webhooks and updates message status), and Analytics Service (aggregates delivery, open, click, and bounce metrics). Place Redis in front of hot-path lookups (consent, frequency caps, idempotency). Use a relational database for campaign metadata and a columnar store for analytics.

Step 3: Deep Dive on the Send Pipeline

Walk through a campaign launch. The Campaign Service locks the audience snapshot and publishes a "CampaignStarted" event. The Send Pipeline reads recipient batches from the snapshot, checks consent and frequency caps in Redis, renders personalized content, and writes each outbound message to an outbox table. Workers pull from the outbox, call the appropriate provider API with the message ID as idempotency key, and mark messages as sent on success. If a provider returns a transient error, the message stays in the outbox for retry with exponential backoff. A dead-letter queue captures messages that exhaust retries. Token-bucket rate limiters in Redis throttle workers per provider. On pause, workers drain gracefully and persist their checkpoint offsets; on resume, they pick up from the last checkpoint.

Step 4: Address Secondary Concerns

Cover analytics: provider webhooks (delivery receipts, opens, clicks, bounces) land in Kafka and feed a real-time aggregation pipeline that updates campaign dashboards. Discuss A/B testing by splitting audiences into variant groups during the snapshot phase and routing each group through different message templates, then comparing delivery and engagement metrics. Address monitoring: track send rate, provider error rates, queue depth, and campaign progress. Explain multi-tenancy by isolating rate limits and quotas per customer to prevent one large campaign from starving others. Touch on compliance by logging every consent check and skip reason for audit trails.

Related Learning Resources

Job Scheduler guide -- durable workflow orchestration, checkpointing, and retry strategies for long-running tasks
Message Queues -- Kafka-based fan-out, backpressure, and exactly-once delivery patterns
Rate Limiters -- token-bucket algorithms for throttling provider API calls and enforcing frequency caps

Problem Statement

Key Requirements

Functional

Audience targeting -- define and preview audiences using rules over user profiles and behavioral events, then lock an immutable snapshot for the send
Multi-channel delivery -- compose personalized messages for email, SMS, push, and in-app channels, and schedule delivery immediately or within time-zone-aware windows
Consent and frequency caps -- enforce opt-in preferences, legal compliance (GDPR, CAN-SPAM), and per-user frequency limits across channels and campaigns
Campaign controls -- monitor live campaigns with real-time delivery metrics, pause and resume without duplicates, and run A/B experiments on message variants

Non-Functional

Scalability -- send 100 million messages per campaign across multiple channels, supporting dozens of concurrent campaigns
Reliability -- guarantee at-least-once delivery with no duplicates visible to recipients; survive provider outages without data loss
Latency -- begin delivering messages within 60 seconds of campaign launch; reflect delivery metrics on dashboards within 30 seconds
Consistency -- ensure exactly-once processing in the send pipeline through idempotency; accept eventual consistency for analytics aggregations

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Campaign Workflow Orchestration

Enterprise sends are multi-step, long-running processes. Interviewers want to see how you model the workflow so it survives crashes, supports pause/resume, and avoids duplicating work on restart.

Hints to consider:

Model the campaign as a durable workflow with explicit stages: audience snapshot, segmentation, message rendering, pacing, delivery, and metrics reconciliation
Implement checkpointing so that a resumed campaign picks up exactly where it left off without re-sending to already-processed recipients
Design pause/resume as state transitions that stop worker consumption and persist progress markers
Use a saga or durable execution framework to coordinate compensations if a step fails (e.g., rolling back A/B test routing)

2. Fan-Out and Provider Rate Limiting

Hints to consider:

Use a partitioned message queue (Kafka) to distribute outbound messages across worker pools, each rate-limited per provider
Implement token-bucket rate limiters in Redis keyed by provider and tenant to respect API limits
Design time-zone-aware scheduling that spreads delivery across hours rather than sending everything at midnight UTC
Build circuit breakers around provider calls that pause sending and drain to a retry queue on sustained errors

3. Consent Enforcement and Frequency Caps

Skipping consent checks leads to legal risk and user churn. Interviewers look for fast, authoritative checks in the hot send path rather than batch-only verification.

Hints to consider:

Maintain a per-user preferences store with channel-level opt-in/opt-out flags, checked synchronously before every send
Implement rolling frequency caps (e.g., max 3 emails per user per week) using Redis counters with TTL-based expiration
Design the pipeline so that consent and frequency violations produce skip records rather than silent drops, preserving auditability
Consider how to handle mid-campaign preference changes (user unsubscribes while campaign is in flight)

4. Idempotency and Exactly-Once Delivery

Retries after transient failures must not produce duplicate messages visible to recipients. Interviewers expect an outbox pattern or equivalent mechanism for safe retry.

Hints to consider:

Generate a deterministic message ID from campaign ID, recipient ID, and variant to serve as an idempotency key
Write each outbound message to a durable outbox table before calling the provider, marking it as sent only on confirmed delivery
Design consumers to be idempotent: if a message ID has already been marked as sent, skip it on retry
Use dead-letter queues for messages that fail repeatedly, with alerting and manual review workflows

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive on the Send Pipeline

Step 4: Address Secondary Concerns

Related Learning Resources

Job Scheduler guide -- durable workflow orchestration, checkpointing, and retry strategies for long-running tasks
Message Queues -- Kafka-based fan-out, backpressure, and exactly-once delivery patterns
Rate Limiters -- token-bucket algorithms for throttling provider API calls and enforcing frequency caps