Design a notification system that can orchestrate large-scale marketing campaigns for enterprise customers. The platform lets businesses define target audiences, compose personalized messages across multiple channels (email, SMS, push notifications), schedule delivery windows, and monitor campaign performance in real time. Think of systems like Mailchimp, Braze, or Salesforce Marketing Cloud.
A single campaign may fan out to tens of millions of recipients, each requiring personalization, consent verification, frequency cap enforcement, and delivery through third-party providers with strict rate limits. Campaigns often run for hours or days, accounting for recipient time zones and provider throttling. The system must be auditable, pausable, and resilient to partial failures -- a provider outage should not cause duplicate sends or lost messages when the campaign resumes.
Based on real interview experiences, these are the areas interviewers probe most deeply:
Enterprise sends are multi-step, long-running processes. Interviewers want to see how you model the workflow so it survives crashes, supports pause/resume, and avoids duplicating work on restart.
Hints to consider:
Blasting millions of messages at once overwhelms email and SMS providers, causing throttling or dropped sends. Interviewers expect you to design pacing, backpressure, and per-provider rate management.
Hints to consider:
Skipping consent checks leads to legal risk and user churn. Interviewers look for fast, authoritative checks in the hot send path rather than batch-only verification.
Hints to consider:
Retries after transient failures must not produce duplicate messages visible to recipients. Interviewers expect an outbox pattern or equivalent mechanism for safe retry.
Hints to consider:
Confirm the expected campaign size (millions vs. billions of recipients) and how many campaigns run concurrently. Ask about channel mix and whether delivery receipts come back synchronously or via webhooks. Clarify time-zone handling and whether campaigns need per-recipient delivery windows. Understand consent and compliance obligations for target markets. Verify whether A/B testing, dynamic content, and triggered (event-based) sends are in scope.
Sketch these major services: Campaign Service (CRUD for campaigns, audience rules, scheduling), Audience Service (evaluates targeting rules and produces immutable recipient snapshots), Rendering Service (personalizes message templates per recipient and channel), Send Pipeline (Kafka-backed queue with worker pools per channel/provider), Delivery Tracker (consumes provider webhooks and updates message status), and Analytics Service (aggregates delivery, open, click, and bounce metrics). Place Redis in front of hot-path lookups (consent, frequency caps, idempotency). Use a relational database for campaign metadata and a columnar store for analytics.
Walk through a campaign launch. The Campaign Service locks the audience snapshot and publishes a "CampaignStarted" event. The Send Pipeline reads recipient batches from the snapshot, checks consent and frequency caps in Redis, renders personalized content, and writes each outbound message to an outbox table. Workers pull from the outbox, call the appropriate provider API with the message ID as idempotency key, and mark messages as sent on success. If a provider returns a transient error, the message stays in the outbox for retry with exponential backoff. A dead-letter queue captures messages that exhaust retries. Token-bucket rate limiters in Redis throttle workers per provider. On pause, workers drain gracefully and persist their checkpoint offsets; on resume, they pick up from the last checkpoint.
Cover analytics: provider webhooks (delivery receipts, opens, clicks, bounces) land in Kafka and feed a real-time aggregation pipeline that updates campaign dashboards. Discuss A/B testing by splitting audiences into variant groups during the snapshot phase and routing each group through different message templates, then comparing delivery and engagement metrics. Address monitoring: track send rate, provider error rates, queue depth, and campaign progress. Explain multi-tenancy by isolating rate limits and quotas per customer to prevent one large campaign from starving others. Touch on compliance by logging every consent check and skip reason for audit trails.