Design a Webhook Callback System — Snowflake

Reference Answer

For a full example answer with detailed architecture diagrams and deep dives, see our Design Webhook System guide. The webhook guide covers event fan-out, retry logic, payload signing, and delivery guarantees that are central to this problem.

Also review the Message Queues and Rate Limiters building blocks for background on asynchronous delivery pipelines and per-tenant throttling.

Problem Statement

Design a webhook system that allows applications to register callback URLs and automatically receive HTTP POST notifications when specific events occur. Think of how Stripe notifies your backend when a payment succeeds, or how GitHub triggers your CI pipeline when code is pushed -- your system must reliably deliver event payloads to external endpoints registered by third-party developers.

The core challenge is building a multi-tenant event delivery platform that guarantees reliability while protecting both your infrastructure and subscribers from failure cascades. Events arrive continuously from multiple internal services, and each event may fan out to hundreds of registered endpoints. Your design must handle slow or failing receivers gracefully, prevent one misbehaving subscriber from degrading delivery for others, ensure payloads are cryptographically signed so receivers can verify authenticity, and provide full visibility into delivery status. Expect to support 100,000+ registered endpoints processing millions of events per hour with 10x burst capacity during traffic spikes.

Key Requirements

Functional

Endpoint registration and subscription management -- developers register callback URLs, select event types to receive, configure retry policies and authentication secrets through a management API
Signed event delivery -- send HTTPS POST requests with HMAC-signed payloads that receivers can verify to detect tampering or spoofing
Delivery tracking and observability -- maintain a queryable log of every delivery attempt including HTTP response codes, latency, retry count, and failure reasons
Manual controls -- allow subscribers to test endpoints with sample events, pause and resume deliveries, manually retry failed attempts, and inspect recent payloads

Non-Functional

Scalability -- handle 10 million webhook deliveries per hour with burst capacity to 100 million during peak events
Reliability -- guarantee at-least-once delivery with exponential backoff retries over 72 hours before marking events as permanently failed
Latency -- deliver 95th percentile of webhooks within 5 seconds of event creation under normal load
Consistency -- preserve event ordering per subscription where needed, while allowing parallel delivery across different subscriptions

What Interviewers Focus On

Based on real interview experiences at Snowflake and similar companies, these are the areas interviewers probe most deeply:

1. Decoupling Event Producers from Delivery Workers

Strong candidates immediately recognize that synchronous fan-out is unacceptable. If your API blocks while POSTing to external URLs, a single slow subscriber cascades failures across your entire platform.

Hints to consider:

Use a message queue like Kafka or Amazon SQS to decouple ingestion from delivery -- producers publish once and return immediately
Partition the event stream by subscription ID to enable parallel processing while preserving per-subscription ordering
Design for scenarios where the delivery pipeline is temporarily down but event producers must keep running without data loss
Consider whether producers need acknowledgment of successful queuing versus actual delivery

2. Retry Logic and Failure Handling

Networks fail constantly. Interviewers want to see thoughtful retry strategies that balance reliability with protecting both your system and subscribers from runaway behavior.

Hints to consider:

Implement exponential backoff with jitter to avoid thundering herd problems when many endpoints fail simultaneously
Distinguish temporary failures (5xx responses, timeouts, DNS errors) from permanent failures (4xx client errors, invalid TLS certificates)
Move chronically failing endpoints to a dead-letter queue after exhausting retries, and provide mechanisms for subscribers to investigate and replay
Apply circuit breaker patterns to automatically pause deliveries to endpoints showing sustained failures above a threshold

3. Security and Payload Verification

External endpoints are untrusted. Interviewers expect you to protect against spoofing, replay attacks, and man-in-the-middle tampering without adding excessive complexity for legitimate subscribers.

Hints to consider:

Generate HMAC-SHA256 signatures using a per-subscription secret key, including the full payload and a timestamp in the signed data
Enforce HTTPS-only deliveries and validate TLS certificates to prevent plaintext interception
Include timestamps in signed headers so receivers can reject stale deliveries and detect replays
Support signing secret rotation with overlapping validity periods to avoid breaking active integrations during key rotation

4. Backpressure and Multi-Tenant Fairness

A single subscription should never consume all delivery capacity. Interviewers look for multi-tenant isolation and mechanisms to protect the system from both malicious and accidentally misconfigured subscribers.

Hints to consider:

Apply per-subscription rate limits using Redis-backed token buckets to bound concurrent delivery attempts and requests per second
Use separate worker pools or weighted fair queuing for different subscription tiers to guarantee capacity for paying customers
Monitor per-endpoint latency and automatically throttle subscriptions that consistently return slow responses
Design admission control that rejects new events when delivery queues exceed safe thresholds

Suggested Approach

Step 1: Clarify Requirements

Begin by confirming scope and priorities with your interviewer. Ask whether the system needs to guarantee ordered delivery for events from the same source, or if parallel out-of-order delivery is acceptable. Clarify expected event rates and payload sizes -- lightweight JSON documents or multi-megabyte attachments. Determine if subscribers need exactly-once guarantees or if at-least-once with idempotency keys is sufficient. Confirm whether the system must support content-based filtering (subscribers receive only events matching certain criteria) or all-or-nothing subscriptions per event type. Finally, confirm security requirements around payload signing, mutual TLS, and IP allowlisting.

Step 2: High-Level Architecture

Sketch the core components: an Event Ingestion API that accepts events from internal services and immediately enqueues them; a Subscription Registry (PostgreSQL) storing endpoint URLs, event type subscriptions, signing secrets, and retry configuration; a Message Queue (Kafka) that durably buffers events and enables fan-out to multiple subscriptions; a Delivery Worker Pool that consumes from the queue, signs payloads, executes HTTPS POSTs, and records outcomes; a Retry Scheduler that re-enqueues failed deliveries with exponential backoff; and a Delivery History Store (ClickHouse or PostgreSQL with time partitioning) for querying past attempts. Position a Rate Limiter (Redis-backed) in front of the worker pool to enforce per-subscription quotas.

Step 3: Deep Dive on Delivery Workflow

Walk through the end-to-end flow in detail. When an event arrives, the ingestion API queries the subscription registry for matching subscriptions and writes one message per subscription into Kafka, keyed by subscription ID to preserve ordering. Workers pull messages, look up endpoint configuration including signing secrets, construct the HMAC signature over the JSON payload plus a timestamp, and execute the HTTPS POST with a 10-second timeout. On success (HTTP 2xx), record the delivery in history and acknowledge the message. On temporary failure (5xx, timeout, DNS failure), re-enqueue with incremented retry count and exponential delay (1 minute, 5 minutes, 30 minutes, 2 hours, and so on up to 72 hours). On permanent failure (4xx, invalid certificate), log the error and move to a dead-letter queue without retrying. Discuss how workers handle backpressure: if a subscription's queue depth exceeds a threshold, pause new deliveries and alert the subscriber.

Step 4: Address Secondary Concerns

Cover observability by proposing that workers emit metrics (delivery latency, success and failure rates, queue depths) to a monitoring system like Prometheus and write structured delivery logs to ClickHouse for fast querying. Discuss security enhancements: support multiple signature algorithms (HMAC-SHA256, HMAC-SHA512), include webhook IDs in signatures to prevent cross-endpoint replay, and provide signing secret rotation APIs. Address scalability by explaining how to horizontally scale worker pools and partition Kafka topics by subscription ID. Talk through operational tools: a testing API that delivers synthetic events without affecting production quotas, a bulk retry API for recovering from outages, and automated circuit breakers that pause subscriptions with sustained failure rates above 50 percent.

Reference Answer

Also review the Message Queues and Rate Limiters building blocks for background on asynchronous delivery pipelines and per-tenant throttling.

Problem Statement

Key Requirements

Functional

Endpoint registration and subscription management -- developers register callback URLs, select event types to receive, configure retry policies and authentication secrets through a management API
Signed event delivery -- send HTTPS POST requests with HMAC-signed payloads that receivers can verify to detect tampering or spoofing
Delivery tracking and observability -- maintain a queryable log of every delivery attempt including HTTP response codes, latency, retry count, and failure reasons
Manual controls -- allow subscribers to test endpoints with sample events, pause and resume deliveries, manually retry failed attempts, and inspect recent payloads

Non-Functional

Scalability -- handle 10 million webhook deliveries per hour with burst capacity to 100 million during peak events
Reliability -- guarantee at-least-once delivery with exponential backoff retries over 72 hours before marking events as permanently failed
Latency -- deliver 95th percentile of webhooks within 5 seconds of event creation under normal load
Consistency -- preserve event ordering per subscription where needed, while allowing parallel delivery across different subscriptions

What Interviewers Focus On

Based on real interview experiences at Snowflake and similar companies, these are the areas interviewers probe most deeply:

1. Decoupling Event Producers from Delivery Workers

Hints to consider:

Use a message queue like Kafka or Amazon SQS to decouple ingestion from delivery -- producers publish once and return immediately
Partition the event stream by subscription ID to enable parallel processing while preserving per-subscription ordering
Design for scenarios where the delivery pipeline is temporarily down but event producers must keep running without data loss
Consider whether producers need acknowledgment of successful queuing versus actual delivery

2. Retry Logic and Failure Handling

Networks fail constantly. Interviewers want to see thoughtful retry strategies that balance reliability with protecting both your system and subscribers from runaway behavior.

Hints to consider:

Implement exponential backoff with jitter to avoid thundering herd problems when many endpoints fail simultaneously
Distinguish temporary failures (5xx responses, timeouts, DNS errors) from permanent failures (4xx client errors, invalid TLS certificates)
Move chronically failing endpoints to a dead-letter queue after exhausting retries, and provide mechanisms for subscribers to investigate and replay
Apply circuit breaker patterns to automatically pause deliveries to endpoints showing sustained failures above a threshold

3. Security and Payload Verification

External endpoints are untrusted. Interviewers expect you to protect against spoofing, replay attacks, and man-in-the-middle tampering without adding excessive complexity for legitimate subscribers.

Hints to consider:

Generate HMAC-SHA256 signatures using a per-subscription secret key, including the full payload and a timestamp in the signed data
Enforce HTTPS-only deliveries and validate TLS certificates to prevent plaintext interception
Include timestamps in signed headers so receivers can reject stale deliveries and detect replays
Support signing secret rotation with overlapping validity periods to avoid breaking active integrations during key rotation

4. Backpressure and Multi-Tenant Fairness

Hints to consider:

Apply per-subscription rate limits using Redis-backed token buckets to bound concurrent delivery attempts and requests per second
Use separate worker pools or weighted fair queuing for different subscription tiers to guarantee capacity for paying customers
Monitor per-endpoint latency and automatically throttle subscriptions that consistently return slow responses
Design admission control that rejects new events when delivery queues exceed safe thresholds