For a full example answer with detailed architecture diagrams and deep dives, see our Design Webhook System guide. The webhook guide covers event fan-out, retry logic, payload signing, and delivery guarantees that are central to this problem.
Also review the Message Queues and Rate Limiters building blocks for background on asynchronous delivery pipelines and per-tenant throttling.
Design a webhook system that allows applications to register callback URLs and automatically receive HTTP POST notifications when specific events occur. Think of how Stripe notifies your backend when a payment succeeds, or how GitHub triggers your CI pipeline when code is pushed -- your system must reliably deliver event payloads to external endpoints registered by third-party developers.
The core challenge is building a multi-tenant event delivery platform that guarantees reliability while protecting both your infrastructure and subscribers from failure cascades. Events arrive continuously from multiple internal services, and each event may fan out to hundreds of registered endpoints. Your design must handle slow or failing receivers gracefully, prevent one misbehaving subscriber from degrading delivery for others, ensure payloads are cryptographically signed so receivers can verify authenticity, and provide full visibility into delivery status. Expect to support 100,000+ registered endpoints processing millions of events per hour with 10x burst capacity during traffic spikes.
Based on real interview experiences at Snowflake and similar companies, these are the areas interviewers probe most deeply:
Strong candidates immediately recognize that synchronous fan-out is unacceptable. If your API blocks while POSTing to external URLs, a single slow subscriber cascades failures across your entire platform.
Hints to consider:
Networks fail constantly. Interviewers want to see thoughtful retry strategies that balance reliability with protecting both your system and subscribers from runaway behavior.
Hints to consider:
External endpoints are untrusted. Interviewers expect you to protect against spoofing, replay attacks, and man-in-the-middle tampering without adding excessive complexity for legitimate subscribers.
Hints to consider:
A single subscription should never consume all delivery capacity. Interviewers look for multi-tenant isolation and mechanisms to protect the system from both malicious and accidentally misconfigured subscribers.
Hints to consider:
Begin by confirming scope and priorities with your interviewer. Ask whether the system needs to guarantee ordered delivery for events from the same source, or if parallel out-of-order delivery is acceptable. Clarify expected event rates and payload sizes -- lightweight JSON documents or multi-megabyte attachments. Determine if subscribers need exactly-once guarantees or if at-least-once with idempotency keys is sufficient. Confirm whether the system must support content-based filtering (subscribers receive only events matching certain criteria) or all-or-nothing subscriptions per event type. Finally, confirm security requirements around payload signing, mutual TLS, and IP allowlisting.
Sketch the core components: an Event Ingestion API that accepts events from internal services and immediately enqueues them; a Subscription Registry (PostgreSQL) storing endpoint URLs, event type subscriptions, signing secrets, and retry configuration; a Message Queue (Kafka) that durably buffers events and enables fan-out to multiple subscriptions; a Delivery Worker Pool that consumes from the queue, signs payloads, executes HTTPS POSTs, and records outcomes; a Retry Scheduler that re-enqueues failed deliveries with exponential backoff; and a Delivery History Store (ClickHouse or PostgreSQL with time partitioning) for querying past attempts. Position a Rate Limiter (Redis-backed) in front of the worker pool to enforce per-subscription quotas.
Walk through the end-to-end flow in detail. When an event arrives, the ingestion API queries the subscription registry for matching subscriptions and writes one message per subscription into Kafka, keyed by subscription ID to preserve ordering. Workers pull messages, look up endpoint configuration including signing secrets, construct the HMAC signature over the JSON payload plus a timestamp, and execute the HTTPS POST with a 10-second timeout. On success (HTTP 2xx), record the delivery in history and acknowledge the message. On temporary failure (5xx, timeout, DNS failure), re-enqueue with incremented retry count and exponential delay (1 minute, 5 minutes, 30 minutes, 2 hours, and so on up to 72 hours). On permanent failure (4xx, invalid certificate), log the error and move to a dead-letter queue without retrying. Discuss how workers handle backpressure: if a subscription's queue depth exceeds a threshold, pause new deliveries and alert the subscriber.
Cover observability by proposing that workers emit metrics (delivery latency, success and failure rates, queue depths) to a monitoring system like Prometheus and write structured delivery logs to ClickHouse for fast querying. Discuss security enhancements: support multiple signature algorithms (HMAC-SHA256, HMAC-SHA512), include webhook IDs in signatures to prevent cross-endpoint replay, and provide signing secret rotation APIs. Address scalability by explaining how to horizontally scale worker pools and partition Kafka topics by subscription ID. Talk through operational tools: a testing API that delivers synthetic events without affecting production quotas, a bulk retry API for recovering from outages, and automated circuit breakers that pause subscriptions with sustained failure rates above 50 percent.