Webhook Architecture Patterns โ Retry, Idempotency, and Delivery Guarantees (2026)
If youโve read our intro to webhooks, you know the basics: an event happens, an HTTP POST fires, a subscriber reacts. Simple enough on a whiteboard. In production, everything that can go wrong will โ receivers go down, networks drop packets, and duplicate deliveries corrupt state. This guide covers the architecture patterns that make webhooks actually reliable.
Why exactly-once delivery is impossible
Distributed systems theory (the Two Generals Problem) tells us that exactly-once delivery over an unreliable network is impossible. The sender fires a POST. The receiver processes it and returns 200. But the response is lost in transit. The sender has no way to distinguish โdelivered and processedโ from โnever arrived.โ It retries, and now the receiver has processed the event twice.
You have two realistic options:
- At-most-once โ fire and forget. No retries. Simple, but you lose events.
- At-least-once โ retry until you get an acknowledgment. You guarantee delivery but accept the possibility of duplicates.
Almost every production webhook system chooses at-least-once delivery and pushes the duplicate problem to the receiver via idempotency. More on that below.
Retry with exponential backoff
When a delivery attempt fails (timeout, 5xx, connection refused), you retry. Naive retries at fixed intervals will hammer a recovering server. Exponential backoff spaces attempts out progressively:
| Attempt | Delay |
|---|---|
| 1 | immediate |
| 2 | 30 seconds |
| 3 | 2 minutes |
| 4 | 8 minutes |
| 5 | 30 minutes |
| 6 | 2 hours |
Add jitter (a small random offset) to each delay so that thousands of failed webhooks donโt all retry at the exact same second and create a thundering herd.
A typical policy retries 5โ8 times over 24โ72 hours, then gives up and routes the event to a dead letter queue. Always respect the receiverโs response: a 410 Gone means the endpoint was deliberately removed โ stop retrying immediately. For other error handling strategies, see our guide to API error handling.
HMAC signature verification
Receivers need to verify that an incoming webhook actually came from the expected sender and wasnโt tampered with in transit. The standard approach is HMAC-SHA256: the sender signs the payload with a shared secret and includes the signature in a header. The receiver recomputes the signature and compares.
This works alongside HTTPS/TLS โ TLS protects the transport, HMAC proves the senderโs identity.
Sender side (Node.js):
import crypto from "node:crypto";
function signPayload(payload, secret) {
return crypto
.createHmac("sha256", secret)
.update(JSON.stringify(payload))
.digest("hex");
}
// Attach as header: X-Webhook-Signature: sha256=<signature>
Receiver side (Node.js / Express):
import crypto from "node:crypto";
function verifyWebhook(req, secret) {
const expected = req.headers["x-webhook-signature"];
if (!expected) return false;
const computed =
"sha256=" +
crypto.createHmac("sha256", secret).update(req.rawBody).digest("hex");
return crypto.timingSafeEqual(
Buffer.from(expected),
Buffer.from(computed)
);
}
Key details:
- Use
crypto.timingSafeEqualโ a constant-time comparison that prevents timing attacks. - Compute the signature from the raw request body, not a re-serialized object. JSON key ordering differences will break the check.
- Rotate secrets periodically. During rotation, accept signatures from both the old and new secret for a short overlap window.
Idempotency on the receiver side
Since at-least-once delivery means duplicates are inevitable, receivers must be idempotent โ processing the same event twice should produce the same result as processing it once.
The standard pattern:
- The sender includes a unique
X-Webhook-Event-Id(or equivalent) in every delivery. - The receiver stores processed event IDs (a database table, Redis set, or similar).
- Before processing, check if the ID already exists. If it does, return
200 OKwithout re-processing.
async function handleWebhook(req, res) {
const eventId = req.headers["x-webhook-event-id"];
if (await store.has(eventId)) return res.sendStatus(200);
await processEvent(req.body);
await store.add(eventId);
res.sendStatus(200);
}
Set a TTL on stored IDs (e.g., 7 days) so the store doesnโt grow unbounded. For a deeper dive, see Idempotency in APIs.
Dead letter queues
After exhausting all retry attempts, the event has to go somewhere. A dead letter queue (DLQ) captures these permanently failed deliveries so they arenโt silently lost.
A DLQ should store:
- The full event payload
- The target URL
- The failure reason and HTTP status from the last attempt
- A timestamp and retry count
This gives your operations team (or the subscriber themselves via a dashboard) the ability to inspect failures and manually replay events once the underlying issue is fixed. Services like AWS SQS, Google Cloud Pub/Sub, and RabbitMQ all have native DLQ support.
Fan-out: one event, multiple subscribers
Many systems need to notify multiple subscribers when a single event occurs โ a new order might trigger an email service, an analytics pipeline, and an inventory system simultaneously.
Two approaches:
Direct fan-out โ the webhook sender maintains a list of subscriber URLs per event type and delivers to each independently. Simple, but the sender bears the load of N deliveries and N retry chains.
Broker-mediated fan-out โ the sender publishes the event once to a message broker (SNS, Pub/Sub, internal queue). The broker handles delivery to each subscriber. This decouples the sender from subscriber count and isolates failures โ one slow subscriber doesnโt block the others.
For systems with more than a handful of subscribers, broker-mediated fan-out is almost always the right call. It also makes it trivial to add or remove subscribers without changing the senderโs code, which aligns with good API design principles.
Webhooks vs. event streaming (Kafka)
Webhooks and event streaming platforms like Kafka solve overlapping but different problems:
| Dimension | Webhooks | Kafka / event streaming |
|---|---|---|
| Delivery model | Push (sender โ receiver) | Pull (consumer reads from topic) |
| Coupling | Sender knows receiver URL | Producers and consumers decoupled |
| Replay | Not natively supported | Full replay from any offset |
| Ordering | No guarantees | Per-partition ordering |
| Best for | Cross-org integrations, SaaS | Internal microservices, high throughput |
Use webhooks when youโre integrating with external systems you donโt control. Use event streaming when you own both sides and need ordering, replay, or very high throughput. Many architectures use both โ Kafka internally, webhooks at the boundary.
Monitoring webhook health
A webhook system without observability is a webhook system that fails silently. Track these metrics:
- Delivery success rate โ percentage of first-attempt 2xx responses. A drop signals receiver issues.
- Retry rate โ how many events require retries. A spike means something is degrading.
- DLQ depth โ events that exhausted all retries. This should trend toward zero.
- Delivery latency โ p50/p95/p99 time from event creation to successful delivery.
- Subscriber response time โ slow receivers increase your retry queue depth and resource usage.
Alert on sustained DLQ growth and on delivery success rate dropping below a threshold (e.g., 99%). Expose a health dashboard to subscribers so they can self-diagnose โ Stripe, GitHub, and Shopify all do this well.
Putting it all together
A production-grade webhook pipeline looks like this:
- Event occurs โ payload created and signed with HMAC
- First delivery attempt to subscriber URL over HTTPS
- On failure โ exponential backoff with jitter, up to N retries
- Receiver verifies signature, checks idempotency key, processes event
- After max retries โ route to dead letter queue
- For multi-subscriber events โ fan-out via message broker
- Monitor everything: success rates, latency, DLQ depth
Webhooks are deceptively simple on the surface. The patterns above โ at-least-once delivery, HMAC verification, idempotency, dead letter queues, and fan-out โ are what separate a toy implementation from one that handles real traffic without losing or duplicating data.