Mar 30, 2025·8 min read

Webhook reliability checklist: retries, idempotency, replay

Practical webhook reliability checklist: retries, idempotency, replay logs, and monitoring for inbound and outbound webhooks when partners fail.

Why webhooks feel unreliable in real projects

A webhook is a simple deal: one system sends an HTTP request to another system when something happens. "Order shipped", "ticket updated", "device went offline". It’s basically a push notification between apps, delivered over the web.

They feel reliable in demos because the happy path is quick and clean. In real work, webhooks sit between systems you don’t control: CRMs, shipping providers, help desks, marketing tools, IoT platforms, even internal apps owned by another team. Outside of payments, you often lose mature delivery guarantees, stable event schemas, and consistent retry behavior.

The first signs are usually confusing:

Duplicate events (the same update arrives twice)
Missing events (something changed, but you never heard about it)
Delays (an update arrives minutes or hours later)
Events out of order (a "closed" update arrives before "opened")

Flaky third-party systems make this feel random because failures aren’t always loud. A provider might time out but still process your request. A load balancer might drop a connection after the sender already retried. Or their system might go down briefly, then send a burst of old events all at once.

Imagine a shipping partner that sends "delivered" webhooks. One day your receiver is slow for 3 seconds, so they retry. You get two deliveries, your customer gets two emails, and support is confused. The next day they have an outage and never retry, so "delivered" never arrives and your dashboard stays stuck.

Webhook reliability is less about one perfect request and more about designing for messy reality: retries, idempotency, and the ability to replay and verify what happened later.

The three building blocks: retries, idempotency, replay

Webhooks come in two directions. Inbound webhooks are calls you receive from someone else (a payment provider, CRM, shipping tool). Outbound webhooks are calls you send to your customer or partner when something changes in your system. Both can fail for reasons that have nothing to do with your code.

Retries are what happen after a failure. A sender may retry because it got a timeout, a 500 error, a dropped connection, or no response fast enough. Good retries are expected behavior, not a rare edge case. The goal is to get the event through without flooding the receiver or creating duplicate side effects.

Idempotency is how you make duplicates safe. It means "do it once, even if received twice". If the same webhook arrives again, you detect it and return a success response without running the business action a second time (for example, don’t create a second invoice).

Replay is your recovery button. It’s the ability to reprocess past events on purpose, in a controlled way, after you fix a bug or after a partner has an outage. Replay is different from retries: retries are automatic and immediate, replay is deliberate and often happens hours or days later.

If you want webhook reliability, set a few simple goals and design around them:

No lost events (you can always find what arrived or what you tried to send)
Safe duplicates (retries and replays don’t double-charge, double-create, or double-email)
Clear audit trail (you can answer "what happened?" quickly)

A practical way to support all three is to store every webhook attempt with a status and a unique idempotency key. Many teams build this as a small "webhook inbox/outbox" table.

Inbound webhooks: a receiver flow you can reuse

Most webhook problems happen because the sender and receiver are running on different clocks. Your job as the receiver is to be predictable: acknowledge quickly, record what arrived, and process it safely.

Separate "accept" from "do work"

Start with a flow that keeps the HTTP request fast and moves real work elsewhere. This reduces timeouts and makes retries much less painful.

Acknowledge quickly. Return a 2xx as soon as the request is acceptable.
Check the basics. Validate content type, required fields, and parsing. If the webhook is signed, verify the signature here.
Persist the raw event. Store the body plus the headers you’ll need later (signature, event ID), along with a received timestamp and a status like "received".
Queue the work. Create a job for background processing, then return your 2xx.
Process with clear outcomes. Mark the event "processed" only after side effects succeed. If it fails, record why and whether it should be retried.

What "respond fast" looks like

A realistic target is responding in under a second. If the sender expects a specific code, use it (many accept 200, some prefer 202). Only return a 4xx when the sender should not retry (like an invalid signature).

Example: a "customer.created" webhook arrives while your database is under load. With this flow, you still store the raw event, enqueue it, and answer 2xx. Your worker can retry later without needing the sender to resend.

Inbound safety checks that don’t break delivery

Security checks are worth doing, but the goal is simple: block bad traffic without blocking real events. A lot of delivery problems come from receivers being too strict or returning the wrong response.

Start by proving the sender. Prefer signed requests (HMAC signature header) or a shared secret token in a header. Verify it before doing heavy work, and fail fast if it’s missing or wrong.

Be careful with status codes because they control retries:

Return 401/403 for auth failures so the sender doesn’t retry forever.
Return 400 for malformed JSON or missing required fields.
Return 5xx only when your service is temporarily unable to accept or process.

IP allowlists can help, but only when the provider has stable, documented IP ranges. If their IPs change often (or they use a large cloud pool), allowlists can quietly drop real webhooks and you may only notice much later.

If the provider includes a timestamp and a unique event ID, you can add replay protection: reject messages that are too old, and track recent IDs to spot duplicates. Keep the time window small, but allow a grace period so clock drift doesn’t break valid requests.

A receiver-friendly security checklist:

Validate signature or shared secret before parsing large payloads.
Enforce a maximum body size and a short request timeout.
Use 401/403 for auth failures, 400 for malformed JSON, and 2xx for accepted events.
If you check timestamps, allow a small grace window (for example, a few minutes).

For logging, keep an audit trail without keeping sensitive data forever. Store the event ID, sender name, receive time, verification result, and a hash of the raw body. If you must store payloads, set a retention limit and mask fields like emails, tokens, or payment details.

Retries that help, not harm

Make replay a routine fix

Create an internal admin panel to inspect, replay, and resolve failed events safely.

Build Now

Retries are good when they turn a brief hiccup into a successful delivery. They’re harmful when they multiply traffic, hide real bugs, or create duplicates. The difference is having a clear rule for what to retry, how to space attempts, and when to stop.

As a baseline, retry only when the receiver is likely to succeed later. A useful mental model is: retry on "temporary" failures, don’t retry on "you sent something wrong".

Practical HTTP outcomes:

Retry: network timeouts, connection errors, and HTTP 408, 429, 500, 502, 503, 504
Don’t retry: HTTP 400, 401, 403, 404, 422
Depends: HTTP 409 (sometimes "duplicate", sometimes a real conflict)

Spacing matters. Use exponential backoff with jitter so you don’t create a retry storm when many events fail at once. For example: wait 5s, 15s, 45s, 2m, 5m, then add a small random offset each time.

Also set a maximum retry window and a clear cutoff. Common choices are "keep trying for up to 24 hours" or "no more than 10 attempts". After that, treat it as a recovery problem, not a delivery problem.

To make this work day to day, your event record should capture:

Attempt count
Last error
Next attempt time
Final status (including a dead-letter state when you stop retrying)

Dead-letter items should be easy to inspect and safe to replay after you fix the underlying issue.

Idempotency patterns that work in practice

Idempotency means you can safely process the same webhook more than once without creating extra side effects. It’s one of the fastest ways to improve reliability, because retries and timeouts will happen even when nobody is doing anything wrong.

Pick a key that stays stable

If the provider gives you an event ID, use it. That’s the cleanest option.

If there’s no event ID, build your own key from stable fields you do have, such as a hash of:

provider name + event type + resource ID + timestamp, or
provider name + message ID

Store the key plus a small amount of metadata (received time, provider, event type, and the result).

Rules that usually hold up:

Treat the key as required. If you can’t build one, quarantine the event instead of guessing.
Store keys with a TTL (for example 7 to 30 days) so the table doesn’t grow forever.
Save the processing result too (success, failed, ignored) so duplicates get a consistent response.
Put a unique constraint on the key so two parallel requests don’t both run.

Make the business action idempotent too

Even with a good key table, your real operations must be safe. Example: a "create order" webhook shouldn’t create a second order if the first attempt timed out after the database insert. Use natural business identifiers (external_order_id, external_user_id) and upsert patterns.

Out-of-order events are common. If you receive "user_updated" before "user_created", decide on a rule like "only apply changes if event_version is newer" or "only update if updated_at is later than what we have".

Duplicates with different payloads are the hardest case. Decide upfront what you do:

If the key matches but payload differs, treat it as a provider bug and alert.
If the key matches and payload differs only in irrelevant fields, ignore it.
If you can’t trust the provider, switch to a derived key based on the full payload hash, and handle conflicts as new events.

The goal is simple: one real-world change should produce one real-world outcome, even if you see the message three times.

Replay tools and audit logs for recovery

Go from no-code to real code

Generate production-ready code you can deploy to cloud or export for self-hosting.

Try Now

When a partner system is flaky, reliability is less about perfect delivery and more about fast recovery. A replay tool turns "we lost some events" into a routine fix instead of a crisis.

Start with an event log that tracks the lifecycle of each webhook: received, processed, failed, or ignored. Keep it searchable by time, event type, and a correlation ID so support can answer, "What happened to order 18432?" quickly.

For each event, store enough context to re-run the same decision later:

Raw payload and key headers (signature, event ID, timestamp)
Normalized fields you extracted
Processing result and error message (if any)
The workflow or mapping version used at the time
Timestamps for receive, start, finish

With that in place, add a "Replay" action for failed events. The button is less important than the guardrails. A good replay flow shows the previous error, what will happen on replay, and whether the event is safe to re-run.

Guardrails that prevent accidental damage:

Require a reason note before replay
Restrict replay permissions to a small role
Re-run through the same idempotency checks as the first attempt
Rate-limit replays to avoid a new spike during incidents
Optional dry run mode that validates without writing changes

Incidents often involve more than one event, so support replay by time range (for example, "replay all failed events between 10:05 and 10:40"). Log who replayed what, when, and why.

Outbound webhooks: a sender flow you can audit

Add webhook audit logs

Add a searchable audit trail so support can answer “what happened?” quickly.

Try Now

Outbound webhooks fail for boring reasons: a slow receiver, a brief outage, a DNS hiccup, or a proxy that drops long requests. Reliability comes from treating every send as a tracked, repeatable job, not a one-off HTTP call.

A sender flow that stays predictable

Give every event a stable, unique event ID. That ID should stay the same across retries, replays, and even service restarts. If you generate a new ID per attempt, you make deduplication harder for the receiver and auditing harder for you.

Next, sign each request and include a timestamp. The timestamp helps receivers reject very old requests, and signing proves the payload wasn’t changed in transit. Keep the signature rules simple and consistent so partners can implement them without guesswork.

Track deliveries per endpoint, not just per event. If you send the same event to three customers, each destination needs its own attempt history and final status.

A practical flow most teams can implement:

Create an event record with event ID, endpoint ID, payload hash, and initial status.
Send the HTTP request with a signature, timestamp, and an idempotency key header.
Record every attempt (start time, end time, HTTP status, short error message).
Retry only on timeouts and 5xx responses, using exponential backoff with jitter.
Stop after a clear limit (max attempts or max age), then mark it failed for review.

That idempotency key header matters even when you’re the sender. It gives the receiver a clean way to dedupe if they processed the first request but your client never got the 200 response.

Finally, make failures visible. "Failed" shouldn’t mean "lost". It should mean "paused with enough context to safely replay".

Example: a flaky partner system and a clean recovery

Your support app sends ticket updates to a partner system so their agents see the same status. Every time a ticket changes (assigned, priority updated, closed), you post a webhook event like ticket.updated.

One afternoon the partner’s endpoint starts timing out. Your first delivery attempt waits, hits your timeout limit, and you treat it as "unknown" (it might have reached them, it might not). A good retry strategy then retries with backoff instead of firing repeats every second. The event stays in a queue with the same event ID, and each attempt is recorded.

Now the painful part: if you don’t use idempotency, the partner may process duplicates. Attempt #1 might have reached them, but their response never made it back. Attempt #2 arrives later and creates a second "Ticket closed" action, sending two emails or creating two timeline entries.

With idempotency, each delivery includes an idempotency key derived from the event (often just the event ID). The partner stores that key for a period and answers "already processed" for repeats. You stop guessing.

When the partner is finally back, replay is how you fix the one update that truly went missing (say, a priority change during the outage). You pick the event from your audit log and replay it once, with the same payload and idempotency key, so it’s safe even if they already got it.

During the incident, your logs should make the story obvious:

Event ID, ticket ID, event type, and payload version
Attempt number, timestamps, and next retry time
Timeout vs non-2xx response vs success
Idempotency key sent, and whether the partner reported "duplicate"
A replay record showing who replayed it and the final result

Common mistakes and traps to avoid

Reuse proven webhook workflows

Turn webhook reliability rules into reusable Business Processes your team can share.

Start Building

Most webhook incidents aren’t caused by one big bug. They come from small choices that quietly break reliability when traffic spikes or a third party gets flaky.

The traps that show up in postmortems:

Doing slow work inside the request handler (database writes, API calls, file uploads) until the sender times out and retries
Assuming providers never send duplicates, then double-charging, double-creating orders, or sending two emails
Returning the wrong status codes (200 even when you didn’t accept the event, or 500 for bad data that will never succeed on retry)
Shipping without a correlation ID, event ID, or request ID, then spending hours matching logs to customer reports
Retrying forever, which builds a backlog and turns a partner outage into your own outage

A simple rule holds up: acknowledge fast, then process safely. Validate only what you need to decide whether to accept the event, store it, then do the rest asynchronously.

Status codes matter more than people expect:

Use 2xx only when you’ve stored the event (or queued it) and you’re confident it will be handled.
Use 4xx for invalid input or failed auth so the sender stops retrying.
Use 5xx only for temporary problems on your side.

Set a retry ceiling. Stop after a fixed window (like 24 hours) or a fixed number of attempts, then mark the event as "needs review" so a human can decide what to replay.

Quick checklist and next steps

Webhook reliability is mostly about repeatable habits: accept quickly, dedupe aggressively, retry with care, and keep a replay path.

Inbound (receiver) quick checks

Return a fast 2xx once the request is safely stored (do slow work async).
Store enough of the event to prove what you received (and debug later).
Require an idempotency key (or derive one from provider + event ID) and enforce it in the database.
Use 4xx for bad signature or invalid schema, and 5xx only for real server problems.
Track processing status (received, processed, failed) plus the last error message.

Outbound (sender) quick checks

Assign a unique event ID per event, and keep it stable across attempts.
Sign every request and include a timestamp.
Define a retry policy (backoff, max attempts, and when to stop) and stick to it.
Track per-endpoint state: last success, last failure, consecutive failures, next retry time.
Log every attempt with enough detail for support and audits.

For ops, decide upfront what you will replay (single event, batch by time range/status, or both), who can do it, and what your dead-letter review routine looks like.

If you want to build these pieces without wiring everything by hand, a no-code platform like AppMaster (appmaster.io) can be a practical fit: you can model webhook inbox/outbox tables in PostgreSQL, implement retry and replay flows in a visual Business Process Editor, and ship an internal admin panel to search and re-run failed events when partners get flaky.

FAQ

Webhooks sit between systems you don’t control, so you inherit their timeouts, outages, retries, and schema changes. Even when your code is correct, you can still see duplicates, missing events, delays, and out-of-order delivery.

Design for retries and duplicates from day one. Store every incoming event, respond with a fast 2xx once it’s safely recorded, and process it asynchronously with an idempotency key so repeated deliveries don’t repeat side effects.

You should acknowledge quickly after basic validation and storage, usually in under a second. If you do slow work inside the request, senders time out and retry, which increases duplicates and makes incidents harder to untangle.

Treat idempotency as “do the business action once, even if the message arrives multiple times.” You enforce it by using a stable idempotency key (often the provider’s event ID), storing it, and returning success for duplicates without running the action again.

Use the provider’s event ID if it exists. If it doesn’t, derive a key from stable fields you trust, and avoid fields that can change between retries. If you can’t build a stable key, quarantine the event for review instead of guessing.

Return 4xx for problems the sender can’t fix by retrying, such as failed authentication or malformed payloads. Use 5xx only for temporary problems on your side. Be consistent, because the status code often controls whether the sender retries.

Retry on timeouts, connection errors, and temporary server responses like 408, 429, and 5xx. Use exponential backoff with jitter and a clear cutoff, such as a max attempt count or a max age, then move the event to a “needs review” state.

Replay is a deliberate reprocessing of past events after you fix a bug or recover from an outage. Retries are automatic and immediate. Good replay needs an event log, safe idempotency checks, and guardrails so you don’t accidentally duplicate work.

Assume you’ll get out-of-order events and decide a rule that matches your domain. A common approach is to apply updates only when an event version or timestamp is newer than what you’ve already stored, so late arrivals don’t overwrite current state.

Build a simple webhook inbox/outbox table and a small admin view to search, inspect, and replay failed events. In AppMaster, you can model these tables in PostgreSQL, implement dedupe, retry, and replay flows in the Business Process Editor, and ship an internal panel for support without hand-coding the whole system.