Webhooks fail in specific, predictable ways. The handler that works fine in development fails at 3am when the marketing team is running a Black Friday campaign and HubSpot is firing 10,000 contact updates per hour at your endpoint. The database transaction that takes 800ms in normal conditions takes 8 seconds under load. The endpoint returns a 504, HubSpot retries, and you process the same event 400 times.
None of this is theoretical. If you have ever maintained a webhook integration through a high-traffic period, you have probably hit most of these failure modes. This guide covers how to build handlers that do not fail silently — or if they do fail, fail safely.
The Core Pattern: Respond Immediately, Process Async
This is the single most important principle in webhook handler design and the one most often violated by first implementations.
When a webhook sender (HubSpot, Stripe, Klaviyo, Shopify) delivers an event to your endpoint, it waits for a response. The timeout window is short — typically 5-10 seconds before the sender considers the delivery failed and queues a retry. Your handler must return a 200 response within that window, regardless of how long the actual processing takes.
The naive implementation:
# DO NOT DO THIS
@app.route('/webhooks/hubspot', methods=['POST'])
def hubspot_webhook():
events = request.get_json()
for event in events:
# This might take 2-30 seconds per event
process_event_synchronously(event)
return '', 200
When process_event_synchronously takes longer than the sender’s timeout, you get a retry. You now process the event twice. If the processing involves creating a contact in your database, you now have duplicate contacts.
The correct pattern:
import json
from redis import Redis
from rq import Queue
redis_conn = Redis()
task_queue = Queue(connection=redis_conn)
@app.route('/webhooks/hubspot', methods=['POST'])
def hubspot_webhook():
# Validate signature BEFORE anything else
if not validate_hubspot_signature(request):
return '', 401
# Acknowledge immediately
events = request.get_json()
for event in events:
# Enqueue for async processing
task_queue.enqueue(
process_hubspot_event,
event,
job_timeout=300
)
# Return 200 immediately — before any processing happens
return '', 200
The async worker (process_hubspot_event) runs in a separate process and can take as long as it needs. The webhook sender gets its 200, considers the delivery successful, and moves on. Your worker processes the events at its own pace.
This pattern requires a queue (Redis with RQ, Celery, BullMQ for Node.js, AWS SQS) and worker processes. It is more infrastructure than a synchronous handler. It is also the only architecture that is reliable at any meaningful scale.
HMAC Signature Validation: Not Optional
Every major webhook sender signs its payloads. Skipping signature validation is a significant security problem — your endpoint will accept and process any HTTP POST that reaches it, including malicious payloads crafted to trigger your automation in unintended ways.
HMAC validation works like this: the sender computes an HMAC-SHA256 of the request body (sometimes including the URL and timestamp) using a shared secret. You recompute the same HMAC using your copy of the secret and compare. If they match, the request is authentic. If they do not, reject it.
Each platform has a slightly different signature scheme. Here is the implementation for three common platforms:
import hashlib
import hmac
import time
# HubSpot v3 signature validation
def validate_hubspot_signature(request):
client_secret = os.environ['HUBSPOT_CLIENT_SECRET']
timestamp = request.headers.get('X-HubSpot-Request-Timestamp')
signature = request.headers.get('X-HubSpot-Signature-v3')
# Reject if timestamp is more than 5 minutes old
if abs(time.time() * 1000 - int(timestamp)) > 300000:
return False
body = request.get_data(as_text=True)
source_string = f"{request.method}{request.url}{body}{timestamp}"
expected = hmac.new(
client_secret.encode('utf-8'),
source_string.encode('utf-8'),
hashlib.sha256
).hexdigest()
return hmac.compare_digest(expected, signature)
# Stripe signature validation
def validate_stripe_signature(request):
stripe_secret = os.environ['STRIPE_WEBHOOK_SECRET']
signature_header = request.headers.get('Stripe-Signature')
body = request.get_data(as_text=True)
# Parse timestamp and signatures from header
header_parts = dict(part.split('=', 1) for part in signature_header.split(','))
timestamp = header_parts.get('t')
signatures = [v for k, v in header_parts.items() if k == 'v1']
# Reject if timestamp is more than 5 minutes old
if abs(time.time() - int(timestamp)) > 300:
return False
signed_payload = f"{timestamp}.{body}"
expected = hmac.new(
stripe_secret.encode('utf-8'),
signed_payload.encode('utf-8'),
hashlib.sha256
).hexdigest()
return any(hmac.compare_digest(expected, sig) for sig in signatures)
# Klaviyo signature validation
def validate_klaviyo_signature(request):
klaviyo_secret = os.environ['KLAVIYO_WEBHOOK_SECRET']
signature = request.headers.get('X-Klaviyo-Signature')
body = request.get_data(as_text=True)
expected = base64.b64encode(
hmac.new(
klaviyo_secret.encode('utf-8'),
body.encode('utf-8'),
hashlib.sha256
).digest()
).decode('utf-8')
return hmac.compare_digest(expected, signature)
Two implementation requirements that are frequently missed:
Use hmac.compare_digest instead of == for string comparison. Regular string comparison is vulnerable to timing attacks where an attacker can determine how many characters of the signature are correct based on response time. hmac.compare_digest runs in constant time.
Validate the timestamp and reject stale requests. HubSpot and Stripe both include a timestamp in the signature. Replaying a valid signed request from an hour ago is an attack. Reject requests where the timestamp is more than 5 minutes old.
Idempotency and Deduplication
Webhook senders retry on failure. Your endpoint will receive the same event multiple times. Processing it twice must produce the same result as processing it once — this is idempotency, and it is not optional.
The approach depends on what processing the event triggers.
For database writes, the simplest pattern is an idempotency table:
import uuid
from datetime import datetime
def process_hubspot_event(event):
event_id = event.get('eventId')
# Check if we have already processed this event
existing = db.query(
"SELECT id FROM processed_webhooks WHERE event_id = %s",
(event_id,)
).fetchone()
if existing:
# Already processed — skip without error
logger.info(f"Duplicate event {event_id} — skipping")
return
# Process the event
apply_event(event)
# Record that we processed it
db.execute(
"INSERT INTO processed_webhooks (event_id, processed_at) VALUES (%s, %s)",
(event_id, datetime.utcnow())
)
db.commit()
The processed_webhooks table needs a unique index on event_id to prevent race conditions where two workers attempt to process the same event simultaneously. In PostgreSQL, use INSERT ... ON CONFLICT DO NOTHING to make the deduplication atomic.
For events that trigger external API calls — creating a contact in your CRM, sending an email — use the sender’s event ID as an idempotency key in those API calls if the downstream system supports it.
The event ID scheme varies by platform. HubSpot includes an eventId field in webhook payloads. Stripe uses the event id field. Klaviyo includes a unique event identifier. If a platform does not provide a stable unique event ID, hash the payload — though be aware that hash-based deduplication fails for events that are genuinely identical but distinct (two purchases of the same product by the same user in the same second).
Retry Behavior of Major Platforms
Understanding how each platform retries failed deliveries tells you how long your deduplication window needs to be and what your worst-case duplicate event rate looks like.
HubSpot: Retries failed deliveries up to 10 times with exponential backoff starting at 5 minutes. Maximum total retry window is approximately 48 hours. HubSpot considers a delivery failed if your endpoint returns a non-2xx status or does not respond within 10 seconds.
Stripe: Retries failed webhooks over 3 days with exponential backoff. The retry schedule is roughly: immediately, 1 hour, 2 hours, 4 hours, 8 hours, 16 hours, 24 hours, 48 hours, 72 hours. Stripe considers a delivery failed on non-2xx or no response within 30 seconds — longer than most platforms.
Klaviyo: Retries failed deliveries for up to 72 hours with exponential backoff. Maximum of 20 retry attempts. 10-second timeout for your endpoint’s response.
Shopify: Retries up to 19 times over 48 hours. Shopify will eventually disable a webhook endpoint that consistently fails — you will receive an email warning after a high failure rate and the webhook subscription will be deleted if failures continue.
The practical implication: your deduplication store needs to retain event IDs for at least 72 hours, the longest retry window of any common platform. A Redis key with a 96-hour TTL is a reasonable implementation.
Queue-Based Processing and Worker Pattern
The async pattern described above requires workers consuming from a queue. Here is a complete implementation using Python’s RQ library:
# worker.py
import os
import logging
from redis import Redis
from rq import Worker, Queue
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
redis_conn = Redis.from_url(os.environ['REDIS_URL'])
queues = [Queue('webhooks', connection=redis_conn)]
def process_hubspot_event(event):
event_id = event.get('eventId')
# Deduplication check
dedup_key = f"webhook:processed:{event_id}"
if redis_conn.get(dedup_key):
logger.info(f"Duplicate event {event_id} skipped")
return
try:
event_type = event.get('subscriptionType')
if event_type == 'contact.propertyChange':
handle_contact_property_change(event)
elif event_type == 'deal.creation':
handle_deal_creation(event)
elif event_type == 'contact.creation':
handle_contact_creation(event)
else:
logger.warning(f"Unhandled event type: {event_type}")
# Mark as processed — 96 hour TTL covers all retry windows
redis_conn.setex(dedup_key, 345600, '1')
except Exception as e:
logger.error(f"Failed to process event {event_id}: {e}")
raise # Re-raise so RQ marks the job as failed
if __name__ == '__main__':
worker = Worker(queues, connection=redis_conn)
worker.work()
The raise on failure is intentional. When a job raises an exception, RQ moves it to the failed job registry rather than silently discarding it. Failed jobs are visible in the RQ dashboard and can be retried manually or automatically.
Dead Letter Queues and Monitoring
A dead letter queue (DLQ) holds events that failed processing after all retry attempts. It is the safety net that prevents data loss when processing fails for reasons that cannot be automatically recovered.
Every webhook processing pipeline needs a DLQ and a process for reviewing it. Without one, failed events disappear silently and you discover the data loss weeks later when a metric does not add up.
In RQ, the failed job registry serves this function. In AWS SQS, you configure a DLQ per queue. In Kafka, you can route failed messages to a separate topic.
The monitoring requirements for a webhook integration:
Queue depth — how many jobs are waiting to be processed. A growing queue indicates your workers cannot keep up with inbound volume. Alert when queue depth exceeds a threshold that represents meaningful lag.
Job failure rate — what percentage of jobs are failing. A non-zero failure rate is normal during deployments or transient service issues. A sustained failure rate above 1-2% indicates a systematic problem.
Processing latency — how long between webhook receipt and job completion. Important for time-sensitive workflows like triggered emails after user actions.
DLQ size — how many events have exhausted all retries. A growing DLQ requires investigation.
# Example monitoring metrics to emit
import statsd
stats = statsd.StatsClient('localhost', 8125)
def process_hubspot_event(event):
start_time = time.time()
try:
# ... processing logic ...
stats.incr('webhooks.hubspot.processed')
stats.timing('webhooks.hubspot.processing_time',
(time.time() - start_time) * 1000)
except Exception as e:
stats.incr('webhooks.hubspot.failed')
raise
Timeout Handling
Your processing code makes external API calls. Those calls can hang. A webhook worker that is waiting for a downstream API that stopped responding holds a worker thread indefinitely and eventually depletes your worker pool.
Set explicit timeouts on all external calls within webhook handlers:
import requests
from requests.exceptions import Timeout, RequestException
def create_crm_contact(contact_data):
try:
response = requests.post(
'https://api.hubapi.com/crm/v3/objects/contacts',
json=contact_data,
headers=get_hubspot_headers(),
timeout=(5, 30) # 5s connect timeout, 30s read timeout
)
response.raise_for_status()
return response.json()
except Timeout:
logger.error("HubSpot API timed out during contact creation")
raise # Let the queue retry
except RequestException as e:
logger.error(f"HubSpot API error: {e}")
raise
The timeout tuple (connect_timeout, read_timeout) is important. A 30-second read timeout means you will wait up to 30 seconds for the response body after the connection is established. Without an explicit timeout, requests in Python will wait indefinitely.
FAQ
How do we handle webhook payloads that are too large to process safely?
Some platforms send bulk webhook payloads — HubSpot can send up to 100 events in a single HTTP request. If processing all events in a single job creates memory pressure or long processing times, split the payload on receipt. Deserialize the payload in the HTTP handler, iterate over events, and enqueue each as an individual job. This also improves failure isolation — if one event in a batch fails processing, it goes to the DLQ without affecting the others.
What is the right way to test webhook handlers locally?
Use a tunneling tool (ngrok, Cloudflare Tunnel, or localtunnel) to expose your local server to the internet, then configure the webhook sender to deliver to your tunnel URL. This is better than mocking because it exercises the actual HTTP request path including headers, signature validation, and timing. For automated testing, write integration tests that generate correctly signed webhook payloads and POST them directly to your handler — you do not need the actual sender involved. Maintain a fixture library of real webhook payloads from each platform, captured during development, for regression testing.
How do we handle a webhook sender that does not provide event IDs for deduplication?
Fall back to content hashing. Serialize the event payload (excluding any delivery-specific metadata like timestamps that change on retry) and hash it. Use the hash as the deduplication key. This is not perfect — two genuinely identical events with different intentions cannot be distinguished — but it prevents the most common duplication scenario, which is retry-induced duplicates where the payload is byte-for-byte identical. Document the limitation so future engineers understand the deduplication approach and its edge cases.
When should we validate webhook payload structure, and what should we do when validation fails?
Validate before enqueueing. If the payload is structurally invalid — missing required fields, incorrect data types — there is no point in enqueueing it since processing will fail. Log the invalid payload, return 200 to the sender (returning 4xx causes retries, and retries of a structurally invalid payload will always fail), and alert your engineering team. The 200 response is correct here — you successfully received the webhook; the problem is with the payload content, not the delivery. Distinguish between validation failures (return 200, log for investigation) and authentication failures (return 401, which signals to the sender that there is a credential problem).
How do we handle webhook processing during deployments when workers restart?
Design for in-flight job safety. Jobs that are actively being processed when a worker restarts are either lost (if the worker crashes without acknowledgment) or requeued (if the queue tracks acknowledgment separately from dequetion). In RQ, jobs are moved to the failed state on worker crash, which means they can be retried. In SQS, unacknowledged messages return to the queue after the visibility timeout expires. The key is that your idempotency implementation handles re-processing correctly — a job that was 80% complete when the worker crashed and gets requeued should resume cleanly, not produce partial results. For operations that are not naturally idempotent, implement checkpointing or use database transactions to ensure partial progress is either committed or rolled back atomically.

