Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice

A Stripe webhook fires. Your server processes it, charges the customer, returns a 200. Stripe never gets the 200 because of a 50ms TCP blip. Stripe retries. You charge the customer again.

That is the entire story behind a class of bug that lives in nearly every Node.js codebase that talks to a queue, a webhook provider, or any system with at-least-once delivery. The fix is not “be more careful.” The fix is idempotency keys, and it is roughly 30 lines of middleware plus one Postgres table. Here is exactly what the code looks like, why each line is there, and how to convince yourself the fix actually works.

What “idempotent” really means here

An endpoint is idempotent when calling it N times with the same input produces the same effect as calling it once. GET is idempotent by accident. POST /charges is not, unless you make it that way.

The trick is that “the same input” is doing a lot of work in that sentence. Two webhook deliveries with identical bodies are the same input. Two browser clicks of “Pay” two seconds apart that happen to send identical bodies are also the same input — and you almost certainly do not want to treat them as one charge.

So the contract has to be: the caller picks a key, and that key tells you “this is the same logical operation as the one I sent before.” For webhooks the provider supplies it (Stripe-Signature includes a delivery ID; svix-id for Svix; X-GitHub-Delivery for GitHub). For your own clients, generate a UUID per user action and send it as Idempotency-Key. Stripe’s own API works exactly this way for a reason.

The shape of the bug

Before the fix, here is the kind of log you find when a customer support ticket says “I was charged twice”:

12:04:17.812  POST /webhooks/stripe  evt_1NxYz7  charge.succeeded  -> 200 (1840ms)
12:04:19.701  POST /webhooks/stripe  evt_1NxYz7  charge.succeeded  -> 200 (1612ms)

Same event ID. Two successful processings. Two rows in your payouts table. Two emails sent. One refund to issue.

The first request’s response was slow enough that Stripe’s HTTP client timed out and retried. Your handler had already committed the side effects when the retry arrived. Without an idempotency check, the retry happily re-runs everything.

The 30 lines

The middleware below does four things in order: hash the request, look up the key in Postgres with SELECT ... FOR UPDATE, return the cached response if we have already processed this key, or insert a placeholder row and run the handler. Every step matters.

// idempotency.ts
import { createHash } from 'node:crypto';
import type { Request, Response, NextFunction } from 'express';
import { pool } from './db';

const TTL_HOURS = 24;

export function idempotency(headerName = 'idempotency-key') {
  return async (req: Request, res: Response, next: NextFunction) => {
    const key = req.header(headerName);
    if (!key) return next(); // opt-in; only enforce when client supplies a key

    const fingerprint = createHash('sha256')
      .update(req.method + req.path + JSON.stringify(req.body))
      .digest('hex');

    const client = await pool.connect();
    try {
      await client.query('BEGIN');

      const existing = await client.query(
        `SELECT status, response_code, response_body, request_fingerprint
           FROM idempotency_keys
          WHERE key = $1
          FOR UPDATE`,
        [key],
      );

      if (existing.rowCount) {
        const row = existing.rows[0];
        await client.query('COMMIT');

        if (row.request_fingerprint !== fingerprint) {
          return res.status(422).json({ error: 'idempotency_key_reuse' });
        }
        if (row.status === 'in_progress') {
          return res.status(409).json({ error: 'request_in_progress' });
        }
        return res.status(row.response_code).json(row.response_body);
      }

      await client.query(
        `INSERT INTO idempotency_keys (key, request_fingerprint, status, expires_at)
         VALUES ($1, $2, 'in_progress', now() + interval '${TTL_HOURS} hours')`,
        [key, fingerprint],
      );
      await client.query('COMMIT');
    } catch (e) {
      await client.query('ROLLBACK');
      client.release();
      return next(e);
    }
    client.release();

    // Capture the handler's response and persist it before the socket closes.
    const originalJson = res.json.bind(res);
    res.json = (body: unknown) => {
      pool.query(
        `UPDATE idempotency_keys
            SET status = 'completed',
                response_code = $2,
                response_body = $3
          WHERE key = $1`,
        [key, res.statusCode, body],
      ).catch((err) => console.error('[idempotency] persist failed', err));
      return originalJson(body);
    };

    next();
  };
}

Wire it in front of the handlers that mutate things:

import express from 'express';
import { idempotency } from './idempotency';

const app = express();
app.use(express.json());

// Use the webhook provider's delivery id as the key.
app.post('/webhooks/stripe',
  idempotency('stripe-signature'), // or a parsed delivery-id header
  stripeWebhookHandler,
);

// For first-party clients, expect them to send Idempotency-Key.
app.post('/api/charges',
  idempotency('idempotency-key'),
  createChargeHandler,
);

And the schema, because the table design is doing real work:

CREATE TABLE idempotency_keys (
  key                 text PRIMARY KEY,
  request_fingerprint text NOT NULL,
  status              text NOT NULL CHECK (status IN ('in_progress', 'completed')),
  response_code       int,
  response_body       jsonb,
  created_at          timestamptz NOT NULL DEFAULT now(),
  expires_at          timestamptz NOT NULL
);

CREATE INDEX idempotency_keys_expires_at_idx ON idempotency_keys (expires_at);

That is the whole change. Schema, middleware, two app.use lines.

Why each piece is there

A few of these decisions look optional but are not.

SELECT ... FOR UPDATE inside a transaction. Two simultaneous deliveries arrive at the same millisecond. Without a row-level lock, both SELECTs return zero rows and both INSERTs succeed (one of them gets a unique-violation, fine — but both have already started running the handler in parallel). The FOR UPDATE on the parent transaction blocks the second request until the first one has either inserted the placeholder row or rolled back. This is the line that turns an at-least-once stream into an exactly-once handler.

Storing a request_fingerprint. Idempotency keys can be reused incorrectly — most often by clients that retry after editing the payload. Your contract should be: same key, same body, or it is a 422. Without the fingerprint check, a client could send an updated charge amount under the same key and silently get the old response. The Stripe API does exactly this check; it returns 400 idempotency_key_in_use when the body differs.

The in_progress status. A retry that arrives while the original is still running is the messiest case. Returning 409 lets the client back off and retry once the original is done — by which point the row will say completed and the cached response will be served. The alternative (waiting on the lock until the handler finishes) ties up a database connection per concurrent retry; under burst conditions, that exhausts your pool.

Wrapping res.json. The handler runs outside the transaction that locks the row, on purpose. You do not want a long-running handler holding a Postgres lock for its full duration. Instead, persist the result after the handler completes. The trade-off: if the process dies between the handler’s side effects and the UPDATE, the next retry sees in_progress and the client has to wait. That is a much smaller blast radius than holding the lock the whole time, and it is the trade-off Stripe makes too.

expires_at and a periodic cleanup job. Webhook providers retry for hours, sometimes days. Stripe retries for up to three days; SQS visibility timeouts can stretch even longer. A 24-hour TTL covers the realistic retry window for most providers, with a DELETE FROM idempotency_keys WHERE expires_at < now() job that runs hourly. If you need longer (PCI-style audit trail), bump the interval — the index keeps the cleanup cheap.

How to test it (the hammer)

A unit test that says “calling the function twice returns the same value” misses the entire point. The bug only happens under concurrency, so the test has to actually fire concurrent requests.

import { test, expect } from 'vitest';
import request from 'supertest';
import { app } from './app';
import { pool } from './db';

test('parallel duplicate webhooks produce one charge', async () => {
  const key = 'evt_test_' + Date.now();
  const body = { type: 'charge.succeeded', amount: 1500 };

  const sendOnce = () =>
    request(app).post('/webhooks/stripe')
      .set('idempotency-key', key)
      .send(body);

  // Hammer: 25 simultaneous deliveries of the same event.
  const responses = await Promise.all(Array.from({ length: 25 }, sendOnce));

  const succeeded = responses.filter((r) => r.status === 200).length;
  const inProgress = responses.filter((r) => r.status === 409).length;

  expect(succeeded + inProgress).toBe(25);

  // The key bit: charges_table got exactly one row.
  const { rows } = await pool.query(
    `SELECT count(*)::int AS n FROM charges WHERE event_id = $1`,
    [key],
  );
  expect(rows[0].n).toBe(1);
});

Run it ten times in a row. If even one run produces two charges, you have a race. Without the FOR UPDATE, you will see this within five iterations, every time.

For a heavier test, point vegeta at a staging instance:

echo "POST https://staging.example.com/webhooks/stripe
Idempotency-Key: evt_load_test_001
Content-Type: application/json
@./payload.json" \
  | vegeta attack -rate=200 -duration=10s \
  | vegeta report

You should see 200s and 409s only — never two 200s with side effects, never a 5xx. Run a SELECT count(*) against the resulting rows in your charges table; it should be 1.

What still bites you

A few footguns that do not show up in toy examples.

Multi-region or multi-database setups. The lock only works inside a single Postgres cluster. If you run two regional clusters with eventual replication, a simultaneous retry to two regions will both see “no row” and both proceed. The fix is either a single global table for idempotency keys (with the latency cost) or routing the same key to the same region via consistent hashing in your edge layer.

Handlers that have non-database side effects. If your handler sends an email and writes to Postgres, the idempotency table protects only the database write. The email goes out twice. Either make the email-sender itself idempotent (Resend, Postmark, and SES all accept a client-provided message ID), or batch the side effects so they live behind a single transaction-bound trigger.

The 24-hour window vs. retry horizons. Stripe retries for up to 72 hours. AWS EventBridge for up to 24. If you set the TTL too low, a very late retry sees no row and is processed as new. Match the TTL to the longest provider you talk to, plus a margin. The cleanup job means you pay nothing for the longer window beyond a small storage cost.

Body parsing inconsistencies. The fingerprint hashes JSON.stringify(req.body), which is sensitive to key ordering. Postgres-stored payloads or any pre-processing middleware that re-serializes the body can shift keys around and break fingerprint matching. If your stack does this, hash the raw body buffer before parsing, and put the middleware before express.json().

Bursty providers. Some webhook providers will redeliver a stuck event hundreds of times. If your handler is slow, every delivery sits in the in_progress window and replies 409. The provider keeps retrying. The fix is to make the handler fast (under a second is the right target) or to acknowledge the webhook immediately and process it on a queue, where the queue worker handles its own idempotency.

The metric that proves it shipped

The most useful chart is “duplicate side-effect count per day” — a SELECT event_id, count(*) FROM charges GROUP BY event_id HAVING count(*) > 1, queried daily. Before the fix, this is a non-zero number with a long tail of two or three duplicates a week. After the fix, it is zero, and any non-zero day is a real incident worth paging on.

If you cannot run that query because you do not have an event_id column on your side-effect tables, add one first. Without it, you have no way to tell whether duplicates are happening at all — they show up as “weird customer complaint about being charged twice” with no way to back-trace.

The takeaway

Idempotency is one of those things every team agrees they should do and almost no team actually implements until a customer is double-charged. The reason is that the toy version (a Set of seen IDs) does not survive process restarts, the next version (a database INSERT ... ON CONFLICT DO NOTHING) does not handle concurrent retries cleanly, and the third version is the one above — and by the time you get there, you have already shipped the bug.

Skip the first two iterations. The 30-line version with FOR UPDATE and a fingerprint is the version you want from day one. It costs a single table, one middleware, and an afternoon. The next time a customer’s card glitches and your handler returns a 502 mid-charge, you will sleep through the retry storm.

A note from Yojji

Reliability work like this — the unglamorous middleware that decides whether your billing is correct, your queue is honest, and your on-call goes a week without paging — is the kind of thing Yojji has been shipping since 2016.

Yojji is an international custom software development company with offices across Europe, the US, and the UK. Their teams specialize in the JavaScript stack (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and microservices architecture. They run dedicated, senior outstaffed teams for long-running engagements, plus full-cycle product work covering discovery, design, development, QA, and DevOps.

If your team would rather hire the practice of building safe-to-retry systems than learn it the hard way after a refund spreadsheet, Yojji is worth a conversation.