Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming
Naive retry loops are how a single sick dependency takes the whole platform down. Here is the retry pattern that actually survives a real outage — exponential backoff with decorrelated jitter, retry budgets, deadline propagation, and the four mistakes that will turn your "self-healing" client into a self-DDoS tool.
A downstream service hiccups for 800ms. A few seconds later your dashboards show a clean recovery. Two minutes later, the same service goes hard down — except this time it stays down for 40 minutes, takes three other services with it, and the post-mortem points the finger at a config change that did nothing wrong.
What happened is the most boring incident in distributed systems: every client retried, every retry hit the same already-overloaded service, every failure caused another retry, and the synchronized wave of req → fail → retry from a few thousand callers produced more load than the service had ever handled in a normal day. The hiccup was a hiccup. The retries were the outage.
This pattern — the retry storm — is in every “self-healing” client people add the week after their first 5xx. It is also entirely avoidable. The fix is not “fewer retries.” It is retries that understand they are part of a fleet: exponential backoff with proper jitter, a retry budget that caps the blast radius, a deadline that travels with the request, and a short list of errors that are worth retrying at all. Together those four things take maybe sixty lines of TypeScript and they are the difference between a 503 that recovers in five seconds and an outage that fills your weekend.
Mistake #1: retrying everything
The first mistake is treating “the call failed” as a single category. It is not. There are at least four classes of failure, and only one of them benefits from retrying.
// Bad: retry everything that throws.
async function call(url: string) {
for (let i = 0; i < 3; i++) {
try {
return await fetch(url);
} catch (err) {
// ... retry?
}
}
}
400 Bad Request is not retryable — the input is wrong, retrying will just send the same wrong input again. 401 Unauthorized is not retryable until you refresh the token. 404 is definitive. 409 Conflict usually means a domain rule rejected the request and retrying would either no-op or repeat the conflict.
The set of errors actually worth retrying is small:
- Transient network errors —
ECONNRESET,ETIMEDOUT,EAI_AGAIN,socket hang up. The connection itself failed. 502,503,504— the service or its upstream is unavailable or overloaded.429Too Many Requests — but only after honoring theRetry-Afterheader.408Request Timeout — server-side timeout.- Idempotent
5xx— retry only when the operation is safe to repeat (more on idempotency below).
Anything else, fail fast. This sounds aggressive until you remember: a non-retryable error retried six times is just six errors instead of one, plus a longer client latency, plus six wasted RPS against a service that already told you “no.”
type RetryDecision = 'retry' | 'fail';
function classify(err: unknown, res?: Response): RetryDecision {
if (res) {
if (res.status === 429 || res.status === 408) return 'retry';
if (res.status >= 500 && res.status <= 599) return 'retry';
return 'fail';
}
const code = (err as NodeJS.ErrnoException)?.code;
if (code === 'ECONNRESET' || code === 'ETIMEDOUT' ||
code === 'EAI_AGAIN' || code === 'ECONNREFUSED' ||
code === 'UND_ERR_SOCKET') return 'retry';
return 'fail';
}
This function is the entire policy. Everything else is mechanics.
Mistake #2: retrying without backoff
The naive loop is the one in every tutorial:
for (let i = 0; i < 5; i++) {
try { return await call(); }
catch { /* try again immediately */ }
}
If the downstream is overloaded, this is a way to send five requests in 50ms instead of one in 50ms. Multiply that by ten thousand clients and the downstream service’s “recovery window” is now a tighter feedback loop than it was before.
The minimum bar is exponential backoff: double the wait between attempts.
const base = 100; // ms
for (let attempt = 0; attempt < 5; attempt++) {
try { return await call(); }
catch (err) {
if (classify(err) === 'fail') throw err;
await sleep(base * 2 ** attempt); // 100, 200, 400, 800, 1600
}
}
This helps. It does not solve the real problem.
Mistake #3: backoff without jitter
If every client uses the same backoff schedule, every client retries at the same time. The downstream sees a smooth load curve under steady state and a series of synchronized spikes during a partial outage. The spikes are bigger than steady-state load and they line up perfectly with the moments the service is most fragile.
The fix is jitter — adding randomness to the wait. There are three flavors people argue about; only one is right for almost every case.
Full jitter — pick a random value between 0 and the current cap.
function fullJitter(attempt: number, base = 100, cap = 30_000) {
const exp = Math.min(cap, base * 2 ** attempt);
return Math.random() * exp;
}
Equal jitter — half deterministic, half random.
function equalJitter(attempt: number, base = 100, cap = 30_000) {
const exp = Math.min(cap, base * 2 ** attempt);
return exp / 2 + Math.random() * (exp / 2);
}
Decorrelated jitter — each wait is computed from the previous wait, not from the attempt number.
function decorrelatedJitter(prev: number, base = 100, cap = 30_000) {
return Math.min(cap, base + Math.random() * (prev * 3 - base));
}
The AWS Architecture Blog post that introduced these compared them on a simulated overload scenario and the practical result is well-replicated: full jitter and decorrelated jitter both flatten the spike effectively. Decorrelated jitter has slightly better worst-case latency under heavy contention because it does not correlate to the attempt counter when many clients have been retrying for a while. For most services, full jitter is fine; for services where you expect a long tail of retrying clients (auth services, payment processors, anything with a global tail), pick decorrelated.
What you should never do is the deterministic schedule with no randomness, or “jitter” implemented as ± 10% of the deterministic value. The first creates the spike. The second narrows it without flattening it.
Mistake #4: no retry budget
Here is the pattern that turns a small outage into a giant one. Imagine your service makes one call to payment-gateway per inbound request. Normally it succeeds. During a hiccup, every call fails and your client retries five times. Your inbound RPS is unchanged. Your outbound RPS to payment-gateway is now 6x of normal.
If payment-gateway was already at 70% of capacity, you just put it at 420% of capacity. There is no recovery window because every recovering attempt gets buried by the next wave of retries.
A retry budget caps the additional load retries can generate. The classic implementation, used in gRPC and Finagle and several large-scale Envoy deployments, is a token bucket: every successful call adds a token; every retry costs a token. When the bucket is empty, retries are skipped.
class RetryBudget {
private tokens: number;
constructor(
private readonly capacity = 100,
private readonly retryRatio = 0.1, // retries can be at most 10% of successful calls
) {
this.tokens = capacity;
}
/** Call after a successful underlying request. */
onSuccess() {
this.tokens = Math.min(this.capacity, this.tokens + 1);
}
/** Call before a retry. Returns false if budget is exhausted. */
tryConsume(): boolean {
const cost = 1 / this.retryRatio; // 10 tokens per retry at 10% ratio
if (this.tokens >= cost) {
this.tokens -= cost;
return true;
}
return false;
}
}
retryRatio = 0.1 means: across the population of calls, retries can add at most 10% extra load. Under normal operation the bucket sits near full and retries flow freely. Under partial failure, the success rate drops, the bucket drains, and retries automatically start being skipped — which is exactly the moment you want them skipped, because the downstream is already struggling.
This is the single most important pattern in this post. A team that has exponential backoff with jitter but no retry budget will still produce retry storms during a multi-second outage. A team that has only a retry budget — no backoff at all — will not.
Mistake #5: ignoring the request deadline
Retries cost time. If your handler has a 2-second SLO and the first attempt times out at 1.8 seconds, retrying is mathematically pointless: the client gave up, your inbound load balancer hung up, and the next retry is doing work for nobody. (See AbortController in Node.js for the wider problem.)
Every retry policy should be bounded by a deadline that propagates from the inbound request, not just by a max attempt count. The simplest version: each retry checks how much time is left before scheduling the next attempt.
async function withRetry<T>(
fn: (signal: AbortSignal) => Promise<T>,
opts: { signal: AbortSignal; deadlineMs: number; budget: RetryBudget },
): Promise<T> {
const start = Date.now();
let lastWait = 100;
let attempt = 0;
while (true) {
const remaining = opts.deadlineMs - (Date.now() - start);
if (remaining <= 0 || opts.signal.aborted) {
throw new Error('deadline exceeded');
}
try {
const result = await fn(opts.signal);
opts.budget.onSuccess();
return result;
} catch (err) {
if (classify(err) === 'fail') throw err;
if (!opts.budget.tryConsume()) throw err;
const wait = decorrelatedJitter(lastWait);
lastWait = wait;
// Never sleep past the deadline.
const sleepFor = Math.min(wait, remaining - 50);
if (sleepFor <= 0) throw err;
await sleep(sleepFor);
attempt++;
}
}
}
function sleep(ms: number) {
return new Promise((r) => setTimeout(r, ms));
}
A few details people get wrong even when they get the rest right:
- The deadline is absolute, not per-attempt. A “5 second timeout per call, 3 retries” client can take 15 seconds in the worst case. That is not what your inbound caller expects.
- Subtract a small buffer (
- 50) before the deadline. Otherwise the last attempt times out the moment it starts, which is just expensive failure. - Honor
Retry-After. A429or503withRetry-After: 5is the server explicitly telling you when retrying is welcome. Use that instead of your jittered value if it is larger.
function honorRetryAfter(res: Response, fallback: number): number {
const header = res.headers.get('Retry-After');
if (!header) return fallback;
const seconds = Number(header);
if (!Number.isNaN(seconds)) return Math.max(fallback, seconds * 1000);
const dateMs = Date.parse(header);
if (!Number.isNaN(dateMs)) return Math.max(fallback, dateMs - Date.now());
return fallback;
}
Mistake #6: retrying non-idempotent operations
Retrying POST /payments is how you charge a customer twice. Retrying POST /send-email is how customers get five copies of the same notification. Retries are only safe when the operation is idempotent — when “do this twice” produces the same observable result as “do this once.”
Some operations are naturally idempotent (PUT /users/:id, DELETE /orders/:id). Some can be made idempotent with an idempotency key — a client-generated identifier the server uses to deduplicate. The full pattern is in the post on idempotency keys; the short version is:
const key = crypto.randomUUID();
await withRetry(
(signal) => fetch('/payments', {
method: 'POST',
headers: { 'Idempotency-Key': key, 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
signal,
}),
{ signal: req.signal, deadlineMs: 5_000, budget: paymentBudget },
);
The same key on every retry. The server stores the key + the response, and a duplicate request returns the cached response instead of re-executing.
If an endpoint is not idempotent and you cannot add an idempotency key — set attempts = 1 and stop. A retry that double-charges is worse than the failure it is trying to avoid.
Putting it together
The full client is small:
const paymentBudget = new RetryBudget(100, 0.1);
export async function chargeCard(payload: ChargePayload, parent: AbortSignal) {
const key = crypto.randomUUID();
return withRetry(
async (signal) => {
const res = await fetch('https://payment-gateway.internal/charge', {
method: 'POST',
headers: {
'Idempotency-Key': key,
'Content-Type': 'application/json',
},
body: JSON.stringify(payload),
signal,
});
if (!res.ok) {
const err = new Error(`payment ${res.status}`);
(err as any).status = res.status;
(err as any).response = res;
throw err;
}
return res.json();
},
{ signal: parent, deadlineMs: 4_000, budget: paymentBudget },
);
}
Sixty lines, including the budget and the classifier. It will outperform every “smart” retry library that does not implement a budget, because under partial failure the budget is what saves the downstream.
How to test it before production tests it
You will not catch retry-storm behavior in unit tests. The only way to see it is to inject failure under load.
A short k6 scenario that runs your client against a flaky simulator is enough:
// k6 script: 30s normal load, 30s downstream fails 80% of requests
import http from 'k6/http';
import { sleep } from 'k6';
export const options = {
scenarios: {
constant: { executor: 'constant-arrival-rate', rate: 200, timeUnit: '1s', duration: '60s', preAllocatedVUs: 50 },
},
};
export default function () {
http.get('http://localhost:3000/api/charge'); // your service that wraps the client
sleep(0);
}
Run it with the simulated downstream healthy and your outbound RPS to the downstream should hover around inbound RPS. Run it again with the simulated downstream returning 503 for 80% of requests and watch the outbound RPS. With a working budget it climbs by 10% and plateaus. Without one it climbs by 6x and stays there.
This is also the test that catches the version of the bug where someone “tightens” the retry policy by setting attempts = 8. The budget should mean attempts = 8 and attempts = 3 produce the same outbound RPS during failure. They will if everything else is wired correctly.
Where retries fit in the bigger picture
Retries are one layer in a stack of resilience patterns. A short map:
- Timeouts stop you from waiting on a hung dependency.
- Retries (this post) recover from transient failures.
- Circuit breakers stop calling a dependency that is clearly broken so the retries do not pile up.
- Bulkheads / pool limits stop a slow dependency from exhausting your concurrency.
- Hedging-style request fanout (out of scope here) shaves p99 latency at the cost of multiplying load — and so should always sit behind a budget.
Retries without a circuit breaker is the configuration that produces a retry storm. Retries with a circuit breaker but no budget produces the storm slightly later. Retries with a budget and a breaker, properly tested under failure, produces a service that recovers gracefully from the kind of 800ms hiccup that does not need to make it into a post-mortem at all.
The whole point is that none of this is exotic. It is the plumbing that turns a “self-healing” client from a marketing word into a true claim, and most teams have a 60-line gap between where they are and where this post is.
A note from Yojji
Most of the work in this post is unglamorous: deciding which errors to retry, building a budget that drains during partial failure, wiring deadlines through every layer, and load-testing the whole thing before production tests it for you. It is the difference between a service that recovers from a downstream blip and one that turns the blip into the incident.
That kind of careful, production-aware backend engineering is exactly what Yojji ships. Yojji is an international custom software development company, founded in 2016, with offices in Europe, the US, and the UK. Their teams specialize in the JavaScript stack (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and microservices architectures, and they run dedicated senior outstaffed teams alongside full-cycle product engagements covering discovery, design, development, QA, and DevOps.
If your team would rather hire the practice of building reliable, well-instrumented distributed services than learn it the hard way during a peak-hour retry storm, Yojji is worth a conversation.