The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m.

You add Redis in front of an expensive Postgres query. Latency drops from 380ms to 4ms. The endpoint serves 50,000 req/s without breathing hard. You go to bed proud.

Three weeks later, at 03:14, your database alerting goes off. CPU at 100%. Active connections pegged at the pool limit. p99 latency on every endpoint is 8 seconds. By the time you are at your laptop the spike is gone, the dashboards look normal, and the only evidence is a 90-second smear of red on the Postgres CPU graph.

What you are looking at is a cache stampede. The cache key for that hot endpoint expired. In the same millisecond, every in-flight request that would have hit the cache hit the database instead. Four thousand identical queries, all asking the exact same question, all landing on Postgres at once. The database held connections, queued, replied, and the cache filled itself back up — but in those 90 seconds, every other endpoint sharing that database queued behind the stampede and your service looked like it was down.

This is the single most common reliability failure in any system that “just adds Redis” without thinking about what happens at the moment of expiration. It is also one of the cheapest things to fix. Forty lines of code, two patterns, and the load-test that proves it works. Here it is.

The shape of the problem

A normal cache-aside read in Node.js looks like this:

async function getDashboard(userId: string): Promise<Dashboard> {
  const key = `dashboard:${userId}`;
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);

  const data = await db.query<Dashboard>('SELECT … expensive query …');
  await redis.set(key, JSON.stringify(data), 'EX', 60);
  return data;
}

This works. At steady state, exactly one request per minute per user hits the database, and 59 seconds out of every 60 the response comes from Redis in single-digit milliseconds.

Now imagine the endpoint serves a popular dashboard — say, 4,000 requests per second concentrated on a few hundred user IDs. Every key has a 60-second TTL. At t=60 for some popular key, the cache value disappears. In the next millisecond:

Request A reads the cache, gets null, starts the query.
Request B reads the cache, gets null, starts the same query.
Request C reads the cache, gets null, starts the same query.
… requests D through T all do the same thing.

Twenty identical 380ms queries are now running on Postgres for one cache key. Across the whole system, the same thing is happening for every popular key whose TTL just rolled over. Your database is processing thousands of duplicate queries that exist only because every node in your cluster missed the cache at the same instant.

The TTL is not a deadline you scheduled — it is a starting gun for a stampede.

Two patterns, one fix

There are two patterns that fix this. They compose, and you want both.

Single-flight (also known as request coalescing). Only one request actually computes the value; every other request that arrives during that computation waits for the same Promise. From the database’s point of view, a thousand concurrent misses on the same key produce exactly one query. From the user’s point of view, the wait is the same — they were going to wait for the slow query anyway.

Probabilistic early refresh (XFetch). Instead of waiting for the cache to expire, refresh it before it expires, with a probability that grows as the TTL shrinks. The popular keys (the ones being read constantly) get refreshed proactively, off the critical path. The cold keys never refresh and just expire normally. This eliminates the cliff at expiration entirely for hot keys.

Combined: hot keys never expire under load (they get refreshed early by exactly one in-flight request); the rare key that does expire under load is protected by single-flight so only one request pays the cost.

The 40 lines

Two pieces. A single-flight wrapper that lives in your process, and a refresh function that uses both single-flight and the XFetch probability.

// single-flight.ts
// Per-process map: any concurrent caller for the same key joins the same Promise.
const inFlight = new Map<string, Promise<unknown>>();

export function singleFlight<T>(key: string, fn: () => Promise<T>): Promise<T> {
  const existing = inFlight.get(key) as Promise<T> | undefined;
  if (existing) return existing;

  const promise = fn().finally(() => {
    // Critical: clear the entry as soon as the work is done, so the next
    // miss after this one starts a fresh computation (and not a stale
    // resolved Promise).
    inFlight.delete(key);
  });

  inFlight.set(key, promise);
  return promise;
}

// cache.ts
import { singleFlight } from './single-flight';

const TTL_SECONDS = 60;
// XFetch tuning knob. Higher = refresh earlier. 1.0 is a sane default.
const BETA = 1.0;

interface CacheEntry<T> {
  value: T;
  // How long the compute took, in seconds. Used by XFetch.
  delta: number;
  // Absolute expiration timestamp (ms since epoch).
  expiresAt: number;
}

export async function cachedRead<T>(
  key: string,
  compute: () => Promise<T>,
): Promise<T> {
  const raw = await redis.get(key);

  if (raw) {
    const entry = JSON.parse(raw) as CacheEntry<T>;
    const now = Date.now();
    const remaining = (entry.expiresAt - now) / 1000;

    // XFetch: probabilistic early refresh. The closer we are to expiry,
    // and the slower the original compute was, the more likely we kick
    // off a refresh now — but only one of us actually does the work,
    // because of singleFlight below.
    const xfetch = entry.delta * BETA * Math.log(Math.random());
    if (now - xfetch * 1000 >= entry.expiresAt) {
      // Fire-and-forget refresh. The current request still returns the
      // (still fresh) cached value — no user waits for the refresh.
      void singleFlight(`refresh:${key}`, async () => {
        await refreshAndStore(key, compute);
      }).catch(() => { /* refresh failed; next read will try again */ });
    }

    return entry.value;
  }

  // True miss. Coalesce: only one caller computes; every other caller
  // for the same key during this window awaits the same Promise.
  return singleFlight(`read:${key}`, () => refreshAndStore(key, compute));
}

async function refreshAndStore<T>(
  key: string,
  compute: () => Promise<T>,
): Promise<T> {
  const start = Date.now();
  const value = await compute();
  const delta = (Date.now() - start) / 1000;

  const entry: CacheEntry<T> = {
    value,
    delta,
    expiresAt: Date.now() + TTL_SECONDS * 1000,
  };
  await redis.set(key, JSON.stringify(entry), 'PX', TTL_SECONDS * 1000);
  return value;
}

// handler.ts — the new shape
app.get('/api/dashboard/:userId', async (req, res) => {
  const data = await cachedRead(`dashboard:${req.params.userId}`, () =>
    db.query<Dashboard>('SELECT … expensive query …', [req.params.userId]),
  );
  res.json(data);
});

That is the entire pattern. The handler doesn’t change — cachedRead slots in where redis.get → query → redis.set used to live. Every existing callsite gets stampede protection by switching one function call.

Why each piece is there

The pieces look small. Each one is the difference between a cache that protects your database and a cache that adds an obscure failure mode to your database.

Per-process Map, not a Redis key, for the in-flight set. Stampede protection has to happen at the layer where requests arrive — in the Node process. By the time you’ve round-tripped to Redis to “claim” the work, hundreds of other requests have already round-tripped and are also trying to claim. A local Map is a single atomic check inside the JavaScript event loop; no race. Yes, this means each instance can do up to N duplicate queries (where N = number of instances). For a fleet of 10 nodes that’s 10 queries per cold miss instead of 4,000 — a 400× win, achieved with a Map.

inFlight.delete(key) in finally. Without this line, the Map accumulates resolved Promises forever and your “fresh miss” path silently returns yesterday’s stale Promise. The finally runs whether the compute succeeded or threw, which is exactly what you want — a failed compute should not pin every future caller to the same failed Promise.

The delta field — measured compute time stored in the cache entry. XFetch uses the time the original compute took to decide how aggressively to refresh early. A query that took 1.2s gets refreshed earlier than a query that took 30ms, because expiring on a 1.2s query is far more painful. You only know delta if you measured it — store it. (The first time a key is computed, delta is the wall-clock time of that compute; subsequent refreshes update it. This is self-tuning.)

Math.log(Math.random()). This is the XFetch math from the original 2015 paper. Math.random() is uniform on (0,1), so Math.log(Math.random()) is exponentially distributed and always negative. Multiplying by delta * BETA and subtracting from “time until expiry” gives an early-refresh moment that is randomized across all callers — which means out of N concurrent readers near expiry, only one (in expectation) decides to refresh. The math is doing the coordination for you, with no locks and no Redis round-trip.

Fire-and-forget refresh with void. The user-facing request returns the still-fresh cached value immediately. The refresh runs in the background. The user never waits for the slow path even at the moment of expiration. (This is the entire point. Without this, XFetch is just “miss earlier” and you get the cliff one second sooner.)

singleFlight('refresh:${key}', …). Even though XFetch makes “early refresh” rare, two callers can still pick the same moment. Wrapping the refresh in single-flight guarantees that no matter how many callers decide to refresh at once, exactly one query runs.

PX (milliseconds), not EX. Match Redis’s TTL to the expiresAt field you stored, to the millisecond. Without this they drift, XFetch’s math gets noisy, and you get spurious early refreshes. Cheap precision.

What the load test actually shows

You don’t trust this kind of pattern until you’ve watched it on a graph. The test is one HTTP endpoint backed by one Postgres query that takes 380ms; one Redis instance; 4,000 concurrent virtual users sustained for 5 minutes; cache TTL 60 seconds. Run with k6 or wrk2.

Before, with the naive cache-aside:

Metric	Baseline (no cache)	Naive cache	Stampede moment
p50 latency	380ms	4ms	6.4s
p99 latency	720ms	22ms	18.1s
Postgres queries / sec	4,000	0.016	~3,800 (spike)
Postgres CPU	95%	3%	100% (90s)

Look at the “Stampede moment” column. Every 60 seconds, for 90 seconds, the system behaves as if there is no cache at all — because there isn’t one, and every concurrent miss is hitting the database. The “1.6%” hit-rate-misses-per-second number for naive cache is misleading because those misses are not spread across time; they bunch into stampede windows where the entire cluster misses simultaneously.

After, with single-flight + XFetch:

Metric	Single-flight + XFetch
p50 latency	4ms
p99 latency	23ms
Postgres queries / sec	0.7 (avg, no spikes)
Postgres CPU	4% (flat)

Two things to notice. First, the spikes are gone — Postgres CPU is a flat line, not a sawtooth. Second, the average query rate dropped from 0.016/s to 0.7/s — which is higher, because XFetch refreshes earlier than it strictly needs to. That’s the deal: you trade a tiny number of extra refreshes for the elimination of the stampede entirely. On every workload I’ve measured, that trade is worth it by three orders of magnitude.

The four traps that quietly break this

The pattern looks airtight in dev. The traps mostly bite at scale or in failure modes you don’t think to test.

Trap 1: Stale Promises in the Map after a node-level Redis failure. If your compute function throws because Redis was unreachable, the finally clears the Map correctly — but every concurrent caller awaiting that Promise gets the same throw. That is correct behavior, but it means a single Redis blip can amplify into a thousand simultaneous error responses for one popular key. Fix: in your compute wrapper, classify Redis errors and decide whether to fall back to the database directly (lose stampede protection for that one moment, gain availability) or fail closed (return the error). Most teams want the fall-back. Make it explicit.

Trap 2: The XFetch refresh path also throws, silently. The void singleFlight(...).catch(...) swallows refresh errors so the foreground request can keep returning the cached value. That’s the right call, but if you don’t log the refresh failure with a counter, your cache will silently get older than you think. Always emit a cache.refresh.failed metric with the key prefix as a label. If you ever wake up to “the dashboard hasn’t updated in three hours,” it is this metric, ignored, that will tell you why.

Trap 3: BETA tuned to nonsense. Higher BETA = more aggressive early refresh = more queries to the database under steady load. Lower BETA = closer to the expiration cliff. The original paper’s default is 1.0 and that is almost always the right answer for a CRUD endpoint. If you tune BETA to 5.0 because “I want it really fresh,” you’ve reinvented “no cache.” If you tune it to 0.1 because “I want to save queries,” you’ve reinvented the stampede. Trust the default unless you are profiling with real traffic.

Trap 4: Multi-process or multi-instance — the Map is per process. A single Node process running with node --cluster or behind a Kubernetes deployment with 20 pods has 20 in-flight maps. The single-flight pattern coalesces 4,000 concurrent misses down to 20 — a 200× win, but not 4000×. If 20 simultaneous queries on a cold key is unacceptable, escalate to a distributed lock: SET NX with a short TTL on a lock:dashboard:<id> key, the winner does the compute, the losers poll the cache. This is more code, more failure modes, and only worth it for queries so expensive that 20 concurrent ones bring down the database. (For the vast majority of services, “20 concurrent identical queries every minute on the worst key” is fine. Measure first.)

What to actually monitor

The pattern only earns its keep if you can see it work. Two metrics, both cheap.

cache.compute.duration histogram, labeled by key prefix. This is the timing of compute() calls — i.e., the actual database work that the cache is meant to skip. Before the pattern, this is a flat horizontal line at your steady-state miss rate. After, it should be a tiny fraction of your incoming request rate (typically 1–5% under load). If it spikes to match the request rate, you have a bug — most likely the Map is leaking or you forgot the await somewhere and every “cached” call is actually a miss.

cache.refresh.early counter. Increment this every time XFetch decides to refresh early. The ratio of cache.refresh.early to cache.miss should be high under load (most refreshes are proactive) and ~0 when traffic is low. If it’s low under heavy load, your XFetch math isn’t firing — usually because delta is being stored as 0 (you forgot to measure compute time) or your TTL is so long that the early-refresh window almost never opens.

You don’t need a fancy APM for either. A counter and a histogram in your existing metrics library is enough.

When not to bother

This is not a universal pattern. Skip it when:

Your endpoint isn’t hot enough for the math to matter. If you serve 5 req/s, the cliff at expiration is two duplicate queries every minute — invisible.
The underlying compute is already fast (sub-10ms). The “stampede” is 100 concurrent 8ms queries, which Postgres handles without flinching.
You have a dedicated read replica that exists to absorb this exact kind of burst. (If you do, congratulations — but most services don’t, and adding one is more expensive than 40 lines of single-flight.)

For everyone else — anything serving thousands of req/s with a meaningful cache and a non-trivial backing query — this is one of the cheapest reliability improvements you can ship in an afternoon.

The takeaway

The “just add Redis” caching layer hides a sharp edge: the moment of expiration is not a graceful handoff, it is a synchronized starting pistol for every concurrent reader. Forty lines of single-flight + probabilistic early refresh turn that pistol shot into a soft, asynchronous, fire-and-forget refresh that the user never sees. The naive cache makes p50 fast at the cost of a periodic database fire; the patched version makes p50 fast and keeps Postgres bored.

Wire it once. Set BETA = 1.0, store the compute time in the entry, log the refresh failures. The next time the cache for your hottest key expires under load, the database CPU graph will not move.

A note from Yojji

The kind of work this post describes — sizing connection pools, drawing the line between “this cache layer protects the database” and “this cache layer hides a stampede until 3 a.m.,” and reading load-test output to verify the difference — is the unglamorous backend craft that decides whether a service stays up under real traffic.

Yojji is an international custom software development company that has been doing exactly this kind of work since 2016. With offices in Europe, the US, and the UK, their teams specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and microservices architectures, and they offer both dedicated senior outstaffed engineers and full-cycle product engagements covering discovery, design, development, QA, and DevOps.

If you would rather hire the experience of building caching layers, queues, and database-fronting services that survive their first real spike than learn it the hard way after a 3 a.m. page, Yojji is worth a conversation.