Finding Node.js Memory Leaks with Heap Snapshots

Your service does not crash during the load test. It crashes six hours later, when traffic is boring and nobody is watching.

The graph is always the same: RSS climbs in a staircase, garbage collection gets louder, p95 latency follows, then Kubernetes restarts the pod with OOMKilled. You increase the memory limit from 512 MB to 1 GB. The crash moves from lunch to dinner. You increase it again. Now the bill is worse and the leak is still there.

This is the moment where guessing gets expensive. Memory leaks in Node.js are rarely fixed by staring at the line where the process died. The useful question is not “why did V8 run out of memory?” It is “what objects survived long enough to become old, who is retaining them, and why did we expect them to be collectible?”

This post walks through a production-friendly workflow: track the right memory signals, trigger safe heap snapshots, compare retained objects in Chrome DevTools, fix the actual retention path, and prove the fix with a repeatable leak test.

No folklore. No “just restart nightly”. Working code.

First: know which memory is growing

Node exposes several memory numbers. They do not mean the same thing.

setInterval(() => {
  const m = process.memoryUsage();

  console.log(JSON.stringify({
    rss: Math.round(m.rss / 1024 / 1024),
    heapUsed: Math.round(m.heapUsed / 1024 / 1024),
    heapTotal: Math.round(m.heapTotal / 1024 / 1024),
    external: Math.round(m.external / 1024 / 1024),
    arrayBuffers: Math.round(m.arrayBuffers / 1024 / 1024),
  }));
}, 30_000).unref();

Watch heapUsed for live JavaScript objects, external and arrayBuffers for memory attached to JS objects but allocated outside the heap, and rss for total resident process memory.

If heapUsed climbs forever, heap snapshots are the right tool. If rss climbs while heapUsed stays flat, suspect native memory, large buffers, image processing, compression, TLS, database drivers, or allocator fragmentation. Heap snapshots may still help, but they will not show every byte.

For services, export these as metrics instead of printing them:

import client from 'prom-client';

export const nodeHeapUsedBytes = new client.Gauge({
  name: 'nodejs_heap_used_bytes',
  help: 'V8 heap currently used by live objects',
});

export const nodeRssBytes = new client.Gauge({
  name: 'nodejs_rss_bytes',
  help: 'Resident set size for the Node.js process',
});

setInterval(() => {
  const m = process.memoryUsage();
  nodeHeapUsedBytes.set(m.heapUsed);
  nodeRssBytes.set(m.rss);
}, 10_000).unref();

You want a graph that answers this before the page fires: is the JavaScript heap leaking, or is the process leaking outside the heap?

The demo leak: a cache that forgot to evict

Here is a small Express service with a realistic leak. It stores per-user responses in a process-local Map and never evicts old entries.

import express from 'express';
import crypto from 'node:crypto';

const app = express();
const userResponseCache = new Map();

app.get('/api/users/:id/dashboard', async (req, res) => {
  const userId = req.params.id;

  if (userResponseCache.has(userId)) {
    return res.json(userResponseCache.get(userId));
  }

  const dashboard = {
    userId,
    generatedAt: new Date().toISOString(),
    widgets: Array.from({ length: 80 }, (_, i) => ({
      id: i,
      title: `Widget ${i}`,
      // Pretend this came from a few joins and service calls.
      data: crypto.randomBytes(2048).toString('hex'),
    })),
  };

  userResponseCache.set(userId, dashboard);
  res.json(dashboard);
});

app.listen(3000, () => {
  console.log('listening on :3000');
});

This is the kind of bug that passes tests. With ten users it looks like an optimization. With hundreds of thousands of user IDs, it becomes a memory leak with a friendly name.

Add safe heap snapshot capture

Node has a built-in heap snapshot API via node:v8. A heap snapshot freezes the event loop while V8 walks the heap and writes a large file. It can take seconds. The file can be hundreds of megabytes. It may include secrets from memory.

So do not expose this as an unauthenticated public route. Prefer a signal handler in production and an authenticated HTTP endpoint in controlled internal environments.

Option A: signal-triggered snapshots

This is the safest default for Kubernetes or systemd because it does not add a network surface.

import v8 from 'node:v8';
import fs from 'node:fs';
import path from 'node:path';

const snapshotDir = process.env.HEAP_SNAPSHOT_DIR ?? '/tmp';

function writeHeapSnapshot(reason) {
  fs.mkdirSync(snapshotDir, { recursive: true });

  const file = path.join(
    snapshotDir,
    `heap-${process.pid}-${Date.now()}-${reason}.heapsnapshot`,
  );

  const written = v8.writeHeapSnapshot(file);
  console.error(`wrote heap snapshot: ${written}`);
}

process.on('SIGUSR2', () => {
  writeHeapSnapshot('sigusr2');
});

Then trigger it:

kubectl exec deploy/api -- kill -USR2 1
kubectl cp default/api-pod-name:/tmp/heap-123-1710000000000-sigusr2.heapsnapshot ./heap-before.heapsnapshot

If your container runs Node as PID 1, kill -USR2 1 works. If you use an init process, target the Node PID.

Capture two snapshots, not one

A single heap snapshot shows what exists. A pair of snapshots shows what grew.

The workflow: start the service, warm it up, force a garbage collection if you are in a test environment, capture before.heapsnapshot, run the workload, force GC again, capture after.heapsnapshot, then compare the two.

For local leak hunting, run Node with explicit GC enabled:

node --expose-gc server.js

Add a test-only endpoint or script hook:

export function forceGcForTestsOnly() {
  if (process.env.NODE_ENV !== 'test') return;
  if (typeof global.gc === 'function') {
    global.gc();
    global.gc();
  }
}

Do not rely on forced GC in production. It is a diagnostic trick, not a runtime policy.

Now generate load with many unique users:

// leak-test.mjs
const baseUrl = process.env.BASE_URL ?? 'http://localhost:3000';

for (let i = 0; i < 50_000; i++) {
  const res = await fetch(`${baseUrl}/api/users/user-${i}/dashboard`);

  if (!res.ok) {
    throw new Error(`request ${i} failed: ${res.status}`);
  }

  if (i % 1000 === 0) {
    const mb = process.memoryUsage().heapUsed / 1024 / 1024;
    console.log({ i, heapUsedMb: Math.round(mb) });
  }
}

Run it:

node --expose-gc server.js
node leak-test.mjs

Read the heap snapshot without drowning

Open Chrome DevTools, go to Memory, load both .heapsnapshot files, and use Comparison mode. Sort by Size Delta or Count Delta.

The first view is noisy. Do not chase the largest class name blindly. Chase retention.

The useful columns are # Delta for object growth, Retained Size for memory that would become collectible if an object disappeared, and Retainers for the path from a GC root to the object.

In the demo leak, the comparison will show growth in Object, Array, and string data. Click one of the dashboard objects and inspect retainers. The important path looks like this:

Window / global
  -> userResponseCache
    -> Map
      -> table
        -> Object
          -> widgets
          -> data

That path is the bug report. The Map is reachable from module scope, so every value in it is reachable. V8 is doing the right thing. Your program told it to keep everything.

This is why heap snapshots beat guesses. You are not looking for “a lot of objects”. You are looking for “a lot of objects retained by something that should not own them forever.”

Common retention paths in Node services

Unbounded Maps and Sets

Caches, de-dupe tables, idempotency trackers, rate-limit buckets, and in-flight request maps all leak when they do not have a size or time boundary.

Bad:

const seenRequestIds = new Set();

export function dedupe(requestId) {
  if (seenRequestIds.has(requestId)) return false;
  seenRequestIds.add(requestId);
  return true;
}

Better:

import { LRUCache } from 'lru-cache';

const seenRequestIds = new LRUCache({
  max: 50_000,
  ttl: 15 * 60 * 1000,
});

export function dedupe(requestId) {
  if (seenRequestIds.has(requestId)) return false;
  seenRequestIds.set(requestId, true);
  return true;
}

A cache without a limit is a memory leak with better branding.

Event listeners that never unregister

Bad:

app.get('/events', (req, res) => {
  const onUpdate = (event) => {
    res.write(`data: ${JSON.stringify(event)}\n\n`);
  };

  domainEmitter.on('update', onUpdate);
});

When the client disconnects, the listener remains. It retains res, closures, and whatever else the request captured.

Better:

app.get('/events', (req, res) => {
  res.setHeader('content-type', 'text/event-stream');

  const onUpdate = (event) => {
    res.write(`data: ${JSON.stringify(event)}\n\n`);
  };

  domainEmitter.on('update', onUpdate);

  req.on('close', () => {
    domainEmitter.off('update', onUpdate);
    res.end();
  });
});

If DevTools shows retained closures under EventEmitter, look for missing cleanup.

Promise queues that keep completed work

Batchers and retry queues often retain results by accident.

Bad:

const pending = [];

export function enqueue(job) {
  const promise = runJob(job);
  pending.push({ job, promise });
  return promise;
}

Nothing removes completed work. The array becomes a history table.

Better:

const pending = new Set();

export function enqueue(job) {
  const entry = { job };
  entry.promise = runJob(job).finally(() => {
    pending.delete(entry);
  });

  pending.add(entry);
  return entry.promise;
}

Also consider whether job contains large payloads. If you only need an ID for observability, store the ID.

Metrics with unbounded labels

Prometheus labels are not a place to put user IDs, request IDs, email addresses, or raw paths.

Bad:

httpRequestsTotal.inc({
  method: req.method,
  path: req.path,
  userId: req.user.id,
});

Every unique label combination creates time series state in your process and in Prometheus.

Better:

httpRequestsTotal.inc({
  method: req.method,
  route: req.route?.path ?? 'unknown',
  status: String(res.statusCode),
});

If a heap snapshot points at a metrics registry retaining thousands of label objects, the memory leak is also an observability bill leak.

Fix the demo with a bounded cache

For the dashboard cache, use an LRU with both a size limit and TTL. Size protects the process from high cardinality. TTL protects users from stale data.

npm install lru-cache

import express from 'express';
import crypto from 'node:crypto';
import { LRUCache } from 'lru-cache';

const app = express();

const userResponseCache = new LRUCache({
  max: 10_000,
  ttl: 5 * 60 * 1000,
  updateAgeOnGet: false,
});

app.get('/api/users/:id/dashboard', async (req, res) => {
  const userId = req.params.id;
  const cached = userResponseCache.get(userId);

  if (cached) {
    return res.json(cached);
  }

  const dashboard = {
    userId,
    generatedAt: new Date().toISOString(),
    widgets: Array.from({ length: 80 }, (_, i) => ({
      id: i,
      title: `Widget ${i}`,
      data: crypto.randomBytes(2048).toString('hex'),
    })),
  };

  userResponseCache.set(userId, dashboard);
  res.json(dashboard);
});

This is not just “use a library.” It is a policy decision written in code: at most 10,000 dashboard responses live in memory, no entry lives longer than five minutes, and the cache does not become immortal just because one user is noisy.

If the response is large, prefer a byte-based maxSize too. The point is to bound the cache by the thing that can actually kill the process: memory, not vibes.

Add a regression test for memory growth

Memory tests are noisy. They should not fail because one GC cycle happened later than expected. But they can catch obvious unbounded growth.

Here is a simple integration test that runs the same workload twice and checks that heap growth stabilizes after the cache reaches its maximum size.

import assert from 'node:assert/strict';

async function hitUsers(start, count) {
  for (let i = start; i < start + count; i++) {
    const res = await fetch(`http://localhost:3000/api/users/user-${i}/dashboard`);
    assert.equal(res.status, 200);
  }
}

function heapUsedMb() {
  if (typeof global.gc === 'function') {
    global.gc();
    global.gc();
  }

  return process.memoryUsage().heapUsed / 1024 / 1024;
}

// Run with: node --expose-gc memory-regression.test.mjs
await hitUsers(0, 20_000);
const afterFirst = heapUsedMb();

await hitUsers(20_000, 20_000);
const afterSecond = heapUsedMb();

const growth = afterSecond - afterFirst;

assert.ok(
  growth < 80,
  `heap kept growing after cache saturation: ${growth.toFixed(1)} MB`,
);

This is not a perfect proof. It is a tripwire. Keep the threshold coarse enough to avoid flakes, but strict enough to catch someone changing max: 10_000 back to an unbounded Map.

Production guardrails that actually help

Heap snapshots find leaks. Guardrails reduce blast radius while you are finding them.

Set a realistic memory limit. In Kubernetes, do not run Node with a 4 GB container limit if the service normally needs 300 MB.

resources:
  requests:
    memory: "384Mi"
    cpu: "250m"
  limits:
    memory: "768Mi"
    cpu: "1000m"

Set V8’s old-space limit below the container limit so Node throws a JavaScript heap OOM before the kernel kills it without a useful report:

node --max-old-space-size=512 server.js

Leave headroom for RSS outside the V8 heap. If the container limit is 768 MB, do not set old space to 760 MB. Buffers, native modules, stacks, and the runtime need room.

Add alerts on slope, not just absolute value. “Heap is above 600 MB” is late. “Heap grew 200 MB in 30 minutes while request rate stayed flat” is actionable.

A PromQL sketch:

increase(nodejs_heap_used_bytes[30m]) > 200 * 1024 * 1024

Pair it with restart count and latency alerts. A memory leak often shows up as GC pressure before it becomes an OOM.

The practical workflow

When a Node service leaks memory, do this:

Confirm whether heapUsed, external, or only rss is growing.
Reproduce the growth with a script that resembles the production cardinality.
Capture a heap snapshot after warmup.
Run the workload.
Capture another snapshot after GC in a test environment.
Compare snapshots by size delta.
Follow retainers to the owner.
Add a bound, cleanup path, or lifecycle rule.
Add a regression test or metric alert so the leak does not return quietly.

The fix is usually small. The hard part is proving which reference is keeping the object alive.

For the demo service, the working answer is a bounded LRU cache. For your service, it might be emitter.off, deleting completed jobs from a Set, replacing raw URL labels with route templates, clearing a timer, or moving large buffers out of process memory entirely.

Do not normalize periodic OOMs. A process that needs restarts to stay healthy is not healthy; it is hiding state you do not understand yet.

A note from Yojji

The kind of backend reliability work that turns a mysterious OOMKilled graph into a small, verified fix — heap snapshots, retention-path analysis, cache bounds, and production guardrails — is exactly the unglamorous engineering that keeps systems stable after launch.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. Their engineers work heavily in the JavaScript ecosystem, cloud platforms, and microservices architecture, including the performance and operational details that decide whether a Node.js service survives real production traffic.