Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys
Every redeploy your users see a 4–7 second window of 502s. Here is exactly why, the 40 lines of Node code that eliminate it, and how to verify the fix with a real load test.
Every time you deploy, your users see 502s for four to seven seconds. You probably shrug at it. You shouldn’t.
When Kubernetes — or ECS, or Fly, or whatever orchestrator you run — sends SIGTERM to your container, your Node process gets up to 30 seconds before SIGKILL finishes the job. Most apps use zero of those seconds. They die mid-request, the load balancer keeps sending traffic to the dying pod for one more health-check interval, and you ship 502s on every single deploy.
Here is exactly what is happening, the 40 lines of code that fix it, and how to convince yourself the fix actually worked.
The seven-second window
Run this experiment on a vanilla Express app behind any load balancer. Start a vegeta load test at 100 RPS, then trigger a redeploy. In your error dashboard you will see:
- A clean spike of 502s starting at the moment the new pod begins rolling
- A handful of partial responses — connections cut mid-payload
- Background jobs that were halfway through a database write that never committed
This is not a Node problem. It is a “you did not implement the shutdown handshake” problem. The runtime gives you the tools. Almost nobody wires them up.
What is supposed to happen
The handshake has four steps. Almost no codebase implements all four:
- Receive
SIGTERM. Your process gets a signal saying “shut down soon.” - Stop accepting new connections. Fail your readiness probe so the load balancer routes traffic elsewhere. Existing in-flight requests keep going.
- Drain the queue. Finish in-flight requests. Stop background workers at a safe checkpoint. Flush logs.
- Exit cleanly. When the queue is empty,
process.exit(0).
Skip step 2 and you accept new requests right up until SIGKILL — guaranteed 502s.
Skip step 3 and you kill in-flight requests — partial responses, half-committed transactions, the kind of bug that shows up in an “occasional” support ticket six months later.
The 40 lines
// shutdown.js
import { setTimeout as wait } from 'node:timers/promises'
const SHUTDOWN_TIMEOUT_MS = 25_000 // stay under the 30s SIGKILL grace
const READINESS_DRAIN_MS = 5_000 // give the LB time to notice we are unready
let isShuttingDown = false
const cleanups = []
export function isHealthy() {
return !isShuttingDown
}
export function onShutdown(fn) {
cleanups.push(fn)
}
export function installGracefulShutdown(server) {
const shutdown = async (signal) => {
if (isShuttingDown) return
isShuttingDown = true
console.log(`[shutdown] received ${signal}`)
// 1. Fail readiness probe so the LB stops sending traffic.
await wait(READINESS_DRAIN_MS)
// 2. Stop accepting new HTTP connections.
server.close(err => {
if (err) console.error('[shutdown] server.close error', err)
})
// 3. Run cleanups (DB pool, queue workers, log flush).
const settle = Promise.allSettled(cleanups.map(fn => fn()))
const timeout = wait(SHUTDOWN_TIMEOUT_MS - READINESS_DRAIN_MS)
.then(() => { throw new Error('shutdown timeout') })
try {
await Promise.race([settle, timeout])
process.exit(0)
} catch (e) {
console.error('[shutdown] forced exit:', e.message)
process.exit(1)
}
}
process.on('SIGTERM', () => shutdown('SIGTERM'))
process.on('SIGINT', () => shutdown('SIGINT'))
}
Wire it in:
// server.js
import express from 'express'
import { installGracefulShutdown, isHealthy, onShutdown } from './shutdown.js'
import { pool } from './db.js'
import { worker } from './queue.js'
const app = express()
app.get('/healthz', (_, res) => res.status(200).send('ok'))
app.get('/readyz', (_, res) => res.status(isHealthy() ? 200 : 503).send())
app.get('/', /* your routes */)
const server = app.listen(3000)
onShutdown(() => pool.end()) // close the Postgres pool
onShutdown(() => worker.close()) // stop pulling jobs off the queue
installGracefulShutdown(server)
Forty lines plus three wiring points. That is the whole change.
Why the five-second drain matters
The most counter-intuitive line is the wait(5000) before server.close. Without it, your server stops accepting new connections immediately — but the load balancer does not know yet. Its readiness probe only runs every few seconds. New requests arriving in that gap get connection refused, which the LB turns into a 502.
The drain inverts the order: fail readiness checks for five seconds, then stop the listener. By the time the listener stops, the LB has already routed traffic to other pods. Zero 502s.
Tune READINESS_DRAIN_MS to be slightly longer than (probe period) × (failure threshold). For default Kubernetes settings (10s period, 3 failures), 5s is too short — bump it to 35s. For Cloudflare or AWS ALB defaults, 5–10s is fine.
How to test it
Two checks. The first is local:
node server.js &
PID=$!
sleep 1
curl -s localhost:3000/readyz # 200
kill -TERM $PID
sleep 0.1
curl -s -o /dev/null -w "%{http_code}\n" localhost:3000/readyz # 503
wait $PID
/readyz flips to 503 the moment SIGTERM lands. The process keeps serving in-flight requests, then exits.
The second is an end-to-end load test against a staging deploy:
echo "GET https://staging.example.com/" | vegeta attack -rate=100 -duration=60s > attack.bin
# in another terminal, trigger a redeploy mid-attack
vegeta report attack.bin
Before the fix:
Requests [total, rate] 6000, 100.07
Success [ratio] 97.42%
Status codes [code:count] 200:5845 502:155
After:
Requests [total, rate] 6000, 100.07
Success [ratio] 100.00%
Status codes [code:count] 200:6000
Zero 502s. Through a deploy. Print the vegeta report and put it in your PR description — it is a stronger argument for merging than any explanation.
What still bites you
A few things this 40-line version does not cover, in rough order of “you will hit this”:
- Long-polling and WebSocket connections. They will not close themselves when the listener stops. Broadcast a “server is shutting down, reconnect” message before
server.close, and close the sockets explicitly. - Streaming responses (SSE). Same problem. Send a final event before tearing down.
- Connection pools that auto-reconnect. Some ORMs (looking at you, older Sequelize) try to re-establish a pool if a query is queued. Drain queries first, then end the pool — ordering matters.
- Cleanup errors masked by
process.exit. If a cleanup function rejects, the rejection logs with no context. Wrap each cleanup in its own try/catch with the cleanup name baked into the error. - Shared infra without
terminationGracePeriodSecondsset. Kubernetes defaults to 30s, but I have seen Helm charts with it set to5. Check your manifest before you trust the timeout math.
The metric that proves it shipped
After this lands, your 5xx rate during deploys should be a flat line. The clearest dashboard: 5xx count per minute, with deploy markers overlaid. Before the change you see a clean spike on every marker. After, the line stays flat right through the marker — sometimes you cannot even tell a deploy happened.
If you do not have a real production chart, the vegeta artifact is the next best evidence. It is also the most convincing thing to drop into a PR review when someone asks “is this really worth shipping?”
It is. Forty lines. One afternoon. Every future deploy stops costing you 1–2% of one minute’s traffic.
A note from Yojji
Building production-grade backends — graceful shutdown, retries, observability, the unglamorous infrastructure that decides whether your product wakes someone up at 3am — is the kind of work Yojji has been shipping since 2016.
Yojji is an international custom software development company with offices across Europe, the US, and the UK. Their teams specialize in the JavaScript stack (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and microservices architecture. They run dedicated, senior outstaffed teams for long-running engagements, plus full-cycle product work covering discovery, design, development, QA, and DevOps.
If your team would rather spend its time building features than learning production-reliability lessons the hard way, Yojji is worth a conversation.