Docker HEALTHCHECK for Node.js: The Pattern That Stops Your Orchestrator From Routing Traffic to Broken Containers
Your orchestrator does not know your container is broken until a user hits it. Docker HEALTHCHECK fills that gap with a three-parameter config and a deliberately boring HTTP endpoint that separates startup, liveness, and readiness into distinct states.
Your container boots, registers with the service mesh, and starts receiving traffic. The Node.js process is up. The health endpoint returns 200. Everything looks fine.
But the database connection pool is exhausted because a previous deployment left connections open. The event loop is running at 98% lag because a memory leak has pinned GC into a continuous cycle. The response time has gone from 12ms to 4 seconds, but the health endpoint still returns 200 because it only checks process.uptime() > 0.
This is the gap that Docker HEALTHCHECK fills, and most teams configure it wrong.
What HEALTHCHECK actually does
Docker HEALTHCHECK is a Dockerfile instruction that tells the container runtime to run a command periodically inside the container. The exit code of that command determines the container health state:
- Exit 0: healthy
- Exit 1: unhealthy
- Exit 2: reserved (do not use)
Docker updates the container’s Status field in docker ps, emits a health_status event, and stops routing traffic to the container through Docker’s internal networking. Orchestrators like Docker Swarm, Nomad, and some Kubernetes CNI plugins read this status and act on it.
The three parameters that matter:
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD node /healthcheck.js
- interval: how often Docker runs the check. Default is 30s. Faster than 10s adds noise.
- timeout: how long Docker waits before declaring the check failed. Must be significantly shorter than interval. If your health check can take 30 seconds, it is doing too much.
- retries: consecutive failures before marking unhealthy. Three matches the default Kubernetes behavior and absorbs transient blips.
The startup period is a fourth dimension that is conspicuously missing from the Dockerfile syntax. Kubernetes handles this with initialDelaySeconds or a startup probe. Docker has no equivalent, which means your first HEALTHCHECK can fire before your app is ready. We will fix this with a deliberate approach to state.
The one-endpoint-to-rule-them-all trap
The most common mistake is a single /health endpoint that does everything:
import express from 'express';
const app = express();
app.get('/health', async (_req, res) => {
// Is the DB reachable?
await db.query('SELECT 1');
// Is Redis alive?
await redis.ping();
// Is the upstream API responding?
await upstream.healthCheck();
res.json({ status: 'ok' });
});
This endpoint has three problems when used as a Docker HEALTHCHECK target.
First, it mixes liveness and readiness into one answer. If the database is down, should Docker restart the container? No. The database blip will pass. Restarting the container makes it worse because all replicas restart in a thundering herd and you create the death spiral the K8s probes post describes. Docker HEALTHCHECK is a liveness signal: is this container so broken that it needs to be replaced? It is not a readiness signal: is this container ready to accept traffic right now?
Second, it calls external services synchronously inside the health check. The HEALTHCHECK timeout starts when Docker invokes the command. If your database call takes 6 seconds and the timeout is 5 seconds, the check always fails, even if the container is perfectly healthy.
Third, it does not distinguish startup from steady state. If the first HEALTHCHECK fires 30 seconds after container start and your app takes 45 seconds to initialize, the container is marked unhealthy, the orchestrator restarts it, and you get a crash loop before the first request arrives.
The right approach: three states, one endpoint
Build a single health endpoint that reports three states, and write a HEALTHCHECK script that interprets only the state that matters for liveness:
interface HealthState {
status: 'starting' | 'healthy' | 'unhealthy';
checks: {
self: 'pass' | 'fail';
db?: 'pass' | 'fail';
redis?: 'pass' | 'fail';
};
startedAt: number;
}
The endpoint returns starting during initialization, healthy when the process is running and the event loop is responsive, and unhealthy only when the container needs replacement:
import express from 'express';
import { executionAsyncId } from 'node:async_hooks';
const app = express();
const startedAt = Date.now();
const STARTUP_GRACE_MS = 30_000;
let dbConnected = false;
let redisConnected = false;
const db = createPool({ /* ... */ });
const redis = createClient({ /* ... */ });
db.on('connect', () => { dbConnected = true; });
db.on('error', () => { dbConnected = false; });
redis.on('connect', () => { redisConnected = true; });
redis.on('end', () => { redisConnected = false; });
app.get('/health', (_req, res) => {
const uptime = Date.now() - startedAt;
// Phase 1: Startup
if (uptime < STARTUP_GRACE_MS) {
res.set('Retry-After', '5');
return res.status(503).json({
status: 'starting',
checks: { self: 'pass' },
startedAt,
});
}
// Phase 2: Liveness check (self only)
// If the event loop is so blocked we cannot respond, we never reach here.
// The HEALTHCHECK timeout fires, Docker marks us unhealthy. Good.
const livenessResult = { status: 'healthy', checks: { self: 'pass' }, startedAt };
// Phase 3: Readiness info (for the orchestrator, not the HEALTHCHECK)
// Return db/redis status so load balancers can decide, but do NOT
// fail the HEALTHCHECK on external dependency blips.
return res.json({
...livenessResult,
checks: {
self: 'pass',
db: dbConnected ? 'pass' : 'fail',
redis: redisConnected ? 'pass' : 'fail',
},
});
});
Now the Docker HEALTHCHECK command targets only the liveness signal:
HEALTHCHECK --interval=30s --timeout=5s --start-period=40s --retries=2 \
CMD node /healthcheck.js
Wait - start-period is not a Docker HEALTHCHECK parameter. It is a docker run flag or a Compose field. Docker HEALTHCHECK in the Dockerfile does not support a start period. This is the fourth dimension I mentioned. Here is how to handle it.
The startup period problem and two solutions
Docker HEALTHCHECK has no --start-period in the Dockerfile syntax. If you define HEALTHCHECK in your Dockerfile, Docker starts running checks immediately. Your options:
Option A: Accept the retries budget
Set --retries high enough that the container survives startup:
HEALTHCHECK --interval=10s --timeout=5s --retries=10 \
CMD node /healthcheck.js
If your app takes 45 seconds to start, 10 retries at a 10-second interval gives 100 seconds of tolerance. The downside: Docker does not report the container as healthy until the first successful check, so orchestration tools that wait for healthy before routing wait longer.
Option B: Use the health endpoint’s startup grace period
Build the grace period into the endpoint itself, as shown above. Return 503 with starting status during the first N seconds. Mark the HEALTHCHECK so it only considers exit codes, not the response body:
HEALTHCHECK --interval=15s --timeout=5s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
With curl -f, a non-2xx status causes exit 1. The endpoint returns 503 during startup. Docker sees the exit code and waits. After N seconds, the endpoint switches to 200 with the health state. The retries absorb any timing edge cases.
This is the cleaner approach because it keeps the startup logic in application code where it belongs, not in retry-count guesses.
Docker Compose extension (for local dev)
If you use Docker Compose, the healthcheck block supports a start_period:
services:
api:
build: .
healthcheck:
test: ["CMD", "node", "/healthcheck.js"]
interval: 30s
timeout: 5s
retries: 3
start_period: 40s
Compose waits 40 seconds before the first check. This is the best option for Compose-based workflows. But it does not apply to production orchestrators that read the Dockerfile HEALTHCHECK directly.
What your health check should actually measure
A health check is not a monitoring system. It is a binary switch: should this container receive traffic or not? The threshold for “not” must be high enough that you do not restart containers unnecessarily.
Measure these (liveness)
- Process responsiveness: If the endpoint responds in time, the event loop is running. That is the strongest liveness signal. If V8 is stuck in a GC cycle that lasts longer than your HEALTHCHECK timeout, you want a restart.
- File descriptor exhaustion: Check
/proc/self/fdcount against the container limit. A leaked fd leak will eventually prevent accepting new connections. - Memory allocation failure: A
try/catcharound a smallBuffer.allocthat checks whether V8 can still allocate. If it fails, you are about to be OOM-killed anyway. Signal early so the orchestrator drains you gracefully.
import { readFileSync } from 'node:fs';
function checkFileDescriptors(limit = 0.9): boolean {
const open = readFileSync('/proc/self/fd', 'utf8').split('\n').length;
const max = readFileSync('/proc/self/limits', 'utf8')
.match(/Max open files\s+(\d+)\s+(\d+)/);
if (!max) return true; // Can't determine, assume pass
const soft = parseInt(max[1], 10);
return open / soft < limit;
}
function checkMemory(): boolean {
try {
Buffer.alloc(1024 * 1024); // 1MB probe allocation
return true;
} catch {
return false;
}
}
Do not measure these (or measure them as info only)
- Database reachability: The database will have blips. That is what connection pools and retries are for. Restarting the container does not fix the database.
- Upstream API health: You have no control over upstreams. A down upstream does not mean your container should restart.
- Disk space: Unless your container writes to a local volume, disk space is ephemeral and will reset on restart. Alert on disk in your monitoring, do not restart containers for it.
- Queue depth or request latency: These are performance metrics, not health signals. Restarting a container because it has a backlog guarantees the backlog never drains.
The complete HEALTHCHECK script
Here is the production-grade health check script that the Dockerfile calls:
// healthcheck.js
// This script runs inside the container as the HEALTHCHECK command.
// It exits 0 for healthy, 1 for unhealthy. That is all Docker sees.
import http from 'node:http';
const HEALTH_URL = process.env.HEALTH_URL || 'http://localhost:3000/health';
const GRACE_MS = parseInt(process.env.HEALTH_STARTUP_GRACE_MS || '30000', 10);
const req = http.get(HEALTH_URL, { timeout: 4000 }, (res) => {
let body = '';
res.on('data', (chunk) => (body += chunk));
res.on('end', () => {
try {
const state = JSON.parse(body);
// During startup, the endpoint may return 503.
// Absorb that into the retries budget.
if (state.status === 'healthy') {
process.exit(0);
}
// If the process is past grace period but still unhealthy, fail.
process.exit(1);
} catch {
process.exit(1);
}
});
});
req.on('error', () => process.exit(1));
req.on('timeout', () => { req.destroy(); process.exit(1); });
And the Dockerfile:
FROM node:22-alpine AS runner
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY healthcheck.js .
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=5s --start-period=40s --retries=3 \
CMD node healthcheck.js
USER node
CMD ["node", "dist/server.js"]
Note that --start-period is included in this Dockerfile despite being absent from the HEALTHCHECK specification. Here is the reality: Docker 25+ and many cloud container runtimes (ECS, Google Cloud Run, Fly.io) accept and honor it as a HEALTHCHECK extension. If your runtime ignores it, the startup grace logic in the endpoint covers you. This dual protection is the production pattern.
How Kubernetes and HEALTHCHECK interact
If you run on Kubernetes, Docker HEALTHCHECK is redundant with the kubelet’s liveness probe. The kubelet does not read Docker’s health status. It runs its own probes. Most Kubernetes deployments should skip Docker HEALTHCHECK and use kubelet probes directly.
But there are three cases where Docker HEALTHCHECK still matters on Kubernetes:
-
Sidecars that need to know the main container’s health: A sidecar proxy (Envoy, Linkerd) can read the main container’s Docker health status to decide whether to drain connections. The kubelet probe is not visible inside the container.
-
Non-Kubernetes orchestrators: If you deploy to Nomad, Docker Swarm, or ECS (which uses a Docker-compatible health check), HEALTHCHECK is the primary signal.
-
Development and CI environments: Docker Compose
depends_onwithcondition: service_healthyensures services start in order. Without HEALTHCHECK, Compose waits for the process to start, not for it to be ready.
# docker-compose.yml
services:
postgres:
image: postgres:16
healthcheck:
test: ["CMD-SHELL", "pg_isready -U app"]
interval: 5s
timeout: 3s
retries: 5
start_period: 10s
api:
build: .
depends_on:
postgres:
condition: service_healthy
healthcheck:
test: ["CMD", "node", "healthcheck.js"]
interval: 30s
timeout: 5s
retries: 3
start_period: 40s
This pattern replaces the fragile sleep 30 startup hacks that most Compose files use. The API container does not start until Postgres reports healthy. No guessing, no magic numbers.
A note from Yojji
Designing container health checks that distinguish liveness from readiness, absorb transient infrastructure blips, and fail fast on real process issues is the kind of operational discipline that separates a service that survives a database failover from one that restarts itself into a prolonged outage. Getting the HEALTHCHECK right in your Dockerfile and Compose files means every deployment, scaling event, and infrastructure incident is handled by the orchestrator instead of the on-call engineer.
Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their senior engineering teams specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, GCP), Docker and Kubernetes infrastructure, and the full cycle of product delivery from discovery through DevOps and production support. Yojji builds the kind of production-hardened backend systems where health checks, graceful shutdowns, and container lifecycle management are designed in from day one, not bolted on after the first outage.