Blue-Green Deployments for Node.js: Nginx Upstream Switching and Zero-Downtime Rollbacks
The deploy script kills the old process, the load balancer still points at it, and every active connection gets a 502. Here is the blue-green deployment pattern that switches traffic atomically, runs smoke tests in the live slot, and rolls back in under five seconds.
The deploy ran for six minutes, the health checks passed, and the dashboard showed all instances green. Then support tickets started arriving: users who had submitted a payment form during the deploy window got a 502 and a double charge. The old server pool, still registered with the load balancer, had been killed mid-request while draining connections. The new pool was accepting traffic, but the load balancer had routed the payment POST to the dying instance, which terminated the process before the HTTP response flushed.
This is the standard failure mode of a naive deploy. You stop the old service, start the new one, and hope the load balancer figures out the rest. It does not. The load balancer’s health check interval is five seconds, so for up to five seconds after you kill the old instances, it still routes traffic to dead processes. Those requests get connection refused or, worse, a half-flushed response that the client interprets as success before the TCP socket slams shut.
The fix is blue-green deployment: two fully independent server pools, a load balancer that switches between them with zero overlap, and a rollback path that does not require redeploying anything. This post walks through the pattern with nginx, a Node.js health-check contract, and a deployment script that handles the entire lifecycle from a GitHub Actions runner.
Why rolling updates are not enough
Kubernetes users will point out that a rolling update handles this. It does, if you have Kubernetes. Many teams do not. They run a handful of VMs or bare-metal boxes behind nginx, perhaps with some Docker Compose in staging and a hand-rolled deploy script for production. Rolling updates on bare metal require you to orchestrate instance drain, connection migration, and health verification yourself. At that point, you have rebuilt blue-green with extra complexity.
A rolling update on two instances works like this: kill instance A, wait for it to drain, start the replacement, wait for its health check, then repeat on instance B. During the window between A going down and B coming up, the system operates at half capacity. If instance B’s startup takes thirty seconds, you lose half your throughput for that long. If instance B fails its health check, you are stuck with one healthy node and a partial rollout.
Blue-green avoids the capacity gap entirely. The old pool runs at full capacity until the new pool is fully verified. Then traffic shifts in a single atomic operation.
The architecture
Blue-green requires three things:
- Two identical server pools, labeled “blue” and “green”. Only one is active at any time.
- A load balancer that can switch traffic between pools without dropping connections.
- A health-check contract that the load balancer uses to verify the active pool is healthy and the inactive pool is ready to receive traffic.
Here is the nginx configuration that supports this pattern:
upstream app_blue {
server 10.0.1.10:3000 max_fails=3 fail_timeout=10s;
server 10.0.1.11:3000 max_fails=3 fail_timeout=10s;
}
upstream app_green {
server 10.0.2.10:3000 max_fails=3 fail_timeout=10s;
server 10.0.2.11:3000 max_fails=3 fail_timeout=10s;
}
# Default: blue is active
map $http_x_deploy_color $backend {
default app_blue;
blue app_blue;
green app_green;
}
server {
listen 80;
listen 443 ssl;
location / {
proxy_pass http://$backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Deploy-Color $backend;
# Timeouts that prevent hung connections from piling up
proxy_connect_timeout 5s;
proxy_send_timeout 10s;
proxy_read_timeout 30s;
# Buffer settings for long-running requests
proxy_buffering off;
proxy_request_buffering off;
}
# Health check endpoint that load balancers and deploy scripts hit
location /health {
proxy_pass http://$backend/health;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_connect_timeout 3s;
proxy_read_timeout 5s;
}
}
The map block is the switch. It defaults to app_blue. When you deploy, you update this value to app_green and reload nginx. Reloading nginx does not drop active connections. Workers are replaced one at a time, and each old worker keeps serving existing connections until they finish. New connections go to the new configuration. This is the atomic switch.
The X-Deploy-Color header is optional but useful. It lets your application know which pool it is running in, which matters for cache keys, database migration gating, and log correlation.
The Node.js health check contract
The load balancer is only half the pattern. The other half is a health check endpoint that tells the deploy tooling exactly what state the application is in. It must distinguish between “I am running” and “I am ready to accept production traffic.”
import express from 'express';
import { createServer } from 'http';
import { Pool } from 'pg';
const app = express();
const db = new Pool({ connectionString: process.env.DATABASE_URL });
type HealthStatus = 'starting' | 'ready' | 'draining' | 'failing';
let status: HealthStatus = 'starting';
// Mark the server as ready after startup tasks complete
async function initialize() {
// Wait for database connectivity
await db.query('SELECT 1');
// Wait for any warmup requests (compile templates, load caches)
await warmup();
status = 'ready';
}
async function warmup(): Promise<void> {
// Pre-heat connection pools, compile templates, hydrate caches
// This prevents the cold-start latency spike after a switch
}
app.get('/health', async (_req, res) => {
const health = {
status,
uptime: process.uptime(),
database: 'unknown',
memory: process.memoryUsage().heapUsed,
};
if (status === 'draining') {
// Return 503 so the load balancer stops routing to this pool
return res.status(503).json({
...health,
message: 'Draining connections for deployment',
});
}
if (status === 'starting') {
// Return 503 until initialization completes
return res.status(503).json({
...health,
message: 'Still initializing',
});
}
// Check database connectivity
try {
await db.query('SELECT 1');
health.database = 'connected';
} catch {
health.database = 'disconnected';
// Still return 200 if the app can function without DB for a moment
// Return 503 for strict dependency gating
}
res.json(health);
});
app.post('/health/drain', (req, res) => {
// Called by the deploy script before killing this pool
status = 'draining';
res.json({ status: 'draining' });
// Give in-flight requests time to complete
setTimeout(() => {
server.close(() => {
process.exit(0);
});
}, 30000); // 30-second drain window
});
const server = createServer(app);
server.listen(3000, () => {
initialize().then(() => {
console.log('Server ready for traffic');
});
});
Three things matter in this implementation:
The status field is the source of truth. It transitions through starting -> ready -> draining over the server’s lifecycle. The deploy script and load balancer both read it. A starting or draining status returns 503, which tells nginx to skip this server and try another.
The POST /health/drain endpoint is what the deploy script calls before shutting down the old pool. It sets the status to draining, waits up to 30 seconds for in-flight requests to complete, then closes the server gracefully. This is the connection-drain step that most deploy scripts skip.
The warmup function is where you preload caches, establish database connections, or compile templates. Without it, the first requests to the new pool experience the cold-start latency that makes blue-green feel worse than a simple restart.
The deploy script
With nginx configured and the health contract in place, the deploy script becomes a state machine. Here it is in a self-contained bash script that a CI runner can execute:
#!/usr/bin/env bash
set -euo pipefail
ENVIRONMENT="${1:-production}"
COLOR_FILE="/etc/nginx/.deploy-color"
NGINX_CONF="/etc/nginx/sites-enabled/app"
NEW_IMAGE="myregistry.com/app:$(git rev-parse --short HEAD)"
# Read the current active color
if [[ -f "$COLOR_FILE" ]]; then
CURRENT_COLOR=$(cat "$COLOR_FILE")
else
CURRENT_COLOR="blue"
fi
# Determine the target color
if [[ "$CURRENT_COLOR" == "blue" ]]; then
TARGET_COLOR="green"
TARGET_HOSTS=("10.0.2.10" "10.0.2.11")
CURRENT_HOSTS=("10.0.1.10" "10.0.1.11")
else
TARGET_COLOR="blue"
TARGET_HOSTS=("10.0.1.10" "10.0.1.11")
CURRENT_HOSTS=("10.0.2.10" "10.0.2.11")
fi
echo "Current: $CURRENT_COLOR, Target: $TARGET_COLOR"
# Step 1: Deploy the new image to the inactive pool
for HOST in "${TARGET_HOSTS[@]}"; do
echo "Deploying to $HOST..."
ssh "deploy@$HOST" "
docker pull $NEW_IMAGE
docker stop app || true
docker rm app || true
docker run -d \
--name app \
--restart unless-stopped \
-p 3000:3000 \
-e NODE_ENV=production \
$NEW_IMAGE
"
done
# Step 2: Wait for the new pool to pass health checks
echo "Waiting for $TARGET_COLOR pool to become ready..."
for i in $(seq 1 30); do
ALL_READY=true
for HOST in "${TARGET_HOSTS[@]}"; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 2 \
--max-time 3 \
"http://$HOST:3000/health" 2>/dev/null || echo "000")
if [[ "$STATUS" != "200" ]]; then
ALL_READY=false
echo " $HOST returned $STATUS (attempt $i/30)"
break
fi
done
if $ALL_READY; then
echo "All $TARGET_COLOR hosts are healthy"
break
fi
if [[ $i -eq 30 ]]; then
echo "ERROR: $TARGET_COLOR pool failed health checks after 30 attempts"
echo "Rolling back: removing $TARGET_COLOR containers"
for HOST in "${TARGET_HOSTS[@]}"; do
ssh "deploy@$HOST" "docker stop app && docker rm app" || true
done
exit 1
fi
sleep 2
done
# Step 3: Switch nginx traffic to the new pool
echo "Switching traffic to $TARGET_COLOR..."
sed -i "s/default app_$CURRENT_COLOR;/default app_$TARGET_COLOR;/" "$NGINX_CONF"
nginx -s reload
# Step 4: Verify the switch via the load balancer
echo "Verifying traffic reaches $TARGET_COLOR..."
for i in $(seq 1 10); do
LB_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 2 \
--max-time 3 \
"http://localhost/health" 2>/dev/null || echo "000")
if [[ "$LB_STATUS" == "200" ]]; then
echo "Load balancer returning 200 through $TARGET_COLOR"
break
fi
if [[ $i -eq 10 ]]; then
echo "ERROR: Load balancer not healthy after switch"
echo "Rolling back nginx config..."
sed -i "s/default app_$TARGET_COLOR;/default app_$CURRENT_COLOR;/" "$NGINX_CONF"
nginx -s reload
exit 1
fi
sleep 2
done
# Step 5: Smoke test via the load balancer (not the backend directly)
echo "Running smoke tests through $TARGET_COLOR..."
for TEST in \
"GET /api/status -> 200" \
"POST /api/echo -> 200"; do
METHOD=$(echo "$TEST" | cut -d' ' -f1)
PATH=$(echo "$TEST" | cut -d' ' -f3)
EXPECTED=$(echo "$TEST" | cut -d' ' -f5)
ACTUAL=$(curl -s -o /dev/null -w "%{http_code}" \
-X "$METHOD" \
--connect-timeout 3 \
--max-time 5 \
"http://localhost$PATH" 2>/dev/null || echo "000")
if [[ "$ACTUAL" != "$EXPECTED" ]]; then
echo "FAIL: $METHOD $PATH expected $EXPECTED got $ACTUAL"
echo "Rolling back..."
sed -i "s/default app_$TARGET_COLOR;/default app_$CURRENT_COLOR;/" "$NGINX_CONF"
nginx -s reload
exit 1
fi
echo "PASS: $METHOD $PATH -> $ACTUAL"
done
# Step 6: Signal the old pool to drain
echo "Draining $CURRENT_COLOR pool..."
for HOST in "${CURRENT_HOSTS[@]}"; do
ssh "deploy@$HOST" "curl -s -X POST --connect-timeout 3 http://localhost:3000/health/drain" &
done
# Wait for drain signal to be sent
sleep 2
# Record the new active color
echo "$TARGET_COLOR" > "$COLOR_FILE"
echo "Deploy complete! Active pool: $TARGET_COLOR"
This script does six things in order:
- Determines the target pool by reading the color file on the nginx server.
- Deploys to the inactive pool so there is zero overlap with live traffic.
- Waits for health checks with a configurable retry budget. If the new pool never becomes healthy, it removes the containers and exits with a failure, without touching the live pool at all.
- Switches nginx via
sedand a reload. The reload is atomic: old workers finish existing requests, new workers use the new upstream. - Smoke tests through the load balancer to verify the full path works, not just the backend health endpoint.
- Drains the old pool by calling the
/health/drainendpoint on each host, then records the new active color.
The rollback path
The script already handles two rollback scenarios:
- Deploy failure: The new pool never passes health checks. The script removes the new containers and exits. The old pool never stopped serving traffic.
- Switch failure: The load balancer does not return 200 after the switch. The script reverts the nginx config and reloads.
There is a third scenario worth handling separately: the switch succeeds but a production incident is discovered minutes or hours later. For that, you need a manual rollback command:
#!/usr/bin/env bash
# rollback.sh - Switch traffic back to the previous pool
set -euo pipefail
COLOR_FILE="/etc/nginx/.deploy-color"
NGINX_CONF="/etc/nginx/sites-enabled/app"
if [[ ! -f "$COLOR_FILE" ]]; then
echo "No deploy color file found. Cannot determine rollback target."
exit 1
fi
CURRENT_COLOR=$(cat "$COLOR_FILE")
if [[ "$CURRENT_COLOR" == "blue" ]]; then
ROLLBACK_COLOR="green"
else
ROLLBACK_COLOR="blue"
fi
echo "Rolling back from $CURRENT_COLOR to $ROLLBACK_COLOR..."
sed -i "s/default app_$CURRENT_COLOR;/default app_$ROLLBACK_COLOR;/" "$NGINX_CONF"
nginx -s reload
# Verify
sleep 3
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/health)
if [[ "$STATUS" == "200" ]]; then
echo "$ROLLBACK_COLOR" > "$COLOR_FILE"
echo "Rollback complete. Active pool: $ROLLBACK_COLOR"
else
echo "CRITICAL: Rollback health check failed. Manual intervention required."
exit 1
fi
The manual rollback takes under five seconds. It does not require building or deploying any code. It just switches nginx back to the pool that was running before the deploy. This is the killer feature of blue-green: rollback is a config change, not a redeploy.
The GitHub Actions workflow
Here is how you wire the deploy script into CI:
name: Deploy
on:
push:
branches: [main]
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build and push Docker image
run: |
docker build -t myregistry.com/app:${GITHUB_SHA::7} .
docker push myregistry.com/app:${GITHUB_SHA::7}
- name: Deploy to production
run: |
ssh deploy@loadbalancer "bash -s" < deploy.sh production
env:
SSH_PRIVATE_KEY: ${{ secrets.SSH_PRIVATE_KEY }}
- name: Notify on failure
if: failure()
run: |
curl -X POST -H "Content-Type: application/json" \
-d '{"text":"Deploy failed. Traffic remains on previous pool."}' \
${{ secrets.SLACK_WEBHOOK }}
The key detail: if the deploy step fails, the workflow sends a notification but does not trigger a rollback. The rollback already happened inside the deploy script. The notification just tells the team that the new version failed to deploy and the old version is still serving traffic. No incident, no pager. Just a failed CI run.
What to watch out for
Blue-green is simple in theory, but a few details will bite you in practice:
Database migrations. Both pools share the same database. If the new code expects a migration that the old code does not understand, a rollback breaks. Run backward-compatible migrations before the deploy, or gate the migration behind a feature flag. Never run a destructive migration (DROP COLUMN, ALTER COLUMN TYPE) in the same deploy that switches traffic.
Session state. If users are logged into a session stored in the old pool’s memory, the switch makes them lose their session. Store sessions in Redis or a database, not in process memory.
WebSocket connections. Nginx can proxy WebSocket connections, but the Upgrade header must be forwarded correctly (the config above handles this). On switch, existing WebSocket connections stay connected to the old pool because the old nginx worker keeps serving them. Those connections eventually break when the old pool drains. If zero WebSocket interruption matters, use a sticky-session load balancer or a shared WebSocket gateway.
Cache warmup. The first request to the new pool for every cache key is a cold miss. If your application depends on a warm cache for acceptable latency, preheat the cache during the warmup() step of the health check lifecycle.
Monitoring gap. Between the switch and the old pool drain (about 30 seconds), metrics come from two pools simultaneously. Make sure your monitoring can distinguish between app_blue and app_green instances, or accept that metrics are noisy during this window. Add the X-Deploy-Color response header to correlate requests to the right pool.
When not to use blue-green
Blue-green is not always the right choice. If you run a single server, you cannot afford the hardware duplication. If your startup time is under two seconds and your traffic is low, a graceful shutdown with a health check drain window is simpler and cheaper. If you are on Kubernetes, its native rolling update with maxSurge and maxUnavailable handles most of this automatically.
But for teams running Node.js on bare metal, VMs, or a handful of cloud instances behind a reverse proxy, blue-green is the pattern that eliminates the “deploy during lunch” risk without adding orchestration complexity. Two nginx upstream blocks, one health check endpoint, and one deploy script. That is the whole thing.
Before your next deploy, run through this checklist:
- The inactive pool deploys before traffic switches.
- The deploy script waits for health checks (200, not just “process is running”).
- Nginx reload happens via
nginx -s reload, notsystemctl restart nginx. - The old pool drains connections before shutting down.
- Rollback is a config change, not a code redeploy.
- Database migrations are backward-compatible with the old pool.
- Sessions and caches live outside process memory.
- Monitoring distinguishes between blue and green instances.
Deployments that drop connections are not inevitable. They are a missing pattern. Add this one, and the next deploy is just a CI run that happens to switch traffic with zero user impact.
A note from Yojji
The difference between a deploy that nobody notices and a deploy that triggers a post-mortem usually comes down to the orchestration layer: how the load balancer switches, how long the health check waits, and whether the rollback path is tested under failure. That kind of infrastructure discipline does not happen by accident. It comes from teams that have been through the outage and built the fix into their deployment pipeline.
Building production-grade infrastructure pipelines requires the same engineering rigor as building the application itself. Yojji is an international custom software development company, founded in 2016, with offices in Europe, the US, and the UK. Their senior engineers specialize in the JavaScript ecosystem, cloud infrastructure across AWS, Azure, and Google Cloud, and the kind of full-cycle delivery that includes deployment pipelines, monitoring, and runbooks alongside feature development. If your team wants the deployment pattern but not the learning curve, Yojji builds it for you.