Multi-Region Deployment for Node.js: DNS Routing, Database Replication, and Production Failover

The alarm went off at 3:14 AM. us-east-1 was having another “event.” Not a full outage. Just enough packet loss that every API call from the Nginx health checker started timing out. The deployment was in a single region. The entire service was dark for 47 minutes while AWS sorted out whatever was happening in their northern Virginia data center.

The postmortem was predictable. “We should have been multi-region.” But when the team looked into it, they found a maze of questions with no clear answers. How do you route traffic to the nearest region without adding DNS as a SPOF? How do you handle the database when writes must work everywhere? Do you need sticky sessions? What about config and secrets across regions?

Here is the practical answer to each of those questions, with working configuration and the trade-offs you need to make.

The three-layer model

Multi-region deployment solves three distinct problems, and conflating them is where most teams get stuck.

Latency. Users far from your single region wait an extra round trip for every API call. For a Singapore user hitting us-east-1, that is 150-200ms of pure physics you cannot optimize away.

Availability. A single region is a single failure domain. Cloud providers have region-wide outages. They are rare, but they happen, and when they do, your service goes with them.

Throughput. Two regions means two sets of compute. If your service is CPU-bound, multi-region effectively doubles your capacity without needing to scale vertically.

Each problem demands a different architecture. You can solve latency with read replicas without touching your write path. You can solve availability with a cold standby that takes over via DNS change. You only need active-active multi-master if you need all three. Start with the simplest architecture that solves the problem you actually have.

DNS routing: latency-based with health checks

The first question is how to get a user’s request to the nearest healthy region. The answer is DNS routing with health-check gating.

If you are on AWS, Route 53 latency routing policies handle this. If you are on GCP, Cloud DNS + global load balancer does the same thing. The principle is identical: DNS returns the IP of the region with the lowest latency for the requesting resolver, provided that region passes a health check.

Here is a Route 53 latency record set with failover, using Terraform:

resource "aws_route53_record" "api" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"

  set_identifier = "us-east-1"

  latency_routing_policy {
    region = "us-east-1"
  }

  alias {
    name                   = aws_lb.us_east_1.dns_name
    zone_id                = aws_lb.us_east_1.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "PRIMARY"
  }

  health_check_id = aws_route53_health_check.us_east_1.id
}

resource "aws_route53_record" "api_eu_west" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"

  set_identifier = "eu-west-1"

  latency_routing_policy {
    region = "eu-west-1"
  }

  alias {
    name                   = aws_lb.eu_west_1.dns_name
    zone_id                = aws_lb.eu_west_1.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "SECONDARY"
  }

  health_check_id = aws_route53_health_check.eu_west_1.id
}

The critical detail is the health check. If us-east-1 fails its health check, Route 53 stops returning that record and all traffic goes to eu-west-1. The TCP health check is not enough. You need an application-level health check that validates database connectivity, cache connectivity, and internal service dependencies.

resource "aws_route53_health_check" "us_east_1" {
  fqdn              = "api-us-east-1.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 10

  tags = {
    Name = "us-east-1-api-health"
  }
}

Set the health check interval to 10 seconds with a failure threshold of 3. That gives you a 30-second detection window. Set your DNS TTL to 10 seconds for the latency records. Do not go lower than 10 — DNS providers cap it, and DNS resolvers ignore sub-1-second TTLs anyway. A 10-second TTL with a 30-second health check means failover completes in under 60 seconds from the first failure.

Database replication: active-passive is the default

Do not attempt active-active multi-master on your first pass. It adds conflict resolution, clock skew handling, and a debugging surface area that will consume your engineering team for months. Start with active-passive.

In active-passive, one region handles all writes. The other regions run read replicas connected to the primary via streaming replication. Read traffic is routed to the local replica. Write traffic is forwarded to the primary region.

For PostgreSQL on AWS RDS, this means a cross-region read replica:

resource "aws_db_instance" "primary" {
  identifier = "app-db-us-east-1"
  engine     = "postgres"
  # ... standard config ...

  backup_retention_period = 7
  storage_encrypted       = true
}

resource "aws_db_instance" "replica_eu" {
  provider = aws.eu_west_1

  identifier = "app-db-eu-west-1"
  engine     = "postgres"

  replicate_source_db = aws_db_instance.primary.identifier

  backup_retention_period = 7
  storage_encrypted       = true

  lifecycle {
    ignore_changes = [
      replicate_source_db,
    ]
  }
}

Your application code needs to know which database connection to use for reads versus writes. A simple connection manager handles this:

import { Pool } from 'pg';

interface RegionConfig {
  isPrimary: boolean;
  readPool: Pool;
  writePool?: Pool; // only set in the primary region
}

const regionConfigs: Record<string, RegionConfig> = {
  'us-east-1': {
    isPrimary: true,
    readPool: new Pool({ connectionString: process.env.DB_READ_URL_US }),
    writePool: new Pool({ connectionString: process.env.DB_WRITE_URL_US }),
  },
  'eu-west-1': {
    isPrimary: false,
    readPool: new Pool({ connectionString: process.env.DB_READ_URL_EU }),
  },
};

const currentRegion = process.env.AWS_REGION || 'us-east-1';
const config = regionConfigs[currentRegion];

export function getReadPool(): Pool {
  return config.readPool;
}

export function getWritePool(): Pool {
  // In non-primary regions, forward writes via a proxy endpoint
  if (!config.writePool) {
    throw new ApiError(
      503,
      'Writes are not available in this region. ' +
      'Please retry with primary-region affinity.'
    );
  }
  return config.writePool;
}

Notice the error message. When a write request lands in a non-primary region, you have two options. You can either forward the request to the primary region (transparent proxy) or return a 503 with a hint telling the client to retry against the primary region. The transparent proxy approach adds latency (two cross-region hops) but keeps client logic simple. The redirect approach is cleaner for APIs that use idempotency keys and retry logic.

Promoting a replica to primary

When the primary region goes down, you promote the read replica to a standalone primary. On RDS, this is a single API call, but it breaks the replication chain permanently. You cannot reattach a promoted instance as a replica later.

resource "aws_db_instance" "replica_eu" {
  provider = aws.eu_west_1

  identifier = "app-db-eu-west-1--promoted"
  engine     = "postgres"

  # Remove replicate_source_db to promote
  # replicate_source_db = aws_db_instance.primary.identifier

  backup_retention_period = 7
  storage_encrypted       = true
}

When the primary region recovers, you have a choice. You can set up a new replica in eu-west-1 that replicates from the now-recovered us-east-1 primary (losing the writes that happened on the promoted instance), or you can reverse the replication direction and make us-east-1 the replica. This is messy and requires careful data reconciliation. Have a runbook for it. Do not try to figure it out during an incident.

Session and state: prefer stateless

Multi-region deployment punishes server-local state. If a user’s session is stored in a Node.js process memory in us-east-1, and the next request is routed to eu-west-1, the user is logged out. The standard fix is a shared session store (Redis, DynamoDB) accessible from both regions. But that creates a cross-region dependency that adds latency to every authenticated request.

The better fix is to eliminate server-side session state entirely. Use JWTs for sessions. The token contains all the information the server needs (user ID, roles, expiration). The server validates the signature using a shared public key. No shared session store required.

import jwt from 'jsonwebtoken';

const PUBLIC_KEY = `-----BEGIN PUBLIC KEY-----
... shared across regions via a config management system ...
-----END PUBLIC KEY-----`;

interface SessionPayload {
  sub: string;
  roles: string[];
  exp: number;
}

export function verifySession(token: string): SessionPayload {
  try {
    return jwt.verify(token, PUBLIC_KEY, { algorithms: ['RS256'] }) as SessionPayload;
  } catch (err) {
    throw new AuthenticationError('Invalid or expired session token');
  }
}

The shared secret or public key is the only piece of state you need to synchronize across regions. That is a much smaller surface area than a Redis cluster stretched across continents.

If you cannot go fully stateless (rate limiter state, temporary upload tokens, long-poll connections), use a regional session store and route the user to the same region consistently. That leads to the next pattern.

Regional affinity without sticky sessions

Sticky sessions at the load balancer level are fragile. If a load balancer instance dies, all its sticky routing tables disappear and users get reassigned. In a multi-region setup, you want affinity at the region level, not the instance level.

Use a region cookie set by the application:

export function setRegionCookie(response: ServerResponse, region: string): void {
  response.setHeader('Set-Cookie', [
    `region=${region}; Path=/; Max-Age=3600; Secure; HttpOnly; SameSite=Lax`
  ]);
}

export function getRegionFromRequest(request: IncomingMessage): string | null {
  const cookie = parseCookies(request.headers.cookie || '');
  return cookie.region || null;
}

When a request arrives, check the region cookie. If it matches the current region, serve it. If it does not, return a 302 redirect to the matching region’s endpoint. For API clients that do not follow redirects, include a X-Region-Hint response header and let the client decide.

This approach keeps your load balancer configuration simple. It does not need to maintain session tables. It just needs to terminate TLS and forward traffic to healthy instances.

Configuration and secrets across regions

Your application config (feature flags, external URLs, tuning parameters) must be consistent across regions. Secrets (database passwords, API keys, JWT signing keys) must be available in every region but never stored in code.

Use AWS Systems Manager Parameter Store with replication to other regions, or use AWS Secrets Manager which supports cross-region replication natively.

resource "aws_secretsmanager_secret" "app" {
  name = "app-config"
}

resource "aws_secretsmanager_secret_replication" "app" {
  secret_id = aws_secretsmanager_secret.app.id
  replica {
    region = "eu-west-1"
    kms_key_id = aws_kms_key.eu.key_id
  }
}

Your application fetches config at startup and caches it locally. Do not fetch secrets on every request. A startup-time fetch with a 15-minute background refresh is sufficient for most config.

import {
  SecretsManagerClient,
  GetSecretValueCommand,
} from '@aws-sdk/client-secrets-manager';

let cachedConfig: AppConfig | null = null;
let lastFetch = 0;

export async function getConfig(): Promise<AppConfig> {
  const now = Date.now();
  if (cachedConfig && (now - lastFetch) < 900_000) {
    return cachedConfig;
  }

  const client = new SecretsManagerClient({ region: process.env.AWS_REGION });
  const response = await client.send(
    new GetSecretValueCommand({ SecretId: process.env.CONFIG_SECRET_ARN })
  );

  cachedConfig = JSON.parse(response.SecretString ?? '{}');
  lastFetch = now;
  return cachedConfig!;
}

Deployment pipeline: sequential rollouts

Do not deploy to every region simultaneously. Deploy to one region, run smoke tests, observe metrics for 5 minutes, then proceed to the next region. A bad deployment that hits all regions at once is worse than a single-region outage.

# .github/workflows/deploy-multi-region.yml
jobs:
  deploy-us-east-1:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/deploy-region.sh us-east-1
      - run: ./scripts/smoke-test.sh https://api-us-east-1.example.com/health

  deploy-eu-west-1:
    needs: [deploy-us-east-1]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/deploy-region.sh eu-west-1
      - run: ./scripts/smoke-test.sh https://api-eu-west-1.example.com/health

  deploy-ap-southeast-1:
    needs: [deploy-eu-west-1]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/deploy-region.sh ap-southeast-1
      - run: ./scripts/smoke-test.sh https://api-ap-southeast-1.example.com/health

Each region’s smoke test should validate the full path: DNS resolution, TLS termination, application response, and database connectivity. If a region fails its smoke test, the pipeline stops and the region stays on the previous version. DNS routing automatically sends a portion of traffic to the healthy region while you fix the broken one.

The failover playbook

When the primary region goes down, you need a runbook, not a guessing game. Here is the sequence.

Step 1: Confirm the outage. Do not failover based on a single alert. Verify with an external monitoring system (not hosted in the affected region) that the region is actually impaired. A health check from three different vantage points is the minimum.

Step 2: Promote the database replica. If the primary is unreachable and the replica is fully caught up, promote it. This breaks replication. Accept that. Recovery is a future problem.

Step 3: Update DNS health check status. If your health checks are configured correctly, DNS routing will automatically shift traffic to the healthy region within 60 seconds. If you use a failover routing policy instead of latency-only, swap the primary/secondary designation manually.

Step 4: Scale up the surviving region. The surviving region now handles 100% of traffic. If it was provisioned for 50%, it will fail under doubled load. Have a scaling script ready:

#!/bin/bash
# scale-up-survivor.sh
REGION=$1
ASG_NAME="app-${REGION}"

# Double the desired capacity
NEW_DESIRED=$(aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names "$ASG_NAME" \
  --region "$REGION" \
  --query 'AutoScalingGroups[0].DesiredCapacity' \
  --output text)

NEW_DESIRED=$((NEW_DESIRED * 2))

aws autoscaling set-desired-capacity \
  --auto-scaling-group-name "$ASG_NAME" \
  --desired-capacity "$NEW_DESIRED" \
  --region "$REGION"

Step 5: Communicate. The most underrated part of any failover. Post to your status page, update your incident tracker, and send a notification to the on-call channel. State what happened, what you are doing, and the expected timeline.

After the incident, the most important step is the data reconciliation playbook. When the primary region comes back, you need to figure out how to get the writes that went to the promoted replica back into the original primary. This might involve a one-way data sync, a full rebuild of the primary from the promoted replica, or accepting a small window of data loss. The right answer depends on your consistency requirements. Document it before you need it.

When you should not go multi-region

Multi-region is expensive. You pay for compute in every region, cross-region data transfer, and the engineering time to maintain the infrastructure. Here is when you should not do it.

If your availability requirement is 99.9% and your cloud provider’s single-region SLA is 99.95%, you do not need multi-region. The single region already meets your target. The extra 0.05% is not worth the complexity.

If your user base is concentrated in one geographic area, multi-region adds latency for no benefit. A European user base should deploy in eu-west-1 or eu-central-1, not copy everything to ap-southeast-1.

If your database cannot tolerate even seconds of write unavailability during a replica promotion, you need a multi-master database (CockroachDB, YugabyteDB, Spanner), not a multi-region application layer with a single-writer database. Do not try to fake multi-master on top of PostgreSQL streaming replication. You will lose data.

The working setup: three-region active-passive

Here is what a production-ready three-region setup looks like, end to end.

DNS: Route 53 latency records with health checks, 10-second TTL.
Compute: ECS Fargate or EKS in each region, behind a regional ALB.
Database: RDS PostgreSQL primary in us-east-1. Read replicas in eu-west-1 and ap-southeast-1.
Session: Stateless JWTs signed with an RSA key pair. Public key deployed to every region via Secrets Manager replication.
Config: Secrets Manager with cross-region replication. Fetched at startup, cached for 15 minutes.
Cache: Regional ElastiCache Redis clusters. No cross-region replication. Cache is local and disposable.
Deployment: Sequential canary deployments across regions. Each region goes through smoke tests and a 5-minute observation window.
Failover: Automated health check triggers DNS shift. Manual database promotion. Pre-written scaling and communication runbooks.

This setup handles the three most common failure modes: a single AZ failure (the regional load balancer and multi-AZ RDS handle this), a single region failure (DNS shifts traffic, read replica gets promoted), and a gradual degradation (health checks detect increased p99 latency and remove the region from DNS rotation before it fully fails).

Start with two regions. Add a third only after you have tested failover in production three times. Run a game day where you simulate a region failure and time the recovery. The first time will take an hour. The third time should take under five minutes. That is when you know you have it right.

A note from Yojji

The discipline of multi-region deployment (testing failover paths before you need them, writing runbooks for the 3 AM scenario, thinking about data reconciliation from the start) is the same discipline that makes any production system reliable. Yojji’s teams build and operate distributed systems across AWS, Azure, and Google Cloud, including the cross-region infrastructure, CI/CD pipelines, and observability tooling that keeps services available when a data center goes dark. If your architecture needs to survive region-level failures without the postmortem theatrics, Yojji is worth a conversation.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. They specialize in the JavaScript ecosystem, cloud-native infrastructure, and the full cycle of product delivery from discovery through DevOps.