Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All

A junior engineer opens a PR titled “rename email to email_address.” It is a three-line migration:

ALTER TABLE users RENAME COLUMN email TO email_address;

Reviewed and approved. Deploy starts. The migration runs. The previous version of the application — still serving traffic during the rolling deploy — does SELECT email FROM users and crashes. Login is broken for 90 seconds while pods cycle. The rollback is even worse, because rolling back the migration only works if no one has written email_address yet.

This is the simplest version of the zero-downtime migration problem. The general pattern that solves all of these — column renames, type changes, table splits, NOT NULL backfills — is called expand-migrate-contract, and it is the only way to do non-trivial schema changes without a maintenance window. Six steps. Two deploys. Always backwards-compatible at every stage.

The pattern

Step 1 (deploy 1):  Add the new shape alongside the old.
Step 2 (deploy 1):  App writes BOTH old and new.
Step 3 (background): Backfill old data into the new shape.
Step 4 (deploy 2):  App reads the NEW shape, still writes both.
Step 5 (deploy 2):  App writes only the new shape.
Step 6 (deploy 2):  Drop the old shape.

At every step, both the previous version of the application and the next can run against the database without errors. Rollback is a re-deploy of the previous version — never a destructive SQL operation.

For trivial migrations (adding a nullable column with no backfill, dropping an unused column nothing reads), this is overkill. For anything else, it is the price of zero downtime.

Worked example: renaming `email` to `email_address`

Six steps, two deploys, around three days of clock time on a real table.

Deploy 1, step 1: add the new column

ALTER TABLE users ADD COLUMN email_address text;

Adding a nullable column with no default is fast and metadata-only on Postgres 11+. No table rewrite. Cost: milliseconds.

Deploy 1, step 2: app writes both, reads old

async function createUser({ email, ... }) {
  await db.query(
    `INSERT INTO users (email, email_address, ...) VALUES ($1, $1, ...)`,
    [email, ...],
  );
}

async function updateEmail(userId, newEmail) {
  await db.query(
    `UPDATE users SET email = $1, email_address = $1 WHERE id = $2`,
    [newEmail, userId],
  );
}

// Reads still use email.
async function getEmail(userId) {
  return (await db.query(`SELECT email FROM users WHERE id = $1`, [userId])).rows[0].email;
}

Now every new write keeps both columns in sync. The previous version of the app (still rolling out / rolling back) writes only email; the new code reads email so it does not depend on email_address yet.

Deploy 1, step 3: backfill in the background

The new column is empty for all pre-existing rows. Run a backfill — chunked, idempotent, off-peak.

-- Backfill in batches of 10k to avoid long locks.
DO $$
DECLARE
  rows_updated int := 1;
BEGIN
  WHILE rows_updated > 0 LOOP
    WITH batch AS (
      SELECT id FROM users
       WHERE email_address IS NULL
       ORDER BY id
       LIMIT 10000
       FOR UPDATE SKIP LOCKED
    )
    UPDATE users SET email_address = email
      WHERE id IN (SELECT id FROM batch);
    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    PERFORM pg_sleep(0.1);  -- gentle pacing
  END LOOP;
END $$;

For very large tables (>50M rows), use a tool like pg_repack or run the backfill from the application as a background job. The Postgres anonymous block above works for moderate tables and avoids deployment fuss.

When the backfill is done, verify:

SELECT count(*) FROM users WHERE email_address IS NULL;

Should be zero. (Or only rows that genuinely have no email — handle them per business logic.)

Deploy 2, step 4: read from new column

async function getEmail(userId) {
  return (await db.query(`SELECT email_address FROM users WHERE id = $1`, [userId])).rows[0].email_address;
}

Writes still go to both. If anything is wrong with the data in email_address, you can roll back to deploy 1 and the app reads email again. Do this step on its own and watch metrics for a day before proceeding.

Deploy 2, step 5: stop writing the old column

async function createUser({ emailAddress, ... }) {
  await db.query(
    `INSERT INTO users (email_address, ...) VALUES ($1, ...)`,
    [emailAddress, ...],
  );
}

After this deploys and the previous version is fully rolled out, no code reads or writes email.

Deploy 2, step 6: drop the old column

ALTER TABLE users DROP COLUMN email;

Fast, metadata-only. Done.

What if a step fails?

Rollback is by reverting the application, not the migration:

Step 4 deployed and reads are wrong. Revert app to step 2/3 state. Reads go back to email, which still has data. Investigate.
Step 5 deployed and old code still in flight. No issue — old code will only see existing email data, which is unchanged.
Step 6 deployed and you discover a bug. This is the only step that destroys data. Restore from backup or accept loss. Wait at least a day between steps 5 and 6 to avoid this.

Type changes: same pattern, different implementation

Want to change id INT to id BIGINT because you are running out of integer space? Same shape:

Add id_new BIGINT column.
Trigger or app writes id_new = id on every change.
Backfill historical rows.
Add unique index on id_new concurrently.
Switch app reads to id_new.
Drop id, rename id_new to id.

The trigger version (step 2):

CREATE FUNCTION sync_id_new() RETURNS trigger AS $$
BEGIN
  NEW.id_new := NEW.id;
  RETURN NEW;
END
$$ LANGUAGE plpgsql;

CREATE TRIGGER users_sync_id_new
  BEFORE INSERT OR UPDATE ON users
  FOR EACH ROW EXECUTE FUNCTION sync_id_new();

For foreign keys referring to the changed column, the same pattern applies on the dependent tables in lockstep — and this is where these migrations get genuinely involved.

Adding `NOT NULL` to a populated column

A common request: the phone column is nullable but should be required going forward. The naive version locks the table while Postgres scans every row to verify:

-- Locks for the duration of a full-table scan.
ALTER TABLE users ALTER COLUMN phone SET NOT NULL;

The zero-downtime version uses CHECK ... NOT VALID first:

-- 1. Add a check constraint without validating existing rows.
ALTER TABLE users ADD CONSTRAINT users_phone_not_null CHECK (phone IS NOT NULL) NOT VALID;
-- This locks the table briefly but does not scan it.

-- 2. Validate the constraint in the background. Takes a long time on big tables
--    but does not block writes.
ALTER TABLE users VALIDATE CONSTRAINT users_phone_not_null;

-- 3. Optionally, promote the check to NOT NULL (Postgres 12+ uses the validated check).
ALTER TABLE users ALTER COLUMN phone SET NOT NULL;

Backfill any null rows before step 2 — otherwise validation fails and the constraint stays invalid.

The lock cheat sheet

The locks each schema change takes (Postgres 14+, hot path only):

Operation	Lock
`ADD COLUMN` (nullable, no default)	`AccessExclusive`, brief
`ADD COLUMN ... DEFAULT non-volatile`	`AccessExclusive`, brief (no rewrite)
`ADD COLUMN ... DEFAULT volatile`	`AccessExclusive`, full rewrite — avoid
`DROP COLUMN`	`AccessExclusive`, brief
`RENAME COLUMN`	`AccessExclusive`, brief
`ALTER COLUMN TYPE` (compatible cast)	`AccessExclusive`, may rewrite
`CREATE INDEX`	blocks writes; use `CONCURRENTLY`
`CREATE INDEX CONCURRENTLY`	does not block, slower
`ALTER TABLE SET NOT NULL`	full table scan, blocks writes
`ADD CONSTRAINT ... NOT VALID`	brief
`VALIDATE CONSTRAINT`	reads only, no write block
`DROP CONSTRAINT`	`AccessExclusive`, brief

The two important takeaways:

Use CONCURRENTLY for indexes. A non-concurrent index build on a hot table blocks writes for as long as it takes — easily hours.
Avoid DEFAULT with volatile expressions. DEFAULT now() on a new column requires a full rewrite. If you need it, set the default after adding the column and backfill.

A migration framework that supports the pattern

Most ORMs default to “one migration = one SQL operation” which makes expand-migrate-contract awkward. A few patterns:

One migration file per step, with explicit ordering. 001_add_email_address_column.sql, 002_backfill_email_address.sql, etc. Each runs independently and can be retried.
Online schema change tools like pg-online-schema-change or reshape build the expand-migrate-contract pattern into the migration tool.
Plain SQL files committed alongside code. The discipline of “no migration without an expand step” lives in code review, not tooling.

For most teams, the third option (plain SQL + code review discipline) is enough. The tooling is helpful but not essential.

Pre-flight checks before any migration

A small checklist that prevents most incidents:

Estimate the lock duration. EXPLAIN does not work for DDL, but \timing on and a staging-environment dry-run on a comparable-size table do.
Set lock_timeout in the migration session. SET lock_timeout = '5s'; aborts the migration if it cannot get the lock — better than blocking the entire database.
Set statement_timeout for the migration as a safety net.
Run during low-traffic windows for anything that takes >1 second.
Have a rollback ready. Not “we’ll figure it out” — actually written down, copy-paste ready.

The takeaway

A schema change against a populated production table is a six-step deploy, not a one-line SQL change. Expand the schema, write to both shapes, backfill, switch reads, stop writing the old shape, drop it. Two deploys, always backwards-compatible, rollback by re-deploying the previous version.

The first time you do this it feels like ceremony. By the third one it is routine, and the alternative — “schedule a maintenance window for the rename” — feels like 2010.

A note from Yojji

The kind of operations discipline that turns a column rename from a maintenance window into a routine deploy — expand-migrate-contract, lock-budgeting, backfill jobs that respect production load — is the kind of long-haul backend work Yojji’s teams build into the products they ship.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, GCP), and the kind of database operations work that decides whether a schema change is a non-event or a postmortem.