Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All
Renaming a column on a 50-million-row table looks like a one-line SQL change and is actually a six-step deploy spread across two PRs. Here is the pattern — expand, migrate, contract — applied to renames, type changes, and NOT NULL backfills, with the locks each step takes and the rollback at every stage.
A junior engineer opens a PR titled “rename email to email_address.” It is a three-line migration:
ALTER TABLE users RENAME COLUMN email TO email_address;
Reviewed and approved. Deploy starts. The migration runs. The previous version of the application — still serving traffic during the rolling deploy — does SELECT email FROM users and crashes. Login is broken for 90 seconds while pods cycle. The rollback is even worse, because rolling back the migration only works if no one has written email_address yet.
This is the simplest version of the zero-downtime migration problem. The general pattern that solves all of these — column renames, type changes, table splits, NOT NULL backfills — is called expand-migrate-contract, and it is the only way to do non-trivial schema changes without a maintenance window. Six steps. Two deploys. Always backwards-compatible at every stage.
The pattern
Step 1 (deploy 1): Add the new shape alongside the old.
Step 2 (deploy 1): App writes BOTH old and new.
Step 3 (background): Backfill old data into the new shape.
Step 4 (deploy 2): App reads the NEW shape, still writes both.
Step 5 (deploy 2): App writes only the new shape.
Step 6 (deploy 2): Drop the old shape.
At every step, both the previous version of the application and the next can run against the database without errors. Rollback is a re-deploy of the previous version — never a destructive SQL operation.
For trivial migrations (adding a nullable column with no backfill, dropping an unused column nothing reads), this is overkill. For anything else, it is the price of zero downtime.
Worked example: renaming email to email_address
Six steps, two deploys, around three days of clock time on a real table.
Deploy 1, step 1: add the new column
ALTER TABLE users ADD COLUMN email_address text;
Adding a nullable column with no default is fast and metadata-only on Postgres 11+. No table rewrite. Cost: milliseconds.
Deploy 1, step 2: app writes both, reads old
async function createUser({ email, ... }) {
await db.query(
`INSERT INTO users (email, email_address, ...) VALUES ($1, $1, ...)`,
[email, ...],
);
}
async function updateEmail(userId, newEmail) {
await db.query(
`UPDATE users SET email = $1, email_address = $1 WHERE id = $2`,
[newEmail, userId],
);
}
// Reads still use email.
async function getEmail(userId) {
return (await db.query(`SELECT email FROM users WHERE id = $1`, [userId])).rows[0].email;
}
Now every new write keeps both columns in sync. The previous version of the app (still rolling out / rolling back) writes only email; the new code reads email so it does not depend on email_address yet.
Deploy 1, step 3: backfill in the background
The new column is empty for all pre-existing rows. Run a backfill — chunked, idempotent, off-peak.
-- Backfill in batches of 10k to avoid long locks.
DO $$
DECLARE
rows_updated int := 1;
BEGIN
WHILE rows_updated > 0 LOOP
WITH batch AS (
SELECT id FROM users
WHERE email_address IS NULL
ORDER BY id
LIMIT 10000
FOR UPDATE SKIP LOCKED
)
UPDATE users SET email_address = email
WHERE id IN (SELECT id FROM batch);
GET DIAGNOSTICS rows_updated = ROW_COUNT;
PERFORM pg_sleep(0.1); -- gentle pacing
END LOOP;
END $$;
For very large tables (>50M rows), use a tool like pg_repack or run the backfill from the application as a background job. The Postgres anonymous block above works for moderate tables and avoids deployment fuss.
When the backfill is done, verify:
SELECT count(*) FROM users WHERE email_address IS NULL;
Should be zero. (Or only rows that genuinely have no email — handle them per business logic.)
Deploy 2, step 4: read from new column
async function getEmail(userId) {
return (await db.query(`SELECT email_address FROM users WHERE id = $1`, [userId])).rows[0].email_address;
}
Writes still go to both. If anything is wrong with the data in email_address, you can roll back to deploy 1 and the app reads email again. Do this step on its own and watch metrics for a day before proceeding.
Deploy 2, step 5: stop writing the old column
async function createUser({ emailAddress, ... }) {
await db.query(
`INSERT INTO users (email_address, ...) VALUES ($1, ...)`,
[emailAddress, ...],
);
}
After this deploys and the previous version is fully rolled out, no code reads or writes email.
Deploy 2, step 6: drop the old column
ALTER TABLE users DROP COLUMN email;
Fast, metadata-only. Done.
What if a step fails?
Rollback is by reverting the application, not the migration:
- Step 4 deployed and reads are wrong. Revert app to step 2/3 state. Reads go back to
email, which still has data. Investigate. - Step 5 deployed and old code still in flight. No issue — old code will only see existing
emaildata, which is unchanged. - Step 6 deployed and you discover a bug. This is the only step that destroys data. Restore from backup or accept loss. Wait at least a day between steps 5 and 6 to avoid this.
Type changes: same pattern, different implementation
Want to change id INT to id BIGINT because you are running out of integer space? Same shape:
- Add
id_new BIGINTcolumn. - Trigger or app writes
id_new = idon every change. - Backfill historical rows.
- Add unique index on
id_newconcurrently. - Switch app reads to
id_new. - Drop
id, renameid_newtoid.
The trigger version (step 2):
CREATE FUNCTION sync_id_new() RETURNS trigger AS $$
BEGIN
NEW.id_new := NEW.id;
RETURN NEW;
END
$$ LANGUAGE plpgsql;
CREATE TRIGGER users_sync_id_new
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_id_new();
For foreign keys referring to the changed column, the same pattern applies on the dependent tables in lockstep — and this is where these migrations get genuinely involved.
Adding NOT NULL to a populated column
A common request: the phone column is nullable but should be required going forward. The naive version locks the table while Postgres scans every row to verify:
-- Locks for the duration of a full-table scan.
ALTER TABLE users ALTER COLUMN phone SET NOT NULL;
The zero-downtime version uses CHECK ... NOT VALID first:
-- 1. Add a check constraint without validating existing rows.
ALTER TABLE users ADD CONSTRAINT users_phone_not_null CHECK (phone IS NOT NULL) NOT VALID;
-- This locks the table briefly but does not scan it.
-- 2. Validate the constraint in the background. Takes a long time on big tables
-- but does not block writes.
ALTER TABLE users VALIDATE CONSTRAINT users_phone_not_null;
-- 3. Optionally, promote the check to NOT NULL (Postgres 12+ uses the validated check).
ALTER TABLE users ALTER COLUMN phone SET NOT NULL;
Backfill any null rows before step 2 — otherwise validation fails and the constraint stays invalid.
The lock cheat sheet
The locks each schema change takes (Postgres 14+, hot path only):
| Operation | Lock |
|---|---|
ADD COLUMN (nullable, no default) | AccessExclusive, brief |
ADD COLUMN ... DEFAULT non-volatile | AccessExclusive, brief (no rewrite) |
ADD COLUMN ... DEFAULT volatile | AccessExclusive, full rewrite — avoid |
DROP COLUMN | AccessExclusive, brief |
RENAME COLUMN | AccessExclusive, brief |
ALTER COLUMN TYPE (compatible cast) | AccessExclusive, may rewrite |
CREATE INDEX | blocks writes; use CONCURRENTLY |
CREATE INDEX CONCURRENTLY | does not block, slower |
ALTER TABLE SET NOT NULL | full table scan, blocks writes |
ADD CONSTRAINT ... NOT VALID | brief |
VALIDATE CONSTRAINT | reads only, no write block |
DROP CONSTRAINT | AccessExclusive, brief |
The two important takeaways:
- Use
CONCURRENTLYfor indexes. A non-concurrent index build on a hot table blocks writes for as long as it takes — easily hours. - Avoid
DEFAULTwith volatile expressions.DEFAULT now()on a new column requires a full rewrite. If you need it, set the default after adding the column and backfill.
A migration framework that supports the pattern
Most ORMs default to “one migration = one SQL operation” which makes expand-migrate-contract awkward. A few patterns:
- One migration file per step, with explicit ordering.
001_add_email_address_column.sql,002_backfill_email_address.sql, etc. Each runs independently and can be retried. - Online schema change tools like pg-online-schema-change or reshape build the expand-migrate-contract pattern into the migration tool.
- Plain SQL files committed alongside code. The discipline of “no migration without an expand step” lives in code review, not tooling.
For most teams, the third option (plain SQL + code review discipline) is enough. The tooling is helpful but not essential.
Pre-flight checks before any migration
A small checklist that prevents most incidents:
- Estimate the lock duration.
EXPLAINdoes not work for DDL, but\timing onand a staging-environment dry-run on a comparable-size table do. - Set
lock_timeoutin the migration session.SET lock_timeout = '5s';aborts the migration if it cannot get the lock — better than blocking the entire database. - Set
statement_timeoutfor the migration as a safety net. - Run during low-traffic windows for anything that takes >1 second.
- Have a rollback ready. Not “we’ll figure it out” — actually written down, copy-paste ready.
The takeaway
A schema change against a populated production table is a six-step deploy, not a one-line SQL change. Expand the schema, write to both shapes, backfill, switch reads, stop writing the old shape, drop it. Two deploys, always backwards-compatible, rollback by re-deploying the previous version.
The first time you do this it feels like ceremony. By the third one it is routine, and the alternative — “schedule a maintenance window for the rename” — feels like 2010.
A note from Yojji
The kind of operations discipline that turns a column rename from a maintenance window into a routine deploy — expand-migrate-contract, lock-budgeting, backfill jobs that respect production load — is the kind of long-haul backend work Yojji’s teams build into the products they ship.
Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, GCP), and the kind of database operations work that decides whether a schema change is a non-event or a postmortem.