The Practical Developer

Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It

A 4 GB table somehow uses 80 GB on disk and queries are slow. Autovacuum is on, autovacuum is running, autovacuum is not actually freeing space — and the reason is one setting most teams have never heard of. Here is what bloat is, why long-running transactions kill VACUUM, and the four queries you need to run before reaching for `pg_repack`.

A close-up of circuitry — the right metaphor for the dense, hidden machinery of MVCC

The events table has 4 GB of live rows. The on-disk size is 80 GB. Queries that used to take 50 ms now take 4 seconds. Somebody runs VACUUM events; — no change. Somebody runs VACUUM FULL events; — table is locked for 20 minutes and shrinks to 4 GB. Two weeks later, it is back at 60 GB.

This is bloat, and it is the single most misunderstood operational issue in Postgres. Autovacuum is on by default, autovacuum is running according to the logs, and the table is bloating anyway. The reason is almost always the same: a long-running transaction is preventing VACUUM from freeing dead row versions, and dead rows are accumulating faster than VACUUM can keep up. This post is what bloat actually is, the queries that diagnose it, and the four interventions in order of cost.

What VACUUM does, briefly

Postgres uses MVCC: when you UPDATE or DELETE a row, the old version stays on disk, marked invisible to new transactions but still readable by older ones that need it. New row versions are written elsewhere on the page. Over time, dead row versions pile up. VACUUM scans the table, identifies dead row versions whose xmin (transaction ID) is older than the oldest active transaction, and marks their disk space as reusable.

Two things to internalize:

  1. VACUUM does not return disk space to the OS. It marks pages as having free space inside them; future inserts/updates can reuse that space. The on-disk file size shrinks only with VACUUM FULL (locks the table, rewrites it).
  2. VACUUM cannot remove a dead row whose xmin is greater than the oldest running transaction’s xmin. A long-running transaction holds the floor and freezes VACUUM’s effective horizon.

That second point is where 95% of bloat issues come from.

The query that shows the problem

SELECT
  pid,
  age(backend_xmin) AS xmin_age,
  state,
  query,
  now() - xact_start AS xact_age
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC
LIMIT 5;

xmin_age is the number of transactions that have elapsed since the connection’s snapshot was taken. Anything over a few hundred thousand is concerning. Over a few million is the smoking gun.

The candidates for “long-running transaction blocking VACUUM”:

  • Idle transactions (state = 'idle in transaction'). The connection started a transaction, did one query, and is now sitting idle. Common cause: an app forgot to commit, or a debugger paused execution.
  • Long analytics queries. A 30-minute reporting query is fine in isolation, but it pins the VACUUM horizon for everyone.
  • Replication slots with stale confirmed_flush_lsn. A replication consumer that has stopped reading WAL holds back VACUUM via the xmin of its slot.
  • Prepared transactions (two-phase commit) that nobody finished.

Run the query above on any bloated database and 90% of the time you find the culprit on the first row.

The query that shows the bloat

SELECT
  schemaname || '.' || relname AS table,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
  n_dead_tup,
  n_live_tup,
  round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
  last_autovacuum,
  last_vacuum
FROM pg_stat_all_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY n_dead_tup DESC
LIMIT 20;

dead_pct over 20% on a hot table is bloat. Over 50% is severe. The last_autovacuum and last_vacuum columns tell you when Postgres last cleaned the table — if it has been hours and the dead count is high, something is preventing autovacuum from working.

For a sharper bloat estimate (which accounts for index bloat too), the pgstattuple extension is more accurate but more expensive:

CREATE EXTENSION pgstattuple;
SELECT * FROM pgstattuple('events');

This reads every page, so for a huge table it is slow — run it during off hours.

The four interventions, cheapest to most disruptive

1. Kill the long-running transaction. If pg_stat_activity shows an idle-in-transaction connection holding back VACUUM, terminate it:

SELECT pg_terminate_backend(pid) FROM pg_stat_activity
 WHERE state = 'idle in transaction'
   AND xact_start < now() - interval '10 minutes';

Cost: zero, except killing whatever the app was in the middle of. Set idle_in_transaction_session_timeout = '5min' in postgresql.conf so future connections cannot pin VACUUM by accident. This single change prevents most bloat incidents.

2. Tune autovacuum for hot tables. Default thresholds are too lax for tables with millions of writes per day:

ALTER TABLE events SET (
  autovacuum_vacuum_scale_factor = 0.05,    -- vacuum at 5% dead, default 20%
  autovacuum_vacuum_threshold = 1000,       -- minimum 1000 dead rows
  autovacuum_analyze_scale_factor = 0.02
);

The defaults assume small tables. A 100M-row table at 20% bloat is 20M dead rows — that is a lot. For hot tables, drop the scale factor.

Also worth raising autovacuum’s parallelism:

# postgresql.conf
autovacuum_max_workers = 6
autovacuum_vacuum_cost_limit = 2000  # less throttling
maintenance_work_mem = 1GB           # more memory per VACUUM

The cost-limit and work-mem changes let VACUUM run faster at the price of more I/O. On modern SSDs the trade is usually worth it.

3. VACUUM (VERBOSE, ANALYZE) on demand. When you have caught the table mid-bloat, manually run:

VACUUM (VERBOSE, ANALYZE) events;

This compacts dead row space without rewriting the table. On-disk size does not shrink, but future inserts reuse the space, so the table stops growing.

For specific row classes, VACUUM (FREEZE) early-freezes old rows so VACUUM has less work later — useful for append-mostly tables.

4. pg_repack to reclaim disk. When you actually need the disk space back without VACUUM FULL’s exclusive lock, pg_repack rewrites the table online:

pg_repack -h db.example.com -d appdb -t events

It builds a new copy concurrently, syncs writes via triggers, and swaps the new table in atomically. Takes a long time on huge tables but the production impact is small. The cost: temporarily 2× the disk space (during rebuild).

VACUUM FULL does the same job but holds an ACCESS EXCLUSIVE lock — no one can read or write the table during the operation. Use it only for tables small enough that ~1 minute of downtime is acceptable.

Index bloat, separately

VACUUM cleans tables. Indexes have their own bloat — when rows are updated, the index entries pointing to old row versions become dead and need cleaning. VACUUM does some of this; on heavily-updated indexes, it does not keep up.

-- Detect index bloat
SELECT
  schemaname || '.' || tablename AS table,
  indexrelname AS index,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan
FROM pg_stat_user_indexes
WHERE pg_relation_size(indexrelid) > 1000000000  -- > 1GB
ORDER BY pg_relation_size(indexrelid) DESC;

Fix:

REINDEX INDEX CONCURRENTLY events_user_id_idx;

CONCURRENTLY rebuilds without taking a write lock. Available since Postgres 12.

Replication slots: the silent bloat cause

A logical replication slot whose consumer has fallen behind keeps xmin pinned. Check for slots:

SELECT slot_name, active, age(xmin) AS xmin_age,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots;

If a slot has active = false and xmin_age in the millions, the consumer has been gone for a long time. Either reconnect the consumer or drop the slot:

SELECT pg_drop_replication_slot('orphaned_slot');

Dropping a slot that another system thinks it owns will mean re-bootstrapping that consumer — make sure you know what you are dropping.

Monitoring you should set up

Three alerts catch bloat issues before they become outages:

-- 1. Alert if any backend xmin is older than 1M transactions.
SELECT max(age(backend_xmin)) FROM pg_stat_activity WHERE backend_xmin IS NOT NULL;
-- threshold: > 1000000

-- 2. Alert on dead-tuple ratio per hot table.
SELECT relname, n_dead_tup::float / NULLIF(n_live_tup, 0) AS ratio
FROM pg_stat_user_tables
WHERE n_live_tup > 100000;
-- threshold: > 0.3

-- 3. Alert on table growth rate beyond expected.
-- (Compare pg_relation_size now vs 24h ago for the same table.)

Pipe them through Prometheus / DataDog / your dashboard. The alert cost is about 10 lines of config per check.

What idle_in_transaction_session_timeout is worth

If you set one parameter as a result of reading this post, set this:

idle_in_transaction_session_timeout = '5min'

In postgresql.conf. Reload. This kills any transaction that has been idle for 5 minutes. The downside is application bugs may now manifest as terminated transactions; the upside is you cannot accidentally pin VACUUM for hours because somebody hit a debugger breakpoint.

Pair with statement_timeout (kills statements running longer than the threshold) and lock_timeout (kills queries waiting too long for a lock). Three timeouts; one-line each; eliminate a class of incidents.

The takeaway

Postgres bloat is not “VACUUM is broken.” It is “VACUUM is being prevented from working by something else.” The diagnosis is two pg_stat_activity and pg_stat_all_tables queries. The four fixes — kill the long transaction, tune autovacuum thresholds, manual VACUUM, pg_repack — are progressive in cost. idle_in_transaction_session_timeout is the single setting that prevents 80% of the cases.

Most “the database is slow” incidents that turn out to be bloat-related are one query and one config change away from being closed. The next time you see the on-disk table size diverge from the live row count, run the queries above before you reach for VACUUM FULL.


A note from Yojji

The kind of database operations work that keeps a Postgres instance healthy at year five — autovacuum tuning, slot monitoring, the boring details that decide whether disk usage stays linear or explodes — is the kind of long-haul backend engineering Yojji’s teams have shipped across hundreds of client projects.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. They specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, GCP), and the kind of database and infrastructure work that decides whether a product feels good a year later or grinds slowly to a halt.