PostgreSQL Production Health Checks: The Queries That Save Your Weekend

The alert fired at 02:14 on a Saturday. Postgres CPU at 98%. Active connections maxed out. Every API call that touched the database was timing out. You SSH into the box, run psql, and stare at the prompt. What do you query first? Most engineers open pg_stat_activity, see 200 idle-in-transaction connections, and have no idea what to do next.

This is the moment where a small library of repeatable, memorized health-check queries separates a 10-minute fix from a 90-minute scrape through docs, blog posts, and the pg_catalog docs page. The eight queries below cover every common production Postgres emergency. They are the ones I have run on three separate on-call rotations, at a startup that managed 50 GB databases and at a company that managed 50 TB databases. The thresholds are calibrated to real incidents.

Save these somewhere searchable. Put them in your runbook. Put them in a healthchecks.sql file in your repo. When the pager goes off, you will not need to think.

1. What is running right now?

Every incident starts here. You need to know what queries are active, how long they have been running, and whether they are doing anything useful. The stock pg_stat_activity view is too noisy. Filter it down.

SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  state,
  wait_event_type,
  wait_event,
  left(query, 120) AS query_snippet,
  usename,
  application_name
FROM pg_stat_activity
WHERE state != 'idle'
  AND pid != pg_backend_pid()
ORDER BY query_start ASC;

The output tells you three things immediately. Are there queries that have been running for minutes instead of milliseconds (that is your first suspect). Are most connections in active state or idle in transaction (idle-in-transaction connections hold locks and block vacuum). Is there a wait_event_type of Lock or IO (that tells you whether the bottleneck is contention or storage).

The left(query, 120) truncation keeps the output readable. If you need the full query, query query directly for a specific PID.

Threshold: Any query running longer than 30 seconds in active state during normal traffic is a problem. Longer than 5 minutes during any traffic is an incident.

2. Who is blocking whom?

Long-running queries do not exist in isolation. They block other queries. Postgres exposes lock contention through pg_locks and pg_stat_activity. This query maps every blocked PID to the PID that holds the lock.

SELECT
  blocked.pid AS blocked_pid,
  blocked.query AS blocked_query,
  blocked.query_start AS blocked_start,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query,
  blocking.query_start AS blocking_start,
  blocked.wait_event,
  blocked.wait_event_type
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking ON (
  SELECT pg_blocking_pids(blocked.pid)
) @> ARRAY[blocking.pid]
WHERE blocked.wait_event_type = 'Lock';

This is the single most useful query during a connection spike. It will show you exactly which query is the root cause. The blocking query is often an uncommitted UPDATE or DELETE that the application left hanging because the client disconnected without committing or rolling back.

Threshold: If you see any blocked query with a wait time over 5 seconds during normal traffic, investigate. If the blocking query has been running longer than 5 minutes, terminate it.

-- If you need to terminate the blocker (use with extreme care)
SELECT pg_terminate_backend(<blocking_pid>);

Only terminate if you understand what the blocking query was doing. Terminating a long-running CREATE INDEX or ALTER TABLE can leave the table in an unusable state.

3. Are there long-running idle-in-transaction connections?

Idle-in-transaction connections are the silent killers of Postgres performance. They hold locks, prevent VACUUM from cleaning up dead tuples, and cause connection pool exhaustion. Applications frequently leak them when a transaction is opened but never committed or rolled back after an error.

SELECT
  pid,
  usename,
  application_name,
  state,
  now() - xact_start AS transaction_duration,
  now() - state_change AS idle_duration,
  left(query, 120) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start ASC;

These connections are holding whatever locks they acquired during the transaction. The last query column tells you what they were doing before they went idle. If you see a simple SELECT that has been open for 30 minutes, the connection pool in your application is not properly releasing connections after read-only queries.

Threshold: Any idle-in-transaction connection older than 30 seconds during normal traffic is a leak. Older than 5 minutes is a production incident. Terminate aggressively:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND now() - xact_start > interval '5 minutes';

Put this in a cron job that runs every minute during business hours. It is not a fix for the root cause (fix the connection leak in your ORM or pooler), but it is a tourniquet that keeps the database alive while you debug.

4. Is the cache hit ratio healthy?

Postgres relies on shared buffers to cache frequently accessed data. When the cache hit ratio drops, every query pays the price of reading from disk. This query shows you the cache efficiency since the last restart.

SELECT
  'buffer_hit' AS stat,
  round(blks_hit::numeric / (blks_hit + blks_read) * 100, 2) AS ratio
FROM pg_stat_bgwriter
UNION ALL
SELECT
  'index_hit',
  round(sum(idx_blks_hit)::numeric / (sum(idx_blks_hit) + sum(idx_blks_read)) * 100, 2)
FROM pg_statio_user_indexes
UNION ALL
SELECT
  'table_hit',
  round(sum(heap_blks_hit)::numeric / (sum(heap_blks_hit) + sum(heap_blks_read)) * 100, 2)
FROM pg_statio_user_tables;

The buffer hit ratio should be above 99% for most OLTP workloads. Index hit ratio should be above 95%. Table hit ratio (sequential scans that hit shared buffers) is less reliable as a metric because it depends on access patterns.

Threshold: Buffer hit ratio below 95% means your shared_buffers setting is too small for your working set, or you have queries doing sequential scans of tables that do not fit in memory. Index hit ratio below 90% means indexes are being evicted faster than they are being used, or you have too many unused indexes bloating the cache.

If the ratio is low but the server has free RAM, increase shared_buffers. The formula is situational, but a safe starting point is 25% of available RAM for a dedicated Postgres server.

5. Which queries are doing sequential scans?

Sequential scans are not inherently bad. Sequential scans are bad when they hit large tables that should be accessed through an index. Postgres logs the decision in pg_stat_user_tables.

SELECT
  schemaname,
  relname,
  seq_scan,
  seq_tup_read,
  idx_scan,
  seq_tup_read / nullif(seq_scan, 0) AS avg_tuples_per_seq,
  n_live_tup AS estimated_row_count
FROM pg_stat_user_tables
WHERE seq_scan > 100
  AND (seq_tup_read / nullif(seq_scan, 0)) > 1000
ORDER BY seq_tup_read DESC
LIMIT 20;

This filters to tables that have had more than 100 sequential scans and where each sequential scan reads more than 1,000 tuples on average. Those are the tables that are screaming for a missing index.

Cross-reference this with pg_stat_user_indexes to see whether existing indexes exist but are not being used:

SELECT
  schemaname,
  relname,
  indexrelname,
  idx_scan,
  idx_tup_read,
  idx_tup_fetch
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY relname;

An index with zero scans is dead weight. It consumes disk space, slows down writes, and bloats the shared buffer cache. Drop it.

Threshold: Any table over 100,000 rows with more sequential scans than index scans is almost certainly missing an index on the filter column used by your most common queries.

6. Are there table bloat or vacuum problems?

Postgres does not reclaim disk space from updated or deleted rows automatically. That is VACUUM’s job. When vacuum falls behind, table bloat grows, query performance degrades, and disk space fills up. This query estimates bloat for the 20 largest tables.

SELECT
  schemaname,
  relname,
  n_live_tup,
  n_dead_tup,
  round(n_dead_tup::numeric / nullif(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct,
  last_autovacuum,
  last_autoanalyze,
  n_mod_since_analyze,
  round(pg_relation_size(relid) / 1024 / 1024.0, 1) AS table_size_mb
FROM pg_stat_user_tables
WHERE n_live_tup > 0
ORDER BY n_dead_tup DESC
LIMIT 20;

The dead_pct column shows the proportion of dead tuples relative to live tuples. Anything above 20% is a problem. Above 50% means autovacuum is not keeping up, and you need to tune autovacuum_vacuum_scale_factor or autovacuum_vacuum_threshold for that table.

For a deeper bloat analysis, use the pgstattuple extension:

-- Requires: CREATE EXTENSION pgstattuple;
SELECT * FROM pgstattuple('your_table_name');

The dead_tuple_percent column gives you the exact bloat percentage. The free_percent column shows internal fragmentation.

Threshold: Dead tuple ratio above 20% needs investigation. Above 50% is an active incident. If last_autovacuum is NULL on a table that has been receiving writes, autovacuum may be disabled globally or the table may be too large for the default scale factor.

7. Is replication lag growing?

If you use read replicas, replication lag is the metric that silently causes stale data reads, incorrect reports, and confused customers. Postgres exposes this in pg_stat_replication.

SELECT
  application_name,
  client_addr,
  state,
  sync_state,
  pg_wal_lsn_diff(
    pg_current_wal_lsn(),
    replay_lsn
  ) / 1024 / 1024 AS lag_mb,
  now() - replay_lag_check_time() AS lag_time,
  flush_lag,
  replay_lag
FROM pg_stat_replication;

Note: pg_stat_replication shows data from the perspective of the primary. Run this on the primary. The lag_mb column shows how far behind the replica is in megabytes of WAL. The flush_lag and replay_lag columns (available in Postgres 10+) show time-based lag.

If the replica is streaming but not applying, check pg_stat_wal_receiver on the replica to see if WAL apply is stuck:

-- Run on the replica
SELECT
  status,
  receive_start_lsn,
  received_lsn,
  latest_end_lsn,
  slot_name,
  conninfo
FROM pg_stat_wal_receiver;

A stuck received_lsn that does not advance over time means the replica is connected but not making progress. This usually means the replica is stuck applying a long-running transaction or its disk is full.

Threshold: Lag above 100 MB during normal traffic needs investigation. Above 1 GB is an incident. For time-based lag, anything above 30 seconds for a replica that serves user-facing traffic is unacceptable.

8. Is the connection pool saturated?

Connection exhaustion looks like a database performance problem but is often an application connection pool problem. This query shows you the distribution of connections by state and source.

SELECT
  usename,
  application_name,
  state,
  count(*) AS connection_count,
  round(count(*)::numeric / (SELECT count(*) FROM pg_stat_activity) * 100, 1) AS pct_of_total
FROM pg_stat_activity
GROUP BY usename, application_name, state
ORDER BY connection_count DESC;

This tells you exactly which application (the application_name column, if set) is consuming connections and in what state. If you see one application with 80 connections in idle state and 20 in active, that application is overprovisioned in its pool and may need its max setting reduced.

Combine this with information about the max connections:

SHOW max_connections;

SELECT count(*) AS total_connections FROM pg_stat_activity;
SELECT count(*) AS active_connections FROM pg_stat_activity WHERE state = 'active';

Threshold: If total connections are above 80% of max_connections, you are one traffic spike away from FATAL: remaining connection slots are reserved. In PgBouncer-managed setups, the limit is at the pooler level, not the Postgres level, but the symptom is the same.

Putting it all together: the emergency runbook

When the pager goes off at 2 a.m., do not run these queries in order. Run them in this order:

Query 1 (what is running right now) to see if there is an obvious runaway query.
Query 2 (who is blocking whom) to see if queries are stacked up behind a blocker.
Query 3 (idle-in-transaction) to see if leaked transactions are holding locks.

If the system is in crisis (connections exhausted, CPU pegged), skip to the intervention:

-- Emergency: terminate idle-in-transaction connections older than 30 seconds
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND now() - xact_start > interval '30 seconds';

-- Emergency: terminate any query running longer than 5 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - query_start > interval '5 minutes'
  AND pid != pg_backend_pid()
  AND usename != 'postgres';  -- protect superuser sessions

This is a blunt instrument. It will cause application errors as transactions are aborted. But it is better than a full database crash.

Once the system is stable, use queries 4 through 8 for root cause analysis. The cache hit ratio tells you if your shared_buffers are undersized. The sequential scan query tells you if missing indexes caused the spike. The vacuum stats tell you if bloat silently built up over weeks. The replication query tells you if lag is building. The pool query tells you if an application change suddenly increased connection demand.

Automating the health checks

Do not wait for the pager. Run these queries on a schedule and alert on the thresholds.

A simple approach: wrap each threshold check into a shell script that exits non-zero when a threshold is breached, then feed that into your monitoring system. Here is a pattern for the idle-in-transaction check:

#!/bin/bash
# healthcheck_idle_tx.sh - alert if any idle-in-transaction older than 30s
DB_URL="${1:-postgres://localhost:5432/mydb}"

IDLE_COUNT=$(psql "$DB_URL" -t -A -c "
  SELECT count(*) FROM pg_stat_activity
  WHERE state = 'idle in transaction'
    AND now() - xact_start > interval '30 seconds';
")

if [ "$IDLE_COUNT" -gt 0 ]; then
  echo "WARNING: $IDLE_COUNT idle-in-transaction connections older than 30s"
  exit 1
fi

echo "OK: no stale idle-in-transaction connections"
exit 0

Wire this into your Prometheus node_exporter’s textfile collector directory, or run it as a Nagios/Icinga check, or schedule it in a Kubernetes cron job that exposes metrics. The important thing is that you have an automated check before the incident happens, not during it.

The one query you should memorize

If you remember only one query from this post, make it this one. It combines active queries, their duration, and blocking information into a single snapshot that tells you almost everything you need to know in the first 30 seconds of an incident.

SELECT
  pid,
  now() - query_start AS duration,
  state,
  wait_event_type,
  wait_event,
  (SELECT count(*) FROM unnest(pg_blocking_pids(pid))) AS blocked_by_count,
  left(query, 90) AS query
FROM pg_stat_activity
WHERE state != 'idle'
  AND pid != pg_backend_pid()
ORDER BY query_start;

Add the blocked_by_count subquery and you get instant visibility into lock contention. Zero blockers and long duration means a slow query. Multiple blockers and any duration means a lock chain.

Type this into a ~/.psqlrc alias:

\set active 'SELECT pid, now() - query_start AS duration, state, wait_event_type, wait_event, (SELECT count(*) FROM unnest(pg_blocking_pids(pid))) AS blocked_by_count, left(query, 90) FROM pg_stat_activity WHERE state != ''idle'' AND pid != pg_backend_pid() ORDER BY query_start;'

Then type :active in any psql session to get a snapshot. Muscle memory for this single query will save you more time than any monitoring dashboard.

A note from Yojji

Building a production database monitoring strategy that catches problems before they escalate into outages requires deep familiarity with Postgres internals, alerting thresholds that actually correspond to real risks, and automation that bridges the gap between raw system views and actionable signals. It is the kind of operational expertise that Yojji’s senior engineering teams bring to every backend they ship, from connection management and query optimization to full observability pipelines. Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their teams specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and full-cycle product engineering that includes the database reliability patterns this post covers.