The Practical Developer

Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group

Your "latest status per device" query takes 800 ms with window functions and self-joins. Postgres DISTINCT ON solves it in 8 ms with a single index scan. Here is the syntax, the index strategy, and the gotchas that make the difference between a fast dashboard and a slow one.

Computer monitor displaying lines of code, representing the SQL query optimization process

The device status dashboard was the first screen every operator opened in the morning. It showed the latest heartbeat, temperature, and error count for 12,000 IoT devices in a single table. At 9:00 AM it loaded in 820 milliseconds. By 2:00 PM, with more historical data in the events table, it was pushing 1.4 seconds. The query plan showed a WindowAgg over 4.2 million rows, followed by a Filter to keep only row_number = 1. The index was being used. The planner was not stupid. But the query was doing work that did not need to happen.

The team had written what every blog post recommends: a window function with PARTITION BY device_id ORDER BY created_at DESC. It is portable SQL. It is easy to understand. And for this specific pattern (one row per group, sorted by time), it is the second-slowest correct solution you can write in Postgres. The slowest is the correlated subquery.

Postgres has a specialized primitive for exactly this problem. It is called DISTINCT ON. It is not standard SQL. It will not port to MySQL or SQL Server. But if you run Postgres in production and you need the latest row per group, it is the tool you should reach for first. This post shows why it wins, how to index for it, and the three gotchas that will bite you if you skip the details.

The query everyone writes first

The events table looks like this:

CREATE TABLE events (
  id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  device_id uuid NOT NULL,
  temperature numeric(5,2),
  status text NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now()
);

The requirement: show the most recent event for every device. The window-function approach:

SELECT id, device_id, temperature, status, created_at
FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY device_id ORDER BY created_at DESC) AS rn
  FROM events
) sub
WHERE rn = 1;

This query scans the table, sorts every partition by created_at DESC, assigns a row number, then filters out every row except the first per device. If you have 12,000 devices and 4.2 million events, the database sorts 4.2 million rows to return 12,000. The sort is the killer. Even with an index on (device_id, created_at), the planner may choose a Seq Scan plus Sort because random index lookups for 4.2 million rows can be slower than reading the table sequentially and sorting in memory.

The correlated subquery is worse:

SELECT e1.*
FROM events e1
WHERE e1.created_at = (
  SELECT MAX(e2.created_at)
  FROM events e2
  WHERE e2.device_id = e1.device_id
);

This runs the subquery once per row, or once per device if the planner is generous. In practice, with four million rows, it is a disaster. Do not use this pattern.

DISTINCT ON: one scan, one row per group

Postgres extends the SQL standard with DISTINCT ON (expression). It keeps the first row for each distinct value of the expression, discarding the rest. The ORDER BY clause controls which row is “first.”

SELECT DISTINCT ON (device_id)
  id, device_id, temperature, status, created_at
FROM events
ORDER BY device_id, created_at DESC;

That is the entire query. No subquery. No window function. No filter predicate.

The semantics are simple: the database scans rows in ORDER BY order. Every time it sees a new device_id, it emits that row and skips every subsequent row with the same device_id. Because the ordering is device_id first and created_at DESC second, the first row per device is the most recent one.

The key performance insight is that this query can be satisfied with an index scan that never sorts. If you have an index on (device_id, created_at DESC), the index is already ordered exactly the way the query needs. Postgres walks the index from the start, emits one row per distinct device_id, and stops after 12,000 rows. It does not touch the other 4.1 million rows. It does not allocate memory for a sort. The difference between 820 ms and 8 ms is often this single rewrite.

The index that makes it work

Without the right index, DISTINCT ON is not automatically fast. The planner must still sort the data to group by device_id in the order the query specifies.

CREATE INDEX idx_events_device_created
ON events (device_id, created_at DESC);

Note the DESC. If your ordering is created_at DESC but your index is (device_id, created_at ASC), Postgres can still use the index by scanning backward, but the plan is slightly more complex and the optimizer may hesitate. Match the index direction to the query direction.

Run EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) on both versions. The window-function plan looks like this:

WindowAgg  (cost=... rows=4200000)
  ->  Index Scan using idx_events_device_created on events
        (cost=... rows=4200000)

Even with the index, the database reads every row. The WindowAgg node buffers the partition, sorts it, assigns numbers, and filters.

The DISTINCT ON plan with the same index looks like this:

Unique  (cost=... rows=12000)
  ->  Index Scan using idx_events_device_created on events
        (cost=... rows=12000)

The Unique node is a streaming deduplicator. It reads the index in order, keeps a running memory of the last-seen device_id, and emits rows only when the value changes. After 12,000 distinct values, the scan stops (or continues only if the query has a LIMIT, which DISTINCT ON composes with naturally).

Check the Buffers: line in EXPLAIN (ANALYZE, BUFFERS). The window function version will show tens of thousands of shared buffer hits. The DISTINCT ON version will show hundreds. That is the difference between a query that saturates your buffer cache and one that barely disturbs it.

Partial indexes for soft deletes and stale data

Most production tables have rows you do not care about: soft-deleted records, events older than a retention window, or devices that have been decommissioned. If your dashboard only shows active devices, indexing every row is wasted space and slower maintenance.

A partial index targets exactly the rows the query needs:

CREATE INDEX idx_events_active_device_created
ON events (device_id, created_at DESC)
WHERE deleted_at IS NULL;

And the query should include the same predicate:

SELECT DISTINCT ON (device_id)
  id, device_id, temperature, status, created_at
FROM events
WHERE deleted_at IS NULL
ORDER BY device_id, created_at DESC;

The partial index is smaller, faster to scan, and faster to maintain during inserts and updates. If your dashboard query is the only thing that needs this specific ordering, a partial index is often the right call.

For events with a retention window, add a time-based predicate:

CREATE INDEX idx_events_recent_device_created
ON events (device_id, created_at DESC)
WHERE created_at > now() - interval '30 days';

Match the query predicate exactly. If the query says WHERE created_at > now() - interval '30 days', the planner will use the partial index. If the query omits the predicate, the planner will fall back to the broader index or a sequential scan.

The three gotchas

DISTINCT ON is powerful, but it has sharp edges. Most production bugs with this syntax come from one of three mistakes.

Gotcha 1: ORDER BY must start with the DISTINCT ON expression.

This is not optional. The ORDER BY clause must lead with the same expression (or expressions, in the same order) as DISTINCT ON, followed by the sorting you want within each group.

-- CORRECT
SELECT DISTINCT ON (device_id) *
FROM events
ORDER BY device_id, created_at DESC;

-- WRONG: will error
SELECT DISTINCT ON (device_id) *
FROM events
ORDER BY created_at DESC;

The error is clear: SELECT DISTINCT ON expressions must match initial ORDER BY expressions. But if you are generating SQL dynamically from an ORM or a query builder, it is easy to produce invalid output. Always assert that your generator prepends the distinct expression to the order clause.

Gotcha 2: NULLs group together, and NULL ordering matters.

If device_id can be NULL (it should not be, but schemas drift), all NULL values form a single group. DISTINCT ON returns one row for the entire NULL group, not one NULL row per some other key. The row it returns depends on the second-level sort order.

If you need deterministic behavior with NULLs, be explicit:

SELECT DISTINCT ON (device_id)
  id, device_id, temperature, status, created_at
FROM events
ORDER BY device_id NULLS LAST, created_at DESC;

Gotcha 3: DISTINCT ON gives you one row per group, not N rows per group.

This is the most common confusion. If your dashboard needs the latest three events per device, DISTINCT ON cannot help you. It is strictly a “top 1 per group” tool. For “top N per group,” use a LATERAL join:

SELECT e.*
FROM devices d
CROSS JOIN LATERAL (
  SELECT *
  FROM events
  WHERE events.device_id = d.id
  ORDER BY created_at DESC
  LIMIT 3
) e;

Or use ROW_NUMBER() if you need ranking with ties. Do not try to hack DISTINCT ON into an N-per-group query with arrays or string aggregation. It will be slower and less readable than the right tool.

When DISTINCT ON loses

There are cases where a window function is the better choice, even in Postgres.

Ties need explicit handling. If two events for the same device have exactly the same created_at, DISTINCT ON picks one arbitrarily (or based on the next column in the ORDER BY, if you add one). A window function with RANK() or DENSE_RANK() makes tie behavior explicit.

You need the rank value. If the dashboard shows “this is the #1 event and here is its rank among all events,” you need the rank. DISTINCT ON does not compute it.

Portability is a hard requirement. If you maintain the same schema on Postgres and another database, DISTINCT ON will not port. But in practice, most teams do not switch databases. They optimize for the database they run.

Complex grouping expressions. DISTINCT ON works best when the grouping key is a single column or a small set of columns that index well. If your “group” is a computed expression like DATE_TRUNC('hour', created_at), an index may not help, and the planner may choose a sort anyway. In that case, a window function may be no slower and more standard.

The migration: replacing window functions in production

If you already have the window-function query in production, migrating to DISTINCT ON is usually safe. The result set is identical when the query is semantically correct (one row per group, deterministic ordering).

Steps:

  1. Create the covering index in a migration. Use CONCURRENTLY to avoid locking:

    CREATE INDEX CONCURRENTLY idx_events_device_created
    ON events (device_id, created_at DESC);
  2. Run EXPLAIN (ANALYZE, BUFFERS) on both the old and new queries in a read replica or a staging environment with production-like data size. Verify the buffer hit count drops.

  3. Replace the query. If you use an ORM (Prisma, Drizzle, TypeORM), you may need to drop down to raw SQL. DISTINCT ON is not universally supported in query builders. Prisma does not support it natively as of mid-2026. Drizzle has a distinctOn operator. Raw SQL is always an option:

    const latestEvents = await prisma.$queryRaw`
      SELECT DISTINCT ON (device_id) *
      FROM events
      ORDER BY device_id, created_at DESC
    `;
  4. Monitor query duration in your observability tool. The p95 should drop immediately. If it does not, check whether the planner chose a different path. Run ANALYZE events if statistics are stale.

The takeaway

DISTINCT ON is not a secret. It is right there in the Postgres documentation. But most developers learn SQL from tutorials that target the lowest common denominator across databases, so they reach for ROW_NUMBER() for every “top per group” problem. That is overkill. It sorts rows you do not need and allocates memory for partitions you will throw away.

If you run Postgres and your requirement is “the latest row per group,” DISTINCT ON with the right index is the fastest, simplest, and most resource-efficient solution. Create the composite index. Write the shorter query. Watch your dashboard drop from 800 ms to 8 ms. Then spend the time you saved on a harder problem.


A note from Yojji

Query optimization work that looks at execution plans instead of just query shape, that understands the difference between sorting a partition and scanning an index in order, and that tailors indexes to the exact query pattern instead of adding generic covering indexes everywhere, is the kind of database craft that keeps production systems fast as they grow.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their senior engineering teams build data-intensive backend systems and performance-tune production Postgres deployments as part of their full-cycle delivery practice, from product discovery through cloud deployment.