The Practical Developer

Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page

The daily report cron ran twice last Tuesday, missed Wednesday entirely, and silently failed on Thursday until a customer complained. Here is the small Postgres-backed pattern that makes scheduled tasks observable, overlap-safe, and idempotent. With working TypeScript.

An analog alarm clock face close up, the right metaphor for scheduled tasks that need to be dependable

The daily report job is supposed to run at 06:00. It emails a summary of yesterday’s transactions to the finance team. Last Tuesday it sent two copies, 40 minutes apart. On Wednesday it sent nothing at all. The ops channel got a ping at 08:15 from someone asking if the system was broken. It was not broken. The cron daemon was running. The Node process had started the job, hung on a slow database query, and the next day’s scheduled run fired while the first was still stuck. That collision corrupts the state. The missed Wednesday run? A deploy restarted the container at 05:59 and the cron scheduler lost the tick. The double Tuesday run? No overlap guard meant the second process started in parallel, and both jobs read the same unmarked rows, computed the same totals, and sent the same emails.

Every team that runs cron jobs in production has a version of this story. The fix is not a bigger cron scheduler. It is a tiny set of rules: prevent overlap, detect missed runs, make the work idempotent, and record what happened. This post shows the Postgres-backed pattern that implements all four in about 80 lines of TypeScript. It works with node-cron, bree, node-schedule, or a simple setInterval. The scheduler is interchangeable. The state table is what matters.

What a cron job actually needs

A production cron job is not a shell script that emails output to root. It is a distributed task with four failure modes:

  1. Overlap. The previous run is still going when the next tick fires. Without a lock, both runs execute concurrently and corrupt or duplicate work.
  2. Missed execution. The server is down, deploying, or paused during the scheduled window. The tick is lost forever unless something notices.
  3. Silent failure. The job throws, logs to stdout, and nobody reads the log until a downstream human complains.
  4. Non-idempotent side effects. Even with a perfect lock, if the process crashes after the work is done but before the lock is released, a retry or recovery run may repeat the side effects.

The standard solutions people reach for (Redis locks, external job queues, Kubernetes CronJobs with suspend logic) are fine, but most teams already have Postgres. A small state table in the same database gives you overlap detection, execution history, and failure recovery without adding new infrastructure.

The state table

Create one table. It is the source of truth for every scheduled task.

CREATE TABLE cron_jobs (
  name TEXT PRIMARY KEY,
  schedule_interval INTERVAL NOT NULL,
  last_run_at TIMESTAMPTZ,
  last_duration_ms INT,
  last_status TEXT CHECK (last_status IN ('started', 'success', 'failure')),
  locked_until TIMESTAMPTZ,
  next_expected_at TIMESTAMPTZ,
  failure_count INT NOT NULL DEFAULT 0
);

INSERT INTO cron_jobs (name, schedule_interval)
VALUES ('daily-finance-report', '1 day');

The columns:

  • name: the job identifier.
  • schedule_interval: how often it should run, as a Postgres interval. 1 day, 1 hour, 15 minutes.
  • last_run_at: when the last execution started.
  • locked_until: a coarse time boundary during which the job is considered “in flight.” If a process dies, the lock expires and another runner can pick it up.
  • next_expected_at: the scheduled time of the next run. Used to detect missed executions.
  • failure_count: consecutive failures, so you can alert before the human complaint arrives.

Overlap prevention with advisory locks

Postgres advisory locks are perfect for this. They are lightweight, transactional, and you can scope them to the job name without adding a column or worrying about lock cleanup after a crash.

The worker loop looks like this:

import pg from 'pg';
const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });

export async function runCronJob(
  name: string,
  work: () => Promise<void>,
  maxDurationMs: number = 300_000,
) {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');

    // 1. Acquire the row with an update so we serialize all runners.
    const { rows } = await client.query(
      `SELECT locked_until, last_run_at, next_expected_at, last_status
       FROM cron_jobs
       WHERE name = $1
       FOR UPDATE`,
      [name],
    );

    if (rows.length === 0) {
      throw new Error(`Unknown cron job: ${name}`);
    }

    const job = rows[0];
    const now = new Date();

    // 2. Overlap check: is another runner still holding the lock?
    if (job.locked_until && job.locked_until > now) {
      await client.query('COMMIT');
      console.log(`[${name}] skipped: still locked until ${job.locked_until.toISOString()}`);
      return;
    }

    // 3. Missed-execution detection. If next_expected_at passed without a success,
    //    we still run now, but we log the gap for alerting.
    if (job.next_expected_at && job.next_expected_at < now && job.last_status !== 'success') {
      console.warn(`[${name}] missed expected run at ${job.next_expected_at.toISOString()}`);
    }

    // 4. Set the lock and mark started.
    const lockedUntil = new Date(now.getTime() + maxDurationMs);
    await client.query(
      `UPDATE cron_jobs
       SET locked_until = $1,
           last_run_at = $2,
           last_status = 'started',
           next_expected_at = $2 + schedule_interval
       WHERE name = $3`,
      [lockedUntil, now, name],
    );

    await client.query('COMMIT');
    // Release the client early; the work itself does not need this transaction.
    client.release();

    // 5. Run the actual work.
    const start = Date.now();
    let status: 'success' | 'failure' = 'success';
    try {
      await work();
    } catch (err) {
      status = 'failure';
      console.error(`[${name}] failed:`, err);
      throw err; // rethrow so your outer error tracker (Sentry, etc.) catches it
    } finally {
      // 6. Record the outcome, releasing the lock.
      const duration = Date.now() - start;
      await pool.query(
        `UPDATE cron_jobs
         SET last_status = $1,
             last_duration_ms = $2,
             locked_until = NULL,
             failure_count = CASE WHEN $1 = 'success' THEN 0 ELSE failure_count + 1 END
         WHERE name = $3`,
        [status, duration, name],
      );
    }
  } catch (err) {
    await client.query('ROLLBACK').catch(() => {});
    client.release();
    throw err;
  }
}

A few things to notice:

  • FOR UPDATE on the cron_jobs row means two concurrent runners serialize. One wins, the other sees locked_until in the future and skips.
  • The lock is time-bound (maxDurationMs). If the process dies, locked_until eventually passes and the next runner picks the job up automatically. No orphan locks.
  • The next_expected_at is computed from schedule_interval at the moment we mark started. That means if the job starts at 06:03 instead of 06:00 (because the server was restarting), the next expectation is 06:03 tomorrow. This prevents drift from compounding.
  • The actual work() runs outside the transaction. The transaction only mutates the small state row. If work() takes five minutes, you are not holding a transaction open for five minutes.

Wiring it to a scheduler

Here is the integration with node-cron. Any scheduler works; the function above is the guard.

import cron from 'node-cron';

cron.schedule('0 6 * * *', async () => {
  await runCronJob('daily-finance-report', async () => {
    const yesterday = getYesterdayRange();
    const rows = await pool.query(
      `SELECT sum(amount) as total, count(*) as cnt
       FROM transactions
       WHERE created_at BETWEEN $1 AND $2`,
      [yesterday.start, yesterday.end],
    );
    await sendFinanceEmail(rows[0]);
    await pool.query(
      `UPDATE transactions SET reported = true
       WHERE created_at BETWEEN $1 AND $2`,
      [yesterday.start, yesterday.end],
    );
  }, 300_000);
});

The scheduler fires the callback every day at 06:00. The callback may fire while a previous instance is still running, but runCronJob handles that. The work() function inside is ordinary TypeScript. It does not need to know about locking.

Testing overlap before production proves it

The worst time to discover your overlap guard is broken is when two containers both pick up the job after a deploy. Test it locally with two Node processes pointing at the same database.

Open two terminals. In both, run a small script that calls runCronJob with a 10-second sleep inside work(). The first process should acquire the lock and sleep. The second should log skipped: still locked and exit immediately. After 10 seconds, the first process should release the lock and update last_status to success.

If both processes sleep, your FOR UPDATE is not working. Check that both use the same database and that the table row exists. If the second process never runs even after the first finishes, your locked_until is not being cleared. Check the finally block in runCronJob.

This test takes 30 seconds and saves you the Tuesday double-email incident.

Detecting missed executions

The next_expected_at column exists for one reason: alerting. A separate monitoring query, run every few minutes by your health checker or Prometheus exporter, detects jobs that are overdue:

SELECT name,
       next_expected_at,
       now() - next_expected_at AS overdue_by
FROM cron_jobs
WHERE next_expected_at < now() - interval '5 minutes'
  AND (locked_until IS NULL OR locked_until < now());

If next_expected_at was 06:00 and it is now 06:10, and the lock is not held, the job missed its window. Alert on this. The five-minute grace period avoids noise from clock skew and short deploy restarts.

You can also alert on failure_count:

SELECT name, failure_count
FROM cron_jobs
WHERE failure_count >= 2;

Two consecutive failures means something is structurally wrong, not just a transient blip.

Making the work idempotent

The lock prevents double runs under normal conditions. But what if the job succeeds, writes the reported = true flag, crashes while releasing the lock, and a recovery run picks it up? The next runner will see reported = true on every row and produce an empty report. That is fine. The email might be empty, which is annoying but not harmful.

If your side effect is not naturally idempotent (sending a webhook, charging a fee, creating an invoice), build idempotency into the work itself. An idempotency key table, a processed_at timestamp on the target rows, or a deduplication hash of the inputs. The full pattern is in the post on idempotency keys. The short version: every side effect should be safe to repeat, because in distributed systems the repeat will happen eventually.

The dashboard you need

With the cron_jobs table, a one-page dashboard is trivial:

SELECT name,
       last_run_at,
       last_duration_ms,
       last_status,
       next_expected_at,
       failure_count
FROM cron_jobs
ORDER BY name;

You can ship this as an admin API route in 10 lines. It tells you which jobs are healthy, which are failing, which are overdue, and how long they take. Compare this to crontab -l on a server you cannot SSH into, where the only visibility is “did it email root?”

When this pattern is not enough

This design handles one machine running one scheduler. If you have multiple servers and you need exactly-once execution across the fleet, use a real job queue: pg-boss, Bull MQ, or a SaaS option. Those give you retry logic, dead-letter queues, and horizontal scaling out of the box.

Also, if your job needs sub-minute precision, cron is the wrong tool. Use a streaming consumer, a webhook handler, or an event-driven architecture.

The migration path from a naked cron script to this pattern is straightforward. Add the cron_jobs table, insert a row for your task, wrap the existing work function in runCronJob, and keep the scheduler unchanged. You do not need to rewrite the work. You only need to wrap it. Most migrations take 20 minutes and a single deploy.

The Postgres-backed cron guard is the 80% solution: it takes an existing cron script and makes it safe, observable, and debuggable without adding Redis, a message broker, or a new service.

The takeaway

Do not trust node-cron alone. It schedules fine, but it knows nothing about whether the previous run finished, whether the server was down during the window, or whether the work is safe to repeat. Add one small table, one locking function, and two alerting queries. The result is scheduled tasks that overlap safely, announce their own missed runs, and expose their health in a query any developer can read.

The next time someone proposes “let’s just add a cron job,” ask how it will handle overlap, how you will know if it misses a run, and what happens if it runs twice. If the answer is “it probably won’t,” show them the 80-line pattern above.


A note from Yojji

The difference between a cron job that silently fails and one that self-reports its health is the same difference that separates prototype infrastructure from production infrastructure. Yojji’s teams build that kind of operational rigor into every engagement.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem, cloud platforms, and full-cycle product engineering, including the background-task reliability that keeps daily reports and data pipelines running without the 2 AM page.