API Dependency Health Checks: Why /health Is Not Enough

The pager went off at 3:17 a.m. The checkout API had a 94% error rate. The pods were all Running. The CPU was at 8%. The liveness probes were green. The readiness probes were green. Every health check in the cluster said the service was fine. The truth was simpler: the Postgres connection pool had exhausted its slots because a background migration job had leaked connections. New requests could not acquire a database handle. The application threw 500s. Kubernetes saw a healthy pod and kept sending traffic.

This is the /health trap. Teams build a route that returns { status: "ok" } and call it done. Kubernetes uses it for readiness. Load balancers use it for target health. Engineers look at it and feel safe. But a process that can execute res.status(200).json({}) tells you almost nothing about whether that process can actually serve a request. The database might be down. Redis might be partitioned. The downstream payment API might be rejecting auth tokens. The queue consumer might be wedged. The health check is blind to all of it.

This post shows how to build dependency-aware health checks: ones that validate the actual resources your API needs. We will look at the three layers of dependency checking, the failure modes you need to distinguish, and the code to implement it without creating new outages. No framework changes. Just honest probes.

What a naive health check actually tells you

Here is the naive version that exists in half the production services I audit:

app.get('/health', (req, res) => {
  res.status(200).json({ status: 'ok' });
});

This confirms three things, and three things only: the event loop is not completely frozen, the process has not been OOM-killed, and the HTTP router is mounted. It does not confirm that:

A database connection can be acquired in a reasonable time.
Redis accepts a PING.
The downstream inventory API returns a non-error status.
The file system is writable (if you buffer uploads locally).
The event consumer thread is actually processing messages.

In distributed systems, the most common failures are not process crashes. They are partial failures: a dependency is slow, misconfigured, or rejecting requests. A naive health check is useless against partial failure. Worse, it is dangerous, because it tells your infrastructure that everything is fine when it is not.

The three classes of dependency failures

Not every dependency failure should mark a pod as unhealthy. If you take yourself out of rotation the moment Redis blips, you amplify a small failure into a cascading outage. You need to classify dependencies before you probe them.

Critical dependencies. If this is down, you cannot serve meaningful traffic. For a REST API that reads and writes a relational database, the database is critical. For a video processing service, the object storage backend is critical. If the dependency fails, the pod should fail its readiness probe. Traffic should route elsewhere. If nowhere is healthy, the load balancer returns 503s, which is honest.

Degraded dependencies. If this is down, you can still serve traffic, but some features are unavailable. A caching layer is the classic example. If Redis fails, the API should still serve requests from the database. If analytics telemetry drops, the API should still process checkouts. These should not fail readiness. Instead, they should be monitored, and the service should degrade gracefully.

Best-effort dependencies. These are nice to have, but failures are invisible to users. Think of a metrics push gateway or a non-blocking audit log. These should not affect health checks at all. Probe them for observability, but never let a best-effort dependency evict a pod from the load balancer.

Get this classification wrong and you build a service that falls over because its statsd agent restarted. Get it right and you isolate failures to the blast radius they deserve.

Designing the dependency check

The naive next step is to call every dependency inside the /health handler. This is a mistake. A health check endpoint is queried frequently. Kubernetes probes it every 10 seconds by default. A load balancer might query it every 5 seconds from multiple zones. If you execute a full SELECT 1 on Postgres, a PING to Redis, and a GET to a downstream API on every probe, you generate a significant background load. Worse, if the probe timeout is short (1-2 seconds is common), a transient slowdown in any dependency marks the pod unhealthy even when the dependency would recover a moment later.

A better design uses three techniques.

Background polling with caching. Instead of checking dependencies inline on every HTTP request to /health, run a background task that polls each dependency every few seconds and stores the result in memory. The /health endpoint returns the cached state. This separates probe frequency from dependency check frequency, and it lets you use longer, more realistic timeouts for the actual checks.

Separate readiness and liveness. Kubernetes distinguishes these for a reason. Liveness should mean “this pod is not stuck.” Keep it cheap. Readiness should mean “this pod can serve traffic.” That is where dependency checks belong. If readiness fails, Kubernetes removes the pod from the service endpoints. The pod stays alive so you can inspect logs. If liveness fails, Kubernetes restarts the container. Never put dependency checks on liveness, or a slow database will cause a restart loop.

Timeout discipline. Every dependency check must have a timeout that is shorter than the probe timeout. If your readiness probe timeout is 2 seconds, your Postgres check should timeout in 1 second. If it exceeds that, the dependency is effectively unreachable and the pod should not receive traffic. But if you set the dependency timeout to 5 seconds and the probe timeout to 2 seconds, the probe will always fail on a slow dependency, even when the dependency might recover within 3 seconds.

The code: a production dependency health checker

Here is a Node.js implementation that follows the rules above. It polls critical and degraded dependencies on an interval, caches the results, and exposes separate /health/live and /health/ready endpoints.

import { EventEmitter } from 'node:events';
import pg from 'pg';
import Redis from 'ioredis';

// Configuration: classify your dependencies explicitly.
const DEPENDENCIES = [
  {
    name: 'postgres',
    type: 'critical',
    check: checkPostgres,
    intervalMs: 5_000,
    timeoutMs: 1_500,
  },
  {
    name: 'redis',
    type: 'degraded',
    check: checkRedis,
    intervalMs: 5_000,
    timeoutMs: 1_000,
  },
];

const { Pool } = pg;
const dbPool = new Pool({ connectionString: process.env.DATABASE_URL });
const redis = new Redis(process.env.REDIS_URL);

class DependencyHealthMonitor extends EventEmitter {
  constructor(deps) {
    super();
    this.deps = deps;
    this.state = new Map();
    this.timers = [];
    for (const dep of deps) {
      this.state.set(dep.name, { healthy: true, lastChecked: null, error: null });
    }
  }

  start() {
    for (const dep of this.deps) {
      this._poll(dep);
      const timer = setInterval(() => this._poll(dep), dep.intervalMs);
      this.timers.push(timer);
    }
  }

  stop() {
    for (const t of this.timers) clearInterval(t);
  }

  async _poll(dep) {
    const controller = new AbortController();
    const timeout = setTimeout(() => controller.abort(), dep.timeoutMs);

    try {
      await dep.check({ signal: controller.signal });
      this._update(dep.name, true, null);
    } catch (err) {
      this._update(dep.name, false, err.message);
    } finally {
      clearTimeout(timeout);
    }
  }

  _update(name, healthy, error) {
    const previous = this.state.get(name);
    if (previous.healthy !== healthy) {
      console.log(JSON.stringify({
        event: 'dependency_health_changed',
        dependency: name,
        healthy,
        previousHealthy: previous.healthy,
        error,
        timestamp: new Date().toISOString()
      }));
      this.emit('change', { name, healthy, error });
    }
    this.state.set(name, { healthy, lastChecked: new Date().toISOString(), error });
  }

  isReady() {
    for (const dep of this.deps) {
      if (dep.type === 'critical' && !this.state.get(dep.name).healthy) {
        return false;
      }
    }
    return true;
  }

  isDegraded() {
    for (const dep of this.deps) {
      if (dep.type === 'degraded' && !this.state.get(dep.name).healthy) {
        return true;
      }
    }
    return false;
  }

  summary() {
    const out = {};
    for (const [name, s] of this.state) {
      out[name] = s;
    }
    return out;
  }
}

// Dependency check functions
async function checkPostgres({ signal }) {
  const client = await dbPool.connect();
  try {
    // Use a lightweight query. Do not SELECT * from a large table.
    await client.query('SELECT 1');
  } finally {
    client.release();
  }
}

async function checkRedis({ signal }) {
  await redis.ping();
}

// Application wiring
const monitor = new DependencyHealthMonitor(DEPENDENCIES);
monitor.start();

monitor.on('change', ({ name, healthy }) => {
  // Integrate with your metrics pipeline here.
  // Example: dependencyHealthyGauge.set({ name }, healthy ? 1 : 0);
});

import express from 'express';
const app = express();

// Liveness: cheap. Just confirms the event loop is responsive.
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'alive' });
});

// Readiness: checks cached dependency state.
app.get('/health/ready', (req, res) => {
  const ready = monitor.isReady();
  const degraded = monitor.isDegraded();
  const statusCode = ready ? (degraded ? 200 : 200) : 503;

  // Some teams prefer to return 200 with a body flag even when degraded.
  // For Kubernetes readiness, 503 removes the pod from the service.
  res.status(statusCode).json({
    status: ready ? (degraded ? 'degraded' : 'ready') : 'not_ready',
    degraded,
    dependencies: monitor.summary(),
  });
});

A few details matter here. The checkPostgres function acquires a connection from the pool, runs SELECT 1, and releases it. It does not reuse a dedicated connection, because a dedicated connection might survive when the pool is exhausted. You want to verify that the pool can actually hand out a handle. The timeout uses AbortController, which you pass down to any async call that supports it. The polling interval is 5 seconds, which means the worst-case delay between a dependency failure and readiness reflecting it is 5 seconds plus the check duration. That is fast enough for most services without being aggressive.

Handling the degraded state gracefully

When a degraded dependency fails, the readiness endpoint still returns 200, so Kubernetes keeps routing traffic. Your application code needs to handle the absence of that dependency without crashing or serving 500s.

For a cache, this means skipping the cache and hitting the primary store:

async function getUser(id) {
  try {
    const cached = await redis.get(`user:${id}`);
    if (cached) return JSON.parse(cached);
  } catch (err) {
    // Log at debug level. Do not throw.
    console.log(JSON.stringify({ event: 'cache_read_failed', error: err.message }));
  }

  const user = await db.query('SELECT * FROM users WHERE id = $1', [id]);
  return user.rows[0];
}

For a non-blocking audit log, wrap the write in a fire-and-forget with a timeout. If it fails, log locally and move on:

async function audit(event) {
  try {
    await Promise.race([
      auditClient.post('/events', event),
      new Promise((_, reject) =>
        setTimeout(() => reject(new Error('audit_timeout')), 500)
      ),
    ]);
  } catch (err) {
    console.log(JSON.stringify({ event: 'audit_fallback', payload: event, error: err.message }));
  }
}

The principle is: if a dependency is classified as degraded, every code path that touches it must have a fallback. If you cannot build a fallback, reclassify the dependency as critical. Do not lie to yourself about resilience.

Operational guidance: do not take yourself down

Dependency health checks are powerful, but they introduce a new failure mode: the thundering herd of health checks. If every pod in a 40-replica deployment starts probing Postgres every 5 seconds, you add 8 probes per second to the database. That is usually fine, but during a recovery event, when Postgres is already slow, those probes can make things worse.

Mitigate this with three practices.

Jitter your intervals. Do not start every pod’s polling at the same millisecond. Add a random offset up to the interval on startup:

const jitter = Math.floor(Math.random() * dep.intervalMs);
setTimeout(() => {
  this._poll(dep);
  const timer = setInterval(() => this._poll(dep), dep.intervalMs);
  this.timers.push(timer);
}, jitter);

Use separate credentials or a least-privileged user for probes. If your probe runs SELECT 1, it does not need write access. In extreme cases, give health-check queries their own connection pool with a small cap, so a runaway health check cannot exhaust the application’s main pool.

Watch the watcher. Monitor your health check latency as its own metric. If /health/ready starts taking 500 ms, either your probes are too heavy or your dependencies are under stress. Either way, it is a signal.

Takeaway

A /health endpoint that returns 200 OK is not a health check. It is a process heartbeat. Real health checks validate the resources your service needs to do its job, classify those resources by criticality, and distinguish between “alive but useless” and “alive but slower.” Build background polling, cache the results, wire it to readiness probes, and write fallback code for degraded dependencies. Your 3 a.m. self will thank you.

A note from Yojji

Building resilient APIs means testing failure paths as seriously as happy paths. Yojji helps teams design dependency-aware architectures and implement production-grade health checks that prevent cascading outages. If your monitoring says everything is green while users see errors, it is time to look at how you validate the layers underneath.