The Practical Developer

The Four Timeouts Every Node.js HTTP Client Needs

A production incident walkthrough: Node.js connection pools silently fill with dead TCP sockets, every outbound request hangs forever, and your service looks down while the downstream API is healthy. Here are the four timeout values — connect, response, idle, and keepalive — with the working Agent and fetch config that prevents it.

Glowing network grid over a dark background — the kind of infrastructure diagram you stare at while requests silently time out

Your service is not down. The downstream API is not down. But every request your Node.js service makes to it hangs forever, and your own health checks eventually fail.

The connection pool is full. Sixteen sockets, all marked ESTABLISHED in netstat, all idle, all dead. Somewhere between your Kubernetes pod and the upstream load balancer, a quiet network event happened: a NAT table entry expired, a load balancer shifted, a container restarted. TCP is a reliable protocol, but it is only reliable when someone tells the truth. When a peer disappears without sending a FIN or RST, the remaining end will wait indefinitely unless you told it not to.

Node.js defaults do not tell it not to. By default, http.request() waits forever for a connection, waits forever for a response, and keeps pooled sockets open forever without probing them. The Node runtime is fast; the absence of timeouts is not. This post is the four values you set once and verify with iptables so that a silent network partition becomes a fast error instead of a slow outage.

The shape of the failure

You see it first as latency, not errors. p50 stays flat; p99 climbs, then p99.9 climbs, then the latency histogram turns into a single bar at your maximum client timeout. If your callers have no maximum timeout, the histogram never settles — requests just accumulate.

Inside the process, netstat or ss shows a pool of sockets in ESTABLISHED to the upstream IP. The event loop is not blocked; the sockets are simply waiting for data that will never arrive. If you use an HTTP Agent with maxSockets: 16, the 17th request queues behind the first 16 and never runs. The downstream API is healthy, fast, and has plenty of capacity, but your process never reaches it because the front of the queue is occupied by ghosts.

This is not a bug in Node.js. It is a missing configuration. There are four timeouts, and you need all four because they guard four different phases of a connection lifecycle.

1. Connect timeout: how long the handshake may take

Before any HTTP data flows, TCP must complete its three-way handshake. In a healthy data center this takes a millisecond. Across regions, maybe twenty. If a SYN packet is black-holed — wrong security group, failed NAT, upstream instance terminated — Node.js will retransmit with exponential backoff and wait for roughly 75 seconds by default on Linux. That is system-default territory, not application-default territory, and it is far too long for a service that should fail fast.

Node’s native http module does not expose a connect timeout directly. You build it by racing the request against a timer and destroying the socket if the timer fires first.

import http from 'node:http';

function requestWithConnectTimeout(url, options = {}, connectMs = 5_000) {
  return new Promise((resolve, reject) => {
    const req = http.request(url, options, (res) => {
      clearTimeout(timer);
      resolve(res);
    });

    const timer = setTimeout(() => {
      req.destroy(new Error(`Connect timeout after ${connectMs}ms`));
      reject(new Error(`Connect timeout after ${connectMs}ms`));
    }, connectMs);

    req.on('error', (err) => {
      clearTimeout(timer);
      reject(err);
    });

    req.end();
  });
}

Five seconds is generous for a service-to-service call inside the same cloud region. One second is often enough. The point is not the exact number; the point is that the number exists and is bounded.

2. Response timeout: how long until the first byte

Once the connection is established and the request is sent, how long do you wait for the server to respond? Node.js calls this request.setTimeout(), and it measures time-to-first-byte: headers must start arriving before the timer fires.

function requestWithResponseTimeout(url, options = {}, responseMs = 10_000) {
  return new Promise((resolve, reject) => {
    const req = http.request(url, options, (res) => {
      clearTimeout(timer);
      resolve(res);
    });

    const timer = setTimeout(() => {
      req.destroy(new Error(`Response timeout after ${responseMs}ms`));
      reject(new Error(`Response timeout after ${responseMs}ms`));
    }, responseMs);

    req.on('error', (err) => {
      clearTimeout(timer);
      reject(err);
    });

    req.end();
  });
}

Do not conflate this with a total wall-clock deadline. A response timeout of 30 seconds is reasonable for an expensive database export endpoint. A response timeout of 30 seconds for a user-profile lookup is not. Match the timeout to the endpoint’s expected worst case, not to a global default.

If you use the global request object timeout (req.setTimeout(ms)), Node.js will fire the 'timeout' event without automatically destroying the request. You must listen and abort yourself. The explicit setTimeout + req.destroy() pattern above is clearer and harder to miss.

3. Socket idle timeout: how long a pooled socket may sit unused

Connection reuse is fast. Keeping a socket open for the next request avoids another TCP handshake and TLS negotiation. But a socket that has been idle for five minutes is statistically more likely to belong to a ghost than to a healthy peer.

The http.Agent controls this with keepAlive: true and keepAliveMsecs. The name is misleading: keepAliveMsecs is not TCP keepalive. It is the minimum time the agent will keep a socket open after the last request finishes, before the agent itself closes it. There is no idleTimeout in the native agent, so you must combine a short keepAliveMsecs with a custom agent that tracks last-used time, or switch to a modern client.

With undici — the HTTP client that powers Node.js 18+ global fetch — the concept is explicit:

import { Agent } from 'undici';

const agent = new Agent({
  connect: {
    timeout: 5_000,           // connect timeout
    rejectUnauthorized: true,
  },
  bodyTimeout: 30_000,        // time to receive full body
  headersTimeout: 10_000,     // time to receive headers (response timeout)
  keepAliveTimeout: 30_000,   // idle socket timeout
  keepAliveMaxTimeout: 30_000,
  maxRequestsPerSocket: 100,  // rotate sockets periodically
});

keepAliveTimeout: 30_000 means a socket is evicted from the pool after 30 seconds of idleness. That is short enough that a stale NAT mapping or a silently-replaced load balancer target will not fool you for long, and long enough that a burst of traffic benefits from reuse.

If you are still on the native http module and cannot migrate yet, cap the total lifetime with maxRequestsPerSocket or create a fresh Agent with a bounded socket pool and periodic agent.destroy() in a background timer. It is crude but it works.

4. TCP keepalive: probing dead peers at the OS level

The first three timeouts guard the edges: connecting, waiting for a response, and retiring idle pool members. But what if a socket is mid-request when the peer dies? Or what if your response timeout is intentionally long — a large file transfer, a streaming endpoint — and you want to detect a dead peer inside that long window?

TCP keepalive sends empty probe packets after a period of silence. If the peer does not acknowledge them, the kernel declares the connection dead and closes the socket, which causes Node.js to emit an 'error' event that you can handle. Without keepalive, a socket can sit in ESTABLISHED forever, convinced the peer is alive because no one contradicted it.

Enable it per-request by overriding createConnection on the Agent:

import http from 'node:http';
import net from 'node:net';

class KeepaliveAgent extends http.Agent {
  createConnection(options, callback) {
    const socket = net.createConnection(options);
    socket.setKeepAlive(true, 5_000);   // probe after 5s of silence
    socket.setNoDelay(true);
    socket.on('connect', () => callback(null, socket));
    socket.on('error', callback);
    return socket;
  }
}

const agent = new KeepaliveAgent({
  keepAlive: true,
  maxSockets: 16,
  maxFreeSockets: 4,
});

socket.setKeepAlive(true, 5000) tells the OS to start sending keepalive probes after 5 seconds of idleness. The exact interval and retry count depend on OS-level sysctl settings (net.ipv4.tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes), so treat the socket-level value as a lower bound, not a contract. On Linux the defaults are often 7200 seconds, which is useless; setting the socket option overrides the start delay, but the interval and probe count still follow the system unless you also tune the container.

In containers, add a sysctl init container or tune the node if you control it:

# sysctl init container snippet
- name: sysctl
  image: busybox
  command: ['sh', '-c', 'sysctl -w net.ipv4.tcp_keepalive_time=30 net.ipv4.tcp_keepalive_intvl=5 net.ipv4.tcp_keepalive_probes=3']
  securityContext:
    privileged: true

This shortens the total detection window to roughly 30 + (3 × 5) = 45 seconds. If you do not control the node, set a tighter application-level heartbeat or rely on the response timeout instead.

Putting them together: a production-ready fetch wrapper

If you use the global fetch in Node.js 18+, you are already using undici under the hood, but the global fetch does not expose timeout or keepalive options directly. Pass a custom dispatcher:

import { Agent, request } from 'undici';

const agent = new Agent({
  connect: { timeout: 5_000, keepAlive: true },
  headersTimeout: 10_000,
  bodyTimeout: 30_000,
  keepAliveTimeout: 30_000,
});

export async function fetchWithTimeouts(url, options = {}) {
  const { statusCode, headers, body } = await request(url, {
    ...options,
    dispatcher: agent,
    // undici request options
  });

  const data = await body.json();
  return { statusCode, headers, data };
}

If you need to match the fetch API shape while keeping timeouts, wrap undici’s fetch and pass the dispatcher:

import { Agent, fetch as undiciFetch } from 'undici';

const agent = new Agent({
  connect: { timeout: 5_000 },
  headersTimeout: 10_000,
  bodyTimeout: 30_000,
  keepAliveTimeout: 30_000,
});

export function fetchWithTimeouts(url, options = {}) {
  return undiciFetch(url, {
    ...options,
    dispatcher: agent,
  });
}

For the native http/https module, compose the pieces into one helper:

import https from 'node:https';

const agent = new https.Agent({
  keepAlive: true,
  maxSockets: 16,
  maxFreeSockets: 4,
});

export function httpsRequest(url, options = {}) {
  const connectMs = options.connectTimeout ?? 5_000;
  const responseMs = options.responseTimeout ?? 10_000;

  return new Promise((resolve, reject) => {
    const start = Date.now();
    const req = https.request(url, { agent, ...options }, (res) => {
      clearTimeout(responseTimer);
      resolve(res);
    });

    const connectTimer = setTimeout(() => {
      req.destroy(new Error(`Connect timeout after ${connectMs}ms`));
    }, connectMs);

    const responseTimer = setTimeout(() => {
      req.destroy(new Error(`Response timeout after ${responseMs}ms`));
    }, responseMs);

    req.on('socket', (socket) => {
      socket.setKeepAlive(true, 5_000);
      socket.setNoDelay(true);
      socket.on('connect', () => clearTimeout(connectTimer));
    });

    req.on('error', (err) => {
      clearTimeout(connectTimer);
      clearTimeout(responseTimer);
      reject(err);
    });

    req.end();
  });
}

Use this helper everywhere. Do not sprinkle ad-hoc timeouts across your codebase; the inconsistency will hide bugs.

Why one zombie socket kills throughput

Suppose your downstream API is healthy and p50 response time is 50 ms. You set maxSockets: 16 in the Agent. Under normal load, 16 concurrent requests share the pool, finish in 50 ms, and the next batch reuses or creates fresh sockets. Throughput is roughly 320 req/s.

Now a network event kills half the pooled sockets without a TCP close. Eight sockets are dead. The next eight requests grab those dead sockets, send data, and wait. The remaining eight requests grab healthy sockets and finish in 50 ms. But because there is no idle timeout or keepalive, the dead sockets are never evicted. They sit in ESTABLISHED forever.

Your effective pool size shrinks from 16 to 8. Throughput drops to 160 req/s. Load increases. The queue grows. Latency climbs from 50 ms to seconds. Eventually the queue exceeds your caller’s patience, and the failure looks like a downstream outage even though the API is fine.

The math is simple: maxSockets is a promise you make to the downstream, but if you do not bound the lifetime of each socket, the promise is not kept. The four timeouts are the enforcement mechanism.

Production signals

Add metrics that prove the timeouts are working, not just configured.

Track outbound request latency by host. A bimodal distribution — one peak at normal latency, another at exactly your timeout value — means requests are timing out rather than failing fast. That is a signal to shorten the timeout or to investigate the network path.

Track socket pool utilization. In undici, pool stats are available on the Agent. In the native module, count agent.sockets and agent.freeSockets periodically. If freeSockets is flat at the limit while requests queue, you have a leak or a ghost pool.

Track TCP retransmits and ESTABLISHED socket count per destination from the host or sidecar. A high ratio of sockets to request rate indicates churn or ghosts.

Alert on timeout ratio, not timeout count. A spike of timeouts during a deployment is expected. A steady 2% timeout rate on a stable endpoint means the network or the peer is lying, and your timeouts are the only reason you are not completely down.

The practical takeaway

When a Node.js service makes outbound HTTP calls, copy this checklist into the client initialization:

  • Connect timeout — bounded, typically 1–5 seconds.
  • Response timeout — bounded, matched to the endpoint’s realistic worst case.
  • Socket idle timeout — pool sockets evicted after 15–60 seconds of idleness.
  • TCP keepalive — enabled, with a start delay of 5–30 seconds, and OS-level probes tuned if you control the node.

Set them in one shared helper or dispatcher. Never rely on the defaults. The defaults assume a reliable network and honest peers, and production has neither.

Test the behavior with iptables in a local container:

# Blackhole the upstream IP after the connection is established
iptables -A OUTPUT -p tcp -d <upstream-ip> --dport 443 -j DROP

Watch your service. Without the four timeouts, it hangs. With them, it errors in seconds and your retry or circuit-breaker logic handles the rest. That is the difference between a blip and an outage.

A note from Yojji

The kind of edge-case infrastructure work this post describes — mapping TCP socket lifecycles to application-level reliability, tuning OS keepalive parameters inside containers, and verifying failure modes with iptables rather than hoping the defaults hold — is exactly the backend engineering that separates a prototype from a production service.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. Their engineers specialize in the JavaScript ecosystem, cloud platforms, and microservices architecture, including the network-layer and runtime-level details that keep Node.js services stable when the datacenter is not.

If you would rather have outbound HTTP reliability handled by engineers who have already debugged the 2 a.m. socket ghost hunt, Yojji is worth a conversation.