TCP Keepalive: Detecting Dead Peers Before Your Connection Pool Drowns

The checkout service went quiet at 11:47 a.m. Not a crash. Not an error spike. Just a steady climb in p99 latency from 120 ms to 28 seconds, followed by a wave of 504 Gateway Timeouts. The pods were healthy. The CPU was flat. The database was responsive. The problem was that every HTTP request from checkout to inventory was being dispatched through a connection that had been dead for six minutes. The TCP state on the client read ESTABLISHED. The server it pointed to had been OOM-killed and replaced by Kubernetes six minutes earlier. No FIN. No RST. Just a socket that looked alive and swallowed requests until the application timeout fired.

This is the TCP half-open problem, and if you run microservices, you have it. Connection pools (in http.Agent, pg.Pool, undici, or any language’s equivalent) reuse TCP connections to avoid the overhead of the three-way handshake. That reuse assumes the connection is either alive or will fail fast. After a silent peer death, neither is true. The request hangs. The pool fills with these ghosts. New requests queue behind them. The service dies by congestion, not by crash.

The fix is TCP keepalive, plus a few HTTP agent settings that most teams never touch. This post covers the kernel tunables, the Node.js agent wiring, and the operational validation that proves it works. No framework upgrades required. No service mesh. Just sockets configured honestly.

Why application timeouts are not enough

The obvious response is “we have a 30-second HTTP timeout.” That is an application-level timeout, and it is the wrong layer for this problem. Here is why.

When your HTTP client reuses a connection from the pool, it writes the request to the socket and starts waiting for the response. If the peer is dead, the write succeeds (TCP data goes into the kernel send buffer on the client side; there is no immediate error). The client then waits for ACKs that never come. TCP starts retransmitting. The retransmit schedule is aggressive at first (200 ms, then 400 ms, then 800 ms) but it backs off quickly. On Linux, the default is 15 retransmits over roughly 13 to 30 minutes, depending on the exact RTO. Only then does the kernel report ETIMEDOUT to the application.

Your 30-second application timeout fires first, so the HTTP client gives up and returns an error. That is good for that one request. But the connection is not removed from the pool. The pool manager sees a returned connection, checks whether it seems okay (usually by asking the socket if it is destroyed, which it isn’t), and puts it back for the next request. The next request gets the same ghost. Repeat. The pool becomes a landfill of dead connections that each cost 30 seconds to discover.

Application timeouts protect the user experience. They do not protect the connection pool. You need transport-layer detection of dead peers, which is what TCP keepalive provides.

How TCP keepalive actually works

TCP keepalive sends empty ACK segments on an idle connection to probe whether the peer is still there. If the peer responds with an ACK, the connection is alive. If it does not respond, TCP retransmits the probe with the same exponential backoff as normal data retransmission. After a configured number of unacknowledged probes, the kernel marks the connection as dead and delivers ETIMEDOUT or EPIPE to the application.

Linux exposes three sysctl knobs:

net.ipv4.tcp_keepalive_time (default: 7200 seconds): how many seconds of idleness before the first probe is sent.
net.ipv4.tcp_keepalive_intvl (default: 75 seconds): the interval between probes.
net.ipv4.tcp_keepalive_probes (default: 9): how many probes to send before giving up.

With the defaults, a dead peer is detected after 7200 + (9 * 75) = 7875 seconds, or roughly two hours and ten minutes. That is useless for a microservice.

For a typical internal service mesh or Kubernetes cluster, sane values are:

sudo sysctl -w net.ipv4.tcp_keepalive_time=30
sudo sysctl -w net.ipv4.tcp_keepalive_intvl=10
sudo sysctl -w net.ipv4.tcp_keepalive_probes=3

This means: after 30 seconds of idleness, send a probe. If no response, send another every 10 seconds. If three probes go unanswered, kill the socket. Total time to detect a dead peer: 30 + (3 * 10) = 60 seconds. In practice, because the first lost packet triggers fast retransmission behavior, it is often faster.

To persist these across reboots, add them to /etc/sysctl.conf or a file in /etc/sysctl.d/:

net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 3

Then run sysctl -p.

These settings affect the entire host. If you run a database on the same node and do not want its long-idle admin connections probed aggressively, set keepalive per-socket instead. In Node.js, that is socket.setKeepAlive(true, 30000).

The Node.js agent wiring that makes it real

Node.js http.Agent and https.Agent create and manage TCP sockets for HTTP requests. By default, keepAlive: true reuses connections but does not enable TCP keepalive. You must enable it explicitly, and you should also set timeout on the agent (not the request) to bound how long a socket can sit idle in the pool.

Here is a production-grade agent configuration:

// agent.ts
import http from 'node:http';
import https from 'node:https';

const AGENT_OPTIONS = {
  keepAlive: true,
  maxSockets: 50,
  maxFreeSockets: 10,
  freeSocketTimeout: 30000,   // destroy free sockets after 30s idle
  timeout: 30000,             // 30s timeout on socket operations
};

export const httpAgent = new http.Agent(AGENT_OPTIONS);
export const httpsAgent = new https.Agent(AGENT_OPTIONS);

freeSocketTimeout is the critical setting. It destroys a socket that has been idle in the pool for more than 30 seconds, forcing a new TCP handshake on the next request. This is a blunt but effective way to evict ghosts: if your keepalive probes haven’t already caught the dead peer, the socket is discarded before it can be reused. The trade-off is a few extra TCP handshakes under low load. In a busy service, sockets turnover quickly and this costs nothing.

If you use undici (the HTTP client that powers Node.js native fetch), the same concepts apply via the Pool constructor:

import { Pool } from 'undici';

const pool = new Pool('http://inventory.internal:8080', {
  connections: 50,
  keepAliveTimeout: 30000,
  keepAliveMaxTimeout: 30000,
  connect: {
    rejectUnauthorized: false, // only for internal mTLS without public CAs
  },
});

keepAliveTimeout in undici is the equivalent of freeSocketTimeout: how long a socket can sit unused before the pool closes it.

But neither of these enables TCP-level keepalive yet. For that, you need to hook socket creation.

Enabling TCP keepalive per socket

Node.js agents expose a createConnection option that lets you intercept socket creation and configure the raw TCP socket before the HTTP layer takes over. This is where you enable SO_KEEPALIVE and set the probe interval:

// keepalive-agent.ts
import net from 'node:net';
import tls from 'node:tls';
import http from 'node:http';
import https from 'node:https';

function configureSocket(socket: net.Socket) {
  socket.setKeepAlive(true, 30_000); // enable, initial delay 30s
  // Note: setKeepAliveInterval is not available in Node's public API.
  // The kernel defaults from sysctl apply for interval and probe count.
  // To override per-socket, you need a native addon or Node 20+ with
  // socket.setKeepAlive(keepalive, initialDelay).
}

export const httpAgent = new http.Agent({
  keepAlive: true,
  maxSockets: 50,
  maxFreeSockets: 10,
  freeSocketTimeout: 30_000,
  timeout: 30_000,
  createConnection(options, callback) {
    const socket = net.createConnection(options as net.NetConnectOpts, () => {
      callback(null, socket);
    });
    configureSocket(socket);
    socket.on('error', callback);
    return socket;
  },
});

export const httpsAgent = new https.Agent({
  keepAlive: true,
  maxSockets: 50,
  maxFreeSockets: 10,
  freeSocketTimeout: 30_000,
  timeout: 30_000,
  createConnection(options, callback) {
    const socket = tls.connect(options as tls.ConnectionOptions, () => {
      callback(null, socket);
    });
    configureSocket(socket);
    socket.on('error', callback);
    return socket;
  },
});

socket.setKeepAlive(true, 30000) enables SO_KEEPALIVE with an initial delay of 30 seconds. After 30 seconds of idleness, the kernel starts sending probes.

In Node.js 20+, you can also use socket.setKeepAlive(keepAlive, initialDelay) where initialDelay defaults to 0. The call above is correct for all Node LTS versions.

If you need per-socket control of tcp_keepalive_intvl and tcp_keepalive_probes (not just the host-wide sysctl), use the net-keepalive npm package or a small native addon. For most Kubernetes-hosted microservices, host-level sysctl is sufficient because all your pods share the same node class and the same tuning requirements.

TCP_USER_TIMEOUT: the missing piece

There is a second kernel knob that works alongside keepalive and solves a related problem: TCP_USER_TIMEOUT. This is a socket option that says “if data I have sent is not acknowledged within X milliseconds, destroy the connection.” It does not require the connection to be idle, and it catches a different failure mode than keepalive.

Scenario: your service sends a large POST body. The peer receives the headers, then dies. Keepalive won’t help here because the connection is not idle (you are actively sending). TCP_USER_TIMEOUT will: if the ACKs for your transmitted data do not arrive within the timeout, the kernel tears down the socket.

Recommended value for microservices: 10 to 30 seconds. This is aggressive, but internal networks in a data center or VPC should not take 30 seconds to deliver an ACK unless something is catastrophically wrong. In Node.js, you can set this via socket.setNoDelay() (which enables TCP_NODELAY, unrelated) or via a native addon. The net-keepalive package supports it.

If you cannot set TCP_USER_TIMEOUT per-socket, the combination of application-level request timeouts (30s) and TCP keepalive (30s idle detection) catches the majority of issues. TCP_USER_TIMEOUT is the belt-and-suspenders layer.

What about http/2 and gRPC?

HTTP/2 and gRPC connections are long-lived and multiplex many requests over a single TCP socket. A dead peer here is even more expensive because one bad connection stalls many in-flight requests.

For gRPC in Node.js, the @grpc/grpc-js channel options include:

const client = new InventoryClient(
  'inventory.internal:50051',
  grpc.credentials.createInsecure(),
  {
    'grpc.keepalive_time_ms': 30000,
    'grpc.keepalive_timeout_ms': 10000,
    'grpc.http2.max_pings_without_data': 0,
    'grpc.keepalive_permit_without_calls': 1,
  }
);

grpc.keepalive_time_ms: send a PING frame every 30 seconds.
grpc.keepalive_timeout_ms: wait 10 seconds for the PING ACK.
grpc.keepalive_permit_without_calls: send PINGs even when there are no active RPCs. Without this, the channel only probes when busy, which misses the idle-ghost case.

The HTTP/2 equivalent in Node.js depends on your client library. undici and node-fetch with HTTP/2 support typically expose ping settings. If they do not, the underlying TCP keepalive still applies to the single TCP socket carrying the HTTP/2 session.

Validating that it works

Configuring keepalive is only half the job. You must prove it works before the next incident. There are three ways to validate.

Method 1: tcpdump

Pick a known connection between two of your services. Run tcpdump on the client host:

sudo tcpdump -i any -n host 10.0.1.15 and port 8080 -w /tmp/keepalive.pcap

Wait for traffic to stop, then wait your tcp_keepalive_time interval. You should see empty TCP segments with no data payload (length 0) flowing every tcp_keepalive_intvl seconds. If you see them, keepalive is on. If you don’t, it isn’t.

Method 2: kill a pod and time recovery

In staging, identify a downstream pod that your service is connected to. Delete the pod (kubectl delete pod). Do not scale the deployment to zero (that sends a SIGTERM, which triggers graceful shutdown and FIN packets; you want the abrupt-kill case). Time how long your service takes to start serving successful requests again.

Without keepalive: 30 seconds (your application timeout) or longer. With keepalive at 30s/10s/3 probes: 30 to 60 seconds. With keepalive + freeSocketTimeout: possibly faster, if the bad socket had been idle long enough to be evicted.

Method 3: ss and the timer column

Linux’s ss command shows socket timers:

ss -tanio | grep 10.0.1.15:8080

Look for keepalive in the timer column. A healthy connection with keepalive enabled shows something like:

timer:(keepalive,29sec,0)

Meaning the next keepalive probe fires in 29 seconds. If you grep your services for connections that show timer:(keepalive,...), you know keepalive is active. If every connection shows timer:(timewait,...) or no timer at all, it isn’t.

The operational checklist

Before you declare socket hygiene done, run through this list:

Host sysctl has tcp_keepalive_time <= 600, tcp_keepalive_intvl <= 30, tcp_keepalive_probes <= 5.
Node.js Agent has keepAlive: true.
Node.js Agent has freeSocketTimeout set (30s for most services).
Node.js Agent has timeout set to match your application SLA.
Per-socket setKeepAlive(true, delay) is called in createConnection or equivalent.
If using gRPC, keepalive PINGs are enabled with grpc.keepalive_permit_without_calls = 1.
staging chaos test (abrupt pod kill) recovers in under 60 seconds.
ss -tanio shows keepalive timers on active connections.

If you run in Kubernetes, wrap the sysctl changes in an init container or a privileged DaemonSet that writes to /etc/sysctl.d/ and runs sysctl -p. Do not rely on node admin scripts that someone ran once and forgot to add to the node pool template.

Common mistakes

Enabling keepalive but leaving tcp_keepalive_time at 7200. This is the default on most Linux distributions. You enabled keepalive, but the first probe won’t fire for two hours. In a container that restarts every few days, you may never see a probe. Always tune the sysctl.

Setting freeSocketTimeout to a very low value (5 seconds). This causes excessive connection churn, burning CPU on TLS handshakes and reducing throughput. 30 seconds is a balanced default for internal services. For public-facing APIs with highly variable latency, 15 seconds may be appropriate.

Forgetting that keepalive only fires on idle connections. If your service sends data every 5 seconds, TCP will not probe. That is fine in most cases, because if data is flowing and ACKs stop, the normal retransmit logic catches the failure. But for gRPC and HTTP/2, use application-layer PINGs (gRPC keepalive, HTTP/2 ping frames) because a single stalled stream may not trigger TCP-level detection if other streams on the same socket are healthy.

Assuming Kubernetes readiness probes fix this. They don’t. A readiness probe checks whether a pod should receive traffic. It does not check whether existing TCP connections to a dead pod are being cleaned up. The load balancer stops sending new connections to a failed pod, but existing persistent connections in another service’s pool are unaffected.

When not to bother

Keepalive tuning is unnecessary for:

Short-lived scripts and CLI tools that make one request and exit.
Services behind an API gateway that creates a new TCP connection per request (rare for HTTP/1.1 keepalive, but some proxy configs do this).
Serverless functions that cold-start on every request and never reuse connections.

If your connections live for more than a few seconds and are reused by a pool, keepalive matters.

The takeaway

Silent TCP peer death is one of the most expensive failure modes in microservices because it doesn’t look like a failure. The socket is open. No errors fire. The application timeout eventually kills the request, but the socket goes back into the pool and poisons the next request. Repeat until the pool is full of ghosts and your latency graph looks like a cliff.

TCP keepalive is not a new feature. It is a 40-year-old protocol mechanism that most production systems have disabled by default or misconfigured into uselessness. Turning it on, tuning the intervals for Sub-60-second detection, pairing it with freeSocketTimeout for pool hygiene, and validating with tcpdump and chaos tests is an afternoon of work that eliminates an entire class of incidents.

Your connection pool should contain connections to services that are actually alive. That shouldn’t be a radical idea. But after you watch a 28-second p99 caused by a socket to a pod that died six minutes ago, you start treating every ESTABLISHED connection as potentially guilty until proven responsive.

A note from Yojji

The kind of production work that separates a functioning system from a reliable one is rarely the flashy rewrite. It is the afternoon spent tuning kernel syscalls, verifying that ss -tanio shows keepalive timers, and proving with a staged pod kill that your service recovers in 30 seconds instead of 30 minutes. That kind of unglamorous infrastructure discipline is what Yojji’s teams build into the platforms they ship.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their senior engineering teams specialize in the JavaScript ecosystem, cloud-native infrastructure on AWS, Azure, and Google Cloud, and the full cycle of product delivery from discovery through DevOps.