DNS Caching in Node.js: The Silent Cause of Production Latency Spikes
Your downstream API is healthy but some requests hang for 5 seconds before a timeout. The problem is not the network, the target, or the client. It is DNS resolution, and Node.js does not cache it by default. Here is how to fix it.
You triple-checked the downstream service. Its p99 is 12 ms. Your API timeout is 5 seconds. Yet every few minutes a request logs ConnectTimeoutError or ETIMEOUT and you have no idea why. You scale pods, retry harder, blame the cloud provider. The real problem is in a layer most Node.js engineers never think about: your process is resolving the same hostname thousands of times per second, and the OS resolver is drowning.
Node.js does not cache DNS lookups. Every fetch to api.internal.example.com, every Redis connection to redis.production.local, every Postgres connection string parsed and opened, calls getaddrinfo unless something stops it. On a busy service making hundreds of outbound requests per second, you are hammering the OS resolver, the upstream DNS server, and quite possibly blocking the event loop while you do it. This post shows how to confirm it, fix it in application code, and monitor it.
Why Node.js does not cache DNS
Node.js delegates DNS resolution to the operating system through getaddrinfo(3). That is correct for portability, but it means Node has no built-in cache for the results. If your code calls fetch("https://api.example.com/data") ten thousand times per minute, getaddrinfo runs ten thousand times per minute. The OS resolver may have its own cache, but it is usually small, short-lived, and shared across every process on the machine. On Kubernetes, the node-level kube-dns or CoreDNS pod sees a firehose of identical queries from every container.
Worse, dns.lookup is not async in the way you think. Before Node.js 20, dns.lookup ran on the thread pool with limited concurrency. Only a handful of DNS lookups could run in parallel. Extra requests queued. On a high-throughput service, that queue is a bottleneck that looks like network latency but is actually a local resource exhaustion. Node 20+ improved this with getaddrinfo backed by the c-ares library in some paths, but the fundamental problem remains: there is no TTL-aware application-level cache.
The symptoms are predictable once you know what to look for:
- Intermittent
ConnectTimeoutErrororETIMEDOUTon perfectly healthy downstreams. - Spikes that do not correlate with request rate, CPU, or memory.
- Multiple pods failing at the same time, suggesting overload of a shared DNS server.
- Timeouts that disappear immediately when you switch to IP addresses.
- CoreDNS logs showing thousands of identical queries for the same hostname per minute.
If any of that sounds familiar, you are probably running without DNS caching.
Measuring DNS resolution time in production
You cannot fix what you do not measure. The quickest production diagnostic is instrumenting the dns module directly:
import dns from 'node:dns';
import { performance } from 'node:perf_hooks';
const originalLookup = dns.lookup;
// In production, prefer a proper histogram (see below). This is the minimal version.
function instrumentedLookup(hostname, options, callback) {
const start = performance.now();
if (typeof options === 'function') {
callback = options;
options = {};
}
return originalLookup(hostname, options, (err, address, family) => {
const duration = performance.now() - start;
// Ship this to your metrics pipeline
console.log(JSON.stringify({
event: 'dns_lookup_timing',
hostname,
durationMs: Math.round(duration * 100) / 100,
cached: false,
timestamp: new Date().toISOString()
}));
callback(err, address, family);
});
}
dns.lookup = instrumentedLookup;
On a healthy system, DNS lookup for a warm hostname should be sub-millisecond. If you see p50 above 5 ms, or p99 above 50 ms, your resolver is overloaded. If you see durations of 100 ms or more, DNS is queuing or timing out and retrying. That is the smoking gun.
Do not leave the monkey-patch in production. Use it for a single deploy to confirm the diagnosis, then fix the root cause.
The first fix: persistent connections with keep-alive
The cheapest DNS cache is to never look up the hostname again. HTTP keep-alive and connection pooling reuse the same TCP connection for many requests, which means the hostname is resolved exactly once when the first connection opens. This is not a DNS cache per se, but it reduces the lookup rate by orders of magnitude.
If you use undici (which powers Node.js native fetch since Node 18), configure a Pool with keep-alive:
import { Pool } from 'undici';
const pool = new Pool('https://api.example.com', {
connections: 50,
keepAliveTimeout: 30000,
keepAliveMaxTimeout: 60000,
});
// Use pool.request(...) or assign it to a custom fetch dispatcher
If you use node:http or axios, set keepAlive: true on the Agent:
import http from 'node:http';
const agent = new http.Agent({
keepAlive: true,
maxSockets: 50,
maxFreeSockets: 10,
timeout: 30000,
});
// Pass agent to every request
http.get('https://api.example.com/data', { agent }, (res) => { /* ... */ });
Keep-alive alone often drops DNS lookups from thousands per minute to dozens. If that solves the problem, great. But it only works for HTTP. Database drivers, Redis clients, gRPC channels, and direct TCP connections need their own pooling. If any of those reconnect frequently, you still have a DNS storm.
The real fix: a TTL-aware DNS cache
For services where connections churn, or where you connect to many different hosts, you need an application-level DNS cache that respects TTL. Node.js does not ship one, but you can build it in under 50 lines.
The approach: cache the resolved IP, respect a TTL (or a conservative hard-coded one if you do not want to parse DNS records), and evict entries on expiry. Use it inside a custom lookup function that you pass to your HTTP agent, database driver, or Redis client.
Here is a minimal, production-safe cache:
import dns from 'node:dns';
import { promisify } from 'node:util';
const dnsLookup = promisify(dns.lookup);
class DnsCache {
constructor({ defaultTtlMs = 60_000, maxEntries = 1000 } = {}) {
this.cache = new Map();
this.defaultTtlMs = defaultTtlMs;
this.maxEntries = maxEntries;
}
async lookup(hostname, options) {
const now = Date.now();
const cached = this.cache.get(hostname);
if (cached && cached.expiresAt > now) {
return { address: cached.address, family: cached.family };
}
const result = await dnsLookup(hostname, options);
// Evict oldest if we are at capacity (simple LRU behavior)
if (this.cache.size >= this.maxEntries) {
const firstKey = this.cache.keys().next().value;
this.cache.delete(firstKey);
}
this.cache.set(hostname, {
address: result.address,
family: result.family,
expiresAt: now + this.defaultTtlMs,
});
return result;
}
get size() {
return this.cache.size;
}
}
const dnsCache = new DnsCache({ defaultTtlMs: 30_000 });
You can pass this cache to any library that accepts a custom lookup function. For undici or node:http agents:
import { Agent } from 'undici';
const agent = new Agent({
connect: {
lookup: async (hostname, options, callback) => {
try {
const { address, family } = await dnsCache.lookup(hostname, options);
callback(null, address, family);
} catch (err) {
callback(err);
}
},
},
});
For ioredis or redis clients that support a custom lookup option:
import Redis from 'ioredis';
const redis = new Redis({
host: 'redis.production.local',
lookup: (hostname, options, callback) => {
dnsCache.lookup(hostname, options)
.then(({ address, family }) => callback(null, address, family))
.catch((err) => callback(err));
},
});
This cache is intentionally simple. It does not parse real DNS TTLs, but a 30-second default TTL is usually safe for internal services. For public hostnames where A records may shift, monitor cache hit rates and lower TTL accordingly.
Getting TTL from the DNS layer
If you want to be more precise, use dns.resolve instead of dns.lookup. dns.resolve queries the DNS server directly and returns TTL values on some record types (A, AAAA, CNAME). Node.js exposes this through dns.resolve4 and dns.resolve6.
import dns from 'node:dns';
import { promisify } from 'node:util';
const resolve4 = promisify(dns.resolve4);
async function resolveWithTtl(hostname) {
// Node 20+ supports the ttl option
const records = await resolve4(hostname, { ttl: true });
return records.map((r) => ({
address: r.address,
ttlMs: r.ttl * 1000,
}));
}
Using real TTLs prevents caching a record longer than the domain owner intended. For internal .local or .svc.cluster.local hostnames, the TTL is often short (5 seconds in some Kubernetes DNS setups), which means caching must be aggressive to help. In practice, if the upstream DNS itself returns 5-second TTLs, pinning the IP for even 10-15 seconds reduces the query rate by 2-3x with minimal risk. Just make sure you handle resolution failures gracefully: if the cached IP becomes unreachable, a lookup failure should trigger an immediate cache bypass and retry.
Platform-level DNS caching
Before you write a cache in every application, check whether your platform already has one. On Linux servers:
systemd-resolvedcaches DNS with configurable TTLs. If it is active,getaddrinfohits the local daemon rather than the upstream server.dnsmasqorunboundcan run as a local caching resolver in your container or on the node.- In Kubernetes, CoreDNS has a default cache plugin that caches responses for 30 seconds. But that cache is per-CoreDNS-pod, shared across every container on the node. Under heavy load, it can still saturate.
Application-level caching gives you isolation. A misbehaving neighbor pod cannot evict your records from the CoreDNS cache. You also get observability: you can log cache hits, misses, and TTLs without parsing CoreDNS logs.
If you run on AWS, EC2 and Fargate both cache VPC DNS internally, but the cache is small and shared. We have still seen DNS throttling on high-throughput Node.js services that rely on it alone. The application cache is the final defense.
Monitoring and alerting
Once you deploy a cache, you want to know it is working. Export these metrics:
class DnsCache {
constructor(options) {
this.cache = new Map();
this.defaultTtlMs = options?.defaultTtlMs ?? 60_000;
this.maxEntries = options?.maxEntries ?? 1000;
this.hits = 0;
this.misses = 0;
}
async lookup(hostname, options) {
const now = Date.now();
const cached = this.cache.get(hostname);
if (cached && cached.expiresAt > now) {
this.hits++;
return { address: cached.address, family: cached.family };
}
this.misses++;
const result = await dnsLookup(hostname, options);
/* ... */
return result;
}
stats() {
const total = this.hits + this.misses;
return {
hits: this.hits,
misses: this.misses,
hitRate: total === 0 ? 0 : Math.round((this.hits / total) * 1000) / 10,
size: this.cache.size,
};
}
}
Expose stats() on a /health or /metrics endpoint. Alert if:
hitRatedrops below 70% after the cache has warmed up. That means TTLs are too short, or connections are churning too fast.missesspike suddenly. That suggests a deploy cleared the cache, or a hostname is failing to resolve and retries are bypassing the cache.- DNS lookup duration from the instrumented version goes above 50 ms. If lookups are slow even with a local cache, the OS resolver is the bottleneck.
A full example: wrapping a service client
Here is how to wire the cache into a real service-client pattern:
import { Agent, request } from 'undici';
const dnsCache = new DnsCache({ defaultTtlMs: 30_000 });
const agent = new Agent({
connections: 50,
connect: {
lookup: (hostname, options, callback) => {
dnsCache.lookup(hostname, options)
.then(({ address, family }) => callback(null, address, family))
.catch((err) => callback(err));
},
},
});
async function fetchUser(userId) {
const { body } = await request(
`https://users-api.internal/users/${userId}`,
{ dispatcher: agent, method: 'GET' }
);
return body.json();
}
The first request to users-api.internal resolves the hostname and caches the IP for 30 seconds. Every subsequent request in that window reuses the cached address. If the keep-alive pool also holds the TCP connection open, the total DNS lookup rate drops to near zero.
The decision tree
| Situation | Fix |
|---|---|
| High-rate HTTP calls to one hostname | Keep-alive pool first, DNS cache as backup |
| Frequent database or Redis reconnects | Custom lookup with TTL cache in the client config |
Node 18 using native fetch without dispatcher | Switch to undici Agent or global fetch dispatcher override |
| Multiple hostnames, short-lived connections per host | TTL-aware dns.lookup cache is essential |
| Kubernetes with very high pod density | Add application cache even if CoreDNS caching is on |
| Debugging unexplained connect timeouts on healthy targets | Instrument dns.lookup duration to confirm DNS is the cause |
The takeaway
DNS is not free. On a busy Node.js service, it is a hidden tax on every outbound connection, and the default stack does nothing to amortize it. The symptoms are maddening: random timeouts on healthy downstreams, latency spikes with no CPU profile blame, and errors that vanish when you switch to raw IPs.
Measure it first: patch dns.lookup for one deploy and log the duration. If p99 is above a few milliseconds, you have a DNS problem. Fix it with keep-alive and connection pooling to eliminate redundant lookups, then add a TTL-aware application cache for anything that still reconnects frequently. Most services can drop their DNS query rate by 100x with fewer than 80 lines of code.
Your downstream APIs are fast. Your network is fine. Make sure your process is not spending its time asking the same question over and over.
A note from Yojji
Infrastructure reliability is about fixing the layers no one talks about until they break at 2 AM. DNS caching, connection pooling, and request path instrumentation are exactly the kind of operational detail Yojji engineers build into production systems from the start.
Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their team of 50+ senior engineers has completed hundreds of projects using Node.js, TypeScript, and cloud-native architecture. If your team is dealing with unexplained latency spikes or building high-throughput microservices, Yojji can help you ship infrastructure that stays fast under real load.