Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service
When a downstream service slows from 50ms to 5s, your service inherits the latency, then runs out of connections, then takes everything else with it. A circuit breaker is the 50 lines that say “I will stop calling you for 30 seconds and let you recover.” Here is the implementation, the three states, and the four metrics worth alerting on.
The third-party email API normally responds in 80 ms. Today it is responding in 8 seconds — when it responds at all. Your service, which calls the email API on every signup, is now holding 200 requests open waiting on a doomed network call. The Node.js process exhausts its outbound HTTP agent pool. New incoming requests pile up behind the stuck ones. CPU is fine, memory is fine, the application is just standing in line. From the outside it looks like your service is broken.
The fix is the circuit breaker pattern. When a downstream service starts failing, your service stops calling it for a while, returns a fast fallback, and gives the dependency time to recover. About 50 lines of Node.js. The bar to add it is whether the dependency has ever had a multi-minute outage. (Spoiler: every dependency has.)
The three states
A circuit breaker is a tiny state machine wrapping a function:
- Closed. All calls go through normally. Failures are counted.
- Open. Failures crossed the threshold. New calls fail immediately without making the network request — for a cooldown period.
- Half-open. Cooldown elapsed. The next call is allowed through as a probe. If it succeeds, the breaker closes. If it fails, the breaker re-opens.
[ Closed ] ── failure rate too high ──> [ Open ]
▲ │
│ │ cooldown elapsed
│ probe ok ▼
└──────────── [ Half-Open ] <──────────┘
│
│ probe failed
▼
[ Open ]
The point of half-open is that you do not flip from “everything blocked” to “everything allowed” — you let one or two requests through to test the water. If the dependency is still broken, you do not flood it with retries.
A 50-line implementation
type State = 'closed' | 'open' | 'half-open';
interface BreakerOptions {
failureThreshold: number; // e.g., 0.5 (50% errors)
windowSize: number; // e.g., 20 calls — sample size
cooldownMs: number; // e.g., 30_000 — wait before half-open probe
timeoutMs: number; // e.g., 3_000 — per-call timeout
}
export class CircuitBreaker<T extends (...a: any[]) => Promise<any>> {
private state: State = 'closed';
private results: boolean[] = []; // recent successes (true) / failures (false)
private nextAttemptAt = 0;
constructor(private readonly fn: T, private readonly opts: BreakerOptions) {}
async call(...args: Parameters<T>): Promise<Awaited<ReturnType<T>>> {
if (this.state === 'open') {
if (Date.now() < this.nextAttemptAt) throw new BreakerOpenError();
this.state = 'half-open';
}
try {
const result = await this.withTimeout(this.fn(...args));
this.onSuccess();
return result;
} catch (err) {
this.onFailure();
throw err;
}
}
private async withTimeout<R>(p: Promise<R>): Promise<R> {
return Promise.race([
p,
new Promise<R>((_, reject) =>
setTimeout(() => reject(new TimeoutError()), this.opts.timeoutMs)),
]);
}
private onSuccess() {
this.record(true);
if (this.state === 'half-open') this.state = 'closed';
}
private onFailure() {
this.record(false);
if (this.state === 'half-open' || this.shouldOpen()) {
this.state = 'open';
this.nextAttemptAt = Date.now() + this.opts.cooldownMs;
}
}
private record(ok: boolean) {
this.results.push(ok);
if (this.results.length > this.opts.windowSize) this.results.shift();
}
private shouldOpen(): boolean {
if (this.results.length < this.opts.windowSize) return false;
const failures = this.results.filter(r => !r).length;
return failures / this.results.length >= this.opts.failureThreshold;
}
}
export class BreakerOpenError extends Error { constructor() { super('breaker open'); } }
export class TimeoutError extends Error { constructor() { super('timed out'); } }
That is the entire breaker. Three states, a sliding window of recent results, a cooldown clock, and a timeout. About 50 lines.
Wiring it in
import { CircuitBreaker, BreakerOpenError } from './circuit-breaker';
import { sendEmailViaProvider } from './email';
const emailBreaker = new CircuitBreaker(sendEmailViaProvider, {
failureThreshold: 0.5,
windowSize: 20,
cooldownMs: 30_000,
timeoutMs: 3_000,
});
export async function sendWelcomeEmail(userId: string) {
try {
await emailBreaker.call({ userId, template: 'welcome' });
} catch (err) {
if (err instanceof BreakerOpenError) {
// Fallback: queue for retry instead of dropping or blocking the request.
await queueEmailForRetry({ userId, template: 'welcome' });
return;
}
throw err;
}
}
Notice the fallback. A breaker without a fallback is just a faster way to fail. The shapes of useful fallbacks:
- Queue for retry. Put the work in an outbox / job queue and process it later when the dependency recovers.
- Serve stale. If the dependency was a read, return the last cached value with a
stale=trueflag. - Degrade. If the dependency was non-essential (recommendations, analytics), skip it.
- Static fallback. Return a generic answer (“loading…”, “no related items”).
The fallback is what turns “circuit breaker tripped” from “this feature is broken” into “this feature is gracefully degraded.”
The numbers that matter
Default settings that work for most HTTP dependencies:
failureThreshold: 0.5— open when half the recent calls failed.windowSize: 20— sample size. Smaller windows are jumpy; larger windows are slow to react.cooldownMs: 30_000— 30 seconds open before the next probe. Long enough that a downstream blip clears, short enough that recovery is quick.timeoutMs: < the slowest legitimate response— the timeout is what causes calls to count as failures fast. Without it, slow calls do not trip the breaker.
A surprising one: timeoutMs is often the most important. A downstream that responds in 8s instead of 80ms is, for your purposes, broken — but unless you time it out, it never registers as a failure. A 3-second timeout against a service that should respond in <500ms is a reasonable default.
What to NOT wrap in a breaker
A breaker is for calls that can fail. A few things should not go through one:
- Calls to your own database. If your DB is broken, your service is broken. Fallbacks do not help; degrade more carefully.
- Calls inside a critical path with no useful fallback. If you have nothing to do when the breaker opens, the breaker is just adding latency. Either find a fallback or accept the dependency.
- Long-running streams. Breakers wrap a single call; they do not understand a stream’s lifecycle.
- Idempotent retries. If your retry logic is already exponential-backing-off and timing out, adding a breaker on top double-counts.
The right targets: third-party HTTP calls, downstream microservices, queue producers, anything where “we’ll try again in 30 seconds” is a sensible response to current failures.
Bulkheads, alongside breakers
A breaker stops calls from reaching a stuck dependency. A bulkhead limits the concurrency of calls before they reach the breaker. They are complementary: the breaker says “stop calling,” the bulkhead says “no more than 50 calls at a time.”
import pLimit from 'p-limit';
const limit = pLimit(50);
await limit(() => emailBreaker.call(...));
p-limit is enough for in-process bulkheads. With both in place, a single bad dependency cannot consume more than 50 of your worker slots, and after a few failures it stops consuming any.
Observability: the four metrics
A breaker without metrics is invisible until it trips at 3 a.m. and nobody knows why. The four metrics worth emitting:
- Breaker state. Gauge: 0 closed, 1 half-open, 2 open. Plot over time. Easy to spot flapping.
- Failure rate over the window. Gauge: current
failures / windowSize. Tells you how close to the threshold you are. - Calls rejected by breaker. Counter: incremented every time a call is rejected without trying. Distinguishes “we did not try” from “we tried and failed.”
- Latency of underlying calls. Histogram. The breaker will trip when latency degrades — having the histogram next to the breaker state shows correlations.
Alert on state == 'open' for 5 minutes. That is the “something is genuinely broken downstream” signal.
Production breakers you can buy
If you do not want to write the 50 lines, there are libraries:
- opossum — the most popular Node.js circuit breaker. Well-maintained, similar feature set to Hystrix.
- cockatiel — by a Microsoft engineer, modern API, includes retries / timeouts / bulkheads in one library.
For Java, resilience4j is the standard. Spring Cloud has built-in integration. The patterns transfer one-to-one.
I generally use opossum or cockatiel in production for the metrics and tested fallback semantics. The 50-line version is for understanding.
When the breaker is the wrong tool
Two cases.
The dependency is fundamentally broken. A breaker recovers when the dependency does. If the third-party API is down for two days, the breaker just keeps tripping and back-off does not help. You need a queue, manual ops, and an SLA conversation.
The fallback is more expensive than the call. Sometimes “degrade gracefully” is more compute than “let the user wait.” Profile both paths before adding a breaker.
For everything else — most external dependencies, most service-to-service calls — a breaker is a 50-line change that prevents one of the most common cascading failure modes.
The takeaway
A circuit breaker is one of the highest-leverage reliability investments you can make in any service that calls a service it does not own. It costs ~50 lines, prevents a stuck dependency from cascading into your service, and turns “we are down because Stripe is down” into “Stripe is down and our retry queue is filling — we will catch up in 5 minutes.”
Pick failure thresholds, cooldown, timeout, and a fallback that makes sense for the call. Wrap third-party calls and downstream services. Emit four metrics. Alert on the breaker being open. The next time a dependency has a bad afternoon, you will not have one.
A note from Yojji
The kind of resilience engineering that prevents one slow downstream from taking down a whole service — circuit breakers, bulkheads, fallbacks, the metrics that prove they work — is the kind of long-haul backend work that decides how a system behaves on its worst days. It is the kind of engineering Yojji’s teams build into the production systems they ship for clients.
Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, GCP), and microservices — including the reliability engineering that decides whether your incident is a blip or a saga.