#reliability

42 posts

Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m.
A single poison message crashes your worker, the broker redelivers it, and the crash loop takes down your entire pipeline. Here is the DLQ pattern that separates bad messages from good ones, with working code for RabbitMQ and the replay strategy that turns a dead letter into a recovered system.

May 23, 2026
reliability distributed-systems messaging
File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections
Your API pods show green health checks while clients get connection refused errors. The culprit is not your application. It is the Linux file descriptor limit, and the fix is a mix of kernel tuning, pool sizing discipline, and monitoring that most teams skip.

May 23, 2026
node.js reliability infrastructure
TCP Keepalive: Detecting Dead Peers Before Your Connection Pool Drowns
Your microservice connection pool is full of zombies. TCP connections that look ESTABLISHED but lead to dead peers will hang every request you send through them. Here is the keepalive tuning, HTTP agent wiring, and kernel sysctl config that detects silent failures in seconds instead of minutes.

May 22, 2026
node.js performance reliability networking
DNS Caching in Node.js: The Silent Cause of Production Latency Spikes
Your downstream API is healthy but some requests hang for 5 seconds before a timeout. The problem is not the network, the target, or the client. It is DNS resolution, and Node.js does not cache it by default. Here is how to fix it.

May 21, 2026
node.js performance reliability
Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page
The daily report cron ran twice last Tuesday, missed Wednesday entirely, and silently failed on Thursday until a customer complained. Here is the small Postgres-backed pattern that makes scheduled tasks observable, overlap-safe, and idempotent. With working TypeScript.

May 21, 2026
node.js reliability postgres
Load Shedding in Node.js: How to Reject Traffic Before You Drown
When traffic spikes and every dependency slows down, your service queues itself to death. Here is the admission control pattern that rejects requests early, keeps latency flat, and prevents cascading failures, with the Node.js middleware you can deploy today.

May 19, 2026
node.js reliability performance
Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage
Your memory stays flat but connection count climbs until new clients get refused. The culprit is almost never a leak. It is a slow client holding a socket forever because Node.js server defaults assume everyone plays nice. Here are the three timeout values that turn a slowloris attack or a runaway upload into a fast error, with the 40-line production config and the test that proves it works.

May 16, 2026
node.js reliability security
Event Sourcing with PostgreSQL: The Pragmatic 80% Solution
Your product team wants an audit trail, replayable history, and the ability to rebuild read models without running migrations on a 500GB table. Here is how to implement event sourcing in PostgreSQL without Kafka, schema registries, or six months of migration pain — just an append-only table, a projection function, and the replay logic that makes it useful.

May 15, 2026
postgres architecture reliability
The Four Timeouts Every Node.js HTTP Client Needs
A production incident walkthrough: Node.js connection pools silently fill with dead TCP sockets, every outbound request hangs forever, and your service looks down while the downstream API is healthy. Here are the four timeout values — connect, response, idle, and keepalive — with the working Agent and fetch config that prevents it.

May 15, 2026
node.js reliability networking
The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service
A single slow report endpoint consumed every connection in the pool, and your login API started timing out. Here is how the bulkhead pattern isolates failure domains in Node.js — with semaphores, separate pools, and the fast-fail logic that keeps the rest of your service alive.

May 14, 2026
node.js reliability architecture
Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order
The batch job runs fine locally and explodes in production with ERROR: 40P01 deadlock detected. Here is how to make Postgres tell you exactly which queries fought, how to reproduce the race in a test script, and the three lock-ordering rules that eliminate deadlocks without guesswork.

May 14, 2026
database postgres reliability
Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works
Your API health checks pass, your downstream service is fast, but p99 latency still spikes under load. The culprit is often the Node.js HTTP connection pool. Here is how to measure it, size it, and stop throwing 500s at the problem.

May 13, 2026
node.js performance reliability
Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust
Your "just POST to the callback URL" webhooks are creating angry customers, retry storms, and silent data loss. Here is the architecture — queue, circuit breaker, dead-letter, and backoff — that turns fire-and-forget HTTP into a delivery system you can monitor and trust.

May 12, 2026
reliability api node.js
Request Timeouts and Deadline Propagation: Stop the Chain of Slowness
Why one slow dependency cascades into a site-wide outage, and how to wire deadline propagation through HTTP APIs, database queries, and background jobs so your system fails fast instead of failing everywhere.

May 12, 2026
nodejs reliability distributed-systems
Webhook Signature Verification Is Not Enough: Stop Replay Attacks in Node.js
A valid webhook signature only proves who signed the payload, not that the request is fresh. Build a replay-safe Node.js webhook handler with raw-body verification, timestamp windows, idempotency, and atomic Redis locks.

May 11, 2026
api security node.js reliability
Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys
Every redeploy your users see a 4–7 second window of 502s. Here is exactly why, the 40 lines of Node code that eliminate it, and how to verify the fix with a real load test.

May 9, 2026
node.js devops reliability
Structured Logging With Pino: The 60 Lines That Make Your 3 a.m. Debugging Possible
When a customer reports a 500 at 3 a.m., your logs decide whether you fix it in ten minutes or two hours. Here is the Pino + correlation-id setup that turns Node.js logs from a wall of text into a searchable timeline — with the queries that actually find bugs in production.

May 8, 2026
node.js observability reliability
Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice
Webhook retries silently double-charge customers, double-create resources, and turn one ticket into a refund spreadsheet. Here is the 30-line Postgres-backed middleware that makes any handler safe to retry — plus the hammer-test that proves it works.

May 6, 2026
api reliability node.js
Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming
Naive retry loops are how a single sick dependency takes the whole platform down. Here is the retry pattern that actually survives a real outage — exponential backoff with decorrelated jitter, retry budgets, deadline propagation, and the four mistakes that will turn your "self-healing" client into a self-DDoS tool.

May 4, 2026
reliability distributed-systems node.js
The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m.
A cached endpoint quietly serves 50k req/s for weeks — until the key expires and 4,000 simultaneous misses hit Postgres in the same millisecond. Here is the 40 lines of single-flight + probabilistic early refresh that turn cache expiration from a cliff into a soft handoff, with the load-test numbers that prove it.

May 3, 2026
caching reliability performance
Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis
Most teams reach for Redis, Sidekiq, or BullMQ the moment they need background jobs. You probably do not need any of it. Here is the 80 lines of Postgres-only code that gives you a multi-worker, retry-safe job queue — and the test that proves it does not double-process under load.

May 3, 2026
database postgres node.js reliability
Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right
Half the CPU your API burns under load is spent on requests the client already gave up on. Here is the AbortController pattern that propagates a single cancellation signal through your entire Node.js stack — HTTP, database, fetch — with the 60 lines you actually have to write and the three traps that keep teams from getting the win.

May 2, 2026
node.js performance reliability
The Three Pillars of Observability Are A Myth: What Actually Matters In Production
The "logs, metrics, traces" framework gets repeated everywhere and obscures what observability is actually about: asking new questions of your system. Here is the alternative framing — high-cardinality events — and the practical setup that gets you the actual capability.

September 13, 2024
observability reliability sre
SQLite As Your Application Database In 2024: When It's The Right Call
Most teams reach for Postgres because "SQLite is for embedded use." That assumption is years out of date. SQLite with WAL mode and Litestream replication runs real production workloads at 50,000 writes per second. Here is when it's the right tool, the patterns that work, and the limits to know.

June 21, 2024
database sqlite reliability
The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath
Most teams ship features as “merge to main and deploy.” The result is that a bug affects 100% of users immediately. Five-stage rollouts — internal, 1%, 10%, 50%, 100% — turn “oh no” into “catch it at 1%.” Here is the working pattern, the metrics that gate each stage, and the rollback procedure.

May 24, 2024
process reliability devops
The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules
Most postmortems are theatre — a Google Doc with a timeline and three action items that nobody owns. The version that actually prevents the next incident has six properties: it's blameless, focuses on the system, has owned action items, and gets shared widely. Here is the template and the rules.

April 26, 2024
reliability process sre
Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance
You set up rolling deploys carefully. Then a node drains during cluster upgrade and takes 80% of your pods at once. PodDisruptionBudget is the manifest that says “never evict more than N at a time.” Three lines of YAML, real production benefits.

January 5, 2024
kubernetes devops reliability
SLOs Without The Theatre: How To Pick Three Numbers That Actually Help
Most teams adopt SLOs by copying Google's book and end up with 30 dashboards nobody reads. The version that earns its keep is two SLIs per service, an error budget that drives real decisions, and a quarterly review. Here is the working setup and the rule that keeps SLOs from becoming bureaucracy.

December 8, 2023
reliability sre process
Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale
Most “chaos engineering” discussions are about Chaos Monkey at Netflix and have nothing to do with how a 20-engineer team should test resilience. The five drills here are practical, scoped, runnable in an afternoon, and will surface the broken assumption your monitoring missed.

November 24, 2023
reliability devops distributed-systems
Secrets Management For Real Teams: Vault, SOPS, And The .env File You Should Burn
Almost every team starts with a .env file in 1Password and ends with secrets in Slack. Here are the three credible options for production secrets — Vault, SOPS-encrypted-in-git, cloud-native (AWS/GCP) — with the trade-offs, the migration paths, and the rotation policy that survives a year.

October 27, 2023
security devops reliability
Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies
Two-phase commit is the textbook answer for distributed transactions. It also doesn't survive contact with real systems. The saga pattern — orchestrated or choreographed — is what production systems actually use. Here is the difference, the implementation patterns, and the compensation logic that handles the inevitable failure cases.

September 29, 2023
distributed-systems reliability data-modeling
Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works
Default HPA scales on CPU, which is wrong for most modern workloads. Memory, queue depth, request rate, and custom business metrics are what actually correlate with “need more pods.” Here is the working setup with custom metrics, the formula HPA uses, and the four mistakes that cause flapping.

September 15, 2023
kubernetes devops reliability
Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It
Redlock is the most-recommended distributed-lock algorithm and the one with the most published criticism. The truth: simple Redis locks are fine for most teams, Redlock fixes a narrow set of failure modes most teams don't experience, and the cases where you really need correctness call for Postgres or Zookeeper. Here is the decision tree.

August 18, 2023
distributed-systems redis reliability
Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem
Postgres has two replication systems and most teams cannot articulate the difference. Streaming gives you a hot standby identical to the primary; logical lets you replicate selected tables to a different schema or major version. Here is the decision tree, the operational gotchas of each, and a realistic answer for which one you actually need.

May 26, 2023
database postgres reliability
Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All
Renaming a column on a 50-million-row table looks like a one-line SQL change and is actually a six-step deploy spread across two PRs. Here is the pattern — expand, migrate, contract — applied to renames, type changes, and NOT NULL backfills, with the locks each step takes and the rollback at every stage.

March 3, 2023
database postgres devops reliability
Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service
When a downstream service slows from 50ms to 5s, your service inherits the latency, then runs out of connections, then takes everything else with it. A circuit breaker is the 50 lines that say “I will stop calling you for 30 seconds and let you recover.” Here is the implementation, the three states, and the four metrics worth alerting on.

February 17, 2023
reliability node.js distributed-systems
Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages
Most teams configure liveness and readiness probes identically and wonder why a slow database makes Kubernetes restart their pods in a death spiral. Here is what each probe is actually for, the right endpoint shape for each, and the four-line config that turns an outage into a non-event.

January 20, 2023
kubernetes devops reliability
Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis
The “sliding window” rate limiter every tutorial shows you breaks at scale. Token bucket is the algorithm real APIs use because it allows bursts without exceeding the average rate. Here is a 30-line Lua-on-Redis implementation, the failure modes to test for, and the headers you should be returning to clients.

December 23, 2022
api redis reliability distributed-systems
The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree
Whenever your code does “write to the database, then publish to Kafka,” there is a window where one succeeds and the other does not. The outbox pattern closes that window with a single extra table and 60 lines of dispatcher code. Here is how it works and why every alternative ends up reinventing it.

November 25, 2022
database kafka reliability distributed-systems
Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers)
Most load tests slam one endpoint with a constant rate of requests and report a percentile. That graph means almost nothing. Real bugs live in ramp-up, soak, and spike scenarios — here are the k6 scripts for each, the metric to read, and why the constant-load test you ran last quarter missed the regression.

November 11, 2022
performance reliability devops
OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident
Distributed tracing only earns its keep at 3 a.m., when one slow request is hiding in a microservice call graph. Here is the OpenTelemetry setup for Node.js that auto-instruments the boring stuff, lets you add the span attributes that matter, and connects to any backend you point it at.

September 30, 2022
observability node.js reliability
Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config
Postgres falls over not because of slow queries but because of too many connections. Most teams reach for pgbouncer and copy a config they do not understand. Here is the actual job each setting does, the three pool modes ranked by what they break, and the rule for sizing pool_size that holds at any traffic level.

August 19, 2022
database postgres reliability devops