#distributed-systems

14 posts

Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m.
A single poison message crashes your worker, the broker redelivers it, and the crash loop takes down your entire pipeline. Here is the DLQ pattern that separates bad messages from good ones, with working code for RabbitMQ and the replay strategy that turns a dead letter into a recovered system.

May 23, 2026
reliability distributed-systems messaging
MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40%
JSON.stringify is the default for every internal service call, but on high-throughput RPC it burns CPU and inflates payloads. Here is how MessagePack replaces it, with Node.js benchmarks, Express middleware code, and the migration path that does not break your public API.

May 21, 2026
performance node.js distributed-systems api
Request Hedging: Cut Tail Latency In Half Without Overprovisioning
In a microservice with ten replicas, one overloaded instance can push your P99 from 100 ms to 2 s. Request hedging sends a second request after a short delay and keeps the faster response. Here is the safe way to implement it in Node.js, with cancellation, in-flight limits, and the math that decides whether it is worth it.

May 19, 2026
performance distributed-systems node.js
Request Timeouts and Deadline Propagation: Stop the Chain of Slowness
Why one slow dependency cascades into a site-wide outage, and how to wire deadline propagation through HTTP APIs, database queries, and background jobs so your system fails fast instead of failing everywhere.

May 12, 2026
nodejs reliability distributed-systems
Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming
Naive retry loops are how a single sick dependency takes the whole platform down. Here is the retry pattern that actually survives a real outage — exponential backoff with decorrelated jitter, retry budgets, deadline propagation, and the four mistakes that will turn your "self-healing" client into a self-DDoS tool.

May 4, 2026
reliability distributed-systems node.js
gRPC Vs REST In 2024: When The Switch Pays For Itself
gRPC is faster, smaller, strongly typed, and has worse browser support and harder debugging. The decision is workload-specific. Here is the honest comparison: where gRPC genuinely wins, where REST stays the right choice, and the connect-rpc middle ground that resolves most of the trade-offs.

July 5, 2024
api grpc distributed-systems
Service Mesh: When Istio Or Linkerd Earns Its Operational Cost, And When Not
Service mesh promises automatic mTLS, traffic shifting, and observability. The operational cost is real — Istio doubles a cluster's control-plane complexity. Here is the honest framework for whether your team needs a mesh, the lighter alternatives, and the migration that doesn't break production.

April 12, 2024
kubernetes devops distributed-systems
Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale
Most “chaos engineering” discussions are about Chaos Monkey at Netflix and have nothing to do with how a 20-engineer team should test resilience. The five drills here are practical, scoped, runnable in an afternoon, and will surface the broken assumption your monitoring missed.

November 24, 2023
reliability devops distributed-systems
Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies
Two-phase commit is the textbook answer for distributed transactions. It also doesn't survive contact with real systems. The saga pattern — orchestrated or choreographed — is what production systems actually use. Here is the difference, the implementation patterns, and the compensation logic that handles the inevitable failure cases.

September 29, 2023
distributed-systems reliability data-modeling
Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It
Redlock is the most-recommended distributed-lock algorithm and the one with the most published criticism. The truth: simple Redis locks are fine for most teams, Redlock fixes a narrow set of failure modes most teams don't experience, and the cases where you really need correctness call for Postgres or Zookeeper. Here is the decision tree.

August 18, 2023
distributed-systems redis reliability
Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You
Kafka and RabbitMQ both move messages and are not interchangeable. One is a distributed log, the other is a message router. Picking the wrong one means a year of fighting the abstractions. Here is the workload-based decision tree, the operational realities of each, and the rare case where you need both.

July 7, 2023
messaging kafka rabbitmq distributed-systems
Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service
When a downstream service slows from 50ms to 5s, your service inherits the latency, then runs out of connections, then takes everything else with it. A circuit breaker is the 50 lines that say “I will stop calling you for 30 seconds and let you recover.” Here is the implementation, the three states, and the four metrics worth alerting on.

February 17, 2023
reliability node.js distributed-systems
Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis
The “sliding window” rate limiter every tutorial shows you breaks at scale. Token bucket is the algorithm real APIs use because it allows bursts without exceeding the average rate. Here is a 30-line Lua-on-Redis implementation, the failure modes to test for, and the headers you should be returning to clients.

December 23, 2022
api redis reliability distributed-systems
The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree
Whenever your code does “write to the database, then publish to Kafka,” there is a window where one succeeds and the other does not. The outbox pattern closes that window with a single extra table and 60 lines of dispatcher code. Here is how it works and why every alternative ends up reinventing it.

November 25, 2022
database kafka reliability distributed-systems