A single poison message crashes your worker, the broker redelivers it, and the crash loop takes down your entire pipeline. Here is the DLQ pattern that separates bad messages from good ones, with working code for RabbitMQ and the replay strategy that turns a dead letter into a recovered system.
JSON.stringify is the default for every internal service call, but on high-throughput RPC it burns CPU and inflates payloads. Here is how MessagePack replaces it, with Node.js benchmarks, Express middleware code, and the migration path that does not break your public API.
In a microservice with ten replicas, one overloaded instance can push your P99 from 100 ms to 2 s. Request hedging sends a second request after a short delay and keeps the faster response. Here is the safe way to implement it in Node.js, with cancellation, in-flight limits, and the math that decides whether it is worth it.
Why one slow dependency cascades into a site-wide outage, and how to wire deadline propagation through HTTP APIs, database queries, and background jobs so your system fails fast instead of failing everywhere.
Naive retry loops are how a single sick dependency takes the whole platform down. Here is the retry pattern that actually survives a real outage — exponential backoff with decorrelated jitter, retry budgets, deadline propagation, and the four mistakes that will turn your "self-healing" client into a self-DDoS tool.
gRPC is faster, smaller, strongly typed, and has worse browser support and harder debugging. The decision is workload-specific. Here is the honest comparison: where gRPC genuinely wins, where REST stays the right choice, and the connect-rpc middle ground that resolves most of the trade-offs.
Service mesh promises automatic mTLS, traffic shifting, and observability. The operational cost is real — Istio doubles a cluster's control-plane complexity. Here is the honest framework for whether your team needs a mesh, the lighter alternatives, and the migration that doesn't break production.
Most “chaos engineering” discussions are about Chaos Monkey at Netflix and have nothing to do with how a 20-engineer team should test resilience. The five drills here are practical, scoped, runnable in an afternoon, and will surface the broken assumption your monitoring missed.
Two-phase commit is the textbook answer for distributed transactions. It also doesn't survive contact with real systems. The saga pattern — orchestrated or choreographed — is what production systems actually use. Here is the difference, the implementation patterns, and the compensation logic that handles the inevitable failure cases.
Redlock is the most-recommended distributed-lock algorithm and the one with the most published criticism. The truth: simple Redis locks are fine for most teams, Redlock fixes a narrow set of failure modes most teams don't experience, and the cases where you really need correctness call for Postgres or Zookeeper. Here is the decision tree.
Kafka and RabbitMQ both move messages and are not interchangeable. One is a distributed log, the other is a message router. Picking the wrong one means a year of fighting the abstractions. Here is the workload-based decision tree, the operational realities of each, and the rare case where you need both.
When a downstream service slows from 50ms to 5s, your service inherits the latency, then runs out of connections, then takes everything else with it. A circuit breaker is the 50 lines that say “I will stop calling you for 30 seconds and let you recover.” Here is the implementation, the three states, and the four metrics worth alerting on.
The “sliding window” rate limiter every tutorial shows you breaks at scale. Token bucket is the algorithm real APIs use because it allows bursts without exceeding the average rate. Here is a 30-line Lua-on-Redis implementation, the failure modes to test for, and the headers you should be returning to clients.
Whenever your code does “write to the database, then publish to Kafka,” there is a window where one succeeds and the other does not. The outbox pattern closes that window with a single extra table and 60 lines of dispatcher code. Here is how it works and why every alternative ends up reinventing it.