Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m.
A single poison message crashes your worker, the broker redelivers it, and the crash loop takes down your entire pipeline. Here is the DLQ pattern that separates bad messages from good ones, with working code for RabbitMQ and the replay strategy that turns a dead letter into a recovered system.
Your queue worker has been stable for three months. Then at 02:17, the pod restarts. It comes back up, processes one message, and crashes again. The broker redelivers the same message. The pod crashes again. Kubernetes helpfully restarts it again. This loop continues until the entire consumer group is stuck in CrashLoopBackOff and your queue depth graph is a vertical line.
The message is not malformed. It is not too large. It is a completely valid order cancellation that references a user whose account was deleted yesterday. Your worker did not expect a missing user, threw an unhandled exception, and now that message is a grenade sitting in the middle of your pipeline.
Without a dead letter queue, the broker keeps redelivering. With a dead letter queue, the message moves to a separate queue after a few retries, your worker processes the rest of the backlog, and you replay the dead letter during business hours after writing a three-line defensive check. The difference is not tooling sophistication. It is whether you thought about poison messages before they thought about you.
This post is the DLQ pattern: when messages die, how to route them safely, the consumer code that handles retries without suicide, and the replay workflow that turns a dead letter back into a healthy message.
The five ways a message becomes a poison pill
A poison message is any message that causes the consumer to fail every time it is processed. It does not have to be corrupt. It just has to hit a code path the consumer cannot handle.
- Schema drift. The producer started sending a new field as a number; the consumer still expects a string.
JSON.parsesucceeds, validation fails, the worker throws. - Missing dependency data. The message references an order ID that does not exist yet because of replication lag, or a user that was hard-deleted by a GDPR job.
- ** Transient downstream failure that outlasts retries.** Your payment provider is down for 20 minutes. Each message retries 5 times and exhausts its budget, but now the broker thinks the message is poison when it was just unlucky timing.
- Logic bug in the consumer. A new deployment introduced a null-pointer dereference for edge-case payloads.
- Message too large or too complex. The payload itself is valid but triggers an OOM in the worker because of an unbounded buffer somewhere in the processing pipeline.
In every case, the message is not the enemy. The lack of a quarantine path is.
What a dead letter queue actually is
A DLQ is just another queue. The broker does not treat it specially. The difference is routing policy: when a message fails processing N times, the broker moves it from the main queue to the DLQ instead of redelivering it indefinitely.
The DLQ is not a trash can. It is a holding area with three purposes:
- Preserve the evidence. You need the failed message, its headers, the error that killed it, and the retry count. You cannot debug a ghost.
- Protect the pipeline. Removing the poison message lets the rest of the backlog drain normally.
- Enable replay. After you fix the bug or backfill the missing data, you move the message back to the main queue and process it again.
If you do not have a DLQ, your options are: drop the message (data loss), block the queue (outage), or manually edit the broker database (operational heroics). A DLQ is cheaper than all three.
RabbitMQ: the policy-based approach
RabbitMQ implements DLQs through queue policies. You declare a main queue with a dead-letter-exchange (DLX) and dead-letter-routing-key. When a message is rejected or expires, RabbitMQ routes it to the DLX, which delivers it to the DLQ.
import amqp from 'amqplib';
const RABBIT_URL = process.env.RABBIT_URL ?? 'amqp://localhost';
async function setupTopology() {
const conn = await amqp.connect(RABBIT_URL);
const ch = await conn.createChannel();
// Dead letter exchange and queue.
await ch.assertExchange('orders.dlx', 'direct', { durable: true });
await ch.assertQueue('orders.dlq', { durable: true });
await ch.bindQueue('orders.dlq', 'orders.dlx', 'orders-retry');
// Main queue: after 3 rejections, route to DLX.
await ch.assertQueue('orders', {
durable: true,
arguments: {
'x-dead-letter-exchange': 'orders.dlx',
'x-dead-letter-routing-key': 'orders-retry',
'x-message-ttl': 30_000, // optional: expire stuck messages
},
});
await ch.prefetch(10);
await ch.consume('orders', async (msg) => {
if (!msg) return;
try {
const payload = JSON.parse(msg.content.toString());
await processOrder(payload);
ch.ack(msg);
} catch (err) {
// Reject and do NOT requeue. RabbitMQ increments delivery count;
// after the limit set by queue policy, it dead-letters the message.
ch.reject(msg, false);
}
});
}
The key line is ch.reject(msg, false). The second argument is requeue. Setting it to false tells RabbitMQ either to discard the message or, if a DLX is configured, to route it there. With a policy that sets the maximum delivery count, the message survives a few retries and then graduates to the DLQ.
For delivery-count limiting, add a policy (or use quorum queues which track delivery count natively):
# Set the delivery limit on the orders queue.
rabbitmqctl set_policy orders-delivery-limit \
"^orders$" \
'{"delivery-limit": 3}' \
--apply-to queues
Or declare the queue in Terraform / Pulumi if you manage RabbitMQ through code:
resource "rabbitmq_queue" "orders" {
name = "orders"
vhost = "/"
settings {
durable = true
arguments = {
"x-dead-letter-exchange" = "orders.dlx"
"x-dead-letter-routing-key" = "orders-retry"
}
}
}
SQS: the native dead letter queue
AWS SQS has first-class DLQ support. You attach a dead letter queue to the main queue and set a maxReceiveCount. When a message is received that many times without being deleted, SQS moves it to the DLQ automatically.
import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';
const sqs = new SQSClient({ region: process.env.AWS_REGION });
const QUEUE_URL = process.env.ORDERS_QUEUE_URL!;
async function consume() {
while (true) {
const { Messages } = await sqs.send(new ReceiveMessageCommand({
QueueUrl: QUEUE_URL,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20,
VisibilityTimeout: 60,
AttributeNames: ['ApproximateReceiveCount'],
}));
if (!Messages || Messages.length === 0) continue;
for (const message of Messages) {
const receiveCount = Number(
message.Attributes?.ApproximateReceiveCount ?? '1',
);
try {
const payload = JSON.parse(message.Body ?? '{}');
await processOrder(payload);
await sqs.send(new DeleteMessageCommand({
QueueUrl: QUEUE_URL,
ReceiptHandle: message.ReceiptHandle!,
}));
} catch (err) {
// Log the error with the receive count.
console.error({
orderId: message.Body ? JSON.parse(message.Body).orderId : null,
error: (err as Error).message,
receiveCount,
messageId: message.MessageId,
});
// Do NOT delete the message. SQS will return it after the visibility
// timeout expires. Once receiveCount hits maxReceiveCount, SQS will
// move it to the DLQ automatically.
}
}
}
}
The critical behavior: on failure, you do nothing. You log the error, set a metric, and let the message return to the queue by not deleting it. SQS handles the counting and the DLQ routing for you.
One trap: if your VisibilityTimeout is shorter than your processOrder duration, the message becomes visible again before your worker finishes, another worker picks it up, and you get duplicate processing. Set VisibilityTimeout to at least 6× your p99 processing time.
The consumer code that does not kill itself
The broker policy sets the DLQ destination. The consumer code decides whether a message lives or dies. There are two classes of failure and you must handle them differently.
enum FailureClass {
TRANSIENT = 'transient', // retryable: downstream timeout, network blip.
POISON = 'poison', // not retryable: schema error, missing data, bug.
}
function classifyError(err: unknown): FailureClass {
const message = (err as Error).message ?? '';
// Transient: network or downstream is sick.
if (message.includes('ECONNRESET')) return FailureClass.TRANSIENT;
if (message.includes('ETIMEDOUT')) return FailureClass.TRANSIENT;
if (message.includes('503')) return FailureClass.TRANSIENT;
// Poison: our code cannot handle this message.
if (message.includes('Cannot read properties of null')) return FailureClass.POISON;
if (message.includes('validation failed')) return FailureClass.POISON;
// Default to transient; better to retry once too often than suicide.
return FailureClass.TRANSIENT;
}
async function safeProcess(
payload: unknown,
messageId: string,
): Promise<{ outcome: 'ack' | 'nack' | 'retry' }> {
try {
await processOrder(payload as OrderPayload);
return { outcome: 'ack' };
} catch (err) {
const cls = classifyError(err);
if (cls === FailureClass.POISON) {
// Poison: ack the message immediately to move it out of the queue.
// If a DLX is configured, this may dead-letter it; otherwise it is
// dropped, so make sure your DLX policy is set.
return { outcome: 'nack' };
}
// Transient: let the broker redeliver. Do not ack.
return { outcome: 'retry' };
}
}
Why classify? Because retrying a schema error 10 times wastes 10 visibility timeouts and delays the message’s arrival in the DLQ by 10 minutes. A poison message should be rejected immediately so the pipeline can move on.
Never put business-logic exceptions in the transient bucket by default. A UserNotFoundError is usually poison (the user is gone; retrying in 30 seconds will not bring them back). A PaymentGatewayTimeout is transient.
Monitoring: what you need to watch
A DLQ is only useful if you know when it has messages. Three alerts, all cheap:
1. DLQ depth above zero for more than 5 minutes.
# Using rabbitmq_queue_messages for RabbitMQ, or
# aws_sqs_approximate_number_of_messages_visible for SQS.
rabbitmq_queue_messages{queue="orders.dlq"} > 0
This pages you. A DLQ with messages means a bug or data issue is blocking real work. It should not wait until Monday.
2. Retry rate spike on the main queue.
rate(rabbitmq_channel_messages_redelivered_total[5m]) > 0.5
A redelivery spike usually means a new deployment introduced a bug or a downstream dependency started failing. Catch it before the DLQ fills.
3. Time-to-dead-letter histogram.
How long does a poison message spend retrying before landing in the DLQ? If it is longer than your SLO for processing latency, tighten the delivery limit or classify errors more aggressively.
The replay workflow
Messages in a DLQ are not dead. They are waiting. The replay workflow is how you bring them back.
Step 1: inspect
Read a batch from the DLQ without removing them:
async function inspectDlq(batchSize = 10) {
const { Messages } = await sqs.send(new ReceiveMessageCommand({
QueueUrl: DLQ_URL,
MaxNumberOfMessages: batchSize,
WaitTimeSeconds: 5,
}));
for (const msg of Messages ?? []) {
const body = JSON.parse(msg.Body ?? '{}');
console.log({ messageId: msg.MessageId, orderId: body.orderId });
}
// Do NOT delete. These messages stay in the DLQ while you decide.
}
Step 2: fix the root cause
If every failed message is userId: null, your producer has a bug. If they are all for a single partner integration that changed its API, update your adapter. Do not replay until the fix is deployed.
Step 3: replay
Move messages from the DLQ back to the main queue in batches:
async function replayDlq(batchSize = 10) {
while (true) {
const { Messages } = await sqs.send(new ReceiveMessageCommand({
QueueUrl: DLQ_URL,
MaxNumberOfMessages: batchSize,
WaitTimeSeconds: 5,
}));
if (!Messages || Messages.length === 0) break;
for (const msg of Messages) {
// Re-publish to the main queue.
await sqs.send(new SendMessageCommand({
QueueUrl: QUEUE_URL,
MessageBody: msg.Body,
MessageAttributes: msg.MessageAttributes,
}));
// Delete from DLQ.
await sqs.send(new DeleteMessageCommand({
QueueUrl: DLQ_URL,
ReceiptHandle: msg.ReceiptHandle!,
}));
}
}
}
For RabbitMQ, the equivalent is consuming from the DLQ and publishing back to the main exchange. Some teams prefer to replay by timestamp range (“everything from Tuesday night”) which requires storing the original publish time in message headers.
Step 4: verify
Watch the main queue depth, processing latency, and error rate. If replayed messages fail again, either your fix was incomplete or you have a new class of poison message. Stop the replay and investigate.
The traps that break DLQs
Trap 1: treating the DLQ as a permanent archive. A DLQ is a triage area, not a data lake. If it grows without bound, you have a monitoring failure. Set a max age (TTL) on DLQ messages and alert when depth is non-zero.
Trap 2: no retry count in the message. If the consumer logs do not expose the delivery count, you cannot tell whether a DLQ’d message failed 3 times or 300. Embed retry metadata in message headers so your observability shows the full story.
Trap 3: rejecting messages that should be retried. A consumer that nacks every error, regardless of class, will move transient failures to the DLQ on the first try. Your DLQ will fill every time the payment gateway blips. Classify first, reject with intent.
Trap 4: replaying without fixing the cause. Teams under pressure sometimes replay a DLQ to “clear the alert” and hope the messages process successfully this time. If the bug still exists, they loop: main queue → DLQ → replay → main queue → DLQ. Do not replay until a developer has read the error and merged a fix.
When you do not need a DLQ
Skip DLQs when:
- The queue is low volume and losing a message is cheaper than building the policy. A nightly batch job with 20 messages does not need a DLQ; it needs email-on-failure.
- The processing is fully idempotent and you can safely retry forever. Some queue systems (Kafka with at-least-once consumers) naturally replay offsets, and a poison message will block the partition anyway, so a DLQ does not help.
- You are using a streaming platform (Kafka, Pulsar) and the architecture tolerates per-partition blocks while you fix the consumer. Streaming DLQs exist but are more complex; they are out of scope here.
For any task queue with independent messages and a non-zero cost of failure, the DLQ is baseline infrastructure, not an optional extra.
The takeaway
A poison message is not an edge case. It is an inevitability. The question is whether one bad payload takes down your pipeline or whether it quietly moves to a side queue while the rest of your backlog clears.
Set up the DLX policy on RabbitMQ, or attach the DLQ in SQS. Classify errors into transient and poison in your consumer. Alert on DLQ depth. Fix the root cause before replaying. This is 40 lines of configuration and 20 lines of consumer code. The alternative is a 2 a.m. page, a crashed consumer group, and a queue depth graph that looks like a wall.
Wire it now, before you need it.
A note from Yojji
The kind of backend infrastructure work that separates a pipeline which handles bad data gracefully from one that crashes at the first unexpected null pointer is exactly the kind of practical engineering Yojji builds into the systems it ships.
Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. Their engineers specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and event-driven microservices architectures. Whether you need a dedicated senior team or full-cycle product support from discovery through DevOps, Yojji handles the queue policies, DLQ wiring, and failure-classification logic that keep systems running through bad data and bad traffic.
If you would rather have a pipeline that dead-letters poison messages correctly than learn why it matters during your next 2 a.m. outage, Yojji is worth a conversation.