The Practical Developer

Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You

Kafka and RabbitMQ both move messages and are not interchangeable. One is a distributed log, the other is a message router. Picking the wrong one means a year of fighting the abstractions. Here is the workload-based decision tree, the operational realities of each, and the rare case where you need both.

Network rack and cabling — the right metaphor for two systems that move messages between services

The architecture diagram has a “message bus” between three services. Somebody asks “Kafka or RabbitMQ?” The answer is “it depends,” and the next 90 minutes is a debate that mostly recapitulates the README of each project. Six months later the team has chosen wrong and is spending a sprint per quarter fighting the tool’s natural shape.

Kafka and RabbitMQ both transport messages but they are very different. Kafka is a distributed, replicated log: messages are written to topics, persisted forever (or until aged out), and read by consumers at their own pace. RabbitMQ is a smart router: producers publish, the broker decides where (queues, exchanges, bindings), consumers ack messages and they are gone. Different model, different operational footprint, different failure modes.

This post is the decision tree based on workload, the cases where each shines, and the rare case where you need both.

The shape of each, in 30 seconds

Kafka. Topics are partitioned, append-only logs. Each partition is owned by one broker; replicas live on others. Consumers read by offset and track their own position. Messages are not deleted on consumption — they age out by retention policy (typically 1–7 days). High throughput per partition (hundreds of MB/sec). Not a queue: there is no per-message ack-or-redeliver semantic.

RabbitMQ. Producers send to exchanges. Exchanges route to queues via bindings (direct, topic, fanout, headers). Consumers receive messages from queues and acknowledge them. Acked messages are deleted; un-acked messages are redelivered. Per-message routing, per-message ack — flexible. Throughput is lower per node (tens of thousands of messages/sec is typical).

Kafka is “the durable log of everything that happened.” RabbitMQ is “the smart traffic cop for live messages.”

When Kafka is the right answer

Use Kafka if any of these are true:

Event sourcing or CDC. You want a durable record of every state change, replayable from the beginning of time. Kafka was built for this. Topics with compact retention give you “the latest state of every key” semantics.

Multiple consumers, different speeds. A topic feeds an analytics pipeline (slow batch), a search-index updater (fast), and a notifier (real-time). Each consumer tracks its own offset; nobody’s pace affects anyone else’s.

Stream processing. You are using Kafka Streams, Flink, ksqlDB, or building joins between event topics. Kafka’s partitioning maps directly to processing parallelism.

Very high throughput. Hundreds of MB/sec per topic. Kafka’s per-partition log structure scales linearly with disk bandwidth.

Replay-on-demand. You need to re-process the last 24 hours after fixing a consumer bug. Kafka stores the messages; reset the offset and replay. RabbitMQ deleted them on ack.

When RabbitMQ is the right answer

Use RabbitMQ if any of these are true:

Per-message routing. Different message types go to different consumers based on routing keys, headers, or business rules. RabbitMQ’s exchange/binding model is purpose-built for this. Doing it in Kafka requires custom application logic.

Per-message acknowledgment / dead-lettering. A message that fails goes to a dead-letter queue for inspection or retry. Built into RabbitMQ. In Kafka, you implement this in your consumer (and most teams get it wrong).

Request/response over a broker. RPC-style messaging where a producer waits for a response on a reply queue. RabbitMQ supports it natively. Kafka has no built-in concept of replies.

Lower-volume, complex routing. Tens of thousands of messages per second across many queues with topology that changes often. RabbitMQ’s flexibility wins.

You want one less thing to operate. RabbitMQ is simpler to run for smaller workloads. Kafka requires Zookeeper (or KRaft), which is non-trivial. If your team has not run Kafka before, do not adopt it for a low-volume use case.

The decision tree

Five questions, in order:

  1. Will any consumer ever need to replay history? Yes → Kafka. No → consider RabbitMQ.
  2. Do you need per-message ack/retry/dead-letter semantics? Yes → RabbitMQ. (Kafka makes this awkward.)
  3. Is throughput per topic > 100 MB/sec? Yes → Kafka. No → either works.
  4. Is the routing logic complex (many bindings, header-based)? Yes → RabbitMQ.
  5. Is your team comfortable operating either one? Pick the one you know.

For most teams’ first message broker, RabbitMQ is the simpler starting point. Once you hit the cases where it does not fit (replay, very high throughput, stream processing), Kafka is the next step — but the migration is a real project.

Operational realities of Kafka

Storage planning. With 7-day retention and 100 MB/sec throughput, you are storing 60 TB. Plan disks accordingly.

Partition count is hard to change. You set it when the topic is created. Adding partitions later changes the hashing and breaks ordering guarantees for existing consumers. Pick a number that works for the next 5 years (often: 4× the number of consumer instances you ever expect).

Consumer rebalances are stop-the-world. When consumers join or leave, partitions are redistributed and processing pauses. Cooperative rebalancing (newer Kafka versions) reduces the impact, but you still have to design for it.

Exactly-once is partial. “Exactly-once” semantics in Kafka apply to writes to other Kafka topics and reads from Kafka. They do not magically extend to your database — that is your problem to solve, usually with the outbox pattern.

ZooKeeper or KRaft. Kafka traditionally requires ZooKeeper, an entirely separate distributed system. KRaft (built-in metadata) is replacing it but is recent. Your operational burden depends on which version you run.

Operational realities of RabbitMQ

Memory pressure on big queues. RabbitMQ is memory-hungry. A queue with millions of un-acked messages can OOM the broker. Set queue length limits and lazy-queue mode for long backlogs.

Mirrored queues vs quorum queues. Older HA mirrored queues have known issues. Use quorum queues (Raft-based) since RabbitMQ 3.8. They are the modern, supported HA mode.

Connection vs channel. Open one connection per process; many channels per connection. A common bug is opening one connection per request, which exhausts the broker’s connection limit fast.

Exchanges are idempotent to declare. Code can call channel.exchangeDeclare(...) on every startup safely. But changing the declaration parameters fails — you have to delete and recreate. Plan for this in deployments.

Single-broker default. A standalone RabbitMQ is fine for most workloads but not HA. Production needs a cluster, and clusters have their own quirks (network partitions, split-brain).

When you genuinely need both

A few real architectures use both:

  • Operational events on RabbitMQ, business events on Kafka. Notifications, retries, dead-letter routing on RMQ; long-term event sourcing on Kafka.
  • CDC into Kafka, fanout via RabbitMQ. Debezium ships database changes to Kafka; a Kafka-to-RMQ bridge fans out specific events to per-consumer queues.
  • Migration in flight. A team moving from RMQ to Kafka may run both for a quarter.

This is rare. Most teams pick one and stay there.

Cloud-managed alternatives

For both, a managed service often makes more sense than running it yourself.

  • Kafka: Confluent Cloud, AWS MSK, Azure Event Hubs (Kafka-compatible).
  • RabbitMQ: CloudAMQP, AWS MQ.

The managed cost is real but the alternative is an SRE who knows how to recover from a cluster split-brain at 3 a.m. — usually more expensive.

For very simple workloads, AWS SQS (queue) or SNS (pub/sub) might be enough; you do not always need a “real” message broker. Same goes for Google Pub/Sub.

A small thing that catches teams

Order guarantees.

Kafka guarantees order within a partition. If you partition by user ID, all events for one user are ordered, but events across users may interleave. If you partition randomly (round-robin), there is no order guarantee at all.

RabbitMQ guarantees order within a queue. If you have one queue and one consumer, in-order. If you have one queue and many consumers, the order of consumed messages is not guaranteed.

Order matters more than people remember. Design the partitioning / queue structure around what needs to stay in order.

The takeaway

Kafka and RabbitMQ are different shapes. Kafka is a durable, replayable log; RabbitMQ is a smart message router. Pick by workload: high-throughput, replayable, multi-consumer streaming → Kafka. Per-message routing, ack/retry, RPC, lower volume → RabbitMQ. Pick by team: the one your engineers can operate.

The decision is harder to undo than to make. Spending an afternoon on the decision tree before adoption saves the year of fighting the wrong tool.


A note from Yojji

The kind of architectural judgment that picks the right messaging system the first time — and saves the year of refactor that comes from picking the wrong one — is the kind of senior backend experience Yojji’s teams bring to client engagements.

Yojji is an international custom software development company founded in 2016, with teams in Europe, the US, and the UK. They specialize in the JavaScript ecosystem, cloud platforms (AWS, Azure, GCP), and event-driven backends — including the messaging-architecture decisions that decide whether a system stays simple or accumulates infrastructure debt.