The Practical Developer

Kafka Consumer Group Rebalancing: Why Your Event Processing Stalls and How to Stop It

Your Kafka consumers process 50,000 events per second until one pod restarts and the entire group freezes for 45 seconds. Consumer group rebalancing is the silent killer of event-driven throughput. Here is how the protocol actually works, the configs that stop the stalls, and the Node.js consumer setup that survives rolling deployments.

Server room with rows of blinking rack equipment representing the infrastructure layer where Kafka consumer rebalancing problems hide

Your Kafka consumer group runs six pods. They process 50,000 events per second with a median latency of 12 milliseconds. A deployment rolls a new version. One pod drains, the next one starts, and for the next 45 seconds every consumer in the group stops processing. Incoming events pile up in the topic. Lag spikes from 0 to 200,000. Your alerting fires. By the time the rebalance finishes and consumers resume, you are five minutes behind real-time and the catch-up takes another ten.

This is not a crash. This is not a bug in your consumer code. This is the Kafka consumer group rebalance protocol, and it is probably the most painful production behavior in the entire Kafka ecosystem. The default settings are tuned for correctness, not for speed. They guarantee every partition is consumed by exactly one consumer. They do not guarantee that the handoff is fast.

The good news is that the Kafka community has spent the last several releases fixing this. Cooperative rebalancing, static group membership, and smart heartbeat tuning can turn a 45-second freeze into a sub-second pause. The bad news is that most teams never change the defaults, and most tutorials never mention the settings that matter.

This post covers the rebalance protocol from the inside out, the specific configurations that stop the stalls, and the Node.js consumer setup (using kafkajs) that survives a rolling deployment without spiking lag.

What a rebalance actually does

A Kafka consumer group is a set of consumers that share the work of reading a topic’s partitions. Each partition is assigned to exactly one consumer in the group. When the group membership changes (a consumer joins, leaves, or fails), Kafka must reassign partitions so every partition is covered and no partition is assigned twice. That reassignment is a rebalance.

The sequence looks like this:

  1. A consumer joins the group by sending a JoinGroup request to the group coordinator (one of the Kafka brokers).
  2. The coordinator picks one consumer as the group leader.
  3. The leader receives the full list of members and their subscribed topics, computes a partition assignment (using the PartitionAssignor), and sends it back to the coordinator.
  4. The coordinator distributes the assignment to every member via SyncGroup responses.
  5. Every consumer receives its new assignment and starts fetching from its new partitions.

During steps 1 through 4, every consumer in the group is paused. No processing happens. No offsets are committed. The group has a generation ID, and any consumer that tries to commit offsets using a stale generation gets a IllegalGeneration error and is kicked out of the group.

This is the stop-the-world moment. And its length depends on three things: which rebalance protocol you use, how fast your consumers respond to heartbeats, and how much work your assignor does.

Eager rebalancing: the default that freezes everything

The classic Kafka rebalance protocol is called “eager” (also called “stop-the-world”). Every rebalance revokes ALL partitions from ALL consumers before reassigning them. Even consumers whose assignments will not change must stop, release their partitions, and wait for the new assignment.

For a consumer group with six pods and thirty partitions, an eager rebalance means:

  • Every consumer stops processing.
  • Every consumer sends a LeaveGroup or the coordinator detects a missed heartbeat.
  • The coordinator waits for the max.poll.interval.ms timeout or the session timeout, whichever comes second.
  • The leader computes the assignment from scratch.
  • The coordinator distributes it.
  • Every consumer resumes with its new set of partitions.

The practical result: during a rolling deployment, your group rebalances twice per pod (once when the old pod leaves, once when the new pod joins). If your session.timeout.ms is the default of 45 seconds and your max.poll.interval.ms is the default of 5 minutes, a single rebalance can take 30+ seconds. Multiply by twelve for a six-pod rolling update, and you have two minutes of total processing downtime spread across six minutes of deployment.

Here is what that looks like in practice:

14:00:00 Pod-1 drains, leaves group. Rebalance starts.
14:00:30 Coordinator detects Pod-1's session expired. Sends JoinGroup to remaining members.
14:00:31 Leader computes assignment. Sends SyncGroup.
14:00:32 All pods resume. Lag has grown by 1.5 million events.
14:00:33 Pod-2 (new) starts, joins group. Rebalance starts again.
...

The fix is not “make the timeouts smaller” (though that helps). The fix is to stop revoking partitions that do not need to move.

Cooperative rebalancing: only move what changed

Kafka 2.4 introduced the cooperative rebalancing protocol (KIP-429). Instead of revoking all partitions on every rebalance, cooperative rebalancing works in phases:

  1. The coordinator detects a membership change and tells consumers which partitions they might need to give up.
  2. Consumers revoke only those specific partitions and continue processing the rest.
  3. The coordinator assigns the freed partitions to the consumers that need them.
  4. If more partitions need to move, the cycle repeats. Usually one cycle is enough.

This is the difference between “everyone stops for 30 seconds” and “one consumer stops processing 5 partitions for 200 milliseconds while the rest keep going.”

To use cooperative rebalancing, you need three things:

  • Kafka broker version 2.4 or later (most clusters in 2026 are well past this).
  • A cooperative partition assignor (CooperativeStickyAssignor).
  • A consumer client that supports cooperative rebalancing (Kafka client 2.4+ or kafkajs 2.0+).

In kafkajs, the setup looks like this:

import { Kafka, AssignerProtocol } from 'kafkajs';

const kafka = new Kafka({
  clientId: 'order-processor',
  brokers: ['kafka-0:9092', 'kafka-1:9092', 'kafka-2:9092'],
});

const consumer = kafka.consumer({
  groupId: 'order-processor-group',
  // Use cooperative sticky assignor instead of the default round-robin
  partitionAssigners: [AssignerProtocol.cooperativeStickyAssignor],
  // Allow the consumer to continue processing during rebalance
  rebalanceTimeout: 60000,
});

The rebalanceTimeout is the max time the consumer will take to revoke the partitions it is asked to give up. During this window, the consumer keeps fetching from partitions it gets to keep. This is the key difference from eager mode, where the consumer must stop everything before the rebalance can proceed.

One important detail: cooperative rebalancing does not eliminate the pause entirely. The consumer still has to stop fetching from the partitions it is revoking, and there is a brief synchronization phase. But the pause shrinks from “all partitions, all consumers, for the duration of the rebalance” to “a subset of partitions, on a subset of consumers, for the time it takes to revoke.”

Static group membership: the biggest single win

The most disruptive rebalances happen when consumers leave and rejoin during deployments. Every pod replacement triggers a full rebalance (or a cooperative cycle) because the group sees a new member with a new memberId.

Static group membership (KIP-345, available since Kafka 2.3) solves this by letting consumers identify themselves with a stable group.instance.id instead of a transient memberId. When a consumer with group.instance.id: "consumer-1" disconnects and reconnects within the session.timeout.ms window, the coordinator treats it as the same member and does not trigger a rebalance.

In kafkajs, you set it in the consumer config:

const consumer = kafka.consumer({
  groupId: 'order-processor-group',
  groupId: 'order-processor-group',
  // Every consumer gets a stable identity
  groupInstanceId: `consumer-${process.env.HOSTNAME || process.env.POD_NAME}`,
  partitionAssigners: [AssignerProtocol.cooperativeStickyAssignor],
  sessionTimeout: 30000,
  rebalanceTimeout: 60000,
});

With groupInstanceId set, here is what happens during a rolling deployment:

  1. The old pod shuts down gracefully, calling consumer.stop() and consumer.disconnect().
  2. The coordinator sees the disconnect but keeps the member’s assignment reserved for session.timeout.ms (30 seconds by default, which is fine to set lower with static membership).
  3. The new pod starts up, connects with the same groupInstanceId, and the coordinator hands it the same partition assignment.
  4. No rebalance. Zero lag spikes. The deployment completes in seconds.

The catch: each groupInstanceId must be unique across the group. If two consumers try to register with the same ID, the second one gets an error. In Kubernetes, use the pod name or the StatefulSet ordinal.

Heartbeat tuning: stop getting kicked out for no reason

Two timeouts control how long a consumer can be silent before the coordinator kicks it out of the group:

  • session.timeout.ms (default: 45 seconds): the max time between heartbeats. If the coordinator does not receive a heartbeat within this window, it marks the consumer as dead and triggers a rebalance.
  • heartbeat.interval.ms (default: 3 seconds): how often the consumer sends heartbeats. Should be one-third of session.timeout.ms.

The defaults are conservative. Forty-five seconds means a consumer can be stuck in a GC pause or a slow processBatch call for 44 seconds before the coordinator considers it dead. That sounds generous, but it also means the coordinator waits 45 seconds to start a rebalance when a consumer genuinely dies. During a rolling deployment with eager rebalancing and default session timeouts, every pod replacement costs you nearly a minute of downtime before the rebalance even starts.

For most production workloads, tighten these:

const consumer = kafka.consumer({
  groupId: 'order-processor-group',
  groupInstanceId: `consumer-${process.env.POD_NAME}`,
  partitionAssigners: [AssignerProtocol.cooperativeStickyAssignor],
  sessionTimeout: 10000,         // 10 seconds
  heartbeatInterval: 3000,       // 3 seconds (1/3 of session timeout)
  rebalanceTimeout: 60000,
  maxWaitTimeInMs: 5000,         // how long to block in fetch for data
  maxBytesPerPartition: 1048576, // 1 MB per partition per fetch
});

A 10-second session timeout means the coordinator detects a dead consumer in 10 seconds instead of 45. The rebalance starts sooner and finishes faster. The downside: if your consumer regularly pauses processing for more than 10 seconds (heavy batch processing, slow downstream calls), you will get false positives. In that case, keep the timeout higher but understand the trade-off.

max.poll.interval.ms: the config most teams get wrong

max.poll.interval.ms (default: 5 minutes) is the maximum time between calls to consumer.poll(). If your consumer takes longer than this to process a batch and call poll() again, the coordinator assumes the consumer is stuck and removes it from the group.

This is the setting that catches teams who do heavy processing inside the consumer loop:

// BAD: long processing inside the poll loop
await consumer.run({
  eachBatch: async ({ batch, heartbeat, commitOffsetsIfNecessary }) => {
    for (const message of batch.messages) {
      // This HTTP call could take 30 seconds per message
      await callExternalApi(message.value.toString());
    }
    // By the time we return, we might have exceeded max.poll.interval.ms
  },
});

The fix is one of two patterns:

  1. Process outside the poll loop. Acknowledge the messages immediately, push the work to a queue or worker thread, and let the consumer poll again quickly.

  2. Increase the timeout. If your batch processing genuinely takes 60 seconds, set maxPollInterval: 120000 on the consumer. But this delays rebalance detection, so it is a last resort.

The right approach for most teams is pattern 1:

await consumer.run({
  eachBatchAutoResolve: false,
  eachBatch: async ({ batch, heartbeat, commitOffsetsIfNecessary }) => {
    // Heartbeat frequently so the coordinator knows we are alive
    for (const message of batch.messages) {
      // Offload to a bounded queue, do not block the poll loop
      workQueue.push(message);
      await heartbeat(); // Explicit heartbeat between messages
    }
    // Commit offsets after the batch is queued
    await commitOffsetsIfNecessary();
  },
});

With frequent explicit heartbeat() calls, the coordinator sees the consumer as alive even when the batch takes longer. The max.poll.interval.ms timer resets only on poll() calls, but the session timeout resets on heartbeats. Keeping both alive requires either fast poll() cycles or explicit heartbeats.

Putting it all together: the production consumer

Here is the complete Node.js consumer setup that survives rolling deployments without lag spikes:

import { Kafka, AssignerProtocol, type Consumer } from 'kafkajs';

interface ConsumerConfig {
  groupId: string;
  brokers: string[];
  podName: string;
}

function createResilientConsumer(config: ConsumerConfig): Consumer {
  const kafka = new Kafka({
    clientId: `${config.groupId}-${config.podName}`,
    brokers: config.brokers,
    retry: {
      initialRetryTime: 100,
      retries: 8,
      maxRetryTime: 30000,
    },
  });

  const consumer = kafka.consumer({
    groupId: config.groupId,
    groupInstanceId: config.podName,  // Static membership
    partitionAssigners: [
      AssignerProtocol.cooperativeStickyAssignor,  // Cooperative rebalancing
    ],
    sessionTimeout: 10000,       // Fast dead-consumer detection
    heartbeatInterval: 3000,     // 1/3 of session timeout
    rebalanceTimeout: 60000,     // Max time to complete a rebalance
    maxPollInterval: 300000,     // 5 min (keep default, we use heartbeats)
    maxBytesPerPartition: 1048576,
    minBytes: 1,
    maxBytes: 52428800,          // 50 MB max per fetch
    maxWaitTimeInMs: 500,
  });

  return consumer;
}

// Graceful shutdown: important for static membership
async function shutdown(consumer: Consumer, signal: string) {
  console.log(`Received ${signal}, shutting down consumer...`);
  await consumer.stop();
  await consumer.disconnect();
  process.exit(0);
}

// Usage
const consumer = createResilientConsumer({
  groupId: 'order-processor-group',
  brokers: process.env.KAFKA_BROKERS?.split(',') ?? [],
  podName: process.env.HOSTNAME ?? `consumer-${Math.random().toString(36)}`,
});

// Wire up graceful shutdown
process.on('SIGTERM', () => shutdown(consumer, 'SIGTERM'));
process.on('SIGINT', () => shutdown(consumer, 'SIGINT'));

await consumer.connect();
await consumer.subscribe({ topic: 'orders', fromBeginning: false });

await consumer.run({
  eachBatchAutoResolve: true,
  eachBatch: async ({ batch, heartbeat, commitOffsetsIfNecessary }) => {
    for (const message of batch.messages) {
      await processOrder(message.value);
      await heartbeat(); // Keep session alive during long batches
    }
    await commitOffsetsIfNecessary();
  },
});

The three key decisions in this setup:

  1. Static group membership (groupInstanceId). Eliminates rebalances during graceful deployments. The single biggest win.

  2. Cooperative rebalancing (cooperativeStickyAssignor). If a rebalance does happen (pod crash, partition count change), only the affected partitions pause, not the entire group.

  3. Tight session timeout (10 seconds) with explicit heartbeats. The coordinator detects failures fast, and long-running batches do not trigger false rebalances.

Measuring rebalance impact

You cannot fix what you do not measure. Add these metrics to every consumer:

  • kafka_consumer_rebalance_count - incremented on every rebalance.
  • kafka_consumer_rebalance_duration_ms - time from JoinGroup to SyncGroup completion.
  • kafka_consumer_lag - per-partition lag, which spikes during rebalances.
  • kafka_consumer_assigned_partitions - expected to be stable during normal operation.

In kafkajs, you can instrument the consumer events:

consumer.on(consumer.events.GROUP_JOIN, (event) => {
  metrics.increment('kafka_consumer_rebalance_count');
});

consumer.on(consumer.events.REBALANCE, (event) => {
  metrics.histogram('kafka_consumer_rebalance_duration_ms', Date.now() - rebalanceStart);
});

A healthy consumer group should rebalance only when:

  • You add or remove consumers intentionally (deployment, scaling).
  • You change the topic partition count.
  • A consumer crashes (hopefully rare).

If your group is rebalancing every few minutes, something is wrong. Likely candidates: a consumer with a too-short session timeout that cannot keep up with processing, a network issue causing heartbeat failures, or a consumer that is committing offsets too slowly and getting kicked out.

The one thing that still causes rebalances

Even with all the fixes above, one scenario still triggers a full rebalance: adding partitions to a topic. When you increase the partition count, the group must reassign the new partitions to consumers. With cooperative rebalancing, this is a minimal reassignment (existing partitions stay, new ones are distributed), but it is still a rebalance event.

If you plan to scale partitions over time, do it in batches. Adding 50 partitions at once triggers one rebalance. Adding 1 partition fifty times triggers fifty rebalances. Plan your partition count for the peak throughput you expect in the next 12 months, then add 20% headroom. Changing partition count is not zero-cost.

The takeaway

Consumer group rebalancing is Kafka’s sharpest edge for teams running event-driven systems. The defaults assume you value correctness over speed, which is the right default for a database but the wrong one for a real-time event processor that needs to stay within a latency budget.

The fix is not a single config. It is a combination of four things that work together:

  • Static group membership to eliminate deployment-triggered rebalances.
  • Cooperative rebalancing to minimize the blast radius when rebalances happen.
  • Tight session timeouts to detect failures fast.
  • Explicit heartbeats to keep the session alive during long processing.

Together they turn a 45-second freeze into a sub-second blip. The code above is production-ready: copy it, set your pod name, and watch your deployment lag spike disappear on the next roll.


A note from Yojji

Building event-driven systems that stay reliable through deployments, scaling events, and partial failures requires experience with the sharp edges of distributed consensus protocols. The kind of consumer-group engineering that turns a brittle Kafka pipeline into a stable one is the kind of backend architecture work Yojji’s teams deliver for clients across fintech, e-commerce, and SaaS platforms.

Yojji is an international custom software development company founded in 2016, with teams in Europe, the US, and the UK. They specialize in the JavaScript ecosystem (React, Node.js, TypeScript) and event-driven architectures, including the Kafka consumer patterns and operational tooling that keep production systems stable as they scale.