Scaling WebSockets Horizontally with Redis: From One Server to a Cluster of Real-Time Connections

Your WebSocket server works fine on your laptop. One instance, a dozen connections, every message arrives where it should. Then you deploy to production. You add a second instance behind a load balancer. And suddenly messages disappear. User A, connected to instance 1, sends a message to User B, connected to instance 2, and the message never arrives. User A’s client thinks it sent. User B sees nothing. The user opens a bug ticket. You add logging. You see the message arrive at instance 1. It broadcasts to all connections on instance 1. Instance 2 never gets the memo.

This is the WebSocket scaling wall. And if you hit it without a plan, you will spend a week introducing a message bus, and during that week your users will keep filing bugs.

Here is the plan.

The Problem: WebSocket Connections Are Tied to a Process

A regular HTTP request is stateless. You can route it to any instance in your cluster and get the same result. A WebSocket connection, once established, lives on a single process. The TCP socket is pinned to that Node.js process, that container, that machine. When user A connects to instance 1, their socket lives in instance 1’s memory. Any broadcast or targeted message that originates on instance 2 has no way to reach that socket unless you build a bridge.

The naive fix is sticky sessions. Configure your load balancer to pin a client to the same instance based on a cookie or IP hash. It works for two instances. It breaks when an instance restarts, when the load balancer rebalances, or when you scale up during a traffic spike. And it does nothing to help a message that originates from a different service entirely (a background job, a webhook handler, another WebSocket server).

You need an external message bus. Redis is the simplest one that works at production scale without a dedicated ops team.

The Architecture: Redis as the Cross-Instance Bridge

Every WebSocket server instance connects to the same Redis instance (or cluster). When a message needs to reach a user who is not connected to the current server, the server publishes that message to a Redis channel. Every other server instance subscribes to that same channel, receives the message, and forwards it to the relevant local socket.

User A --WS--> Instance 1 --PUB--> Redis --SUB--> Instance 2 --WS--> User B

This is the WebSocket equivalent of a message queue: publish once, deliver to every subscriber, and let each subscriber decide whether the message belongs to a local connection.

Step 1: Track Connections Per Instance

Before you can route messages, you need to know which user is connected to which server instance. Each instance maintains a local map of authenticated user IDs to WebSocket connections. When a user connects and authenticates, push their ID into the map. When they disconnect, remove it.

import { WebSocketServer, WebSocket } from 'ws';
import { createClient } from 'redis';

const wss = new WebSocketServer({ port: 8080 });

// Local connection map: userId -> Set<WebSocket>
// One user can have multiple tabs/device connections
const connections = new Map<string, Set<WebSocket>>();

// Subscribe to Redis for cross-instance messages
const subscriber = createClient({ url: process.env.REDIS_URL });
await subscriber.connect();

await subscriber.subscribe('ws:messages', (rawMessage) => {
  const { targetUserId, payload } = JSON.parse(rawMessage);
  const userSockets = connections.get(targetUserId);
  if (!userSockets) return; // This user is not on this instance

  for (const ws of userSockets) {
    if (ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify(payload));
    }
  }
});

The connections map is the source of truth for local sockets. It lives in process memory, so lookups are O(1) and there is no Redis round trip for every incoming message. The Redis subscription only handles messages that need to cross instance boundaries.

Step 2: Publish to Redis When the Target Is Remote

When a user sends a message, you need to decide: is the target on this instance or on another instance? You cannot know without a global registry. The simplest approach is to always publish to Redis and let each instance filter locally. The overhead of an empty filter is a JSON parse and a map lookup, which is negligible at the message rates most applications handle.

wss.on('connection', (ws, req) => {
  // Authenticate (simplified for clarity)
  const userId = await authenticate(req);
  registerConnection(userId, ws);

  ws.on('message', (rawData) => {
    const message = JSON.parse(rawData.toString());

    // Publish to Redis channel
    // Every instance receives this, including the sender
    const redisMessage = JSON.stringify({
      targetUserId: message.recipientId,
      payload: { from: userId, text: message.text, timestamp: Date.now() },
    });
    await publisher.publish('ws:messages', redisMessage);
  });

  ws.on('close', () => {
    removeConnection(userId, ws);
  });
});

You need a separate Redis client for publishing because Redis does not allow a client to publish while in subscriber mode. Create two clients:

const publisher = createClient({ url: process.env.REDIS_URL });
const subscriber = createClient({ url: process.env.REDIS_URL });
await Promise.all([publisher.connect(), subscriber.connect()]);

One client publishes. The other subscribes. Never mix the two.

Step 3: Avoid Echoes with a Sender Filter

If every instance publishes to the same Redis channel and every instance subscribes to it, the sending instance receives its own message back. The subscriber handler picks it up, finds the target user in the local connections map, and sends the WebSocket message again. The client gets a duplicate.

Fix this with a sender filter. Tag every message with the originating instance ID, and skip delivery in the subscriber handler if the message originated from the current instance AND the socket belongs to the sender.

const INSTANCE_ID = crypto.randomUUID();

await subscriber.subscribe('ws:messages', (rawMessage) => {
  const { targetUserId, payload, originInstanceId, senderUserId } = JSON.parse(rawMessage);

  // Skip echo: if this message came from our instance and the
  // recipient is the same user who sent it, they already got it.
  if (originInstanceId === INSTANCE_ID && targetUserId === senderUserId) {
    return;
  }

  const userSockets = connections.get(targetUserId);
  if (!userSockets) return;

  for (const ws of userSockets) {
    if (ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify(payload));
    }
  }
});

// When publishing:
await publisher.publish('ws:messages', JSON.stringify({
  targetUserId: message.recipientId,
  senderUserId: userId,
  originInstanceId: INSTANCE_ID,
  payload: { from: userId, text: message.text, timestamp: Date.now() },
}));

This eliminates duplicates without adding a Redis SISMEMBER check or a distributed deduplication set. The sender’s message reaches them once via the local handler and is suppressed in the subscriber handler.

Step 4: Handle Disconnections

When a user disconnects, remove them from the local map. But what about messages that arrive after the disconnect but before the subscriber processes the removal? That is fine. The subscriber calls connections.get(targetUserId), gets undefined, and drops the message. The user’s client has already closed the socket.

The harder problem is a server crash. If instance 1 crashes, any messages published by instance 2 to ws:messages will never reach instance 1’s users. And instance 1’s users are now disconnected from the chat entirely. That is acceptable for a real-time chat app as long as the clients reconnect and the new connection gets assigned to a live instance. The reconnection logic (exponential backoff with jitter) is the client’s responsibility.

But you can do better. Store the user-to-instance mapping in Redis itself, so when a user reconnects to instance 3, instance 3 can recover the session state and rejoin the user to any active rooms or channels.

// On authentication, store the mapping in Redis with a TTL
await publisher.set(`ws:user:${userId}`, INSTANCE_ID, {
  EX: 60, // Expire after 60 seconds
});

// Refresh the TTL periodically with a heartbeat
setInterval(async () => {
  for (const userId of connections.keys()) {
    await publisher.set(`ws:user:${userId}`, INSTANCE_ID, { EX: 60 });
  }
}, 30_000);

// When publishing, check if the target is on a different instance
async function sendToUser(userId: string, payload: unknown) {
  const targetInstance = await publisher.get(`ws:user:${userId}`);

  if (targetInstance === INSTANCE_ID) {
    // Local delivery
    const sockets = connections.get(userId);
    if (!sockets) return;
    for (const ws of sockets) {
      if (ws.readyState === WebSocket.OPEN) {
        ws.send(JSON.stringify(payload));
      }
    }
    return;
  }

  // Remote delivery via Redis channel
  await publisher.publish('ws:messages', JSON.stringify({
    targetUserId: userId,
    originInstanceId: INSTANCE_ID,
    payload,
  }));
}

Now you have a global routing table. When a user switches instances (their original server crashed and they reconnected to instance 3), the Redis key updates within 60 seconds (or 30 seconds if the heartbeat fires). Messages arriving in the gap are published to the old channel and lost. The TTL ensures the stale mapping cleans itself up.

Step 5: Room-Based Multicast

If your application uses rooms, channels, or groups (most real-time apps do), the pattern extends naturally. Instead of targeting a single user, publish to a room channel. Each instance subscribes to the rooms that have local members.

// When a user joins a room
async function joinRoom(userId: string, roomId: string, ws: WebSocket) {
  // Track locally
  if (!rooms.has(roomId)) rooms.set(roomId, new Map());
  const members = rooms.get(roomId)!;
  if (!members.has(userId)) members.set(userId, new Set());
  members.get(userId)!.add(ws);

  // Subscribe to the room channel on Redis if not already subscribed
  const roomChannel = `ws:room:${roomId}`;
  if (!subscribedRooms.has(roomChannel)) {
    await subscriber.subscribe(roomChannel, (rawMessage) => {
      const { payload, excludeUserId, originInstanceId } = JSON.parse(rawMessage);

      // Skip echo
      if (originInstanceId === INSTANCE_ID) return;

      const roomMembers = rooms.get(roomId);
      if (!roomMembers) return;

      for (const [memberId, sockets] of roomMembers) {
        if (memberId === excludeUserId) continue;
        for (const sock of sockets) {
          if (sock.readyState === WebSocket.OPEN) {
            sock.send(JSON.stringify(payload));
          }
        }
      }
    });
    subscribedRooms.add(roomChannel);
  }
}

// When sending a message to a room
async function sendToRoom(roomId: string, payload: unknown, excludeUserId?: string) {
  // Publish to the room channel
  await publisher.publish(`ws:room:${roomId}`, JSON.stringify({
    payload,
    excludeUserId,
    originInstanceId: INSTANCE_ID,
  }));

  // Also deliver locally (the Redis subscriber skips originInstanceId === INSTANCE_ID)
  const roomMembers = rooms.get(roomId);
  if (!roomMembers) return;
  for (const [memberId, sockets] of roomMembers) {
    if (memberId === excludeUserId) continue;
    for (const sock of sockets) {
      if (sock.readyState === WebSocket.OPEN) {
        sock.send(JSON.stringify(payload));
      }
    }
  }
}

This avoids broadcasting every room message to every instance. Instances only subscribe to rooms that have local members. If room “general” has 10,000 members spread across 5 instances, every instance receives the message, and 10,000 clients get it. That is the unavoidable broadcast cost. But if room “private-team-alpha” has 3 members all on instance 2, instance 1 does not even see the message. The per-room subscription granularity saves Redis bandwidth and CPU.

Step 6: Graceful Shutdown and Connection Draining

When you deploy a new version, Kubernetes or your orchestrator sends a SIGTERM to the old pods. Your WebSocket server has open connections. If you exit immediately, every connected user drops with a connection reset. They reconnect, hit the load balancer, and establish new sockets on the new instances. That works, but the reconnect storm can spike your database connection pool and cause cascading failures.

Drain connections gracefully. Close the HTTP server to stop accepting new connections, notify connected clients that a reconnect is coming, and then close each socket with a normal closure code.

process.on('SIGTERM', async () => {
  console.log(`[${INSTANCE_ID}] Shutting down, draining ${connections.size} users`);

  // Stop accepting new connections
  wss.close();

  // Notify all connected clients to reconnect
  const reconnectPayload = JSON.stringify({
    type: 'server:reconnect',
    reconnectAfter: 1000,
    server: INSTANCE_ID,
  });

  for (const sockets of connections.values()) {
    for (const ws of sockets) {
      try {
        ws.send(reconnectPayload);
        ws.close(4001, 'Server restarting');
      } catch {
        // Socket already closed
      }
    }
  }

  // Wait for the client to receive the message
  await new Promise((resolve) => setTimeout(resolve, 2000));

  // Unsubscribe from Redis
  await subscriber.unsubscribe('ws:messages');
  await subscriber.quit();
  await publisher.quit();

  process.exit(0);
});

The client should recognize the server:reconnect message type, close the current socket, and initiate a new connection with exponential backoff. This turns a chaotic disconnect storm into an orderly reconnection wave.

Putting It All Together

Here is the complete server skeleton with all the pieces:

import { WebSocketServer, WebSocket } from 'ws';
import { createClient } from 'redis';

const INSTANCE_ID = crypto.randomUUID();
const REDIS_URL = process.env.REDIS_URL || 'redis://localhost:6379';

const connections = new Map<string, Set<WebSocket>>();
const rooms = new Map<string, Map<string, Set<WebSocket>>>();
const subscribedRooms = new Set<string>();

const publisher = createClient({ url: REDIS_URL });
const subscriber = createClient({ url: REDIS_URL });

await Promise.all([publisher.connect(), subscriber.connect()]);

// Global message channel
await subscriber.subscribe('ws:messages', (raw) => {
  const { targetUserId, payload, originInstanceId, senderUserId } = JSON.parse(raw);
  if (originInstanceId === INSTANCE_ID && targetUserId === senderUserId) return;

  const sockets = connections.get(targetUserId);
  if (!sockets) return;
  for (const ws of sockets) {
    if (ws.readyState === WebSocket.OPEN) ws.send(JSON.stringify(payload));
  }
});

const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', async (ws, req) => {
  const userId = await authenticate(req); // Your auth logic
  registerConnection(userId, ws);

  ws.on('message', async (raw) => {
    const msg = JSON.parse(raw.toString());

    if (msg.type === 'direct') {
      await publisher.publish('ws:messages', JSON.stringify({
        targetUserId: msg.recipientId,
        senderUserId: userId,
        originInstanceId: INSTANCE_ID,
        payload: msg.data,
      }));
    } else if (msg.type === 'room') {
      await publisher.publish(`ws:room:${msg.roomId}`, JSON.stringify({
        payload: msg.data,
        originInstanceId: INSTANCE_ID,
      }));
      // Local delivery handled in the subscriber
    }
  });

  ws.on('close', () => removeConnection(userId, ws));
});

function registerConnection(userId: string, ws: WebSocket) {
  if (!connections.has(userId)) connections.set(userId, new Set());
  connections.get(userId)!.add(ws);

  // Heartbeat
  publisher.set(`ws:user:${userId}`, INSTANCE_ID, { EX: 60 });
}

function removeConnection(userId: string, ws: WebSocket) {
  const sockets = connections.get(userId);
  if (!sockets) return;
  sockets.delete(ws);
  if (sockets.size === 0) connections.delete(userId);
}

// Heartbeat interval
setInterval(() => {
  for (const userId of connections.keys()) {
    publisher.set(`ws:user:${userId}`, INSTANCE_ID, { EX: 60 });
  }
}, 30_000);

What This Pattern Does Not Do

This pattern handles real-time messaging across N instances. It does not handle:

Persistent message history. Redis Pub/Sub has no built-in message persistence. If a subscriber is disconnected briefly, it misses messages. Use Redis Streams if you need guaranteed delivery, or write messages to Postgres before publishing.
Exactly-once delivery. The subscriber callback fires once per connected Redis client. If the callback throws, the message is lost. Wrap the handler in a try/catch and log failures.
Ordering across rooms. Messages on different channels are processed concurrently. If your application requires strict ordering across all messages, use a single channel and serialize through it. This limits throughput, so only do this if you actually need it.

The Practical Takeaway

If you have more than one WebSocket server instance, you need a cross-instance message bus. Redis Pub/Sub is the simplest option that scales to thousands of connections across dozens of instances. Create two Redis clients (publisher and subscriber), tag messages with an origin instance ID to suppress echoes, and subscribe per-room to avoid broadcasting every message to every server.

Start with the basic global channel. Add per-room subscriptions when you measure Redis CPU as a bottleneck. Add the user-to-instance mapping in Redis when you need faster failover. Do not add sticky sessions as a shortcut — they mask the problem and introduce a new one when an instance dies.

Ship the client reconnection logic before you ship the server. Without it, scaling WebSockets just means more users experience the same disconnect delay when you deploy.

A note from Yojji

Building real-time infrastructure that survives traffic spikes, instance failures, and rolling deployments is the kind of systems engineering that separates a prototype from a production service. The patterns in this post (cross-instance messaging, connection draining, room-based subscription filtering) are exactly the type of architectural decisions Yojji’s teams make daily when building scalable backend systems for their clients. Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their teams specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and full-cycle product engineering from discovery through DevOps. If your real-time stack is held together by a single instance and a hope that nobody deploys on a Friday, Yojji’s engineering teams have built this pattern at scale and can help ship a production-grade replacement.