WebSocket Reconnection with Backoff and State Recovery in Production

The live dashboard froze at 14:03. Not a crash. The WebSocket connection showed green in the browser DevTools, but the last trade timestamp was six minutes old. Three engineers stared at the chart until someone hit refresh, and the missing trades flooded in all at once. The server had restarted for a routine deployment. The client had reconnected in under a second, but the new server process had no memory of the connection and no concept of what events the client had already seen. The socket was open. The feed was dead.

This is the most common failure mode in production WebSocket systems, and it is rarely handled well. Most tutorials show you how to open a socket and listen for onmessage. They do not show you how to survive a server restart, a network blip, or a load balancer that drops idle connections after 60 seconds. Reconnecting is not enough. You need to reconnect politely (with backoff), detect silence (with heartbeats), and recover state (with event IDs). This post covers all three, with working code you can drop into a browser or a Node.js client today.

Why instant reconnection is worse than no reconnection

The naive approach looks like this:

const ws = new WebSocket(url);
ws.onclose = () => {
  setTimeout(() => {
    const ws2 = new WebSocket(url); // reconnect immediately
  }, 1000);
};

This code is dangerous for three reasons.

First, it lacks backoff. If the server restarts and needs 10 seconds to accept connections, 1,000 clients will hammer it every second with reconnection attempts. That turns a graceful restart into a denial-of-service event. The server never gets the breathing room to finish startup because it is buried under WebSocket handshake traffic.

Second, it has no maximum delay ceiling. A client with a flaky mobile connection will reconnect every second forever, burning battery and bandwidth.

Third, and most importantly, it throws away state. The new WebSocket instance has no memory of what the old instance received. If the server emitted three events during the 1.2-second gap between onclose and reconnection, those events are gone. The user sees stale data and has no idea anything is wrong.

The fix is a three-layer pattern: transport resilience, heartbeat detection, and state synchronization.

Exponential backoff with full jitter

When a WebSocket closes, the client should wait before reconnecting. The wait time should grow exponentially with each consecutive failure, capped at a maximum, and randomized with jitter to prevent thundering herds.

Full jitter is the safest strategy in practice. For each attempt, you pick a random duration between zero and the exponential cap. This spreads reconnections across the entire interval and eliminates synchronization across clients. It is slightly slower than some alternatives on average, but it is the most friendly to an overloaded server.

Here is the helper:

function reconnectDelay(attempt: number, baseMs = 1000, maxMs = 30000): number {
  const cap = Math.min(maxMs, baseMs * Math.pow(2, attempt));
  return Math.floor(Math.random() * cap);
}

For the first failure, the delay is 0-1000 ms. For the second, 0-2000 ms. By the fifth failure, the cap is 30 seconds. The randomness means 10,000 clients that disconnect at the same time will reconnect over a 30-second window instead of stampeding the server in a single millisecond.

Never use a fixed delay like setTimeout(reconnect, 5000). A fixed delay synchronizes all clients. If a server restarts at 14:00:00, every client with a 5-second fixed delay reconnects at 14:00:05 simultaneously. That is a thundering herd, and it will crash the server you just restarted.

Heartbeats: detecting half-open connections

TCP, like WebSocket, is susceptible to half-open connections. A client that silently loses network access (laptop lid closed, subway tunnel, mobile tower handoff) will not trigger onclose on either side. The server thinks the client is still there. The client thinks the server is still there. Nothing flows, and neither side knows it.

The standard fix is a heartbeat ping-pong. The server sends a ping frame every N seconds. The client responds with a pong. If either side misses a pong within a timeout, the connection is declared dead and closed locally. In the browser, you cannot send ping frames manually, so you must use application-level heartbeats.

A lightweight protocol message works well:

interface HeartbeatMessage {
  type: 'ping' | 'pong';
  ts: number;
}

On the server (Node.js with ws):

import { WebSocketServer } from 'ws';

const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (ws) => {
  ws.isAlive = true;

  ws.on('pong', () => {
    ws.isAlive = true;
  });

  const interval = setInterval(() => {
    if (!ws.isAlive) {
      ws.terminate();
      clearInterval(interval);
      return;
    }
    ws.isAlive = false;
    ws.ping();
  }, 30000);

  ws.on('close', () => clearInterval(interval));
});

This is the canonical pattern from the ws documentation, and it is correct. terminate() closes the socket immediately without waiting for a graceful close handshake, which is what you want for a dead peer. The 30-second interval is a balanced default for most applications. Trade desks and chat apps may want 10 seconds. Batched telemetry pipelines can tolerate 60.

State recovery: the missing piece

Backoff and heartbeats keep the transport healthy. They do not solve the data gap. When a client reconnects, the server must be able to answer one question: “What happened since event 847?”

The solution is an event ID on every message. The server assigns monotonically increasing IDs (per client or globally, depending on your consistency model). The client remembers the highest ID it has received. On reconnect, it sends that ID, and the server replays everything after it.

This sounds simple, but there are two practical constraints.

Buffer size: The server cannot store infinite history. A ring buffer of the last 10,000 events or the last 5 minutes is usually enough for momentary reconnections. If a client has been offline for an hour, you fall back to a snapshot plus live diff rather than replaying 50,000 events.

Global ordering: If you need strict global ordering across all clients, you need a single sequence counter (in Redis, Postgres, or a log-backed stream). If you only need per-client ordering (e.g., a personal notification feed), a per-client counter is simpler and horizontally scalable.

Here is a compact TypeScript client that implements all three layers (transport backoff, heartbeat, and event recovery) and can be dropped into a React hook, a Node.js service, or a plain browser script.

The complete resilient client

interface EventMessage {
  id: number;
  type: string;
  payload: unknown;
}

interface Options {
  url: string;
  heartbeatIntervalMs?: number;
  heartbeatTimeoutMs?: number;
  reconnectBaseMs?: number;
  reconnectMaxMs?: number;
  onMessage: (msg: EventMessage) => void;
  onStatusChange?: (status: 'open' | 'closed' | 'reconnecting') => void;
}

export class ResilientWebSocket {
  private ws: WebSocket | null = null;
  private url: string;
  private lastEventId = 0;
  private reconnectAttempt = 0;
  private reconnectTimer: ReturnType<typeof setTimeout> | null = null;
  private heartbeatTimer: ReturnType<typeof setInterval> | null = null;
  private heartbeatTimeoutTimer: ReturnType<typeof setTimeout> | null = null;
  private intentionallyClosed = false;
  private onMessage: (msg: EventMessage) => void;
  private onStatusChange?: (status: 'open' | 'closed' | 'reconnecting') => void;

  private heartbeatIntervalMs: number;
  private heartbeatTimeoutMs: number;
  private reconnectBaseMs: number;
  private reconnectMaxMs: number;

  constructor(opts: Options) {
    this.url = opts.url;
    this.onMessage = opts.onMessage;
    this.onStatusChange = opts.onStatusChange;
    this.heartbeatIntervalMs = opts.heartbeatIntervalMs ?? 30000;
    this.heartbeatTimeoutMs = opts.heartbeatTimeoutMs ?? 10000;
    this.reconnectBaseMs = opts.reconnectBaseMs ?? 1000;
    this.reconnectMaxMs = opts.reconnectMaxMs ?? 30000;
  }

  connect() {
    this.intentionallyClosed = false;
    this._connect();
  }

  private _connect() {
    if (this.ws) return;

    // append lastEventId so the server can resume the stream
    const resumeUrl = `${this.url}?lastEventId=${this.lastEventId}`;
    this.ws = new WebSocket(resumeUrl);

    this.ws.onopen = () => {
      this.reconnectAttempt = 0;
      this.onStatusChange?.('open');
      this._startHeartbeat();
    };

    this.ws.onmessage = (event) => {
      const raw = event.data.toString();

      if (raw === 'ping') {
        this.ws?.send('pong');
        return;
      }

      try {
        const msg: EventMessage = JSON.parse(raw);
        if (typeof msg.id === 'number') {
          this.lastEventId = msg.id;
        }
        this.onMessage(msg);
      } catch {
        // ignore malformed messages
      }
    };

    this.ws.onclose = () => {
      this._cleanup();
      this.onStatusChange?.('closed');
      if (!this.intentionallyClosed) {
        this._scheduleReconnect();
      }
    };

    this.ws.onerror = () => {
      // let onclose handle the reconnect logic; do not call it twice
    };
  }

  private _startHeartbeat() {
    this.heartbeatTimer = setInterval(() => {
      if (this.ws?.readyState !== WebSocket.OPEN) return;

      this.ws.send('ping');
      this.heartbeatTimeoutTimer = setTimeout(() => {
        // server did not pong in time; close and trigger reconnect
        this.ws?.close();
      }, this.heartbeatTimeoutMs);
    }, this.heartbeatIntervalMs);
  }

  private _cleanup() {
    if (this.reconnectTimer) {
      clearTimeout(this.reconnectTimer);
      this.reconnectTimer = null;
    }
    if (this.heartbeatTimer) {
      clearInterval(this.heartbeatTimer);
      this.heartbeatTimer = null;
    }
    if (this.heartbeatTimeoutTimer) {
      clearTimeout(this.heartbeatTimeoutTimer);
      this.heartbeatTimeoutTimer = null;
    }
    this.ws = null;
  }

  private _scheduleReconnect() {
    this.onStatusChange?.('reconnecting');
    const delay = this._reconnectDelay();
    this.reconnectTimer = setTimeout(() => {
      this.reconnectAttempt++;
      this._connect();
    }, delay);
  }

  private _reconnectDelay(): number {
    const cap = Math.min(
      this.reconnectMaxMs,
      this.reconnectBaseMs * Math.pow(2, this.reconnectAttempt)
    );
    return Math.floor(Math.random() * cap);
  }

  close() {
    this.intentionallyClosed = true;
    this._cleanup();
    this.ws?.close();
  }
}

This class is intentionally boring. It does not use RxJS, it does not depend on React, and it does not hide state in closures that are impossible to test. It is a plain class with explicit timers that you can unit test by passing a mock WebSocket or by running it in Node.js with the ws package.

The critical behavior is in _connect: every reconnection appends lastEventId to the URL. The server reads that parameter and replays buffered events after that ID before switching to live pushes.

Server-side catch-up before live push

The server needs a small adapter to handle the resume handshake. Here is a minimal example with ws and an in-memory ring buffer (swap this for Redis Streams, Postgres LISTEN/NOTIFY, or Kafka in production).

import { WebSocketServer, WebSocket } from 'ws';
import { parse } from 'url';

const MAX_HISTORY = 10000;
const eventHistory: EventMessage[] = [];
let nextId = 1;

function broadcast(msg: EventMessage) {
  eventHistory.push(msg);
  if (eventHistory.length > MAX_HISTORY) eventHistory.shift();
  for (const client of wss.clients) {
    if (client.readyState === WebSocket.OPEN) {
      client.send(JSON.stringify(msg));
    }
  }
}

const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (ws, req) => {
  ws.isAlive = true;
  ws.on('pong', () => { ws.isAlive = true; });

  // resume from lastEventId
  const query = parse(req.url ?? '', true).query;
  const lastId = parseInt(query.lastEventId as string, 10) || 0;
  const missed = eventHistory.filter((e) => e.id > lastId);
  for (const msg of missed) {
    ws.send(JSON.stringify(msg));
  }

  // then join live broadcast
  // (already part of broadcast() above)
});

In production, replace the in-memory eventHistory with a bounded stream (Redis Streams capped at 5,000 entries, or a materialized view in Postgres if you already run it). The key invariant is that the buffer depth must exceed your expected worst-case reconnection window. If your 99th percentile mobile disconnect lasts 8 seconds, and you emit 200 events per second, you need a buffer of at least 1,600 events. Add a 10x safety margin and cap at 20,000.

Common mistakes

Using ws.send blindly without checking readyState === OPEN. After a disconnect, there is a brief window where your application code may still try to publish. Always guard sends, or queue them client-side and flush on onopen.

Letting the browser handle pings natively. The browser WebSocket API auto-responds to server ping frames with pong frames, but you cannot observe whether the server sent the ping. You cannot build a client-side dead-peer detector without application-level heartbeats.

Forgetting to reset reconnectAttempt on success. If you do not reset the counter, a client that suffered five failures an hour ago will still wait 30 seconds on its next disconnect. Reset to zero on every onopen.

Storing lastEventId in localStorage for long-lived sessions. It seems smart, but if the user has two tabs open, each tab advances its own lastEventId. On refresh, the tab reads the highest ID from localStorage, which may belong to the other tab, and skips events. Keep lastEventId in memory per instance, or scope localStorage keys by a tab ID.

The practical takeaway

A production WebSocket client is not a new WebSocket(url) wrapped in a useEffect. It is a state machine with three concerns: transport (backoff and jitter), liveness (heartbeats and timeouts), and semantics (event IDs and catch-up). Neglect any one, and the other two become decorative.

Before your next deploy, run through this checklist:

Reconnection uses exponential backoff with jitter, not a fixed interval.
Backoff has a maximum ceiling (e.g., 30 seconds).
The client resets the attempt counter on every successful open.
Heartbeats run in both directions (server pings, client pongs) with a timeout shorter than the OS TCP retransmit window.
Every event carries a monotonic ID.
The server accepts a lastEventId parameter on connection and replays missed events before pushing live data.
The history buffer is sized for the 99th percentile disconnect duration at peak event volume.
The client does not send without checking readyState === OPEN.

Your socket will disconnect. That is not a failure. The failure is assuming it will not, and building a system that has no vocabulary for “catch me up.”

A note from Yojji

Engineering resilient real-time systems is not about writing clever binary protocols. It is about acknowledging that networks fail, servers restart, and mobile users ride trains through tunnels. The discipline of adding backoff, heartbeat timeouts, and event replay to a WebSocket client is exactly the kind of production-hardened thinking that separates a prototype from a shipping product. Yojji’s teams bring that discipline to the full-cycle applications they build, from real-time dashboards to high-throughput messaging infrastructure.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their senior engineering teams specialize in the JavaScript ecosystem, cloud-native infrastructure on AWS, Azure, and Google Cloud, and the full cycle of product delivery from discovery through DevOps.