Node.js Child Processes: Spawn, Errors, Orphans, and Supervision in Production

The image conversion service crashed three times before anyone noticed. Not the Express server. That stayed up. The child_process.fork() that processed uploaded images silently exited when Sharp hit a corrupt JPEG. No error in the parent. No restart. The queue filled up. Users uploaded, the upload returned 200, and the resized thumbnail never appeared. By the time the monitoring caught the 4,000-image backlog, the damage was done.

Node.js child_process is one of the most commonly misused APIs in production. It looks simple: call exec(), get output, move on. But the defaults are designed for interactive shells, not long-running servers. Output buffers fill up and block the child. Exit codes go unchecked. Orphan processes accumulate when the parent crashes. Stderr is ignored until the child has already failed.

This post covers four patterns that turn child_process from a footgun into a reliable subsystem: correct spawning with backpressure, error handling that covers every failure mode, orphan prevention with process groups, and a supervisor pattern for long-lived workers.

Pattern 1: spawn, not exec

The most common mistake is using exec() to run a shell command that produces output.

import { exec } from 'child_process';

// Looks harmless. Is not.
const { stdout, stderr } = await execAsync('ffmpeg -i input.mp4 output.webm');

exec() buffers stdout and stderr into strings. The default buffer size is 1024 kilobytes. If your ffmpeg output (or any command output) exceeds that, the child process blocks when its pipe buffer fills up, the write system call hangs, and your command never finishes. This is not theoretical. It happens with image processors, video transcoders, database dumps, and anything that produces more than a megabyte of output.

The fix is spawn() with streaming stdio:

import { spawn } from 'node:child_process';
import { Readable, Writable } from 'node:stream';

interface SpawnResult {
  stdout: string;
  stderr: string;
  exitCode: number | null;
  signal: string | null;
}

function spawnCollect(
  command: string,
  args: string[],
  options?: { timeout?: number; maxBuffer?: number }
): Promise<SpawnResult> {
  return new Promise((resolve, reject) => {
    const child = spawn(command, args, {
      stdio: ['ignore', 'pipe', 'pipe'],
      timeout: options?.timeout ?? 30_000,
    });

    const stdoutChunks: Buffer[] = [];
    const stderrChunks: Buffer[] = [];
    let stdoutSize = 0;
    let stderrSize = 0;
    const maxBuffer = options?.maxBuffer ?? 10 * 1024 * 1024; // 10 MB default

    child.stdout!.on('data', (chunk: Buffer) => {
      stdoutSize += chunk.length;
      if (stdoutSize > maxBuffer) {
        child.kill();
        reject(new Error(`stdout exceeded maxBuffer (${maxBuffer} bytes)`));
        return;
      }
      stdoutChunks.push(chunk);
    });

    child.stderr!.on('data', (chunk: Buffer) => {
      stderrSize += chunk.length;
      if (stderrSize > maxBuffer) {
        child.kill();
        reject(new Error(`stderr exceeded maxBuffer (${maxBuffer} bytes)`));
        return;
      }
      stderrChunks.push(chunk);
    });

    child.on('error', (err) => {
      reject(err);
    });

    child.on('close', (exitCode, signal) => {
      resolve({
        stdout: Buffer.concat(stdoutChunks).toString('utf-8'),
        stderr: Buffer.concat(stderrChunks).toString('utf-8'),
        exitCode,
        signal,
      });
    });
  });
}

This gives you streaming reads with backpressure (the OS pipe buffer drains as you read), explicit max-buffer enforcement that kills the child instead of hanging, and separate access to stdout and stderr.

Key differences from exec():

Stdio is piped, not buffered. Node reads from the OS pipe in chunks, so the child never stalls on a full pipe.
You control the buffer limit. Pick a value that fits your workload. Logs: 1 MB. Video frames: 100 MB. Kill the child if it exceeds.
You get the exit code AND the signal. A child killed by SIGTERM (signal: 'SIGTERM') is different from one that exits with code 1.

Use exec() only for trivial commands where you control the input and the output is known to be small (under 10 KB). For everything else, spawn() with explicit stdio handling.

Pattern 2: handle every failure mode

A child process can fail in five distinct ways, and you need to handle all of them:

Command not found — spawn() throws an ENOENT error.
Permission denied — spawn() throws an EACCES error.
Non-zero exit code — The process ran but returned a failure code.
Killed by signal — The OS or another process terminated it.
Timeout — The process ran longer than expected.

Most implementations handle only #3. Here is the complete handler:

interface ProcessResult {
  stdout: string;
  stderr: string;
  ok: boolean;
  code: number | null;
  signal: string | null;
}

async function runProcess(
  command: string,
  args: string[],
  options?: { timeout?: number }
): Promise<ProcessResult> {
  return new Promise((resolve) => {
    const child = spawn(command, args, {
      stdio: ['ignore', 'pipe', 'pipe'],
      timeout: options?.timeout ?? 30_000,
    });

    let stdout = '';
    let stderr = '';

    child.stdout!.setEncoding('utf-8');
    child.stderr!.setEncoding('utf-8');
    child.stdout!.on('data', (d) => { stdout += d; });
    child.stderr!.on('data', (d) => { stderr += d; });

    child.on('error', (err: NodeJS.ErrnoException) => {
      // ENOENT, EACCES, etc. The child never started.
      resolve({
        stdout,
        stderr: `Failed to spawn: ${err.message}`,
        ok: false,
        code: err.code === 'ENOENT' ? 127 : 126,
        signal: null,
      });
    });

    child.on('close', (code, signal) => {
      const ok = code === 0 && signal === null;
      resolve({ stdout, stderr, ok, code, signal });
    });
  });
}

The error event fires when the child cannot start. The close event fires when the child exits. Both can fire (ENOENT triggers error then close with code null). The close event alone handles normal exits, signal kills, and timeouts (Node kills with SIGTERM on timeout, which fires close with a signal).

One gotcha: the exit event fires before close. Use close instead of exit because close guarantees all stdio streams have finished. If you use exit, you might read partial output.

Pattern 3: prevent orphans with process groups

When your Node.js process crashes, any child processes it spawned become orphans. The OS reparents them to init (PID 1), and they keep running. This is how production incidents get compound: the parent OOMs, but the twelve ffmpeg children it spawned continue consuming CPU and memory on the same host.

The fix is to launch children in a process group and kill the group when the parent dies.

import { spawn, execSync } from 'node:child_process';

function spawnWithGroup(command: string, args: string[]): ReturnType<typeof spawn> {
  const child = spawn(command, args, {
    stdio: ['ignore', 'pipe', 'pipe'],
    detached: false,   // Keeps child in the parent's process group
    // On Linux, use setsid to create a new session so we can kill the group
    // On Windows, use taskkill /T
  });

  return child;
}

// Graceful cleanup
function killProcessGroup(child: ReturnType<typeof spawn>): void {
  if (child.pid === undefined) return;

  if (process.platform === 'win32') {
    execSync(`taskkill /PID ${child.pid} /T /F`, { stdio: 'ignore' });
  } else {
    // Negative PID sends signal to the process group
    try {
      process.kill(-child.pid, 'SIGTERM');
    } catch {
      // Process group may already be dead
    }
  }
}

// Use with exit handlers
function setupOrphanPrevention(child: ReturnType<typeof spawn>): void {
  const cleanup = () => {
    killProcessGroup(child);
  };

  process.on('SIGTERM', cleanup);
  process.on('SIGINT', cleanup);
  process.on('SIGHUP', cleanup);
  process.on('beforeExit', cleanup);

  // Remove listeners when child exits
  child.on('close', () => {
    process.removeListener('SIGTERM', cleanup);
    process.removeListener('SIGINT', cleanup);
    process.removeListener('SIGHUP', cleanup);
    process.removeListener('beforeExit', cleanup);
  });
}

The key detail is the negative PID in process.kill(-child.pid, 'SIGTERM'). On POSIX systems, a negative PID sends the signal to every process in the process group. If your child spawns its own children (ffmpeg does, so does make), they all get killed.

Important caveat: This only works if the parent is still alive when SIGTERM arrives. If the parent crashes (uncaught exception, segfault inside the runtime), the exit handlers never run. For that scenario, run your Node.js process with a process supervisor (systemd, supervisord, or Docker with --init) so the container runtime handles orphan reaping. Or use the detached: true option and manage the child PID explicitly.

If you are on Linux and want maximum protection, set the child’s PR_SET_PDEATHSIG — the kernel kills the child when the parent dies, no matter how the parent dies:

import { spawn } from 'node:child_process';

const child = spawn('node', ['worker.js'], {
  stdio: ['pipe', 'pipe', 'pipe'],
  // Pre-exec function only available in Node >= 16
});

// Alternative: use a wrapper that sets PDEATHSIG
// Only works on Linux
const childWithDeathSig = spawn('sh', ['-c', `
  prctl --death 9
  exec "$@"
`, '--', command, ...args]);

For a portable solution that handles crashes too, use a monitoring process (pattern 4) that watches both the parent and children, or run your service in Docker with init: true in your Compose file, which runs an init process as PID 1 that reaps orphans.

Pattern 4: the supervisor pattern for long-lived workers

When you fork() a worker process to handle CPU-bound work, you need more than just spawning it. You need supervision: restart on crash, backoff on repeated crashes, and health checks.

Here is a worker supervisor that handles all three:

import { fork, ChildProcess } from 'node:child_process';
import { EventEmitter } from 'node:events';

interface SupervisorOptions {
  modulePath: string;
  args?: string[];
  env?: Record<string, string>;
  maxRestarts?: number;
  restartDelay?: number;       // Base delay in ms
  healthCheckInterval?: number;
}

class WorkerSupervisor extends EventEmitter {
  private child: ChildProcess | null = null;
  private restartCount = 0;
  private healthCheckTimer: ReturnType<typeof setInterval> | null = null;
  private stopped = false;

  constructor(private options: SupervisorOptions) {
    super();
    this.options.maxRestarts ??= 10;
    this.options.restartDelay ??= 1000;
    this.options.healthCheckInterval ??= 15_000;
  }

  start(): void {
    this.stopped = false;
    this.spawn();
  }

  stop(): void {
    this.stopped = true;
    this.clearHealthCheck();
    if (this.child) {
      this.child.kill('SIGTERM');
      this.child = null;
    }
  }

  private spawn(): void {
    if (this.stopped) return;

    this.child = fork(this.options.modulePath, this.options.args, {
      env: { ...process.env, ...this.options.env },
      stdio: ['pipe', 'pipe', 'pipe'],
    });

    this.child.stdout?.pipe(process.stdout);
    this.child.stderr?.pipe(process.stderr);

    this.child.on('message', (msg: unknown) => {
      this.emit('message', msg);
      // Reset restart count on any message (worker is alive and working)
      this.restartCount = 0;
    });

    this.child.on('exit', (code, signal) => {
      this.child = null;
      this.clearHealthCheck();

      if (this.stopped) return;

      const unexpected = code !== 0 || signal !== null;
      if (unexpected) {
        this.restartCount++;
        this.emit('crashed', { code, signal, restartCount: this.restartCount });

        if (this.restartCount > this.options.maxRestarts!) {
          this.emit('exhausted', {
            message: `Worker crashed ${this.restartCount} times. Giving up.`,
          });
          return;
        }

        const delay = this.options.restartDelay! * Math.pow(2, this.restartCount - 1);
        this.emit('restarting', { delay, attempt: this.restartCount });
        setTimeout(() => this.spawn(), delay);
      }
    });

    this.child.on('error', (err) => {
      this.emit('error', err);
    });

    this.startHealthCheck();
  }

  private startHealthCheck(): void {
    this.clearHealthCheck();
    this.healthCheckTimer = setInterval(() => {
      if (this.child && !this.child.killed) {
        this.child.send({ type: 'ping' });
        // If no response within timeout, kill and restart
        const timeout = setTimeout(() => {
          this.emit('unresponsive');
          this.child?.kill('SIGKILL');
        }, 5000);
        this.child.once('message', (msg: unknown) => {
          clearTimeout(timeout);
          if ((msg as { type?: string }).type === 'pong') {
            this.restartCount = 0; // Healthy response resets throttle
          }
        });
      }
    }, this.options.healthCheckInterval);
  }

  private clearHealthCheck(): void {
    if (this.healthCheckTimer) {
      clearInterval(this.healthCheckTimer);
      this.healthCheckTimer = null;
    }
  }
}

Usage:

const supervisor = new WorkerSupervisor({
  modulePath: './image-worker.js',
  maxRestarts: 5,
  restartDelay: 500,
  healthCheckInterval: 10_000,
});

supervisor.on('crashed', ({ code, signal, restartCount }) => {
  console.error(`Worker crashed (code=${code}, signal=${signal}), restart ${restartCount}`);
});

supervisor.on('exhausted', ({ message }) => {
  console.error(message);
  // Alert PagerDuty, send to metrics, etc.
});

supervisor.on('unresponsive', () => {
  console.warn('Worker did not respond to health check. Forcing restart.');
});

supervisor.start();

And the worker (image-worker.js) needs to handle the health-check protocol:

process.on('message', (msg: { type?: string }) => {
  if (msg.type === 'ping') {
    process.send!({ type: 'pong' });
  }
});

The supervisor does three things the naive approach misses:

Exponential backoff on restarts. If the worker crashes repeatedly (config error, corrupted input), it stops trying after maxRestarts instead of restarting forever in a tight loop.
Health checks. A worker can be alive but stuck (infinite loop, deadlock). The supervisor sends a ping and expects a pong within 5 seconds. No response means SIGKILL.
Restart count reset on success. If the worker processes a message successfully, the counter resets. This prevents transient failures (OOM from a single large image) from accumulating into a permanent blacklist.

The practical takeaway

Here is the rule of thumb for choosing a child process API:

Task	API	Why
Run a short command with tiny output (under 10 KB)	`exec()`	Convenient, but only for known-small output.
Run any command with unknown or large output	`spawn()` + streaming pipes	No buffer deadlock.
Fork a Node.js module as a worker	`fork()` + supervisor pattern	Built-in IPC, health checks, restart backoff.
Run a daemon or background service	`spawn()` with `detached: true`	Process group isolation.

And the checklist for every child_process usage in production:

Are stdout and stderr piped as streams, not buffered strings?
Is there a max buffer limit that kills the child if exceeded?
Are both the error event (spawn failure) and close event (exit) handled?
Is a non-zero exit code treated as an error?
Are orphan children cleaned up when the parent exits or crashes?
If the child is long-lived, is there a health check and restart policy?

The image conversion service I mentioned at the start? The fix was a 40-line supervisor with streaming stdio, a max buffer of 50 MB (images are big), and a health check that caught the corrupt-JPEG case within seconds. The crash loop that had filled 4,000 queue items over six hours was caught within 30 seconds on the next deployment. The code in this post is that fix, extracted and generalized.

A note from Yojji

The kind of work this post describes (handling every failure mode from ENOENT to SIGKILL, designing process supervision with backoff, and preventing orphans in production) is the unglamorous infrastructure engineering that separates services that recover from ones that compound failures. It is exactly the kind of production-aware backend craft that Yojji’s teams build into the systems they ship.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their teams specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and full-cycle product engineering covering discovery, design, development, QA, and DevOps. If your team would rather hire the practice of building reliable, well-instrumented process architectures than learn it the hard way during a silent queue buildup, Yojji is worth a conversation.