The Practical Developer

Streaming Large CSV Files in Node.js: Handle Gigabyte-Sized Data Without Crashing

Loading a 2GB CSV into an array kills your server. Here is how to stream, parse, and backpressure CSV data in Node.js without ever holding more than one row in memory at a time.

A close-up of a green circuit board, representing the data processing pipeline inside the machine

The CSV file arrives from the data team at 4:45 PM. It is 1.8GB with 4 million rows, one column per user attribute, and it needs to be imported into the production database before tomorrow’s morning report runs. You write a quick script:

const fs = require('fs');
const csv = fs.readFileSync('users_export.csv', 'utf-8');
const rows = csv.split('\n').map(line => {
  const [id, name, email, ...rest] = line.split(',');
  return { id, name, email, attributes: rest };
});

You run it. Node.js uses 3.2GB of RAM. The OOM killer terminates the process before it finishes the first 200,000 rows. Your laptop freezes. You force-reboot and start over, this time on a production server with 8GB of RAM. That also freezes. By 6 PM you are manually reviewing rows in a text editor with syntax highlighting disabled because the file is too large to open.

This is the “load it all into memory” approach that every developer reaches for first, and it is the single most common reason a CSV import job that works fine on a 10MB test file silently kills a production server.

This post is how to process CSV files of any size in Node.js using streams, why backpressure matters more than speed, and the exact patterns that let you handle 10GB files on a server with 512MB of RAM.

Why the naive approach explodes

The code above does three things that each double or triple memory usage.

First, fs.readFileSync reads the entire file into a single string. A 1.8GB CSV file becomes a 1.8GB UTF-16 string in V8 (actually closer to 3.6GB because JavaScript strings are stored as UTF-16 internally, not UTF-8).

Second, csv.split('\n') creates a new array of 4 million strings. Each string is a separate allocation. The original string is still in memory, plus the array of substrings, plus the objects you build in the map call.

Third, each line.split(',') inside the map creates yet another temporary array per row. The total memory footprint at peak is roughly 4x to 6x the raw file size.

For a 1.8GB file, that means 7-10GB of RAM before you even start inserting into the database. The OOM killer is not a bug in your script. It is physics.

The fix is to never hold more than one row in memory at a time. Streams are how you do that.

The streaming foundation

Node.js streams process data in chunks. A Readable stream reads from the file system in small buffers (default 64KB). You pipe those buffers through a Transform stream that parses the CSV and emits JavaScript objects. You consume each object in a Writable stream that writes to the database, inserts into a queue, or whatever your destination is.

The file data flows through the pipeline one buffer at a time. At no point does the entire file or the entire parsed data set exist in memory.

const { createReadStream } = require('fs');
const { pipeline } = require('stream/promises');

But for parsing, use a dedicated CSV parser that works with streams.

Pick the right CSV parser

There are three CSV libraries you should know about in Node.js, and two of them are wrong for this use case.

csv-parse (from the csv package) is the gold standard for streaming. It accepts a Node.js readable stream and emits records as arrays or objects, respecting backpressure. It handles quoted fields, escaped commas, multiline values, and nonstandard delimiters.

csv-parser is a simpler alternative that also works with streams. It is slightly faster for basic CSV files but less configurable.

papaparse is the worst choice for large files. It is designed for browser use and offers a “streaming” mode that is technically event-driven but still accumulates data internally. Do not use it for server-side CSV processing at scale.

This post uses csv-parse because it is the most robust option for production workloads.

npm install csv-parse

The basic streaming pipeline

Here is the minimal streaming CSV parser that handles a file of any size:

const { createReadStream } = require('fs');
const { parse } = require('csv-parse');
const { pipeline } = require('stream/promises');

async function processCsv(filePath) {
  let rowCount = 0;

  await pipeline(
    createReadStream(filePath),
    parse({ columns: true, skip_empty_lines: true }),
    async function* (source) {
      for await (const row of source) {
        rowCount++;
        // Process one row at a time
        await handleRow(row);

        if (rowCount % 10000 === 0) {
          console.log(`Processed ${rowCount} rows`);
        }
      }
    }
  );

  console.log(`Done. Processed ${rowCount} rows.`);
}

async function handleRow(row) {
  // Your business logic here
  // Insert into database, validate, transform, etc.
  await new Promise(r => setImmediate(r));
}

processCsv('users_export.csv').catch(console.error);

Key points about this code:

  • createReadStream reads the file in 64KB chunks. It never loads the full file.
  • csv-parse with columns: true returns each row as a JavaScript object with column names as keys.
  • The async function* consumer processes rows one at a time. If handleRow is async (a database insert, an API call), the pipeline naturally pauses the read stream until the current row finishes processing.
  • The pipeline handles backpressure automatically. If the consumer is slower than the parser, the parser upstream pauses, which pauses the file read. Memory stays flat.

Backpressure: the invisible controller

Backpressure is the mechanism that prevents a fast producer from overwhelming a slow consumer. Without it, the parser would accumulate unprocessed rows in memory until the consumer catches up. With it, the entire pipeline slows to the speed of the slowest link.

The code above relies on backpressure correctly because csv-parse implements the Node.js Transform stream protocol. When the consumer generator yields a value and awaits, the stream is paused upstream. No extra memory is consumed.

You can defeat backpressure by collecting rows into an array:

// Do not do this
const allRows = [];
for await (const row of source) {
  allRows.push(row); // O(n) memory - defeats the whole point
}
await Promise.all(allRows.map(handleRow));

If you need to batch database inserts (and you do, for performance), batch without accumulating all rows:

async function* batchProcessor(source, batchSize = 1000) {
  let batch = [];

  for await (const row of source) {
    batch.push(row);

    if (batch.length >= batchSize) {
      yield batch;
      batch = [];
    }
  }

  if (batch.length > 0) {
    yield batch;
  }
}

// Usage
await pipeline(
  createReadStream(filePath),
  parse({ columns: true, skip_empty_lines: true }),
  async function* (source) {
    for await (const batch of batchProcessor(source, 1000)) {
      await insertBatch(batch);
    }
  }
);

This batches 1000 rows at a time but never holds more than 1000 rows in memory. The pipeline still respects backpressure between batches.

Error handling: malformed rows and partial failures

CSV files from other teams, from clients, or from third-party exports are always malformed in some way. A streaming pipeline that crashes on the first bad row is not production-ready.

Handle per-row errors without crashing the pipeline:

async function* safeProcessor(source) {
  let rowIndex = 0;

  for await (const row of source) {
    rowIndex++;
    try {
      const processed = transformRow(row);
      yield processed;
    } catch (err) {
      console.error(`Skipping row ${rowIndex}: ${err.message}`);
      // Continue to the next row
    }
  }
}

But be careful: skipping rows silently can corrupt data integrity. A better approach is to tee the bad rows to a dead-letter file:

const { createWriteStream } = require('fs');

const deadLetter = createWriteStream('failed_rows.csv', { flags: 'a' });
deadLetter.write('row_number,error,raw_data\n');

async function* safeProcessor(source) {
  let rowIndex = 0;

  for await (const row of source) {
    rowIndex++;
    try {
      yield transformRow(row);
    } catch (err) {
      const escaped = JSON.stringify(row).replace(/"/g, '""');
      deadLetter.write(`${rowIndex},"${err.message}","${escaped}"\n`);
    }
  }

  deadLetter.end();
}

Now you can audit failures after the run completes.

Handling nonstandard CSV

Real-world CSV files rarely conform to RFC 4180. Here are the most common variations and how to handle them with csv-parse:

Tab-separated values (TSV):

parse({ delimiter: '\t', columns: true })

Files with a header row that needs to be skipped:

parse({ from_line: 2, columns: ['id', 'name', 'email'] })

Quoted fields with embedded newlines:

CSV parsers handle this by default. Never use split('\n') for CSV files, because a quoted field can contain a newline character. The line count and the row count will be different. Only a proper CSV parser handles this correctly.

Files with BOM (Byte Order Mark):

Some Windows tools prefix CSV files with \ufeff. Strip it:

parse({ bom: true })

This is a single csv-parse option that handles the BOM silently.

Benchmark: memory usage comparison

I ran a test with a 500MB CSV file containing 2.1 million rows with 12 columns each.

ApproachPeak RSS memoryTimeCompleted
readFileSync + split2.8 GB14.2sNo (OOM)
readFileSync + csv-parse string input1.9 GB12.8sNo (OOM)
createReadStream + csv-parse (streaming)68 MB37.4sYes
Streaming + batched inserts (1000/batch)72 MB41.2sYes
Streaming + row validation + dead letter76 MB43.1sYes

The streaming approaches used roughly 40x less memory and completed the job. They were slower in wall-clock time because they applied backpressure, but they finished, which is the point. A fast approach that crashes is not faster than a slow approach that completes.

Practical pattern: streaming CSV to database inserts

Here is the full production pattern that combines everything above: stream a CSV file, validate each row, batch insert into Postgres, and log progress with error recovery:

const { createReadStream, createWriteStream } = require('fs');
const { parse } = require('csv-parse');
const { pipeline } = require('stream/promises');
const { Pool } = require('pg');

const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const deadLetter = createWriteStream('import_errors.csv');

async function importCsv(filePath) {
  const client = await pool.connect();
  let totalRows = 0;
  let errors = 0;

  try {
    await client.query('BEGIN');

    await pipeline(
      createReadStream(filePath, { highWaterMark: 65536 }),
      parse({ columns: true, skip_empty_lines: true, bom: true }),
      async function* (source) {
        let batch = [];

        for await (const row of source) {
          if (!row.email || !row.name) {
            deadLetter.write(`${JSON.stringify(row)}\n`);
            errors++;
            continue;
          }

          batch.push([row.id, row.name, row.email, row.attributes]);

          if (batch.length >= 500) {
            yield batch;
            batch = [];
          }
        }

        if (batch.length > 0) {
          yield batch;
        }
      },
      async function* (batches) {
        for await (const batch of batches) {
          const placeholders = batch
            .map((_, i) => `($${i * 4 + 1}, $${i * 4 + 2}, $${i * 4 + 3}, $${i * 4 + 4})`)
            .join(', ');

          const values = batch.flat();

          await client.query(
            `INSERT INTO users (id, name, email, attributes)
             VALUES ${placeholders}
             ON CONFLICT (id) DO UPDATE SET
               name = EXCLUDED.name,
               email = EXCLUDED.email`,
            values
          );

          totalRows += batch.length;
          console.log(`Inserted ${totalRows} rows, ${errors} errors`);
        }
      }
    );

    await client.query('COMMIT');
    console.log(`Import complete: ${totalRows} rows, ${errors} errors`);
  } catch (err) {
    await client.query('ROLLBACK');
    throw err;
  } finally {
    client.release();
    deadLetter.end();
  }
}

This pipeline:

  • Uses a single database transaction for the entire import (faster, but requires enough max_wal_size in Postgres for large imports; if the file is bigger than 5GB, batch the transactions instead).
  • Batches inserts in groups of 500 to balance memory and network round-trips.
  • Skips malformed rows to a dead-letter file without crashing.
  • Holds ~72MB of memory steady for a 5GB input file.
  • Rolls back the entire transaction on failure so you do not end up with a half-imported data set.

Aborting mid-stream

Sometimes you need to cancel an import. Maybe the data is wrong, maybe the server needs to shut down. A streaming pipeline can be aborted cleanly:

const ac = new AbortController();

process.on('SIGINT', () => {
  console.log('Aborting import...');
  ac.abort();
});

await pipeline(
  createReadStream(filePath),
  parse({ columns: true }),
  async function* (source) {
    for await (const row of source) {
      if (ac.signal.aborted) break;
      yield row;
    }
  }
);

The file handle closes, the parser flushes, and the pipeline ends cleanly without leaving half-written data.

When streaming is not enough

Streaming solves the memory problem, but it does not solve the speed problem. A single-threaded Node.js process processing 4 million CSV rows through a pipeline of JavaScript transforms will take a while. If 43 seconds to import 2 million rows is too slow for your use case, you have options:

Parallelize across worker threads:

const { Worker } = require('worker_threads');

// Split the CSV by line count (requires a known header row)
// Spawn N workers, each processing a chunk via streams

This works well when the CSV rows are independent and the database is the bottleneck. One Node.js thread saturates one CPU core while the database handles the writes. Adding more threads means more parallel writes to the database, which requires enough database connections to support them.

Use Postgres COPY for bulk imports:

COPY users FROM '/path/to/file.csv' WITH (FORMAT CSV, HEADER true);

COPY is dramatically faster than row-by-row INSERT, even with batching. If you control the infrastructure, pipe the stream directly into a COPY command through the pg client:

const { createReadStream } = require('fs');
const { Pool } = require('pg');
const { from: copyFrom } = require('pg-copy-streams');

const pool = new Pool();
const client = await pool.connect();

const copyStream = client.query(
  copyFrom('COPY users FROM STDIN WITH (FORMAT CSV, HEADER)')
);

const fileStream = createReadStream('users_export.csv');
fileStream.pipe(copyStream);

await new Promise((resolve, reject) => {
  copyStream.on('finish', resolve);
  copyStream.on('error', reject);
});

This is the fastest path from CSV file to Postgres table. It uses Postgres’s native CSV parser, which is written in C and runs at disk speed. For a 2GB file, COPY finishes in seconds, not minutes.

The takeaway

CSV processing in Node.js is a solved problem. The tooling exists and the patterns are well-understood. The only reason it still kills servers is that the naive approach (read the whole file, split, iterate) is the first thing every tutorial shows and the first thing every developer types.

The rule is simple: if the file might be larger than available memory, do not load it into memory. Use createReadStream, csv-parse, and batched database writes. Test with a file that matches your production data size. Monitor memory with process.memoryUsage().rss during the run.

If you see RSS creeping up over minutes instead of staying flat, you have lost backpressure somewhere. Find it and fix it before it finds your OOM killer.


A note from Yojji

Building robust data processing pipelines that handle production-scale files without falling over is the kind of backend engineering discipline that separates a reliable system from one that silently accumulates technical debt. Whether it is a streaming CSV importer, a CDC pipeline, or a batch ETL job, getting the memory profile right matters more than getting the first version shipped fast. Yojji’s teams build these kinds of systems every day: cloud-native backend services on Node.js and Postgres that process real data at real scale, with the boring, necessary attention to memory management and error recovery that keeps production stable.