Linux Capabilities and Container Security for Node.js: Running Without Root

Your Node.js container runs as root. You know this because your Dockerfile says FROM node:20-slim and you never added a USER directive. The process runs with uid 0 inside the container, which means if an attacker gets RCE through a vulnerability in express, lodash, or any of the other 1,200 packages in node_modules, they have full root privileges on the container. From there, kernel exploit or misconfigured seccomp, host access is one CVE away.

The Dockerfile that ships with half the tutorials on the internet looks exactly like this:

FROM node:20-slim
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]

No non-root user. No capability drops. No read-only filesystem. No seccomp. It builds, it runs, it passes every smoke test. And it is one curl command from a container breakout that exposes the host.

This post covers the exact four things you need to harden a Node.js container: dropping Linux capabilities, running as a non-root user, mounting the root filesystem read-only, and applying a seccomp profile. Every step is deployable today, compatible with Docker and Kubernetes, and breaks nothing if you account for the side effects.

Why root inside a container is still root

A common misconception is that Docker containers run in a sandbox and that root inside a container is somehow less powerful than root on the host. That is partially true and dangerously misleading.

Docker applies a default seccomp profile and drops some Linux capabilities. But the default set of capabilities Docker keeps is generous. A node:20-slim container running as root has the following capabilities by default:

CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_MKNOD, CAP_NET_RAW, CAP_SETGID, CAP_SETUID, CAP_SETFCAP, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, CAP_AUDIT_WRITE, CAP_KILL

That is fourteen capabilities, including CAP_DAC_OVERRIDE (bypass file permission checks), CAP_NET_RAW (raw socket access for ARP spoofing), and CAP_SYS_CHROOT (chroot escapes). If an attacker compromises your Node.js process, they inherit all of these.

The attack chain looks like this:

A prototype pollution vulnerability in a dependencies package lets the attacker write a file to disk.
Because the container runs as root, the file lands with uid 0 and can overwrite any binary in /usr, /sbin, or anywhere else in the container.
The attacker writes a malicious binary that uses CAP_SYS_CHROOT and a mounted /proc to escape the container namespace.
The attacker now has a foothold on the host.

Every step of this chain is blocked by the hardening techniques below.

Step 1: Drop every capability, then add back only what you need

The first and easiest hardening step is to drop all capabilities and only add back the ones your application actually needs.

For a typical Node.js HTTP server, the only capability you need is CAP_NET_BIND_SERVICE if you want to bind to a privileged port (under 1024). If your application listens on port 3000 or above (which it should), you do not even need that.

Docker Compose:

services:
  app:
    build: .
    cap_drop:
      - ALL
    cap_add: []
    ports:
      - "3000:3000"

Docker run:

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE my-app

But wait. If you test --cap-drop=ALL on a Node.js container running as root, you might see something unexpected. Node.js’s fs module uses uv_fs_open() which, under the hood, calls openat(). Without CAP_DAC_OVERRIDE, the kernel enforces the file’s permission bits strictly. If your application writes to a log file or uploads a file, the uid and gid of the running process must have write permission on the target directory. This is not a capability issue but a permissions issue, which the next step solves.

The key insight: capability drops are free. They add zero runtime overhead, they require no code changes, and they block entire classes of kernel-level exploits. There is no reason not to drop ALL and add back only what you need.

Step 2: Run as a non-root user

This is the single highest-impact change you can make. A process running as uid 1000 inside the container cannot write to /usr/bin, cannot modify /etc/passwd, and cannot chroot to escape namespaces. The kernel checks against the effective uid of the process, and if that uid is not 0, the privileged syscalls are blocked regardless of what capabilities the container holds.

The Dockerfile change is two lines:

FROM node:20-slim

# Create a non-root user and group
RUN groupadd --system --gid 1000 appuser && \
    useradd --system --uid 1000 --gid appuser appuser

WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && \
    # Ensure the app user owns the application files
    chown -R appuser:appuser /app
COPY --chown=appuser:appuser . .

USER appuser
EXPOSE 3000
CMD ["node", "server.js"]

If you are using Alpine-based images (node:20-alpine), the commands are different because Alpine uses busybox:

FROM node:20-alpine

RUN addgroup -S -g 1000 appuser && \
    adduser -S -u 1000 -G appuser appuser

WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && \
    chown -R appuser:appuser /app
COPY --chown=appuser:appuser . .

USER appuser
EXPOSE 3000
CMD ["node", "server.js"]

The uid 1000 is arbitrary but conventional. Any uid above 1000 works. Do not use uids below 100 (system accounts) for application processes.

What breaks when you switch to a non-root user?

Anything that writes to filesystem paths controlled by root. The most common issues:

Log files written to /var/log. Your application cannot create files there. Write logs to stdout/stderr (which you should be doing anyway for containerized apps) or to a directory under /app that has the right ownership.
Socket files in /var/run. If you use Unix domain sockets, create the socket in a directory owned by the app user.
npm install with lifecycle scripts. Some npm packages run postinstall scripts that need to write to protected paths. If you npm install as the appuser, those scripts fail. Always run npm ci during the build (as root or with a temporary build user) and copy the result.

Once you switch to a non-root user and drop all capabilities, your container is dramatically harder to exploit.

Step 3: Mount the root filesystem read-only

A read-only root filesystem means the process cannot write to any path on the root filesystem, period. Combined with a non-root user, this closes the entire class of binary-overwrite and configuration-tampering attacks.

Docker:

docker run --read-only --tmpfs /tmp --tmpfs /app/data my-app

Docker Compose:

services:
  app:
    build: .
    read_only: true
    tmpfs:
      - /tmp
      - /app/data
    cap_drop:
      - ALL

Kubernetes (Pod Security Context):

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
    readOnlyRootFilesystem: true

The --read-only flag makes the container’s union filesystem immutable. Node.js writes to /tmp and /app/data are redirected to an in-memory tmpfs. No files survive a container restart, which is fine because containers are ephemeral.

What needs a writable path that is not /tmp?

Node.js itself writes to a few paths at runtime:

V8 compilation cache. Node.js caches compiled bytecode in a platform-specific directory. If $XDG_CACHE_HOME is not writable, Node.js skips the cache. The performance impact is negligible.
npm cache. If your application runs npm commands at runtime (which it should not), the npm cache directory needs to be writable. Set npm config set cache /tmp/.npm in your Dockerfile.
Temporary files. Libraries like sharp (image processing), puppeteer (headless Chrome), and node-gyp (native compilation) write to /tmp. As long as /tmp is mounted as tmpfs, they work fine.
Upload directories. If your application accepts file uploads, the upload destination must be a tmpfs mount or a PersistentVolumeClaim in Kubernetes.

The rule is simple: everything under / is read-only. Anything that needs writes goes to /tmp or a named volume.

Step 4: Apply a seccomp profile

seccomp (secure computing mode) restricts the system calls a process can make. Docker ships with a default seccomp profile that blocks around 50 dangerous syscalls (like mount, reboot, swapon). But the default profile is permissive enough to run most applications without issues. You can tighten it.

A custom seccomp profile for a Node.js application should block syscalls that are never used by a JavaScript runtime: mount, umount2, ptrace, perf_event_open, bpf, kexec_file_load, swapon, swapoff, create_module, init_module, finit_module, delete_module.

Here is a seccomp profile that is stricter than the Docker default but still allows Node.js to run normally:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"],
  "syscalls": [
    {
      "names": [
        "accept", "accept4", "access", "arch_prctl", "bind",
        "brk", "capget", "capset", "chdir", "chmod", "chown",
        "clock_getres", "clock_gettime", "clock_nanosleep",
        "clone", "clone3", "close", "connect", "copy_file_range",
        "creat", "dup", "dup2", "dup3", "epoll_create1",
        "epoll_ctl", "epoll_pwait", "eventfd2", "execve",
        "exit", "exit_group", "faccessat2", "fadvise64",
        "fallocate", "fchdir", "fchmod", "fchmodat", "fchown",
        "fchownat", "fcntl", "fdatasync", "fgetxattr",
        "flistxattr", "flock", "fork", "fremovexattr",
        "fsetxattr", "fstat", "fstatfs", "fsync", "ftruncate",
        "futex", "getcwd", "getdents64", "getegid", "geteuid",
        "getgid", "getpeername", "getpgid", "getpgrp",
        "getpid", "getppid", "getpriority", "getrandom",
        "getresgid", "getresuid", "getrlimit", "getrusage",
        "getsockname", "getsockopt", "gettid", "gettimeofday",
        "getuid", "getxattr", "inotify_add_watch",
        "inotify_init1", "inotify_rm_watch", "ioctl",
        "ioprio_get", "ioprio_set", "kcmp", "kill",
        "lgetxattr", "link", "linkat", "listen", "listxattr",
        "llistxattr", "lremovexattr", "lseek", "lsetxattr",
        "lstat", "madvise", "mbind", "memfd_create",
        "membarrier", "mincore", "mkdir", "mkdirat",
        "mlock", "mlock2", "mmap", "mmap_cache", "mount",
        "move_mount", "mprotect", "mquery", "mremap",
        "msgctl", "msgget", "msgrcv", "msgsnd",
        "msync", "munlock", "munmap", "name_to_handle_at",
        "nanosleep", "newfstatat", "open", "openat",
        "openat2", "pause", "pidfd_getfd", "pidfd_open",
        "pidfd_send_signal", "pipe", "pipe2", "poll",
        "ppoll", "prctl", "pread64", "preadv", "preadv2",
        "prlimit64", "process_vm_readv", "pselect6",
        "pwrite64", "pwritev", "pwritev2", "read",
        "readlink", "readlinkat", "readv", "recvfrom",
        "recvmmsg", "recvmsg", "rename", "renameat",
        "renameat2", "restart_syscall", "rmdir", "rseq",
        "rt_sigaction", "rt_sigpending", "rt_sigprocmask",
        "rt_sigqueueinfo", "rt_sigreturn", "rt_sigsuspend",
        "rt_sigtimedwait", "sched_getaffinity",
        "sched_getattr", "sched_getparam", "sched_getscheduler",
        "sched_rr_get_interval", "sched_setaffinity",
        "sched_setattr", "sched_setparam", "sched_setscheduler",
        "sched_yield", "seccomp", "select", "semctl",
        "semget", "semop", "semtimedop", "sendfile",
        "sendmmsg", "sendmsg", "sendto", "set_gid",
        "set_robust_list", "set_tid_address", "setdomainname",
        "setgid", "setgroups", "sethostname", "setitimer",
        "setpgid", "setpriority", "setregid", "setresgid",
        "setresuid", "setreuid", "setrlimit", "setsid",
        "setsockopt", "setuid", "shmctl", "shmdt",
        "shmget", "shutdown", "sigaltstack", "signalfd4",
        "socket", "socketpair", "splice", "stat", "statfs",
        "statx", "symlink", "symlinkat", "sync",
        "sync_file_range", "sysinfo", "tee", "tgkill",
        "time", "timer_create", "timer_delete",
        "timer_getoverrun", "timer_gettime", "timer_settime",
        "timerfd_create", "timerfd_gettime", "timerfd_settime",
        "tkill", "truncate", "umask", "uname", "unlink",
        "unlinkat", "unshare", "utimensat", "utimes",
        "vfork", "vmsplice", "wait4", "waitid", "write",
        "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Save this as node-seccomp.json and apply it:

docker run --security-opt seccomp=node-seccomp.json my-app

In Kubernetes, seccomp profiles can be referenced via a RuntimeClass or a PodSecurityPolicy. The simplest approach is to use the default seccomp profile and tighten capabilities instead, since seccomp profiles are harder to manage across a cluster.

Putting it all together: the hardened Dockerfile

Here is the complete hardened Dockerfile that combines every technique above:

FROM node:20-slim AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

FROM node:20-slim
RUN groupadd --system --gid 1000 appuser && \
    useradd --system --uid 1000 --gid appuser appuser

WORKDIR /app
COPY --from=builder --chown=appuser:appuser /app/node_modules ./node_modules
COPY --chown=appuser:appuser . .

USER appuser
EXPOSE 3000

# Use tini for proper signal handling
RUN apt-get update && apt-get install -y --no-install-recommends tini && \
    apt-get clean && rm -rf /var/lib/apt/lists/*
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["node", "server.js"]

And the corresponding docker-compose.yml:

version: '3.8'
services:
  app:
    build: .
    user: "1000:1000"
    cap_drop:
      - ALL
    cap_add: []
    read_only: true
    tmpfs:
      - /tmp
      - /app/data
    security_opt:
      - no-new-privileges:true
      - seccomp:node-seccomp.json
    ports:
      - "3000:3000"

Kubernetes Pod Security Context

In Kubernetes, all of these settings go into the Pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: node-app
  labels:
    app: node-app
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: my-app:latest
      ports:
        - containerPort: 3000
      securityContext:
        allowPrivilegeEscalation: false
        privileged: false
        readOnlyRootFilesystem: true
        capabilities:
          drop:
            - ALL
          add: []
        runAsNonRoot: true
        runAsUser: 1000
      volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: data
          mountPath: /app/data
  volumes:
    - name: tmp
      emptyDir:
        medium: Memory
    - name: data
      emptyDir:
        medium: Memory

The allowPrivilegeEscalation: false flag is critical. It sets the no_new_privs bit on the process, which prevents the binary from gaining additional privileges via setuid binaries or setcap executables. Combined with runAsNonRoot: true, this means that even if an attacker overwrites a binary with a setuid root binary, the kernel will refuse to elevate the process.

Testing that the hardening is actually enforced

A quick smoke test to verify your container is not running as root:

# Verify the user inside the container
docker run --rm --cap-drop=ALL --read-only my-app id
# Expected output: uid=1000(appuser) gid=1000(appuser) groups=1000(appuser)

# Verify you cannot write anywhere outside /tmp
docker run --rm --cap-drop=ALL --read-only my-app touch /test.txt
# Expected output: touch: cannot touch '/test.txt': Read-only file system

# Verify privilege escalation is blocked
docker run --rm --security-opt no-new-privileges:true my-app \
  /bin/sh -c "chmod u+s /usr/bin/touch && touch /test.txt"
# Expected output: Operation not permitted

In your CI pipeline, add a step that runs these checks after the image build:

# GitHub Actions
- name: Security smoke test
  run: |
    docker run --rm --read-only --cap-drop=ALL \
      my-app node -e "process.exit(0)"
    echo "Container runs with read-only root FS and dropped capabilities"

What about `npm audit` and image scanning?

The container hardening in this post is about runtime security: what happens after the container starts. It is complementary to image-level scanning (Trivy, Grype, Snyk) that checks for known CVEs in your base image and dependencies. You need both.

A container that passes every CVE scan can still be exploited if the process runs as root with too many capabilities. And a hardened container running as non-root with read-only filesystem can still be exploited if a dependency has a deserialization vulnerability. Layer the defenses.

A note from Yojji

Container security is easy to defer until after a breach, and nearly impossible to retrofit without breaking something if you do not plan for it from the start. The four layers covered here (non-root user, capability drops, read-only filesystem, seccomp) cost nothing to implement and require no architectural changes if applied during initial setup. Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK, and their teams regularly design and deploy Node.js services on AWS, Azure, and Google Cloud with the kind of security-first container posture that makes platform engineers breathe a little easier during incident calls.