Linux Capabilities and Container Security for Node.js: Running Without Root
A practical guide to running Node.js containers with least privilege: dropping Linux capabilities, switching to a non-root user, enabling read-only root filesystems, and applying seccomp profiles so a container breakout is significantly harder.
Your Node.js container runs as root. You know this because your Dockerfile says FROM node:20-slim and you never added a USER directive. The process runs with uid 0 inside the container, which means if an attacker gets RCE through a vulnerability in express, lodash, or any of the other 1,200 packages in node_modules, they have full root privileges on the container. From there, kernel exploit or misconfigured seccomp, host access is one CVE away.
The Dockerfile that ships with half the tutorials on the internet looks exactly like this:
FROM node:20-slim
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]
No non-root user. No capability drops. No read-only filesystem. No seccomp. It builds, it runs, it passes every smoke test. And it is one curl command from a container breakout that exposes the host.
This post covers the exact four things you need to harden a Node.js container: dropping Linux capabilities, running as a non-root user, mounting the root filesystem read-only, and applying a seccomp profile. Every step is deployable today, compatible with Docker and Kubernetes, and breaks nothing if you account for the side effects.
Why root inside a container is still root
A common misconception is that Docker containers run in a sandbox and that root inside a container is somehow less powerful than root on the host. That is partially true and dangerously misleading.
Docker applies a default seccomp profile and drops some Linux capabilities. But the default set of capabilities Docker keeps is generous. A node:20-slim container running as root has the following capabilities by default:
CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_MKNOD, CAP_NET_RAW, CAP_SETGID, CAP_SETUID, CAP_SETFCAP, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, CAP_AUDIT_WRITE, CAP_KILL
That is fourteen capabilities, including CAP_DAC_OVERRIDE (bypass file permission checks), CAP_NET_RAW (raw socket access for ARP spoofing), and CAP_SYS_CHROOT (chroot escapes). If an attacker compromises your Node.js process, they inherit all of these.
The attack chain looks like this:
- A prototype pollution vulnerability in a dependencies package lets the attacker write a file to disk.
- Because the container runs as root, the file lands with uid 0 and can overwrite any binary in
/usr,/sbin, or anywhere else in the container. - The attacker writes a malicious binary that uses
CAP_SYS_CHROOTand a mounted/procto escape the container namespace. - The attacker now has a foothold on the host.
Every step of this chain is blocked by the hardening techniques below.
Step 1: Drop every capability, then add back only what you need
The first and easiest hardening step is to drop all capabilities and only add back the ones your application actually needs.
For a typical Node.js HTTP server, the only capability you need is CAP_NET_BIND_SERVICE if you want to bind to a privileged port (under 1024). If your application listens on port 3000 or above (which it should), you do not even need that.
Docker Compose:
services:
app:
build: .
cap_drop:
- ALL
cap_add: []
ports:
- "3000:3000"
Docker run:
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE my-app
But wait. If you test --cap-drop=ALL on a Node.js container running as root, you might see something unexpected. Node.js’s fs module uses uv_fs_open() which, under the hood, calls openat(). Without CAP_DAC_OVERRIDE, the kernel enforces the file’s permission bits strictly. If your application writes to a log file or uploads a file, the uid and gid of the running process must have write permission on the target directory. This is not a capability issue but a permissions issue, which the next step solves.
The key insight: capability drops are free. They add zero runtime overhead, they require no code changes, and they block entire classes of kernel-level exploits. There is no reason not to drop ALL and add back only what you need.
Step 2: Run as a non-root user
This is the single highest-impact change you can make. A process running as uid 1000 inside the container cannot write to /usr/bin, cannot modify /etc/passwd, and cannot chroot to escape namespaces. The kernel checks against the effective uid of the process, and if that uid is not 0, the privileged syscalls are blocked regardless of what capabilities the container holds.
The Dockerfile change is two lines:
FROM node:20-slim
# Create a non-root user and group
RUN groupadd --system --gid 1000 appuser && \
useradd --system --uid 1000 --gid appuser appuser
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && \
# Ensure the app user owns the application files
chown -R appuser:appuser /app
COPY --chown=appuser:appuser . .
USER appuser
EXPOSE 3000
CMD ["node", "server.js"]
If you are using Alpine-based images (node:20-alpine), the commands are different because Alpine uses busybox:
FROM node:20-alpine
RUN addgroup -S -g 1000 appuser && \
adduser -S -u 1000 -G appuser appuser
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && \
chown -R appuser:appuser /app
COPY --chown=appuser:appuser . .
USER appuser
EXPOSE 3000
CMD ["node", "server.js"]
The uid 1000 is arbitrary but conventional. Any uid above 1000 works. Do not use uids below 100 (system accounts) for application processes.
What breaks when you switch to a non-root user?
Anything that writes to filesystem paths controlled by root. The most common issues:
- Log files written to
/var/log. Your application cannot create files there. Write logs to stdout/stderr (which you should be doing anyway for containerized apps) or to a directory under/appthat has the right ownership. - Socket files in
/var/run. If you use Unix domain sockets, create the socket in a directory owned by the app user. npm installwith lifecycle scripts. Some npm packages runpostinstallscripts that need to write to protected paths. If younpm installas theappuser, those scripts fail. Always runnpm ciduring the build (as root or with a temporary build user) and copy the result.
Once you switch to a non-root user and drop all capabilities, your container is dramatically harder to exploit.
Step 3: Mount the root filesystem read-only
A read-only root filesystem means the process cannot write to any path on the root filesystem, period. Combined with a non-root user, this closes the entire class of binary-overwrite and configuration-tampering attacks.
Docker:
docker run --read-only --tmpfs /tmp --tmpfs /app/data my-app
Docker Compose:
services:
app:
build: .
read_only: true
tmpfs:
- /tmp
- /app/data
cap_drop:
- ALL
Kubernetes (Pod Security Context):
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
readOnlyRootFilesystem: true
The --read-only flag makes the container’s union filesystem immutable. Node.js writes to /tmp and /app/data are redirected to an in-memory tmpfs. No files survive a container restart, which is fine because containers are ephemeral.
What needs a writable path that is not /tmp?
Node.js itself writes to a few paths at runtime:
- V8 compilation cache. Node.js caches compiled bytecode in a platform-specific directory. If
$XDG_CACHE_HOMEis not writable, Node.js skips the cache. The performance impact is negligible. - npm cache. If your application runs
npmcommands at runtime (which it should not), the npm cache directory needs to be writable. Setnpm config set cache /tmp/.npmin your Dockerfile. - Temporary files. Libraries like
sharp(image processing),puppeteer(headless Chrome), andnode-gyp(native compilation) write to/tmp. As long as/tmpis mounted astmpfs, they work fine. - Upload directories. If your application accepts file uploads, the upload destination must be a
tmpfsmount or a PersistentVolumeClaim in Kubernetes.
The rule is simple: everything under / is read-only. Anything that needs writes goes to /tmp or a named volume.
Step 4: Apply a seccomp profile
seccomp (secure computing mode) restricts the system calls a process can make. Docker ships with a default seccomp profile that blocks around 50 dangerous syscalls (like mount, reboot, swapon). But the default profile is permissive enough to run most applications without issues. You can tighten it.
A custom seccomp profile for a Node.js application should block syscalls that are never used by a JavaScript runtime: mount, umount2, ptrace, perf_event_open, bpf, kexec_file_load, swapon, swapoff, create_module, init_module, finit_module, delete_module.
Here is a seccomp profile that is stricter than the Docker default but still allows Node.js to run normally:
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"],
"syscalls": [
{
"names": [
"accept", "accept4", "access", "arch_prctl", "bind",
"brk", "capget", "capset", "chdir", "chmod", "chown",
"clock_getres", "clock_gettime", "clock_nanosleep",
"clone", "clone3", "close", "connect", "copy_file_range",
"creat", "dup", "dup2", "dup3", "epoll_create1",
"epoll_ctl", "epoll_pwait", "eventfd2", "execve",
"exit", "exit_group", "faccessat2", "fadvise64",
"fallocate", "fchdir", "fchmod", "fchmodat", "fchown",
"fchownat", "fcntl", "fdatasync", "fgetxattr",
"flistxattr", "flock", "fork", "fremovexattr",
"fsetxattr", "fstat", "fstatfs", "fsync", "ftruncate",
"futex", "getcwd", "getdents64", "getegid", "geteuid",
"getgid", "getpeername", "getpgid", "getpgrp",
"getpid", "getppid", "getpriority", "getrandom",
"getresgid", "getresuid", "getrlimit", "getrusage",
"getsockname", "getsockopt", "gettid", "gettimeofday",
"getuid", "getxattr", "inotify_add_watch",
"inotify_init1", "inotify_rm_watch", "ioctl",
"ioprio_get", "ioprio_set", "kcmp", "kill",
"lgetxattr", "link", "linkat", "listen", "listxattr",
"llistxattr", "lremovexattr", "lseek", "lsetxattr",
"lstat", "madvise", "mbind", "memfd_create",
"membarrier", "mincore", "mkdir", "mkdirat",
"mlock", "mlock2", "mmap", "mmap_cache", "mount",
"move_mount", "mprotect", "mquery", "mremap",
"msgctl", "msgget", "msgrcv", "msgsnd",
"msync", "munlock", "munmap", "name_to_handle_at",
"nanosleep", "newfstatat", "open", "openat",
"openat2", "pause", "pidfd_getfd", "pidfd_open",
"pidfd_send_signal", "pipe", "pipe2", "poll",
"ppoll", "prctl", "pread64", "preadv", "preadv2",
"prlimit64", "process_vm_readv", "pselect6",
"pwrite64", "pwritev", "pwritev2", "read",
"readlink", "readlinkat", "readv", "recvfrom",
"recvmmsg", "recvmsg", "rename", "renameat",
"renameat2", "restart_syscall", "rmdir", "rseq",
"rt_sigaction", "rt_sigpending", "rt_sigprocmask",
"rt_sigqueueinfo", "rt_sigreturn", "rt_sigsuspend",
"rt_sigtimedwait", "sched_getaffinity",
"sched_getattr", "sched_getparam", "sched_getscheduler",
"sched_rr_get_interval", "sched_setaffinity",
"sched_setattr", "sched_setparam", "sched_setscheduler",
"sched_yield", "seccomp", "select", "semctl",
"semget", "semop", "semtimedop", "sendfile",
"sendmmsg", "sendmsg", "sendto", "set_gid",
"set_robust_list", "set_tid_address", "setdomainname",
"setgid", "setgroups", "sethostname", "setitimer",
"setpgid", "setpriority", "setregid", "setresgid",
"setresuid", "setreuid", "setrlimit", "setsid",
"setsockopt", "setuid", "shmctl", "shmdt",
"shmget", "shutdown", "sigaltstack", "signalfd4",
"socket", "socketpair", "splice", "stat", "statfs",
"statx", "symlink", "symlinkat", "sync",
"sync_file_range", "sysinfo", "tee", "tgkill",
"time", "timer_create", "timer_delete",
"timer_getoverrun", "timer_gettime", "timer_settime",
"timerfd_create", "timerfd_gettime", "timerfd_settime",
"tkill", "truncate", "umask", "uname", "unlink",
"unlinkat", "unshare", "utimensat", "utimes",
"vfork", "vmsplice", "wait4", "waitid", "write",
"writev"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
Save this as node-seccomp.json and apply it:
docker run --security-opt seccomp=node-seccomp.json my-app
In Kubernetes, seccomp profiles can be referenced via a RuntimeClass or a PodSecurityPolicy. The simplest approach is to use the default seccomp profile and tighten capabilities instead, since seccomp profiles are harder to manage across a cluster.
Putting it all together: the hardened Dockerfile
Here is the complete hardened Dockerfile that combines every technique above:
FROM node:20-slim AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
FROM node:20-slim
RUN groupadd --system --gid 1000 appuser && \
useradd --system --uid 1000 --gid appuser appuser
WORKDIR /app
COPY --from=builder --chown=appuser:appuser /app/node_modules ./node_modules
COPY --chown=appuser:appuser . .
USER appuser
EXPOSE 3000
# Use tini for proper signal handling
RUN apt-get update && apt-get install -y --no-install-recommends tini && \
apt-get clean && rm -rf /var/lib/apt/lists/*
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["node", "server.js"]
And the corresponding docker-compose.yml:
version: '3.8'
services:
app:
build: .
user: "1000:1000"
cap_drop:
- ALL
cap_add: []
read_only: true
tmpfs:
- /tmp
- /app/data
security_opt:
- no-new-privileges:true
- seccomp:node-seccomp.json
ports:
- "3000:3000"
Kubernetes Pod Security Context
In Kubernetes, all of these settings go into the Pod spec:
apiVersion: v1
kind: Pod
metadata:
name: node-app
labels:
app: node-app
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: my-app:latest
ports:
- containerPort: 3000
securityContext:
allowPrivilegeEscalation: false
privileged: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
add: []
runAsNonRoot: true
runAsUser: 1000
volumeMounts:
- name: tmp
mountPath: /tmp
- name: data
mountPath: /app/data
volumes:
- name: tmp
emptyDir:
medium: Memory
- name: data
emptyDir:
medium: Memory
The allowPrivilegeEscalation: false flag is critical. It sets the no_new_privs bit on the process, which prevents the binary from gaining additional privileges via setuid binaries or setcap executables. Combined with runAsNonRoot: true, this means that even if an attacker overwrites a binary with a setuid root binary, the kernel will refuse to elevate the process.
Testing that the hardening is actually enforced
A quick smoke test to verify your container is not running as root:
# Verify the user inside the container
docker run --rm --cap-drop=ALL --read-only my-app id
# Expected output: uid=1000(appuser) gid=1000(appuser) groups=1000(appuser)
# Verify you cannot write anywhere outside /tmp
docker run --rm --cap-drop=ALL --read-only my-app touch /test.txt
# Expected output: touch: cannot touch '/test.txt': Read-only file system
# Verify privilege escalation is blocked
docker run --rm --security-opt no-new-privileges:true my-app \
/bin/sh -c "chmod u+s /usr/bin/touch && touch /test.txt"
# Expected output: Operation not permitted
In your CI pipeline, add a step that runs these checks after the image build:
# GitHub Actions
- name: Security smoke test
run: |
docker run --rm --read-only --cap-drop=ALL \
my-app node -e "process.exit(0)"
echo "Container runs with read-only root FS and dropped capabilities"
What about npm audit and image scanning?
The container hardening in this post is about runtime security: what happens after the container starts. It is complementary to image-level scanning (Trivy, Grype, Snyk) that checks for known CVEs in your base image and dependencies. You need both.
A container that passes every CVE scan can still be exploited if the process runs as root with too many capabilities. And a hardened container running as non-root with read-only filesystem can still be exploited if a dependency has a deserialization vulnerability. Layer the defenses.
A note from Yojji
Container security is easy to defer until after a breach, and nearly impossible to retrofit without breaking something if you do not plan for it from the start. The four layers covered here (non-root user, capability drops, read-only filesystem, seccomp) cost nothing to implement and require no architectural changes if applied during initial setup. Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK, and their teams regularly design and deploy Node.js services on AWS, Azure, and Google Cloud with the kind of security-first container posture that makes platform engineers breathe a little easier during incident calls.