Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Welcome to “Rootless Containers from Scratch”.

This book documents my journey of building a functional, rootless container implementation using nothing but standard Linux utilities and Bash.

Why this book?

While tools like Docker and Podman abstract away the complexity of containers, understanding the underlying mechanisms is crucial for security engineers, system administrators, and curious developers.

In this book, I dissect:

  • How containers are just fancy Linux processes.
  • The security implications of running as root vs. rootless.
  • The specific hurdles encountered when implementing this in a shell script.

Resources

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Core Concepts

Before diving into the code, I needed to understand the three pillars of Linux containers:

  1. Namespaces: Configuring what a process sees.
  2. Cgroups: Configuring what a process can use.
  3. Union Filesystems: Configuring how files are layered.

The Traditional Model (with root)

To appreciate the complexity of rootless containers, it helps to understand how “normal” containers (like standard Docker or containerd) work.

In the traditional model, a daemon runs with root privileges on the host. This simplifies everything:

  1. Creation: The daemon calls unshare() or clone() to create namespaces. Since it is root, the kernel allows this immediately.
  2. Networking: The daemon creates a “veth pair” (virtual ethernet cable). It plugs one end into a host bridge (like docker0) and moves the other end into the container. It modifies iptables for NAT. All of this requires CAP_NET_ADMIN on the host.
  3. Filesystem: The daemon mounts OverlayFS directly. Since it is root, it ignores the ownership of the files on disk. It can read/write anything.
  4. Cgroups: The daemon writes directly to /sys/fs/cgroup/....
  5. User Switching: Finally, if the container is supposed to run as a specific user (like postgres), the daemon performs the namespace setup and then “drops” privileges using setuid() before executing the payload.

The Rootless Difference

In a rootless environment, I don’t have a privileged daemon. I am just vagmi (UID 1000).

  • I cannot create a bridge device.
  • I cannot mount OverlayFS on host directories owned by root.
  • I cannot modify Cgroup limits (unless delegated).
  • I cannot write to iptables.

This forces me to use “User Namespaces” as a wedge to gain fake root privileges, and user-space tools (like slirp4netns and fuse-overlayfs) to emulate kernel features I cannot access.

Namespaces: The Seven Kingdoms

Linux Namespaces are the fundamental isolation primitive of containers. They wrap a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.

While there are several namespaces, seven are crucial for containers.

The Big 7

NamespaceFlagIsolatesPurpose in Containers
UserCLONE_NEWUSERUser & Group IDsAllows a process to run as root inside while being a normal user outside.
MountCLONE_NEWNSMount pointsProvides a separate filesystem view (the rootfs).
PIDCLONE_NEWPIDProcess IDsEnsures the container’s init process sees itself as PID 1.
NetworkCLONE_NEWNETNetwork stackGives the container its own IP, localhost, and routing table.
UTSCLONE_NEWUTSHostnameAllows the container to have its own hostname.
IPCCLONE_NEWIPCIPC resourcesPrevents shared memory attacks between host and container.
CgroupCLONE_NEWCGROUPCgroup rootIsolates the view of cgroup hierarchy (less commonly used directly).

The User Namespace: The Key to Rootless

The User Namespace is special. It is the only namespace that an unprivileged user can create without sudo.

When creating a new User Namespace:

  1. I become UID 0 (root) inside that namespace.
  2. I gain a full set of Capabilities (like CAP_SYS_ADMIN, CAP_NET_ADMIN) but only within that namespace.

This is the magic trick:

“To perform privileged operations (like mounting filesystems or configuring networks), I don’t need to be real root. I just need to be root inside a User Namespace.”

Example: Becoming Root (Fake)

Try this in your terminal:

# -U: User Namespace
# -r: Map current user to root inside
unshare -U -r whoami

Output:

root

You are now root! (Well, technically, you are root inside that ephemeral namespace).

Manipulating Namespaces

I use two primary syscalls (wrapped by command-line tools) to manage namespaces.

1. unshare (Create)

The unshare command creates new namespaces and then executes a program inside them.

In my rootless script:

unshare --user --fork --map-root-user bash

This creates a new user namespace and runs bash inside it.

2. nsenter (Join)

The nsenter command allows an existing process to “enter” the namespaces of another process. This is how docker exec works.

Every process has a directory in /proc/[pid]/ns/ containing links to its namespaces:

ls -l /proc/$$/ns/
# lrwxrwxrwx 1 vagmi vagmi 0 Jan 12 10:00 net -> 'net:[4026531992]'
# lrwxrwxrwx 1 vagmi vagmi 0 Jan 12 10:00 user -> 'user:[4026531837]'

To join a container, I target its PID:

nsenter --target $CONTAINER_PID --mount --net --pid /bin/sh

The Order of Operations

In my rootless implementation, the order matters immensely:

  1. Create User Namespace: I must do this first to gain the privileges needed to create the others.
  2. Become Root (Inside): I map my host UID to root.
  3. Create Other Namespaces: Now that I am “root”, I can create Mount, Network, and PID namespaces.
  4. Mount Filesystems: I mount /proc, /sys, and my overlayfs root.

If I tried to create a Mount namespace (unshare -m) before the User namespace, the kernel would deny me because I am not real root.

Control Groups (Cgroups)

While Namespaces isolate visibility (what you can see), Control Groups (cgroups) isolate usage (what you can use). They are the resource management layer of the kernel.

Hierarchy and Controllers

Cgroups are organized in a hierarchy (a tree), similar to a filesystem. Directories in this tree represent groups of processes.

Controllers are the subsystems that enforce limits. Common controllers include:

  • memory: Limits RAM usage.
  • cpu: Limits CPU cycles.
  • io: Limits disk I/O bandwidth.
  • pids: Limits the total number of processes.

Cgroup v1 vs. v2

Linux is transitioning from v1 to v2.

  • Cgroup v1: Had a separate hierarchy for each controller (/sys/fs/cgroup/memory, /sys/fs/cgroup/cpu, etc.). This was messy and hard to coordinate.
  • Cgroup v2: Has a Unified Hierarchy (/sys/fs/cgroup). All controllers exist in the same tree. This is what modern container runtimes (and my script) use.

Rootless Delegation

Normally, only root can modify cgroups (e.g., set a memory limit). So how can a rootless user restrict their container’s memory?

The answer is Delegation.

systemd (the init system) automatically creates a cgroup for every user login. It looks like this:

/sys/fs/cgroup/user.slice/user-1000.slice/[email protected]/

Systemd can “delegate” ownership of this directory (or a subdirectory) to the user. This means the directory is chowned to the user (vagmi:vagmi).

Once I own the directory, I can create sub-directories (sub-cgroups) and write to their control files.

Implementing Limits

In my script, I create a sub-cgroup for the container:

CGROUP_PATH="/sys/fs/cgroup/user.slice/user-1000.slice/[email protected]/alpine-box"
mkdir -p "$CGROUP_PATH"

Then I enable controllers and set limits by writing to files:

  1. Add Process: Move the container’s PID into the cgroup.

    echo $CONTAINER_PID > "$CGROUP_PATH/cgroup.procs"
    
  2. Set Memory Limit: Limit to 500MB.

    echo "500M" > "$CGROUP_PATH/memory.high"
    
  3. Set CPU Limit: Limit to 50% of one core (50000us out of 100000us).

    echo "50000 100000" > "$CGROUP_PATH/cpu.max"
    

The “No Permission” Warning

If you see a warning about permissions when setting cgroups, it’s usually because:

  1. You are using Cgroup v1 (which doesn’t support safe delegation easily).
  2. Systemd hasn’t delegated the controllers to your user session.

You can verify delegation with:

cat /sys/fs/cgroup/user.slice/user-1000.slice/[email protected]/cgroup.controllers

If this file is empty, your user cannot control cgroups.

Union Filesystems (OverlayFS)

Containers are famous for their speed. A big part of this speed comes from Copy-on-Write (CoW) filesystems. I don’t copy the entire operating system every time I start a container; I just layer a writable sheet on top of the read-only image.

The Layer Cake

OverlayFS is the standard union filesystem used today. It merges multiple directories into one.

The Components

  1. LowerDir (Read-Only): This is the base image (e.g., Alpine Linux contents). It is never modified. Example: /home/vagmi/.local/containers/alpine-box/rootfs (original state)

  2. UpperDir (Read-Write): This is the “diff” layer. Any file I create or modify goes here. Example: /home/vagmi/.local/containers/alpine-box/overlay/upper

  3. WorkDir (Internal): A scratchpad directory required by OverlayFS for atomic operations. Example: /home/vagmi/.local/containers/alpine-box/overlay/work

  4. MergedDir (The View): This is the final mount point the container sees. Example: /home/vagmi/.local/containers/alpine-box/rootfs (mounted state)

How Operations Work

  • Reading a file: The kernel looks in UpperDir. If not found, it looks in LowerDir.
  • Modifying a file: The kernel copies the file from LowerDir to UpperDir (this is the “Copy” in “Copy-on-Write”), and then applies the modification to the copy.
  • Creating a file: It is created directly in UpperDir.
  • Deleting a file: A “Whiteout” file (a special 0/0 character device) is created in UpperDir. This tells the kernel “mask this file from LowerDir”.

The Rootless Challenge

Mounting an OverlayFS usually requires CAP_SYS_ADMIN in the initial namespace (i.e., root).

mount -t overlay overlay -o lowerdir=...,upperdir=... merged
# Error: Operation not permitted

Solution 1: Kernel Support (User Namespaces)

Since Linux kernel 5.11, unprivileged users can mount OverlayFS if they are inside a User Namespace. This is why my script waits until the User Namespace is created before attempting the mount.

Solution 2: FUSE-OverlayFS

Before kernel 5.11, or for complex UID mapping scenarios, I use FUSE-OverlayFS. This is a userspace implementation.

Why FUSE? The native kernel OverlayFS has limitations with UID mapping.

  • The LowerDir usually contains files owned by root (UID 0).
  • On the host, these are just files owned by vagmi (UID 1000).
  • Inside the container, I want them to look like UID 0.

FUSE-OverlayFS is smarter about this shifting. It can actively translate UIDs on the fly, ensuring that chown works as expected inside the container without actually changing the ownership of the source files on the host disk in a way that breaks things.

This is why, in my final script, using fuse-overlayfs inside the user namespace was the most robust solution.

The Rootless Revolution

Rootless containers refer to the ability for an unprivileged user to create, run, and manage containers. This represents a significant shift in the Linux container ecosystem, moving away from the “root-owned daemon” model popularized by early Docker versions.

Historical Context: BSD Jails

Before Linux had Namespaces, FreeBSD introduced Jails in 2000.

FreeBSD Jails

A Jail is an OS-level virtualization mechanism that partitions a FreeBSD system into several independent mini-systems.

  • Privilege Model: Traditionally, creating a jail required root privileges. The jail itself runs a “virtual root”, but the setup (IP assignment, filesystem creation) was an administrative task.
  • Design: Jails were designed primarily for isolation, not necessarily for unprivileged users to create their own environments at will.

The Linux Divergence: Namespaces

Linux took a different path. Instead of a single “Jail” object, Linux decomposed isolation into granular Namespaces (PID, Network, Mount, etc.). However, for a long time, entering these namespaces required CAP_SYS_ADMIN (root).

The Convergence: User Namespaces

The introduction of the User Namespace (CLONE_NEWUSER) bridged the gap. It allowed a normal user to say: “I want to be root, but only inside this new sandbox I just created.”

This is the key difference between Jails and Rootless Containers:

  • Jails: Admin sets up a sandbox; user plays in it.
  • Rootless Containers: User sets up their own sandbox and plays in it, with zero admin intervention required (once the initial kernel/shadow-utils configuration is done).

Why this matters

  1. Security: If the container runtime is compromised, the attacker only gains the privileges of the user, not root.
  2. Multi-tenancy: Allows multiple users on a shared system (HPC clusters, university servers) to run containers without asking an admin.
  3. Isolation: A breakout from a rootless container lands you as a normal user, not as root.

Why Rootless?

For years, the mantra of container security has been “containers do not contain”. While hypervisors (VMs) rely on hardware-enforced isolation, containers rely on kernel software abstractions. If those abstractions fail, the game is over - especially if the container is running as root.

Rootless containers fundamentally change this equation.

The “Root is Root” Problem

In a traditional container (like standard Docker):

  1. The Docker daemon runs as root.
  2. The container process (often) runs as root.

If a malicious actor breaks out of the container (via a kernel vulnerability or misconfiguration), they are root on the host. They can load kernel modules, wipe the filesystem, install rootkits, or access other users’ data.

The Rootless Defense

In a rootless container:

  1. The container engine runs as a normal user (e.g., UID 1000).
  2. The container process runs as a mapped UID.

If an attacker breaks out of a rootless container, they find themselves… as vagmi (UID 1000).

  • They cannot modify system files (/etc, /boot).
  • They cannot install kernel modules.
  • They cannot inspect other users’ processes.

They are contained not just by the container boundary, but by the standard Unix permissions of the host user.

The Rise of Coding Agents

The importance of rootless containers has exploded with the advent of LLM-based Coding Agents.

Agents like Devin, OpenCode, or GitHub Copilot Workspace are designed to:

  1. Take a user prompt.
  2. Write code.
  3. Execute that code (to test it, run builds, etc.).

The Agent Security Paradox

I want agents to be powerful. I want them to install packages (npm install), run servers (python server.py), and delete temporary files. However, I am effectively allowing an AI (which can hallucinate or be prompt-injected) to execute arbitrary code on my infrastructure.

Running this code on the host machine is reckless (rm -rf / is one hallucination away). Running this code in a standard root-privileged container is risky (container escapes are rare but real).

Rootless Containers: The Perfect Sandbox for Agents

Rootless containers offer the ideal balance for coding agents:

  1. Safety by Design: Even if the agent executes malicious code that escapes the container, the blast radius is limited to the user’s session. It cannot compromise the underlying node or other tenants.
  2. No “Sudo” Friction: Agents often need to run apk add or apt-get install. In a rootless container, the agent is root inside the namespace. It can install packages freely into its own overlay filesystem without needing to ask the human user for a password or having actual root access to the host.
  3. Ephemeral Environments: Rootless containers are lightweight. An agent can spin up a container, trash the filesystem with dependencies, and destroy it cleanly without leaving residue on the user’s machine.

Summary

FeatureRootful ContainerRootless Container
Daemon PrivilegeRootUser (UID 1000)
Breakout ResultSystem CompromiseUser Compromise
InstallationRequires sudoNo sudo required
Ideal Use CaseInfrastructure servicesAgents, CI/CD, Desktop Apps

For the future of AI-driven development, where code execution is autonomous and frequent, rootless containers are not just a feature - they are a requirement.

User Namespaces & ID Mapping

This is the engine room of rootless containers. It is the mechanism that allows a process to feel like root inside a container while remaining a standard, unprivileged user on the host.

The Illusion of Root

Inside the container, I want files to look like they are owned by root (UID 0), bin (UID 1), daemon (UID 2), etc. Outside the container, on the shared host system, I cannot actually let a user own files as root. That would be a massive security hole.

User Namespaces solve this by creating a translation layer (a map) between the UIDs inside the container and the UIDs on the host.

The UID Map (/proc/self/uid_map)

When a process is in a User Namespace, it has a file /proc/self/uid_map. This file contains three numbers per line: [Inside-ID] [Outside-ID] [Length]

A Simple Map (Single User)

If I just map my own user:

0 1000 1
  • Inside ID 0 (Container Root) maps to Outside ID 1000 (Vagmi).
  • Length is 1.

This works for basic things. whoami inside says “root”. But if I try to chown a file to user nobody (UID 65534), it fails. Why? Because I only mapped ONE id. I don’t have permission to be UID 65534 on the host.

The Subuid/Subgid Mechanism

To run a full Linux distro, a container needs to own thousands of UIDs (for postgres, nginx, www-data, etc.).

Since a normal user (UID 1000) only owns one UID, system administrators grant them a range of Subordinate UIDs (subuids).

These are defined in /etc/subuid:

vagmi:100000:65536

This grants user vagmi ownership of 65,536 UIDs starting from ID 100,000.

The Complex Map (Full Distro)

To support a full container, I create a map with two ranges:

  1. Map Root: Container UID 0 -> Host UID 1000.
  2. Map Users: Container UIDs 1..65536 -> Host UIDs 100000..165535.

The command newuidmap (a setuid helper binary) writes this to the kernel:

# newuidmap <pid> 0 1000 1 1 100000 65536

Resulting uid_map:

0       1000    1
1       100000  65536

Visualizing the Mapping

Container Reality (Inside)Host Reality (Outside)
root (UID 0)vagmi (UID 1000)
bin (UID 1)100000
daemon (UID 2)100001
nobody (UID 65534)165533

The newuidmap Security Gate

You might ask: “Why can’t I just write any mapping I want?”

The kernel allows a user to write to their own uid_map, BUT they can only map UIDs they actually own (which is usually just their own current UID).

To map the range 100,000+, I need privilege. This is why newuidmap has the SUID bit set (owned by root, executable by users).

  1. newuidmap starts as root.
  2. It checks /etc/subuid.
  3. Does vagmi own the range 100,000-165536?
  4. If yes, it writes the privileged mapping to /proc/[pid]/uid_map.

This mechanism delegates the ability to isolate specific ID ranges without giving full root access.

Building the Container: The Assembly Line

If you have used Docker, you are used to typing docker run alpine and having everything - downloading, unpacking, networking, mounting - happen instantly.

In this chapter, I am going to build that “run” command from scratch, using a bash script. I am stepping away from the “magic” and looking at the raw assembly line.

The Process

Building a rootless container involves four distinct phases:

  1. Preparation: Ensuring the host Linux system allows me (a regular user) to perform the necessary magic.
  2. Acquisition: Getting the Operating System files (the “Image”) and unpacking them onto my disk.
  3. Storage Assembly: Creating the “copy-on-write” filesystem layer so the container can write files without corrupting the original image.
  4. Wiring: Connecting the isolated container to the outside world (the internet).

The Script

I will reference my rootless-alpine.sh script throughout this section. Think of this script as my custom-built container engine - a tiny, 400-line version of Docker.

Prerequisites & Environment

Before I can build anything, I need to ensure my environment is compatible. Since we are developers, let’s compare this to setting up a development environment.

1. The Kernel Features (/proc/self/ns/user)

Just as a Python script might check sys.version to ensure it’s running on Python 3.10+, my container script checks for User Namespace support.

I check if /proc/self/ns/user exists.

  • What it is: This file represents the namespace of the current process.
  • Why I need it: If the kernel doesn’t support User Namespaces, I cannot create my “fake root” environment. The game is over before it begins.

2. The Permission Helpers (newuidmap & newgidmap)

In a typical web app, you might need an API Key to access a third-party service. In Linux, these two binaries are our “API Keys” to the kernel’s privileged ID mapping features.

  • The Problem: A normal user cannot map arbitrary UIDs (like mapping host UID 100000 to container UID 1). Only root can do that.
  • The Solution: These two programs have the SUID bit set. This means when you run them, they execute with root privileges, even if you are just a normal user.
  • The Check: My script verifies they are installed and have the SUID bit (chmod u+s).

3. The Allowance (/etc/subuid)

Think of /etc/subuid as an Access Control List (ACL) or a permissions.json file.

Just because I have the newuidmap tool doesn’t mean I can use any ID. The system administrator must explicitly grant me a range of IDs to play with.

A line like vagmi:100000:65536 says:

“User ‘vagmi’ is allowed to use 65,536 IDs starting from 100,000.”

If this file is missing or doesn’t contain my user, my script cannot create the mapping, and the container will fail to start.

Obtaining the Image

In the Docker world, docker pull does a lot of work. I am going to break it down using lower-level “OCI” (Open Container Initiative) tools: Skopeo and Umoci.

Step 1: Download (Skopeo)

An “Image” is really just a tarball (zip file) of a filesystem, plus a JSON file describing metadata (like “run /bin/sh by default”).

I use skopeo copy to fetch this from a registry (like Docker Hub) to my local disk.

skopeo copy docker://alpine:latest oci:local-image:latest

This saves the blobs to a directory. At this point, it’s just a pile of data - you can’t “cd” into it yet.

Step 2: Unpack (Umoci)

I need to turn that blob of data into a real folder with real files (/bin, /etc, /home). This is called “unpacking” or “extracting” the rootfs.

I use umoci unpack.

The Ownership Problem

This is the trickiest part of rootless images.

  1. The Source: The Alpine image contains files owned by root (UID 0).
  2. The Destination: I am downloading this to /home/vagmi/my-container.
  3. The Conflict: User vagmi cannot create files owned by root. Only root can do that.

How Umoci handles it: If I run umoci unpack --rootless, it extracts the files owned by me (UID 1000), but it records the intended ownership in a separate metadata file.

My Fallback (Tar): If I don’t have umoci, I use tar. But tar will just make vagmi own everything. This causes issues later because programs like sudo or apk expect to be owned by root.

  • Fix: I rely on the User Namespace mapping later to “shift” these UIDs so they appear correct inside the container.

The Filesystem Challenge: OverlayFS

I have my unpacked image (the rootfs). Now, I want to run a container. But wait! If I run the container directly in that directory, any changes I make (installing packages, creating files) will permanently modify the image. I don’t want that. I want a fresh start every time.

The Solution: OverlayFS

I use OverlayFS to create a “sandwich”:

  1. LowerDir (The Bread): My read-only rootfs (the Alpine image).
  2. UpperDir (The Filling): An empty directory where my changes go.
  3. MergedDir (The Sandwich): The magical view that combines them.

The Rootless Twist

In a standard system, I would run:

mount -t overlay ...

But mount requires root.

The “Permission Denied” Trap

If I try to run mount (or even fuse-overlayfs) from my regular user shell, it often fails. Why? Because the files in LowerDir are owned by vagmi (UID 1000). But inside the container, I want them to look like UID 0.

If I mount it outside, the filesystem sees “Owner: 1000”. When I enter the container (where I am fake-root), I might not have the right permissions to modify them, or the ownership looks wrong.

The Fix: Mount INSIDE the Namespace

This was the key breakthrough in my script.

I do not setup the mount before creating the container.

  1. I create the User Namespace first.
  2. I enter the namespace.
  3. Then I run fuse-overlayfs.

Because I am inside the namespace, fuse-overlayfs sees the UID mapping. It knows:

  • “Oh, the backing file on disk is owned by 1000.”
  • “But inside here, 1000 is mapped to 0.”
  • “So I will present this file to the user as owned by root.”

This dynamic translation is what allows commands like apk install (which needs to chown files to root) to succeed.

Networking User Space: Slirp4netns

Networking is usually the realm of the kernel. When you plug in an ethernet cable, the kernel handles the electrical signals and IP packets. In a container, I usually use “veth pairs” (Virtual Ethernet), which acts like a virtual cable. But creating veth pairs requires - you guessed it - root.

The Problem

A rootless user cannot modify the system routing table. I cannot assign an IP address to an interface. I cannot manipulate iptables.

The Solution: User-Space NAT (slirp4netns)

I use a tool called slirp4netns. It sounds weird (“SLIP” + “IRP” + “Network Namespace”), but think of it as a Router written in software.

How it works

  1. The TAP Device: The container gets a generic network interface called a “TAP” device. To the container, this looks like a real ethernet card.

  2. The Process: slirp4netns runs on the host as a normal process. It holds onto the other end of that TAP device.

  3. The Translation (Packet -> Socket):

    • Outgoing: When the container sends a TCP packet to google.com (IP 1.2.3.4), slirp4netns reads it. It doesn’t put it on the wire. Instead, it effectively calls socket.connect('1.2.3.4') on the host machine, just like a Python script would.
    • Incoming: When the host receives data from Google, slirp4netns wraps it back up into a TCP packet and injects it into the TAP device for the container to see.

Why this is cool

This means the container’s traffic looks, to the host kernel, exactly like regular traffic from the user vagmi.

  • It respects your host firewall.
  • It works over VPNs.
  • It doesn’t require any special permissions.

It is slower than raw kernel networking (because of the overhead of reconstructing packets), but it is completely secure and unprivileged.

Lessons Learned

Building a container runtime from scratch is a humbling experience. Even with access to extensive documentation and powerful LLMs, the intricacies of Linux namespaces often lead to confusing dead ends.

What works in theory (“just map the UIDs!”) often fails in practice due to the nuanced interplay between the kernel, the filesystem, and process hierarchy.

This chapter documents the specific “gotchas” I encountered. These aren’t just bug fixes; they are lessons in how Linux actually works under the hood. I learned that context is everything: where you run a command (host vs. namespace) matters just as much as what command you run.

The OverlayFS Mount Trap

This was the most persistent hurdle in my journey.

The Symptom

I tried to mount the filesystem. I had fuse-overlayfs installed. I had my directories ready. Yet, every time I ran the script:

fusermount3: mount failed: Operation not permitted

Or, if I tried to use mount -t overlay, it simply failed silently or with “Permission Denied”.

The Mental Model Failure

I assumed that because I created the User Namespace (which gives me CAP_SYS_ADMIN), I could just mount the filesystem from my script’s main process context.

I forgot that capabilities are scoped to the namespace.

The Reality

  1. My script (running as vagmi) created a User Namespace.
  2. But the script process itself was still running in the Host Namespace.
  3. When I ran fuse-overlayfs from the script, it was trying to execute on the host.
  4. The host kernel looked at vagmi (UID 1000) and said: “You are not root. You cannot mount filesystems.”

The Solution: nsenter is the Key

I had to fundamentally restructure the script. Instead of preparing the filesystem before starting the container process, I had to move the mounting logic inside the container creation flow.

I used nsenter to inject the fuse-overlayfs command into the User Namespace I just created.

# Wrong (Host context)
fuse-overlayfs -o ... "$ROOTFS"

# Right (Namespace context)
nsenter --user -t $USERNS_PID fuse-overlayfs -o ... "$ROOTFS"

Once inside the namespace:

  1. I am effectively root.
  2. I have the necessary capabilities.
  3. The UID mapping is active, allowing fuse-overlayfs to translate the file ownership correctly (making host files owned by UID 1000 appear as UID 0).

Process Management & The Grandchild Problem

Shell scripting makes it easy to spawn background processes (&), but it makes tracking them surprisingly difficult.

The Symptom

The container would start, print “Ready”, and then immediately execute the cleanup function and die.

The Code

I had a structure like this:

unshare --user --fork bash -c "
    unshare --mount --net --fork bash -c 'sleep infinity' &
    echo $! > container.pid
" &
# ...
wait $(cat container.pid)

The “Wait” Limitation

The wait command in Bash has a strict rule: You can only wait for your direct children.

In my architecture:

  1. Script (PID 100) spawns ->
  2. User NS unshare (PID 101) spawns ->
  3. Mount NS unshare (PID 102) spawns ->
  4. Container Payload (PID 103)

I was trying to wait 103 from PID 100. The kernel immediately returns “not a child of this shell”, so wait exits with code 0. The script thinks “Oh, the container finished!”, runs cleanup(), and kills everything.

The Fix: Tracking the Parent

I realized I couldn’t wait on the container process directly. I had to wait on the User Namespace process (PID 101), which is the direct child of my script.

I introduced parent.pid to track this intermediate process.

UNSHARE_PID=$!
echo $UNSHARE_PID > parent.pid
# ...
wait $(cat parent.pid)

This ensures the script stays alive as long as the namespace hierarchy exists.

Permission Conflicts & resolv.conf

One of the final errors I hit was deceptively simple:

/etc/resolv.conf: Permission denied

The Context

I needed to configure DNS for the container. The standard way is to write nameserver 10.0.2.3 into /etc/resolv.conf inside the container’s rootfs.

The Attempt (Failed)

My script tried to do this from the host:

echo "nameserver ..." > $ROOTFS/etc/resolv.conf

The Permission Paradox

  1. On Disk: The file is physically owned by vagmi (UID 1000).
  2. The Error: So why can’t vagmi write to it?

The issue was OverlayFS combined with User Namespaces. Once the OverlayFS is mounted, it presents a unified view. If fuse-overlayfs is doing its job correctly, it tells the host kernel: “This file is owned by UID 100000 (the mapped root)”.

When my script (running as UID 1000) tries to write to it through the mount point, the permission check fails because UID 1000 != UID 100000.

The Solution: Write from Within

I moved the configuration step inside the start_rootless_container function, which runs inside the namespace.

Inside the namespace:

  1. I am UID 0 (Root).
  2. The file appears to be owned by UID 0.
  3. The write succeeds.

This reinforces the core lesson: Rootless containers are a separate world. You cannot simply reach in from the outside and change things; you must enter the world (the namespace) to make changes.

Conclusion

I have successfully built a working rootless container runtime in just over 400 lines of Bash.

It works. It downloads an image, creates secure namespaces, maps UIDs, sets up a writable filesystem, and establishes networking. It allows a regular user to run a full Alpine Linux environment safely.

Reflections: Clunky but Doable

Comparing this experience to FreeBSD Jails (which have existed since 2000) is illuminating.

  • FreeBSD Jails: Feel like a cohesive, designed feature of the OS. You define a jail in a config file, and the OS handles it. It feels “solid”.
  • Linux Rootless Containers: Feel like a “Rube Goldberg machine”. I am gluing together disparate features - Namespaces, Cgroups, OverlayFS, Slirp4netns, Setuid helpers - using shell scripts and hope.

The Linux approach is composable but clunky. It requires a deep understanding of how six or seven different subsystems interact. A single misstep in the order of operations (like creating the mount namespace before the user namespace) causes the whole house of cards to collapse with opaque “Permission Denied” errors.

But… It’s Magic

Despite the complexity, the result is magical.

To be able to become root - to install packages, modify network routes, and mount filesystems - without actually having sudo access to the host machine is a triumph of kernel engineering.

It opens the door for:

  • Secure CI/CD pipelines that don’t need privileged runners.
  • AI Agents that can write and execute code without endangering the user.
  • Desktop App Sandboxing (like Flatpak) that works for everyone.

The complexity I faced is exactly why tools like Docker, Podman, and containerd exist: to wrap this complexity in a friendly API. But now, having built it myself, I understand exactly what those tools are doing for me.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

You are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material

Under the following terms:

  • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • NonCommercial — You may not use the material for commercial purposes.
  • ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.