Chapter 18: The Jailer

The hardware virtualization boundary is real, but it is not the outermost wall. Once a process holds an open file descriptor on /dev/kvm, the kernel imposes no further restriction on which KVM ioctls that process can issue. A guest that breaks out of its vCPU execution environment and controls the VMM process inherits the entire ioctl surface. The question, then, is what limits the blast radius of a compromised VMM. The answer for Firecracker is jailer — a small privileged binary that wraps every Firecracker instance in the same Linux isolation primitives that contain an OCI container: a pivot_root-based filesystem jail, mount and PID namespaces, cgroup resource limits, rlimit caps, and a uid/gid privilege drop. The jailer is not itself a VMM; it never touches /dev/kvm. It exists to impose a second boundary around the process that does.

Why the Hardware Boundary Is Not Enough

KVM has had exploitable bugs. CVE-2021-29657, fixed in Linux 5.11.12, is a use-after-free in the SVM nested-virtualization path (arch/x86/kvm/svm/nested.c, function nested_svm_vmrun()) allowing an AMD KVM guest to bypass MSR access controls on the host (the VMCB12 double-fetch in nested_svm_vmrun() is one proposed mechanical account; the NVD record classifies it as CWE-416). The virtio 1.2 spec designates the split virtqueue descriptor table (Section 2.7.5) as guest-writable memory; device backends that trust descriptor contents without bounds-checking have produced heap overflows reachable from the guest — CVE-2019-14835, in the Linux kernel's vhost subsystem (vhost/vhost.c), exploited get_indirect() and was fixed in Linux 5.3.

Neither of these is a flaw in Firecracker. But they are evidence of a pattern: every abstraction between guest code and host resources has had bugs, and the hardware boundary is only one layer. Firecracker's docs/design.md states the position plainly: "all vCPU threads are considered to be running malicious code as soon as they have been started; these malicious threads need to be contained." The jailer is the outermost layer of that containment. It does not prevent the bugs from existing; it limits what a successful exploit can reach.

One Binary, One Instance, One Jail

Starting as root is a cost. jailer is a single Rust binary, distributed alongside firecracker in the same release tarball, whose entire purpose is to pay that cost once and then eliminate the privilege. It performs every privileged operation required to build and enter a jail, drops to a caller-specified uid and gid, and replaces itself with firecracker via execve. After the exec, firecracker never had root; it has never seen a file path outside the jail; it is constrained by cgroup limits that were set before it existed. Every Firecracker instance runs inside its own jail. Each jail is an independent directory tree, an independent set of cgroup leaves, and a distinct uid/gid pair — so a process that escapes one instance cannot reach another instance's files.

The minimum required arguments are:

Argument Meaning
--id <id> Unique identifier for this microVM instance
--exec-file <path> Path to the firecracker binary on the host
--uid <uid> uid the jailed process will run as
--gid <gid> gid the jailed process will run as

Additional flags wire in a network namespace (--netns), set cgroup limits (--cgroup, --cgroup-version, --parent-cgroup), add rlimit overrides (--resource-limit), daemonize the process (--daemonize), and place the process in a fresh PID namespace (--new-pid-ns). The chroot base directory defaults to /srv/jailer. Given --exec-file /usr/bin/firecracker --id i1, the jail root is /srv/jailer/firecracker/i1/root — constructed as <chroot-base-dir>/<exec-file-name>/<id>/root. That path is the filesystem root firecracker sees when it starts.

Process Sanitization

Before parsing arguments, sanitize_process() in main.rs closes every file descriptor from 3 upward by calling close_range(3, UINT_MAX, CLOSE_RANGE_UNSHARE) — Linux syscall SYS_close_range, available since kernel 5.9. It then removes every environment variable from the process environment via clean_env_vars(). Both operations happen unconditionally, before anything else: the goal is to ensure that whatever the caller passed to jailer — inherited fds, proxy-authorization tokens in the environment, sockets, anything — does not survive into the jail. The jailer's own subsequent work starts from a clean slate.

The Ordered Sequence

The bulk of the work lives in Env::run() in src/jailer/src/env.rs. The ordering is not arbitrary; each step has a dependency on what comes before it, and the sequence must be understood as a unit.

flowchart TD A["sanitize_process()"] --> B["copy exec-file into chroot, fchown to target uid/gid"] B --> C["join network namespace (setns CLONE_NEWNET)"] C --> D["setrlimit for each --resource-limit"] D --> E["cgroup setup: create hierarchy, write PID"] E --> F["open /dev/null (if --daemonize)"] F --> F2["[aarch64 only] copy /sys/bus/cpu cache info into chroot"] F2 --> G["pivot_root sequence (unshare, bind-mount, pivot, umount)"] G --> H["mknod device nodes inside jail"] H --> I["daemonize (if --daemonize)"] I --> J["clone CLONE_NEWPID (if --new-pid-ns)"] J --> K["setuid / setgid / execve firecracker"]

The network namespace join happens before cgroup setup and before the filesystem root changes, because --netns is a host-side path that must be opened and passed to setns(2) while the jailer can still see the host filesystem. Cgroup setup must complete before the chroot step for the opposite reason: the cgroup pseudo-filesystem hierarchy under /sys/fs/cgroup is not visible inside the jail, so any write to a cgroup control file must happen while the jailer still has access to the host namespace. Opening /dev/null before the chroot is the same logic — daemonize needs to redirect stdio, and /dev/null is a host path.

The Network Namespace Handoff

The jailer does not create a network namespace. It joins an existing one. The caller — typically a container runtime or a higher-level orchestrator — is responsible for creating the tap device inside the network namespace before invoking jailer. When --netns <path> is supplied, the jailer opens the namespace file with O_CLOEXEC, calls setns(fd, CLONE_NEWNET), and closes the file descriptor. Everything after this point — cgroup writes, the chroot, the exec — inherits the joined namespace. firecracker starts inside the network namespace and has no mechanism to escape it.

The docs/jailer.md documentation places the network namespace join after the chroot sequence. The source code in env.rs places it before cgroup setup and before the chroot. The source is authoritative; this is noted explicitly because a reader who checks the documentation will see a different order.

Resource Limits

Two rlimit resources are supported via --resource-limit <name>=<value>:

Name rlimit constant Default
no-file RLIMIT_NOFILE 2048
fsize RLIMIT_FSIZE not set

For each limit, setrlimit(2) is called with both rlim_cur (the soft limit) and rlim_max (the hard limit) set to the same value. This is the critical detail: a process can normally raise its own soft limit up to the hard limit at any time, without privilege. Setting both to the same value removes that headroom and makes the limit permanent for the lifetime of the process. The default of 2048 open file descriptors is intentionally conservative — a Firecracker instance with two vCPUs and a handful of virtio devices holds well under 100 open fds.

Cgroup Resource Control

Cgroup setup runs before the chroot and writes the jailer's own PID into the cgroup, so every subsequent operation — including the execve into Firecracker — inherits the cgroup membership. The controller discovery process parses /proc/mounts at runtime using a regex that distinguishes v1 and v2 mounts by the presence of a 2 suffix on the filesystem type:

^([a-z2]*)[[:space:]](?P<dir>.*)[[:space:]]cgroup(?P<ver>2?)[[:space:]](?P<options>.*)[[:space:]]0[[:space:]]0$

The (?P<ver>2?) group is empty for cgroup v1 and "2" for cgroup v2. The implementation is in src/jailer/src/cgroup.rs.

cgroup v1

With --cgroup-version=1 (the default), each controller has its own hierarchy. The jailer creates a directory <controller-mount>/<parent-cgroup>/<id>/ — where --parent-cgroup defaults to the exec-file name, firecracker — writes the requested values into the controller-specific files, and attaches the process by writing its PID to <cgroup-path>/tasks. The tasks file accepts thread IDs under v1; process IDs written there apply to the entire thread group.

The cpuset controller requires special handling. If the parent cgroup's cpuset.mems or cpuset.cpus files are empty, attaching a process fails. The jailer calls inherit_from_parent(), which walks up the cgroup hierarchy until it finds a non-empty value, then propagates it down. Without this, a freshly created cgroup under a default install would refuse to accept any process.

Production-relevant resource knobs documented in docs/prod-host-setup.md:

Category cgroup v1 file Meaning
Memory hard limit memory.limit_in_bytes Hard memory ceiling
Memory + swap limit memory.memsw.limit_in_bytes Combined memory and swap cap
Memory soft limit memory.soft_limit_in_bytes Sharing threshold under contention
CPU weight cpu.shares Relative CPU weight
CPU period cpu.cfs_period_us CFS period (default 100,000 µs)
CPU quota cpu.cfs_quota_us CPU time allowed per period
IO IOPS blkio.throttle.io_serviced IOPS throttle
IO throughput blkio.throttle.io_service_bytes Bandwidth throttle

cgroup v2

With --cgroup-version=2, there is a single unified hierarchy. Before writing any controller property, the jailer must enable each required controller by writing +<controller> to every cgroup.subtree_control file along the path from the mount root down to the parent cgroup. This is the no-internal-process constraint: a non-root cgroup can distribute domain resources to child cgroups only if it contains no processes itself. The jailer satisfies this constraint by calling write_all_subtree_control() before creating or touching any child directory. Once controllers are enabled, it creates <v2-root>/<parent-cgroup>/<id>/ and writes the PID to cgroup.procs — not tasks. Writing to cgroup.procs migrates all threads of the process atomically.

The v1/v2 distinction here is not just a naming difference. In v1, tasks and cgroup.procs exist side by side and mean different things; in v2 only cgroup.procs exists and the subtree enablement step is mandatory. If a domain controller is already enabled in a cgroup that contains processes and you attempt to move a PID into a child, the kernel rejects the write. The jailer's error message for this case includes: "Hint: If you intended to create a child cgroup under {0}, pass any --cgroup parameters."

A note for Linux 6.1 (x86_64) deployments: a boot regression affecting cgroup v2 performance is mitigated by remounting the unified hierarchy with favordynmodsmount -o remount,favordynmods /sys/fs/cgroup. The favordynmods option is specific to the cgroup v2 unified mount; it does not apply to cgroup v1 per-controller mounts.

The pivot_root Sequence

The filesystem jail is implemented in src/jailer/src/chroot.rs. The function is named chroot() in the source, but it performs a pivot_root(2) sequence rather than a simple chroot(2) call. The distinction matters: a process holding CAP_SYS_CHROOT can escape a chroot(2) jail by chdir-ing outside the jail root before the root changes. pivot_root plus MNT_DETACH completely unmounts the old filesystem tree from the process's view. This is the same choice that runc and youki make for OCI containers.

Root required. The following sequence requires CAP_SYS_ADMIN and CAP_SYS_CHROOT. In normal operation, jailer holds these capabilities as root before the chroot step; after the exec into Firecracker, they are gone. Do not test this sequence against a production system without understanding the mount namespace implications.

The steps, in order:

  1. unshare(CLONE_NEWNS) — creates a new mount namespace. Mount events inside the jail will not propagate to the host and vice versa.

  2. mount(NULL, "/", NULL, MS_SLAVE | MS_REC, NULL) — sets mount propagation to MS_SLAVE recursively, ensuring that unmount events inside the jail cannot propagate to the host's mount namespace. The docs/jailer.md documentation says MS_PRIVATE | MS_REC; the source code uses MS_SLAVE | MS_REC. The source is authoritative.

  3. mount(<chroot_dir>, <chroot_dir>, NULL, MS_BIND | MS_REC, NULL) — self bind-mounts the jail directory. pivot_root(2) requires the new root to be on a different mount point from the current root; the self bind-mount creates that requirement.

  4. mkdir("old_root") inside the chroot directory, mode 0600.

  5. chdir(<chroot_dir>).

  6. libc::syscall(libc::SYS_pivot_root, ".", "old_root") — raw syscall; there is no glibc wrapper for pivot_root(2). The old root is now accessible at ./old_root.

  7. chdir("/")pivot_root(2) does not reposition the current working directory.

  8. umount2("old_root", MNT_DETACH) — lazily detaches the old root. Any process still holding a reference to it can continue, but no new path traversals reach the old filesystem.

  9. rmdir("old_root") — removes the empty mountpoint.

After step 9, there is no path from inside the jail to the host filesystem. No chroot(2) call follows; the pivot_root sequence is complete.

Device Nodes Inside the Jail

After the chroot, the jailer creates the directory structure that Firecracker needs — /dev, /dev/net, and /run, each mode 0o700, each chown'd to the target uid/gid — and then calls mknod to create the character device nodes that Firecracker will open. The major/minor values are derived from Documentation/admin-guide/devices.txt:

Device Jail path Major Minor Notes
KVM /dev/kvm 10 232 misc device; required for all guest execution
TUN/TAP /dev/net/tun 10 200 misc device; /dev/net/ created first
urandom /dev/urandom 1 9 failure is non-fatal; MMDS v2 unavailable without it
userfaultfd /dev/userfaultfd 10 dynamic minor number read from /proc/misc at runtime

Every node is created with mode S_IFCHR | S_IRUSR | S_IWUSR — character device, readable and writable only by the owner — and immediately chown'd to the target uid/gid. Two of these have non-trivial behavior. The /dev/urandom mknod is allowed to fail: if it fails, the jailer prints a warning ("MMDS version 2 will not be available to use.") and continues. The userfaultfd minor number is allocated dynamically by the kernel using MISC_DYNAMIC_MINOR; the jailer reads /proc/misc, finds the line containing "userfaultfd", and parses the minor number from its first column. If the kernel was built without userfaultfd support, the entry is absent and the node is silently omitted.

PID Namespace

When --new-pid-ns is set, the jailer calls:

libc::syscall(libc::SYS_clone, libc::CLONE_NEWPID, null_stack, 0, 0, 0)

A null child stack is intentional — the child immediately replaces itself with execve, so there is no stack work to do in the interim. The jailer parent writes the child PID to <exec-file-name>.pid and exits. The child process becomes PID 1 inside the new namespace. This is the init role in Linux PID namespace semantics: orphaned processes created within the namespace are reparented to PID 1, and the namespace is destroyed when PID 1 exits.

The PID namespace provides isolation but also a monitoring handle. The orchestrator can watch the PID file and know exactly which host PID corresponds to the init of a given microVM instance.

Privilege Drop and the Handoff

The final operation before exec is the privilege drop. The jailer constructs a std::process::Command with .uid(uid).gid(gid) set. The Rust standard library applies these as setuid(2) and setgid(2) calls in the child process between fork(2) and execve(2). The transition from uid 0 to an unprivileged uid causes the Linux kernel to drop all effective capabilities — including CAP_SYS_ADMIN, CAP_NET_ADMIN, and CAP_SYS_CHROOT — automatically on exec.

No explicit capset call appears in the jailer's privilege-drop path; the uid/gid transition is the mechanism by which effective capabilities are shed on exec. prctl(PR_SET_NO_NEW_PRIVS, 1) does not appear in the jailer either, but it must be issued inside Firecracker before seccomp filters are installed: seccomp(2) requires either CAP_SYS_ADMIN or the no_new_privs bit, and Firecracker holds neither after the uid drop. Chapter 19 covers where and how Firecracker sets that bit and installs per-thread filters.

The arguments passed to Firecracker on exec are:

--id \ --start-time-us \ --start-time-cpu-us \ --parent-cpu-time-us \ [any extra args passed to jailer after --]

The timing arguments allow Firecracker to report accurate startup metrics even though the clock starts before the exec. After this point, docs/design.md states: "past this point, Firecracker can only access resources that a privileged third-party grants access to (e.g., by copying a file into the chroot, or passing a file descriptor)." The gate closes at execve.

Isolation Per Instance

docs/prod-host-setup.md specifies that each Firecracker instance must use a distinct uid/gid pair. This is not a convention — it is load-bearing. Every file and directory created inside the jail, including the firecracker binary copy, all device nodes, and the jail root directory, was chown'd to the target uid/gid before the exec. A process that escapes its chroot will find itself owning only those files. A process that somehow escapes to the host filesystem will be an unprivileged user with a uid that no other instance shares — POSIX ownership prevents lateral movement to a neighboring instance's files.

flowchart LR
    host["Host filesystem"]
    j1["/srv/jailer/firecracker/i1/root\nuid 10001 / gid 10001"]
    j2["/srv/jailer/firecracker/i2/root\nuid 10002 / gid 10002"]
    j3["/srv/jailer/firecracker/i3/root\nuid 10003 / gid 10003"]
    host --> j1
    host --> j2
    host --> j3

Host-Level Concerns Beyond the Jailer

The jailer does not address every attack surface.

Instance Metadata Service filtering is the orchestrator's responsibility. Firecracker performs no network filtering. All guest egress reaches the host tap interface as untrusted traffic. Block guest access to the IMDS with a rule such as:

Root required. The following nftables rule modifies host network policy and takes effect immediately.

nft add rule firecracker filter iifname "tap*" ip daddr 169.254.169.254 counter drop

SMT (simultaneous multithreading, or hyperthreading) creates cross-core speculative execution side channels — Spectre, MDS variants — that neither the hardware virtualization boundary nor the jailer prevents. In multi-tenant environments, SMT should be disabled at the host level.

KSM (Kernel Samepage Merging) deduplicates identical memory pages across VMs. The deduplication timing is observable as a memory-content oracle across VM boundaries. KSM must be disabled.

The kvm-pit/<pid> kernel thread that KVM creates for the PIT timer is not automatically placed in the microVM's cgroup. If strict resource accounting is required, an external agent must move it into the instance's cgroup after VM creation.

The guest can write to the serial device, which maps to host stdout or stderr, at an arbitrary rate. Production deployments should redirect serial output to a bounded buffer or /dev/null to prevent a guest from consuming host I/O capacity through the serial device.

The Full Picture

sequenceDiagram
    participant O as Orchestrator
    participant J as jailer (root)
    participant K as kernel
    participant F as firecracker (uid N)

    O->>J: exec jailer --id i1 --uid N --gid N ...
    J->>J: close_range(3, UINT_MAX, CLOSE_RANGE_UNSHARE)
    J->>J: clean_env_vars()
    J->>K: setns(netns_fd, CLONE_NEWNET)
    J->>K: setrlimit(RLIMIT_NOFILE, 2048)
    J->>K: write cgroup limits, write PID to tasks/cgroup.procs
    Note over J: tasks (v1) or cgroup.procs (v2)
    J->>K: unshare(CLONE_NEWNS)
    J->>K: mount MS_SLAVE | MS_REC
    J->>K: mount MS_BIND | MS_REC (chroot dir)
    J->>K: SYS_pivot_root, umount2 MNT_DETACH
    J->>K: mknod /dev/kvm, /dev/net/tun, /dev/urandom
    J->>K: clone(CLONE_NEWPID) (if --new-pid-ns)
    J->>K: setuid(N), setgid(N)
    J->>F: execve firecracker --id i1 ...
    Note over J: jailer exits
    F->>F: prctl(PR_SET_NO_NEW_PRIVS, 1), install seccomp filters per-thread
    F->>K: open /dev/kvm, KVM_CREATE_VM ...

Past the execve, docs/design.md states: "Firecracker can only access resources that a privileged third-party grants access to (e.g., by copying a file into the chroot, or passing a file descriptor)." The gate closes at exec.

Sources And Further Reading