Chapter 20: The Threat Model
If you run untrusted code on a multi-tenant system, the central question is not whether isolation works in the happy path — it does. The question is what an adversary can reach after breaking the first barrier. Containers answer it one way: a process that escapes its namespace lands in the host kernel, which is also the kernel every other tenant is running on. MicroVMs answer it differently, with a sequence of barriers, each designed to contain what the previous one failed to stop.
This chapter maps that sequence in Firecracker's terms. The barriers are not informal defense-in-depth platitudes. They are hardware CPU modes, per-thread BPF filters, and a setuid jail binary — each with a specific syscall, VMCS field, or kernel module parameter attached to the claim.
The Trust Axiom
Firecracker's design document states the baseline assumption plainly:
"all vCPU threads are considered to be running malicious code as soon as they have been started; these malicious threads need to be contained."
This is not a disclaimer hedging an improbable edge case. It is the load-bearing premise the entire architecture rests on. A guest OS that boots correctly and runs cooperative workloads is operationally convenient; the containment model is designed for a guest that has been taken over and is actively probing for weaknesses in every direction.
From that premise, the trust hierarchy partitions into three zones: an untrusted zone (all guest vCPU threads and all guest network traffic), a semi-trusted zone (the Firecracker VMM process itself), and a trusted zone (the host kernel, KVM module, the Unix socket API channel, snapshot files, and the physical hardware). "Semi-trusted" is precise: Firecracker is written in safe Rust with a deliberately small codebase, but it remains a userspace process, and a vulnerability in its virtio emulation paths is in scope for exploitation. What limits the damage is the containment imposed around that process — not the assumption that the process is correct.
Firecracker explicitly disclaims network traffic filtering. The design document
states that all egress from a guest is untrusted and must be filtered at the host
level by the operator — typically with nftables rules applied to the TAP
interfaces. That delegation is not a weakness; it reflects where the right tool
sits. The VMM is not a firewall.
Layer 1: Hardware Virtualization
The outermost barrier is the one the CPU enforces without software assistance.
Intel VMX introduces two orthogonal operating modes: VMX root operation, where the
hypervisor and host OS run, and VMX non-root operation, where the guest runs. Each
mode retains the usual CPL 0--3 ring hierarchy, but a guest OS running at CPL 0 in
VMX non-root mode does not have full ring-0 privilege. Every privileged action the
guest attempts — writing CR3, accessing an MSR, executing INVLPG — is governed
by the VM-execution control fields in the VMCS (Virtual Machine Control Structure),
and most of them cause a VM exit rather than executing.
On a VM exit, the CPU atomically loads host state from the VMCS host-state area —
CR0, CR3, CR4, segment selectors, RIP and RSP from the HOST_RIP and
HOST_RSP fields — and saves guest state. Linux's arch/x86/kvm/vmx/vmenter.S notes
this directly: "After a successful VMRESUME/VMLAUNCH, control flow 'magically'
resumes below at vmx_vmexit due to the VMCS HOST_RIP setting." The guest did
not transfer control voluntarily; the CPU forced the transition and simultaneously
switched to a separate address space.
That vmenter.S path also zeroes all general-purpose registers except RSP and RBX
before returning to host code, preventing speculative use of guest register values
in host execution paths. RSB (Return Stack Buffer) clearing and SPEC_CTRL MSR
handling are applied as post-exit mitigations for Spectre-class side channels —
more on those below.
AMD SVM is structurally parallel. The VMCB (Virtual Machine Control Block) is
divided into a control area (intercepts, ASID, ASID flush bits) and a save area
(guest register state). VMRUN saves host state to the area pointed at by the
HSAVE_PA MSR and enters the guest; #VMEXIT reverses it.
The KVM API exposes this hardware boundary through the three-scope ioctl
hierarchy. Applications verify KVM_GET_API_VERSION returns 12, create a VM
with KVM_CREATE_VM (_IO(KVMIO, 0x01)) on /dev/kvm, register guest memory
with KVM_SET_USER_MEMORY_REGION using struct kvm_userspace_memory_region (slot,
flags, guest_phys_addr, memory_size, userspace_addr), and run a vCPU with
KVM_RUN (_IO(KVMIO, 0x80), decimal 44672). When a guest action requires VMM
intervention, KVM sets kvm_run->exit_reason in the shared mmap region and
returns. Common exit reasons include KVM_EXIT_IO (2) for port I/O,
KVM_EXIT_MMIO (6) for MMIO, and KVM_EXIT_SHUTDOWN (8) for guest shutdown.
Operations the host kernel can handle entirely in KVM — local APIC, IOAPIC, PIT —
never cross the KVM_RUN boundary to userspace at all.
The guarantee this layer provides is direct: guest code cannot read or write host memory, cannot execute privileged host instructions, and cannot modify host page tables. What it does not guarantee is that KVM's own kernel-mode code is bug-free — and that caveat is exactly where CVE-2021-29657 sits.
Layer 2: Seccomp BPF Filters
Chapter 19 walked through the seccomp(2) mechanism and Firecracker's filter
allow-lists in detail. Here the relevant frame is what the filters contribute to
the layered barrier.
Suppose a guest has found a bug in the KVM hardware boundary and is now executing
arbitrary code inside a vCPU thread on the host. Without any further containment,
that thread can call every syscall the process is permitted to call: socket,
execve, fork, mount. The host kernel evaluates each one against the
Firecracker process's credentials and the host's network configuration. A guest
that can issue arbitrary syscalls on the host has escaped.
Seccomp BPF filters answer this by restricting what syscalls each thread in the
Firecracker process can reach, before examining whether any individual call is
malicious. The filter policy is not system-wide; it is per-thread and applied from
three distinct JSON sources compiled at build time by seccompiler-bin into BPF
bytecode embedded in the firecracker binary. The three thread categories and
their approximate allow-list sizes on x86_64-unknown-linux-musl (main branch):
- The
vmmthread allows roughly 68 distinct syscalls, covering virtio I/O, KVM VM-scope ioctls (KVM_SET_USER_MEMORY_REGION,KVM_IOEVENTFD,KVM_IRQFD,KVM_GET_DIRTY_LOG), and TUN/TAP ioctls (TUNSETIFF,TUNSETOFFLOAD,TUNSETVNETHDRSZ). - The
apithread allows roughly 41 syscalls and zero KVM ioctls — onlyFIONBIO(value21537) for non-blocking I/O control, plus the Unix socket calls it actually needs. - The
vcputhread allows roughly 47 syscalls, principally the vCPU-scope ioctls includingKVM_RUN(44672), but nothing beyond what vCPU execution requires.
The default_action across all three filters is "trap" — mapping to
SECCOMP_RET_TRAP, which delivers SIGSYS. An unlisted call does not return an
error; it terminates the thread. An operator can supply a custom pre-compiled
filter via --seccomp-filter, but the default posture is deny-by-default.
Beyond the syscall number, Firecracker uses the argument-evaluation capability of
seccomp BPF — the filter receives struct seccomp_data.args[6] and can inspect
up to six arguments, though it cannot dereference pointers. A handful of
constraints that matter for an escape scenario:
mmaprequires thatPROT_EXEC(bit 2) be unset, so a compromised VMM process cannot create a new executable mapping to JIT shellcode.mprotectcarries the samePROT_EXECexclusion, preventing an attacker from making an existing mapping executable.socketpermits onlyAF_UNIX(value1). NoAF_INETorAF_INET6sockets are reachable from any thread in the Firecracker process.tkillis restricted to signals6(SIGABRT) and35(SIGRTMIN + the per-vCPU RT signal offset), blocking arbitrary signal delivery to host threads.
The combined effect: a compromised VMM process cannot open a network socket, cannot JIT executable code, and cannot escalate through signal tricks. It is constrained to the exact ioctls and syscalls Firecracker itself needs to run.
Installing a seccomp BPF filter requires either CAP_SYS_ADMIN or a prior
prctl(PR_SET_NO_NEW_PRIVS, 1) call; the kernel returns -EACCES otherwise.
Firecracker uses PR_SET_NO_NEW_PRIVS — which also prevents execve from
granting the child more privileges than the parent, closing an escalation path
before any filter is in place.
Layer 3: The Jailer
The jailer is a separate setuid binary. Its job is to set up every privileged
resource Firecracker needs, then exec() into the firecracker binary. After
that handoff, firecracker can only access resources the jailer explicitly created
inside the jail before transferring control.
The sequence of operations the jailer binary performs, in order:
- Places the process into a cgroup (v1: by writing to the
tasksfile under one ofcpuset,cpu,cpuacct,memory,net_cls,net_prio, orpids; v2: by writing tocgroup.procs). The--cgroup-versionflag selects which (default: v1). - Creates a new mount namespace with
unshare(), then callspivot_root()— not the olderchroot()— to establish a jail root at<chroot_base>/<exec_file_name>/<id>/root. - Creates only the device nodes Firecracker actually needs inside the chroot:
/dev/net/tunand/dev/kvm. Nothing else ismknod()'d. - Optionally creates a PID namespace via
clone(CLONE_NEWPID)(the--new-pid-nsflag) and a network namespace viasetns(fd, CLONE_NEWNET)(the--netnsflag). - Closes all file descriptors that were not explicitly inherited, using
close_range(3, UINT_MAX, CLOSE_RANGE_UNSHARE)(close_rangesyscall, requires kernel 5.9+). - Drops privilege with
setuid(uid)andsetgid(gid)to a unique non-privileged uid/gid per instance. - Sets resource limits:
setrlimit(RLIMIT_FSIZE)andsetrlimit(RLIMIT_NOFILE), the latter defaulting to 2048 file descriptors. exec()s intofirecracker.
The use of pivot_root() rather than chroot() is meaningful. chroot() only
changes the root directory for path resolution; a process with CAP_SYS_CHROOT
can break out of a chroot() jail. pivot_root() replaces the entire mount
namespace root, so the old filesystem tree is genuinely unreachable unless the
jailer explicitly binds it in — which it does not.
The jailer's own inputs are treated as trusted. The jailer documentation is
explicit: it is the operator's responsibility to ensure that jailer input paths
cannot be tampered with by unprivileged local users. The jailer does not defend
against a malicious operator; it defends against a compromised firecracker
process trying to reach host resources.
Device Model as Attack Surface Reduction
The device model bounds what the attacker has to aim at in the first place — not by filtering, but by the code not being present.
Firecracker's emulated device set is: VirtIO Net, VirtIO Block, and VirtIO Vsock (all over a virtio-mmio transport with I/O rate limiting); a serial console (8250 UART); a partial i8042 keyboard controller used only for reboot signaling; and the PIC, IOAPIC, and PIT handled entirely within KVM's in-kernel emulation. That is the complete list. There is no PCI bus, no BIOS or firmware, no ACPI, no USB controller, no GPU, no floppy disk controller, no sound device, no legacy ISA devices beyond i8042 and 8250. The guest boots via a direct-boot protocol straight to the Linux kernel, with no firmware layer in the path.
This absence is a security property. VENOM (CVE-2015-3456) exploited the floppy
disk controller emulation in QEMU: fdctrl_handle_drive_specification_command()
allocated a 512-byte FIFO buffer, but a missing data_pos reset in one branch of
the fifth-parameter handling allowed the write pointer to advance past the buffer
boundary on every subsequent I/O byte. A privileged guest user writing to the FDC
I/O port could overflow the heap region immediately following that FIFO. CVSS 2
score: 7.7 HIGH, fixed via commit
e907746266721f305d67bc0718795fedee2e824c, released in QEMU after 2.3.0. In Firecracker, the code path does not
exist. You cannot exploit emulation for a device the VMM did not implement.
The same logic applies to the virtio descriptor table, which Firecracker does
parse. The virtio 1.2 specification defines struct virtq_desc as 16 bytes:
le64 addr (guest-physical), le32 len, le16 flags, le16 next. The flags
field carries VIRTQ_DESC_F_NEXT (bit 0, value 1) for descriptor chaining and
VIRTQ_DESC_F_WRITE (bit 1, value 2) for write-only device buffers. Maximum queue
size is 32,768 entries. The spec places a MUST-level obligation on the VMM at
section 2.7.5.1: "A device MUST NOT write to any descriptor table entry."
Equally, the VMM must validate all descriptor fields before acting on them — the
addr, len, and next fields are all guest-controlled and all could be crafted
to manipulate host memory if bounds checks are missing.
CVE-2019-14835 is the canonical example of what happens when that validation is
absent. The get_indirect() function in the vhost-net kernel driver
(drivers/vhost/vhost.c) iterated up to USHRT_MAX + 1 (65,536) times writing to
a log buffer during live migration, without checking that *log_num stayed within
the actual buffer size. A guest could craft descriptor tables with large len
values to trigger the overflow during a migration event, achieving kernel heap
overflow. CVSS 3.1: 7.8 HIGH. Introduced in Linux 2.6.34, fixed in Linux 5.3,
commit 060423bfdee3f8bc6e2c1bac97de24d5415e2bc4. This was a kernel-mode virtio
backend, not a userspace VMM, but the attack surface is the same: guest-controlled
descriptor table content reaching code that fails to validate it.
Comparison to Container Isolation
The contrast between container isolation and microVM isolation is not a matter of degree. It is a categorical difference in what the attacker reaches if the first barrier breaks.
A container shares the host kernel. Every syscall a containerized process issues goes directly to the same kernel all other containers on the host are running on. Docker's default seccomp profile blocks approximately 44 syscalls out of 300-plus, leaving roughly 256 reachable to the host kernel by default. Research from 2025 (arxiv:2510.03720) showed that optimized per-container syscall limiting can reduce the average allowed set to roughly 87 syscalls, which would statically prevent exploitation of 87 CVEs in the study's dataset — a meaningful improvement, but still operating entirely within the shared-kernel model.
Consider CVE-2022-0847, Dirty Pipe. The commit f6dd975583bd ("pipe: merge
anon_pipe_buf*_ops"), introduced in Linux 5.8, left the PIPE_BUF_FLAG_CAN_MERGE
flag in struct pipe_buffer uninitialized. An unprivileged process could fill all
pipe ring slots — setting the flag on each — then splice() a read-only file's
page into the pipe (inheriting the flag from the ring), then write() to append
into the page cache, silently overwriting read-only file content without write
permission. Fixed in Linux 5.16.11, 5.15.25, and 5.10.102 on 2022-02-23.
From a container, this attack calls pipe(2), splice(2), and write(2) — all
syscalls Docker's default profile permits — directly into the host kernel's
syscall handler. The result is host page cache overwrite. From a microVM guest,
those same calls go to the guest OS kernel. The host kernel does not see them.
Reaching the host would require first escaping the hardware VMX/SVM boundary,
surviving the seccomp filter, and escaping the jailer's pivot_root() jail —
three barriers that are not present in the container model.
Similarly, CVE-2017-7308 let an unprivileged user reach privilege escalation by
crafting setsockopt() calls via AF_PACKET sockets. From a container,
socket(AF_PACKET, ...) may reach the host kernel depending on the container's
seccomp profile. From a Firecracker VMM process, socket() is allowed by the
seccomp filter only for AF_UNIX (value 1) — AF_PACKET is not on the list,
and the default action is SECCOMP_RET_TRAP.
| Property | Container (default Docker) | Firecracker microVM |
|---|---|---|
| Kernel boundary | Namespace + cgroup (shared kernel) | Hardware VMX/SVM + KVM |
| Syscall surface to host | ~256 of 300+ reachable | vcpu thread: ~47 via seccomp |
| Default seccomp posture | ~44 syscalls blocked | All threads: default_action=trap |
| Device attack surface | Full host driver stack | 8 emulated devices; no PCI/BIOS/USB |
| Escape path | Single shared-kernel bug sufficient | KVM escape + VMM exploit + seccomp bypass + jailer escape |
The KVM Boundary Is Not Inviolable
The hardware virtualization boundary stops the vast majority of guest-originating attacks because it is enforced by CPUs that AMD and Intel have spent decades hardening. But the boundary is not a proof — it is an engineering artifact, and KVM's kernel-mode code is in scope for bugs.
CVE-2021-29657 was the first public writeup of a KVM guest-to-host breakout that did not rely on any bug in QEMU or a userspace VMM at all. Affected kernels: v5.10 through v5.12-rc6, patched in March 2021. The attack targeted KVM's AMD SVM-specific kernel-mode code directly from a guest vCPU, without requiring the guest to first manipulate the VMM process; Intel VMX users were not affected. A comparable Intel-VMX-specific guest-to-host breakout has not been publicly demonstrated. The CVE is nonetheless a proof of concept that the "KVM escape" scenario the rest of the containment model is designed for is not purely theoretical.
The practical implication for the threat model is this: the seccomp filters and the jailer are not fallback measures deployed on the assumption that the hardware boundary works. They are independent containment layers designed for the scenario where the hardware boundary has already failed.
Microarchitectural Side Channels
Software barriers are not the only threat surface. Several CPU microarchitectural vulnerabilities allow cross-tenant information leakage through shared hardware state that the VMM cannot observe or block in software.
CVE-2018-3646 (L1TF / Foreshadow-VMM) is the defining example. An x86 PTE with
the Present bit cleared causes speculative execution to load the physical address
from the L1D cache before the page fault is raised and before the permission check
that would stop it. With Hyper-Threading enabled, a guest vCPU running on one
logical processor can speculatively read L1D contents populated by the host on the
sibling logical processor of the same physical core. The mitigation MSR is
IA32_FLUSH_CMD at address 0x10B — write-only; writing bit 0 (L1D_FLUSH,
value 1) flushes and invalidates the L1D on the executing physical core. Support
is enumerated via CPUID.(EAX=07H,ECX=0):EDX[28]. Susceptibility can be checked
via IA32_ARCH_CAPABILITIES MSR at 0x10A; bit 0 (RDCL_NO, value 1)
indicates the processor is not vulnerable. KVM exposes the mitigation via
/sys/module/kvm_intel/parameters/vmentry_l1d_flush: cond (flush only after
non-audited code paths, default) or always (unconditional, with 1--50%
performance overhead depending on VM exit rate).
CVE-2017-5715 (Spectre v2) targets the branch predictor. The mitigation MSRs are
IA32_SPEC_CTRL at 0x48 (IBRS = bit 0, STIBP = bit 1) and IA32_PRED_CMD at
0x49 (IBPB = bit 0, write-only). KVM exposes these to guests and handles the
host-side save/restore: on VM exit, for CPUs using classic IBRS, KVM sets IBRS to
0; CPUs with enhanced IBRS (eIBRS, widely available since ~2019) do not require
this per-exit write because eIBRS protection persists across the transition. On VM
entry KVM restores the guest's saved IBRS value. The RSB is flushed on every VM
exit. Current mitigation status is visible at
/sys/devices/system/cpu/vulnerabilities/spectre_v2.
Neither of these is a VMM bug in the conventional sense. They are properties of the physical hardware, and the software mitigation for both converges on the same recommendation in Firecracker's production host setup guide: disable SMT (Hyper-Threading) entirely. With SMT disabled, no sibling logical processor can speculatively read L1D contents belonging to another tenant. The L1D flush MSR then becomes a belt-and-suspenders measure rather than the primary defense.
A complete production host mitigation table, drawn from Firecracker's
prod-host-setup.md:
| Mitigation | Mechanism | Threat addressed |
|---|---|---|
| Disable SMT | Kernel boot parameter or BIOS | Cross-tenant L1D leakage via sibling threads (L1TF) |
| Disable KSM | echo 0 > /sys/kernel/mm/ksm/run |
Cross-tenant timing attacks via page deduplication |
| Disable swap | swapoff -a |
Guest memory remanence on storage media |
| DDR4 with TRR + ECC | Hardware selection | Rowhammer |
kvm.nx_huge_pages=never |
Kernel module parameter | iTLB multihit regression (Linux 6.1, x86-64) |
| Updated CPU microcode | Distribution security updates | All speculative execution side channels |
Note: Disabling KSM (
/sys/kernel/mm/ksm/run) and swap require root on the host. Consult your platform's hardening guide before making these changes in production.
On ARM, Firecracker resets the CNTPCT physical counter only when
KVM_CAP_COUNTER_OFFSET is available, which requires kernel 6.4 or later.
Snapshot Trust and Operational Hazards
Snapshot files — the VM state file, the memory snapshot, and the disk image — are classified as trusted in Firecracker's threat model. This is not a strong guarantee. Firecracker applies a 64-bit CRC to the VM state file for partial corruption detection; it does not authenticate or encrypt snapshot content, and the CRC covers only the state file, not the memory snapshot or the disk image. All three files must be independently secured by the operator — an attacker who can modify a snapshot file can inject arbitrary guest state.
Resuming identical snapshots multiple times creates a subtler hazard: UUID collisions, reuse of entropy pool state, repeated cryptographic tokens, and reused RNG seeds across multiple resumed instances. If snapshot triggering is exposed to customers, operators must enforce disk quotas to prevent DoS via unbounded snapshot files.
There are two configuration hazards worth naming before a deployment reaches
production. The 8250 serial device can cause unbounded memory and storage usage on
the host if guest output is not rate-limited; the production guidance is to disable
it with the kernel command line argument 8250.nr_uarts=0. The MMDS (MicroVM
Metadata Service) is accessible from the guest at 169.254.169.254 by default;
operators must block it at the host with an nftables rule targeting TAP
interfaces:
Note: The commands below modify the host firewall. The
firecrackertable andfilterchain must exist before adding the rule; create them once if they do not (seeprod-host-setup.mdfor the full setup). Verify the rule does not conflict with existingnftablesrulesets before applying.
nft add table ip firecracker
nft add chain ip firecracker filter { type filter hook forward priority 0 \; }
nft add rule ip firecracker filter iifname "tap*" ip daddr 169.254.169.254 counter drop
The threat model fixes the adversary's position and the defender's posture; Chapter 21 covers how to validate both in practice, using automated policy checks and runtime attestation to confirm that the barriers described here are actually in place on a production host.
Sources And Further Reading
- Firecracker design document: https://github.com/firecracker-microvm/firecracker/blob/main/docs/design.md
- Firecracker jailer documentation: https://github.com/firecracker-microvm/firecracker/blob/main/docs/jailer.md
- Firecracker production host setup guide: https://github.com/firecracker-microvm/firecracker/blob/main/docs/prod-host-setup.md
- Firecracker seccomp documentation: https://github.com/firecracker-microvm/firecracker/blob/main/docs/seccomp.md
- Firecracker seccompiler documentation: https://github.com/firecracker-microvm/firecracker/blob/main/docs/seccompiler.md
- Firecracker snapshot support documentation: https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/snapshot-support.md
- Firecracker x86-64 seccomp filter JSON: https://github.com/firecracker-microvm/firecracker/blob/main/resources/seccomp/x86_64-unknown-linux-musl.json
- KVM API documentation: https://www.kernel.org/doc/html/latest/virt/kvm/api.html
- Linux kernel KVM UAPI header (
kvm.h): https://github.com/torvalds/linux/blob/master/include/uapi/linux/kvm.h - Linux kernel
vmenter.S: https://github.com/torvalds/linux/blob/master/arch/x86/kvm/vmx/vmenter.S - Kernel seccomp_filter documentation: https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html
- L1TF (CVE-2018-3646) kernel documentation: https://docs.kernel.org/admin-guide/hw-vuln/l1tf.html
- Spectre v2 (CVE-2017-5715) kernel documentation: https://docs.kernel.org/admin-guide/hw-vuln/spectre.html
- OASIS virtio v1.2 specification: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html
- Intel SDM Vol. 3C (VMX architecture): https://cdrdv2-public.intel.com/789585/326019-sdm-vol-3c.pdf
- NVD entry for CVE-2015-3456 (VENOM): https://nvd.nist.gov/vuln/detail/CVE-2015-3456
- CrowdStrike VENOM technical disclosure: https://www.crowdstrike.com/en-us/blog/venom-vulnerability-details/
- NVD entry for CVE-2019-14835 (vhost-net): https://nvd.nist.gov/vuln/detail/CVE-2019-14835
- oss-security disclosure for CVE-2019-14835: https://www.openwall.com/lists/oss-security/2019/09/17/1
- LWN case study on CVE-2021-29657 (KVM guest-to-host breakout): https://lwn.net/Articles/861330/
- Dirty Pipe (CVE-2022-0847) canonical writeup: https://dirtypipe.cm4all.com/
- NVD entry for CVE-2022-0847: https://nvd.nist.gov/vuln/detail/cve-2022-0847
- Docker seccomp documentation: https://docs.docker.com/engine/security/seccomp/
- Syscall limitation research (arxiv:2510.03720): https://arxiv.org/html/2510.03720v1