Chapter 23: The VMM Landscape

Firecracker's device list fits on a napkin: virtio-net, virtio-blk, virtio-vsock, virtio-balloon, a 16550A UART, and an i8042 stub. Six devices, no PCI bus, no display, no audio, no ISA timer chain beyond what the interrupt controller requires. That restraint is not an accident of being new — it is the load-bearing design decision the whole microVM argument rests on. But to appreciate what was removed, you have to see what was not.

Three VMMs define the space around Firecracker: QEMU, the full-featured baseline that can emulate any machine ever sold; crosvm, the Google ChromeOS VMM from which Firecracker's codebase was directly forked; and Cloud Hypervisor, the modern Rust sibling targeting general-purpose cloud workloads. None of these is better than the others in any absolute sense. Each embodies a specific tradeoff between feature breadth and attack surface, and understanding that tradeoff is what makes Firecracker's six devices legible as engineering rather than as minimalism for its own sake.


The Shared Substrate

All three Rust VMMs share code through the rust-vmm umbrella project, and that common ancestry shapes what is comparable between them.

rust-vmm was founded in December 2018 when engineers from Amazon, Google, Intel, and Red Hat began extracting core virtualization code from crosvm and Firecracker into shared crates. Firecracker itself had already been forked from crosvm's codebase before Amazon open-sourced it on November 27, 2018. The first shared artifact was the vm-memory crate, whose initial commit (83f61119 on GitHub) was derived simultaneously from crosvm commit 186eb8b0 and Firecracker commit 80128ea6. Today rust-vmm hosts more than 30 crates; current contributors include Alibaba, AWS, Intel, Google, Linaro, and Red Hat.

The crates that appear in VMM comparisons throughout this chapter are: kvm-ioctls (v0.25.0), which provides safe Rust wrappers over the three-level KVM file descriptor API; kvm-bindings (v0.14.1), the FFI layer beneath it; vm-memory (v0.17.1), which decouples memory consumers from memory providers through GuestMemoryMmap and GuestAddress; linux-loader (v0.13.2) for loading ELF, bzImage, and PE kernels into guest memory; seccompiler (v0.5.0) for compiling and installing per-thread BPF seccomp filters; and virtio-queue (v0.17.0), the split-ring virtqueue implementation that every virtio device in the Rust VMMs drives. Cloud Hypervisor's Cargo.toml workspace manifest pins all of these versions explicitly. Firecracker uses a subset. crosvm, as of the last documentation published on chromium.googlesource.com, had not formally consumed rust-vmm crates in production, citing friction around non-Linux OS support for kvm-bindings and vmm-sys-util — though that document dates to late 2020 and may no longer reflect the current codebase.

The shared substrate means that a virtqueue, a guest memory region, or a KVM ioctl wrapper works the same way across all three Rust VMMs. The differences lie above the crate boundary: which devices each VMM implements, how it isolates them, and what workloads it is designed to serve.


Fabrice Bellard published QEMU's first preview in 2003. It is a hosted hypervisor and machine emulator written in C, licensed under GPL-2.0-only for its core, and it remains the single tool in this space that can boot a guest of any ISA on a host of any other ISA without hardware assistance. That capability explains everything that follows.

QEMU runs in two fundamentally different modes. In TCG mode (Tiny Code Generator), a software JIT cross-compiles guest basic blocks to host instructions at runtime, with no kernel support required. This is how QEMU can emulate a MIPS board on an x86-64 laptop. In KVM mode, selected with -accel kvm or -enable-kvm, QEMU delegates vCPU execution to the kernel's kvm.ko module and achieves near-native speed; TCG still handles device model emulation and any guest code that causes a VM exit. Other acceleration backends exist — Hypervisor.Framework on macOS (-accel hvf), WHPX on Windows, MSHV for Hyper-V — but KVM is the standard production path on Linux.

The device model is where QEMU's scope becomes concrete. The hw/ directory in the QEMU source tree contains approximately 70 subdirectories covering device families: acpi, block, char, display, dma, i386, ide, input, net, nvme, pci, scsi, timer, tpm, usb, vfio, virtio, and dozens more from 9pfs to xen and xtensa. QEMU's own security documentation identifies the primary attack surface as "emulated devices" and lists as untrusted inputs: guest code, VNC and SPICE connections, NBD and live-migration network protocols, user-supplied disk images, device trees, and PCI and USB passthrough devices. The breadth of that list is not incidental — it follows directly from the breadth of what QEMU emulates.

Even when KVM is doing the vCPU work, a QEMU process in default x86-64 mode emulates a full PCI bus hierarchy, ISA devices (the i8254 PIT, i8257 DMA controller, CMOS/RTC, i8259 PIC), USB host controllers, VGA and display adapters, sound cards, and the BIOS or OVMF firmware path. Each emulated device is code reachable from the guest over a hardware interface. Every byte the guest can write to a device register is a potential attack surface. This is not a critique of QEMU — it is the inevitable cost of universal compatibility.

QEMU's security policy at qemu.org/docs/master/system/security.html describes isolation options: SELinux, AppArmor, resource limits, cgroups, Linux namespaces, and seccomp via --sandbox. The key word is "options": seccomp is not on by default. Hardening QEMU for multi-tenant use requires explicit, per-deployment configuration; the tool does not arrive hardened.

The microvm Machine Type

QEMU does contain one concession to the minimalist camp: a machine type called microvm, selected with -machine microvm. This machine type strips away PCI, ACPI, and most legacy hardware, exposing up to 8 virtio-MMIO devices, one optional ISA serial port, LAPIC, IOAPIC, kvmclock, and fw_cfg. It does not support device hotplug or live migration across QEMU versions.

The microvm machine type is useful as a side-by-side comparison with Firecracker: same virtio-MMIO transport, similarly stripped device set. The difference is that it still runs inside QEMU's full codebase — the TCG JIT, the multi-ISA emulation engine, all 70 device family subdirectories — even when none of them are exposed to a given guest. The attack surface of the process includes code that is present but not reachable through the current machine type's configuration. Firecracker's device count is low because the code for everything else was never written, not because it is compiled out.

Intel recognized this gap. Its NEMU project (github.com/intel/nemu, archived April 14, 2021) was an attempt to strip QEMU specifically for cloud workloads. The archived repository's own README now redirects visitors: "Cloud Hypervisor is the successor." That sentence closes one lineage and opens the next.


Cloud Hypervisor: The Modern rust-vmm Sibling

Cloud Hypervisor grew from NEMU's successor effort and was relaunched under the cloud-hypervisor GitHub organization at v0.4.0. It is now governed by the Linux Foundation as "a Series of LF Projects, LLC." Its README.md states plainly that "a large part of the Cloud Hypervisor code is based on either the Firecracker or the crosvm project's implementations." Supporting organizations include Alibaba, AMD, Ampere, ARM, ByteDance, Cyberus Technology, Intel, Microsoft, SAP, and Tencent Cloud. The current release as of this writing is v52.0, released May 14, 2026, following an approximately monthly cadence that has been in place since v15.0.

Two Hypervisor Backends

Cloud Hypervisor runs on two backends compiled into a single binary with run-time detection, a capability introduced at v26.0. The primary backend is KVM on Linux. The second is MSHV — the Microsoft Hypervisor interface, used on Azure hosts and Windows Hyper-V. This dual-backend binary stands in contrast to Firecracker, which is KVM-only, and to crosvm, which supports KVM plus Gunyah, GenieZone, and Halla for Android hardware, plus WHPX and HAXM on Windows.

The minimum recommended host kernel for Cloud Hypervisor is 5.13 for required KVM functionality; CI runs against 5.15. Supported architectures are x86-64 (primary), AArch64 (primary, requiring GICv3), and riscv64 (experimental). Supported guest operating systems are 64-bit Linux and Windows 10 / Windows Server 2019 — the Windows guest support alone distinguishes Cloud Hypervisor from both Firecracker and crosvm in a meaningful way for enterprise workloads.

The Device Model

The most structurally important decision in Cloud Hypervisor's device model is transport: all virtio devices use virtio-PCI exclusively. virtio-MMIO was removed at v0.11.0 to simplify the code and reduce the testing burden. Firecracker uses virtio-MMIO throughout. This single decision means that Cloud Hypervisor requires a guest kernel with PCI support and a host that can set up a PCI bus in the VM, whereas Firecracker's guests need no PCI driver at all.

The built-in virtio device set, as of v52.0, spans ten devices: virtio-blk (default io_uring backend since v0.11.0), virtio-console, virtio-net (with multi-queue and multi-thread since v0.5.0 and rate limiting), virtio-pmem, virtio-rng, virtio-vsock (forked from Firecracker's vsock implementation, as acknowledged in docs/device_model.md), virtio-iommu, virtio-mem (the virtio 1.2 memory device for hotplug, since v0.7.0), virtio-balloon (with free-page reporting since v22.0), and virtio-watchdog (experimental, since v0.11.0).

Beyond the built-in virtio devices, Cloud Hypervisor supports four vhost-user offload backends: vhost-user-blk for high-performance block via SPDK, vhost-user-net for DPDK-backed networking, vhost-user-fs for shared filesystems via virtiofsd, and vhost-user-generic (added in v52.0), which allows arbitrary backends without requiring the VMM to know the device type. Each of these runs in a separate process; the VMM and the backend communicate over a UNIX socket.

Emulated legacy devices are kept deliberately narrow: a 16550A serial port on x86-64 (PL011 UART on AArch64), RTC/CMOS, I/O APIC, i8042, ARM PL061 GPIO (for AArch64 shutdown), and an ACPI device as the default shutdown and reboot path. No emulated e1000, no IDE controller, no PS/2 bus, no i8254 PIT by default. This is the "wider than Firecracker, narrower than QEMU" line Cloud Hypervisor draws.

VFIO passthrough of physical PCI and PCIe devices has been available since v0.1.0, with hotplug since v0.6.0. Version 52.0 added support for the modern iommufd/vfio-cdev interface introduced in Linux 6.6, enabling selective BAR mapping, sub-page BAR expansion, MSI-X synchronization, and lazy GSI allocation.

Memory and CPU Architecture

Cloud Hypervisor's memory configuration is richer than Firecracker's precisely because its target workload is longer-lived and more varied. The --memory flag accepts fields including size, mergeable (KSM), hugepages, prefault (using MAP_POPULATE), shared (using mmap(MAP_SHARED)), hotplug_method (either acpi or virtio-mem), and hotplug_size. ACPI hotplug increments must be multiples of 128 MiB; virtio-mem carries no such alignment constraint. Multiple PCIe segments are available since v20.0, configured via --platform num_pci_segments=<N>,iommu_segments=<range>.

The virtio-iommu device provides a paravirtualized IOMMU that eliminates shadow page table complexity. When a physical IOMMU and a VFIO device are both present, DMA remapping tables are updated via VFIO whenever the guest updates its mappings, enabling nested VFIO passthrough; hugepages reduce IOMMU mapping overhead substantially in this configuration.

vCPU topology follows a four-level model: threads:cores:dies:packages, defaulting to 1:1:1:1. vCPU hotplug is supported by configuring max greater than boot at startup and then onlining CPUs in the guest via /sys/devices/system/cpu/cpu*/online. AMX support for x86 was added at v23.0. SMT side-channel mitigation is handled with the core_scheduling option (added v52.0), which supports modes vm, vcpu, and off.

Version 52.0 also added KVM SEV-SNP support for confidential VMs, using KVM_CREATE_GUEST_MEMFD for private guest memory. Firmware is packaged as IGVM; the kernel, command line, and initrd are included in the launch measurement.

Live Migration and Snapshot

Cloud Hypervisor is designed for the full lifecycle of a cloud VM, which includes moving it. Live migration uses a UNIX socket for local transfer or TCP for remote; TLS and mTLS are available on the TCP path. Version 52.0 added multi-connection TCP (1 to 128 parallel connections) to saturate high-bandwidth links. Migration supports both precopy (the default) and postcopy modes. Protocol versioning is strict: each release sends its current version number and accepts the immediately preceding version; jumping multiple releases requires stepping through intermediate versions. Default maximum downtime is 300 ms; timeout is 3600 s.

Snapshots are written to a directory containing three files: config.json (the full VM configuration, human-readable), memory-ranges (raw guest RAM), and state.json (per-component state). The ondemand restore mode, added in v52.0, uses userfaultfd to fault pages in lazily, reducing time-to-first-instruction on restore. VFIO devices are excluded from snapshot and restore.

None of Firecracker's published specifications mention live migration. The workloads Firecracker serves — serverless function invocations lasting tens to hundreds of milliseconds — do not need it. The workloads Cloud Hypervisor serves — persistent cloud VMs, containers-as-VMs, Windows guests, confidential compute instances — do.


crosvm: The ChromeOS Ancestor

Google built crosvm for ChromeOS's Crostini Linux container runtime and for the Android guest (ARCVM). crosvm's README states that Firecracker "used [crosvm] as the basis for their own VMM," and the history of the vm-memory crate confirms it: the first rust-vmm artifact was derived from both codebases simultaneously in December 2018. crosvm has since expanded to Android's TerminalApp, Cuttlefish (Google's virtual Android device platform), and Windows hosts.

Process-Per-Device Isolation

crosvm's defining architectural decision — the one that distinguishes its security model most sharply from Firecracker's — is its process-per-device sandbox model. Each virtio device backend runs in a sandboxed child process, not as a thread within the main VMM process. The main VMM forks and jails each device process using minijail, a Google library that wraps Linux namespaces and seccomp-BPF.

Each jailed device process receives three layers of containment: VFS, PID, user, and network namespaces via pivot_root; a per-device-type BPF seccomp policy from jail/seccomp/{arch}/{device}.policy; and Linux capability dropping. The main process retains PciRoot coordination; jailed device processes communicate back via VM control sockets over shared GuestMemory.

flowchart TD main["crosvm main process\n(PciRoot, vCPU threads)"] net["virtio-net process\nminijail + seccomp policy:\nnet.policy"] blk["virtio-blk process\nminijail + seccomp policy:\nblk.policy"] gpu["virtio-gpu process\nminijail + seccomp policy:\ngpu.policy"] snd["virtio-snd process\nminijail + seccomp policy:\nsnd.policy"] main -->|"VM control socket\n+ shared GuestMemory"| net main -->|"VM control socket\n+ shared GuestMemory"| blk main -->|"VM control socket\n+ shared GuestMemory"| gpu main -->|"VM control socket\n+ shared GuestMemory"| snd

The blast-radius argument here is different from Firecracker's. In Firecracker, a compromised device path is contained by a per-thread seccomp-BPF filter inside a single monolithic process — the attacker is in the same process but limited in which syscalls they can make. In crosvm, a compromised virtio-net backend is confined to a separate process with its own namespace and its own seccomp policy; to affect the VMM main process, an attacker must additionally escape the process boundary. The tradeoff is that process-per-device adds IPC overhead and complexity in the main process's PciRoot coordination, costs that Firecracker avoids by accepting the monolithic model.

Hypervisor Backends

crosvm runs on more hypervisor backends than either Firecracker or Cloud Hypervisor: KVM on Linux (primary), Gunyah, GenieZone, and Halla on Linux and Android hardware, plus WHPX and HAXM on Windows. This breadth reflects crosvm's deployment reality — Android hardware may run non-KVM hypervisors, and the same crosvm codebase must function across all of them.

Device Set

crosvm's device scope follows its workload: a ChromeOS or Android desktop environment, not a serverless function. The virtio device list includes virtio-blk (raw, QCOW2, zstd, and Android sparse image formats), virtio-net (vhost and slirp backends), virtio-vsock, virtio-gpu (2D, virglrenderer 3D, gfxstream, and Vulkan), virtio-snd (CRAS and AAudio backends), virtio-fs (FUSE), virtio-9p, virtio-input, virtio-balloon, virtio-console, virtio-rng, virtio-iommu, virtio-tpm, virtio-pmem, and experimental virtio-video and virtio-scsi. Emulated legacy devices include CMOS/RTC, i8042, serial (I/O port), and xHCI USB passthrough.

Wayland display forwarding operates through virtio-gpu cross-domain mode and requires a guest Linux kernel at version 5.16 or later with CONFIG_DRM_VIRTIO_GPU enabled.

The devices Firecracker deliberately excludes — virtio-gpu, audio, 9P filesystem sharing, USB passthrough, and display forwarding — are exactly the devices crosvm includes as first-class features. Neither list is wrong; they reflect genuinely different user-visible workloads.


Scope, Attack Surface, and Intended Workload

The four VMMs arrange into a clear gradient when you hold them against the same axes: how many device types each exposes, how it isolates them, and what workload justifies the design.

flowchart LR FC["Firecracker\n6 devices\nper-thread seccomp\nno PCI\nserverless"] CH["Cloud Hypervisor\n10+ virtio + 4 vhost-user\nvirtio-PCI\nper-thread seccomp\ncloud VMs"] CV["crosvm\n~16 virtio + xHCI USB\nprocess-per-device\nmulti-backend hypervisor\nclient / desktop"] QE["QEMU\n~70 hw/ subdirectories\noptional --sandbox\nmulti-ISA emulation\ndevelopment / compat"] FC --- CH --- CV --- QE

Firecracker draws the smallest perimeter: six devices confirmed in the FAQ and design document, no PCI bus, no GPU, no audio, no display. The measurable results are stated in SPECIFICATION.md and enforced in CI: VMM startup to API socket in at most 8 CPU milliseconds, guest /sbin/init from InstanceStart in at most 125 milliseconds, and no more than 5 MiB of VMM memory overhead per microVM at 1 vCPU and 128 MiB of guest RAM. Guest CPU performance stays above 95 percent of bare metal; network throughput reaches 14.5 Gbps at 80 percent host CPU utilization. Max vCPUs per microVM is 32. The narrow device list is what makes those bounds achievable and verifiable in CI.

The seccomp posture is the other half of Firecracker's security argument. The three thread types — the API thread, the VMM thread, and one vCPU thread per guest CPU — each run under a separate BPF filter, installed before any guest code executes. The API thread's filter allows exactly FIONBIO among ioctls. Selected ioctls the VMM thread's filter allows include KVM_SET_USER_MEMORY_REGION, KVM_IOEVENTFD, KVM_IRQFD, TUNSETIFF, TUNSETOFFLOAD, TUNSETVNETHDRSZ, KVM_GET_DIRTY_LOG, KVM_GET_IRQCHIP, KVM_GET_CLOCK, and KVM_GET_PIT2 (the last three enable snapshot and restore of interrupt-controller and timer state). The vCPU thread's filter allows KVM_RUN and GET ioctls for registers and CPU state; the full enumeration is in the seccomp JSON files rather than reproduced here, because the set changes across releases. These filters are compiled into the binary for exactly two target triples: x86_64-unknown-linux-musl and aarch64-unknown-linux-musl. The files live at resources/seccomp/x86_64-unknown-linux-musl.json and resources/seccomp/aarch64-unknown-linux-musl.json.

Cloud Hypervisor takes a wider stance at every axis. Ten built-in virtio devices, four vhost-user offload backends, VFIO passthrough, Windows guest support, live migration, vCPU hotplug, memory hotplug, and confidential VM support via KVM SEV-SNP. Its seccomp posture uses the seccompiler rust-vmm crate for per-thread filters, but the set of allowed operations is necessarily larger because the device model is larger. The workloads that justify the additional surface area — Windows guests, persistent cloud VMs, VFIO-attached hardware, confidential compute — cannot fit inside Firecracker's perimeter.

crosvm trades off differently again. Its device set is the widest of the three Rust VMMs — virtio-gpu with Vulkan, virtio-snd, Wayland forwarding, xHCI USB, 9P and virtio-fs — but it contains the blast radius of any single compromised device through process isolation rather than through device omission. The seccomp policies at jail/seccomp/{arch}/{device}.policy are per-device-type rather than per-thread, because each device lives in its own process. A compromised virtio-gpu backend is confined to the GPU policy and the GPU process's namespaces; it does not directly threaten the VMM main process. crosvm's threat model is "the guest is interactive and relatively trusted; the device boundary is where you isolate against bugs" — a different assumption from Firecracker's "the guest is hostile and the perimeter is the whole VMM."

QEMU maximizes compatibility. The hw/ subdirectory count of approximately 70 is the structural proxy for attack surface: each subdirectory is a family of emulated hardware, each family is code reachable from the guest, and the guest in QEMU's default configuration can reach most of them. Hardening requires explicit opt-in — --sandbox on, an AppArmor profile, cgroup limits — and even with all of these applied, the underlying device model is present in the process. QEMU is the right tool for development, cross-ISA testing, CI environments, and any workload that requires hardware it cannot buy; it is the wrong starting point for a multi-tenant production deployment of untrusted code without a significant hardening investment.

The KVM Interface Each VMM Uses

All three Rust VMMs reach the kernel through kvm-ioctls v0.25.0; QEMU uses its own C wrappers. The KVM API itself is the same three-level file descriptor interface for all of them: a /dev/kvm fd for system-level operations, a VM fd opened with KVM_CREATE_VM, and per-vCPU fds opened with KVM_CREATE_VCPU. VmFd in kvm-ioctls exposes KVM_SET_USER_MEMORY_REGION, KVM_CREATE_IRQCHIP, KVM_IOEVENTFD, KVM_IRQFD, KVM_GET_DIRTY_LOG, and KVM_CREATE_GUEST_MEMFD (for confidential VM private memory), among others. VcpuFd exposes KVM_RUN, KVM_GET_REGS, KVM_SET_REGS, KVM_GET_SREGS, KVM_SET_CPUID2, KVM_GET_LAPIC, KVM_SET_LAPIC, KVM_GET_MSRS, KVM_SET_MSRS, KVM_GET_XSAVE, KVM_GET_NESTED_STATE, and more.

Note: Opening /dev/kvm requires membership in the kvm group on most Linux distributions, or root access. Any process that holds a VM fd or a vCPU fd has host-kernel access at the level of those handles. Run VMMs as non-root, apply seccomp, and restrict /dev/kvm permissions before exposing any VMM to untrusted guest workloads.

Each VMM uses a subset of the available ioctls. Firecracker's seccomp filter is the most explicit statement of which subset: it is the list of ioctls the process is allowed to call at all, enforced by the kernel's BPF machinery, not merely by the code paths the VMM happens to exercise. Cloud Hypervisor and crosvm do not publish an equivalent enumeration in a single file; their allowed sets are wider because their device models demand it.

A Note on the virtio 1.2 Registry

The OASIS virtio 1.2 specification defines 19 device types by ID. Comparing each VMM against that registry makes the gaps visible at a glance. Firecracker implements IDs 1 (Network), 2 (Block), 5 (Balloon, since v0.24.0), and 19 (vsock); Cloud Hypervisor adds IDs 3 (Console / virtio-console), 4 (RNG), 23 (IOMMU), 24 (Memory / virtio-mem), 26 (File System, via vhost-user-fs), and 27 (PMEM); crosvm adds IDs 16 (GPU), 18 (Input), 25 (Sound), 26 (File System), and more. IDs 8 (SCSI), 20 (Crypto), and 29 (Administration) appear in none of the three Rust VMMs' stable device sets as of this writing.

The spec registry is useful not as a scorecard but as a stable coordinate system: when a new device type appears in a VMM, you can look up its ID and description in the OASIS document rather than treating it as a proprietary feature.


The next chapter turns back to Firecracker specifically and examines how its seccomp filters are constructed — the mechanism that makes the thread model's security promises concrete.


Sources And Further Reading