Chapter 9: Anatomy Of A VMM

Every VMM you will ever read — kvmtool, crosvm, Cloud Hypervisor, Firecracker — does the same five things before a single guest instruction retires. The wrappers differ, the languages differ, and the ambition differs, but the kernel interface is stable and narrow. Underneath each VMM is a handful of ioctl calls on three file descriptors, a shared page that ferries data across every VM-exit, and a thread for each guest CPU. The five jobs compose into every VMM, from the 200-line teaching example in LWN to the production runtime that boots AWS Lambda functions.

The KVM Fd Hierarchy

KVM exposes a three-level file-descriptor hierarchy. Every level is a character device opened by a different ioctl, and each level accepts only the ioctls designed for it — send a vCPU ioctl to a system fd and the kernel returns ENOTTY.

The root is /dev/kvm. Opening it gives you a system fd whose most important job is creating VMs. Before doing that, a VMM should call KVM_GET_API_VERSION (_IO(0xAE, 0x00)) and verify the result is 12. That number has not changed since KVM was merged; if you get anything else you are talking to a kernel too old or too exotic to trust.

flowchart TD A["/dev/kvm (system fd)"] A -->|"KVM_GET_API_VERSION → 12"| A A -->|"KVM_CREATE_VM"| B["VM fd"] B -->|"KVM_SET_USER_MEMORY_REGION"| B B -->|"KVM_CREATE_VCPU"| C["vCPU fd"] C -->|"KVM_RUN"| C C -->|"KVM_GET_REGS / KVM_SET_REGS"| C C -->|"KVM_GET_SREGS / KVM_SET_SREGS"| C

KVM_CREATE_VM (_IO(0xAE, 0x01)) on the system fd yields a VM fd that represents the guest address space and its interrupt state. KVM_CREATE_VCPU (_IO(0xAE, 0x41)) on the VM fd yields a vCPU fd — one per guest core — that exposes the per-CPU register file and the interface for entering and exiting guest mode. The type byte 0xAE is the KVMIO constant; it appears in every KVM ioctl number by convention.

The hierarchy matters because it is the security boundary. Granting a process the VM fd without the system fd limits what VMs it can create. Granting the vCPU fd without the VM fd limits what memory mappings it can see. Real VMMs like Firecracker use the jailer binary to enter a chroot, drop privileges, and pass in a pre-opened VM fd before seccomp locks down the process, so no single thread ever holds all three capabilities simultaneously after the sandbox is active.

Job 1: Allocate Guest Memory

The guest's physical address space is a lie. Every guest-physical address (GPA) is a host-virtual address (HVA) in disguise — the VMM mmaps an anonymous buffer, then registers a range of host virtual memory as a contiguous block of guest-physical memory. The hardware (Intel's EPT, AMD's NPT) translates GPA to HPA transparently during guest execution, using page tables the kernel maintains inside the VMCS or VMCB. The VMM never writes to those hardware tables directly; it only tells the kernel where its buffer lives.

The registration call is KVM_SET_USER_MEMORY_REGION (_IOW(0xAE, 0x46, struct kvm_userspace_memory_region)), a VM ioctl. The struct is compact:

struct kvm_userspace_memory_region {
    __u32 slot;             /* memory slot index */
    __u32 flags;            /* KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY */
    __u64 guest_phys_addr;  /* base GPA of this region */
    __u64 memory_size;      /* size in bytes; 0 deletes the slot */
    __u64 userspace_addr;   /* HVA: address of the mmap'd backing buffer */
};

The slot field is an index into the kernel's memory slot table. Each slot describes one contiguous GPA-to-HVA mapping; the kernel supports many slots simultaneously, which is how production VMMs carve out distinct regions for DRAM, ROM, firmware, and MMIO holes. flags is almost always zero for a regular RAM slot; KVM_MEM_LOG_DIRTY_PAGES enables tracking for live migration, and KVM_MEM_READONLY makes the guest unable to write the range.

The LWN reference VMM reduces this to its minimum: one mmap call, one slot:

void *mem = mmap(NULL, 0x1000,
    PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_ANONYMOUS, -1, 0);

struct kvm_userspace_memory_region region = {
    .slot            = 0,
    .guest_phys_addr = 0x1000,
    .memory_size     = 0x1000,
    .userspace_addr  = (uint64_t)mem,
};
ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &region);

GPA 0x1000 is chosen so the guest starts executing at CS:IP = 0x0000:0x1000 (flat physical 0x1000) in 16-bit real mode. One page, one slot, one shot.

Firecracker does the same thing structurally but at a different scale. build_microvm_for_boot() in src/vmm/src/builder.rs calls vm_resources.allocate_guest_memory() before anything else, then vm.register_dram_memory_regions(guest_memory), which wraps KVM_SET_USER_MEMORY_REGION in a loop over named slots. The guest-physical layout is defined in src/vmm/src/arch/x86_64/layout.rs: the kernel loads at HIMEM_START = 0x100000 (1 MB), boot_params lives at the zero page 0x7000, the command line sits at 0x20000, and the IOAPIC MMIO window opens at 0xFEC00000. The numbers are fixed constants in that file; every other component in Firecracker is built around them.

Job 2: Create vCPUs And Map kvm_run

KVM_CREATE_VCPU takes a single integer argument — the vCPU ID, which on x86 becomes the APIC ID. It returns a vCPU fd:

int vcpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);

The fd is nearly useless until the VMM maps the kvm_run communication page. This is the mechanism by which guest exit state crosses the kernel–userspace boundary without a copy: the kernel writes exit_reason and the exit-specific data directly into a page that the VMM can also read, via a shared mapping:

int mmap_size = ioctl(kvmfd, KVM_GET_VCPU_MMAP_SIZE, 0);
struct kvm_run *run = mmap(NULL, mmap_size,
    PROT_READ | PROT_WRITE, MAP_SHARED, vcpufd, 0);

KVM_GET_VCPU_MMAP_SIZE (_IO(0xAE, 0x04)) is called on the system fd — /dev/kvm — not on the vCPU fd. That surprises most first-time readers. The returned size is usually larger than sizeof(struct kvm_run) because the kernel stores transient per-vCPU scratch state in the same mapping immediately after the struct. MAP_SHARED is not optional; without it, the guest's writes to exit_reason never reach the VMM.

struct kvm_run contains an input half and an output half, separated by the exit_reason field:

struct kvm_run {
    /* inputs: VMM writes these before KVM_RUN */
    __u8  request_interrupt_window;
    __u8  immediate_exit;
    __u8  padding1[6];

    /* output: kernel writes this after each VM-exit */
    __u32 exit_reason;

    __u8  ready_for_interrupt_injection;
    __u8  if_flag;
    __u16 flags;
    __u64 cr8;
    __u64 apic_base;

    union {
        struct { /* KVM_EXIT_IO */
            __u8  direction;   /* 0 = IN, 1 = OUT */
            __u8  size;        /* 1, 2, or 4 bytes */
            __u16 port;
            __u32 count;
            __u64 data_offset; /* offset from kvm_run* to the data buffer */
        } io;
        struct { /* KVM_EXIT_MMIO */
            __u64 phys_addr;
            __u8  data[8];
            __u32 len;
            __u8  is_write;
        } mmio;
        struct { /* KVM_EXIT_FAIL_ENTRY */
            __u64 hardware_entry_failure_reason;
            __u32 cpu;
        } fail_entry;
        struct { /* KVM_EXIT_HYPERCALL */
            __u64 nr;
            __u64 args[6];
            __u64 ret;
        } hypercall;
        char padding[256];
    };
};

Setting immediate_exit to 1 before calling KVM_RUN — or from a separate thread while the vCPU is already running — forces the guest out as quickly as possible. The kernel checks it on every safe-to-exit point and returns EINTR to the calling thread. The request_interrupt_window field asks the kernel to exit as soon as the guest's interrupt flag is enabled, which is how VMMs inject interrupts without racily hitting a window where IF = 0.

Job 3: Load The Guest

With memory registered and a vCPU created, the VMM has to put executable code into guest memory and configure the CPU state to match. This job splits naturally into two parts: writing the guest image into the GPA range and setting the register file via KVM_SET_REGS and KVM_SET_SREGS.

The register setup is architecture-specific and fiddly. The LWN minimal VMM boots a tiny flat binary in real mode and sets registers directly:

struct kvm_sregs sregs;
ioctl(vcpufd, KVM_GET_SREGS, &sregs);
sregs.cs.base     = 0;
sregs.cs.selector = 0;
ioctl(vcpufd, KVM_SET_SREGS, &sregs);

struct kvm_regs regs = {
    .rip    = 0x1000,
    .rax    = 2,
    .rbx    = 2,
    .rflags = 0x2,     /* bit 1 is architecturally reserved-set */
};
ioctl(vcpufd, KVM_SET_REGS, &regs);

rflags = 0x2 is not a quirk — x86 defines bit 1 of EFLAGS as permanently reserved-set. Failing to set it causes KVM_EXIT_FAIL_ENTRY before the first instruction retires.

A production VMM targeting a Linux guest does not poke registers in real mode. It implements the Linux/x86 boot protocol, current version 2.15 (introduced in kernel 5.5). The protocol specifies a boot_params structure (the "zero page") placed at a well-known GPA, populated with a setup_header that begins at offset 0x01F1 in the kernel image. The VMM must write 0xAA55 at boot_params[0x1FE] (the boot_flag), 0x53726448 ("HdrS") at 0x202, 0xFF in type_of_loader, and a physical pointer to the kernel command line in cmd_line_ptr. For bzImage format (boot protocol >= 2.00), the protected-mode kernel body loads at physical address 0x100000 when LOADED_HIGH (bit 0 of loadflags) is set; the 32-bit entry point expects CS = 0x10 (4 GB flat), DS = ES = SS = 0x18, %esi pointing to boot_params, paging off, and interrupts off.

kvmtool chooses 16-bit real mode. Its x86/kvm-cpu.c sets:

kvm_regs.rip    = arch.boot_ip;   /* must be <= 65535 */
kvm_regs.rsp    = arch.boot_sp;
kvm_regs.rbp    = arch.boot_sp;
kvm_regs.rflags = 0x0000000000000002ULL;

All segment registers (CS, SS, DS, ES, FS, GS) receive the same boot_selector, and the base for each is computed with the standard real-mode left-shift:

static inline uint32_t selector_to_base(uint16_t sel) {
    return (uint32_t)sel << 4;
}

kvmtool also initializes MSRs via KVM_SET_MSRS (_IOW(0xAE, 0x89, struct kvm_msrs)): MSR_IA32_SYSENTER_CS/ESP/EIP are all zeroed, MSR_IA32_TSC is zeroed, and MSR_IA32_MISC_ENABLE has the FAST_STRING bit enabled. FPU state via KVM_SET_FPU (_IOW(0xAE, 0x8d, struct kvm_fpu)) is set to fcw = 0x37f, mxcsr = 0x1f80 — the x87 control word and SSE control register their reset values.

Firecracker goes further still. src/vmm/src/arch/x86_64/regs.rs programs the vCPU into 64-bit long mode directly: CR0 has PE and ET set, CR4 has PAE set, the EFER MSR has both LME and LMA set, the CS descriptor is built from gdt_table[1] (GDT index 1) giving selector 0x08 — a 64-bit execute/read code segment with L = 1 — and the GDT is loaded at GPA 0x500. Firecracker also supports PVH boot (32-bit protected mode with the PVH magic 0x336EC578), which skips real mode entirely and hands control to the kernel at a different entry point. The KVM_SET_CPUID2 (_IOW(0xAE, 0x90, struct kvm_cpuid2)) call configures which CPU features the guest sees, and KVM_SET_SIGNAL_MASK (_IOW(0xAE, 0x8b, struct kvm_signal_mask)) optionally controls which signals interrupt KVM_RUN on the vCPU thread.

Job 4: Run The Loop

KVM_RUN (_IO(0xAE, 0x80)) is a no-argument vCPU ioctl. It blocks the calling thread, switches the physical CPU into VMX or SVM guest mode, and does not return until a VM-exit occurs. On success it returns 0; kvm_run->exit_reason identifies why the guest exited. On EINTR (a signal arrived before or during guest execution) it returns -1 with errno = EINTR.

Every production VMM is a direct elaboration of this loop:

while (1) {
    ioctl(vcpufd, KVM_RUN, NULL);
    switch (run->exit_reason) {
    case KVM_EXIT_HLT:
        return 0;                    /* clean shutdown */
    case KVM_EXIT_IO:
        if (run->io.direction == KVM_EXIT_IO_OUT
            && run->io.size == 1
            && run->io.port == 0x3f8
            && run->io.count == 1)
            putchar(*(((char *)run) + run->io.data_offset));
        break;
    case KVM_EXIT_FAIL_ENTRY:
        errx(1, "KVM_EXIT_FAIL_ENTRY: 0x%llx\n",
             run->fail_entry.hardware_entry_failure_reason);
    case KVM_EXIT_INTERNAL_ERROR:
        errx(1, "KVM_EXIT_INTERNAL_ERROR: suberror = 0x%x\n",
             run->internal.suberror);
    }
}

KVM_EXIT_HLT (value 5) means the guest executed an HLT instruction and has no pending interrupt to wake it. For the minimal VMM this signals a clean exit; for a real VMM it means putting the vCPU thread to sleep until an interrupt arrives. KVM_EXIT_FAIL_ENTRY (value 9) is distinct and serious: KVM tried to enter guest mode but the hardware rejected the VMCS or VMCB state. fail_entry.hardware_entry_failure_reason encodes the hardware VM-exit reason that the processor reported when rejecting VM-entry, and it usually means the VMM set an illegal register combination during job 3. KVM_EXIT_INTERNAL_ERROR (value 17) means KVM itself detected an inconsistency; it is always fatal.

data_offset is a byte offset from the start of the kvm_run struct, not a pointer. The data buffer sits inside the same mmap'd region as the struct, past the fixed header. Casting to (char *)run + run->io.data_offset is correct; dereferencing run->io.data_offset as a pointer is a common first-time mistake that produces a segfault at address ~80.

The full KVM_EXIT_* vocabulary used in production is broader. A selection of the values in include/uapi/linux/kvm.h worth knowing:

Constant Value Meaning
KVM_EXIT_UNKNOWN 0 Hardware exit reason unrecognized
KVM_EXIT_IO 2 Guest PIO (IN/OUT instruction)
KVM_EXIT_HYPERCALL 3 Guest hypercall (VMCALL/VMMCALL)
KVM_EXIT_HLT 5 Guest executed HLT with no pending interrupt
KVM_EXIT_MMIO 6 Guest accessed an unmapped GPA
KVM_EXIT_SHUTDOWN 8 Triple fault or guest-requested shutdown
KVM_EXIT_FAIL_ENTRY 9 Hardware refused VM-entry; see fail_entry.hardware_entry_failure_reason
KVM_EXIT_INTR 10 Signal arrived during KVM_RUN
KVM_EXIT_INTERNAL_ERROR 17 KVM internal consistency error; always fatal
KVM_EXIT_X86_RDMSR 29 Guest RDMSR with no in-kernel handler
KVM_EXIT_X86_WRMSR 30 Guest WRMSR with no in-kernel handler
KVM_EXIT_MEMORY_FAULT 39 Guest accessed GPA with no valid mapping

kvmtool's kvm-cpu.c handles KVM_EXIT_UNKNOWN, KVM_EXIT_DEBUG, KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_INTR, KVM_EXIT_SHUTDOWN, and KVM_EXIT_SYSTEM_EVENT. It also drains coalesced MMIO (via KVM_COALESCED_MMIO_PAGE_OFFSET) both before and after the standard MMIO exit path — a performance optimization that batches MMIO writes from the guest to reduce the number of true VM-exits.

Job 5: Emulate Devices

Most VM-exits are the guest asking the VMM to do something on its behalf. The two most common forms are PIO (KVM_EXIT_IO) and MMIO (KVM_EXIT_MMIO).

PIO exits arrive when the guest executes an IN or OUT instruction. The io sub-struct in kvm_run carries everything needed:

For KVM_EXIT_IO_OUT, the data is already in the buffer when the VMM reads it; the VMM routes the write to the appropriate emulated device. For KVM_EXIT_IO_IN, the VMM writes the device's response into the buffer and then re-enters KVM_RUN — the kernel delivers those bytes to the IN instruction as if they came from real hardware.

MMIO exits arrive when the guest accesses a GPA that has no registered memory slot. The mmio sub-struct carries:

The VMM dispatches phys_addr to the emulated device that owns that GPA range. Firecracker uses virtio-MMIO for every device — there is no PCI bus — so its MMIO handler routes accesses in the 32-bit MMIO region starting at 0xC0000000 (MMIO32_MEM_START in layout.rs) to the appropriate virtio backend, reads or writes the virtio queue registers, and re-enters.

The emulation path is also where interrupt injection happens. After servicing an MMIO or PIO write that updates a device's status (say, a virtio queue kick), the VMM may need to deliver an interrupt back to the guest CPU. With an in-kernel irqchip (KVM_CREATE_IRQCHIP), Firecracker routes interrupts through the IOAPIC GSI table (KVM_SET_GSI_ROUTING) rather than directly manipulating the APIC — the in-kernel implementation handles the interrupt delivery atomically with the next KVM_RUN re-entry. The minimal VMM has none of this; the virtio story and interrupt injection are the subjects of later chapters.

The vCPU Thread Model

KVM_RUN blocks its calling thread for the entire duration of guest execution. That constraint is the whole vCPU thread model: one host thread per guest CPU, each calling KVM_RUN in a loop on its own vCPU fd. The kernel documentation is explicit: "To run a multi-CPU VM, the user-space process must spawn multiple threads, and call KVM_RUN for different virtual CPUs in different threads." The kernel schedules those POSIX threads across physical cores, so a 2-vCPU guest can genuinely execute two streams of guest code on two physical CPUs simultaneously.

The sequence from process start to a running multi-vCPU guest follows this shape:

sequenceDiagram participant M as "VMM main thread" participant T1 as "vCPU thread 0" participant T2 as "vCPU thread 1" participant K as "KVM kernel module" M->>K: open /dev/kvm, KVM_CREATE_VM M->>K: KVM_SET_USER_MEMORY_REGION M->>K: KVM_CREATE_VCPU (id=0) M->>K: KVM_CREATE_VCPU (id=1) M->>K: KVM_SET_REGS / KVM_SET_SREGS (both vCPUs) M->>T1: spawn thread, pass vcpufd 0 M->>T2: spawn thread, pass vcpufd 1 T1->>K: KVM_RUN (blocks) T2->>K: KVM_RUN (blocks) K-->>T1: VM-exit → exit_reason T1->>T1: handle exit, re-enter K-->>T2: VM-exit → exit_reason T2->>T2: handle exit, re-enter

The kernel documentation notes that vCPU ioctls should be issued from the same thread that created the vCPU. Migrating a vCPU fd to a different thread is not forbidden, but the first ioctl from the new thread incurs TLB and scheduler effects as the kernel re-pins the vCPU to the new physical CPU.

Interrupting A Running vCPU

When the VMM needs to pull a vCPU out of guest mode — to inject an interrupt, to handle an API request, to stop the VM — it cannot simply call a function, because the vCPU thread is blocked inside the kernel. The mechanism has three parts:

  1. Write 1 to kvm_run->immediate_exit (a u8 field in the mmap'd struct). The kernel checks this flag at every safe VM-exit point and returns EINTR to the thread when it is set.
  2. Issue a memory fence to ensure the write is visible before the signal arrives.
  3. Send a signal to the vCPU thread. The signal causes KVM_RUN to return -1 with errno = EINTR if the vCPU is in guest mode, or prevents the next KVM_RUN call from entering guest mode if the signal arrives first.

The vCPU thread on the other side clears immediate_exit back to 0 and checks for pending events before re-entering the loop.

An alternative is KVM_SET_SIGNAL_MASK (_IOW(0xAE, 0x8b, struct kvm_signal_mask)), which lets the VMM declare exactly which signals interrupt KVM_RUN. Any unmasked signal that arrives during the ioctl causes it to return -EINTR. This is useful when the process has other signal handlers that should not disturb the run loop.

The kernel also has an internal wake path, kvm_vcpu_kick(), used when one component of the KVM code needs to stop a vCPU that is executing inside the guest. kvm_vcpu_kick() sends an inter-processor interrupt (IPI) to the physical CPU running the guest, causing a VM-exit, after which the kernel checks the vcpu->requests bitmap (set by kvm_make_request()) before allowing re-entry.

Firecracker's Thread Categories

Firecracker structures its threads into three roles:

Each vCPU thread is spawned in Vcpu::start_threaded() inside src/vmm/src/vstate/vcpu.rs:

thread::Builder::new()
    .name(format!("fc_vcpu {}", self.kvm_vcpu.index))
    .spawn(move || {
        self.register_kick_signal_handler();
        barrier.wait();
        self.run(seccomp_filter)
    })

The thread registers a handler for SIGRTMIN + 0 (Firecracker's VCPU_RTSIG_OFFSET = 0) as its kick signal, waits on a barrier so all threads synchronize at start, and then enters self.run(). That function implements a state machine: the initial state is paused; the running state calls run_emulation() in a tight loop, which calls self.kvm_vcpu.fd.run() — the kvm-ioctls crate's safe wrapper around ioctl(vcpufd, KVM_RUN, NULL). When KVM_RUN returns EINTR, Firecracker sets immediate_exit = 0 and returns VcpuEmulation::Interrupted, triggering a check of the mpsc channel the VMM thread uses to send events. This is the seam between the run loop and everything else: the vCPU thread is single-purpose and reactive; complexity lives in the VMM thread and crosses to the vCPU thread only through the event channel and kvm_run.

Firecracker wraps all KVM kernel types through the rust-vmm crate ecosystem: kvm-bindings provides Rust FFI structs generated from kernel headers, and kvm-ioctls provides safe wrappers — Kvm, VmFd, VcpuFd — that mirror the three-level fd hierarchy exactly. VcpuFd::run() calls the ioctl, reads exit_reason from the mmap'd struct, and returns a VcpuExit<'_> enum variant; on failure it checks specifically for KVM_EXIT_MEMORY_FAULT (value 39) before returning an error, because that exit requires distinct handling from other kernel errors.

Five Jobs, One Picture

The five jobs are not parallel: each depends on the last. You cannot create a vCPU before you have a VM fd, cannot run the loop before you have loaded the guest, cannot emulate devices before the loop is running to deliver exits to you.

flowchart LR A["1. Allocate memory\n(mmap + KVM_SET_USER_MEMORY_REGION)"] B["2. Create vCPUs\n(KVM_CREATE_VCPU + mmap kvm_run)"] C["3. Load guest\n(write image + KVM_SET_REGS / KVM_SET_SREGS)"] D["4. Run the loop\n(KVM_RUN per vCPU thread)"] E["5. Emulate devices\n(KVM_EXIT_IO / KVM_EXIT_MMIO handlers)"] A --> B --> C --> D --> E E -->|"re-enter"| D

Jobs 1 through 3 are one-time setup. Jobs 4 and 5 repeat in a tight loop for every cycle of guest time. The boundary between them — KVM_RUN returning, exit_reason being read, a handler firing, KVM_RUN being called again — is the innermost loop of every VMM, and its latency is the floor on device emulation performance.

The LWN example covers this loop in about 200 lines of C. kvmtool adds real-mode boot, MSR initialization, and a proper exit handler in a few thousand. Firecracker adds a security sandbox, virtio devices, and a three-thread architecture on top of the same five jobs, in roughly 30,000 lines of Rust. The mechanism does not change. The policy around it does.

Sources And Further Reading