Appendix C: The KVM ioctl Reference
This appendix covers every ioctl the book uses, grouped by the scope of the fd it targets, with its _IO/_IOW/_IOR/_IOWR encoding, its raw hex number, and a plain-English account of its effect. The capability table at the end maps KVM_CAP_* constants to the ioctls they gate. For the mental model of why the three-fd hierarchy exists and what a VM exit looks like from the VMM's side, see Chapter 1. For how Firecracker wires KVM_IRQFD and KVM_IOEVENTFD into its device model, see Chapter 8.
The 32-Bit ioctl Encoding
Every KVM ioctl is a 32-bit integer assembled by _IOC(dir, type, nr, size) from four packed fields defined in include/uapi/asm-generic/ioctl.h.
| Field | Bits | Width | Meaning |
|---|---|---|---|
nr |
7–0 | 8 bits | Function number within the subsystem |
type |
15–8 | 8 bits | Magic byte identifying the subsystem |
size |
29–16 | 14 bits | sizeof the argument struct (advisory only) |
dir |
31–30 | 2 bits | Data direction: 0 = none, 1 = write-to-kernel, 2 = read-from-kernel, 3 = both |
Four derived macros cover the common cases:
#define _IO(type, nr) _IOC(0, (type), (nr), 0)
#define _IOR(type, nr, argtype) _IOC(2, (type), (nr), sizeof(argtype))
#define _IOW(type, nr, argtype) _IOC(1, (type), (nr), sizeof(argtype))
#define _IOWR(type, nr, argtype) _IOC(3, (type), (nr), sizeof(argtype))
For _IO ioctls, where dir = 0 and size = 0, the encoding collapses to (type << 8) | nr. Because every KVM ioctl uses the magic byte KVMIO = 0xAE, a system-scope _IO ioctl with function number nr always encodes as 0x0000AE00 | nr. The size field is advisory: the ioctl(2) man page notes that "the size bits are very unreliable — in lots of cases they are wrong," and the kernel does not rely on the encoded size for correctness.
The Linux ioctl-number registry assigns 0xAE to two subsystems: KVM (numbers 0x00–0x1F and 0x40–0xFF in linux/kvm.h) and AWS Nitro Enclaves (numbers 0x20–0x3F in linux/nitro_enclaves.h). On x86 and arm64 the encoding above is exact; PowerPC overrides _IOC_SIZEBITS and _IOC_DIRBITS, which shifts bits in the size and dir fields for _IOW/_IOR/_IOWR ioctls, though nr and type are identical on all architectures.
Three Scopes, Three File Descriptors
KVM organises its API around a three-level fd hierarchy. Issuing an ioctl on the wrong fd produces ENOTTY; there is no fallback.
System fd. Opened directly from /dev/kvm. Ioctls here query or configure KVM as a whole: check the API version, create a VM, probe capabilities, fetch the supported CPUID set.
VM fd. Returned by KVM_CREATE_VM. Ioctls here configure one virtual machine: map memory, create vCPUs, wire the irqchip, register eventfd bindings.
vCPU fd. Returned by KVM_CREATE_VCPU. Ioctls here configure and run one virtual CPU: set registers, program CPUID, and issue KVM_RUN.
A threading constraint applies: VM ioctls must originate from the same process (address space) that called KVM_CREATE_VM. vCPU ioctls should come from the thread that called KVM_CREATE_VCPU, except for ioctls explicitly documented as asynchronous (the immediate_exit field, written from a signal handler, being the canonical exception).
Note.
/dev/kvmis a character device with mode0660, ownedroot:kvm. Opening it requires either root or membership in thekvmgroup. Any command or program in this appendix that opens/dev/kvmneeds that access.
System Ioctls
Issued on the fd returned by open("/dev/kvm", O_RDWR).
| Ioctl | Encoding | Nr | One-line effect |
|---|---|---|---|
KVM_GET_API_VERSION |
_IO(KVMIO, 0x00) |
0x00 | Returns KVM_API_VERSION = 12; abort if not 12 |
KVM_CREATE_VM |
_IO(KVMIO, 0x01) |
0x01 | Creates a VM; returns VM fd |
KVM_GET_MSR_INDEX_LIST |
_IOWR(KVMIO, 0x02, struct kvm_msr_list) |
0x02 | Returns MSR indices the kernel handles |
KVM_CHECK_EXTENSION |
_IO(KVMIO, 0x03) |
0x03 | Tests a KVM_CAP_*; returns 0 (absent) or positive (present) |
KVM_GET_VCPU_MMAP_SIZE |
_IO(KVMIO, 0x04) |
0x04 | Returns bytes to mmap on each vCPU fd to obtain struct kvm_run |
KVM_GET_SUPPORTED_CPUID |
_IOWR(KVMIO, 0x05, struct kvm_cpuid2) |
0x05 | Fills CPUID leaves KVM can emulate on this host |
KVM_GET_MSR_FEATURE_INDEX_LIST |
_IOWR(KVMIO, 0x0a, struct kvm_msr_list) |
0x0a | Returns MSR indices with per-feature data |
KVM_GET_API_VERSION
_IO(KVMIO, 0x00) encoded: 0x0000AE00
Returns the integer constant KVM_API_VERSION, which is hard-coded to 12 and has been 12 since at least Linux 2.6.22; the kernel docs note that 2.6.20 and 2.6.21 reported earlier unsupported values. Any VMM that receives a value other than 12 must refuse to continue. There is no migration path; the value is frozen.
KVM_CREATE_VM
_IO(KVMIO, 0x01) encoded: 0x0000AE01
The argument is a machine-type integer. Pass 0 for the standard VM type on x86; the KVM documentation says "you probably want to use 0." Returns a new VM fd. The new VM has no vCPUs and no memory; both require subsequent ioctls before KVM_RUN is valid.
KVM_CREATE_VM can return EINTR because its kernel path calls mm_take_all_locks(), which is CPU-intensive and interruptible. Firecracker (src/vmm/src/vstate/vm.rs) handles this by retrying up to five times with exponential back-off on EINTR. A VMM that does not retry will fail spuriously under load.
KVM_CHECK_EXTENSION
_IO(KVMIO, 0x03) encoded: 0x0000AE03
Takes a KVM_CAP_* integer. Returns 0 if the capability is absent, or a positive integer if present. Some capabilities encode a meaningful count in the return value rather than simply returning 1: KVM_CAP_NR_MEMSLOTS, for example, returns the maximum number of memory slots the VM supports.
KVM_CHECK_EXTENSION can be issued on the system fd (global query) or on a VM fd (VM-specific query). The VM-level call is preferred because different VMs may present different capabilities. The VM-level form requires KVM_CAP_CHECK_EXTENSION_VM.
KVM_GET_VCPU_MMAP_SIZE
_IO(KVMIO, 0x04) encoded: 0x0000AE04
Returns the byte size of the region that must be mmap(2)-ed on each vCPU fd to obtain struct kvm_run. The returned size is often larger than sizeof(struct kvm_run):
- When
KVM_CAP_COALESCED_MMIOis present, a coalesced-MMIO ring page sits atKVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZEwithin the mapping. - When
KVM_CAP_DIRTY_LOG_RINGis present, dirty-log ring pages sit atKVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.
Both extra regions are included in the returned size. Always pass the full returned size as the length argument to mmap, not sizeof(struct kvm_run). Using the smaller value will silently truncate the mapping and produce undefined behavior when the kernel writes to the ring pages.
VM Ioctls
Issued on the fd returned by KVM_CREATE_VM.
| Ioctl | Encoding | Nr | One-line effect |
|---|---|---|---|
KVM_CREATE_VCPU |
_IO(KVMIO, 0x41) |
0x41 | Creates a vCPU; returns vCPU fd |
KVM_GET_DIRTY_LOG |
_IOW(KVMIO, 0x42, struct kvm_dirty_log) |
0x42 | Returns and clears the dirty-page bitmap for a memory slot |
KVM_SET_USER_MEMORY_REGION |
_IOW(KVMIO, 0x46, struct kvm_userspace_memory_region) |
0x46 | Maps host virtual memory into guest physical address space |
KVM_SET_TSS_ADDR |
_IO(KVMIO, 0x47) |
0x47 | Sets Intel VMX internal TSS guest physical address (Intel hosts only) |
KVM_SET_IDENTITY_MAP_ADDR |
_IOW(KVMIO, 0x48, __u64) |
0x48 | Sets guest physical address of the identity-map page for real-mode entry (Intel) |
KVM_SET_USER_MEMORY_REGION2 |
_IOW(KVMIO, 0x49, struct kvm_userspace_memory_region2) |
0x49 | Extended form; adds guest_memfd support for confidential VMs |
KVM_CREATE_IRQCHIP |
_IO(KVMIO, 0x60) |
0x60 | Creates in-kernel PIC + IOAPIC + per-vCPU LAPIC on x86 |
KVM_IRQ_LINE |
_IOW(KVMIO, 0x61, struct kvm_irq_level) |
0x61 | Sets the level of an IRQ line on the in-kernel irqchip |
KVM_SET_GSI_ROUTING |
_IOW(KVMIO, 0x6a, struct kvm_irq_routing) |
0x6a | Programs GSI-to-irqchip-pin or GSI-to-MSI routing table |
KVM_IRQFD |
_IOW(KVMIO, 0x76, struct kvm_irqfd) |
0x76 | Binds an eventfd to a GSI so writes signal the interrupt |
KVM_IOEVENTFD |
_IOW(KVMIO, 0x79, struct kvm_ioeventfd) |
0x79 | Fires an eventfd when the guest writes to a MMIO or PIO address |
KVM_SET_CLOCK |
_IOW(KVMIO, 0x7b, struct kvm_clock_data) |
0x7b | Sets the VM's master clock |
KVM_GET_CLOCK |
_IOR(KVMIO, 0x7c, struct kvm_clock_data) |
0x7c | Reads the VM's master clock |
KVM_CREATE_DEVICE |
_IOWR(KVMIO, 0xe0, struct kvm_create_device) |
0xe0 | Creates an in-kernel device (e.g. GICv3 on arm64) |
KVM_CREATE_VCPU
_IO(KVMIO, 0x41) encoded: 0x0000AE41
The argument is the vCPU ID integer, which must be in [0, KVM_CAP_MAX_VCPU_ID). At most KVM_CAP_MAX_VCPUS vCPUs may be added to a VM. Returns a vCPU fd.
The new vCPU starts in an undefined register state. The VMM must call at minimum KVM_SET_SREGS and KVM_SET_REGS before the first KVM_RUN; the kernel does not pre-initialize registers to any documented reset state.
Sequencing constraint. If an in-kernel IRQ chip is desired, KVM_CREATE_IRQCHIP must be called before KVM_CREATE_VCPU. Each new vCPU automatically receives a wired local APIC only if the irqchip already exists at creation time. Creating vCPUs first and then calling KVM_CREATE_IRQCHIP is a silent misconfiguration: the vCPUs will lack LAPICs and interrupt delivery will be broken.
KVM_SET_USER_MEMORY_REGION
_IOW(KVMIO, 0x46, struct kvm_userspace_memory_region)
Requires capability KVM_CAP_USER_MEMORY. The struct is:
struct kvm_userspace_memory_region {
__u32 slot; /* bits 0–15: slot ID; bits 16–31: address space */
__u32 flags;
__u64 guest_phys_addr;
__u64 memory_size; /* bytes; 0 = delete the slot */
__u64 userspace_addr; /* host virtual address; must span full memory_size */
};
The userspace_addr field is the crucial insight: the guest "physical" address space is backed by host virtual memory, typically obtained with mmap. EPT (on Intel) or NPT (on AMD) then walks from that host virtual address to the true host-physical page frame. The guest never sees or controls this translation layer.
Flags:
| Flag | Value | Meaning |
|---|---|---|
KVM_MEM_LOG_DIRTY_PAGES |
(1UL << 0) |
Enable dirty-page tracking for live migration |
KVM_MEM_READONLY |
(1UL << 1) |
Guest writes produce KVM_EXIT_MMIO instead of writing through; requires KVM_CAP_READONLY_MEM |
KVM_MEM_GUEST_MEMFD |
(1UL << 2) |
Back slot with a guest_memfd object for confidential VMs |
Slots must not overlap within the same address space. An existing slot can be moved or have its flags changed but cannot be resized in place; set memory_size = 0 to delete the slot first, then re-add it. To enable 2 MiB large-page backing, the lower 21 bits of guest_phys_addr and userspace_addr should match so that the EPT walker can map a single 2 MiB page rather than 512 separate 4 KiB pages.
Firecracker allocates slot IDs sequentially via next_kvm_slot() and enforces the limit reported by KVM_CHECK_EXTENSION(KVM_CAP_NR_MEMSLOTS) on the VM fd.
KVM_SET_TSS_ADDR
_IO(KVMIO, 0x47) encoded: 0x0000AE47
Required on Intel VMX hosts only. KVM reserves a three-page region (3 × 4 KiB = 12 KiB) in guest physical space as an internal Task State Segment used by VMX bookkeeping. The argument is the guest physical address of the first page. The address must be within the first 4 GiB of guest physical space and must not overlap any memory slot or MMIO range. Guest access to this region produces undefined behavior; the VMM must ensure the firmware and guest OS never map or use it.
KVM_CREATE_IRQCHIP
_IO(KVMIO, 0x60) encoded: 0x0000AE60
Requires KVM_CAP_IRQCHIP. On x86, creates three in-kernel interrupt controller components: a virtual IOAPIC, a PIC master (i8259A), and a PIC slave (i8259A). Each subsequently created vCPU also receives a local APIC. The default GSI routing installed by KVM_CREATE_IRQCHIP routes GSIs 0–15 to both the PIC and the IOAPIC, and GSIs 16–23 to the IOAPIC only.
On arm64, KVM_CREATE_DEVICE with type KVM_DEV_TYPE_ARM_VGIC_V3 is now the preferred path for the GIC rather than this ioctl.
Must be called before KVM_CREATE_VCPU; see the sequencing note under that entry.
KVM_IRQFD
_IOW(KVMIO, 0x76, struct kvm_irqfd)
Requires KVM_CAP_IRQFD.
struct kvm_irqfd {
__u32 fd;
__u32 gsi; /* Global System Interrupt number */
__u32 flags;
__u32 resamplefd; /* used with KVM_IRQFD_FLAG_RESAMPLE */
__u8 pad[16];
};
Writing the bound eventfd signals the GSI. The kernel resolves the GSI to an irqchip pin or an MSI vector using the routing table set with KVM_SET_GSI_ROUTING. Firecracker uses routing entries of type KVM_IRQ_ROUTING_IRQCHIP (value 1, IOAPIC pin) and KVM_IRQ_ROUTING_MSI (value 2). When KVM_CAP_MSI_DEVID is present, MSI entries may carry the KVM_MSI_VALID_DEVID flag.
| Flag | Value | Meaning |
|---|---|---|
KVM_IRQFD_FLAG_DEASSIGN |
(1 << 0) |
Remove an existing binding |
KVM_IRQFD_FLAG_RESAMPLE |
(1 << 1) |
Level-triggered: re-arm after delivery via resamplefd |
The value of KVM_IRQFD is that interrupt injection bypasses the VMM entirely on the hot path: a virtio device backend writes an eventfd, and the kernel delivers the interrupt to the guest without a round-trip through userspace.
KVM_IOEVENTFD
Requires KVM_CAP_IOEVENTFD.
struct kvm_ioeventfd {
__u64 datamatch;
__u64 addr; /* MMIO or PIO address */
__u32 len; /* 0, 1, 2, 4, or 8 */
__s32 fd;
__u32 flags;
__u8 pad[36];
};
| Flag | Value | Meaning |
|---|---|---|
KVM_IOEVENTFD_FLAG_DATAMATCH |
(1 << 0) |
Fire only when written value equals datamatch |
KVM_IOEVENTFD_FLAG_PIO |
(1 << 1) |
addr is an I/O port, not MMIO |
KVM_IOEVENTFD_FLAG_DEASSIGN |
(1 << 2) |
Remove an existing registration |
KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY |
(1 << 3) |
s390 virtio-ccw specific |
KVM_IOEVENTFD is the complementary mechanism to KVM_IRQFD: where KVM_IRQFD lets the host signal the guest without a userspace round-trip, KVM_IOEVENTFD lets the guest notify the host without one. When the guest writes to the registered MMIO or PIO address, the kernel fires the eventfd directly, never returning to userspace. This is the primary mechanism for virtio queue kick notification. With KVM_CAP_IOEVENTFD_ANY_LENGTH, len = 0 is valid and the kernel ignores the write width.
vCPU Ioctls
Issued on the fd returned by KVM_CREATE_VCPU.
| Ioctl | Encoding | Nr | One-line effect |
|---|---|---|---|
KVM_RUN |
_IO(KVMIO, 0x80) |
0x80 | Enter guest; returns when exit requires VMM attention |
KVM_GET_REGS |
_IOR(KVMIO, 0x81, struct kvm_regs) |
0x81 | Reads x86 GPRs, RIP, and RFLAGS |
KVM_SET_REGS |
_IOW(KVMIO, 0x82, struct kvm_regs) |
0x82 | Writes x86 GPRs, RIP, and RFLAGS |
KVM_GET_SREGS |
_IOR(KVMIO, 0x83, struct kvm_sregs) |
0x83 | Reads segment registers, descriptor tables, CR*, EFER |
KVM_SET_SREGS |
_IOW(KVMIO, 0x84, struct kvm_sregs) |
0x84 | Writes segment registers, descriptor tables, CR*, EFER |
KVM_GET_MSRS |
_IOWR(KVMIO, 0x88, struct kvm_msrs) |
0x88 | Reads one or more MSRs |
KVM_SET_MSRS |
_IOW(KVMIO, 0x89, struct kvm_msrs) |
0x89 | Writes one or more MSRs |
KVM_SET_CPUID2 |
_IOW(KVMIO, 0x90, struct kvm_cpuid2) |
0x90 | Programs CPUID leaves returned to the guest |
KVM_GET_CPUID2 |
_IOWR(KVMIO, 0x91, struct kvm_cpuid2) |
0x91 | Reads the CPUID table currently set for this vCPU |
KVM_GET_FPU |
_IOR(KVMIO, 0x8c, struct kvm_fpu) |
0x8c | Reads x87 FPU and SSE state |
KVM_SET_FPU |
_IOW(KVMIO, 0x8d, struct kvm_fpu) |
0x8d | Writes x87 FPU and SSE state |
KVM_GET_LAPIC |
_IOR(KVMIO, 0x8e, struct kvm_lapic_state) |
0x8e | Reads local APIC page (requires in-kernel APIC) |
KVM_SET_LAPIC |
_IOW(KVMIO, 0x8f, struct kvm_lapic_state) |
0x8f | Writes local APIC page |
KVM_ENABLE_CAP |
_IOW(KVMIO, 0xa3, struct kvm_enable_cap) |
0xa3 | Enables a per-vCPU capability |
KVM_GET_VCPU_EVENTS |
_IOR(KVMIO, 0x9f, struct kvm_vcpu_events) |
0x9f | Reads pending exceptions, interrupts, and NMI state |
KVM_SET_VCPU_EVENTS |
_IOW(KVMIO, 0xa0, struct kvm_vcpu_events) |
0xa0 | Writes pending exceptions, interrupts, and NMI state |
KVM_GET_XSAVE |
_IOR(KVMIO, 0xa4, struct kvm_xsave) |
0xa4 | Reads XSAVE area (requires KVM_CAP_XSAVE) |
KVM_SET_XSAVE |
_IOW(KVMIO, 0xa5, struct kvm_xsave) |
0xa5 | Writes XSAVE area |
KVM_GET_XCRS |
_IOR(KVMIO, 0xa6, struct kvm_xcrs) |
0xa6 | Reads extended control registers including XCR0 |
KVM_SET_XCRS |
_IOW(KVMIO, 0xa7, struct kvm_xcrs) |
0xa7 | Writes extended control registers |
KVM_GET_ONE_REG |
_IOW(KVMIO, 0xab, struct kvm_one_reg) |
0xab | Reads a single named register (arm64 and other arches); _IOW not _IOR — pointer-based, see detail below |
KVM_SET_ONE_REG |
_IOW(KVMIO, 0xac, struct kvm_one_reg) |
0xac | Writes a single named register |
KVM_KVMCLOCK_CTRL |
_IO(KVMIO, 0xad) |
0xad | Resets per-vCPU kvmclock state (needed after snapshot restore) |
KVM_SET_SIGNAL_MASK |
_IOW(KVMIO, 0x8b, struct kvm_signal_mask) |
0x8b | Sets the signal mask active while this vCPU is running inside KVM_RUN |
KVM_RUN
_IO(KVMIO, 0x80) encoded: 0x0000AE80
No explicit argument. All communication between the VMM and the kernel about this vCPU's execution happens through the struct kvm_run region mapped at offset 0 of the vCPU fd. The region's size comes from KVM_GET_VCPU_MMAP_SIZE.
Returns 0 on clean exit, -1 on error. Notable errno values: EINTR when an unmasked signal is pending (the VMM must retry), ENOEXEC when no guest code is loaded, EPERM on a capability or mode error.
Key fields in struct kvm_run used around the call:
| Field | Type | Direction | Meaning |
|---|---|---|---|
request_interrupt_window |
__u8 |
in (VMM writes) | Causes exit when the guest is ready to accept an external interrupt |
immediate_exit |
__u8 |
in (VMM writes) | Writing 1 from any thread forces the current or next KVM_RUN to return EINTR immediately |
exit_reason |
__u32 |
out (kernel writes) | KVM_EXIT_* constant indicating why the guest exited |
ready_for_interrupt_injection |
__u8 |
out | 1 if an interrupt can be injected now |
if_flag |
__u8 |
out | Current value of RFLAGS.IF |
Firecracker writes immediate_exit from a signal handler and follows the write with fence(Ordering::Release), ensuring the store is visible before any subsequent Acquire load on the vCPU thread. The pattern is necessary because the vCPU thread may be inside the kernel at the time the signal fires; the fence ensures the store is visible when the kernel samples the field on its way back to userspace.
The exit reason determines which union subfield of kvm_run is valid. See the exit-reason table below.
KVM_GET_REGS / KVM_SET_REGS
_IOR(KVMIO, 0x81, struct kvm_regs) (GET)
_IOW(KVMIO, 0x82, struct kvm_regs) (SET)
The x86_64 struct, from arch/x86/include/uapi/asm/kvm.h:
struct kvm_regs {
__u64 rax, rbx, rcx, rdx;
__u64 rsi, rdi, rsp, rbp;
__u64 r8, r9, r10, r11;
__u64 r12, r13, r14, r15;
__u64 rip, rflags;
};
Firecracker programs these registers differently depending on boot protocol. For 64-bit Linux boot (BootProtocol::LinuxBoot):
| Register | Value | Meaning |
|---|---|---|
rip |
kernel entry address | First instruction of the compressed kernel |
rsp / rbp |
BOOT_STACK_POINTER |
Top of the boot stack |
rsi |
ZERO_PAGE_START |
Pointer to boot params (Linux 64-bit boot ABI) |
rflags |
0x0000_0000_0000_0002 |
Bit 1 (Reserved=1) always set; interrupts disabled |
For PVH boot (BootProtocol::PvhBoot):
| Register | Value | Meaning |
|---|---|---|
rip |
PVH entry address | First instruction |
rbx |
PVH_INFO_START |
Pointer to hvm_start_info struct |
rflags |
0x0000_0000_0000_0002 |
Same as above |
KVM_GET_SREGS / KVM_SET_SREGS
_IOR(KVMIO, 0x83, struct kvm_sregs) (GET)
_IOW(KVMIO, 0x84, struct kvm_sregs) (SET)
struct kvm_sregs {
struct kvm_segment cs, ds, es, fs, gs, ss, tr, ldt;
struct kvm_dtable gdt, idt;
__u64 cr0, cr2, cr3, cr4, cr8;
__u64 efer;
__u64 apic_base;
__u64 interrupt_bitmap[(256 + 63) / 64];
};
interrupt_bitmap is four __u64 words (since KVM_NR_INTERRUPTS = 256, so (256 + 63) / 64 = 4). At most one bit may be set. It represents an external interrupt acknowledged by the APIC but not yet injected into the CPU core.
Firecracker always calls KVM_GET_SREGS first, modifies the relevant fields, then calls KVM_SET_SREGS. Overwriting the struct wholesale would corrupt APIC state and any fields the kernel may have already populated.
For 64-bit Linux boot, Firecracker sets:
| Field | Value | Meaning |
|---|---|---|
cr0 |
\|= X86_CR0_PE (0x1) |
Protected Mode Enable |
efer |
\|= EFER_LME \| EFER_LMA (0x100 | 0x400) |
Long Mode Enable + Long Mode Active |
cr4 |
\|= X86_CR4_PAE (0x20) |
Physical Address Extension |
cr3 |
0x9000 |
PML4 base; PDPTE at 0xa000; PDE at 0xb000 |
The GDT sits at guest physical 0x500 with four entries: NULL, CODE (0xa09b), DATA (0xc093), TSS (0x808b). The IDT sits at guest physical 0x520 with limit 7 (one 8-byte entry).
For PVH boot, the setup is different: cr0 = X86_CR0_PE | X86_CR0_ET = 0x11 (32-bit protected mode, no paging), cr4 = 0.
KVM_SET_CPUID2
_IOW(KVMIO, 0x90, struct kvm_cpuid2)
struct kvm_cpuid2 {
__u32 nent;
__u32 padding;
struct kvm_cpuid_entry2 entries[];
};
struct kvm_cpuid_entry2 {
__u32 function; /* CPUID leaf (EAX input) */
__u32 index; /* CPUID sub-leaf (ECX input) */
__u32 flags;
__u32 eax, ebx, ecx, edx;
__u32 padding[3];
};
The KVM_CPUID_FLAG_SIGNIFCANT_INDEX flag (value (1 << 0)) must be set on entries where the index field distinguishes sub-leaves, such as leaf 0xB (extended topology) and leaf 0xD (XSAVE state). Two other flag bits exist in the header (KVM_CPUID_FLAG_STATEFUL_FUNC = bit 1, KVM_CPUID_FLAG_STATE_READ_NEXT = bit 2) but are vestigial and not used in current VMM practice.
Two sequencing rules matter and both come directly from the kernel documentation. First: "Using KVM_SET_CPUID{,2} after KVM_RUN may cause guest instability." The CPUID table must be set before the first KVM_RUN on any vCPU. Second: all vCPUs in a VM should receive identical CPUID data unless the guest explicitly supports per-CPU CPUID differences; heterogeneous tables produce guest instability. KVM_SET_CPUID2 supersedes the older KVM_SET_CPUID (_IOW(KVMIO, 0x8a, struct kvm_cpuid)); use KVM_SET_CPUID2 on any modern kernel.
The typical workflow is to call KVM_GET_SUPPORTED_CPUID on the system fd to obtain the host-supported leaves, apply any per-guest masking or feature-hiding policy, and then call KVM_SET_CPUID2 on each vCPU.
KVM_SET_FPU
_IOW(KVMIO, 0x8d, struct kvm_fpu)
Firecracker's initialization sets two fields in struct kvm_fpu before the first KVM_RUN: fcw = 0x37f (the x87 FPU control word with all exceptions masked) and mxcsr = 0x1f80 (the SSE control/status register with all SIMD exceptions masked). Without this initialization the FPU starts in an unpredictable state and any guest floating-point operation may produce a spurious #MF or #XF exception that panics the guest kernel before it has printed its first console line.
KVM_GET_ONE_REG / KVM_SET_ONE_REG
_IOW(KVMIO, 0xab, struct kvm_one_reg) (GET)
_IOW(KVMIO, 0xac, struct kvm_one_reg) (SET)
Both use _IOW (write-to-kernel direction). This is intentional: struct kvm_one_reg contains a pointer to the caller's output buffer rather than embedding the data inline, so in both the GET and the SET case the struct itself travels only from userspace to the kernel. The pointer dereference is what transfers data in the GET direction.
Used primarily on arm64 and other non-x86 architectures where there is no flat struct kvm_regs. On x86 the flat register ioctls are preferred.
KVM_KVMCLOCK_CTRL
_IO(KVMIO, 0xad) encoded: 0x0000AEAD
Resets the per-vCPU kvmclock state. A Firecracker fix merged in 2024 added a KVM_KVMCLOCK_CTRL call before resuming a vCPU from a snapshot to prevent the Linux guest watchdog from firing a soft-lockup warning on time-skewed restore. Without the reset, the guest kernel sees a large TSC jump and the watchdog fires within seconds of snapshot resume.
KVM_EXIT_* Exit Reason Codes
When KVM_RUN returns 0, kvm_run.exit_reason holds one of the following KVM_EXIT_* constants, defined in include/uapi/linux/kvm.h. The union subfields listed below are valid only for their respective exit reasons.
| Constant | Value | Meaning |
|---|---|---|
KVM_EXIT_UNKNOWN |
0 | Unrecognised exit; inspect kvm_run.hw.hardware_exit_reason |
KVM_EXIT_EXCEPTION |
1 | x86 hardware exception |
KVM_EXIT_IO |
2 | PIO in/out; see kvm_run.io subfield |
KVM_EXIT_HYPERCALL |
3 | Hypercall |
KVM_EXIT_DEBUG |
4 | Debug event |
KVM_EXIT_HLT |
5 | Guest executed HLT with no pending work |
KVM_EXIT_MMIO |
6 | MMIO access; see kvm_run.mmio subfield |
KVM_EXIT_IRQ_WINDOW_OPEN |
7 | Interrupt window open (response to request_interrupt_window) |
KVM_EXIT_SHUTDOWN |
8 | Guest shutdown (triple fault or ACPI power-off) |
KVM_EXIT_FAIL_ENTRY |
9 | VMX/SVM VM-entry failure; see kvm_run.fail_entry subfield |
KVM_EXIT_INTR |
10 | Host signal interrupted KVM_RUN (errno = EINTR); must retry |
KVM_EXIT_SET_TPR |
11 | CR8 write (Task Priority Register) |
KVM_EXIT_TPR_ACCESS |
12 | TPR access reporting |
KVM_EXIT_NMI |
16 | NMI window |
KVM_EXIT_INTERNAL_ERROR |
17 | KVM internal error; see kvm_run.internal.suberror |
KVM_EXIT_SYSTEM_EVENT |
24 | System event; see kvm_run.system_event subfield |
KVM_EXIT_X86_RDMSR |
29 | Unhandled RDMSR (user-space MSR handling enabled) |
KVM_EXIT_X86_WRMSR |
30 | Unhandled WRMSR |
KVM_EXIT_DIRTY_RING_FULL |
31 | Dirty ring full; VMM must drain before KVM_RUN again |
KVM_EXIT_MEMORY_FAULT |
39 | Guest accessed memory with no valid mapping |
The full list through Linux 6.x contains 44 values; only those the book exercises are shown here. The complete set is in include/uapi/linux/kvm.h.
Union Subfields for Common Exits
KVM_EXIT_IO (exit_reason = 2):
struct {
__u8 direction; /* 0 = IN (guest reading from port; VMM provides value), 1 = OUT (guest writing to port; VMM consumes value) */
__u8 size; /* 1, 2, or 4 bytes */
__u16 port;
__u32 count;
__u64 data_offset; /* byte offset from start of kvm_run to data buffer */
} io;
KVM_EXIT_MMIO (exit_reason = 6):
struct {
__u64 phys_addr; /* guest physical address */
__u8 data[8]; /* value read or to be written */
__u32 len; /* access size in bytes */
__u8 is_write;
} mmio;
KVM_EXIT_FAIL_ENTRY (exit_reason = 9):
struct {
__u64 hardware_entry_failure_reason; /* VM-entry control failure code */
__u32 cpu;
} fail_entry;
KVM_EXIT_INTERNAL_ERROR (exit_reason = 17):
struct {
__u32 suberror;
__u32 ndata;
__u64 data[16];
} internal;
KVM_EXIT_SYSTEM_EVENT (exit_reason = 24):
struct {
__u32 type; /* KVM_SYSTEM_EVENT_SHUTDOWN=1, KVM_SYSTEM_EVENT_RESET=2,
KVM_SYSTEM_EVENT_CRASH=3 */
__u32 ndata;
union { __u64 flags; __u64 data[16]; };
} system_event;
Firecracker's VcpuExit match in src/vmm/src/vstate/vcpu.rs maps the exits as follows: MmioRead/MmioWrite dispatch to the device bus; IoIn/IoOut dispatch to the IO bus; X86Rdmsr/X86Wrmsr are handled for select MSRs (including the kvmclock MSR); SystemEvent maps KVM_SYSTEM_EVENT_SHUTDOWN and KVM_SYSTEM_EVENT_RESET to microVM lifecycle actions; FailEntry and InternalError are logged and returned as FaultyKvmExit.
The KVM_RUN Loop
The sequence diagram below traces the full VMM lifecycle from open to exit, placing the ioctls in their correct order and on their correct fd targets. Capability checks are shown before the operations they gate.
sequenceDiagram
participant VMM
participant KVM
VMM->>KVM: open("/dev/kvm") → sysfd
VMM->>KVM: ioctl(sysfd, KVM_GET_API_VERSION) → 12
VMM->>KVM: ioctl(sysfd, KVM_CHECK_EXTENSION, KVM_CAP_USER_MEMORY)
VMM->>KVM: ioctl(sysfd, KVM_CHECK_EXTENSION, KVM_CAP_IRQCHIP)
VMM->>KVM: ioctl(sysfd, KVM_GET_VCPU_MMAP_SIZE) → mmap_size
VMM->>KVM: ioctl(sysfd, KVM_CREATE_VM, 0) → vmfd
VMM->>KVM: ioctl(vmfd, KVM_SET_TSS_ADDR, tss_gpa)
VMM->>KVM: ioctl(vmfd, KVM_CREATE_IRQCHIP)
VMM->>KVM: ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, slot)
VMM->>KVM: ioctl(vmfd, KVM_CREATE_VCPU, 0) → vcpufd
VMM->>VMM: mmap(vcpufd, mmap_size) → kvm_run*
VMM->>KVM: ioctl(vcpufd, KVM_SET_SREGS, sregs)
VMM->>KVM: ioctl(vcpufd, KVM_SET_REGS, regs)
VMM->>KVM: ioctl(vcpufd, KVM_SET_CPUID2, cpuid)
loop until shutdown
VMM->>KVM: ioctl(vcpufd, KVM_RUN)
KVM-->>VMM: returns, kvm_run.exit_reason set
VMM->>VMM: dispatch on exit_reason
end
The diagram omits KVM_IRQFD, KVM_IOEVENTFD, and KVM_SET_GSI_ROUTING for clarity. Those three are issued on the VM fd after KVM_CREATE_IRQCHIP and before the first KVM_RUN, as Chapter 8 details.
Capability Check Reference
KVM_CHECK_EXTENSION takes a KVM_CAP_* integer. The table below covers the capabilities most relevant to x86 microVM construction. Capabilities marked as gating an ioctl must return a positive value before that ioctl is issued; issuing an ungated ioctl returns ENOTTY or EINVAL depending on the kernel version.
| Capability | Value | Gates or meaning |
|---|---|---|
KVM_CAP_IRQCHIP |
0 | KVM_CREATE_IRQCHIP, KVM_IRQ_LINE |
KVM_CAP_HLT |
1 | HLT trap generates KVM_EXIT_HLT |
KVM_CAP_USER_MEMORY |
3 | KVM_SET_USER_MEMORY_REGION |
KVM_CAP_SET_TSS_ADDR |
4 | KVM_SET_TSS_ADDR |
KVM_CAP_EXT_CPUID |
7 | KVM_GET_SUPPORTED_CPUID on system fd |
KVM_CAP_NR_VCPUS |
9 | Recommended (soft) max vCPU count; assume 4 if absent |
KVM_CAP_COALESCED_MMIO |
15 | Coalesced MMIO ring in vCPU mmap region |
KVM_CAP_IRQFD |
32 | KVM_IRQFD |
KVM_CAP_IOEVENTFD |
36 | KVM_IOEVENTFD |
KVM_CAP_SET_IDENTITY_MAP_ADDR |
37 | KVM_SET_IDENTITY_MAP_ADDR |
KVM_CAP_ADJUST_CLOCK |
39 | KVM_GET_CLOCK / KVM_SET_CLOCK |
KVM_CAP_INTERNAL_ERROR_DATA |
40 | Extended data in kvm_run.internal on KVM_EXIT_INTERNAL_ERROR |
KVM_CAP_XSAVE |
55 | KVM_GET_XSAVE / KVM_SET_XSAVE |
KVM_CAP_XCRS |
56 | KVM_GET_XCRS / KVM_SET_XCRS |
KVM_CAP_TSC_CONTROL |
60 | Per-vCPU TSC frequency scaling |
KVM_CAP_GET_TSC_KHZ |
61 | KVM_GET_TSC_KHZ vCPU ioctl |
KVM_CAP_MAX_VCPUS |
66 | Hard limit on vCPU count per VM |
KVM_CAP_SPLIT_IRQCHIP |
121 | Split LAPIC / IOAPIC configuration |
KVM_CAP_IMMEDIATE_EXIT |
136 | immediate_exit field in struct kvm_run |
Three capabilities in the research note have uncertain numeric values (KVM_CAP_READONLY_MEM, KVM_CAP_NR_MEMSLOTS, KVM_CAP_MAX_VCPU_ID) and are omitted from the table rather than printed speculatively. Consult include/uapi/linux/kvm.h directly for their current numeric assignments.
rust-vmm Wrappers
Firecracker does not call ioctl(2) directly. It uses two crates from the rust-vmm project:
kvm-ioctls wraps the three fd types as Kvm (system fd), VmFd (VM fd), and VcpuFd (vCPU fd). VcpuFd::run() returns a VcpuExit enum whose variants include IoIn, IoOut, MmioRead, MmioWrite, Hlt, X86Rdmsr, X86Wrmsr, Hypercall, FailEntry, InternalError, and SystemEvent. The crate absorbs the EINTR-on-KVM_RUN and retries internally.
kvm-bindings provides #[repr(C)] Rust structs that mirror the C structs in include/uapi/linux/kvm.h and arch/x86/include/uapi/asm/kvm.h. When the kernel header changes struct layout, kvm-bindings is the single point that tracks the change; Firecracker's register-initialization code in src/vmm/src/arch/x86_64/regs.rs then compiles against the updated bindings.
A KVM_SET_SREGS call in Firecracker looks like vcpu.set_sregs(&sregs)? rather than ioctl(vcpufd, 0xC1C8AE84, &sregs). Error handling for the underlying ioctl failures — including the cases where KVM_RUN returns EINTR — is covered where kvm-ioctls surfaces those errors to Firecracker's vCPU thread loop.
Sources And Further Reading
- Linux KVM API documentation (primary specification for all ioctls and capability constants): https://docs.kernel.org/virt/kvm/api.html
include/uapi/linux/kvm.h(kernel master —KVM_EXIT_*constants, struct definitions,KVM_CAP_*values): https://github.com/torvalds/linux/blob/master/include/uapi/linux/kvm.hinclude/uapi/asm-generic/ioctl.h(ioctl 32-bit encoding macros_IO,_IOR,_IOW,_IOWR): https://github.com/torvalds/linux/blob/master/include/uapi/asm-generic/ioctl.harch/x86/include/uapi/asm/kvm.h(x86-specific structs:kvm_regs,kvm_sregs,kvm_segment,kvm_dtable): https://raw.githubusercontent.com/torvalds/linux/master/arch/x86/include/uapi/asm/kvm.h- Linux ioctl decoding guide: https://docs.kernel.org/userspace-api/ioctl/ioctl-decoding.html
- Linux ioctl number registry (KVMIO = 0xAE assignment): https://docs.kernel.org/userspace-api/ioctl/ioctl-number.html
ioctl(2)man page (warning on unreliable size bits): https://man7.org/linux/man-pages/man2/ioctl.2.html- kvm-ioctls crate (rust-vmm): https://docs.rs/kvm-ioctls/latest/kvm_ioctls/
- kvm-bindings crate (rust-vmm): https://github.com/rust-vmm/kvm-bindings
- Firecracker
src/vmm/src/vstate/vm.rs(KVM_CREATE_VM retry logic, memory slot management): https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/vstate/vm.rs - Firecracker
src/vmm/src/arch/x86_64/regs.rs(boot register initialization, FPU setup, GDT layout): https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/arch/x86_64/regs.rs - Firecracker CHANGELOG (KVM_KVMCLOCK_CTRL snapshot-restore fix, 2024): https://github.com/firecracker-microvm/firecracker/blob/main/CHANGELOG.md