Chapter 6: Guest Memory And Two-Dimensional Paging
The previous chapter registered guest memory with KVM_SET_USER_MEMORY_REGION and moved on. What it did not explain is what "guest physical address" actually means, or what the hardware does when a guest instruction fetches from one. The guest OS believes it is running on a machine whose RAM starts at address zero. The VMM process is running on a host OS that has no idea a guest exists. Somehow a load instruction issued inside the VM must reach host physical DRAM — and on a busy host that memory might be at physical address 0x2_4e3f_0000, nothing close to what the guest sees. This chapter is about how that bridging works, what data structures implement it, and what it costs.
The answer is two-dimensional paging: the guest's page tables translate guest virtual to guest physical, and a second hardware page-table tree translates guest physical to host physical. A TLB miss has to walk both trees. Intel calls its implementation EPT (Extended Page Tables); AMD calls its implementation NPT (Nested Page Tables), marketed as RVI (Rapid Virtualization Indexing). Before those technologies existed, hypervisors had to maintain shadow page tables that merged both translations into one, at a maintenance cost so severe that the hardware acceleration provided speed-ups of up to 48% on MMU-intensive benchmarks and up to 600% on MMU-intensive microbenchmarks.
Four Address Spaces, Not Two
The KVM MMU documentation (Documentation/virt/kvm/x86/mmu.rst) defines four distinct address spaces that coexist in a virtualized x86-64 system. GPA is not HVA; confusing the two is the most common error in VMM memory code.
| Symbol | Name | Controlled by |
|---|---|---|
| GVA | Guest Virtual Address | Guest OS — CR3-rooted 4-level page tables inside the VM |
| GPA | Guest Physical Address | VMM — KVM memory slots and the EPT/NPT structure |
| HVA | Host Virtual Address | VMM process mmap — an ordinary pointer in the Firecracker address space |
| HPA | Host Physical Address | Host OS page tables — where DRAM actually is |
The translation chain in EPT/NPT mode is:
GVA --(guest page tables, CR3)--> GPA
GPA --(KVM memslot lookup)--> HVA
HVA --(host page tables)--> HPA
The instinct to equate GPA with HPA, or GPA with HVA, is the most common mental model error when reading VMM source code. GPA is not HVA. The guest believes its RAM starts at GPA zero; the VMM allocated the backing memory at some HVA like 0x7f3a80000000. The two are related only by the slot registration. Similarly, GPA is not HPA — the host kernel placed the physical backing wherever it saw fit when the VMM called mmap. None of these three equalities hold in production. The hardware's job, through EPT/NPT, is to make the distinction invisible to guest code.
Memory Slots
KVM's model for guest memory is built around memory slots: named, numbered regions that declare "guest physical addresses from guest_phys_addr to guest_phys_addr + memory_size are backed by host virtual memory starting at userspace_addr." A slot is not the memory itself; it is a mapping declaration. The VMM supplies the backing memory by any means it chooses, and KVM uses the slot to build the EPT/NPT entries that make the translation fast.
KVM_SET_USER_MEMORY_REGION
KVM_SET_USER_MEMORY_REGION is _IOW(KVMIO, 0x46, struct kvm_userspace_memory_region), a VM ioctl issued on the VM file descriptor. It requires capability KVM_CAP_USER_MEMORY. The struct, from include/uapi/linux/kvm.h:
struct kvm_userspace_memory_region {
__u32 slot; /* bits 0-15: slot index; bits 16-31: address space ID */
__u32 flags;
__u64 guest_phys_addr; /* GPA base of this slot */
__u64 memory_size; /* bytes; 0 = delete this slot */
__u64 userspace_addr; /* HVA: host virtual address of backing memory */
};
The slot index in bits 0–15 is the identifier KVM uses to distinguish slots. Bits 16–31 carry the address space ID, used when KVM_CAP_MULTI_ADDRESS_SPACE is available — irrelevant for most VMMs, which operate in address space zero. Calling the ioctl with an existing slot number replaces that slot's mapping in-place. Passing memory_size = 0 deletes the slot. Slots must not overlap in guest physical address space; the kernel enforces this and returns -EINVAL on a conflict.
The flags field has three defined bits:
| Flag | Bit | Meaning |
|---|---|---|
KVM_MEM_LOG_DIRTY_PAGES |
0x1 |
KVM maintains a dirty bitmap; retrieve with KVM_GET_DIRTY_LOG |
KVM_MEM_READONLY |
0x2 |
Guest writes produce KVM_EXIT_MMIO; requires KVM_CAP_READONLY_MEM |
KVM_MEM_GUEST_MEMFD |
0x4 |
Backed by a guest memfd; only valid in KVM_SET_USER_MEMORY_REGION2 |
On ARM64, a write to a KVM_MEM_READONLY slot injects an abort into the guest rather than generating KVM_EXIT_MMIO — a behavioral difference worth knowing if the code targets multiple architectures.
Slot Counts
The kernel on x86-64 supports 509 user-accessible slots (plus 3 internal slots, for a total KVM_MEM_SLOTS_NUM of 512). That limit was raised from 125 to 509 in a 2014 patch to support 256 memory hotplug slots plus 253 device slots. ARM64 was raised from 32 to 508 in a later patch; the 32-slot limit had constrained PCI passthrough device counts. Query the runtime limit via KVM_CHECK_EXTENSION(KVM_CAP_NR_MEMSLOTS) rather than hardcoding 509 — KVM_CAP_NR_MEMSLOTS is capability value 10, the same value checked in KVM_CHECK_EXTENSION(KVM_CAP_NR_MEMSLOTS) from chapter 5.
Internally, the kernel stores memory slots in struct kvm_memory_slot (from include/linux/kvm_host.h), which holds base_gfn (the GPA base right-shifted by PAGE_SHIFT), npages, userspace_addr, dirty_bitmap, id, and as_id. The slots are indexed by a red-black tree keyed by guest frame number and a hash table keyed by slot ID, with two sets maintained for lockless readers — an active set and an inactive set swapped via a generation counter on each update.
The Extended Variant
A newer KVM_SET_USER_MEMORY_REGION2 (_IOW(KVMIO, 0x49, struct kvm_userspace_memory_region2)) extends the struct with guest_memfd_offset and guest_memfd fields. It requires capabilities KVM_CAP_GUEST_MEMFD and KVM_CAP_USER_MEMORY2. The extension exists for confidential computing — Intel TDX and AMD SEV-SNP — where guest memory must be isolated from the host even at the hypervisor level. Standard Firecracker deployments use the original ioctl.
The VMM mmap Pattern
The canonical sequence for allocating and registering guest RAM, following the LWN "Using the KVM API" article:
Note: The program issuing these calls must have read/write access to
/dev/kvm. On most Linux distributions that means membership in thekvmgroup or root privilege. Themmapbacking must remain live for the entire lifetime of the VM. Unmapping it while the VM is running is undefined behavior; the kernel cannot intercept the deallocation.
/* Step 1: allocate anonymous pages in the host process */
void *mem = mmap(NULL, size,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, /* MAP_SHARED also works; Linux ignores the flag for anonymous mappings */
-1, 0);
/* Step 2: declare the GPA->HVA mapping to KVM */
struct kvm_userspace_memory_region region = {
.slot = 0,
.guest_phys_addr = 0x1000,
.memory_size = size,
.userspace_addr = (uint64_t)mem,
};
ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, ®ion);
userspace_addr is an HVA — a pointer in the VMM process's address space. The kernel has not touched the physical pages yet; the host kernel maps them lazily on first access. KVM records the slot's HVA range and uses it to build the EPT/NPT leaf entries that will resolve GPA to HPA when the guest faults in each page.
The backing can be anonymous memory (MAP_ANONYMOUS), file-backed memory (memfd or hugetlbfs), or device memory. The kernel recommends — not requires — that bits 20:0 of guest_phys_addr and userspace_addr be identical. The reason is alignment: both addresses must have the same offset within a 2 MiB boundary (2^21 bytes) for a host huge page to back a guest huge page without sub-page remapping. Violating this alignment is not an error, but it silently prevents huge-page EPT/NPT entries.
Firecracker's Slot Layout
Firecracker uses the vm-memory crate from the rust-vmm project, with two principal types:
pub type GuestRegionMmap = vm_memory::GuestRegionMmap<Option<AtomicBitmap>>;
pub type GuestMemoryMmap = vm_memory::GuestRegionCollection<GuestRegionMmapExt>;
GuestRegionMmapExt wraps GuestRegionMmap and adds region_type (Dram or Hotpluggable), slot_from (starting KVM slot number), slot_size (uniform byte size per KVM slot), and plugged (a Mutex<BitVec> tracking which sub-slots are currently active). The AtomicBitmap type tracks dirty pages at page granularity using atomic operations, so multiple vCPU threads can mark pages concurrently without a lock. When dirty tracking is disabled, AtomicBitmap is None.
The slot registration is a From implementation in src/vmm/src/vstate/memory.rs:
This From impl is the entire gap between Rust types and the raw ioctl struct. KVM_MEM_LOG_DIRTY_PAGES is set if and only if the AtomicBitmap is present — a clean expression of the dirty-tracking flag as a type-level choice.
Firecracker's GPA layout on x86-64, from src/vmm/src/arch/x86_64/layout.rs:
| Region | GPA Start | Notes |
|---|---|---|
| Low RAM | 0x0 |
Up to the 32-bit MMIO hole |
| EBDA / system data | 0x9fc00 |
MPTable, ACPI tables |
| Kernel load | 0x0010_0000 (1 MiB) |
HIMEM_START |
| 32-bit MMIO gap | 0xC000_0000–0xFFFF_FFFF |
Device BARs, LAPIC, IOAPIC |
| RAM above 4 GiB | 0x1_0000_0000+ |
Second slot for guests larger than ~3 GiB |
| 64-bit MMIO gap | 0x40_0000_0000 |
MMIO64_MEM_START = 256 << 30 |
The MMIO hole between roughly 3.25 GiB and 4 GiB is reserved for device configuration. Guest RAM that would otherwise land in that range is placed above 4 GiB in a second KVM slot. This means Firecracker uses at most two RAM memory slots on x86-64.
Backing mode is determined at construction time in MmapRegion: anonymous() produces MAP_PRIVATE | MAP_ANONYMOUS memory (with optional hugepages), memfd_backed() produces MAP_SHARED memory via a memfd file descriptor, and snapshot_file() uses MAP_PRIVATE from a file for snapshot restore.
Two-Dimensional Paging: EPT And NPT
Shadow Paging: The Problem It Solved and Created
Before EPT and NPT existed, KVM used shadow page tables. A struct kvm_mmu_page held 512 shadow PTEs (SPTEs) that mapped GVA directly to HPA, effectively collapsing the two-level translation into one. That sounds efficient, but the maintenance cost was severe. KVM had to write-protect all guest page tables so it could intercept modifications; every guest CR3 load, every INVLPG, and every page-table write triggered a VM exit so KVM could rebuild or invalidate the corresponding shadow entries. The KVM MMU documentation notes that in EPT mode "neither invlpg nor CR3 loads and stores cause a vmexit in EPT mode, and kvm_set_cr3 is hardly ever called" — describing, by contrast, how intrusive shadow paging was.
Shadow paging was also the source of a central MMU lock that serialized all vCPU threads on page-fault handling, a design that broke down catastrophically at scale. That lock is what motivated the TDP MMU rewrite discussed below.
The 24-Access Worst Case
A TLB miss under two-level nested paging on x86-64 requires up to 24 memory accesses in the worst case. The derivation: the guest has 4-level page tables (PML4 → PDPT → PD → PT); each of those four guest-page-table entries is itself a GPA that must be resolved through the 4-level nested page table, costing 4 accesses per guest-table walk level plus 1 for the nested PML4 root. That gives 4 × 5 = 20 accesses for the guest walk, plus 4 more to translate the final GPA to HPA: 24 total. This is the cold-TLB worst case with no EPT or NPT TLB entries populated. In steady state the hardware TLBs cache the composed translations and most accesses cost nothing beyond a normal TLB hit.
Intel EPT
EPT was introduced in the Nehalem microarchitecture — the first Intel Core i-series, around 2008. The "unrestricted guest" mode, which allows a guest to run in real mode without shadow paging, requires EPT and was added in the subsequent Westmere generation.
Enabling EPT. EPT is activated through VMCS Secondary Processor-Based VM-Execution Controls, encoding 0x401E (SECONDARY_VM_EXEC_CONTROL, confirmed in arch/x86/include/asm/vmx.h). Bit 1 of that field is "Enable EPT." Setting it to 1 activates hardware two-level paging for that VM.
EPT Pointer. Once EPT is enabled, the hardware needs to know where the root of the EPT paging structure lives. VMCS field EPT_POINTER (encoding 0x201A, confirmed in arch/x86/include/asm/vmx.h) is written via VMWRITE to supply that root. The EPTP bit layout:
| Bits | Meaning |
|---|---|
| 2:0 | EPT paging-structure memory type (0 = UC, 6 = WB; WB is normal) |
| 5:3 | Page-walk length minus 1 (3 = 4-level EPT, the current standard) |
| 6 | Enable accessed and dirty flags in EPT entries (requires CPU support; absent before Haswell) |
| 11:7 | Reserved, must be zero |
| 51:12 | Physical address of EPT PML4 table |
| 63:52 | Reserved |
A 5-level EPT (PML5) was added for 57-bit guest physical addresses; bits 5:3 would be 4 for 5-level.
EPT leaf PTE fields. Each EPT entry is 8 bytes. The leaf PTE bits that matter most:
| Bit | Meaning |
|---|---|
| 0 | Read permission |
| 1 | Write permission |
| 2 | Execute permission (supervisor mode) |
| 5:3 | EPT memory type (6 = WB) |
| 8 | Accessed flag (set by hardware) |
| 9 | Dirty flag (set by hardware on write; leaf entries only) |
| 51:12 | Host physical page frame address |
EPT violations and misconfigurations. An EPT violation exits when a guest access lacks sufficient EPT permission — for example, a write to a read-only EPT entry. Exit reason EXIT_REASON_EPT_VIOLATION = 48 (from arch/x86/include/uapi/asm/vmx.h). An EPT misconfiguration exits when an EPT entry has an illegal format, such as a non-leaf entry with write permission set but read permission clear. Exit reason EXIT_REASON_EPT_MISCONFIG = 49. KVM uses EPT violations deliberately for MMIO interception: MMIO ranges are left unmapped in the EPT, so a guest access generates an EPT violation that KVM handles as KVM_EXIT_MMIO back to the VMM, without any explicit MMIO range registration in EPT.
flowchart TD
A["Guest memory access (GPA)"]
F{"Entry format valid?\n(e.g. write=1 but read=0 is illegal)"}
G["EPT misconfiguration → EXIT_REASON_EPT_MISCONFIG (49)"]
B{"Entry present?\n(read | write | execute != 0)"}
D{"Permission sufficient\nfor this access?"}
E["EPT violation → EXIT_REASON_EPT_VIOLATION (48)"]
C["Hardware resolves GPA→HPA (no exit)"]
A --> F
F -- no --> G
F -- yes --> B
B -- no --> E
B -- yes --> D
D -- no --> E
D -- yes --> C
EPTP switching. VM function 0 allows VMX non-root software to switch EPT roots without a full VM exit, by indexing into a hypervisor-controlled list of 512 8-byte EPTP entries via ECX. VMCS VM-function controls live at encoding 0x2018; the EPTP-list address lives at 0x2024. This is a niche optimization for workloads that need to quickly present different physical memory views to a guest.
AMD NPT
AMD introduced nested paging in 3rd-generation Opteron (codename Barcelona, 2007), one year before Intel's Nehalem. AMD's marketing name is Rapid Virtualization Indexing; the engineering name is NPT. Performance gains over shadow paging: VMware research measured up to 42%; Red Hat OLTP testing showed approximately 2× throughput improvement.
Enabling NPT. AMD's VM control block is the VMCB, a structure distinct from Intel's VMCS. Bit 0 of the np_enable field (VMCB control area offset 0x90) activates nested paging when the VMRUN instruction is issued.
nCR3. The nested page table root is held in nCR3 (Nested CR3), a 64-bit field at VMCB control area offset 0xB0 (confirmed in FreeBSD's sys/amd64/vmm/amd/vmcb.h as VMCB_OFF_NPT_BASE). It holds a host physical address — the HPA of the top-level NPT paging structure. The guest's ordinary CR3 (gCR3) holds a GPA of the guest's own page-table root. Both pointers are active simultaneously; this is the fundamental asymmetry. The hardware consults gCR3 for GVA→GPA and nCR3 for GPA→HPA.
ASID. Each guest is assigned an Address Space Identifier at VMCB control area offset 0x58 (VMCB_OFF_ASID), so the hardware can tag TLB entries per-guest and avoid full TLB flushes on VM entry and exit.
Nested page faults. A nested page fault generates SVM exit code SVM_EXIT_NPF = 0x400 (from arch/x86/include/uapi/asm/svm.h). KVM module parameters kvm-amd.npt=0 and kvm-intel.ept=0 disable NPT and EPT respectively at module load time; the default for both is 1 (enabled for 64-bit and 32-bit PAE mode).
VMCB clean bits. Bit 4 of the VMCB clean-bits field (offset 0xC0) is VMCB_CACHE_NP, signaling that the nested paging fields including nCR3 are clean and need not be reloaded from VMCB on VM entry. This caching reduces the cost of rapid VMRUN calls on nested-paging paths.
The TDP MMU
KVM's TDP MMU (arch/x86/kvm/mmu/tdp_mmu.c) is a reimplementation of the KVM MMU designed specifically for EPT/NPT. It eliminates the reverse mapping (rmap) data structure that shadow paging required. Shadow paging needed rmaps to find every SPTE that mapped a given host physical page — necessary for write-protection maintenance. TDP page tables are per-VM and map GPA directly to HPA without indirection through GVA contexts, so no rmap is needed. Eliminating rmap removed the central MMU lock that serialized all vCPU threads.
When the TDP MMU was introduced as a 22-patch series in September 2020, Google measured an 89% reduction in demand-paging test duration on 416-vCPU VMs; previously 98% of time was spent waiting for the MMU lock. The TDP MMU enabled live migration of 416-vCPU, 12 TiB VMs that had been impractical with the legacy MMU. The TDP MMU became the default for x86-64 KVM in Linux 5.15. The legacy shadow MMU remains as a fallback when EPT/NPT is unavailable.
In TDP mode, the SPTE role has role.base.direct = true (direct GPA→HPA mapping), with role.base.cr0_wp and role.base.efer_nx unconditionally set to true — unlike shadow paging, where they reflect actual guest CPU state. KVM supports 4 KiB (level-1 SPTE), 2 MiB (level-2), and 1 GiB (level-3) EPT/NPT entries. A large SPTE requires that the host supports the page size, that the guest PTE maps an equivalent range, that no write-protected pages exist in the range, and that the entire range falls within a single memory slot.
Dirty-Page Tracking
Two use cases drive dirty-page tracking: live migration (which must replay every write the guest makes after the first full copy) and snapshot diffing (which records only pages changed since the last snapshot). KVM offers two interfaces for the same underlying data: a per-slot bitmap and a per-vCPU ring buffer. They are mutually exclusive for a given VM.
The Legacy Bitmap Interface
Setting KVM_MEM_LOG_DIRTY_PAGES (0x1) in kvm_userspace_memory_region.flags instructs the kernel to maintain a dirty bitmap for that slot — one bit per 4 KiB page, bit 0 corresponding to the first page. The bitmap lives in kvm_memory_slot.dirty_bitmap. The kernel also sets EPT/NPT entries in that slot to read-only, so a guest write generates a fault that sets the bit and restores write permission. This write-protection overhead is why enabling dirty-page tracking forces 4 KiB EPT/NPT granularity even when the host uses 2 MiB huge pages: a huge-page EPT entry cannot be write-protected at 4 KiB sub-page granularity.
KVM_GET_DIRTY_LOG (_IOW(KVMIO, 0x42, struct kvm_dirty_log)) retrieves the bitmap for one slot:
struct kvm_dirty_log {
__u32 slot;
__u32 padding1;
union {
void __user *dirty_bitmap;
__u64 padding2;
};
};
By default, the kernel clears dirty bits atomically before the ioctl returns. KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 (capability value 168) defers that clearing to a subsequent KVM_CLEAR_DIRTY_LOG call. KVM_CLEAR_DIRTY_LOG (_IOWR(KVMIO, 0xc0, struct kvm_clear_dirty_log)) adds __u32 num_pages and __u64 first_page fields, enabling partial range clearing rather than whole-slot clearing — useful for large slots where clearing the entire bitmap stalls the guest.
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 supports two sub-flags: KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (1 << 0) and KVM_DIRTY_LOG_INITIALLY_SET (1 << 1). When INITIALLY_SET is active, the bitmap starts all-ones, treating all pages as initially dirty. A VMM can then use KVM_CLEAR_DIRTY_LOG to write-unprotect pages in chunks, spreading the VM-exit storm of re-enabling protection across time rather than concentrating it at the moment tracking is enabled. KVM_DIRTY_LOG_INITIALLY_SET is incompatible with the dirty ring interface.
The Dirty Ring Interface
KVM_CAP_DIRTY_LOG_RING (capability value 192) enables the dirty ring. The ring is a per-vCPU mmap'd region, separate from kvm_run, containing struct kvm_dirty_gfn entries:
struct kvm_dirty_gfn {
__u32 flags; /* KVM_DIRTY_GFN_F_DIRTY = (1<<0), KVM_DIRTY_GFN_F_RESET = (1<<1) */
__u32 slot;
__u64 offset; /* page offset within slot */
};
The state machine is: flags = 0 means the entry is invalid (empty slot); flags = 1 (DIRTY set) means the kernel has recorded a dirty GFN; flags = 3 (both DIRTY and RESET set) means userspace has read the entry and is requesting reset. Userspace reads ring entries without any ioctl. After harvesting entries, it issues KVM_RESET_DIRTY_RINGS (_IO(KVMIO, 0xc7)) to re-write-protect the harvested pages.
The ring has a genuine trade-off. Under an 800 MB/s random-write rate with a 24 GiB guest, dirty ring required approximately 73 seconds to complete live migration versus approximately 55 seconds for dirty bitmap. At very high dirty rates, the per-page write-protection overhead of the ring can exceed the cost of the bitmap's bulk clearing, making the bitmap faster. The right choice depends on workload: the ring shines when dirty pages are scattered and sparse; the bitmap wins under sustained high write pressure.
Firecracker's Dirty Tracking
Firecracker's MachineConfig (in src/vmm/src/vmm_config/machine_config.rs) has a track_dirty_pages: bool field, default false. When true, Firecracker sets KVM_MEM_LOG_DIRTY_PAGES on all memory slots, and each GuestRegionMmapExt receives a Some(AtomicBitmap) rather than None.
The snapshot flow in src/vmm/src/vmm_config/snapshot.rs uses store_dirty_bitmap() to read KVM's dirty log and merge it into the internal AtomicBitmap. dump_dirty() then iterates 64-bit words of the merged bitmap, seeking past clean regions using sparse-file semantics, and writes only dirty 4 KiB pages to the diff snapshot file. After a diff snapshot, Firecracker resets the dirty bitmap to baseline the next diff.
Without track_dirty_pages, Firecracker falls back to mincore(2) to identify resident pages. This mode requires swap to be disabled: a page swapped out appears as not-in-core and would be silently omitted from the snapshot. The trade-off is that mincore produces no write overhead at runtime, while track_dirty_pages introduces the write-protection overhead described above and forces 4 KiB granularity even when the host uses hugepages. Diff snapshots are currently in developer preview.
Userfaultfd for Snapshot Resume
When restoring a VM from a snapshot, the VMM must repopulate guest memory without stalling the guest for the full restore to complete. Firecracker supports two modes, controlled by LoadSnapshotParams.mem_backend.backend_type: File (blocking read-back) and Uffd (demand-paged via userfaultfd).
In the Uffd path, a separate userspace process receives the userfaultfd file descriptor over a Unix domain socket and responds to UFFD_EVENT_PAGEFAULT by issuing UFFDIO_COPY to populate individual pages on demand as the guest touches them. On Linux 5.10, the userfaultfd object is created via the userfaultfd(2) syscall; on Linux 6.1 and later it is created via /dev/userfaultfd. When the virtio-balloon deflates during a UFFD-backed resume, madvise(MADV_DONTNEED) triggers UFFD_EVENT_REMOVE, and the page handler must zero those pages rather than reloading from the snapshot file — a subtle interaction between two separately designed subsystems that Firecracker's documentation explicitly warns about.
Memory Ballooning
Ballooning is the mechanism by which the host can reclaim memory from a running guest without stopping it. The guest OS voluntarily surrenders pages through a device driver, and the VMM releases the backing host memory. The protocol is virtio.
The Virtio Balloon Device
The virtio balloon device has device ID 5 (OASIS virtio 1.2 spec §5.5.1). It uses up to four virtqueues: index 0 (inflate queue — guest passes PFNs of pages to surrender), index 1 (deflate queue — guest passes PFNs of pages to reclaim), index 2 (stats queue, enabled with VIRTIO_BALLOON_F_STATS_VQ), and index 3 (free-page hint queue, enabled with VIRTIO_BALLOON_F_FREE_PAGE_HINT). The protocol is asymmetric: the host signals how many pages it wants by writing num_pages into the virtio_balloon_config struct; the guest driver responds at its own pace by inflating or deflating through the queues. The host cannot force the guest to respond promptly.
The feature bits that define balloon behavior (from include/uapi/linux/virtio_balloon.h):
| Bit | Constant | Meaning |
|---|---|---|
| 0 | VIRTIO_BALLOON_F_MUST_TELL_HOST |
Guest must notify host before reusing deflated pages |
| 1 | VIRTIO_BALLOON_F_STATS_VQ |
Enables stats virtqueue (index 2) |
| 2 | VIRTIO_BALLOON_F_DEFLATE_ON_OOM |
Guest deflates balloon instead of invoking OOM killer |
| 3 | VIRTIO_BALLOON_F_FREE_PAGE_HINT |
Guest reports free pages to host (index 3) |
| 4 | VIRTIO_BALLOON_F_PAGE_POISON |
Guest reports page-poison value via poison_val config field |
| 5 | VIRTIO_BALLOON_F_REPORTING |
Guest reports free pages via reporting queue for host to reclaim |
The config struct carries two fields visible to the driver: __le32 num_pages (how many pages the host wants in the balloon) and __le32 actual (how many are currently held). The stats queue exchanges struct virtio_balloon_stat entries — tag-value pairs, packed, 10 bytes each — with 16 defined tags as of Linux 6.12, including swap-in/out counts, major/minor faults, free and total memory, OOM kills, and direct and async reclaim statistics.
Firecracker's Balloon
Firecracker exposes the balloon through its REST API:
- Pre-boot:
PUT /balloonwith{"amount_mib": N, "deflate_on_oom": bool, "stats_polling_interval_s": N} - Runtime:
PATCH /balloonto adjust target size and polling interval GET /balloon/statisticsto read the stats virtqueue values
Firecracker supports three virtio-balloon feature bits: VIRTIO_BALLOON_F_DEFLATE_ON_OOM (bit 2), VIRTIO_BALLOON_F_FREE_PAGE_HINT (bit 3, developer preview), and VIRTIO_BALLOON_F_REPORTING (bit 5).
When the guest inflates the balloon — surrendering pages — Firecracker issues madvise(MADV_DONTNEED) on the corresponding HVA range. This releases the underlying host physical pages back to the host kernel and reduces the Firecracker process RSS. On the next access, the host kernel zero-fills the page (anonymous MAP_PRIVATE semantics), preventing cross-VM data leakage. The Firecracker documentation notes a critical asymmetry: the host can set num_pages to request balloon inflation, but the actual surrender rate is governed by the guest kernel's balloon driver. An operator cannot count on the balloon responding within any particular time bound. The documentation's warning is worth repeating directly: ensure the host is prepared to handle a situation in which the Firecracker process uses all of the memory it was given at boot, even if the balloon was used to restrict guest memory.
Oversubscription
Firecracker's design document states that microVMs can oversubscribe host CPU and memory; the degree is controlled by the operator. No built-in hard cap is enforced. The mmap call for guest RAM succeeds as long as virtual address space is available; host physical pages are committed lazily on first access. An operator running 100 Firecracker processes, each with 512 MiB of guest RAM, is not necessarily using 50 GiB of host RAM — only the pages the guests have actually touched are physically backed.
Firecracker's production host setup guide (docs/prod-host-setup.md) mandates two settings that define the oversubscription envelope:
Disable swap. /proc/swaps must be empty. The reason is not performance but security: guest memory swapped to host storage creates data remanence — a page that was part of a guest's heap appears on disk, accessible to the host operator and potentially recoverable after the VM is destroyed. With swap disabled, the only way memory pressure resolves is through the balloon.
Disable KSM. echo 0 > /sys/kernel/mm/ksm/run. KSM (Kernel Same-page Merging) deduplicates pages with identical content across processes, saving physical RAM. The security cost is a timing side channel: by measuring how long certain memory operations take, a process can determine which pages are shared with another process — leaking information about memory access patterns across VM boundaries. Disabling KSM removes this channel entirely.
Note: Both changes require root and affect the entire host, not just the Firecracker process. Disable swap and KSM before starting any Firecracker production workload, not inline with the VMM's startup sequence.
With swap and KSM disabled, virtio-balloon is the only host-side mechanism for reclaiming memory from running VMs. Cgroup memory.limit_in_bytes (or the v2 equivalent memory.max) provides a hard ceiling on how much memory a single Firecracker process can consume, which is the primary per-VM isolation tool. The operator's oversubscription ratio is the ratio of total memory_size across all memory slots across all VMs to the host's physical RAM, minus a safety margin for the guest kernels' actual working sets.
The GPA→HPA chain built in this chapter — slots declaring the GVA→GPA mapping, EPT/NPT translating GPA→HPA in hardware, dirty tracking and ballooning managing the physical backing — is the axis that everything else in a VMM's memory subsystem rotates around. The next chapter turns from memory layout to device I/O: how a guest reaches storage and network without knowing it is virtualized.
Sources And Further Reading
- KVM API kernel documentation (canonical reference for
KVM_SET_USER_MEMORY_REGION, dirty log ioctls, flags, capability values): https://docs.kernel.org/virt/kvm/api.html - KVM MMU documentation (
Documentation/virt/kvm/x86/mmu.rst) — address space definitions (GVA, GPA, HVA, HPA), shadow paging vs. TDP, SPTE levels: https://docs.kernel.org/virt/kvm/x86/mmu.html include/uapi/linux/kvm.h— ioctl encodings,struct kvm_userspace_memory_region, dirty log structs, dirty ring structs, capability constants: https://github.com/torvalds/linux/blob/master/include/uapi/linux/kvm.hinclude/linux/kvm_host.h— internalstruct kvm_memory_slot(red-black tree, hash table,dirty_bitmapfield): https://github.com/torvalds/linux/blob/master/include/linux/kvm_host.harch/x86/include/asm/vmx.h— VMCS field encodings (EPT_POINTER = 0x0000201A,SECONDARY_VM_EXEC_CONTROL = 0x0000401E): https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/vmx.harch/x86/include/uapi/asm/vmx.h— VMX exit reason codes (EXIT_REASON_EPT_VIOLATION = 48,EXIT_REASON_EPT_MISCONFIG = 49): https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/vmx.harch/x86/include/uapi/asm/svm.h— SVM exit codes (SVM_EXIT_NPF = 0x400): https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/svm.hinclude/uapi/linux/virtio_balloon.h— virtio-balloon feature bits and stat tag definitions: https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_balloon.h- FreeBSD
sys/amd64/vmm/amd/vmcb.h— AMD VMCB layout (VMCB_OFF_NPT_BASE = 0xB0,VMCB_OFF_ASID = 0x58,VMCB_CACHE_NPbit 4): https://github.com/freebsd/freebsd-src/blob/master/sys/amd64/vmm/amd/vmcb.h - rust-vmm kvm-bindings (
src/x86_64/bindings.rs) —KVM_MEM_LOG_DIRTY_PAGES = 0x1,KVM_MEM_READONLY = 0x2: https://github.com/rust-vmm/kvm-bindings/blob/main/src/x86_64/bindings.rs - ia32-doc machine-readable Intel SDM extract (VMCS field encodings): https://github.com/wbenny/ia32-doc/blob/master/yaml/Intel/VMX/VMCS.yml
- Firecracker memory types and KVM slot registration (
src/vmm/src/vstate/memory.rs): https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/vstate/memory.rs - Firecracker x86-64 GPA layout (
src/vmm/src/arch/x86_64/layout.rs): https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/arch/x86_64/layout.rs - Firecracker
track_dirty_pagesfield (src/vmm/src/vmm_config/machine_config.rs): https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/vmm_config/machine_config.rs - Firecracker snapshot config (
src/vmm/src/vmm_config/snapshot.rs): https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/vmm_config/snapshot.rs - Firecracker snapshot support documentation (diff snapshots,
mincorefallback, dirty-tracking constraints): https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/snapshot-support.md - Firecracker page-fault handling on snapshot resume (UFFD backend,
UFFD_EVENT_PAGEFAULT,UFFDIO_COPY,UFFD_EVENT_REMOVE): https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md - Firecracker hugepages documentation (dirty-tracking incompatibility with huge pages): https://github.com/firecracker-microvm/firecracker/blob/main/docs/hugepages.md
- Firecracker ballooning documentation (REST API,
MADV_DONTNEED, supported feature bits): https://github.com/firecracker-microvm/firecracker/blob/main/docs/ballooning.md - Firecracker production host setup guide (no swap, no KSM, cgroup memory limits): https://github.com/firecracker-microvm/firecracker/blob/main/docs/prod-host-setup.md
- Firecracker design document (oversubscription policy and design goals): https://github.com/firecracker-microvm/firecracker/blob/main/docs/design.md
- OASIS virtio 1.2 specification §5.5 (balloon device ID 5, virtqueues, feature bits, config struct): https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html
- vm-memory crate bitmap/AtomicBitmap documentation: https://docs.rs/vm-memory/latest/vm_memory/bitmap/index.html
- LWN "Using the KVM API" (Josh Triplett) — canonical
mmap+KVM_SET_USER_MEMORY_REGIONpattern: https://lwn.net/Articles/658511/ - LWN TDP MMU introduction (September 2020) — 89% demand-paging improvement, 416-vCPU VMs, no-rmap design: https://lwn.net/Articles/832835/
- LWN dirty ring performance data — 800 MB/s random-write rate, 24 GiB guest, 73 s ring vs. 55 s bitmap: https://lwn.net/Articles/833784/
- Phoronix: TDP MMU made default in Linux 5.15: https://www.phoronix.com/news/Linux-5.15-KVM
- KVM x86 memslot increase patch (125 → 509 user slots, 2014): https://patchwork.kernel.org/patch/5244591/
- ARM64 memslot increase patch (32 → 508 user slots): https://patchwork.kernel.org/project/linux-arm-kernel/patch/1486538141-30627-3-git-send-email-linucherian@gmail.com/
- Wikipedia: Second Level Address Translation (EPT Nehalem introduction, NPT Barcelona introduction, 24-access derivation, performance gain figures): https://en.wikipedia.org/wiki/Second_Level_Address_Translation
- KVM memory overview (nCR3 vs. gCR3 distinction): https://www.linux-kvm.org/page/Memory
- ACRN hypervisor memory management (EPT violations, misconfigurations, and MMIO interception pattern): https://projectacrn.github.io/latest/developer-guides/hld/hv-memmgt.html
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2INITIALLY_SETdetails: https://patchwork.kernel.org/patch/11419191/- Dirty ring and bitmap exclusivity,
INITIALLY_SETincompatibility: https://lkml.kernel.org/kvm/20200331190000.659614-7-peterx@redhat.com/