Chapter 4: CPU Virtualization Extensions

Before hardware added explicit virtualization support, a VMM had to catch every privileged guest instruction by running the guest in user mode and trapping each fault. The approach was called "trap and emulate," and it worked well enough for most instructions — but x86 had a category of instructions that were sensitive without being privileged. Instructions like SGDT, SIDT, and PUSHF read or modify state that differs between the host and guest, yet they execute silently at ring 3 instead of faulting. A guest OS issuing SGDT to read the GDT register would get the host's descriptor-table base, not its own. There was no trap, so there was nothing to catch. VMware's early x86 hypervisors solved the problem with binary translation — a JIT compiler that rewrote guest code on the fly before execution, patching out the troublesome instructions. It worked, but it required a scanning pass over every basic block — and VMware's own published benchmarks showed 5–20% overhead on kernel-intensive workloads.

Intel's answer, published in 2005 as VT-x (Virtualization Technology for IA-32/IA-64/x86_64), and AMD's concurrent answer as AMD-V (also called SVM, Secure Virtual Machine), both solve the same problem the same way: add a hardware-managed execution mode where the CPU can run guest ring-0 code natively at full speed, intercept exactly the operations the hypervisor cares about, and save and restore the world in a processor-managed data structure. ARM took a structurally different route, adding a dedicated privilege level — EL2, higher than the OS's EL1 — that sits above the guest and controls what the guest can see. All three mechanisms share the same goal: make guest ring 0 architecturally distinct from host ring 0, enforced by hardware decode logic rather than software convention.

Intel VT-x

Detecting and Entering VMX Mode

The first thing any VMM must do is confirm the CPU supports VMX. That check is CPUID leaf 1, ECX bit 5: CPUID.01H:ECX.VMX[bit 5] = 1. A zero there means no VT-x, full stop.

Three setup steps must happen before VMXON will succeed. First, the VMM sets CR4.VMXE (bit 13). Executing VMXON with that bit clear raises #UD, an invalid-opcode fault — the instruction does not even exist to the CPU in that state. Second, the VMM reads IA32_FEATURE_CONTROL (MSR address 0x3A) and verifies bit 0 (the lock bit) and bit 2 are both set. Bit 0, once written to 1, is latched until power-on reset; BIOS programs it during POST. Bit 2 authorizes VMXON outside SMX operation, which is the normal case. If the lock bit is clear, firmware has not committed the machine to VMX operation and VMXON will fault with #GP(0). Third, the VMM allocates a 4 KiB-aligned VMXON region, writes the 31-bit VMCS revision identifier (from IA32_VMX_BASIC MSR 0x480, bits 30:0) into its first four bytes with bit 31 cleared, and passes its physical address as the operand to VMXON.

IA32_VMX_BASIC carries a few other fields worth naming. Bits 44:32 give the allocation size for VMXON and VMCS regions (1–4096 bytes). Bits 53:50 give the memory type the processor expects for those regions; value 6 means write-back (WB), the value reported by all processors since Nehalem.

On success, VMXON transitions the logical processor into VMX root operation. The current-VMCS pointer is set to FFFFFFFF_FFFFFFFFH (no VMCS active), INIT signals are blocked, and A20M is disabled. The processor will stay in this mode — handling guest entry and exit — until VMXOFF returns it to plain IA-32e operation.

VMX Root and Non-Root: A Mode Orthogonal to Rings

The central insight of VT-x is the distinction between VMX root operation and VMX non-root operation. These two modes exist at a level below the familiar ring hierarchy.

In VMX root operation the VMM executes. The full instruction set is available, including every VMX instruction. In VMX non-root operation the guest executes. Certain operations that would proceed normally in root operation instead cause VM exits — a hardware-managed transfer of control back to the VMM. The crucial detail: there is no software-visible register bit that indicates which mode the CPU is in. A guest OS running at CPL 0 in VMX non-root operation cannot read a flag and discover it is being virtualized. The mode exists only in the processor's internal state machine, which is exactly what makes it sound as an isolation boundary.

flowchart LR subgraph root["VMX Root Operation"] vmm["VMM / KVM\n(CPL 0)"] end subgraph nonroot["VMX Non-Root Operation"] gkernel["Guest kernel\n(CPL 0)"] guser["Guest user\n(CPL 3)"] end vmm -->|"VMLAUNCH / VMRESUME"| gkernel gkernel -->|"VM exit"| vmm guser -->|"VM exit (selected ops)"| vmm

A guest's attempt to execute VMXOFF does not switch the processor back to non-VMX mode — it causes a VM exit. The same applies to VMXON, VMLAUNCH, VMRESUME, VMREAD, and VMWRITE. The VMX instruction set is unconditionally intercepted in non-root operation; no VMCS control bit can allow a guest to use it.

The VMCS

Every virtual CPU (vCPU) in a VT-x system is associated with a VMCS — Virtual Machine Control Structure. The VMCS is a processor-managed data structure up to 4096 bytes in size (the exact size is read from IA32_VMX_BASIC bits 44:32). Its internal layout is implementation-specific and never documented by Intel; the only portable way to read or write a VMCS field is through VMREAD and VMWRITE, which take a 32-bit field encoding as their operand.

The first eight bytes have a fixed layout. Bytes 0–3 hold the revision identifier (bits 30:0) and a shadow-VMCS indicator (bit 31): if bit 31 is set, this is a shadow VMCS used for VMCS shadowing, and VMPTRLD will reject it unless VMCS shadowing is enabled. Bytes 4–7 are the VMX-abort indicator, written nonzero by the processor if a VMX abort occurs during a VM exit.

Every VMCS has a launch state — either "clear" (immediately after VMCLEAR) or "launched" (after a successful VMLAUNCH). This state is internal to the processor and cannot be read by software; it determines which entry instruction is legal.

VMCS Field Encodings

Every field is addressed by a 32-bit encoding:

Bits Meaning
0 Access type: 0 = full field, 1 = high 32 bits of a 64-bit field
9:1 Index within type/width category
11:10 Type: 0 = control, 1 = VM-exit info, 2 = guest state, 3 = host state
14:13 Width: 0 = 16-bit, 1 = 64-bit, 2 = 32-bit, 3 = natural-width

A few encodings from arch/x86/include/asm/vmx.h illustrate the scheme. PIN_BASED_VM_EXEC_CONTROL is 0x4000: type 0 (control), 32-bit, index 0. EPT_POINTER is 0x201A: type 0 (control), 64-bit. VM_EXIT_REASON is 0x4402: type 1 (VM-exit info, read-only), 32-bit. GUEST_CR0 is 0x6800: type 2 (guest state), natural-width. HOST_CR0 is 0x6C00: type 3 (host state), natural-width.

VMCS Logical Groups

The VMCS groups its fields into six logical areas:

Group Content
Guest-state area CR0, CR3, CR4, segment selectors, bases, limits, AR bytes, GDTR, IDTR, RIP, RSP, RFLAGS, DR7, IA32_EFER, IA32_PAT, IA32_DEBUGCTL, activity state, interruptibility state, VMCS link pointer, preemption timer value
Host-state area CR0, CR3, CR4, segment selectors (CS/SS/DS/ES/FS/GS/TR), FS/GS/TR/GDTR/IDTR bases, RIP, RSP, IA32_EFER, IA32_PAT, IA32_SYSENTER_CS/ESP/EIP
VM-execution control fields Pin-based controls, primary/secondary processor-based controls, exception bitmap, I/O bitmaps, MSR bitmaps, CR3-target controls, APIC-access address, EPT pointer (0x201A), VPID (0x0000), preemption timer
VM-exit control fields VM-exit controls, MSR-store/load areas and counts
VM-entry control fields VM-entry controls, MSR-load area, event injection field
VM-exit information fields (read-only) Exit reason (0x4402), exit qualification, guest-linear address, guest-physical address, IDT-vectoring info, instruction info, instruction length

The guest-state area is what the processor saves on every VM exit and restores on every VM entry. The host-state area is what the processor loads on every VM exit to hand control back to the VMM. The VMM is responsible for writing host-state fields correctly before the first VMLAUNCH; if the processor ever needs to exit and finds garbage in the host-state area, it will jump to a garbage instruction pointer.

VMCS Management Instructions

VMCLEAR writes the VMCS to memory, sets its launch state to "clear," and dissociates it from the logical processor. VMPTRLD makes a VMCS the current VMCS for the logical processor without changing its launch state. VMPTRST stores the current-VMCS pointer to a memory location. VMREAD and VMWRITE read and write individual fields of the current VMCS by encoding. These instructions are only available in VMX root operation; a guest executing any of them causes a VM exit.

KVM's Use of the VMCS

KVM's VMX backend lives in arch/x86/kvm/vmx/vmx.c. The per-vCPU struct is struct vcpu_vmx, which embeds struct kvm_vcpu. VMCS bookkeeping is tracked by struct loaded_vmcs in arch/x86/kvm/vmx/vmcs.h. Its bool launched field is the flag KVM consults to decide between VMLAUNCH and VMRESUME for the next guest entry. Its struct vmcs_host_state host_state caches the host CR3, CR4, GS and FS bases, RSP, and segment selectors that KVM would otherwise have to re-read on every entry path — VMWRITE is a serializing instruction, so the savings matter at high vCPU counts.

Nested virtualization (KVM hosting a guest that itself runs VMs) introduces three VMCS instances per nested vCPU. vmcs01 is what KVM builds for the L1 guest hypervisor during normal non-nested operation. vmcs12 is the VMCS the L1 hypervisor constructs for its L2 nested guest, represented in KVM as struct vmcs12. vmcs02 is the VMCS KVM actually executes L2 with — it merges the policies from vmcs01 and vmcs12 so that neither L1 nor KVM can bypass the other's intercept controls.

VM Entry: VMLAUNCH and VMRESUME

VMLAUNCH (opcode 0F 01 C2) performs the first entry into a VMCS. The current VMCS must be in "clear" launch state; on success the processor transitions it to "launched." VMRESUME (opcode 0F 01 C3) performs every subsequent entry and requires the VMCS to already be in "launched" state. Using the wrong instruction — VMLAUNCH on a launched VMCS or VMRESUME on a clear one — produces a VMfailValid with the appropriate error code in the VM-instruction error field.

Both instructions require VMX root operation at CPL 0, CR0.PE = 1, and RFLAGS.VM = 0. They also require no MOV-SS or POP-SS blocking to be active.

VM entry proceeds through three check phases. Phase 1 validates VMX controls and the host-state area; failure causes VMfailValid and leaves guest state unchanged. Phase 2 validates the guest-state area and PDPTRs; failure causes the processor to load the host state and transfer to the host RIP, with bit 31 of the exit-reason field set to indicate a VM-entry failure rather than a true exit. Phase 3 validates the MSR-load area; failure also loads host state. Only after all three phases pass does the processor commit to VMX non-root operation and begin executing guest code.

VM Exits

A VM exit occurs when the guest executes an operation the VMCS execution-control fields have marked for interception, or when the processor encounters a condition that mandates host intervention regardless of control settings — triple fault, NMI, INIT signal, or an external interrupt when external-interrupt exiting is set. The processor atomically saves guest state into the VMCS guest-state area, loads host state from the VMCS host-state area, and jumps to the address in the VMCS host RIP field.

The VM-exit reason field (0x4402, 32-bit, read-only) describes what happened. Bits 15:0 carry the basic exit reason. Bit 31 distinguishes a true VM exit (0) from a VM-entry failure that loaded host state (1). Selected basic reasons:

Code Reason
0 Exception or NMI
1 External interrupt
2 Triple fault
10 CPUID
12 HLT
18 VMCALL (hypercall)
28 CR access
30 I/O instruction
31 MSR read
32 MSR write
48 EPT violation
49 EPT misconfiguration
52 VMX-preemption timer expired

Exit reasons 48 and 49 deserve a note. They fire when a guest-physical address cannot be resolved through the Extended Page Tables — either because no mapping exists (violation) or because a mapping is present but its permission bits are inconsistent (misconfiguration). Both route to the KVM memory-fault handler, which either populates the mapping or reflects the fault to user space as a KVM_EXIT_MMIO exit from KVM_RUN. Chapter 5 covers EPT in detail.

Execution Controls and the MSR Bitmap

The most important tool for tuning VM-exit overhead is the MSR bitmap. When the primary processor-based VM-execution control "Use MSR bitmaps" (field 0x4002, bit 28) is set, the processor checks the MSR bitmap before deciding whether an RDMSR or WRMSR causes an exit. The bitmap is a 4 KiB page: four 1 KiB regions cover MSR reads in 0x00000000–0x00001FFF, MSR reads in 0xC0000000–0xC0001FFF, and the corresponding write halves. A set bit means intercept; a clear bit means pass through to the guest. KVM marks the bitmap bits for performance-critical MSRs like IA32_TSC (with TSC offsetting active, field 0x4002 bit 3) to avoid exits on every call to clock_gettime.

The VMX-preemption timer (pin-based control 0x4000 bit 6) provides a deadline mechanism: a 32-bit counter in the VMCS guest-state area decrements proportionally to the TSC (at a rate of one decrement per TSC bit-X transition, where X is read from IA32_VMX_MISC). When the counter reaches zero, a VM exit fires with reason 52. KVM uses this to implement the vCPU preemption timer for guests that spin in HLT loops.

Event Injection

To deliver an interrupt or exception to a guest on the next VM entry, the VMM writes the VM-entry interruption-information field in the VMCS. The 32-bit layout: bits 7:0 are the vector; bits 10:8 are the type (0 = external interrupt, 2 = NMI, 3 = hardware exception, 4 = software interrupt, 5 = privileged software exception, 6 = software exception); bit 11 enables error-code delivery; bit 31 marks the field valid. Setting bit 31 to 0 suppresses injection. On the next VMRESUME, the processor delivers the event as if it had arrived naturally — through the IDT, with all the privilege checks that entails.

AMD-V (SVM)

Intel shipped the first VT-x–capable processors in November 2005; AMD followed in May 2006 with the Athlon 64 Orleans and Windsor desktop processors, and added Nested Page Tables in the third-generation Opteron "Barcelona" (Family 0x10) in 2007. The architecture is called SVM — Secure Virtual Machine — and while it achieves the same isolation goals as VT-x, it makes different tradeoffs that show up clearly in KVM's two backends.

Detecting and Enabling SVM

SVM availability is signaled by CPUID leaf 0x80000001, ECX bit 2 = 1. The feature capability leaf 0x8000000A gives further detail: EAX returns the SVM revision number, EBX returns the NASID (number of available ASIDs), and EDX carries individual feature bits. The features that matter most in practice are bit 0 (Nested Page Tables), bit 3 (NRIP Save — next sequential RIP recorded in the VMCB on exit, sparing the hypervisor from decoding the faulting instruction), bit 5 (VMCB Clean Bits — allows the CPU to cache VMCB fields across consecutive VMRUN calls), and bit 13 (AVIC — Advanced Virtual Interrupt Controller, hardware-accelerated interrupt delivery).

Enabling SVM requires setting EFER.SVME (bit 12 of MSR 0xC0000080). Before doing so, the VMM reads MSR MSR_VM_CR at 0xC0010114 and checks bit 4 (SVMDIS). If SVMDIS is 1, firmware has locked SVM off and it cannot be re-enabled without a power cycle — the lock is asymmetric by design, because some enterprise security policies prohibit guest execution. Finally, the VMM allocates a 4 KiB-aligned host save area and writes its physical address to MSR_VM_HSAVE_PA at 0xC0010117. This page is where the CPU will save host state on every VMRUN.

The VMCB

AMD's counterpart to the VMCS is the VMCB — Virtual Machine Control Block — a 4 KiB page whose physical address the hypervisor passes in RAX to the VMRUN instruction. Unlike the VMCS, the VMCB is a plain memory-mapped structure with documented field offsets. The hypervisor reads and writes it with ordinary load and store instructions after mapping it into the host virtual address space. That accessibility makes VMCB manipulation faster than VMCS manipulation (no serializing VMREAD/VMWRITE pairs) but means KVM must be careful about cache coherency and VMCB clean bits.

The VMCB splits into two halves. The control area (bytes 0x0000x3FF, 1024 bytes) holds everything the CPU consults before guest entry: intercept vectors, TSC offset, IOPM and MSRPM pointers, ASID, TLB controls, interrupt controls, event injection, the NPT root pointer, and VMCB clean bits. The state save area (starting at byte 0x400) holds the full architectural state of the vCPU. The layout, from arch/x86/include/asm/svm.h, places segment registers and descriptors at 0x400, EFER at 0x4D0, CR4 at 0x548, CR3 at 0x550, CR0 at 0x558, RFLAGS at 0x570, RIP at 0x578, RSP at 0x5D8, RAX at 0x5F8, the SYSCALL MSRs (STAR=0x600, LSTAR=0x608, CSTAR=0x610, SFMASK=0x618, KERNEL_GS_BASE=0x620), g_PAT at 0x668, and SPEC_CTRL at 0x6E0. The struct is 744 bytes in total. CET state fields (s_cet, ssp, isst_addr) sit between RSP and RAX, which is why the latter fields are offset significantly further than they appear in earlier versions of the header.

Intercept Controls

At VMCB offset 0x000, six consecutive 32-bit words (192 bits total) form the intercept bitmap. Each bit controls whether a specific guest operation triggers #VMEXIT. KVM manipulates these through vmcb_set_intercept() and vmcb_clr_intercept() from arch/x86/include/asm/svm.h.

Selected flat bit indices: 96 (external interrupt), 97 (NMI), 107 (CPUID), 120 (HLT), 123 (IOIO — I/O port access), 124 (MSR_PROT — MSR access controlled by the MSRPM), 127 (SHUTDOWN — triple fault), 128 (VMRUN — always intercepted in any nested-SVM setup so a guest hypervisor cannot execute VMRUN directly), 129 (VMMCALL — hypercall), and 141 (XSETBV).

The I/O and MSR permission maps work analogously to the VT-x bitmaps. The MSRPM is 8 KiB at 4 KiB alignment: four 2 KiB regions covering MSR ranges 0x00000000–0x1FFF, 0xC0000000–0xC0001FFF, and 0xC0010000–0xC0011FFF, with 2 bits per MSR (read intercept and write intercept). The IOPM is 12 KiB with one bit per I/O port. KVM programs both during vCPU creation and updates them as the virtual device model grows.

VMCB Clean Bits

On CPUs where CPUID 0x8000000A EDX bit 5 is set, the processor can cache VMCB field groups across consecutive VMRUN calls. The clean bits field at VMCB offset 0x0C0 acts as a validity bitmap: when a bit is set, the CPU is permitted to use its cached copy instead of re-reading from memory. The hypervisor must clear any bit whose corresponding fields it has modified since the last VMRUN.

Selected bits from arch/x86/kvm/svm/svm.h:

Bit Covers
VMCB_INTERCEPTS (0) Intercept vectors, TSC offset, pause filter
VMCB_PERM_MAP (1) IOPM and MSRPM base addresses
VMCB_ASID (2) ASID
VMCB_INTR (3) Interrupt control fields
VMCB_NPT (4) NPT enable, nested_cr3, g_PAT
VMCB_CR (5) CR0, CR3, CR4, EFER
VMCB_SEG (8) CS, DS, SS, ES, CPL

KVM clears the appropriate bits whenever it modifies a field group, and sets them all at the end of a successful #VMEXIT handler so the next VMRUN can take maximum advantage of caching. On a busy system where most exits are IOIO or MSR faults with no CR changes, bits 0, 1, 3, 4, 5, and 8 can survive across hundreds of consecutive entries, materially reducing the cost of each VMRUN.

VMRUN and #VMEXIT

VMRUN rAX (opcode 0F 01 D8) is the SVM instruction that corresponds to both VMLAUNCH and VMRESUME combined. There is no separate "first entry" instruction: the CPU loads guest state from the VMCB, applies control fields, and begins executing guest code. Host state is saved to the MSR_VM_HSAVE_PA page automatically — the fields saved are SS selector, RSP, CR0, CR3, CR4, EFER, IDTR, and GDTR.

On #VMEXIT, the processor writes guest state back into the VMCB state save area, records the exit reason at VMCB offset 0x070 (exit_code), stores additional qualification at 0x078 (exit_info_1) and 0x080 (exit_info_2), restores host state from MSR_VM_HSAVE_PA, and jumps to the host #VMEXIT handler.

Not everything goes through VMRUN's automatic save mechanism. VMSAVE rAX (opcode 0F 01 DB) and VMLOAD rAX (opcode 0F 01 DA) handle extended state: FS/GS base, LDTR, TR, STAR, LSTAR, CSTAR, SFMASK, KernelGsBase, and the SYSENTER MSRs. KVM calls VMSAVE before VMRUN to capture any host extended state, and VMLOAD after to restore it. Forgetting this step would leave the host's FS base mapped as the guest's FS base after the first VMRUN, a privilege crossing that no software check would catch.

If the NRIP feature is present (CPUID 0x8000000A EDX bit 3), the processor records the next sequential instruction address at VMCB offset 0x0C8 (next_rip) on every #VMEXIT. KVM uses this to skip the faulting instruction when emulating I/O port accesses and similar traps, avoiding a full instruction decode.

Selected #VMEXIT exit codes from arch/x86/include/uapi/asm/svm.h:

Constant Value Meaning
SVM_EXIT_EXCP_BASE 0x040 Exception base (vectors 0–31 at 0x0400x05F)
SVM_EXIT_INTR 0x060 External interrupt
SVM_EXIT_NMI 0x061 NMI
SVM_EXIT_VINTR 0x064 Virtual interrupt window open
SVM_EXIT_CPUID 0x072 CPUID
SVM_EXIT_HLT 0x078 HLT
SVM_EXIT_IOIO 0x07B I/O port access
SVM_EXIT_MSR 0x07C MSR access
SVM_EXIT_SHUTDOWN 0x07F Triple fault
SVM_EXIT_VMRUN 0x080 Guest executed VMRUN
SVM_EXIT_VMMCALL 0x081 Hypercall
SVM_EXIT_NPF 0x400 Nested page fault
SVM_EXIT_VMGEXIT 0x403 SEV-ES VMGEXIT

SVM_EXIT_NPF at 0x400 is AMD's nested-page-fault exit — the equivalent of Intel's EPT violation (exit reason 48). exit_info_1 carries the fault-error bits and exit_info_2 carries the guest-physical address that faulted.

ASIDs and TLB Management

SVM uses ASIDs (Address Space Identifiers) to tag TLB entries per guest, preventing cross-VM TLB pollution and avoiding full flushes on every VMRUN. ASID 0 is reserved for the host. The maximum ASID is CPUID(0x8000000A).EBX - 1 (NASID minus one). The ASID is set in the VMCB control area at offset 0x058 and the TLB flush mode is set at 0x05C (tlb_ctl):

Value Meaning
0 TLB_CONTROL_DO_NOTHING — reuse existing TLB entries
1 TLB_CONTROL_FLUSH_ALL_ASID — flush all TLB entries with this ASID
3 TLB_CONTROL_FLUSH_ASID — flush non-global entries for this ASID
7 TLB_CONTROL_FLUSH_ASID_LOCAL — flush on this logical CPU only

When the per-CPU ASID counter exhausts the pool (next_asid > max_asid), KVM increments its generation counter and writes TLB_CONTROL_FLUSH_ALL_ASID to force a clean slate on the next VMRUN. Intel's analogue is the VPID (16-bit, VMCS field 0x0000) and the INVVPID instruction.

Event Injection

Event injection on AMD uses event_inj (VMCB offset 0x0A8, 32-bit). The layout is structurally identical to VT-x: bits 7:0 are the vector, bits 10:8 are the type (0 = hardware interrupt, 2 = NMI, 3 = exception, 4 = software interrupt), bit 11 is the error-code-valid flag, bit 31 marks the field valid. When bit 31 is set on entry into VMRUN, the processor delivers the event to the guest before executing its first instruction.

The interrupt-window mechanism differs between the two architectures. VT-x uses a dedicated primary processor-based control (field 0x4002, bit 2, "interrupt-window exiting") that causes an immediate VM exit when the guest reaches an interruptible state (RFLAGS.IF = 1, no blocking). AMD uses the V_IRQ bit in the int_ctl field (VMCB offset 0x060), which signals that a virtual interrupt is pending; when the guest becomes interruptible, the processor fires SVM_EXIT_VINTR (0x064). The end result is the same — the VMM gets a callback at the first moment it is safe to inject — but the mechanism differs.

Nested Page Tables

AMD's second-level address translation is called NPT (Nested Page Tables), also marketed as RVI (Rapid Virtualization Indexing). It was introduced with the "Barcelona" Family 0x10 Opteron. Enabling NPT requires setting the SVM_MISC_ENABLE_NP bit in the VMCB misc_ctl field at offset 0x090, and writing the host-physical address of the nested page-table root into nested_cr3 at offset 0x0B0. KVM additionally clears the INTERCEPT_INVLPG bit and removes PF_VECTOR from the exception intercept bitmap when NPT is active — with NPT, page faults inside the guest no longer need to exit, because the hardware resolves guest-physical to host-physical without software involvement.

One documented asymmetry between AMD NPT and Intel EPT: AMD NPT does not support execute-only mappings. An NPT entry with execute permission set must also have read permission. Intel EPT permits execute-only pages (XWR bits 0b100). This asymmetry surfaces in KVM's NPT entry construction and in any hypervisor that tries to use execute-only guard pages for shadow-stack hardening.

VT-x and SVM Side by Side

The two architectures solve the same problem with the same primitives but make opposite tradeoffs on structure access:

Aspect Intel VT-x AMD SVM
Control block VMCS (opaque, VMREAD/VMWRITE) VMCB (4 KiB memory-mapped struct)
Entry instruction VMLAUNCH / VMRESUME VMRUN rAX
Exit reason location VMCS VM_EXIT_REASON (0x4402) VMCB exit_code offset 0x070
Host state save VMCS host-state area MSR_VM_HSAVE_PA physical page
TLB tagging VPID (16-bit, VMCS 0x0000) ASID (32-bit, VMCB 0x058)
SLAT EPT (execute-only pages supported) NPT (execute-only not supported)
Interrupt window Primary control bit 2 V_IRQ in int_ctlSVM_EXIT_VINTR
Hypercall VMCALL VMMCALL
Next-RIP on exit VM-exit instruction-length VMCS field next_rip at VMCB 0x0C8 (NRIP feature)

The structural difference matters to KVM's two backends (arch/x86/kvm/vmx/vmx.c and arch/x86/kvm/svm/svm.c), which share a common struct kvm_vcpu core but diverge entirely on how they program and read control state. From the guest's perspective — the one running at CPL 0 in non-root operation — the difference is invisible.

ARM Virtualization Extensions

Exception Levels

AArch64 organizes privilege into four exception levels:

EL3 Secure Monitor / firmware — unrestricted EL2 Hypervisor EL1 OS kernel EL0 User applications

EL0 and EL1 are mandatory on every AArch64 implementation. EL2 and EL3 are optional; hardware that omits EL2 provides no virtualization extensions and cannot run KVM. Code at EL0 cannot access system registers at all. Code at EL1 can access EL1 system registers but not EL2 or EL3 registers. EL2 can access EL1 registers (to save and restore guest context) and has its own set of hypervisor registers — HCR_EL2, VTCR_EL2, VTTBR_EL2, and others — that EL1 cannot read or write.

flowchart TB el3["EL3\nSecure Monitor\n(firmware)"] el2["EL2\nHypervisor\n(KVM)"] el1g["EL1\nGuest OS kernel"] el0g["EL0\nGuest user apps"] el1h["EL1\nHost OS kernel\n(nVHE only)"] el3 --> el2 el2 --> el1g el2 --> el1h el1g --> el0g

This is not the x86 root/non-root distinction. There is no separate mode bit layered below the ring hierarchy. The EL hierarchy is the isolation boundary. A guest OS at EL1 simply does not have the instruction encodings to write EL2 registers. Any attempt raises an exception that routes to EL2 rather than executing.

Exceptions in AArch64 can only move to the same level or a higher level on entry. ERET can only return to the same or a lower level. A guest kernel at EL1 issuing an ERET cannot route execution to EL2 — it would return to EL0. The hardware exception model makes EL2 a strict parent of EL1, not a peer reachable from below.

HCR_EL2

HCR_EL2, the Hypervisor Configuration Register, is a 64-bit system register at EL2 that governs what EL1 and EL0 can do. It is the primary control surface for a KVM vCPU, roughly analogous to the VMCS execution-control fields for VT-x.

The most important bit for basic virtualization is bit 0, VM: when set, the CPU enables stage-2 address translation for the EL1&0 regime, translating Intermediate Physical Addresses (what the guest calls physical memory) to Host Physical Addresses (the real DRAM locations). Clearing bit 0 makes the guest's physical addresses resolve directly to host-physical — appropriate only during early bring-up when the hypervisor is not yet protecting guest memory.

Bits 3 (FMO), 4 (IMO), and 5 (AMO) route physical FIQ, IRQ, and SError exceptions to EL2, preventing the guest from seeing raw hardware interrupts and allowing the hypervisor to inject virtual IRQs through the GIC instead.

Bit 13 (TWI) traps WFI (Wait For Interrupt) from EL0 and EL1 to EL2. This is the ARM equivalent of VT-x's HLT exiting (exit reason 12). When a guest vCPU executes WFI to idle, KVM gets control and can schedule another vCPU or yield the physical core.

Bit 18 (TID3) traps EL1 reads of the group-3 ID registers — the registers advertising CPU features, implementation options, and ISA revisions — to EL2. KVM intercepts these reads and returns synthesized values, which is the mechanism behind Firecracker's V1N1 static CPU template on ARM: a host running on an AWS Graviton (Neoverse V1 microarchitecture) presents itself to the guest as Neoverse N1, improving migration portability across instance types. Firecracker's V1N1 template requires host KVM capabilities KVM_CAP_ARM_PTRAUTH_ADDRESS (171) and KVM_CAP_ARM_PTRAUTH_GENERIC (172) so it can safely expose or suppress pointer-authentication features.

Bit 31 (RW) sets the execution state for EL1: 1 means AArch64, 0 means AArch32. Every 64-bit Linux guest needs this set to 1.

Bit 34 (E2H) enables Virtualization Host Extensions. Bit 46 (FWB, added in ARMv8.4) allows stage-2 attributes to directly override stage-1 cacheability, giving the hypervisor direct control over guest memory type without the guest being able to influence it.

Stage-2 Address Translation

When HCR_EL2.VM = 1, every memory access from EL1 or EL0 goes through two independent MMU walks. The guest OS programs its own page tables as always, translating guest-virtual addresses to what it believes are physical addresses — ARM calls these Intermediate Physical Addresses (IPA). The hardware then performs a second walk, controlled by the hypervisor, translating each IPA to a Host Physical Address (HPA).

VTCR_EL2 (Virtualization Translation Control Register) configures the stage-2 walk: T0SZ (bits 5:0) defines the IPA input range as 2^(64-T0SZ) bytes; SL0 (bits 7:6) sets the starting lookup level; TG0 (bits 15:14) selects the translation granule (4 KB = 0b00, 64 KB = 0b01, 16 KB = 0b10); PS (bits 18:16) sets the output address size.

VTTBR_EL2 carries two things: the VMID in bits [63:48] (for 16-bit VMIDs, ARMv8.1+) or [55:48] (for 8-bit VMIDs, ARMv8.0), and the stage-2 page-table base address in bits [47:1]. The VMID tags TLB entries per VM exactly as ASID does for AMD or VPID does for Intel: world-switching between two VMs does not require flushing the TLB so long as each VM has a distinct VMID.

The stage-2 walk is independent of stage-1. The hypervisor can disable or re-enable stage-2 independently, which is useful during early boot when the hypervisor is initializing a guest's memory map before enabling the full translation regime.

VHE: Running the Host Kernel at EL2

Without Virtualization Host Extensions, the standard AArch64 arrangement places the KVM hypervisor stub at EL2 and the host Linux kernel at EL1. Every guest entry and exit requires a full world switch: save all EL1 host registers, load all EL1 guest registers, ERET to guest EL1 on entry; reverse the process on exit, returning to EL2, then dropping back to EL1 host context. The host and guest EL1 contexts are completely symmetric — both are "just an EL1" — but they cannot coexist on the CPU simultaneously.

VHE (Virtualization Host Extensions), introduced in ARMv8.1-A, collapses this asymmetry. When HCR_EL2.E2H = 1 (bit 34), the CPU enters a mode where EL2 becomes a superset of EL1: the host kernel can run directly at EL2 with full OS semantics. Most EL1 system registers accessed from EL2 redirect to their EL2 equivalents — SCTLR_EL1 at EL2 accesses SCTLR_EL2, and so on. New _EL12 aliases (SCTLR_EL12, TCR_EL12, TTBR0_EL12, VBAR_EL12) give the hypervisor access to the actual EL1 register contents for guest context save/restore, without confusion from the E2H redirect.

A companion bit, HCR_EL2.TGE (bit 27), switches user-space semantics. When TGE = 1 alongside E2H = 1, all physical exceptions from EL0 route to EL2 — the EL0 threads are treated as host-OS user processes. When TGE = 0, EL0 threads are guest user space, with exceptions routing to EL1 as normal.

Linux KVM on ARMv8.1+ runs the host kernel at EL2 using VHE. The same kernel binary supports both VHE and non-VHE (nVHE) via runtime alternative instruction patching decided at boot based on CPU feature detection. Starting with ARMv9.5, implementations may make HCR_EL2.E2H a RES1 field — permanently 1, making VHE the only implemented behavior and removing the non-VHE code path from relevance on new silicon.

flowchart LR
  subgraph nvhe["nVHE (ARMv8.0)"]
    el2s["KVM stub at EL2"]
    el1h2["Host kernel at EL1"]
    el1g2["Guest at EL1"]
    el2s -->|"world switch"| el1g2
    el2s --> el1h2
  end
  subgraph vhe["VHE (ARMv8.1+, E2H=1)"]
    el2v["Host kernel + KVM at EL2"]
    el1gv["Guest at EL1"]
    el2v -->|"ERET / exception"| el1gv
  end

Exit Handling on ARM

When the guest triggers an exception that routes to EL2, the syndrome register ESR_EL2 records what happened. Bits 31:26 (EC field) carry the exception class. KVM's arm_exit_handlers[] array in arch/arm64/kvm/handle_exit.c maps EC values to handler functions via kvm_get_exit_handler().

Selected EC codes:

Symbol Cause
ESR_ELx_EC_WFx WFI/WFE trap — guest idle
ESR_ELx_EC_HVC64 HVC — hypervisor call from guest
ESR_ELx_EC_SMC64 SMC — secure monitor call
ESR_ELx_EC_SYS64 System register access trap
ESR_ELx_EC_DABT_LOW Data abort / MMIO / stage-2 fault
ESR_ELx_EC_IABT_LOW Instruction abort

ESR_ELx_EC_DABT_LOW is the ARM equivalent of an EPT violation: a stage-2 fault on a data access. The handler walks the ESR fields to determine whether the fault is MMIO (no mapping exists and the address is in a device region) or a genuine page fault (a mapping needs to be installed), then either emulates the device access or populates the stage-2 table.

GIC Virtualization

ARM guests need to receive interrupts. The Generic Interrupt Controller (GIC) has had virtualization support since GICv2, adding a two-register-bank split: the virtual interface control block (GICH_* registers) that the hypervisor programs, and the virtual CPU interface (GICV_* registers) that the guest reads as if they were the physical GIC CPU interface registers.

Virtual interrupt delivery uses GICH_LRn — the List Registers. Each LR entry specifies a virtual interrupt ID, an optional physical interrupt ID (when HW = 1 for hardware-backed IRQs), a priority, a state (pending, active, or pending+active), and EOI behavior. The number of list registers is reported by GICH_VTR, typically between 4 and 16 on real hardware. GICH_HCR bit 0 (En) must be set before guest entry or no virtual interrupts will be delivered.

When HW = 1 in a list register, hardware deactivates the physical interrupt automatically when the guest writes EOI to GICV_EOIR. When HW = 0, the hypervisor must manually deactivate the physical IRQ — usually by writing the physical INTID to GICD_DIR — after the guest's interrupt handler runs.

GICv3 and GICv4 move the interface to system registers: ICH_HCR_EL2 replaces GICH_HCR, ICH_LR<n>_EL2 (up to 16) replace GICH_LRn, and the virtual CPU interface becomes ICV_* registers at EL1. The architecture is part of the AArch64 architectural state from ARMv8 onward; GICv3 support is mandatory for any arm64 server platform.

Firecracker on aarch64 requires KVM_CAP_IRQCHIP — in-kernel GICv2 or GICv3 emulation — in addition to /dev/kvm. Hardware with CONFIG_KVM=y but no IRQ chip capability at runtime cannot host Firecracker guests; this was confirmed by Firecracker issue #1186, where a Raspberry Pi 3 built with KVM enabled failed the capability check at startup.

PSCI

Guest power management on ARM does not use ACPI PM registers. Instead, the guest issues PSCI (Power State Coordination Interface, ARM DEN0022) calls using the SMCCC (DEN0028) calling convention. The conduit is SMC when EL3 is present, or HVC when only EL2 is available.

KVM emulates PSCI via kvm_smccc_call_handler(). Selected 64-bit function IDs: CPU_ON is 0xC4000003, CPU_SUSPEND is 0xC4000001, SYSTEM_OFF is 0x84000008, and SYSTEM_RESET is 0x84000009. KVM exposes PSCI to guests via the KVM_ARM_VCPU_PSCI_0_2 feature flag set on the vCPU. For events it cannot handle internally — SYSTEM_OFF and SYSTEM_RESET — KVM returns KVM_EXIT_SYSTEM_EVENT to user space. In Firecracker, a poweroff command inside the guest triggers a PSCI SYSTEM_OFF call, KVM emits KVM_EXIT_SYSTEM_EVENT, and Firecracker performs a clean shutdown, unmounting storage and releasing resources before the VMM process exits.

How Silicon Enforces the Boundary

The same question runs through all three architectures: what actually prevents guest ring 0 from becoming host ring 0? The answer is not a software invariant or an operating system convention — it is hardware decode logic that the guest cannot observe or bypass.

On x86 with VT-x, the CPU maintains an internal mode bit distinguishing VMX root from VMX non-root operation. Guest code executing at CPL 0 in VMX non-root cannot use VMX instructions — they cause VM exits, not execution. It cannot read or write the VMCS — VMREAD and VMWRITE are unconditionally intercepted. It cannot execute VMXOFF or VMXON — also unconditionally intercepted. The guest's view of physical memory is confined to what EPT maps, and EPT is controlled entirely by the VMM running in root mode. The guest kernel therefore operates at full CPL 0 privilege within its architectural world, but that world is bounded by hardware: nothing the guest can do in non-root mode reaches host state or host memory.

On AMD-V, the VMRUN intercept bit (index 128 in the VMCB intercept bitmap) is required in any system where a guest hypervisor might try to execute VMRUN itself. Every VMRUN the guest attempts fires a #VMEXIT back to the real hypervisor. The VMCB resides in host-physical memory that NPT does not expose to the guest once the nested page tables are active. Host state is saved to and restored from the MSR_VM_HSAVE_PA page, which is likewise outside the guest's NPT-visible address space.

On ARM, the mechanism is the exception level hierarchy itself. A guest OS at EL1 does not have instruction encodings that write EL2 registers. Writing HCR_EL2 requires EL2; the guest cannot disable stage-2 translation by clearing HCR_EL2.VM. Writing VTTBR_EL2 to install new stage-2 page tables requires EL2; the guest cannot remap the hypervisor's memory into its own address space. Any EL1 instruction that attempts to access an EL2-only register raises an exception that routes to EL2 — to the hypervisor — not back to EL1. The hardware exception model guarantees that exceptions from one VM's EL1 cannot route to another VM's EL1; they always traverse EL2 first.

Chapter 5 builds on this foundation — EPT and NPT give the hypervisor the same kind of structural control over the guest's view of physical memory that the mode bit and exception-level hierarchy provide over instruction execution.

Sources And Further Reading