Chapter 7: Virtual Interrupts And Time

A guest operating system expects to receive interrupts. The NIC driver expects the hardware to signal that a packet arrived. The block driver expects a completion notice. The timer subsystem expects the interrupt controller to fire on schedule. None of that hardware exists — the guest is running on emulated devices — yet the guest must receive interrupts that arrive with plausible timing and in the right order. Meanwhile, the guest's clock must advance at something close to wall-clock rate, survive live migration to a host whose TSC runs at a slightly different frequency, and not drift by seconds across a long-lived workload.

The interrupt and timekeeping problems share an ancestor. Both require the hypervisor to intercept hardware-level operations — writes to interrupt-controller registers, reads of the time-stamp counter — and either emulate them fully in kernel space or share a memory-mapped ABI with the guest that makes the emulation cheap enough to be invisible. Getting the first problem wrong produces missed interrupts, hangs, and stalled virtio queues. Getting the second produces drifting logs, failing TLS certificates, and confused application timing loops.

The Interrupt Architecture Problem

On real hardware, an interrupt follows a path that software rarely thinks about: a device asserts a line, the I/O APIC records it, the local APIC on the target CPU raises an interrupt request, and the CPU saves state and dispatches the handler. This path crosses several pieces of hardware the guest believes it owns — the local APIC register page at 0xFEE00000, the I/O APIC at 0xFEC00000, the legacy 8259 PIC pair at I/O ports 0x20 and 0xA0 — none of which exists in the guest's physical address space unless the hypervisor puts it there.

The hypervisor has three options. It can emulate the entire interrupt controller stack in userspace — every register write from the guest triggers a VM exit, the VMM updates its software model, and the VMM injects the interrupt on the next KVM_RUN. It can emulate the controllers inside the kernel, handling register accesses without returning to userspace. Or it can split the work: keep the local APIC in-kernel where injection is fast, but handle the legacy PIC and I/O APIC in userspace where the VMM can apply its own routing policy. Each choice makes different tradeoffs between latency, flexibility, and complexity.

In-Kernel Irqchip vs. Userspace

The canonical path for x86 microVMs is fully in-kernel. A single VM-level ioctl, KVM_CREATE_IRQCHIP (_IO(KVMIO, 0x60)), requires KVM_CAP_IRQCHIP (capability value 0) and creates three emulated controllers in one call: a master 8259 PIC (KVM_IRQCHIP_PIC_MASTER = 0), a slave 8259 PIC (KVM_IRQCHIP_PIC_SLAVE = 1), and an I/O APIC (KVM_IRQCHIP_IOAPIC = 2). Every vCPU created after this call gets an in-kernel local APIC. The state of any chip can be read or written later with KVM_GET_IRQCHIP and KVM_SET_IRQCHIP, which reference chips by these same ID constants.

On arm64, KVM_CREATE_IRQCHIP creates a GICv2 only. For a GICv3 — the interrupt controller on any recent Arm server or embedded SoC — userspace must instead call KVM_CREATE_DEVICE with KVM_DEV_TYPE_ARM_VGIC_V3. The kernel enforces that only one VGIC instance may exist per VM; GICv2 and GICv3 cannot coexist.

Split-irqchip mode, enabled by KVM_CAP_SPLIT_IRQCHIP (value 121), keeps the in-kernel local APIC but moves the legacy PIC, I/O APIC, and PIT to userspace. When a guest performs an EOI that would normally notify the I/O APIC, KVM surfaces this to userspace as KVM_EXIT_IOAPIC_EOI rather than handling it in-kernel. This is the configuration chosen by VMMs that want fine-grained control over routing while still keeping interrupt injection fast.

Without either KVM_CREATE_IRQCHIP or split mode, the VMM emulates every controller in userspace and injects interrupts via the vCPU ioctl KVM_INTERRUPT, which queues a single interrupt vector for delivery at the next VM entry. This pure-userspace path requires a round-trip to userspace for every interrupt and is too slow for production use; it exists for completeness and for VMMs that deliberately avoid the kernel irqchip.

Firecracker calls KVM_CREATE_IRQCHIP at VM creation on x86_64, using the fully in-kernel path for all three chips. On aarch64, it provisions a GICv3 via KVM_CREATE_DEVICE with KVM_DEV_TYPE_ARM_VGIC_V3, falling back to KVM_DEV_TYPE_ARM_VGIC_V2 if the host kernel or hardware does not support GICv3.

The Local APIC In Detail

After KVM_CREATE_IRQCHIP, each vCPU's local APIC is emulated by KVM in arch/x86/kvm/lapic.c. The APIC register page is 4 KiB (LAPIC_MMIO_LENGTH = 4096), mapped at 0xFEE00000 in the guest's physical address space. Register accesses go through KVM's MMIO handler rather than to real hardware, so they are resolved in-kernel without a userspace round-trip.

The key registers and their offsets within the APIC page:

Register Offset Purpose
APIC_ID 0x020 APIC identifier
APIC_LVR 0x030 Version; KVM emulates 0x14
APIC_SPIV 0x0F0 Spurious interrupt vector; bit APIC_SPIV_APIC_ENABLED arms the APIC
APIC_ICR 0x300 Interrupt Command Register, low 32 bits
APIC_ICR2 0x310 ICR high 32 bits — destination field
APIC_LVT0 0x350 LVT entry 0 (LINT0)
APIC_LVT1 0x360 LVT entry 1 (LINT1)

The Interrupt Request Register (IRR), In-Service Register (ISR), and Trigger-Mode Register (TMR) are each 256-bit bitmaps stored as eight 32-bit registers inside the kvm_lapic_state.regs page. When the guest writes the EOI register, KVM's handler calls apic_find_highest_isr(), clears the ISR bit for the current interrupt, recomputes the Processor Priority Register, and notifies the in-kernel I/O APIC via kvm_ioapic_send_eoi(). All of this happens inside the APIC MMIO handler without returning to userspace.

The LAPIC timer complicates things. Emulating a timer accurately requires the kernel to know when the next deadline fires, then absorb the jitter introduced by VM-exit and VM-entry latency. KVM applies a configurable timer advance: LAPIC_TIMER_ADVANCE_NS_INIT = 1000 ns, capped at LAPIC_TIMER_ADVANCE_NS_MAX = 5000 ns. The kernel fires the host-side timer slightly early, then busy-waits in a tight loop to hit the precise deadline, spending less than one microsecond of CPU time per timer event on a quiet system.

The full LAPIC state can be saved and restored across migration with KVM_GET_LAPIC (ioctl 0x8e) and KVM_SET_LAPIC (ioctl 0x8f). Both operate on struct kvm_lapic_state { char regs[KVM_APIC_REG_SIZE]; } where KVM_APIC_REG_SIZE = 0x400 (1024 bytes).

x2APIC

The original xAPIC mode uses 8-bit APIC IDs stored in MMIO registers. x2APIC extends this to 32-bit IDs and replaces the MMIO interface with MSR accesses in the range 0x8000x8FF — each MSR maps to an APIC register at offset (msr - APIC_BASE_MSR) << 4 where APIC_BASE_MSR = 0x800. KVM emulates the full x2APIC MSR range, but doing so requires KVM_CREATE_IRQCHIP to have been called first; KVM does not support forwarding x2APIC MSR accesses to userspace.

Enabling the extended API requires KVM_CAP_X2APIC_API (value 129). When the KVM_X2APIC_API_USE_32BIT_IDS flag is set within this capability, KVM stores the full 32-bit x2APIC ID in bytes 32–35 of kvm_lapic_state.regs; xAPIC stores only an 8-bit ID in byte 35 (bits 31–24 of that word).

The In-Kernel I/O APIC

The KVM I/O APIC (arch/x86/kvm/ioapic.c) emulates exactly 24 input pins (KVM_IOAPIC_NUM_PINS = 24), matching the Intel 82093AA specification. The MMIO window is 256 bytes (0x100) at default base 0xFEC00000. Like real hardware, the I/O APIC uses an indirect addressing scheme: a write to IOAPIC_REG_SELECT at offset 0x00 sets the internal register index; a subsequent read or write to IOAPIC_REG_WINDOW at offset 0x10 accesses the selected register.

Indirect register 0x00 is the ID (IOAPICID), 0x01 the version (IOAPICVER; KVM reports IOAPIC_VERSION_ID = 0x11), 0x02 the arbitration register (IOAPICARB). Redirection table entries start at index 0x10 (pin 0) and occupy two 32-bit words each, running through 0x3F (pin 23). Each 64-bit entry (union kvm_ioapic_redirect_entry) encodes the destination vector, delivery mode, destination APIC ID, trigger mode (edge or level), mask bit, and remote IRR flag.

GSI routing determines which controller receives each interrupt: GSIs 0–15 route to both the PIC and the I/O APIC (for compatibility with legacy software); GSIs 16–23 go to the I/O APIC only. The RTC IRQ, RTC_GSI = 8, routes through both.

The GSI Routing Table

Higher-level interrupt routing — from a device's logical signal to the right controller and pin — lives in a table the VMM manages with KVM_SET_GSI_ROUTING (_IOW(KVMIO, 0x6a, struct kvm_irq_routing)), gated by KVM_CAP_IRQ_ROUTING (value 25). Each call atomically replaces the entire table; there is no incremental-update path. A VMM that needs to add one route must rebuild and resubmit the full table.

Each entry in the table is a struct kvm_irq_routing_entry:

struct kvm_irq_routing_entry {
    __u32 gsi;
    __u32 type;   /* KVM_IRQ_ROUTING_IRQCHIP=1, KVM_IRQ_ROUTING_MSI=2,
                     KVM_IRQ_ROUTING_S390_ADAPTER=3, KVM_IRQ_ROUTING_HV_SINT=4,
                     KVM_IRQ_ROUTING_XEN_EVTCHN=5 */
    __u32 flags;
    __u32 pad;
    union {
        struct kvm_irq_routing_irqchip irqchip;
        struct kvm_irq_routing_msi     msi;
        /* ... */
    } u;
};

For MSI devices, KVM_IRQ_ROUTING_MSI = 2 entries carry address_lo, address_hi, and data — the three fields that encode the destination APIC and vector in the MSI message format. Setting KVM_MSI_VALID_DEVID (bit 0 in struct kvm_msi.flags) passes a PCIe Requester ID via devid, which enables interrupt remapping hardware to associate the interrupt with a specific device (requires KVM_CAP_MSI_DEVID = 131).

On arm64, GSI routing applies to KVM_IRQFD bindings but does not apply to KVM_IRQ_LINE.

Firecracker builds its routing table by collecting all unmasked entries — one KVM_IRQ_ROUTING_IRQCHIP entry pointing to KVM_IRQCHIP_IOAPIC per device on x86_64, one entry with chip index 0 per device on aarch64, and one KVM_IRQ_ROUTING_MSI entry per MSI-capable device — and submits them with a single set_gsi_routing() call. Rebuilding on every change is acceptable because routing changes are rare and the atomicity guarantee is valuable.

irqfd: Interrupt Injection Without a VM Exit

The most important observation about interrupt delivery is that the fast path does not involve userspace at all. The mechanism that enables this is irqfd, introduced in Linux 2.6.32 (commit 721eecbf, Gregory Haskins, Novell) and requiring KVM_CAP_IRQFD (value 32).

The underlying primitive is eventfd(2) (available since Linux 2.6.22): a file description backed by a 64-bit kernel counter. Writing 8 bytes adds to the counter; reading 8 bytes returns and resets it. The fd becomes EPOLLIN-readable the moment the counter is nonzero. KVM exploits the poll notification mechanism: during irqfd registration, kvm_irqfd_assign() calls init_poll_funcptr() and vfs_poll() on the eventfd file description, installing a custom waitqueue entry whose irqfd_wakeup function is called the instant the counter becomes nonzero — that is, the instant a device thread writes to the eventfd.

struct kvm_irqfd {
    __u32 fd;          /* eventfd file descriptor */
    __u32 gsi;         /* irqchip GSI / pin number */
    __u32 flags;
    __u32 resamplefd;  /* used only with KVM_IRQFD_FLAG_RESAMPLE */
    __u8  pad[16];
};

KVM_IRQFD is a VM ioctl: _IOW(KVMIO, 0x76, struct kvm_irqfd). Setting KVM_IRQFD_FLAG_DEASSIGN in flags removes the binding (both fd and gsi must be provided). Setting KVM_IRQFD_FLAG_RESAMPLE (requires KVM_CAP_IRQFD_RESAMPLE = 82) switches to level-triggered mode: when the guest performs an EOI, KVM de-asserts the GSI and writes resamplefd, allowing the VMM to re-inject if the device still has work pending.

The path through the kernel (in virt/kvm/eventfd.c) avoids acquiring the irqfds.lock during the fast path to prevent deadlock; SRCU read locks protect the routing table instead. When irqfd_wakeup fires on EPOLLIN, it takes an SRCU read lock, reads the cached IRQ routing, and calls kvm_arch_set_irq_inatomic(). If that call returns -EWOULDBLOCK — the interrupt cannot be injected atomically at this moment, perhaps because the vCPU is not in a state that can receive it — the function schedules the irqfd_inject work item on a workqueue for deferred delivery. On EPOLLHUP (the eventfd was closed), irqfd_deactivate() removes the registration and queues irqfd_shutdown.

The result: a device thread writes 8 bytes to a file descriptor, and the guest receives an interrupt without the VMM process ever executing a line of code to handle it. No KVM_RUN return, no userspace round-trip, no scheduling decision — just a kernel callback and a VMCS update.

ioeventfd: Eliminating the Outbound Round-Trip

irqfd handles the host-to-guest direction. The guest-to-host direction — a guest writing to a virtqueue notify register to tell the host that work is ready — needs a mirror primitive. Without one, every guest MMIO write to a device register causes a KVM_EXIT_MMIO return from KVM_RUN, the VMM process wakes up, reads the address and value from kvm_run.mmio, dispatches to the right device handler, and re-enters the guest. That sequence costs on the order of 9 microseconds per notification.

KVM_IOEVENTFD (_IOW(KVMIO, 0x89, struct kvm_ioeventfd), requiring KVM_CAP_IOEVENTFD = 36) was introduced in Linux 2.6.32 (commit d34e6b17, Gregory Haskins, August 2009) to eliminate that round-trip:

struct kvm_ioeventfd {
    __u64 datamatch;
    __u64 addr;    /* legal pio/mmio address */
    __u32 len;     /* 0, 1, 2, 4, or 8 bytes */
    __s32 fd;
    __u32 flags;
    __u8  pad[36];
};

The flags field controls matching behavior:

Flag Meaning
KVM_IOEVENTFD_FLAG_DATAMATCH Signal only if the written value matches datamatch
KVM_IOEVENTFD_FLAG_PIO Target PIO address space instead of MMIO
KVM_IOEVENTFD_FLAG_DEASSIGN Remove the binding
KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY s390 virtio-ccw channel device

KVM_CAP_IOEVENTFD_ANY_LENGTH permits len = 0 registrations that match regardless of write size.

The kernel fast-path in virt/kvm/eventfd.c: kvm_assign_ioeventfd_idx() registers the ioeventfd on KVM's MMIO, PIO, or VIRTIO_CCW bus via kvm_io_bus_register_dev(). When the guest executes a write to the registered address, KVM's exit handler calls ioeventfd_write(), which checks the address, the write length, and (if KVM_IOEVENTFD_FLAG_DATAMATCH) the written value via ioeventfd_in_range(). On a hit, it calls eventfd_signal() in-kernel and returns 0 — preventing the exit from propagating to userspace. A hardware-level VM exit still occurs (VMX must trap the write to unmapped MMIO), but the kernel services it without returning to the VMM process.

The patch commit message from August 2009 reported the performance effect:

Path IOPS Round-trip latency
QEMU MMIO baseline 110,000 9.09 µs
ioeventfd MMIO 200,100 5.00 µs
ioeventfd PIO 367,300 2.72 µs

The ioeventfd path recovers roughly 4 µs per notification by eliminating the userspace hop. For a workload with high virtqueue notification rates — sustained disk I/O or network traffic — that 4 µs per operation accumulates into a significant fraction of total CPU time.

How Firecracker Uses irqfd and ioeventfd

Firecracker registers one ioeventfd per virtqueue. The MMIO notify address is device_base + NOTIFY_REG_OFFSET where NOTIFY_REG_OFFSET = 0x50 (defined in src/vmm/src/devices/virtio/mod.rs, matching the virtio MMIO specification). The datamatch value is the queue index i, so KVM signals the queue-i eventfd only when the guest writes i to the QueueNotify register — avoiding spurious signals when the guest notifies a different queue at the same address:

// src/vmm/src/device_manager/mmio.rs (simplified)
for (i, queue_evt) in locked_device.queue_events().iter().enumerate() {
    let io_addr = IoEventAddress::Mmio(
        device.resources.addr + u64::from(NOTIFY_REG_OFFSET),
    );
    vm.fd()
        .register_ioevent(queue_evt, &io_addr, u32::try_from(i).unwrap())
        .map_err(MmioError::RegisterIoEvent)?;
}

Each device's irqfd registration assigns a single GSI (allocated via resource_allocator.allocate_gsi_legacy(1)) and binds it to the device's interrupt eventfd:

vm.register_irq(&mmio_device.interrupt.irq_evt, gsi)
    .map_err(MmioError::RegisterIrqFd)?;

The virtio device thread polls the per-queue ioeventfd file descriptors via epoll; the interrupt eventfd is the irqfd that triggers a guest interrupt when the device signals completion. The MMIO slot for each virtio device is 4 KiB (0x1000).

The two mechanisms are complements:

sequenceDiagram participant G as "Guest vCPU" participant K as "KVM (kernel)" participant D as "Device thread" G->>K: Write QueueNotify (MMIO) K->>D: eventfd_signal (ioeventfd hit) Note over K,D: No return to VMM process D->>D: Process virtqueue D->>K: Write irqfd eventfd K->>G: Inject interrupt (irqfd_wakeup) Note over K,G: No return to VMM process

The VMM process is not in the path for either the notification or the interrupt injection. It configured the bindings at setup time; the kernel and the device thread handle the fast path entirely.

Intel Posted Interrupt Processing

Even with irqfd, interrupt injection has a cost: when irqfd_wakeup fires and calls kvm_arch_set_irq_inatomic(), KVM must update the vCPU's interrupt state in the VMCS — which means the vCPU thread must either be in a state where the update is safe, or the injection must wait for the next VM entry. Intel VT-x includes a hardware mechanism that eliminates even this software step for the common case.

Posted interrupt processing was introduced on Ivy Bridge-EP and Haswell server processors in 2013. Two VMCS fields enable it: the posted-interrupt notification vector, a dedicated interrupt vector the CPU checks on incoming interrupts, and the posted-interrupt descriptor address, a pointer to a 64-byte Posted-Interrupt Descriptor (PID) in memory. All modifications to the PID must use locked read-modify-write instructions because the CPU and software may access it concurrently.

The PID layout:

Bits Field Meaning
255:0 PIR 256-bit bitmap; bit N indicates interrupt vector N is pending
256 ON Outstanding Notification bit
511:257 Reserved

When an interrupt arrives while the vCPU is in VMX non-root mode and the interrupt vector matches the notification vector, the CPU does not exit. Instead it atomically clears the ON bit (bit 256), scans the PIR bitmap, and delivers all pending interrupts directly to the virtual APIC — without executing any VMM or KVM code. The cost of interrupt delivery when the vCPU is running is reduced to pure hardware time.

When the target vCPU is not currently scheduled, KVM sends an IPI carrying the notification vector to the physical CPU that will next run the vCPU. That CPU processes the PIR bits when it next enters VMX non-root mode for the vCPU, delivering the interrupt at entry rather than requiring an exit and re-entry cycle.

Enabling posted interrupts requires setting the "process posted interrupts" VM-execution control bit in the VMCS, configuring the notification vector and descriptor address, and ensuring KVM_CAP_X2APIC_API and related capabilities are in order. KVM manages this transparently when the hardware supports it; the VMM does not need to handle posted interrupts explicitly.

Paravirtual Interrupt Optimizations

Even with in-kernel APIC emulation, certain interrupt operations are expensive. An EOI write to the APIC at 0xFEE000B0 is an MMIO write that KVM must intercept and handle. On a guest processing thousands of interrupts per second, those EOI exits accumulate.

PV-EOI eliminates most of them. The guest writes MSR_KVM_PV_EOI_EN = 0x4b564d04 with the low bit set and bits 63–2 holding a 4-byte-aligned guest physical address. KVM then sets bit 0 of the word at that address before injecting each interrupt. The guest's interrupt return path tests and clears that bit atomically; if the bit was set, the EOI is complete without any APIC MMIO write. Only when the bit is already clear — when multiple interrupt levels are active and the APIC needs to update the ISR — does the guest fall back to the MMIO EOI.

The paravirtual hypercall interface provides additional shortcuts for inter-processor interrupts:

Hypercall Number Description
KVM_HC_KICK_CPU 5 Wake a vCPU from HLT; a1 = target APIC ID
KVM_HC_SEND_IPI 10 Multicast IPI; a0/a1 = 128-bit APIC ID bitmap, a2 = lowest APIC ID, a3 = ICR value; up to 128 destinations per call in 64-bit mode
KVM_HC_SCHED_YIELD 11 Yield to scheduler when IPI target is preempted; a0 = destination APIC ID

Sending a TLB shootdown IPI on a large guest with many vCPUs would otherwise require one APIC ICR write per destination, each of which may exit. The hypercall encodes 128 destinations in two registers, collapsing the loop to a single VM exit.

The ARM GIC (VGIC)

Arm's interrupt controller architecture is the Generic Interrupt Controller, or GIC. KVM's virtual GIC implementation is the VGIC. Interrupt IDs are organized in four ranges:

Range IDs Type
SGI (Software Generated) 0–15 Per-vCPU, used for IPIs
PPI (Private Peripheral) 16–31 Per-vCPU, used for timers
SPI (Shared Peripheral) 32–1019 Shared across all vCPUs
LPI (Locality-specific Peripheral) 8192+ Shared; GICv3 only

The kernel VGIC allows 64–1024 IRQs in steps of 32, configured via KVM_DEV_ARM_VGIC_GRP_NR_IRQS. Total IRQ count in Firecracker is GSI_LEGACY_NUM (32) + SPI count.

GICv2

GICv2 uses MMIO exclusively. The device type for KVM_CREATE_DEVICE is KVM_DEV_TYPE_ARM_VGIC_V2. Two MMIO regions must be placed via KVM_DEV_ARM_VGIC_GRP_ADDR:

Attribute Alignment Region size
KVM_VGIC_V2_ADDR_TYPE_DIST 4 KiB 4 KiB
KVM_VGIC_V2_ADDR_TYPE_CPU 4 KiB 8 KiB

The distributor holds global state; the CPU interface, one per vCPU, is the per-CPU window through which a running vCPU reads pending priority and signals EOI.

GICv3

GICv3 replaces the per-CPU MMIO CPU interface with system registers — the ICC and ICH register families, accessed via MSR/MRS rather than memory loads and stores. This makes EOI and priority reads faster on real hardware, and KVM emulates the same path via KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS. Device type: KVM_DEV_TYPE_ARM_VGIC_V3.

The memory layout changes substantially. Where GICv2 has one CPU interface region for all vCPUs, GICv3 introduces a redistributor — a 128 KiB (KVM_VGIC_V3_REDIST_SIZE = 0x20000) per-vCPU MMIO region that holds per-CPU state and LPI pending bits. The distributor grows to 64 KiB (KVM_VGIC_V3_DIST_SIZE = 0x10000):

Attribute Alignment Region size
KVM_VGIC_V3_ADDR_TYPE_DIST 64 KiB 64 KiB
KVM_VGIC_V3_ADDR_TYPE_REDIST 64 KiB 128 KiB per vCPU

The redistributor is selected by MPIDR (the Multiprocessor Affinity Register), so the same redistributor address space can serve multiple vCPUs if they have distinct MPIDR values.

The Interrupt Translation Service

LPIs are Locality-specific Peripheral Interrupts — GICv3's mechanism for MSI delivery. A device writes to an Interrupt Translation Table (ITT) rather than asserting a wire; the Interrupt Translation Service (ITS) translates the write into an LPI. KVM exposes this via KVM_DEV_TYPE_ARM_VGIC_ITS with a 128 KiB MMIO region (GIC_V3_ITS_SIZE = 0x20000 in Firecracker), placed at a 64 KiB-aligned address via KVM_VGIC_ITS_ADDR_TYPE.

The ITS maintains three tables: a Device Table mapping DeviceID to an Interrupt Translation Table, an Interrupt Translation Entry table mapping EventID to a physical LPI number, and a Collection Table mapping collection IDs to redistributors. Key control registers in the ITS MMIO space: GITS_CBASER (command queue base address), GITS_CWRITER and GITS_CREADR (command queue write and read pointers), and GITS_CTLR (enable).

Lifecycle Constraints

The VGIC imposes strict ordering on its initialization sequence. KVM_DEV_ARM_VGIC_CTRL_INIT must be called after all vCPUs are created, because the redistributor count is determined by the vCPU count. LPI pending state — which LPIs were pending at snapshot time — must be flushed to guest RAM before a snapshot with KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES, and reloaded at restore time.

Firecracker places the GICv3 regions at fixed offsets below MMIO32_MEM_START: the distributor at MMIO32_MEM_START - 0x10000, the redistributors at dist_addr - (vcpu_count * 0x20000), and the ITS at redist_addr - 0x20000.

Why Clocks Are Hard Under Virtualization

Interrupt delivery can be made fast with the mechanisms above. Timekeeping is harder because the inaccuracy accumulates invisibly and manifests far from its cause.

The time-stamp counter — RDTSC on x86 — is the cheapest way to measure elapsed time on a running CPU. The instruction is not serializing: out-of-order execution can produce a later RDTSC result that is numerically less than an earlier one on a different CPU. On multi-socket systems the TSC oscillators are independent crystals that drift due to temperature and electrical variation, so RDTSC on CPU 0 and RDTSC on CPU 1 may disagree. The Linux kernel documentation is explicit: "do not trust the TSCs to remain synchronized on NUMA or multiple socket systems."

The TSC has also historically changed rate with CPU power states. X86_FEATURE_CONSTANT_TSC (CPUID.80000007H:EDX[8], the "invariant TSC" bit) guarantees the TSC runs at a constant rate across all ACPI P-, C-, and T-states. Without it, any transition to a lower-frequency P-state or a shallow C-state corrupts time measurements. X86_FEATURE_NONSTOP_TSC guarantees the counter does not halt in C-states. On a virtualized system the guest has no control over which C-states the host enters, so it cannot rely on either property being present unless the hypervisor advertises them.

Live migration to a different host is the hardest case. The destination host's TSC may run at a slightly higher or lower frequency than the source. A faster destination TSC cannot be slowed; the hypervisor must apply an offset and a scaling factor to make the guest's visible TSC advance at the same rate.

Legacy timekeeping via the PIT (Programmable Interval Timer, I/O ports 0x400x43, base frequency 1.193182 MHz) or the RTC (32.768 kHz crystal) relies on interrupt delivery rates that the hypervisor cannot always guarantee. When a host CPU is overloaded, timer interrupts are late; the guest's time drifts forward relative to wall clock.

Both VMX and SVM virtualize the TSC with an offset field — TSC_OFFSET in the VMCS; the equivalent field in the VMCB on AMD — so the guest reads host_TSC + offset. Both also support TSC scaling: VMX provides a 64-bit TSC_MULTIPLIER field in the VMCS encoded as a 48.16 fixed-point number; AMD SVM provides the TSC_RATIO MSR at 0xC0010104. KVM programs these fields via the KVM_SET_TSC_KHZ ioctl (gated by KVM_CAP_TSC_CONTROL for per-vCPU control, KVM_CAP_VM_TSC_CONTROL for a VM-wide default applied to subsequently created vCPUs). KVM_GET_TSC_KHZ returns a negative error if the host TSC is unstable.

The IA32_TSC_ADJUST MSR (0x3B) provides a per-logical-processor offset added to the hardware TSC on every guest read. Its reset value is 0. Linux guests use IA32_TSC_ADJUST for TSC synchronization across CPUs rather than writing IA32_TSC directly, which would disrupt offset-based invariants. KVM tracks IA32_TSC_ADJUST separately from the VMCS TSC_OFFSET.

kvmclock: The Paravirtual Clock

The structural solution to these problems is to remove the guest's dependence on hardware clocks entirely and give it a shared-memory ABI the hypervisor updates directly. That ABI is kvmclock, also known as pvclock from the name of the data structures it uses.

The guest detects KVM with CPUID leaf 0x40000000; the signature at EBX:ECX:EDX spells "KVMKVMKVM\0\0\0". Feature bits are at leaf 0x40000001 EAX:

Bit Constant Meaning
0 KVM_FEATURE_CLOCKSOURCE kvmclock available at deprecated MSRs 0x11 / 0x12
3 KVM_FEATURE_CLOCKSOURCE2 kvmclock at canonical MSRs 0x4b564d00 / 0x4b564d01
5 KVM_FEATURE_STEAL_TIME steal time at MSR 0x4b564d03
24 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT host guarantees no per-CPU warp; enables vDSO fast path

The detection algorithm: check kvm_para_available(), then read cpuid_eax(0x40000001). If bit 3 is set, use the canonical MSRs MSR_KVM_SYSTEM_TIME_NEW (0x4b564d01) and MSR_KVM_WALL_CLOCK_NEW (0x4b564d00); if only bit 0 is set, use the deprecated pair 0x12 / 0x11.

The Wall Clock

The guest writes a 4-byte-aligned guest physical address to MSR_KVM_WALL_CLOCK_NEW (0x4b564d00). The hypervisor fills the structure at that address:

struct pvclock_wall_clock {
    u32 version;  /* seqlock */
    u32 sec;      /* seconds since Unix epoch at guest boot */
    u32 nsec;
} __attribute__((__packed__));

This MSR is global — not per-vCPU — and records the wall time at the moment the MSR was written. To compute current wall time, the guest adds pvclock_wall_clock.sec/nsec to the elapsed system time obtained from MSR_KVM_SYSTEM_TIME_NEW.

The System Time Clock

MSR_KVM_SYSTEM_TIME_NEW (0x4b564d01) is per-vCPU. The guest writes a 4-byte-aligned guest physical address with bit 0 as the enable bit. The hypervisor fills and periodically updates the structure at that address:

struct pvclock_vcpu_time_info {
    u32 version;           /* seqlock; odd = update in progress */
    u32 pad0;
    u64 tsc_timestamp;     /* host TSC at last update */
    u64 system_time;       /* host monotonic ns at last update */
    u32 tsc_to_system_mul; /* fixed-point multiplier */
    s8  tsc_shift;         /* shift before multiply: positive=left, negative=right */
    u8  flags;
    u8  pad[2];
} __attribute__((packed));  /* 32 bytes total */

The comment in arch/x86/include/asm/pvclock-abi.h is unambiguous: "these structs MUST NOT be changed" — they are stable ABI shared between KVM and Xen guests.

Reading the Clock

The conversion from TSC ticks to nanoseconds uses the multiplier and shift:

delta = current_tsc - tsc_timestamp
if (tsc_shift >= 0): delta <<= tsc_shift
else:                delta >>= -tsc_shift
time_ns = ((delta * tsc_to_system_mul) >> 32) + system_time

The version field is a seqlock: read it before and after capturing the time fields; if either read is odd or the two values differ, the hypervisor updated the structure mid-read and the guest must retry. The seqlock protocol is what makes the update safe without a kernel lock in the guest read path.

Flags

Bit Value Meaning
0 1 Timestamps across CPUs are guaranteed monotonic; no global atomic needed per read
1 2 Guest vCPU was paused by the host; clear this flag and touch watchdogs

Bit 0 is set when the host advertises KVM_FEATURE_CLOCKSOURCE_STABLE_BIT (bit 24 in CPUID.0x40000001). Without it, pvclock.c enforces global monotonicity by updating a last_value counter via atomic compare-and-swap on every read — because without the host's guarantee, an unlucky migration could move the guest to a CPU with a slightly earlier TSC value, causing time to appear to go backward. With bit 0 set, the raw computed value is returned directly with no global serialization, enabling per-CPU vDSO reads.

kvmclock as a Linux Clocksource

kvmclock_init() in arch/x86/kernel/kvmclock.c registers kvm_clock with clocksource_register_hz(&kvm_clock, NSEC_PER_SEC). The default clocksource rating is 400, which wins over the HPET (rating 250) and the ACPI PM timer (rating 200). When the host exposes both X86_FEATURE_CONSTANT_TSC and X86_FEATURE_NONSTOP_TSC and !check_tsc_unstable(), the rating is reduced to 299 so that the native TSC clocksource — which is cheaper, requiring no shared-memory read — can win instead.

When flags bit 0 is set, kvmclock also calls kvm_sched_clock_init() to register the scheduler clock and exposes the fast path via vDSO, so that clock_gettime(CLOCK_MONOTONIC, ...) is serviced by a user-space shared-library read rather than a system call.

Per-vCPU pvclock_vcpu_time_info structures: the boot CPU uses a static array; hotplugged CPUs use dynamic allocation in kvmclock_setup_percpu(). TSC frequency is retrieved from these structures via kvm_get_tsc_khz().

Steal Time

The paravirtual clock tells the guest how much time has elapsed from the host's perspective. Steal time tells it how much of that elapsed time the guest vCPU was actually scheduled — the complement of what the host scheduler gave to other workloads.

The guest writes a 64-byte-aligned guest physical address (stricter than the 4-byte alignment required by the clock MSRs) with bit 0 as the enable bit to MSR_KVM_STEAL_TIME (0x4b564d03). The structure must be zero-initialized before the MSR write. The hypervisor fills:

struct kvm_steal_time {
    __u64 steal;      /* ns vCPU was not scheduled (excludes idle time) */
    __u32 version;    /* seqlock; even/odd protocol */
    __u32 flags;      /* currently always 0 */
    __u8  preempted;  /* nonzero = vCPU currently descheduled */
    __u8  u8_pad[3];
    __u32 pad[11];
};

The steal field counts only involuntary non-run time — host-scheduler preemption — not idle time. A guest CPU consuming 100% of its allowed time shows zero steal; a vCPU that the host is not scheduling shows increasing steal. The preempted field is a hint the guest can use to avoid spinning on spinlocks when it knows the vCPU holding the lock has been descheduled.

VM-Level Clock Ioctls

Two VM-level ioctls expose the kvmclock value to the VMM for snapshot and restore purposes, gated by KVM_CAP_ADJUST_CLOCK:

struct kvm_clock_data {
    __u64 clock;      /* kvmclock nanosecond value */
    __u32 flags;
    __u32 pad0;
    __u64 realtime;   /* host CLOCK_REALTIME at snapshot (if KVM_CLOCK_REALTIME set) */
    __u64 host_tsc;   /* host TSC at snapshot (if KVM_CLOCK_HOST_TSC set) */
    __u32 pad[4];
};

KVM_GET_CLOCK reads the current kvmclock value; KVM_SET_CLOCK restores it. Flags in the structure:

Constant Value Meaning
KVM_CLOCK_TSC_STABLE 2 clock is consistent across all vCPUs
KVM_CLOCK_REALTIME 1 << 2 realtime field is valid
KVM_CLOCK_HOST_TSC 1 << 3 host_tsc field is valid

KVM_KVMCLOCK_CTRL is a vCPU ioctl that sets a flag in the KVM vCPU state indicating the vCPU was paused by host userspace. The guest checks this flag and skips soft-lockup watchdog triggers for the duration of the pause — preventing false lockup reports when the VMM deliberately pauses vCPUs for snapshotting.

Clocks And Snapshots

The interaction between kvmclock and snapshot/restore is subtle enough to have produced multiple Firecracker bugs over several releases. Each fix is instructive.

Firecracker v1.8.0 addressed MSR_IA32_TSC_DEADLINE behavior at restore. If the guest sets a TSC deadline timer and then the vCPU is snapshotted, the saved MSR_IA32_TSC_DEADLINE may be 0 at snapshot time (because the timer already fired). Restoring a 0 deadline means the timer never fires again, stalling any guest that relies on TSC deadline interrupts. Firecracker v1.8.0 detects this condition and substitutes the current MSR_IA32_TSC value. It also changed restore ordering to apply MSR_IA32_TSC_DEADLINE after MSR_IA32_TSC, because KVM uses the guest TSC value when computing the deadline's meaning.

Firecracker v1.10.0 added a KVM_KVMCLOCK_CTRL call after pausing vCPUs on x86_64. Previously, paused vCPUs could trigger the guest's soft-lockup watchdog because the guest did not know a pause was intentional. Non-fatal failures from this ioctl increment the vcpu.kvmclock_ctrl_fails metric rather than aborting the pause.

Firecracker v1.16.0 fixed two separate kvmclock problems. First, guests using kvm-clock and running on host Linux 5.16 or later experienced the monotonic clock jumping forward by the wall-clock time elapsed since snapshot creation. The root cause: KVM_SET_CLOCK was not being called at restore, so the guest's internal monotonic time and the host's wall clock could diverge. The LoadSnapshot API now accepts an optional clock_realtime: true flag that opts into calling KVM_SET_CLOCK with KVM_CLOCK_REALTIME at restore time. Without the flag, the guest monotonic clock resumes from the snapshot timestamp — a deliberate choice for workloads that want time to appear frozen across the pause.

The same v1.16.0 release fixed snapshot serialization: the snapshot previously covered only a small subset of KVM custom MSRs, missing entries such as MSR_KVM_ASYNC_PF_INT (0x4b564d06). The fix extended coverage to the full KVM custom MSR range 0x4b564d000x4b564dff.

Firecracker exposes kvm-clock and tsc as clocksources on x86_64, and arch_sys_counter on aarch64. After snapshot restore, the guest wall clock continues from the moment of snapshot creation. For workloads that need accurate wall time, the recommended remediation is to update the clock post-restore via NTP or a guest agent — the VMM cannot safely correct it without risking time monotonicity violations visible to running guest processes.

Sources And Further Reading