Chapter 11: virtio — The Paravirtualized Device Model
Every device the guest needs — a network card, a disk, an entropy source — has to go through the VMM. The question is how. The naive answer is emulation: the VMM impersonates a real piece of hardware, the guest drives it with an unmodified driver, and the VMM translates guest I/O port writes and MMIO accesses into host operations. QEMU's emulation of the 82093AA I/OAPIC, the Intel 82576 Gigabit Ethernet controller, and a dozen other real chips is how most full VMs run. It is also slow, not because the emulation itself is expensive, but because the interface was designed for a device that has its own DMA engine, its own FIFO, and a long latency to silicon. The guest issues dozens of register writes to queue one operation, each of which exits the CPU into the VMM and back. The overhead is not in the emulated logic; it is in the round-trip count.
Paravirtualization cuts the round-trip count by giving up the pretense of real hardware. The guest runs a driver that knows it is talking to a VMM, and the protocol between them is designed for that context: large batches, a shared memory ring for communication, and a single notification per batch rather than one per operation. The physical-device illusion disappears; in its place is an explicit contract between the driver and the device backend.
virtio is that contract, standardized. The OASIS virtio Committee Specification v1.2 CS01 (published 1 July 2022) defines the shared-memory ring format, the feature negotiation handshake, the two transports (MMIO and PCI), and the wire protocol for each device class. It is the interface that Firecracker, crosvm, Cloud Hypervisor, and QEMU all implement — which means a Linux guest compiled once can run on any of them without modification, because the driver it loads is the kernel's standard virtio driver, not a VMM-specific one.
The Virtqueue
Every virtio device exposes one or more virtqueues: shared-memory rings
through which the driver (the guest's kernel driver) submits work and the
device (the VMM backend) returns completions. The virtio v1.2 spec defines two
queue formats. The split virtqueue (spec section 2.7) uses three separate
memory regions. The packed virtqueue (spec section 2.8), introduced in
v1.1 and enabled by feature bit VIRTIO_F_RING_PACKED = 34, collapses those
three regions into one circular ring plus two small event-suppression
structures. Firecracker implements split virtqueues only; the packed format is
not supported.
Three Rings, Three Owners
A split virtqueue consists of three physically independent memory regions, each owned exclusively by one side:
Guest Memory
┌──────────────────────────────────────────────────────┐
│ Descriptor Table (16 bytes × Queue Size) │◄── Driver writes
│ Alignment: 16 bytes │ Device reads
├──────────────────────────────────────────────────────┤
│ Available Ring (6 + 2 × Queue Size bytes) │◄── Driver writes
│ Alignment: 2 bytes │ Device reads
├──────────────────────────────────────────────────────┤
│ Used Ring (6 + 8 × Queue Size bytes) │◄── Device writes
│ Alignment: 4 bytes │ Driver reads
└──────────────────────────────────────────────────────┘
The alignment constants are defined in linux/include/uapi/linux/virtio_ring.h
as VRING_DESC_ALIGN_SIZE = 16, VRING_AVAIL_ALIGN_SIZE = 2, and
VRING_USED_ALIGN_SIZE = 4. Queue Size must be a power of two, at least 1,
and at most 32,768 (0x8000). Firecracker caps every queue at 256 entries
(FIRECRACKER_MAX_QUEUE_SIZE = 256).
The ownership rule is strict: the driver never writes to the used ring; the device never writes to the available ring or the descriptor table. This means accesses never race between writer and reader on the same memory. There is still a concurrency hazard — the guest and the VMM run concurrently — but it is bounded to the index fields, which the spec addresses with explicit memory barrier requirements.
The Descriptor Table
Each entry in the descriptor table is a struct virtq_desc (16 bytes, all
fields in little-endian):
struct virtq_desc {
le64 addr; /* offset 0: guest-physical buffer address */
le32 len; /* offset 8: buffer length in bytes */
le16 flags; /* offset 12: control flags */
le16 next; /* offset 14: index of next descriptor (if chaining) */
};
Three flag bits control how the descriptor is used. VIRTQ_DESC_F_NEXT = 0x1
means the descriptor is not the last in a chain — the next field holds the
index of the next descriptor. VIRTQ_DESC_F_WRITE = 0x2 marks the buffer as
device-writable; without it the buffer is device-readable. VIRTQ_DESC_F_INDIRECT = 0x4
signals that addr and len point not to data but to an in-memory table of
further virtq_desc entries, enabled by feature bit
VIRTIO_F_RING_INDIRECT_DESC = 28. Within an indirect table, only
VIRTQ_DESC_F_WRITE and VIRTQ_DESC_F_NEXT are valid; VIRTQ_DESC_F_INDIRECT
is forbidden in indirect entries, and the device must ignore
VIRTQ_DESC_F_WRITE on the outer descriptor that points to the table.
Descriptors chain together to describe a single I/O request. A virtio-blk read, for example, uses three descriptors in a chain: a device-readable header (16 bytes: request type, reserved padding, sector number), one or more device-writable data buffers, and a device-writable one-byte status field. All device-readable descriptors precede all device-writable ones in the chain — this is a hard split-virtqueue rule, not a convention.
The driver builds these chains by filling descriptor table entries, then publishes the chain by placing the head descriptor's index into the available ring.
The Available Ring
The available ring is the driver's outbox. Its layout (from spec section 2.7.6):
struct virtq_avail {
le16 flags; /* VIRTQ_AVAIL_F_NO_INTERRUPT = 1 */
le16 idx; /* where driver will write next head index */
le16 ring[/* Queue Size */]; /* head indices of published chains */
le16 used_event; /* only if VIRTIO_F_EVENT_IDX negotiated */
};
The idx field wraps naturally at 2^16. The driver increments it by the
number of chains it publishes, stores the head indices in ring[idx % QueueSize]
through ring[(idx + n - 1) % QueueSize], then issues a write memory barrier
before notifying the device. The device reads ring[(last_seen_idx % QueueSize)]
through ring[(avail.idx - 1) % QueueSize] to collect new chains.
Notice that idx is never reset — it grows monotonically, modulo 2^16. A
device that tracks the last idx it saw can detect new work without any
locking; the index is the only synchronization signal.
The Used Ring
The used ring is the device's completion outbox. Its layout (section 2.7.8):
Each virtq_used_elem is 8 bytes. When the device finishes a chain, it writes
the head index and byte count into the current used slot, increments idx,
and — unless notification suppression says otherwise — signals the guest
interrupt. The driver scans from its last-seen idx to used.idx - 1 to
harvest completions.
Notification Suppression
Left to themselves, driver and device fire an interrupt or a doorbell write after every descriptor batch. For high-throughput paths, that overhead adds up. virtio provides two suppression mechanisms.
The coarse mechanism uses the binary flags: the driver sets
avail.flags = VIRTQ_AVAIL_F_NO_INTERRUPT to suppress device-to-driver
interrupts; the device sets used.flags = VIRTQ_USED_F_NO_NOTIFY to suppress
driver-to-device kicks. Either side can assert its flag at any time. The
tradeoff is crude — all notifications or none.
The fine-grained mechanism, enabled by VIRTIO_F_RING_EVENT_IDX = 29, uses
threshold fields instead of binary flags. The driver places a target idx
value into avail.used_event; the device fires an interrupt only when
used.idx reaches that value. The device places a target into
used.avail_event; the driver kicks only when avail.idx reaches it. This
lets either side defer a notification precisely until the peer has enough work
queued to justify waking up. Firecracker implements the EVENT_IDX path and
validates the notification-suppression logic and 16-bit index wraparound with
Kani formal proofs.
The virtio-queue Crate
Firecracker's virtqueue implementation lives in the rust-vmm virtio-queue
crate (published at https://crates.io/crates/virtio-queue). The crate
provides two queue types: Queue for single-threaded use and QueueSync
(Arc<Mutex<Queue>>) for shared access, both implementing the QueueT trait.
Key methods include set_desc_table_address, set_avail_ring_address,
set_used_ring_address, set_size, set_ready, set_event_idx, is_valid,
add_used, needs_notification, disable_notification, and
enable_notification. AvailIter is a consuming iterator over available
descriptor chain heads; DescriptorChain with DescriptorChainRwIter
separates readable from writable segments cleanly.
The crate uses Rust read_volatile and write_volatile with explicit
acquire/release memory fences for every ring access, matching the spec's
barrier requirements without relying on the compiler to infer them. A
Time-To-Live counter limits chain traversal depth to prevent infinite loops
from a malicious guest that crafts a circular chain. Used-ring notifications
are batched via prepare_kick() rather than checked after each add_used()
call — the crate documents this as a deliberate deviation from spec section
2.6.7.2. The crate targets the virtio v1.1 split virtqueue spec.
Feature Negotiation
The spec imposes a strict handshake before the device becomes usable. This is the mechanism by which a driver compiled three years ago negotiates with a device model compiled last week: each side publishes what it supports; the intersection is what they use. Neither side assumes the other is current.
The Nine-Step Sequence
Spec section 3.1.1 defines the mandatory initialization sequence. The driver must follow these steps in order:
Features are read and written in two 32-bit pages via DeviceFeaturesSel and
DriverFeaturesSel: page 0 covers bits 0–31, page 1 covers bits 32–63. This
matters in practice because VIRTIO_F_VERSION_1 = 32 sits at bit 0 of page 1.
A device presenting itself as modern must advertise this bit; a driver that
does not acknowledge it is treated as a legacy driver, and a v2 MMIO device
must reject initialization if the driver fails to acknowledge it.
The six device status register bits (from
linux/include/uapi/linux/virtio_config.h) are the handshake signals:
| Constant | Value | Meaning |
|---|---|---|
VIRTIO_CONFIG_S_ACKNOWLEDGE |
1 | Driver found the device |
VIRTIO_CONFIG_S_DRIVER |
2 | Driver knows how to drive it |
VIRTIO_CONFIG_S_FEATURES_OK |
8 | Feature negotiation complete |
VIRTIO_CONFIG_S_DRIVER_OK |
4 | Driver is live |
VIRTIO_CONFIG_S_NEEDS_RESET |
64 | Device needs reset (unrecoverable) |
VIRTIO_CONFIG_S_FAILED |
128 | Fatal error |
Status starts at 0. The driver must not clear individual bits; only writing 0 resets the register and the device.
Kernel Implementation
virtio_dev_probe() in drivers/virtio/virtio.c implements steps 2–7: it
sets DRIVER, calls virtio_get_features(), ANDs the device and driver
feature tables, calls dev->config->finalize_features(), sets FEATURES_OK,
and reads back status. If FEATURES_OK is absent, it returns -ENODEV.
virtio_device_ready() sets DRIVER_OK after queue setup completes.
virtio_features_ok() in drivers/virtio/virtio.c checks that
VIRTIO_F_VERSION_1 is in the negotiated set before writing DriverFeatures
to a modern device.
The Transport-Layer Feature Bits
Most of these bits live in the range VIRTIO_TRANSPORT_F_START = 28 through
VIRTIO_TRANSPORT_F_END = 42 and apply to every device type.
VIRTIO_F_ANY_LAYOUT is listed here for completeness — it predates the formal
transport range and sits at bit 27, just outside it.
The Linux kernel uapi headers (virtio_ring.h) name the indirect-descriptor
and event-index bits VIRTIO_RING_F_INDIRECT_DESC and VIRTIO_RING_F_EVENT_IDX;
the OASIS spec uses VIRTIO_F_RING_INDIRECT_DESC and VIRTIO_F_RING_EVENT_IDX
for the same bits (28 and 29). This chapter follows the spec naming.
| Constant (OASIS spec) | Bit | Meaning |
|---|---|---|
VIRTIO_F_ANY_LAYOUT |
27 | Device handles any descriptor ordering (predates transport range) |
VIRTIO_F_RING_INDIRECT_DESC |
28 | Indirect descriptor tables |
VIRTIO_F_RING_EVENT_IDX |
29 | Descriptor-granularity notification suppression |
VIRTIO_F_VERSION_1 |
32 | Modern device (mandatory for modern devices) |
VIRTIO_F_ACCESS_PLATFORM |
33 | IOMMU DMA required |
VIRTIO_F_RING_PACKED |
34 | Packed virtqueue format |
VIRTIO_F_IN_ORDER |
35 | Buffers used in availability order |
VIRTIO_F_RING_RESET |
40 | Per-queue reset |
Firecracker advertises VIRTIO_F_VERSION_1 and VIRTIO_F_RING_EVENT_IDX on
all its devices. VIRTIO_F_RING_PACKED is never advertised because Firecracker
does not implement packed virtqueues.
Config Space Atomicity
Device-specific configuration fields (capacity, MAC address, queue pair count,
and so on) live in a config space region that can be updated at any time — for
example, a network link-state change arriving mid-probe. Spec section 2.5
requires the driver to re-read config fields in a compare-and-retry loop using
the config_generation field (MMIO offset 0x0fc) whenever a concurrent
change is suspected. The device increments config_generation before and after
each config update; if the driver reads a different value at the end of a read
sequence than at the beginning, it retries.
The MMIO Transport
The MMIO transport (spec section 4.2) exposes the device as a flat register window mapped into the guest's physical address space. There is no bus, no enumeration protocol, no capability list — just a base address and an IRQ number that the VMM communicates to the guest out-of-band.
Register Map
All registers are 4 bytes wide, 4-byte-aligned, at fixed offsets from the base
address (from linux/include/uapi/linux/virtio_mmio.h):
| Register | Offset | Dir | Purpose |
|---|---|---|---|
MagicValue |
0x000 |
RO | Must read 0x74726976 ("virt" in LE ASCII) |
Version |
0x004 |
RO | 2 = modern; 1 = legacy |
DeviceID |
0x008 |
RO | virtio device type |
VendorID |
0x00c |
RO | Vendor identifier |
DeviceFeatures |
0x010 |
RO | 32-bit feature page |
DeviceFeaturesSel |
0x014 |
WO | Feature page selector (0 or 1) |
DriverFeatures |
0x020 |
WO | Accepted feature bits |
DriverFeaturesSel |
0x024 |
WO | Driver feature page selector |
QueueSel |
0x030 |
WO | Select active queue (0-indexed) |
QueueNumMax |
0x034 |
RO | Maximum queue size |
QueueNum |
0x038 |
WO | Actual queue size (driver chooses) |
QueueReady |
0x044 |
RW | Write 1 to activate queue |
QueueNotify |
0x050 |
WO | Write queue index to kick device |
InterruptStatus |
0x060 |
RO | Bit 0 = used-buffer; bit 1 = config change |
InterruptACK |
0x064 |
WO | Acknowledge interrupt bits |
Status |
0x070 |
RW | Device status register |
QueueDescLow |
0x080 |
WO | Descriptor Table GPA bits 31:0 |
QueueDescHigh |
0x084 |
WO | Descriptor Table GPA bits 63:32 |
QueueAvailLow |
0x090 |
WO | Available Ring GPA bits 31:0 |
QueueAvailHigh |
0x094 |
WO | Available Ring GPA bits 63:32 |
QueueUsedLow |
0x0a0 |
WO | Used Ring GPA bits 31:0 |
QueueUsedHigh |
0x0a4 |
WO | Used Ring GPA bits 63:32 |
ConfigGeneration |
0x0fc |
RO | Config space atomicity counter |
Config |
0x100+ |
RW | Device-specific config (up to 0xfff) |
The legacy (Version 1) layout adds GuestPageSize at 0x028, QueueAlign at
0x03c, and QueuePFN at 0x040, and collapses the split 64-bit address
pairs into a single page-frame number. Modern drivers do not touch these.
Device Discovery
MMIO has no self-describing discovery mechanism (spec section 4.2.1). The guest
must learn each device's base address and IRQ from the VMM. Linux's
drivers/virtio/virtio_mmio.c driver supports three paths: a device tree node
with compatible = "virtio,mmio", a kernel command-line parameter
virtio_mmio.device=<size>@<baseaddr>:<irq>[:<id>] (requires
CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES), and static platform device registration
in board code.
Firecracker uses the command-line path: it appends one
virtio_mmio.device=... entry per device to the kernel command line at boot,
advertising each device's MMIO window size, base address, and IRQ. An open
issue (#2519) proposes replacing this with a device tree blob passed via the
setup_data boot protocol field, but as of this writing the issue is not
merged.
Firecracker's MMIO Backend
Firecracker's MMIO transport is implemented in
src/vmm/src/devices/virtio/transport/mmio.rs as MmioTransport, which
implements the BusDevice trait. Guest writes to MMIO space cause VM exits;
the VMM dispatches them through BusDevice::read and BusDevice::write.
A few Firecracker-specific constants are worth naming. MMIO_VERSION = 2 is
hardcoded — the device always presents as modern. VENDOR_ID = 0 deviates
from the spec's recommended value of 0x1AF4 (Red Hat, Inc.) and mirrors the
crosvm convention. Reading DeviceFeatures with DeviceFeaturesSel = 1 ORs
in 0x1 unconditionally, so VIRTIO_F_VERSION_1 (bit 32) is always visible
to the driver regardless of what the inner device model advertises.
set_device_status() enforces the spec state machine with a VALID_TRANSITIONS
table; any status write that is not a legal transition logs a warning.
Transition to DRIVER_OK calls locked_device().activate(), which hands
control to the device backend — at that point, the virtqueues are live and
the device can begin processing descriptors.
The guest kernel requires CONFIG_VIRTIO_MMIO=y and
CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES=y to use Firecracker MMIO devices.
The PCI Transport
The PCI transport (spec section 4.1) is self-describing. The guest scans the
PCI bus, finds devices with Vendor ID 0x1AF4 (Red Hat, Inc.), and walks each
device's PCI capability list to find the five vendor-specific capability
structures that virtio-PCI defines. No out-of-band communication is needed:
the bus itself tells the driver where everything is.
Device IDs
PCI device IDs split into two ranges. Legacy (transitional) devices use IDs
0x1000–0x103F; modern devices use 0x1040 + the virtio device ID, so
virtio-net is 0x1041, virtio-blk is 0x1042, virtio-rng is 0x1044, and
virtio-vsock is 0x1053.
Five Capability Structures
Each capability uses cap_vndr = PCI_CAP_ID_VNDR (identifying it as a
vendor-specific capability) and a cfg_type field that says which of the five
roles it plays (from linux/include/uapi/linux/virtio_pci.h):
cfg_type |
Value | Purpose |
|---|---|---|
VIRTIO_PCI_CAP_COMMON_CFG |
1 | Common configuration struct (virtio_pci_common_cfg) |
VIRTIO_PCI_CAP_NOTIFY_CFG |
2 | Queue doorbell addresses |
VIRTIO_PCI_CAP_ISR_CFG |
3 | Interrupt status byte |
VIRTIO_PCI_CAP_DEVICE_CFG |
4 | Device-specific configuration |
VIRTIO_PCI_CAP_PCI_CFG |
5 | Alternative PCI config-space access window |
struct virtio_pci_cap records which BAR holds the region (bar, 0–5), the
byte offset within that BAR (offset), and the region's length (length).
The notification capability also carries notify_off_multiplier; the doorbell
address for queue N is cap.offset + queue_notify_off × notify_off_multiplier.
struct virtio_pci_common_cfg exposes the feature selectors and data fields,
the queue count, the device status register, config_generation, the queue
selector and size, queue_enable, and the split 64-bit GPA fields
queue_desc_lo/hi, queue_avail_lo/hi, and queue_used_lo/hi — a
superset of the MMIO register map, accessed through a memory-mapped struct
rather than individual register offsets.
Why Firecracker Originally Chose MMIO
PCI bus enumeration, ACPI table parsing, and MSI/MSI-X interrupt wiring each add work to the guest boot path. For Firecracker's original target — the serverless VM that must start in under 150 ms — those milliseconds matter. MMIO requires none of that infrastructure, the device model is simpler, and the command-line discovery mechanism is a handful of string appends.
PCI transport was later added to Firecracker behind --enable-pci. Benchmarks
from the Firecracker team (discussion #4845) show the tradeoff concretely:
block synchronous reads improve by about 50%, block synchronous writes by 46%
(on a 1-vCPU VM), network transmit throughput by 2–11%, and network receive
throughput by 9–17%. Latency drops roughly 27%. The cost is an approximately
8% slower boot on VMs under 4 GiB. The fundamental reason is interrupt
delivery: MMIO uses level-triggered interrupts that require a VM exit per
notification; MSI-X can deliver interrupts without a VMM-side exit. PCI is
the right answer when throughput dominates; MMIO is the right answer when
boot time does.
Enabling PCI mode requires additional guest kernel configuration:
CONFIG_PCI, CONFIG_PCI_MMCONFIG, CONFIG_PCI_MSI, CONFIG_PCIEPORTBUS,
CONFIG_VIRTIO_PCI, CONFIG_BLK_MQ_PCI, CONFIG_PCI_HOST_COMMON, and
CONFIG_PCI_HOST_GENERIC. The guest must not pass pci=off on its command
line.
The Devices That Matter
The five device types Firecracker exposes cover everything a modern serverless workload needs: a network path, a block device, a host-guest socket channel, memory pressure signaling, and entropy. Each is a separate protocol layered on top of the virtqueue machinery.
virtio-net (Device ID 1)
The network device presents the guest with an Ethernet interface backed by a TAP device on the host.
TAP setup. Firecracker opens /dev/net/tun with O_RDWR | O_NONBLOCK |
O_CLOEXEC and calls TUNSETIFF (_IOW('T', 202, int)) with three flags:
IFF_TAP = 0x0002 (Ethernet frames, not raw IP), IFF_NO_PI = 0x1000 (do
not prepend the four-byte struct tun_pi packet info header), and
IFF_VNET_HDR = 0x4000 (prepend or strip a virtio_net_hdr on each frame).
Two more ioctls complete the setup: TUNSETOFFLOAD (_IOW('T', 208, unsigned int))
advertises which checksum offloads the tap device can handle, and
TUNSETVNETHDRSZ (_IOW('T', 216, int)) tells the kernel to use the 12-byte
virtio_net_hdr_v1 format rather than the legacy 10-byte form. Interface name
is a 16-byte array matching IFNAMSIZ. These constants and structures are
defined in linux/include/uapi/linux/if_tun.h.
Before running commands that open
/dev/net/tunor manage TAP devices, the process needs eitherCAP_NET_ADMINor a pre-created TAP interface. On a production host Firecracker relies on the jailer to set up the TAP before dropping privileges; on a development machine,sudo ip tuntap add dev tap0 mode tapcreates one manually.
The virtio-net header. Every frame crossing the TAP/virtqueue boundary
carries a virtio_net_hdr_v1 (12 bytes) that describes the offload state of
the packet (defined in linux/include/uapi/linux/virtio_net.h):
| Offset | Field | Notes |
|---|---|---|
| 0 | flags |
VIRTIO_NET_HDR_F_NEEDS_CSUM = 1 |
| 1 | gso_type |
NONE=0, TCPV4=1, UDP=3, TCPV6=4, ECN flag=0x80 |
| 2–3 | hdr_len |
Total L2+L3+L4 header length |
| 4–5 | gso_size |
Desired MSS for segmentation |
| 6–7 | csum_start |
Byte offset where checksum computation begins |
| 8–9 | csum_offset |
Offset from csum_start to place the checksum |
| 10–11 | num_buffers |
Merged receive buffer count (if VIRTIO_NET_F_MRG_RXBUF) |
Queues. Firecracker implements exactly two virtqueues: RX_INDEX = 0 and
TX_INDEX = 1, each capped at NET_QUEUE_MAX_SIZE = 256 descriptors. The
spec allows a multi-queue extension (VIRTIO_NET_F_MQ = 22) with one transmit
and one receive queue per CPU, but Firecracker does not implement it; each
virtio-net device has a single queue pair. MAX_BUFFER_SIZE = 65562 bytes
(64 KiB plus the virtio-net header overhead) is the largest receive buffer
the device will accept.
Feature bits Firecracker advertises. From linux/include/uapi/linux/virtio_net.h,
Firecracker sets: VIRTIO_NET_F_CSUM (0), VIRTIO_NET_F_GUEST_CSUM (1),
VIRTIO_NET_F_GUEST_TSO4 (7), VIRTIO_NET_F_GUEST_TSO6 (8),
VIRTIO_NET_F_GUEST_UFO (10), VIRTIO_NET_F_HOST_TSO4 (11),
VIRTIO_NET_F_HOST_TSO6 (12), VIRTIO_NET_F_HOST_UFO (14),
VIRTIO_NET_F_MRG_RXBUF (15), plus VIRTIO_F_RING_EVENT_IDX (29) and
VIRTIO_F_VERSION_1 (32). VIRTIO_NET_F_MAC (5) is added when a MAC address
is configured; VIRTIO_NET_F_MTU (3) when an MTU override is set.
The config space struct (virtio_net_config) carries MAC (6 bytes), status
(2 bytes), max_virtqueue_pairs (2 bytes), MTU (2 bytes), speed in Mbps (4
bytes), and duplex (1 byte).
virtio-blk (Device ID 2)
The block device exposes a single virtqueue to the guest. Firecracker
implements BLOCK_NUM_QUEUES = 1, sized to 256 descriptors. IO_URING_NUM_ENTRIES = 128
(half the queue depth) because one block request typically spans two to three
descriptors; a full 256-entry submission ring would overflow an io_uring ring
of the same size.
The sector model. SECTOR_SIZE = 512 bytes (1 << 9). The capacity field
in the config struct is a u64 reporting the total sector count. A guest
trying to read sector N at offset N × 512 from the start of the backing file
or device.
Request layout. Each I/O request is a three-descriptor chain:
- A 16-byte device-readable header:
type(u32),reserved(u32),sector(u64). - One or more data buffers — device-readable for writes, device-writable for reads.
- A one-byte device-writable status field:
VIRTIO_BLK_S_OK = 0,VIRTIO_BLK_S_IOERR = 1, orVIRTIO_BLK_S_UNSUPP = 2.
The type field in the header selects the operation: VIRTIO_BLK_T_IN = 0
(read), VIRTIO_BLK_T_OUT = 1 (write), VIRTIO_BLK_T_FLUSH = 4 (cache
flush), VIRTIO_BLK_T_GET_ID = 8 (identify device: returns a 20-byte
ASCII string). All defined in linux/include/uapi/linux/virtio_blk.h.
Feature bits Firecracker advertises. VIRTIO_F_VERSION_1 (32) and
VIRTIO_F_RING_EVENT_IDX (29) always. VIRTIO_BLK_F_FLUSH (9) when the
backing disk is in writeback-cache mode. VIRTIO_BLK_F_RO (5) when the
disk is read-only.
virtio-vsock (Device ID 19)
vsock gives the guest and host a socket channel without a network interface.
The guest opens a socket with socket(AF_VSOCK, SOCK_STREAM, 0) and
addresses the host by its well-known CID. This is the channel Firecracker
uses for its API proxy feature and for guest agent communication in richer
microVM platforms.
The address family AF_VSOCK was introduced in Linux 4.8. Each endpoint is
addressed by a (CID, port) pair. Reserved CIDs: VMADDR_CID_HYPERVISOR = 0,
VMADDR_CID_LOCAL = 1, VMADDR_CID_HOST = 2, VMADDR_CID_ANY = 0xFFFFFFFF.
Firecracker sets VSOCK_HOST_CID = 2 for the host-side endpoint.
Queues. Firecracker implements three queues (VSOCK_NUM_QUEUES = 3): RXQ
(index 0) for data from host to guest, TXQ (index 1) for data from guest to
host, and EVQ (index 2) for event messages. All three are sized to 256
descriptors. Each descriptor chain encodes exactly one vsock packet: a 44-byte
header followed by an optional payload up to MAX_PKT_BUF_SIZE = 65536 bytes.
The header. virtio_vsock_hdr is 44 bytes, packed:
| Offset | Field | Type | Notes |
|---|---|---|---|
| 0–7 | src_cid |
le64 | Source context ID |
| 8–15 | dst_cid |
le64 | Destination context ID |
| 16–19 | src_port |
le32 | Source port |
| 20–23 | dst_port |
le32 | Destination port |
| 24–27 | len |
le32 | Payload byte count |
| 28–29 | type |
le16 | Socket type (1=STREAM, 2=SEQPACKET) |
| 30–31 | op |
le16 | Operation code |
| 32–35 | flags |
le32 | Operation-specific flags |
| 36–39 | buf_alloc |
le32 | Receiver buffer allocation (flow control) |
| 40–43 | fwd_cnt |
le32 | Bytes consumed by receiver (flow control) |
The op field drives the connection state machine. VIRTIO_VSOCK_OP_REQUEST = 1
initiates a connection; VIRTIO_VSOCK_OP_RESPONSE = 2 accepts it;
VIRTIO_VSOCK_OP_RST = 3 rejects or aborts; VIRTIO_VSOCK_OP_SHUTDOWN = 4
begins a graceful close; VIRTIO_VSOCK_OP_RW = 5 carries data;
VIRTIO_VSOCK_OP_CREDIT_UPDATE = 6 and VIRTIO_VSOCK_OP_CREDIT_REQUEST = 7
implement receive-window flow control through buf_alloc and fwd_cnt in
the header — the receiver advertises available buffer space, and the sender
tracks how much of it it has consumed.
Feature bits. Firecracker advertises VIRTIO_F_VERSION_1 (32),
VIRTIO_F_IN_ORDER (35), and VIRTIO_F_RING_EVENT_IDX (29), combined as
AVAIL_FEATURES = (1 << 32) | (1 << 35) | (1 << 29). The device-specific
feature VIRTIO_VSOCK_F_SEQPACKET = 1 (SOCK_SEQPACKET support) is not
advertised.
virtio-balloon (Device ID 5)
The balloon device lets the host reclaim memory from a running guest without
stopping it. The host writes a target page count into the num_pages field of
the config struct; the guest driver inflates the balloon — pins that many 4 KiB
pages, reports them to the host, and the host calls
madvise(MADV_DONTNEED) on those guest-physical ranges. The guest kernel can
no longer access them efficiently; the host OS reclaims the physical pages.
When the host writes a smaller target, the guest deflates: it releases the
pinned pages back to its own allocator and the host stops MADV_DONTNEED-ing
them. Re-access by the guest returns zero-filled pages on demand.
This is the mechanism by which Firecracker supports memory overcommit: a fleet of microVMs can collectively commit more memory than the host has, and the balloon keeps actual physical usage within bounds.
Config struct. virtio_balloon_config (from
linux/include/uapi/linux/virtio_balloon.h):
| Field | Type | Meaning |
|---|---|---|
num_pages |
le32 | Host-requested balloon size in 4 KiB pages |
actual |
le32 | Current balloon size in 4 KiB pages |
free_page_hint_cmd_id |
le32 | Command ID for free page hinting |
poison_val |
le32 | Page poison value |
All balloon accounting is in 4 KiB pages; 256 pages equals 1 MiB. The host
sets num_pages; the guest updates actual as it completes inflation or
deflation.
Queues. The inflate queue (index 0) and deflate queue (index 1) are always
present, sized to 128 descriptors in Firecracker. Additional queues appear
conditionally: the stats queue (index 2) when VIRTIO_BALLOON_F_STATS_VQ
is negotiated, a free-page-hinting queue (index 3) when
VIRTIO_BALLOON_F_FREE_PAGE_HINT is set, and a page-reporting queue (index 4)
when VIRTIO_BALLOON_F_REPORTING is set.
Feature bits Firecracker advertises. VIRTIO_F_VERSION_1 (32),
VIRTIO_BALLOON_F_DEFLATE_ON_OOM (2) (the guest automatically deflates if the
host OOM killer fires), VIRTIO_BALLOON_F_STATS_VQ (1), VIRTIO_BALLOON_F_FREE_PAGE_HINT (3),
and VIRTIO_BALLOON_F_REPORTING (5).
Statistics. The stats queue carries tagged 8-byte entries with a u16 tag
and a u64 value. The defined tags include swap-in and swap-out counts
(VIRTIO_BALLOON_S_SWAP_IN = 0, VIRTIO_BALLOON_S_SWAP_OUT = 1), page fault
counts (VIRTIO_BALLOON_S_MAJFLT = 2, VIRTIO_BALLOON_S_MINFLT = 3), free
and total memory (VIRTIO_BALLOON_S_MEMFREE = 4, VIRTIO_BALLOON_S_MEMTOT = 5),
available memory (VIRTIO_BALLOON_S_AVAIL = 6), page cache size
(VIRTIO_BALLOON_S_CACHES = 7), and OOM kill count
(VIRTIO_BALLOON_S_OOM_KILL = 10). The statistics polling interval is
configurable in Firecracker; setting it to 0 disables polling. The guest
kernel requires CONFIG_MEMORY_BALLOON=y and CONFIG_VIRTIO_BALLOON=y.
virtio-rng (Device ID 4)
The entropy device is the simplest virtio device in the spec: linux/include/uapi/linux/virtio_rng.h
contains no device-specific feature bits — it includes only virtio_ids.h
and virtio_config.h. There is no device-specific config space. The entire
protocol fits in a paragraph.
Firecracker exposes one queue (RNG_NUM_QUEUES = 1) and advertises only
VIRTIO_F_VERSION_1 (32). The queue direction is device-writable only: the
guest posts write-only descriptors pointing to buffers it wants filled with
entropy, and the device fills them. The guest never sends data to the device.
MAX_ENTROPY_BYTES = 65536 (64 KiB) is the cap on bytes served per request,
preventing host memory exhaustion from a malicious guest that crafts
overlapping descriptor chains pointing to enormous ranges. Firecracker draws
entropy from aws_lc_rs::rand (the AWS LibCrypto Rust bindings), not from
/dev/random or getrandom() directly. Rate limiting is available via the
Firecracker API, with independent controls for bytes-per-second and
operations-per-second.
The simplicity is the point. A random number device has no protocol state, no connection setup, no flow control, and no error conditions beyond buffer exhaustion. It is what virtio looks like when nothing is left to remove.
Wiring It Together
flowchart TB
gk["Guest Kernel Driver"]
vq["Virtqueue<br/>(desc / avail / used rings)"]
mmio["virtio-MMIO or PCI Transport"]
vmm["VMM Device Backend<br/>(Firecracker)"]
tap["TAP /dev/net/tun"]
blkfile["Block backing file"]
vsockunix["Unix socket<br/>(vsock muxer)"]
awslc["aws_lc_rs::rand"]
gk -->|"write head idx to avail.ring,\nkick QueueNotify"| vq
vq -->|"VM exit on register write"| mmio
mmio -->|"dispatch to activate()d device"| vmm
vmm -->|"net: read/write virtio_net_hdr + frame"| tap
vmm -->|"blk: read/write 512-byte sectors"| blkfile
vmm -->|"vsock: virtio_vsock_hdr + payload"| vsockunix
vmm -->|"rng: fill entropy bytes"| awslc
vmm -->|"write id+len to used.ring,\ninterrupt guest"| vq
vq -->|"driver polls used.idx"| gk
Every device Firecracker ships uses the same queue depth (256), the same
notification suppression path (VIRTIO_F_RING_EVENT_IDX), and the same memory
fence discipline (read_volatile/write_volatile with acquire/release
barriers). The per-device protocol differences are entirely in the descriptor
chain layout and the config space struct — the queue machinery is shared.
Sources And Further Reading
- OASIS virtio Committee Specification v1.2, CS01, 1 July 2022 (HTML): https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html
- OASIS virtio v1.2 CS01 (PDF): https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.pdf
oasis-tcs/virtio-speccanonical C headers (virtio-queue.h): https://github.com/oasis-tcs/virtio-spec/blob/master/virtio-queue.h- Linux UAPI
linux/virtio_ring.h: https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_ring.h - Linux UAPI
linux/virtio_config.h: https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_config.h - Linux UAPI
linux/virtio_mmio.h: https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_mmio.h - Linux UAPI
linux/virtio_pci.h: https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_pci.h - Linux UAPI
linux/virtio_ids.h: https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_ids.h - Linux UAPI
linux/virtio_net.h: https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_net.h - Linux UAPI
linux/virtio_blk.h: https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_blk.h - Linux UAPI
linux/virtio_vsock.h: https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_vsock.h - Linux UAPI
linux/virtio_balloon.h: https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_balloon.h - Linux UAPI
linux/virtio_rng.h: https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_rng.h - Linux UAPI
linux/if_tun.h: https://github.com/torvalds/linux/blob/master/include/uapi/linux/if_tun.h - Linux
drivers/virtio/virtio.c(feature negotiation implementation): https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio.c - Linux
drivers/virtio/virtio_mmio.c: https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_mmio.c - Linux kernel virtio driver API documentation: https://docs.kernel.org/driver-api/virtio/virtio.html
vsock(7)man page: https://man7.org/linux/man-pages/man7/vsock.7.html- rust-vmm
virtio-queuecrate: https://crates.io/crates/virtio-queue - rust-vmm
virtio-queueREADME: https://github.com/rust-vmm/vm-virtio/blob/main/virtio-queue/README.md - Firecracker
src/vmm/src/devices/virtio/transport/mmio.rs: https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/devices/virtio/transport/mmio.rs - Firecracker
src/vmm/src/devices/virtio/net/device.rs: https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/devices/virtio/net/device.rs - Firecracker
src/vmm/src/devices/virtio/net/tap.rs: https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/devices/virtio/net/tap.rs - Firecracker
src/vmm/src/devices/virtio/block/virtio/device.rs: https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/devices/virtio/block/virtio/device.rs - Firecracker
src/vmm/src/devices/virtio/vsock/device.rs: https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/devices/virtio/vsock/device.rs - Firecracker
src/vmm/src/devices/virtio/balloon/device.rs: https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/devices/virtio/balloon/device.rs - Firecracker
src/vmm/src/devices/virtio/rng/device.rs: https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/devices/virtio/rng/device.rs - Firecracker ballooning documentation: https://github.com/firecracker-microvm/firecracker/blob/main/docs/ballooning.md
- Firecracker network setup documentation: https://github.com/firecracker-microvm/firecracker/blob/main/docs/network-setup.md
- Firecracker issue #2519 (virtio-mmio device tree): https://github.com/firecracker-microvm/firecracker/issues/2519
- Firecracker PCI performance discussion #4845: https://github.com/firecracker-microvm/firecracker/discussions/4845
- Kani formal verification of Firecracker virtio queue: https://model-checking.github.io/kani-verifier-blog/2022/07/13/using-the-kani-rust-verifier-on-a-firecracker-example.html