Chapter 21: Host Networking For MicroVMs
When Firecracker boots a guest kernel, the guest is completely isolated from the host network: no shared namespace, no injected interface, no route added by the runtime. The question is how traffic gets in and out. A container reaches its bridge through a veth pair the runtime handed to a CNI plugin; a microVM has its own kernel, so there is nothing to share — every byte that crosses the isolation boundary must pass through a file descriptor the VMM process owns. That file descriptor is the TAP device, and it is the whole story.
The TAP Device and the VMM File Descriptor
A TAP device (network TAP) is a kernel virtual Ethernet interface whose packet stream is exposed to a userspace process through a file descriptor on /dev/net/tun (character device major 10, minor 200). The kernel document at Documentation/networking/tuntap.rst describes two variants: TUN at layer 3 (raw IP datagrams) and TAP at layer 2 (Ethernet frames including MAC header). Firecracker uses TAP because virtio-net presents Ethernet semantics to the guest: the guest's driver sees a virtual NIC with a MAC address, sends and receives full Ethernet frames, and never knows those frames are travelling through a file descriptor in the VMM process on the other side.
Opening the device requires CAP_NET_ADMIN. Firecracker opens /dev/net/tun with:
libc::open(c"/dev/net/tun".as_ptr(), libc::O_RDWR | libc::O_NONBLOCK | libc::O_CLOEXEC)
Each flag matters. O_RDWR gives bidirectional packet access — the same fd both dequeues incoming frames and injects outgoing ones. O_NONBLOCK makes read and write return EAGAIN rather than blocking, which is essential for Firecracker's epoll-driven event loop: the I/O thread can poll multiple event sources without getting stuck in a slow system call. O_CLOEXEC ensures the fd does not leak into child processes; Firecracker's jailer forks before execing the VMM, and a leaked TAP fd in the jailer's pid namespace would outlive the microVM.
After open, Firecracker calls ioctl(fd, TUNSETIFF, &ifreq). TUNSETIFF is defined in include/uapi/linux/if_tun.h as _IOW('T', 202, int) — type 'T' (decimal 84), sequence 202. The ifreq.ifr_name field sets the interface name (maximum IFNAMSIZ = 16 bytes including the NUL terminator); an empty name lets the kernel assign the next available tapN. Firecracker's generated bindings (src/vmm/src/devices/virtio/net/generated/if_tun.rs) set three flags in ifreq.ifr_flags:
| Flag | Value | Effect |
|---|---|---|
IFF_TAP |
2 | Layer-2 (Ethernet) mode |
IFF_NO_PI |
4096 | Suppress the 4-byte struct tun_pi protocol info header |
IFF_VNET_HDR |
16384 | Prepend/consume a virtio_net_hdr on each frame |
IFF_NO_PI removes a 4-byte prefix the kernel otherwise adds to every read (it carries the EtherType and flags that are already in the Ethernet header). IFF_VNET_HDR is the productive one: it tells the kernel to expect a virtio_net_hdr structure at the front of every frame written through the fd, and to prepend one to every frame read from it. That header carries checksum-offload and segmentation-offload metadata between the guest driver and the host NIC, allowing large sends to pass through the VMM without being fragmented in the VMM's own memory.
Two more ioctls follow during device activation. TUNSETOFFLOAD (_IOW('T', 208, unsigned int)) accepts a TUN_F_* bitmask derived from the virtio features the guest negotiated: TUN_F_CSUM (0x01), TUN_F_TSO4 (0x02), TUN_F_TSO6 (0x04), TUN_F_TSO_ECN (0x08), and TUN_F_UFO (0x10). TUNSETVNETHDRSZ (_IOW('T', 216, int)) is called with 12, the size of virtio_net_hdr_v1, telling the kernel exactly how many bytes of virtio header to expect on each transfer.
Firecracker never calls TUNSETPERSIST. The TAP device therefore lives only as long as the fd is open — when the Tap struct drops, the fd closes, and the kernel removes the interface. Device lifetime is bound to the VMM process with no cleanup step required.
The VMM's Use of the TAP fd in the Event Loop
The Tap struct in src/vmm/src/devices/virtio/net/tap.rs wraps a single File that owns the fd. At runtime, the net device registers five event sources with Firecracker's epoll manager:
| Token | Source | Purpose |
|---|---|---|
PROCESS_VIRTQ_RX |
RX virtqueue eventfd | Guest driver has posted receive buffers |
PROCESS_VIRTQ_TX |
TX virtqueue eventfd | Guest driver has posted transmit buffers |
PROCESS_TAP_RX |
TAP fd (EPOLLIN, edge-triggered) | Frame has arrived from the host |
PROCESS_RX_RATE_LIMITER |
RateLimiter timerfd | RX token bucket has refilled |
PROCESS_TX_RATE_LIMITER |
RateLimiter timerfd | TX token bucket has refilled |
EPOLLIN on the TAP fd triggers process_rx() — the handler reads a frame from the host network stack and writes it into the guest's receive ring. The other direction is driven not by the TAP fd but by an eventfd: when the guest driver writes the virtio MMIO notify register, KVM translates that write into an eventfd notification (PROCESS_VIRTQ_TX), and the VMM dequeues the transmit buffer, runs it through the rate limiter, and calls libc::writev() on the TAP fd to inject the Ethernet frame into the kernel.
This asymmetry is worth pausing on. The guest signals the VMM through a KVM eventfd bound to the MMIO address (the standard virtio/KVM notification path); the host signals the VMM through the TAP fd being readable. Two distinct kernel mechanisms, unified by the same epoll loop.
virtio-net Queues and the vnet Header
Firecracker's virtio-net device (VIRTIO_ID_NET = 1) uses exactly two virtqueues (NET_NUM_QUEUES = 2), each of depth 256 (NET_QUEUE_MAX_SIZE = 256). Queue 0 (RX_INDEX) holds device-writable buffers the guest driver posts for incoming frames. Queue 1 (TX_INDEX) holds device-readable buffers the guest driver fills with outgoing frames. The maximum per-buffer size is 65,562 bytes: a 12-byte vnet header plus the largest Ethernet frame.
With IFF_VNET_HDR set and TUNSETVNETHDRSZ at 12, every frame transferred through the TAP fd is prefixed with a virtio_net_hdr_v1:
| Field | Type | Offset | Meaning |
|---|---|---|---|
flags |
u8 | 0 | VIRTIO_NET_HDR_F_NEEDS_CSUM = 1, F_DATA_VALID = 2 |
gso_type |
u8 | 1 | GSO_NONE=0, GSO_TCPV4=1, GSO_UDP=3, GSO_TCPV6=4, GSO_ECN=0x80 |
hdr_len |
u16 | 2 | Transport header length |
gso_size |
u16 | 4 | MSS for segmentation |
csum_start |
u16 | 6 | Byte offset to the checksum field |
csum_offset |
u16 | 8 | Offset to the checksum within the segment |
num_buffers |
u16 | 10 | Number of merged receive buffers |
VIRTIO_NET_F_MRG_RXBUF (bit 15) is always advertised, which activates the num_buffers field and allows a single incoming frame to span multiple guest ring buffers. VIRTIO_NET_F_MQ (bit 22) is explicitly not advertised; Firecracker is single-queue only, and IFF_MULTI_QUEUE (Linux 3.8+) is defined in the generated bindings but unused.
The packet flows, written out concretely:
Guest TX (guest to host): the guest driver posts a descriptor chain to queue 1 and writes the MMIO notify register; KVM writes the bound eventfd; the epoll loop wakes on PROCESS_VIRTQ_TX; the VMM pops the head descriptor, checks the rate limiter (ops bucket first, then bytes bucket), copies scatter-gather data out of guest memory into a TX buffer, and calls libc::writev() on the TAP fd, delivering the Ethernet frame to the host kernel's network stack. It then signals the used ring to the guest.
Host to guest RX: EPOLLIN fires on the TAP fd; the VMM checks the RX rate limiter (if throttled, it unregisters the TAP fd from epoll until the timer fires); libc::readv() fills scatter-gather buffers from the fd; the buffers are written into guest memory; the VMM updates the used ring and injects an interrupt into the guest.
Connecting the TAP to the World
Firecracker supports only the TUN/TAP backend; every networking topology is built outside the VMM. The operator is responsible for creating the TAP device and connecting it to the rest of the host before the microVM boots. Three topologies cover the common cases.
Routed NAT (Point-to-Point)
The minimal and most common setup assigns a /30 subnet to each TAP device, giving two usable addresses: one on the host-side TAP interface and one configured inside the guest. The host IP-forwards between the TAP and the outbound interface and masquerades the guest traffic. This is the topology Firecracker's docs/network-setup.md uses as its starting example, and it scaled to several thousand simultaneous microVMs in the original Firecracker demo.
Root required. The commands below create a TAP device and modify the host routing table. Run them as root or with
CAP_NET_ADMIN.
# Create the TAP device and assign the host-side address
ip tuntap add tap0 mode tap
ip addr add 172.16.0.1/30 dev tap0
ip link set tap0 up
# Enable IP forwarding
echo 1 > /proc/sys/net/ipv4/ip_forward
# Masquerade guest traffic leaving on eth0 (nftables)
nft add table ip firecracker
nft add chain ip firecracker postrouting \
'{ type nat hook postrouting priority 100 ; }'
nft add rule ip firecracker postrouting \
ip saddr 172.16.0.2 oifname eth0 counter masquerade
nft add chain ip firecracker filter \
'{ type filter hook forward priority 0 ; }'
nft add rule ip firecracker filter \
iifname tap0 oifname eth0 accept
Inside the guest, the equivalent with iproute2:
ip addr add 172.16.0.2/30 dev eth0
ip link set eth0 up
ip route add default via 172.16.0.1 dev eth0
For N microVMs, use sequential /30 subnets starting at 172.16.0.0/16. For zero-based ordinal O, the host TAP address is 172.16.[(4*O+1)/256].[(4*O+1)%256] and the guest address is 172.16.[(4*O+2)/256].[(4*O+2)%256]. Each TAP stays in its own /30 broadcast domain, which means no lateral traffic between VMs without routing.
Many operators skip the iproute2 step entirely and pass the guest IP configuration through the Linux kernel's ip= boot parameter, documented in Documentation/admin-guide/nfs/nfsroot.rst. The full 10-field positional format:
ip=<client-IP>:<server-IP>:<gw-IP>:<netmask>:<hostname>:<device>:<autoconf>:<dns0-IP>:<dns1-IP>:<ntp0-IP>
All fields are optional; trailing colons for omitted fields can be dropped.
For a static Firecracker guest with no server, no hostname, and no autoconf:
The kernel performs the equivalent of ip addr add, ip link set, and ip route add default during boot initialization, before the init process starts. No iproute2 package and no DHCP client need to be in the rootfs. This is particularly useful with the minimal disk images that Firecracker's fast-boot architecture encourages.
When using Firecracker's official getting-started rootfs, the guest MAC must follow the form 06:00:AC:10:00:02, where the last four octets encode the guest IPv4 address in hex (AC:10:00:02 = 172.16.0.2). This is a convention of that rootfs's init system, not a requirement of Firecracker itself. The API field guest_mac is optional; if omitted, the guest kernel generates a random MAC at startup.
Bridge (L2 Multi-VM)
A Linux bridge provides layer-2 connectivity across multiple TAP devices, putting several microVMs on the same broadcast domain:
Root required. The commands below create a bridge and add a TAP to it, modifying the host network topology.
ip link add name br0 type bridge
ip link set dev tap0 master br0
ip link set br0 up
With a bridge, the kernel bridge layer handles MAC learning and frame forwarding. Intra-VM communication requires no NAT — two guests on the same bridge communicate directly through the bridge's MAC table. A masquerade rule on the bridge interface still covers external access.
The routed and bridge topologies are the two extremes of isolation. A /30 point-to-point gives each VM the maximum isolation short of no connectivity at all; a bridge maximizes intra-VM connectivity at the cost of a shared L2 domain. Routed NAT maximizes inter-VM isolation; a bridge gives each VM a shared L2 domain, which production deployments such as Firecracker on AWS Lambda avoid in favor of higher-level service meshes for inter-function traffic.
The Security Boundary
Firecracker performs no network traffic filtering. Its docs/design.md states plainly: "all outbound network traffic data is copied by the Firecracker I/O thread from the emulated network interface to the backing host TAP device." Filtering is the host operator's responsibility. In practice that means nftables or iptables rules on the host, applied to the TAP interface, before traffic reaches the bridge or the routing table.
Rate Limiting at the VMM Edge
The problem rate limiting solves for microVMs is the same one it solves for containers: a single noisy tenant can starve every other workload on the same host.
Token Bucket in VMM Userspace
Firecracker rate limiting executes entirely in VMM userspace. There is no kernel qdisc, no netfilter rule, and no traffic control class. The I/O thread applies token-bucket checks synchronously, in the same epoll handler that moves data between the virtqueue ring buffers and the TAP fd. A frame that fails the bucket check never reaches the TAP fd — the kernel never sees it.
Each network interface in the Firecracker API accepts two optional RateLimiter objects, one for RX and one for TX. Each RateLimiter holds two independent TokenBucket configurations:
- bandwidth: the unit is bytes; limits throughput.
- ops: the unit is packets; limits packet rate independently of size.
Both run simultaneously. The ops bucket is checked first (consume(1, TokenType::Ops)); if it passes, the bytes bucket is checked (consume(frame_size, TokenType::Bytes)). If the bytes check fails after the ops check succeeded, the ops token is manually replenished (manual_replenish(1, Ops)) to keep the two buckets consistent. Both buckets must pass for a frame to proceed.
TokenBucket has three fields:
| Field | Type | Notes |
|---|---|---|
size |
int64 | Bucket capacity and initial budget |
refill_time |
int64 (ms) | Time to refill one full bucket |
one_time_burst |
int64 | Non-replenishing burst, consumed before the main budget |
The bucket starts full (budget = size). The one_time_burst is consumed first and does not replenish after draining — it is a startup grace period, not a sustained burst allowance. Setting size = 0 or refill_time = 0 disables that limiter.
The refill formula, from src/vmm/src/rate_limiter/mod.rs:
refill_tokens = (time_delta_ns * size) / (refill_time_ms * 1_000_000)
To avoid integer overflow for large size values, the implementation pre-divides size and complete_refill_time_ns by their GCD (Euclidean algorithm) and stores the reduced pair as processed_capacity / processed_refill_time. The refill polling interval is REFILL_TIMER_DURATION = 100 ms, driven by a TimerFd per RateLimiter that is armed when the bucket hits empty and cleared by event_handler() when the timer fires.
The reduce() method returns one of three variants. BucketReduction::Success means enough tokens are available and the packet proceeds. BucketReduction::Failure means the bucket is dry; the timer is armed, draining halts, and — for RX — the TAP fd is unregistered from epoll until the refill fires. BucketReduction::OverConsumption(f64) is the interesting edge case: the frame is larger than the full bucket capacity. The VMM lets it through (dropping an oversized frame would be worse) but arms the timer for ratio * refill_time ms to compensate. The rx_rate_limiter_throttled metric is incremented on throttle.
Rate limiter configuration is pre-boot via PUT /network-interfaces/{iface_id}. Live reconfiguration on a running microVM is available through PATCH /network-interfaces/{iface_id}, introduced in Firecracker v0.15.0.
Comparison With CNI Bandwidth Shaping
Container networking shapes traffic through the kernel qdisc scheduler. The CNI bandwidth meta-plugin adds a tbf (token bucket filter) qdisc to the host-side veth using RTM_NEWQDISC netlink calls. For egress shaping (pod-to-host direction), it attaches an ingress qdisc (handle ffff:) to the host-side veth, installs a U32 filter with a MirredAction (TCA_EGRESS_REDIR) that redirects matching traffic to an IFB device named bwp<hash> (max 15 chars), and adds a tbf root qdisc on the IFB. For ingress shaping (host-to-pod), a tbf root qdisc goes directly on the host-side veth. The CNI binary exits after setup; enforcement is entirely in the kernel packet scheduler.
The tbf qdisc parameters: Rate (bytes/s, computed as ingressRate / 8 since the CNI config specifies bits), Limit (derived from rate, burst, and a hardcoded latencyInMillis = 25), and Buffer (burst in kernel tick units). Burst values in the CNI config are in bits; the burst/8 value must be less than 2^32 bytes.
The architectural split:
| Property | Firecracker rate limiter | CNI bandwidth plugin |
|---|---|---|
| Enforcement point | VMM userspace I/O thread (epoll) | Linux kernel qdisc on host-side veth |
| Algorithm | Custom Rust token bucket | Kernel tbf qdisc (net/sched/sch_tbf.c) |
| Granularity | Per-NIC, per-direction, dual bucket (bytes + ops) | Per-NIC, per-direction, bytes only |
| Burst mechanism | one_time_burst (non-replenishing) |
burst parameter (bits, converted to buffer ticks) |
| Refill timer | 100 ms TimerFd in VMM userspace |
Kernel packet scheduler tick |
| Live reconfiguration | Yes, PATCH since v0.15.0 | No — requires deleting and recreating the pod |
| Isolation boundary | Before packet reaches TAP fd or kernel | After packet exits the container namespace |
The tbf qdisc only sheds bytes. Firecracker's dual-bucket design lets operators cap packet rate (ops/s) independently from byte throughput, which is useful for controlling CPU cost from small-packet floods — a stream of 64-byte UDP packets consumes almost no bandwidth but can saturate the VMM's I/O thread with interrupt processing. Setting an ops limit sheds packets before the VMM copies them, directly capping interrupt load.
The enforcement point is the other decisive difference. The Firecracker rate limiter intercepts traffic before the kernel ever sees the frame. The CNI/qdisc path enforces limits in the kernel scheduler after the packet has already crossed the veth pair into the host network stack. For a microVM, there is no veth pair and no shared namespace to cross — the TAP fd is the only crossing point, and the rate limiter sits between the virtqueue and the fd.
MMDS: In-Process Metadata Service
A microVM that boots from a minimal rootfs with no DHCP client needs another way to receive its per-instance configuration: its IP address, its TLS certificate, its role credentials, the bootstrap data that tells its init process what to run. The EC2 Instance Metadata Service (IMDS) solved this problem when EC2 launched by placing a magic link-local address (169.254.169.254) on every instance that routes to the hypervisor's management plane. Firecracker embeds the equivalent: the MicroVM Metadata Service (MMDS).
Architecture: Dumbo Inside the Data Path
MMDS is not a sidecar process and not a separate network path. It is three components embedded directly in the Firecracker VMM:
- A host-side HTTP API handler that lets the operator populate a per-VM JSON data store before or after boot.
- A global JSON data store (
serde_json::Value, default size limit 51,200 bytes, configurable with--mmds-size-limit). - Dumbo, a minimalist TCP/IPv4/ARP network stack implemented in Rust, embedded in the virtio-net data path.
Dumbo intercepts frames between the guest's virtio ring buffers and the TAP fd. For each guest-to-host frame on an MMDS-enabled interface, the VMM applies a heuristic: if the frame could be an ARP request for the MMDS IP or an IPv4 packet destined to the MMDS IP, Dumbo handles it and the frame never reaches the TAP fd. Otherwise the frame is forwarded normally. The heuristic has no false negatives — a frame that might be for MMDS is never incorrectly forwarded to the host.
MMDS is disabled by default. Enabling it requires associating it with one or more network interfaces via PUT /mmds/config. The network_interfaces field (array of interface ID strings) controls which TAP devices Dumbo monitors; frames arriving on non-associated interfaces pass through unmodified.
MMDS answers at 169.254.169.254 by default, overridable with any valid link-local IPv4 via ipv4_address in MmdsConfig. The hardcoded Dumbo source MAC is 06:01:23:45:67:01, used in ARP replies and as the source MAC on all outgoing TCP segments. TTL on all MMDS outgoing packets is 1. The guest must add a host route for the MMDS IP:
ip route add 169.254.169.254 dev eth0
Without a DHCP client this route must be injected by the init system, an in-guest script, or through the kernel ip= parameter using a supplementary static route mechanism.
flowchart LR
subgraph guest["Guest kernel"]
drv["virtio-net driver"]
end
subgraph vmm["Firecracker VMM process"]
vq["Virtio ring buffers"]
dumbo["Dumbo\n(ARP + TCP/IP stack)"]
store["JSON data store"]
tap["TAP fd"]
end
hostapi["Host API\nPUT /mmds"]
drv <-->|"MMIO + DMA"| vq
vq --> dumbo
dumbo -->|"frame for 169.254.169.254"| store
dumbo -->|"all other frames"| tap
hostapi --> store
Dumbo's Constraints
Dumbo is a deliberately narrow implementation. The design document enumerates its limitations explicitly:
- No 802.1Q VLAN tag support; tagged frames pass through to the TAP fd unexamined.
- No IP fragmentation reassembly; fragmented packets are treated as independent datagrams.
- Only EtherType 0x0806 (ARP) and 0x0800 (IPv4) are processed; IPv6 is dropped.
- Minimal TCP: flow control only, no congestion control, no support for most TCP options.
- At most one pending HTTP response per TCP connection.
- If a guest request exceeds the fixed receive buffer, the connection is reset.
When an ARP request targeting the MMDS address arrives, Dumbo records it (retaining only the most recent). On the next available slot in the guest receive ring, it sends an ARP reply with source MAC 06:01:23:45:67:01 before serving any TCP segments. This ordering guarantees the guest's ARP table has the MMDS MAC before any HTTP connection attempt.
Host API and Data Store
Four endpoints manage MMDS:
| Endpoint | Method | Purpose |
|---|---|---|
PUT /mmds/config |
PUT | Set MMDS version, IPv4 address, and allowed interface IDs (pre-boot) |
PUT /mmds |
PUT | Replace entire data store with any valid JSON |
PATCH /mmds |
PATCH | Partial update via JSON Merge Patch (RFC 7396) |
GET /mmds |
GET | Retrieve full data store from host side |
The data store accepts any valid JSON. The MMDS version, network configuration, and IPv4 address are preserved across snapshot and restore. The data store itself is not persisted across snapshots — a snapshot of a running VM clears it — to avoid leaking per-VM secrets to clones derived from the same snapshot.
V1 and V2
MMDS has two protocol versions. V1 is stateless and deprecated; it is scheduled for removal in the next major Firecracker release. V2 is modeled after AWS IMDSv2 and is the version new deployments should use.
V2 session flow: the guest first obtains a session token:
PUT http://169.254.169.254/latest/api/token
X-metadata-token-ttl-seconds: 21600
Dumbo responds with a token string. TTL is between 1 and 21600 seconds (6 hours). The X-Forwarded-For header must not be present in the token request; its presence causes a 400 to be returned, which closes the same SSRF attack surface that IMDSv2 targets. Subsequent data requests supply the token:
GET http://169.254.169.254/latest/meta-data/ami-id
X-metadata-token: <token>
An invalid or expired token returns 401 Unauthorized. V2 also accepts the AWS header names (X-aws-ec2-metadata-token-ttl-seconds, X-aws-ec2-metadata-token) to allow unmodified EC2 IMDS clients to run against a Firecracker MMDS without modification.
V1 skips the token step entirely. Requests arrive without authentication and Dumbo responds. Metrics mmds.rx_invalid_token and mmds.rx_no_token are incremented on V1 requests so operators can track migration progress, but no enforcement occurs.
Addressing the Data Store
Resources are addressed by JSON Pointer (RFC 6901) as the URI path. For a data store containing:
{
"latest": {
"meta-data": {
"ami-id": "ami-12345678",
"local-ipv4": "172.16.0.2"
}
}
}
the guest fetches /latest/meta-data/ami-id and receives ami-12345678. Two response formats are available: Accept: application/json returns JSON; Accept: plain/text (or no header) returns IMDS text format, where object keys are separated by newlines and objects are represented with a trailing /, matching AWS EC2 IMDS behavior. Note that plain/text is a Firecracker-specific token, not the RFC 2045-compliant text/plain; this is an intentional Firecracker quirk, not a typo. Setting imds_compat: true in MmdsConfig forces IMDS format regardless of the Accept header, enabling unmodified EC2 IMDS clients. JSON types that have no IMDS text representation (numbers, arrays, booleans) return 501 when IMDS format is requested.
Why Not a veth?
The last question this chapter should answer is why Firecracker does not use the same veth-plus-CNI architecture that container runtimes use. The answer is the second kernel.
A container is a process in a separate Linux network namespace. Its veth pair is a kernel object that tunnels between two namespaces within the same kernel. The kernel bridge or route table handles switching between them. No userspace process mediates every packet; the kernel does it directly.
A microVM runs a separate kernel. There is no shared kernel object that spans both kernels — any communication must cross from the guest kernel into the VMM process and then back into the host kernel. The TAP fd is the crossing point. Firecracker reads frames from the guest virtqueue (shared memory that KVM maps into both address spaces), converts them into writev() calls on the TAP fd, and lets the host kernel handle them from there. The veth pair is a kernel-to-kernel shortcut that simply does not exist across a VM boundary.
The CNI model could be grafted on top of the TAP device — some orchestrators do exactly that, running CNI plugins that configure the host-side TAP rather than a veth. But the virtio-net device, the epoll event loop, and the rate limiter always remain between the guest and whatever the host side is.
Sources And Further Reading
- Linux TUN/TAP documentation: https://docs.kernel.org/networking/tuntap.html
- Linux
if_tun.hkernel uapi: https://github.com/torvalds/linux/blob/master/include/uapi/linux/if_tun.h - Linux
virtio_net.hkernel uapi: https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_net.h - OASIS virtio 1.2 CS01: https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html
- Firecracker
tap.rs: https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/devices/virtio/net/tap.rs - Firecracker
device.rs(net): https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/devices/virtio/net/device.rs - Firecracker
event_handler.rs(net): https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/devices/virtio/net/event_handler.rs - Firecracker
mod.rs(net): https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/devices/virtio/net/mod.rs - Firecracker
rate_limiter/mod.rs: https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/rate_limiter/mod.rs - Firecracker OpenAPI spec: https://raw.githubusercontent.com/firecracker-microvm/firecracker/main/src/firecracker/swagger/firecracker.yaml
- Firecracker
network-setup.md: https://github.com/firecracker-microvm/firecracker/blob/main/docs/network-setup.md - Firecracker
design.md: https://github.com/firecracker-microvm/firecracker/blob/main/docs/design.md - Firecracker
mmds-design.md: https://github.com/firecracker-microvm/firecracker/blob/main/docs/mmds/mmds-design.md - Firecracker
mmds-user-guide.md: https://github.com/firecracker-microvm/firecracker/blob/main/docs/mmds/mmds-user-guide.md - CNI bandwidth plugin
main.go: https://github.com/containernetworking/plugins/blob/main/plugins/meta/bandwidth/main.go - CNI bandwidth plugin
ifb_creator.go: https://github.com/containernetworking/plugins/blob/main/plugins/meta/bandwidth/ifb_creator.go - CNI bandwidth plugin docs: https://www.cni.dev/plugins/current/meta/bandwidth/
- Linux kernel NFS root docs (
ip=parameter): https://docs.kernel.org/admin-guide/nfs/nfsroot.html