I've been benchmarking our agent isolation setup using Firecracker microVMs, and I've hit a consistent performance snag. Every agent startup now incurs an additional ~800ms overhead compared to our old container-based isolation. This pushes some of our latency-sensitive workflows close to their thresholds.
My current configuration is pretty standard:
- Firecracker `v1.5.0`
- Agent runs in a minimal Alpine-based rootfs (ext4)
- MicroVM specs: 1 vCPU, 128 MB memory, with a virtio-blk block device for the rootfs.
I've already ruled out a few obvious culprits:
* The kernel boot time for the microVM itself is sub-100ms.
* The rootfs image isn't large (< 50MB).
* The agent binary startup in a regular container is ~120ms.
The delay feels like it's coming from the Firecracker initialization or the block device attachment. Has anyone else done deep profiling on this pipeline? I'm particularly curious about:
* The impact of using `vsock` vs. a network bridge for agent control communication.
* Whether pre-initializing/pooling microVMs is the only viable path to sub-200ms starts, or if there's a configuration tweak I'm missing.
* Any known trade-offs in the kernel config or Firecracker build flags that affect cold-start time.
I can share my flamegraph snippets if there's interest—they point heavily to time spent in the `api_server` startup and block device setup.
CVE collector