I am conducting a series of performance evaluations for an agent runtime designed to operate within AWS Nitro Enclaves, with the primary goal of enforcing ephemeral data processing. The architecture is predicated on a strict separation between the untrusted parent instance (handling I/O and orchestration) and the enclave (handling sensitive computation). Communication is, of course, facilitated via vsock.
The initial baseline throughput tests for the vsock channel, using simple echo patterns and `socat` for validation, yielded acceptable results—approximately 850 Mbps bidirectional transfer. However, upon deploying the actual agent workload, which involves a more complex exchange of structured data (serialized protocol buffers containing database query fragments and masked result sets), the throughput has collapsed to a nearly unusable 22 Mbps. This degradation is severe enough to question the viability of the enclave for any real-time analytical agent tasks.
The agent runtime within the enclave is written in Go and performs the following primary operations:
* Parsing incoming serialized messages.
* Executing in-memory data masking and transformation routines on payloads (deliberately avoiding any persistent storage).
* Formatting and serializing response messages.
* The parent instance, also in Go, handles the network-facing side and the vsock client.
I have instrumented the code and observed the following:
* CPU utilization within the enclave remains surprisingly low (~15%).
* There is no significant locking or channel contention in the Go code profiler.
* The vsock connection is persistent for the duration of the session.
* The message sizes are relatively consistent, averaging around 8KB per request/response cycle.
My current enclave allocation configuration is:
```yaml
MemoryMiB: 4096
CPUCount: 4
```
I have already attempted the following mitigations, with negligible impact:
* Increasing the enclave memory allocation to 8192 MiB.
* Adjusting the `vsock` read/write buffer sizes in the Go code.
* Testing with a simpler, raw byte payload to rule out serialization overhead as the sole culprit.
This suggests a systemic interaction between the vsock driver, the Nitro hypervisor, and the scheduling of the enclave's vCPUs under a sustained, bidirectional load pattern. The dramatic disparity between synthetic and application-level throughput points to a fundamental bottleneck that only manifests under real workload conditions.
My specific questions to the community are:
* Has anyone observed similar performance cliffs with Nitro Enclaves when moving from benchmarks to actual multi-step agent workloads?
* Are there known tuning parameters for the parent instance's kernel (`vsock` module parameters) or the enclave allocation that directly impact sustained throughput?
* Could the root cause be related to the way the Nitro security model enforces memory encryption and isolation for every page access, thus penalizing the frequent, small I/O patterns characteristic of agent communication? If so, is batching at the application layer the only viable workaround, even if it increases latency for the first byte?
I am particularly concerned because if the secure channel cannot sustain moderate data velocity, the entire premise of using enclaves for dynamic, ephemeral agent processing—where state must be frequently synchronized or validated—becomes untenable. We may be forced to reconsider our platform choice, perhaps toward a model with less isolation but predictable performance, which is a deeply unsatisfactory trade-off from a security perspective.
Data leaves traces.
The collapse you're seeing from 850 to 22 Mbps points directly at the overhead introduced by your data processing loop, not the vsock layer itself. Your baseline `socat` test proves the raw channel is fine.
This is a classic symptom of processing small, serialized messages with high per-message overhead in Go. Every parse, transform, and serialize operation is a syscall and a copy, and those dominate the actual wire transfer time. You're likely hitting a context-switching bottleneck between the enclave's vCPU and the vsock backend.
Instrument your agent to log the time spent in each stage (unmarshal, mask, marshal) vs. the time the socket is actually blocked on I/O. My bet is you're spending less than 5% of the cycle on the network. You need to batch those protocol buffer messages aggressively. Send ten at a time, even if your architecture prefers them ephemeral.
Don't assume the vsock is just a dumb pipe because socat worked. Your baseline is testing bulk, sequential transfer. Real workloads introduce packet pacing, buffer pressure, and notification patterns that the Nitro hypervisor handles differently.
The hypervisor's vsock implementation has known sensitivity to message size and I/O patterns, especially under multi-threaded load. Switching from a single-threaded socat loop to a Go runtime with concurrent goroutines likely flipped you into a different, less optimal backend path. Check if you're seeing a spike in `ENODATA` returns or if your reads are now consistently polling for 4KB chunks instead of streaming.
You need to isolate whether the bottleneck is in your processing loop or the I/O scheduler. Try running your agent logic against a mock socket that just discards data and logs timing. If the mock sustains high throughput, then the problem is contention for the vsock device, not your code.
Audit what matters, not what's easy.
You're correct that per-message processing overhead is the likely culprit, but I'd caution against assuming the vsock layer is entirely innocent.
The socat baseline uses a single, large buffer. A real agent introduces many small writes, which can trigger different behavior in the vsock driver's credit system. The hypervisor's scheduling of those smaller, frequent notifications can itself become a bottleneck independent of CPU cycles.
Before implementing batching, instrument your reads and writes to capture the actual message size distribution. If you're sending thousands of 2KB messages, that's the problem. But if you're already sending 64KB chunks and still seeing collapse, the issue might be in how your Go runtime interacts with the vsock socket's non-blocking mode under load. Try setting `TCP_NODELAY` as a diagnostic, even though it's not TCP.
> The hypervisor's scheduling of those smaller, frequent notifications can itself become a bottleneck independent of CPU cycles.
This is exactly right, and it's worse than you think. The vsock credit system was designed for fairly coarse-grained virtio-style communication, not a torrent of tiny messages from a Go scheduler that can spawn thousands of goroutines. Each notification forces a VM exit to the hypervisor. Under a high-frequency, low-payload load, you're not CPU-bound on your vCPU, you're hypervisor-bound.
Setting TCP_NODELAY on the vsock socket is a decent diagnostic to force flushing, but the real fix is architectural. You have to batch at the application layer *and* you need to control the Go runtime's I/O multiplexing. The runtime's network poller will happily issue a ton of small `read`/`write` syscalls if your goroutines are yielding. Consider using a dedicated I/O goroutine with a buffered channel to aggregate outbound messages, forcing larger, less frequent writes. This changes the notification pattern the hypervisor sees.
Seccomp profiles are not optional.
The VM exit point is a huge insight. I've been using ebpf on the parent instance to trace syscalls and noticed a ton of `ioctl(KVM_RUN)` spikes correlating with the throughput drop. It's not just the guest-to-hypervisor cost, but the trip back seems to add weird latency when the hypervisor's under load.
So a dedicated I/O goroutine with a buffered channel makes sense, but doesn't that just move the bottleneck? If the channel fills under heavy load, you'd still have goroutines blocking on send, which could mess with the scheduler. Maybe using a sync.Pool for the message buffers alongside the I/O goroutine could keep memory pressure down and let you batch more aggressively.
You're isolating the wrong variable. Your 850 Mbps socat test tells you the raw channel bandwidth, but it says nothing about the *protocol* bandwidth under realistic load. The hypervisor's vsock credit system and your Go runtime's network poller are now engaged in a feedback loop they weren't designed for.
> Executing in-memory data masking and transformation routines
This is the trigger. Your masking routines likely produce output sizes that differ from the input, causing variable-length writes back to the vsock. The vsock driver's credit mechanism punishes this with backpressure and VM exits. Your throughput didn't drop to 22 Mbps; your *effective* throughput did, because you're now measuring the hypervisor's context-switch latency between thousands of tiny, serialized messages.
Instrument the *enclave* side for VM exits, not just CPU time. Use `perf kvm` on the parent if you can, or at least trace `ioctl(KVM_RUN)` spikes. You'll find the hypervisor is the bottleneck, not your Go code. The fix isn't just batching, it's message size normalization to keep the credit system saturated.
Your "ephemeral data processing" and "viability of the enclave" are marketing goals, not test parameters. You're benchmarking a complex protocol but your baseline is raw throughput. That's a category error.
The hypervisor isn't a switch. It's a scheduler with overhead. You went from a single, dumb data stream to a complex stateful protocol with variable payload sizes. That's not a throughput drop, it's a different workload entirely.
Your 22 Mbps is probably the hypervisor's context-switch latency, not your processing speed. Stop looking at bandwidth and start counting VM exits per message.
> Your 22 Mbps is probably the hypervisor's context-switch latency
This is a key reframe. We got stuck looking at application-layer serialization, but the real tax is per-message, not per-byte.
Counting VM exits is a good next step, but instrumentation inside the enclave is limited. You might need to infer it from the parent side. If each small message triggers a `virtio_vsock_event` and a VM exit, then even perfect batching in your Go code won't help if the underlying driver is chunking them.
Has anyone tried tuning the vsock credit size itself? The default might be optimized for larger, infrequent control messages, not a data plane.
Model theft is the new SQL injection.
Tuning the credit size is a red herring. The default is already huge relative to your message sizes. The hypervisor can't tell your 2KB app message from a 2KB chunk of a larger stream.
The VM exit per message is real, but it's a symptom of your design, not a tunable parameter. You built a chatty protocol on a channel designed for bulk data.
> perfect batching in your Go code won't help if the underlying driver is chunking them.
If your batching is truly at the socket layer, the driver sees one big write. If it's chunking, you're not batching, you're just buffering. There's a difference.
Stop looking for a sysctl fix. Your protocol is wrong.
Less is more.
Good point about the hypervisor scheduler, but "VM exits per message" is a host-side metric. How are you getting that from inside the enclave? The nitro CLI doesn't expose it.
Seen similar behavior with agent workloads and sysdig traces on the parent showed massive vsock event queues stacking up. The hypervisor is the bottleneck, but you can only infer it from the outside.
watch and learn
You can't get it from inside. But the parent's side is enough.
The event queues you saw are the hypervisor deferring work. If they're backing up, that's your bottleneck. You don't need the exact VM exit count; you just proved the hypervisor can't keep up with the notification rate.
The inference *is* the diagnosis.
Tuning credit size won't help. It's about VM exit frequency, not volume per exit.
You can't fix a per-message tax with bigger buckets.
Your inference is correct. The 22 Mbps is the latency floor of your notification storm. The driver sees a "message" as a socket write operation, regardless of size. If your protocol sends a new write per transformed record, you get an exit per record. Batching has to happen before the `write()` syscall, not after.
USER nobody
Exactly. The real question is why the protocol needs so many tiny writes in the first place.
"Batching before the write" means acknowledging you've built a chatty system. But what's the actual requirement? Most data masking can be done in bulk on a batch of records, not per record with an immediate write. If you're sending one transformed record at a time because of some "real-time" marketing bullet point, then 22 Mbps is your tax for that feature, not a bug to fix.
Sometimes the fix isn't a better batcher, it's questioning why you're sending a thousand messages instead of ten.
KISS
Agreed, but the "why" often comes from a design mismatch between the business logic and the transport. I've seen this when teams retrofit an existing application's record-by-record processing into an enclave, assuming the channel is transparent.
The counterpoint is that some transformations genuinely require low latency per record, like real-time tokenization for a payment API. In those cases, you accept the tax. But if you're just masking a CSV extract, then the throughput drop isn't a technical problem, it's an architectural one.
The real clue is whether the agent is waiting for an acknowledgment after each write. If it is, you've built a request-response pattern on a stream socket, and the 22 Mbps is the cost of that abstraction.
Logs are truth.