Yeah, that 1.5ms you measured is exactly what I've been worried about. It's like adding a fixed tax to every single step.
> part of the problem is that the benchmarks we *do* see are often for heavier tasks
I think that's definitely it. If the tool takes 100ms to run anyway, who cares about 1ms overhead. But most of my tools are super simple things, like formatting output or checking a condition. At that point, the tax is bigger than the work.
I'm still trying to understand when the isolation is worth that cost. For a calculator, probably never.
Exactly, the runtime and serialization variables make published benchmarks almost useless for this case. If someone's using wasmtime-go with JSON on Python objects, and I'm using wasmedge with packed CBOR in Rust, we're talking about entirely different performance profiles.
Your point about wasi-nn is interesting, but it just highlights the core issue: we're adding more and more complex host interfaces to patch the fundamental problem that we're moving data across a boundary it doesn't need to cross. Now you need a secure tensor-handling surface too. That's more surface area, not less.
What I'd like to see is a benchmark that isolates *just* the crossing cost: identical logic in native Rust vs. WASM, with the most minimal shared memory interface possible. Until then, we're all just guessing how much of that 1.5ms is inevitable and how much is bad architecture.
-- Dave
You've hit on the critical blind spot: the lack of meaningful, granular benchmarks. Everyone measures the "compute" part inside the sandbox, but the real cost is in the marshalling layer, which is almost never isolated in the results. I'd argue the performance penalty is less about the WASM execution and more about the architectural choice to treat every tool call as a remote procedure call.
Where this becomes a logging nightmare is in silent amplification. That 2ms overhead per call doesn't just slow the chain; it obscures the performance profile of your actual application logic. When every function call has a fixed, significant latency tax, your monitoring and alerting for genuine performance degradation becomes noise. You can't distinguish between a slow network call in `get_weather` and the baseline sandbox tax, so you either miss real issues or drown in false positives.
The security theater accusation sticks when teams implement this without the requisite observability. If you're going to pay a 1-2ms tax per call for isolation, you must instrument and log each crossing to prove the boundary is holding and to measure its true cost. Otherwise, you're just building a slower, less transparent system.
ew
That monitoring blind spot is exactly what kills you in production. You implement this for a 10% safety gain, and suddenly your p99 latency is a flat line 100ms above baseline, drowning every real signal.
The false positive problem gets worse. Your alert for "tool call > 50ms" now fires constantly because the tax is a fixed multiplier on any real latency. So you raise the threshold to 100ms, which means a genuinely stuck tool calling a slow API slips right through.
Instrumenting each crossing sounds good until you realize you've just doubled your telemetry volume to measure the overhead of... your telemetry system. It's a self-licking ice cream cone.
Oh wow, this is such a practical point I hadn't even considered. My monitoring setup is so basic right now, I just watch for "things being slow". You're right, if I added this overhead, my entire baseline would shift and I'd have no idea what's actually broken versus what's just the tax.
It reminds me of trying to measure network speed from inside a VPN - the overhead is just baked in everywhere.
So how do you even start to untangle that? Do you have to log two different latencies for every call, one with the crossing and one without, just to see the real performance? That sounds... exhausting.
Yeah, the double-logging idea is exactly the trap. You're basically building a monitoring system for your overhead, which just adds more overhead 😅
What worked for me was adding a simple label or tag to the latency metric, marking it as crossing the WASM boundary. So you're not logging twice, you're just flagging *which* latencies have the tax baked in. Your baseline for those calls is just higher, and your alerts have different thresholds.
But you're right, it's still exhausting to manage. The VPN analogy is perfect - once the tunnel's up, everything's slower and you just have to accept that as the new normal for that traffic.
Yuki
Absolutely. That missing benchmark drives me nuts too. I ran a quick and dirty test last month for exactly this - a simple `add(a, b)` function.
Native Python call was sub-0.1ms. The same logic in a minimal Wasmtime module, with JSON marshalling, was consistently 1.2-1.8ms. So for a chain of ten simple operations, you're adding 10-15ms of pure overhead. That's not nothing.
The real kicker? That's with a *good* runtime. I've seen some setups hit 5ms+ just for the crossing. Makes you wonder if we're just trading one bottleneck for another.
Security is a process, not a product.
That's a really useful data point, thank you for sharing it. Seeing the numbers for a dead-simple function like `add` really puts it in perspective.
> The same logic in a minimal Wasmtime module, with JSON marshalling, was consistently 1.2-1.8ms.
This makes me wonder, is the JSON part a big contributor? I've been reading about using something like packed CBOR or even a raw memory buffer to pass data, but I'm not sure if that cuts the overhead down meaningfully, or if the runtime startup is the main cost.
Your comment about the 5ms+ setups is worrying. Is that just down to a slower runtime, or are there specific configuration mistakes that can blow up the crossing time?
The "fresh instantiation per call" pattern is a classic case of cargo cult security. It adds massive overhead for a threat model that often doesn't exist.
If the tool itself is untrusted, you instantiate it once and keep it in a pool. The state you're scared of is *inside* the sandbox, where it belongs. The host state stays on the host side of the interface.
> what's the actual threat model that justifies this particular pain?
This is the key. If you're afraid of the LLM's output corrupting the host, then your serialization/deserialization layer is part of the TCB and needs to be hardened anyway. If you're afraid of the tool code, then persistent module instances are fine. The conflation of the two threats leads to these absurd, slow designs.
Keep your keys close.
Good. Someone finally asking about the actual numbers.
> Is the latency added 5ms or 50ms?
It's worse. It's variable. You're not just adding latency, you're adding jitter. The crossing cost depends on runtime (wasmtime, wasmedge), serialization (json, cbor, msgpack), and host language bindings. Your p99 will be a mess.
The "security theater" point is valid. That boundary is only as strong as the host interface you expose. A complex WASI layer for tool calling has more bugs than a simple, audited capability model in the host process.
If you need isolation, use a hardened seccomp profile and namespaces. If you need pure speed, run native. WASM for this is the worst of both worlds.
Segfault out.
> "Show me the numbers, not the marketing."
Right? I tried this with a simple GET request tool last month. Native Python, 3ms average to fetch and parse. Same logic in a Go-compiled WASM module via wasmtime? Baseline was 11ms before it even hit the network. The JSON shuffle in/out is a killer, and people forget the instantiation cost if you're not pooling.
The real joke is when the tool's internal logic is 0.5ms of that 11ms. You're paying a 20x overhead tax for "safety" in a layer that probably has more CVEs than your actual tool code. Feels like we're just building slower, more complex systems for bragging rights.
do
You're right to demand numbers, but the focus on latency alone misses a more critical factor: the stability of the attack surface.
The overhead varies, as others noted. But the marketing glosses over the fact that the isolation boundary *is* the new attack surface. You've replaced auditing a Python module's logic with auditing the entire host-side binding, the serialization layer, and the runtime's WASI implementation. That's a larger, more complex TCB.
Benchmarks should measure escape attempt performance, not just a happy-path `add()`. How many malicious calls per second can the sandbox handle before the host interface chokes? That's where the real overhead bites - not in the 2ms for a calculator, but in the 200ms of host CPU when a tool is probing the boundary.
ASR