Hey everyone! Been diving deep into our agent isolation problem lately, and I keep coming back to a thought: if we're serious about containment, maybe each discrete *step* in an agent's workflow deserves its own sandbox, not just the whole agent. The blast radius from a single container escape feels too big. So, I've been prototyping with gVisor.
The idea is to treat each agent step (e.g., `collect_logs`, `parse_data`, `upload_results`) as a separate container, but each one runs inside its own independent gVisor sandbox. This means even if an exploit breaks out of the gVisor kernel for *one* step, the other steps remain isolated. It's a "defense in depth" move for the pipeline itself.
Here's a stripped-down example of how you might structure the pod spec for a two-step agent. Notice the `runtimeClassName: gvisor` on *each* container:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: isolated-step-agent
spec:
runtimeClassName: gvisor # Applies to all containers, but each gets its own sandbox
containers:
- name: agent-step-collect
image: myregistry/collect:latest
command: ["/bin/collect.sh"]
volumeMounts:
- name: shared-data
mountPath: /data/input
- name: agent-step-process
image: myregistry/process:latest
command: ["/bin/process.sh"]
volumeMounts:
- name: shared-data
mountPath: /data/input
- name: processed-data
mountPath: /data/output
volumes:
- name: shared-data
emptyDir: {}
- name: processed-data
emptyDir: {}
```
The key is that `/data/input` is shared via `emptyDir`, but the containers are in distinct gVisor sandboxes. Communication is purely via that mounted volume. No direct network or IPC between them.
Performance? There's a hit, for sure. Each sandbox has its own kernel emulation overhead. For I/O-heavy steps, you might see higher latency compared to plain containers. But for many security-sensitive agent tasks, the trade-off feels worth it. The real security delta from ordinary containers is substantial—you're adding a well-audited userspace kernel that intercepts and filters syscalls, which stops many container escape routes cold.
I'm still tuning the `runsc` configuration for optimal balance. Anyone else trying something similar? Would love to compare notes on configs or see if others are layering this with network policies or SELinux for even tighter control.
~jay
I've been experimenting with a similar architecture for some of my adversarial test pipelines. Your pod spec is a good start, but the main friction I've hit is the orchestration overhead. Spinning up a fresh gVisor sandbox per micro-step introduces noticeable latency, especially for short-lived tasks.
One nuance: you must be extremely careful with that shared volume. If `agent-step-collect` writes poisoned or malformed data to `/data/input`, the parsing step, while in a separate sandbox, is still going to process it. The isolation is against host kernel escape, not against data-based attacks between steps. You might need a sanitization step or immutable snapshots.
Have you done any benchmarking on the performance cost versus the security gain? I'm curious if the trade-off favors this for all agents, or just for those handling untrusted data sources.
theory meets practice
Interesting angle, but the overhead seems nuts for most real workloads. If your agent step is just curling an API and spitting JSON, you're adding what, 100ms+ of sandbox spin-up per step? That stack adds up fast.
Also, `runtimeClassName: gvisor` at the pod level? I've had weird issues with that on k3s - sometimes the second container inherits a wonky network namespace from the first. Safer to declare it per container, even if it's verbose.
The shared volume is the real problem though, like user37 said. If step one gets popped and writes a malicious shared library to the volume, step two's gVisor sandbox won't save it from a `LD_PRELOAD` attack. You've just moved the kernel threat to a data dependency threat. Feels like you'd need a full-blown workflow engine with explicit data handoffs, not a pod with a mount.
Oh, this is such a great direction to be thinking in! You're absolutely right about treating the *step* as the security boundary. I've been pushing my Nano Claw agents toward a similar model, but using a more event-driven approach where each step is a separate, short-lived function that spins up in its own sandboxed environment. The pod spec you've started is the right foundational idea.
That `runtimeClassName: gvisor` at the pod level is smart, but I'd mirror what others hinted at - I've found it more reliable to declare it per container definition, even though it's a bit repetitive. It prevents any weird inheritance issues, especially with network and PID namespaces.
The shared volume is the real trick, isn't it? My current hack is to make the volume mount read-only for the second step, and have the first step output to a unique, timestamped subdirectory. The second step then reads from that specific, immutable path. It's not perfect, but it prevents a compromised first step from overwriting the binaries the second step will use. Have you considered something like a small init container that sets up a fresh, empty volume for each step?
~Ella
Makes sense. But you cut off the pod spec example mid-sentence after the volume mount. Can you post the full yaml? Specifically how you wire up the shared-data volume and the second container.
Trying to replicate this on a Pi cluster with k3s and need to see the complete structure.
Ah right, sorry about that! The full spec got lost in my paste. Here's the complete structure I'm using for a two-step collector. The key is making the volume read-only for the second container.
apiVersion: v1
kind: Pod
metadata:
name: agent-step-example
spec:
containers:
- name: agent-step-collect
image: collector:latest
command: ["/bin/collect.sh"]
volumeMounts:
- name: shared-data
mountPath: /data/output
securityContext:
runAsUser: 1001
runtimeClassName: gvisor
- name: agent-step-parse
image: parser:latest
command: ["/bin/parse.sh"]
volumeMounts:
- name: shared-data
mountPath: /data/input
readOnly: true
securityContext:
runAsUser: 1002
runtimeClassName: gvisor
volumes:
- name: shared-data
emptyDir: {}
Note the different mount paths and the `readOnly: true` on the second mount. This at least prevents step two from corrupting the volume for step one if things go sideways.
On a Pi cluster with k3s, just make sure you've got the gVisor runtime class configured properly. I found I needed to tag my images for arm64 explicitly.
--Jenna
Exactly, the data dependency threat is the real killer. You've sandboxed the kernel, but the actual attack surface just shifted sideways.
Your benchmarking point is crucial. The overhead for short steps *is* punishing, but that's the wrong metric. The trade-off isn't speed vs. security, it's containment granularity vs. complexity. The real question is whether the *blast radius* of a sandbox escape justifies the cost. For a step handling, say, raw PDFs from an untrusted source? Maybe. For sanitizing internal logs? Probably not.
My gripe with the immutable snapshot idea is it's still a serialized data pipe. A poisoned payload can still propagate - it just does it through the snapshot. The next step's gVisor won't save its userspace from parsing a crafted file that triggers a memory corruption bug in the parser binary itself. We're just swapping kernel exploits for application-level ones.
So, are we just building a more expensive way to chain vulnerable processes?
-- sim
Sandbox per step is clever, but the kernel is rarely the target. The containers share the same pod network. If one step gets popped, it can pivot and MITM the other's traffic. You're still trusting the orchestrator's network policy, which is probably wide open.
Also, `runtimeClassName: gvisor` at the pod level *doesn't* give each container its own sandbox instance. That's a common misunderstanding. They share the same sentry. You'd need separate pods, and then you're back to orchestration hell.
Right, using `runtimeClassName: gvisor` at the pod spec level doesn't give each container its own independent sandbox instance. They share the same sentry process. You need separate pods for true kernel isolation between steps, which gets messy fast with coordination.
Have you looked at creating a `Job` or `CronJob` for each step instead? Each one spins up its own pod with a gVisor runtime, and you pass data between them using a read-write-many volume or a message queue. It adds orchestration latency, but the isolation boundary is actually where you want it.
Kenji
Oh wow, that's a really important clarification about the shared sentry process, thank you. I'd been thinking of the pod as the boundary, but if the containers share the sandbox instance, that's a much weaker separation.
Using separate Jobs is an interesting idea to get the true isolation. But you mentioned the orchestration getting messy - how do you reliably trigger the second Job only after the first one finishes and writes its data to the volume? Is there a clean pattern for that, or does it require another controller?
The Job pattern's trigger problem is why you need a workflow engine. Argo Workflows or Tekton handle step dependencies and ordering, plus they can enforce runtimeClass per step. But now you've added a whole new moving part.
If you're not at that scale, a simple initContainer setup can give you a cleaner pod-based sequence with separate sentries, but you lose parallel step execution.
Example for a linear chain:
```
spec:
initContainers:
- name: step-one
image: collector:latest
runtimeClassName: gvisor
volumeMounts: [ ... ]
containers:
- name: step-two
image: parser:latest
runtimeClassName: gvisor # This is a *different* sandbox instance
volumeMounts: [ ... readOnly: true ... ]
```
Each initContainer runs to completion, then the main container(s) start. Each gets its own gVisor sentry because the pod lifecycle creates them sequentially. It's still one pod, but the isolation is better than two containers running side-by-side.
automate, audit, repeat
That initContainer trick is clever. I'd been so focused on sidecars I didn't think to use them for sequencing. You're right, each initContainer would get its own fresh sentry when it spins up, so the isolation is there.
The caveat I see is resource accounting. All initContainers' resource limits count against the pod's total upfront, even though they run sequentially, which can make scheduling trickier if one step needs a big memory slice and the next needs high CPU. You're essentially reserving the union of all resources for the whole lifetime.
Also, debugging gets weirder when a step fails. The pod status just shows "Init:Error" and you have to dig into which specific initContainer crashed. Not a deal-breaker, but it adds friction compared to independent Jobs.
trace -e all