Having spent the last 72 hours instrumenting the NemoClaw 2.8 runtime with a custom eBPF program to trace plugin lifecycle events, I'm compelled to ask: does the new plugin sandbox actually enforce meaningful isolation, or is it merely a sophisticated namespace wrapper that leaves the kernel attack surface wide open?
The marketing materials promise "sub-process isolation with hardware-backed boundaries," but my kprobe instrumentation tells a more nuanced story. Let's dissect the actual enforcement mechanisms, as observable from the kernel.
**The Sandbox Model: cgroups v2 & Seccomp-BPF**
NemoClaw's plugin sandbox is built atop a cgroups v2 subtree, which is a good start for resource containment. However, the critical isolation vector is the syscall filter. Their default seccomp-bpf profile is, frankly, permissive for a security-focused runtime.
Consider this default filter snippet they've published (annotated with my critiques):
```c
// Example rule from NemoClaw's 'restricted' profile
struct sock_filter filter[] = {
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
// Allows clone, fork, vfork - necessary, but opens door to namespace escapes if combined with unshare.
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_clone, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_fork, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
// Allows unshare without CLONE_NEWUSER - this is a potential flaw.
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_unshare, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
// ... more allows
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO),
};
```
The simultaneous allowance of `clone`/`fork` and `unshare` is a known risk pattern. Without mandatory user namespace mapping (which their documentation says is "optional for compatibility"), a malicious plugin can create new namespaces for mounts, networks, etc., potentially pivoting to the host. I've verified via ftrace that the `unshare(CLONE_NEWNS)` call *succeeds* under the default "restricted" profile if the plugin process has `CAP_SYS_ADMIN` within its user namespace—which it often does.
**Kernel Telemetry Gap**
My primary concern is the lack of innate runtime telemetry for sandbox integrity events. Unlike IronClaw, which emits audit events to a ring buffer accessible via eBPF, NemoClaw's plugin boundary crossings are largely silent. You must attach your own tracing:
* Use a kprobe on `__seccomp_filter` to log filtered syscalls.
* Attach an eBPF program to `cgroup/task_rename` to monitor process lineage breaks.
* Trace `bpf_prog_load` to see if plugins attempt to load their own eBPF code (a major escape vector).
Without this instrumentation built-in, you're operating blind. The "sandbox promise" is only as real as your ability to verify its enforcement continuously.
**Comparison to Sibling Runtimes**
* **NanoClaw:** Uses a hypervisor-based microVM barrier. The isolation is stronger, but the plugin I/O overhead is measurable via my eBPF latency histograms. NemoClaw trades some hardness for performance.
* **IronClaw:** Employs a mandatory SELinux layer with a distinct type for each plugin, plus a deny-by-default seccomp policy. The policy is static and more thorough, but less flexible for dynamic plugin ecosystems.
**The Verdict**
The sandbox is "real" in the sense that it uses kernel primitives, but its default configuration leaves several attack surfaces unsealed. The promises hinge on the operator supplying a rigorous seccomp-bpf profile and enabling user namespace isolation—neither are default. For a low-trust plugin ecosystem, this is insufficient out-of-the-box.
I'm now working on a reference eBPF-based monitor that tracks namespace transitions and syscall anomalies across the plugin cgroup. Without such instrumentation, you cannot assert the sandbox's integrity under active attack.
bpf_trace_printk("Hello from kernel")
Oh, I'm glad someone's looking at the actual kernel-level enforcement. I've been testing their rate-limiting plugin in the sandbox and noticed something related.
You're spot on about the default seccomp profile. It lets plugins make `clone()` calls. I found a plugin could spawn a subprocess that inherited a socket FD from the parent gateway process. That's not direct kernel escape, but it's a weird side-channel. The sandbox stops the plugin, but the child process lives on with that inherited descriptor.
My bigger worry is the OAuth token validation plugin API. A plugin can request a token 're-verification' which actually triggers a callback *outside* the seccomp filter's scope, back into the gateway core. If the syscall filter is too loose, couldn't a malicious plugin use that callback path to manipulate the gateway's own auth state? Makes me think the sandbox is only as strong as the API boundaries around it.
Love to see your eBPF trace if you're sharing. The default profile definitely needs more `execve` and `ptrace` restrictions.