I've been experimenting with seccomp-bpf for isolating a simple agent runtime I'm prototyping, and I've hit what seems like a logical endpoint: a filter that permits only `read`, `write`, `exit`, and `futex`. The idea was to create a bare-minimum sandbox for a process that only needs to perform I/O and then terminate cleanly, perhaps a network proxy or a data filter.
On paper, it works. The agent loads, it can read from stdin and write to stdout/stderr, and it can exit. But I'm immediately suspicious. This feels *too* minimal, even for a highly constrained workload. I've been methodically testing edge cases, and I'm already seeing some puzzling behavior, particularly around the `futex` syscall. My understanding is that `futex` (Fast Userspace muTEX) is crucial for thread synchronization, but my test program is single-threaded. I suspect higher-level libraries (like glibc's `malloc` or even some parts of the Rust runtime, though I'm testing in C for now) are using it internally for things like locking memory arenas or managing thread-local storage, even in a nominally single-threaded context.
Here's the filter I constructed using `libseccomp`:
```c
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
if (ctx == NULL) { /* handle error */ }
// Allow the absolute bare essentials
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(futex), 0);
// Default action is KILL
seccomp_load(ctx);
```
My immediate questions for the community are:
* **Is this futex allowance a dangerous gap?** I'm wary because `futex` can be used with `FUTEX_WAKE` and `FUTEX_WAIT` operations to interact with kernel memory. In a more adversarial scenario (like an untrusted workload), could this be leveraged in a chain to manipulate kernel state, even with no other syscalls? Or is its utility for an escape negligible without `clone`, `mmap`, etc.?
* **What about `exit_group`?** I've only allowed `exit`. If my process *were* multi-threaded (it's not, but I'm thinking defensively), would the lack of `exit_group` cause all other threads to remain live after the main thread exits, creating a zombie-like state? Should `exit_group` always be paired with `exit` in a real-world filter?
* **Are there hidden dependencies for `read`/`write`?** They seem straightforward, but do they implicitly rely on other syscalls for buffer management or signal handling? For instance, if a signal interrupts a `write`, does the C library restart it using the same syscall, or could it involve something else?
The goal of this exercise is to understand the true minimum viable seccomp profile. I'm trying to dissect what each permitted syscall truly "buys" you in terms of functionality and, conversely, what risk it introduces. My current hypothesis is that `futex` is the most complex and potentially risky of these four, but I lack the low-level knowledge to confirm that. Has anyone else built a similarly restrictive profile and stress-tested it against real (but benign) workloads? I'm particularly interested in any observed crashes or hangs that trace back to a missing, seemingly unrelated syscall.
trace -e all