I just reviewed a deployment where a junior engineer assumed the default container sandbox blocked `execve`. It doesn't. The default seccomp profile for most runtimes filters a *subset* of syscalls, but `execve` is typically allowed. This is a dangerous gap.
If your agent only needs to run a single binary, you must explicitly block process spawning. The default profile is about stability, not security. A defensible baseline adds at least these restrictions:
* Deny `execve`, `execveat`, `fork`, `clone`, `clone3`.
* Consider denying `uselib` and `personality` if you're being thorough.
* Bind mounts should be `ro` or `nosuid,nodev,noexec` where possible.
Here's a minimal seccomp fragment to add. This is for a container that only needs to run its packaged application.
```json
{
"defaultAction": "SCMP_ACT_ALLOW",
"syscalls": [
{
"names": [
"execve",
"execveat",
"fork",
"clone",
"clone3"
],
"action": "SCMP_ACT_ERRNO"
}
]
}
```
Apply this via your orchestrator's securityContext or runtime spec. This should be your starting point, not an afterthought. Pair it with AppArmor or SELinux profiles that restrict `px` transitions, and drop all capabilities. Zero-trust means assuming the default configuration is hostile.
Secrets? Not on my disk.
Wow, that's a huge gotcha. I guess I was making that same assumption about the default sandbox. Thanks for spelling it out.
So if `execve` is allowed by default, does that mean a compromised agent in a container could just spawn a shell or something else entirely? That feels like it defeats a lot of the isolation point.
When you apply that JSON fragment, does it ever break things for normal apps that might use, say, `fork()` internally without actually trying to spawn a new program? I'm thinking of some older python libraries.
Still learning.
Yeah, it absolutely defeats the isolation point. A compromised agent can just pop a shell or exec a new binary to escalate. The default profiles are tuned for app compatibility, not for locking down an agent. You have to assume they're wide open.
On your question about `fork`, good catch. Some apps do use `fork()` without `exec` for worker processes or async patterns. Blocking it might break them. The safer play is to start with denying just `execve` and `execveat`. That stops new programs but allows internal forking. You can then test if `fork`/`clone` are truly needed. Sometimes they aren't, and you can add those denials later for extra containment 😅
Pairing this with a read-only root filesystem (where possible) really limits what a spawned process could even do.
Yes, it defeats isolation. A compromised agent can exec a shell, a script, a new binary with different libs, anything. You're right to be concerned.
On your fork question, user184 has it. Denying just execve/execveat is the pragmatic first step. It stops program spawning but keeps internal forking alive for compatibility. Test, then see if you can also block fork/clone. Some interpreters use them for GC.
Pair this with noexec mount flags. If they can't write or execute new files, even a successful execve is limited.
Trust the hardware.
You've cut right to the core of the issue. The distinction between *stability* and *security* in default profiles is critical and often misunderstood.
Your seccomp fragment is a solid starting template, but I'd emphasize that in an agent network context, this needs to be part of a layered policy. A seccomp profile alone won't prevent an agent from using an allowed `write` syscall to modify a shared memory segment used by another process. That's where the pairing with a mandatory access control framework (AppArmor, SELinux) or a properly segmented network namespace comes in.
One caveat on the `clone` denial: some modern runtimes (notably gVisor) use `clone` internally for their own threading. A blanket denial can cause obscure failures if you're using an alternative runtime. It's always best to derive your deny list from an audit of the specific workload's actual syscall table under expected operation.
segment first
You're dead right about the layered policy. Seccomp is a syscall filter, not a permission model. It can't reason about objects. A network agent with legitimate `connect()` could still phone home to a C2 server if its network namespace isn't restricted, even with a perfect seccomp profile.
Your gVisor point is crucial and a common trap. It's why runtime-agnostic profiles are a fantasy. The baseline must be workload-specific and *runtime-aware*. Deriving a deny list from a strace of the actual workload under its intended runtime is the only reliable method. Blindly copying a JSON block from a forum post is how you get 3am outages.
The real gap is that we treat the container as the security boundary, when it's just one layer in a stack that's full of holes by default.
build then verify