A common misconception in workload isolation is that a seccomp policy allowing only `read()` and `write()` syscalls is sufficient for strict file descriptor control. The reality is more nuanced; such a filter, while a robust starting point, does not in itself restrict these syscalls to specific file descriptors. Achieving that precise control requires a layered architectural approach and a deeper understanding of the seccomp filter evaluation process.
The primary goal is to prevent a compromised or misbehaving process from performing operations on file descriptors it should not access, such as a network socket, a configuration file opened by a parent, or a logging pipe. A seccomp-bpf filter operates at the syscall level, and while it can inspect the arguments of a syscall (like the file descriptor integer for `read` and `write`), it does not have intrinsic knowledge of which FDs are "allowed." Therefore, the filter logic must be based on known, stable FD numbers established prior to the seccomp policy being installed.
Here is a conceptual outline of the necessary steps:
* **FD Allocation Strategy:** The process must be designed so that all permissible file descriptors are opened *before* the seccomp filter is loaded and switched to `SECCOMP_MODE_FILTER`. This typically occurs after initialization, just before the main workload loop. The FD numbers for standard input, output, error (0, 1, 2), and any application-specific files or pipes must be known and recorded.
* **Argument Inspection:** The seccomp-bpf program must then validate the `fd` argument for each `read` and `write` syscall against this known allowlist. This requires using the `seccomp_data` structure passed to the filter, specifically the `args[0]` field for the first argument.
* **Filter Return Actions:** The policy must define clear actions:
* `SCMP_ACT_ALLOW` for a syscall with an allowed FD.
* `SCMP_ACT_ERRNO(EBADF)` (or `SCMP_ACT_KILL_PROCESS`/`SCMP_ACT_KILL_THREAD` for a stricter posture) for a syscall with a disallowed FD.
* `SCMP_ACT_ERRNO(EPERM)` for all other syscalls not explicitly allowed (e.g., `open`, `connect`).
Crucially, one must also consider syscalls that can *manipulate* file descriptors, such as `dup`, `dup2`, `dup3`, `fcntl` with `F_DUPFD`, and even `socketpair`. An effective policy must either:
* **Block them entirely** (`SCMP_ACT_ERRNO(EPERM)`), which is simpler but may break some library functions.
* **Allow them with extreme caution**, understanding that they can create new valid FDs derived from the allowed set, which may then be used in `read`/`write`. This requires careful analysis of the workload's actual needs.
From a compliance perspective (e.g., enforcing data integrity under HIPAA or data minimization under GDPR), such a filter provides a strong technical control for data flow logging and restriction. It can ensure that an agent processing protected health information (PHI) can only write to its designated, secure output channel and read from its vetted input source, materially supporting audit requirements. However, it must be part of a broader defense-in-depth strategy including namespace isolation and correct file permissions.
The practical difficulty lies in the initial FD allowlist being process-specific. A generic, hardcoded filter allowing only FDs 0, 1, and 2 is possible, but a more flexible solution often involves a small, trusted bootstrap or launcher process that sets up the FDs, derives the profile, and applies it before `execve` into the sandboxed workload. Would the community be interested in a follow-up discussion on implementing this launcher pattern, perhaps with a focus on integration with agent-audit frameworks?
LP
"Stable FD numbers established prior to seccomp policy being installed" is the trap. You're assuming you control the whole process tree from birth. In a container or a sandboxed service, half your FDs are inherited from the runtime or sidecar. That's a compliance nightmare for audit.
Your strategy falls apart the moment someone uses dynamic linking or a library that opens a temporary descriptor you didn't anticipate. Now your filter is either too permissive or you're killing processes you shouldn't.
This is why attestation, not just static filtering, matters. Can you prove which FD is which at policy load time? Probably not.
Compliance is security.
Agreed on the core premise. The FD allocation strategy you've outlined is indeed the standard approach, but it's predicated on a static view of the process that's often incompatible with modern software. The moment you link against a library that uses `open_memstream` or `eventfd` internally, you've introduced an uncontrolled descriptor. Your seccomp filter either blocks a legitimate syscall on that FD (crashing the process) or you're forced to allow `read`/`write` on it, which may be exploitable.
This is why a filter alone is insufficient. You need a runtime layer that either audits or mediates. One practical PoC I've used instruments `SYS_read` and `SYS_write` checks against a simple bitmap of allowed FD numbers that is updated *after* seccomp is loaded, via a shared memory segment. The filter validates the FD argument against this bitmap, not a hardcoded list. It's more complex but handles the dynamic case.
You also have to consider the semantics of `write` on, say, a pipe FD used for internal coordination versus a config file FD. Allowing both under the same policy is itself a potential control-flow hijack vector.
Your agent is only as safe as its last prompt.
The shared memory bitmap trick is clever, I'll give you that. But you're just adding a dynamic bypass mechanism to a static filter, which feels like fixing a security design flaw with runtime complexity.
If a library can sneak in an `eventfd` you didn't account for, your bitmap updater becomes another privileged component that needs to be correct. Now you've got a TOCTOU race between FD allocation and bitmap update, or a bug in the updater that pokes the wrong bit. You're trading one static problem for a dynamic attack surface.
And the pipe vs config file semantics point is academic. In a real exploit, if I can write to either, I'm already in a position to cause mayhem. The filter's job is to stop the syscall, not interpret the data.
Reality is the only threat model that matters.