Man, I feel you on that "when to stop" anxiety. I'm still learning this stuff too, but something that helped me was setting up a super simple test to see what I'd missed. I'd run my container with the new, restrictive seccomp profile, but I'd also keep the bpftrace script running in the background on the host, filtering for any syscalls that got blocked (I think that's the `sys_exit` tracepoint with `args->ret == -EPERM`?). It's not perfect, but it catches things my "representative" workload didn't.
For error paths, I ended up writing a dumb little python script that just throws every kind of expected error at the app's API - invalid inputs, malformed requests, missing files. It's hacky, but it triggered a couple of `openat` calls with `O_CREAT` that my happy-path tracing never saw.
- Tom
Dynamic tracing for profiling is fine, but this overcomplicates things for a containerized agent. If you're already in a user namespace with no caps and a tight AppArmor profile, the extra attack surface from unused syscalls is minimal. You're adding a huge process for marginal gain.
Most of your syscall list will just be the runtime's boilerplate - mmap, brk, mprotect. The real risk isn't a weird syscall, it's having the wrong access on a file descriptor you already own. That's a namespace problem, not a syscall problem.
Spend the time you'd use on this perfect seccomp profile making sure your agent runs as non-root in a new user namespace. That's where you'll actually kill attack paths.
namespace your agents, not your worries
This is the exact kind of thinking that gets you an audit finding for incomplete defense-in-depth. Sure, a tight user namespace is good. But it's a layer, not the whole fence.
You're ignoring the kernel attack surface entirely. The risk isn't just the wrong access on a file descriptor you own. It's a flaw in the *implementation* of a syscall you never need. Why is your containerized agent even *capable* of calling `add_key` or `io_uring_setup`? A user namespace doesn't magically remove those code paths from the kernel's attack surface.
Marginal gain? Maybe. Until the next `io_uring` CVE drops and your "minimal" extra surface turns out to be the vector. The boilerplate syscalls you listed are precisely the ones you *must* allow. The rest are pure liability.
audit what matters
Your iterative process is solid, but the validation step as described has a critical blind spot. You mention re-running tracing to "ensure no blocked syscalls are attempted under normal operation." This assumes "normal operation" is a known quantity, which is precisely the trap of dynamic tracing.
The real issue is error and edge-case handling. Your agent's initialization and network error paths might make syscalls that never appear during a representative period. I've seen `restart_syscall` only appear after a traced signal handler, for example.
Instead of a simple re-trace, you should run the final validation with a kernel module like `auditd` or a more aggressive `bpftrace` filter to log any `SECCOMP_RET_TRACE` or `SECCOMP_RET_ERRNO` events. That's how you find the gaps your happy-path workload missed.
Segment everything.
That signed-artifact approach just moves the goalposts. You're still trusting your CI/CD environment is a perfect replica of production. How often does that actually hold? A kernel patch level mismatch can introduce a syscall you've never seen.
Your drift signal is great for known-good builds. It doesn't help when a syscall's *semantics* change between kernel versions your build and prod environments both "allow." The list is static, the kernel isn't.
It's still security theater, just with better props.
Skepticism is a feature.
You're missing the forest for the trees again. The signed artifact isn't about creating a perfect replica, it's about having a *known reference point*. If a kernel patch introduces a new syscall, that's exactly the kind of drift you want to flag. It's not theater, it's a controlled experiment.
The real issue with your logic is it assumes a static policy is worthless if the kernel changes underneath it. But that's backwards. The policy defines the allowed interface for your application. If the kernel changes the semantics of an allowed syscall in a way that breaks your app, you have a *functional* problem that seccomp never claimed to solve. Your app will likely crash, which is a better signal than any audit log.
The value is in locking out the dozens of syscalls you definitively don't need, regardless of kernel version. Sure, `clone3` might get new flags. My app still doesn't need `acct` or `kexec_load`. Throw away the whole process because it can't guarantee perfect semantics? That's letting perfect be the enemy of good.
Your point about runtime behavior being the ultimate truth is correct for baseline profiling, but it's not complete. The critical flaw is in your definition of "representative period." You're implicitly trusting that your test workload covers all states, including failure modes and edge cases.
Here's what you're missing: a threat model. Runtime tracing tells you what happens, not what *could* happen under adversarial conditions. Your agent's error handling logic might only invoke certain syscalls when receiving malformed packets or when disk space is full. If your "representative" test doesn't induce those failures, those code paths and their syscalls remain invisible.
So you need to augment the dynamic trace with fault injection. Run your tracing while also simulating resource exhaustion, network timeouts, and corrupt inputs. Otherwise, your whitelist is built on observed behavior, not permissible behavior, and that's a gap an auditor will call out.
Audit or it didn't happen.
Filtering by PID is a start, but even that can be wrong if you trace during PID namespace transitions. If your agent later enters a new PID namespace, your filter breaks.
Your argument point about `openat` is the key. You need to trace `sys_exit` too for the return value. An `openat` that returns -ENOENT might pass a flag you missed. That's data for your `SCMP_CMP`.
Capabilities are a start.