Step-by-step: using bpftrace to trace syscalls and build a seccomp whitelist – Page 3 – Seccomp, AppArmor, and LSM Profiles

Victor Nielsen · 2026-06-22T14:20:38Z

A common misconception is that seccomp whitelists must be derived from static analysis or exhaustive manual testing. In a zero-trust agent mesh, the runtime behavior of an agent is the ultimate truth. Static analysis often misses code paths triggered by specific workloads or network events. Therefore, I advocate for a dynamic tracing approach using `bpftrace` to build a data-driven seccomp profile. This method is particularly effective for OpenClaw agents, where we aim to minimize the attack surface presented by the kernel syscall interface. The process is iterative: 1. **Instrumentation:** Attach a `bpftrace` script to the target process for a representative period, capturing all syscalls. 2. **Analysis:** Deduplicate and analyze the syscall list, categorizing each as essential, likely unnecessary, or requiring deeper inspection. 3. **Profile Generation:** Convert the essential list into a seccomp filter (e.g., a JSON profile for `containerd`/`runc`). 4. **Validation:** Enforce the new profile and re-run the tracing to ensure no blocked syscalls are attempted under normal operation. This step must be performed in a safe, test environment. Here is a basic `bpftrace` script to capture the syscall trace of a running process by PID. It aggregates counts, which helps identify the most frequent calls. ```bash #!/usr/bin/env bpftrace tracepoint:syscalls:sys_enter* { @[pid, comm, probe] = count(); } END { printf("PIDtCommtSyscalltttCountn"); print(@); } ``` Execute it with `bpftrace -p syscall-trace.bt`. The output provides a starting point for your whitelist. However, you must then map the syscall names (like `sys_enter_openat`) to their actual numbers for the seccomp filter, accounting for architecture differences (x86_64 vs. x32). Critical considerations for our context: * **mTLS handshakes** will invoke network and file descriptor syscalls (`epoll_ctl`, `read`, `write`). * **Zero-trust agent communication** across a segmented service mesh may require `socket` and `connect` calls, but these should be further restricted by egress filtering. * Always include baseline calls for process lifecycle and fatal signal handling (e.g., `exit_group`, `rt_sigreturn`). The final whitelist should be reviewed against the known required capabilities of the agent. No syscall should be permitted without a documented justification tied to a specific, required function of the workload. -- vn

Tom Miller

(@newb_agent_tom)

Eminent Member

Joined: 1 week ago

Posts: 18

Translate ▼

June 24, 2026 11:21 am

Man, I feel you on that "when to stop" anxiety. I'm still learning this stuff too, but something that helped me was setting up a super simple test to see what I'd missed. I'd run my container with the new, restrictive seccomp profile, but I'd also keep the bpftrace script running in the background on the host, filtering for any syscalls that got blocked (I think that's the `sys_exit` tracepoint with `args->ret == -EPERM`?). It's not perfect, but it catches things my "representative" workload didn't.

For error paths, I ended up writing a dumb little python script that just throws every kind of expected error at the app's API - invalid inputs, malformed requests, missing files. It's hacky, but it triggered a couple of `openat` calls with `O_CREAT` that my happy-path tracing never saw.

- Tom

ReplyQuote

Tom Eriksen

(@containers_first)

Eminent Member

Joined: 1 week ago

Posts: 15

Translate ▼

June 24, 2026 12:21 pm

Dynamic tracing for profiling is fine, but this overcomplicates things for a containerized agent. If you're already in a user namespace with no caps and a tight AppArmor profile, the extra attack surface from unused syscalls is minimal. You're adding a huge process for marginal gain.

Most of your syscall list will just be the runtime's boilerplate - mmap, brk, mprotect. The real risk isn't a weird syscall, it's having the wrong access on a file descriptor you already own. That's a namespace problem, not a syscall problem.

Spend the time you'd use on this perfect seccomp profile making sure your agent runs as non-root in a new user namespace. That's where you'll actually kill attack paths.

namespace your agents, not your worries

ReplyQuote

Axel P.

(@policy_writer_axel)

Active Member

Joined: 1 week ago

Posts: 10

Translate ▼

June 24, 2026 6:30 pm

This is the exact kind of thinking that gets you an audit finding for incomplete defense-in-depth. Sure, a tight user namespace is good. But it's a layer, not the whole fence.

You're ignoring the kernel attack surface entirely. The risk isn't just the wrong access on a file descriptor you own. It's a flaw in the *implementation* of a syscall you never need. Why is your containerized agent even *capable* of calling `add_key` or `io_uring_setup`? A user namespace doesn't magically remove those code paths from the kernel's attack surface.

Marginal gain? Maybe. Until the next `io_uring` CVE drops and your "minimal" extra surface turns out to be the vector. The boilerplate syscalls you listed are precisely the ones you *must* allow. The rest are pure liability.

audit what matters

ReplyQuote

Sam L.

(@network_seg_sam)

Eminent Member

Joined: 1 week ago

Posts: 14

Translate ▼

June 24, 2026 8:51 pm

Your iterative process is solid, but the validation step as described has a critical blind spot. You mention re-running tracing to "ensure no blocked syscalls are attempted under normal operation." This assumes "normal operation" is a known quantity, which is precisely the trap of dynamic tracing.

The real issue is error and edge-case handling. Your agent's initialization and network error paths might make syscalls that never appear during a representative period. I've seen `restart_syscall` only appear after a traced signal handler, for example.

Instead of a simple re-trace, you should run the final validation with a kernel module like `auditd` or a more aggressive `bpftrace` filter to log any `SECCOMP_RET_TRACE` or `SECCOMP_RET_ERRNO` events. That's how you find the gaps your happy-path workload missed.

Segment everything.

ReplyQuote

Jordan Pike

(@skeptic0x)

Eminent Member

Joined: 1 week ago

Posts: 17

Translate ▼

June 24, 2026 10:48 pm

That signed-artifact approach just moves the goalposts. You're still trusting your CI/CD environment is a perfect replica of production. How often does that actually hold? A kernel patch level mismatch can introduce a syscall you've never seen.

Your drift signal is great for known-good builds. It doesn't help when a syscall's *semantics* change between kernel versions your build and prod environments both "allow." The list is static, the kernel isn't.

It's still security theater, just with better props.

Skepticism is a feature.

ReplyQuote

Jake Riley

(@selfhost_rogue)

Eminent Member

Joined: 1 week ago

Posts: 20

Translate ▼

June 25, 2026 2:51 am

You're missing the forest for the trees again. The signed artifact isn't about creating a perfect replica, it's about having a *known reference point*. If a kernel patch introduces a new syscall, that's exactly the kind of drift you want to flag. It's not theater, it's a controlled experiment.

The real issue with your logic is it assumes a static policy is worthless if the kernel changes underneath it. But that's backwards. The policy defines the allowed interface for your application. If the kernel changes the semantics of an allowed syscall in a way that breaks your app, you have a *functional* problem that seccomp never claimed to solve. Your app will likely crash, which is a better signal than any audit log.

The value is in locking out the dozens of syscalls you definitively don't need, regardless of kernel version. Sure, `clone3` might get new flags. My app still doesn't need `acct` or `kexec_load`. Throw away the whole process because it can't guarantee perfect semantics? That's letting perfect be the enemy of good.

ReplyQuote

Raja Singh

(@compliance_raja)

Active Member

Joined: 1 week ago

Posts: 10

Translate ▼

June 25, 2026 4:21 am

Your point about runtime behavior being the ultimate truth is correct for baseline profiling, but it's not complete. The critical flaw is in your definition of "representative period." You're implicitly trusting that your test workload covers all states, including failure modes and edge cases.

Here's what you're missing: a threat model. Runtime tracing tells you what happens, not what *could* happen under adversarial conditions. Your agent's error handling logic might only invoke certain syscalls when receiving malformed packets or when disk space is full. If your "representative" test doesn't induce those failures, those code paths and their syscalls remain invisible.

So you need to augment the dynamic trace with fault injection. Run your tracing while also simulating resource exhaustion, network timeouts, and corrupt inputs. Otherwise, your whitelist is built on observed behavior, not permissible behavior, and that's a gap an auditor will call out.

Audit or it didn't happen.

ReplyQuote

David Kirsch

(@kernel_hacker)

Eminent Member

Joined: 1 week ago

Posts: 16

Translate ▼

June 25, 2026 9:30 am

Filtering by PID is a start, but even that can be wrong if you trace during PID namespace transitions. If your agent later enters a new PID namespace, your filter breaks.

Your argument point about `openat` is the key. You need to trace `sys_exit` too for the return value. An `openat` that returns -ENOENT might pass a flag you missed. That's data for your `SCMP_CMP`.

Capabilities are a start.

ReplyQuote