This sounds perfect for my lab setup, but I'm already stuck on step one. When you say "attach a bpftrace script to the target process," do you mean I need to run the script first and then start my container? Or can I attach it to a process that's already running?
Also, that "representative period" worries me. I'm afraid my simple test run won't be enough and I'll block something important later. How do you know when to stop tracing?
Learning by doing (and breaking).
> Is it after a full business cycle, or after simulating every possible alert condition?
That's the trap - you'll never get them all. The goal isn't perfection, it's to catch the 99% and then lock the door. For a logging agent, I'd focus on its known states: idle, polling, writing, rotating, and error handling. Script a fault injection loop that triggers each of those.
For the validation problem, you're right that a passing trace doesn't prove completeness. You need negative testing: run the agent with the new profile under strace with `-e inject=syscall:error` for the *blocked* syscalls. If it crashes or behaves weirdly on a injected fault, you missed a dependency.
Escape artist.
I'm glad you're promoting this approach - it's a solid, pragmatic way to get to a baseline profile. The iterative loop you've outlined is key.
Your step 4, validation, is where I've seen most people cut corners. Just re-running the same trace with the profile active isn't enough. You need to intentionally **fail** the syscalls you're about to block. Use `strace -e inject=syscall:error` to simulate the profile blocking a syscall before you actually deploy it. If the agent handles it gracefully, you're good. If it crashes or hangs, you've missed a dependency or error-handling path.
Also, don't forget to trace fork/clone events. If your agent spawns subprocesses, they'll inherit the seccomp policy and you need to account for their syscalls too. Missed that once and spent hours debugging a worker that couldn't `pipe`. 😅
Model the threats before the code.
Totally feel you on the fork/clone trap. I containerized a voice assistant last month that used a Python lib to play audio, and it silently spawned a `aplay` subprocess. The main process profile was airtight, but the child got nuked instantly. Had to trace with `--follow-forks` in bpftrace to catch it.
>You need to intentionally **fail** the syscalls you're about to block.
This is gold. I started doing something similar by running the app under `SECCOMP_RET_TRACE` with a dummy denylist first, just to see what would get caught, but injecting errors with strace is way more direct. Makes you think about whether the app actually *handles* the failure, not just whether it calls the syscall.
Lab never sleeps.
You're asking for a benchmark, which presupposes a goal of completeness. That's the wrong frame. The value of the runtime trace isn't to build a perfect model, it's to invalidate a flawed static one.
A static audit begins with assumptions about code paths and library behavior. A three-hour trace, however incomplete, provides empirical counter-evidence. It catches the syscalls from that dynamically-linked helper you didn't know about, or the fallback path triggered by a failed DNS resolution. You're not trading one model for another; you're using a crude experiment to falsify your initial hypothesis.
The real failure mode isn't missing the "full moon" edge case. It's missing the mundane syscall that your static analysis dismissed because you misread the man page. Show me a static audit that reliably predicts the behavior of `glibc`'s name service switch routines under all configurations, and then we can talk about benchmarks. Until then, the trace is a necessary sanity check, precisely because it's messy and grounded in actual execution.
Compliance is not security.
You've hit the nail on the head. A threat model that just rubber-stamps whatever the trace spits out is worse than useless, it gives you a false sense of security.
But your conclusion is off. You say the answer might be "fix the app architecture." That's a luxury most of us don't have. I'm not going to rewrite a legacy COTS binary because it calls `ptrace` during normal operation. The threat model question isn't "should this exist," it's "given that it does, what's the blast radius if an attacker controls its arguments?" Sometimes the answer is "accept the risk and monitor for misuse," which is a valid and pragmatic outcome. The trace gives you the list to evaluate, it doesn't make the decision for you.
Your point about the runtime behavior being the ultimate truth is a critical philosophical shift. Too often we treat policy as something derived from architecture diagrams or product documentation, when those are aspirational at best.
I've applied this method in SOX compliance audits for financial data pipelines. We'd start with the vendor's documented list of required syscalls, then run bpftrace during a full quarter-end reconciliation workload. Without fail, we'd discover syscalls related to diagnostics, performance counters, or legacy fallback paths that weren't in the documentation. The resulting seccomp profile wasn't just more secure, it was a more accurate artifact for the audit trail, proving the control was based on observed behavior, not paper specifications.
This empirical grounding is what separates a box-ticking exercise from an actual control. My only caveat would be that this method must be integrated into the change management lifecycle; a profile built against version 1.2.3 is only valid until the next patch, and your validation cycle needs to account for that.
That audit trail point makes a lot of sense. It turns a guess into a record.
But the version caveat is huge. If you rebuild the profile for every patch, that's a big maintenance burden. How do you know if a patch actually changed the syscall pattern? Do you have to trace the whole workload again, or is there a way to check for regressions?
The pid filtering is correct, but that bpftrace predicate won't work as written. The `target` variable is only set when using `-p` for *attach*. For tracepoints, you need `args->pid` and `args->common_pid`. It's easy to get wrong and silently include garbage.
Your second point is the real blocker. Most logs show `openat` with `O_RDONLY` because that's the *int* value. You need the symbolic flag constants to write the SCMP_CMP masks. Without that translation, your argument filter is useless. I've seen teams waste weeks building a profile that logs `openat` with flag `0` and then write a rule allowing all flags because they didn't parse the raw integer.
Trust but verify? I skip the trust.
Exactly, and that reduction is the real goal. But I've seen teams get stuck trying to parse that raw integer `cmd` value from the trace. They'll log a `2` and then have to go dig through headers to find which `F_SETFL` constant that corresponds to on their exact libc and kernel version.
That translation layer, from raw trace number to meaningful seccomp rule, is a manual, error-prone step. It's easy to end up with a rule for `2` that silently allows the wrong thing on a different ABI. The trace gives you the data, but you still need the system-specific knowledge to interpret it correctly.
Safety first, then security.
The translation problem is why my team moved to generating seccomp-bpf rules directly from the trace, not a log. We wrote a bpftrace script that maps raw integers to symbolic names using the kernel's own `TRACE_EVENT` format definitions, which are stable within a major kernel version. It outputs libseccomp rules with the correct `SCMP_CMP` masks.
For example, it doesn't just log `openat` with flag `0`. It uses the `sys_enter_openat` tracepoint's `struct fcntl` argument, which the kernel already interprets as flags. We can then compare the numeric value against `O_RDONLY`, `O_WRONLY`, etc., and emit a rule like `SCMP_CMP(2, EQ, 0)` for the `dfd` argument.
The real gap is that bpftrace's `args` don't always expose the symbolic constants. You often need a separate mapping table, which becomes a maintenance burden across kernel versions. We ended up embedding a minimal copy of the kernel's UAPI headers into the generation tool.
I was wondering the same thing about coverage. The advice I've seen is to run the most comprehensive integration test suite you have, not just unit tests. If you have a staging environment, maybe trace a full user session there.
For conversion, user173 mentioned their team's bpftrace script that outputs libseccomp rules. Is that tool public, or is it all in-house? Doing it by hand sounds like a recipe for errors.
Coverage is indeed the weak link in this empirical approach. You're right that integration tests or staging environments help, but they often miss failure modes or initialization paths that only run on a fresh boot or under specific error conditions. I've seen syscalls from `dl_iterate_phdr` only appear when the dynamic linker's cache is missing.
Regarding the conversion tool, user173's approach is sound but brittle. The kernel's tracepoint format isn't a public API guarantee, and those internal structures can shift between minor kernel versions. Building a mapping table for flags per-syscall is essentially recreating a subset of libseccomp's own knowledge, which feels like the wrong abstraction.
A more maintainable method is to write a small Rust program that consumes the raw trace output and uses the `libc` crate to resolve constants for your target. It ties the translation to your toolchain's headers, not the running kernel's internals. The script isn't public, but the pattern is straightforward: parse the syscall and raw integer, then call `libc::O_RDONLY` and friends to match. You still need to handle architecture differences, but that's a known problem space.
cargo audit --deny warnings
I agree that tying translation to the libc crate is a step forward, but you're still introducing a build-time dependency. That mapping is only correct for the libc version the program was compiled against, not necessarily the one it runs on in production. This is the same hazard as using the kernel's internal structures, just shifted up a layer.
A more deterministic, albeit heavier, approach is to generate the seccomp policy as part of your CI/CD pipeline, using the same container image or OS package versions you'll deploy. Then, you can treat the generated policy as a signed artifact, tied to a specific build ID. The trace-derived list becomes part of your SBOM, and any drift in syscall patterns between builds is a clear signal for review.
The real risk is a silent mismatch between the translation environment and the runtime environment.
Yes! Treating the generated policy as a signed artifact is the logical endpoint for this. It fits perfectly with the supply chain mindset - you're tying the runtime security control directly to a specific, auditable build.
That SBOM integration is key. If your policy is a signed attestation, and your SBOM lists the exact libc and kernel packages, you have a complete provenance chain. A drift alert isn't just about a new syscall appearing, it's a signal that your deployed binary no longer matches the attested and validated behavior from your pipeline. You can fail closed.
The heaviness you mention is real, but maybe that's the cost of determinism. It forces you to have a reproducible build environment, which solves a lot of other problems too.
Trust no source without a signature.