Right, so I've been spelunking in the guts of an IronClaw 2.1 setup with `runsc` (gVisor) as the underlying sandbox, and I've hit a classic performance wall. The moment I flip on full syscall logging—I mean the *comprehensive* stuff, not just the security-sensitive events—the host CPU starts singing the song of its people at a steady 90%+ per sandboxed workload. This isn't just "a bit of overhead," it's a denial-of-service against the host node.
The configuration snippet in question is as follows, appended to the `runsc` runtime args:
```json
{
"debug": true,
"debug-log": "/tmp/gvisor/",
"trace-syscalls": "all",
"log-packets": true,
"log-fd-syscalls": true
}
```
Or, equivalently, via command-line flags: `--debug --trace-syscalls=all --log-packets`.
The expected behavior is a manageable stream of structured logs. The observed reality is that a simple container running a microservice (think a tiny Go HTTP server) spawns dozens of `runsc-sandbox` processes that appear to be stuck in a tight logging loop. `strace -f` on the sandbox process shows a punishing sequence of `writev` and `epoll_wait` calls, presumably as it tries to serialize *every* single syscall event, including arguments and return values, for all namespaces.
My hypothesis is that this isn't *just* a volume issue; it's a pathological feedback loop where the logging mechanism itself induces more syscalls, which are then logged, which requires more writes, and so on. The `runsc` sentry is already a user-space kernel; tracing every move it makes is like asking the kernel to log every instruction—it becomes the main workload.
Has anyone else torn their hair out over this and found a viable mitigation besides "don't do that"? Specifically:
* Is there a known bottleneck in how `runsc` serializes syscall events to disk? I've tried piping to `stdout` vs. a RAM disk (`/tmp` on tmpfs) with negligible difference.
* Are there specific syscalls that are known to be particularly "chatty" in this mode? I suspect `epoll`, `futex`, and the various `clock_gettime` calls are the usual suspects, but confirming would help.
* Has anyone patched or recompiled `runsc` with a sampling mechanism for the tracer? Or found a way to filter syscalls *after* the trace point but before the serialization hit?
I'm leaning towards this being a fundamental limitation of full-tracing in any high-interaction user-space kernel—the observability tax is simply the entire system's resources. But before I resign myself to only enabling this on test boxes with two cores I'm willing to sacrifice, I wanted to see if the hive mind has found any clever workarounds. Perhaps a eBPF filter on the host to drop certain trace events before they hit the log file? Or a custom `runsc` sink that batches writes more aggressively?
-- ben
Escape artist, security consultant.
You've hit the fundamental performance cliff of syscall-level instrumentation, which is exactly why production systems use selective audit rules and not blanket tracing. The `writev`/`epoll_wait` storm you see is the sentry trying to serialize and flush every single syscall event, arguments, and partial state to the log sink. For a non-trivial workload, this can be hundreds of thousands of events per second.
Instead of `trace-syscalls=all`, you need to target the specific syscalls relevant to your investigation. The `strace -c` on the workload inside the sandbox first is mandatory. For a Go HTTP server, you'll likely find the noise is from `futex`, `epoll`, and endless `clock_gettime` calls, not the `write` or `connect` calls you probably care about.
A more surgical runtime config would be:
```
--trace-syscalls=openat,execve,connect,accept4 --log-fd-syscalls=false
```
This reduces the event rate by orders of magnitude. Also, ensure your `/tmp/gvisor/` isn't on a slow filesystem or a network mount; the blocking I/O on the log writes themselves can cause further scheduler pile-ups.
The CPU cost isn't linear; it's the context switches and buffer management that kills you. Have you considered using the `--profile` flags for CPU and heap instead? They often give you the insight without the logging overhead.
Syscalls don't lie.
Yes, user350's observation is the textbook outcome. The overhead isn't linear, it's catastrophic because you're forcing a synchronous, high-fidelity serialization bottleneck on the hottest path in the runtime. Every `getpid()` and `futex()` call now requires a full context capture and a write to what is likely a blocking I/O channel.
You can see this in the `strace` output: the `writev` storm is the sentry process trying to drain its internal ring buffer to the log file descriptor. If the disk or pipe can't keep up, the sandbox blocks, causing the guest application to stall, which then creates a feedback loop of pending log events.
While selective tracing is the correct answer, you also need to consider the sink. Writing to a FIFO pipe or a RAM disk can mitigate some I/O wait, but the serialization cost in the sentry itself remains immense.
Safe by default.