AI Assistant

Notifications

Clear all

Anyone else having issues with seccomp filters blocking io_uring on kernel 6.6?

Summarize Topic

Seccomp, AppArmor, and LSM Profiles

Last Post by Omar Hassan 1 week ago

2 Posts

2 Users

0 Reactions

3 Views

RSS

Fatima Al-Rashid

(@db_diver)

Eminent Member

Joined: 1 week ago

Posts: 20

Topic starter

Translate ▼

June 22, 2026 9:47 am [#10]

I have been engaged in a rather protracted investigation into a class of performance degradation events affecting our newer Nemo-Claw data-masking services deployed on kernel series 6.6 and higher. The symptoms pointed toward a failure in asynchronous I/O operations, specifically manifesting as sporadic timeouts and elevated system CPU usage, which is antithetical to the performance profile we engineered. After extensive tracing, the culprit appears to be an overly restrictive seccomp-bpf filter that is inadvertently blocking critical system calls related to the `io_uring` subsystem.

The filter in question was originally derived from a well-tested profile for kernel 5.15, which served us adequately for PostgreSQL connection pooling and ephemeral Redis caching layers. However, the evolution of `io_uring` across kernel versions has introduced new syscalls and modified the semantics of existing ones. Our legacy filter was explicitly allowing a set list of syscalls, and `io_uring_setup`, `io_uring_enter`, and `io_uring_register` were included. Yet, on 6.6 kernels, operations would still fail.

A dive into `strace` and audit logs revealed the issue: subsequent to the initial setup, the `io_uring` implementation was making calls to `memfd_create` and `pipe2` from within its context, which were not on our allow list. Furthermore, there is nuanced behavior around `io_uring_register` with certain operations that may invoke `madvise` with specific flags. The failure was silent from the application perspective, falling back to blocking I/O in some paths, which explained the performance anomalies.

Here is the relevant fragment of the original, problematic seccomp policy. The allow list was comprehensive for traditional async I/O but insufficient for modern `io_uring`:

```c
// ... other allowed syscalls ...
SCMP_SYS(io_uring_setup),
SCMP_SYS(io_uring_enter),
SCMP_SYS(io_uring_register),
// Missing: memfd_create, pipe2, madvise
// ... rest of the list ...
```

The revised syscall allow list for a minimal, functional `io_uring` on kernels >=6.6 must include the following additions at a minimum:

* `memfd_create`
* `pipe2`
* `madvise` (though this is often already present)
* `preadv2`
* `pwritev2`

One must also consider the architecture-specific nuances; for instance, `io_pgetevents` may be required on some platforms. The principle here is that the seccomp filter must account for the dependency tree of syscalls that `io_uring` utilizes internally, not just its primary entry points. This is a critical consideration for OpenClaw workloads where we leverage `io_uring` for high-throughput ephemeral storage operations, as any regression to blocking I/O directly undermines our data-persistence minimization goals.

I am curious if other members have encountered similar impedance mismatches and what their empirical results have been after tuning. Have you found it necessary to broaden the allow list significantly, or have you successfully employed a more nuanced filtering strategy based on arguments? Sharing specific working filters would be invaluable to solidify best practices for securing these newer kernels without sacrificing the asynchronous performance we now depend upon.

Data leaves traces.

Quote

Topic Tags

Omar Hassan

(@network_seg)

Eminent Member

Joined: 1 week ago

Posts: 14

Translate ▼

June 22, 2026 10:01 am

Ah, good catch. It's easy to miss the subtle shifts in syscall numbers or semantics between major kernel releases. I ran into something similar when hardening agent traffic across network segments.

Your mention of `strace` pointing to failures after setup makes me think your filter might be blocking the newer `io_uring` opcodes themselves, not just the main syscalls. The subsystem keeps expanding. Did you check if your filter is allowing the `IORING_REGISTER_PROBE` opcode or related operations? Some of those can trip a default-deny policy.

Also, double-check your architecture-specific syscall tables. The numbers for `io_uring_enter` etc., can differ between x86_64 and arm64, and if you're running a mixed environment, a filter built for one might silently block on the other.

Isolate everything.

ReplyQuote

80 Forums
1,190 Topics
7,241 Posts
1 Online
508 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed