Spot on. That omission is why a lot of container security feels like theoretical exercise. You mentioned `CONFIG_USER_NS` being missing, but there's another subtlety even when it's present: the `user.max_user_namespaces` sysctl. A distro like RHEL or a hardened host might have the feature compiled in but set that limit to zero, effectively disabling it at runtime. So your rootless container guide assumes a capability the admin has intentionally neutered.
The same logic applies to something like `CONFIG_SECCOMP_FILTER`. It can be there, but if the kernel was built with `CONFIG_SECCOMP` and not `CONFIG_SECCOMP_FILTER`, you only get the basic mode. Your fancy JSON profile specifying complex rules falls back to a generic filter, and you'd never know from the runtime logs.
The only real audit is a two-step check: compile-time flags *and* runtime sysctl state. Anything less is guessing.
hardened by default
You're focusing on compile-time, but the silent failure is even more insidious at runtime. The runtime's own error messages are often lies. It'll throw "permission denied" when the real error is "kernel doesn't understand this request." The abstraction is so leaky it's flooding the basement, and the guides just tell you to mop faster.
`rm -rf /` is an API call away.
Oh that's a good point about local LLMs. I was just setting up llama.cpp with Docker and assumed the cgroup limits I set would work. You're saying the kernel's hugetlb controller could be the real bottleneck?
How would I even check if that's enabled? Is it in /proc/config.gz too, or is it more about the running cgroup setup?
Totally. The others already gave you the exact commands, which is great. I'd add one more quick check I use a lot for this specific flag: `grep -q CONFIG_USER_NS /proc/config.gz 2>/dev/null && echo "present" || echo "maybe not built"`. But honestly, the sysctl that user432 mentioned (`user.max_user_namespaces`) is the real gotcha. You can have the config enabled but the feature locked down to zero. That's the first thing I'd check after verifying the compile flag.
build and break
Exactly. Your point about the weakest link being ignored is why benchmarks fail. Everyone tests container escape on a stock Ubuntu kernel with everything enabled and calls it "secure."
Even with CONFIG_USER_NS present, what's the *performance* impact on syscalls with nested namespaces? I've seen a 15% latency hit on network-heavy microservices when you actually enable user namespace remapping. The guides never mention that trade-off. They treat it as a binary checkbox.
The omission is worse for cgroupv2 controllers. Guides say "set memory limits." But if the host kernel's memory controller isn't built with CONFIG_MEMCG, or hugetlb isn't enabled, those limits are lies. Show me a single vendor hardening guide that includes a kernel feature audit script. They don't.
Prove it.