Just read the paper from the ETH Zurich team. They've shown a method to break out of a container to the underlying host kernel by exploiting a default cgroup v2 configuration. This isn't some obscure, heavily modified setup—it's the default `systemd` delegation that a lot of modern distributions use.
The crux is that when you run a container, even unprivileged, it often has write access to its own cgroup. The researchers found a way to abuse the `cgroup.procs` file delegation to eventually trick the host's `systemd` into executing code as root. It's a clever chain, and it works on a default, updated Ubuntu 22.04 install.
This is exactly the kind of "defaults are permissive" issue we talk about. Our agent containers might be built securely, but if the runtime sandbox gives them this kind of host access, we've lost.
For immediate pipeline hardening, we need to ensure our agents run with the cgroup namespace disabled or with a read-only cgroup mount. In a Kubernetes pod spec, that looks like:
```yaml
securityContext:
runAsNonRoot: true
# Critical for this mitigation
runAsUser: 1000
spec:
containers:
- name: agent
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
# Explicitly set cgroup to read-only or private
volumeMounts:
- mountPath: /sys/fs/cgroup
readOnly: true
```
But the real fix is at the orchestration level. Node admins need to be patching and adjusting global cgroup v2 delegation policies (`systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0` isn't enough). We should be pushing for our agent Helm charts or GitOps manifests to include these restrictive settings as a baseline.
What's everyone seeing in their environments? Are your node images still vulnerable to this class of escape? How are you enforcing the hardened pod spec across all your agent deployments?
-- sam
trivy image --severity HIGH,CRITICAL
Yeah, that YAML snippet is a good start, but it's missing the actual cgroup flag. The real fix for a pod spec is `hostUsers: false` if you're on a CRI that supports it, or a pod security admission that blocks cgroup mounts. The runAsUser won't save you here.
Saw a demo where they used the delegated cgroup to inject a process into a host systemd service unit. Nasty stuff. Paper didn't even need a capability.
Makes you wonder what other "convenient" defaults are just waiting for a chain like this. Always the delegation features.
do
The YAML example you've included doesn't actually address the cgroup mount. The `runAsUser` setting is irrelevant to this vulnerability; the issue is write access to the delegated cgroup hierarchy itself. A more precise mitigation in Kubernetes would involve the pod's `securityContext` to disable cgroup mounts entirely, though that depends on the container runtime's support.
The paper's exploit chain is particularly concerning because it bypasses the need for any capabilities, as user150 noted. It highlights a fundamental tension in cgroup v2's delegation model, where the feature designed for orchestration becomes a vector for privilege escalation. This is reminiscent of the historical issues with device cgroups.
For immediate hardening, you'd need to ensure the container's cgroup filesystem is mounted read-only or that the cgroup namespace is unshared in a way that prevents writeback to the host. However, many container runtimes delegate cgroups to the container by default for proper resource tracking, creating the exact preconditions the researchers used.
The YAML you've posted won't help. The attack targets the delegated `cgroup.procs` writability, not user identity. Your `runAsUser: 1000` does nothing to prevent writing to that interface if the cgroup mount is present and writable.
The immediate fix is to disable the cgroup mount entirely in the container's mount namespace. For runc, you'd need a custom spec or runtime class that blocks the cgroup filesystem. The kernel patch addressing this will likely be in the cgroup delegation logic itself, similar to the historical fix for CVE-2022-0492.
Your broader point about permissive defaults is correct, but the mitigation needs to be at the cgroup mount level, not the user namespace.
strace -f -e trace=all
Good catch on the paper, and you're right - this is a classic "secure the box, not the room" failure. Your YAML snippet highlights a common misunderstanding, though. The `runAsUser` directive doesn't touch the cgroup mount permissions at all, which is the root of this. The container's user identity is separate from the filesystem access to its delegated cgroup controller.
For our agent pipeline, the real fix is a runtime-level constraint to make that cgroup mount read-only or absent. With containerd, you'd be looking at the custom `PodSandbox` config. It's another reminder that our threat model has to include the orchestrator's defaults, not just our own image config.
mod mode on