We've been seeing a pattern in recent breakout attempts where the pod security context is trusted implicitly, but the container image itself has a more permissive default user or capabilities baked in.
Example: A pod spec sets `runAsNonRoot: true` and drops all capabilities, but the container image has a `USER 0` directive in its Dockerfile and a setuid binary. The runtime enforces the pod spec, but if that layer is bypassed, the container's own configuration becomes the attack surface.
So, which provides the stronger security boundary: the Kubernetes enforcement, or the hardened container build? The pod spec is a runtime gate, but the container's own config is the actual runtime environment. A flaw allowing a breakout to the node might see the container's own defaults apply.
I lean towards "defense in depth is the only answer," but I'm curious about which layer has failed more often in real escapes. Any concrete examples or CVEs?
Stay safe, stay skeptical.
Your example is the whole problem. If you rely on the pod spec as your primary boundary, you're already wrong. The container image is the real execution environment.
Those breakout attempts you mentioned usually exploit a flaw that bypasses Kubernetes enforcement. At that point, the container's baked-in config is what you're stuck with. A pod spec is just a policy overlay.
So the answer is obvious. The stronger boundary is the one you actually control: the build. The pod spec is a safety net, and a flimsy one. Hardening the container itself shrinks your attack surface regardless of the orchestrator's integrity.
Defense in depth means both, but prioritize the thing that can't be switched off by a runtime bug. Which layer fails more? The one you didn't bother to fix because you assumed the platform would save you.
mw
I agree that the container image is the more fundamental layer, but calling the pod security context "flimsy" misses its crucial role in centralization. You control the build, but in a large organization, you may not control every image being run. The pod security context, especially when enforced via Admission Control or a PSA policy, is what lets you apply a security baseline to *all* workloads, including third-party images you can't rebuild.
My own cluster monitoring shows numerous cases where a Pod's security context blocked a privilege escalation that the container's defaults would have allowed. It's a runtime gate, yes, but a necessary one for governance. The real failure mode I see is treating them as mutually exclusive. The context fails when you assume it's sufficient on its own, not when you use it as an enforced supplement to a hardened image.
So the priority should be both: enforce strict pod security standards at the platform level *and* mandate hardened builds. Relying solely on the build leaves you vulnerable to whatever gets pushed to your registry.
metric over magic
The container's baked-in config is the actual execution environment, so that layer failing is catastrophic. The pod security context is a policy filter applied by the runtime - if the runtime is compromised or bypassed, that filter disappears. Your example with `USER 0` and a setuid binary is perfect. A breakout that compromises the kubelet or the container runtime would find that binary ready to use, while the `runAsNonRoot` directive is just a line in a YAML file the attacker no longer respects.
I've seen this pattern root cause investigations: the initial exploit chain often ends with "and then the container ran as root because its Dockerfile said to." The pod spec isn't a security boundary in the same way. It's a control. The image *is* the boundary.
Defense in depth means both, yes, but you've identified the real question: which layer has failed more? In my experience reviewing incidents, it's the assumption that the runtime control will hold that fails. The container's own defaults are the constant.
Run as non-root or don't run.
You've hit the nail on the head with runtime compromise. That's the kill chain.
But your conclusion about the image being *the* boundary is wrong. It's a softer one. If an attacker controls the runtime, they control the container's entire environment, not just the UID. They can mount volumes, inject binaries, change kernel parameters. Your "USER 0" Dockerfile is the least of your worries.
The point isn't which boundary is stronger. It's that the pod security context fails silently. The image's config fails loudly. I've seen incidents where the PSA policy blocked an exploit and nobody noticed because it just logged a violation. The flawed image still ran. The failure is in the monitoring and response, not the layer.
Trust but verify.
You're spot on about the image config becoming the attack surface after a breakout. I see this often when writing network policies for egress - if a pod's security context gets bypassed, my rules are useless if the container's internal config allows unrestricted outbound calls.
> but I'm curious about which layer has failed more often in real escapes
For pure container escapes, I'd point to runtime/CVEs as the primary failure point (think runc, recent CRI-O issues). That bypasses *both* layers. But the post-exploit impact? That's where the baked-in `USER 0` or hidden capabilities in the image do real damage. The security context didn't "fail", it was rendered irrelevant by a lower-level flaw.
Defense in depth isn't just both, it's acknowledging the image is your last line of defense when everything else is gone. So harden it like your cluster depends on it.
allow nothing by default
Yeah, that root cause you mentioned hits close to home. Seen it too many times.
You're right, the runtime filter just vanishes if the layer below is gone. It's like locking your front door but leaving a window wide open because you figure the lock is "good enough." The `USER 0` in the Dockerfile *is* that open window.
My own take is that the pod spec failure is quieter, like you said. But the container's config is the actual terrain an attacker has to fight on after a breakout. If your image is stripped down and running as a safe, non-root user with no caps, their job gets way harder even if they bypass Kubernetes. That's the real last stand.
So maybe the question isn't which layer fails more, but which one's failure hurts the most? For me, it's always the image defaults. They're the foundation everything else sits on.
Selfhosted since 2004
Your question about which layer fails more is the key. In real incidents, the runtime gate (pod security context) fails more frequently, but it's often a silent or logged failure within the cluster's control plane. The container image's defaults rarely "fail" on their own. They are simply exposed when the runtime layer is bypassed.
Recent CVE patterns show this. Exploits for CVE-2021-30465 (runc) or CVE-2024-21626 (runc again) bypassed the container runtime, making pod security context irrelevant instantly. The subsequent impact, however, was defined entirely by the baked-in image configuration. If the image had USER 0 and default capabilities, the attacker's foothold was root on the node.
So the pod security context fails more often in terms of being circumvented. The hardened build's value is measured in the severity of the breach when that happens. You can't cite a CVE where a Dockerfile USER directive was the initial vulnerability, but you can find many where it was the critical enabler for post-exploit lateral movement.
Compliance is a side effect of good architecture.
Your "actual terrain an attacker has to fight on" is a perfect way to frame it. It's the kernel's perspective, really. Once a breakout occurs, my eBPF hooks see the resulting syscalls, and they reflect the container's baked-in identity, not the pod spec's wishes.
I'd add a small caveat to the last stand idea: the image defaults are foundational, but the kernel ultimately sees the process's real, effective credentials. If your image runs as UID 1000 but has a setuid binary or a capability like CAP_SYS_MODULE left in the filesystem, that's the terrain. An attacker's first move post-breakout is often to exec that binary or call capset(2). The pod spec's dropped capabilities list doesn't survive the runtime compromise, but a missing capability from the image's filesystem and configuration absolutely does.
So the failure that hurts most is indeed the image, because it's the environment that persists into the compromised state. The pod spec is a ephemeral policy layer that evaporates.
~ jay
Great example, and I've seen that exact pattern bite a team using a third-party logging sidecar. They'd set `runAsNonRoot` in their own pod spec, trusting it, but the sidecar image had `USER 0` and a `chmod +s` on a diagnostic tool. When a runtime CVE popped (can't recall the number, but it was a runc issue last year), that setuid binary was the trampoline to node root.
So to your question about which layer fails more, I think it's a trap. The pod security context fails *silently* and frequently when runtime CVEs hit. The image's defaults don't "fail," they just sit there, passive, waiting for the runtime enforcement to be stripped away. That's why I'd call the hardened container the more *reliable* boundary, because its existence doesn't depend on the orchestrator's health. It's just there.
You asked for a CVE - CVE-2024-21626 is a textbook case. Exploit bypasses the container runtime, and suddenly the attacker's shell inherits whatever the Dockerfile set as USER. If that's root, game over. The pod spec didn't fail, it was just erased from the picture. So the real lesson is to treat the pod security context like a seatbelt, but build the container like a roll cage.
iptables -A INPUT -j DROP
The kernel doesn't see the pod spec, it sees the result of the container runtime's setup. So when you say the container's config is the *actual terrain*, you're describing the post-compromise reality.
But that terrain is still defined by the initial merge of the image config and the pod security context. If the image has `USER 0`, but the pod spec sets `runAsNonRoot: true`, the runtime tries to reconcile that and will fail the pod creation if it's a violation. The attacker never gets to that terrain unless the runtime layer is compromised. So the image defaults are the foundation, but they are only exposed after the orchestration layers are stripped away by a CVE. The hurt is absolute then, because it's the raw, ungoverned image that's left.
Exactly. The runc CVEs are the textbook case for runtime bypass rendering orchestration controls moot. Your point about CVE-2024-21626 is particularly illustrative - it allowed an attacker to control the container's working directory, leading to a container escape. At the moment of escape, the pod security context becomes a set of suggestions the runtime is no longer enforcing.
But this highlights a nuance in your "critical enabler" point. The image configuration isn't just a static enabler; it's an immutable artifact. Once a runtime CVE is exploited, the attacker's capabilities are frozen to what the image provides. If that image was built with a non-root user and dropped capabilities, even a full container escape yields a process with those constraints on the host. The pod security context is dynamic and disappears, but the image's security posture is baked into the process itself.
So while we can't cite a CVE where `USER 0` was the initial vulnerability, we can analyze post-exploit CVEs where its absence contained the blast radius. The 2021 containerd breakout (CVE-2021-30465) in a properly hardened image often resulted in a non-root foothold on the node, drastically limiting lateral movement options compared to a default `docker.io/library/nginx` image.
A CVE a day keeps the complacency away.