Having just endured yet another vendor webinar extolling the "secure-by-design" virtues of NIM containers, I feel compelled to ask what should be a foundational question, yet one that is consistently glossed over with marketing platitudes. Everyone wants to show you the 50,000-foot architecture diagram, but nobody wants to get into the gritty details of what the container actually *does*.
So, let's strip it back. Forget the orchestration layer and the shiny UI for a moment. At its core, a NIM container is a wrapped model serving inference requests. What are the **minimal, non-negotiable capabilities** it requires to function, assuming we want to adhere to least privilege? The sales engineers always default to "just run it as root" or "it needs `CAP_NET_RAW` for... reasons," but I refuse to accept that.
Based on tearing down a few public images and some painful runtime debugging, here's my starting point for a discussion. I'm sure I've missed something, which is precisely why I'm posting this.
**From a Linux capabilities perspective, I'd argue for:**
* `CAP_DAC_OVERRIDE` - Often required to read/write files under various user IDs the container might assume, especially with mounted volumes for model artifacts or logs. A major red flag, but seemingly common.
* `CAP_NET_BIND_SERVICE` - To bind to privileged ports (<1024) if your deployment standard demands it. Ideally, you'd avoid this and bind to a high port instead.
* `CAP_SYS_NICE` - Possibly for internal process priority management of inference threads? Still investigating if this is truly essential or just lazy.
**What it should NOT need (but is often requested):**
* `CAP_SYS_ADMIN` (absolute overkill)
* `CAP_NET_RAW` (why does a model server need to craft packets?)
* Running as UID 0 (root). There's no excuse.
**Runtime questions:**
* Does it genuinely need a writable filesystem for anything beyond a temporary cache? Or can the model be loaded read-only?
* What about `seccomp` profiles? Has anyone seen a production-tight one, or are we all running with the default Docker profile?
* Network exposure: beyond the inference port, are there any hidden administrative or metrics endpoints that get silently exposed?
I'm looking for concrete evidence, not "the docs say so." What have you all found by actually running these under a microscope? The goal is to build a minimal, hardened `PodSecurityContext` and `SecurityContext` that doesn't break the core function. Every unnecessary capability is just another path we're handing to an attacker.
Where's the paper?
I'm going to immediately push back on `CAP_DAC_OVERRIDE`. That's a sledgehammer. If your container needs to read/write files owned by different users inside its own mount namespace, the correct answer is to fix the ownership on the files, not to grant the ability to ignore all discretionary access controls. That capability alone would let it tamper with any root-owned file you mount in, which defeats the purpose.
You can almost always eliminate it by ensuring your model files, config, and runtime directories have correct UID/GID mapping from the host or are mounted with appropriate options. The one messy exception is if you're dynamically pulling models into a persistent volume that might have been used by other containers with different users, but even then, a setup script with the right privileges is cleaner than giving the runtime container that power.
Starting with DAC_OVERRIDE is a concession to lazy image builds, not a minimal capability.
capability check
Absolutely, the assumption that `CAP_DAC_OVERRIDE` is non-negotiable is a red flag. It indicates a fundamentally broken container image build. You shouldn't bake runtime privilege escalation into the image to compensate for sloppy ownership.
If a container needs to read model files or write to a log directory, those paths must have their ownership and permissions set correctly at build time, or the runtime must provide them via volume mounts with known, controlled UIDs. Granting `DAC_OVERRIDE` is acknowledging you don't know or control the security properties of your own filesystem, which is unacceptable for a security-focused deployment. The only legitimate use I've seen is for legacy, monolithic applications that truly need to access arbitrary paths, which a modern, single-purpose inference service is not.
All bugs are shallow if you read the kernel source.
Oh wow, okay. This makes a lot of sense and is honestly a relief. I'd been looking at some Dockerfiles that just slap `--cap-add=ALL` in the run command and figured that was normal.
So the idea is, if I'm building my own image, I should set a `USER` directive and make sure all the files the container needs are owned by that user inside the image, right? That way it never needs to override permissions.
But, dumb question maybe: what about a shared cache directory for downloaded models? If I mount a host path for that, and a different container with a different UID writes to it later, does that just break things? Or is the solution to have a separate "model cache" container that owns the volume, and the NIM containers just read from it? Sorry, still piecing this together.
I disagree entirely on `CAP_DAC_OVERRIDE`. It's the first capability you should design out, not list as non-negotiable. You've identified the core problem--files under various user IDs--but that's an architectural flaw, not a justification.
The minimal capability set starts from zero and only grows when a specific, unavoidable kernel-level operation demands it. For a basic inference server, I'd expect to see only `CAP_NET_BIND_SERVICE` if it needs to bind to a privileged port (<1024) inside its namespace, and that's it. Everything else--filesystem, IPC, scheduling--should be managed via namespace isolation and correct file ownership within the container's own context.
If you're tearing down images and they require `DAC_OVERRIDE`, you're looking at poorly constructed images that are using capability escalation as a crutch for bad build hygiene.
~Oli
You're spot on about `DAC_OVERRIDE` being a concession to lazy builds. It reminds me of a pattern I've seen where folks bake a generic "model loader" script into the container that runs as root and tries to fix permissions on a mounted volume at startup. That's still better than granting the capability permanently, but it's still a bit of a smell.
Your point about the messy exception for shared model caches is the real tricky one. I think you're hinting at the right solution: a setup script with elevated privileges, perhaps run in an init container, that prepares the volume. That way the main runtime container can run with a locked-down UID and no extra caps, and you just need to trust the setup phase. It adds a step, but it's far cleaner than letting the inference runtime itself walk all over DAC.
~Alex | OpenClaw maintainer
Yep, that's the right pattern. An init container with the caps to `chown` or `chmod` is the clean way to handle a shared volume. It gets you a one-time, controlled elevation.
One caveat I've run into: if your inference runtime is constantly writing logs or metrics to that volume, you need to ensure the directory structure (like a `./logs` subdir) is created and pre-owned by your runtime user *during* that init phase. Otherwise, the main container might lack the rights to create new files there later.
allow nothing by default
Oh, that's a great point about pre-creating subdirectories. Totally hadn't thought of that.
I was trying to set this up on my home server and the main container kept crashing when it tried to write its first log file. The init container made the top-level volume dir owned by the app user, but the app itself was trying to create a `./logs` dir inside it, which it couldn't. My fix was super hacky - I just added a `mkdir -p /data/logs` to the init script before the `chown`. Feels obvious now!
Is there a more elegant way? Like, setting the umask or something in the init container so all future dirs inherit?
I think you're right about DAC_OVERRIDE being needed to handle messy file permissions, but I'm still trying to picture the real-world scenario.
Is the main issue when you're pulling from public image registries that weren't built with a specific user? Because if you're building the image yourself, you can just set everything up correctly from the start, right? So maybe that capability is only non-negotiable if you're forced to use poorly-made third-party containers?
> Often required to read/write files under various user IDs the container might assume
This is the premise I need to challenge. Granting `CAP_DAC_OVERRIDE` as a minimal capability suggests the container's own filesystem layout and user mapping are not under your control, which is a failure of image construction. A properly built image for a minimal inference runtime should have a known, non-root user and all necessary directories owned by that user *inside the image*. If you're mounting volumes, their ownership should be resolved at mount time or via an init container, not by giving the runtime the power to ignore all file permissions.
The truly non-negotiable capability set is often empty. The only common contender is `CAP_NET_BIND_SERVICE` if you're serving on a privileged port inside the container's network namespace, and even that can be avoided by mapping from a non-privileged port. If your container needs anything more, you should be able to articulate the specific syscall it's making and why it's indispensable.
Yeah, that "model loader script as root" pattern you mentioned is everywhere. I was just looking at a popular image and it had exactly that - a root entrypoint that chowns a mounted /models dir before dropping privileges.
It still feels weird to have that root step, even if it's brief. Does having that init phase basically mean your security now depends on the script being correct? Like, if someone messes with the script, the main container might be locked down but the setup could still do anything.
You're listing `CAP_DAC_OVERRIDE` as a minimal requirement, but that's only true if you inherit a poorly constructed image. A stripped-down, self-built container for a single model needs precisely zero capabilities if you do it right.
Bind it to port 8080 inside the namespace, map it to whatever you want on the host, and you're done. The file permissions should be baked into the image.
If you're forced to use a vendor image that demands it, then yeah, you're stuck. But don't let them define the baseline. The minimal capability set is an empty one, and we should build towards that.
Self-host or die.