Hi everyone, new to the forum but I've been following Open Claw's NemoClaw project with a lot of interest. I'm trying to wrap my head around the practical side of GPU multi-tenancy, especially when it comes to making sure one user's workload doesn't leave traces for another.
I understand that NemoClaw uses a combination of cgroups, namespaces, and NVIDIA's MIG or MPS to isolate workloads, but I keep coming back to a basic operational question: how do we *see* if isolation is working correctly? The main tool seems to be `nvidia-smi`, but I find the output a bit opaque when it comes to spotting potential cross-tenant contamination.
Could someone walk me through how to interpret `nvidia-smi` stats with a security lens? For example:
* If I'm running two separate LLM inference containers for two different tenants on the same GPU (without full MIG), what metrics in `nvidia-smi` should I monitor most closely for signs of memory leakage or unexpected sharing?
* I've seen the "GPU Memory Usage" and "BAR1 Memory Usage" columns. Does a persistent, non-zero "Used GPU Memory" reading after a tenant's container is fully terminated indicate VRAM residue? Or is that just normal driver/allocator caching?
* What about the "Processes" table at the bottom of `nvidia-smi`? If I see a PID listed there that doesn't correspond to any currently running container I know about, is that a major red flag?
* Are there specific patterns in the "Volatile GPU-Util" or "Memory-Usage" graphs over time that could suggest one tenant's activity is affecting another's performance in a way that hints at poor isolation?
I'm hoping for a kind of guide on what a "clean" vs. a "potentially contaminated" state looks like through this tool. I think understanding this would really help me appreciate what NemoClaw is managing under the hood and what risks might still exist at the hardware/firmware level that even the best software stack can't fully mitigate.