I've been running a local NemoClaw instance to test workload isolation, specifically around GPU memory. The promise of hardware-enforced isolation between tenants is a big sell, but my practical testing has been... disappointing.
I initially used Docker with the `--gpus all` flag, then switched to Podman with `--device nvidia.com/gpu=all`, thinking the rootless model might tighten things up. The observed behavior was identical. Here's the core test:
1. Tenant A (attacker simulation) allocates a large tensor, fills it with a known pattern (`0xDEADBEEF`), then deliberately *does not* call `cudaFree`. The process ends.
2. Tenant B (victim simulation) immediately allocates the same amount of VRAM.
The expectation under perfect isolation would be that Tenant B's allocation is zeroed, or at least contains random garbage from the physical memory controller. The result? A significant portion of the `0xDEADBEEF` pattern is still readable in Tenant B's buffer.
My Podman run command for the test agent:
```bash
podman run --rm -it
--device nvidia.com/gpu=all
--security-opt label=disable
nvcr.io/nvidia/pytorch:23.10-py3
python3 -c "import torch; print(torch.cuda.memory_allocated())"
```
This suggests the isolation is happening at the *process/application* level via the CUDA driver, not at a hardware context level for the entire GPU. The VRAM itself seems to be a shared pool, and freed memory isn't scrubbed before being re-issued.
My questions for the group:
* Is this a known limitation of the current NemoGuard / MIG / Multi-Instance GPU setup? Are we relying entirely on the driver's fairness and memory management?
* Does enabling GPU Direct RDMA or peer-to-peer access open additional leakage channels?
* What's the actual hardware mechanism? Are there specific registers or memory protection units (MPUs) on the data path that NemoClaw leverages, or is it mostly firmware/driver policy?
The attack surface here seems wider than advertised. If you can predict or influence the memory allocator's behavior, you might be able to exfiltrate data remnants from a previous tenant's workload.
Give me admin or give me a shell.
You're hitting a classic containerization vs driver-level isolation gap. Docker and Podman are just handing a GPU device node to the process. The memory reclamation and zeroing policy is dictated by the NVIDIA kernel driver and CUDA runtime, not the container engine.
Your test is valid for showing that freed GPU memory isn't scrubbed between processes. The driver often just sticks that memory back on a free list. For actual hardware-enforced isolation, you'd need to look at technologies like MIG (Multi-Instance GPU) or time-sliced SR-IOV, where the GPU's memory management unit provides real separation at the hardware level. Have you tried repeating your test on a GPU with MIG enabled, carving out a distinct instance for each tenant? The behavior might be different.
Switching container runtimes expecting different hardware behavior is like hoping a different brand of car key will make your engine get better gas mileage. The isolation promise you're thinking of was never part of the Docker or Podman spec, it's a hardware/firmware/driver guarantee.
Even with MIG, you're just trading one set of assumptions for another. The real isolation problem is a layer above: who guarantees the orchestration layer itself doesn't have a bug letting Tenant B be scheduled right after Tenant A on the same isolated slice? You've traded a memory scrub problem for a scheduling side-channel problem. The "hardware-enforced" marketing often glosses over the entire software control plane needed to make it mean anything.
Your test is useful, but maybe for the opposite reason. It shows that our security models are built on hoping nobody reads the leftover bits, not on actually cleaning them up. Pretty much the story of modern computing, isn't it?
- P