Anyone else having issues with CUDA context persistence across container restarts?

GPU Memory Isolation and Leakage

Last Post by Alexei Volkov 2 hours ago

1 Posts

1 Users

0 Reactions

1 Views

RSS

Alexei Volkov

(@kernel_watcher)

Eminent Member

Joined: 2 weeks ago

Posts: 18

Topic starter

Translate ▼

July 4, 2026 3:01 pm [#1386]

I've been conducting a series of isolation tests for GPU-accelerated workloads under NemoClaw, specifically focusing on the persistence of CUDA context state across container lifecycle events. My findings suggest there is a non-trivial, and likely undocumented, risk of VRAM metadata leakage even after a tenant's container is terminated and its cgroups/namespaces are cleaned up. This isn't about visible data in framebuffers; it's about the driver's internal context handles, allocated memory page lists, and potentially kernel-mode driver state that remains associated with the physical GPU device.

The core issue appears to be that while `nvidia-container-cli` does a commendable job of cleaning up the visible device file descriptors (`/dev/nvidia*`) from the container's filesystem namespace, the underlying CUDA driver context established by the tenant's process can persist in the GPU's hardware and the host kernel's driver modules. This context is not fully torn down until the last reference to the GPU device is closed *on the host*. A subsequent container, even with a different user namespace and cgroup, that acquires access to the same GPU device may inherit a context with stale internal allocations.

Consider this simple reproducer pattern:

```bash
# In container A (with GPU access)
python3 <<EOF
import torch
x = torch.randn(1000, 1000, device='cuda')
print(f"Allocated on GPU: {x.device}")
# Container is forcibly killed here (SIGKILL), not allowing graceful teardown.
EOF

# Host cleans up container cgroup, namespace.

# In container B (with GPU access, same physical device)
python3 <<EOF
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
proc_info = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)
# Are there any lingering processes reported? Sometimes yes, sometimes no.
# The more telling test is to query internal state via CUDA APIs.
EOF
```

The isolation gap stems from several layers:

* **User-space driver persistence:** The NVIDIA kernel driver (`nvidia.ko`) maintains a per-process context that is tied to the PID namespace on the host, not inside the container. A container kill may orphan some of this state.
* **VRAM page table remnants:** The GPU's internal memory management unit (MMU) has page tables that map virtual addresses to physical framebuffer pages. These mappings are not guaranteed to be flushed on context termination unless a new context performs a full GPU reset (which is disruptive).
* **NVIDIA's "GPU Isolation" claims:** Their documentation speaks to hardware-enforced isolation between processes *through the driver*, but this assumes the driver's own internal bookkeeping is flawless after a violent termination. The guardrails are designed for healthy, cooperating processes, not for adversarial post-mortem state scavenging.

My questions to the group are:

* Have you observed similar artifacts—such as unexpected "cannot allocate memory" errors, strange device query results, or even measurable performance artifacts—when scheduling a new workload onto a GPU recently vacated by a killed container?
* What are your operational mitigations? I've been experimenting with a custom `runtimeClass` hook that, prior to assigning a GPU to a new tenant, attempts to:
* Use `nvidia-smi --gpu-reset` on the specific device (too heavy-handed for a shared node).
* Trigger a dummy allocation and free cycle via a privileged initContainer to "clean" the context state.
* Is anyone aware of a definitive syscall or driver ioctl sequence that forces a complete GPU context tear-down from userspace, without requiring `CAP_SYS_ADMIN` for a full device reset?

The security implications are clear for a multi-tenant cluster. VRAM residue from a prior tenant could be probed and potentially partially reconstructed by a subsequent, adversarial tenant, especially if they can engineer specific allocation patterns to occupy previously-used page frames. This moves beyond theoretical into the realm of practical side-channel attacks.

I'll be presenting a more detailed analysis, including eBPF traces of the driver ioctl calls during container termination, at the next Open Claw meetup. In the meantime, I'm keen to compare notes.

--av

Quote

Topic Tags

80 Forums
1,387 Topics
8,035 Posts
0 Online
508 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed