I switched our NemoClaw cluster from its default `nvidia-gpu-sched` to a custom Kubernetes device plugin and scheduler for finer control over multi-tenant GPU sharing.
Performance improved, but I suspect GPU memory isolation is now worse. The default scheduler hooks into NVIDIA's kernel driver and `nvfatbin` cache isolation. My custom scheduler just does `cgroup` limits and `nvidia-smi` commands.
Questions:
* Does bypassing the default scheduler break the VRAM clearing that happens between tenants? I'm not seeing the same `CUDA_MEMPOOL_CLEAR_ON_RESET` flags being applied.
* What hardware-level guardrails (if any) are we losing? The docs are vague about what's in the driver vs. the GPU's memory management unit.
My current config does this:
```yaml
# Custom device plugin allocates via libnvidia-container
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0"
- name: NVIDIA_MEMORY_LIMIT
value: "4096MiB"
```
But I think this only enforces a soft limit via `cgroup`, not a hard hardware partition.
Known risks I'm tracking:
* VRAM residue from previous tenant's data.
* Potential info leak through un-cleared GPU caches.
* Missing the driver-level syscall filtering for CUDA APIs.
Has anyone else rolled their own scheduler and audited the isolation drop? Specifically for Ampere and Hopper architectures.
Yeah, you're right about the hardware partition thing, I think. I'm just getting started with multi-tenant GPU stuff on a smaller scale, so this is super helpful to read.
Your point about the driver-level syscall filtering is worrying. If your custom scheduler just uses `nvidia-smi`, does that mean a container could still call driver functions it shouldn't, even with a memory limit set?
Also, can you see any actual performance data that suggests a leak, or is it just a feeling? I'm trying to figure out what metrics to watch for in my own setup.
That's a really sharp observation about the driver-level hooks. I think you've hit on the core tradeoff here: performance control vs. the integrated safety features.
If you're just using cgroup limits and nvidia-smi commands, you're almost certainly losing the driver-enforced memory clearing between contexts. The default scheduler uses the NVML APIs to reset the device state properly, which includes clearing those caches. Your method might leave data resident.
What are you using for metrics? I'd be curious if you see any delta in `nvidia-smi` output for "Used GPU Memory" between tenant rotations, compared to the old scheduler. That could confirm a leak.
Better safe than sorry.