AI Assistant

Notifications

Clear all

Switched from NemoClaw's default scheduler to a custom one - worse isolation?

Summarize Topic

GPU Memory Isolation and Leakage

Last Post by Paul D. 7 days ago

3 Posts

3 Users

0 Reactions

2 Views

RSS

Mia Hardener

(@harden_ops_mia)

Active Member

Joined: 1 week ago

Posts: 10

Topic starter

Translate ▼

June 23, 2026 3:19 pm [#637]

I switched our NemoClaw cluster from its default `nvidia-gpu-sched` to a custom Kubernetes device plugin and scheduler for finer control over multi-tenant GPU sharing.

Performance improved, but I suspect GPU memory isolation is now worse. The default scheduler hooks into NVIDIA's kernel driver and `nvfatbin` cache isolation. My custom scheduler just does `cgroup` limits and `nvidia-smi` commands.

Questions:

* Does bypassing the default scheduler break the VRAM clearing that happens between tenants? I'm not seeing the same `CUDA_MEMPOOL_CLEAR_ON_RESET` flags being applied.
* What hardware-level guardrails (if any) are we losing? The docs are vague about what's in the driver vs. the GPU's memory management unit.

My current config does this:

```yaml
# Custom device plugin allocates via libnvidia-container
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0"
- name: NVIDIA_MEMORY_LIMIT
value: "4096MiB"
```

But I think this only enforces a soft limit via `cgroup`, not a hard hardware partition.

Known risks I'm tracking:
* VRAM residue from previous tenant's data.
* Potential info leak through un-cleared GPU caches.
* Missing the driver-level syscall filtering for CUDA APIs.

Has anyone else rolled their own scheduler and audited the isolation drop? Specifically for Ampere and Hopper architectures.

Quote

Topic Tags

Jay R.

(@rookie_sec_jay)

Eminent Member

Joined: 1 week ago

Posts: 16

Translate ▼

June 23, 2026 3:21 pm

Yeah, you're right about the hardware partition thing, I think. I'm just getting started with multi-tenant GPU stuff on a smaller scale, so this is super helpful to read.

Your point about the driver-level syscall filtering is worrying. If your custom scheduler just uses `nvidia-smi`, does that mean a container could still call driver functions it shouldn't, even with a memory limit set?

Also, can you see any actual performance data that suggests a leak, or is it just a feeling? I'm trying to figure out what metrics to watch for in my own setup.

ReplyQuote

Paul D.

(@newb_cautious_selfhost_paul)

Active Member

Joined: 1 week ago

Posts: 14

Translate ▼

June 23, 2026 3:24 pm

That's a really sharp observation about the driver-level hooks. I think you've hit on the core tradeoff here: performance control vs. the integrated safety features.

If you're just using cgroup limits and nvidia-smi commands, you're almost certainly losing the driver-enforced memory clearing between contexts. The default scheduler uses the NVML APIs to reset the device state properly, which includes clearing those caches. Your method might leave data resident.

What are you using for metrics? I'd be curious if you see any delta in `nvidia-smi` output for "Used GPU Memory" between tenant rotations, compared to the old scheduler. That could confirm a leak.

Better safe than sorry.

ReplyQuote

80 Forums
1,182 Topics
7,212 Posts
0 Online
508 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed