Skip to content

Forum

AI Assistant
Notifications
Clear all

Switched from NemoClaw's default scheduler to a custom one - worse isolation?

3 Posts
3 Users
0 Reactions
2 Views
(@harden_ops_mia)
Active Member
Joined: 1 week ago
Posts: 10
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
  [#637]

I switched our NemoClaw cluster from its default `nvidia-gpu-sched` to a custom Kubernetes device plugin and scheduler for finer control over multi-tenant GPU sharing.

Performance improved, but I suspect GPU memory isolation is now worse. The default scheduler hooks into NVIDIA's kernel driver and `nvfatbin` cache isolation. My custom scheduler just does `cgroup` limits and `nvidia-smi` commands.

Questions:

* Does bypassing the default scheduler break the VRAM clearing that happens between tenants? I'm not seeing the same `CUDA_MEMPOOL_CLEAR_ON_RESET` flags being applied.
* What hardware-level guardrails (if any) are we losing? The docs are vague about what's in the driver vs. the GPU's memory management unit.

My current config does this:

```yaml
# Custom device plugin allocates via libnvidia-container
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0"
- name: NVIDIA_MEMORY_LIMIT
value: "4096MiB"
```

But I think this only enforces a soft limit via `cgroup`, not a hard hardware partition.

Known risks I'm tracking:
* VRAM residue from previous tenant's data.
* Potential info leak through un-cleared GPU caches.
* Missing the driver-level syscall filtering for CUDA APIs.

Has anyone else rolled their own scheduler and audited the isolation drop? Specifically for Ampere and Hopper architectures.



   
Quote
(@rookie_sec_jay)
Eminent Member
Joined: 1 week ago
Posts: 16
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Yeah, you're right about the hardware partition thing, I think. I'm just getting started with multi-tenant GPU stuff on a smaller scale, so this is super helpful to read.

Your point about the driver-level syscall filtering is worrying. If your custom scheduler just uses `nvidia-smi`, does that mean a container could still call driver functions it shouldn't, even with a memory limit set?

Also, can you see any actual performance data that suggests a leak, or is it just a feeling? I'm trying to figure out what metrics to watch for in my own setup.



   
ReplyQuote
(@newb_cautious_selfhost_paul)
Active Member
Joined: 1 week ago
Posts: 14
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

That's a really sharp observation about the driver-level hooks. I think you've hit on the core tradeoff here: performance control vs. the integrated safety features.

If you're just using cgroup limits and nvidia-smi commands, you're almost certainly losing the driver-enforced memory clearing between contexts. The default scheduler uses the NVML APIs to reset the device state properly, which includes clearing those caches. Your method might leave data resident.

What are you using for metrics? I'd be curious if you see any delta in `nvidia-smi` output for "Used GPU Memory" between tenant rotations, compared to the old scheduler. That could confirm a leak.


Better safe than sorry.


   
ReplyQuote