Skip to content

Forum

AI Assistant
Notifications
Clear all

Step-by-step: Enabling MIG on A100 for NemoClaw without breaking everything

2 Posts
2 Users
0 Reactions
4 Views
(@oscp_student)
Eminent Member
Joined: 1 week ago
Posts: 17
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
  [#462]

Hey everyone, been lurking for a bit while studying for my OSCP and trying to apply some of the methodology to understanding our agent environment. I've been experimenting with NemoClaw on A100s in our lab slice, specifically around GPU memory isolation.

I kept running into issues where my agent's workload seemed to have visibility into memory regions that should have belonged to another tenant's MIG slice. The vendor docs just say "isolation is handled," but we all know there's always a gap, right? 😅

I wanted to set up a proper, isolated MIG config on an A100 to test NemoClaw's claims, but bricked my lab GPU twice. Finally got a stable config. Here's what actually worked, avoiding the common pitfalls.

**First, you have to clean-slate the GPU. A simple `nvidia-smi -i 0 -mig 1` isn't enough if there are existing compute processes.** Our orchestration layer sometimes leaves ghosts.

```bash
# Kill ALL compute processes (be careful, this is for a lab!)
sudo nvidia-smi -i 0 -q | grep "Process ID" | awk '{print $4}' | xargs sudo kill -9

# Disable MIG mode to reset
sudo nvidia-smi -i 0 -mig 0
sudo reboot
# After reboot, enable MIG and create the slices
sudo nvidia-smi -i 0 -mig 1
```

**The key for NemoClaw was creating asymmetric profiles.** The default 1g.5gb slices left weird leftover memory. I found this config prevented the "GPU lost communication" errors:

```bash
# Create one 2g.10gb instance for the primary agent
sudo nvidia-smi mig -i 0 -cgi 2g.10gb -C
# Create two 1g.5gb instances for secondary workloads
sudo nvidia-smi mig -i 0 -cgi 1g.5gb -C
sudo nvidia-smi mig -i 0 -cgi 1g.5gb -C
```

**Biggest gotcha:** The NemoClaw driver module sometimes doesn't re-bind correctly after MIG changes. You have to manually unbind/rebind the NVMe and GPU controllers in the PCIe hierarchy, or the agent can't see the new MIG instances.

Has anyone else tried poking at the isolation between these MIG slices? I'm curious if the guardrails are just at the memory controller level, or if there's actual VRAM residue between workloads. My next write-up will be on a small experiment trying to read from a deallocated buffer in a neighboring slice.



   
Quote
(@mod_community_tech_li)
Eminent Member
Joined: 1 week ago
Posts: 17
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Good to see someone putting NemoClaw's isolation claims through a real test. The "ghost process" issue you hit is a classic gotcha with MIG, especially on shared lab nodes.

Your `nvidia-smi -q` pipeline is the right idea, but I'd caution others that grepping for "Process ID" can miss some driver-level contexts on a busy system. Safer to first use `nvidia-smi --query-compute-apps=pid --format=csv,noheader` to target only compute PIDs before the nuclear option.

One thing to add: after your reboot, remember to check the GPU's persistence mode is set before creating slices. A missing `nvidia-smi -pm 1` can lead to configs not surviving driver reloads, which might be why you bricked it initially.



   
ReplyQuote