Step-by-step: Enabling MIG on A100 for NemoClaw without breaking everything

Summarize Topic

GPU Memory Isolation and Leakage

Last Post by Li X. 1 week ago

2 Posts

2 Users

0 Reactions

4 Views

RSS

Sophia Martinez

(@oscp_student)

Eminent Member

Joined: 1 week ago

Posts: 17

Topic starter

Translate ▼

June 22, 2026 6:49 pm [#462]

Hey everyone, been lurking for a bit while studying for my OSCP and trying to apply some of the methodology to understanding our agent environment. I've been experimenting with NemoClaw on A100s in our lab slice, specifically around GPU memory isolation.

I kept running into issues where my agent's workload seemed to have visibility into memory regions that should have belonged to another tenant's MIG slice. The vendor docs just say "isolation is handled," but we all know there's always a gap, right? 😅

I wanted to set up a proper, isolated MIG config on an A100 to test NemoClaw's claims, but bricked my lab GPU twice. Finally got a stable config. Here's what actually worked, avoiding the common pitfalls.

**First, you have to clean-slate the GPU. A simple `nvidia-smi -i 0 -mig 1` isn't enough if there are existing compute processes.** Our orchestration layer sometimes leaves ghosts.

```bash
# Kill ALL compute processes (be careful, this is for a lab!)
sudo nvidia-smi -i 0 -q | grep "Process ID" | awk '{print $4}' | xargs sudo kill -9

# Disable MIG mode to reset
sudo nvidia-smi -i 0 -mig 0
sudo reboot
# After reboot, enable MIG and create the slices
sudo nvidia-smi -i 0 -mig 1
```

**The key for NemoClaw was creating asymmetric profiles.** The default 1g.5gb slices left weird leftover memory. I found this config prevented the "GPU lost communication" errors:

```bash
# Create one 2g.10gb instance for the primary agent
sudo nvidia-smi mig -i 0 -cgi 2g.10gb -C
# Create two 1g.5gb instances for secondary workloads
sudo nvidia-smi mig -i 0 -cgi 1g.5gb -C
sudo nvidia-smi mig -i 0 -cgi 1g.5gb -C
```

**Biggest gotcha:** The NemoClaw driver module sometimes doesn't re-bind correctly after MIG changes. You have to manually unbind/rebind the NVMe and GPU controllers in the PCIe hierarchy, or the agent can't see the new MIG instances.

Has anyone else tried poking at the isolation between these MIG slices? I'm curious if the guardrails are just at the memory controller level, or if there's actual VRAM residue between workloads. My next write-up will be on a small experiment trying to read from a deallocated buffer in a neighboring slice.

Quote

Topic Tags

Li X.

(@mod_community_tech_li)

Eminent Member

Joined: 1 week ago

Posts: 17

Translate ▼

June 22, 2026 8:38 pm

Good to see someone putting NemoClaw's isolation claims through a real test. The "ghost process" issue you hit is a classic gotcha with MIG, especially on shared lab nodes.

Your `nvidia-smi -q` pipeline is the right idea, but I'd caution others that grepping for "Process ID" can miss some driver-level contexts on a busy system. Safer to first use `nvidia-smi --query-compute-apps=pid --format=csv,noheader` to target only compute PIDs before the nuclear option.

One thing to add: after your reboot, remember to check the GPU's persistence mode is set before creating slices. A missing `nvidia-smi -pm 1` can lead to configs not surviving driver reloads, which might be why you bricked it initially.

ReplyQuote

80 Forums
1,190 Topics
7,241 Posts
0 Online
508 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed