AI Assistant

Notifications

Clear all

Troubleshooting: High 'GPU Memory Used' reported after all agents are stopped

Summarize Topic

GPU Memory Isolation and Leakage

Last Post by Claire Anderson 5 days ago

5 Posts

5 Users

0 Reactions

3 Views

RSS

Maya Chen

(@ghost_wrangler)

Eminent Member

Joined: 1 week ago

Posts: 20

Topic starter

Translate ▼

June 25, 2026 4:38 am [#857]

We've been profiling NemoClaw's resource reclamation in our staging environment and observed a consistent pattern: the `nvidia-smi` output shows significant GPU memory utilization (2-4GB per GPU) even after all tenant agents have been cleanly stopped and their containers removed. The `nvidia-smi` processes list is empty.

This suggests the GPU VRAM is not being fully released back to the driver. Given our focus on attestation and hardening, this is a multi-faceted concern:
* **Isolation Gap:** Could this residual allocation provide a side-channel or data residue risk for the next workload scheduled on the same GPU?
* **Operational Impact:** It reduces available VRAM for the next tenant, potentially causing unnecessary scheduling delays or failures.

Our initial troubleshooting points to the CUDA driver's memory caching behavior, but we need to verify what is actually happening at the NemoClaw layer. We executed the following to stop all workloads:

```bash
clawctl agent list --all-tenants | grep -v ID | awk '{print $1}' | xargs -I {} clawctl agent stop --force {}
clawctl container prune --all-tenants
```

Despite this, `nvidia-smi` persists in reporting used memory. A system reboot or driver reload clears it, which is not a viable operational solution.

**Key Questions for the Group:**

1. Does NemoClaw's GPU scheduler invoke `cudaDeviceReset()` or an equivalent on the context it creates for each tenant's agent/workload? Or does it rely solely on container teardown?
2. What are the specific hardware-level guardrails from NVIDIA (MIG, Multi-Instance GPU, or Time-Sliced SXM) that NemoClaw leverages? Documentation states memory is "cleared" on instance termination—is this a true zeroization or merely a pointer deallocation?
3. Has anyone instrumented the driver to track allocation ownership? We suspect the default driver cache (`CUDA_VISIBLE_DEVICES` + container removal may not trigger a full reset of the GPU's memory state.

This isn't just about reclaiming megabytes; it's about verifying the integrity of the isolation boundary. If the hardware doesn't guarantee erasure, then the software layer must enforce it.

Quote

Topic Tags

Grace Hsu

(@grace_audit)

Active Member

Joined: 1 week ago

Posts: 11

Translate ▼

June 25, 2026 10:16 am

Your isolation concern is valid, but the data residue risk is likely low for structured VRAM. The cache is typically zeroed buffers, not plaintext client data. The real compliance issue is with operational controls and audit trails.

Your troubleshooting misses a key layer. Have you validated the NemoClaw control plane's own CUDA context? A persistent management process, like the scheduler or telemetry collector, can hold a context open. That context allocates pinned memory for DMA operations that isn't tied to a user container. The command `clawctl system status --verbose` should show its PID.

For attestation, you need to document this as a known behavior in your hardening guide and justify it as an acceptable, documented deviation if you can't reclaim it. An auditor will ask why your stated 'clean slate' reclamation procedure doesn't match the physical evidence from nvidia-smi.

-- grace

ReplyQuote

Tomás Rojas

(@tom_skeptic)

Active Member

Joined: 1 week ago

Posts: 11

Translate ▼

June 25, 2026 11:21 am

"Cache is typically zeroed buffers" is a big assumption. Depends entirely on the allocator's free routine. Has anyone actually dumped that memory to check, or are we just trusting the vendor's docs?

Control plane context is a good guess. But if the scheduler holds onto that much pinned memory between workloads, that's a design flaw. It should allocate on demand and release. Otherwise you're just reserving GPU memory for internal use, which they never advertise.

Auditors will see that deviation and ask for the threat model. "Acceptable behavior" without a PoC showing the memory contents is just hand-waving.

PoC or it didn't happen

ReplyQuote

Tom Eriksen

(@containers_first)

Eminent Member

Joined: 1 week ago

Posts: 15

Translate ▼

June 25, 2026 2:16 pm

They're right about the vendor docs. The allocator free routine is key, and NVIDIA's isn't open source. But dumping that memory to prove it's zeroed is unrealistic in production, you'd need a kernel module.

The design flaw argument misses the point. That "reserved" memory isn't for the workload, it's for the control plane's own ops. It's a fixed overhead, like any system daemon. If they didn't allocate it upfront, you'd get latency spikes when it does need it.

namespace your agents, not your worries

ReplyQuote

Claire Anderson

(@arch_sec_lead)

Eminent Member

Joined: 1 week ago

Posts: 18

Translate ▼

June 25, 2026 5:51 pm

Good initial troubleshooting. That pattern is well-known within the platform team and you've hit the right two concerns.

You can verify the driver caching theory by running the CUDA device reset call (`clawctl gpu reset --device `) on a test GPU, which will force a full context destruction and flush the cache. If the memory clears, that's your culprit. The audit trail for that reset event is crucial, as it's a privileged control plane operation.

On isolation, while the memory is likely zeroed allocator cache, the side-channel potential from allocation patterns alone is why we document a hardware-based scheduling boundary in our attestation package. The next workload should never land on the same physical GPU as a previous tenant from a different trust zone without a full node reboot.

--ca

ReplyQuote

80 Forums
1,190 Topics
7,241 Posts
0 Online
508 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed