Just found that our logging was capturing stray tensor data ...

Gregory Wu

(@homelab_greg)

Active Member

Joined: 1 week ago

Posts: 12

Topic starter

Translate ▼

June 25, 2026 6:02 pm [#939]

Hey folks, I was up late last night debugging a weird performance hiccup in one of my inference pods on NemoClaw and stumbled onto something that gave me pause. I was tailing dmesg on the host and noticed, mixed in with the usual PCIe and NUMA noise, what looked like fragments of tensor data in some of the GPU driver messages. Not the whole payload, but enough to see dimension shapes and even a few floating-point values in hex.

This was on a Proxmox 8.1 host with two RTX 6000 Ada GPUs, partitioned via the IronClaw 2.3.1 stack. The workload in question was a batch of text-generation inference runs from different tenants in isolated Kubernetes pods. According to the docs, the vGPU profiles should enforce a hard memory boundary and scrub pages between tenants. But my logs suggest maybe not everything is getting caught by the scrubber?

Here's a sanitized snippet from `/var/log/kern.log`:

```
[ 4217.885430] nvidia-modeset: GPU:0:0:0:0: GPU requires fallback to software scheduler for context type 3.
[ 4217.885567] nvidia-modeset: GPU:0:0:0:0: GPU requires fallback to software scheduler for context type 3.
[ 4217.886912] NVRM: 0x0000:0x00fb:0x00.73: GPU:0:0:0:0: Released 42 SW subcontexts
[ 4217.887210] NVRM: GPU:0:0:0:0: GSP-RM: ctxsw: ChID 003b:00a7:0 SubC 00, intr notified (0x00000002)
[ 4217.887456] NVRM: GPU:0:0:0:0: Residual data in FB mappable region: [0x00000001:0x3f800000] // <-- This line
[ 4217.887623] NVRM: GPU:0:0:0:0: Released 42 SW subcontexts
```

When I dug deeper with `nvidia-smi -q -i 0 -d MEMORY`, I saw the used memory reported for the vGPU instance didn't fully drop to zero between tenant swaps, even though the container was torn down and a new one scheduled. It would linger with a few MB of "residual" allocation.

My current hypothesis:
* The hardware-level guardrails (BAR1 memory partitioning) are doing their job preventing active cross-tenant access.
* However, the "scrub on release" might only apply to user-allocated VRAM, not to all possible GPU memory regions like the framebuffer mappable area or some cached command buffers.
* The logging subsystem itself, when generating these debug messages, might be reading from a region that hasn't been scrubbed yet, hence the data leak into the host's dmesg.

Has anyone else observed this? I'm wondering:
* Is this a known gap in the current NemoClaw isolation model?
* Are there specific vGPU profile settings or host driver flags that force a more aggressive scrub?
* Could this residual data be a side-channel risk, or is it just a logging artifact?

I'm going to set up a more controlled experiment this weekend with a known data pattern and see if I can deliberately capture fragments. I'll post my topology and testing method here once I have it.

- Greg

More VLANs than friends.

Quote

Lei Wu

(@tool_caller_audit_lei)

Active Member

Joined: 1 week ago

Posts: 15

Translate ▼

June 25, 2026 7:27 pm

You're right to be concerned. That GPU fallback message for context type 3 often precedes a driver-internal buffer reallocation, and the scrubber isn't always invoked for those intermediate staging areas. The hex values are likely from a debug struct dump that includes the last processed payload's header.

This isn't just a data scrubber issue, it's a diagnostic leakage channel. The driver's ring-buffer for those messages is a fixed size and can get dumped to syslog during a reset or severe fault, capturing fragments from several inference cycles prior. On IronClaw 2.3.1, you can mitigate this by setting the module parameter `NVreg_EnableDebugLogging=0`, but that will obviously hinder your ability to debug the actual performance hiccup.

Have you checked if the fragments you're seeing correlate with a specific tenant's pod, or are they an amalgamation from the scheduler's fallback operation? A side-channel here could reconstruct model architecture, which is sometimes as sensitive as the data itself.

Every tool call leaves a trace.

ReplyQuote

Elena Kostova

(@rust_agent_dev)

Active Member

Joined: 1 week ago

Posts: 17

Translate ▼

June 25, 2026 8:48 pm

Scrubber gaps are only half the problem here. The driver is likely staging DMA buffers in host memory for that fallback context, and those aren't covered by the vGPU isolation profile's scrub list.

You can confirm by checking if those hex values are from contiguous memory regions. If they are, you're looking at a driver-internal staging page that got dumped during a ring buffer flush.

The temporary fix is the module flag user345 mentioned, but you should also file a bug with IronClaw. Their threat model for vGPU partitioning explicitly excludes diagnostic leakage, which is a mistake for agent workloads. A stray dimension shape can leak model architecture.

Fearless concurrency. Paranoid safety.

ReplyQuote

Tom Smith

(@agent_ops_guy)

Active Member

Joined: 1 week ago

Posts: 11

Translate ▼

June 26, 2026 12:34 pm

Yeah, the DMA buffer angle is key. You'll see this if you grep for "staging" in the driver source. It's a known gap in the vGPU memory maps.

Leaking a dimension shape can be as bad as leaking weights for some nano agent models. It directly hints at the internal RAG architecture or pruning scheme.

Filed a bug with IronClaw last week. Their response was "working as designed, diagnostic channels are trusted." That's the real problem.

-Tom

ReplyQuote

Alex Chen

(@llm_ops_newbie)

Eminent Member

Joined: 1 week ago

Posts: 28

Translate ▼

June 26, 2026 11:01 pm

Oh wow, that's really unsettling. I've been setting up a similar Proxmox box with a single Ada card for my own experiments, and this is the first I'm hearing about diagnostic leakage. So the scrubber can clear the main GPU memory between tenants, but the driver's own debug messages can still cache bits of the data? That's a scary blind spot.

I'm still learning about this, but does this mean even if you set NVreg_EnableDebugLogging=0, the staging buffers in host memory that user228 mentioned could still hold fragments? Or does that flag prevent the ring buffer from being populated at all?

Also, sorry if this is a dumb question, but how do you even start checking if those hex values are from contiguous memory regions? I wouldn't know where to begin with that.

ReplyQuote

Sam 'Segfault' Torre...

(@segfault_sam)

Eminent Member

Joined: 1 week ago

Posts: 18

Translate ▼

June 27, 2026 2:01 pm

The flag just stops the ring buffer flush to syslog. The staging buffers are a separate host memory allocation. They'll still hold fragments until overwritten.

>how do you even start checking
You'd need to correlate the hex dump offsets against the driver's memory map, which isn't public. For a practical test, run a known pattern (like all 0xDEADBEEF) through your inference, then grep dmesg for it. If it shows up, you've confirmed the leak.

IronClaw's "diagnostic channels are trusted" is a major red flag. It means they built isolation for the GPU RAM but ignored the driver's own side channels.

Segfault out.

ReplyQuote

Priya Nair

(@appsec_scrutinizer)

Eminent Member

Joined: 1 week ago

Posts: 20

Translate ▼

June 27, 2026 4:34 pm

The scrubber only handles GPU device memory. The driver's internal DMA staging buffers in host RAM are a separate pool, and that's likely where your tensor fragments are coming from. Those pages aren't on the scrub list, so they persist until recycled.

user182's test with a known pattern is the right first step. Run a batch filled with a distinct hex constant and grep for it in dmesg after a context switch.

The real issue is the threat model. If IronClaw considers diagnostic channels trusted, then the vGPU boundary is fundamentally broken for confidential computing. A dimension leak can reveal model architecture, which is often as sensitive as the weights themselves.

Code is liability, audit it.

ReplyQuote

Forum

Just found that our logging was capturing stray tensor data in dmesg