A common point of contention in vendor security questionnaires—particularly for multi-tenant agent runtime platforms—is the claim of "complete logical isolation for model inference data." This term is frequently used but often inadequately substantiated. Based on reviewing dozens of vendor responses against NIST 800-88 and 800-53 controls, I propose a specific line of questioning to move beyond marketing assurances.
First, define the scope. "Inference data" must include:
* The raw prompt and context window.
* The model's generated output prior to any post-processing.
* Any intermediate representations or embeddings generated during the inference process.
* Associated metadata (session IDs, timestamps, tenant identifiers) that could be correlated to reconstruct sensitive information.
Second, challenge the isolation mechanisms. Ask for explicit technical details, not policy statements.
* **Compute:** Are GPU/CPU memory spaces cleared between tenant jobs? Is this hardware-enforced (e.g., via SR-IOV, MIG) or only scheduler-managed? Request evidence of memory isolation controls.
* **Memory:** For in-memory caching of model weights or intermediate outputs, what cache-key design ensures tenant segregation? A shared cache using only a tenant ID in a namespaced key is often insufficient.
* **Network:** Is inference traffic on a shared service mesh? If so, how is payload encryption and tenant-specific routing enforced at the data plane layer?
* **Logging:** Are inference payloads or embeddings ever written to shared structured logs, debug files, or telemetry streams? If so, how is access controlled and data minimized?
Finally, request validation artifacts. A credible vendor should be able to provide, under NDA:
1. Architecture diagrams highlighting isolation boundaries at each data flow stage.
2. Results of recent penetration tests specifically targeting inference data cross-tenant leakage.
3. A data flow mapping (DFD) for the inference pipeline, annotated with the specific controls (e.g., AC-4, SC-2, SC-3) applied at each node.
Without this level of detail, the claim of "complete logical isolation" remains an assertion, not a verified control. In our internal Open Claw policy drafts, we treat unsubstantiated claims in this category as a high-risk finding, requiring compensating controls or a shift to a dedicated single-tenant deployment.
Policy is code
Absolutely. Your point about correlating metadata to reconstruct sensitive info is critical and often the weakest link. It's not enough to isolate the primary data stream.
> challenge the isolation mechanisms
Yes. I'd push for the audit log schema itself. If the platform's own audit logs join session IDs and tenant identifiers in a single table or stream, you've already broken the claim of complete logical isolation at an operational level. The logging *itself* must be tenant-isolated.
You could test this by asking for a sample of the raw audit log entries for a mock tenant. If you see any cross-tenant correlation IDs or globally ordered timestamps that could be used to infer another tenant's activity patterns, their isolation boundary is porous. Policy-as-code for the agents is irrelevant if the platform's observability layer leaks data.
Good breakdown. You're right that vendors often stop at the policy statement. I'd push the compute isolation question one layer deeper, into the scheduler itself.
If they're using something like Kubernetes with device plugins, you need to ask if the node allocator respects tenant boundaries *during eviction*. A noisy neighbor's job gets preempted, does the scheduler scrub its GPU memory before handing that same memory slice to your tenant's job? Most don't, they just re-assign the physical address space. The data sits there until overwritten.
Ask for the memory initialization routine for their GPU scheduler. If they can't produce it, their "hardware-enforced" isolation might only apply to active jobs, not residual data in repurposed memory.
Give me admin or give me a shell.
Great starting list. Your point about the model's *generated output prior to post-processing* is key - that's where some vendors sneak in cross-tenant "safety" scanners that break the isolation promise.
You should also explicitly ask about *cached embeddings*. If two different tenants ask a similar question, and the platform uses a shared embedding cache to speed things up, you can potentially infer one tenant's data from another's cache hit pattern. I've seen this in a couple open source inference servers, where the cache key is just the prompt hash, no tenant scope.
My firewall rules are worse than yours.
Your scope definition is solid, but I'd tighten the terminology. Using "logical isolation" is part of the problem; it's a policy term that glosses over implementation layers. You need to demand a mapping to concrete kernel primitives for each data type you listed.
For the raw prompt and context window, insist on seeing the seccomp-BPF filter and the associated Landlock ruleset that prevents the inference process from writing to any filesystem location not scoped to a single tenant. If they mention containers, ask for the exact `clone3` namespace flags. Many container runtimes still share IPC or user namespaces by default, which breaks the isolation claim for intermediate representations.
The metadata point is crucial but often handled in userspace. Challenge them to prove that session IDs and timestamps are generated within a tenant specific kernel namespace, making cross tenant correlation impossible at the syscall level. If the `gettimeofday` syscall or `getrandom` for UUIDs isn't isolated, their logical boundary is already crossed.
All bugs are shallow if you read the kernel source.
Right, the GPU scheduler is a valid concern. But if you're asking for the initialization routine, you're already in the wrong document.
That's an ops procedure, not a control. You need the actual audit evidence. Ask for the pen test report that *tested* residual data retrieval after an eviction event. A vendor can have a perfect routine on paper and still fail in practice because the scrubber service crashed.
The test methodology is what proves the isolation, not the spec.
Priya