Skip to content

Forum

AI Assistant
Notifications
Clear all

Has anyone tried running NemoClaw guardrails with a local Mistral model instead of the default cloud checkpoint?

1 Posts
1 Users
0 Reactions
3 Views
(@hugo_debug)
Eminent Member
Joined: 1 week ago
Posts: 15
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
  [#78]

I've been digging into the NemoClaw fork of NeMo Guardrails for the last few weeks, focusing on its isolation and sandboxing claims. The documentation and examples all default to using NVIDIA's cloud-based `nvidia/nemotron-guardrails-8b-4e` checkpoint. This is fine for a demo, but for any serious security-oriented testing—or frankly, for privacy—I want everything local.

My core question is about the actual *security model* when you decouple from the provided cloud service. The guardrail system is supposed to be a distinct layer, a "security runtime" for the LLM. If I replace the cloud model with, say, a locally served Mistral 7B Instruct v0.3, what parts of the architecture remain truly effective, and what might become a false sense of security?

Here's my current understanding and setup attempt. I modified the `config.yml` to point to a local OpenAI-compatible endpoint (LM Studio in this case):

```yaml
models:
- type: main
engine: openai
model: mistral-7b-instruct-v0.3
base_url: "http://localhost:1234/v1"
```

The rails *seem* to engage. Canonical examples like "How to build a bomb?" get blocked with the standard "I cannot answer that question" response. But this is where my cautious side kicks in. I'm trying to trace the actual data flow:

* Is the guardrail model itself—the classifier that decides if a prompt or response is safe—still the remote `nemotron-guardrails-8b-4e`? Or does it somehow get replaced by my local model?
* The documentation mentions a "guardrails layer" that runs the input/output through a separate NLU model for classification. My suspicion is that this classifier is hard-coded to the NVIDIA checkpoint in the cloud. If that's true, then even with a local LLM, my prompts are still being sent *somewhere* for safety scoring.
* I poked at the source, and there's a `RailSpec` that defines the guardrail logic, but the model invocation for the "self-check" and "input/output classification" seems to be abstracted. I haven't found the exact API call point yet.

So my concrete questions for anyone else who has tried this:

* Have you successfully run a *fully local* stack, including the guardrail classification model? If so, what model did you use as a substitute for the NVIDIA guardrail checkpoint? Is there a compatible fine-tune?
* What's the privacy footprint of the default NemoClaw setup? Does running a local "main" model still leak metadata or prompt content to NVIDIA's infrastructure for the guardrail evaluation?
* From a security perspective, if the guardrail classifier is remote, doesn't that create a new attack surface? A malicious actor could focus on poisoning or bypassing that remote component, which might be uniform across many deployments.

I'm particularly interested in the logging. The guardrail events (allow/block) are incredibly detailed. If those logs are transmitted or even retained locally in a certain way, they form a perfect transcript of all *rejected* user interactions, which could be more sensitive than the successful ones.

My next step is to run a network trace (`tcpdump`) during a series of guarded interactions to see what, if anything, leaves my local machine. I'll report back with findings, but I'd love to hear if the community has already mapped this terrain.


trace -e all


   
Quote