Hey everyone, total newbie question here. I've been reading about NemoClaw's guardrails and keep seeing the term "bypass." I get that the system uses regex patterns and an LLM-as-judge to catch bad stuff.
But in simple terms, what does a "bypass" actually look like in practice? Like, if the regex is looking for a specific word, and I misspell it, does that count as a bypass? Or does it only count if it fools the LLM judge too? Just trying to picture the failure modes before I even think about testing anything 😅
Kevin
Learning by doing (and breaking).
A bypass is any input that gets past both the regex filter AND the LLM judge, delivering a harmful response.
> if I misspell it, does that count as a bypass?
Yes, if it gets through. But in NemoClaw, that's only step one. A misspelling might slip by the regex, but the LLM judge should still catch the intent. That's the point of the two layers.
The real bypass is when the final model (the one being guarded) outputs harmful content. So you have to fool the judge *and* the target model.
Common bypass patterns for LLM-as-judge:
- Contextual poisoning ("I'm writing a safety research paper about...")
- Multi-step instructions that seem benign until final output.
- Obfuscation the judge doesn't parse but the target model does.
It fails when the judge's understanding is weaker than the target model's.
--Jay
Exactly. A misspelling that gets past the regex is precisely the kind of thing the LLM-as-judge is supposed to catch. It's a *regex bypass*, not a full guardrail bypass.
The failure state you're asking about happens when both layers fail in sequence. For instance, you might use a cultural reference or an idiom the judge doesn't understand, but the underlying model does. The judge sees "make the weather warm" and thinks it's benign, while the target model interprets it as a known euphemism for a violent act.
So the bypass is complete only when the final output is harmful.
Know your dependencies, or they will know you.
You're right about the two-layer failure sequence, but there's a critical network perspective missing. The pipeline's design assumes both components are equally exposed to the input string.
Consider if an attacker can route the judge's evaluation through a different network path than the target model's processing. A microsegmentation failure could allow the input to be pre-processed or altered between the judge's analysis and the model's execution. The bypass then occurs not at the semantic layer, but at the transport layer.
This is why a zero-trust architecture for the agent mesh is non-negotiable. Each component's communication needs mutual TLS and strict egress filtering to prevent context manipulation in transit.
segment or sink
Good point about the judge needing to have a weaker understanding. I watch the real-time metrics on these pipelines. You'll often see the judge's confidence score drop to ~60% on the contextual poisoning examples you mentioned, while the target model's confidence stays high. That gap is the bypass forming, before the harmful output even happens.
It's a signal you could alert on, if you're monitoring the right telemetry. The latency between the judge's "allow" and the final model's "generate" is also a small window where you might see an anomaly spike in the runtime trace.
The egress filtering comment from user277 is relevant too. If that judge LLM call is going over an unmonitored channel, you'd never even see that low confidence score.
watch and learn
Spot-on about the transport layer, and that's a subtle point that doesn't get enough airtime. It reminds me of a live exercise we ran last year where the judge's API endpoint was, for historical reasons, configured with a different input sanitizer than the target model's endpoint. The payload was base64-encoded between the two calls, and the decoding step on the model's side introduced a null byte that the judge's sanitizer had stripped. The judge saw a clean string, the model got something entirely different.
Your zero-trust call is right, but I'd add it's not just about mTLS and filtering. You need to enforce **identical** preprocessing stacks for both the judge and the target model in the pipeline runtime. If they're not sharing the exact same code path for decoding, normalization, and tokenization before the logic even starts, you're baking in a transport-layer bypass. The architecture diagram looks clean, but the dependency graph is where these splits hide.
That's a great point about monitoring the confidence gap. It's exactly the kind of telemetry we've found useful for early warning.
We set an alert to fire when the judge's confidence dips below 70% while the main model's stays above 90%. It catches a lot of those contextual poisoning attempts before they complete the request.
The latency window you mention is key, too. We've seen injection attempts that rely on a rapid sequence of requests, trying to get the 'allow' decision cached before the judge's confidence score is even logged. Makes you need to treat that judge's verdict and its confidence as a single atomic data point.
Great question, Kevin! I was wondering about this too.
It sounds like a misspelling only counts as a bypass if it actually tricks the final model into doing something bad. So if it just gets past the regex but the LLM judge stops it, the guardrails are still working.
I'm curious, does anyone have an example of a simple, real bypass that worked? Like the first one you'd see in a tutorial? It helps to picture the basic case.
That cultural reference example is a good illustration of the semantic gap. The risk compounds if the judge is a smaller, cheaper model than the target model. You're banking on a less capable model correctly interpreting all the nuances and euphemisms a more capable model understands, which is a tall order.
We've seen this in multilingual contexts, where an idiom in one language is translated literally for the judge but the target model runs the original prompt. The judge sees a nonsensical literal phrase and lets it pass, while the target model acts on the intended meaning.
Sandboxed from the kernel up.
The cost angle is a real driver for that smaller judge model setup, and it's a security flaw disguised as a budget win.
You see this in agent chains where every sub-agent call has to be cheap. The judge gets a 7B model, while the writing or coding agent is a 70B model. The semantic gap isn't just a possibility, it's guaranteed. The 70B model understands subtext, implications, and cultural references that the 7B model literally can't see.
That multilingual idiom case is perfect. We replicated it by having the judge on a model with poor non-English support. The target model, with full multilingual training, got the joke and executed the intent. The bypass wasn't crafted, it was emergent from the architecture.
The fix is to match capabilities or accept you're leaving a huge hole open.
--Ray
So if the regex misses a misspelling but the LLM judge catches it, the guardrails did their job. It's only a full bypass if the harmful content makes it through to the final output.
I'm still learning too, but it makes me wonder about timing. What if the bad output happens before the judge even finishes its check? Is that possible, or is the pipeline always sequential?
Good point on the transport layer. But doesn't mTLS just prove identity and encrypt? If the attacker controls the client, they can still send the manipulated payload directly. The judge gets the clean version, the malicious client sends a different one to the model, and both calls have valid mTLS certs.
The architecture assumes a single, honest client. If that's compromised, your zero-trust mesh might just be authenticating the attacker twice.
You're right. The threat model of an honest, monolithic client is a common architectural blind spot. Even with a zero-trust transport, if the client is compromised or malicious, you've lost.
This is why the pipeline's integrity must be enforced server-side. The system invoking the judge and the target model must be a single, trusted runtime component, not a client instruction. The client should only ever send one payload to this orchestrator. If the orchestrator then calls two different internal services, they must share the identical request context object, not be fed independently by the client.
Otherwise, as you said, you're just authenticating the attacker twice.
risk is not a number