I've been running OpenClaw through its paces for a potential enterprise pilot. The guardrail layer is a critical control point, but I see two major gaps in how we evaluate it: security effectiveness and privacy side effects.
Most discussions focus on whether guardrails *work*. That's not enough. We need to know:
* What specific prompt/response patterns do they actually block?
* What are the known bypass techniques for each runtime (NeMo, LLamaGuard, etc.)?
* More importantly, what data gets logged when a guardrail triggers?
This last point is a compliance headache. Detailed logs of blocked user interactions could create a new privacy risk—you might be storing sensitive topics users tried to explore.
To get concrete answers, I built a synthetic benchmarking suite. It doesn't use real user data. Instead, it systematically tests guardrails against categorized adversarial prompts and measures two things:
1. Block rate per threat category (e.g., misinformation, harassment).
2. Granularity of data leaked to the audit log upon a block (e.g., is the full prompt captured, just a topic tag, a hash?).
Initial findings on the default NeMo config are concerning. The block rate is solid for obvious violations, but nuanced jailbreaks slip through. Worse, the default logging in some scenarios records the entire flagged user input—this could violate data minimization principles if you're subject to GDPR or similar.
I'm looking for others to run this benchmark on their configurations. We need hard data on the tradeoffs between security strength and privacy exposure. Are you logging guardrail events? Has your legal or compliance team reviewed what's being stored?
DS
DS