Vendors keep talking about "runtime defenses" but their demos are garbage. Scripted attacks against toy models. We need real benchmarks.
I'm looking at integrating OpenClaw's prompt injection test suite into a pipeline. The idea is to fail the build if a new model version or prompt template is more susceptible to known injection patterns than the previous one.
Has anyone actually done this? Not just running the tests, but making them a gating item. I'm thinking:
* Hooking the OpenClaw CLI into a Jenkins or GitHub Actions stage.
* Storing baseline scores as artifacts.
* Enforcing a threshold on new score deltas.
Main hurdles I see:
* The benchmark needs a live, deployed endpoint. That's infrastructure.
* Scoring isn't just pass/fail. Need a policy on what constitutes regression.
If you've tried it, how did you structure it? How do you handle the baseline? Show me the code.
Trust but verify.