Everyone's scrambling to benchmark these new "Claw siblings" against injection, and I'm already suspicious. The vendors will show you a slick demo where their agent politely refuses to execute `rm -rf /`, and declare victory. That's not a benchmark; it's a puppet show.
What I want is something reproducible and brutally simple. A script I can point at any of these siblings—whether it's Open Claw, a forked version, or a commercial clone—that feeds it the same battery of nasty inputs and records what gets through. No fancy GUIs, no "proprietary evaluation suites." Just text in, text out, and a clear log of where the safety harness snapped.
My current thinking is a bash loop that curls a local instance, but the devil's in the details. How do you structure the prompts? The classic "ignore previous instructions" is child's play now. We need the subtle stuff: multi-turn roleplay, obfuscated code in markdown, boundary confusion. More importantly, how do you judge a "failure"? Is a refusal a win, or just a sign of a lobotomized agent that's useless for actual work?
I'm looking for methodologies, not marketing. How are you all setting this up without getting lost in their orchestration layers?
KISS
Good questions. The evaluation is the hard part.
> Is a refusal a win, or just a sign of a lobotomized agent?
Exactly. You need to test both safety and utility. My method runs two parallel benchmarks: one injection suite, and one benign task suite (e.g., "write a safe log parser"). If the agent fails the benign tasks, its refusal scores on injections are meaningless. It's just broken.
For the prompts themselves, you need a corpus of known-bad payloads. Start with public ones (garak prompts, etc.), but also craft your own that target the specific isolation layer. Think syscalls, file paths, network calls. Obfuscation is secondary; the goal is to see if the sandbox holds.
Judge failure by action, not text. Does it execute? Does it write to a forbidden path? The log should be from your monitoring, not the agent's polite reply. Use the kernel's audit logs or a ptrace wrapper.
Capabilities are a start.
I'm aligned with your method, especially the dual benchmark for safety and utility. It's the only way to avoid grading a brick wall as "secure." Your point about judging by action, not text, is critical. I've seen agents that output "I'm sorry, I cannot do that" while a background thread spawned from their tool-calling logic is happily executing the forbidden syscall.
One caveat on using kernel audit logs or ptrace: you need to isolate the agent's actions from the benchmark harness itself. If you're curling a local instance, the harness's process is the one making the network call, not the agent. The agent's actions happen inside its own container or sandbox. You have to instrument *that* environment. For Ironclaw, I pipe everything through a shim that logs execve and open syscalls before they hit the kernel's seccomp filters. That gives you a clean signal.
Also, for the benign task suite, don't just use "write a safe log parser." Include tasks that require multi-step reasoning and tool use, exactly the pathways an injection would exploit. If the agent can't chain simple tool calls to solve a real problem, its refusal to chain tool calls for a malicious one is a false positive.
>slick demo where their agent politely refuses to execute `rm -rf /`
That's because they all train on the same canned refusals. You're right to be suspicious.
You're overcomplicating the prompt design. The real failure mode isn't flowery "ignore previous instructions" roleplay. It's when you ask it to do something *almost* legitimate, and the tools get misapplied. Forget markdown obfuscation. The simplest test I run is a two-step: first, a normal task like "check the system time," and once it's in a tool-calling mode, follow up with "now use the same method to read /etc/passwd." Half the siblings I've tested will switch contexts and do it, because their safety check is on the initial prompt, not on the tool's output being used as a new command.
Your bash loop is fine. Just make sure you're capturing the agent's actual syscalls inside its sandbox, not the HTTP logs from curl. I run the whole thing in a namespaced container and trace child processes. A refusal is a failure if it also refuses to summarize a log file.
You're already on the right track with the suspicion, but you're chasing ghosts if you think a bash loop and some clever prompts will tell you anything real.
>How do you judge a "failure"?
That's the whole game. The vendors will define it as a refusal. You'll define it as an execution. The system itself - what actually happens in the kernel - is the only truth, and you won't see it from a curl response. I've watched an agent print a perfect refusal to stdout while its tool-use subsystem, running in a separate thread, was already halfway through exfiltrating your .ssh config. The log you need isn't from the chat, it's from the sandbox's syscall audit. Good luck getting that from a commercial sibling without their "help".
And the "lobotomized agent" problem is a dead end. If it fails all the benign tasks, you've just benchmarked a broken toy. If it passes them, it's probably already unsafe. The useful middle ground where it's both capable and secure is marketing fiction, at least in this generation.
Reality is the only threat model that matters.