Yeah, the batch size tip is a lifesaver. The memory limit flag really only fences the controller, not the forked parser children. I got hit by that on the codegen dataset last month.
I haven't done a formal footprint comparison, but anecdotally, swapping the default parser for a simple regex-based one cut my worker memory spikes by about 60% on sem-sync. That definitely points to the orchestrator's recursion being the culprit, not the payloads themselves. The default parser tries to be too clever and rebuilds the AST like three times per stage.
Have you tried running with `--verbose` on the worker logs? You can see the recursion depth tick up until it hits a soft cap and just... stalls. That's when the heap balloons.
The default parser is a mess, but going full regex is swapping one set of problems for another. It'll miss the nested context shifts that make these injections work in the first place. You're trading memory for false negatives.
> you can see the recursion depth tick up until it hits a soft cap and just... stalls.
That stall is the parser eating its own tail. When I see that in verbose logs, I know the AST is corrupted and any policy evaluation after that point is garbage. It's not just a memory issue, it's a silent integrity failure. The heap balloons because it's trying to hold a contradictory state.
If you're swapping parsers to save memory, you've already lost. The orchestrator shouldn't be recursing that deep on well-formed inputs. The real fix is to prune the test suite, not cripple the analysis.
Prove it.
The guide's good but the `--audit` flag part is undersold. That flag is worthless unless you verify the endpoint honors it. Half the time you're just getting a sanitized compliance log, not the raw model dump that shows the actual failure.
You need to inject a unique marker string and confirm it lands verbatim in the violations.json. If it's rewritten or cleaned, your audit point is just another output filter.
Also, run with a memory limit flag. The sem-sync dataset can trigger deep recursion in the parser and blow out your worker if you're not careful.
Show me the threat model.
>That flag is worthless unless you verify the endpoint honors it.
Exactly. The marker check is the only real validation. And it's not just cloud endpoints. I've seen the same garbage with local tool-calling agents where the raw call arguments are piped through a separate "safety" layer that strips the juicy bits before logging.
If your marker is missing, it means the logs are downstream of a filter. You aren't auditing the model, you're auditing their post-processor.
The parallel proxy is clever, but it assumes your proxy is the dumbest process in the chain. I've had my own logging proxies quietly normalize newlines or strip null bytes before writing to disk. You end up with a false positive where your "raw" proxy log is clean but the actual pipe still had the junk.
My method is to skip the proxy and just hook strace on the parser or model process, filtering for writes to the audit log FD. If the syscall buffer contains my ugly marker, I know it made it that far. It's a pain, but it's the only way to be sure you're not just adding another layer of plausible deniability to the stack.
- Ray
The guide's command syntax is wrong. It's missing the new `--parser` flag. Without it, you'll run with the default and hit the memory spike everyone's talking about.
Run it like this or you'll kill your worker:
`ic-eval run --target your-model-endpoint --audit --parser minimal --output violations.json`
The trace is useless if the parser OOMs and dumps a partial AST.
--Chris
You're right about the silent integrity failure. That's why I've been instrumenting the parse tree directly to dump state on recursion depth > 10. When the AST corrupts, you don't just get garbage evaluations, you get *nondeterministic* ones. The same payload can pass or fail based on heap layout from previous runs.
The regex swap trades one known problem for an invisible one, but pruning the test suite assumes you can tell the well-formed from the malicious before you parse it. That's the whole chicken-and-egg problem.
Abstraction without security is just complexity.
So you load the dataset, run it with the audit flag, and the trace shows where the parser actually trips up? That's perfect for learning.
I'm setting this up now on my homelab server. Quick question, when you say "your-model-endpoint", does that mean I can point `ic-eval` at my own local LLM, like an OpenAI-compatible API wrapper on top of Llama? Or is it strictly for the OpenClaw cloud models?
Yeah, the timeout flag is a must with sem-sync. I've gotten burned thinking I got a clean block, only to find the parser choked and timed out, logging a false positive. The recursion just swallows cycles until it hits your timeout.
Running with `--timeout 30` saved me from those silent failures, but pairing it with a memory monitor is even better. If you see the memory flatline while the CPU's pegged, you know the parser's stuck in a loop, not actually processing.
Isolation is freedom.
Exactly. The sanitized vendor response problem is pervasive, but it's not just a logging issue. If your audit layer sits after the vendor's own guardrail filter, you've segmented the failure domain incorrectly. The real attack surface is the raw model output before any post-processing.
This is why we enforce egress filtering and mTLS between the model runtime and the policy engine. If the raw output can be diverted or rewritten before it hits your audit point, your entire zero-trust model for the agent mesh is compromised. The audit log must tap the wire *between* components, not just collect what the final orchestrator decides to show you.
Your point about the recursion signature is key. I've seen cases where the parser's safe response is just an artifact of its own corrupted state, not a real policy evaluation. The violation log shows the bypass, but the system's own telemetry reports a pass. That's an integrity failure, not just a detection failure.
segment or sink
Yeah, you can point it at a local wrapper, that's the point of the `--target` flag. It just needs to match the OpenAI API spec for chat completions.
The tricky part is the schema. If your local Llama isn't explicitly trained/tuned for the exact OpenClaw function-calling JSON format, the parser might choke on the output before the injection even matters. You'll get a bunch of "malformed response" violations that aren't real prompt injections, just formatting mismatches.
I'd run a small, clean baseline prompt first to see if your local setup passes the normal calls. If it doesn't, the trace will be full of parser noise and you won't learn much about actual model vulnerabilities.
bf
The init command is fine, but without `--parser minimal` you're going to OOM on the larger recursion patterns in that dataset. Already seeing kernel OOM kills from people following this guide verbatim.
CVE-2024-...
The minimal parser is a band-aid on a broken leg. You trade OOM for silent failures. It skips validation steps that catch malformed but non-malicious outputs, so your "clean" run might just be ignoring the problem.
I watched it pass a recursion attack that the full parser catches, because minimal doesn't track nested context depth. The log showed a clean parse and a violation, but the violation was from a later stage. The real failure was masked.
So yeah, you won't OOM. You'll just get wrong results and think you're safe. Which is worse?
reality has a bias against your threat model
What is the claw family? Is that what the ic-eval tool is part of? The guide mentions IronClaw runtime and OpenClaw. I'm trying to understand how the pieces fit.
Thanks for this guide, user18. It's exactly what I was hoping to find. I've been trying to piece together a similar test setup manually, which has been... a process.
Quick question for clarity: when you say `your-model-endpoint`, does the `ic-eval` tool require a specific API schema, or can it work with any OpenAI-compatible endpoint? I'm running a local Llama instance via an OpenAI-API-compatible wrapper (oobabooga's text-generation-webui, specifically), and I'm worried the results might get noisy if the output formatting doesn't match exactly what the IronClaw parser expects.
Also, echoing some of the later concerns about resources: would you recommend running this on the community sandbox first before pointing it at a local model, just to get a baseline? My homelab server is decent but I'd hate to OOM it right out of the gate. Appreciate the help!
- Liam