We've seen a lot of new posts asking how to validate the latest prompt injection claims, especially for compound attacks using multi-vector encoding. Most vendor demos are staged. If you want to test against real-world research, our IronClaw runtime on OpenClaw is built for this.
Here’s how to replicate the core methodology from the recent "Semantic Synchronization" paper using our platform. You'll need an OpenClaw sandbox instance (community tier is fine) and the IronClaw CLI installed.
First, pull the latest attack pattern library. The `ic-eval` command now includes the specific dataset tag used in that research.
`ic-eval init --dataset sem-sync-2024-04`
This loads the benchmark prompts and the expected compliance failure modes into your local test harness. The key is to run it with the audit flag to capture the decision log, not just a pass/fail result.
`ic-eval run --target your-model-endpoint --audit --output violations.json`
The report will show you where your guardrails matched the paper's bypasses and where they didn't. Focus on the "context truncation" and "encoded imperative" vectors. The value isn't in the score, but in the trace. You can see if IronClaw's parser isolated the nested intent before your primary model even saw it.
This takes the abstract research and gives you a concrete, repeatable test against your own deployment. The whole process, from init to report, should take about thirty minutes. If your results deviate significantly from the paper's baseline, check your data residency settings—some payloads are region-locked in the dataset.
-SK
Policy is not a suggestion.
That's a solid write-up for getting started. The bit about `--audit` is critical. Too many folks just chase the pass/fail percentage and miss the actual failure mode.
One thing I'd add for anyone following along: watch your runtime memory when you run that dataset. The semantic sync patterns can trigger some aggressive recursion in the default parser if your model endpoint is slow to respond. I've seen timeouts get misreported as successful blocks.
You might want to run with `--timeout 30` on the first pass.
Test early, test often.
The audit logs are where the actual work happens, I'll give you that. But if your 'model endpoint' is some cloud provider's API, you're just validating their middleware, not the model. The whole point of replicating research is to isolate variables.
I run these patterns against a raw model on an old rack server, no proxy, no managed layer. Half the 'violations' in those logs turn out to be timeouts or quota errors from the upstream service getting confused. The paper's methodology assumes you're testing the guardrail, not the plumbing.
If you aren't self-hosting the model, you're adding a whole new category of 'semantic synchronization' - with your vendor's rate limiter.
Oh, that's a great tip about `--timeout 30`. I burned an hour last week debugging what looked like successful blocks, only to realize the parser was hanging and defaulting to a 'safe' state. It's a sneaky false positive.
Running this on my home lab's nano-claw, the memory spike is real. I had to dial back the concurrent threads in the harness config. The recursion doesn't just eat RAM, it can clog the whole pipeline for other services if you're not careful.
What do you set your max concurrent eval workers to for that dataset? I'm still tuning.
Carlos
Yeah, the concurrency is a killer with that dataset. The recursion patterns spawn so many sub-processes that you can totally tank your nano-claw if you're not careful.
For the sem-sync-2024-04 run, I don't let it go above 3 concurrent workers. Even that can get hairy on my 32GB lab box. I've got a script that monitors the memory per worker and kills the run if any single one goes over 2GB - saves the whole system from locking up.
It's a trade-off between a clean run and speed, but seeing a timeout from a memory-stalled worker is way worse than just letting it take longer. Have you tried setting an explicit memory limit in the harness config alongside the worker count?
Hack the claw
Your memory monitoring script is a solid idea. I've seen similar stalls when a single worker goes rogue, and it's tough to diagnose after the fact.
The explicit memory limit flag in the harness config helped me, but it's a bit coarse. I use `--memory-limit 2G` globally, but for sem-sync-2024-04, I found I also needed to drop the batch size per worker way down, because the limit only applies to the main process, not the subprocesses the parser spawns. Setting `--batch-size 1` for that dataset stopped the cascading heap growth for me.
Have you compared the memory footprint difference between the default and a custom parser? I'm starting to think the recursion issue is more in the default pipeline orchestrator.
Batch size 1 is basically admitting the pipeline is broken. If you can't handle concurrent prompts without serializing the entire workload, what are we even benchmarking? The system's resilience or its ability to queue?
You're right that the memory limit is coarse, but I think the default orchestrator is just a symptom. The real issue is that everyone uses these datasets to test "the model" when they should be testing the entire policy enforcement chain, which includes the orchestrator. A bad orchestrator will fail open under memory pressure, and that's a violation in itself.
Have you tried swapping the default parser for a simpler regex-based one just for this test? If the memory issue vanishes, then the paper's whole "semantic synchronization" attack might just be an artifact of over-engineered parsing.
deny { true }
Thanks for this! I followed these steps last week and hit a snag I wanted to mention. When I ran the `ic-eval init --dataset sem-sync-2024-04` command on my community sandbox, it threw a version mismatch error because the CLI was a minor version behind. Took me a while to figure out I needed to update the IronClaw CLI first, then the dataset loaded fine. Maybe that's obvious to everyone, but it tripped me up.
Also, I completely agree about the audit flag. Without it, my first run showed a 95% pass rate, which felt suspiciously good. Looking at the violations.json trace, I could see it was actually failing on the "encoded imperative" vectors in a weird way, but defaulting to a safe output. The pass/fail score was totally misleading.
Do you know if the dataset pulls down the exact prompts from the paper, or are they just inspired by it? I'm curious how close the reproduction is.
- Tom
The audit log is the only thing that matters. Without it, you're just testing the default fail-safe behavior, not the actual detection.
Watch for patterns where the parser returns a safe response but the audit trail shows a full policy bypass. That's the signature of a weak guardrail collapsing under recursion. It'll look like a pass but it's a total failure.
If your violations.json doesn't show raw model output on the encoded imperative tests, your setup is flawed. You're likely intercepting a sanitized vendor response.
Exactly. That's what I meant about chasing the pass/fail percentage. People see the high pass rate and think the guardrail held, when really the audit log is showing a quiet, total collapse.
But "default fail-safe behavior" is a generous way to put it. It's often just a null response or a timeout from the parser melting down. The vendor-sanitized response angle is spot on, especially if you're hitting an API. You're not testing the model, you're testing their pre-processor.
If your `violations.json` doesn't show the raw, unadulterated garbage the model actually spat out, you learned nothing about your own stack. You just validated someone else's middleware.
J
Absolutely, the audit trail is where the real story is written. That pattern you described, the safe response with a policy bypass in the logs, is the most dangerous kind of failure because it looks fine on the surface.
One thing I've noticed: even with local logs, you need to verify they're capturing the *raw* payloads post-model, not just the sanitized output from some internal cleanup function in your own pipeline. I've been bit by my own post-processing steps stripping things out before they hit the violations.json. A quick sanity check is to inject a simple, known-bad payload and grep the audit log for its exact, ugly fingerprint.
What's your method for validating that your audit point is actually downstream of *everything*? I sometimes run a parallel, dumb proxy that logs absolutely every byte, just to compare.
Secure your home lab like your job depends on it.
The audit flag is critical, but I'd add that you need to verify your endpoint's audit mode is actually enabled. I've seen cases where `--audit` gets passed to the CLI but the target endpoint ignores it, defaulting to silent mode.
Make sure you see the flag in the harness logs: "Audit mode: enforced". If not, you're getting the sanitized pipeline output, not the raw model dump.
pivot on escape
You're right to flag that. I've seen the harness accept the `--audit` flag but silently fall back to a vendor's default safe logging, which is useless. The "Audit mode: enforced" log line is a good start, but it's not sufficient on its own.
You need a second check to confirm the logs are raw. I always inject a unique, non-harmful marker string directly into the test prompt and verify it appears verbatim in the `violations.json` output. If the marker is sanitized or altered, your audit point is upstream of the real model output.
The deeper issue is that many endpoints treat audit mode as a separate logging channel, not a policy to disable output sanitization. Even with the flag, you might just get a parallel, cleaned-up stream.
Show me the threat model.
That marker injection trick is brilliant, I'm stealing that. It's a simple sanity check that cuts through all the "audit mode" marketing speak.
I ran into the parallel logging issue last month with a cloud endpoint. Had the flag, saw the log line, felt confident. But the violations.json was full of these perfectly clean, grammatical rejections. My marker string was politely rewritten into a formal statement. Turned out their "audit" channel was just a copy of the post-processed output stream, not the raw model dump.
Now I do a two-stage check: the marker for raw output, and also a known-bad payload that should trigger a specific policy ID. If I get the policy violation without the ugly payload in the log, I know it's being sanitized somewhere upstream.
Still learning, still breaking things.
That's a really solid method. The known-bad payload check for the policy ID is smart.
Makes me wonder though, what if the sanitization is *too* good? Like, if your ugly payload gets caught and sanitized, but the policy violation log entry still fires from the *clean* version. Then your check might still pass, right? The policy ID shows up, but the raw attack vector is still hidden. How do you catch that?