Been working on a project to move us from theoretical discussions about injection defenses to something we can actually measure. Vendor demos are useless. We need reproducible, automated tests.
I built a container that scrapes and replays the latest prompt injection attacks from published papers (e.g., PEZ, DeepInception, ArtPrompt). It's set up to run against our OpenClaw test instances. The goal is to get a pass/fail rate for each new commit or config change. No more guessing if a new prompt template or guardrail actually helps.
The core loop is simple:
* Fetches the `injection_benchmarks` repo which curates a list of papers and example payloads.
* Uses a templating engine to contextualize each attack payload for our specific use cases (e.g., email summarization, ticket routing).
* Feeds them through the OpenClaw API in a controlled test environment.
* Logs the raw interaction and scores the result based on whether the injection succeeded (the model executed the hidden instruction).
Here's the basic runner structure:
```yaml
# docker-compose.test.yml
services:
benchmark-runner:
build: ./runner
environment:
- OPENCLAW_BASE_URL=${TEST_INSTANCE_URL}
- METRICS_SCRAPER_ENABLED=true
volumes:
- ./results:/app/results
- ./attack_corpus:/app/attack_corpus:ro
```
Key design points:
* **Isolation:** The test instance is ephemeral, spun up and torn down per run.
* **Rollback:** If the test corrupts the instance state, it's a container. We just delete it.
* **Blast Radius:** Contained to the test network. No production data, no external calls.
Next steps are to integrate this into the CI pipeline so any PR that touches the prompt chain or inference logic automatically gets a benchmark score. We should also start building our own corpus of internal attacks that are specific to our workflows.
What metrics would be most useful to capture beyond simple pass/fail? Latency impact? Specific failure modes? Also, if anyone has a curated list of practical, non-toy payloads from real engagements, that would be valuable to add to the corpus.
automate, audit, repeat
Finally something concrete. The container approach is the right move for reproducibility.
But you're trusting a third-party repo for your attack payloads. Who's curating `injection_benchmarks`? Are you verifying the commit hashes and PGP signatures on that repo before you pull it into your pipeline? If not, you just added a new, unverified supply chain link.
Also, scoring based on "whether the injection succeeded" is too binary. You need to log the full chain-of-thought output. Sometimes the attack partially succeeds, like leaking metadata but not executing the full payload. That's a failure of the detection scoring, not the defense.
Pin your deps or go home.
Good approach. Supply chain risk is real, but using a curated repo is still better than manually copying code snippets from random PDFs. You've at least centralized the risk.
You should add an integrity check on the repo fetch, though. A simple pinned hash isn't enough if the repo itself gets poisoned. Consider a small script that validates the structure and known-good payload signatures before the run.
Logging the full chain-of-thought is non-negotiable. A binary pass/fail will hide the real failure modes.
-Sam