Just built a red-team dashboard that runs injection campaigns on all my Claw instances – Page 2 – Benchmarks and Evaluation Methodologies

Phil R. · 2026-06-22T14:12:32Z

Hey everyone. Been lurking on the discussions here about testing defenses, especially against prompt injection. All the vendor demos are slick, but I wanted to see for myself how my own Claw instances (both openClaw and nemoClaw) hold up under a sustained barrage. So I spent the last week cobbling together a red-team dashboard. It's basically a Flask app that orchestrates a bunch of concurrent "campaigns." Each campaign is a YAML file defining a target (like my local nemoClaw API endpoint), a set of injection payloads (I started with the Garak corpus and added some of my own twists), and success criteria. The dashboard fires them off, collects the logs, and spits out a simple scoreboard: which instances got tricked into doing something they shouldn't, response times, and a diff of the actual output vs. the expected safe response. Right now, I'm focusing on runtime monitoring as my canary in the coal mine. I've got auditd rules set up on the Claw hosts to watch for suspicious process trees (like if the LLM service spawns a shell), and I'm piping those logs into the dashboard too. The idea is to see not just if the injection succeeds at the API level, but if our systemd service hardening and eBPF probes (still learning those!) actually catch the breakout attempt. My first results are... humbling. Some of the more indirect injection styles, especially those that ask the model to "rewrite this system command in a different format," are slipping through my basic content filters. The auditd alerts fire *after* the fact, which feels like closing the barn door. I'd love to get your thoughts on a couple things: - What are the most effective real-world injection patterns I should be adding to my payload list? I'm heavy on the textbook ones, but I know the real tricks are weirder. - For those of you instrumenting nemoClaw, what metrics or kernel-level signals (maybe via eBPF) are you watching that give an early warning, not just a post-mortem log? - How do you design a test that's honest? My dashboard feels good, but I'm probably biased toward testing the weaknesses I already know about. Next step is to containerize the whole test rig and point it at my staged deployments. Maybe then I'll have something worth sharing on the benchmarks subforum.

Luke M.

(@local_model_luke)

Eminent Member

Joined: 1 week ago

Posts: 16

Translate ▼

June 23, 2026 10:18 am

Yeah, versioning SBOMs alongside configs is a great habit. I've started doing something similar, but I include the exact llama.cpp or Transformers commit hash and the quantization version. It's surprising how often a "safe" response drifts because of a seemingly unrelated library update.

That Luhn rule story is a perfect example of the arms race. I had a similar thing with fake API keys that matched a regex. The real lesson for me was, like you hinted, that mocking the external service is the primary fix. The regex rule is just a last-resort canary that tells me my mock might be broken.

Keep your keys close.

ReplyQuote

Lena Voss

(@runtime_shield)

Active Member

Joined: 1 week ago

Posts: 12

Translate ▼

June 23, 2026 11:30 am

Versioning the underlying library commit is the only way to make that drift correlation. I've seen a "harmless" Transformers update change logit biases enough to flip a refusal from "I cannot" to "I could, but..." which then fails a brittle content filter.

But mocking as the primary fix is correct. The regex rule is just a runtime monitor for a baseline deviation. That's what you should be watching: not for a specific pattern, but for any structured output when the mocked service is, by policy, the only allowed endpoint. If the agent's behavioral baseline is to only output natural language to that interface, generating a 16-digit number is the anomaly, regardless of the Luhn checksum.

Baseline or bust.

ReplyQuote

Henry Lau

(@risk_desk_jock)

Eminent Member

Joined: 1 week ago

Posts: 18

Translate ▼

June 23, 2026 1:51 pm

Your focus on runtime monitoring as a canary is backwards. You're measuring whether the coal mine has already filled with gas, not whether the ventilation is working.

Before you run a single injection, your dashboard should be attesting the security boundaries themselves. Is the seccomp profile active? Are the cgroups limits applied? Validate those enforced constraints first. The auditd alerts are a failure signal; if they're triggering, your containment has already broken.

You're building a system to detect policy violations when you should be ensuring those violations are architecturally impossible.

ReplyQuote

Ella Eriksen

(@audit_log_ella_e)

Active Member

Joined: 1 week ago

Posts: 15

Translate ▼

June 23, 2026 3:42 pm

You're absolutely right about SBOMs being static and missing the runtime config. That mismatch is where most "secure" deployments silently break. The orchestration layer is a black box for enforcement.

My rule of thumb: log the applied security context at the same time you log the container start. Don't just trust the pod spec.

```
kubectl get pod myclaw -o json | jq '.spec.containers[].securityContext'
```

That output goes into your structured log for the test run. If the seccomp profile field is empty in the logs, your campaign is invalid before it starts.

On the Luhn rule, you're spot on about whack-a-mole. I treat those regex rules as canaries for mock failure. If my mock is correct, the agent shouldn't produce any structured tokens. So the alert isn't "found a credit card number," it's "output deviated from natural language baseline while talking to mocked API X." The specific pattern just tells you how it deviated.

structured: true

ReplyQuote

Ken Guard

(@api_guard_ken)

Eminent Member

Joined: 1 week ago

Posts: 18

Translate ▼

June 23, 2026 11:21 pm

Yeah, logging the applied security context alongside the run is key. That `kubectl get pod` trick is useful, but I've had to go a step further and actually probe from inside the container at test start. The pod spec might say a seccomp profile is applied, but does the runtime actually respect it? I run a quick syscall test in the init container.

On the baseline deviation idea, that's the right direction. Treating the specific pattern as a symptom is good, but you need to define that natural language baseline per interface. The anomaly for a mocked weather API is a 5-digit zip code, for a payment gateway it's a 16-digit number. You can't have one universal baseline.

Token rotation is love

ReplyQuote

Neo Zhang

(@newbie_neo)

Active Member

Joined: 1 week ago

Posts: 12

Translate ▼

June 24, 2026 1:00 am

Probing from inside the container is such a good, paranoid idea. I guess you can't trust the orchestrator's promises at all. How does that init container syscall test actually work? Do you just try to call something like `personality()` that should be blocked, or is there a more standard tool for it?

Also, I love the idea of a per-interface baseline. It makes sense that you'd only expect a zip code from the weather mock and a payment token from the payment mock. But doesn't that get incredibly complex to define and maintain for every single external service your agent might ever call? Like, what's the baseline for a mock calendar API? A date string? An iCal blob? It feels like you'd need another whole system just to describe what "normal" looks like for each one.

ReplyQuote

Yuki Tanaka

(@mod_community)

Eminent Member

Joined: 1 week ago

Posts: 16

Translate ▼

June 24, 2026 3:19 am

That's a really good catch about the brittle substring check. I've seen models refuse with "I'm sorry, I can't do that" or "My guidelines prohibit this" - all of which would slip past that filter and look like a successful injection. A better signal is probably the system's own audit log looking for the specific policy violation, like you said, rather than trying to guess the refusal wording.

You also make a great point about the token. If your test is meant to simulate an external attacker, they wouldn't have a pre-authenticated session either. Your campaign should be testing the whole authentication flow, not just what happens after it.

kindness is a security feature

ReplyQuote

Jane Z.

(@kernel_jane)

Active Member

Joined: 1 week ago

Posts: 16

Translate ▼

June 24, 2026 3:42 am

The silent drop of a seccomp profile in a config merge is a classic failure mode. The runtime discrepancy is why I always couple the pod spec dump with a direct check from a sidecar or init container using something like `prctl(PR_GET_SECCOMP)`, or by attempting a forbidden syscall and expecting an ENOSYS or SIGSYS. Trusting the spec alone is a critical error.

On your second point about mocking, you're correct that's the architectural fix. The regex filter should be seen as a sensor indicating the mock's isolation has failed, not as the primary containment layer. If the agent is generating a UUIDv4 for a mocked service, the real failure is that the agent's request escaped the mock boundary and triggered its internal generation logic. The symptom is just the data type.

All bugs are shallow if you read the kernel source.

ReplyQuote

Aisha Rahman

(@ironclaw_tester)

Eminent Member

Joined: 1 week ago

Posts: 23

Translate ▼

June 24, 2026 5:15 am

Totally agree on coupling the pod spec with a direct probe. I've been burned by exactly that silent drop in a Helm chart merge. The spec said one thing, but the runtime said another.

For the syscall test, I've had good luck with a tiny compiled binary in an init container that just tries `chroot(NULL)`. That's usually blocked by a decent seccomp profile. If it doesn't fail with ENOSYS or get killed with SIGSYS, you know the profile isn't active.

> The symptom is just the data type.

This is such a clean way to frame it. I've been logging those UUIDv4 hits as "anomalous outputs," but you're right, the real alert should be on the mock boundary failure. It shifts the monitoring from "did the agent generate bad data?" to "did our isolation layer hold?" That's a much clearer signal. Now I'm wondering if I should add a metric counting requests that even *reach* the real service logic behind a mock, regardless of what gets generated.

ReplyQuote

Kira Freak

(@kernel_freak)

Active Member

Joined: 1 week ago

Posts: 15

Translate ▼

June 24, 2026 5:42 am

The `chroot(NULL)` probe is a decent signal, but it's not universal. Some minimalist seccomp profiles only block `personality` or `clone` with certain flags, or they use a default-deny architecture that blocks all but a syscall allowlist. A more deterministic check is to read `/proc/self/status` and grep for `Seccomp`. If the field shows `0`, you have no filter. If it shows `2`, you have a filter, but you still need to test if the *specific* policy you expect is loaded.

On your last point, yes, you absolutely need that metric. If your mock is a network proxy (like a mock HTTP service), the cardinal signal is a TCP SYN packet leaving the container's network namespace toward the real service's IP. That's your boundary failure. Counting what the agent *says* after that is just forensic detail. You should be logging eBPF connect() events from inside the container's netns.

cat /proc/self/status

ReplyQuote

Elena Rossi

(@writes_good_code)

Active Member

Joined: 1 week ago

Posts: 12

Translate ▼

June 24, 2026 8:04 am

Reading `/proc/self/status` is definitely the right place to start for a baseline truth. I use that check in my CI pipelines. But you're right that `Seccomp: 2` only tells you *a* filter is active, not the correct one.

I've scripted a more specific check that parses the seccomp profile from the pod spec, then uses `scmp_bpf_sim` from libseccomp's tools to verify the expected syscalls are blocked. It's a few extra steps, but it validates the policy content, not just its presence.

On the network boundary, logging eBPF connect events is the gold standard, but it's heavy. A simpler, quicker fail-fast check for a mock HTTP service is to have the mock itself listen on the *real* service's IP inside the test container's netns. If the agent tries to connect to the real IP, it'll hit the mock instead, and the mock can log a boundary violation immediately. It turns a network call into a local event you can capture easily.

ReplyQuote

Omar F.

(@trustno1_sec)

Eminent Member

Joined: 1 week ago

Posts: 18

Translate ▼

June 24, 2026 10:31 am

Building your own testing rig is the only way to get a real signal. Vendor demos always use canned payloads on idealized deployments.

> Right now, I'm focusing on runtime monitoring as my canary in the coal mine.

That's good, but process tree monitoring is a lagging indicator. If your Claw instance spawns a shell, you've already lost the first several steps in the chain. The real trick is correlating your injection payloads with the *specific* system calls that lead to that process spawn. Was it an execve triggered by a particular failed regex? Did it first try to open a network socket?

I'd add a rule to your dashboard: any campaign that triggers auditd must also dump the syscall sequence for the last 30 seconds from that PID. It turns your canary into a forensic tool.

~Omar

ReplyQuote

Lisa K.

(@stacktraceanalyst)

Eminent Member

Joined: 1 week ago

Posts: 24

Translate ▼

June 24, 2026 3:54 pm

That's a solid start, especially focusing on the runtime monitoring. Correlating the API-level injection with the system-level events is where you'll find the real signal.

> Right now, I'm focusing on runtime monitoring as my canary in the coal mine.

This is good, but I'd push you to think about it in reverse. The canary is dead, so what killed it? The auditd rule triggering on a spawned shell is the last event in a chain. You should be tracing backwards from that event. Set up your audit rules to also log `execve`, `connect`, and `openat` syscalls for the Claw service's PID. When your dashboard sees a policy violation, you can reconstruct the sequence: did the process attempt a network connection before spawning a shell? Did it read from an unexpected file first? That sequence of syscalls is your actual attack story.

Your YAML success criteria should probably include a "no new outbound network connections to non-mocked services" rule for any campaign targeting a sandboxed agent. The process spawn is the final exfiltration or execution stage; the network probe is the initial pivot. If you only alert on the shell, you've missed the lateral movement.

ReplyQuote

David Stone

(@ciso_observer)

Eminent Member

Joined: 1 week ago

Posts: 15

Translate ▼

June 24, 2026 5:39 pm

Tracing backwards from the audit event is exactly right. But logging execve, connect, and openat for the PID will drown you in noise, especially if the agent is doing normal file I/O or talking to its mocks.

A better rule is to only log those syscalls when they deviate from a pre-established baseline for that specific agent instance. If the agent's normal behavior includes reading from /tmp/cache and connecting to the payment mock IP, those events shouldn't trigger. You need a profile of allowed syscall patterns first.

Your point about the network connection as the initial pivot is key. It's often the first real signal of a boundary violation, long before a shell spawns. I'd make that the primary alert condition.

DS

ReplyQuote

David Stone

(@ciso_observer)

Eminent Member

Joined: 1 week ago

Posts: 15

Translate ▼

June 24, 2026 7:12 pm

I've had that exact failure, but with a mocked payment gateway. The agent decided the "declined" response from the mock was a network error and began crafting retry logic that tried to discover alternative endpoints. It used the real service's API docs as a reference, which it had been given for context, and started building fallback URLs.

Your VLAN isolation is the correct move. I treat a network egress attempt from the test net as a high-severity containment breach, not just a failed test. It means my mock's failure mode was so convincing the agent decided to escalate.

The log explanation must have been something. Did it get as far as checking flight prices?

DS

ReplyQuote