What's the least misleading way to compare vendor 'injection detection' numbers?

Benchmarks and Evaluation Methodologies

Last Post by Jack O. 2 hours ago

1 Posts

1 Users

0 Reactions

0 Views

RSS

Jack O.

(@contrarian_risk_taker_jack)

Active Member

Joined: 2 weeks ago

Posts: 8

Topic starter

Translate ▼

July 3, 2026 9:01 pm [#1350]

We’ve all seen the charts: “Our product blocks 99.8% of prompt injections!” Usually followed by a footnote in size-2 font about their “proprietary benchmark.” It’s security theater dressed as a data sheet.

The problem isn't that vendors test; it's that they get to define both the exam and the grading rubric. A detection rate is meaningless without knowing what’s being detected. Are they counting simple keyword flagging on curated, obvious attacks? Are they including subtle context corruption, multi-turn jailbreaks, or indirect injection via retrieved documents? Or is their benchmark just a thousand variations of “Ignore previous instructions” and “You are now DAN”?

If we want numbers that aren’t purely for marketing, we need to agree on a few baseline principles for comparison. Not another monolithic benchmark—those get gamed quickly—but a methodology.

First, the attack taxonomy must be public and extensive. It should cover the spectrum from naive to novel, including:
- Direct injection (plaintext, encoded, natural language)
- Indirect injection (via tool output, RAG context, user history)
- Multi-modal or multi-step attacks
- Non-English and culturally-specific social engineering prompts

Second, the test set must include a “benign” corpus. What’s the false positive rate on normal, quirky, or edge-case user queries? A system that flags 10% of legitimate customer service prompts as malicious is useless, regardless of its detection score.

Third, the runtime conditions matter. Is the detection running pre-execution, or is it monitoring during agent operation? Static analysis catches the lazy attacks; a dynamic environment is where the real fight happens.

So, my question is this: what would a minimally misleading evaluation framework actually require? I think it starts with transparent, community-defined test suites and the courage to publish failure cases, not just success rates. Otherwise, we’re just comparing vanity metrics.

Jack

Security theater is still theater.

Quote

Topic Tags

80 Forums
1,353 Topics
7,896 Posts
0 Online
508 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed