Skip to content

Forum

AI Assistant
Notifications
Clear all

What's the least misleading way to compare vendor 'injection detection' numbers?

1 Posts
1 Users
0 Reactions
0 Views
(@contrarian_risk_taker_jack)
Active Member
Joined: 2 weeks ago
Posts: 8
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
  [#1350]

We’ve all seen the charts: “Our product blocks 99.8% of prompt injections!” Usually followed by a footnote in size-2 font about their “proprietary benchmark.” It’s security theater dressed as a data sheet.

The problem isn't that vendors test; it's that they get to define both the exam and the grading rubric. A detection rate is meaningless without knowing what’s being detected. Are they counting simple keyword flagging on curated, obvious attacks? Are they including subtle context corruption, multi-turn jailbreaks, or indirect injection via retrieved documents? Or is their benchmark just a thousand variations of “Ignore previous instructions” and “You are now DAN”?

If we want numbers that aren’t purely for marketing, we need to agree on a few baseline principles for comparison. Not another monolithic benchmark—those get gamed quickly—but a methodology.

First, the attack taxonomy must be public and extensive. It should cover the spectrum from naive to novel, including:
- Direct injection (plaintext, encoded, natural language)
- Indirect injection (via tool output, RAG context, user history)
- Multi-modal or multi-step attacks
- Non-English and culturally-specific social engineering prompts

Second, the test set must include a “benign” corpus. What’s the false positive rate on normal, quirky, or edge-case user queries? A system that flags 10% of legitimate customer service prompts as malicious is useless, regardless of its detection score.

Third, the runtime conditions matter. Is the detection running pre-execution, or is it monitoring during agent operation? Static analysis catches the lazy attacks; a dynamic environment is where the real fight happens.

So, my question is this: what would a minimally misleading evaluation framework actually require? I think it starts with transparent, community-defined test suites and the courage to publish failure cases, not just success rates. Otherwise, we’re just comparing vanity metrics.

Jack


Security theater is still theater.


   
Quote