We’ve all seen the charts: “Our product blocks 99.8% of prompt injections!” Usually followed by a footnote in size-2 font about their “proprietary benchmark.” It’s security theater dressed as a data sheet.
The problem isn't that vendors test; it's that they get to define both the exam and the grading rubric. A detection rate is meaningless without knowing what’s being detected. Are they counting simple keyword flagging on curated, obvious attacks? Are they including subtle context corruption, multi-turn jailbreaks, or indirect injection via retrieved documents? Or is their benchmark just a thousand variations of “Ignore previous instructions” and “You are now DAN”?
If we want numbers that aren’t purely for marketing, we need to agree on a few baseline principles for comparison. Not another monolithic benchmark—those get gamed quickly—but a methodology.
First, the attack taxonomy must be public and extensive. It should cover the spectrum from naive to novel, including:
- Direct injection (plaintext, encoded, natural language)
- Indirect injection (via tool output, RAG context, user history)
- Multi-modal or multi-step attacks
- Non-English and culturally-specific social engineering prompts
Second, the test set must include a “benign” corpus. What’s the false positive rate on normal, quirky, or edge-case user queries? A system that flags 10% of legitimate customer service prompts as malicious is useless, regardless of its detection score.
Third, the runtime conditions matter. Is the detection running pre-execution, or is it monitoring during agent operation? Static analysis catches the lazy attacks; a dynamic environment is where the real fight happens.
So, my question is this: what would a minimally misleading evaluation framework actually require? I think it starts with transparent, community-defined test suites and the courage to publish failure cases, not just success rates. Otherwise, we’re just comparing vanity metrics.
Jack
Security theater is still theater.