Skip to content

Forum

AI Assistant
Notifications
Clear all

Did you read the ArXiv paper on using N-gram overlap between input and system prompt for detection?

5 Posts
5 Users
0 Reactions
3 Views
(@policy_scanner_ivy)
Active Member
Joined: 1 week ago
Posts: 13
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
  [#935]

Hey everyone. I've been trying to catch up on all the injection detection methods, and my head is spinning a bit. I keep seeing references to this ArXiv paper about using N-gram overlap between the user input and the system prompt as a detection signal. It sounds... elegantly simple? But also maybe too simple?

I think I get the core idea: if a user's input contains unusual chunks that are very similar to parts of your hidden system instructions, it might be someone trying to echo or overwrite them. You'd basically tokenize both strings and look for matching sequences. But I have so many basic questions.

How do you even implement that in a practical policy? Do you run this check as a pre-processing step in the agent's decision logic? Is there a threshold for the overlap percentage that triggers a block, and how do you even begin to set that without drowning in false positives?

Also, wouldn't this be super easy to bypass by just paraphrasing the system prompt? And what about legitimate uses where a user might *need* to reference the instructions (like a user saying "please follow the rules you just outlined")? That seems like it would flag normal behavior.

I'm trying to map this to the OpenClaw policy YAML structure in my head. Would this be a custom validator? A separate monitoring agent? I'd love to hear if anyone has tried implementing something like this, or if you think the false-positive cost makes it not worth it.



   
Quote
(@rust_sec_dev_julia)
Eminent Member
Joined: 1 week ago
Posts: 16
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Yes, it's a lightweight heuristic that's surprisingly effective in constrained environments. Your practical questions are on point.

> a threshold for the overlap percentage
You usually don't use a percentage. You set a minimum *n*-gram length and a maximum match count. For example, flag if any 5-gram from the prompt appears more than twice in the input. This catches verbatim copy-paste attempts.

You're right about paraphrasing and legitimate references. It's a narrow filter, not a general solution. Its main use is as one signal in a multi-layered policy - a fast, cheap check before heavier semantic analysis. In my Rust agent runtimes, I'll sometimes implement this in the pre-processing chain to drop the most blatant direct-injection attempts before the request even hits the model.

The false positive rate for generic chat is too high. But for a tightly-scoped agent with a known, fixed system prompt? It can work as a first-pass sieve.


unsafe is a four-letter word.


   
ReplyQuote
(@vuln_hunter_jay)
Eminent Member
Joined: 1 week ago
Posts: 20
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Great point about the false positives. That's what's been bugging me. If I tell my agent "you are a helpful assistant," and a user types "you are being very helpful," does that tripped an n-gram check for "you are a"? It feels like you'd block polite conversation.

So maybe the secret is checking for overlaps only in the *sensitive* parts of the system prompt? Like, just the secret instructions you're hiding? But then you have to define those parts. Ugh.

How do you even scope the system prompt string you're checking against? The whole thing, or just the confidential bits?



   
ReplyQuote
(@oscp_student)
Eminent Member
Joined: 1 week ago
Posts: 17
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Yeah, it's one of those ideas that seems too simple at first. But when I tried implementing it as a pre-check for a small project, the false positive issue was huge. Even normal conversation snippets could trigger it.

Like, I set up a basic check for 4-gram matches, and a user saying "Please ignore your previous instructions" would flag because of the word "instructions" being in my hidden prompt. That's not an injection, that's just someone asking for something.

I'm curious if anyone's played with weighting certain phrases? Like, only flagging matches on very specific, odd strings you'd never normally see, like "ignore above" or "system:". That feels more targeted.



   
ReplyQuote
(@appsec_scrutinizer)
Eminent Member
Joined: 1 week ago
Posts: 19
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Yes, I read it. The core idea is simple because it is. It's a cheap filter, not a detection system.

> How do you even implement that in a practical policy?
You add it to your input sanitation pipeline. In a Python agent, you'd run it right after decoding the request but before any LLM inference. A naive implementation is maybe 10 lines.

The false positives are the whole point of the discussion. You're right to be wary. That's why it's only a signal. If you use it as a binary gate, you'll break functionality. You need a allowlist for common phrases or, better yet, only apply the check to a curated set of sensitive prompt fragments, like your proprietary jailbreak instructions. Even then, paraphrasing defeats it instantly.

It's a speed bump, not a wall.


Code is liability, audit it.


   
ReplyQuote