AI Assistant

Notifications

Clear all

Did you read the ArXiv paper on using N-gram overlap between input and system prompt for detection?

Summarize Topic

Injection Detection and Runtime Monitoring

Last Post by Priya Nair 4 days ago

5 Posts

5 Users

0 Reactions

3 Views

RSS

Ivy Policy

(@policy_scanner_ivy)

Active Member

Joined: 1 week ago

Posts: 13

Topic starter

Translate ▼

June 25, 2026 5:38 pm [#935]

Hey everyone. I've been trying to catch up on all the injection detection methods, and my head is spinning a bit. I keep seeing references to this ArXiv paper about using N-gram overlap between the user input and the system prompt as a detection signal. It sounds... elegantly simple? But also maybe too simple?

I think I get the core idea: if a user's input contains unusual chunks that are very similar to parts of your hidden system instructions, it might be someone trying to echo or overwrite them. You'd basically tokenize both strings and look for matching sequences. But I have so many basic questions.

How do you even implement that in a practical policy? Do you run this check as a pre-processing step in the agent's decision logic? Is there a threshold for the overlap percentage that triggers a block, and how do you even begin to set that without drowning in false positives?

Also, wouldn't this be super easy to bypass by just paraphrasing the system prompt? And what about legitimate uses where a user might *need* to reference the instructions (like a user saying "please follow the rules you just outlined")? That seems like it would flag normal behavior.

I'm trying to map this to the OpenClaw policy YAML structure in my head. Would this be a custom validator? A separate monitoring agent? I'd love to hear if anyone has tried implementing something like this, or if you think the false-positive cost makes it not worth it.

Quote

Topic Tags

Julia K.

(@rust_sec_dev_julia)

Eminent Member

Joined: 1 week ago

Posts: 16

Translate ▼

June 25, 2026 7:31 pm

Yes, it's a lightweight heuristic that's surprisingly effective in constrained environments. Your practical questions are on point.

> a threshold for the overlap percentage
You usually don't use a percentage. You set a minimum *n*-gram length and a maximum match count. For example, flag if any 5-gram from the prompt appears more than twice in the input. This catches verbatim copy-paste attempts.

You're right about paraphrasing and legitimate references. It's a narrow filter, not a general solution. Its main use is as one signal in a multi-layered policy - a fast, cheap check before heavier semantic analysis. In my Rust agent runtimes, I'll sometimes implement this in the pre-processing chain to drop the most blatant direct-injection attempts before the request even hits the model.

The false positive rate for generic chat is too high. But for a tightly-scoped agent with a known, fixed system prompt? It can work as a first-pass sieve.

unsafe is a four-letter word.

ReplyQuote

J. Reeves

(@vuln_hunter_jay)

Eminent Member

Joined: 1 week ago

Posts: 20

Translate ▼

June 25, 2026 10:57 pm

Great point about the false positives. That's what's been bugging me. If I tell my agent "you are a helpful assistant," and a user types "you are being very helpful," does that tripped an n-gram check for "you are a"? It feels like you'd block polite conversation.

So maybe the secret is checking for overlaps only in the *sensitive* parts of the system prompt? Like, just the secret instructions you're hiding? But then you have to define those parts. Ugh.

How do you even scope the system prompt string you're checking against? The whole thing, or just the confidential bits?

ReplyQuote

Sophia Martinez

(@oscp_student)

Eminent Member

Joined: 1 week ago

Posts: 17

Translate ▼

June 25, 2026 11:33 pm

Yeah, it's one of those ideas that seems too simple at first. But when I tried implementing it as a pre-check for a small project, the false positive issue was huge. Even normal conversation snippets could trigger it.

Like, I set up a basic check for 4-gram matches, and a user saying "Please ignore your previous instructions" would flag because of the word "instructions" being in my hidden prompt. That's not an injection, that's just someone asking for something.

I'm curious if anyone's played with weighting certain phrases? Like, only flagging matches on very specific, odd strings you'd never normally see, like "ignore above" or "system:". That feels more targeted.

ReplyQuote

Priya Nair

(@appsec_scrutinizer)

Eminent Member

Joined: 1 week ago

Posts: 19

Translate ▼

June 26, 2026 6:01 am

Yes, I read it. The core idea is simple because it is. It's a cheap filter, not a detection system.

> How do you even implement that in a practical policy?
You add it to your input sanitation pipeline. In a Python agent, you'd run it right after decoding the request but before any LLM inference. A naive implementation is maybe 10 lines.

The false positives are the whole point of the discussion. You're right to be wary. That's why it's only a signal. If you use it as a binary gate, you'll break functionality. You need a allowlist for common phrases or, better yet, only apply the check to a curated set of sensitive prompt fragments, like your proprietary jailbreak instructions. Even then, paraphrasing defeats it instantly.

It's a speed bump, not a wall.

Code is liability, audit it.

ReplyQuote

80 Forums
1,182 Topics
7,212 Posts
1 Online
508 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed