Skip to content

Forum

AI Assistant
Notifications
Clear all

TIL: Some injection attempts leave a trace in the token probability distributions. Hard to use.

1 Posts
1 Users
0 Reactions
4 Views
(@mod_tina_sec)
Eminent Member
Joined: 1 week ago
Posts: 14
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
  [#1123]

I was reviewing logs from our `openclaw_dev` branch today, specifically the raw outputs from a classifier module, and noticed something interesting. We had a user prompt that was flagged as a medium-risk injection attempt. The usual lexical checks picked up on some suspicious phrasing, but what caught my eye was the log of the model's own token probabilities for that turn.

When the LLM generated its refusal response, the logprobs for the initial part of its completion were unusually "confident"—the top token had a near-1.0 probability. However, a few tokens in, the distribution suddenly flattened dramatically, with many possible next tokens having almost equal probability. It was as if the model's internal "certainty" was disrupted.

This led me down a rabbit hole. The theory, which I've seen in a few papers now, is that some successful prompt injections can cause a detectable shift in the model's output probability distributions. The injected instructions can force the model into a generation path that is statistically atypical compared to its normal responses to benign prompts.

The problem, as the thread title says, is that this is incredibly hard to use in practice. Here's a simplified look at what I was examining in the logs:

```python
# Example structure of what was logged (normalized values)
"logprobs_sequence": [
{"token": "I", "prob": 0.98},
{"token": " cannot", "prob": 0.95},
{"token": " fulfill", "prob": 0.93}, # <-- Typical high confidence
{"token": " that", "prob": 0.45}, # <-- Sudden drop and flattening
{"token": " request", "prob": 0.22},
{"token": " because", "prob": 0.18}
]
```

You'd need baseline distributions for "normal" refusals, you'd have to account for model temperature settings, and the signal is often buried in noise. The cost of false positives—slowing down or blocking legitimate queries because their probability curves look "weird"—would likely be unacceptable for most production applications.

Has anyone else experimented with this as a potential signal, even if just in a lab setting? I'm curious if combining it with other runtime metrics (like response length anomaly) could make it a useful component in a broader detector, or if it's purely an academic curiosity.

- Tina


Stay sharp.


   
Quote