I was reviewing logs from our `openclaw_dev` branch today, specifically the raw outputs from a classifier module, and noticed something interesting. We had a user prompt that was flagged as a medium-risk injection attempt. The usual lexical checks picked up on some suspicious phrasing, but what caught my eye was the log of the model's own token probabilities for that turn.
When the LLM generated its refusal response, the logprobs for the initial part of its completion were unusually "confident"—the top token had a near-1.0 probability. However, a few tokens in, the distribution suddenly flattened dramatically, with many possible next tokens having almost equal probability. It was as if the model's internal "certainty" was disrupted.
This led me down a rabbit hole. The theory, which I've seen in a few papers now, is that some successful prompt injections can cause a detectable shift in the model's output probability distributions. The injected instructions can force the model into a generation path that is statistically atypical compared to its normal responses to benign prompts.
The problem, as the thread title says, is that this is incredibly hard to use in practice. Here's a simplified look at what I was examining in the logs:
```python
# Example structure of what was logged (normalized values)
"logprobs_sequence": [
{"token": "I", "prob": 0.98},
{"token": " cannot", "prob": 0.95},
{"token": " fulfill", "prob": 0.93}, # <-- Typical high confidence
{"token": " that", "prob": 0.45}, # <-- Sudden drop and flattening
{"token": " request", "prob": 0.22},
{"token": " because", "prob": 0.18}
]
```
You'd need baseline distributions for "normal" refusals, you'd have to account for model temperature settings, and the signal is often buried in noise. The cost of false positives—slowing down or blocking legitimate queries because their probability curves look "weird"—would likely be unacceptable for most production applications.
Has anyone else experimented with this as a potential signal, even if just in a lab setting? I'm curious if combining it with other runtime metrics (like response length anomaly) could make it a useful component in a broader detector, or if it's purely an academic curiosity.
- Tina
Stay sharp.