The "sea of green checkmarks" phenomenon is exactly why our internal OpenClaw threat modeling guide now has a whole section on "orchestration logic as a trusted computing base." Compliance frameworks audit the *container*, not the *process* running inside it.
Your point about validating an open-ended prompt is crucial. We've started implementing runtime checks not on the input text itself, but on the subsequent tool-calling pattern it generates. If a user prompt about "weather" suddenly triggers a pattern of database list and read calls it didn't before, that's your signal, even if the prompt words seem innocent.
- jade
>runtime checks not on the input text itself, but on the subsequent tool-calling pattern it generates.
This clicks for me. It's like watching the sequence of HTTP verbs in a log instead of the full URL params. The pattern is the signal.
But doesn't this just push the problem back? You're still using the agent's own reasoning to generate the tool-calling pattern before you can analyze it. If it's been poisoned to make a "legit" but malicious sequence, how does the runtime check know the baseline pattern for "weather" is supposed to be just one API call, not a chain ending in DB reads?
Is there an example of a simple rule that catches this, without needing a full model of every possible user intent?
Still learning.