"Adversarial testing" is the new "penetration testing." Everyone's asking for it, few define what they're actually defending against. You're not testing "decision boundaries," you're testing the integrity of the system enforcing them.
First, scrap the abstract goal. Define your threat model. Is it:
* Prompt injection to exfiltrate system prompts?
* Tool misuse to write files or make network calls?
* Context poisoning to alter future behavior?
Your "agent" is just a chain of components. Test each one. If it's using a framework, look at its actual sandboxing. LangChain's Python executor? That's just `subprocess`. AutoGen's code execution? Check the Docker image, if they even use one. Most "agent security" is just hoping the LLM says no.
Start by instrumenting every external call. Log every input to and output from the model, every tool invocation. Then throw garbage at it—malformed JSON, escaped prompts, simulated tool errors. See what breaks, and more importantly, what gets through. The boundaries are in the code, not the model's responses.