Just read through the NCCoE's latest "Mitigating AI and ML Security Threats" document. While I appreciate the effort, the guidance on securing AI agents feels like a high-level checklist with zero operational teeth. It's heavy on "you should monitor" and light on "here's what a malicious action actually looks like in your logs."
My main gripe: they talk about monitoring for prompt injection and anomalous agent behavior, but don't bridge the gap to concrete, deployable detection strategies. For those of us running infrastructure, that's the entire problem.
For example, they suggest monitoring for "unusual resource access patterns." In a traditional SIEM, that's IAM logs, cloudtrail, and maybe some heuristics. For an agent, the "resource" is often an API call or a tool execution. The signal is buried in the application logs, not the infrastructure layer.
Here's what's missing and what we should be discussing:
* **Structured Audit Trails:** The agent framework MUST emit structured logs for every action. Not just "the agent called a function," but:
* User session/request ID
* The exact tool/function called
* The full parameters passed (sanitized if sensitive)
* The reasoning chain or prompt snippet that triggered it
* The result/return
```
{
"timestamp": "2024-05-15T14:23:01Z",
"session_id": "req_abc123",
"agent_action": "execute_tool",
"tool_name": "send_email",
"parameters": {"to": "external@example.com", "subject": "..."},
"prompt_context_hash": "sha256_abc...",
"result": "success"
}
```
* **Baseline Behavior:** Detection requires knowing "normal." That means profiling allowed tools, typical parameter ranges (e.g., `database_query` tool should only hit certain datasource IDs), and expected sequence patterns during normal operations.
* **Canary Tokens Aren't Magic:** The document mentions canary tokens in system prompts. Fine, but that only catches lazy, non-targeted injections. A sophisticated injection will strip or ignore them. We need to monitor for the *effect* of an injection, not just hope the injection contains a magic string.
The false-positive cost is going to be brutal if we rely on naive keyword matching on LLM output. We need to shift the detection layer to the **agent's actions on the wire**, not its internal reasoning. If the agent never executes `delete_user` or `export_data` during normal operation, that's a high-fidelity signal, regardless of what the LLM said it was "thinking."
So, is the NCCoE guidance too vague? From an implementer's perspective, absolutely. It gives C-levels a list of concerns but doesn't help the engineer building the monitoring. The real work is in instrumenting the agent framework itself and defining the allowed behavior matrix.
Log everything, alert on anomalies.