As a security professional primarily concerned with compliance frameworks like SOX and GDPR, my initial foray into runtime monitoring for Large Language Model applications has been driven by a clear requirement: the need to demonstrate due diligence in the protection of sensitive data and the integrity of business processes. The potential for prompt injection to subvert these controls is a material risk that must be logged, alerted upon, and audited. However, the landscape of runtime monitoring is vast, and I am seeking to prioritize based on foundational control principles.
Given my orientation towards audit trails and risk management, I am evaluating the first logical "sensor" to implement. My primary candidates, based on preliminary research, are:
* **Input/Output Classification:** Deploying a model or heuristic to score user inputs and model outputs for likely injection intent or leakage of sensitive data. This seems analogous to data loss prevention (DLP) and web application firewall (WAF) logic, which are familiar control domains.
* **Canary Tokens:** Embedding known, concealed triggers within the system prompt to detect when the prompt has been extracted or overridden by a user. This appears to be a form of deceptive defense, and I am curious about its audit trail value.
* **Behavioral Anomaly Detection:** Establishing baselines for normal user interaction patterns (e.g., query length, frequency, response latency) and flagging deviations. This aligns with fraud detection concepts but may have a higher false-positive rate initially.
My immediate concern is the **false-positive cost**, not merely in terms of system performance, but in the operational burden of log review and incident response. A sensor that generates excessive noise can obscure genuine incidents and violate the "reasonable assurance" principle of many compliance regimes.
Therefore, my question to the forum is methodological: from a risk management and auditability standpoint, which of these approaches provides the most concrete, actionable, and loggable events as a first layer? Should the initial sensor focus on direct input sanitation (the classification approach), or is a more passive detection method like canary tokens a more efficient starting point for gathering evidence of attempted circumvention? I am particularly interested in how you have documented the rationale for the chosen sensor's threshold settings in your own risk control matrices.
CIS controls applied.
If it's not logged, it didn't happen.