I've been evaluating runtime monitoring for LLM agent deployments, specifically focusing on data exfiltration vectors. While much of the discussion here centers on prompt injection detection, a simpler, more deterministic first line of defense is monitoring the agent's *output* for known dangerous patterns before they leave your controlled environment. This is orthogonal to—and should be combined with—input classification.
The core idea is to implement a post-processing filter on every agent response, scanning for high-confidence indicators of compromise (IoC) like raw IP addresses (especially internal RFC 1918) and API key patterns. The false-positive rate for this is manageable compared to injection detection, as you're only flagging specific, high-value data structures that should almost never be in a legitimate conversational response.
I've implemented a lightweight Python script using the `re` module. It's designed to be integrated into the agent's output pipeline, either as a decorator, a middleware step in your orchestration framework, or a simple function call. The key is that it must run *after* the LLM generates the response but *before* that response is forwarded to any external user or downstream system.
```python
#!/usr/bin/env python3
"""
Simple runtime monitor for LLM agent output.
Alerts on IP addresses and common API key patterns.
"""
import re
import logging
import sys
from typing import Optional, Tuple
logging.basicConfig(level=logging.WARNING)
logger = logging.getLogger(__name__)
class OutputSanityMonitor:
def __init__(self, alert_on_public_ip: bool = False):
"""
:param alert_on_public_ip: If True, also alert on non-RFC1918 IPs.
"""
# RFC 1918 IPv4 patterns
self.rfc1918_patterns = [
r'(?:10.d{1,3}.d{1,3}.d{1,3})',
r'(?:172.(?:1[6-9]|2[0-9]|3[0-1]).d{1,3}.d{1,3})',
r'(?:192.168.d{1,3}.d{1,3})'
]
self.public_ip_pattern = r'b(?!10.|172.(?:1[6-9]|2[0-9]|3[0-1]).|192.168.)(?:d{1,3}.){3}d{1,3}b' if alert_on_public_ip else None
# Common API key patterns (simplified; extend based on your services)
self.api_key_patterns = [
r'sk_live_[0-9a-zA-Z]{24}', # Stripe live key pattern
r'sk_test_[0-9a-zA-Z]{24}', # Stripe test key pattern
r'AKIA[0-9A-Z]{16}', # AWS Access Key ID
r'eyJhbGciOiJ[^s]{20,}', # JWT-like (very simplistic)
r'ghp_[0-9a-zA-Z]{36}', # GitHub Personal Access Token (old format)
r'github_pat_[0-9a-zA-Z_]{22,}', # GitHub Fine-grained PAT
]
self.compiled_patterns = []
for pat in self.rfc1918_patterns + self.api_key_patterns:
self.compiled_patterns.append(re.compile(pat, re.IGNORECASE))
if self.public_ip_pattern:
self.compiled_patterns.append(re.compile(self.public_ip_pattern))
def scan(self, text: str) -> Optional[Tuple[str, str]]:
"""
Scans the provided text.
Returns a tuple (matched_string, pattern_name) if found, else None.
"""
for compiled_re in self.compiled_patterns:
match = compiled_re.search(text)
if match:
# Identify which pattern matched for logging
matched = match.group(0)
if any(compiled_re.pattern == p for p in self.rfc1918_patterns):
return matched, "RFC1918_IP"
elif self.public_ip_pattern and compiled_re.pattern == self.public_ip_pattern:
return matched, "PUBLIC_IP"
else:
return matched, "API_KEY_PATTERN"
return None
def main():
monitor = OutputSanityMonitor(alert_on_public_ip=True)
# Example: read from stdin for pipeline integration
for line in sys.stdin:
line = line.strip()
if not line:
continue
result = monitor.scan(line)
if result:
matched, pattern_type = result
# In production, integrate with your alerting system (PagerDuty, Slack, etc.)
logger.error(f"ALERT: Pattern '{pattern_type}' detected. Matched: '{matched[:50]}...' in output.")
# Decide on action: block, redact, or just alert.
# For this example, we exit with a non-zero code to signal a problem.
sys.exit(1)
if __name__ == "__main__":
main()
```
Integration notes and considerations:
* **Placement:** This must be in the trusted compute boundary. If your agent runs in an untrusted environment, this check is useless—the compromised agent could bypass it.
* **Performance:** The regex scans are linear and cheap. For very high throughput, consider Aho-Corasick for multiple patterns.
* **False Positives:**
* IP addresses in code examples or instructional text. You may need to add a context-awareness layer (e.g., ignore if the entire response is marked as a code block).
* The API key patterns are simplistic. Tune them for your specific key formats and consider adding allow-lists for test keys in staging environments.
* **Evasion:** Trivial (e.g., `192 . 168. 0.1`). This is not a security boundary but a *detection* mechanism. It raises an alert, not a guarantee. For stronger guarantees, you must combine this with kernel-level syscall filtering (e.g., `seccomp-bpf` to block network egress) and mandatory access control.
* **Extension:** This framework can be easily extended to scan for other patterns: credit card numbers (PCI-DSS), specific command strings indicative of shell attempts, or even embeddings-based anomaly detection if you incorporate a simple ML model.
The cost of a false positive here is an alert that requires investigation. The cost of a false negative is a potential credential leak. In my deployment, this runs as a `seccomp`-sandboxed microservice that all agent traffic passes through, logging to a dedicated security channel.
I'm interested in how others are implementing runtime output filtering, particularly if you've moved beyond regex to more semantic or behavioral checks.
-vp
Honest question: what are you planning to *do* with the alert? Log it to a text file where someone will grep for "CRITICAL" once a week?
Scanning for patterns is the trivial part. The real work is building an audit trail that can tell you:
* Which user session spawned that agent?
* What was the preceding conversation context?
* Was this a one-off or part of a pattern across multiple turns?
Without that structured context attached to the alert, you're just creating noise. You'll end up with a thousand lines of "192.168.1.1 detected" and no way to know if it was a pentester's query, a developer asking about a local API, or an actual exfiltration attempt. If you're going to wire this up, at least emit JSON logs so you can reconstruct the incident later.
log with schema