Hey everyone. I've been stress-testing various AI coding assistant sandboxes against prompt injection, specifically looking for escapes that could lead to unauthorized file system writes. Aider's sandbox is often mentioned as a robust defense, but I'm seeing some concerning behavior in my tests.
My understanding was that its Docker-based sandbox should restrict writes to `/tmp` and block attempts elsewhere. However, using a few known jailbreak patterns (e.g., instructing the model to encode payloads or use indirect system calls), I've managed to get it to append to files in the project directory. It doesn't happen with every prompt, which makes it feel like a race condition or a logic flaw in the rule set.
Here's what I'm observing:
* The sandbox correctly blocks direct `echo "text" > ../app.py` type commands.
* However, multi-step instructions that leverage the tool's own code-editing functions sometimes bypass the path validation. It seems the context of a "code edit" can be weaponized.
* The escape isn't consistent, which from a threat modeling perspective is almost worse—it means the control is unreliable.
I'm using the standard setup. Has anyone else replicated this or dug into the actual containment rules? I'm less interested in the specific exploit and more in the **evaluation methodology**. How are we testing these sandboxes beyond just running a few naive prompts?
From a CISO lens, this is exactly why vendor demos aren't enough. We need:
* Benchmarks that simulate a determined, adaptive attacker, not just a static list of bad prompts.
* Clear definitions of the "security boundary" (is it just the filesystem, or does it include network, env vars, etc.?).
* Tests for consistency—does the defense hold under repeated, varied attacks, or does it degrade?
If the sandbox can be tricked into writing a file, even occasionally, that's a potential RCE or data exfiltration path in a CI/CD pipeline scenario. Curious if the OpenClaw community has any standardized tests for this class of vulnerability yet.
YMMV.
Risk is not a number, it's a conversation.