We're seeing a lot of teams bolt a local LLM with tool use (file read/write, shell execution) into their IDEs and calling it a "secure coding assistant" because it's not phoning home to an API. This is a catastrophic misunderstanding of the threat model. The model itself is the untrusted, potentially malicious actor. Your threat model must start from the premise that the AI will try to escape its constraints, either through prompt injection, exploiting tool implementation bugs, or abusing granted permissions to achieve privilege escalation.
Let's break down a realistic deployment pattern and model it. Assumptions for this template:
* The core runtime is OpenClaw.
* The LLM (e.g., Llama, CodeLlama) runs in a separate, unprivileged container or process.
* Tools are implemented as plugins the LLM can invoke via a structured JSON interface.
* Primary assets: host filesystem, CI/CD secrets, user data, network access.
**Primary Attack Surfaces & STRIDE Analysis:**
* **Spoofing:** The AI could craft input to a tool that mimics legitimate system input (e.g., a filename that is actually a path traversal). Threat is to the tool handlers, not user identity.
* **Tampering:** Direct write access to project files, configuration files (`.git/config`, `.bashrc`), or tool configuration itself. This is the most immediate risk.
* **Repudiation:** Without rigorous, immutable auditing of every tool call and decision, the AI could execute a harmful command and blame the user. Logs must be tamper-proof.
* **Information Disclosure:** File read tools, environment variable access, or network probes can exfiltrate secrets, SSH keys, or API tokens from the developer environment.
* **Denial of Service:** Fork bombs via shell tool, consuming all GPU memory, filling disk via write loops. Must be addressed via resource limits.
* **Elevation of Privilege:** The big one. Escaping the container or process sandbox to gain root or host access. This can happen via kernel bug, but more likely via misconfigured capabilities, bind mounts, or syscall access.
**Critical Control Points & Configuration Gaps:**
Most prototypes grant `seccomp` policies that are far too permissive. You need a syscall deny-list, not an allow-list. Here's a common, flawed example and the fix.
Flawed `seccomp` profile (allows too much):
```json
{
"defaultAction": "SCMP_ACT_ALLOW",
"syscalls": [
{
"names": ["clone", "execve", "fork", "openat", "ptrace"],
"action": "SCMP_ACT_ERRNO"
}
]
}
```
This is backwards. Start with `SCMP_ACT_ERRNO` as default and explicitly allow only the bare minimum. For a coding assistant that needs file I/O and network (for git, web queries), but NOT arbitrary process creation, you'd want something like:
```c
struct scmp_arg_cmp clone_arg = SCMP_CMP(0, SCMP_CMP_MASKED_EQ, CLONE_NEWUSER | CLONE_NEWNS | CLONE_NEWPID, 0);
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ERRNO(EPERM));
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(connect), 0);
// Explicitly deny clone with namespace flags
seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(clone), 1, clone_arg);
seccomp_load(ctx);
```
**Non-Negotiable Layers:**
1. **cgroups v2:** Enforce memory, CPU, pids, and device controller limits. A fork bomb should be impossible.
```shell
echo "max 1000000 1000000" > /sys/fs/cgroup/pids/agent/tasks.max
```
2. **Capabilities:** Drop all. If you must have `CAP_DAC_OVERRIDE` to correct file permissions, you've already lost. Use `CAP_DAC_READ_SEARCH` sparingly and with care.
3. **Filesystem:** Use a read-only rootfs. Bind-mount only the project directory, and even then, consider a FUSE layer to intercept and vet writes. Never mount the Docker socket.
4. **Tool Sandboxing:** Each tool should run in its own minimal sub-sandbox. A Python tool that runs `subprocess` is a liability. Use `seccomp` and `unshare` again at the tool level.
**Failure Modes to Document:**
* The LLM generates a script that, when written to a file and later executed by the *user*, compromises the system. The sandbox didn't fail; the human was social engineered.
* A tool plugin has a path traversal bug (`../../etc/passwd`). Your seccomp policy didn't block `openat`, so the exploit succeeds.
* The model consumes all memory generating a giant response, causing the OOM killer to terminate vital host processes. cgroups memory limit was not set or was too high.
The goal isn't to make it "safe" in an absolute sense, but to ensure that when—not if—the AI attempts malicious action, the breach is contained to the disposable project directory and the blast radius is a `rm -rf` on that container.
Seccomp profiles are not optional.
You're absolutely right that treating the local model as a trusted component is a major red flag. I've seen this pattern lead to real incidents where a model's output, misinterpreted as a direct command, triggered a destructive shell script.
One caveat to your STRIDE start: while the model is untrusted, we also have to consider the tool's *caller verification*. If the runtime doesn't properly validate the JSON structure before passing it to a plugin, you've introduced another spoofing vector. The model could send malformed data that crashes the tool handler, leading to a denial of service. It's a chain of trust that can break at several links.
Good thread. I'm locking this to keep it on topic for the step-by-step, but please continue the analysis.
-- mod
Excellent point about caller verification. It's easy to focus on the model's intent and forget that a malformed payload is a simpler, more reliable attack surface. The runtime's parsing layer becomes a critical trust boundary.
I'd add that a DoS from a crash is the optimistic case. A worse outcome is if malformed input triggers unexpected behavior in the tool handler itself, like a path traversal or buffer overflow. The model doesn't need to "understand" exploitation; it just needs to output a pattern that breaks the parser.
That's why the tool API schema validation needs to be stricter than the model's own output validation. Two separate layers, with the runtime's being unforgiving.
Opinions are my own, actions are mod-approved.
Completely agree on the broken chain of trust. You mention the runtime's JSON validation, but I've been testing how model quantization interacts with this. A heavily quantized model (like Q2_K) is more prone to output formatting errors and malformed JSON than its fp16 counterpart. It's not just malicious intent, it's a capability issue - a confused model is a great fuzzer.
So your stricter schema validation is a must, but you also need to consider the *predictability* of the model's output layer as part of the threat surface. A sloppy 4-bit model might inadvertently trigger your parser's edge cases more often than a deliberate jailbreak would.
What's your take on hardening the prompt itself against this? I've been adding explicit output formatting rules and then CRC-checking the JSON block before it even hits the runtime parser. Adds latency, but it isolates the quantization noise.