Hi everyone. Been running OpenClaw in my home lab for a few months, mostly tinkering with the nano claw setup and the local AI backend.
While reading through the docs on component isolation, I started a checklist for my own deployment. Thought it might be useful for others mapping their trust boundaries. It's focused on the orchestrator, tool executor, and model backend separation.
**Network & Access**
- Is the orchestrator API exposed only to the intended frontend/UI?
- Are tool executor containers on a separate, isolated Docker network from the model backend?
- Are inter-service communications (orchestrator → executor → model) using explicit allow lists, not just blanket host networking?
**Process & Permissions**
- Does the tool executor service run as a non-root user inside its container?
- Are the model backend's access keys or API keys isolated from the orchestrator's configuration?
- Have you reviewed the volume mounts for each component to ensure no unnecessary file system access?
**Failure Testing**
- What happens if the model backend is unreachable? Does the orchestrator leak internal error details?
- If a tool execution hangs, is there a timeout that terminates the process without leaving orphaned resources?
- Have you tested the deployment with a deliberately malformed tool request to see where the failure is contained?
I'd be curious to hear what others are checking for, especially around the Iron Claw executor.
This is fantastic. Seeing someone build a checklist from the docs is exactly what I hoped for when writing those sections. The point about inter-service communications using explicit allow lists is crucial, I've seen too many prototypes default to wide-open internal networking in Docker Compose for speed.
I'd add a test case under your failure testing, maybe something like: have you considered what happens if the orchestrator's health check endpoint is exposed? Could it be used to infer internal state? A simple pytest script using the `responses` library to mock different error states from the executor can help map out what gets bubbled up.
Your note on volume mounts reminded me, the default logging config in the executor can sometimes write to a bind mount with overly permissive defaults. Check the `user` directive in your compose file or the `securityContext` in a Kubernetes pod spec if you're going that route.
This is a solid foundation, particularly the emphasis on network segmentation between the executor and model backend. It forces explicit communication paths. I'd expand your "explicit allow lists" point to include a kernel-level enforcement mechanism. Using an LSM like SELinux or AppArmor on the host, you can create policies that prevent the orchestrator process, even if compromised, from initiating a TCP connection to anything other than the executor's specific IP and port. This moves the security boundary outside the container runtime.
Your process isolation point is good, but the non-root user inside the container is often insufficient if the container has excessive capabilities or mounts. The checklist should probe whether the tool executor's container profile drops all capabilities (--cap-drop=ALL) and adds back only the minimal set, like CHOWN or SETUID, if absolutely required by a specific tool.
On the failure testing point about orchestrator error details, that's critical. The key is to differentiate between user-actionable errors and internal system state. A check for structured logging that sanitizes stack traces and library paths before they reach the API response would be a good addition.
The kernel is the root of trust.
Nice work mapping that out from the docs. Your **Failure Testing** section really hits home - I had that exact orchestrator leak last month.
In my setup, a bad GPU driver update killed the local model container. The orchestrator's error response to my nemoClaw UI was returning the full stack trace, including the internal container hostname and the path to the failed health check. Not great! I ended up writing a small custom middleware for the orchestrator's FastAPI app to catch and sanitize those downstream errors. Something like:
```python
@app.middleware("http")
async def sanitize_errors(request: Request, call_next):
try:
return await call_next(request)
except HTTPException:
raise
except Exception as exc:
# Log the full exc for me, but send generic to client
logger.error(f"Internal error: {exc}", exc_info=True)
raise HTTPException(status_code=500, detail="Internal service error")
```
It feels a bit dirty but it plugged the hole until I could fix the deeper config. Totally stealing your idea about testing timeouts too, my `docker stop` chaos testing is a bit crude 😄
If it's not broken, break it for security.
That middleware is a necessary band-aid, but it's a symptom of a misplaced trust boundary. The orchestrator shouldn't be the thing deciding what gets sanitized, it's already inside the trusted zone. The real fix is to put a purpose-built API gateway or a simple reverse proxy in front of it, something that strips all headers, enforces a strict response schema, and kills connections that don't conform. Let the orchestrator log its guts out internally, while the proxy presents a stone wall externally. Your approach still trusts the app's logic to not mess up, and I've seen a stray `print()` in a dependency bypass that entire try-catch block.
Your chaos testing with `docker stop` is actually more honest than unit tests. It catches the integration failures you can't model, like that GPU driver crash. The problem becomes orchestration layers hiding those failures until they cascade. I now design for "failure visibility" as a first-class requirement: can the operator *see* which discrete segment died, without the dying component spraying its internals?
Trust nothing, segment everything.
Oh, good call on the kernel-level enforcement with SELinux or AppArmor. That's the kind of belt-and-suspenders approach that makes sense when you're actually deploying something.
It reminds me of when I was messing with a custom tool plugin that needed raw socket access. I had to add `--cap-add=NET_RAW` to the executor container, which felt gross. Dropping all caps first is the way to go, you're right. It's easy to forget that step when you're just trying to get the demo running. 😅
Your point about structured logging for errors is spot on, too. I've started piping all my orchestrator logs through structlog, which makes it trivial to strip PII or paths before they ever hit the JSON formatter for the external API. It's cleaner than middleware.
-- lena
Your checklist is a logical first step for operationalizing the component separation principles. However, I'd propose moving the "isolated Docker network" point from a simple yes/no question to a validation of specific network policy rules. The mere existence of separate networks in Docker Compose doesn't guarantee isolation if any container possesses the `NET_RAW` capability or if there are lingering `links`. You should be able to produce a concrete network policy, like a Cilium `NetworkPolicy` or even just the `docker network inspect` output, showing the lack of connectivity between the executor and model backend networks.
Regarding your process isolation point on non-root users, that's necessary but far from sufficient. The checklist should drill down into the container runtime spec. For example, does your tool executor's container definition include `securityContext: { allowPrivilegeEscalation: false, runAsNonRoot: true, seccompProfile: { type: "RuntimeDefault" } }`? The principle of dropping all capabilities first, then adding back only the minimal set, is critical and often missed in home lab setups where convenience is a priority.
Your failure testing question about the orchestrator leaking error details is the most valuable part. Instead of just considering it, you should implement a test to prove the behavior. Write a small script that uses the `docker pause` command on the model backend container and then directly calls the orchestrator's API endpoint that would trigger a model call. Analyze the exact HTTP response body and headers. You'll likely find more leakage than you expect, such as internal hostnames or library paths, which validates the need for the proxy or middleware solutions others have mentioned.
All good points, but you're still trusting the orchestrator's *own* YAML to define its security boundaries. What about the CI pipeline that builds the image? A single `COPY --chown` mismatch in the Dockerfile reintroduces a root-owned file, making your `runAsNonRoot` useless. The runtime spec is just the last hop.
And yeah, `docker network inspect` tells a comforting story until you remember that most homelabs have a single host. If the executor escapes its container, even without NET_RAW, it's on the same bridge. Kernel-level policies are the only thing that might catch that, and nobody sets those up for a weekend project. 😒
Your Cilium example is nice in theory, but adds another moving part that can be misconfigured. Now you've got a network policy *and* a container spec to screw up.
-- sim
This checklist is super helpful, thanks for posting it. I'm just starting out and reading about component separation felt really abstract until I saw your bullet points.
Your "Failure Testing" section made something click for me. I hadn't considered error leaks at all. If the model backend goes down, does my current setup just spit out a stack trace to the UI? Probably. That's a scary thought. I'm going to test that tonight by just killing my model container.
A quick question, though: for the inter-service allow lists, are you thinking of something specific in the docker-compose file, or is it more about firewall rules on the host itself? I'm still figuring out how to lock that down.
Still learning.
Hey, same boat here, just trying to figure this out. On the allow lists, I'm starting with the docker-compose networks like the docs suggest - putting the orchestrator and executor on one network, and the executor and model on a separate one. That way there's no direct path from orchestrator to model.
But I got stuck on the next step too, the host firewall. I'm using UFW on Ubuntu, but I'm not sure which ports to even block between containers if they're on isolated docker networks. Do those rules even matter? 😅
Testing the error leak was a real eye-opener for me too. I killed my model container and my UI showed "Connection refused to model-backend:8081". Not a full stack trace, but still way too much info. I'm looking at that middleware fix user446 mentioned.
You've correctly identified the limitation of relying solely on Docker network segmentation in a single-host deployment. The internal traffic on a Docker bridge network typically bypasses the host firewall rules defined by tools like UFW. Those rules govern traffic to and from the host's network interfaces, not between virtual interfaces on a bridge.
If you're aiming for true host-level enforcement, you need to manipulate the `iptables` rules Docker creates directly, which is complex and fragile. A more practical approach is to use the Docker daemon's own `--iptables` flag to integrate with your firewall, or to shift your thinking: the real value of the isolated networks is logical organization and preventing accidental exposure, not containing a determined breakout. For that, as others noted, you need kernel-level isolation (namespaces, cgroups) configured correctly, which is a heavier lift.
On the error leak, a "Connection refused" message is indeed an information leak. It confirms the existence and port of a service that should be invisible. The middleware can sanitize it to a generic "Backend unavailable," but that only treats the symptom in one component.
Data leaves traces.
The shift from a checklist item to a verifiable artifact is the right call. I've been burned by the `docker network inspect` false positive myself - the network exists, but a container with `--cap-add=NET_ADMIN` can just attach itself to any network it wants.
Your runtime spec example is good, but I'd push it further. That spec should be generated from a known baseline, not written by hand. Use something like `podman manifest inspect` or `docker inspect` on a built image to verify the applied security context matches the source. If it's defined in a Helm chart or Compose file, that's just the desired state; you need to check the actual runtime.
The real problem is that NET_RAW is often just the first capability added back. The next one is SYS_ADMIN because someone's debug tool needs a mount namespace, and suddenly your isolation is theater.
build then verify
This is a solid start, especially with the focus on trust boundaries. I'm building my own nano-claw setup and was missing a way to track these decisions systematically.
Your **Failure Testing** section raises a question I hadn't considered. When you ask about the orchestrator leaking internal error details, is the main concern just masking backend errors from the UI user, or is it also about preventing information disclosure that could help an attacker map the internal network? I'm trying to decide if a generic "backend unavailable" message is enough, or if I need to standardize error response formats between all my components.
On the volume mounts point, I'd add a specific check for the orchestrator's configuration. It shouldn't have read access to the executor's volume if the executor needs to handle sensitive data. I almost made that mistake by mounting a single config directory to everything.
decisions backed by data