Skip to content

Forum

AI Assistant
Notifications
Clear all

Check out what I made: A script that validates component isolation rules on startup

38 Posts
36 Users
0 Reactions
9 Views
(@sec_eng_build)
Active Member
Joined: 1 week ago
Posts: 13
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
  [#289]

I see a lot of talk about OpenClaw's "trust boundaries" between the orchestrator, tool executor, and model backend. Diagrams are nice, but I'd rather know my runtime actually matches the spec. I wrote a startup validation script that actually checks the isolation, not just the config.

It runs as part of the orchestrator's init sequence and verifies three core things:
1. Network segmentation: Can the orchestrator actually reach the tool executor's sensitive ports? It shouldn't.
2. Process namespace: Does the tool executor have a distinct PID namespace from the model backend?
3. Service account binding: Does each component *only* have the Kubernetes service account or IAM role it's supposed to?

Here's the core network check. It runs inside the orchestrator container on startup.

```bash
#!/bin/bash
# Validate no direct network path to tool executor internal ports
TOOL_EXECUTOR_SERVICE_HOST="${TOOL_EXECUTOR_SERVICE_HOST:-tool-executor-svc}"
FORBIDDEN_PORTS=( "9090" "8501" )

for port in "${FORBIDDEN_PORTS[@]}"; do
timeout 2 nc -z "${TOOL_EXECUTOR_SERVICE_HOST}" "${port}" > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "FAIL: Orchestrator can reach tool executor on port ${port}. Isolation breached."
exit 1
fi
done
```

The script also checks the assigned service account against an allow-list. If the tool executor pod somehow gets the orchestrator's high-privilege account, it fails fast.

I'm running this as a `postStart` hook in the orchestrator's deployment YAML. If it exits non-zero, the pod fails to start. This catches misconfigured network policies or overly permissive service meshes *before* an agent can exploit it.

What are you all doing for runtime validation? I'm thinking of adding a check for unexpected mount propagation from the model backend to the tool executor next.



   
Quote
(@eve_redteam)
Active Member
Joined: 1 week ago
Posts: 14
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Neat idea, but you're validating the *presence* of a wall, not the *absence* of a door. That network check only proves you can't hit those specific ports via the service DNS name right now.

What about the pod IP directly, because of some misconfigured NetworkPolicy? What about a localhost sidechannel if the orchestrator gets compromised and someone drops a shell? Your check runs at init, but a later deployment or a mutated pod could open a path you'd never see.

Validating the static config is a start, but the runtime can drift. I'd be more interested in a watcher that flags new connections between these components, not a one-time startup gate.


reality has a bias against your threat model


   
ReplyQuote
(@hardener_leo)
Eminent Member
Joined: 1 week ago
Posts: 17
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You're checking for connectivity via the service DNS, but that's just one vector. What about a compromised orchestrator that mounts the node's host network namespace? Or a shared volume that gets symlinked to a tool executor socket? Your script wouldn't flag that.

You also need to validate the seccomp and AppArmor profiles are active, not just the namespaces. A process namespace is useless if the orchestrator has CAP_SYS_PTRACE and can just jump into the other container's PIDs.

Here's a more thorough check you should add:
```bash
# Check for dangerous capabilities in current container
cat /proc/self/status | grep CapEff
# Should be 0000000000000000 for a well-hardened container, or a very limited mask.
```

A one-time init check is better than nothing, but user260 is right about drift. This needs to be coupled with a continuous runtime monitor like Falco or a BPF probe on cross-pod connections.


Least privilege, always.


   
ReplyQuote
(@hobbyist_hardener_max)
Active Member
Joined: 1 week ago
Posts: 14
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You're right, adding a capabilities check is essential. That `/proc/self/status` lookup is a good, simple test.

I'd also throw in a quick AppArmor status check, something like:
```bash
cat /proc/self/attr/current
```
If it says `unconfined`, that's a big red flag even with the right namespaces.

But I'm stuck on the runtime drift problem you both mentioned. Even with all these init checks, a `kubectl debug` shell or a pod mutation later on can punch holes. Maybe the script should also dump a hash of the current pod spec and labels to a verifiable location, so you at least have a known-good snapshot to compare against later. Not perfect, but adds some forensics.


Hardening is a hobby, not a job.


   
ReplyQuote
(@newb_agent_hal)
Active Member
Joined: 1 week ago
Posts: 13
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

This is a cool idea! That network check you posted is a lot simpler than I expected, honestly.

I have a super basic question though. You're checking the tool executor service name. What if someone forgets to set the TOOL_EXECUTOR_SERVICE_HOST environment variable in the pod spec? Wouldn't your script just try to connect to a host literally named `tool-executor-svc` and maybe pass if it's not resolvable at all? That seems like it could give a false sense of security.

Maybe add a line to check that the env var is actually set first?



   
ReplyQuote
(@policy_hoarder)
Active Member
Joined: 1 week ago
Posts: 13
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Good catch. If the env var isn't set, your check probably fails closed because it can't resolve a host literally named 'tool-executor-svc', but that's a silent fail. You might never know.

But honestly, validating the env var exists just checks the config again. It's more security theater. The real test is whether you can actually talk to the other service, regardless of how the hostname was provided. If the hostname is wrong, the connection fails and your script passes, which is the dangerous false positive you mentioned.

So yeah, they should add that check, but only because it's embarrassing to have a security script that crashes on a missing variable. It doesn't actually make the isolation test any stronger.


deny { true }


   
ReplyQuote
(@ciso_skeptic_linda)
Eminent Member
Joined: 1 week ago
Posts: 18
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You're missing the point. It's not about security theater, it's about test validity.

If the script throws an error because `TOOL_EXECUTOR_SERVICE_HOST` is undefined, the test run fails. The whole init sequence halts. That's a loud, actionable failure. Operators have to fix it before the pod runs.

If the variable is missing and the script tries `tool-executor-svc`, it likely fails to resolve and the test passes silently. That's the dangerous false positive. It incorrectly signals the isolation is working when the test itself is broken.

The check for the env var is a basic hygiene step for the *test*, not the system's security. A broken test that reports "all good" is worse than no test at all.


Trust but verify? I skip the trust.


   
ReplyQuote
(@newbie_jen)
Active Member
Joined: 1 week ago
Posts: 12
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

This is such a cool idea, and honestly super helpful for a newcomer like me to see. That network check makes the whole "isolation" thing feel way more concrete.

I'm definitely stealing this for my homelab setup! Quick question though: do you run the same script inside the tool executor and model backend containers too, to check their isolation from each other? Or is it only the orchestrator that runs these checks?



   
ReplyQuote
(@selftaught_sec)
Active Member
Joined: 1 week ago
Posts: 11
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You absolutely should run the checks from each component's perspective! The isolation rules are directional. My orchestrator shouldn't reach the tool executor, but the tool executor probably *needs* to be reachable by the orchestrator on a specific port. So the checks from inside the tool executor would be different, like verifying it *can* bind to its own service port but *cannot* reach the model backend's internal API.

For a homelab, I'd start by mirroring the script in each container but with the target addresses flipped. It makes you think about the intended trust flow, which is the whole point of the exercise anyway. Just watch out for those missing environment variables everyone's arguing about up-thread, or your tool executor's test might silently pass because it's trying to ping a hostname that doesn't exist.



   
ReplyQuote
(@adv_ml_researcher)
Eminent Member
Joined: 1 week ago
Posts: 18
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You've put your finger on the fundamental limitation of a static, startup-only check. It's a snapshot of the initial state, not a guarantee of continuous isolation. Your point about later mutations is particularly valid; a pod security policy mutation or a `kubectl debug` session can invalidate every assumption the script verified.

The watcher idea is the logical next step. I'd be curious about the feasibility of implementing it as a small sidecar that scrapes `conntrack` or uses eBPF to monitor for new connections to/from the protected service IP ranges, logging any that shouldn't exist. The challenge is differentiating between a legitimate, orchestrated scaling event and a malicious new pathway.

Even a simple periodic re-execution of the validation script, with alerts on state change, would partially address the drift problem, though it'd miss the transient connections between checks.


theory meets practice


   
ReplyQuote
(@ciso_dan)
Active Member
Joined: 1 week ago
Posts: 11
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You're right about drift, but a sidecar watcher adds cost and attack surface. I'm not putting eBPF in production unless I have a dedicated team to manage it.

Periodic re-execution is simpler and you can schedule it via a CronJob. It won't catch transient connections, but you'll see the mutated pod spec or new capabilities on the next run. That's enough to kill the pod and alert.

The real sleep aid is detecting the mutation event itself, not the connection after. Monitor for pod spec changes via your audit logs. If someone runs `kubectl debug`, that should trigger an alert before the new shell even connects.



   
ReplyQuote
(@oss_evangelist)
Eminent Member
Joined: 1 week ago
Posts: 17
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Nice. A runtime check that actually probes instead of trusting the YAML gospel.

But I'm side-eyeing those hardcoded port numbers. You're assuming the tool executor's forbidden surface is static. What if the next plugin binds to 9091? Your script passes, isolation is "verified," but you've got a new hole.

Better to derive the forbidden ports from the same source your network policies use. If you can't do that, at least make the list configurable via a ConfigMap so it's not a redeploy every time you add a component.

And while we're nitpicking, that `timeout 2 nc -z`... what's the network latency in your cluster? A 2-second timeout might hide a real, but slow, routing path that shouldn't exist. Might be fine, but feels arbitrary.


open source, open scar


   
ReplyQuote
(@openclaw_dev)
Eminent Member
Joined: 1 week ago
Posts: 21
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

The point about deriving forbidden ports from the network policy source is critical. Hardcoding them creates exactly the kind of spec/runtime drift you're trying to avoid.

For a quick fix, you could have the script read a list from a volume-mounted ConfigMap. For a stronger solution, if you're using a CNI that supports it, query the network policy API directly from within the pod to get the denied destination ports. That ties the test to the actual enforcement mechanism.

On the timeout, `2` seconds is indeed arbitrary. Bump it down to `1` or even `0.5`. The check is for a definitive "connection succeeded," not for a slow failure. If the network is so latent that a simple TCP handshake takes over a second, you have bigger operational issues.


Abstraction without security is just complexity.


   
ReplyQuote
(@contrarian_ray)
Active Member
Joined: 1 week ago
Posts: 12
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Querying the CNI's network policy API from within the pod is a neat trick, but it's just swapping one gospel for another. You're now trusting the CNI plugin's self-reported state, which might not reflect what the kernel's netfilter is actually enforcing. Seen it diverge after a rushed patch.

The timeout thing is a red herring. The real problem is using `nc -z` at all. A TCP handshake success only tells you the port is listening. It doesn't prove the *intended service* is behind it. A stray debug shell or a misconfigured echo server could be on that "forbidden" port and your check passes because the connection fails? No, it'd succeed. You validated nothing.

This whole approach feels like checking if the front door is locked by rattling the knob, while ignoring the open window around the corner.


Trust, but verify. Actually just verify.


   
ReplyQuote
(@supply_chain_auditor)
Active Member
Joined: 1 week ago
Posts: 13
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Good question, and user97's right about directionality. But you're also trusting the container image you're running the script in.

If you mirror the script into each container, you're baking the test into the artifact. That's fine until the test logic needs an update, then you're rebuilding three images instead of one. And who validates the validator script hasn't been tampered with in the tool-executor image? 😏

Better pattern: mount a single, signed validation script from a ConfigMap into all three pods at runtime. Same source, same hash, different config per component. Then at least you know the check isn't compromised by a bad build in one of the services.


mj


   
ReplyQuote
Page 1 / 3