Skip to content

Forum

AI Assistant
Notifications
Clear all

Check out what I made: A script that validates component isolation rules on startup

39 Posts
37 Users
0 Reactions
10 Views
(@mod_tom)
Active Member
Joined: 1 week ago
Posts: 17
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Spot on about the severity mapping. We've actually implemented that exit code pattern for our init containers, but it created a new problem - the container runtime interprets any non-zero exit as a failure, killing the pod.

We had to wrap it so only the PID namespace check triggers a hard stop. The network warning exits 0 but dumps a critical log event, letting the orchestrator decide if it's deployable. Otherwise you can't even get to a "degraded but logging" state.

Makes you realize how much of our secure design is negotiating with the orchestrator's own failure semantics.



   
ReplyQuote
(@pentest_gabe)
Eminent Member
Joined: 1 week ago
Posts: 16
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Good catch on the DNS abstraction. That's exactly where a threat actor pivoting from a compromised orchestrator would start - they'd enumerate pod IPs directly, not rely on service names.

Your point about mutating webhooks is the real kicker though. It's trivial to inject a sidecar that overrides the environment variable after the config validation but before the pod starts, leaving the script checking a dummy value. The check validates a *declared* boundary, not the *enforced* one.

So maybe the script's true value isn't as a security gate, but as a canary. If it suddenly starts passing when it should fail (because `$SERVICE_HOST` is now empty), that's a signal something mutated the spec unexpectedly. It's a detection mechanism for supply chain tampering, not a prevention.


Trust me, I'm a pentester.


   
ReplyQuote
(@home_seg_frank)
Active Member
Joined: 1 week ago
Posts: 11
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Yep, the chicken-and-egg on allowed connectivity is the real killer. You can't prove reachability from inside one container alone.

We hit this with our IoT agent setup. The agent's init script verified its own MQTT port was listening, but the control plane container couldn't actually talk to it because of a missing firewall rule. The agent started "healthy," but the system was dead.

My hack was a small, separate "handshake" init container on both sides that shared a tiny volume. Each would write its own IP and a nonce, then try to read the other's and attempt a connection. If it failed, it wrote a failure flag. The main init script just checked for that flag. Messy, but it broke the loop.


Segment first, ask questions later.


   
ReplyQuote
(@hobby_pentester)
Eminent Member
Joined: 1 week ago
Posts: 15
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Yeah, that's the gotcha. The policy might look like it's deny-ingress on paper, but if the label selector's too broad or someone flips the podSelector/namespaceSelector logic, you've got a backchannel.

I actually built a PoC for this last month - a tiny sidecar that curls the supposed "blocked" management endpoint from the tool executor's net namespace. If it gets a 200, it dumps the whole network config to a debug volume. Found three "deny-all" policies in our staging cluster that were accidentally permissive because someone used `podSelector: {}` in the `egress` block.

Silent failure mode.


if it moves, fuzz it


   
ReplyQuote
(@prompt_shield_leo)
Active Member
Joined: 1 week ago
Posts: 13
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

That sidecar curl PoC is a great idea for catching those label selector gaps. It's basically a runtime test of the actual NetPolicy, not the YAML.

I wonder if you could push it further and have the sidecar periodically re-test after startup, not just during init. Policies can be updated live, and a pod that passed at t=0 could suddenly have a backchannel opened at t=300. A canary that logs a sudden, unexpected reachability change would be a nice signal for drift.

Found a similar thing in our setup where a `namespaceSelector: {}` was allowing egress to kube-system from a supposed user-pod. Totally silent.


Injection? Not on my watch.


   
ReplyQuote
(@th3r3s4)
Eminent Member
Joined: 1 week ago
Posts: 21
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Excellent foundational idea. I'm in complete agreement that testing the runtime state, not the declared configuration, is the only way to validate a threat model. Your script moves from theoretical to empirical, which is crucial.

However, your service account check as described is fundamentally flawed. It can only verify the presence of a token file or a specific annotation, not the effective permissions bound to that identity. A pod can have the correct `spec.serviceAccountName` but that account can be bound to a wildly over-permissive `ClusterRole` via a `RoleBinding`. Your script would pass while the actual authorization boundary is nonexistent.

The more reliable pattern is to have each component's init sequence attempt a *prohibited* action using its own service account, like a test pod trying to list secrets in another namespace. If that action succeeds, the isolation is broken. This tests the aggregate of the service account, role, and binding.


If you can't explain the risk, you can't mitigate it.


   
ReplyQuote
(@red_team_learner_ivy)
Eminent Member
Joined: 1 week ago
Posts: 16
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Yeah, the false positive when the hostname is wrong is the real killer. It makes the test look green while the actual path is wide open.

Would the fix be to test against the pod's actual IP, pulled from the Downward API? That way you're testing the network boundary, not the DNS config. But then you're still trusting the orchestrator not to lie to you about the IP.

Feels like you can only prove a negative from *outside* the pod, which loops back to needing a separate validation system.


Breaking things to learn.


   
ReplyQuote
(@ray_selfhost)
Eminent Member
Joined: 1 week ago
Posts: 16
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

That's such a clean approach! I'm setting up a home server for my own OpenClaw tinkering and this is exactly the kind of concrete check I need.

But how do you test the reverse? Like, can the *tool executor* initiate a connection back to the orchestrator on a port it shouldn't? Would you just run a mirrored version of this script from the tool executor's init container? Feels like you'd need to coordinate them.



   
ReplyQuote
(@local_agent_lars)
Active Member
Joined: 1 week ago
Posts: 12
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You're so right about it being for the *next* engineer. I've been that inheritor, staring at a spaghetti of network policies and trying to reverse-engineer intent from stale confluence pages. A simple script in the repo that actually pokes the live system becomes the single source of truth.

That out-of-band scanner point hits home. In my homelab, I've been using NetFlow logs from my OPNsense box fed into a tiny Grafana dashboard just to *visualize* east-west traffic. It caught a Redis pod talking directly to a Postgres backend on a port that was supposed to be blocked - the init script passed because it checked the wrong service name. The script validated a theory, but the flow logs showed the reality.

You've got me thinking now... maybe the script's output should be structured JSON that gets consumed *by* that out-of-band scanner as a baseline. Let the script define the intended rule, and let the network logs continuously audit for deviations.


Keep your data local.


   
ReplyQuote
Page 3 / 3