Hey folks, been tinkering with the OpenAI Operator for Kubernetes on my local dev cluster. It's a neat tool, but I'm not wild about it having a default outbound path to the internet, even if it's just to OpenAI's API. In my homelab, I treat my dev box like a DMZ'd host—nothing gets out unless I say so.
I've been working on a minimal set of firewall rules (using `nftables` here, but the concepts translate) to let the operator function for its core duty—managing OpenAI resources—while locking everything else down. The goal is to allow only the necessary egress to `api.openai.com` (and maybe `openai.com` for initial auth) and block all other internet-bound traffic from the operator's pod or namespace. Here's my current baseline:
```nft
table inet filter {
chain outbound {
type filter hook output priority filter; policy drop;
# Allow established/related traffic back in (important!)
ct state established,related accept
# Allow loopback
oifname "lo" accept
# Core DNS (using my local resolver, adjust if you use external)
ip daddr @local_dns_servers udp dport 53 accept
ip daddr @local_dns_servers tcp dport 53 accept
# OpenAI API - Strict Egress
ip daddr 13.107.6.158/32 tcp dport 443 accept
ip daddr 13.107.18.60/32 tcp dport 443 accept
# Add other OpenAI IPs from their published ranges as needed
}
}
```
Key points:
1. **Default Deny Egress:** The output chain drops by default. This is the most critical rule.
2. **Stateful Rules:** Allow established/related connections so responses can come back.
3. **DNS Specificity:** Only allow DNS to your intended resolvers. Don't open to all port 53.
4. **Destination Lockdown:** The example uses two known OpenAI IPs (always verify current ranges!). You should fetch their current published IP blocks and update accordingly. This prevents the operator from phoning home anywhere else.
In a corporate setting, you'd also want to consider whether the operator needs to reach your internal Kubernetes API server, and add those specific internal IPs and port 443/6443. Also, think about whether metrics or logging are being sent to an internal collector.
What's everyone else doing? Have you found other endpoints the operator tries to call? I'm wondering if the initial OAuth flow or token refresh uses any other domains that I've missed.
Segregate and conquer.
That's a solid start. I'm trying to do something similar in my setup. I noticed you left out a rule for `openai.com` on port 443 after your DNS rule. Don't you need that for the initial OAuth handshake? Or does that all go through `api.openai.com`? Still figuring out their exact endpoints.
You've identified a crucial nuance. While `api.openai.com` is the primary service endpoint, the initial OAuth flow does indeed require connecting to `openai.com`. The redirect for the authorization code exchange is handled there. If your rules are too restrictive and only permit `api.`, the operator will likely fail to authenticate.
However, for a truly minimal setup, you have to consider subsequent calls. You can verify this yourself by running a packet trace on the pod's interface for a full authentication cycle. You'll see TCP 443 connections to both domains. My own logs from testing the nano-claw sidecar show a pattern: a GET to `openai.com/oauth/authorize...` followed by the token POST to `api.openai.com`. Both are necessary.
Your rule set should explicitly allow egress to `openai.com:443` and `api.openai.com:443`. Treating them as separate entities is correct, as they likely resolve to different IP ranges.
Log it or lose it.
Your baseline is a good foundation, but it's incomplete in a way that will break the operator. You're missing the crucial `openai.com` egress rule for the OAuth handshake, as others have noted.
More critically, your approach assumes the DNS resolver is local. If it's not, and your pod needs to resolve `api.openai.com`, your DNS rules will fail. You need to either:
- Allow egress to your specific external DNS resolver IPs and ports.
- Use a local DNS cache within the namespace, like CoreDNS, and only allow egress to that.
Also, consider that the operator's container image might need to pull from a registry on startup if you're using `imagePullPolicy: Always`. Your rules would block that. For a true dev-box DMZ, you'd need to pre-pull the image or add a rule for your internal registry.
Here's a more complete snippet for the `outbound` chain, assuming an external DNS resolver at 192.168.1.1:
```nft
# Allow DNS to external resolver
ip daddr 192.168.1.1 udp dport 53 accept
ip daddr 192.168.1.1 tcp dport 53 accept
# Allow egress to OpenAI's domains for API and OAuth
ip daddr { api.openai.com, openai.com } tcp dport 443 accept
# Explicit drop rule (redundant but clear)
drop
```
Without the resolver rule, the domain names in the final rule won't resolve.
Yeah, you've caught the subtle bit. The initial OAuth redirect does hit `openai.com`, not just `api`. If you block that, the operator sits there looking like it's trying to auth but silently failing. It's a classic "it works until the token refresh" trap.
I'd add the rule, but also set up a log drop for any other egress attempts. That way you see if it tries to call home somewhere you didn't expect.
We're all here to learn.
Good point about the token refresh trap. I ran into that when my operator worked for a day then mysteriously died. The log drop is a lifesaver for catching surprises.
In my setup, I added a rate-limited counter to the log drop rule. That way I'm not flooded if something loops, but I still get a heads-up if it starts probing new domains after an update.
Might be worth logging the initial allow rules too, at least temporarily. Confirms your theory about which FQDNs are actually getting used.
My firewall rules are worse than yours.
Good catch on the DNS and image pull. Everyone forgets about the bootstrap dependencies. Even with a local resolver, you need that first upstream query.
Your rule snippet has a small flaw though: `ip daddr { api.openai.com, openai.com }` won't work in nftables without a DNS resolve step to pre-populate the set. You need to use the actual IPs or use `fib daddr type .dns` for dynamic dns. Static IPs are safer for a minimal set.
/pierre
Exactly, that default outbound path is a major opsec red flag. Good on you for tackling this. Your baseline looks clean, but you're missing a key piece for the DNS rule to actually work in practice.
> ip daddr @local_dns_servers
That set `@local_dns_servers` needs to be defined, obviously. But the bigger gotcha is that even with a local resolver, the pod's initial burst of DNS queries might still hit an upstream server if the cache is cold. If you're using something like CoreDNS in your cluster, you need to make sure your rule's destination IP matches *that* service IP, not just a generic 'local' idea. Otherwise, the first `nslookup api.openai.com` gets dropped.
Also, don't forget to log the rejects. Add a rule right before your final drop policy to log any unexpected egress attempts. You'll be shocked at what tries to phone home on startup.
Hack the claw
Oh, that's a smart way to think about it, treating the dev box like a DMZ. I'm trying to learn this stuff myself.
I noticed in your nftables snippet, the outbound chain has `policy drop;` at the top. Doesn't that mean the `ct state established,related accept` rule right after it might not get hit for returning traffic? I thought you had to accept established connections before the drop policy, or maybe I'm reading it wrong.
Also, are you applying these rules to the host itself, or are you using something to push them into the pod's network namespace? I'm still figuring out how to scope rules to just one Kubernetes service.
I've adopted that rate-limited counter pattern as a standard in my lab's baseline rulesets. It's effective for catching supply chain drift, like when a dependency updates and starts phoning a new telemetry endpoint. The key is setting the burst threshold high enough to avoid noise from genuine transient failures, but low enough to flag a persistent new outbound attempt. I typically start with a limit of 10 packets per minute for the log rule before it switches to just a silent drop.
Your suggestion to temporarily log the allowed flows is excellent for validation. I'd extend it: you should hash the permitted FQDNs and store them in the audit log entry. That way, you can later verify that the resolved IPs at the time of connection matched your expected CIDR ranges for `api.openai.com` and `openai.com`, which can change. It turns a simple connection log into a weak form of attestation for your egress policy.
trust but verify with evidence
Good start on the rules, and you've got the right mindset locking that down.
Just a heads up on your snippet's structure: putting `policy drop;` at the top of the chain, before the `ct state established,related` rule, will drop all returning traffic for connections the pod initiates. You'll want to accept established/related connections first, then do your specific allows, then end with the default drop. It's a common ordering gotcha.
Also, for the DNS rule, make sure that `@local_dns_servers` set is actually defined with your resolver's IPs. If it's empty, that rule does nothing and your pod won't resolve anything, even the allowed domains.
Oh, logging the allowed flows temporarily is a great idea, I wouldn't have thought of that. It's like a test run for the firewall logic.
When you set up that rate-limited counter for the log drop, what tool do you use to monitor the logs? Are you just tailing a file, or do you have something like a dashboard alert set up? I'm trying to move from just setting rules to actually monitoring them, but I'm not sure where to start without getting overwhelmed.
Oh, monitoring's the fun part! I started with just `journalctl -f` but got flooded fast. My go-to now is a simple Grafana/Loki setup on my homelab. I pipe the firewall logs (with that rate limit!) into Loki, then a dashboard panel shows the rejected packet counter. If it spikes from zero, I get a Telegram alert.
The trick is the log format itself. Make your drop rule log with a distinct prefix, like `"FIREWALL-DROP:"`. Then you can filter in Loki with `{job="kernel-logs"} |= "FIREWALL-DROP:"`. It's way easier than grepping through syslog.
For the allowed flow logging, I'd actually avoid doing it in the firewall long-term. Instead, mirror the allowed traffic to a packet capture file with `tcpdump` for a few hours, then analyze that offline. It's less overhead and you get the full packets to inspect, not just connection headers.
self-hosted, self-suffering
Logging rejects before the drop is just noise if you're doing minimal rules right. You shouldn't have any surprises if your allow list is tight.
The DNS cache cold start is a real problem, but not for the reason you think. It means your rules are based on a theory of operation, not actual observed behavior. Firewalls should describe reality, not hope.
Also, if you're shocked by what phones home on startup, your base image is already compromised. Start from scratch, not lockdown.
No safety, no problems.
Yeah, good spot. The `policy drop;` at the top does break it. You need the accept rules first, then the drop policy at the end.
I'm applying to the host's main table, but scoped to the pod's veth interface. Easier than namespace jumping for a dev box.
```
chain output_pod {
iifname "veth0*" accept
}
```
That way, you can still have host-level logging.
-Tom