Just spent the last three days instrumenting a test pod to see what our monitoring agents are *actually* trying to do on the network. The default network policies from the vendor are a joke—wide open egress to half the internet. So I built a minimal sidecar that uses an eBPF program attached to cgroup sockets to log every single outbound connection attempt (SYN packet) from the target container. The results were exactly as infuriating as expected.
Here is the core of the tracing logic. It logs process ID, command, destination IP:port, and the protocol.
```c
// This is a simplified skeleton of the eBPF program (TCPSnoop)
SEC("tracepoint/syscalls/sys_enter_connect")
int trace_connect_entry(struct trace_event_raw_sys_enter *ctx) {
u16 port = 0;
u32 addr = 0;
// ... logic to pull sockaddr from ctx->args[1]
// ... resolve PID, TGID, command from current task struct
// Push event to a perf buffer
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &data, sizeof(data));
return 0;
}
```
The raw log after 24 hours of the agent running its "normal" workload revealed over 15 distinct FQDNs and IP ranges being contacted. The vast majority were not for core functionality. Here is the breakdown of what was *actually* necessary versus what was requested:
* **Required for basic function:**
* `metrics-collector.internal.cluster.local:8443` – Our internal metrics gateway.
* `region-1.control-plane.vendor.com:443` – For initial heartbeat and config fetch (could be restricted to a specific IP range).
* `k8s-api-server:443` – For pod metadata (already internal).
* **Declared as required by vendor docs but entirely unused:**
* `download.vendor.com` – For "auto-updates" (disabled in our config).
* `telemetry.vendor.com` – For "product improvement" (disabled).
* `status.vendor.com` – For "status checks".
* **Completely undocumented and suspicious:**
* Connections to two separate IPs in a public cloud block on port `8883` (MQTT).
* Periodic DNS lookups for `pool.ntp.org` (the container has a read-only `/etc/localtime` and we provide NTP servers via pod spec).
The actionable takeaway: Your allowlist design must start from zero egress, then build from observed necessity, not vendor paperwork. This sidecar is now a permanent fixture in our agent deployment pipeline. Every new version gets a 72-hour observation period in the staging environment before its network policy is hardened and pushed to production.
The next step is to convert these logs into automated, version-controlled NetworkPolicy manifests. If the agent attempts a new outbound connection after an update, the sidecar logs it, the pipeline flags it, and a human has to justify the addition to the allowlist. No more blind trust.
- Leo
Least privilege, always.
Oh wow, that's a really clever approach. I've seen people try to do this with tcpdump at the pod level, but the context gets so messy. Using eBPF to tie it directly to the process is brilliant.
Did you find any of the outbound calls were truly surprising, or was it mostly the expected telemetry and update servers? I'm wondering how many "normal" agents have hidden pings for license checks or customer success metrics.
This is exactly why I'm into nano agents - you can audit every single line. Makes me want to adapt your sidecar for a little JS-based monitoring tool I run.
~Anna
The eBPF approach is clean for attribution, but you have to be careful about the blind spots. It won't show you connections that bypass the syscall, like a library using a raw socket. I've seen agents do that for "heartbeats" they don't want logged.
>mostly the expected telemetry and update servers
Yes, but the volume was the surprise. One agent was phoning a metrics endpoint every 30 seconds with a full payload, not a simple ping. That's a data exfiltration risk if the endpoint gets compromised.
Nano agents are fine for your own code, but the real problem is vendor bloat. Even if you audit every line of a nano agent, you're still surrounded by other containers making calls you didn't code. The sidecar shows you the actual attack surface, which is almost always bigger than the documented one.
STRIDE or bust
That's a slick method. Tying it directly to the `sys_enter_connect` tracepoint is clever for clarity.
You mentioned the vendor network policies are a joke. Once you got that raw log of 15+ FQDNs, did you try to actually map them back to a *legitimate* purpose from the vendor's docs? I'm curious if there's even a facade of justification, or if they're just completely undocumented callbacks.
Also, why do you think vendors do this? Just sloppiness, or is there a business reason to hide egress calls?
That's a great question about mapping the calls back to the docs. I tried that once with a different agent, and it was a dead end. The documentation just said "requires internet connectivity for core functions" with no port or domain list. No facade at all.
For the business reason, I think it's partly sloppiness from fast development, but also a kind of lazy telemetry. It's easier to send everything to their cloud than to build proper local logging, so they hide the cost and complexity in your network traffic. Makes you wonder what they'd do if egress bandwidth wasn't so cheap for them.
I've been down that road too. The documentation omission isn't just laziness, it's a liability hedge. If they publish a list of FQDNs, they're on the hook when they add a new one without notice. "Requires internet connectivity" is a legal CYA, not a technical spec.
The bandwidth cost observation is sharp. It gets worse: they're not just hiding the cost, they're shifting the *risk* of a BGP hijack or a compromised endpoint entirely onto you. Your data is in their telemetry stream, but it's your network that carries it to an IP they can change at will.
You could argue they should at least provide a `--audit` flag that dumps the DNS resolution list at runtime, but that would require them to *know* what they're doing.
- Ray
Over 15 distinct FQDNs is exactly the kind of data I want to see. Post the actual list and the protocol for each. Which ones were for TLS, which were plain HTTP, and which used a custom port?
Also, you said "the vast majority were not for core fu...". If they weren't for core functions, what were they for? You can't just hint at it. Give a concrete example, like "metrics.foo.com on 443" or "update.bar.com on 8080". Without that, this is just another anecdote.
hm
I appreciate the push for concrete data, but posting the actual FQDNs would border on doxxing the vendor. I can categorize them, though, which might be more generically useful.
Of the 17 unique domains logged, the breakdown was roughly:
- 8 to vendor-controlled subdomains of a major cloud provider (all TLS/443). These were for metrics, error reporting, and a "runtime config" service.
- 4 to third-party analytics and crash reporting services (TLS/443). This was the truly unexpected bit - the agent was bundling libraries that phoned home independently.
- 3 to internal corporate endpoints (non-routable from my cluster, which suggests lazy development defaults).
- 2 to update servers (HTTP/80, no TLS). This was the most concerning from a security perspective.
Your request for protocol detail is spot on. The TLS connections aren't inherently benign, but they at least obscure the payload. The plain HTTP calls to update servers were transmitting the full agent version, a machine ID, and the cluster name in the clear. That's a reconnaissance goldmine.
The "core functions" line I referenced was from the vendor's support site. In practice, only 2 of the 17 domains were listed there as required for "license validation" and "critical security updates." The other 15 fell under "performance monitoring" or were entirely undocumented. The volume of data sent to the metrics endpoints, sampled every 30 seconds, dwarfed the actual license check payload by two orders of magnitude.
theory meets practice
That's a solid tracepoint for initial attribution, but you're only seeing the first leg of a connection's lifecycle. Many agents, particularly the ones doing telemetry, will initiate with `connect` then immediately spawn a thread or child process to handle the actual data transfer. Your eBPF program will show the initial outbound SYN, but the PID/command logged might be a short-lived launcher, not the component writing the payload.
You need to also hook `sys_enter_sendto` or trace the socket descriptor through its lifetime to see which binary is actually pumping bytes. I've seen agents where the `connect` originates from a generic networking library, but the sensitive data is shoved into the socket by a separate, obfuscated module. Your log would show a benign parent process, missing the real culprit.
Your choice of the `sys_enter_connect` tracepoint is a good starting point for visibility, but I need to push back on its completeness for this threat model. You're only capturing the initiation of a TCP stream. A sophisticated agent, particularly one trying to obfuscate its telemetry, could easily use a pre-connected socket inherited from a parent process, or use `sendto`/`sendmsg` on an already-established UDP or raw socket, bypassing the `connect` syscall entirely.
The more critical blind spot is encrypted payload inspection. Your log shows `destination IP:port` and protocol, but the contents of the POST to `metrics.foo.com:443` are opaque. An agent could be embedding environment variables, partial configuration, or system fingerprints in those TLS payloads. You've mapped the exfiltration channels, but not the data volume or sensitivity. For a true audit, you'd need to pair this with a sidecar that can perform TLS interception with the agent's own CA, which introduces its own complexity.
Still, your data on the sheer number of distinct FQDNs is the real value here. It quantifies the attack surface shift. Even if you can't see the data, you now know how many external endpoints you're implicitly trusting not to be compromised or malicious.
Your eBPF approach is a correct first step for mapping the declared attack surface, but you're right to be infuriated. Over 15 distinct FQDNs from a single monitoring agent indicates a clear violation of the principle of least privilege that should be contractually mandated.
The critical next step isn't just logging, it's policy enforcement. You now have the empirical data to author a NetworkPolicy or CiliumNetworkPolicy that denies all egress, then whitelists only the endpoints required for the agent's *documented* function. The remaining connections will fail, and you can present those failure logs to the vendor as a compliance defect.
Without that translation into enforced policy, you're just documenting the bleed. The vendor's "wide open egress" default becomes your liability in an audit.
Policy is code
Good point about logging the initial SYN. I've seen the same pattern where the connect call is just a shell game. The real payload gets handed off.
Your eBPF method is perfect for the initial map, but to get the full picture you'll need to also trace socket writes and ideally correlate them back to the original connection. A lot of these agents use connection pools, so the PID that does the send might be totally different from the one that opened the socket.
Did you consider adding a second tracepoint for `sys_enter_sendto` and trying to match the socket descriptor? It's more complex but you'd catch the actual data movement, not just the door opening.
Model theft is the new SQL injection.
You're only seeing the front door. That `sys_enter_connect` hook is blind to any outbound traffic that uses an existing socket from a connection pool, or raw UDP. Your 15 FQDNs is likely incomplete.
And you logged the destination. Great. What about the payload? Those TLS connections to metrics.foo.com could be sending your entire environment variable list. You've mapped the phone lines, not the conversation.
no default passwords
That's a solid starting point for mapping. I'd also throw a `sys_enter_sendmsg` hook in there to catch writes to already-connected sockets. I've seen agents where the initial connect is from a generic lib, but the sensitive data is shoved in by a different PID from a thread pool.
Your list of 15 FQDNs is probably incomplete. Check for UDP DNS lookups to catch the ones that only resolve for heartbeat or failover endpoints that weren't triggered in your 24-hour window.
--Chris
Totally agree on the blind spot for raw sockets. I've seen a similar pattern with some libraries that open a raw ICMP socket for "latency checks" - completely bypasses any connect or sendto hooks. It's like they're intentionally avoiding the usual syscalls.
And yeah, the volume is the killer. Even when you expect the telemetry, the sheer frequency and payload size can swamp your logs and mask real exfil. I had one case where the "metrics" payload included the hostname and a list of nearby network services. That's not telemetry, that's reconnaissance.
Your last point is the real takeaway for me. You can build the cleanest, most auditable microservice, but if it's sharing a pod with a vendor sidecar, your actual attack surface is whatever that black box decides to do. The sidecar logger isn't just for visibility, it's for proving the delta between the spec and reality.
--Em