AI Assistant

Notifications

Clear all

Just built a tiny sidecar that logs all outbound connection attempts

Leo M. · 2026-06-24T05:04:49Z

Just spent the last three days instrumenting a test pod to see what our monitoring agents are *actually* trying to do on the network. The default network policies from the vendor are a joke—wide open egress to half the internet. So I built a minimal sidecar that uses an eBPF program attached to cgroup sockets to log every single outbound connection attempt (SYN packet) from the target container. The results were exactly as infuriating as expected. Here is the core of the tracing logic. It logs process ID, command, destination IP:port, and the protocol. ```c // This is a simplified skeleton of the eBPF program (TCPSnoop) SEC("tracepoint/syscalls/sys_enter_connect") int trace_connect_entry(struct trace_event_raw_sys_enter *ctx) { u16 port = 0; u32 addr = 0; // ... logic to pull sockaddr from ctx->args[1] // ... resolve PID, TGID, command from current task struct // Push event to a perf buffer bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &data, sizeof(data)); return 0; } ``` The raw log after 24 hours of the agent running its "normal" workload revealed over 15 distinct FQDNs and IP ranges being contacted. The vast majority were not for core functionality. Here is the breakdown of what was *actually* necessary versus what was requested: * **Required for basic function:** * `metrics-collector.internal.cluster.local:8443` – Our internal metrics gateway. * `region-1.control-plane.vendor.com:443` – For initial heartbeat and config fetch (could be restricted to a specific IP range). * `k8s-api-server:443` – For pod metadata (already internal). * **Declared as required by vendor docs but entirely unused:** * `download.vendor.com` – For "auto-updates" (disabled in our config). * `telemetry.vendor.com` – For "product improvement" (disabled). * `status.vendor.com` – For "status checks". * **Completely undocumented and suspicious:** * Connections to two separate IPs in a public cloud block on port `8883` (MQTT). * Periodic DNS lookups for `pool.ntp.org` (the container has a read-only `/etc/localtime` and we provide NTP servers via pod spec). The actionable takeaway: Your allowlist design must start from zero egress, then build from observed necessity, not vendor paperwork. This sidecar is now a permanent fixture in our agent deployment pipeline. Every new version gets a 72-hour observation period in the staging environment before its network policy is hardened and pushed to production. The next step is to convert these logs into automated, version-controlled NetworkPolicy manifests. If the agent attempts a new outbound connection after an update, the sidecar logs it, the pipeline flags it, and a human has to justify the addition to the allowlist. No more blind trust. - Leo

Summarize Topic

Page 2 / 2 Prev

Allowlist Design for Agent Network Access

Last Post by Omar Hassan 4 days ago

19 Posts

18 Users

0 Reactions

3 Views

RSS

Bob Thornton

(@contrarian_risk_bob)

Active Member

Joined: 1 week ago

Posts: 13

Translate ▼

June 25, 2026 5:03 pm

Fifteen FQDNs logged and you're infuriated. Wait until you correlate that to your actual business risk. What's the worst they're sending? Your pod's CPU average?

You're not running a spy agency. This is a monitoring agent. It's going to call home. Your list is good for writing a tight network policy. Do that instead of complaining about the map.

What is the actual threat?

ReplyQuote

Chris P.

(@shed_sysadmin)

Eminent Member

Joined: 1 week ago

Posts: 19

Translate ▼

June 25, 2026 7:09 pm

>Your pod's CPU average?

Last breach we handled started with an "innocent" monitoring agent uploading a hashed environment file that included a temporary S3 credential. It's not about the metric, it's about the vector.

You're right about the network policy, but you can't write an effective one if you don't know the full call pattern. That's the point of the map. You think those 15 FQDNs are the whole story? They never are.

--Chris

ReplyQuote

Ava Carter

(@agent_network_architect)

Active Member

Joined: 1 week ago

Posts: 14

Translate ▼

June 25, 2026 8:07 pm

>It's not about the metric, it's about the vector.

Precisely. That's the core distinction between telemetry and exfiltration. A permitted destination is not a permitted payload. The network policy that whitelists `metrics.foo.com:443` still permits the agent to transmit any data it can serialize and encrypt within that single TLS session.

The real failure mode I've seen is when the agent's architecture treats "metric collection" and "system state collection" as the same data-gathering pipeline. A single allowed egress endpoint becomes the conduit for both. Your S3 credential example is a perfect illustration: the vector was authorized, the content was not.

This is why mapping is just step one. The subsequent step has to be payload inspection, either via a MITM proxy with DLP rules or, more commonly now, agent-level attestation that the library is only sending signed, expected metrics. Without that, you're just trusting the vendor's definition of "metric," which tends to expand over time.

segment first

ReplyQuote

Omar Hassan

(@sysadmin_prod)

Eminent Member

Joined: 1 week ago

Posts: 20

Translate ▼

June 25, 2026 9:27 pm

That's the exact pivot point. You can have perfect network policies and still get burned because the session is authorized.

The MITM approach is technically correct but introduces a huge new operational surface - now you're managing CA certs, DLP rule tuning, and decryption performance at scale. I've seen teams implement it, then disable it during an incident because it broke the vendor's telemetry and they couldn't troubleshoot.

The agent-level attestation you mentioned is the more sustainable path, but it requires the vendor to actually provide a verifiable, immutable manifest of what the library sends. Good luck getting that contractually. Most just give you a PDF that says "trust us."

So you're stuck. Block everything and break functionality, or allow a tunnel and hope their definition of "metric" doesn't creep.

automate, audit, repeat

ReplyQuote

Page 2 / 2 Prev

80 Forums
1,182 Topics
7,209 Posts
1 Online
508 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed