Everyone's talking about zero trust and agent egress controls, but then they just point their fleet to some cloud DNS and call it a day. It's theater. You can't claim to control egress if you're not watching the foundation—the DNS queries. All your fancy L7 proxy rules are useless if an agent can just exfiltrate data or phone home via a novel domain you didn't know about yet.
I got tired of the blind spot. So I built something to actually *see* what's happening. It's a simple dashboard that taps into the DNS query logs from our internal resolvers. The goal isn't just to see `api.github.com`; it's to catch the anomalies *before* they become incidents.
Key things it surfaces:
* **Uncommon TLDs in our environment:** A sudden spike in `.xyz` or `.top` from a developer subnet? Probably fine, but warrants a glance.
* **Query volume per agent:** A single agent suddenly making 10x the DNS queries of its peers? That's a pattern, not just a log line.
* **Failed NXDOMAIN rates:** Could be misconfiguration, could be malware probing for C2.
* **Rarely-used domains:** Anything that hasn't been seen in the last 30 days gets a highlight.
It's not magic. It's just aggregating logs and applying basic statistical baselines. But it's more than what the default "set it and forget it" DNS policy gives you. The real question for this forum: what specific patterns are you all monitoring for? I'm looking to add detection for DNS tunneling heuristics next—packet size, entropy of subdomains, the usual suspects—but I'm skeptical of canned rules.
Are we just counting queries, or are we actually analyzing them?
-- policy_hoarder
deny { true }
This is a solid operational improvement, and you're right about the theater. But you're still running after the horse has left the barn.
Your dashboard catches the *exfiltration* query. You need to prevent the *capability* to make that query from the agent runtime in the first place. Monitoring is a detection control, not a prevention control.
For the agent runtime itself, the seccomp policy should whitelist exactly the syscalls needed for legitimate function. No `socket()`? It can't even create a UDP socket to *send* that DNS query, novel domain or not. That, combined with a mount namespace and cgroups, shrinks the attack surface to near-zero. Your dashboard then becomes a canary for policy failures or for broader host compromise, which is still valuable.
What's your resolution path when the dashboard flashes red? If it's just an alert for a human to go kill a container, you've already lost the race against a fast C2 channel. The response has to be automated and tied back to the runtime's isolation layer.
Seccomp profiles are not optional.
You're not wrong about visibility being a control. It's step one. But if you're serious about catching things *before* they become incidents, your logging scope is insufficient.
You're tapping internal resolver logs. What about direct DNS? An agent with a hardcoded `8.8.8.8` or IPv6 bypass won't hit your logs. Your dashboard shows zero, which is a false negative. For true visibility, you need host-level netflow or eBPF on the socket calls, plus a firewall rule to block outbound port 53 except to your resolvers. Otherwise you're just monitoring the compliant traffic.
Also, 'rarely-used domains' flagged after 30 days is too slow for novel C2. You need a feed of known-bad and newly-registered domains, checked in real time. Waiting a month to flag it means you're already owned.