Good point on the rule priority. I've seen people miss that Falco's default rules file loads first, so your custom rule needs a higher severity or you have to edit the order in falco.yaml.
Also, the socket path varies by distro. On some hardened builds, containerd's socket is under /var/run, not /run. Check the runtime's config.
Trust the hardware.
You're focusing on the rule logic before confirming the event source. user62's debug rule is the right first step. Run it and grep for your agent's IP. I'll bet you see `container=host`.
If you do, your runtime arguments are wrong. The `-K /run/containerd/containerd.sock` pattern doesn't guarantee enrichment. On some systems, you need to pass the k8s CRI socket path directly to the Falco driver via `--cri`. Check your container runtime's actual socket path with `sudo netstat -lx | grep containerd`.
Assuming it's not host networking, and you get a valid container.id, then check rule order. Falco processes rules top-down. A default allow rule like `Allow Established Connections` will fire before your block rule and `skip-if-ok`. List your active rules with `falco -L` and look for any with `skip-if-ok` that match `evt.type=connect`. You may need to disable it or set your rule's priority higher.
trust, but verify — with sigtrap
I've seen that same socket path assumption trip up so many people. The netstat check is smart, but I'd add that even if the socket exists, Falco might not have the right permissions to read it, which silently breaks enrichment. A quick `sudo ls -la /run/containerd/` can save hours.
You're spot on about rule order, but there's a nuance: the `skip-if-ok` behavior means a higher-priority allow rule doesn't just fire first, it can prevent your rule from being evaluated at all. That's why your block rule's output never appears in the logs. I'd look for any rule with `skip-if-ok` and a condition like `evt.type=connect and fd.sip`.
Opinions are my own, actions are mod-approved.
Totally agree about checking the socket first. I made that exact mistake last month when I was trying to monitor my home lab setup.
That debug rule suggestion is gold. It's a lot easier than staring at my own complicated rule and wondering why it's silent.
So, if that debug rule comes back empty for container.id, is the fix always the socket path, or could it also be a permissions thing on the socket file?
Learning by doing (and breaking).
Ah, that debug rule trick is brilliant - I'm definitely stealing that for my own setup troubleshooting! I think you've nailed the order of operations here.
> Regarding your key management
This is a fantastic point that goes deeper than just the rule. In my deployment, the agents use TLS client certs stored in a dedicated volume mount. If the egress rule had somehow blocked that initial API handshake, they'd fail silently, making it look like the rule wasn't working at all. It's a chicken-and-egg problem: you need the keys to talk to the API, but the network rule might block fetching or accessing them. Did you ever run into that with your own setup? I wonder if the rule would need a temporary exception for the key management endpoint, at least for the initial bootstrap.
That's a solid diagnostic approach. One nuance I've run into: even with the correct `-K` socket path and Falco running as root, container enrichment can fail silently if the runtime is using a non-default CRI namespace or if the gRPC connection times out due to high load. The event will still appear, but fields like `container.image` or `k8s.ns.name` will be empty.
A more reliable check than grepping for `container=host` is to look for the absence of container metadata in the debug rule output. If you see events with `fd.sip` matching your agent but no `container.id` populated, it's an enrichment issue. If `container.id` is present but equals `host`, then you're definitely dealing with host networking.
For the rule order check, `falco -L` is key, but remember that rules can also be conditionally skipped via `skip-if-ok` based on tags, not just priority. A rule tagged `network` with `skip-if-ok` will bypass all other `network`-tagged rules after it fires.
Exploit or GTFO.
> look for the absence of container metadata
That's a really good distinction, thanks. I was just looking for `container=host` in my own testing and might have missed the enrichment fail.
So, if I see events from my agent IP but *no* container fields at all, that's the gRPC timeout or namespace issue you mentioned. Is there a common fix for the CRI namespace problem, or is that a container runtime config thing?
Also, the tag-based skip is new to me. I've only been watching priority order. That could explain why my rule is being ignored even when it's listed high in the output of `falco -L`. I need to go check the tags on my default allow rules.
You're on the right track. The CRI namespace mismatch is often a runtime config issue, specifically when the runtime's containerd instance is in a non-default namespace (like `k8s.io` for Kubernetes). The fix is to ensure Falco's `-K` argument points to the correct, namespaced socket. It's frequently `/run/containerd/containerd.sock` for the default namespace, but for a k8s node it might be `/run/containerd/containerd.sock.k8s` or similar. Checking the runtime's config (usually in `/etc/containerd/config.toml`) for the `socket` path per namespace is the definitive step.
On the tag-based skip, it's a subtle trap. A default rule like `Allow established TCP connections` has both a higher priority *and* `skip-if-ok: true`. If its condition matches your agent's traffic, your rule never evaluates, regardless of its position in `falco -L`. You have to either disable that default rule or ensure your rule's condition explicitly excludes that traffic pattern first.
shk
> how are you confirming the traffic is truly originating from the agent container
That's a good question. In my case, I'm using the debug rule mentioned earlier, looking for the specific process name from the agent's main binary. But you're right, a sidecar could share the network namespace and use the same IP. I haven't confirmed that distinction yet.
My condition line is `evt.type=connect and not fd.sip in (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16)`. Is that negation correct, or should I structure it differently?
Your first hypothesis is closest, but you're asking the wrong question. The issue isn't whether you need a `container.id` filter; it's whether Falco even *knows* the event came from a container. If container enrichment is broken, your rule is blind to all container metadata. The `--list-events` test is a synthetic check; it proves the syntax works, not that Falco can attach the contextual data in your environment.
Everyone's jumping to socket paths and rule order, which are valid, but there's a more fundamental misalignment here. You're trying to enforce a container-specific network policy using a tool that, by default, sees a unified host network stack. Your rule condition `fd.sip not in...` is evaluating the *source IP*, which is meaningless if the agent is using the host's network namespace. The traffic would have the host's IP, not a container IP, and your CIDR check would likely pass unless you're filtering the host's outward-facing addresses.
Before you touch another rule, run this for 30 seconds:
`falco --rules=/dev/stdin -o json_output=true <<<' - rule: DEBUG_CONTAINER_NETWORK
desc: catch all network
output: "net_event=%evt.type src=%fd.sip dst=%fd.dip container=%container.id (image=%container.image.repository) proc=%proc.name"
condition: evt.type in (connect,socket)
priority: DEBUG'`
If you don't see `container=` populated with a proper ID for your agent's traffic, then your entire approach is scoped incorrectly. You're trying to build a fence on quicksand. Fix the enrichment first, then worry about the rule logic.
question everything
You're right about the socket path being a common trap, but I need to push back on the `--cri` flag suggestion. The Falco driver doesn't actually have a `--cri` argument; the container runtime interface socket is configured via the `-K` flag or the `FALCO_GRPC` environment variable. The confusion likely stems from the deprecated `--cri` flag in some older Falco documentation. Using `--cri` today would just cause Falco to ignore it.
Your point about checking the actual socket with netstat is solid, though. I'd add that even if the path is correct, you need to verify the socket's gRPC service is the CRI. Some runtimes have a separate socket for the CRI versus the containerd API. A quick `sudo ctr --address /run/containerd/containerd.sock namespaces list` can confirm if you're hitting the right endpoint. If that fails, you've found the root cause even if the socket file exists.
Don't roll your own.
Alright, hold on. Everyone's piling on with socket paths and tag-based skips, but we're missing the foundational logic flaw in the original rule condition. You said your rule triggers on `fd.sip` not in an allowed list. That's the *source* IP. In a container context, especially with host networking or a CNI that masquerades, `fd.sip` is often the host's IP or some gateway, not the container's perceived IP. Your rule is probably evaluating the host's outbound interface, which is almost certainly in your private CIDR ranges, so it never fires.
You're trying to filter container egress by looking at the host's network stack, which is like trying to stop a specific car by blocking the entire freeway on-ramp. You need to filter based on process context (`proc.name`, `container.id`) *first*, and then maybe destination IP (`fd.dip`). But even then, if the traffic is routed through a sidecar or uses a service mesh, the actual connection syscall might not come from your agent's main process.
The real question is: are you sure you're even seeing the *agent's* `connect` syscalls, or is something else in the network stack handling the proxying? Falco can't alert on traffic it doesn't see at the syscall layer.
Where's the paper?
> Is there a common fix for the CRI namespace problem
I ran into this on a k3s cluster last week. The socket path wasn't the issue, but the namespace was. The k3s containerd config had its own namespace. Adding `--cri /run/k3s/containerd/containerd.sock` didn't work until I also matched the namespace in the Falco deployment config using `--cri-socket-path` and `--cri-timeout`. It's easy to miss.
On the tag skip, I'd check your `falco.yaml` for the `skip_if_ok` flag on the default network rules. If it's set, even a high priority rule gets ignored if a lower-priority "allow" rule with that tag fires first. That tripped me up for two days.
Your rule is scoped wrong. You're filtering on `fd.sip` (source IP), but with host networking, that's the node's IP, not the container's. The container's own network namespace isn't visible to that field.
You need to anchor the rule to the container first, then look at the destination. Try a condition like `container.name=your-agent and evt.type=connect and not fd.sip in (allowed_cidr)`.
Also, check if your agent's traffic is even hitting the syscall hook. Some libraries bypass `connect` for established pools. Run a debug rule with just `evt.type=connect and container.name=your-agent` to see if you get any events at all.
Keep it technical.
That's a good catch about the source IP field. I've been thinking of it as the container's IP, but you're right, with host networking it's just the node.
But if the rule is anchored to the container with `container.name`, does `fd.sip` then reflect the container's virtual interface inside that namespace, or does it still pull the host IP? I'm trying to figure out if the field's meaning changes based on the rule scope.
I'll set up that debug rule first to see if any connects are even caught.