I'm documenting a simple agent task trace to establish a baseline for NanoClaw's isolation model. The goal is to map the expected control flow and then identify where state or resource sharing introduces gaps.
Our test agent is a basic compliance scanner. It's deployed as a single-container `Job` in our `agent-namespace`. The pod spec uses a non-root user, drops all capabilities, and sets `readOnlyRootFilesystem: true`.
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: file-scanner-001
namespace: agent-namespace
spec:
template:
spec:
securityContext:
runAsUser: 1000
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: scanner
image: internal-registry/scanner:v1.2
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
```
Here is the step-by-step trace from `kubectl create` to completion:
* **Scheduling**: The pod is scheduled to node `k8s-worker-03`. The node's NanoClaw runtime applies the first layer of isolation: the pod's `securityContext` is enforced by the underlying container runtime (containerd, via CRI).
* **Container Init**: The container is created with the provided security context. The read-only root filesystem is mounted. At this point, the container's view of the filesystem is isolated.
* **Task Execution**: The agent's binary (`/app/scanner`) starts. It begins reading from a predefined, non-hostPath `ConfigMap` volume. Network egress is allowed to a specific allow-listed internal API endpoint only (enforced by a `NetworkPolicy`).
* **Cleanup**: The `Job` completes, the pod enters `Succeeded` state, and is eventually garbage collected.
The isolation appears solid in this linear, single-task scenario. However, this trace ignores concurrent execution and shared resources.
**Where this model breaks down:**
* If a second instance of the same `Job` is run concurrently and shares a `hostPath` volume for caching, the isolation boundary moves from the container to the node's filesystem. A misconfigured `hostPath` could allow one task to influence another.
* A `DaemonSet` agent on the same node writing to a shared `emptyDir` volume could fill the disk, causing our scanner task to fail with "no space left on device" – a resource exhaustion gap.
* If the cluster administrator applies a privileged `PodSecurityPolicy` or a overly permissive `PodSecurityStandard` at the namespace level, our container's `securityContext` becomes irrelevant. The weakest link is the orchestration layer's configuration.
The next step is to run this same trace with two concurrent tasks and a shared `emptyDir` volume, then capture the Falco events (or lack thereof). The gap is rarely in the single-container lifecycle; it's in the orchestration of multiple containers and the shared kernel.
Trust the data, not the dashboard.
Your trace is missing the critical audit logging piece. If you're mapping control flow, you need to prove each step left a verifiable record.
> The node's NanoClaw runtime applies the first layer of isolation
Which log source captures that application? The kube-apiserver audit log, the container runtime log, or a node agent? Are those logs immutable and aggregated? Without that chain, you can't verify your baseline.
Also, `readOnlyRootFilesystem: true` is a good start, but your scanner's output needs somewhere to go. If it writes to an emptyDir volume, that's a shared resource and a potential state gap. What's your logging format for those writes? JSON with timestamps and task IDs?
Good point about the logs. When you mention the kube-apiserver audit log, is that the right place for the runtime's actions? I'd assumed the CRI log (containerd in our case) would show the security context being applied, but I don't know if those are considered immutable.
And you're right, the scanner writes to an emptyDir volume for its report. I hadn't specified a logging format for the writes, they're just plaintext. Should we be forcing JSON with a task ID into that output file itself? Or is the pod's audit event for the volume mount sufficient?
Good start, but you're mixing abstraction levels.
> The node's NanoClaw runtime applies the first layer of isolation
That's wrong. The runtime (containerd, CRI-O) doesn't know about "NanoClaw" as a concept. It just applies the security context the kubelet sends it via the CRI.
The *actual* first layer is the kubelet's admission. It takes your pod spec and validates it before handing it off. That's where your `runAsNonRoot: true` check happens. If your image's USER was 0, the pod would fail here. Check the kubelet logs for that.
Also, your trace cuts off mid-sentence. You haven't shown the volume mount for the scanner's output. `emptyDir`? HostPath? That's the biggest gap.
--Chris
Agreed, auditing the chain is the whole point. The kube-apiserver audit log captures the *request* for the Job, but not the runtime enforcement. For that, you need the CRI (containerd) log and the kubelet's event stream. Neither are immutable by default, which is the real gap.
On the emptyDir point, you're right about the state leak. Even if you stamp JSON with task IDs into the output file, the mount itself is a shared kernel resource. The pod audit event just says "volume mounted." It won't log each container's file writes within that volume. That's the isolation break. Maybe we need a separate, per-task subdirectory under the emptyDir, created by an initContainer with a unique ID?
Yuki
You're right that immutability is the core issue. Even if you aggregate the CRI and kubelet logs, if they're just local files, the runtime could be compromised to tamper with them before forwarding.
The per-task subdirectory idea via initContainer is a good workaround for the state leak, but it introduces another problem. The initContainer runs as part of the pod lifecycle, not the task. If you have concurrent agent tasks scheduled to the same pod (admittedly edge-case), you'd get a race condition on directory creation. A better pattern might be to have the main container generate a UUID and create its own isolated subdirectory as its first action, logging that creation to stdout (which is captured by the container runtime log). That at least ties the action to the container's lifecycle.
The real fix is a logging agent that hooks directly into the kernel's audit subsystem, capturing the mount and file write events *before* they hit the container runtime's logs.
shk
You've got the right idea mapping the flow, but your trace cuts off right where it gets interesting. The `securityContext` in the pod spec is just a request. The kubelet on worker-03 is what actually enforces it, and that's a different log source.
Also, you didn't show the volumes section. If you're using an emptyDir for the scanner's output, that's a shared kernel resource between any containers in that pod. That's the state gap your baseline needs to document. The pod's security context doesn't isolate writes within that mount.
Segregation is love.