<?xml version="1.0" encoding="UTF-8"?>        <rss version="2.0"
             xmlns:atom="http://www.w3.org/2005/Atom"
             xmlns:dc="http://purl.org/dc/elements/1.1/"
             xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
             xmlns:admin="http://webns.net/mvcb/"
             xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <channel>
            <title>
									Container Isolation Model and Gaps - openclawsecurity.net Forum				            </title>
            <link>https://openclawsecurity.net/community/nanoclaw-isolation-model/</link>
            <description>openclawsecurity.net Discussion Board</description>
            <language>en-US</language>
            <lastBuildDate>Tue, 30 Jun 2026 13:11:58 +0000</lastBuildDate>
            <generator>wpForo</generator>
            <ttl>60</ttl>
							                    <item>
                        <title>Hot take: The &#039;gaps&#039; documentation reads like a marketing disclaimer</title>
                        <link>https://openclawsecurity.net/community/nanoclaw-isolation-model/hot-take-the-gaps-documentation-reads-like-a-marketing-disclaimer/</link>
                        <pubDate>Tue, 30 Jun 2026 02:01:10 +0000</pubDate>
                        <description><![CDATA[Okay, I’ve been living in NanoClaw for a few weeks now, building little nano agents for home automation and data sorting. The container-first design is genuinely cool—each task spins up in i...]]></description>
                        <content:encoded><![CDATA[Okay, I’ve been living in NanoClaw for a few weeks now, building little nano agents for home automation and data sorting. The container-first design is genuinely cool—each task spins up in its own isolated environment, which feels safe and clean.

But then I read the "Known Gaps" section in the docs. It starts to feel less like technical transparency and more like a liability waiver. Phrases like “under heavy concurrent load, isolation guarantees may degrade” or “shared volumes configured manually can introduce state leakage” are technically true, but they’re buried. I hit this myself:

* Running three document processing agents in parallel, all pulling from a shared volume. Saw one agent’s temporary files bleed into another’s workspace. The logs just showed “file not found” errors—no flag from the orchestration layer.
* The CPU pinning logic for containers seems to fall apart when you have more concurrent agents than cores. Throttling happens, but the isolation model doesn’t adjust—it just slows everything down uniformly, which feels like a gap between the promise and the practice.

I love the vision, and I’m still a huge fan of the local-first, open approach. But calling these “gaps” makes them sound like small, theoretical edge cases. In practice, they’re the exact scenarios you encounter when you start stacking agents for real workloads.

Has anyone else pushed the concurrency or shared volume setup and found the isolation boundary getting fuzzy? I’m curious how you’re working around it—custom orchestration hooks? Or just keeping workloads simple and spaced out?

--Ryan]]></content:encoded>
						                            <category domain="https://openclawsecurity.net/community/nanoclaw-isolation-model/">Container Isolation Model and Gaps</category>                        <dc:creator>Ryan J.</dc:creator>
                        <guid isPermaLink="true">https://openclawsecurity.net/community/nanoclaw-isolation-model/hot-take-the-gaps-documentation-reads-like-a-marketing-disclaimer/</guid>
                    </item>
				                    <item>
                        <title>Walkthrough: Using notary to sign images and enforce policy on the orchestrator</title>
                        <link>https://openclawsecurity.net/community/nanoclaw-isolation-model/walkthrough-using-notary-to-sign-images-and-enforce-policy-on-the-orchestrator/</link>
                        <pubDate>Mon, 29 Jun 2026 13:02:16 +0000</pubDate>
                        <description><![CDATA[Hey everyone, I&#039;ve been diving into the security features of our stack, specifically the notary/signing flow for agent container images. I think I&#039;ve got a working example of how to sign an ...]]></description>
                        <content:encoded><![CDATA[Hey everyone, I've been diving into the security features of our stack, specifically the notary/signing flow for agent container images. I think I've got a working example of how to sign an image and then have the orchestrator enforce that it's signed before pulling it. This seems like a key piece for making the "container-first" isolation actually trustworthy from the source.

Here's the basic flow I set up, using `notation` and `oras`. First, you need a local keypair and to add the public key to a trust store. I generated a key and self-signed a certificate for testing:

```bash
# Generate a key and a self-signed cert
notation cert generate-test --default my-wasm-agent-id

# List the certs in the trust store
notation cert list
```

Then, after building my agent image (`myregistry.io/agents/calc:latest`), I signed it with the private key:

```bash
notation sign myregistry.io/agents/calc:latest
```

The cool part is enforcing this at the orchestrator level. For example, with containerd, you configure the `notation-verifier` plugin in `/etc/containerd/config.toml`. You point it to the trust store and specify the policy. A simple `trust` policy would reject unsigned images:

```toml

trust_policy_file = "/etc/containerd/trust-policy.json"
```

And the `trust-policy.json` would define a `trust` policy for your registry scope, requiring valid signatures.

My question is about the gaps. This seems solid for the *initial* pull. But what about during scaling under load? If the orchestrator caches an unsigned image layer from somewhere, or if there's a shared volume with a compromised binary that gets executed as part of the agent's task, does the signing still protect us? Also, managing these keys across a fleet feels like a whole other challenge. &#x1f605;

Has anyone else set this up in a production-like environment for agent workloads? I'd love to compare notes on the operational side of things.]]></content:encoded>
						                            <category domain="https://openclawsecurity.net/community/nanoclaw-isolation-model/">Container Isolation Model and Gaps</category>                        <dc:creator>Petr V.</dc:creator>
                        <guid isPermaLink="true">https://openclawsecurity.net/community/nanoclaw-isolation-model/walkthrough-using-notary-to-sign-images-and-enforce-policy-on-the-orchestrator/</guid>
                    </item>
				                    <item>
                        <title>Unpopular opinion: The isolation model is a band-aid on a flawed agent architecture</title>
                        <link>https://openclawsecurity.net/community/nanoclaw-isolation-model/unpopular-opinion-the-isolation-model-is-a-band-aid-on-a-flawed-agent-architecture/</link>
                        <pubDate>Mon, 29 Jun 2026 12:01:06 +0000</pubDate>
                        <description><![CDATA[Okay, I&#039;m probably going to get roasted for this, but I&#039;ve been running a mini-lab with NanoClaw agents segmented across three VLANs for testing, and I&#039;ve hit a wall. The container-first iso...]]></description>
                        <content:encoded><![CDATA[Okay, I'm probably going to get roasted for this, but I've been running a mini-lab with NanoClaw agents segmented across three VLANs for testing, and I've hit a wall. The container-first isolation feels robust when you look at a single agent, or even a few. It gives you that warm, fuzzy feeling of clean boundaries.

But start stacking concurrent tasks, especially those that need to share a data volume for processing, and the cracks show. The isolation model feels like it's compensating for the fact that the agents themselves weren't designed with true multi-tenancy in mind. You end up with a dozen containers on the same host, all spawned by the same orchestration layer, fighting for the same underlying resources. I've seen latency spikes in agent response that directly correlate to when shared volume I/O maxes out. The network namespace isolation is great, but if the orchestration decides to schedule two high-intensity agent tasks on the same node, they're still sharing CPU and memory pressure in ways that can starve each other out.

My specific pain point? Agent tasks that process sensor data from my IoT segment. They pull from a shared read-only volume, but the writes go to individual agent-specific volumes. Under light load, fine. Under a simulated event, with multiple agents triggering analysis concurrently, the shared read volume becomes a bottleneck. The container isolation does nothing to mitigate that. It feels like the architecture assumes isolation == security and performance, but it's really just a band-aid over the lack of resource-aware scheduling and proper shared storage I/O controls.

I'm curious if others have seen this. Are we just misconfiguring our resource limits and QoS, or is the model fundamentally fragile when you move beyond a simple, sequential workflow? Maybe we need to be looking at agent co-location rules, or even pushing for a shift towards a more microservices-aware design where the "agent" is just a thin coordinator, and the heavy tasks are truly isolated, ephemeral functions. Love to hear your thoughts.]]></content:encoded>
						                            <category domain="https://openclawsecurity.net/community/nanoclaw-isolation-model/">Container Isolation Model and Gaps</category>                        <dc:creator>Eve R.</dc:creator>
                        <guid isPermaLink="true">https://openclawsecurity.net/community/nanoclaw-isolation-model/unpopular-opinion-the-isolation-model-is-a-band-aid-on-a-flawed-agent-architecture/</guid>
                    </item>
				                    <item>
                        <title>My two cents: The container model falls apart with stateful, long-running agents</title>
                        <link>https://openclawsecurity.net/community/nanoclaw-isolation-model/my-two-cents-the-container-model-falls-apart-with-stateful-long-running-agents/</link>
                        <pubDate>Sun, 28 Jun 2026 19:01:01 +0000</pubDate>
                        <description><![CDATA[Hi everyone, I&#039;m Mike. I&#039;ve been following the Open Claw project for a while now, and I finally decided to jump in. I&#039;m really excited about the security-first approach, especially the focus...]]></description>
                        <content:encoded><![CDATA[Hi everyone, I'm Mike. I've been following the Open Claw project for a while now, and I finally decided to jump in. I'm really excited about the security-first approach, especially the focus on isolation. But, I have to admit, I'm feeling a bit nervous about something, and I wanted to share my thoughts and see if I'm on the right track or just misunderstanding things.

I've been reading all the documentation about NanoClaw's container-first design, and it makes perfect sense for short-lived, stateless tasks. The idea that each agent task spins up in its own isolated container is fantastic for security. It's like having a fresh, clean room for every single job, and nothing can bleed over. That's the dream, right? &#x1f605;

My concern starts when I think about the real-world use cases I'm interested in, like self-hosting a database-backed application or running a media server with persistent data. The documentation talks about "stateful, long-running agents," and that's where my anxiety kicks in. If an agent needs to run for weeks or months, managing a persistent database or a file library, doesn't the container model start to show some cracks?

For instance, if I have an agent that manages my photo backup (encrypted, of course!), it needs constant access to a volume where new photos land and where the encrypted archive lives. That volume has to be shared, either bind-mounted from the host or from some shared storage. Suddenly, that perfect isolation feels... less perfect. The container is isolated, but the data it touches isn't confined to that container anymore. If another, less-trusted container somehow gets access to that same volume path (through a misconfiguration, maybe in the orchestration layer), the isolation for that stateful data is broken.

Also, what about resource contention over time? A long-running container for a heavy process might start to accumulate memory leaks or file descriptors, and since it's not being torn down and recreated regularly, those issues could grow and potentially affect the host or other containers in more subtle ways than a quick task ever would.

I guess my question is, how does NanoClaw's model specifically handle these gaps? Are there extra layers—maybe specific user namespace mappings, mandatory access controls, or volume labeling—that are automatically applied to long-running agents to compensate? Or is the guidance that for truly stateful workloads, we should be looking at a different isolation primitive, like a dedicated VM, and just use NanoClaw agents to manage *into* that space?

I would be so grateful for any step-by-step guidance or best practices on this. The theoretical model is clear, but I get nervous when theory meets my messy, stateful reality. Thank you all in advance for your patience with a newcomer's worries.]]></content:encoded>
						                            <category domain="https://openclawsecurity.net/community/nanoclaw-isolation-model/">Container Isolation Model and Gaps</category>                        <dc:creator>Mike O&#039;Brien</dc:creator>
                        <guid isPermaLink="true">https://openclawsecurity.net/community/nanoclaw-isolation-model/my-two-cents-the-container-model-falls-apart-with-stateful-long-running-agents/</guid>
                    </item>
				                    <item>
                        <title>Just built a stress test that spawns 500 containers to find the orchestrator limit</title>
                        <link>https://openclawsecurity.net/community/nanoclaw-isolation-model/just-built-a-stress-test-that-spawns-500-containers-to-find-the-orchestrator-limit/</link>
                        <pubDate>Sun, 28 Jun 2026 02:00:05 +0000</pubDate>
                        <description><![CDATA[Okay, so I finally got around to stress-testing the container isolation layer in my local NanoClaw setup. We talk a lot about the threat model of a single agent, but the real test is how the...]]></description>
                        <content:encoded><![CDATA[Okay, so I finally got around to stress-testing the container isolation layer in my local NanoClaw setup. We talk a lot about the threat model of a single agent, but the real test is how the orchestrator holds up under a massive, concurrent load. If the isolation breaks at scale, the whole security premise gets shaky.

I wrote a script to spawn 500 concurrent agent tasks, each in its own dedicated container. The goal was to find the practical limit of my orchestration stack (running on a single beefy server) and see where cracks appear.

**Initial findings:**
*   The container spawn rate starts strong but hits a steep cliff around #350. The bottleneck wasn't CPU or RAM for the models, but the container runtime's socket and log handling.
*   Shared volumes, even when mounted read-only, became a serious contention point. I saw latency spikes in simple file reads for later containers.
*   At around #420, I observed the first "orphaned" containers—tasks that completed, but their cleanup didn't fire, leaving ghost containers consuming resources.

Here's the core of the spawn loop I used:
```python
# simplified spawn loop
for i in range(500):
    agent_config = {
        "task": f"stress_test_{i}",
        "image": "nano-claw-agent:latest",
        "volumes": {
            "/shared_input": {"ro": True},
            f"/agent_{i}_output": {}
        }
    }
    # async container spawn call
    asyncio.create_task(orchestrator.spawn_isolated_agent(agent_config))
```
The config looks sound, right? Each agent gets its own output directory. The breakdown is systemic.

**The gaps:**
*   **Orchestrator State Management:** Under heavy load, the orchestrator's internal state (tracking container IDs, tasks, status) fell out of sync with the actual runtime state.
*   **Shared Kernel Pressure:** Even with namespaces, 500 containers on one kernel caused PID exhaustion and network port contention, which isn't a model-level issue but breaks the isolation guarantee.
*   **Cleanup Race Conditions:** The "fire-and-forget" async pattern, while fast, meant cleanup tasks sometimes lost the race, leading to those orphans.

So, the isolation model for any single container is robust. But the *orchestration* layer, under extreme concurrency, becomes the weak link. This suggests we need to think about queueing, batch limits, and more aggressive health checks on the container runtime itself, not just the agents.

Has anyone else pushed the container count this high? Did you hit different limits based on your runtime (docker, containerd, podman)?

luke out]]></content:encoded>
						                            <category domain="https://openclawsecurity.net/community/nanoclaw-isolation-model/">Container Isolation Model and Gaps</category>                        <dc:creator>Luke M.</dc:creator>
                        <guid isPermaLink="true">https://openclawsecurity.net/community/nanoclaw-isolation-model/just-built-a-stress-test-that-spawns-500-containers-to-find-the-orchestrator-limit/</guid>
                    </item>
				                    <item>
                        <title>News: OpenClaw now supports user namespaces. Is it actually usable yet?</title>
                        <link>https://openclawsecurity.net/community/nanoclaw-isolation-model/news-openclaw-now-supports-user-namespaces-is-it-actually-usable-yet/</link>
                        <pubDate>Thu, 25 Jun 2026 09:38:29 +0000</pubDate>
                        <description><![CDATA[Hey everyone! I saw the announcement about user namespace support in the latest OpenClaw release. That&#039;s huge for isolation, right? &#x1f389;

But I&#039;m a bit lost on the practical side. I&#039;m s...]]></description>
                        <content:encoded><![CDATA[Hey everyone! I saw the announcement about user namespace support in the latest OpenClaw release. That's huge for isolation, right? &#x1f389;

But I'm a bit lost on the practical side. I'm still trying to wrap my head around the default container model. If I enable this new user namespace mapping, what actually changes? Do I need to rebuild all my agent images? And more importantly, is it stable enough to use now, or are we still in "experimental, might break everything" territory?

I'd love a simple example of how the permissions or file ownership looks inside a task container with this on vs. off.]]></content:encoded>
						                            <category domain="https://openclawsecurity.net/community/nanoclaw-isolation-model/">Container Isolation Model and Gaps</category>                        <dc:creator>Maya L.</dc:creator>
                        <guid isPermaLink="true">https://openclawsecurity.net/community/nanoclaw-isolation-model/news-openclaw-now-supports-user-namespaces-is-it-actually-usable-yet/</guid>
                    </item>
				                    <item>
                        <title>Switched from pure Docker to Podman for rootless agents, here is why</title>
                        <link>https://openclawsecurity.net/community/nanoclaw-isolation-model/switched-from-pure-docker-to-podman-for-rootless-agents-here-is-why/</link>
                        <pubDate>Thu, 25 Jun 2026 00:38:37 +0000</pubDate>
                        <description><![CDATA[Moving our agent runtime off Docker to a rootless Podman deployment has significantly tightened our security posture, particularly for the NanoClaw model. While containers provide a baseline...]]></description>
                        <content:encoded><![CDATA[Moving our agent runtime off Docker to a rootless Podman deployment has significantly tightened our security posture, particularly for the NanoClaw model. While containers provide a baseline isolation primitive, the traditional Docker daemon's architecture introduces unnecessary attack surface for multi-tenant agent workloads.

The primary motivator was eliminating the `dockerd` privilege boundary. With rootless Podman, each agent's container is a child of the user-namespaced agent process itself, not a central daemon. This aligns with the principle of least privilege and provides a cleaner security boundary. The user namespace mapping is handled per-pod, which is crucial when agents require distinct UID/GID mappings for their attached volumes.

Here is a snippet from our agent orchestration layer, showing the shift in how we instantiate a task's sandbox:

```rust
// Previous Docker-based spawn
// let container = docker.containers::create(&amp;config).await?;

// Podman via the Rust `podman-api` crate
let podman = Podman::unix("/run/user/1000/podman/podman.sock");
let container = podman.containers().create(&amp;config).await?;
```

However, this model has gaps. Under concurrent workloads, shared volumes—even with correct user namespace mappings—become a coherence challenge. If two agent tasks are scheduled to process segments of the same volume, Podman's rootless overlayfs mounts can introduce subtle race conditions. Furthermore, the default seccomp profile for rootless containers is more permissive; we had to enforce a strict, custom profile to filter non-essential syscalls like `userfaultfd` and `keyctl`.

Key observations from the migration:
*   **Capabilities are better contained:** No daemon means no privileged operations escaping the user namespace.
*   **cgroups v2 delegation is cleaner:** We can manage agent resource constraints via the systemd scopes Podman creates.
*   **Orchestration complexity increases:** Replacing Docker Swarm with systemd units and Podman pods requires careful lifecycle management.

The isolation breaks down if the host kernel isn't configured for safe unprivileged user namespaces (`kernel.unprivileged_userns_clone=1`) or if agents are co-located on a host with relaxed `sysctl` parameters (e.g., `user.max_user_namespaces` set too high). The model also depends on the strength of the user namespace isolation itself, which has seen vulnerabilities in the past.

julia]]></content:encoded>
						                            <category domain="https://openclawsecurity.net/community/nanoclaw-isolation-model/">Container Isolation Model and Gaps</category>                        <dc:creator>Julia K.</dc:creator>
                        <guid isPermaLink="true">https://openclawsecurity.net/community/nanoclaw-isolation-model/switched-from-pure-docker-to-podman-for-rootless-agents-here-is-why/</guid>
                    </item>
				                    <item>
                        <title>Comparison: NanoClaw&#039;s chroot jail vs full container for simple one-shot tasks</title>
                        <link>https://openclawsecurity.net/community/nanoclaw-isolation-model/comparison-nanoclaws-chroot-jail-vs-full-container-for-simple-one-shot-tasks/</link>
                        <pubDate>Wed, 24 Jun 2026 11:19:35 +0000</pubDate>
                        <description><![CDATA[Been using NanoClaw&#039;s `--isolate` flag for simple tasks (file ops, transforms). It creates a chroot jail, which is way faster than spinning up a full container.

But I ran into a weird edge ...]]></description>
                        <content:encoded><![CDATA[Been using NanoClaw's `--isolate` flag for simple tasks (file ops, transforms). It creates a chroot jail, which is way faster than spinning up a full container.

But I ran into a weird edge case yesterday. A task was modifying a temp file in the jail, and another concurrent task on the same host could *almost* see the inode? Got me thinking.

When does the chroot model actually break? Here's my test case:

```bash
# Fast chroot isolation (default for one-shot)
openclaw-cli run --isolate chroot --task "process_data"

# Full container isolation
openclaw-cli run --isolate container --task "process_data"
```

The chroot is just a filesystem boundary. No network, PID, or IPC isolation. If your task does *anything* with shared memory, or another task mounts something funky, the walls get thin.

So for simple one-shot tasks, are we trading security for speed? When do you switch to the full container flag? Is it just about untrusted code, or are there other gotchas?]]></content:encoded>
						                            <category domain="https://openclawsecurity.net/community/nanoclaw-isolation-model/">Container Isolation Model and Gaps</category>                        <dc:creator>Finn Asher</dc:creator>
                        <guid isPermaLink="true">https://openclawsecurity.net/community/nanoclaw-isolation-model/comparison-nanoclaws-chroot-jail-vs-full-container-for-simple-one-shot-tasks/</guid>
                    </item>
				                    <item>
                        <title>Thoughts on the new &#039;strict&#039; isolation mode in the dev branch?</title>
                        <link>https://openclawsecurity.net/community/nanoclaw-isolation-model/thoughts-on-the-new-strict-isolation-mode-in-the-dev-branch/</link>
                        <pubDate>Tue, 23 Jun 2026 11:38:35 +0000</pubDate>
                        <description><![CDATA[Having spent the last two days examining the proposed &#039;strict&#039; isolation mode patches in the dev branch, I find the direction promising but the implementation currently incomplete for its st...]]></description>
                        <content:encoded><![CDATA[Having spent the last two days examining the proposed 'strict' isolation mode patches in the dev branch, I find the direction promising but the implementation currently incomplete for its stated goal of guaranteeing agent task separation under concurrent workloads. The core premise—layering additional kernel security features atop the standard NanoClaw container model—is sound, yet the selective application creates a false sense of security in specific, predictable scenarios.

The mode currently enforces a non-writable, namespace-unique `seccomp` profile that blocks key syscalls like `clone`, `unshare`, and `setns`. This is good. It also pins the `user_namespace` and disallows `uid`/`gid` shifting post-init. However, the glaring omission is a comprehensive `cgroup` containment strategy. The agent tasks share the parent cgroup for memory and CPU, which under heavy concurrent load can lead to resource starvation and side-channel leakage via `pressure` files, even with the other namespaces isolated. Furthermore, the `mknod` capability is retained within the filtered `CAP_SYS_ADMIN` remnant, allowing device node creation if a shared volume is mounted `rw`.

```c
// Example from the current 'strict' seccomp filter (non-writable)
if (syscall == __NR_clone || syscall == __NR_unshare || syscall == __NR_setns) {
    return SECCOMP_RET_ERRNO(EPERM);
}
// But CAP_MKNOD remains under a conditional check...
```

The breakdown occurs precisely in the orchestration gaps: a shared `emptyDir` volume with `medium: Memory` and a misconfigured pod security context that grants `CAP_SYS_ADMIN` "for legacy reasons" will bypass the intended isolation. The agent can then mknod a `mem` device, or via the shared cgroup, probe the memory pressure of co-located tasks. The model needs to address the full triad: **namespaces, capabilities, and cgroups** as a unified policy, not as incremental additions.

I am curious to hear from others who have attempted to replicate the "concurrent workload" test suite—specifically the shared-volume and cgroup pressure tests. Are we considering a move towards a defined `seccomp` profile that is both non-writable *and* excludes `CAP_MKNOD` and `CAP_SYS_MODULE` entirely in this mode? Should the `cgroup` namespace be mandatory, with delegated controllers? Without this, the 'strict' mode is only a partial filter, not an isolation boundary.

-- R]]></content:encoded>
						                            <category domain="https://openclawsecurity.net/community/nanoclaw-isolation-model/">Container Isolation Model and Gaps</category>                        <dc:creator>Rae Chen</dc:creator>
                        <guid isPermaLink="true">https://openclawsecurity.net/community/nanoclaw-isolation-model/thoughts-on-the-new-strict-isolation-mode-in-the-dev-branch/</guid>
                    </item>
				                    <item>
                        <title>Showcase: Grafana dashboard tracking container creation/deletion rates per agent</title>
                        <link>https://openclawsecurity.net/community/nanoclaw-isolation-model/showcase-grafana-dashboard-tracking-container-creation-deletion-rates-per-agent/</link>
                        <pubDate>Mon, 22 Jun 2026 21:12:32 +0000</pubDate>
                        <description><![CDATA[I&#039;ve been stress-testing the container lifecycle isolation in our NanoClaw deployments, specifically watching how the system handles rapid agent spawning. The promise is strong isolation per...]]></description>
                        <content:encoded><![CDATA[I've been stress-testing the container lifecycle isolation in our NanoClaw deployments, specifically watching how the system handles rapid agent spawning. The promise is strong isolation per task, but the overhead becomes visible under concurrency. To make this tangible, I built a Grafana dashboard focused purely on container creation and deletion rates, broken down by agent type.

The core metrics are sourced from the container runtime's exposed stats, piped through a Prometheus node exporter I customized. The key panels track:
- `container_changes_total` (filtered by `operation="create"` or `operation="delete"`)
- Rate of change over 1m and 5m windows, segmented by the `agent_name` label we inject at orchestration
- Concurrent container count per agent, highlighting "stuck" containers that aren't cleaned up

Here's the essential part of the Prometheus scrape config that adds the agent label from our orchestration metadata:

```yaml
- job_name: 'nano_claw_container_stats'
  static_configs:
    - targets: 
  relabel_configs:
    - source_labels: 
      target_label: agent_name
    - source_labels: 
      target_label: container_name
```

The dashboard immediately revealed two gaps. First, under high load from our `nemo_claw` inference agents, the deletion rate couldn't keep up with the creation rate, leading to a growing container backlog. This points to a resource contention issue in the orchestration layer, not the isolation model itself. Second, shared volumes—even read-only ones—used by multiple concurrent agent instances occasionally caused temporary lock contention, visible as a spike in container `start_time` duration.

This visualization helped us pinpoint that isolation breaks down not during normal operation, but during orchestration-scale events: mass scale-up, eviction storms, or when shared volume mounts are misconfigured as `rw` instead of `ro`. The next step is to correlate this with GPU isolation metrics to see if container backlog leads to unexpected CUDA device sharing. Has anyone else instrumented their container lifecycle rates and seen similar orchestration bottlenecks?]]></content:encoded>
						                            <category domain="https://openclawsecurity.net/community/nanoclaw-isolation-model/">Container Isolation Model and Gaps</category>                        <dc:creator>Uma Krishnan</dc:creator>
                        <guid isPermaLink="true">https://openclawsecurity.net/community/nanoclaw-isolation-model/showcase-grafana-dashboard-tracking-container-creation-deletion-rates-per-agent/</guid>
                    </item>
							        </channel>
        </rss>
		