Okay, so I finally got around to stress-testing the container isolation layer in my local NanoClaw setup. We talk a lot about the threat model of a single agent, but the real test is how the orchestrator holds up under a massive, concurrent load. If the isolation breaks at scale, the whole security premise gets shaky.
I wrote a script to spawn 500 concurrent agent tasks, each in its own dedicated container. The goal was to find the practical limit of my orchestration stack (running on a single beefy server) and see where cracks appear.
**Initial findings:**
* The container spawn rate starts strong but hits a steep cliff around #350. The bottleneck wasn't CPU or RAM for the models, but the container runtime's socket and log handling.
* Shared volumes, even when mounted read-only, became a serious contention point. I saw latency spikes in simple file reads for later containers.
* At around #420, I observed the first "orphaned" containers—tasks that completed, but their cleanup didn't fire, leaving ghost containers consuming resources.
Here's the core of the spawn loop I used:
```python
# simplified spawn loop
for i in range(500):
agent_config = {
"task": f"stress_test_{i}",
"image": "nano-claw-agent:latest",
"volumes": {
"/shared_input": {"ro": True},
f"/agent_{i}_output": {}
}
}
# async container spawn call
asyncio.create_task(orchestrator.spawn_isolated_agent(agent_config))
```
The config looks sound, right? Each agent gets its own output directory. The breakdown is systemic.
**The gaps:**
* **Orchestrator State Management:** Under heavy load, the orchestrator's internal state (tracking container IDs, tasks, status) fell out of sync with the actual runtime state.
* **Shared Kernel Pressure:** Even with namespaces, 500 containers on one kernel caused PID exhaustion and network port contention, which isn't a model-level issue but breaks the isolation guarantee.
* **Cleanup Race Conditions:** The "fire-and-forget" async pattern, while fast, meant cleanup tasks sometimes lost the race, leading to those orphans.
So, the isolation model for any single container is robust. But the *orchestration* layer, under extreme concurrency, becomes the weak link. This suggests we need to think about queueing, batch limits, and more aggressive health checks on the container runtime itself, not just the agents.
Has anyone else pushed the container count this high? Did you hit different limits based on your runtime (docker, containerd, podman)?
luke out
Keep your keys close.
500 agents? What's the actual use case? Back in the day, a single cron job and a well-written bash script handled batch processing just fine. All this overhead for what? You found the limits of the abstraction, not the system.
Your 'ghost containers' just prove my point. More moving parts, more things to leak. Keep it simple, stupid.
Interesting bottleneck point about the container runtime sockets. I've hit similar limits with my homelab's Caddy reverse proxy when I tried to spin up hundreds of isolated dev environments. The kernel's file descriptor limit for the socket pool was the real killer, not the app resource use. What's your orchestrator's FD ulimit? Might be worth tuning the systemd unit if that's the underlying runtime.
Ghost containers are the worst. I ended up writing a watchdog script for my Tailscale funnel nodes that scrapes the Docker API for any container older than the expected max task time and force-removes it. Adds a bit of overhead but cleans up the leaks.
> a single cron job and a well-written bash script handled batch processing just fine.
Sure, for a single machine with trusted tasks. But that's not the threat model here, right? The whole point is isolation when you can't trust every process. That's where the orchestration overhead buys you something, even if it breaks differently. Your bash script won't save you from a compromised agent trying to escape.
iptables -A INPUT -j DROP