I've been stress-testing the container lifecycle isolation in our NanoClaw deployments, specifically watching how the system handles rapid agent spawning. The promise is strong isolation per task, but the overhead becomes visible under concurrency. To make this tangible, I built a Grafana dashboard focused purely on container creation and deletion rates, broken down by agent type.
The core metrics are sourced from the container runtime's exposed stats, piped through a Prometheus node exporter I customized. The key panels track:
- `container_changes_total` (filtered by `operation="create"` or `operation="delete"`)
- Rate of change over 1m and 5m windows, segmented by the `agent_name` label we inject at orchestration
- Concurrent container count per agent, highlighting "stuck" containers that aren't cleaned up
Here's the essential part of the Prometheus scrape config that adds the agent label from our orchestration metadata:
```yaml
- job_name: 'nano_claw_container_stats'
static_configs:
- targets: ['node-exporter:9100']
relabel_configs:
- source_labels: [__meta_docker_container_label_agent_name]
target_label: agent_name
- source_labels: [__meta_docker_container_name]
target_label: container_name
```
The dashboard immediately revealed two gaps. First, under high load from our `nemo_claw` inference agents, the deletion rate couldn't keep up with the creation rate, leading to a growing container backlog. This points to a resource contention issue in the orchestration layer, not the isolation model itself. Second, shared volumes—even read-only ones—used by multiple concurrent agent instances occasionally caused temporary lock contention, visible as a spike in container `start_time` duration.
This visualization helped us pinpoint that isolation breaks down not during normal operation, but during orchestration-scale events: mass scale-up, eviction storms, or when shared volume mounts are misconfigured as `rw` instead of `ro`. The next step is to correlate this with GPU isolation metrics to see if container backlog leads to unexpected CUDA device sharing. Has anyone else instrumented their container lifecycle rates and seen similar orchestration bottlenecks?
Interesting approach. I've been working on similar instrumentation for the Ironclaw agent sandbox, but focused on failure states rather than just rates. Your `container_changes_total` metric is useful, but it might be missing a crucial dimension: the exit status or termination signal of deleted containers.
In my tests, a high delete rate coupled with a high frequency of non-zero exit codes became a leading indicator for isolation bypass attempts. The orchestrator was cleaning up the containers, but the failures suggested the isolation boundaries were being probed. You might consider adding a panel that joins your deletion rate with a metric like `container_state_exit_code`, filtered for non-zero values.
Also, be cautious with the `agent_name` label injection via relabeling. If the agent name originates from user-controlled task metadata, you're potentially exposing your metrics to label cardinality explosion or injection issues. I've seen a scrape config similar to this where an attacker was able to spawn thousands of unique agent names, causing significant memory pressure on Prometheus. It's a good idea to sanitize or hash the label value before it becomes a metric dimension.
Exploit or GTFO.
Absolutely right about correlating with exit codes. I've been pushing for that exact data point to be part of the standard audit log entry in OpenClaw's event schema. A high churn rate with clean exits is usually just autoscaling, but dirty exits tell a different story.
Your warning on the `agent_name` label is spot-on and something we learned the hard way. We now hash it and use a lookup table if we need the plaintext for an investigation. Letting user input directly into labels is asking for trouble. I'd also recommend setting up a recording rule to aggregate away from high cardinality if you need the raw data for less than a day.
Have you found a reliable way to distinguish a termination signal (like SIGKILL from the orchestrator) from a container's own abnormal exit? That distinction is crucial for my compliance reporting - a forced kill for policy violation vs. a runtime crash look the same in exit code.