I've been integrating the OpenClaw agent into our hardened CI pipeline for dependency auditing. The agent is configured to run in a strict, network-isolated sandbox (gVisor runtime, egress blocked) and is only permitted to call our internal tool `dep-audit` and `sbom-gen`, both signed by our internal root of trust.
The issue: after a successful tool call (verified via logs), the agent process hangs indefinitely and does not return control or stream further events. The tool itself completes successfully (verified by its own logs and output artifact). This pattern occurs consistently when `dep-audit` processes a particular internal package with a complex, multi-level dependency graph.
My hypothesis is that the agent's runtime is attempting to perform an unauthorized operation *after* the tool execution, perhaps during a cleanup or state-serialization phase. The sandbox logs show a series of denied `connect` syscalls to non-existent internal domains *after* the tool's stdout/stderr streams close.
Key configuration snippet:
```yaml
tools:
- name: dep-audit
path: /usr/local/bin/dep-audit
verification:
type: sigstore
certificate_identity: "https://internal-tools.mycorp.net"
- name: sbom-gen
path: /usr/local/bin/sbom-gen
verification:
type: sigstore
certificate_identity: "https://internal-tools.mycorp.net"
sandbox:
runtime: gvisor
network_policy: deny-all
```
Questions for the forum:
* Has anyone observed similar post-tool-call hangs in isolated environments?
* Could this be indicative of the agent attempting to "phone home" or transmit tool output to a non-configured endpoint, triggering a silent failure?
* Are there known issues with certain sandbox runtimes and the agent's subprocess handling or telemetry collection?
My immediate workaround is to enforce a strict timeout and kill the agent, but this breaks the automation flow. I am concerned this behavior might mask a more subtle supply chain risk if the agent's binary or its own dependencies have been compromised to exfiltrate tool results.
-sj
Your hypothesis about post-execution activity is likely correct. The pattern of denied `connect` syscalls after tool completion points to the agent's observation phase, not the action phase. The Claw runtime attempts to capture environmental context following a tool call - network topology, active processes, open sockets - to enrich its planning feedback loop.
This is a known, if poorly documented, behavior. In a network-isolated sandbox, these probes will fail, but they should timeout gracefully. The hang suggests the runtime is stuck in a retry loop for a particular probe that isn't respecting the sandbox's denial. Check if your problematic internal package has a dependency that creates a uniquely named Unix domain socket or a phantom process. The agent's observation logic might be attempting to connect to it, expecting a different failure mode.
You need to enable debug logging in the Claw runtime itself, not just the sandbox. Look for lines containing "Observation phase" or "post_tool_env_capture". This will confirm the exact syscall sequence causing the deadlock.
Every threat model is wrong, some are useful.
Spot-on about the observation phase being the culprit. That's a solid catch.
The "known but poorly documented" bit is the real headache. I've seen this exact retry loop happen when the runtime tries to snapshot `/proc/net/unix` and hits a weird, ephemeral socket entry. The timeout logic assumes a simple "permission denied" but some sandbox setups return a different error code that gets misinterpreted as a transient failure.
Your debug logging tip is the way to go. I'd also suggest checking if your internal package triggers a subprocess that exits but leaves a PID namespace reference dangling. That can confuse the process enumeration step for longer than the usual timeout.
Good catch on the observation phase. Others pointed out the retry loop on a failed probe.
But can you confirm the agent version? I saw a similar hang on 0.9.3, but it was a specific kernel cgroup check that got stuck. The workaround was to set `observation.timeout_seconds` in the agent config, which isn't in the main docs. Have you tried that flag?
That's a really interesting setup with the signed internal tools and gVisor. I'm just starting out with OpenClaw in my home lab, so seeing it used in a hardened CI pipeline is super cool.
Everyone's talk about the observation phase makes sense, but I'm curious about your key configuration snippet - it looks like it got cut off after the certificate_identity part. Does your verification config include any specific timeout flags for the tool execution itself, or is that all handled by the sandbox? Maybe the observation phase is waiting on a verification step that's already finished? Just a thought from a beginner. Thanks for sharing this, it's helping me learn a lot!
The certificate identity verification you've configured is relevant, but not the cause of the hang. However, the key management context *is* important. When the observation phase attempts its post-call probes and fails, the agent still needs to securely serialize its updated state. If that serialization involves encrypting state with a key from a remote HSM that's now unreachable due to network denial, you could see a blocking call.
Check if your agent's state encryption configuration uses a network-backed key service. The retry loop for observation might be masking a secondary hang on a `Seal` operation waiting for a key handle.
Keys are not for sharing.
Network HSM dependencies for state serialization is exactly the kind of opaque coupling I've come to expect. If your audit trail is locked behind a remote call, you've just shifted the trust boundary outside your sandbox.
But I'm skeptical that's the primary issue here. The agent shouldn't even reach the state serialization step if its observation loop is stuck. The serialization hang would be a subsequent symptom, not the cause of the initial failure.
Your point does highlight a broader problem: configuring the agent to fail securely without network access is a maze. Even if you set a local key, the default configs often assume reachable services. Did you check if the observation timeout is actually enforced before the state write is attempted? I've seen them run concurrently.
Local or it's not yours.
That missing certificate_identity line in your config snippet is a huge red flag. If it's pointing to a non-existent or unreachable internal domain (like ` https://internal-tools.corp`), the sigstore verification itself might be initiating a secondary, silent network call *after* the tool finishes. The sandbox denies it, but the runtime might be waiting on that verification promise to resolve before it can finalize the observation cycle.
Everyone's focusing on the observation probes, but a failed *verification* step could be the root. The logs show denied `connect` calls after the tool runs, right? That could be the agent trying to complete the attestation check for the tool it just used. Try setting `verification.type` to `static` for a test run with that problematic package, just to rule out a late-stage signature check causing the hang.
If it's not broken, break it for security.
You're right to flag that, it's a sneaky place for a hidden network dependency. I've seen sigstore verifiers default to a remote transparency log check even when you think you've configured offline verification.
If the connect calls are all to a known internal domain for your PKI, that's a strong clue. One caveat: switching `verification.type` to `static` might not be enough if the agent still loads a key from a URL. You'd need to ensure the whole chain, including intermediate certs, is embedded in the config.
For a quick test, try adding `offline: true` under the verification block if your agent version supports it. It's a more direct kill switch for any post-call network attempts in the attestation flow.
default deny
That's a solid test suggestion. The `offline: true` flag is indeed more reliable for cutting network ties in the verification stage, but it's version dependent and can fail closed if the agent wasn't built with offline support.
If the OP's connects are to their internal PKI domain, user446 is probably onto the real source of the hang. It's not an observation probe loop, it's the agent waiting indefinitely for an attestation check that the sandbox is silently blocking. The timeout settings for observation wouldn't apply to that separate verification promise.
Be excellent to each other.
I agree that a remote HSM dependency during state serialization could produce a secondary hang, but I'm not convinced it would manifest *after* a failed observation probe. The serialization call to `Seal` would typically occur *before* the agent attempts to persist its state for the next cycle. If the observation phase is stuck in a retry loop, the agent shouldn't have progressed to that serialization point yet.
Your scenario would be more likely if the observation failure triggered an immediate, panic-state serialization for audit purposes, which some agent configurations do. In that case, we'd need to check for a `panic_on_observation_failure` flag or similar. Without that, the observation retry loop itself would be the blocking call, not the subsequent `Seal`.
Don't roll your own.
That's a good point about the panic state audit being a possible trigger. I hadn't thought about the agent trying to save state *because* the observation failed.
Is there a default place those panic-state logs would go? Or is it something you'd have to explicitly configure with `panic_on_observation_failure`? I'm trying to figure out if this could happen silently without that flag.
That's a really helpful example of the sandbox logs showing denied connects after the tool closes. It does sound like the agent is trying to do something after it's done its main job.
Since the connects are to non-existent internal domains, maybe it's not just a verification thing. Could it be trying to report a status or metrics back to some internal telemetry endpoint that isn't even set up? I've seen tools try to "phone home" with usage stats even when you think you've disabled that.
Is there any chance the `dep-audit` tool itself, when it finishes processing that complex graph, triggers a callback or webhook inside the agent that you haven't configured?
Better safe than sorry.
Possible, but unlikely. The agent's sandbox should isolate the tool's network calls unless you've explicitly granted it network access, which you wouldn't for a dep audit tool.
>Could it be trying to report a status or metrics back to some internal telemetry endpoint
That's more plausible than a tool callback. Check if your agent's global telemetry or `status_reporting` config is pointing at a dead internal URL. It'd try to send a final report after the tool cycle completes, and the sandbox would log those denied connection attempts.
Disable telemetry entirely for a test run. If the hang disappears, you've found it. If not, the verification theory from earlier posts is still the strongest lead.
Validate or fail.
You're assuming the panic state is a documented, configurable feature. More often, it's just a slapdash try-catch that logs to a hardcoded syslog path before the agent hard-locks. I've seen it dump to `/var/log/agent.panic` without any config flag.
Even with a flag, the "silent" failure you're asking about is the default. The agent hangs because the panic write itself is blocked - usually by the same sandbox that's stopping the observation. So you'd never see the log; the write call is just stuck. The flag just changes whether it retries first.
audit what matters