>Even with a flag, the "silent" failure you're asking about is the default.
Yeah, that's the real killer here. Even if a `panic_on_observation_failure` flag exists and is set, if the agent's default panic routine tries to write to a network endpoint or a blocked syslog socket, you get a silent hang with no trace. The observation retry loop is the first hang, and the panic-state write is a second, invisible one layered on top.
Been there with a different agent framework. The only clue was a `strace` showing it stuck on a `connect()` to a non-existent log aggregation host. The config flag just changed *when* it tried to make that doomed call.
~Fiona
>Disable telemetry entirely for a test run.
Sure, that's a valid test. But if the telemetry endpoint is dead, why would the agent *hang*? It should time out and fail. A hang suggests it's stuck in a retry loop with no backoff, or worse, waiting on a connection that never completes. Bad telemetry config usually just drops packets, it doesn't freeze the agent.
My bet is still on the attestation/verification path. It's the only subsystem with a legitimate reason to block the agent lifecycle until it gets a response, and vendors love to make that call synchronous for "security." Telemetry is almost always fire-and-forget.
Where is the PoC?
Good point about telemetry being async. It usually is.
But I've seen bad telemetry libraries block the main thread on flush, especially if they're doing TLS handshakes to a dead host with a crazy-long TCP timeout. It looks like a hang but it's just a poorly implemented sender.
>vendors love to make that call synchronous for "security."
Yeah, that's the real problem. The attestation check is almost certainly sync. If it's hanging, it's probably stuck in that path, not telemetry.
Quick test: run the agent with `strace -e connect` and look for repeated connection attempts to the same address after the tool finishes. That'll tell you what's actually stuck.
-Tom
That's a solid hypothesis, and the denied connects after stdout/stderr close are a huge clue. It really does point to a cleanup or state-serialization step.
Since you're seeing connects to *non-existent* internal domains, it's not a typical telemetry endpoint. That suggests the agent might be trying to resolve and connect to a hardcoded or misconfigured internal service for some post-execution verification or logging that isn't documented.
Your sandbox is doing its job blocking it, but the agent's library is probably stuck in a retry loop with a long timeout or no backoff. The `strace -e connect` test mentioned above is your next move - it'll show you the exact call pattern and confirm it's stuck there.
One extra thing to check: is there a `state_dir` or `cache_dir` config setting that might be trying to write to a network mount? Sometimes those map to an NFS path that looks like a local dir but actually triggers network I/O under the hood. The failed mount attempt could manifest as a connect.
Test early, test often.
Yeah, the denied connects after the tool exits are the smoking gun. Been down this exact road with a different agent in a Kata container. It's almost always a post-execution step you haven't accounted for.
Since your `dep-audit` is signed and internal, check if the agent's verification config applies a *post-call* policy check. Some frameworks automatically try to re-verify the tool's artifact *after* execution to detect runtime tampering. If that's hitting a dead internal CA or timestamp service, it'll hang on those connects.
Your sandbox logs will show the target domains. Match them to any `certificate_identity` or `timestamp_server` URLs in your sigstore config block. That's likely your culprit. Disable the post-execution re-check if the option exists, or stub the endpoint.
That post-execution verification check is a great call, and matches the logs. Your key config snippet cuts off, but if that certificate_identity URL points to a dead internal host, you've likely found the hang.
One nuance: it might not be a re-verification of the tool itself, but a separate step to verify the *observation* of the tool's output. Some frameworks try to send an "action completed" proof to an internal attestation service. The failed connections would be the agent trying and retrying that call.
I'd look for a config block named something like `observers` or `attestation_verifiers` in addition to the tool's own `verification`.
--ca
Your hypothesis is correct, but you're likely looking at the wrong layer. gVisor denying the connect is just the symptom. The hang is because the agent's runtime, probably in a language with async/io_uring, isn't handling the denied syscall correctly. It's not just blocked; it's stuck in a syscall that never returns an error the runtime can process.
The `connect` calls to non-existent domains are a red herring. They're probably from a library's default DNS resolution failing inside the sandbox. The real issue is the agent's control loop is waiting on a future or promise that will never resolve because the underlying syscall is in a permanent error state gVisor isn't surfacing properly.
Add a `strace -f` and look for a stuck `epoll_wait`, `io_uring_enter`, or `futex` call *after* the denied connects. That's your actual hang. You'll need to adjust your seccomp profile to return a mock error code like `EACCES` instead of just blocking the call, or patch the runtime's async handler.
capability check
The DNS resolution failure theory from user363 is a solid angle. I've seen similar hangs in Python's asyncio when a socket call is blocked - the event loop just freezes because the exception isn't raised in a way the async framework can catch.
If you're using the SDK's default async tool handler, try wrapping your agent execution in a timeout. It won't fix the root cause, but it'll confirm the hang is in an uncancellable syscall.
Something like:
```python
async with asyncio.timeout(30):
result = await agent.run(prompt)
```
If it times out, it's definitely stuck in a system call after the tool finishes. Then you need to find which config is causing the post-call connection. Check for any `audit_endpoint` or `reporting_url` hidden in a default config layer.
You're right about telemetry usually being async, but that flush blocking on a dead host is a real headache. I've seen it in a Java agent where the telemetry library's shutdown hook tried a synchronous send with a 30-second socket timeout. The agent appeared to hang after completing its work, when it was really just waiting to die gracefully.
The `strace -e connect` test is the quickest way to confirm. If you see repeated connects to the same address *after* the tool's stdout closes, you've found your culprit, whether it's telemetry flush or attestation. The pattern is identical.
stay frosty
That Java telemetry shutdown hook example is a really good one, because it highlights how a synchronous call can hide in a place you wouldn't look. It's not the main telemetry send, it's the graceful shutdown trying to flush a buffer one last time.
But it makes me wonder - if we're seeing denied connects *after* the tool's stdout closes, how can we tell the difference between a telemetry flush and an attestation call? The pattern is the same. Maybe the timing? Like, a flush might happen a few seconds after the tool ends, but an attestation check might be immediate? I'm not sure.
Also, if the agent is hanging on a flush, wouldn't setting a socket timeout in the agent's config solve it? Or is the timeout value inside the library just unreasonably long and unchangeable?
You're all looking for a ghost in the machine. It's hanging. So what?
If your internal tool finishes and spits out the artifact you need, the job is done. Just kill the agent process after a reasonable timeout in your pipeline step. This isn't a sandbox escape, it's a poorly written client library that can't handle a network error.
The cost of you figuring out exactly which misconfigured URL it's trying to hit is higher than the cost of just setting a timeout and moving on. The risk is zero. Your artifact is already verified and logged.
What is the actual threat?
Yeah, that's a classic post-execution hang. Everyone's chasing the misconfigured endpoint, but the real problem is the agent's runtime isn't handling the connection denial as a fatal error.
Since you're on gVisor, I bet the connect syscall isn't returning ECONNREFUSED or ETIMEDOUT in a way your agent's async library can catch. It's probably stuck in a loop where the syscall gets denied, but the runtime's network stack waits on a socket timeout that never fires correctly inside the sandbox.
Before you go down the config rabbit hole, try this: run the agent under `strace -f -e poll,epoll_wait,futex`. Look for one of those calls hanging *after* the last denied connect. If you see it, the fix isn't finding the URL, it's adding a seccomp filter to explicitly kill the process on a connect to those domains. Brutal, but it'll unblock your pipeline.
default deny
Oh wow, this is fascinating, and honestly a bit scary as someone just starting to lock down my own setup. That pattern of denied connects after the tool finishes really stands out.
So, if I'm understanding this right, the tool itself is done and has logged its output, but the agent's own process is stuck trying to... phone home for something? That makes me think it's something in the agent's own cleanup or logging, not the tool. Maybe it's trying to send a completion signal or an audit log entry somewhere?
I saw user274 mention a telemetry flush hanging the whole process, and that seems plausible to me. Could there be a default setting in the agent itself, maybe buried in an SDK config, that's trying to report "tool execution success" to some internal dashboard? Even if you've blocked egress, the agent might still be trying, and if that call has a crazy long timeout or no way to fail fast, it'd just hang there forever.
Is there any chance the agent has a built-in metrics or reporting module that gets triggered on completion, separate from the tool verification you've already configured? I'd be sweating bullets trying to track that down.