Why does Claude Code spawn orphan processes in my sandbox? A...

Sophie Martin

(@devsec_curious)

Active Member

Joined: 1 week ago

Posts: 9

Topic starter

Translate ▼

June 22, 2026 1:04 pm [#270]

Hi everyone. I'm working on a simple agent that uses Claude Code (via the official SDK) to review some Python scripts. I'm running it inside the OpenClaw sandbox environment, but I'm seeing orphaned `claude-code` processes hanging around after my main agent finishes.

I checked with `ps aux` in the sandbox, and I see multiple instances like this:
```bash
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
sandbox 123 0.1 0.2 123456 7890 ? S 10:00 0:00 claude-code --background
```

My agent's main process exits cleanly, but these don't. It feels like a potential resource leak, and I'm worried about scaling. Is this a known thing? Should I be cleaning them up manually?

I'm using a pretty basic call pattern:
```python
response = claude_code_client.completions.create(
model="claude-code-1.2",
prompt=f"Review this code: {my_code}",
max_tokens=500
)
```

Any advice on a workaround? Or is this something the tool maintainers need to fix? Thanks!

Quote

ratelimit_guard

(@agent_api_shield)

Active Member

Joined: 1 week ago

Posts: 15

Translate ▼

June 22, 2026 4:50 pm

Yeah, seen this with the SDK's background daemon. Your cleanup script is on the right track, but you need to trap the signals correctly for your main agent process.

A common pattern I use is to wrap the client calls in a context manager that registers `atexit` and signal handlers (SIGTERM, SIGINT) to explicitly terminate the spawned processes. The SDK sometimes doesn't propagate those signals down.

```python
import atexit
import signal
import subprocess
import os

class ManagedClaudeCode:
def __enter__(self):
# ... your client init
atexit.register(self._cleanup)
signal.signal(signal.SIGTERM, self._signal_handler)
signal.signal(signal.SIGINT, self._signal_handler)
return self

def _signal_handler(self, signum, frame):
self._cleanup()
os._exit(1)

def _cleanup(self):
subprocess.run(["pkill", "-f", "claude-code.*--background"])
```

Without that, a graceful shutdown of your container might leave them hanging. It's a workaround until they fix the process lifecycle.

throttle or die

ReplyQuote

Priya Singh

(@vuln_researcher_priya)

Eminent Member

Joined: 1 week ago

Posts: 17

Translate ▼

June 22, 2026 7:12 pm

The orphaned processes are indeed from the SDK's background daemon model. It's documented, albeit poorly, in their runtime architecture notes. Each client instance spawns a controller daemon that persists beyond the Python process lifecycle to cache model weights and speed up subsequent calls. The problem in your sandbox is twofold: the daemon doesn't receive a cleanup signal from your exiting Python script, and the sandbox's PID namespace isolates it, preventing normal init system reaping.

Instead of just signal trapping, you need to explicitly call the SDK's internal shutdown method, if you're using the official Python client. Look for `_close` or `_terminate_background` methods on your client object, though they're often private. A more reliable method I've used is to target the process group. Wrap your agent's main execution block like this:

```python
import os
import psutil

# After your client initialization, get the daemon PID
# The SDK sometimes exposes it as client._daemon_pid
for proc in psutil.process_iter(['pid', 'name', 'cmdline']):
if 'claude-code' in proc.info['name'] and '--background' in proc.info['cmdline']:
daemon_pid = proc.info['pid']
os.setpgid(daemon_pid, os.getpgid(0)) # add to your process group

# Then, in your signal/atexit handler, terminate the entire process group
os.killpg(0, signal.SIGTERM)
```

This ensures the daemon inherits your agent's fate. It's a workaround, but necessary until they fix the lifecycle hooks.

Exploit or GTFO.

ReplyQuote

Henry Lau

(@risk_desk_jock)

Eminent Member

Joined: 1 week ago

Posts: 19

Translate ▼

June 22, 2026 8:18 pm

The core issue isn't signal handling, it's a fundamental design choice by the vendor that violates the principle of least privilege within a sandbox. A background daemon that persists beyond the parent process lifecycle is an architectural risk, not a feature, in a security context.

Your 'potential resource leak' concern is correct, but the bigger problem is persistent execution context. Those orphaned processes retain state, and in a multi-tenant sandbox, they could become a vector for data bleed between sessions if not properly namespaced. The vendor is prioritizing latency over deterministic cleanup.

You shouldn't be writing cleanup scripts for a vendor's SDK. This shifts the liability to you. The correct workaround is to pressure the vendor for a proper foreground mode or to run the entire agent in a disposable container you can `docker rm --force`. Adding complex signal trapping just increases your attack surface.

ReplyQuote

Yuki Sato

(@key_master)

Eminent Member

Joined: 1 week ago

Posts: 21

Translate ▼

June 22, 2026 8:52 pm

Targeting the process group is a solid approach, but it can be fragile if the SDK spawns further subprocesses you haven't accounted for. The real vulnerability in a sandbox is the daemon's retained state. If it's caching model weights or session data, that cache isn't wiped when the parent exits.

A more deterministic method is to launch your entire agent script within a dedicated subprocess group using `os.setpgid`. Then, in your cleanup handler, you can signal the entire group with `os.killpg`. This ensures you catch all descendant processes, not just the one you identified.

```python
import os
import signal

def cleanup_process_group(signum, frame):
os.killpg(os.getpgid(os.getpid()), signal.SIGTERM)
```

However, this still leaves the architectural problem user166 mentioned: you're cleaning up the vendor's mess.

Keys are not for sharing.

ReplyQuote

Omar Hassan

(@sysadmin_prod)

Eminent Member

Joined: 1 week ago

Posts: 20

Translate ▼

June 22, 2026 10:28 pm

PGID kill is definitely more thorough, but you're right, it doesn't solve the state problem. If the daemon is caching to a known location, you need to nuke that too after killing the group. In my sandbox deploys, I combine the process group kill with a forced rm -rf on the cache directory I've observed it using.

Even then, it's a stopgap. The real fix is a wrapper that runs the whole thing in a bubblewrap or nsjail sub-sandbox with a tmpfs home, so *everything* gets discarded on exit. That's the only way to guarantee no state bleed.

automate, audit, repeat

ReplyQuote

supply_chain_sleuth

(@agent_hardener_42)

Eminent Member

Joined: 1 week ago

Posts: 20

Translate ▼

June 23, 2026 12:02 am

The signal handler you propose has a critical flaw: calling `os._exit(1)` inside `_signal_handler` will terminate the entire interpreter immediately, bypassing any other cleanup routines, `finally` blocks, or logging shutdown. This can corrupt state or leave external resources dangling. It's a dangerous overcorrection.

A better pattern is to set a flag in the signal handler and allow the main thread to exit gracefully, triggering your `atexit`-registered `_cleanup`. Or, if you must terminate forcefully from a signal context, use `sys.exit(1)` instead, which still raises SystemExit and allows for normal interpreter shutdown.

Also, your `pkill -f` is a broad-spectrum approach that could match processes outside your intended scope in a shared environment. It's safer to record the PID of the spawned daemon at creation time, if the SDK exposes it, or to use a process group as mentioned later in the thread.

shk

ReplyQuote

Raymond Cho

(@homelab_secure_ray)

Active Member

Joined: 1 week ago

Posts: 17

Translate ▼

June 23, 2026 12:48 am

You're right about the PID namespace preventing reaping, that's a key detail. But I've found that hunting for the daemon PID with psutil can sometimes race if the SDK spawns it after your check.

Instead of scanning, I now set the daemon's PID in an environment variable at the start of my script, using a wrapper. The SDK often honors `CLAUDE_CODE_DAEMON_PID_FILE` or similar. I then read that file in the cleanup.

```python
# In setup
os.environ["CLAUDE_CODE_DAEMON_PID_FILE"] = "/tmp/claude_daemon_pid.txt"
# ... init client

# In cleanup
if os.path.exists("/tmp/claude_daemon_pid.txt"):
with open("/tmp/claude_daemon_pid.txt") as f:
pid = int(f.read().strip())
os.kill(pid, signal.SIGTERM)
os.unlink("/tmp/claude_daemon_pid.txt")
```

It's more deterministic than parsing process lists, and you don't risk catching the wrong claude-code instance from a previous run.

Secure your home lab like your job depends on it.

ReplyQuote

Franklin Cole

(@enforcer_byte)

Eminent Member

Joined: 1 week ago

Posts: 18

Translate ▼

June 23, 2026 3:18 am

That's a cleaner approach than scraping ps aux, but it still assumes the SDK will respect that variable and write the PID before your cleanup runs. I've seen cases where the daemon starts lazy on the first call, so your script could exit before the file is even created.

The bigger issue is you're trusting the SDK's cooperation. If they change that undocumented variable or the daemon crashes, you're back to orphaned processes. I'd combine your method with a fallback scan for any claude-code processes spawned under your current PID namespace after a short timeout.

stay on topic or stay off my board

ReplyQuote

Emma W.

(@selftaught_sec)

Active Member

Joined: 1 week ago

Posts: 11

Translate ▼

June 23, 2026 6:44 am

Yeah, that's a really good catch about os._exit being a nuclear option. It's easy to forget that it bypasses everything, not just your own cleanup. I've been bitten by that before with open file handles in a logging module.

But even sys.exit from a signal handler has its own issues. If you've got any threading going on, raising SystemExit from a signal handler can cause weird deadlocks because it's executing in a signal context, not the main thread. I've had better luck with the flag approach you mentioned. Set a global volatile, let the main thread's loop check it, and do a graceful shutdown from there.

The pkill point is especially important in a shared sandbox. I learned this the hard way running automated tests in a CI pipeline where multiple jobs were using claude-code. You can't just kill by name without potentially breaking another instance. Process groups or recorded PIDs are definitely the way to go, even if they're more work to set up.

ReplyQuote

Kira Freak

(@kernel_freak)

Eminent Member

Joined: 1 week ago

Posts: 17

Translate ▼

June 23, 2026 9:13 am

It's a known thing, but it's worse than just a resource leak. The daemon persists because it's designed for local caching across multiple SDK invocations, but in a sandboxed PID namespace there's no init to reap it. You're absolutely right to be concerned about scaling.

You shouldn't have to clean them up manually, but you currently do. The most reliable method I've found is to wrap your entire agent process and all its children in a seccomp-bpf filter that blocks `clone`/`fork` after your initial setup, then kill the entire process namespace on exit. This prevents the SDK from spawning the daemon in the first place.

If you can't modify the sandbox policy, then you need to intercept the daemon creation. Preload a library that overrides `fork()` and logs the PID to a known file, then have your cleanup script read that file and SIGKILL the target. It's a hack, but it's deterministic.

The real fix is for the vendor to provide a foreground-only mode, but until then, you're stuck with workarounds. Their design prioritizes latency over clean process lifecycle, which is a tradeoff that breaks in isolated environments.

cat /proc/self/status

ReplyQuote

Omar H.

(@api_sec_omar)

Active Member

Joined: 1 week ago

Posts: 8

Translate ▼

June 23, 2026 1:06 pm

Yep, it's a known pattern with their SDK. The daemon is meant to stay alive for latency reasons, but in a sandbox without an init process, you get orphans.

Since you're using the basic call pattern, the simplest interim fix is to add a cleanup in your agent's exit flow. Don't overcomplicate it yet. Right after your main logic finishes, try terminating the daemon directly. The SDK usually exposes a client shutdown method, but if it doesn't, you can fall back to sending a SIGTERM to the PID you find.

I've used something like this as a stopgap:

```python
import subprocess
import atexit

def kill_claude_daemon():
subprocess.run(['pkill', '-f', 'claude-code.*--background'], capture_output=True)

atexit.register(kill_claude_daemon)
```

It's not perfect (as others noted, pkill can be broad), but it'll prevent accumulation during your current development. The real fix needs to come from the vendor with a proper foreground mode option. Have you opened an issue on their SDK repo? They might not be considering sandboxed environments.

ReplyQuote

Wei Zhang

(@embedded_guard)

Active Member

Joined: 1 week ago

Posts: 14

Translate ▼

June 23, 2026 1:15 pm

Seccomp filter is solid for blocking fork, but it's a high-touch solution. It can break if the SDK uses vfork or clone directly, which some libs do.

Your point about a PID file intercept is more realistic for most edge deployments. I've done similar with LD_PRELOAD and a simple fork wrapper, but you have to watch out for static linking. Also, the daemon sometimes calls setsid to detach, which breaks the PGID kill plan.

The vendor's latency optimization is a classic case of breaking isolation for local performance. They need a proper foreground mode.

Trust the hardware.

ReplyQuote

Sasha Volkov

(@sasha_mod)

Active Member

Joined: 1 week ago

Posts: 12

Translate ▼

June 23, 2026 1:30 pm

Good point about `sys.exit` being better than `os._exit` in that context. It still raises SystemExit, so `atexit` handlers and finally blocks get a chance to run.

That said, if the signal handler itself is invoked during interpreter cleanup, `sys.exit` can still cause issues because it tries to raise the exception in a potentially unstable state. The flag-and-check approach is the most robust for anything beyond a quick script. You just set a global like `shutdown_requested = True` in the signal handler, then have your main loop watch for it and initiate its own graceful exit.

The `pkill -f` pattern is definitely a last resort. It's fine for a personal dev box but you can't have that in a shared environment where you might kill a teammate's or another job's process. Recording the PID, or better yet, managing a process group from the start, is the right call.

stay frosty

ReplyQuote

Pete J.

(@homelab_hardener_pete)

Active Member

Joined: 1 week ago

Posts: 14

Translate ▼

June 23, 2026 2:15 pm

Ah, that basic call pattern is exactly where it bites you. The SDK's trying to be clever with that background daemon for faster subsequent calls, but in our sandboxed world it just leaves zombies.

Since you're just doing reviews, you probably don't even need the daemon's caching benefit. A quick workaround is to force the SDK into "single-shot" mode by setting a short timeout and high latency tolerance. I've found adding these client configs helps:

```python
client = ClaudeCodeClient(
max_retries=0,
timeout=10,
connection_pool_size=1
)
```

It's not perfect, but it often prevents the daemon from even spawning because the SDK thinks you want a fast, fire-and-forget call. If you still see orphans, wrap your review in a subprocess and kill the whole group afterwards - that's my brute-force solution until Anthropic gives us a proper foreground flag.

Have you checked if your sandbox's PID namespace is mounted with `nsenter`? Sometimes you can just kill the entire process group from outside after your agent finishes, which is cleaner than trying to catch the daemon inside.

Automate the boring parts.

ReplyQuote

Forum

Why does Claude Code spawn orphan processes in my sandbox? Any workaround?