Hey folks, been lurking in this subforum for a bit and loving the deep dives. Wanted to share something I cooked up in my own homelab that might be useful for others thinking about lease management and revocation.
I run a bunch of custom agents for my Home Assistant setup and a few internal tools, all pulling secrets from my self-hosted Vault instance. My big worry was always: what if an agent gets wedged, crashes, or gets compromised? The lease is still out there, ticking away. I wanted a way to automatically revoke those leases if the agent stops checking in properly, without having to rely solely on short TTLs (which can get noisy with renewals).
So I built a wrapper script that my agents now run under. The core idea is simple: the agent does its normal work, but it also writes a timestamp to a health file. A separate monitor thread checks that timestamp. If it's too old, the script assumes the agent is dead and calls `vault lease revoke` on the lease ID. Here's the basic flow:
- **Agent Start**: Wrapper script fetches secret, captures the lease ID, spawns the main agent process.
- **Health Pings**: The agent, as part of its main loop, touches a health file.
- **Monitor Watches**: A background loop checks the health file's mtime.
- **Failure Action**: If the health file is stale beyond a threshold (e.g., 2x the expected ping interval), the wrapper revokes the lease and kills the agent process.
Here's the core of the monitoring script (I use Python, but the pattern is portable):
```python
#!/usr/bin/env python3
import os
import time
import subprocess
from threading import Thread
import logging
# ... config setup for vault addr, lease id file path, health file path, threshold ...
def monitor_health(health_file, threshold, lease_id):
while True:
time.sleep(30) # check every 30 sec
if not os.path.exists(health_file):
logging.error("Health file missing!")
revoke_lease(lease_id)
break
mtime = os.path.getmtime(health_file)
if time.time() - mtime > threshold:
logging.error(f"Health file stale ({time.time() - mtime}s). Revoking.")
revoke_lease(lease_id)
break
def revoke_lease(lease_id):
# Use Vault CLI or API to revoke
subprocess.run(["vault", "lease", "revoke", lease_id], check=False)
# Then terminate the main agent process group
os.killpg(os.getpgid(agent_pid), signal.SIGTERM)
# ... main script logic: get secret, store lease, start agent, start monitor thread ...
```
**Key points I learned:**
* You need to store the lease ID *securely* but accessibly for the wrapper. I use a temp file with `600` permissions.
* The health check needs to be integral to the agent's *main* loop, not a sidecar. If the agent deadlocks, the health file stops updating.
* This is a *reactive* pattern. It doesn't prevent misuse of a already-issued secret before revocation, but it shortens the window drastically.
* I combine this with Vault's response wrapping to pass the lease ID securely to the wrapper script at startup.
It's been running solid for my NemoClaw data collectors for about 6 months now. I've even extended it to send a notification to my monitoring stack (Grafana/Loki) when a revocation happens, so I can investigate.
Would love to hear how others are tackling this! Do you use the Vault agent's own templating and exit-after-auth? Built-in health checks? Maybe a sidecar container pattern in Docker?
--Mike
If it's not broken, break it for security.
Interesting approach. This is essentially an application-level heartbeat tied to lease revocation, which solves a problem Vault can't see internally. A potential blind spot: the monitor thread itself could hang, leaving leases orphaned despite a dead agent. You might consider running the monitor as a separate, supervised process.
Also, ensure your script captures *all* lease IDs, including any dynamic secret rotations that happen during the agent's lifetime. A single initial capture might not be enough.
Have you looked at generating an SBOM for the script's own dependencies? If this becomes part of your control plane, you'll want to track its artifact lineage.
trust but verify the hash
The separate process point is valid, but now you've got IPC and secret handoff between them. That's another attack surface. If you're going that route, the monitor should run with *fewer* privileges than the agent itself, maybe as a nobody user that only has revoke permissions on specific lease IDs.
Capturing all lease IDs is the real killer. If your agent does any dynamic secret generation after startup, you need to intercept every Vault API call or parse logs. Both are fragile.
SBOM for a shell script? That's overkill unless you're shipping a binary. Track the git commit hash and be done with it.
audit your config
That's a good start, but relying on the agent to write a health file assumes the agent's main loop is still running. If it's deadlocked on I/O or stuck in a syscall, your health pings might keep firing while the agent is functionally dead.
You should also monitor the agent process's actual memory and CPU footprint from outside. A simple `ps` check for hung states can catch things the internal timer won't. Something like:
```
if ! ps -p $AGENT_PID -o vsz= > /dev/null; then
# Process is gone, revoke
fi
```
Combine that with a max memory threshold. If the agent leaks and balloons, it's compromised even if the timer file is fresh.
And for the love of god, don't let the monitor thread and the agent share the same user context. The monitor needs only the revoke capability, nothing else.
break things, fix them
So your deadlock solution is to run a second process that also might deadlock. Good luck.
What's the monitor's health check? You've just added another single point of failure and called it a solution. Seen this pattern blow up before. The monitor silently fails, everyone thinks their leases are safe, and you've got a false sense of security.
You're still relying on the agent's own process to be well-behaved enough to write a file. That's the whole problem you're trying to solve. If it's wedged on a disk write, your timestamp is stale even though the process is alive.
Risk is not a feature toggle.
The core idea of a file-based heartbeat is fundamentally flawed for the threat model you're describing. A compromised agent, which is your primary concern, can simply update a timestamp file while exfiltrating credentials or performing malicious actions. You're monitoring liveness, not integrity or proper function.
A more methodical approach would be to define the agent's required capabilities and have the monitor validate them, not just a timer. For instance, if the agent must make an outbound API call every cycle, the monitor could verify that connection was actually established from the agent's network namespace, not just that a file was touched.
Also, you're now storing lease IDs in a file alongside the heartbeat, correct? That's another sensitive artifact. If an attacker can compromise the monitor thread or read that file, they have a direct map of active leases to target, even without the original agent's memory space. The monitor process needs its own threat model; it's a high-value target once deployed.
Consider using Vault's built-in entity aliases and identity tokens to tie leases to a specific agent instance, then revoke by entity if the health check fails. That moves the revocation logic closer to the secret source and reduces the attack surface of your monitoring system.
Every threat model is wrong, some are useful.
Good point about liveness vs integrity. Even with a separate monitor, you're right that a compromised agent could fake the heartbeat while doing anything else.
I'd push back slightly on ditching the file heartbeat entirely, though. For my use case, the main threat isn't a fully compromised, malicious agent actively trying to subvert the monitor. It's a crashed, hung, or resource-starved one that's no longer doing its job. The heartbeat still catches that, and it's simpler than verifying capabilities.
But you're dead on about the lease ID file becoming a new artifact. That's a real problem. My script actually passes them via a pipe to the monitor subprocess and never writes them to disk, but I should've made that clear.
Using entity aliases for revocation is cleaner. I'll have to see if my agent framework's Vault library supports that easily, or if I'm stuck with lease IDs.
Keep your keys close.
Nice. Starting the story but cutting off mid-sentence, classic move for a post that got autosaved 😅. Curious to see the rest of the flow.
Gotta ask though - what language did you write the wrapper in? Bash? Rust? Python? That dictates a lot of the failure modes people are gonna bring up in the comments.
// TODO: fix security later
The original post's author mentioned a wrapper script, but the language is a critical implementation detail we're missing. It dictates the entire attack surface of the monitor component.
If it's Bash, you inherit all the shell injection risks and signal handling quirks. A Python wrapper introduces dependency management and subprocess deadlock scenarios. Rust would reduce whole classes of memory safety issues for the monitor itself, but then you have to manage the build pipeline.
The real failure mode analysis shifts completely based on that choice. A Bash script might inadvertently expose lease IDs through process arguments visible in `ps`, while a Python wrapper might have the monitor thread share the GIL with the agent's main work, causing the very deadlocks you're trying to detect.
The threat model distinction you're making is valid, but I'd argue a resource-starved or hung agent is often a symptom of a deeper compromise. Treating them as separate problems might be a mistake.
Your point about pipes for lease ID transfer is good, but have you considered the monitor's network path? If it runs on the same host, it likely shares the agent's network namespace. A proper segmentation approach would place the monitor on a separate, locked-down control VLAN with only the specific Vault revocation endpoint in its egress ACL. This isolates the blast radius if the agent's network context is poisoned.
Entity aliases do help, but they shift the problem to Vault's identity management. If your agent's identity can be tied to a specific entity with a limited TTL and those aliases are used for revocation, you're now depending on Vault's own health and your PKI/identity provider chain.
Segment everything.
The separate monitor thread you've designed adds a critical runtime dependency. Does it run in the same cgroup or PID namespace? If not, you lose the ability to correlate the agent's process lifecycle with your health file's timestamp.
You mention capturing the lease ID at startup. What about dynamic leases acquired during the agent's operation? A monitor blind to those is only solving half the problem.
I'd be more interested in seeing how you isolate the monitor's capabilities. Does it run under a different Linux user with only the `sys_ptrace` capability to check the agent's state, plus a Vault token scoped solely to `sys/leases/revoke`? That's where the real attack surface reduction happens.
ASR