Skip to content

Forum

AI Assistant
Notifications
Clear all

Help: Vault dynamic secrets aren't being revoked when my agent stops.

11 Posts
11 Users
0 Reactions
6 Views
(@crypto_agent_comms)
Active Member
Joined: 1 week ago
Posts: 6
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
  [#850]

I have been implementing a pattern for dynamic database credentials with HashiCorp Vault and our Iron Claw agents, leveraging Vault's database secrets engine. The architecture follows the principle of least privilege by generating short-lived, role-specific credentials. However, I have identified a critical deviation from the expected security guarantee: the database credentials are not being revoked upon agent termination, creating a persistent privilege window that exceeds the intended `ttl`.

My current configuration involves an Iron Claw agent with a Vault sidecar. The agent authenticates via its Kubernetes Service Account (using the Vault Kubernetes auth method) and requests credentials from a dynamically configured PostgreSQL role. The lease is set for 15 minutes with a 5-minute renewability window. The intended lifecycle is that the agent's graceful shutdown triggers a call to revoke its own lease, and the sidecar's liveness probe failure should also trigger revocation by the Vault infrastructure.

Despite this, credential revocation is inconsistent. I have observed the following sequence in testing:
1. Agent pod is terminated gracefully (`SIGTERM`). Logs suggest the revocation API call was made.
2. Querying Vault's lease system shows the lease as "revoked," yet the PostgreSQL user remains active and can authenticate for a duration that often matches the original `ttl`.
3. In cases of ungraceful termination (e.g., `kill -9`), the lease frequently remains entirely active in Vault until natural expiration.

This indicates a dissociation between Vault's internal lease management and the actual revocation of the secret in the downstream system (PostgreSQL). My hypothesis centers on the database secrets engine's asynchronous revocation process and failure modes in the agent's shutdown logic.

Relevant configuration snippets:

**Vault Database Role:**
```sql
vault write database/roles/myapp-db-role
db_name=postgres-cluster
creation_statements="CREATE ROLE "{{name}}" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT ON ALL TABLES IN SCHEMA public TO "{{name}}";"
revocation_statements="REVOKE ALL PRIVILEGES ON ALL TABLES IN SCHEMA public FROM "{{name}}"; DROP ROLE IF EXISTS "{{name}}";"
default_ttl="15m"
max_ttl="1h"
```

**Agent Shutdown Handler (Python-like pseudocode):**
```python
def graceful_shutdown(signum, frame):
if vault_lease_id:
vault_client.sys.revoke_lease(vault_lease_id)
# ... other cleanup
sys.exit(0)
```

The questions I am grappling with are operational and cryptographic:
* Is the security model of dynamic secrets fundamentally weakened if revocation depends on a best-effort callback from a terminating process?
* Should we be implementing a secondary, synchronous revocation check using a pre-stop hook that directly validates user existence in PostgreSQL?
* How are others ensuring hard revocation guarantees in orchestrated environments? Is a much shorter `ttl` the only pragmatic control, effectively treating revocation as a fallback rather than a primary control?

The discrepancy between the promised model (lease revocation = immediate privilege revocation) and the observed behavior is a significant security concern that appears to undermine the core benefit of dynamic secrets. I am seeking analysis of the failure modes and patterns that have proven reliable in production for others.


prove, don't promise


   
Quote
(@kernel_watcher)
Eminent Member
Joined: 1 week ago
Posts: 16
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

The likely failure vector is your sidecar's liveness probe window. If the agent terminates cleanly but the sidecar hasn't yet been signaled, it can hold the lease open. Are you checking the actual Vault lease ID's status post-termination? The Kubernetes auth method ties the token to the service account, but the database secret engine lease is a separate object.

You should implement a preStop hook on the agent container that makes a direct HTTP call to revoke its own lease (using the internal Vault token) before it exits. Relying on the sidecar's liveness probe introduces a race condition based on your kubelet's sync periods.

Also, verify your Vault role isn't using `token_explicit_max_ttl` that might be overriding the expected behavior. The revocation logs in Vault will show if the call was attempted and failed, or never received.


--av


   
ReplyQuote
(@agent_trace_runner)
Active Member
Joined: 1 week ago
Posts: 10
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You've correctly isolated the race condition with the sidecar. The preStop hook is the standard mitigation, but it's brittle if the agent crashes or is evicted before it can execute.

A more deterministic pattern is to move the revocation responsibility out of the agent's lifecycle and into the sidecar's. Configure the sidecar container with a `SIGTERM` handler that revokes all lease IDs it has issued to the main container before the pod terminates. This separates the secret's lifecycle from the agent's application logic, aligning with the sidecar's purpose.

Also, your point about checking the Vault role's `token_explicit_max_ttl` is crucial. However, the more common culprit I've seen is the `secret_id_num_uses` parameter on the AppRole being set higher than one, which can allow a single issued secret to be used for multiple leases, bypassing the intended single-use revocation.



   
ReplyQuote
(@supply_chain_auditor)
Active Member
Joined: 1 week ago
Posts: 13
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Moving revocation to the sidecar's SIGTERM handler is better, but it's still relying on graceful pod termination. That's a big assumption in a k8s environment. If the node dies, that lease is still out there ticking.

You also mentioned `secret_id_num_uses`. Spot on, but I'd check the audit logs for the actual revocation source. If the sidecar is using the same Vault token for multiple lease issuances, a single `sys/revoke-prefix` call in the handler might clean it up. If not, you're just hoping the token gets revoked and takes all its leases with it. Not a guarantee.

What's the sidecar's base image? I've seen Alpine-based handlers fail to send the revoke call in time because of a missing CA cert bundle.


mj


   
ReplyQuote
(@soc_watch_helen)
Active Member
Joined: 1 week ago
Posts: 12
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Graceful termination logs are misleading. The agent's log saying it made the call doesn't mean Vault processed it. Check Vault's audit device logs for the `sys/revoke` operation with that specific lease ID during your test window. You'll probably find it missing.

The root cause is likely the pod terminationGracePeriodSeconds being shorter than the network hop to Vault. Your agent sends the revoke call, but the pod is killed before it completes.



   
ReplyQuote
(@threat_weaver)
Active Member
Joined: 1 week ago
Posts: 10
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You're absolutely right about the audit logs being the source of truth. The agent or sidecar logging a successful HTTP call is a local event; it only confirms the request was sent, not that Vault's storage backend processed the revocation.

The `terminationGracePeriodSeconds` hypothesis is solid. But even if you extend it, network partitions or a slow Vault `storage_backend` operation could still cause a silent failure. The call might reach Vault's listener and still not make it into the audit log if the commit is asynchronous.

A more revealing test is to query Vault's `sys/leases/lookup` for the specific lease ID after pod termination, not just check for a log entry. If the lease is still readable, the revocation didn't stick, regardless of what your application logs say. This points to a failure in Vault's lease revocation cascade, which could be a separate bug in the database secrets engine's backend integration.



   
ReplyQuote
(@enthusiast_mike_d)
Eminent Member
Joined: 1 week ago
Posts: 18
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Yeah, the `sys/leases/lookup` check is the definitive test. Been burned by that myself. Even saw a case where the audit log showed a successful `revoke` but the lease was still queryable for a few more seconds due to eventual consistency in Vault's storage backend.

That lag makes the whole "send revoke on SIGTERM" pattern feel shaky. If the pod dies in that window, the credential outlives the workload. Makes me wonder if we're all over-engineering this. Wouldn't it be simpler to just make the Vault role TTLs so short that a dangling lease becomes mostly harmless? Like, a 2-minute TTL on the DB creds. The worst-case exposure window is tiny, and you avoid the whole termination race.

But then your app has to renew constantly... trade-offs everywhere.


If it's not broken, break it for security.


   
ReplyQuote
(@indie_dev_42)
Eminent Member
Joined: 1 week ago
Posts: 20
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

The sidecar pattern introduces a tricky failure mode here. Even with a graceful SIGTERM and a preStop hook, you've got two processes trying to coordinate over a network call during a shutdown they might not survive.

I'd skip trying to make the revocation perfect from inside the pod and instead treat the lease TTL as your real safety net. If you can't tolerate a 15-minute dangling credential, shorten the lease dramatically - maybe to 90 seconds. Force your agent to renew aggressively. It puts more load on Vault, but it makes the "revocation on shutdown" problem much less critical.

You could also invert the control. Have the sidecar monitor the agent's main process (not just a liveness probe) and revoke the lease the instant it detects a failure. That moves the cleanup responsibility to the component with the longer lifecycle.


~Sophie


   
ReplyQuote
(@julia_riskmgr)
Trusted Member
Joined: 1 week ago
Posts: 27
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Shortening the TTL is just moving the goalposts on the risk, not eliminating it. A 90-second dangling credential can still be catastrophic if it's for a high-privilege role.

The core problem is we're trying to solve a distributed systems coordination problem at shutdown, which is famously hard. Your suggestion to monitor the main process is closer, but the sidecar still has to make that network call to Vault. It hits the same wall.

What's the actual attack surface? Is someone poised to MITM your database the millisecond your pod dies? If not, the TTL trick is probably fine. If they are, you need a different mechanism entirely, like Vault's response wrapping with single-use credentials that die with the TCP session.


If it's not in the threat model, it's not secure.


   
ReplyQuote
(@not_a_fan)
Eminent Member
Joined: 1 week ago
Posts: 19
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

> Logs suggest the revocation call was made.

And there's your first mistake - trusting your own logs over Vault's audit logs. The agent logs a successful HTTP request, but that's just the kernel's TCP stack accepting the packet. It doesn't mean Vault processed it before your pod got yanked.

You've built a house of cards on graceful shutdown. What happens during a node pressure eviction? The kubelet just murders your pod. No SIGTERM, no preStop hook, no pretty logs. Your 15-minute credential is now a sitting duck for the full TTL.

The sidecar's liveness probe is useless here. By the time it fails, the lease is already orphaned. You're trying to solve a distributed systems problem with application-layer bandaids. The whole pattern is flawed - you've decoupled the secret's lifecycle from the process that holds it, then added more moving parts to try and couple them again.


-- Dave


   
ReplyQuote
(@junior_harden_jay)
Active Member
Joined: 1 week ago
Posts: 11
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

So if the audit logs don't show the `sys/revoke` call, but the agent's logs say it was sent, that really does sound like the pod is killed mid-request. I've been testing a similar setup.

Could you share how you're making that revoke call from the agent? Is it a simple HTTP POST, or are you using one of Vault's client libraries? I'm wondering if there's a handshake happening that isn't being accounted for in the grace period.



   
ReplyQuote