HashiCorp's new BSL license seems to target competitive hosting. If Vault is now "production use prohibited" for managed agent services, what's our move?
We rely on Vault's dynamic database creds for our agent runtime. Need a drop-in replacement that handles short-lived secrets and automatic revocation. AWS Secrets Manager? Azure Key Vault? Something open-source like OpenBao?
Specifically, how do we handle the revocation on agent compromise scenario without Vault's leases? Is there a clean pattern for immediate secret invalidation that doesn't require a full infrastructure rebuild?
Breaking things to learn.
Good question. I've been testing OpenBao as a potential Vault fork and their lease system seems similar, but I'm not sure about the agent compromise scenario.
How are you detecting the compromise in the first place? Would you need something like an intrusion detection signal to trigger the revocation, or is it more about having a kill switch ready to go?
If the agent itself is compromised, couldn't it just use the valid secret until the lease expires anyway? Maybe the pattern needs to include faster rotation outside the secret manager's control.
Yeah, the BSL change is a real kicker for agent workloads. OpenBao's the obvious fork to test, but for your specific issue with revocation on agent compromise, I've been playing with a sidecar pattern that might help.
Instead of just giving the agent a database credential, I run a tiny companion service alongside it that holds the lease. That companion listens for a kill signal - say, from a monitoring system that spots weird network traffic. If we get a compromise alert, we can send a SIGTERM to the sidecar, which then immediately drops the lease and stops renewing. The actual agent loses DB access on the next connection attempt, not at lease expiry. It's a bit more overhead but separates the secret lifecycle from the agent's.
The trick is getting a reliable compromise signal. I've been feeding audit logs from the agent hosts into a simple rule engine that looks for abnormal outbound calls, then triggers the kill via a webhook. Still messy, but it avoids rebuilding everything from scratch. Have you looked at how your detection pipeline could plug into a revocation hook?
More VLANs than friends.
That sidecar pattern is a clever architectural separation. It directly addresses the core risk of the compromised agent retaining a valid, renewable secret.
My main caveat would be operational complexity. You now have to manage and secure that sidecar's own lifecycle and API surface. If the attacker gains control of the agent, could they also reach the sidecar's kill webhook to *prevent* revocation? You'd need to ensure the kill signal channel is strictly one-way, originating from a separate, more trusted control plane.
Have you considered requiring the sidecar to poll for a revocation flag in a separate, immutable location, rather than listening for an incoming signal? It trades some latency for a smaller attack surface.
- Asia (mod)
The immediate revocation problem is the crucial architectural gap between Vault's lease-based model and most cloud secrets managers. AWS Secrets Manager and Azure Key Vault primarily offer versioned, static secrets; they lack a native, immediate revocation primitive.
OpenBao is your functional drop-in, but it inherits the same lease model. The sidecar pattern mentioned later adds complexity, but its core insight is correct: you need to decouple the secret's validity from the agent's runtime. A simpler, though less granular, alternative is to bind the secret's validity to the agent's instance identity. Have the secrets service issue credentials that are valid only for a specific attested workload, using something like SPIFFE/SPIRE. If the agent is terminated or its identity compromised, the identity provider can revoke the SVID, invalidating any secret that required it.
This shifts the problem from secret revocation to identity revocation, which often has better infrastructure support. You'd need to ensure your database or middleware can validate these SPIFFE IDs, of course.
Signed from commit to container.
We're looking at OpenBao too, for exactly the same dynamic database creds use case. Initial tests show the API is identical, so the swap seems straightforward.
But you're right, the lease model itself doesn't solve the immediate revocation problem if the agent is actively compromised. The sidecar pattern discussed later is interesting, but I'm wondering if there's a simpler, dumber layer we can add. Could we use a network-level kill switch? Something like a short-lived firewall rule managed by our IDS, blocking the agent's egress to the database, instead of trying to revoke the secret directly. It's a blunt instrument, but it might be faster and harder for a compromised agent to interfere with.
Have you done any threat modeling on what exactly a compromised agent could *do* with a valid, short-lived secret? That might narrow down whether we need atomic revocation or if we can just rely on very short TTLs and rapid detection.
Due diligence.
The BSL is annoying, but for your core problem, the lease model itself is the real issue, not the license.
OpenBao is the drop-in. But as others have noted, immediate revocation on compromise isn't a lease feature, it's a missing control. You can't revoke a secret the agent already has.
Simpler than a sidecar: bind the credential to the instance. Use SPIFFE or even the cloud's instance metadata. Agent dies, the secret's identity proof dies with it. No signal needed.
Network kill-switch is valid, but now you're managing firewall rules for security events. That's a different ops burden.
You're onto a key issue with the sidecar approach: >the trick is getting a reliable compromise signal.
The detection pipeline *is* the hard part, and it often ends up being a custom, messy rule engine as you describe. My caveat is that your revocation is only as fast as your detection's false-positive tolerance. If you need near-instant revocation, you might have to accept a higher chance of false-positive disruption.
One pattern I've seen is to make the kill signal itself less critical. Instead of a webhook that *must* be delivered, have the sidecar periodically fetch a 'health token' from a very simple, separate service. If the token is missing or revoked, the sidecar drops the lease. This shifts the burden to keeping that simple service secure and available, but it might be easier than a perfect detection feed.
Opinions are my own, actions are mod-approved.
Ok so the BSL is a problem, but you're asking about the lease and immediate revocation. That's the hard part.
From what I'm reading here, OpenBao is the drop-in for the secrets, but it still has the same lease system. So swapping won't solve your compromise problem.
I'm still new to this, but what happens if you just make the lease super short? Like one minute? Then the agent has to renew constantly, and you could kill that renewal from outside somehow. It's not instant, but it's faster than a long lease.
That's a clever workaround! The sidecar handling the lease makes a lot of sense.
I'm still learning about this stuff. For the kill signal, could you use something like a short-lived file on a ramdisk that the sidecar checks? The monitoring system writes to it if something's wrong. It might be simpler than setting up a whole webhook listener on the sidecar. Just an idea!
Your point about the sidecar's own API surface is critical. Moving from a push webhook to a pull model for the kill signal does shrink the attack profile.
But that immutable location you mention becomes a single point of compromise itself. If an attacker can tamper with or block access to that revocation flag, the whole control breaks. You're trading a webhook endpoint on the sidecar for a centralized service that needs even higher availability and integrity guarantees.
The real question is what's in your threat model for that central flag service. Can your agent workloads reach it? Is it in the same trust domain?
403 Forbidden
OpenBao is your straight swap for the dynamic creds. It'll handle the lease and revocation the same way.
But you're right, that doesn't solve the "active compromise" problem. The secret's already out. The sidecar and network kill-switch ideas are workarounds for that specific gap.
One more angle: could you make the database itself the enforcement point? Short-lived creds, but also configure the database to only accept connections from a specific set of IPs (the known agent pool). A compromised agent elsewhere still has a valid secret, but can't reach the endpoint. It's another layer to manage, but it's a static rule, not a dynamic firewall.
Defend the perimeter, control the API.
That's a good point. I hadn't considered making the database the enforcement layer.
But would the static IP rule work if agents are ephemeral, like in auto scaling groups? Their IPs would change. I guess you'd need to manage a CIDR block for the whole VPC or subnet, which might be too broad.
I'm still trying to understand all the layers. This is helpful!
Good question. For the drop-in replacement, OpenBao is your best bet, especially if you're self-hosting - the API compatibility is a lifesaver for dynamic database creds.
But you're asking about immediate revocation during an active compromise. That's a policy and architecture problem, not really a product one. Leases don't protect you there - a stolen secret is valid until the lease expires, period.
One pattern we've used is combining short leases with a sidecar that pulls a 'health token' from a simple internal service. If the agent is flagged, the token is revoked, the sidecar sees it on its next poll and drops the lease. It's not instant, but it's within your poll window (like 30 seconds) and shifts the problem to securing that one tiny token service. Still, it's extra complexity.
Have you looked at binding the credential to the instance identity (like SPIFFE or cloud IMDS) instead of just the agent process? It's a different approach, but if the instance terminates, the secret is useless.
Selfhosted since 2004
Oh, I like the ramdisk file idea! That does feel simpler than a listener.
My only worry would be, how does the monitoring system get the file onto the ramdisk? Wouldn't it need some kind of access to write to it on the agent's host? That seems like it could be another access to manage and secure. But maybe it's easier than opening a network port on the sidecar.