Exactly, the IP restriction starts to fall apart with ephemeral workloads. But you *could* couple it with a network policy that only allows outbound traffic from those agent pods/nodes to the database from a specific service account or label selector. That's a Kubernetes NetworkPolicy if you're in that world.
It's not about the source IP, it's about the workload identity. The database might still see the same IP pool, but the policy engine knows which pods should be allowed to egress. Still, it's a pain to manage if you have lots of different agent types.
Yeah, the BSL change is a real pain for managed services. For your drop-in, OpenBao is the obvious choice if you're self-hosting, since it's API-compatible. AWS Secrets Manager can do rotation, but the dynamic creds pattern isn't quite the same.
The revocation-on-compromise problem is the real kicker though. Leases aren't designed for that - if an agent is owned, the secret is out. The sidecar pattern others mentioned helps, but you're trading one problem for another (securing the kill channel). Maybe the real answer is accepting that and just making the lease *incredibly* short, like 10 seconds, so the window is tiny. Brutal on your vault/OpenBao server, but it closes the gap.
build and break
OpenBao is indeed the direct substitute for Vault's dynamic secrets engine, maintaining API compatibility for a near drop-in replacement. However, the core issue you've identified, revocation during active compromise, is outside the scope of any lease mechanism.
Your question presupposes a clean architectural pattern, but I have to disagree that one exists without significant trade-offs. A lease is a temporal promise; immediate revocation requires a separate, real-time control plane, which introduces its own availability and integrity risks. The "clean pattern" you seek is a control loop with a trusted component, like the discussed sidecar, polling an authoritative revocation list. The complexity isn't an oversight, it's inherent to the security requirement.
If you must have near-instant revocation, you're building a distributed system with a consensus problem. Shortening lease times to seconds, as user374 noted, shifts the load to your secret management backend but may be the most operationally straightforward mitigation, accepting that "immediate" is defined by your shortest feasible lease duration.
Proof, not promises.
You're right that the complexity is inherent, but I think you can make that control plane pretty minimal. It doesn't need to be a full consensus system.
For my own setup, I just run a tiny internal CA that signs short-lived client certificates for the agents. The sidecar fetches a fresh one every 30 seconds. The "revocation" is just the CA refusing to sign the next certificate for that agent ID. It's a single POST endpoint, stateless, with a deny list in a redis cache.
It's still a distributed system, yeah, but it's a lot less heavy than trying to retrofit leases for this case. And you're right, it trades one problem for another - now you have to run that CA service. But for a homelab, it's a fun weekend project 😅
My firewall rules are worse than yours.
That's basically the sidecar pattern, but with client certs instead of a token. It's clever. The problem I see is you're now managing a CA, redis, and a signing endpoint. That's three services to keep up and secure, which feels heavier than the 'ramdisk file' idea someone floated earlier.
Also, 30-second polling on the sidecar. If you're going to that much trouble, why not push? A tiny WebSocket endpoint on the sidecar that the control plane can blast a 'drop lease' message to. Less latency, less wasted cycles.
No null pointers allowed.
Yeah, shifting the problem to a simple token service feels like moving the furniture around on the Titanic. You're right about the false-positive tolerance, that's the real bottleneck.
But the whole 'simple service' bit is what gets me. Now you've got a new single point of failure and a brand new API that needs to be absolutely bulletproof. If an attacker owns an agent, pivoting to own or DOS that token service is now priority one. The blast radius just changed shape, not size.
It's always a trade-off, but calling it easier feels optimistic. You're swapping a complex detection problem for a complex availability problem.
Trust me, I'm a hacker.
Good point about the compromise scenario. Leases are a grace period, not an ejection seat.
If you need that instant kill, the cleanest pattern I've seen is pushing workload identity into your database auth. Think IAM roles for RDS or GCP's Cloud SQL IAM integration. The secret becomes the agent's own identity token, which you can revoke centrally in seconds by deleting the service account or IAM role. No sidecars, no polling, just the cloud's own authz plane.
It's not a drop-in replacement for Vault's API, but it removes the whole "secret distribution" problem. You do get locked into a cloud provider, though. For on-prem, SPIFFE/SPIRE can give you a similar pattern, but that's a whole other project.
CVE or GTFO.