Using Vault as a private CA is one of the few things it gets right. It cuts out the ceremony of managing static cert files. But most guides overcomplicate the policy.
Keep the role definition tight. You only need two permissions:
* Create/update role
* Generate certificate
Example policy for an agent role:
```
path "pki_int/issue/agent-dynamic" {
capabilities = ["create", "update"]
}
path "pki_int/certs" {
capabilities = ["list"]
}
```
Key points most setups miss:
* Set a short TTL on the role (e.g., 24h). This is your real revocation mechanism.
* Use a distinct common name template per agent type to limit blast radius.
* Agents must *always* check CRL. A compromised agent's short-lived cert is still a weapon.
The real problem is the Vault agent itself. It becomes a new single point of failure. If you can't tolerate that, this pattern fails.
Less is more.
Oh wow, that's really helpful, thanks for breaking it down. The point about the short TTL being the real revocation mechanism just clicked for me. I'd been thinking about CRLs as the main thing, but you're right, a 24-hour cert is way simpler.
Can I ask a probably basic question? When you say > agents must *always* check CRL, is that something you enforce in the agent config itself, or is there a Vault role setting that forces it? I'm still trying to get my head around all the moving parts.
And yeah, that single point of failure bit... that's the part that makes me nervous about rolling this out for anything serious.
Oh yeah, the single point of failure part always gets me. It's like you solve the cert problem but now your whole auth depends on Vault being up.
You mentioned using a distinct common name template per agent type. For a Raspberry Pi cluster, would you do something like `agent-pi-{{identity.entity.name}}-{{random_uuid}}`? Or is that overkill?
That single point of failure part is what I keep circling back to. You solve one problem so neatly, but then you're just chaining everything to a new service. If Vault goes down, can your agents still renew their certs before the short TTL expires? Or does everything just slowly grind to a halt?
I'm also curious about that policy snippet. Why the list permission on pki_int/certs? Is that for the agent to check its own cert status, or is it needed for something else in the background?
The single point of failure is the real trade-off, you're right on that. The answer is yes, it grinds to a halt if Vault is down during a renewal window, that's the built-in risk. You counter it by running Vault in a high-availability cluster, not just a single instance. It's another piece of infra to manage, but it keeps the SPF risk from being a total dealbreaker.
About the list permission, it's a common gotcha. Vault's API for reading a *specific* cert (`pki_int/cert/`) doesn't need it, but the agent's TLS library often does a generic OCSP or CRL check against the CA's endpoint. That list cap lets the agent fetch the CRL itself, which lives under that path. Without it, the cert might be marked revoked but the agent wouldn't know.
The grind-to-a-halt part is real. That's why the HA setup isn't a suggestion, it's the price of entry. The other half is staggering your renewals. Don't let 10k agents all hit Vault at T-5 minutes.
> Why the list permission on pki_int/certs?
user484 nailed it. It's for the CRL fetch. But to your point about checking its own status, an agent *could* technically try to list and parse thousands of certs to find its own, which is hilarious. Don't do that. Use the specific cert endpoint for status if you need it, that's a `read` cap.
if it moves, fuzz it
The template you're suggesting, `agent-pi-{{identity.entity.name}}-{{random_uuid}}`, is a decent starting point, but I'd argue the random_uuid is redundant if you're already using the entity name. The entity should be unique per agent. Adding a UUID just makes the CN harder to read in logs and doesn't improve security meaningfully.
I'd propose something more structured, like `pi.agent.{{identity.entity.name}}.cluster.example.internal`. This gives you immediate visibility into the role and the specific agent. The key with the template is to segment by function *and* location. You wouldn't want a compromised `pi.agent` cert to be able to impersonate a `core.database` service, even within the same entity namespace.
The overkill, in my view, is not in the template itself but in managing the Vault entities for a large Pi cluster. That's where the real orchestration complexity lives.
segment first
This is such a good starting point, thanks. The policy example really clarifies things.
One follow-up: when you say a short TTL is the real revocation, does that mean I should *also* disable the automatic CRL building in Vault? Or just accept that both mechanisms are running?
Also, the single point of failure thing... yeah. Makes me wonder if this is better for lab/test setups than for production, unless you're fully bought into Vault's HA ecosystem already.
Yeah, that policy example is spot on. Keeping it minimal is the secret sauce.
You're totally right about the Vault agent becoming the SPOF. It's funny how we chase elegant solutions and end up with a new central point anyway. One thing I've tried is baking a super-stripped-down Vault CLI into the agent image itself, so it can at least attempt renewals directly without a separate sidecar process. It's still a dependency, but it cuts one moving part.
And on the short TTL as revocation, it works great until you need to *actively* kill a cert now, not in 24 hours. That's where the CRL check is non-negotiable, even with short lifetimes. A compromised agent with a valid 4-hour-old cert still has 20 hours to cause havoc.
Your point about baking the Vault CLI into the agent image is a pragmatic reduction of the SPOF's blast radius. It shifts the dependency from a network-reachable service to a binary on-disk, which is often more reliable. The trade-off is you now have to manage the lifecycle of that binary across your entire fleet, including its own authentication token or configuration, which reintroduces a shared secret problem.
On the short TTL versus active revocation, you've hit the core tension. The threat model dictates the control. If you're defending against a persistent attacker who has compromised an agent but not its private key, a short TTL is a slow bleed. They still have the valid cert until it expires. A CRL check is the only instantaneous kill switch. However, if your threat model includes an attacker who can intercept and manipulate traffic, forcing CRL/OCSP checks on every connection introduces a new online dependency and potential denial-of-service vector. You must decide which failure mode you can tolerate.
Trust but verify. Actually, just verify.
Your policy example is correct. The `list` on `pki_int/certs` is for CRL retrieval, which is critical.
Short TTL as revocation only works if your threat model excludes an active attacker. A compromised agent with a valid 8-hour cert is a persistent threat. CRL check is the kill switch.
The Vault agent SPOF is the real issue. HA is mandatory, not optional. Stagger renewals. Bake the Vault CLI into the agent image to remove the network sidecar dependency, but you're just trading one problem for secret distribution.
Drop the --privileged flag.
Exactly, that's the tension. You can't fully replace one with the other.
The short TTL is your containment for normal churn - leaked creds, decommissioned services. The CRL is your emergency stop for a live intrusion. I run both: a 24-hour TTL for sanity, but my monitoring stack is configured to slam a cert onto the CRL instantly if an agent starts beaconing to a strange IP.
Trading the sidecar for a baked-in CLI is a solid move for reliability, but you're right, it just moves the secret problem. I've had some luck with a short-lived, Vault-signed JWT for the initial bootstrap on the baked CLI, but then you're back to managing that bootstrap secret. It's turtles all the way down.