So you're tired of the platform's managed HSM or KMS and want to roll your own key provider for the enclave. Brave, or foolish. Usually both.
First, disabuse yourself of the notion that this is just another service. You're building the crown jewels vault, not a config file. The compliance checklist crowd will tell you to "use FIPS 140-2 Level 3" and call it a day, but that's a starting point, not a design. The real questions are about actual risk:
* What's the threat model for your key material *before* it reaches the enclave?
* How are you handling the provisioning ceremony? (Hint: if it involves a web console and an IAM role, go back to the drawing board.)
* What's your real recovery story when—not if—your initial sealed storage blob is corrupted?
You'll need to map the entire lifecycle, and I mean *entire*:
* **Provisioning:** Secure channel *into* the enclave. Are you using attested launch and a remote quote? Or just hoping the internal API is safe?
* **Sealing:** Relying solely on the platform's seal key? Better understand the derivation process and its ties to your enclave's identity and software.
* **Persistence:** Where does the sealed blob live? Who/what can access that storage? How often do you re-seal?
* **Teardown/Migration:** This is where most hand-wavy designs collapse. If the enclave dies, does the key material persist for a new, identical enclave? Should it? If you're moving between hardware, what's the attestation and authorization flow for rehydrating the key?
Start by writing the policy document first. No code. Answer those lifecycle questions under the assumption that your first three designs will have fatal flaws. Then, and only then, look at the SDK samples.
- Levi
Audit what matters, not what's easy.
Wow. Okay. This is incredibly dense and I'm suddenly feeling a lot less brave and a lot more foolish. The phrase "crown jewels vault" just made my stomach drop a little.
Your questions are the exact kind of thing I hadn't even thought to ask. I was so focused on the 'how do I make the enclave accept my key' part, I completely skipped over "how does the key get to the enclave in the first place without being seen." Like, I was picturing a config map or a secret mount, but you're saying that's basically just handing it out the window on the way in, right? 😅
So my stupid-newbie follow-up: when you mention a provisioning ceremony that shouldn't involve a web console... is the ideal something physical? Like, a one-time manual step with a USB stick and a person in a room for the very first root key? That seems so... medieval. But also kind of makes sense?
Exactly. The internal API *is* the attack surface you're trying to shrink. If you're provisioning through a regular cluster service, you've already lost.
Your bullet points hit the critical path, but I'd swap the order. Persistence is often the afterthought that kills you. Where the sealed blob lives determines the recovery story, which determines if you can ever safely rotate the master key. If your sealed blob is in a bucket with versioning, you just gave an attacker a timeline to work with.
So the ceremony starts with answering: who or what is allowed to *write* to that specific persistence endpoint, and how does it prove it's the *right* enclave making the request? Without remote attestation in the mix, you're just playing musical chairs with credentials.
~Omar
> Without remote attestation in the mix, you're just playing musical chairs with credentials.
Bang on. And most DIY attempts stop right there, because adding remote attestation feels like building a second, even more complex system just to secure the first one.
The practical caveat is you end up tied to a specific cloud provider's enclave type (like AWS Nitro or Azure Confidential VMs) just to get that attestation doc. That's a massive, often hidden, lock-in cost. Your custom key provider now has a single point of failure: that vendor's TCB and their API availability for the attestation verification step.
You're not just building a vault anymore, you're building the security guard that checks the guard's ID before he can check the vault. Fun times.
- ken
Yeah, the vendor lock-in is the real kicker. You finally get remote attestation working, and now your entire key provider chain is bolted to AWS's Nitro API being up and healthy. If that hiccups during a key rotation or a disaster recovery drill, you're dead in the water.
We actually sidestepped this a bit for an internal project by using a hybrid model. The initial, high-value provisioning ceremony used a physical HSM appliance (yeah, real hardware). But for day-to-day operations, we accept a slightly weaker, cloud-attested model, knowing we have the air-gapped HSM as a root of trust to re-seed from if needed.
It's messy, but it means the cloud vendor's TCB isn't the *only* single point of failure. Just the one we use most often 😅
Selfhosted since 2004