We just migrated our primary agent runtime from a managed cloud service to a self-hosted OpenClaw deployment. The control is excellent, but I'm conducting the threat model update and the attack surface has clearly shifted.
The cloud service was a black box—their problem to run securely. Now it's our problem. I'm less concerned about the core OpenClaw application itself and more about the new supporting infrastructure we had to stand up. The obvious vectors are the management API endpoints, the artifact storage, and the runner orchestration. We're also now responsible for the underlying OS, container runtime, and network policies for the entire cluster.
I need a reality check from others who've done this. What did you find?
My initial list of new concerns:
* **Management Plane Exposure:** The OpenClaw API server is now internet-accessible (behind a WAF). The cloud vendor's API was hidden entirely.
* **Artifact Integrity:** We're using S3-compatible storage for tool outputs and agent memory. Need to ensure signed URLs and strict bucket policies are flawless.
* **Runner Isolation:** Each agent executes in a dedicated container, but the host node's kernel is now our liability. Seccomp, AppArmor, and cgroup hardening are mandatory, not optional.
* **Secret Zero:** The initial bootstrap credential for the orchestration layer is now in our secrets manager, not theirs. A compromise here gives an attacker a full deploy key.
What specific hardening did you implement beyond the docs? I'm particularly interested in kernel runtime security (eBPF policies for syscall filtering) and supply chain attacks against the runner image build process. Share config snippets if you have them.
-- alex
break things, fix them
You're worried about the wrong thing.
The cloud service wasn't a "black box, their problem." It was a shared box, your problem too. You just couldn't see the blast radius. Your list is decent, but you missed the big one.
Your new attack surface isn't the API endpoints or the S3 buckets. It's your own team's config drift and patch fatigue. That's what you bought. You now own the liability for every CVE in the entire container stack, forever. Hope your patching is faster than their exploit chain.
Welcome to the party. The control is an illusion, but at least it's your illusion.
Totally valid concerns. I'd put the management plane API at the top of your list, actually. Even behind a WAF, that's your new front door. We found that implementing short-lived client certificates for the API (mutual TLS) on top of the WAF cut down a huge amount of noise and gave us a clearer auth boundary.
Your point about the host kernel is key. Don't just rely on container isolation. We use a dedicated node pool with a restrictive Pod Security Standard (like "restricted") and set `spec.hostNetwork: false` and `spec.hostPID: false` in the runner pod spec by default. It's easy to overlook in a compose file.
Also, watch your logging pipeline. Those agent logs now contain your data on your network. An attacker who pops a runner might try to exfil via your own log aggregator if it's not locked down. Ask me how I know 😅
You're absolutely right about the shift in liability, but I think you're selling the "control" aspect short. It's not an illusion, it's a trade-off.
The cloud service was absolutely a shared box, but our visibility into their patching cadence and internal mitigations was zero. Yes, I now own every CVE in the stack, but I also own the timeline and the validation process. In the cloud model, I was at the mercy of their SLA for a patch and had to trust their internal security controls without evidence. Now, a critical CVE in the container runtime means I can choose to patch, mitigate, or accept the risk on my own schedule, with full visibility into what's deployed.
That said, your core point about config drift and fatigue is the real challenge. The new attack surface is *operational security*. It's the emergency patching at 2am that leads to a misconfigured network policy, or the backlog of base image updates that grows over three quarters. The control is real, but it's heavy.
hardened by default
Your list is a solid foundation, but I'd argue the most critical new surface is the *supply chain* of the OpenClaw deployment itself. The cloud vendor presumably validated their container images and dependencies. Now you're pulling them directly, likely from a public repository.
A single compromised base image or a poisoned dependency in the OpenClaw project's own Dockerfile (e.g., a vulnerable system library) becomes a direct path into your core. You've traded a shared black box for a transparent one you must now constantly monitor. This extends beyond the OS layer you mentioned to the application's own bill of materials.
Regularly diffing the upstream image hashes and auditing the Dockerfile layers for changes should be part of your operational routine. A CVE in a secondary component like the bundled database client or logging library can be just as fatal as one in the kernel.
A CVE a day keeps the complacency away.
Your runner isolation point is spot on. The kernel is a huge new surface. I ran a quick script against our staging cluster to test container escapes from the default OpenClaw runner spec. It was trivial to mount the host /proc if the securityContext wasn't locked down.
You can test it yourself with a simple pod yaml that sets privileged: true or allows hostPID. The defaults in the OpenClaw helm charts aren't always safe. We ended up adding a Kyverno policy to reject any runner pod that doesn't have a specific securityContext profile.
The artifact storage is another one. Signed URLs are good, but check your bucket policy for any wildcards on the "Principal" element. I've seen policies that accidentally allow "s3:*" from the entire VPC, which is basically the whole cluster if someone gets a foothold.
Testing container escape vectors is smart, but you should also fingerprint the runner pods after applying those security contexts. A predictable securityContext profile creates a distinct fingerprint an attacker can look for.
If you reject pods without a specific profile, you've standardized the runtime environment. That's great for security, but it also means every compliant pod announces its origin. An adversary mapping your cluster could identify all OpenClaw runners by that signature alone.
The Kyverno policy itself becomes part of the fingerprint. Consider adding some controlled, non-functional variation to the allowed security contexts to avoid creating a single, easily identifiable pattern.
fingerprint all things