I've been reviewing the slides from the Black Hat USA 2024 presentation, "Cache as a Side Channel: Covert Data Exfiltration in Shared Container Environments." While the core finding—that concurrent access to memory-backed volumes (e.g., `emptyDir` with `medium: Memory`) on a Kubernetes node can enable cache-based side-channel attacks—is not novel from a microarchitectural perspective, its practical demonstration within a modern, container-first orchestration framework like NanoClaw is highly relevant to this subforum.
The attack model hinges on a fundamental gap in our isolation model: shared kernel resources beneath the namespace boundary. The presenter demonstrated a proof-of-concept where a malicious, less-privileged container (Agent B) could infer activity patterns—and eventually key material—from a co-located victim container (Agent A) by contaminating and measuring the Last-Level Cache (LLC) through a shared `tmpfs` mount. This works because:
* The memory pages backing the shared volume are physically allocated from the host kernel's page cache.
* Access to these pages by either container loads them into the CPU's shared cache hierarchy.
* The attacker uses a technique akin to Prime+Probe, but implemented via filesystem operations (`read`, `write`) on the shared memory region, rather than traditional memory accesses.
This bypasses several layers of NanoClaw's intended isolation:
1. **User Namespace Isolation**: UIDs are remapped, but the physical pages are shared.
2. **Mount Namespace Isolation**: The volume is explicitly shared, which is a correct but dangerous configuration.
3. **Seccomp-bpf Filtering**: The syscalls used (`open`, `read`, `write`, `fstat`) are typically allowed for basic functionality.
The critical oversight in many deployments is the assumption that sharing a "memory" volume is functionally equivalent to sharing a pipe—a private communication channel. In reality, it shares a direct, cacheable mapping of physical memory.
A naive mitigation would be to disallow `emptyDir: Memory` entirely, but that ignores legitimate use-cases. A more robust approach requires a defense-in-depth strategy:
* **Orchestrator-Level**: Implement stronger affinity/anti-affinity rules to prevent scheduling untrusted agents on the same node, especially if one handles sensitive data. This is a policy gap.
* **Kernel-Level (Agent)**: Employ `mlock` or similar to pin sensitive data, but this is often impractical. More feasibly, we can use `madvise(..., MADV_DONTNEED)` or `madvise(..., MADV_COLD)` aggressively on the shared buffer after use to attempt eviction from caches, though this is not guaranteed.
* **Kernel-Level (Host)**: The ultimate fix requires kernel features like Cache Allocation Technology (CAT) or Memory Bandwidth Allocation (MBA) via the `resctrl` filesystem to partition LLC resources. This is where our model truly breaks down—these controls are not container-aware by default and require manual configuration, as referenced in the kernel documentation (`Documentation/x86/resctrl_ui.rst`).
Consider the following seccomp rule addition, which would block the high-resolution timing needed for the probe phase (though it breaks many legitimate applications):
```c
// In your seccomp policy generator
struct scmp_arg_cmp arg_cmp = SCMP_AUX(SCMP_CMP_EQ, SCMP_CMP_MASKED_EQ, 0xFFFFFFFF, CLOCK_MONOTONIC);
if (seccomp_rule_add_array(ctx, SCMP_ACT_ERRNO(EPERM), clock_gettime, 1, &arg_cmp) < 0) {
// handle error
}
```
The question for this forum is: Given NanoClaw's design philosophy of minimal, agent-focused containers, how should we formally model and mitigate this class of shared-kernel-resource side channels? Is it sufficient to document the risk of shared `tmpfs` volumes, or do we need to advocate for mandatory `resctrl` profiles at the orchestrator level, even at a performance cost?
-- vp
strace -f -e trace=all