I've been conducting a post-mortem analysis on a recent container escape incident in our lab environment. During the forensic review, I noticed something concerning in the audit logs: a process was able to read sensitive application data from a memory-backed file system (`tmpfs` at `/dev/shm`) even after the parent container was terminated and re-instantiated.
This led me down a rabbit hole investigating persistent memory (PMEM) and `memfd`-backed file systems. The core issue appears to be that common encryption-at-rest solutions (LUKS, eCryptfs) do not cover volatile or persistent memory regions by default. Data written to `/dev/shm`, `/run/shm`, or via `memfd_create()` remains unencrypted.
Consider this simple demonstration. A process creates an in-memory file and writes sensitive data:
```c
#define _GNU_SOURCE
#include
#include
#include
#include
int main() {
int fd = memfd_create("secrets", 0);
const char *secret = "AUTH_KEY=supersecret123";
write(fd, secret, strlen(secret));
lseek(fd, 0, SEEK_SET);
/* Process terminates, but memory pages may persist */
pause(); /* Simulate a crash without cleanup */
return 0;
}
```
Post-termination, these pages can linger in the kernel's page cache or, worse, in actual persistent memory (NVDIMMs). My testing with `pmem` namespaces on a test system confirms that `ndctl`-created namespaces mounted with `DAX` bypass the block layer entirely, rendering block-level encryption ineffective.
**Key findings from my lab:**
* **Page Cache Retention:** Dirty pages from `tmpfs` can remain in the page cache long after file deletion, accessible via direct physical memory inspection or certain kernel debug interfaces.
* **PMEM/DAX Bypass:** Filesystems mounted with Direct Access (DAX) on persistent memory avoid the block layer. Full-disk encryption does not apply.
* **Container Shared Memory:** Kubernetes `emptyDir` with `medium: Memory` creates a `tmpfs` mount. Multi-container pods can leak data via this shared memory if not explicitly cleared.
**Potential mitigation paths I'm evaluating:**
* Implementing a kernel module to hook `memfd_create()` and `shm_open()` to enforce encryption via a lightweight cipher (e.g., ChaCha20) for selected processes.
* Using `mlock()` and explicit `memset()` to zero memory before termination in sensitive applications.
* For PMEM, configuring the namespace to use the `sector` (block translation) mode instead of `fsdax` or `devdax`, then applying LUKS. This sacrifices some performance.
My primary questions for the community:
* Are there existing, production-tested frameworks for transparent memory encryption in user-space for Linux?
* Has anyone successfully implemented a policy (e.g., via eBPF) to detect uncleared sensitive data in persistent memory regions?
* Is this considered a realistic threat model in your organization's hardening guides, or is it typically dismissed as requiring physical access?
Logs don't lie.
Good catch. This is a known, often overlooked side effect of how memory pressure works. The kernel can page out `tmpfs` and `memfd` pages to swap. If you have encrypted swap, that's one layer, but the keys still live in unencrypted RAM until eviction.
The real risk isn't just the persisting pages, it's a cold boot attack or a DMA attack like CVE-2015-2877 if the physical hardware is accessible. For containers, if the host kernel crashes or the memory isn't zeroed before being reallocated to a new container, that's your data leak.
You need to combine `mlock()` to pin sensitive pages (prevents swap) and explicit zeroing before process exit. Even then, it's a defense-in-depth game.
trust, but verify — with sigtrap
The swap encryption point is key. Many distros don't enable it by default, so that layer is often absent.
You're right about `mlock()` and zeroing, but in a container context, you're at the mercy of the orchestrator's security context. Using `mlock()` often requires `CAP_IPC_LOCK`, which blows your containment wide open. It's a trade-off between locking pages and a reduced attack surface.
For container workloads, I've seen more success treating all in-memory data as potentially exposed and focusing on limiting what gets written there in the first place, coupled with a tight seccomp profile that blocks `memfd_create`. Not perfect, but pragmatic.
Trust the data, not the dashboard.