AI Assistant

Notifications

Clear all

TIL: How to use fault injection to test an agent's failure recovery logic.

Summarize Topic

Off-Topic

Last Post by Wendy Chen 17 hours ago

2 Posts

2 Users

0 Reactions

4 Views

RSS

Lee H.

(@selfhost_sec_architect_lee)

Eminent Member

Joined: 1 week ago

Posts: 19

Topic starter

Translate ▼

June 29, 2026 1:01 pm [#1142]

I was stress-testing my latest OpenClaw deployment's self-healing routines and realized I was just simulating failures in software. That's good, but it doesn't test the agent's true resilience against physical layer faults. So I dug into hardware fault injection. It's a game-changer for validating failure recovery logic at the lowest levels.

The core idea is to deliberately corrupt the environment—memory, CPU, power, network—and see if your agent's watchdog, state sync, and restart mechanisms hold up. I'm not talking about pulling a cable (though that's valid). I mean targeted, reproducible faults.

Here's a simple example using `LD_PRELOAD` to simulate memory allocation failures for a specific agent process. This tests its graceful handling of `malloc` failures.

```c
// fail_malloc.c
#define _GNU_SOURCE
#include
#include
#include
#include

static int fail_rate = 0;
static void (*real_malloc)(size_t) = NULL;

void __malloc_init(void) {
real_malloc = (void*(*)(size_t)) dlsym(RTLD_NEXT, "malloc");
srand(time(NULL));
}

void* malloc(size_t size) {
if (real_malloc == NULL) __malloc_init();

if (rand() % 100 < fail_rate) {
// Simulate allocation failure
return NULL;
}
return real_malloc(size);
}
```
Compile with `gcc -shared -fPIC -o fail_malloc.so fail_malloc.c -ldl`. Then inject it into your agent's process:
```bash
FAIL_RATE=30 LD_PRELOAD=./path/to/fail_malloc.so ./your_agent
```
This will cause ~30% of malloc calls to fail. Does your agent crash, or does it log, release resources, and attempt recovery?

Other fault injection vectors I've been playing with:
* **Network:** Using `tc` to introduce packet loss, corruption, or delay on the agent's egress interface.
* **Process:** Random `SIGKILL` via a cron script, but *only* if the agent's PID is tracked by a supervisor.
* **Filesystem:** Mount a tmpfs with limited inodes or use `libfiu` to fail filesystem operations.

The goal isn't just to break things—it's to verify that your architectural safeguards (like the **Nano Claw**'s heartbeat and immutable ledger) actually trigger and restore service. Without this, you're just hoping your recovery logic works.

Has anyone else built a dedicated fault injection rig for their self-hosted agents? I'd love to compare methods, especially for testing zero-trust network handshakes under duress.

Lee

Isolation is freedom.

Quote

Topic Tags

Wendy Chen

(@wendy_homelab)

Active Member

Joined: 1 week ago

Posts: 17

Translate ▼

June 29, 2026 2:01 pm

That LD_PRELOAD trick is really clever for simulating low-level failures without needing special hardware. I had to look up how it works, but it makes sense now.

It got me thinking - this is great for forcing a single process to fail, but in my little home lab setup, I'm more worried about cascading failures. Like, if my main Pi running the agent watchdog dies, will the backup Pi actually pick up the monitoring? I can't easily simulate the main Pi's power supply dying, but I guess I could pull the plug.

Do you think pulling the plug (literally) on a node is still a valid part of this kind of testing, even if it's less surgical than your code example? It's the only "hardware fault" I can easily do right now

ReplyQuote

80 Forums
1,176 Topics
7,188 Posts
0 Online
508 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed