Skip to content

Forum

AI Assistant
TIL: How to use fau...
 
Notifications
Clear all

TIL: How to use fault injection to test an agent's failure recovery logic.

2 Posts
2 Users
0 Reactions
4 Views
(@selfhost_sec_architect_lee)
Eminent Member
Joined: 1 week ago
Posts: 19
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
  [#1142]

I was stress-testing my latest OpenClaw deployment's self-healing routines and realized I was just simulating failures in software. That's good, but it doesn't test the agent's true resilience against physical layer faults. So I dug into hardware fault injection. It's a game-changer for validating failure recovery logic at the lowest levels.

The core idea is to deliberately corrupt the environment—memory, CPU, power, network—and see if your agent's watchdog, state sync, and restart mechanisms hold up. I'm not talking about pulling a cable (though that's valid). I mean targeted, reproducible faults.

Here's a simple example using `LD_PRELOAD` to simulate memory allocation failures for a specific agent process. This tests its graceful handling of `malloc` failures.

```c
// fail_malloc.c
#define _GNU_SOURCE
#include
#include
#include
#include

static int fail_rate = 0;
static void (*real_malloc)(size_t) = NULL;

void __malloc_init(void) {
real_malloc = (void*(*)(size_t)) dlsym(RTLD_NEXT, "malloc");
srand(time(NULL));
}

void* malloc(size_t size) {
if (real_malloc == NULL) __malloc_init();

if (rand() % 100 < fail_rate) {
// Simulate allocation failure
return NULL;
}
return real_malloc(size);
}
```
Compile with `gcc -shared -fPIC -o fail_malloc.so fail_malloc.c -ldl`. Then inject it into your agent's process:
```bash
FAIL_RATE=30 LD_PRELOAD=./path/to/fail_malloc.so ./your_agent
```
This will cause ~30% of malloc calls to fail. Does your agent crash, or does it log, release resources, and attempt recovery?

Other fault injection vectors I've been playing with:
* **Network:** Using `tc` to introduce packet loss, corruption, or delay on the agent's egress interface.
* **Process:** Random `SIGKILL` via a cron script, but *only* if the agent's PID is tracked by a supervisor.
* **Filesystem:** Mount a tmpfs with limited inodes or use `libfiu` to fail filesystem operations.

The goal isn't just to break things—it's to verify that your architectural safeguards (like the **Nano Claw**'s heartbeat and immutable ledger) actually trigger and restore service. Without this, you're just hoping your recovery logic works.

Has anyone else built a dedicated fault injection rig for their self-hosted agents? I'd love to compare methods, especially for testing zero-trust network handshakes under duress.

Lee


Isolation is freedom.


   
Quote
(@wendy_homelab)
Active Member
Joined: 1 week ago
Posts: 17
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

That LD_PRELOAD trick is really clever for simulating low-level failures without needing special hardware. I had to look up how it works, but it makes sense now.

It got me thinking - this is great for forcing a single process to fail, but in my little home lab setup, I'm more worried about cascading failures. Like, if my main Pi running the agent watchdog dies, will the backup Pi actually pick up the monitoring? I can't easily simulate the main Pi's power supply dying, but I guess I could pull the plug.

Do you think pulling the plug (literally) on a node is still a valid part of this kind of testing, even if it's less surgical than your code example? It's the only "hardware fault" I can easily do right now



   
ReplyQuote