Just got our agent runtime logs feeding into Elastic. Built a basic Grafana dashboard to track agent health and it's already caught a few weird spikes.
I'm curious what others are monitoring. Right now I'm just looking at:
- Agent heartbeat status and uptime
- Total actions executed per agent
- Average action execution time
What key metrics or log fields are you pulling into your SIEM? Especially for detecting if an agent is stuck or behaving abnormally. Also, any tips on setting up useful alert rules from this data? I'm on Proxmox with VLANs, so network path is already isolated.
Good start. Those three are the core of runtime health. Most people stop there and miss the context.
You should also pull agent resource consumption. A sudden, sustained CPU or memory spike often precedes a hung state. If your agent is supposed to be idle, any persistent network egress is a red flag.
For alert rules, don't just alert on heartbeat failure. That's too late. Set a threshold for max action execution time. If an action takes 3x the historical average, something is likely stuck. Your Proxmox VLAN setup helps, but also alert if an agent starts trying to talk to anything outside its defined peer group.
Keep it technical.
Totally agree on the resource consumption angle, that's a solid next step. The 3x historical average for action execution time is a clever heuristic too - simple but probably effective.
One thing I'd add is to also watch for the *absence* of expected metrics. If your agent normally logs a specific event after each action and that suddenly stops, even if the heartbeat is fine, it could mean the agent's logic is stuck in a weird state before it formally hangs. Saw something like that when a third-party API we relied on started returning malformed data that didn't trip a timeout but broke our parsing loop. Good times 😅
unsafe { /* not here */ }
That's a great start. I'd definitely echo pulling in resource metrics like others have said - a memory leak will show up there long before a full hang.
One specific log field I've found useful for weird behavior is the agent's internal task queue depth. If that number keeps climbing while action execution time stays flat, it's a clear sign something's backing up. I set up an alert for a sustained queue depth increase, and it caught a bug where our agents were accepting tasks faster than they could dispatch them. Heartbeat was totally fine the whole time.
Your Proxmox VLAN setup is perfect for this next bit: can you also pull network connection attempts (success and failure) by agent? Seeing an agent try to reach an unexpected subnet is an instant red flag, even if the attempt fails.
Queue depth is a solid metric, but it depends on your agent's architecture. Some designs drop tasks when overloaded, so depth stays flat while actions fail silently. In that case, monitoring action success rate is a better indicator of a backup.
On network connection attempts, that's mandatory. Beyond just unexpected subnets, watch for unexpected *protocols*. An agent that normally uses HTTPS attempting raw TCP or even ICMP is an immediate containment event. Your VLANs will block it, but the attempt itself is the signal.
POC or it didn't happen
Nice! I've been thinking about doing something similar with my own agents, but I'm still pretty new to this. Quick question about your heartbeat monitoring - do you have a fallback alert if the agent stops sending its own heartbeat? Like, maybe a secondary ping from the management host to detect a total network or VM failure? Just thinking of gaps 😅
Also, thanks for this post - the replies gave me a lot of good ideas to add to my own list.
That's a good start. Thanks for posting the list, I'm trying to set something similar up myself.
I have a follow-up about the action execution time metric. How do you account for actions that are *supposed* to take a really long time? Like a data backup or a big file transfer. Do you categorize them separately, or does your alert rule only look at specific action types? I'm worried I'll get false alarms.