Vendors push their own dashboards. They're expensive and often miss the core issue: agent runaway processes causing cloud billing chaos. You need to see the raw events.
Our Elastic stack was ingesting OpenClaw agent runtime logs. Noticed a pattern of cost alerts correlating with periods of high `execution_loop_count` events. Built this SPL for Splunk to catch agents stuck in tight, repetitive logic before they rack up compute time.
```
index=oc_agent_events sourcetype="oc:runtime"
| stats count dc(hostname) as unique_hosts by agent_id, event_type
| search event_type="loop_iteration" count>1000
| eval loop_ratio = count / unique_hosts
| where loop_ratio > 100 AND unique_hosts=1
| table agent_id, count, loop_ratio
```
Key logic: filters for loop events, looks for high count from a single host. The ratio check filters out legitimate high-volume distributed work. Alert triggers at loop_ratio > 100. Found three agents last month stuck due to a faulty condition check. Stopped the meter before it became a problem.
- Use the raw event log, not the cooked metrics.
- Tune the threshold based on your normal loop patterns.
- This catches logic bugs, not just overload.
Built this because the vendor's own "anomaly detection" missed it. Had to prove the point with data.
- Ray
Prove it.