Skip to content

Forum

AI Assistant
Notifications
Clear all

My agent got stuck in a loop calling the same tool. How do I build in circuit breakers?

2 Posts
2 Users
0 Reactions
2 Views
(@ai_sysadmin)
Eminent Member
Joined: 1 week ago
Posts: 21
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
  [#605]

I've been experimenting with the Anthropic Agent SDK to orchestrate a set of internal data analysis tools. During a recent extended run, I observed a concerning failure mode: the agent entered a pathological loop, repeatedly calling the same "query_metrics" tool with near-identical parameters. This wasn't a simple retry on error; the reasoning output suggested it was stuck in a cyclical thought process, racking up unnecessary API calls and compute time.

This highlights a gap in the default SDK security posture: there are no inherent guardrails against an agent's own logic going awry. In a traditional microservices architecture, you'd rely on circuit breakers (e.g., via a service mesh or client libraries like resilience4j). Here, the "circuit" is the agent's decision loop, and the breaker must be internal.

My initial approach was to implement a tool-level call counter with a short-term memory. Here's a simplified version of the wrapper I used:

```python
from anthropic import Anthropic
from collections import defaultdict
from datetime import datetime, timedelta

class CircuitBreakerToolWrapper:
def __init__(self, tool_callable, threshold=5, window_seconds=30):
self.tool = tool_callable
self.call_history = defaultdict(list)
self.threshold = threshold
self.window = timedelta(seconds=window_seconds)

def __call__(self, *args, **kwargs):
# Create a simple key from the tool name and a hash of normalized arguments
arg_hash = hash(frozenset(kwargs.items()))
key = f"{self.tool.__name__}_{arg_hash}"

now = datetime.now()
# Clean old entries
self.call_history[key] = [t for t in self.call_history[key] if now - t = self.threshold:
raise RuntimeError(f"Circuit breaker tripped for tool {self.tool.__name__}. Too many calls with similar arguments within {self.window.total_seconds()}s.")

self.call_history[key].append(now)
return self.tool(*args, **kwargs)
```

This works, but it's crude. It doesn't consider the agent's broader context or the semantic similarity of arguments, just a hash. Furthermore, the `RuntimeError` thrown back into the agent loop sometimes leads to unhelpful reactions.

I'm now considering a more integrated approach:
- **Tool-level throttling**: The wrapper above, but perhaps using a token bucket algorithm.
- **Session-level memory**: The agent SDK provides message history. Could we add a system prompt directive that instructs the agent to self-monitor for loops?
- **Supervisor agent pattern**: A lightweight overseer process that monitors the main agent's tool call patterns via Prometheus metrics and can intervene by modifying the context or issuing a stop.

My primary concerns are:
1. Minimizing latency overhead.
2. Ensuring the breaker doesn't inadvertently block legitimate, rapid-fire tool use (e.g., polling for a status).
3. Where should the breaker logic live? Wrapping individual tools feels right for granularity, but a centralized orchestrator might have better visibility.

Has anyone else implemented a pattern for this? I'm particularly interested in solutions that tie into the existing `AnthropicBedrock` or `AnthropicVertex` tool-calling flow without requiring a full custom agent runner.


metric over magic


   
Quote
(@karen_secops)
Active Member
Joined: 1 week ago
Posts: 9
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You're on the right track with the wrapper. I'd move the breaker out of the tool layer entirely though. Let the agent's own call be the trigger.

Wrap the main agent execution loop. Count total tool calls per run, or calls to the same tool within a sliding window. If you hit a limit, you force a stop and dump the full reasoning trace to your logs for review. That way you're not just stopping a broken run, you're capturing the bad logic that caused it.

For us, the threshold is 10 calls total or 5 to the same tool in 60 seconds. Beyond that, it's not thinking, it's thrashing. Kill it and alert.



   
ReplyQuote