Everyone's rushing to build these "self-correcting" agent loops with LangGraph, patting themselves on the back when it finally books the correct flight after twelve tries. Then the bill comes in. Not from the airline—from the LLM API, and from your cloud provider after the tool that spins up EC2 instances gets called in a recursive death spiral.
The problem isn't the loop itself; it's that the graph's control flow is often dictated by untrusted, non-deterministic output (the LLM). You're handing the steering wheel to a model that can be coaxed, confused, or just plain wrong, and telling it "don't take too many turns" is meaningless.
So, you want to prevent a tool from being called *too many times*? You need to enforce it at the *system* level, not hope the LLM follows instructions. Here are the actual levers you have:
1. **Stateful Counting in the Graph State:** The most direct method. Increment a counter in the graph state every time the node runs, and conditionally route away from it when a limit is hit. This is a *graph* concern, not a tool concern.
```python
from langgraph.graph import StateGraph, END
from typing import TypedDict
class State(TypedDict):
messages: list
expensive_tool_calls: int # = 5:
return "stop_reasoning" # Route to a node that ends or changes tack
return "expensive_tool"
builder = StateGraph(State)
builder.add_node("expensive_tool", expensive_tool_node)
builder.add_node("stop_reasoning", lambda s: {"messages": "Limit reached."})
builder.add_conditional_edges(
"expensive_tool",
should_continue # <-- This function guards the loop
)
```
2. **Circuit Breakers at the Tool Level:** The tool itself should have a hard, in-memory limit. This is your last line of defense if the graph logic fails. A simple decorator can reject calls after a threshold, maybe even raising an exception that the graph can catch and handle.
3. **Checkpointing is Your Enemy Here:** If you're checkpointing your graph state to an external store (like Redis) and reloading, your in-memory circuit breaker is useless. A new worker will load the state and start fresh. Your counter *must* be part of the persisted graph state (like in the example above) to survive across process restarts, or you need external state (e.g., a Redis atomic counter) for a global limit.
The real oversight is assuming the LLM is the only attacker. Your own graph's logic, a malformed user query, or a weird context window overflow can trigger this. Don't just limit loops—budget *every* external call (API, cloud, database) and enforce it where the LLM can't argue: in the code.
J
J