Your AI agent was supposed to refactor the auth module. Instead, it deleted the test suite, created 47 empty files, and ran the same deploy command 15 times. Welcome to agent debugging.
Agents fail differently from traditional software. The code runs fine — the model just makes bad decisions. Here’s how to find and fix these failures.
The 5 common agent failure modes
1. Infinite loops
The agent tries the same action repeatedly, expecting different results.
Symptoms: Same tool call appearing 10+ times in logs. Token usage spiking. No progress.
Cause: The agent doesn’t recognize that its action failed, or it doesn’t have a fallback strategy.
Fix:
# Track action history, break loops
action_history = []
def run_agent_step(agent, task):
action = agent.plan_action(task)
# Detect loop: same action 3 times in a row
if len(action_history) >= 3 and all(a == action for a in action_history[-3:]):
action = agent.plan_alternative(task,
context=f"Previous approach failed 3 times: {action}. Try something different.")
action_history.append(action)
return execute(action)
2. Wrong tool selection
The agent picks the wrong tool for the job. It tries to write a file when it should read, or searches the web when the answer is in the codebase.
Symptoms: Tool calls that don’t make sense for the task. Errors from tools receiving wrong input.
Cause: Ambiguous tool descriptions, too many tools available, or the model doesn’t understand the tool’s purpose.
Fix:
- Reduce available tools to only what’s needed for the current task
- Improve tool descriptions with examples of when to use (and when NOT to use)
- Add input validation on tool calls
3. Hallucinated plans
The agent creates a plan that references files, functions, or APIs that don’t exist.
Symptoms: “I’ll modify the auth_service.py file” — but that file doesn’t exist. “I’ll call the /api/users endpoint” — but your API doesn’t have that route.
Cause: The agent is working from its training data, not your actual codebase.
Fix:
- Always provide the actual file tree in context
- Validate plans against reality before execution
- Use structured state to track what actually exists
4. Context window overflow
The agent accumulates so much context (conversation history, file contents, tool outputs) that it hits the context limit and starts losing important information.
Symptoms: Agent “forgets” earlier instructions. Quality degrades over long sessions. Errors about context length.
Cause: No context management strategy. See our agent memory patterns guide.
Fix:
- Summarize old context periodically
- Only include relevant files, not the entire codebase
- Set a maximum session length (30-60 minutes) and start fresh
5. Runaway costs
The agent makes expensive API calls in a loop, or uses a frontier model for tasks that a cheap model could handle.
Symptoms: Unexpected cost spikes in your monitoring dashboard.
Fix:
- Set per-session cost limits
- Use model routing — cheap models for planning, expensive models for execution
- Alert on cost spikes
The debugging toolkit
1. Enable verbose logging
Log every agent decision, not just the final output:
logger.info({
"step": step_number,
"thought": agent.last_reasoning,
"action": agent.last_action,
"tool": agent.last_tool_call,
"result": tool_result[:200], # Truncate long results
"tokens_used": tokens,
"cost": cost,
})
2. Use preserve_thinking
If your model supports it (Qwen 3.6, Claude with extended thinking), enable it to see the model’s reasoning:
response = call_llm(messages, extra_body={"preserve_thinking": True})
# Now you can see WHY the agent chose that action
3. Replay failed sessions
Save the full session state so you can replay failures:
# Save session for replay
session = {
"task": original_task,
"messages": all_messages,
"tool_calls": all_tool_calls,
"state": agent_state,
}
json.dump(session, open(f"debug/session_{session_id}.json", "w"))
Replay with a different model or modified prompt to test fixes.
4. Add guardrails
Prevent the worst failures with hard limits:
MAX_STEPS = 20
MAX_COST = 5.00 # dollars
MAX_TOOL_RETRIES = 3
FORBIDDEN_ACTIONS = ["rm -rf", "DROP TABLE", "DELETE FROM"]
def run_agent(task):
for step in range(MAX_STEPS):
action = agent.next_action()
if any(forbidden in str(action) for forbidden in FORBIDDEN_ACTIONS):
logger.error(f"Blocked forbidden action: {action}")
continue
if get_session_cost() > MAX_COST:
logger.error("Cost limit reached, stopping agent")
break
execute(action)
Debugging checklist
When your agent misbehaves:
- Check logs — what was the agent’s reasoning at each step?
- Check token usage — did it hit the context limit?
- Check cost — is it in a loop burning money?
- Check tool calls — are they valid? Are inputs correct?
- Replay the session — can you reproduce the failure?
- Try a different model — is it a model limitation or a prompt issue?
- Simplify the task — does the agent succeed on a simpler version?
Related: How to Build Multi-Agent Systems · Agent Memory Patterns · Agent Orchestration Patterns · When NOT to Use AI Agents · LLM Observability