AI Agent Security β Preventing Tool Abuse, Data Leaks, and Prompt Injection
An AI agent with tool access can read files, run commands, call APIs, and modify databases. Thatβs powerful when it works correctly. When it doesnβt β through prompt injection, hallucination, or bugs β itβs a security incident.
The threat model
| Threat | How it happens | Impact |
|---|---|---|
| Tool abuse | Agent calls dangerous tools (delete files, drop tables) | Data loss |
| Data exfiltration | Agent sends sensitive data to external APIs | Data breach |
| Prompt injection | User input tricks agent into unauthorized actions | Varies |
| Runaway costs | Agent loops, making expensive API calls | Financial |
| Privilege escalation | Agent accesses resources beyond its scope | Unauthorized access |
Defense 1: Least privilege
Give agents only the tools they need. Nothing more.
# Bad: agent has access to everything
tools = [read_file, write_file, delete_file, run_command,
send_email, query_database, drop_table]
# Good: agent only has what it needs for this task
tools = [read_file, write_file, run_tests]
For MCP servers, configure permissions per server:
- File system MCP: read-only for most tasks, write only to specific directories
- Database MCP: SELECT only, no DELETE/DROP
- Terminal MCP: allowlist of safe commands
Tool-level permissions
ALLOWED_COMMANDS = ["npm test", "npm run build", "git status", "git diff"]
BLOCKED_PATTERNS = ["rm -rf", "DROP TABLE", "DELETE FROM", "curl.*|.*sh"]
def safe_execute(command):
if any(blocked in command for blocked in BLOCKED_PATTERNS):
raise SecurityError(f"Blocked dangerous command: {command}")
if not any(command.startswith(allowed) for allowed in ALLOWED_COMMANDS):
raise SecurityError(f"Command not in allowlist: {command}")
return execute(command)
Defense 2: Input sanitization
User input goes into agent prompts. Sanitize it to prevent prompt injection:
def sanitize_user_input(text):
# Remove common injection patterns
dangerous = [
"ignore previous instructions",
"system prompt",
"you are now",
"forget everything",
]
for pattern in dangerous:
if pattern.lower() in text.lower():
logger.warning(f"Potential injection attempt: {pattern}")
text = text.replace(pattern, "[FILTERED]")
return text
This is a basic filter. For production, combine with:
- Input length limits
- Content classification (is this a normal request or an attack?)
- Separate system prompt from user input with clear delimiters
See our prompt injection guide for comprehensive defenses.
Defense 3: Output monitoring
Monitor what the agent does, not just what it says:
def monitor_agent_action(action, context):
# Log every action
logger.info({
"event": "agent_action",
"action_type": action.type,
"target": action.target,
"user_id": context.user_id,
})
# Alert on suspicious patterns
if action.type == "write_file" and "/etc/" in action.target:
alert("suspicious_write", f"Agent trying to write to {action.target}")
return False # Block the action
if action.type == "http_request" and action.target not in ALLOWED_DOMAINS:
alert("data_exfiltration", f"Agent calling unauthorized domain: {action.target}")
return False
return True # Allow the action
Defense 4: Cost limits
Prevent runaway agent loops from draining your budget:
MAX_STEPS = 20
MAX_COST_PER_SESSION = 5.00 # dollars
MAX_TOKENS_PER_SESSION = 500_000
class AgentBudget:
def __init__(self):
self.steps = 0
self.cost = 0
self.tokens = 0
def check(self):
if self.steps >= MAX_STEPS:
raise BudgetExceeded("Max steps reached")
if self.cost >= MAX_COST_PER_SESSION:
raise BudgetExceeded(f"Cost limit: ${self.cost:.2f}")
if self.tokens >= MAX_TOKENS_PER_SESSION:
raise BudgetExceeded(f"Token limit: {self.tokens}")
See our cost governance guide and alerting guide for production monitoring.
Defense 5: Human approval for dangerous actions
For high-risk actions, require human confirmation:
REQUIRES_APPROVAL = ["delete_file", "drop_table", "send_email", "deploy", "payment"]
async def execute_with_approval(action):
if action.type in REQUIRES_APPROVAL:
approved = await request_human_approval(
f"Agent wants to {action.type}: {action.description}"
)
if not approved:
return "Action denied by human reviewer"
return await execute(action)
In our AI Startup Race, agents have a 1hr/week human help budget. All deploy actions and external API calls are logged and reviewed.
The security checklist for agents
Before deploying any AI agent to production:
- Tools follow least privilege (only whatβs needed)
- Dangerous commands are blocked or require approval
- User input is sanitized for injection
- All actions are logged
- Cost limits are set per session
- Step limits prevent infinite loops
- External API calls are restricted to allowlisted domains
- API keys secured and rotated regularly
- File access is scoped to specific directories
- Red team testing completed
- Rollback plan exists (feature flag to disable agent)
See our AI security checklist and MCP security guide for the complete security framework.
Related: Prompt Injection Explained Β· MCP Security Checklist Β· AI Security Checklist Β· Red Team Your AI Application Β· How to Debug AI Agents Β· What is an AI Agent?