Apr 18, 2026 · 3 min read

AI Agent Security — Preventing Tool Abuse, Data Leaks, and Prompt Injection

An AI agent with tool access can read files, run commands, call APIs, and modify databases. That’s powerful when it works correctly. When it doesn’t — through prompt injection, hallucination, or bugs — it’s a security incident.

The threat model

Threat	How it happens	Impact
Tool abuse	Agent calls dangerous tools (delete files, drop tables)	Data loss
Data exfiltration	Agent sends sensitive data to external APIs	Data breach
Prompt injection	User input tricks agent into unauthorized actions	Varies
Runaway costs	Agent loops, making expensive API calls	Financial
Privilege escalation	Agent accesses resources beyond its scope	Unauthorized access

Defense 1: Least privilege

Give agents only the tools they need. Nothing more.

# Bad: agent has access to everything
tools = [read_file, write_file, delete_file, run_command, 
         send_email, query_database, drop_table]

# Good: agent only has what it needs for this task
tools = [read_file, write_file, run_tests]

For MCP servers, configure permissions per server:

File system MCP: read-only for most tasks, write only to specific directories
Database MCP: SELECT only, no DELETE/DROP
Terminal MCP: allowlist of safe commands

Tool-level permissions

ALLOWED_COMMANDS = ["npm test", "npm run build", "git status", "git diff"]
BLOCKED_PATTERNS = ["rm -rf", "DROP TABLE", "DELETE FROM", "curl.*|.*sh"]

def safe_execute(command):
    if any(blocked in command for blocked in BLOCKED_PATTERNS):
        raise SecurityError(f"Blocked dangerous command: {command}")
    
    if not any(command.startswith(allowed) for allowed in ALLOWED_COMMANDS):
        raise SecurityError(f"Command not in allowlist: {command}")
    
    return execute(command)

Defense 2: Input sanitization

User input goes into agent prompts. Sanitize it to prevent prompt injection:

def sanitize_user_input(text):
    # Remove common injection patterns
    dangerous = [
        "ignore previous instructions",
        "system prompt",
        "you are now",
        "forget everything",
    ]
    
    for pattern in dangerous:
        if pattern.lower() in text.lower():
            logger.warning(f"Potential injection attempt: {pattern}")
            text = text.replace(pattern, "[FILTERED]")
    
    return text

This is a basic filter. For production, combine with:

Input length limits
Content classification (is this a normal request or an attack?)
Separate system prompt from user input with clear delimiters

See our prompt injection guide for comprehensive defenses.

Defense 3: Output monitoring

Monitor what the agent does, not just what it says:

def monitor_agent_action(action, context):
    # Log every action
    logger.info({
        "event": "agent_action",
        "action_type": action.type,
        "target": action.target,
        "user_id": context.user_id,
    })
    
    # Alert on suspicious patterns
    if action.type == "write_file" and "/etc/" in action.target:
        alert("suspicious_write", f"Agent trying to write to {action.target}")
        return False  # Block the action
    
    if action.type == "http_request" and action.target not in ALLOWED_DOMAINS:
        alert("data_exfiltration", f"Agent calling unauthorized domain: {action.target}")
        return False
    
    return True  # Allow the action

Defense 4: Cost limits

Prevent runaway agent loops from draining your budget:

MAX_STEPS = 20
MAX_COST_PER_SESSION = 5.00  # dollars
MAX_TOKENS_PER_SESSION = 500_000

class AgentBudget:
    def __init__(self):
        self.steps = 0
        self.cost = 0
        self.tokens = 0
    
    def check(self):
        if self.steps >= MAX_STEPS:
            raise BudgetExceeded("Max steps reached")
        if self.cost >= MAX_COST_PER_SESSION:
            raise BudgetExceeded(f"Cost limit: ${self.cost:.2f}")
        if self.tokens >= MAX_TOKENS_PER_SESSION:
            raise BudgetExceeded(f"Token limit: {self.tokens}")

See our cost governance guide and alerting guide for production monitoring.

Defense 5: Human approval for dangerous actions

For high-risk actions, require human confirmation:

REQUIRES_APPROVAL = ["delete_file", "drop_table", "send_email", "deploy", "payment"]

async def execute_with_approval(action):
    if action.type in REQUIRES_APPROVAL:
        approved = await request_human_approval(
            f"Agent wants to {action.type}: {action.description}"
        )
        if not approved:
            return "Action denied by human reviewer"
    
    return await execute(action)

In our AI Startup Race, agents have a 1hr/week human help budget. All deploy actions and external API calls are logged and reviewed.

The security checklist for agents

Before deploying any AI agent to production:

See our AI security checklist and MCP security guide for the complete security framework.

AI Agent Security — Preventing Tool Abuse, Data Leaks, and Prompt Injection

The threat model

Defense 1: Least privilege

Tool-level permissions

Defense 2: Input sanitization

Defense 3: Output monitoring

Defense 4: Cost limits

Defense 5: Human approval for dangerous actions

The security checklist for agents

📬 AI Dev Weekly

You might also like

How to Debug AI Agents — When Your Agent Goes Off the Rails

When NOT to Use AI Agents — The Anti-Hype Guide

How to Handle AI Latency in User-Facing Apps (2026)

How to Design an AI-Powered Application — Architecture Patterns (2026)