πŸ€– AI Tools
Β· 3 min read

AI Agent Security β€” Preventing Tool Abuse, Data Leaks, and Prompt Injection


An AI agent with tool access can read files, run commands, call APIs, and modify databases. That’s powerful when it works correctly. When it doesn’t β€” through prompt injection, hallucination, or bugs β€” it’s a security incident.

The threat model

ThreatHow it happensImpact
Tool abuseAgent calls dangerous tools (delete files, drop tables)Data loss
Data exfiltrationAgent sends sensitive data to external APIsData breach
Prompt injectionUser input tricks agent into unauthorized actionsVaries
Runaway costsAgent loops, making expensive API callsFinancial
Privilege escalationAgent accesses resources beyond its scopeUnauthorized access

Defense 1: Least privilege

Give agents only the tools they need. Nothing more.

# Bad: agent has access to everything
tools = [read_file, write_file, delete_file, run_command, 
         send_email, query_database, drop_table]

# Good: agent only has what it needs for this task
tools = [read_file, write_file, run_tests]

For MCP servers, configure permissions per server:

  • File system MCP: read-only for most tasks, write only to specific directories
  • Database MCP: SELECT only, no DELETE/DROP
  • Terminal MCP: allowlist of safe commands

Tool-level permissions

ALLOWED_COMMANDS = ["npm test", "npm run build", "git status", "git diff"]
BLOCKED_PATTERNS = ["rm -rf", "DROP TABLE", "DELETE FROM", "curl.*|.*sh"]

def safe_execute(command):
    if any(blocked in command for blocked in BLOCKED_PATTERNS):
        raise SecurityError(f"Blocked dangerous command: {command}")
    
    if not any(command.startswith(allowed) for allowed in ALLOWED_COMMANDS):
        raise SecurityError(f"Command not in allowlist: {command}")
    
    return execute(command)

Defense 2: Input sanitization

User input goes into agent prompts. Sanitize it to prevent prompt injection:

def sanitize_user_input(text):
    # Remove common injection patterns
    dangerous = [
        "ignore previous instructions",
        "system prompt",
        "you are now",
        "forget everything",
    ]
    
    for pattern in dangerous:
        if pattern.lower() in text.lower():
            logger.warning(f"Potential injection attempt: {pattern}")
            text = text.replace(pattern, "[FILTERED]")
    
    return text

This is a basic filter. For production, combine with:

  • Input length limits
  • Content classification (is this a normal request or an attack?)
  • Separate system prompt from user input with clear delimiters

See our prompt injection guide for comprehensive defenses.

Defense 3: Output monitoring

Monitor what the agent does, not just what it says:

def monitor_agent_action(action, context):
    # Log every action
    logger.info({
        "event": "agent_action",
        "action_type": action.type,
        "target": action.target,
        "user_id": context.user_id,
    })
    
    # Alert on suspicious patterns
    if action.type == "write_file" and "/etc/" in action.target:
        alert("suspicious_write", f"Agent trying to write to {action.target}")
        return False  # Block the action
    
    if action.type == "http_request" and action.target not in ALLOWED_DOMAINS:
        alert("data_exfiltration", f"Agent calling unauthorized domain: {action.target}")
        return False
    
    return True  # Allow the action

Defense 4: Cost limits

Prevent runaway agent loops from draining your budget:

MAX_STEPS = 20
MAX_COST_PER_SESSION = 5.00  # dollars
MAX_TOKENS_PER_SESSION = 500_000

class AgentBudget:
    def __init__(self):
        self.steps = 0
        self.cost = 0
        self.tokens = 0
    
    def check(self):
        if self.steps >= MAX_STEPS:
            raise BudgetExceeded("Max steps reached")
        if self.cost >= MAX_COST_PER_SESSION:
            raise BudgetExceeded(f"Cost limit: ${self.cost:.2f}")
        if self.tokens >= MAX_TOKENS_PER_SESSION:
            raise BudgetExceeded(f"Token limit: {self.tokens}")

See our cost governance guide and alerting guide for production monitoring.

Defense 5: Human approval for dangerous actions

For high-risk actions, require human confirmation:

REQUIRES_APPROVAL = ["delete_file", "drop_table", "send_email", "deploy", "payment"]

async def execute_with_approval(action):
    if action.type in REQUIRES_APPROVAL:
        approved = await request_human_approval(
            f"Agent wants to {action.type}: {action.description}"
        )
        if not approved:
            return "Action denied by human reviewer"
    
    return await execute(action)

In our AI Startup Race, agents have a 1hr/week human help budget. All deploy actions and external API calls are logged and reviewed.

The security checklist for agents

Before deploying any AI agent to production:

  • Tools follow least privilege (only what’s needed)
  • Dangerous commands are blocked or require approval
  • User input is sanitized for injection
  • All actions are logged
  • Cost limits are set per session
  • Step limits prevent infinite loops
  • External API calls are restricted to allowlisted domains
  • API keys secured and rotated regularly
  • File access is scoped to specific directories
  • Red team testing completed
  • Rollback plan exists (feature flag to disable agent)

See our AI security checklist and MCP security guide for the complete security framework.

Related: Prompt Injection Explained Β· MCP Security Checklist Β· AI Security Checklist Β· Red Team Your AI Application Β· How to Debug AI Agents Β· What is an AI Agent?