πŸ“ Tutorials
Β· 3 min read

Automate Incident Response with AI: From Alert to Resolution (2026)


Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.

When PagerDuty fires at 3 AM, you need to triage fast: is this critical or noise? What’s the likely root cause? What’s the runbook? AI can handle the first 60 seconds of incident response β€” the part where you’re still waking up.

The AI incident response pipeline

Alert (PagerDuty/Grafana/UptimeRobot)
    ↓
AI Triage (classify severity, identify service)
    ↓
AI Root Cause (analyze logs, metrics, recent deploys)
    ↓
Suggested Actions (runbook steps, rollback commands)
    ↓
Human Decision (approve/modify/escalate)
    ↓
AI Post-Mortem (generate draft after resolution)

The AI handles analysis and suggestions. Humans make decisions and execute actions. This is the human-in-the-loop pattern applied to incident response.

Step 1: AI triage

import ollama

def triage_alert(alert):
    response = ollama.chat(model="qwen3:8b", messages=[{
        "role": "user",
        "content": f"""Triage this alert:

Alert: {alert['title']}
Service: {alert['service']}
Severity: {alert['severity']}
Description: {alert['description']}
Time: {alert['timestamp']}

Classify as:
1. CRITICAL (user-facing outage, data loss risk)
2. HIGH (degraded performance, partial outage)
3. MEDIUM (elevated errors, no user impact yet)
4. LOW (noise, expected behavior, maintenance)

Also identify: affected service, likely component, and whether this correlates with a recent deployment.

Respond in this format:
SEVERITY: X
SERVICE: X
COMPONENT: X
RECENT_DEPLOY: yes/no
SUMMARY: one line"""
    }])
    return response["message"]["content"]

Step 2: AI root cause analysis

def analyze_root_cause(alert, logs, metrics, recent_deploys):
    response = ollama.chat(model="qwen3.5:27b", messages=[{
        "role": "user",
        "content": f"""Perform root cause analysis for this incident:

Alert: {alert['title']}
Service: {alert['service']}

Recent logs (last 30 min):
{logs[:3000]}

Key metrics:
- Error rate: {metrics['error_rate']}
- Latency p99: {metrics['latency_p99']}
- CPU: {metrics['cpu']}
- Memory: {metrics['memory']}

Recent deployments (last 24h):
{recent_deploys}

Provide:
1. Most likely root cause (with confidence: high/medium/low)
2. Supporting evidence from logs/metrics
3. Whether this is likely caused by a recent deployment
4. Immediate mitigation steps"""
    }])
    return response["message"]["content"]

Use qwen3.5:27b for root cause analysis β€” it needs the reasoning capability of a larger model.

Step 3: Suggested actions

RUNBOOKS = {
    "high_error_rate": "1. Check recent deploys\n2. Rollback if deploy < 1h ago\n3. Check database connections\n4. Check external API status",
    "high_latency": "1. Check database slow queries\n2. Check cache hit rate\n3. Check for resource exhaustion\n4. Scale up if needed",
    "oom_killed": "1. Check memory limits\n2. Look for memory leaks in logs\n3. Increase memory limit\n4. Restart affected pods",
}

def suggest_actions(root_cause, service):
    response = ollama.chat(model="qwen3:8b", messages=[{
        "role": "user",
        "content": f"""Based on this root cause analysis:
{root_cause}

And these available runbooks:
{RUNBOOKS}

Provide specific, ordered actions to resolve this incident.
Include exact commands where possible (kubectl, aws cli, etc.).
Mark each action as SAFE (no risk) or RISKY (needs approval)."""
    }])
    return response["message"]["content"]

Step 4: AI post-mortem draft

After the incident is resolved:

def generate_postmortem(incident):
    response = ollama.chat(model="qwen3.5:27b", messages=[{
        "role": "user",
        "content": f"""Generate an incident post-mortem document:

Incident: {incident['title']}
Duration: {incident['duration']}
Impact: {incident['impact']}
Timeline: {incident['timeline']}
Root cause: {incident['root_cause']}
Resolution: {incident['resolution']}

Format as:
## Summary
## Impact
## Timeline
## Root Cause
## Resolution
## Action Items (preventive measures)
## Lessons Learned"""
    }])
    return response["message"]["content"]

Integration with alerting tools

With n8n

Use n8n + Ollama for a visual workflow:

Webhook (PagerDuty) β†’ Ollama (triage) β†’ Switch β†’
  β”œβ”€β”€ CRITICAL β†’ Slack #incidents + page on-call
  β”œβ”€β”€ HIGH β†’ Slack #incidents
  └── LOW β†’ Log and ignore

With UptimeRobot

UptimeRobot detects the outage, webhook triggers your AI triage pipeline.

Why local models for incident response

Incident data is highly sensitive: internal service names, infrastructure topology, error messages with stack traces, database queries. Sending this to cloud AI during an incident:

  1. Adds latency (API call during outage)
  2. Exposes infrastructure details
  3. May fail if the outage affects your internet connectivity

Local Ollama models work offline, respond instantly, and keep everything private.

Related: AI Log Analysis with Local Models Β· Self-Host n8n with Local AI Β· AI for CI/CD Pipelines Β· Human-in-the-Loop AI Agents Β· Ollama Complete Guide Β· AI Agent Error Handling