Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.
When PagerDuty fires at 3 AM, you need to triage fast: is this critical or noise? Whatβs the likely root cause? Whatβs the runbook? AI can handle the first 60 seconds of incident response β the part where youβre still waking up.
The AI incident response pipeline
Alert (PagerDuty/Grafana/UptimeRobot)
β
AI Triage (classify severity, identify service)
β
AI Root Cause (analyze logs, metrics, recent deploys)
β
Suggested Actions (runbook steps, rollback commands)
β
Human Decision (approve/modify/escalate)
β
AI Post-Mortem (generate draft after resolution)
The AI handles analysis and suggestions. Humans make decisions and execute actions. This is the human-in-the-loop pattern applied to incident response.
Step 1: AI triage
import ollama
def triage_alert(alert):
response = ollama.chat(model="qwen3:8b", messages=[{
"role": "user",
"content": f"""Triage this alert:
Alert: {alert['title']}
Service: {alert['service']}
Severity: {alert['severity']}
Description: {alert['description']}
Time: {alert['timestamp']}
Classify as:
1. CRITICAL (user-facing outage, data loss risk)
2. HIGH (degraded performance, partial outage)
3. MEDIUM (elevated errors, no user impact yet)
4. LOW (noise, expected behavior, maintenance)
Also identify: affected service, likely component, and whether this correlates with a recent deployment.
Respond in this format:
SEVERITY: X
SERVICE: X
COMPONENT: X
RECENT_DEPLOY: yes/no
SUMMARY: one line"""
}])
return response["message"]["content"]
Step 2: AI root cause analysis
def analyze_root_cause(alert, logs, metrics, recent_deploys):
response = ollama.chat(model="qwen3.5:27b", messages=[{
"role": "user",
"content": f"""Perform root cause analysis for this incident:
Alert: {alert['title']}
Service: {alert['service']}
Recent logs (last 30 min):
{logs[:3000]}
Key metrics:
- Error rate: {metrics['error_rate']}
- Latency p99: {metrics['latency_p99']}
- CPU: {metrics['cpu']}
- Memory: {metrics['memory']}
Recent deployments (last 24h):
{recent_deploys}
Provide:
1. Most likely root cause (with confidence: high/medium/low)
2. Supporting evidence from logs/metrics
3. Whether this is likely caused by a recent deployment
4. Immediate mitigation steps"""
}])
return response["message"]["content"]
Use qwen3.5:27b for root cause analysis β it needs the reasoning capability of a larger model.
Step 3: Suggested actions
RUNBOOKS = {
"high_error_rate": "1. Check recent deploys\n2. Rollback if deploy < 1h ago\n3. Check database connections\n4. Check external API status",
"high_latency": "1. Check database slow queries\n2. Check cache hit rate\n3. Check for resource exhaustion\n4. Scale up if needed",
"oom_killed": "1. Check memory limits\n2. Look for memory leaks in logs\n3. Increase memory limit\n4. Restart affected pods",
}
def suggest_actions(root_cause, service):
response = ollama.chat(model="qwen3:8b", messages=[{
"role": "user",
"content": f"""Based on this root cause analysis:
{root_cause}
And these available runbooks:
{RUNBOOKS}
Provide specific, ordered actions to resolve this incident.
Include exact commands where possible (kubectl, aws cli, etc.).
Mark each action as SAFE (no risk) or RISKY (needs approval)."""
}])
return response["message"]["content"]
Step 4: AI post-mortem draft
After the incident is resolved:
def generate_postmortem(incident):
response = ollama.chat(model="qwen3.5:27b", messages=[{
"role": "user",
"content": f"""Generate an incident post-mortem document:
Incident: {incident['title']}
Duration: {incident['duration']}
Impact: {incident['impact']}
Timeline: {incident['timeline']}
Root cause: {incident['root_cause']}
Resolution: {incident['resolution']}
Format as:
## Summary
## Impact
## Timeline
## Root Cause
## Resolution
## Action Items (preventive measures)
## Lessons Learned"""
}])
return response["message"]["content"]
Integration with alerting tools
With n8n
Use n8n + Ollama for a visual workflow:
Webhook (PagerDuty) β Ollama (triage) β Switch β
βββ CRITICAL β Slack #incidents + page on-call
βββ HIGH β Slack #incidents
βββ LOW β Log and ignore
With UptimeRobot
UptimeRobot detects the outage, webhook triggers your AI triage pipeline.
Why local models for incident response
Incident data is highly sensitive: internal service names, infrastructure topology, error messages with stack traces, database queries. Sending this to cloud AI during an incident:
- Adds latency (API call during outage)
- Exposes infrastructure details
- May fail if the outage affects your internet connectivity
Local Ollama models work offline, respond instantly, and keep everything private.
Related: AI Log Analysis with Local Models Β· Self-Host n8n with Local AI Β· AI for CI/CD Pipelines Β· Human-in-the-Loop AI Agents Β· Ollama Complete Guide Β· AI Agent Error Handling