Agentic runbooks: what they are, how to write them, and who is shipping them in 2026.
The independent reference for the phrase the industry is still learning to spell. Vendor-neutral. Real YAML and LangGraph code in the first scroll. A forensic 12-vendor matrix. A free MTTR ROI calculator.
anatomy_of_an_incident / observe -> diagnose -> remediate -> postmortem
A runbook tells humans what to do. An agentic runbook tells an AI agent how to think.
A runbook is a set of instructions for handling an incident. An automated runbook executes those instructions on a trigger, via tools like Rundeck or Ansible. An agentic runbook goes further: the execution is handled by an AI agent that reads signals, reasons about what to do, calls real tools, and learns from the outcome. The agent is not following a fixed script. It is applying judgment.
Reasons over signals
Not just 'alert fired, run script'. The agent observes CPU spikes, log patterns, dependency health, and recent deploys before choosing an action.
Chooses from a tool scope
The agent has a defined inventory of actions it can take. It selects the right tool for the situation, not the next step in a fixed list.
Learns from outcomes
After resolution, the agent updates its runbook library via a learning loop. The next similar incident takes less time.
Stop reading vendor diagrams. Read the YAML.
Below is a minimal LangGraph-compatible runbook for Kubernetes pod crash-loop remediation. Comments explain each section. Three full annotated examples at /writing-your-first-agentic-runbook.
# agentic-runbook: pod-crashloop-remediation
metadata:
id: k8s-crashloop-v2
owner: platform-eng
risk: medium
approvers: [on-call-lead]
last_verified: 2026-04-01
signal_spec:
trigger: pagerduty_alert
condition: "alert.title contains CrashLoopBackOff"
cooldown_minutes: 5
tool_scope:
- kubectl_get_pod_logs
- kubectl_describe_pod
- kubectl_rollout_restart # writes to prod
- pagerduty_acknowledge
- slack_post_update
action_boundary:
auto_approve:
- kubectl_get_pod_logs
- kubectl_describe_pod
- slack_post_update
require_human:
- kubectl_rollout_restart
execution_plan:
framework: langgraph
model: claude-sonnet-4-5
max_iterations: 8- metadata · identity, ownership, risk class, who can approve write actions
- signal_spec · the precise condition that triggers agent execution
- tool_scope · the closed inventory of tools the agent may call. anything outside this list is unreachable
- action_boundary · auto-approve for read-only and comms; require_human for any write to production
- execution_plan · LangGraph orchestrator, Sonnet 4.5 reasoning, 8-step bound
The taxonomy is muddled. Here is the clean version.
Vendor blogs conflate four distinct categories. Full comparison at /traditional-vs-agentic.
| Dimension | Traditional | Automated | Agentic |
|---|---|---|---|
| Format | Confluence doc / PDF | Script / Ansible playbook | YAML + LangGraph / AutoGen |
| Trigger | Human reads alert | Webhook / cron | Observability signal + LLM reasoning |
| Execution | Human follows steps | Deterministic script | Agent chooses actions from scope |
| Adaptability | None | Low (pre-scripted paths) | High (reasons about novel situations) |
| Learning | Postmortem updates doc | None | Outcome feeds learning loop |
| Audit trail | Slack thread + notes | Script log | Full reasoning trace + tool calls |
| Typical tool | Confluence, Notion | Rundeck, Ansible | PagerDuty AIOps, Rootly, Kubiya |
The 2026 vendor roster, reduced to one tag and one price.
PagerDuty
P2Runbook Automation + AIOps
Per-seat
incident.io
P4AI workflows, Slack-native
Custom
FireHydrant
P4AI-assisted runbooks
Custom
Rootly
P4AI postmortem + RCA
Custom
Shoreline
P3Notebooks, 75% MTTR claim
Custom
Kubiya
P4Meta-agent orchestration
Custom
Komodor Klaudia
P3K8s, 95% accuracy
Custom
AWS DevOps Agent
P4Bedrock AgentCore + MCP
Usage-based
Pricing from vendor public pages, April 2026. Verify before procurement. See all 12 vendors including Traversal, Resolve.ai, Datadog Bits AI, xMatters, and OpenSRE.
What agents are actually doing on call in 2026.
Pod crash-loop remediation
Agent detects CrashLoopBackOff, reads logs, proposes restart, gets approval. 23-second MTTR.
Deployment rollback
Error-rate spike triggers agent to diff recent deploys, propose rollback to last stable version.
Certificate expiry rotation
Proactive agent runs nightly, detects certs expiring in 14 days, initiates rotation workflow.
Cost anomaly scale-down
Cloud cost spike triggers agent to find over-provisioned resources and propose scale-down.
Auth spike response
Login volume 10x normal: agent classifies campaign vs DDoS, routes to appropriate runbook.
Noise suppression
PagerDuty AIOps agent correlates 400 alerts into 3 actionable incidents. 91% reduction claimed.
You just gave an LLM kubectl write. Here is the threat model.
Vendor pitches will not cover these. The pre-launch checklist at /security-considerations has 12 items. Read them before production rollout.
Prompt injection via alert payloads
An attacker crafts a pod name or service response that hijacks the agent's instructions mid-execution.
Over-privileged IAM
IBM research: 70% of orgs grant AI more access than equivalent humans. Those orgs see 4.5x more security incidents.
Destructive action blast radius
kubectl delete, terraform destroy, and misconfigured rollbacks can cascade. Circuit breakers are not optional.
What is your MTTR savings worth?
Vendor MTTR reduction claims range from 38% to 95%. The free ROI calculator lets SRE leads model their own numbers, no email required.
open_roi_calc →What people ask before they adopt.
Q.01What is an agentic runbook?+
An agentic runbook is a runbook executed by an AI agent that reasons over live signals, chooses actions from a defined tool scope, and learns from outcomes. The three defining properties are agency, memory, and tool scope. Unlike a scripted automated runbook, the agent is not following a fixed execution path. It applies judgment to the current state of the system.
Q.02Is PagerDuty's runbook automation really agentic?+
Honest answer: partially. PagerDuty Runbook Automation (formerly Rundeck) has deterministic execution at its core: event triggers a job, job runs predefined steps. The recent additions of Gen-AI job authoring and the AIOps event-correlation layer push it toward agentic behaviour, but the runbook execution itself remains deterministic. The AIOps layer is agentic-adjacent; the runbook runner is not.
Q.03How do you write an agentic runbook?+
An agentic runbook needs eight fields: metadata (id, version, owner, risk, approvers), signal_spec (what triggers the agent), tool_scope (what APIs it can call), action_boundary (which actions require human approval), context_retrieval (what past incidents and docs the agent pulls via RAG), execution_plan (the LangGraph or AutoGen graph), observability (logs and reasoning dump), and a learning_loop (how outcomes feed back). The /writing-your-first-agentic-runbook page has three full working examples in YAML and LangGraph Python.
Q.04What is MCP and why does it matter for runbooks?+
Model Context Protocol is Anthropic's open standard for agent-to-tool and agent-to-agent communication. It standardises how an agent discovers and invokes capabilities. AWS Bedrock AgentCore wraps Kubernetes, logs, and metrics APIs as MCP tools, meaning a LangGraph or AutoGen agent can call kubectl, CloudWatch, and PagerDuty through a single interface. For runbooks, MCP simplifies integration and enables composable, vendor-neutral agent architectures.
Q.05Can an agentic runbook work in an air-gapped environment?+
Yes, with caveats. Cloud-API LLMs like Claude Sonnet or GPT-4o are out unless your compliance allows egress. The viable paths are self-hosted models (Llama 3, Mixtral, or fine-tuned smaller models) with a local vector database and on-premise orchestration. Emerging option: domain-specific distilled models fine-tuned on runbook reasoning tasks, running entirely inside your VPC. Latency and capability will be lower than cloud LLMs, but the pattern is architecturally sound.
Q.06Will agentic runbooks still be relevant in five years?+
The vocabulary may shift (as 'DevOps' became 'platform engineering'), but the underlying pattern is durable: AI-mediated operational reasoning sitting between observability signals and infrastructure action planes. The specific term 'agentic runbook' may not survive, but the category it describes will. The sites and practitioners who define the vocabulary now will carry that authority forward, regardless of what the term evolves into.
15 deep pages, cross-linked.
What is an agentic runbook?
Precise definition, taxonomy, and the four distinguishing properties.
Traditional vs agentic
Side-by-side comparison matrix and decision tree.
Compare 12 vendors
Forensic capability matrix. No sponsored placements.
Write your first runbook
Three working examples in YAML and LangGraph Python.
Security threat model
Prompt injection, over-privileged IAM, blast radius, and mitigations.
For Kubernetes
The 10 most automated K8s incident patterns, with real tools.
For AWS
DevOps Agent, Bedrock AgentCore, MCP gateway, and IAM policy.
Postmortem automation
AI-drafted postmortems: what they produce and where they fail.
Glossary (40 terms)
The SRE and agentic AI vocabulary, defined precisely.