What Is an Agentic Runbook? A Precise Definition for 2026
An agentic runbook is a runbook executed by an AI agent that reasons over live observability signals, selects actions from a defined tool scope, requests human approval where required, and updates its own reference library based on incident outcomes.
The formal definition: three required properties
Every agentic runbook has three properties that distinguish it from the terms it is frequently confused with. A runbook that lacks any of these three is something else: an automated runbook, an AI-assisted runbook, or a chatbot. The distinction matters because the engineering tradeoffs, the security surface, and the required infrastructure differ substantially between the categories.
Agency
The system plans before it acts. It does not simply receive a trigger and execute a fixed script. It reads the current state of the system, retrieves relevant context (past incidents, runbook library, dependency topology), reasons about the options, and selects an action sequence. This is the most important distinguishing property.
Memory
The agent retains context across incidents. When it encounters a pod crash-loop, it can retrieve the last three incidents of the same type, what actions were taken, and whether they succeeded. This retrieval-augmented approach means the agent gets better with exposure, not just better with model updates.
Tool scope
The agent calls real APIs: kubectl, the PagerDuty API, Slack, CloudWatch, Terraform state. It does not just generate text about what should happen. The tool scope is explicitly defined in the runbook's action_boundary field, which also specifies which actions require human approval.
What an agentic runbook is not
The term is used loosely by vendors, analysts, and engineers. Here is a precise taxonomy to cut through the noise.
A traditional runbook
not agenticA static human-readable procedure document, typically in Confluence, Notion, or a PDF. A human reads the alert, opens the runbook, and follows the steps. The document cannot observe the system or take action independently. It is a reference artefact, not an executing system.
An automated runbook
not agenticA scripted procedure executed by a tool on a trigger. Rundeck, Ansible, and Terraform are the canonical examples. A webhook fires, a pre-defined job runs, outputs are logged. There is no reasoning step. The automation cannot handle situations outside its script. It is deterministic and replay-safe. Also called 'runbook automation' in vendor documentation.
An AI-assisted runbook
not agenticA hybrid where an LLM suggests the next steps and a human executes. The LLM reads the alert and the runbook, produces a recommendation ('restart the pod, then check logs'), and a human carries it out. Copilot-style incident response. The AI is advisory; the human is the action plane. incident.io's early AI features and FireHydrant's runbook suggestions fall here.
A chatbot
not agenticA conversational interface that answers questions about incidents or runbooks. Does not have an action plane. Cannot call kubectl. Cannot acknowledge a PagerDuty alert. Generates text, not actions.
An autonomous runbook (hypothetical)
not yet real in productionA runbook agent with no human approval gates, full write access, and self-updating logic. This is the marketing end-state described in vendor pitch decks. It does not exist safely in production for most organisations in 2026. Full autonomy without human approval is the last mile, not the current state.
A concrete example: pod crash-loop remediation
This is a minimal working agentic runbook for Kubernetes pod crash-loops. It is not a hypothetical. This pattern runs in production at organisations using LangGraph with Claude Sonnet or GPT-4o as the reasoning model. The comments explain what each section does.
# agentic-runbook: pod-crashloop-remediation
# This is a working LangGraph-compatible runbook specification.
# The agent reads this file, executes the signal_spec to detect the trigger,
# retrieves context via the context_retrieval spec, and then builds a
# LangGraph execution graph from the execution_plan.
metadata:
id: k8s-crashloop-v2
version: "2.1"
owner: platform-eng
slack_channel: "#incidents-platform"
pagerduty_service: auth-service
risk: medium # low | medium | high | critical
approvers:
- on-call-lead # Slack handle, required for require_human actions
last_verified: "2026-04-01"
changelog:
- "2026-04-01: Added OOMKilled sub-case, updated approver list"
- "2025-11-15: Initial version"
signal_spec:
# What triggers the agent. Must match before any execution starts.
trigger: pagerduty_alert
condition: "alert.title contains CrashLoopBackOff"
cooldown_minutes: 5 # Prevents duplicate agent runs on repeated alerts
tool_scope:
# Exhaustive list of tools this agent may invoke.
# Any tool not listed here is inaccessible to the agent.
- kubectl_get_pod_logs # read-only
- kubectl_describe_pod # read-only
- kubectl_rollout_restart # WRITE - requires approval
- kubectl_get_events # read-only
- pagerduty_acknowledge # write - safe (status only)
- slack_post_update # write - safe (comms only)
- datadog_query_metrics # read-only
action_boundary:
# auto_approve: agent executes without asking
auto_approve:
- kubectl_get_pod_logs
- kubectl_describe_pod
- kubectl_get_events
- pagerduty_acknowledge
- slack_post_update
- datadog_query_metrics
# require_human: agent proposes, waits for approval
require_human:
- kubectl_rollout_restart # Writes to production
# never_allow: hard-coded block, LLM cannot override
never_allow:
- kubectl_delete_pod
- kubectl_delete_deployment
context_retrieval:
# What the agent fetches via RAG before reasoning
- source: runbook_library
query: "pod crash-loop remediation kubernetes"
top_k: 3
- source: past_incidents
query: "CrashLoopBackOff auth-service"
top_k: 5
max_age_days: 90
execution_plan:
framework: langgraph
model: claude-sonnet-4-5
max_iterations: 8
timeout_seconds: 300
# LangGraph graph is generated from this spec at runtime.
# Nodes: observe, retrieve, reason, propose, approve, execute, verify, report
observability:
reasoning_trace: true # Full chain-of-thought saved to audit log
tool_call_log: true # Every tool call logged with args and result
audit_sink: cloudwatch # cloudwatch | datadog | splunk
immutable: true # Agent cannot modify past records
learning_loop:
enabled: true
on_resolve: update_runbook_library # Outcome + actions saved to vector DB
on_failure: flag_for_review # Human reviews failed runsThe full three-example tutorial with annotated LangGraph Python code is at /writing-your-first-agentic-runbook.
Why the term emerged in 2024 to 2026
Three infrastructure shifts converged to make agentic runbooks technically viable in 2024 and commercially available in 2026. Prior to 2024, each of the three existed independently but not in a form that supported reliable production use.
GPT-4 tool use and function calling
OpenAI's function calling API (June 2023) gave LLMs a reliable way to call structured tools. This was the missing piece: an LLM that could not just describe what should happen but actually call a specific function with specific arguments. Claude's tool use followed shortly after. Suddenly, the 'read alert, choose action, call API' pattern was feasible.
LangGraph reaches production stability, MCP released
LangGraph (LangChain's graph-based agent orchestration framework) reached v0.2 in late 2024, providing the stateful, cyclical execution graphs that incident-response agents require. In November 2024, Anthropic released the Model Context Protocol (MCP), standardising how agents discover and call tools. AWS Bedrock AgentCore adopted MCP as its integration layer in early 2025.
Vendor adoption: PagerDuty, FireHydrant, Rootly, Kubiya
PagerDuty shipped Gen-AI job authoring and AIOps event correlation. FireHydrant launched AI-assisted runbooks. Rootly shipped AI postmortem and RCA. Kubiya introduced the meta-agent orchestration pattern. Komodor released Klaudia, a Kubernetes-specific agent trained on thousands of production K8s environments. By April 2026, every major incident management vendor has a named AI product.
Who coined the term?
No single entity coined "agentic runbook". The earliest traceable uses are in ilert and PagerDuty blog posts from 2024, describing the pattern they were building. By 2025, the term appeared in BigPanda, FireHydrant, and Rootly content. By early 2026 it was in regular use across the SRE analyst community.
The fact that no vendor owns the term is precisely why it is an opportunity for a neutral reference. Weaveworks owned "GitOps" as a canonical source before the CNCF formalised it. Honeycomb and Charity Majors shaped "observability" as a term before it went mainstream. The phrase "agentic runbook" is at the same stage in April 2026 that "observability" was in 2018.
The full taxonomy: five levels of runbook autonomy
| Level | Name | Description | Example | 2026 status |
|---|---|---|---|---|
| 0 | Traditional runbook | Human reads and executes | Confluence doc | Common |
| 1 | Automated runbook | Script executes on trigger | Rundeck job | Common |
| 2 | AI-assisted runbook | LLM suggests, human executes | FireHydrant AI suggestions | Shipping in 2025-2026 |
| 3 | Agentic runbook | Agent reasons and executes with approval gates | LangGraph + PagerDuty AIOps | Early production in 2026 |
| 4 | Autonomous runbook | Full autonomy, no human gates | Hypothetical | Not production-safe in 2026 |