Last verified April 2026

What Is an Agentic Runbook? A Precise Definition for 2026

An agentic runbook is a runbook executed by an AI agent that reasons over live observability signals, selects actions from a defined tool scope, requests human approval where required, and updates its own reference library based on incident outcomes.

The formal definition: three required properties

Every agentic runbook has three properties that distinguish it from the terms it is frequently confused with. A runbook that lacks any of these three is something else: an automated runbook, an AI-assisted runbook, or a chatbot. The distinction matters because the engineering tradeoffs, the security surface, and the required infrastructure differ substantially between the categories.

Agency

The system plans before it acts. It does not simply receive a trigger and execute a fixed script. It reads the current state of the system, retrieves relevant context (past incidents, runbook library, dependency topology), reasons about the options, and selects an action sequence. This is the most important distinguishing property.

Memory

The agent retains context across incidents. When it encounters a pod crash-loop, it can retrieve the last three incidents of the same type, what actions were taken, and whether they succeeded. This retrieval-augmented approach means the agent gets better with exposure, not just better with model updates.

Tool scope

The agent calls real APIs: kubectl, the PagerDuty API, Slack, CloudWatch, Terraform state. It does not just generate text about what should happen. The tool scope is explicitly defined in the runbook's action_boundary field, which also specifies which actions require human approval.

What an agentic runbook is not

The term is used loosely by vendors, analysts, and engineers. Here is a precise taxonomy to cut through the noise.

A traditional runbook

not agentic

A static human-readable procedure document, typically in Confluence, Notion, or a PDF. A human reads the alert, opens the runbook, and follows the steps. The document cannot observe the system or take action independently. It is a reference artefact, not an executing system.

An automated runbook

not agentic

A scripted procedure executed by a tool on a trigger. Rundeck, Ansible, and Terraform are the canonical examples. A webhook fires, a pre-defined job runs, outputs are logged. There is no reasoning step. The automation cannot handle situations outside its script. It is deterministic and replay-safe. Also called 'runbook automation' in vendor documentation.

An AI-assisted runbook

not agentic

A hybrid where an LLM suggests the next steps and a human executes. The LLM reads the alert and the runbook, produces a recommendation ('restart the pod, then check logs'), and a human carries it out. Copilot-style incident response. The AI is advisory; the human is the action plane. incident.io's early AI features and FireHydrant's runbook suggestions fall here.

A chatbot

not agentic

A conversational interface that answers questions about incidents or runbooks. Does not have an action plane. Cannot call kubectl. Cannot acknowledge a PagerDuty alert. Generates text, not actions.

An autonomous runbook (hypothetical)

not yet real in production

A runbook agent with no human approval gates, full write access, and self-updating logic. This is the marketing end-state described in vendor pitch decks. It does not exist safely in production for most organisations in 2026. Full autonomy without human approval is the last mile, not the current state.

A concrete example: pod crash-loop remediation

This is a minimal working agentic runbook for Kubernetes pod crash-loops. It is not a hypothetical. This pattern runs in production at organisations using LangGraph with Claude Sonnet or GPT-4o as the reasoning model. The comments explain what each section does.

# agentic-runbook: pod-crashloop-remediation
# This is a working LangGraph-compatible runbook specification.
# The agent reads this file, executes the signal_spec to detect the trigger,
# retrieves context via the context_retrieval spec, and then builds a
# LangGraph execution graph from the execution_plan.

metadata:
  id: k8s-crashloop-v2
  version: "2.1"
  owner: platform-eng
  slack_channel: "#incidents-platform"
  pagerduty_service: auth-service
  risk: medium          # low | medium | high | critical
  approvers:
    - on-call-lead      # Slack handle, required for require_human actions
  last_verified: "2026-04-01"
  changelog:
    - "2026-04-01: Added OOMKilled sub-case, updated approver list"
    - "2025-11-15: Initial version"

signal_spec:
  # What triggers the agent. Must match before any execution starts.
  trigger: pagerduty_alert
  condition: "alert.title contains CrashLoopBackOff"
  cooldown_minutes: 5   # Prevents duplicate agent runs on repeated alerts

tool_scope:
  # Exhaustive list of tools this agent may invoke.
  # Any tool not listed here is inaccessible to the agent.
  - kubectl_get_pod_logs      # read-only
  - kubectl_describe_pod      # read-only
  - kubectl_rollout_restart   # WRITE - requires approval
  - kubectl_get_events        # read-only
  - pagerduty_acknowledge     # write - safe (status only)
  - slack_post_update         # write - safe (comms only)
  - datadog_query_metrics     # read-only

action_boundary:
  # auto_approve: agent executes without asking
  auto_approve:
    - kubectl_get_pod_logs
    - kubectl_describe_pod
    - kubectl_get_events
    - pagerduty_acknowledge
    - slack_post_update
    - datadog_query_metrics
  # require_human: agent proposes, waits for approval
  require_human:
    - kubectl_rollout_restart  # Writes to production
  # never_allow: hard-coded block, LLM cannot override
  never_allow:
    - kubectl_delete_pod
    - kubectl_delete_deployment

context_retrieval:
  # What the agent fetches via RAG before reasoning
  - source: runbook_library
    query: "pod crash-loop remediation kubernetes"
    top_k: 3
  - source: past_incidents
    query: "CrashLoopBackOff auth-service"
    top_k: 5
    max_age_days: 90

execution_plan:
  framework: langgraph
  model: claude-sonnet-4-5
  max_iterations: 8
  timeout_seconds: 300
  # LangGraph graph is generated from this spec at runtime.
  # Nodes: observe, retrieve, reason, propose, approve, execute, verify, report

observability:
  reasoning_trace: true     # Full chain-of-thought saved to audit log
  tool_call_log: true       # Every tool call logged with args and result
  audit_sink: cloudwatch    # cloudwatch | datadog | splunk
  immutable: true           # Agent cannot modify past records

learning_loop:
  enabled: true
  on_resolve: update_runbook_library  # Outcome + actions saved to vector DB
  on_failure: flag_for_review         # Human reviews failed runs

The full three-example tutorial with annotated LangGraph Python code is at /writing-your-first-agentic-runbook.

Why the term emerged in 2024 to 2026

Three infrastructure shifts converged to make agentic runbooks technically viable in 2024 and commercially available in 2026. Prior to 2024, each of the three existed independently but not in a form that supported reliable production use.

2023

GPT-4 tool use and function calling

OpenAI's function calling API (June 2023) gave LLMs a reliable way to call structured tools. This was the missing piece: an LLM that could not just describe what should happen but actually call a specific function with specific arguments. Claude's tool use followed shortly after. Suddenly, the 'read alert, choose action, call API' pattern was feasible.

Late 2024

LangGraph reaches production stability, MCP released

LangGraph (LangChain's graph-based agent orchestration framework) reached v0.2 in late 2024, providing the stateful, cyclical execution graphs that incident-response agents require. In November 2024, Anthropic released the Model Context Protocol (MCP), standardising how agents discover and call tools. AWS Bedrock AgentCore adopted MCP as its integration layer in early 2025.

2025 to 2026

Vendor adoption: PagerDuty, FireHydrant, Rootly, Kubiya

PagerDuty shipped Gen-AI job authoring and AIOps event correlation. FireHydrant launched AI-assisted runbooks. Rootly shipped AI postmortem and RCA. Kubiya introduced the meta-agent orchestration pattern. Komodor released Klaudia, a Kubernetes-specific agent trained on thousands of production K8s environments. By April 2026, every major incident management vendor has a named AI product.

Who coined the term?

No single entity coined "agentic runbook". The earliest traceable uses are in ilert and PagerDuty blog posts from 2024, describing the pattern they were building. By 2025, the term appeared in BigPanda, FireHydrant, and Rootly content. By early 2026 it was in regular use across the SRE analyst community.

The fact that no vendor owns the term is precisely why it is an opportunity for a neutral reference. Weaveworks owned "GitOps" as a canonical source before the CNCF formalised it. Honeycomb and Charity Majors shaped "observability" as a term before it went mainstream. The phrase "agentic runbook" is at the same stage in April 2026 that "observability" was in 2018.

The full taxonomy: five levels of runbook autonomy

Level	Name	Description	Example	2026 status
0	Traditional runbook	Human reads and executes	Confluence doc	Common
1	Automated runbook	Script executes on trigger	Rundeck job	Common
2	AI-assisted runbook	LLM suggests, human executes	FireHydrant AI suggestions	Shipping in 2025-2026
3	Agentic runbook	Agent reasons and executes with approval gates	LangGraph + PagerDuty AIOps	Early production in 2026
4	Autonomous runbook	Full autonomy, no human gates	Hypothetical	Not production-safe in 2026

Continue reading

Traditional vs Agentic: full comparison matrix How to write your first agentic runbook (LangGraph tutorial)Compare 12 vendors: PagerDuty, Rootly, Kubiya, and more Glossary: 40 SRE and agentic AI terms defined