SYS / OPERATIONAL
P1 / CRITICALP2 / HIGHP3 / MEDP4 / LOW

Agentic runbooks: what they are, how to write them, and who is shipping them in 2026.

The independent reference for the phrase the industry is still learning to spell. Vendor-neutral. Real YAML and LangGraph code in the first scroll. A forensic 12-vendor matrix. A free MTTR ROI calculator.

15 reference pages/40-term glossary/3 working examples/last_verified 2026-04
incident_live
PD-A1923 / svc:auth-service
03:14:02PAGEPD-A1923 / CrashLoopBackOff on auth-service-7d9f
03:14:04OBSERVEagent reads alert, queries deployment topology, fetches last 5 incidents (RAG)
03:14:06RETRIEVEloaded runbook k8s-crashloop-v2 (vector store, similarity 0.94)
03:14:09DIAGNOSEOOMKilled at 09:14:22, mem_limit 512Mi, working_set 489Mi -> recommend rollout restart
03:14:14PROPOSEkubectl rollout restart deploy/auth-service (write action -> requires approver)
03:14:23APPROVEsarah.kim ack via Slack /approve PD-A1923
03:14:25EXECUTEtool_call kubectl_rollout_restart -> success, new revision 47
03:14:31VERIFYpod ready, error_rate 0%, request_latency p99 142ms
03:14:33RESOLVEincident closed, MTTR 31s, postmortem draft queued
MTTA
2s
MTTR
31s
HUMAN
1 ack

anatomy_of_an_incident / observe -> diagnose -> remediate -> postmortem

DEFN.001the 60-second definition

A runbook tells humans what to do. An agentic runbook tells an AI agent how to think.

A runbook is a set of instructions for handling an incident. An automated runbook executes those instructions on a trigger, via tools like Rundeck or Ansible. An agentic runbook goes further: the execution is handled by an AI agent that reads signals, reasons about what to do, calls real tools, and learns from the outcome. The agent is not following a fixed script. It is applying judgment.

// 01

Reasons over signals

Not just 'alert fired, run script'. The agent observes CPU spikes, log patterns, dependency health, and recent deploys before choosing an action.

// 02

Chooses from a tool scope

The agent has a defined inventory of actions it can take. It selects the right tool for the situation, not the next step in a fixed list.

// 03

Learns from outcomes

After resolution, the agent updates its runbook library via a learning loop. The next similar incident takes less time.

CODE.002a real agentic runbook in 30 lines

Stop reading vendor diagrams. Read the YAML.

Below is a minimal LangGraph-compatible runbook for Kubernetes pod crash-loop remediation. Comments explain each section. Three full annotated examples at /writing-your-first-agentic-runbook.

# agentic-runbook: pod-crashloop-remediation
metadata:
  id: k8s-crashloop-v2
  owner: platform-eng
  risk: medium
  approvers: [on-call-lead]
  last_verified: 2026-04-01

signal_spec:
  trigger: pagerduty_alert
  condition: "alert.title contains CrashLoopBackOff"
  cooldown_minutes: 5

tool_scope:
  - kubectl_get_pod_logs
  - kubectl_describe_pod
  - kubectl_rollout_restart   # writes to prod
  - pagerduty_acknowledge
  - slack_post_update

action_boundary:
  auto_approve:
    - kubectl_get_pod_logs
    - kubectl_describe_pod
    - slack_post_update
  require_human:
    - kubectl_rollout_restart

execution_plan:
  framework: langgraph
  model: claude-sonnet-4-5
  max_iterations: 8
annotation
  • metadata · identity, ownership, risk class, who can approve write actions
  • signal_spec · the precise condition that triggers agent execution
  • tool_scope · the closed inventory of tools the agent may call. anything outside this list is unreachable
  • action_boundary · auto-approve for read-only and comms; require_human for any write to production
  • execution_plan · LangGraph orchestrator, Sonnet 4.5 reasoning, 8-step bound
full_tutorial →
TAXO.003traditional vs ai-assisted vs agentic

The taxonomy is muddled. Here is the clean version.

Vendor blogs conflate four distinct categories. Full comparison at /traditional-vs-agentic.

DimensionTraditionalAutomatedAgentic
FormatConfluence doc / PDFScript / Ansible playbookYAML + LangGraph / AutoGen
TriggerHuman reads alertWebhook / cronObservability signal + LLM reasoning
ExecutionHuman follows stepsDeterministic scriptAgent chooses actions from scope
AdaptabilityNoneLow (pre-scripted paths)High (reasons about novel situations)
LearningPostmortem updates docNoneOutcome feeds learning loop
Audit trailSlack thread + notesScript logFull reasoning trace + tool calls
Typical toolConfluence, NotionRundeck, AnsiblePagerDuty AIOps, Rootly, Kubiya
VEND.004who is shipping this in 2026

The 2026 vendor roster, reduced to one tag and one price.

full_matrix →

PagerDuty

P2

Runbook Automation + AIOps

Per-seat

incident.io

P4

AI workflows, Slack-native

Custom

FireHydrant

P4

AI-assisted runbooks

Custom

Rootly

P4

AI postmortem + RCA

Custom

Shoreline

P3

Notebooks, 75% MTTR claim

Custom

Kubiya

P4

Meta-agent orchestration

Custom

Komodor Klaudia

P3

K8s, 95% accuracy

Custom

AWS DevOps Agent

P4

Bedrock AgentCore + MCP

Usage-based

Pricing from vendor public pages, April 2026. Verify before procurement. See all 12 vendors including Traversal, Resolve.ai, Datadog Bits AI, xMatters, and OpenSRE.

RUN.005production-grade use cases

What agents are actually doing on call in 2026.

all_12_cases →
K8s// shipping

Pod crash-loop remediation

Agent detects CrashLoopBackOff, reads logs, proposes restart, gets approval. 23-second MTTR.

CI/CD// shipping

Deployment rollback

Error-rate spike triggers agent to diff recent deploys, propose rollback to last stable version.

Sched// shipping

Certificate expiry rotation

Proactive agent runs nightly, detects certs expiring in 14 days, initiates rotation workflow.

FinOps// shipping

Cost anomaly scale-down

Cloud cost spike triggers agent to find over-provisioned resources and propose scale-down.

Sec// shipping

Auth spike response

Login volume 10x normal: agent classifies campaign vs DDoS, routes to appropriate runbook.

AIOps// shipping

Noise suppression

PagerDuty AIOps agent correlates 400 alerts into 3 actionable incidents. 91% reduction claimed.

P1 / SECURITYdo_not_skip

You just gave an LLM kubectl write. Here is the threat model.

Vendor pitches will not cover these. The pre-launch checklist at /security-considerations has 12 items. Read them before production rollout.

Prompt injection via alert payloads

An attacker crafts a pod name or service response that hijacks the agent's instructions mid-execution.

Over-privileged IAM

IBM research: 70% of orgs grant AI more access than equivalent humans. Those orgs see 4.5x more security incidents.

Destructive action blast radius

kubectl delete, terraform destroy, and misconfigured rollbacks can cascade. Circuit breakers are not optional.

ROI.006mttr savings model

What is your MTTR savings worth?

Vendor MTTR reduction claims range from 38% to 95%. The free ROI calculator lets SRE leads model their own numbers, no email required.

open_roi_calc →
Traversal at DigitalOcean
36,000
hrs/yr saved
38% MTTR reduction
Shoreline claim
75%
MTTR cut
50% auto-remediation
PagerDuty claim
95%
faster
via Runbook Automation + AIOps
FAQ.007common questions

What people ask before they adopt.

Q.01What is an agentic runbook?+

An agentic runbook is a runbook executed by an AI agent that reasons over live signals, chooses actions from a defined tool scope, and learns from outcomes. The three defining properties are agency, memory, and tool scope. Unlike a scripted automated runbook, the agent is not following a fixed execution path. It applies judgment to the current state of the system.

Q.02Is PagerDuty's runbook automation really agentic?+

Honest answer: partially. PagerDuty Runbook Automation (formerly Rundeck) has deterministic execution at its core: event triggers a job, job runs predefined steps. The recent additions of Gen-AI job authoring and the AIOps event-correlation layer push it toward agentic behaviour, but the runbook execution itself remains deterministic. The AIOps layer is agentic-adjacent; the runbook runner is not.

Q.03How do you write an agentic runbook?+

An agentic runbook needs eight fields: metadata (id, version, owner, risk, approvers), signal_spec (what triggers the agent), tool_scope (what APIs it can call), action_boundary (which actions require human approval), context_retrieval (what past incidents and docs the agent pulls via RAG), execution_plan (the LangGraph or AutoGen graph), observability (logs and reasoning dump), and a learning_loop (how outcomes feed back). The /writing-your-first-agentic-runbook page has three full working examples in YAML and LangGraph Python.

Q.04What is MCP and why does it matter for runbooks?+

Model Context Protocol is Anthropic's open standard for agent-to-tool and agent-to-agent communication. It standardises how an agent discovers and invokes capabilities. AWS Bedrock AgentCore wraps Kubernetes, logs, and metrics APIs as MCP tools, meaning a LangGraph or AutoGen agent can call kubectl, CloudWatch, and PagerDuty through a single interface. For runbooks, MCP simplifies integration and enables composable, vendor-neutral agent architectures.

Q.05Can an agentic runbook work in an air-gapped environment?+

Yes, with caveats. Cloud-API LLMs like Claude Sonnet or GPT-4o are out unless your compliance allows egress. The viable paths are self-hosted models (Llama 3, Mixtral, or fine-tuned smaller models) with a local vector database and on-premise orchestration. Emerging option: domain-specific distilled models fine-tuned on runbook reasoning tasks, running entirely inside your VPC. Latency and capability will be lower than cloud LLMs, but the pattern is architecturally sound.

Q.06Will agentic runbooks still be relevant in five years?+

The vocabulary may shift (as 'DevOps' became 'platform engineering'), but the underlying pattern is durable: AI-mediated operational reasoning sitting between observability signals and infrastructure action planes. The specific term 'agentic runbook' may not survive, but the category it describes will. The sites and practitioners who define the vocabulary now will carry that authority forward, regardless of what the term evolves into.

Updated 2026-04-28