Question 1

What is an agentic runbook?

Accepted Answer

An agentic runbook is a runbook executed by an AI agent that reasons over live observability signals, selects actions from a defined tool scope, requests human approval where required, and updates its own reference library based on incident outcomes. The three defining properties are agency (it reasons, not just triggers), memory (it retains context across incidents), and tool scope (it calls real APIs, not just generates text). It differs from an automated runbook (which runs a fixed script on a trigger) and an AI-assisted runbook (where the LLM suggests and a human executes).

Question 2

What is the difference between a runbook and a playbook?

Accepted Answer

Playbooks are strategic documents covering what to do and why in broad incident scenarios, including communication roles, escalation paths, and stakeholder management. Runbooks are tactical, step-by-step procedures for specific known failure modes. A pod-crash-loop runbook is a runbook; a major-outage response plan is a playbook. Agentic runbooks blur the line by absorbing the decision-making layer from playbooks: the agent determines which runbook applies and executes it, making the two documents increasingly redundant.

Question 3

What is the difference between a runbook, an automated runbook, and an agentic runbook?

Accepted Answer

A runbook is a human-readable procedure document (Confluence, Notion). An automated runbook is a scripted procedure executed by a tool (Rundeck, Ansible) on a trigger: deterministic, no reasoning step. An agentic runbook is executed by an AI agent that reasons over the current state, selects from a tool scope, and learns from outcomes. Each level adds autonomy and risk. Most organisations in 2026 are at Level 1 (automated) or Level 2 (AI-assisted) and are evaluating Level 3 (agentic) for well-understood, high-frequency incidents.

Question 4

Can AI actually replace on-call engineers?

Accepted Answer

Not in 2026. Agentic runbooks reduce MTTA and MTTR on known incident types. Shoreline claims 50% auto-remediation rates; Komodor Klaudia reports 95% accuracy on Kubernetes environments. But novel incidents outside training distribution, customer-facing communications, multi-system cascading failures, and destructive data operations still require human judgment. The realistic framing is that AI removes toil from the top 20% of incidents, freeing engineers to focus on the novel and complex.

Question 5

How do you write an agentic runbook?

Accepted Answer

An agentic runbook requires eight fields: metadata (id, version, owner, risk, approvers), signal_spec (what triggers the agent), tool_scope (what APIs it can call), action_boundary (which actions require human approval, which are auto-approved, which are never allowed), context_retrieval (what past incidents and runbooks the agent retrieves via RAG), execution_plan (LangGraph or AutoGen framework, model, max_iterations), observability (reasoning_trace, tool_call_log, audit_sink), and learning_loop (how outcomes feed back into the runbook library). The writing-your-first-agentic-runbook page has three complete working examples.

Question 6

What tools offer agentic runbooks in 2026?

Accepted Answer

The 2026 vendor landscape includes PagerDuty Runbook Automation + AIOps (per-seat enterprise pricing), incident.io AI workflows, FireHydrant AI-assisted runbooks, Rootly AI postmortem and RCA, Shoreline Notebooks (120+ pre-built, 75% MTTR claim), Kubiya meta-agent orchestration, xMatters AI Agent (launched November 2025), Komodor Klaudia (Kubernetes-focused, 95% accuracy), Resolve.ai (80% autonomous resolution target), Traversal (38% MTTR reduction at DigitalOcean), Datadog Bits AI, and AWS DevOps Agent (Bedrock AgentCore). See the full comparison at /by-tool-pagerduty-firehydrant-incidentio-rootly.

Question 7

Is PagerDuty's runbook automation really agentic?

Accepted Answer

Partially. PagerDuty Runbook Automation (formerly Rundeck) has deterministic execution at its core: event triggers a job, job runs predefined steps. The recent additions of Gen-AI job authoring and the AIOps event-correlation layer push it toward agentic behaviour, but the runbook execution itself remains deterministic in 2026. The AIOps layer is agentic-adjacent (it reasons about alert correlations); the runbook runner is not. This is Level 1.5 in the taxonomy, not a fully agentic Level 3 system.

Question 8

What are the security risks of agentic runbooks?

Accepted Answer

The four primary threats are: prompt injection via alert payloads (attacker-crafted pod names or log lines that hijack the agent's instructions), over-privileged IAM (IBM data: 70% of orgs grant AI more access than humans; those with least-privilege AI see 4.5x fewer security incidents), audit trail tampering (if the agent can write to its own log, the record is corruptible), and destructive action blast radius (kubectl delete, terraform destroy cascades). See /security-considerations for the 12-item pre-launch checklist and mitigations.

Question 9

How do you test an agentic runbook?

Accepted Answer

Four testing approaches in order of confidence: chaos engineering replay (inject the target failure in staging, verify agent actions), historical incident replay (feed past incidents to the agent in dry-run, compare proposals to human actions), red-team prompt injection (craft alert payloads with injected instructions, verify they do not execute), and the Microsoft Agent Governance Toolkit pattern (dynamic execution rings, circuit breaker, chaos injection built-in). Run all four before enabling write actions in production.

Question 10

What's the ROI of agentic runbook tooling?

Accepted Answer

Vendor-claimed MTTR reductions range from 38% (Traversal at DigitalOcean, 36,000 engineering hours/year saved) to 75% (Shoreline) and 95% faster (PagerDuty). A realistic model: for a 20-engineer on-call team at $175/hr fully-loaded, 15 incidents/week at 40-minute MTTR, a 50% reduction on 20% of incidents saves roughly $270,000/year. Use the free ROI calculator at /roi-calculator to model your numbers.

Question 11

What is MCP and why does it matter?

Accepted Answer

Model Context Protocol is Anthropic's open standard for agent-to-tool and agent-to-agent communication. It standardises how an agent discovers and invokes capabilities. AWS Bedrock AgentCore wraps Kubernetes, CloudWatch, and CloudTrail APIs as MCP tools. A LangGraph agent consuming these MCP tools can investigate and act across AWS without custom integration code. For runbooks, MCP simplifies integration and enables composable, vendor-neutral architectures. Kubiya adopted MCP in 2025; it is the forward-looking standard for agentic tool integration.

Question 12

Can an agentic runbook work in an air-gapped environment?

Accepted Answer

Yes, with caveats. Cloud-API LLMs are out unless compliance allows egress. Viable paths: self-hosted models (Llama 3, Mixtral, or fine-tuned smaller models) with a local vector database and on-premise orchestration. Emerging option: domain-specific distilled models fine-tuned on runbook reasoning tasks, running entirely inside your VPC. Latency and capability will be lower than cloud LLMs, but the pattern is architecturally sound. Contact your MSSP or security consultant before choosing a model for an air-gapped environment.

Question 13

What's the difference between AIOps and an agentic runbook?

Accepted Answer

AIOps is a category covering AI-augmented IT operations: alert correlation, anomaly detection, event management, and noise suppression. An agentic runbook is a specific implementation pattern: an AI agent that executes remediation procedures. AIOps is typically the event-router layer (Layer 2 in the integration stack); agentic runbooks are the agent runtime layer (Layer 3). They are complementary: PagerDuty AIOps does event correlation and routes a clean signal to an agentic runbook that handles remediation.

Question 14

What programming framework should I use to build one?

Accepted Answer

LangGraph is the recommended framework for most teams in 2026. It provides stateful, cyclical execution graphs ideal for incident-response reasoning (observe, retrieve, reason, propose, execute, verify cycles). AutoGen is better for multi-agent conversation patterns (useful for A2A architectures). Amazon Bedrock AgentCore is the recommended managed option for AWS-centric teams. For open source, the Tracer-Cloud/opensre repository on GitHub provides the reference LangGraph implementation with Kubernetes and PagerDuty integration.

Question 15

How long does it take to deploy agentic runbooks in production?

Accepted Answer

The realistic arc is 6 to 18 months to reach mature production deployment. Month 1-3: runbook inventory and observability fixes. Month 3-6: automated runbooks for top 20% of incidents. Month 6-9: AI-assisted layer added. Month 9-15: agentic agent with read-only scope. Month 12-18: incremental write-action expansion after 90+ days of accurate recommendations. Teams that try to skip stages typically have high false-positive rates, lose engineer trust, and roll back.

Question 16

What if the agent takes the wrong action?

Accepted Answer

Three layers of defence: the require_human gate on all write actions (a human approves before the action executes), the circuit breaker (agent pauses after N unexpected actions), and the kill switch (stops all agents immediately). If a wrong action executes despite these layers, the immutable audit trail provides a complete record of what the agent reasoned and why, enabling post-incident analysis. In practice, wrong actions on read-only operations have no impact; wrong proposals are caught by human approval.

Question 17

Do agentic runbooks help with compliance (SOC 2, HIPAA, PCI)?

Accepted Answer

They can, if implemented correctly. Agentic runbooks provide a complete, timestamped audit trail of every action and reasoning step, which is stronger evidence than a Slack thread. For SOC 2 Type II, the key controls are: agents cannot access production without authenticated credentials (IAM Identity Center), all write actions require human approval (action_boundary), all actions are logged to an immutable sink, and there is a kill switch. Deterministic wrappers (Kubiya pattern) are required for HIPAA and PCI where probabilistic LLM behaviour cannot be tolerated.

Question 18

How much does agentic runbook tooling cost?

Accepted Answer

Most vendors are custom-quoted enterprise contracts (Rootly, Shoreline, Kubiya, Komodor, incident.io, FireHydrant, Traversal, Resolve.ai). The runbook-automation segment of PagerDuty's product set is per-seat pricing in the low-three-figure-per-user range, the only category vendor with public pricing. Build-your-own with LangGraph and a Bedrock LLM API costs roughly $1,000 to $2,000 per month for a 1,000-service organisation. The free ROI calculator at /roi-calculator models savings against cost.

Question 19

Is this just a rebranding of runbook automation?

Accepted Answer

The short answer is no, but with nuance. Traditional runbook automation (Rundeck, Ansible) is deterministic: a trigger fires a pre-scripted sequence of steps. Agentic runbooks add reasoning, memory, and tool selection. The distinction is not marketing; it represents a genuine capability increase that also introduces new risks (prompt injection, non-determinism, blast radius). That said, some vendors are rebranding deterministic automation as 'agentic'. The taxonomy page at /traditional-vs-agentic provides the criteria to evaluate any vendor's claim.

Question 20

Will this still be relevant in five years?

Accepted Answer

The vocabulary may shift (as 'DevOps' became 'platform engineering'), but the underlying pattern is durable: AI-mediated operational reasoning sitting between observability signals and infrastructure action planes. The specific term 'agentic runbook' may not survive, but the category will. The sites and practitioners who define the vocabulary now carry that authority forward. The more interesting question is whether autonomous runbooks (Level 4: no human approval gates) become production-safe within five years. Based on current trajectory, selective autonomy for the lowest-risk actions seems likely by 2028.

Agentic Runbook FAQ: 20 Questions Answered (2026)

Deep dive on specific topics