How to Write Your First Agentic Runbook: A Working Tutorial (2026)
Three working examples with real YAML and LangGraph Python. Annotated line by line. Vendor-neutral. References real tools and real code patterns from AWS, LangChain, and the Kubernetes SRE community.
Prerequisites
Before writing an agentic runbook, your team needs three infrastructure layers already in place. Skipping any of them produces an agent with poor signal quality, insufficient action scope, or inadequate reasoning capability.
Observability layer
- +Structured alerts (not just email)
- +Log aggregation accessible by API
- +Metrics with named labels
- +Deployment event stream
e.g. Datadog, Prometheus + Alertmanager, Grafana
Incident management
- +Alert routing to on-call
- +Incident lifecycle tracking
- +Slack integration
- +Webhook-triggerable API
e.g. PagerDuty, incident.io, FireHydrant, Rootly
LLM API with tool use
- +Tool-use / function calling
- +Structured output mode
- +Sufficient context window
- +Acceptable latency
e.g. Claude Sonnet 4.5, GPT-4o, AWS Bedrock Anthropic Claude
The eight required fields
Every agentic runbook has eight required fields. All eight must be present. An agent without a defined tool_scope becomes unpredictably powerful. An agent without an action_boundary has no human oversight layer.
metadataid, version, owner, slack_channel, pagerduty_service, risk level, approvers list, last_verified date, and changelog. Treat this like a software package version.
signal_specWhat triggers the agent. Must include: trigger type (pagerduty_alert, prometheus_alert, scheduled), the condition string, and a cooldown_minutes to prevent duplicate runs.
tool_scopeAn exhaustive list of every tool (API) this agent may invoke. Any tool not listed is inaccessible. This is your primary security boundary.
action_boundaryThree categories: auto_approve (agent executes without asking), require_human (agent proposes, waits), never_allow (hard-coded block the LLM cannot override). The never_allow list is your blast-radius guardrail.
context_retrievalWhat the agent fetches via RAG before reasoning: runbook library queries, past incident patterns, dependency topology. This is what gives the agent institutional memory.
execution_planThe reasoning framework: LangGraph or AutoGen. Model choice, max_iterations (prevents infinite loops), timeout_seconds. LangGraph is recommended for stateful, cyclical reasoning; AutoGen for multi-agent conversations.
observabilityreasoning_trace: true (saves the full chain-of-thought), tool_call_log: true (every API call with args and response), audit_sink pointing to an immutable destination. Required for compliance and debugging.
learning_loopon_resolve: what happens when the incident is closed (update runbook library, mark action as successful). on_failure: what happens when the agent fails (flag for human review, never silently drop).
Example 1: Kubernetes pod crash-loop (annotated YAML)
The most common agentic runbook use case. Pod enters CrashLoopBackOff; agent retrieves logs, identifies likely cause, proposes and executes restart after approval. Annotated line-by-line.
# EXAMPLE 1: Pod CrashLoopBackOff remediation
# Minimum viable agentic runbook.
# Copy this and replace values for your environment.
metadata:
id: k8s-crashloop-v2 # Unique ID, referenced in audit logs
version: "2.1"
owner: platform-eng # Slack user or team handle
slack_channel: "#incidents-platform"
pagerduty_service: auth-service
risk: medium # low | medium | high | critical
approvers: # Must match Slack handles
- on-call-lead
last_verified: "2026-04-01"
signal_spec:
trigger: pagerduty_alert # pagerduty_alert | prometheus_alert | scheduled
condition: "alert.title contains CrashLoopBackOff"
cooldown_minutes: 5 # Prevents duplicate agent runs on repeated alerts
# If using Prometheus instead:
# trigger: prometheus_alert
# condition: 'kube_pod_container_status_restarts_total > 5'
tool_scope:
# Everything the agent MAY call.
# Any tool not listed here is inaccessible.
- kubectl_get_pod_logs # Read-only: fetch recent log lines
- kubectl_describe_pod # Read-only: describe pod state
- kubectl_get_events # Read-only: cluster events
- kubectl_rollout_restart # WRITE: restarts deployment (needs approval)
- pagerduty_acknowledge # Write (status only, safe)
- slack_post_update # Write (comms only, safe)
- datadog_query_metrics # Read-only: CPU/memory query
action_boundary:
auto_approve: # Agent executes without human confirmation
- kubectl_get_pod_logs
- kubectl_describe_pod
- kubectl_get_events
- pagerduty_acknowledge
- slack_post_update
- datadog_query_metrics
require_human: # Agent proposes; human approves via Slack
- kubectl_rollout_restart
never_allow: # Hard-coded block; LLM cannot override
- kubectl_delete_pod
- kubectl_delete_deployment
context_retrieval:
- source: runbook_library # Vector DB of your runbooks
query: "CrashLoopBackOff kubernetes pod remediation"
top_k: 3
- source: past_incidents # Vector DB of resolved incidents
query: "CrashLoopBackOff {pod_name}" # {pod_name} is injected from alert
top_k: 5
max_age_days: 90
execution_plan:
framework: langgraph
model: claude-sonnet-4-5
max_iterations: 8 # Hard limit; prevents infinite loops
timeout_seconds: 300 # 5 minutes; page human if exceeded
observability:
reasoning_trace: true # Full chain-of-thought saved
tool_call_log: true # Every API call logged with args + response
audit_sink: cloudwatch # cloudwatch | datadog | splunk
immutable: true # Agent cannot modify past records
learning_loop:
enabled: true
on_resolve: update_runbook_library # Save outcome to vector DB
on_failure: flag_for_review # Human reviews failed runsLangGraph Python: the execution layer
The YAML spec above is the configuration. LangGraph reads it and builds an execution graph. Here is the Python that instantiates the graph and runs it on an incoming alert. This is based on the pattern from the AWS Bedrock AgentCore + LangGraph SRE reference implementation.
# langgraph_runbook_agent.py
# Python 3.12 | LangGraph 0.2.x | Claude Sonnet 4.5
from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic
from langchain.tools import tool
from typing import TypedDict, Annotated
import yaml, subprocess, json
# Load the runbook spec
with open("runbooks/k8s-crashloop-v2.yaml") as f:
runbook = yaml.safe_load(f)
# Tool definitions (real kubectl wrappers)
@tool
def kubectl_get_pod_logs(namespace: str, pod_name: str, lines: int = 50) -> str:
"""Fetch recent logs from a pod. Read-only."""
result = subprocess.run(
["kubectl", "logs", pod_name, "-n", namespace, f"--tail={lines}"],
capture_output=True, text=True
)
return result.stdout
@tool
def kubectl_describe_pod(namespace: str, pod_name: str) -> str:
"""Describe pod state, events, and conditions. Read-only."""
result = subprocess.run(
["kubectl", "describe", "pod", pod_name, "-n", namespace],
capture_output=True, text=True
)
return result.stdout
@tool
def slack_post_update(channel: str, message: str) -> str:
"""Post an update to Slack. Comms only."""
# In production: call Slack API
return f"Posted to {channel}: {message}"
# Agent state
class AgentState(TypedDict):
alert: dict # Incoming alert payload
context: str # Retrieved runbook + past incidents
observations: list[str] # Tool call results
reasoning: str # Agent's current reasoning
proposed_action: str # Action requiring human approval
human_approved: bool # Approval flag
resolved: bool # Resolution flag
# Build the LangGraph graph
model = ChatAnthropic(model="claude-sonnet-4-5")
def observe(state: AgentState) -> AgentState:
"""Gather initial observations from the alert."""
alert = state["alert"]
logs = kubectl_get_pod_logs.invoke({
"namespace": alert["namespace"],
"pod_name": alert["pod_name"]
})
desc = kubectl_describe_pod.invoke({
"namespace": alert["namespace"],
"pod_name": alert["pod_name"]
})
state["observations"] = [logs, desc]
return state
def reason(state: AgentState) -> AgentState:
"""Apply the LLM to observations + context to propose an action."""
prompt = f"""
You are an SRE agent handling a CrashLoopBackOff incident.
RUNBOOK: {state['context']}
OBSERVATIONS: {json.dumps(state['observations'])}
ALERT: {json.dumps(state['alert'])}
Based on the observations, diagnose the cause and propose a remediation action.
Choose ONLY from the allowed tool scope: {runbook['tool_scope']}
Actions requiring human approval: {runbook['action_boundary']['require_human']}
Respond with JSON: {{"diagnosis": "...", "proposed_action": "...", "confidence": 0.0-1.0}}
"""
response = model.invoke(prompt)
parsed = json.loads(response.content)
state["reasoning"] = parsed["diagnosis"]
state["proposed_action"] = parsed["proposed_action"]
return state
def await_approval(state: AgentState) -> AgentState:
"""Send Slack message and wait for human approval."""
slack_post_update.invoke({
"channel": runbook["metadata"]["slack_channel"],
"message": f"""
Agent proposes: {state['proposed_action']}
Diagnosis: {state['reasoning']}
Approve? Reply 'approve k8s-crashloop-v2' in this thread.
@{runbook['metadata']['approvers'][0]}
"""
})
# In production: poll for Slack thread reply with timeout
# For this example, simulated:
state["human_approved"] = True # Replace with real Slack approval check
return state
graph = StateGraph(AgentState)
graph.add_node("observe", observe)
graph.add_node("reason", reason)
graph.add_node("await_approval", await_approval)
graph.add_edge("observe", "reason")
graph.add_edge("reason", "await_approval")
graph.add_edge("await_approval", END)
agent = graph.compile()
# Entry point (called by PagerDuty webhook handler)
def handle_alert(alert_payload: dict):
result = agent.invoke({
"alert": alert_payload,
"context": "", # populated by context_retrieval layer
"observations": [],
"reasoning": "",
"proposed_action": "",
"human_approved": False,
"resolved": False,
})
return resultExample 2: Deployment rollback on error-rate spike (abbreviated)
The pattern is identical to Example 1 but the trigger, tool scope, and action boundary differ. Key differences shown only.
metadata:
id: k8s-deployment-rollback-v1
risk: high # Rollback affects all traffic to the service
approvers:
- on-call-lead
- service-owner # Two approvers for high-risk actions
signal_spec:
trigger: prometheus_alert
condition: 'http_requests_total{status=~"5.."} / http_requests_total > 0.02'
# Error rate > 2% for 3 minutes (handled by alerting rule duration)
cooldown_minutes: 10
tool_scope:
- kubectl_get_deployment_history # Read recent rollout history
- git_log_recent_deploys # Read recent git deploy tags
- kubectl_rollout_undo # WRITE: rollback to previous revision
- datadog_query_error_rate # Read: error rate trend
action_boundary:
auto_approve:
- kubectl_get_deployment_history
- git_log_recent_deploys
- datadog_query_error_rate
require_human:
- kubectl_rollout_undo # Two approvers required (see metadata.approvers)
never_allow:
- kubectl_delete_deploymentExample 3: Certificate expiry rotation (scheduled, proactive)
The proactive pattern: the agent runs on a schedule rather than reacting to an alert. This is how agentic runbooks handle toil elimination rather than incident response.
metadata:
id: cert-expiry-rotation-v1
risk: low # cert-manager handles rotation; agent initiates only
approvers: [] # No human approval needed if cert-manager is fully trusted
signal_spec:
trigger: scheduled
schedule: "0 9 * * *" # Nightly at 09:00 UTC
condition: "cert.days_until_expiry < 14"
cooldown_minutes: 1440 # Once per day maximum
tool_scope:
- kubectl_get_certificates # Read: list cert-manager certificates
- cert_manager_request_renewal # Write: trigger cert renewal
- slack_post_cert_report # Write: post rotation summary to Slack
action_boundary:
auto_approve:
- kubectl_get_certificates
- cert_manager_request_renewal # Safe: cert-manager handles the rotation
- slack_post_cert_report
require_human: [] # No approval needed; cert-manager is the guardrail
never_allow:
- kubectl_delete_certificate # Never auto-delete; only renewTesting your agentic runbook
An agentic runbook that has not been tested is a liability. Four testing approaches, in order of increasing confidence.
1. Chaos engineering replay
Inject the failure the runbook targets into a staging environment. Feed the resulting alert to the agent. Verify the proposed action matches the expected remediation. Do not skip this step.
Tool: Chaos Monkey, LitmusChaos (K8s), AWS FIS
2. Historical incident replay
Feed 10-20 real historical incidents (from your incident tool's API) to the agent in dry-run mode. Compare agent-proposed actions to what humans actually did. Discrepancy rate > 20% means the runbook spec needs tuning.
Tool: PagerDuty Incidents API, incident.io export
3. Red-team prompt injection
Craft an alert payload with injected instructions in a pod name, service name, or log line. Verify the agent does not execute the injected instruction. Verify the never_allow boundary is respected. See /security-considerations for the full threat model.
Tool: Manual crafted payloads, Garak (LLM red-team)
4. Microsoft Agent Governance Toolkit pattern
Released April 2026. Open-source runtime security for agentic systems: cryptographic identity for each agent, dynamic execution rings (privilege levels), circuit breaker that pauses the agent after N consecutive unexpected actions, kill switch. Use the execution-ring pattern for safe production rollout.
Tool: Microsoft Agent Governance Toolkit (open source, April 2026)
Versioning and drift prevention
Agentic runbooks drift. The production environment changes; the agent's runbook does not. The result is a runbook that proposes actions for an environment that no longer exists. Two patterns prevent this.
Post-incident review loop (Rootly pattern)
After every resolved incident, the on-call engineer reviews the agent's reasoning trace and either confirms it was correct or flags a discrepancy. Rootly's knowledge reinforcement feature surfaces the flagged discrepancies as runbook update suggestions. Review takes 5-10 minutes per incident.
Quarterly runbook audit
Schedule a quarterly review of every agentic runbook. Verify: tool_scope includes all current API endpoints, action_boundary reflects current blast-radius tolerance, approvers list is current, last_verified date is updated. Run the chaos engineering replay to confirm the runbook still handles the target failure correctly.