Emerging category, best practices evolving. Code samples illustrative. Verify security implications before production use. Data verified April 2026.
Last verified April 2026

How to Write Your First Agentic Runbook: A Working Tutorial (2026)

Three working examples with real YAML and LangGraph Python. Annotated line by line. Vendor-neutral. References real tools and real code patterns from AWS, LangChain, and the Kubernetes SRE community.

Prerequisites

Before writing an agentic runbook, your team needs three infrastructure layers already in place. Skipping any of them produces an agent with poor signal quality, insufficient action scope, or inadequate reasoning capability.

Observability layer

  • +Structured alerts (not just email)
  • +Log aggregation accessible by API
  • +Metrics with named labels
  • +Deployment event stream

e.g. Datadog, Prometheus + Alertmanager, Grafana

Incident management

  • +Alert routing to on-call
  • +Incident lifecycle tracking
  • +Slack integration
  • +Webhook-triggerable API

e.g. PagerDuty, incident.io, FireHydrant, Rootly

LLM API with tool use

  • +Tool-use / function calling
  • +Structured output mode
  • +Sufficient context window
  • +Acceptable latency

e.g. Claude Sonnet 4.5, GPT-4o, AWS Bedrock Anthropic Claude

The eight required fields

Every agentic runbook has eight required fields. All eight must be present. An agent without a defined tool_scope becomes unpredictably powerful. An agent without an action_boundary has no human oversight layer.

metadata

id, version, owner, slack_channel, pagerduty_service, risk level, approvers list, last_verified date, and changelog. Treat this like a software package version.

signal_spec

What triggers the agent. Must include: trigger type (pagerduty_alert, prometheus_alert, scheduled), the condition string, and a cooldown_minutes to prevent duplicate runs.

tool_scope

An exhaustive list of every tool (API) this agent may invoke. Any tool not listed is inaccessible. This is your primary security boundary.

action_boundary

Three categories: auto_approve (agent executes without asking), require_human (agent proposes, waits), never_allow (hard-coded block the LLM cannot override). The never_allow list is your blast-radius guardrail.

context_retrieval

What the agent fetches via RAG before reasoning: runbook library queries, past incident patterns, dependency topology. This is what gives the agent institutional memory.

execution_plan

The reasoning framework: LangGraph or AutoGen. Model choice, max_iterations (prevents infinite loops), timeout_seconds. LangGraph is recommended for stateful, cyclical reasoning; AutoGen for multi-agent conversations.

observability

reasoning_trace: true (saves the full chain-of-thought), tool_call_log: true (every API call with args and response), audit_sink pointing to an immutable destination. Required for compliance and debugging.

learning_loop

on_resolve: what happens when the incident is closed (update runbook library, mark action as successful). on_failure: what happens when the agent fails (flag for human review, never silently drop).

Example 1: Kubernetes pod crash-loop (annotated YAML)

The most common agentic runbook use case. Pod enters CrashLoopBackOff; agent retrieves logs, identifies likely cause, proposes and executes restart after approval. Annotated line-by-line.

# EXAMPLE 1: Pod CrashLoopBackOff remediation
# Minimum viable agentic runbook.
# Copy this and replace values for your environment.

metadata:
  id: k8s-crashloop-v2            # Unique ID, referenced in audit logs
  version: "2.1"
  owner: platform-eng             # Slack user or team handle
  slack_channel: "#incidents-platform"
  pagerduty_service: auth-service
  risk: medium                    # low | medium | high | critical
  approvers:                      # Must match Slack handles
    - on-call-lead
  last_verified: "2026-04-01"

signal_spec:
  trigger: pagerduty_alert        # pagerduty_alert | prometheus_alert | scheduled
  condition: "alert.title contains CrashLoopBackOff"
  cooldown_minutes: 5             # Prevents duplicate agent runs on repeated alerts
  # If using Prometheus instead:
  # trigger: prometheus_alert
  # condition: 'kube_pod_container_status_restarts_total > 5'

tool_scope:
  # Everything the agent MAY call.
  # Any tool not listed here is inaccessible.
  - kubectl_get_pod_logs          # Read-only: fetch recent log lines
  - kubectl_describe_pod          # Read-only: describe pod state
  - kubectl_get_events            # Read-only: cluster events
  - kubectl_rollout_restart       # WRITE: restarts deployment (needs approval)
  - pagerduty_acknowledge         # Write (status only, safe)
  - slack_post_update             # Write (comms only, safe)
  - datadog_query_metrics         # Read-only: CPU/memory query

action_boundary:
  auto_approve:                   # Agent executes without human confirmation
    - kubectl_get_pod_logs
    - kubectl_describe_pod
    - kubectl_get_events
    - pagerduty_acknowledge
    - slack_post_update
    - datadog_query_metrics
  require_human:                  # Agent proposes; human approves via Slack
    - kubectl_rollout_restart
  never_allow:                    # Hard-coded block; LLM cannot override
    - kubectl_delete_pod
    - kubectl_delete_deployment

context_retrieval:
  - source: runbook_library       # Vector DB of your runbooks
    query: "CrashLoopBackOff kubernetes pod remediation"
    top_k: 3
  - source: past_incidents        # Vector DB of resolved incidents
    query: "CrashLoopBackOff {pod_name}"  # {pod_name} is injected from alert
    top_k: 5
    max_age_days: 90

execution_plan:
  framework: langgraph
  model: claude-sonnet-4-5
  max_iterations: 8               # Hard limit; prevents infinite loops
  timeout_seconds: 300            # 5 minutes; page human if exceeded

observability:
  reasoning_trace: true           # Full chain-of-thought saved
  tool_call_log: true             # Every API call logged with args + response
  audit_sink: cloudwatch          # cloudwatch | datadog | splunk
  immutable: true                 # Agent cannot modify past records

learning_loop:
  enabled: true
  on_resolve: update_runbook_library   # Save outcome to vector DB
  on_failure: flag_for_review          # Human reviews failed runs

LangGraph Python: the execution layer

The YAML spec above is the configuration. LangGraph reads it and builds an execution graph. Here is the Python that instantiates the graph and runs it on an incoming alert. This is based on the pattern from the AWS Bedrock AgentCore + LangGraph SRE reference implementation.

# langgraph_runbook_agent.py
# Python 3.12 | LangGraph 0.2.x | Claude Sonnet 4.5

from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic
from langchain.tools import tool
from typing import TypedDict, Annotated
import yaml, subprocess, json

# Load the runbook spec
with open("runbooks/k8s-crashloop-v2.yaml") as f:
    runbook = yaml.safe_load(f)

# Tool definitions (real kubectl wrappers)
@tool
def kubectl_get_pod_logs(namespace: str, pod_name: str, lines: int = 50) -> str:
    """Fetch recent logs from a pod. Read-only."""
    result = subprocess.run(
        ["kubectl", "logs", pod_name, "-n", namespace, f"--tail={lines}"],
        capture_output=True, text=True
    )
    return result.stdout

@tool
def kubectl_describe_pod(namespace: str, pod_name: str) -> str:
    """Describe pod state, events, and conditions. Read-only."""
    result = subprocess.run(
        ["kubectl", "describe", "pod", pod_name, "-n", namespace],
        capture_output=True, text=True
    )
    return result.stdout

@tool
def slack_post_update(channel: str, message: str) -> str:
    """Post an update to Slack. Comms only."""
    # In production: call Slack API
    return f"Posted to {channel}: {message}"

# Agent state
class AgentState(TypedDict):
    alert: dict                  # Incoming alert payload
    context: str                 # Retrieved runbook + past incidents
    observations: list[str]      # Tool call results
    reasoning: str               # Agent's current reasoning
    proposed_action: str         # Action requiring human approval
    human_approved: bool         # Approval flag
    resolved: bool               # Resolution flag

# Build the LangGraph graph
model = ChatAnthropic(model="claude-sonnet-4-5")

def observe(state: AgentState) -> AgentState:
    """Gather initial observations from the alert."""
    alert = state["alert"]
    logs = kubectl_get_pod_logs.invoke({
        "namespace": alert["namespace"],
        "pod_name": alert["pod_name"]
    })
    desc = kubectl_describe_pod.invoke({
        "namespace": alert["namespace"],
        "pod_name": alert["pod_name"]
    })
    state["observations"] = [logs, desc]
    return state

def reason(state: AgentState) -> AgentState:
    """Apply the LLM to observations + context to propose an action."""
    prompt = f"""
You are an SRE agent handling a CrashLoopBackOff incident.

RUNBOOK: {state['context']}
OBSERVATIONS: {json.dumps(state['observations'])}
ALERT: {json.dumps(state['alert'])}

Based on the observations, diagnose the cause and propose a remediation action.
Choose ONLY from the allowed tool scope: {runbook['tool_scope']}
Actions requiring human approval: {runbook['action_boundary']['require_human']}

Respond with JSON: {{"diagnosis": "...", "proposed_action": "...", "confidence": 0.0-1.0}}
"""
    response = model.invoke(prompt)
    parsed = json.loads(response.content)
    state["reasoning"] = parsed["diagnosis"]
    state["proposed_action"] = parsed["proposed_action"]
    return state

def await_approval(state: AgentState) -> AgentState:
    """Send Slack message and wait for human approval."""
    slack_post_update.invoke({
        "channel": runbook["metadata"]["slack_channel"],
        "message": f"""
Agent proposes: {state['proposed_action']}
Diagnosis: {state['reasoning']}
Approve? Reply 'approve k8s-crashloop-v2' in this thread.
@{runbook['metadata']['approvers'][0]}
"""
    })
    # In production: poll for Slack thread reply with timeout
    # For this example, simulated:
    state["human_approved"] = True  # Replace with real Slack approval check
    return state

graph = StateGraph(AgentState)
graph.add_node("observe", observe)
graph.add_node("reason", reason)
graph.add_node("await_approval", await_approval)
graph.add_edge("observe", "reason")
graph.add_edge("reason", "await_approval")
graph.add_edge("await_approval", END)

agent = graph.compile()

# Entry point (called by PagerDuty webhook handler)
def handle_alert(alert_payload: dict):
    result = agent.invoke({
        "alert": alert_payload,
        "context": "",  # populated by context_retrieval layer
        "observations": [],
        "reasoning": "",
        "proposed_action": "",
        "human_approved": False,
        "resolved": False,
    })
    return result

Example 2: Deployment rollback on error-rate spike (abbreviated)

The pattern is identical to Example 1 but the trigger, tool scope, and action boundary differ. Key differences shown only.

metadata:
  id: k8s-deployment-rollback-v1
  risk: high        # Rollback affects all traffic to the service
  approvers:
    - on-call-lead
    - service-owner  # Two approvers for high-risk actions

signal_spec:
  trigger: prometheus_alert
  condition: 'http_requests_total{status=~"5.."} / http_requests_total > 0.02'
  # Error rate > 2% for 3 minutes (handled by alerting rule duration)
  cooldown_minutes: 10

tool_scope:
  - kubectl_get_deployment_history    # Read recent rollout history
  - git_log_recent_deploys            # Read recent git deploy tags
  - kubectl_rollout_undo              # WRITE: rollback to previous revision
  - datadog_query_error_rate          # Read: error rate trend

action_boundary:
  auto_approve:
    - kubectl_get_deployment_history
    - git_log_recent_deploys
    - datadog_query_error_rate
  require_human:
    - kubectl_rollout_undo           # Two approvers required (see metadata.approvers)
  never_allow:
    - kubectl_delete_deployment

Example 3: Certificate expiry rotation (scheduled, proactive)

The proactive pattern: the agent runs on a schedule rather than reacting to an alert. This is how agentic runbooks handle toil elimination rather than incident response.

metadata:
  id: cert-expiry-rotation-v1
  risk: low       # cert-manager handles rotation; agent initiates only
  approvers: []   # No human approval needed if cert-manager is fully trusted

signal_spec:
  trigger: scheduled
  schedule: "0 9 * * *"    # Nightly at 09:00 UTC
  condition: "cert.days_until_expiry < 14"
  cooldown_minutes: 1440   # Once per day maximum

tool_scope:
  - kubectl_get_certificates         # Read: list cert-manager certificates
  - cert_manager_request_renewal     # Write: trigger cert renewal
  - slack_post_cert_report           # Write: post rotation summary to Slack

action_boundary:
  auto_approve:
    - kubectl_get_certificates
    - cert_manager_request_renewal   # Safe: cert-manager handles the rotation
    - slack_post_cert_report
  require_human: []                  # No approval needed; cert-manager is the guardrail
  never_allow:
    - kubectl_delete_certificate     # Never auto-delete; only renew

Testing your agentic runbook

An agentic runbook that has not been tested is a liability. Four testing approaches, in order of increasing confidence.

1. Chaos engineering replay

Inject the failure the runbook targets into a staging environment. Feed the resulting alert to the agent. Verify the proposed action matches the expected remediation. Do not skip this step.

Tool: Chaos Monkey, LitmusChaos (K8s), AWS FIS

2. Historical incident replay

Feed 10-20 real historical incidents (from your incident tool's API) to the agent in dry-run mode. Compare agent-proposed actions to what humans actually did. Discrepancy rate > 20% means the runbook spec needs tuning.

Tool: PagerDuty Incidents API, incident.io export

3. Red-team prompt injection

Craft an alert payload with injected instructions in a pod name, service name, or log line. Verify the agent does not execute the injected instruction. Verify the never_allow boundary is respected. See /security-considerations for the full threat model.

Tool: Manual crafted payloads, Garak (LLM red-team)

4. Microsoft Agent Governance Toolkit pattern

Released April 2026. Open-source runtime security for agentic systems: cryptographic identity for each agent, dynamic execution rings (privilege levels), circuit breaker that pauses the agent after N consecutive unexpected actions, kill switch. Use the execution-ring pattern for safe production rollout.

Tool: Microsoft Agent Governance Toolkit (open source, April 2026)

Versioning and drift prevention

Agentic runbooks drift. The production environment changes; the agent's runbook does not. The result is a runbook that proposes actions for an environment that no longer exists. Two patterns prevent this.

Post-incident review loop (Rootly pattern)

After every resolved incident, the on-call engineer reviews the agent's reasoning trace and either confirms it was correct or flags a discrepancy. Rootly's knowledge reinforcement feature surfaces the flagged discrepancies as runbook update suggestions. Review takes 5-10 minutes per incident.

Quarterly runbook audit

Schedule a quarterly review of every agentic runbook. Verify: tool_scope includes all current API endpoints, action_boundary reflects current blast-radius tolerance, approvers list is current, last_verified date is updated. Run the chaos engineering replay to confirm the runbook still handles the target failure correctly.

Continue reading