Last verified April 2026

Agentic Runbook Integration Patterns: Webhooks, MCP, and Agent-to-Agent

Five patterns for connecting agentic runbooks to your observability stack, incident tooling, and infrastructure APIs. With code examples and a decision matrix.

The integration layers

Every agentic runbook sits inside a five-layer stack. Understanding the stack helps you choose the right integration pattern.

5. Audit / Governance

Immutable logs of agent reasoning and actions

CloudTrail, Falco, Policy Engine

4. Action Plane

Where the agent takes real actions

Kubernetes API, Terraform, AWS Lambda, GitHub API

3. Agent Runtime

The reasoning and orchestration layer

LangGraph, AutoGen, Bedrock AgentCore

2. Event Router

Alert correlation and routing

PagerDuty AIOps, BigPanda, Kafka

1. Observability

Signals, metrics, logs, traces

Datadog, Prometheus, Grafana, CloudWatch

Pattern 1: Webhook-triggered agent

The simplest and most widely supported pattern. An incident management tool fires a webhook when an alert is created or escalated. The agent receives the webhook payload, processes it, and executes the runbook.

Best for

Event-driven incident response. Works with any tool that supports outbound webhooks (PagerDuty, incident.io, Alertmanager, Datadog).

Gotchas

Webhook delivery is at-most-once. Use a queue (SQS, RabbitMQ) to guarantee processing if the agent is down during delivery.

# Python FastAPI webhook handler
# Receives PagerDuty webhook, routes to agentic runbook

from fastapi import FastAPI, Request
from langgraph_runbook_agent import handle_alert

app = FastAPI()

@app.post("/webhook/pagerduty")
async def pagerduty_webhook(request: Request):
    payload = await request.json()

    # PagerDuty sends messages list
    for message in payload.get("messages", []):
        if message["event"] == "incident.trigger":
            incident = message["incident"]

            # Route to matching runbook
            alert_data = {
                "title": incident["title"],
                "body": incident.get("body", {}).get("details", ""),
                "service": incident["service"]["summary"],
                "severity": incident["urgency"],
            }

            # Fire and forget (run agent in background task)
            import asyncio
            asyncio.create_task(
                asyncio.to_thread(handle_alert, alert_data)
            )

    return {"status": "accepted"}

Pattern 2: MCP-powered agent

The Model Context Protocol (Anthropic, released late 2024) is the emerging standard for agent-to-tool communication. Instead of custom tool wrappers, tools are exposed as MCP servers. The agent discovers and invokes them via a standard protocol.

AWS Bedrock AgentCore wraps Kubernetes, CloudWatch, CloudTrail, and EC2 APIs as MCP tools. A LangGraph or AutoGen agent consuming these MCP tools can investigate and remediate across AWS services without custom integration code.

Why MCP matters for runbooks

MCP standardises tool discovery. An agent that speaks MCP can use any MCP-compatible tool server without custom code. This enables vendor-neutral, composable agent architectures where tools are swappable. Kubiya adopted MCP in 2025; AWS Bedrock AgentCore is MCP-native.

# MCP-powered agent (AWS Bedrock AgentCore pattern)
# Tools are MCP servers; agent discovers them automatically

from bedrock_agent_core import AgentCore, MCPGateway

# Bedrock AgentCore wraps these services as MCP tool servers
gateway = MCPGateway(
    tools=[
        "aws-cloudwatch",     # Metrics and logs
        "aws-eks-kubectl",    # Kubernetes API (EKS)
        "aws-cloudtrail",     # Audit trail (read)
        "aws-codedeploy",     # Deployments
        "pagerduty-mcp",      # Incident management
    ],
    iam_role="arn:aws:iam::ACCOUNT:role/SREAgentRole",
    allowed_actions=[
        "cloudwatch:GetMetricData",
        "eks:DescribePodStatus",
        "codedeploy:GetDeployment",
        "codedeploy:CreateDeployment",  # Write: in require_human list
    ]
)

agent = AgentCore(
    model="anthropic.claude-sonnet-4-5-v2",
    gateway=gateway,
    runbook="runbooks/aws-ec2-remediation-v1.yaml"
)

# Agent discovers available MCP tools automatically on startup
# and matches signal_spec.trigger to route incoming alerts

Pattern 3: Agent-to-agent (A2A)

A top-level SRE agent orchestrates specialised sub-agents. The meta-agent handles incident classification and routing; sub-agents handle specific domains. Kubiya popularised this pattern in the SRE space.

# Agent-to-agent pattern (Kubiya-inspired)
# Meta-agent routes to specialised sub-agents

meta_agent:
  role: "incident-classifier"
  receives: pagerduty_alerts
  routes_to:
    - condition: "alert.service == 'kubernetes'"
      agent: k8s-specialist-agent
    - condition: "alert.type == 'deployment'"
      agent: deploy-specialist-agent
    - condition: "alert.type == 'cert-expiry'"
      agent: cert-rotation-agent
    - default:
      agent: general-sre-agent

k8s-specialist-agent:
  tool_scope:
    - kubectl_*          # All kubectl operations
    - helm_*             # Helm operations
  model: claude-sonnet-4-5
  max_iterations: 10

deploy-specialist-agent:
  tool_scope:
    - git_*              # Git operations (read)
    - ci_cd_*            # CI/CD pipeline operations
    - kubectl_rollout_*  # Rollout operations
  model: claude-sonnet-4-5
  max_iterations: 6

Pattern 4: Event-stream processing

For high-volume, low-latency use cases. The agent subscribes to a Kafka, NATS, or Redis stream and processes events reactively. Suitable for noise suppression and alert correlation at scale.

# Kafka consumer for high-volume alert processing
from kafka import KafkaConsumer
from alert_correlator import AgentCorrelator
import json

consumer = KafkaConsumer(
    "platform.alerts",
    bootstrap_servers=["kafka:9092"],
    value_deserializer=lambda m: json.loads(m.decode("utf-8"))
)

correlator = AgentCorrelator(
    window_seconds=300,        # 5-minute correlation window
    max_alerts_per_incident=500,
    model="claude-haiku-3-5",  # Faster model for high-volume correlation
)

for message in consumer:
    alert = message.value

    # Correlate into incident groups
    incident = correlator.process(alert)

    if incident.is_new:
        # Only page on-call for novel incident clusters
        pagerduty.create_incident(incident)
    else:
        # Suppress: add to existing incident
        pagerduty.add_alert_to_incident(incident, alert)

Pattern 5: Scheduled / proactive

The agent runs on a schedule rather than reacting to an alert. Used for toil elimination: cert expiry checking, cost anomaly scanning, config drift detection, runbook freshness auditing.

# Scheduled proactive agent (cron-triggered)
# Runs nightly to check for cert expiry

import schedule, time
from cert_checker_agent import CertCheckerAgent

agent = CertCheckerAgent(
    runbook="runbooks/cert-expiry-rotation-v1.yaml",
    namespaces=["production", "staging"],
    warn_days=14,       # Alert if expiry < 14 days
    auto_renew_days=7,  # Auto-renew if < 7 days
)

schedule.every().day.at("09:00").do(agent.run)

while True:
    schedule.run_pending()
    time.sleep(60)

Pattern selection guide

Pattern	Use when	Avoid when
Webhook	Standard incident response. Alert volume manageable. Tools support webhooks.	Alert volume > 1000/minute. Need guaranteed delivery.
MCP	Using AWS Bedrock AgentCore or Kubiya. Want vendor-neutral tool layer. Multi-cloud.	Running on-prem, air-gapped. MCP server not available for required tool.
A2A	Complex incident types needing domain specialists. Platform engineering teams.	Simple, single-domain incidents. Adds latency and complexity.
Event stream	High alert volume. Noise suppression is the primary use case. Kafka already in stack.	Low volume. Adds infrastructure complexity that is not justified at low volume.
Scheduled	Proactive toil elimination. Cert checks, cost scans, drift detection.	Reactive incident response. Scheduled agents cannot respond in real time.

Integration gotchas

Timeouts cascade

The LLM API has a timeout. The tool call has a timeout. The webhook receiver has a timeout. These cascade: if the LLM takes 8 seconds and the webhook receiver has a 10-second timeout, you have 2 seconds for tool calls. Set timeouts explicitly at every layer.

Retry loops and duplicate execution

If a webhook is retried (common after 5xx response), the agent runs twice. Make your agent idempotent: check whether an action was already taken before executing it again. Use a deduplication key derived from the alert ID.

Circular agent calls in A2A

A meta-agent that calls a sub-agent that calls the meta-agent creates an infinite loop. Enforce max_depth on agent-to-agent calls. LangGraph's cycle detection helps, but explicit depth limits are more reliable.

LLM API rate limits

At high alert volume, you may hit rate limits on the LLM API. Implement exponential backoff, use a lower-tier model for correlation (Claude Haiku vs Sonnet), and queue alerts during bursts.

Continue reading

Write your first agentic runbook (tutorial)Security: prompt injection and IAM risks For Kubernetes: K8s-native integration For AWS: Bedrock AgentCore and MCP