Agentic Runbook Integration Patterns: Webhooks, MCP, and Agent-to-Agent
Five patterns for connecting agentic runbooks to your observability stack, incident tooling, and infrastructure APIs. With code examples and a decision matrix.
The integration layers
Every agentic runbook sits inside a five-layer stack. Understanding the stack helps you choose the right integration pattern.
Immutable logs of agent reasoning and actions
CloudTrail, Falco, Policy Engine
Where the agent takes real actions
Kubernetes API, Terraform, AWS Lambda, GitHub API
The reasoning and orchestration layer
LangGraph, AutoGen, Bedrock AgentCore
Alert correlation and routing
PagerDuty AIOps, BigPanda, Kafka
Signals, metrics, logs, traces
Datadog, Prometheus, Grafana, CloudWatch
Pattern 1: Webhook-triggered agent
The simplest and most widely supported pattern. An incident management tool fires a webhook when an alert is created or escalated. The agent receives the webhook payload, processes it, and executes the runbook.
Best for
Event-driven incident response. Works with any tool that supports outbound webhooks (PagerDuty, incident.io, Alertmanager, Datadog).
Gotchas
Webhook delivery is at-most-once. Use a queue (SQS, RabbitMQ) to guarantee processing if the agent is down during delivery.
# Python FastAPI webhook handler
# Receives PagerDuty webhook, routes to agentic runbook
from fastapi import FastAPI, Request
from langgraph_runbook_agent import handle_alert
app = FastAPI()
@app.post("/webhook/pagerduty")
async def pagerduty_webhook(request: Request):
payload = await request.json()
# PagerDuty sends messages list
for message in payload.get("messages", []):
if message["event"] == "incident.trigger":
incident = message["incident"]
# Route to matching runbook
alert_data = {
"title": incident["title"],
"body": incident.get("body", {}).get("details", ""),
"service": incident["service"]["summary"],
"severity": incident["urgency"],
}
# Fire and forget (run agent in background task)
import asyncio
asyncio.create_task(
asyncio.to_thread(handle_alert, alert_data)
)
return {"status": "accepted"}Pattern 2: MCP-powered agent
The Model Context Protocol (Anthropic, released late 2024) is the emerging standard for agent-to-tool communication. Instead of custom tool wrappers, tools are exposed as MCP servers. The agent discovers and invokes them via a standard protocol.
AWS Bedrock AgentCore wraps Kubernetes, CloudWatch, CloudTrail, and EC2 APIs as MCP tools. A LangGraph or AutoGen agent consuming these MCP tools can investigate and remediate across AWS services without custom integration code.
Why MCP matters for runbooks
MCP standardises tool discovery. An agent that speaks MCP can use any MCP-compatible tool server without custom code. This enables vendor-neutral, composable agent architectures where tools are swappable. Kubiya adopted MCP in 2025; AWS Bedrock AgentCore is MCP-native.
# MCP-powered agent (AWS Bedrock AgentCore pattern)
# Tools are MCP servers; agent discovers them automatically
from bedrock_agent_core import AgentCore, MCPGateway
# Bedrock AgentCore wraps these services as MCP tool servers
gateway = MCPGateway(
tools=[
"aws-cloudwatch", # Metrics and logs
"aws-eks-kubectl", # Kubernetes API (EKS)
"aws-cloudtrail", # Audit trail (read)
"aws-codedeploy", # Deployments
"pagerduty-mcp", # Incident management
],
iam_role="arn:aws:iam::ACCOUNT:role/SREAgentRole",
allowed_actions=[
"cloudwatch:GetMetricData",
"eks:DescribePodStatus",
"codedeploy:GetDeployment",
"codedeploy:CreateDeployment", # Write: in require_human list
]
)
agent = AgentCore(
model="anthropic.claude-sonnet-4-5-v2",
gateway=gateway,
runbook="runbooks/aws-ec2-remediation-v1.yaml"
)
# Agent discovers available MCP tools automatically on startup
# and matches signal_spec.trigger to route incoming alertsPattern 3: Agent-to-agent (A2A)
A top-level SRE agent orchestrates specialised sub-agents. The meta-agent handles incident classification and routing; sub-agents handle specific domains. Kubiya popularised this pattern in the SRE space.
# Agent-to-agent pattern (Kubiya-inspired)
# Meta-agent routes to specialised sub-agents
meta_agent:
role: "incident-classifier"
receives: pagerduty_alerts
routes_to:
- condition: "alert.service == 'kubernetes'"
agent: k8s-specialist-agent
- condition: "alert.type == 'deployment'"
agent: deploy-specialist-agent
- condition: "alert.type == 'cert-expiry'"
agent: cert-rotation-agent
- default:
agent: general-sre-agent
k8s-specialist-agent:
tool_scope:
- kubectl_* # All kubectl operations
- helm_* # Helm operations
model: claude-sonnet-4-5
max_iterations: 10
deploy-specialist-agent:
tool_scope:
- git_* # Git operations (read)
- ci_cd_* # CI/CD pipeline operations
- kubectl_rollout_* # Rollout operations
model: claude-sonnet-4-5
max_iterations: 6Pattern 4: Event-stream processing
For high-volume, low-latency use cases. The agent subscribes to a Kafka, NATS, or Redis stream and processes events reactively. Suitable for noise suppression and alert correlation at scale.
# Kafka consumer for high-volume alert processing
from kafka import KafkaConsumer
from alert_correlator import AgentCorrelator
import json
consumer = KafkaConsumer(
"platform.alerts",
bootstrap_servers=["kafka:9092"],
value_deserializer=lambda m: json.loads(m.decode("utf-8"))
)
correlator = AgentCorrelator(
window_seconds=300, # 5-minute correlation window
max_alerts_per_incident=500,
model="claude-haiku-3-5", # Faster model for high-volume correlation
)
for message in consumer:
alert = message.value
# Correlate into incident groups
incident = correlator.process(alert)
if incident.is_new:
# Only page on-call for novel incident clusters
pagerduty.create_incident(incident)
else:
# Suppress: add to existing incident
pagerduty.add_alert_to_incident(incident, alert)Pattern 5: Scheduled / proactive
The agent runs on a schedule rather than reacting to an alert. Used for toil elimination: cert expiry checking, cost anomaly scanning, config drift detection, runbook freshness auditing.
# Scheduled proactive agent (cron-triggered)
# Runs nightly to check for cert expiry
import schedule, time
from cert_checker_agent import CertCheckerAgent
agent = CertCheckerAgent(
runbook="runbooks/cert-expiry-rotation-v1.yaml",
namespaces=["production", "staging"],
warn_days=14, # Alert if expiry < 14 days
auto_renew_days=7, # Auto-renew if < 7 days
)
schedule.every().day.at("09:00").do(agent.run)
while True:
schedule.run_pending()
time.sleep(60)Pattern selection guide
| Pattern | Use when | Avoid when |
|---|---|---|
| Webhook | Standard incident response. Alert volume manageable. Tools support webhooks. | Alert volume > 1000/minute. Need guaranteed delivery. |
| MCP | Using AWS Bedrock AgentCore or Kubiya. Want vendor-neutral tool layer. Multi-cloud. | Running on-prem, air-gapped. MCP server not available for required tool. |
| A2A | Complex incident types needing domain specialists. Platform engineering teams. | Simple, single-domain incidents. Adds latency and complexity. |
| Event stream | High alert volume. Noise suppression is the primary use case. Kafka already in stack. | Low volume. Adds infrastructure complexity that is not justified at low volume. |
| Scheduled | Proactive toil elimination. Cert checks, cost scans, drift detection. | Reactive incident response. Scheduled agents cannot respond in real time. |
Integration gotchas
Timeouts cascade
The LLM API has a timeout. The tool call has a timeout. The webhook receiver has a timeout. These cascade: if the LLM takes 8 seconds and the webhook receiver has a 10-second timeout, you have 2 seconds for tool calls. Set timeouts explicitly at every layer.
Retry loops and duplicate execution
If a webhook is retried (common after 5xx response), the agent runs twice. Make your agent idempotent: check whether an action was already taken before executing it again. Use a deduplication key derived from the alert ID.
Circular agent calls in A2A
A meta-agent that calls a sub-agent that calls the meta-agent creates an infinite loop. Enforce max_depth on agent-to-agent calls. LangGraph's cycle detection helps, but explicit depth limits are more reliable.
LLM API rate limits
At high alert volume, you may hit rate limits on the LLM API. Implement exponential backoff, use a lower-tier model for correlation (Claude Haiku vs Sonnet), and queue alerts during bursts.