Agentic Runbooks on AWS: DevOps Agent, Bedrock AgentCore, and the 2026 Stack
AWS has shipped the most complete native agentic runbook stack of any cloud provider in 2026. Here is what each component does and how to compose them.
The AWS native stack for agentic runbooks
AWS DevOps Agent
GA 2026The always-available SRE teammate. Cross-account investigation, topology intelligence, continuous learning.
AWS Bedrock AgentCore
GA 2026Agent orchestration runtime. Wraps AWS APIs as MCP tools. Hosts LangGraph and AutoGen agents.
Amazon CloudTrail
GA (existing)Immutable audit log for all agent actions. Feeds the observability and learning_loop.
AWS IAM Identity Center
GA (existing)Authentication and least-privilege policy for agent credentials. Time-bounded credentials.
Amazon CloudWatch
GA (existing)Metrics and log source. Agent queries CloudWatch for incident context.
AWS DevOps Agent capabilities
AWS DevOps Agent is described by AWS as an "always-available AI SRE teammate". It has three distinguishing capabilities not yet matched by point-solution vendors:
Cross-account investigation
The agent can query resources across multiple AWS accounts in a single investigation. Useful for organisations with account-per-environment or account-per-team strategies. When a production service fails and the root cause is in a shared service in a different account, the agent traverses the account boundary automatically.
Topology intelligence
The agent understands AWS resource relationships: EC2 instances in an Auto Scaling Group behind an ELB, ECS tasks on EC2 instances, RDS replicas, VPC routing tables. When investigating an incident, it navigates the topology rather than requiring the engineer to manually trace dependencies.
Continuous learning
Actions taken during incidents are recorded and used to improve future recommendations. Over time, the agent learns the specific patterns and fix sequences that work in your environment. This is the learning_loop component of the runbook specification.
Bedrock AgentCore architecture
Bedrock AgentCore is the agent orchestration layer. It converts AWS service APIs into MCP tool servers. A LangGraph or AutoGen agent consuming these MCP tools can investigate and act across the AWS stack without custom integration code.
# Bedrock AgentCore: MCP tool registration
# Each MCP server wraps one or more AWS APIs
mcp_tools:
cloudwatch-metrics:
api: "cloudwatch:GetMetricData"
access: read
description: "Query CloudWatch metrics for any namespace/metric/dimension"
eks-kubectl:
api: ["eks:DescribeCluster", "eks:ListNodegroups"]
kubectl_access: true # Via EKS API server
access: read-write # Write actions go to require_human boundary
rds-describe:
api: "rds:DescribeDBInstances"
access: read
codedeploy-create:
api: "codedeploy:CreateDeployment"
access: write # Always in require_human list
cloudtrail-lookup:
api: "cloudtrail:LookupEvents"
access: readWorked example: EC2 + ELB + RDS incident
A production incident scenario: database query latency spikes, ELB 5xx errors increase, user-facing error rate climbs. The agent investigates and remediates.
Least-privilege IAM policy for an SRE agent
# Least-privilege IAM policy for an AWS SRE agent
# Read: always in auto_approve list
# Write: always in require_human list
# Never grant: *Delete*, *Terminate*, *Destroy*
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadOnly",
"Effect": "Allow",
"Action": [
"cloudwatch:GetMetricData",
"cloudwatch:DescribeAlarms",
"logs:FilterLogEvents",
"logs:GetLogEvents",
"ec2:DescribeInstances",
"ec2:DescribeSecurityGroups",
"rds:DescribeDBInstances",
"rds:DescribeEvents",
"ecs:DescribeTasks",
"eks:DescribeCluster",
"cloudtrail:LookupEvents",
"elasticloadbalancing:DescribeTargetHealth"
],
"Resource": "*"
},
{
"Sid": "WriteWithCondition",
"Effect": "Allow",
"Action": [
"codedeploy:CreateDeployment", # Requires human approval
"rds:ModifyDBParameterGroup", # Requires human approval
"ecs:UpdateService" # Requires human approval
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:RequestedRegion": "us-east-1" # Region-scope
}
}
}
// Never grant: *Delete, *Terminate, *Destroy, iam:*, s3:Delete*
]
}Estimated monthly cost of an AWS SRE agent (April 2026)
For a 1,000-service organisation with moderate incident volume (50 incidents/week, 30% handled by agent).
| Component | Basis | Est. monthly cost |
|---|---|---|
| Bedrock AgentCore | Agent runtime + tool invocations | $200-800 |
| Claude Sonnet 4.5 (Bedrock) | ~15 incidents/day x 10K tokens = 4.5M tokens/mo | $675 |
| CloudWatch queries | 50 metric queries per incident x 780 incidents/mo | $80 |
| CloudTrail lookups | 5 lookups per incident x 780 | $20 |
| VPC / compute | Agent runtime (Lambda or ECS Fargate) | $100-300 |
| Total estimate | 1,000-service org, 50 incidents/week, 30% agent-handled | $1,075-1,875/mo |
Note: These are rough estimates based on April 2026 Bedrock pricing. Actual costs depend on model choice, incident volume, tool invocation frequency, and data transfer costs. Use the ROI calculator to model your savings against this cost.