Last verified April 2026

Agentic Runbooks on AWS: DevOps Agent, Bedrock AgentCore, and the 2026 Stack

AWS has shipped the most complete native agentic runbook stack of any cloud provider in 2026. Here is what each component does and how to compose them.

The AWS native stack for agentic runbooks

AWS DevOps Agent

GA 2026

The always-available SRE teammate. Cross-account investigation, topology intelligence, continuous learning.

AWS Bedrock AgentCore

GA 2026

Agent orchestration runtime. Wraps AWS APIs as MCP tools. Hosts LangGraph and AutoGen agents.

Amazon CloudTrail

GA (existing)

Immutable audit log for all agent actions. Feeds the observability and learning_loop.

AWS IAM Identity Center

GA (existing)

Authentication and least-privilege policy for agent credentials. Time-bounded credentials.

Amazon CloudWatch

GA (existing)

Metrics and log source. Agent queries CloudWatch for incident context.

AWS DevOps Agent capabilities

AWS DevOps Agent is described by AWS as an "always-available AI SRE teammate". It has three distinguishing capabilities not yet matched by point-solution vendors:

Cross-account investigation

The agent can query resources across multiple AWS accounts in a single investigation. Useful for organisations with account-per-environment or account-per-team strategies. When a production service fails and the root cause is in a shared service in a different account, the agent traverses the account boundary automatically.

Topology intelligence

The agent understands AWS resource relationships: EC2 instances in an Auto Scaling Group behind an ELB, ECS tasks on EC2 instances, RDS replicas, VPC routing tables. When investigating an incident, it navigates the topology rather than requiring the engineer to manually trace dependencies.

Continuous learning

Actions taken during incidents are recorded and used to improve future recommendations. Over time, the agent learns the specific patterns and fix sequences that work in your environment. This is the learning_loop component of the runbook specification.

Bedrock AgentCore architecture

Bedrock AgentCore is the agent orchestration layer. It converts AWS service APIs into MCP tool servers. A LangGraph or AutoGen agent consuming these MCP tools can investigate and act across the AWS stack without custom integration code.

# Bedrock AgentCore: MCP tool registration
# Each MCP server wraps one or more AWS APIs

mcp_tools:
  cloudwatch-metrics:
    api: "cloudwatch:GetMetricData"
    access: read
    description: "Query CloudWatch metrics for any namespace/metric/dimension"

  eks-kubectl:
    api: ["eks:DescribeCluster", "eks:ListNodegroups"]
    kubectl_access: true  # Via EKS API server
    access: read-write    # Write actions go to require_human boundary

  rds-describe:
    api: "rds:DescribeDBInstances"
    access: read

  codedeploy-create:
    api: "codedeploy:CreateDeployment"
    access: write          # Always in require_human list

  cloudtrail-lookup:
    api: "cloudtrail:LookupEvents"
    access: read

Worked example: EC2 + ELB + RDS incident

A production incident scenario: database query latency spikes, ELB 5xx errors increase, user-facing error rate climbs. The agent investigates and remediates.

1. Observe (agent): CloudWatch anomaly detected: RDS QueryLatency P99 > 500ms, ELB 5xx rate > 3%

2. Topology walk (agent): Agent maps: user traffic -> ELB -> EC2 fleet (ASG) -> RDS primary. Identifies RDS as the upstream bottleneck.

3. CloudTrail lookup (agent): Agent queries CloudTrail for changes in the past 24 hours on the RDS instance and associated parameter groups.

4. Finding (agent): RDS parameter group change: max_connections reduced from 500 to 50 by a database migration script at 14:32 UTC.

5. Context retrieval (agent): Agent fetches past postmortem for similar RDS parameter group incident. Found: resolved by reverting parameter and rebooting read replicas.

6. Proposal (human): Propose: modify RDS parameter group max_connections back to 500, reboot read replicas. Requires approval (write action).

7. Approval (human): On-call DBA approves via Slack at 15:01 UTC.

8. Execute (agent): Agent modifies parameter group, initiates read replica reboot via CodeDeploy task.

9. Verify (agent): Agent monitors RDS QueryLatency and ELB 5xx rate for 5 minutes. Both return to baseline.

10. Report (agent): Agent posts resolution summary to Slack, resolves PagerDuty incident, saves action to learning loop.

Least-privilege IAM policy for an SRE agent

# Least-privilege IAM policy for an AWS SRE agent
# Read: always in auto_approve list
# Write: always in require_human list
# Never grant: *Delete*, *Terminate*, *Destroy*

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadOnly",
      "Effect": "Allow",
      "Action": [
        "cloudwatch:GetMetricData",
        "cloudwatch:DescribeAlarms",
        "logs:FilterLogEvents",
        "logs:GetLogEvents",
        "ec2:DescribeInstances",
        "ec2:DescribeSecurityGroups",
        "rds:DescribeDBInstances",
        "rds:DescribeEvents",
        "ecs:DescribeTasks",
        "eks:DescribeCluster",
        "cloudtrail:LookupEvents",
        "elasticloadbalancing:DescribeTargetHealth"
      ],
      "Resource": "*"
    },
    {
      "Sid": "WriteWithCondition",
      "Effect": "Allow",
      "Action": [
        "codedeploy:CreateDeployment",   # Requires human approval
        "rds:ModifyDBParameterGroup",    # Requires human approval
        "ecs:UpdateService"              # Requires human approval
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "us-east-1"   # Region-scope
        }
      }
    }
    // Never grant: *Delete, *Terminate, *Destroy, iam:*, s3:Delete*
  ]
}

Estimated monthly cost of an AWS SRE agent (April 2026)

For a 1,000-service organisation with moderate incident volume (50 incidents/week, 30% handled by agent).

Component	Basis	Est. monthly cost
Bedrock AgentCore	Agent runtime + tool invocations	$200-800
Claude Sonnet 4.5 (Bedrock)	~15 incidents/day x 10K tokens = 4.5M tokens/mo	$675
CloudWatch queries	50 metric queries per incident x 780 incidents/mo	$80
CloudTrail lookups	5 lookups per incident x 780	$20
VPC / compute	Agent runtime (Lambda or ECS Fargate)	$100-300
Total estimate	1,000-service org, 50 incidents/week, 30% agent-handled	$1,075-1,875/mo

Note: These are rough estimates based on April 2026 Bedrock pricing. Actual costs depend on model choice, incident volume, tool invocation frequency, and data transfer costs. Use the ROI calculator to model your savings against this cost.

Continue reading

Write your first agentic runbook (LangGraph tutorial)Integration patterns: MCP and webhooks Security: IAM, prompt injection, audit For Kubernetes: K8s-native agents