Last verified April 2026

Agentic Runbooks for Kubernetes: Tools and Patterns (2026)

Kubernetes is the most common target for agentic runbooks in 2026. High incident frequency, well-known failure modes, and strong RBAC primitives make it ideal for agent automation. Here is the complete guide.

Why Kubernetes is the primary agentic runbook target

Three properties make Kubernetes ideal for agentic automation. First, K8s failure modes are highly recognisable: CrashLoopBackOff, OOMKilled, ImagePullBackOff, and PodPending are patterns any agent can be trained to identify reliably. Second, Kubernetes provides a well-structured API for both read (describe, logs) and write (rollout restart, scale) operations, making it straightforward to define tool_scope boundaries. Third, RBAC is baked in: you can scope an agent's kubectl access to a specific namespace and a specific set of verbs, providing a strong security boundary.

The 10 most automated K8s incident patterns

CrashLoopBackOff pod

Trigger: Restart count > 5 in 10 minutes

Vendors: Komodor Klaudia, Shoreline, OpenSRE

Agent actions: Fetch logs, describe pod, identify cause (OOMKill, config error, probe failure), propose restart or config fix

MTTR data: 23 seconds (Komodor)

OOMKilled pod

Trigger: Pod termination reason = OOMKilled

Vendors: Komodor Klaudia, Shoreline

Agent actions: Fetch memory metrics, compare to memory limit, propose limit increase or identify memory leak

MTTR data: Not published (variant of CrashLoop)

ImagePullBackOff

Trigger: Pod status = ImagePullBackOff

Vendors: Komodor Klaudia

Agent actions: Verify image name and tag, check registry credentials, check network connectivity to registry

MTTR data: Fast (registry issue is usually immediate to identify)

PodPending (scheduling failure)

Trigger: Pod in Pending state > 5 minutes

Vendors: Komodor Klaudia, Shoreline

Agent actions: Check node capacity, check resource requests vs available, check node selectors and taints

MTTR data: Variable (depends on root cause)

HPA scaling anomaly

Trigger: HPA unable to scale or scaling unexpectedly

Vendors: Shoreline

Agent actions: Check metric source, verify HPA target, compare current vs desired replicas

MTTR data: Not published

Stuck deployment rollout

Trigger: Rollout progress deadline exceeded

Vendors: PagerDuty Runbook Automation, Kubiya

Agent actions: Describe deployment, check rollout status, identify stuck ReplicaSet, propose rollback or manual intervention

MTTR data: Not published

Node NotReady

Trigger: Node condition = NotReady > 2 minutes

Vendors: Shoreline, OpenSRE

Agent actions: Check node events, check kubelet status, cordon node, trigger pod eviction, notify team

MTTR data: Not published

Certificate expiry

Trigger: Scheduled check: cert expiry < 14 days

Vendors: Shoreline, Kubiya

Agent actions: Identify expiring certificates via cert-manager, trigger renewal, verify

MTTR data: Proactive (prevents incident)

PVC full

Trigger: PVC usage > 90%

Vendors: Shoreline

Agent actions: Identify the PVC and workload, check growth rate, propose expansion or cleanup

MTTR data: Not published

Service mesh connectivity loss

Trigger: Error rate spike between specific services (Istio telemetry)

Vendors: OpenSRE, custom LangGraph agents

Agent actions: Check mTLS certs, check Envoy proxy config, check network policy, propose restart of affected sidecar

MTTR data: Not published

Tool deep-dive: K8s-native agentic runbook vendors

Komodor Klaudia

95% accuracy, Kubernetes-specialist

Klaudia is trained on thousands of production Kubernetes environments. It has deep context awareness of K8s object relationships: it knows that a failing deployment affects a service, which affects an ingress, which affects user traffic. This topology knowledge is its primary differentiator. The 95% accuracy claim is on Kubernetes failure patterns specifically.

Honest: Not useful outside K8s environments. Architecture knowledge is K8s-specific.

Custom

Shoreline.io

120+ pre-built K8s notebooks, 75% MTTR claim

Shoreline Notebooks are interactive runbooks that can be automated. The 120+ pre-built notebooks covering common K8s incidents are the primary value proposition: teams can start automating without writing runbooks from scratch. The Shoreline Language (Op-spec) DSL is purpose-built for incident remediation.

Honest: Op-spec has a learning curve. The DSL is Shoreline-specific; migrating to another tool requires rewriting runbooks.

Custom

Kubiya

Meta-agent orchestration, deterministic execution

Kubiya's architecture treats K8s, Terraform, and CI/CD as first-class tool domains. The meta-agent pattern (one orchestrator, multiple specialist agents) is well-suited to platform engineering teams managing complex, multi-tool environments. The deterministic execution guarantee (structured tool calls, not free-form LLM output) is important for compliance.

Honest: More complex to set up than single-agent tools. The meta-agent pattern adds latency on simple incidents.

Custom

OpenSRE (Tracer-Cloud)

Open source AI SRE toolkit

OpenSRE is the open-source reference implementation for Kubernetes AI SRE agents. It provides the LangGraph patterns for K8s incident agents, vector database integration for postmortem retrieval, and integration with Prometheus and PagerDuty. The GitHub Tracer-Cloud/opensre repository is the practical starting point for teams building vs buying.

Honest: Requires engineering effort to deploy and maintain. No vendor support. Community-maintained.

Free (open source)

RBAC and security: scoping an agent's kubectl access

An agent with unrestricted kubectl access can delete anything in any namespace. This is never the right configuration. Scope agent access to the minimum required permissions using Kubernetes RBAC.

# Minimal RBAC for a CrashLoop remediation agent
# Scoped to the 'production' namespace only

apiVersion: v1
kind: ServiceAccount
metadata:
  name: sre-agent
  namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: sre-agent-role
  namespace: production   # Namespace-scoped, not ClusterRole
rules:
  # Read-only: always safe for auto_approve
  - apiGroups: [""]
    resources: ["pods", "pods/log", "events"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list", "watch"]
  # Write: require human approval before agent uses
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["patch"]    # For rollout restart (uses patch, not delete)
  # Never grant: delete, deletecollection on any resource
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: sre-agent-binding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: sre-agent
    namespace: production
roleRef:
  kind: Role
  apiGroup: rbac.authorization.k8s.io
  name: sre-agent-role

The full security threat model, including prompt injection into K8s object names and audit trail tamper protection, is at /security-considerations.

Continue reading

Full LangGraph tutorial with K8s examples Security: RBAC, prompt injection, blast radius Integration patterns: MCP and webhooks Compare all 12 vendors