Emerging category, best practices evolving. Code samples illustrative. Verify security implications before production use. Data verified April 2026.
Last verified April 2026

Agentic Runbooks for Kubernetes: Tools and Patterns (2026)

Kubernetes is the most common target for agentic runbooks in 2026. High incident frequency, well-known failure modes, and strong RBAC primitives make it ideal for agent automation. Here is the complete guide.

Why Kubernetes is the primary agentic runbook target

Three properties make Kubernetes ideal for agentic automation. First, K8s failure modes are highly recognisable: CrashLoopBackOff, OOMKilled, ImagePullBackOff, and PodPending are patterns any agent can be trained to identify reliably. Second, Kubernetes provides a well-structured API for both read (describe, logs) and write (rollout restart, scale) operations, making it straightforward to define tool_scope boundaries. Third, RBAC is baked in: you can scope an agent's kubectl access to a specific namespace and a specific set of verbs, providing a strong security boundary.

The 10 most automated K8s incident patterns

01

CrashLoopBackOff pod

Trigger: Restart count > 5 in 10 minutes
Vendors: Komodor Klaudia, Shoreline, OpenSRE
Agent actions: Fetch logs, describe pod, identify cause (OOMKill, config error, probe failure), propose restart or config fix
MTTR data: 23 seconds (Komodor)
02

OOMKilled pod

Trigger: Pod termination reason = OOMKilled
Vendors: Komodor Klaudia, Shoreline
Agent actions: Fetch memory metrics, compare to memory limit, propose limit increase or identify memory leak
MTTR data: Not published (variant of CrashLoop)
03

ImagePullBackOff

Trigger: Pod status = ImagePullBackOff
Vendors: Komodor Klaudia
Agent actions: Verify image name and tag, check registry credentials, check network connectivity to registry
MTTR data: Fast (registry issue is usually immediate to identify)
04

PodPending (scheduling failure)

Trigger: Pod in Pending state > 5 minutes
Vendors: Komodor Klaudia, Shoreline
Agent actions: Check node capacity, check resource requests vs available, check node selectors and taints
MTTR data: Variable (depends on root cause)
05

HPA scaling anomaly

Trigger: HPA unable to scale or scaling unexpectedly
Vendors: Shoreline
Agent actions: Check metric source, verify HPA target, compare current vs desired replicas
MTTR data: Not published
06

Stuck deployment rollout

Trigger: Rollout progress deadline exceeded
Vendors: PagerDuty Runbook Automation, Kubiya
Agent actions: Describe deployment, check rollout status, identify stuck ReplicaSet, propose rollback or manual intervention
MTTR data: Not published
07

Node NotReady

Trigger: Node condition = NotReady > 2 minutes
Vendors: Shoreline, OpenSRE
Agent actions: Check node events, check kubelet status, cordon node, trigger pod eviction, notify team
MTTR data: Not published
08

Certificate expiry

Trigger: Scheduled check: cert expiry < 14 days
Vendors: Shoreline, Kubiya
Agent actions: Identify expiring certificates via cert-manager, trigger renewal, verify
MTTR data: Proactive (prevents incident)
09

PVC full

Trigger: PVC usage > 90%
Vendors: Shoreline
Agent actions: Identify the PVC and workload, check growth rate, propose expansion or cleanup
MTTR data: Not published
10

Service mesh connectivity loss

Trigger: Error rate spike between specific services (Istio telemetry)
Vendors: OpenSRE, custom LangGraph agents
Agent actions: Check mTLS certs, check Envoy proxy config, check network policy, propose restart of affected sidecar
MTTR data: Not published

Tool deep-dive: K8s-native agentic runbook vendors

Komodor Klaudia

95% accuracy, Kubernetes-specialist

Klaudia is trained on thousands of production Kubernetes environments. It has deep context awareness of K8s object relationships: it knows that a failing deployment affects a service, which affects an ingress, which affects user traffic. This topology knowledge is its primary differentiator. The 95% accuracy claim is on Kubernetes failure patterns specifically.

Honest: Not useful outside K8s environments. Architecture knowledge is K8s-specific.
Custom

Shoreline.io

120+ pre-built K8s notebooks, 75% MTTR claim

Shoreline Notebooks are interactive runbooks that can be automated. The 120+ pre-built notebooks covering common K8s incidents are the primary value proposition: teams can start automating without writing runbooks from scratch. The Shoreline Language (Op-spec) DSL is purpose-built for incident remediation.

Honest: Op-spec has a learning curve. The DSL is Shoreline-specific; migrating to another tool requires rewriting runbooks.
Custom

Kubiya

Meta-agent orchestration, deterministic execution

Kubiya&apos;s architecture treats K8s, Terraform, and CI/CD as first-class tool domains. The meta-agent pattern (one orchestrator, multiple specialist agents) is well-suited to platform engineering teams managing complex, multi-tool environments. The deterministic execution guarantee (structured tool calls, not free-form LLM output) is important for compliance.

Honest: More complex to set up than single-agent tools. The meta-agent pattern adds latency on simple incidents.
Custom

OpenSRE (Tracer-Cloud)

Open source AI SRE toolkit

OpenSRE is the open-source reference implementation for Kubernetes AI SRE agents. It provides the LangGraph patterns for K8s incident agents, vector database integration for postmortem retrieval, and integration with Prometheus and PagerDuty. The GitHub Tracer-Cloud/opensre repository is the practical starting point for teams building vs buying.

Honest: Requires engineering effort to deploy and maintain. No vendor support. Community-maintained.
Free (open source)

RBAC and security: scoping an agent's kubectl access

An agent with unrestricted kubectl access can delete anything in any namespace. This is never the right configuration. Scope agent access to the minimum required permissions using Kubernetes RBAC.

# Minimal RBAC for a CrashLoop remediation agent
# Scoped to the 'production' namespace only

apiVersion: v1
kind: ServiceAccount
metadata:
  name: sre-agent
  namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: sre-agent-role
  namespace: production   # Namespace-scoped, not ClusterRole
rules:
  # Read-only: always safe for auto_approve
  - apiGroups: [""]
    resources: ["pods", "pods/log", "events"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list", "watch"]
  # Write: require human approval before agent uses
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["patch"]    # For rollout restart (uses patch, not delete)
  # Never grant: delete, deletecollection on any resource
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: sre-agent-binding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: sre-agent
    namespace: production
roleRef:
  kind: Role
  apiGroup: rbac.authorization.k8s.io
  name: sre-agent-role

The full security threat model, including prompt injection into K8s object names and audit trail tamper protection, is at /security-considerations.

Continue reading