Emerging category, best practices evolving. Code samples illustrative. Verify security implications before production use. Data verified April 2026.
Last verified April 2026

Agentic Runbook Use Cases: What AI Agents Are Actually Doing in Production

Agentic runbooks in 2026 are not replacing on-call. They are removing toil from the top 20% of incidents: the ones that are frequent, well-understood, and have clear remediation paths. Shoreline reports 50% auto-remediation rates; Komodor Klaudia claims 95% accuracy on Kubernetes environments. Here is what that looks like in practice.

The use cases that do not work yet (honest assessment)

xCustomer-facing incident communications (tone, legal risk, relationship sensitivity)
xNovel incidents outside training distribution (by definition, no runbook exists)
xMulti-system cascading failures (too many variables, attribution unclear)
xDestructive data operations (deletes, migrations, schema changes)
01

Pod crash-loop remediation (Kubernetes)

Trigger

CrashLoopBackOff alert from Prometheus or Datadog

Agent actions

Fetch logs, describe pod, query recent deploys, propose restart or rollback

Vendor support (April 2026)

Komodor Klaudia, Shoreline, OpenSRE

MTTR data

23 seconds (Komodor, Kubernetes environments)

Honest limit: Cannot diagnose novel application bugs. Handles infrastructure-layer failures well, not application logic errors.
02

Deployment rollback on error-rate spike

Trigger

Error rate 5xx > 2% for 3 minutes (Datadog, Prometheus)

Agent actions

Diff current vs previous deployment, fetch deploy history, propose rollback to last stable version, execute after approval

Vendor support (April 2026)

PagerDuty AIOps, Rootly, Traversal

MTTR data

Traversal reports 38% MTTR reduction at DigitalOcean, saving 36,000 engineering hours/year

Honest limit: Requires clean deployment history. If the deploy manifest is missing or git-diff is ambiguous, the agent cannot identify the correct rollback target.
03

Certificate expiry detection and rotation

Trigger

Scheduled (nightly): certificates expiring within 14 days

Agent actions

Enumerate certs in scope, identify expiring, initiate cert-manager rotation, verify renewal, update Slack and Jira

Vendor support (April 2026)

Shoreline, Kubiya, AWS DevOps Agent

MTTR data

Proactive pattern: prevents incident rather than reducing MTTR. Estimated 2-4 hours of on-call toil eliminated per renewal cycle.

Honest limit: Certificates managed outside Kubernetes (load balancer SSL termination, CDN) require separate integration. Not all cert-manager providers are supported by every vendor.
04

Cost anomaly detection and scale-down

Trigger

Cloud cost spike: AWS/GCP/Azure spend > 30% above 7-day average

Agent actions

Query cost explorer API, identify over-provisioned resources (idle EC2, oversized RDS, unused load balancers), propose scale-down or termination, get approval, execute

Vendor support (April 2026)

Datadog Bits AI, AWS DevOps Agent, Kubiya

MTTR data

N/A (cost reduction use case). Typical savings: 15-40% of flagged over-provisioned resources, per AWS DevOps Agent documentation.

Honest limit: Requires Cost Explorer or equivalent API access. Scale-down of production databases always requires human approval. Spot market recommendations need historical pricing context.
05

Auth and login spike response

Trigger

Login volume 10x above baseline for 10 minutes

Agent actions

Query auth logs, classify spike (marketing campaign vs credential stuffing vs DDoS), route to appropriate runbook, update status page, optionally trigger rate-limit escalation

Vendor support (April 2026)

PagerDuty AIOps, incident.io AI workflows

MTTR data

MTTA reduction significant (classification done in seconds vs minutes of manual log triage). MTTR varies by response type.

Honest limit: Classification accuracy depends on training data. Mixed signals (campaign that coincides with bot traffic) produce ambiguous classifications. Human review recommended for threat-vector determination.
06

Disk pressure auto-remediation

Trigger

Node disk usage > 85% (Prometheus node_exporter)

Agent actions

Identify large files and stale logs, propose cleanup or volume expansion, execute cleanup after approval, verify resolution

Vendor support (April 2026)

Shoreline, Komodor Klaudia, OpenSRE

MTTR data

Common use case with well-defined remediation. Shoreline reports 50% auto-remediation rate across all incident types.

Honest limit: Cannot safely delete application data without explicit scope definition. Only OS-layer logs and temp files in auto_approve scope. Volume expansion on managed services has cost implications.
07

Config drift detection and re-apply

Trigger

Scheduled or triggered by deployment pipeline

Agent actions

Compare running config against desired state (Terraform state, Helm values), identify drifted resources, propose re-apply, execute after approval

Vendor support (April 2026)

Kubiya, AWS DevOps Agent

MTTR data

Proactive pattern. Prevents config-drift-induced incidents. Not an MTTR metric.

Honest limit: Requires a clear desired-state source (Terraform, ArgoCD, Helm). If desired state is ambiguous or multiple sources conflict, the agent cannot safely determine the correct target.
08

Cache flush on stale-data alerts

Trigger

Cache hit rate < 20% or stale-data user reports

Agent actions

Identify cache cluster, verify staleness via sampling, propose targeted or full flush, execute, monitor hit rate recovery

Vendor support (April 2026)

PagerDuty Runbook Automation, Shoreline

MTTR data

Typically 5-15 minutes of triage time saved.

Honest limit: Full cache flush can cause thundering-herd on the backing database. Agent must check backing-DB capacity before full flush. Targeted flush is safer but requires key-space mapping.
09

Noise suppression and alert correlation

Trigger

Alert storm: > 50 alerts firing in 5 minutes

Agent actions

Correlate alerts into incident groups by service and time window, suppress duplicates, create single incident record, notify on-call with summary

Vendor support (April 2026)

PagerDuty AIOps (91% alert reduction claim), BigPanda, Datadog Bits AI

MTTR data

MTTA reduction primary benefit. PagerDuty AIOps claims 91% alert volume reduction.

Honest limit: Correlation accuracy drops on novel infrastructure topologies. Over-eager suppression can hide critical secondary failures. Requires tuning per environment.
10

Postmortem drafting

Trigger

Incident resolved (manual or auto-trigger from incident tool)

Agent actions

Aggregate Slack thread, PagerDuty timeline, CLI commands run, deploy events, metric anomalies; draft structured postmortem in Confluence/Notion format

Vendor support (April 2026)

Rootly, incident.io, FireHydrant, PagerDuty

MTTR data

N/A (post-incident). Time saved: 1-3 hours of postmortem drafting per incident.

Honest limit: AI drafts the structure and timeline; human insight is required for root cause analysis and action items. Over-reliance on AI postmortems flattens organisational learning. The doc is scaffolding, not the analysis.
11

RCA timeline generation

Trigger

Incident resolved, postmortem phase

Agent actions

Correlate metrics anomalies, deployment events, dependency health, and log patterns to generate a causal timeline; highlight contributing factors; flag known patterns from past incidents

Vendor support (April 2026)

Traversal (90%+ accuracy), Komodor Klaudia (95% on K8s), Rootly, Neubird

MTTR data

N/A (post-incident). Traversal reports 38% MTTR reduction partly from faster RCA on recurrence.

Honest limit: Accuracy is high on known incident shapes. Novel failures and cascading multi-system failures remain difficult. RCA output should be treated as a starting hypothesis, not a conclusion.
12

On-call schedule optimisation

Trigger

Weekly (scheduled) or on-demand from SRE lead

Agent actions

Analyse incident patterns by time of day, engineer workload, and expertise match; suggest schedule adjustments; flag overloaded engineers

Vendor support (April 2026)

PagerDuty Operations Console, incident.io scheduling AI

MTTR data

N/A (operational use case). Reduces toil from manual schedule management.

Honest limit: Does not replace human judgment on team capacity and engineer preferences. Treats on-call as a scheduling problem; burn-out signals are outside scope.