Agentic Runbook Use Cases: What AI Agents Are Actually Doing in Production
Agentic runbooks in 2026 are not replacing on-call. They are removing toil from the top 20% of incidents: the ones that are frequent, well-understood, and have clear remediation paths. Shoreline reports 50% auto-remediation rates; Komodor Klaudia claims 95% accuracy on Kubernetes environments. Here is what that looks like in practice.
The use cases that do not work yet (honest assessment)
Pod crash-loop remediation (Kubernetes)
CrashLoopBackOff alert from Prometheus or Datadog
Fetch logs, describe pod, query recent deploys, propose restart or rollback
Komodor Klaudia, Shoreline, OpenSRE
23 seconds (Komodor, Kubernetes environments)
Deployment rollback on error-rate spike
Error rate 5xx > 2% for 3 minutes (Datadog, Prometheus)
Diff current vs previous deployment, fetch deploy history, propose rollback to last stable version, execute after approval
PagerDuty AIOps, Rootly, Traversal
Traversal reports 38% MTTR reduction at DigitalOcean, saving 36,000 engineering hours/year
Certificate expiry detection and rotation
Scheduled (nightly): certificates expiring within 14 days
Enumerate certs in scope, identify expiring, initiate cert-manager rotation, verify renewal, update Slack and Jira
Shoreline, Kubiya, AWS DevOps Agent
Proactive pattern: prevents incident rather than reducing MTTR. Estimated 2-4 hours of on-call toil eliminated per renewal cycle.
Cost anomaly detection and scale-down
Cloud cost spike: AWS/GCP/Azure spend > 30% above 7-day average
Query cost explorer API, identify over-provisioned resources (idle EC2, oversized RDS, unused load balancers), propose scale-down or termination, get approval, execute
Datadog Bits AI, AWS DevOps Agent, Kubiya
N/A (cost reduction use case). Typical savings: 15-40% of flagged over-provisioned resources, per AWS DevOps Agent documentation.
Auth and login spike response
Login volume 10x above baseline for 10 minutes
Query auth logs, classify spike (marketing campaign vs credential stuffing vs DDoS), route to appropriate runbook, update status page, optionally trigger rate-limit escalation
PagerDuty AIOps, incident.io AI workflows
MTTA reduction significant (classification done in seconds vs minutes of manual log triage). MTTR varies by response type.
Disk pressure auto-remediation
Node disk usage > 85% (Prometheus node_exporter)
Identify large files and stale logs, propose cleanup or volume expansion, execute cleanup after approval, verify resolution
Shoreline, Komodor Klaudia, OpenSRE
Common use case with well-defined remediation. Shoreline reports 50% auto-remediation rate across all incident types.
Config drift detection and re-apply
Scheduled or triggered by deployment pipeline
Compare running config against desired state (Terraform state, Helm values), identify drifted resources, propose re-apply, execute after approval
Kubiya, AWS DevOps Agent
Proactive pattern. Prevents config-drift-induced incidents. Not an MTTR metric.
Cache flush on stale-data alerts
Cache hit rate < 20% or stale-data user reports
Identify cache cluster, verify staleness via sampling, propose targeted or full flush, execute, monitor hit rate recovery
PagerDuty Runbook Automation, Shoreline
Typically 5-15 minutes of triage time saved.
Noise suppression and alert correlation
Alert storm: > 50 alerts firing in 5 minutes
Correlate alerts into incident groups by service and time window, suppress duplicates, create single incident record, notify on-call with summary
PagerDuty AIOps (91% alert reduction claim), BigPanda, Datadog Bits AI
MTTA reduction primary benefit. PagerDuty AIOps claims 91% alert volume reduction.
Postmortem drafting
Incident resolved (manual or auto-trigger from incident tool)
Aggregate Slack thread, PagerDuty timeline, CLI commands run, deploy events, metric anomalies; draft structured postmortem in Confluence/Notion format
Rootly, incident.io, FireHydrant, PagerDuty
N/A (post-incident). Time saved: 1-3 hours of postmortem drafting per incident.
RCA timeline generation
Incident resolved, postmortem phase
Correlate metrics anomalies, deployment events, dependency health, and log patterns to generate a causal timeline; highlight contributing factors; flag known patterns from past incidents
Traversal (90%+ accuracy), Komodor Klaudia (95% on K8s), Rootly, Neubird
N/A (post-incident). Traversal reports 38% MTTR reduction partly from faster RCA on recurrence.
On-call schedule optimisation
Weekly (scheduled) or on-demand from SRE lead
Analyse incident patterns by time of day, engineer workload, and expertise match; suggest schedule adjustments; flag overloaded engineers
PagerDuty Operations Console, incident.io scheduling AI
N/A (operational use case). Reduces toil from manual schedule management.