Incident Runbook Templates for Cloud CDN Failures
Ready-to-use runbooks for Cloudflare/CDN failures: detection, traffic rerouting, cache invalidation, DNS failover and postmortem templates for 2026.
When your CDN goes dark: ready-to-use runbook templates for Cloudflare and CDN failures
Hook: You’ll never be judged for not preventing a CDN outage — you will be judged for how slowly you recover. In 2026, engineering teams must treat Cloudflare and CDN failures like high-severity incidents: detect fast, route traffic safely, purge or serve stale caches, and produce airtight postmortems for SLA and audit purposes.
What this article gives you
Below are battle-tested, ready-to-use runbook templates you can copy into your incident manager (PagerDuty, OpsGenie, Prepared.cloud, etc.) and your runbook repo. Each template includes detection, mitigation, traffic rerouting, cache invalidation, DNS failover, and a postmortem structure. I also include 2026 trends and automation tips so you can validate these during tabletop exercises.
Why this matters in 2026 — trends and context
Late 2025 and early 2026 saw renewed attention on CDN resiliency after multiple high-profile edge and DNS incidents. Businesses now adopt multi-CDN architectures, automated DNS failover, and SRE-driven SLOs for edge services. Regulators and auditors increasingly expect documented, tested continuity plans — especially for services that affect customers directly.
Two trends to keep in mind:
- Edge compute and originless architectures: Your CDN is now also compute. Failure modes include not just cache misses but failing edge functions.
- Automation and AI detection: Observability tooling with ML anomaly detection is common — but false positives require robust runbooks to avoid unnecessary DNS flapping.
Incident detection template — CDN failure (S1)
Objective: confirm a CDN-level failure within 5 minutes and classify impact.
- Trigger: Any of the following within 5 minutes:
- Global 5XX rate > 3% for 5m across edge nodes
- Sudden 50% drop in traffic to CDN with no upstream change
- Monitoring alert: Cloudflare Load Balancer pool down or origin health failures
- External reports (DownDetector, social) + internal telemetry
- Initial responder: On-call Infra/SRE and NetOps join the incident channel immediately.
- Confirm: Run the following checks (runbook commands):
- Quick HTTP check: curl -I -s -S https://www.example.com -o /dev/null -w "%{http_code} %{time_starttransfer}s\n"
- Edge headers: curl -sI https://www.example.com | egrep -i "cf-ray|via|x-cache"
- Cloudflare status/API: curl -s -X GET "https://api.cloudflare.com/client/v4/zones/{zone_id}/dns_records" -H "Authorization: Bearer $CF_API_TOKEN"
- DNS check: dig +short www.example.com @8.8.8.8
- Classify impact: outage vs degradation; customer-impacting vs internal telemetry only. Assign severity S1/S2.
- Notify: Post initial summary in incident channel and kick off paging to stakeholders and customer comms owner.
Mitigation runbook — immediate containment
Objective: stop customer-blind actions and provide short-term relief (30-90 minutes).
- Stabilize traffic:
- If using Cloudflare Load Balancer, shift traffic to healthy pools using the UI or API. Example API to adjust pool weights (pseudo):
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/{zone_id}/load_balancers/{lb_id}" \ -H "Authorization: Bearer $CF_API_TOKEN" \ -H "Content-Type: application/json" \ --data '{"default_pools":["pool-id-A"],"fallback_pool":"pool-id-B"}' - If Cloudflare is partially degraded, enable Origin Pull fallback or temporarily reduce edge TTL to force fresh origin requests (careful with origin load).
- If using Cloudflare Load Balancer, shift traffic to healthy pools using the UI or API. Example API to adjust pool weights (pseudo):
- Serve stale / origin fallback: Configure your CDN or app to serve stale content if possible so users see last-known-good pages while you triage.
- Cloudflare: enable Always Online or Workers that serve cached snapshots for critical routes.
- Cache purge restraint: Avoid wholesale purges unless necessary. If you must purge, target by tag/path using API:
curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache" \ -H "Authorization: Bearer $CF_API_TOKEN" \ -H "Content-Type: application/json" \ --data '{"files":["https://www.example.com/critical-path.js"]}' - Temporary downgrade: If an edge compute function is failing, rollback the worker/function to the last known good revision.
- Throttle automated remediation: If you have automation touching DNS or CDN config, pause automations to avoid cascading changes.
Traffic rerouting playbook — multi-CDN and DNS strategies
Objective: reroute traffic away from the failing CDN with minimal TTL churn and minimal customer impact.
Strategy A — Weighted multi-CDN failover (recommended)
- Precondition: You have two CDNs (Cloudflare + secondary CDN) behind a DNS or Traffic Manager.
- Action: Shift weight from failing CDN to secondary CDN in 10-20% steps over 5-10 minutes while monitoring error rate and origin load.
- Example (Route 53 weighted records):
aws route53 change-resource-record-sets --hosted-zone-id Z3... --change-batch '{ "Changes":[{ "Action":"UPSERT", "ResourceRecordSet":{ "Name":"www.example.com.", "Type":"A", "SetIdentifier":"cdn-primary", "Weight":10, "TTL":60, "ResourceRecords":[{"Value":"1.2.3.4"}] } },{ "Action":"UPSERT", "ResourceRecordSet":{ "Name":"www.example.com.", "Type":"A", "SetIdentifier":"cdn-secondary", "Weight":90, "TTL":60, "ResourceRecords":[{"Value":"5.6.7.8"}] } }] }' - Monitor: Cache hit ratio, 5XX rate, RPS at secondary CDN. If errors spike, roll back weight changes.
Strategy B — DNS failover (single CDN fallback)
- Use health checks (Route 53, NS1, Akamai Fast-Purge) to detect global failure and switch A/AAAA records to an alternate origin or CDN.
- Be aware: DNS TTL and client DNS caching mean this is slower and can cause split-brain experiences.
- Use low TTLs (60s) for high-impact records in production where rapid failover is required.
Cache invalidation and origin protection
Objective: avoid origin overload while ensuring critical fixes reach users.
- Targeted purge first: Purge specific assets (path, tag) rather than everything.
curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache" \ -H "Authorization: Bearer $CF_API_TOKEN" \ -H "Content-Type: application/json" \ --data '{"tags":["release-2026-01-16"]}' - Stagger purges: Throttle purge rate to protect origin. Use background workers to rehydrate cache gradually.
- Origin scaling: If you must bypass CDN, ensure autoscaling policies are ready and confirm DB/third-party dependency capacity.
Rollback & recovery runbook
- Confirmed fix: Only when edge errors remain below threshold for 15 continuous minutes.
- Rollback plan: If you changed CDN weights, reverse incrementally. If you purged caches, warm caches by replaying traffic from a controlled source (bots or workers) for key pages.
- Validation: Validate globally using synthetic checks and actual user metrics (RUM) across regions.
- Close incident: When SLA metrics restored and stakeholders agree; start postmortem within 24 hours.
Postmortem template — CDN failure
Produce a clear, auditable document suitable for engineering leadership and compliance.
- Summary (1-2 lines): Include impact window and services affected.
- Severity: S1/S2, SLA breach details (which SLAs were impacted and how much).
- Timeline: Minute-by-minute timeline with who did what. Use UTC timestamps.
Include: detection time, mitigation start, traffic reroute completion, full recovery, incident close.
- Root cause: Technical cause + contributing factors (e.g., misconfigured edge rule, upstream DNS provider partial outage, rate-limited purge spikes).
- Impact metrics:
- Total requests failed
- Number of users affected
- Downtime per region
- SLA/Metrics: % uptime during window, pages/sec vs baseline
- Remediation & action items: Assign owners, due dates, and verification steps. Examples:
- Implement automated weight rollback if error rate > 2% (owner: NetOps, due: 2 days)
- Add replica origin and test origin shielding (owner: infra, due: 7 days)
- Run tabletop exercise for DNS failover (owner: SRE, due: 14 days)
- Lessons learned & communications: Customer-facing message template and internal notes.
- Appendix: raw logs, diagnostic commands, API calls used, and artifacts for auditors.
SLA calculations & customer communication
Be transparent. Provide a timeline and show how the outage affected SLA guarantees. For example, if SLA is 99.95% monthly uptime, compute downtime minutes and whether SLA credits apply:
- Monthly minutes = 30 * 24 * 60 = 43,200
- Allowed downtime for 99.95% = 43,200 * (1 - 0.9995) = ~21.6 minutes
- If outage = 40 minutes, SLA breached; follow contract process for credits and customer notifications.
Testing, drills, and automation (2026 best practices)
Runbooks are living documents. In 2026, best-practice teams run automated drills monthly and integrate runbooks with CI/CD and IaC:
- Automate synthetic failures during off-hours using feature flags and canary DNS changes.
- Use chaos engineering targeting edge behavior (simulate Cloudflare 5XX at the load balancer level).
- Store runbooks in Git and deploy them to your incident platform (Prepared.cloud, Opsgenie) via IaC.
- Use playbook templating with variables (zone_id, lb_id, secondary_cdn_ip) to avoid human error.
Concrete commands and API examples
Use these as-is inside safe scripts and ensure API tokens are scoped and rotated.
Cloudflare: targeted purge by tag
curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache" \
-H "Authorization: Bearer $CF_API_TOKEN" \
-H "Content-Type: application/json" \
--data '{"tags":["critical-js","release-2026-01"]}'
Health check — quick headers
curl -sI https://www.example.com | egrep -i "Status:|CF-Cache-Status:|cf-ray:|via:|x-cache:" || true
Route 53 weighted record change (JSON snippet)
{
"Comment": "Switching weights during incident",
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "www.example.com.",
"Type": "A",
"SetIdentifier": "cdn-secondary",
"Weight": 90,
"TTL": 60,
"ResourceRecords": [{ "Value": "5.6.7.8" }]
}
}
]
}
Mini case study (anonymized)
In December 2025, a SaaS provider faced a Cloudflare edge rules misconfiguration that returned 502s for API requests. Using a multi-CDN setup and a prepared runbook they:
- Detected the increase in 5XX via ML anomaly alerts
- Shifted 80% traffic to their secondary CDN in 15 minutes
- Served stale snapshots on critical billing routes
- Completed postmortem within 24 hours and implemented an automated weight rollback if errors rise
Checklist: put these in your incident playbook now
- Scoped API tokens for Cloudflare and DNS providers
- Low TTL DNS records for critical hostnames
- Automated health checks and synthetic transactions per region
- Pre-provisioned secondary CDN or origin host for failover
- Documented contact list for CDN provider support and escalation
- Postmortem template stored in your runbook repo (with owner assignments)
Closing notes — preparing for the next outage
CDN failures are inevitable. What separates resilient teams from the rest in 2026 is:
- Runbooks that are tested and executable under pressure
- Automation that reduces manual toil without removing human judgment
- Postmortems that create measurable remediation and verification
Actionable takeaways:
- Implement the detection template in your monitoring tool today (thresholds + anomaly alerts).
- Script targeted cache purge and load balancer weight changes; keep them in Git behind protected branches.
- Run a DNS failover tabletop this quarter and publish your postmortem template in your compliance artifacts.
Call to action
Copy these runbook templates into your incident platform and run a fire-drill within 30 days. If you want a starter repo or policy-as-code examples (Terraform + Cloudflare provider + Route 53 change sets) tailored to your architecture, request our incident playbook kit and a 30-minute walkthrough.
Related Reading
- The Evolution of Site Reliability in 2026: SRE Beyond Uptime
- Edge Auditability & Decision Planes: An Operational Playbook for Cloud Teams in 2026
- Serverless Data Mesh for Edge Microhubs: A 2026 Roadmap
- Password Hygiene at Scale: Automated Rotation & Detection
- Film‑Score Flow: A Teacher’s Guide to Sequencing Classes Around Movie Soundtracks
- Privacy Notice Templates for Landlords and Property Managers About Smart Devices
- How to Build a Pizza-Test Kitchen on a Budget Using CES Finds and Discounted Gear
- Resume Templates for OTT and Sports-Broadcasting Roles — Land a Job at Platforms Like JioStar
- Inclusive Rivers: How Outfitters Can Build Trans-Friendly Policies
Related Topics
prepared
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Our Network
Trending stories across our publication group
