When your CDN goes dark: ready-to-use runbook templates for Cloudflare and CDN failures
Hook: You’ll never be judged for not preventing a CDN outage — you will be judged for how slowly you recover. In 2026, engineering teams must treat Cloudflare and CDN failures like high-severity incidents: detect fast, route traffic safely, purge or serve stale caches, and produce airtight postmortems for SLA and audit purposes.
What this article gives you
Below are battle-tested, ready-to-use runbook templates you can copy into your incident manager (PagerDuty, OpsGenie, Prepared.cloud, etc.) and your runbook repo. Each template includes detection, mitigation, traffic rerouting, cache invalidation, DNS failover, and a postmortem structure. I also include 2026 trends and automation tips so you can validate these during tabletop exercises.
Why this matters in 2026 — trends and context
Late 2025 and early 2026 saw renewed attention on CDN resiliency after multiple high-profile edge and DNS incidents. Businesses now adopt multi-CDN architectures, automated DNS failover, and SRE-driven SLOs for edge services. Regulators and auditors increasingly expect documented, tested continuity plans — especially for services that affect customers directly.
Two trends to keep in mind:
- Edge compute and originless architectures: Your CDN is now also compute. Failure modes include not just cache misses but failing edge functions.
- Automation and AI detection: Observability tooling with ML anomaly detection is common — but false positives require robust runbooks to avoid unnecessary DNS flapping.
Incident detection template — CDN failure (S1)
Objective: confirm a CDN-level failure within 5 minutes and classify impact.
- Trigger: Any of the following within 5 minutes:
- Global 5XX rate > 3% for 5m across edge nodes
- Sudden 50% drop in traffic to CDN with no upstream change
- Monitoring alert: Cloudflare Load Balancer pool down or origin health failures
- External reports (DownDetector, social) + internal telemetry
- Initial responder: On-call Infra/SRE and NetOps join the incident channel immediately.
- Confirm: Run the following checks (runbook commands):
- Quick HTTP check: curl -I -s -S https://www.example.com -o /dev/null -w "%{http_code} %{time_starttransfer}s\n"
- Edge headers: curl -sI https://www.example.com | egrep -i "cf-ray|via|x-cache"
- Cloudflare status/API: curl -s -X GET "https://api.cloudflare.com/client/v4/zones/{zone_id}/dns_records" -H "Authorization: Bearer $CF_API_TOKEN"
- DNS check: dig +short www.example.com @8.8.8.8
- Classify impact: outage vs degradation; customer-impacting vs internal telemetry only. Assign severity S1/S2.
- Notify: Post initial summary in incident channel and kick off paging to stakeholders and customer comms owner.
Mitigation runbook — immediate containment
Objective: stop customer-blind actions and provide short-term relief (30-90 minutes).
- Stabilize traffic:
- If using Cloudflare Load Balancer, shift traffic to healthy pools using the UI or API. Example API to adjust pool weights (pseudo):
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/{zone_id}/load_balancers/{lb_id}" \ -H "Authorization: Bearer $CF_API_TOKEN" \ -H "Content-Type: application/json" \ --data '{"default_pools":["pool-id-A"],"fallback_pool":"pool-id-B"}' - If Cloudflare is partially degraded, enable Origin Pull fallback or temporarily reduce edge TTL to force fresh origin requests (careful with origin load).
- If using Cloudflare Load Balancer, shift traffic to healthy pools using the UI or API. Example API to adjust pool weights (pseudo):
- Serve stale / origin fallback: Configure your CDN or app to serve stale content if possible so users see last-known-good pages while you triage.
- Cloudflare: enable Always Online or Workers that serve cached snapshots for critical routes.
- Cache purge restraint: Avoid wholesale purges unless necessary. If you must purge, target by tag/path using API:
curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache" \ -H "Authorization: Bearer $CF_API_TOKEN" \ -H "Content-Type: application/json" \ --data '{"files":["https://www.example.com/critical-path.js"]}' - Temporary downgrade: If an edge compute function is failing, rollback the worker/function to the last known good revision.
- Throttle automated remediation: If you have automation touching DNS or CDN config, pause automations to avoid cascading changes.
Traffic rerouting playbook — multi-CDN and DNS strategies
Objective: reroute traffic away from the failing CDN with minimal TTL churn and minimal customer impact.
Strategy A — Weighted multi-CDN failover (recommended)
- Precondition: You have two CDNs (Cloudflare + secondary CDN) behind a DNS or Traffic Manager.
- Action: Shift weight from failing CDN to secondary CDN in 10-20% steps over 5-10 minutes while monitoring error rate and origin load.
- Example (Route 53 weighted records):
aws route53 change-resource-record-sets --hosted-zone-id Z3... --change-batch '{ "Changes":[{ "Action":"UPSERT", "ResourceRecordSet":{ "Name":"www.example.com.", "Type":"A", "SetIdentifier":"cdn-primary", "Weight":10, "TTL":60, "ResourceRecords":[{"Value":"1.2.3.4"}] } },{ "Action":"UPSERT", "ResourceRecordSet":{ "Name":"www.example.com.", "Type":"A", "SetIdentifier":"cdn-secondary", "Weight":90, "TTL":60, "ResourceRecords":[{"Value":"5.6.7.8"}] } }] }' - Monitor: Cache hit ratio, 5XX rate, RPS at secondary CDN. If errors spike, roll back weight changes.
Strategy B — DNS failover (single CDN fallback)
- Use health checks (Route 53, NS1, Akamai Fast-Purge) to detect global failure and switch A/AAAA records to an alternate origin or CDN.
- Be aware: DNS TTL and client DNS caching mean this is slower and can cause split-brain experiences.
- Use low TTLs (60s) for high-impact records in production where rapid failover is required.
Cache invalidation and origin protection
Objective: avoid origin overload while ensuring critical fixes reach users.
- Targeted purge first: Purge specific assets (path, tag) rather than everything.
curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache" \ -H "Authorization: Bearer $CF_API_TOKEN" \ -H "Content-Type: application/json" \ --data '{"tags":["release-2026-01-16"]}' - Stagger purges: Throttle purge rate to protect origin. Use background workers to rehydrate cache gradually.
- Origin scaling: If you must bypass CDN, ensure autoscaling policies are ready and confirm DB/third-party dependency capacity.
Rollback & recovery runbook
- Confirmed fix: Only when edge errors remain below threshold for 15 continuous minutes.
- Rollback plan: If you changed CDN weights, reverse incrementally. If you purged caches, warm caches by replaying traffic from a controlled source (bots or workers) for key pages.
- Validation: Validate globally using synthetic checks and actual user metrics (RUM) across regions.
- Close incident: When SLA metrics restored and stakeholders agree; start postmortem within 24 hours.
Postmortem template — CDN failure
Produce a clear, auditable document suitable for engineering leadership and compliance.
- Summary (1-2 lines): Include impact window and services affected.
- Severity: S1/S2, SLA breach details (which SLAs were impacted and how much).
- Timeline: Minute-by-minute timeline with who did what. Use UTC timestamps.
Include: detection time, mitigation start, traffic reroute completion, full recovery, incident close.
- Root cause: Technical cause + contributing factors (e.g., misconfigured edge rule, upstream DNS provider partial outage, rate-limited purge spikes).
- Impact metrics:
- Total requests failed
- Number of users affected
- Downtime per region
- SLA/Metrics: % uptime during window, pages/sec vs baseline
- Remediation & action items: Assign owners, due dates, and verification steps. Examples:
- Implement automated weight rollback if error rate > 2% (owner: NetOps, due: 2 days)
- Add replica origin and test origin shielding (owner: infra, due: 7 days)
- Run tabletop exercise for DNS failover (owner: SRE, due: 14 days)
- Lessons learned & communications: Customer-facing message template and internal notes.
- Appendix: raw logs, diagnostic commands, API calls used, and artifacts for auditors.
SLA calculations & customer communication
Be transparent. Provide a timeline and show how the outage affected SLA guarantees. For example, if SLA is 99.95% monthly uptime, compute downtime minutes and whether SLA credits apply:
- Monthly minutes = 30 * 24 * 60 = 43,200
- Allowed downtime for 99.95% = 43,200 * (1 - 0.9995) = ~21.6 minutes
- If outage = 40 minutes, SLA breached; follow contract process for credits and customer notifications.
Testing, drills, and automation (2026 best practices)
Runbooks are living documents. In 2026, best-practice teams run automated drills monthly and integrate runbooks with CI/CD and IaC:
- Automate synthetic failures during off-hours using feature flags and canary DNS changes.
- Use chaos engineering targeting edge behavior (simulate Cloudflare 5XX at the load balancer level).
- Store runbooks in Git and deploy them to your incident platform (Prepared.cloud, Opsgenie) via IaC.
- Use playbook templating with variables (zone_id, lb_id, secondary_cdn_ip) to avoid human error.
Concrete commands and API examples
Use these as-is inside safe scripts and ensure API tokens are scoped and rotated.
Cloudflare: targeted purge by tag
curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache" \
-H "Authorization: Bearer $CF_API_TOKEN" \
-H "Content-Type: application/json" \
--data '{"tags":["critical-js","release-2026-01"]}'
Health check — quick headers
curl -sI https://www.example.com | egrep -i "Status:|CF-Cache-Status:|cf-ray:|via:|x-cache:" || true
Route 53 weighted record change (JSON snippet)
{
"Comment": "Switching weights during incident",
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "www.example.com.",
"Type": "A",
"SetIdentifier": "cdn-secondary",
"Weight": 90,
"TTL": 60,
"ResourceRecords": [{ "Value": "5.6.7.8" }]
}
}
]
}
Mini case study (anonymized)
In December 2025, a SaaS provider faced a Cloudflare edge rules misconfiguration that returned 502s for API requests. Using a multi-CDN setup and a prepared runbook they:
- Detected the increase in 5XX via ML anomaly alerts
- Shifted 80% traffic to their secondary CDN in 15 minutes
- Served stale snapshots on critical billing routes
- Completed postmortem within 24 hours and implemented an automated weight rollback if errors rise
Checklist: put these in your incident playbook now
- Scoped API tokens for Cloudflare and DNS providers
- Low TTL DNS records for critical hostnames
- Automated health checks and synthetic transactions per region
- Pre-provisioned secondary CDN or origin host for failover
- Documented contact list for CDN provider support and escalation
- Postmortem template stored in your runbook repo (with owner assignments)
Closing notes — preparing for the next outage
CDN failures are inevitable. What separates resilient teams from the rest in 2026 is:
- Runbooks that are tested and executable under pressure
- Automation that reduces manual toil without removing human judgment
- Postmortems that create measurable remediation and verification
Actionable takeaways:
- Implement the detection template in your monitoring tool today (thresholds + anomaly alerts).
- Script targeted cache purge and load balancer weight changes; keep them in Git behind protected branches.
- Run a DNS failover tabletop this quarter and publish your postmortem template in your compliance artifacts.
Call to action
Copy these runbook templates into your incident platform and run a fire-drill within 30 days. If you want a starter repo or policy-as-code examples (Terraform + Cloudflare provider + Route 53 change sets) tailored to your architecture, request our incident playbook kit and a 30-minute walkthrough.
Related Reading
- The Evolution of Site Reliability in 2026: SRE Beyond Uptime
- Edge Auditability & Decision Planes: An Operational Playbook for Cloud Teams in 2026
- Serverless Data Mesh for Edge Microhubs: A 2026 Roadmap
- Password Hygiene at Scale: Automated Rotation & Detection
- Film‑Score Flow: A Teacher’s Guide to Sequencing Classes Around Movie Soundtracks
- Privacy Notice Templates for Landlords and Property Managers About Smart Devices
- How to Build a Pizza-Test Kitchen on a Budget Using CES Finds and Discounted Gear
- Resume Templates for OTT and Sports-Broadcasting Roles — Land a Job at Platforms Like JioStar
- Inclusive Rivers: How Outfitters Can Build Trans-Friendly Policies