incident responsecdnrunbooks

Incident Runbook Templates for Cloud CDN Failures

pprepared

2026-02-05

9 min read

Ready-to-use runbooks for Cloudflare/CDN failures: detection, traffic rerouting, cache invalidation, DNS failover and postmortem templates for 2026.

When your CDN goes dark: ready-to-use runbook templates for Cloudflare and CDN failures

Hook: You’ll never be judged for not preventing a CDN outage — you will be judged for how slowly you recover. In 2026, engineering teams must treat Cloudflare and CDN failures like high-severity incidents: detect fast, route traffic safely, purge or serve stale caches, and produce airtight postmortems for SLA and audit purposes.

What this article gives you

Below are battle-tested, ready-to-use runbook templates you can copy into your incident manager (PagerDuty, OpsGenie, Prepared.cloud, etc.) and your runbook repo. Each template includes detection, mitigation, traffic rerouting, cache invalidation, DNS failover, and a postmortem structure. I also include 2026 trends and automation tips so you can validate these during tabletop exercises.

Why this matters in 2026 — trends and context

Late 2025 and early 2026 saw renewed attention on CDN resiliency after multiple high-profile edge and DNS incidents. Businesses now adopt multi-CDN architectures, automated DNS failover, and SRE-driven SLOs for edge services. Regulators and auditors increasingly expect documented, tested continuity plans — especially for services that affect customers directly.

Two trends to keep in mind:

Edge compute and originless architectures: Your CDN is now also compute. Failure modes include not just cache misses but failing edge functions.
Automation and AI detection: Observability tooling with ML anomaly detection is common — but false positives require robust runbooks to avoid unnecessary DNS flapping.

Incident detection template — CDN failure (S1)

Objective: confirm a CDN-level failure within 5 minutes and classify impact.

Trigger: Any of the following within 5 minutes:
- Global 5XX rate > 3% for 5m across edge nodes
- Sudden 50% drop in traffic to CDN with no upstream change
- Monitoring alert: Cloudflare Load Balancer pool down or origin health failures
- External reports (DownDetector, social) + internal telemetry
Initial responder: On-call Infra/SRE and NetOps join the incident channel immediately.
Confirm: Run the following checks (runbook commands):
- Quick HTTP check: curl -I -s -S https://www.example.com -o /dev/null -w "%{http_code} %{time_starttransfer}s\n"
- Edge headers: curl -sI https://www.example.com | egrep -i "cf-ray|via|x-cache"
- Cloudflare status/API: curl -s -X GET "https://api.cloudflare.com/client/v4/zones/{zone_id}/dns_records" -H "Authorization: Bearer $CF_API_TOKEN"
- DNS check: dig +short www.example.com @8.8.8.8
Classify impact: outage vs degradation; customer-impacting vs internal telemetry only. Assign severity S1/S2.
Notify: Post initial summary in incident channel and kick off paging to stakeholders and customer comms owner.

Mitigation runbook — immediate containment

Objective: stop customer-blind actions and provide short-term relief (30-90 minutes).

Stabilize traffic:
- If using Cloudflare Load Balancer, shift traffic to healthy pools using the UI or API. Example API to adjust pool weights (pseudo):
```
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/{zone_id}/load_balancers/{lb_id}" \
  -H "Authorization: Bearer $CF_API_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"default_pools":["pool-id-A"],"fallback_pool":"pool-id-B"}'
```
- If Cloudflare is partially degraded, enable Origin Pull fallback or temporarily reduce edge TTL to force fresh origin requests (careful with origin load).
Serve stale / origin fallback: Configure your CDN or app to serve stale content if possible so users see last-known-good pages while you triage.
- Cloudflare: enable Always Online or Workers that serve cached snapshots for critical routes.

Cache purge restraint: Avoid wholesale purges unless necessary. If you must purge, target by tag/path using API:

curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache" \
  -H "Authorization: Bearer $CF_API_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"files":["https://www.example.com/critical-path.js"]}'

Temporary downgrade: If an edge compute function is failing, rollback the worker/function to the last known good revision.
Throttle automated remediation: If you have automation touching DNS or CDN config, pause automations to avoid cascading changes.

Traffic rerouting playbook — multi-CDN and DNS strategies

Objective: reroute traffic away from the failing CDN with minimal TTL churn and minimal customer impact.

Strategy A — Weighted multi-CDN failover (recommended)

Precondition: You have two CDNs (Cloudflare + secondary CDN) behind a DNS or Traffic Manager.
Action: Shift weight from failing CDN to secondary CDN in 10-20% steps over 5-10 minutes while monitoring error rate and origin load.

Example (Route 53 weighted records):

aws route53 change-resource-record-sets --hosted-zone-id Z3... --change-batch '{
  "Changes":[{
    "Action":"UPSERT",
    "ResourceRecordSet":{
      "Name":"www.example.com.",
      "Type":"A",
      "SetIdentifier":"cdn-primary",
      "Weight":10,
      "TTL":60,
      "ResourceRecords":[{"Value":"1.2.3.4"}]
    }
  },{
    "Action":"UPSERT",
    "ResourceRecordSet":{
      "Name":"www.example.com.",
      "Type":"A",
      "SetIdentifier":"cdn-secondary",
      "Weight":90,
      "TTL":60,
      "ResourceRecords":[{"Value":"5.6.7.8"}]
    }
  }]
}'

Monitor: Cache hit ratio, 5XX rate, RPS at secondary CDN. If errors spike, roll back weight changes.

Strategy B — DNS failover (single CDN fallback)

Use health checks (Route 53, NS1, Akamai Fast-Purge) to detect global failure and switch A/AAAA records to an alternate origin or CDN.
Be aware: DNS TTL and client DNS caching mean this is slower and can cause split-brain experiences.
Use low TTLs (60s) for high-impact records in production where rapid failover is required.

Cache invalidation and origin protection

Objective: avoid origin overload while ensuring critical fixes reach users.

Targeted purge first: Purge specific assets (path, tag) rather than everything.

curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache" \
  -H "Authorization: Bearer $CF_API_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"tags":["release-2026-01-16"]}'

Stagger purges: Throttle purge rate to protect origin. Use background workers to rehydrate cache gradually.
Origin scaling: If you must bypass CDN, ensure autoscaling policies are ready and confirm DB/third-party dependency capacity.

Rollback & recovery runbook

Confirmed fix: Only when edge errors remain below threshold for 15 continuous minutes.
Rollback plan: If you changed CDN weights, reverse incrementally. If you purged caches, warm caches by replaying traffic from a controlled source (bots or workers) for key pages.
Validation: Validate globally using synthetic checks and actual user metrics (RUM) across regions.
Close incident: When SLA metrics restored and stakeholders agree; start postmortem within 24 hours.

Postmortem template — CDN failure

Produce a clear, auditable document suitable for engineering leadership and compliance.

Summary (1-2 lines): Include impact window and services affected.
Severity: S1/S2, SLA breach details (which SLAs were impacted and how much).
Timeline: Minute-by-minute timeline with who did what. Use UTC timestamps.
Include: detection time, mitigation start, traffic reroute completion, full recovery, incident close.
Root cause: Technical cause + contributing factors (e.g., misconfigured edge rule, upstream DNS provider partial outage, rate-limited purge spikes).
Impact metrics:
- Total requests failed
- Number of users affected
- Downtime per region
- SLA/Metrics: % uptime during window, pages/sec vs baseline
Remediation & action items: Assign owners, due dates, and verification steps. Examples:
- Implement automated weight rollback if error rate > 2% (owner: NetOps, due: 2 days)
- Add replica origin and test origin shielding (owner: infra, due: 7 days)
- Run tabletop exercise for DNS failover (owner: SRE, due: 14 days)
Lessons learned & communications: Customer-facing message template and internal notes.
Appendix: raw logs, diagnostic commands, API calls used, and artifacts for auditors.

SLA calculations & customer communication

Be transparent. Provide a timeline and show how the outage affected SLA guarantees. For example, if SLA is 99.95% monthly uptime, compute downtime minutes and whether SLA credits apply:

Monthly minutes = 30 * 24 * 60 = 43,200
Allowed downtime for 99.95% = 43,200 * (1 - 0.9995) = ~21.6 minutes
If outage = 40 minutes, SLA breached; follow contract process for credits and customer notifications.

Testing, drills, and automation (2026 best practices)

Runbooks are living documents. In 2026, best-practice teams run automated drills monthly and integrate runbooks with CI/CD and IaC:

Automate synthetic failures during off-hours using feature flags and canary DNS changes.
Use chaos engineering targeting edge behavior (simulate Cloudflare 5XX at the load balancer level).
Store runbooks in Git and deploy them to your incident platform (Prepared.cloud, Opsgenie) via IaC.
Use playbook templating with variables (zone_id, lb_id, secondary_cdn_ip) to avoid human error.

Concrete commands and API examples

Use these as-is inside safe scripts and ensure API tokens are scoped and rotated.

Cloudflare: targeted purge by tag

curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache" \
  -H "Authorization: Bearer $CF_API_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"tags":["critical-js","release-2026-01"]}'

Health check — quick headers

curl -sI https://www.example.com | egrep -i "Status:|CF-Cache-Status:|cf-ray:|via:|x-cache:" || true

Route 53 weighted record change (JSON snippet)

{
  "Comment": "Switching weights during incident",
  "Changes": [
    {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "www.example.com.",
        "Type": "A",
        "SetIdentifier": "cdn-secondary",
        "Weight": 90,
        "TTL": 60,
        "ResourceRecords": [{ "Value": "5.6.7.8" }]
      }
    }
  ]
}

Mini case study (anonymized)

In December 2025, a SaaS provider faced a Cloudflare edge rules misconfiguration that returned 502s for API requests. Using a multi-CDN setup and a prepared runbook they:

Detected the increase in 5XX via ML anomaly alerts
Shifted 80% traffic to their secondary CDN in 15 minutes
Served stale snapshots on critical billing routes
Completed postmortem within 24 hours and implemented an automated weight rollback if errors rise

Recovery time went from 120 minutes (previous outage) to 28 minutes. That operational improvement prevented an SLA credit and reduced churn risk.

Checklist: put these in your incident playbook now

Scoped API tokens for Cloudflare and DNS providers
Low TTL DNS records for critical hostnames
Automated health checks and synthetic transactions per region
Pre-provisioned secondary CDN or origin host for failover
Documented contact list for CDN provider support and escalation
Postmortem template stored in your runbook repo (with owner assignments)

Closing notes — preparing for the next outage

CDN failures are inevitable. What separates resilient teams from the rest in 2026 is:

Runbooks that are tested and executable under pressure
Automation that reduces manual toil without removing human judgment
Postmortems that create measurable remediation and verification

Actionable takeaways:

Implement the detection template in your monitoring tool today (thresholds + anomaly alerts).
Script targeted cache purge and load balancer weight changes; keep them in Git behind protected branches.
Run a DNS failover tabletop this quarter and publish your postmortem template in your compliance artifacts.

Call to action

Copy these runbook templates into your incident platform and run a fire-drill within 30 days. If you want a starter repo or policy-as-code examples (Terraform + Cloudflare provider + Route 53 change sets) tailored to your architecture, request our incident playbook kit and a 30-minute walkthrough.

prepared

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.