A Byzantine failure in the real world
November 27, 2020 12:00PM
When we review design documents at Cloudflare, we are always on the lookout for Single Points of Failure (SPOFs). In this post, we present a timeline of a real-world incident, and how an interesting failure mode known as a Byzantine fault played a role in a cascading series of events....
Continue reading »
August 30th 2020: Analysis of CenturyLink/Level(3) Outage
August 30, 2020 9:55PM
Outage
Post Mortem
Today CenturyLink/Level(3), a major ISP and Internet bandwidth provider, experienced a significant outage that impacted some of Cloudflare’s customers as well as a significant number of other services and providers across the Internet....
Cloudflare outage on July 17, 2020
July 18, 2020 2:22AM
Post Mortem
Outage
Engineering
Today a configuration error in our backbone network caused an outage for Internet properties and Cloudflare services that lasted 27 minutes. We saw traffic drop by about 50% across our network....
Details of the Cloudflare outage on July 2, 2019
July 12, 2019 4:45PM
Post Mortem
Outage
Deep Dive
Almost nine years ago, Cloudflare was a tiny company and I was a customer not an employee. Cloudflare had launched a month earlier and one day alerting told me that my little site, jgc.org, didn’t seem to have working DNS any more....
Cloudflare outage caused by bad software deploy (updated)
July 02, 2019 4:50PM
Post Mortem
Outage
Deep Dive
Starting at 1342 UTC today we experienced a global outage across our network that resulted in visitors to Cloudflare-proxied domains being shown 502 errors (“Bad Gateway”). The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF)...