Post Mortem: The Ugly, the Bad & the

Last night was not our finest hour. Around 07:30 GMT, we were finishing up a push of a new DNS infrastructure. The core of what this new update was built to do is make DNS updates even faster. Before it took about a minute for a change to your DNS settings to propagate to all our infrastructure, with the new DNS update it is almost instant. That is important to understand in order to understand what went wrong.

Making an update to the DNS requires changing underlying code deep in our system and taking servers offline while we do so. We scheduled the update for the quietest time on our network, which is around 07:00 GMT (around 11:00pm in San Francisco). The code had been running smoothly in our test environment and one data center for the last week so we were feeling pretty good. And, in fact, the push of the DNS update went smoothly and was ahead of schedule.

The Ugly

When the update was complete in 10 of our 14 data centers we got word of a minor issue that was affecting some data getting pushed from the master DNS database. In the process of diagnosing the minor issue, the master DNS database was deleted. The new DNS system did its job and rapidly propagated across the 10 datacenters where the update was live. The result was that if recursive DNS looked up a domain and hit one of those 10 datacenters, around 07:30 GMT they would receive an invalid result. That meant those sites went offline and it was entirely our fault.

The Bad

The DNS database is regularly backed up, but it took us about 5 minutes to recognize the issue, retrieve the backup, and push it to production. Our new DNS infrastructure pushed the update out to most of the datacenters immediately, but because it was such a large update it took a few minutes to rebuild. In most places, new DNS requests were correctly answered with less than a 10 minute window of bad results.

Unfortunately, DNS is a series of interconnected caches, many of which are not in our control. If you accessed a page during the issue, your ISP's recursive DNS likely cached the result. Since most DNS providers don't make it easy to flush their cache (compared with a recursive provider like OpenDNS, which does) it extended the outage for people who were already seeing an issue. Generally, within 30 minutes, recursive DNS had flushed and by 8:00 GMT sites were back online.

Two datacenters did not take all the corrected DNS file updates correctly. We are still investigating why, but our speculation is that because the update affected a large number of records the systems choked on the initial attempt at the updates. Requests that hit those data centers returned bad results for some sites until about 8:10 GMT. Some visitors in Europe and Asia would have seen a longer period of downtime on some sites as a result. Our system has multiple layers of redundancy, including at the datacenter level, so we removed the two data centers from rotation as soon as we recognized the issue and affected visitors once again saw correct DNS results.

Two last problems exacerbated things. First, as is normal operations for us, we were dealing with two mid-sized DDoS attacks directed at some of our customers at the time. Nothing abnormal about that, but having two fewer data centers in rotation made us less effective at stopping them and caused a small handful of 500 errors. The impact of those, however, was minimal (less than 0.001% of traffic for around a 12 minute period). Second, there were some DNS entries in our system for TLDs like that shouldn't have been there. While it wasn't a validated DNS zone record, the way that the DNS update was pushed caused a handful of records that fell under these TLDs to also see an extended outage. When we got reports of this we identified the issue and removed the problematic entries.

The Good

There's not a ton of good in this incident itself. While the system status is green now, we will memorialize the incident on our system status page. I, along with the rest of the team, apologize for the problem and anyone who experienced it. We've built a system that is resilient to most attacks, but a mistake on our part can still cause a significant issue. This is the second significant period of downtime we've had network wide. The first was more than a year ago and also occurred due to an error we made ourselves. Any period of downtime is unacceptable to us and, again, we sincerely apologize.

Going forward, we've already added several layers of safeguards to prevent this, or a similar incident, from occurring. CloudFlare's technical systems are designed to learn over time, that same ethos is in our team itself. While this incident was ugly, I was proud to see almost the entire engineering, ops, and support teams online into the wee hours helping customers sort out issues and building the safeguards to prevent an issue like this in the future.

What I was planning on writing a blog post about this morning is our new DNS infrastructure, so I will end with a bit more detail on that. As described above, one of the main benefits is that DNS updates are even faster than before. In the past, DNS files were replicated every minute or so. Now changes are pushed instantly to our entire network. While that wasn't a great thing last night, in general we believe it is a big benefit to our publishers and makes us the fastest updating global authoritative DNS in the world.

The update to the DNS systems also includes hardening against some of the new breed of DNS-directed DDoS attacks we've begun to see. Going forward, this will help us provide even better protection against larger and larger attacks. Our goal is to stay ahead of the bad guys and ensure that everyone on CloudFlare has state-of-the-art protection against attacks.

I apologize again for those of you who experienced downtime as a result of our mistake. We will learn from it and continue to build redundancy and resiliency into CloudFlare in order to earn your trust.