Today, shortly after 21:00 UTC, on our internal operations chat there was a scary message from one of our senior support staff: "getting DNS resolution errors on support.cloudflare.com", at the same time as automated monitoring indicated a problem. Shortly thereafter, we saw alarms and feedback from a variety of customers (but not everyone) reporting "1001 errors", which indicated a DNS resolution error on the CloudFlare backend. Needless to say, this got an immediate and overwhelming response from our operations and engineering teams, as we hadn't changed anything and had no other indications of anomaly.
In the course of debugging, we were able to identify common characteristics of affected sites—CNAME-based users of CloudFlare, rather than complete domain hosted entirely on CloudFlare, which, ironically, included our own support site, support.cloudflare.com. When users point (via CNAME) to a domain instead of providing us with an IP address, our network resolves that name —- and is obviously unable to connect if the DNS provider has issues. (Our status page https://www.cloudflarestatus.com/ is off-network and was unaffected). Then, we were investigating why only certain domains were having issues—was the issue with the upstream DNS? Testing whether their domains were resolvable on the Internet (which they were) added a confounding data point.
Ultimately, the outage was identified as Dyn, another major DNS operator, having issues with their own DNS configuration. (https://www.dynstatus.com/incidents/4sbm48rdsdbq)
The Internet is made up of many networks, operated by companies, organizations, governments, and individuals around the world, all cooperating using a common set of protocols and agreed policies and behaviors. These systems interoperate in a number of ways, sometimes entirely non-obviously. The mutual goal is to provide service to end users, letting them communicate, enjoy new services, and explore together. When one provider has a technical issue, it can cascade throughout the Internet and it isn’t obvious to users exactly which piece is broken.
Fortunately, even when companies are competitors, the spirit of the Internet is to work together for the good of the users. Once we identified this issue, we immediately contacted Dyn and relayed what we knew, and worked with them to resolve the issue for everyone’s benefit. We have temporarily put in workarounds to address this issue on our side, and hope the underlying difficulties will be resolved shortly.
Update: The good folks at Dyn have posted a short explanation of what happened on their nameservers.