CloudFlare had an outage across much of our network today. The outage began at 20:19 (GMT). It affected approximately 75% of traffic to CloudFlare's network. The length of time for the outage varied depending on region, but the maximum period of downtime was approximately 15 minutes. I wanted to quickly get information out about what happened, why it happened, and what we're doing to ensure it never happens again.
Routes, Routers and Routing
To understand the problem, you need to understand a bit about how Internet routing works. The Internet is a massively interconnected network. Networks send packets to each other across routes. These routes are set for each network by routers. A route defines the path for packets to take to get to a particular IP address. One network will announce that it is responsible for a particular set of IP addresses. That fact is then shared to upstream routers so if they see a packet bound for a particular IP they can send it in the correct direction.
Routers exchange routes between each other using something called Border Gateway Protocol (BGP). When two networks interconnect, they generally trust each other's routes. If a routing change is announced by one router, the immediately connected upstream routers will pickup the routing change. They will subsequently pass the change on to other routers that are further upstream.
Bad Route to Hong Kong
Today we had a scheduled maintenance for our Hong Kong data center while its systems were being upgraded. The data center was taken offline by shutting down all the in-bound Anycast routes. This, as we intended, caused all traffic that would have gone to that data center to hit the next closest facility (either Singapore or Tokyo).
While the systems were being upgraded, our network team worked to optimize some of the routing in Hong Kong. At some point, the out-bound routes were entered into the in-bound interface. The out-bound routes describe our entire net range so the net effect was the router in Hong Kong was announcing that it was the correct place to send all traffic bound for CloudFlare's IP space.
Our upstream provider trusts our routes so, via BGP, they were quickly relayed throughout their network and to their upstreams. The result was traffic from around the world was directed to the Hong Kong data center, which was offline. We realized the issue and announced the corrected routes. It took approximately 15 minutes from the beginning of the problem to when the routes were corrected network wide. About 25% of CloudFlare's in-bound traffic comes from direct peers. This traffic was not affected by the routing because the direct peers trusted our routes more than the ones they were receiving from other upstreams.
We are implementing systems to run all routing changes through a verification layer that double check before any routes are announced. We are also talking with all our upstream providers to enable additional checks on their networks that do not automatically propogate major routing changes without confirmation.
This is only the second significant outage in CloudFlare's history (here's our post mortem from the other). Any period of downtime is completely unacceptable to us. On behalf of our whole team, I apologize for the problem. We have learned from this experience and are already implementing the safeguards to ensure it will not happen again.