Post Mortem: What Yesterday's Network Outage Looked
Like

Yesterday, around 16:36 GMT, we had an interruption to our network services. The interruption was caused by a combination of factors. First, we had an upstream bandwidth provider with some network issues that primarily affected our European data centers. Second, we misapplied a network rate limit in an attempt to mitigate a large DDoS our ops team had been fighting throughout the night.

CloudFlare is designed to make sites faster, safer and more reliable so any time an incident on our network causes any of our customers' sites to be unreachable it is unacceptable. I wanted to take some time to give you more of a sense of exactly what happened and what our ops and engineering teams have been working on since we got things restored in order to protect our network from incidents like this in the future.

Two Visible Events

The graph at the top of this post is the aggregate traffic across CloudFlare's eight European data centers. The green section represents traffic that is inbound to our network, the blue line represents traffic that is outbound from our network. Inbound traffic includes both requests that we receive from visitors to our customers' sites as well as any content that we pull from our customers' origin servers. Since we cache content on our network, the blue line should always be significantly above the green one.

Two things you notice from the graph: the big green spike around 13:30 GMT and the fall off in the blue line around 16:30 GMT. While the two were almost 3 hours apart, they are in fact related. Here's what happened.

Limited Network and a Nasty Attack

One of our upstream network providers began having issues in Europe so we routed traffic around their network, which concentrated more traffic than usual in some of our facilities in the region. These incidents happen all the time and our network is designed to make them invisible to our customers. Around 13:00 GMT, a very large DDoS attack was launched against one of our customer's websites. The initial attack was initially a basic layer 4 attack — sending a large amount of garbage traffic from a large number of infected machines to the site. The attack peaked at over 65 Gbps of traffic, much of that concentrated in Europe.

This attack is represented by the big green spike in the graph above. It was a lot of traffic, but nothing our network can't handle. We're pretty good at stopping simple attacks like this and, by 13:30 GMT, that attacker had largely stopped with that simple attack vector. During that time, the attack didn't affect any other customers on our network. Nothing so far is atypical for a normal day at CloudFlare.

Mitigation and a Mistake

The attacker switched to trying other vectors over the next several hours. While we have automated systems to deal with many of these attacks, the size of the attack was sufficient that several members of our ops team were monitoring the situation and manually adjusting routing and rules in order to ensure the customer under attack stayed online and none of the rest of the network was impacted.

Around 16:30 GMT the attacker switched vectors again. Our team implemented a new rate limit. The rate limit was supposed to apply only to the affected customer, but was misapplied to a wider number of customers. Because traffic was already concentrated in Europe more so than usual, the misapplied network rate limit impacted a large number of customers in the region. There was some spillover to traffic to our facilities in North America and Asia Pacific, however the brunt of the outage was felt in Europe.

As you can see from the graph, both inbound and outbound traffic fell off almost entirely in the region. We realized our mistake and reverted the rate limit. In some cases, the rate limit also affected BGP announcements that setup routes to our network. The spikes on the graph you see over the next hour are from the network routing rebalancing in the region.

Sanity Checks

We have been working to automate more and more of our attack mitigation. For most manual changes that could affect our network we have sanity checks in place to ensure mistake don't make it into production. The events of today have exposed one more place we need to put such checks in place. Our team worked yesterday to build additional sanity checks to protect against something similar to this happening again in the future.

CloudFlare has grown very quickly because we provide a service many websites need in a way that is affordable and easy for anyone to implement. Core to what we do is ensuring the uptime and availability of our network. We let many of our customers down yesterday. We will learn from the outage and continue to work toward implementing systems that ensure our network is among the most reliable on the Internet.