Today at 16:13 UTC a large amount of traffic began hitting our Los Angeles data center. We have an in-house team that monitors our network 24x7x365 and immediately all their alarms went off. We initially thought it was a very large attack. In fact, it was something much trickier to resolve.
Background
CloudFlare makes wide use of Anycast routing. This gives us a very large capacity to stop huge DDoS attacks. The challenge is managing the routing to ensure that traffic goes to the correct place.
CloudFlare buys bandwidth to connect to the Internet via what are known as transit providers. The first transit provider we used starting back in 2010 was a company called nLayer. They have been a terrific partner over the years.
In the last year, nLayer merged with GTT. Then, about a month ago, GTT/nLayer purchased Inteliquent (aka., TINET). Over the last few weeks, GTT/nLayer has been consolidating their network with Inteliquent's. When this is complete, GTT/nLayer will move from a T er 2 network provider to one of the small handful of Tier 1 network providers.
Bumps
Today's issue was an indirect result of this migration. GTT/nLayer previously connected to Global Crossing, another large transit provider that is now owned by Level3. As part of the GTT/nLayer/Inteliquent consolidation, Level3 switched a route to being between Global Crossing and GTT/nLayer's route to instead be between Level3 and GTT/nLayer.
For most non-Anycasted traffic, this wouldn't cause any disruption. In our case, it shifted a large amount of traffic that would usually hit data centers on the east coast of the United States and Europe to all hit our facility in Los Angeles. In the worst case, this caused some machines in Los Angeles to overload, returning 502 Gateway Errors. Other visitors may have seen packet loss and slow connections as some links were saturated.
It wasn't immediately obvious what the cause of the issue was. We worked directly with GTT/nLayer's network team to rebalance traffic which temporarily put additional load on Seattle, then Dallas, then Chicago. While usually only customers nearby affected data centers would see an issue, in this case traffic as far away as Europe was landing in the wrong place.
Whether a visitor was affected or not was a bit of a crapshoot. We use multiple transit providers, so if your ISP wasn't connected to Level3 and you weren't naturally hitting an overloaded data center then you likely saw no problem. Overall, we estimate that around 10% of connections to our network were impacted for an approximately 20 minute window. A small percentage of users may have seen issues for a longer period of impact depending on their connection to Level3 and if they were pulled to more than one affected location.
Responsibility
Level3 or GTT/nLayer had no way of knowing how the changes they were making to their systems would affect us downstream.
While this was a very tricky situation for us to anticipate or even diagnose when it was happening, the responsibility lies with us to ensure our routing is getting people to the right locations and no facilities are overburdened. We've added this scenario to the conditions that we guard against so a similar change upstream should not affect us in the future.
The GTT/nLayer migration is scheduled to be completed today. One of the benefits of connecting to Tier 1 providers is route stability. While today's network issue was painful, I am encouraged that the underlying reason for the issue stems from an effort to build a more robust, stable, and reliable network.