Today, Google's services experienced a limited outage for about 27 minutes over some portions of the Internet. The reason this happened dives into the deep, dark corners of networking. I'm a network engineer at CloudFlare and I played a small part in helping ensure Google came back online. Here's a bit about what happened.
At around 6:24pm PST / 02:24 UTC (5 Nov. 2012 PST / 6 Nov. 2012 UTC), CloudFlare employees noticed that Google's services were offline. We use Google Apps for things like email so when we can't reach their servers the office notices quickly. I'm on the Network Engineering team so I jumped online to figure out if the problem was local to us or global.
I quickly realised that we were unable to resolve all of Googles services — or even reach 188.8.131.52, Googles public DNS server — so I started troubleshooting DNS.
$ dig +trace google.com
Here's the response I got when I tried to reach any of Google.com's name servers:
google.com. 172800 IN NS ns2.google.com.google.com. 172800 IN NS ns1.google.com.google.com. 172800 IN NS ns3.google.com.google.com. 172800 IN NS ns4.google.com.;; Received 164 bytes from 184.108.40.206#53(e.gtld-servers.net) in 152 ms;; connection timed out; no servers could be reached
The fact that no servers could be reached means something was wrong. Specifically, it meant that from our office network we were unable to reach any of Googles DNS servers.
I started to look at the network layer, see if that's where the problems lay.
PING 220.127.116.11 (18.104.22.168): 56 data bytesRequest timeout for icmp_seq 092 bytes from 1-1-15.edge2-eqx-sin.moratelindo.co.id (22.214.171.124): Time to live exceeded
That was curious. Normally, we shouldn't be seeing an Indonesian ISP (Moratel) in the path to Google. I jumped on one of CloudFlare's routers to check what was going on. Meanwhile, others reports from around the globe on Twitter suggested we weren't the only ones seeing the problem.
To understand what went wrong you need to understand a bit about how networking on the Internet works. The Internet is a collection of networks, known as "Autonomous Systems" (AS). Each network has a unique number to identify it known as AS number. CloudFlare's AS number is 13335, Google's is 15169. The networks are connected together by what is known as Border Gateway Protocol (BGP). BGP is the glue of the Internet — announcing what IP addresses belong to each network and establishing the routes from one AS to another. An Internet "route" is exactly what it sounds like: a path from the IP address on one AS to an IP address on another AS.
BGP is largely a trust-based system. Networks trust each other to say which IP addresses and other networks are behind them. When you send a packet or make a request across the network, your ISP connects to its upstream providers or peers and finds the shortest path from your ISP to the destination network.
Unfortunately, if a network starts to send out an announcement of a particular IP address or network behind it, when in fact it is not, if that network is trusted by its upstreams and peers then packets can end up misrouted. That is what was happening here.
I looked at the BGP Routes for a Google IP Address. The route traversed Moratel (23947), an Indonesian ISP. Given that I'm looking at the routing from California and Google is operating Data Centre's not far from our office, packets should never be routed via Indonesia. The most likely cause was that Moratel was announcing a network that wasn't actually behind them.
The BGP Route I saw at the time was:
[email protected]> show route 126.96.36.199 inet.0: 422168 destinations, 422168 routes (422154 active, 0 holddown, 14 hidden)+ = Active Route, - = Last Active, * = Both188.8.131.52/24 *[BGP/170] 00:15:47, MED 18, localpref 100 AS path: 4436 3491 23947 15169 I > to 184.108.40.206 via ge-1/0/9.0
Looking at other routes, for example to Google's Public DNS, it was also stuck routing down the same (incorrect) path:
[email protected]> show route 220.127.116.11 inet.0: 422196 destinations, 422196 routes (422182 active, 0 holddown, 14 hidden)+ = Active Route, - = Last Active, * = Both18.104.22.168/24 *[BGP/170] 00:27:02, MED 18, localpref 100 AS path: 4436 3491 23947 15169 I > to 22.214.171.124 via ge-1/0/9.0
(Image Credit: The Simpsons)
Situations like this are referred to in the industry as "route leakage", as the route has "leaked" past normal paths. This isn't an unprecedented event. Google previously suffered a similar outage when Pakistan was allegedly trying to censor a video on YouTube and the National ISP of Pakistan null routed the service's IP addresses. Unfortunately, they leaked the null route externally. Pakistan Telecom's upstream provider, PCCW, trusted what Pakistan Telecom's was sending them and the routes spread across the Internet. The effect was YouTube was knocked offline for around 2 hours.
The case today was similar. Someone at Moratel likely "fat fingered" an Internet route. PCCW, who was Moratel's upstream provider, trusted the routes Moratel was sending to them. And, quickly, the bad routes spread. It is unlikely this was malicious, but rather a misconfiguaration or an error evidencing some of the failings in the BGP Trust model.
The solution was to get Moratel to stop announcing the routes they shouldn't be. A large part of being a network engineer, especially working at a large network like CloudFlare's, is having relationships with other network engineers around the world. When I figured out the problem, I contacted a colleague at Moratel to let him know what was going on. He was able to fix the problem at around 2:50 UTC / 6:50pm PST. Around 3 minutes later, routing returned to normal and Google's services came back online.
Looking at peering maps, I'd estimate the outage impacted around 3–5% of the Internet's population. The heaviest impact will have been felt in Hong Kong, where PCCW is the incumbent provider. If you were in the area and unable to reach Google's services around that time, now you know why.
Building a Better Internet
This all is a reminder about how the Internet is a system built on trust. Today's incident shows that, even if you're as big as Google, factors outside of your direct control can impact the ability of your customers to get to your site so it's important to have a network engineering team that is watching routes and managing your connectivity around the clock. CloudFlare works every day to ensure our customers get the optimal possible routes. We look out for all the websites on our network to ensure that their traffic is always delivered as fast as possible. Just another day in our ongoing efforts to #savetheweb.
Update:Tuesday, November 6 11:00am PST
Moratel says the issue was caused by an unexpected hardware failure, causing this abnormal condition. This was not a malicious attempt. Moratel immediately shutdown the BGP peering with Google after contact was made while the hardware failure was being looked into.
Thanks for reading all the way to the end. If you enjoyed this post, take a second to learn more about CloudFlare or nominate us for the 2012 Crunchie Award for Best Technical Innovation.