On May 31, 2018 we had a 17 minute outage on our 1.1.1.1 resolver service; this was our doing and not the result of an attack.
Cloudflare is protected from attacks by the Gatebot DDoS mitigation pipeline. Gatebot performs hundreds of mitigations a day, shielding our infrastructure and our customers from L3/L4 and L7 attacks. Here is a chart of a count of daily Gatebot actions this year:
In the past, we have blogged about our systems:
Today, things didn't go as planned.
Gatebot
Cloudflare’s network is large, handles many different types of traffic and mitigates different types of known and not-yet-seen attacks. The Gatebot pipeline manages this complexity in three separate stages:
attack detection - collects live traffic measurements across the globe and detects attacks
reactive automation - chooses appropriate mitigations
mitigations - executes mitigation logic on the edge
The benign-sounding "reactive automation" part is actually the most complicated stage in the pipeline. We expected that from the start, which is why we implemented this stage using a custom Functional Reactive Programming (FRP) framework. If you want to know more about it, see the talk and the presentation.
Our mitigation logic often combines multiple inputs from different internal systems, to come up with the best, most appropriate mitigation. One of the most important inputs is the metadata about our IP address allocations: we mitigate attacks hitting HTTP and DNS IP ranges differently. Our FRP framework allows us to express this in clear and readable code. For example, this is part of the code responsible for performing DNS attack mitigation:
def action_gk_dns(...):
[...]
if port != 53:
return None
if whitelisted_ip.get(ip):
return None
if ip not in ANYCAST_IPS:
return None
[...]
It's the last check in this code that we tried to improve today.
Clearly, the code above is a huge oversimplification of all that goes into attack mitigation, but making an early decision about whether the attacked IP serves DNS traffic or not is important. It's that check that went wrong today. If the IP does serve DNS traffic then attack mitigation is handled differently from IPs that never serve DNS.
Cloudflare is growing, so must Gatebot
Gatebot was created in early 2015. Three years may not sound like much time, but since then we've grown dramatically and added layers of services to our software stack. Many of the internal integration points that we rely on today didn't exist then.
One of them is what we call the Provision API. When Gatebot sees an IP address, it needs to be able to figure out whether or not it’s one of Cloudflare’s addresses. Provision API is a simple RESTful API used to provide this kind of information.
This is a relatively new API, and prior to its existence, Gatebot had to figure out which IP addresses were Cloudflare addresses by reading a list of networks from a hard-coded file. In the code snippet above, the ANYCAST_IPS variable is populated using this file.
Things went wrong
Today, in an effort to reclaim some technical debt, we deployed new code that introduced Gatebot to Provision API.
What we did not account for, and what Provision API didn’t know about, was that 1.1.1.0/24 and 1.0.0.0/24 are special IP ranges. Frankly speaking, almost every IP range is "special" for one reason or another, since our IP configuration is rather complex. But our recursive DNS resolver ranges are even more special: they are relatively new, and we're using them in a very unique way. Our hardcoded list of Cloudflare addresses contained a manual exception specifically for these ranges.
As you might be able to guess by now, we didn't implement this manual exception while we were doing the integration work. Remember, the whole idea of the fix was to remove the hardcoded gotchas!
Impact
The effect was that, after pushing the new code release, our systems interpreted the resolver traffic as an attack. The automatic systems deployed DNS mitigations for our DNS resolver IP ranges for 17 minutes, between 17:58 and 18:13 May 31st UTC. This caused 1.1.1.1 DNS resolver to be globally inaccessible.
Lessons Learned
While Gatebot, the DDoS mitigation system, has great power, we failed to test the changes thoroughly. We are using today’s incident to improve our internal systems.
Our team is incredibly proud of 1.1.1.1 and Gatebot, but today we fell short. We want to apologize to all of our customers. We will use today’s incident to improve. The next time we mitigate 1.1.1.1 traffic, we will make sure there is a legitimate attack hitting us.