The Cloudflare Blog

How Cloudflare responded to the “Copy Fail” Linux vulnerability

Chris J Arges — Thu, 07 May 2026 13:00:00 GMT

On April 29, 2026, a Linux kernel local privilege escalation vulnerability was publicly disclosed under the name "Copy Fail" (CVE-2026-31431). Cloudflare’s Security and Engineering teams began assessing the vulnerability as soon as it was disclosed. We reviewed the exploit technique, evaluated exposure across our infrastructure, and validated that our existing behavioral detections could identify the exploit pattern within minutes.

There was no impact to the Cloudflare environment, no customer data was at risk, and no services were disrupted at any point. Read on to learn how our preparedness paid off.

Background

Our Linux kernel release process

Cloudflare operates a global Linux server infrastructure at an immense scale, with datacenters located across 330 cities. We maintain a custom Linux kernel build based on the community's Long-Term Support (LTS) versions to manage updates effectively at this volume. At any given time, we may utilize multiple LTS versions from various series, such as 6.12 or 6.18, which benefit from extended update periods.

The community regularly merges and releases security and stability updates which trigger an automated job to generate a new internal kernel build approximately every week. These builds undergo testing in our staging data centers to ensure stability before a global rollout. Following a successful release, the Edge Reboot Release (ERR) pipeline manages a systematic update and reboot of the edge infrastructure on a four-week cycle. Our control plane infrastructure typically adopts the most recent kernel, with reboots scheduled according to specific workload requirements.

By the time a CVE becomes public knowledge, the necessary fix has typically been integrated into stable Linux LTS releases for several weeks. Our established procedures ensure that we have already deployed these patches.

At the time of the "Copy Fail" disclosure, the majority of our infrastructure was running the 6.12 LTS version, while a subset of machines had begun transitioning to the newer 6.18 LTS release.

About the Copy Fail vulnerability

It helps to understand the vulnerability before getting to the response story. A comprehensive write-up can be found in the original Xint Code disclosure post.

AF_ALG and the kernel crypto API

The Linux kernel's internal crypto API manages functions like kTLS and IPsec. Userspace programs access this via the AF_ALG socket family, allowing unprivileged processes to request encryption or decryption. The algif_aead module facilitates this for Authenticated Encryption with Associated Data (AEAD) ciphers.

An unprivileged program follows these steps:

Opens an AF_ALG socket and binds to an AEAD template.
Sets a key and accepts a request socket.
Submits input via sendmsg() or splice().
Executes the operation using recvmsg().

The splice() system call is critical here, as it moves data by passing page cache references.

Memory mechanics: page cache and in-place crypto

The page cache is a shared system cache for file contents. Modifying a page belonging to a setuid binary effectively edits that program for all users until the page is evicted.

The crypto API utilizes scatterlists, which are structures linking various memory pages. In 2017, algif_aead was optimized for in-place operations, chaining destination and reference pages together. This design lacked enforcement to prevent algorithms from writing past intended boundaries.

The vulnerability: out-of-bounds write

When the user executes recvmsg(), the authencesn wrapper in the kernel performs a 4-byte write past the legitimate output region:

scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1);

By using splice(), an attacker can chain a target file's page cache pages to the scatterlist. The out-of-bounds write then taints the cached file, allowing an attacker to control which file is modified, the offset, and the specific 4 bytes written. This means the attacker can manipulate the following with this exploit:

File: Any readable file.
Offset: Tunable via assoclen and splice parameters.
Value: Controlled via AAD bytes 4-7 in sendmsg()

The exploit, step by step

The default exploit targets /usr/bin/su, a setuid-root binary present on essentially every distribution.

Cache Reference: Open /usr/bin/su as O_RDONLY and read() to populate the page cache. Use splice() on the file descriptor to pass these page cache references into the crypto scatterlist.
Setup: Create an AF_ALG socket, bind() to authencesn(hmac(sha256),cbc(aes)), set a key, and accept a request socket without needing privileges.
Write Construction: For each 4-byte shellcode chunk:
- sendmsg() with AAD bytes 4–7 containing the shellcode.
- splice() the binary into a pipe then the AF_ALG socket so assoclen + cryptlen targets the desired .text offset.
Trigger: recvmsg() initiates decryption. authencesn writes its scratch data to the target offset of /usr/bin/su in the page cache. Although the function returns -EBADMSG, the 4-byte write is now in the global page cache.
Execution: Running execve("/usr/bin/su") loads the tainted page cache. Since the binary is setuid-root, the injected shellcode executes with root privileges.

The upstream fix (commit a664bf3d603d) reverts the 2017 in-place optimization, removing the exploit.

How we responded

When the vulnerability was disclosed, many workstreams started in parallel:

Mapping the blast radius: Our security team worked with kernel engineers to determine which kernel versions were vulnerable and assess the potential exposure.
Validating coverage: Security reviewed the exploit technique and confirmed that our existing behavioral detections could identify the exploit pattern during authorized internal validation.
Proactive threat hunting: Security began searching for signs that the vulnerability had been exploited before it was publicly known, going back 48 hours in our fleet-wide logs.
Engineering a mitigation: Kernel engineers began building a runtime mitigation that would protect the fleet without breaking production services.
Continuing software updates: Our engineering teams worked on delivering an updated Linux kernel, which required carefully rebooting and rolling it out across our servers.

There was no customer impact at any point during this response.

Validating detection coverage

One of the first things our security team did was confirm that our existing endpoint detection would catch this exploit. Our servers run behavioral detection that continuously monitors process execution patterns. It doesn't rely on knowing about specific vulnerabilities; it watches for anomalous behavior across the fleet.

When our engineers validated the vulnerability internally as part of the response, the detection platform flagged it within minutes. The system linked the entire execution chain—starting at the script interpreter, moving through the kernel’s cryptographic subsystem, and ending at the privilege escalation binary—flagging it as malicious based on fleet-wide behavioral patterns.

This happened without a signature update, without a rule change, and without human intervention. Our behavioral detection coverage existed before we wrote any custom logic for this particular Copy File exploit.

The confirmation was important because it meant we had coverage before writing a vulnerability-specific rule.

Hunting for exploitation

While our engineering team moved to a more targeted mitigation, our security investigation had been running since disclosure. This is our standard procedure for any critical vulnerability.

Our security team operates on a simple principle for critical vulnerabilities: assume compromise until you can prove otherwise. The investigation started from the assumption that exploitation could have occurred before the vulnerability was public, and we worked systematically to either confirm or rule it out.

The exploit leaves a distinctive trace in kernel logs when it runs. We searched for that trace across our centralized logging infrastructure, covering 48 hours before the vulnerability was publicly disclosed. If someone had exploited this before the world knew about it, we would have seen it.

We pulled access logs for affected systems and reconstructed who connected, when, and what commands they ran. This gave us a complete forensic picture of interactive activity on potentially affected infrastructure.

We checked that system binaries had not been tampered with, validated cryptographic hashes against known-good package manifests, looked for persistence mechanisms, and audited network connections for anything unusual. Everything was clean.

Incident timeline and impact

Time (UTC)	Event
2026-04-29 16:00	Copy Fail publicly disclosed.
2026-04-29 ~21:00	Security and Engineering teams began assessing fleet exposure and mitigation options before full declaration of the Incident Response process
2026-04-29 22:52	Security confirmed existing behavioral detection covered the Copy Fail exploit pattern. During authorized internal validation, detection flagged the activity within minutes.
2026-04-29 23:01	Existing behavioral detection generated a high-severity alert for exploit-like activity, confirming detection coverage for the technique.
2026-04-29 (evening)	First mitigation attempt pushed to our staging datacenter. The deployment process surfaced a dependency conflict; the mitigation was rolled back. No production systems were affected.
2026-04-29 (overnight)	Engineering drafted bpf-lsm mitigation program.
2026-04-30 03:14	Security incident declared to drive cross-functional collaboration and urgency. Security performed fleetwide threat hunting of historical data to confirm that no malicious activity was present on Cloudflare systems.
2026-04-30 (morning)	Engineering tested the bpf-lsm mitigation program and made it production-ready.
2026-04-30 14:25	Engineering incident declared to coordinate mitigation program and Linux patch rollout.
2026-04-30 ~17:00	Decision made: ship a patched build of the previous LTS line through reboot automation; do not accelerate the new LTS; lean on bpf-lsm in the meantime.
2026-04-30 (afternoon)	Visibility pipeline (eBPF tracing of AF_ALG socket usage) deployed fleet-wide. Gives a complete picture of all legitimate AF_ALG users.
2026-04-30 (evening)	bpf-lsm mitigation program rolled out behind a separate gate to fully mitigate the fleet. End-to-end verification on a previously-vulnerable test node confirms the exploit no longer works.
2026-05-04 (morning)	Reboot automation resumed at normal pace with the patched kernel.
2026-05-04 onward	Servers that had already passed through reboot automation earlier in the week manually rebooted to pick up the patched kernel. Unpatched servers update per our normal reboot automation.

This graph shows the progress of our mitigation program as it progressed through our infrastructure.

How did we mitigate it?

Because of the long timeframe involved in deploying a patched Linux kernel, we also pursued mitigating this exploit without a reboot.

Removing the module

The bug was in the algif_aead kernel module. Therefore, the simple fix was to just remove this module and disallow it from being reloaded.

This mitigation was therefore exactly what the Copy Fail write-up from the security researchers who identified it recommends.

echo "install algif_aead /bin/false" > /etc/modprobe.d/disable-algif.conf
rmmod algif_aead 2>/dev/null || true

Unfortunately removing the module would have impacted software that leverages the kernel crypto API. This meant that we had to figure out a more surgical mitigation.

Bpf-lsm

We’ve already developed and deployed such a tool for this exact scenario: bpf-lsm. Instead of removing the module, this tool leaves it loaded for legitimate users and uses a BPF Linux Security Module program to deny the socket_bind LSM hook for everyone else. This completely blocks the front door for any exploits.

A draft of the eBPF program was put together overnight. Team members picked it up the following morning, ran validations, and made it production-ready. The program is fairly straightforward. On every socket_bind call:

If the socket family is not AF_ALG, allow the call through unchanged.
If the family is AF_ALG, check the calling binary's path against an allow-list of the binaries we know to be legitimate users.
If the binary is on the allow-list, allow the bind. Otherwise, deny it.

To verify the mitigation on a given machine without exploiting it, the Copy Fail write-up gives a one-liner:

python3 -c 'import socket; s = socket.socket(socket.AF_ALG, socket.SOCK_SEQPACKET, 0); s.bind(("aead","authencesn(hmac(sha256),cbc(aes))"));'

On a mitigated machine you get PermissionError: [Errno 1] Operation not permitted (or FileNotFoundError, depending on which mitigation is active) instead of a successful bind.

Rolling it out

Before enabling enforcement, we verified that our known internal service was the sole legitimate AF_ALG user to avoid accidental outages. We used prometheus-ebpf-exporter to hook the socket() syscall and track AF_ALG usage per binary across the fleet. This required no kernel changes and provided aggregate data from hundreds of thousands of servers within hours. Results confirmed the identified service was indeed the only legitimate user.

So the bpf-lsm rollout was deliberately staged in two steps:

Get visibility first. Push the ebpf-exporter config gated by salt. Confirm at the metric layer that the known service is effectively the only thing creating AF_ALG sockets.
Then enforce. Push the bpf-lsm program behind a separate enforcement gate.

In parallel, the upstream backport for our majority LTS line finally became available, and our internal automation built a patched kernel against it.

We started to test the patched kernel in our staging datacenters as soon as possible, then we resumed the longer reboot process in order to fully patch our fleet.

Remediation and follow-up steps

While we were prepared for this scenario, at Cloudflare we’re always learning and improving. Key areas we identified for improvement:

Better visibility into kernel-API dependencies. We will review kernel-subsystem usage across production services, so we can continue to quickly mitigate exploits without service disruption.
Better runtime mitigation. bpf-lsm is a valuable tool for mitigations, but we want to make this tool even better. This will include looking into faster deployments, better playbooks, and better logging and visibility of the tool.
Reduce attack surface of Linux Kernel. Review and audit our kernel configuration. Proactively identify unused modules or features so that we can remove them from our build entirely.

Conclusion

The "Copy Fail" vulnerability presented a unique challenge for us. Despite our practice of deploying Linux patch updates every two weeks, we remained vulnerable because a month-old mainline fix had yet to be backported to our primary kernel line. Despite that, we were still able to roll out patched kernels within hours of the backport's release. In the interim, bpf-lsm provided a surgical, no-reboot mitigation that secured our fleet. While our initial attempt to disable the problematic module failed, it did so safely within our internal staging environment rather than production, allowing us to identify this dependency.

By the end of the rollout, every machine in our fleet was protected by either a patched kernel or a bpf-lsm program denying the vulnerable code path to non-allow-listed binaries. There was no customer impact at any point during this incident, and we have committed to the follow-up work above to make our response faster and our visibility better the next time something like this lands. Responsible disclosure works, in-kernel visibility tooling pays off in moments exactly like this one, and bpf-lsm continues to be one of the most useful primitives we have for runtime kernel mitigation.

At Cloudflare, critical vulnerability response is a coordinated effort across Security, Engineering, Product, and many other teams. Special thanks to Ali Adnan, Ivan Babrou, Frederik Baetens, Curtis Bray, Piers Cornwell, Everton Didone Foscarini, Rob Dinh, Elle Dougherty, Kevin Flansburg, Matt Fleming, Kimberley Hall, Brandon Harris, Jerry Ho, Oxana Kharitonova, Marek Kroemeke, Fred Lawler, James Munson, Nafeez Nazer, Walead Parviz, Miguel Pato, Evan Pratten, Josh Seba, June Slater, Ryan Timken, Michael Wolf, Jianxin Zeng and everyone else who contributed to the investigation, mitigation, and remediation of Copy Fail. We'd also like to thank the Linux upstream maintainers and Copy Fail researchers whose work helped make a rapid response possible.

SLP: a new DDoS amplification vector in the wild

Alex Forster — Tue, 25 Apr 2023 13:07:56 GMT

Earlier today, April 25, 2023, researchers Pedro Umbelino at Bitsight and Marco Lux at Curesec published their discovery of CVE-2023-29552, a new DDoS reflection/amplification attack vector leveraging the SLP protocol. If you are a Cloudflare customer, your services are already protected from this new attack vector.

Service Location Protocol (SLP) is a “service discovery” protocol invented by Sun Microsystems in 1997. Like other service discovery protocols, it was designed to allow devices in a local area network to interact without prior knowledge of each other. SLP is a relatively obsolete protocol and has mostly been supplanted by more modern alternatives like UPnP, mDNS/Zeroconf, and WS-Discovery. Nevertheless, many commercial products still offer support for SLP.

Since SLP has no method for authentication, it should never be exposed to the public Internet. However, Umbelino and Lux have discovered that upwards of 35,000 Internet endpoints have their devices’ SLP service exposed and accessible to anyone. Additionally, they have discovered that the UDP version of this protocol has an amplification factor of up to 2,200x, which is the third largest discovered to-date.

Cloudflare expects the prevalence of SLP-based DDoS attacks to rise significantly in the coming weeks as malicious actors learn how to exploit this newly discovered attack vector.

Cloudflare customers are protected

If you are a Cloudflare customer, our automated DDoS protection system already protects your services from these SLP amplification attacks.To avoid being exploited to launch the attacks, if you are a network operator, you should ensure that you are not exposing the SLP protocol directly to the public Internet. You should consider blocking UDP port 427 via access control lists or other means. This port is rarely used on the public Internet, meaning it is relatively safe to block without impacting legitimate traffic. Cloudflare Magic Transit customers can use the Magic Firewall to craft and deploy such rules.

How to drop 10 million packets per second

Marek Majkowski — Fri, 06 Jul 2018 13:00:00 GMT

Internally our DDoS mitigation team is sometimes called "the packet droppers". When other teams build exciting products to do smart things with the traffic that passes through our network, we take joy in discovering novel ways of discarding it.

CC BY-SA 2.0 image by Brian Evans

Being able to quickly discard packets is very important to withstand DDoS attacks.

Dropping packets hitting our servers, as simple as it sounds, can be done on multiple layers. Each technique has its advantages and limitations. In this blog post we'll review all the techniques we tried thus far.

Test bench

To illustrate the relative performance of the methods we'll show some numbers. The benchmarks are synthetic, so take the numbers with a grain of salt. We'll use one of our Intel servers, with a 10Gbps network card. The hardware details aren't too important, since the tests are prepared to show the operating system, not hardware, limitations.

Our testing setup is prepared as follows:

We transmit a large number of tiny UDP packets, reaching 14Mpps (millions packets per second).
This traffic is directed towards a single CPU on a target server.
We measure the number of packets handled by the kernel on that one CPU.

We're not trying to maximize userspace application speed, nor packet throughput - instead, we're trying to specifically show kernel bottlenecks.

The synthetic traffic is prepared to put maximum stress on conntrack - it uses random source IP and port fields. Tcpdump will show it like this:

$ tcpdump -ni vlan100 -c 10 -t udp and dst port 1234
IP 198.18.40.55.32059 > 198.18.0.12.1234: UDP, length 16
IP 198.18.51.16.30852 > 198.18.0.12.1234: UDP, length 16
IP 198.18.35.51.61823 > 198.18.0.12.1234: UDP, length 16
IP 198.18.44.42.30344 > 198.18.0.12.1234: UDP, length 16
IP 198.18.106.227.38592 > 198.18.0.12.1234: UDP, length 16
IP 198.18.48.67.19533 > 198.18.0.12.1234: UDP, length 16
IP 198.18.49.38.40566 > 198.18.0.12.1234: UDP, length 16
IP 198.18.50.73.22989 > 198.18.0.12.1234: UDP, length 16
IP 198.18.43.204.37895 > 198.18.0.12.1234: UDP, length 16
IP 198.18.104.128.1543 > 198.18.0.12.1234: UDP, length 16

On the target side all of the packets are going to be forwarded to exactly one RX queue, therefore one CPU. We do this with hardware flow steering:

ethtool -N ext0 flow-type udp4 dst-ip 198.18.0.12 dst-port 1234 action 2

Benchmarking is always hard. When preparing the tests we learned that having any active raw sockets destroys performance. It's obvious in hindsight, but easy to miss. Before running any tests remember to make sure you don't have any stale tcpdump process running. This is how to check it, showing a bad process active:

$ ss -A raw,packet_raw -l -p|cat
Netid  State      Recv-Q Send-Q Local Address:Port
p_raw  UNCONN     525157 0      *:vlan100          users:(("tcpdump",pid=23683,fd=3))

Finally, we are going to disable the Intel Turbo Boost feature on the machine:

echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

While Turbo Boost is nice and increases throughput by at least 20%, it also drastically worsens the standard deviation in our tests. With turbo enabled we had ±1.5% deviation in our numbers. With Turbo off this falls down to manageable 0.25%.

Step 1. Dropping packets in application

Let's start with the idea of delivering packets to an application and ignoring them in userspace code. For the test setup, let's make sure our iptables don't affect the performance:

iptables -I PREROUTING -t mangle -d 198.18.0.12 -p udp --dport 1234 -j ACCEPT
iptables -I PREROUTING -t raw -d 198.18.0.12 -p udp --dport 1234 -j ACCEPT
iptables -I INPUT -t filter -d 198.18.0.12 -p udp --dport 1234 -j ACCEPT

The application code is a simple loop, receiving data and immediately discarding it in the userspace:

s = socket.socket(AF_INET, SOCK_DGRAM)
s.bind(("0.0.0.0", 1234))
while True:
    s.recvmmsg([...])

We prepared the code, to run it:

$ ./dropping-packets/recvmmsg-loop
packets=171261 bytes=1940176

This setup allows the kernel to receive a meagre 175kpps from the hardware receive queue, as measured by ethtool and using our simple mmwatch tool:

$ mmwatch 'ethtool -S ext0|grep rx_2'
 rx2_packets: 174.0k/s

The hardware technically gets 14Mpps off the wire, but it's impossible to pass it all to a single RX queue handled by only one CPU core doing kernel work. mpstat confirms this:

$ watch 'mpstat -u -I SUM -P ALL 1 1|egrep -v Aver'
01:32:05 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:32:06 PM    0    0.00    0.00    0.00    2.94    0.00    3.92    0.00    0.00    0.00   93.14
01:32:06 PM    1    2.17    0.00   27.17    0.00    0.00    0.00    0.00    0.00    0.00   70.65
01:32:06 PM    2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
01:32:06 PM    3    0.95    0.00    1.90    0.95    0.00    3.81    0.00    0.00    0.00   92.38

As you can see application code is not a bottleneck, using 27% sys + 2% userspace on CPU #1, while network SOFTIRQ on CPU #2 uses 100% resources.

By the way, using recvmmsg(2) is important. In these post-Spectre days, syscalls got more expensive and indeed, we run kernel 4.14 with KPTI and retpolines:

$ tail -n +1 /sys/devices/system/cpu/vulnerabilities/*
==> /sys/devices/system/cpu/vulnerabilities/meltdown <==
Mitigation: PTI

==> /sys/devices/system/cpu/vulnerabilities/spectre_v1 <==
Mitigation: __user pointer sanitization

==> /sys/devices/system/cpu/vulnerabilities/spectre_v2 <==
Mitigation: Full generic retpoline, IBPB, IBRS_FW

Step 2. Slaughter conntrack

We specifically designed the test - by choosing random source IP and ports - to put stress on the conntrack layer. This can be verified by looking at number of conntrack entries, which during the test is reaching the maximum:

$ conntrack -C
2095202

$ sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 2097152

You can also observe conntrack shouting in dmesg:

[4029612.456673] nf_conntrack: nf_conntrack: table full, dropping packet
[4029612.465787] nf_conntrack: nf_conntrack: table full, dropping packet
[4029617.175957] net_ratelimit: 5731 callbacks suppressed

To speed up our tests let's disable it:

iptables -t raw -I PREROUTING -d 198.18.0.12 -p udp -m udp --dport 1234 -j NOTRACK

And rerun the tests:

$ ./dropping-packets/recvmmsg-loop
packets=331008 bytes=5296128

This instantly bumps the application receive performance to 333kpps. Hurray!

PS. With SO_BUSY_POLL we can bump the numbers to 470k pps, but this is a subject for another time.

Step 3. BPF drop on a socket

Going further, why deliver packets to userspace application at all? While this technique is uncommon, we can attach a classical BPF filter to a SOCK_DGRAM socket with setsockopt(SO_ATTACH_FILTER) and program the filter to discard packets in kernel space.

See the code, to run it:

$ ./bpf-drop
packets=0 bytes=0

With drops in BPF (both Classical as well as extended eBPF have similar performance) we process roughly 512kpps. All of them get dropped in the BPF filter while still in software interrupt mode, which saves us CPU needed to wake up the userspace application.

Step 4. iptables DROP after routing

As a next step we can simply drop packets in the iptables firewall INPUT chain by adding rule like this:

iptables -I INPUT -d 198.18.0.12 -p udp --dport 1234 -j DROP

Remember we disabled conntrack already with -j NOTRACK. These two rules give us 608kpps.

The numbers in iptables counters:

$ mmwatch 'iptables -L -v -n -x | head'

Chain INPUT (policy DROP 0 packets, 0 bytes)
    pkts      bytes target     prot opt in     out     source               destination
605.9k/s    26.7m/s DROP       udp  --  *      *       0.0.0.0/0            198.18.0.12          udp dpt:1234

600kpps is not bad, but we can do better!

Step 5. iptables DROP in PREROUTING

An even faster technique is to drop packets before they get routed. This rule can do this:

iptables -I PREROUTING -t raw -d 198.18.0.12 -p udp --dport 1234 -j DROP

This produces whopping 1.688mpps.

This is quite a significant jump in performance, I don't fully understand it. Either our routing layer is unusually complex or there is a bug in our server configuration.

In any case - "raw" iptables table is definitely way faster.

Step 6. nftables DROP before CONNTRACK

Iptables is considered passé these days. The new kid in town is nftables. See this video for a technical explanation why nftables is superior. Nftables promises to be faster than gray haired iptables for many reasons, among them is a rumor that retpolines (aka: no speculation for indirect jumps) hurt iptables quite badly.

Since this article is not about comparing the nftables vs iptables speed, let's try only the fastest drop I could came up with:

nft add table netdev filter
nft -- add chain netdev filter input { type filter hook ingress device vlan100 priority -500 \; policy accept \; }
nft add rule netdev filter input ip daddr 198.18.0.0/24 udp dport 1234 counter drop
nft add rule netdev filter input ip6 daddr fd00::/64 udp dport 1234 counter drop

The counters can be seen with this command:

$ mmwatch 'nft --handle list chain netdev filter input'
table netdev filter {
    chain input {
        type filter hook ingress device vlan100 priority -500; policy accept;
        ip daddr 198.18.0.0/24 udp dport 1234 counter packets    1.6m/s bytes    69.6m/s drop # handle 2
        ip6 daddr fd00::/64 udp dport 1234 counter packets 0 bytes 0 drop # handle 3
    }
}

Nftables "ingress" hook yields around 1.53mpps. This is slightly slower than iptables in the PREROUTING layer. This is puzzling - theoretically "ingress" happens before PREROUTING, so should be faster.

In our test nftables was slightly slower than iptables, but not by much. Nftables is still better :P

Step 7. tc ingress handler DROP

A somewhat surprising fact is that a tc (traffic control) ingress hook happens before even PREROUTING. tc makes it possible to select packets based on basic criteria and indeed - action drop - them. The syntax is rather hacky, so it's recommended to use this script to set it up. We need a tiny bit more complex tc match, here is the command line:

tc qdisc add dev vlan100 ingress
tc filter add dev vlan100 parent ffff: prio 4 protocol ip u32 match ip protocol 17 0xff match ip dport 1234 0xffff match ip dst 198.18.0.0/24 flowid 1:1 action drop
tc filter add dev vlan100 parent ffff: protocol ipv6 u32 match ip6 dport 1234 0xffff match ip6 dst fd00::/64 flowid 1:1 action drop

We can verify it:

$ mmwatch 'tc -s filter  show dev vlan100  ingress'
filter parent ffff: protocol ip pref 4 u32 
filter parent ffff: protocol ip pref 4 u32 fh 800: ht divisor 1 
filter parent ffff: protocol ip pref 4 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1  (rule hit   1.8m/s success   1.8m/s)
  match 00110000/00ff0000 at 8 (success   1.8m/s ) 
  match 000004d2/0000ffff at 20 (success   1.8m/s ) 
  match c612000c/ffffffff at 16 (success   1.8m/s ) 
        action order 1: gact action drop
         random type none pass val 0
         index 1 ref 1 bind 1 installed 1.0/s sec
        Action statistics:
        Sent    79.7m/s bytes   1.8m/s pkt (dropped   1.8m/s, overlimits 0 requeues 0) 
        backlog 0b 0p requeues 0

A tc ingress hook with u32 match allows us to drop 1.8mpps on a single CPU. This is brilliant!

But we can go even faster...

Step 8. XDP_DROP

Finally, the ultimate weapon is XDP - eXpress Data Path. With XDP we can run eBPF code in the context of a network driver. Most importantly, this is before the skbuff memory allocation, allowing great speeds.

Usually XDP projects have two parts:

the eBPF code loaded into the kernel context
the userspace loader, which loads the code onto the right network card and manages it

Writing the loader is pretty hard, so instead we can use the new iproute2 feature and load the code with this trivial command:

ip link set dev ext0 xdp obj xdp-drop-ebpf.o

Tadam!

The source code for the loaded eBPF XDP program is available here. The program parses IP packets and looks for desired characteristics: IP transport, UDP protocol, desired target subnet and destination port:

if (h_proto == htons(ETH_P_IP)) {
    if (iph->protocol == IPPROTO_UDP
        && (htonl(iph->daddr) & 0xFFFFFF00) == 0xC6120000 // 198.18.0.0/24
        && udph->dest == htons(1234)) {
        return XDP_DROP;
    }
}

XDP program needs to be compiled with modern clang that can emit BPF bytecode. After this we can load and verify the running XDP program:

$ ip link show dev ext0
4: ext0:  mtu 1500 xdp qdisc fq state UP mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:8a:59:8e brd ff:ff:ff:ff:ff:ff
    prog/xdp id 5 tag aedc195cc0471f51 jited

And see the numbers in ethtool -S network card statistics:

$ mmwatch 'ethtool -S ext0|egrep "rx"|egrep -v ": 0"|egrep -v "cache|csum"'
     rx_out_of_buffer:     4.4m/s
     rx_xdp_drop:         10.1m/s
     rx2_xdp_drop:        10.1m/s

Whooa! With XDP we can drop 10 million packets per second on a single CPU.

CC BY-SA 2.0 image by Andrew Filer

Summary

We repeated the these for both IPv4 and IPv6 and prepared this chart:

Generally speaking in our setup IPv6 had slightly lower performance. Remember that IPv6 packets are slightly larger, so some performance difference is unavoidable.

Linux has numerous hooks that can be used to filter packets, each with different performance and ease of use characteristics.

For DDoS purporses, it may totally be reasonable to just receive the packets in the application and process them in userspace. Properly tuned applications can get pretty decent numbers.

For DDoS attacks with random/spoofed source IP's, it might be worthwhile disabling conntrack to gain some speed. Be careful though - there are attacks for which conntrack is very helpful.

In other circumstances it may make sense to integrate the Linux firewall into the DDoS mitigation pipeline. In such cases, remember to put the mitigations in a "-t raw PREROUTING" layer, since it's significantly faster than "filter" table.

For even more demanding workloads, we always have XDP. And boy, it is powerful. Here is the same chart as above, but including XDP:

If you want to reproduce these numbers, see the README where we documented everything.

Here at Cloudflare we are using... almost all of these techniques. Some of the userspace tricks are integrated with our applications. The iptables layer is managed by our Gatebot DDoS pipeline. Finally, we are working on replacing our proprietary kernel offload solution with XDP.

Want to help us drop more packets? We're hiring for many roles, including packet droppers, systems engineers and more!

Special thanks to Jesper Dangaard Brouer for helping with this work.

Today we mitigated 1.1.1.1

Marek Majkowski — Fri, 01 Jun 2018 01:13:53 GMT

On May 31, 2018 we had a 17 minute outage on our 1.1.1.1 resolver service; this was our doing and not the result of an attack.

Cloudflare is protected from attacks by the Gatebot DDoS mitigation pipeline. Gatebot performs hundreds of mitigations a day, shielding our infrastructure and our customers from L3/L4 and L7 attacks. Here is a chart of a count of daily Gatebot actions this year:

In the past, we have blogged about our systems:

Meet Gatebot, a bot that allows us to sleep

Today, things didn't go as planned.

Gatebot

Cloudflare’s network is large, handles many different types of traffic and mitigates different types of known and not-yet-seen attacks. The Gatebot pipeline manages this complexity in three separate stages:

attack detection - collects live traffic measurements across the globe and detects attacks
reactive automation - chooses appropriate mitigations
mitigations - executes mitigation logic on the edge

The benign-sounding "reactive automation" part is actually the most complicated stage in the pipeline. We expected that from the start, which is why we implemented this stage using a custom Functional Reactive Programming (FRP) framework. If you want to know more about it, see the talk and the presentation.

Our mitigation logic often combines multiple inputs from different internal systems, to come up with the best, most appropriate mitigation. One of the most important inputs is the metadata about our IP address allocations: we mitigate attacks hitting HTTP and DNS IP ranges differently. Our FRP framework allows us to express this in clear and readable code. For example, this is part of the code responsible for performing DNS attack mitigation:

def action_gk_dns(...):

    [...]

    if port != 53:
        return None

    if whitelisted_ip.get(ip):
        return None

    if ip not in ANYCAST_IPS:
        return None
        [...]

It's the last check in this code that we tried to improve today.

Clearly, the code above is a huge oversimplification of all that goes into attack mitigation, but making an early decision about whether the attacked IP serves DNS traffic or not is important. It's that check that went wrong today. If the IP does serve DNS traffic then attack mitigation is handled differently from IPs that never serve DNS.

Cloudflare is growing, so must Gatebot

Gatebot was created in early 2015. Three years may not sound like much time, but since then we've grown dramatically and added layers of services to our software stack. Many of the internal integration points that we rely on today didn't exist then.

One of them is what we call the Provision API. When Gatebot sees an IP address, it needs to be able to figure out whether or not it’s one of Cloudflare’s addresses. Provision API is a simple RESTful API used to provide this kind of information.

This is a relatively new API, and prior to its existence, Gatebot had to figure out which IP addresses were Cloudflare addresses by reading a list of networks from a hard-coded file. In the code snippet above, the ANYCAST_IPS variable is populated using this file.

Things went wrong

Today, in an effort to reclaim some technical debt, we deployed new code that introduced Gatebot to Provision API.

What we did not account for, and what Provision API didn’t know about, was that 1.1.1.0/24 and 1.0.0.0/24 are special IP ranges. Frankly speaking, almost every IP range is "special" for one reason or another, since our IP configuration is rather complex. But our recursive DNS resolver ranges are even more special: they are relatively new, and we're using them in a very unique way. Our hardcoded list of Cloudflare addresses contained a manual exception specifically for these ranges.

As you might be able to guess by now, we didn't implement this manual exception while we were doing the integration work. Remember, the whole idea of the fix was to remove the hardcoded gotchas!

Impact

The effect was that, after pushing the new code release, our systems interpreted the resolver traffic as an attack. The automatic systems deployed DNS mitigations for our DNS resolver IP ranges for 17 minutes, between 17:58 and 18:13 May 31st UTC. This caused 1.1.1.1 DNS resolver to be globally inaccessible.

Lessons Learned

While Gatebot, the DDoS mitigation system, has great power, we failed to test the changes thoroughly. We are using today’s incident to improve our internal systems.

Our team is incredibly proud of 1.1.1.1 and Gatebot, but today we fell short. We want to apologize to all of our customers. We will use today’s incident to improve. The next time we mitigate 1.1.1.1 traffic, we will make sure there is a legitimate attack hitting us.

Rate Limiting: Delivering more rules, and greater control

Alex Cruz Farmer — Mon, 21 May 2018 20:41:37 GMT

With more and more platforms taking the necessary precautions against DDoS attacks like integrating DDoS mitigation services and increasing bandwidth at weak points, Layer 3 and 4 attacks are just not as effective anymore. For Cloudflare, we have fully automated Layer 3/4 based protections with our internal platform, Gatebot. In the last 6 months we have seen a large upward trend of Layer 7 based DDoS attacks. The key difference to these attacks is they are no longer focused on using huge payloads (volumetric attacks), but based on Requests per Second to exhaust server resources (CPU, Disk and Memory). On a regular basis we see attacks that are over 1 million requests per second. The graph below shows the number of Layer 7 attacks Cloudflare has monitored, which is trending up. On average seeing around 160 attacks a day, with some days spiking up to over 1000 attacks.

A year ago, Cloudflare released Rate Limiting, and it is proving to be a hugely effective tool for customers to protect their web applications and APIs from all sorts of attacks, from “low and slow” DDoS attacks, through to bot-based attacks, such as credential stuffing and content scraping. We’re pleased about the success our customers are seeing with Rate Limiting and are excited to announce additional capabilities to give our customers further control.

So what’s changing?

There are times when you clearly know that traffic is malicious. In cases like this, our existing Block action is proving effective for our customers. But there are times when it is not the best option, and causes a negative user experience. Rather than risk a false negative, customers often want to challenge a client to ensure they are who they represent themselves to be, which is in most situations, human not a bot.

Firstly, to help customers more accurately identify the traffic, we are adding Cloudflare JavaScript Challenge, and Google reCAPTCHA (Challenge) mitigation actions to the UI and API for Pro and Business plans. The existing Block and Simulate actions still exist. As a reminder, to test any rule, deploying in Simulate means that you will not be charged for any requests. This is a great way to test your new rules to make sure they have been configured correctly.

Secondly, we’re making Rate Limiting more dynamically scalable. A new feature has been added which allows Rate Limiting to count on Origin Response Headers for Business and Enterprise customers. The way this feature works is by matching attributes which are returned by the Origin to Cloudflare.

The new capabilities - in action!

One of the things that really drives our innovation is solving the real problems we hear from customers every day. With that, we wanted to provide some real world examples of these new capabilities in action.

Each of the use cases have Basic and Advanced implementation options. After some testing, we found that tiering rate limits is an extremely effective solution against repeat offenders.

Credential Stuffing Protection for Login Pages and APIs. The best way to build applications is to utilise the standardized Status Codes. For example, if I fail to authenticate against an endpoint or a website, I should receive a “401” or “403”. Generally speaking a user to a website will often get their password wrong three times before selecting the “I forgot my password” option. Most Credential Stuff bots will try thousands of times cycling through many usernames and password combinations to see what works.

Here are some example rate limits which you can configure to protect your application from credential stuffing.

Basic:Cloudflare offers a “Protect My Login” feature out the box. Enter the URL for your login page and Cloudflare will create a rule such that clients that attempt to log in more than 5 times in 5 minutes will be blocked for 15 minutes.

With the new Challenge capabilities of Rate Limiting, you can customize the response parameters for log in to more closely match the behavior pattern for bots you see on your site through a custom-built rule.

Logging in four times in one minute is hard - I type fast, but couldn’t even do this. If I’m seeing this pattern in my logs, it is likely a bot. I can now create a Rate Limiting rule based on the following criteria:

RuleID	URL	Count	Timeframe	Matching Criteria	Action
1	/login	4	1 minute	Method: POST Status Code: 401,403	Challenge

With this new rule, if someone tries to log in four times within a minute, they will be thrown a challenge. My regular human users will likely never hit it, but if they do - the challenge insures they can still access the site.

Advanced: And sometimes bots are just super persistent in their attacks. We can tier rules together to tackle repeat offenders. For example, instead of creating just a single rule, we can create a series of rules which can be tiered to protect against persistent threats:

RuleID	URL	Count	Timeframe	Matching Criteria	Action
1	/login	4	1 minute	Method: POST Status Code: 401,403	JavaScript Challenge
2	/login	10	5 minutes	Method: POST Status Code: 401,403	Challenge
3	/login	20	1 hour	Method: POST Status Code: 401,403	Block for 1 day

With this type of tiering, any genuine users that are just having a hard time remembering their login details whilst also being extremely fast typers will not be fully blocked. Instead, they will first be given out automated JavaScript challenge followed by a traditional CAPTCHA if they hit the next limit. This is a much more user-friendly approach while still securing your login endpoints.

Time-based Firewall

Our IP Firewall is a powerful feature to block problematic IP addresses from accessing your app. Particularly this is related to repeated abuse, or based on IP Reputation or Threat Intelligence feeds that are integrated at the origin level.

While the IP firewall is powerful, maintaining and managing a list of IP addresses which are currently being blocked can be cumbersome. It becomes more complicated if you want to allow blocked IP addresses to “age out” if bad behavior stops after a period of time. This often requires authoring and managing a script and multiple API calls to Cloudflare.

The new Rate Limiting Origin Headers feature makes this all so much easier. You can now configure your origin to respond with a Header to trigger a Rate-Limit. To make this happen, we need to generate a Header at the Origin, which is then added to the response to Cloudflare. As we are matching on a static header, we can set a severity level based on the content of the Header. For example, if it was a repeat offender, you could respond with High as the Header value, which could Block for a longer period.

Create a Rate Limiting rule based on the following criteria:

RuleID	URL	Count	Timeframe	Matching Criteria	Action
1	*	1	1 second	Method: _ALL_ Header: X-CF-Block = low	Block for 5 minutes
2	*	1	1 second	Method: _ALL_ Header: X-CF-Block = medium	Block for 15 minutes
3	*	1	1 second	Method: _ALL_ Header: X-CF-Block = high	Block for 60 minutes

Once that Rate-Limit has been created, Cloudflare’s Rate-Limiting will then kick-in immediately when that Header is received.

Enumeration Attacks

Enumeration attacks are proving to be increasingly popular and pesky to mitigate. With enumeration attacks, attackers identify an expensive operation in your app and hammer at it to tie up resources and slow or crash your app. For example, an app that offers the ability to look up a user profile requires a database lookup to validate whether the user exists. In an enumeration attack, attackers will send a random set of characters to that endpoint in quick succession, causing the database to ground to a halt.

Rate Limiting to the rescue!

One of our customers was hit with a huge enumeration attack on their platform earlier this year, where the aggressors were trying to do exactly what we described above, in an attempt to overload their database platform. Their Rate Limiting configuration blocked over 100,000,000 bad requests during the 6-hour attack.

When a query is sent to the app, and the user is not found, the app serves a 404 (page not found). A very basic approach is to set a rate limit for 404s. If a user crosses a threshold of 404’s in a period of time, set the app to challenge the user to prove themselves to be a real person.

RuleID	URL	Count	Timeframe	Matching Criteria	Action
1	*	10	1 minute	Method: GET Status Code: 404	Challenge

To catch repeat offenders, you can tier the tier Rate Limits:

RuleID	URL	Count	Timeframe	Matching Criteria	Action
1	/public/profile*	10	1 minute	Method: GET Status Code: 404	JavaScript Challenge
2	/public/profile*	25	1 minute	Method: GET Status Code: 200	Challenge
3	/public/profile*	50	10 minutes	Method: GET Status Code: 200, 404	Block for 4 hours

With this type of tiered defense in place, it means that you can “caution” an offender with a JavaScript challenge or Challenge (Google Captcha), and then “block” them if they continue.

Content Scraping

Increasingly, content owners are wrestling with content scraping - malicious bots copying copyrighted images or assets and redistributing or reusing them. For example, we work with an eCommerce store that uses copyrighted images and their images are appearing elsewhere on the web without their consent. Rate Limiting can help!

In their app, each page displays 4 copyrighted images, 1 which is actual size, and 3 which are thumbnails. By looking at logs and user patterns, they determined that most users, at a stretch, would never view more than 10–15 products in a minute, which would equate to 40–60 loads from the images store.

They chose to tier their Rate Limiting rules to prevent end users from getting unnecessarily blocked when they were browsing heavily. To block malicious attempts at content scraping can be quite simple, however it does require some forward planning. Placing the rate limit on the right URL is key to insure you are placing the rule on exactly what you are trying to protect and not the broader content. Here’s an example set of rate limits this customer set to protect their images:

RuleID	URL	Count	Timeframe	Matching Criteria	Action
1	/img/thumbs/*	10	1 minute	Method: GET Status Code: 404	Challenge
2	/img/thumbs/*	25	1 minute	Method: GET Status Code: 200	Challenge
3	/img/*	75	1 minute	Method: GET Status Code: 200	Block for 4 hours
4	/img/*	5	1 minute	Method: GET Status Code: 403, 404	Challenge

As we can see here, rules 1 and 2 are counting based on the number of requests to each endpoint. Rule 3 is counting based on all hits to the image store, and if it gets above 75 requests, the user will be blocked for 4 hours. Finally, to avoid any enumeration or bots guessing image names and numbers, we are counting on 404 and 403s and challenging if we see unusual spikes.

One more thing ... more rules, totally rules!

We want to ensure you have the rules you need to secure your app. To do that, we are increasing the number of available rules for Pro and Business, for no additional charge.

Pro plans increase from 3 to 10 rules
Business plans increase from 3 to 15 rules

As always, Cloudflare only charges for good traffic - requests that are allowed through Rate Limiting, not blocked. For more information click here.

The Rate-Limiting feature can be enabled within the Firewall tab on the Dashboard, or by visiting: cloudflare.com/a/firewall

Memcrashed - Major amplification attacks from UDP port 11211

Marek Majkowski — Tue, 27 Feb 2018 14:38:35 GMT

Over last couple of days we've seen a big increase in an obscure amplification attack vector - using the memcached protocol, coming from UDP port 11211.

CC BY-SA 2.0 image by David TrawinIn the past, we have talked a lot about amplification attacks happening on the internet. Our most recent two blog posts on this subject were:

SSDP amplifications crossing 100Gbps. Funnily enough, since then we were a target of an 196Gbps SSDP attack.
General statistics about various amplification attacks we see.

The general idea behind all amplification attacks is the same. An IP-spoofing capable attacker sends forged requests to a vulnerable UDP server. The UDP server, not knowing the request is forged, politely prepares the response. The problem happens when thousands of responses are delivered to an unsuspecting target host, overwhelming its resources - most typically the network itself.

Amplification attacks are effective, because often the response packets are much larger than the request packets. A carefully prepared technique allows an attacker with limited IP spoofing capacity (such as 1Gbps) to launch very large attacks (reaching 100s Gbps) "amplifying" the attacker's bandwidth.

Memcrashed

Obscure amplification attacks happen all the time. We often see "chargen" or "call of duty" packets hitting our servers.

A discovery of a new amplification vector though, allowing very great amplification, happens rarely. This new memcached UDP DDoS is definitely in this category.

The DDosMon from Qihoo 360 monitors amplification attack vectors and this chart shows recent memcached/11211 attacks:

The number of memcached attacks was relatively flat, until it started spiking just a couple days ago. Our charts also confirm this, here are attacks in packets per second over the last four days:

While the packets per second count is not that impressive, the bandwidth generated is:

At peak we've seen 260Gbps of inbound UDP memcached traffic. This is massive for a new amplification vector. But the numbers don't lie. It's possible because all the reflected packets are very large. This is how it looks in tcpdump:

$ tcpdump -n -t -r memcrashed.pcap udp and port 11211 -c 10
IP 87.98.205.10.11211 > 104.28.1.1.1635: UDP, length 13
IP 87.98.244.20.11211 > 104.28.1.1.41281: UDP, length 1400
IP 87.98.244.20.11211 > 104.28.1.1.41281: UDP, length 1400
IP 188.138.125.254.11211 > 104.28.1.1.41281: UDP, length 1400
IP 188.138.125.254.11211 > 104.28.1.1.41281: UDP, length 1400
IP 188.138.125.254.11211 > 104.28.1.1.41281: UDP, length 1400
IP 188.138.125.254.11211 > 104.28.1.1.41281: UDP, length 1400
IP 188.138.125.254.11211 > 104.28.1.1.41281: UDP, length 1400
IP 5.196.85.159.11211 > 104.28.1.1.1635: UDP, length 1400
IP 46.31.44.199.11211 > 104.28.1.1.6358: UDP, length 13

The majority of packets are 1400 bytes in size. Doing the math 23Mpps x 1400 bytes gives 257Gbps of bandwidth, exactly what the chart shows.

Memcached does UDP?

I was surprised to learn that memcached does UDP, but there you go! The protocol specification shows that it's one of the best protocols to use for amplification ever! There are absolutely zero checks, and the data WILL be delivered to the client, with blazing speed! Furthermore, the request can be tiny and the response huge (up to 1MB).

Launching such an attack is easy. First the attacker implants a large payload on an exposed memcached server. Then, the attacker spoofs the "get" request message with target Source IP.

Synthetic run with Tcpdump shows the traffic:

$ sudo tcpdump -ni eth0 port 11211 -t
IP 172.16.170.135.39396 > 192.168.2.1.11211: UDP, length 15
IP 192.168.2.1.11211 > 172.16.170.135.39396: UDP, length 1400
IP 192.168.2.1.11211 > 172.16.170.135.39396: UDP, length 1400
...(repeated hundreds times)...

15 bytes of request triggered 134KB of response. This is amplification factor of 10,000x! In practice we've seen a 15 byte request result in a 750kB response (that's a 51,200x amplification).

Source IPs

The vulnerable memcached servers are all around the globe, with higher concentration in North America and Europe. Here is a map of the source IPs we've seen in each of our 120+ points of presence:

Interestingly our datacenters in EWR, HAM and HKG see disproportionally large numbers of attacking IPs. This is because most of the vulnerable servers are located in major hosting providers. The AS numbers of the IPs that we've seen:

┌─ips─┬─srcASN──┬─ASName───────────────────────────────────────┐
│ 578 │ AS16276 │ OVH                                          │
│ 468 │ AS14061 │ DIGITALOCEAN-ASN - DigitalOcean, LLC         │
│ 231 │ AS7684  │ SAKURA-A SAKURA Internet Inc.                │
│ 199 │ AS9370  │ SAKURA-B SAKURA Internet Inc.                │
│ 165 │ AS12876 │ AS12876                                      │
│ 119 │ AS9371  │ SAKURA-C SAKURA Internet Inc.                │
│ 104 │ AS16509 │ AMAZON-02 - Amazon.com, Inc.                 │
│ 102 │ AS24940 │ HETZNER-AS                                   │
│  81 │ AS26496 │ AS-26496-GO-DADDY-COM-LLC - GoDaddy.com, LLC │
│  74 │ AS36351 │ SOFTLAYER - SoftLayer Technologies Inc.      │
│  65 │ AS20473 │ AS-CHOOPA - Choopa, LLC                      │
│  49 │ AS49981 │ WORLDSTREAM                                  │
│  48 │ AS51167 │ CONTABO                                      │
│  48 │ AS33070 │ RMH-14 - Rackspace Hosting                   │
│  45 │ AS19994 │ RACKSPACE - Rackspace Hosting                │
│  44 │ AS60781 │ LEASEWEB-NL-AMS-01 Netherlands               │
│  42 │ AS45899 │ VNPT-AS-VN VNPT Corp                         │
│  41 │ AS2510  │ INFOWEB FUJITSU LIMITED                      │
│  40 │ AS7506  │ INTERQ GMO Internet,Inc                      │
│  35 │ AS62567 │ DIGITALOCEAN-ASN-NY2 - DigitalOcean, LLC     │
│  31 │ AS8100  │ ASN-QUADRANET-GLOBAL - QuadraNet, Inc        │
│  30 │ AS14618 │ AMAZON-AES - Amazon.com, Inc.                │
│  30 │ AS31034 │ ARUBA-ASN                                    │
└─────┴─────────┴──────────────────────────────────────────────┘

Most of the memcached servers we've seen were coming from AS16276 - OVH, AS14061 - Digital Ocean and AS7684 - Sakura.

In total we've seen only 5,729 unique source IPs of memcached servers. We're expecting to see much larger attacks in future, as Shodan reports 88,000 open memcached servers:

Let's fix it up

It's necessary to fix this and prevent further attacks. Here is a list of things that should be done.

Memcached Users

If you are using memcached, please disable UDP support if you are not using it. On memcached startup you can specify --listen 127.0.0.1 to listen only to localhost and -U 0 to disable UDP completely. By default memcached listens on INADDR_ANY and runs with UDP support ENABLED. Documentation:

https://github.com/memcached/memcached/wiki/ConfiguringServer#udp

You can easily test if your server is vulnerable by running:

$ echo -en "\x00\x00\x00\x00\x00\x01\x00\x00stats\r\n" | nc -q1 -u 127.0.0.1 11211
STAT pid 21357
STAT uptime 41557034
STAT time 1519734962
...

If you see non-empty response (like the one above), your server is vulnerable.

System administrators

Please ensure that your memcached servers are firewalled from the internet! To test whether they can be accessed using UDP I recommend the nc example above, to verify if TCP is closed run nmap:

$ nmap TARGET -p 11211 -sU -sS --script memcached-info
Starting Nmap 7.30 ( https://nmap.org ) at 2018-02-27 12:44 UTC
Nmap scan report for xxxx
Host is up (0.011s latency).
PORT      STATE         SERVICE
11211/tcp open          memcache
| memcached-info:
|   Process ID           21357
|   Uptime               41557524 seconds
|   Server time          2018-02-27T12:44:12
|   Architecture         64 bit
|   Used CPU (user)      36235.480390
|   Used CPU (system)    285883.194512
|   Current connections  11
|   Total connections    107986559
|   Maximum connections  1024
|   TCP Port             11211
|   UDP Port             11211
|_  Authentication       no
11211/udp open|filtered memcache

Internet Service Providers

In order to defeat such attacks in future, we need to fix vulnerable protocols and also IP spoofing. As long as IP spoofing is permissible on the internet, we'll be in trouble.

Help us out by tracking who is behind these attacks. We must know not who has problematic memcached servers, but who sent them queries in the first place. We can't do this without your help!

Developers

Please please please: Stop using UDP. If you must, please don't enable it by default. If you do not know what an amplification attack is I hereby forbid you from ever typing SOCK_DGRAM into your editor.

We've been down this road so many times. DNS, NTP, Chargen, SSDP and now memcached. If you use UDP, you must always respond with strictly a smaller packet size then the request. Otherwise your protocol will be abused. Also remember that people do forget to set up a firewall. Be a nice citizen. Don't invent a UDP-based protocol that lacks authentication of any kind.

That's all

It's anyone's guess how large the memcached attacks will become before we clean the vulnerable servers up. There were already rumors of 0.5Tbps amplifications in the last few days, and this is just a start.

Finally, you are OK if you are a Cloudflare customer. Cloudflare's Anycast architecture works well to distribute the load in case of large amplification attacks, and unless your origin IP is exposed, you are safe behind Cloudflare.

Prologue

A comment (below) points out that the possibility of using memcached for DDoS was discussed in a 2017 presentation.

UpdateWe received a word from Digital Ocean, OVH, Linode and Amazon that they tackled the memcached problem, their networks should not be a vector in future attacks. Hurray!

Dealing with DDoS attacks sound interesting? Join our world famous team in London, Austin, San Francisco and our elite office in Warsaw, Poland.

On the Leading Edge - Cloudflare named a leader in The Forrester Wave: DDoS Mitigation Solutions

Jen Taylor — Thu, 07 Dec 2017 20:44:39 GMT

Cloudflare has been recognized as a leader in the “Forrester Wave^TM: DDoS Mitigation Solutions, Q4 2017.”

The DDoS landscape continues to evolve. The increase in sophistication, frequency, and range of targets of DDoS attacks has placed greater demands on DDoS providers, many of which were evaluated in the report.

This year, Cloudflare received the highest scores possible in 15 criteria, including:

Length of Implementation
Layers 3 and 4 Attacks Mitigation
DNS Attack Mitigation
IoT Botnets
Multi-Vector Attacks
Filtering Deployment
Secure Socket Layer Investigation
Mitigation Capacity
Pricing Model

We believe that Cloudflare’s position as a leader in the report stems from the following:

An architecture designed to address high-volume attacks. This post written in October 2016 provides some insight into how Cloudflare’s architecture scales to meet the most advanced DDoS attacks differently than legacy scrubbing centers.
In September 2017, due to the size and effectiveness of our network, we announced the elimination of “surge pricing” commonly found in other DDoS vendors by offering unmetered mitigation. Regardless of what Cloudflare plan a customer is on—Free, Pro, Business, or Enterprise—we will never terminate a customer or charge more based on the size of an attack.
Because we protect over 7 million Internet properties, we have a unique view into the types of attacks launched across the Internet, especially harder-to-deflect Layer 7 application attacks. This allows us to reduce the amount of manual intervention and use automated mitigations to more quickly detect and block attacks.
Our DDoS mitigation solution helps protect customers by integrating with not only a stack of other security features, such as SSL and WAF, but also with a full suite of performance features. With a highly scalable network of over 118 data centers, Cloudflare can both accelerate legitimate traffic and block malicious DDoS traffic.

This combination of scale, ease-of-use through automatic mitigations, and integration with performance solutions, continues to advance our mission to help build a better Internet.

At Cloudflare, our mission is to help build a better internet - one that is performant, secure and reliable for all. We do this through a combination of scale, ease-of use, and data-driven insights that enable us to deliver automatic mitigation. It is because of this focus and these types of innovation that we were able to offer unmetered DDoS mitigation at no additional cost to all of our customers this year. We are honored to be recognized as a leader in the Forrester Wave^TM: DDoS Mitigation Solutions, Q4 2017 report.

To check out the full report, download your complimentary copy here.

The New DDoS Landscape

Junade Ali — Thu, 23 Nov 2017 03:28:00 GMT

News outlets and blogs will frequently compare DDoS attacks by the volume of traffic that a victim receives. Surely this makes some sense, right? The greater the volume of traffic a victim receives, the harder to mitigate an attack - right?

At least, this is how things used to work. An attacker would gain capacity and then use that capacity to launch an attack. With enough capacity, an attack would overwhelm the victim's network hardware with junk traffic such that they can no longer serve legitimate requests. If your web traffic is served by a server with a 100 Gbps port and someone sends you 200 Gbps, your network will be saturated and the website will be unavailable.

Recently, this dynamic has shifted as attackers have gotten far more sophisticated. The practical realities of the modern Internet have increased the amount of effort required to clog up the network capacity of a DDoS victim - attackers have noticed this and are now choosing to perform attacks higher up the network stack.

In recent months, Cloudflare has seen a dramatic reduction in simple attempts to flood our network with junk traffic. Whilst we continue to see large network level attacks, in excess of 300 and 400 Gbps, network level attacks in general have become far less common (the largest recent attack was just over 0.5 Tbps). This has been especially true since the end of September when we made official a policy that would not remove any customers from our network merely for receiving a DDoS attack that's too big, including those on our free plan.

Far from attackers simply closing shop, we see a trend whereby attackers are moving to more advanced application-layer attack strategies. This trend is not only seen in metrics from our automated attack mitigation systems, but has also been the experience of our frontline customer support engineers. Whilst we continue to see very large network level attacks, note that they are occurring less frequently since the introduction of Unmetered Mitigation:

To understand how the landscape has made such a dramatic shift, we must first understand how DDoS attacks are performed.

Performing a DDoS

From presentation by @IcyApril - First thing you absolutely need for a successful DDoS - is a cool costume. pic.twitter.com/WIC0LjF4ka
— majek04 (@majek04) November 22, 2017

The first thing you need before you can carry out a DDoS Attack is capacity. You need the network resources to be able to overwhelm your victim.

To build up capacity, attackers have a few mechanisms at their disposal; three such examples are Botnets, IoT Devices and DNS Amplification:

Botnets

Computer viruses are deployed for multiple reasons, for example; they can harvest private information from users or blackmail users into paying them money to get their precious files back. Another utility of computer viruses is building capacity to perform DDoS Attacks.

A Botnet is a network of infected computers that are centrally controlled by an attacker; these zombie computers can then be used to send spam emails or perform DDoS attacks.

Consumers have access to faster Internet than ever before. In November 2016, the average UK broadband upload speed reached 4.3Mbps - this means a Botnet which has infected a little under 2,400 computers can launch an attack of around 10Gbps. Such capacity is plenty enough to saturate the end networks that power most websites online.

On August 17th, 2017, multiple networks online were subject to significant attacks from a botnet known as WireX. Researchers from a variety of organisations, including from Akamai, Cloudflare, Flashpoint, Google, Oracle Dyn, RiskIQ, Team Cymru, and other organizations cooperated to combat this botnet - eventually leading to hundreds of Android apps being removed and a process started to remove the malware-ridden apps from all devices.

IoT Devices

More and more of our everyday appliances are being embedded with Internet connectivity. Like other types of technology, they can be taken over with malware and controlled to launch large-scale DDoS attacks.

Towards the end of last year, we began to see Internet-connected cameras start to launch large DDoS attacks. The use of video cameras was advantageous to attackers in the sense they needed to be connected to networks with enough bandwidth to be capable of streaming video.

Mirai was one such botnet which targeted Internet-connected cameras and Internet routers. It would start by logging into the web dashboard of a device using a table of 60 default usernames and passwords, then installing malware on the device.

Where users set passwords instead of them merely being the default values, other pieces of malware can use Dictionary Attacks to repeatedly guess simple user-configured passwords, using a list of common passwords like the one shown below. I have self-censored some of the passwords, apparently users can be in a fairly angry state-of-mind when setting passwords:

Passwords aside, back in May, I blogged specifically about some examples of security risks we are starting to see which are specific to IoT devices: IoT Security Anti-Patterns.

DNS Amplification

DNS is the phonebook of the Internet; in order to reach this site, your local computer used DNS to look up which IP Address would serve traffic for blog.cloudflare.com. I can perform this DNS queries from my command line using dig A blog.cloudflare.com:

Firstly notice that the response is pretty big, it's certainly bigger than the question we asked.

DNS is built on a Transport Protocol called UDP, when using UDP it's easy to forge the requester of a query as UDP doesn't require a handshake before sending a response.

Due to these two factors, someone is able to make a DNS query on behalf of someone else. We can make a request for a relatively small DNS query, which will will then result in a long response being sent somewhere else.

Online there are open DNS resolvers that will take a request from anyone online and send the response to anyone else. In most cases we should not be exposing open DNS resolvers to the internet (most are open due to configuration mistakes). However, when intentionally exposing DNS resolvers online, security steps should be taken - StrongArm has a primer on securing open DNS resolvers on their blog.

Let's use a hypothetical to illustrate this point. Imagine that you wrote to a mail order retailer requesting a catalogue (if you still know what one is). You'd send a relatively short postcard with your contact information and your request - you'd then get back quite a big catalogue. Now imagine you did the same, but sent hundreds of these postcards and instead included the address of someone else. Assuming the retailer was obliging in sending such a vast amount of catalogues, your friend could wake up one day with their front door blocked with catalogues.

In 2013, we blogged about how one DNS Amplification attack we faced almost broke the internet; however, in recent times aside from one exceptionally high attack (occurring just after we launched Unmetered DDoS Mitigation), DNS Amplification attacks have generally been a low proportion of attacks we see recently:

Whilst we're seeing fewer of these attacks, you can find a more detailed overview on our learning centre: DNS Amplification Attack

DDoS Mitigation: The Old Priorities

Using the capacity an attacker has built up, they can send junk traffic to a web property. This is referred to as a Layer 3/4 attack. This kind of attack primarily seeks to block up the network capacity of the victim network.

Above all, mitigating these attacks requires capacity. If you get an attack of 600 Gbps and you only have 10 Gbps of capacity you either need to pay an intermediary network to filter traffic for you or have your network go offline due to the force of the attack.

As a network, Cloudflare works by passing a customer's traffic through our network; in doing so, we are able to apply performance optimisations and security filtering to the traffic we see. One such security filter is removing junk traffic associated with Layer 3/4 DDoS attacks.

Cloudflare's network was built when large scale DDoS Attacks were becoming a reality. Huge network capacity, spread out over the world in many different data centres makes it easier to absorb large attacks. We currently have over 15 Tbps and this figure is always growing fast.

Preventing DDoS attacks needs a little more sophistication than just capacity though. While traditional Content Delivery Networks are built using Unicast technology, Cloudflare's network is built using an Anycast design.

In essence, this means that network traffic is routed to the nearest available Point-of-Presence and it is not possible for an attacker to override this routing behaviour - the routing is effectively performed using BGP, the routing protocol of the Internet.

Unicast networks will frequently use technology like DNS to steer traffic to close data centres. This routing can easily be overridden by an attacker, allowing them to force attack traffic to a single data centre. This is not possible with Cloudflare's Anycast network; meaning we maintain full control of how our traffic is routed (providing intermediary ISPs respect our routes). With this network design, have the ability to rapidly update routing decisions even against ISPs which ordinarily do not respect cache expiration times for DNS records (TTLs).

Cloudflare's network also maintains an Open Peering Policy; we are open to interconnecting our network with any other network without cost. This means we tend to eliminate intermediary networks across our network. When we are under attack, we usually have a very short network path from the attacker to us - this means there are no intermediary networks which can suffer collateral damage.

The New Landscape

I started this blog post with a chart which demonstrates the frequency of a type of network-layer attack known as a SYN Flood against the Cloudflare network. You'll notice how the largest attacks are further spaced out over the past few months:

You can also see that this trend does not follow when compared to a graph of Application Layer DDoS attacks which we continue to see coming in:

The chart above has an important caveat, an Application Layer attack is defined by what we determine an attack is. Application Layer (Layer 7) attacks are more indistinguishable from real traffic than Layer 3/4 attacks. The attacks effectively resemble normal web requests, instead of junk traffic.

Attackers can order their Botnets to perform attacks against websites using "Headless Browsers" which have no user interface. Such Headless Browsers work exactly like normal browsers, except that they are controlled programmatically instead of being controlled via a window on a user's screen.

Botnets can use Headless Browsers to effectively make HTTP requests that load and behave just like ordinary web requests. As this can be done programmatically, they can order bots to repeat these HTTP requests rapidly - effectively taking up the entire capacity of a website, taking it offline for ordinary visitors.

This is a non-trivial problem to solve. At Cloudflare, we have specific services like Gatebot which identify DDoS attacks by picking up on anomalies in network traffic. We have tooling like "I'm Under Attack Mode" to analyse traffic to ensure the visitor is human. This is, however, only part of the story.

A $20/month server running a resource intense e-commerce platform may not be able to cope with any more than a dozen concurrent HTTP requests before being unable to serve any more traffic.

An attack which can take down a small e-commerce site will likely not even be a drop in the ocean for Cloudflare's network, which sees around 10% of Internet requests online.

The chart below outlines DDoS attacks per day against Cloudflare customers; but it is important to bear in mind that this includes what we define as an attack. In recent times, Cloudflare has built specific products to help customers define what they think an attack looks like and how much traffic they feel they should cope with.

A Web Developers Guide to Defeating an Application Layer DDoS Attack

One of the reasons why Application Layer DDoS attacks are so attractive is due to the uneven balance between the relative computational overhead required to for someone to request a web page, and the computational difficulty in serving one. Serving a dynamic website requires all kinds of operations; fetching information from a database, firing off API requests to separate services, rendering a page, writing log lines and potentially even pushing data down a message queue.

Fundamentally, there are two ways of dealing with this problem:

making the balance between requester and server, less asymmetric by making it easier to serve web requests
limiting requests which are in such excess, they are blatantly abusive

It remains critical you have a high-capacity DDoS mitigation network in front of your web application; one of the reasons why Application Layer attacks are increasingly attractive to attackers because networks have gotten good at mitigating volumetric attacks at the network layer.

Cloudflare has found that whilst performing Application Layer attacks, attackers will sometimes pick cryptographic ciphers that are the hardest for servers to calculate and are not usually accelerated in hardware. What this means is the attackers will try and consume more of your servers resources by actually using the fact you offer encrypted HTTPS connections against you. Having a proxy in front of your web traffic has the added benefit of ensuring that it has to establish a brand new secure connection to your origin web server - effectively meaning you don't have to worry about Presentation Layer attacks.

Additionally, offloading services which don't have custom application logic (like DNS) to managed providers can help ensure you have less surface area to worry about at the Application Layer.

Aggressive Caching

One of the ways to make it easier to serve web requests is to use some form of caching. There are multiple forms of caching; however, here I'm going to be talking about how you enable caching for HTTP requests.

Suppose you're using a CMS (Content Management System) to update your blog, the vast majority of visitors to your blog will see the identical page to every other visitor. It will only be the when an anonymous visitor logs in or leaves a comment that they will see a page that's dynamic and unlike every other page that's been rendered.

Despite the vast majority of HTTP requests to specific URLs being identical, your CMS has to regenerate the page for every single request as if it was brand new. Application Layer DDoS attacks exploit this to amplification to make their attacks more brutal.

Caching proxies like NGINX and services like Cloudflare allow you to specify that until a user has a browser cookie that de-anonymises them, content can be served from cache. Alongside performance benefits, these configuration changes can prevent the most crude Application Layer DDoS Attacks.

For further information on this, you can consult NGINX guide to caching or alternatively see my blog post on caching anonymous page views:

Rate Limiting

Caching isn't enough; non-idempotent HTTP requests like POST, PUT and DELETE are not safe to cache - as such making these requests can bypass caching efforts used to prevent Application Layer DDoS attacks. Additionally, attackers can attempt to vary URLs to bypass advanced caching behaviour.

Software exists for web servers to be able to perform rate limiting before anything hits dynamic logic; examples of such tools include Fail2Ban and Apache mod_ratelimit.

If you do rate limiting on your server itself, be sure to configure your edge network to cache the rate limit block pages, such that attackers cannot perform Application Layer attackers once blocked. This can be done by caching responses with a 429 status code against a Custom Cache Key based on the IP Address.

Services like Cloudflare offer Rate Limiting at their edge network; for example Cloudflare Rate Limiting.

Conclusion

As the capacity of networks like Cloudflare continue to grow, attackers move from attempting DDoS attacks at the network layer to performing DDoS attacks targeted at applications themselves.

For applications to be resilient to DDoS attacks, it is no longer enough to use a large network. A large network must be complemented with tooling that is able to filter malicious Application Layer attack traffic, even when attackers are able to make such attacks look near-legitimate.

No Scrubs: The Architecture That Made Unmetered Mitigation Possible

John Graham-Cumming — Mon, 25 Sep 2017 13:00:33 GMT

When building a DDoS mitigation service it’s incredibly tempting to think that the solution is scrubbing centers or scrubbing servers. I, too, thought that was a good idea in the beginning, but experience has shown that there are serious pitfalls to this approach.

A scrubbing server is a dedicated machine that receives all network traffic destined for an IP address and attempts to filter good traffic from bad. Ideally, the scrubbing server will only forward non-DDoS packets to the Internet application being attacked. A scrubbing center is a dedicated location filled with scrubbing servers.

Three Problems With Scrubbers

The three most pressing problems with scrubbing are: bandwidth, cost, knowledge.

The bandwidth problem is easy to see. As DDoS attacks have scaled to >1Tbps having that much network capacity available is problematic. Provisioning and maintaining multiple-Tbps of bandwidth for DDoS mitigation is expensive and complicated. And it needs to be located in the right place on the Internet to receive and absorb an attack. If it’s not then attack traffic will need to be received at one location, scrubbed, and then clean traffic forwarded to the real server: that can introduce enormous delays with a limited number of locations.

Imagine for a moment you’ve built a small number of scrubbing centers, and each center is connected to the Internet with many Gbps of connectivity. When a DDoS attack occurs that center needs to be able to handle potentially 100s of Gbps of attack traffic at line rate. That means exotic network and server hardware. Everything from the line cards in routers, to the network adapter cards in the servers, to the servers themselves is going to be very expensive.

This (and bandwidth above) is one of the reasons DDoS mitigation has traditionally cost so much and been billed by attack size.

The final problem, knowledge, is the most easily overlooked. When you set out to build a scrubbing server you are building something that has to separate good packets from bad.

At first this seems easy (let’s filter out all TCP ACK packets for non-established connections, for example), and low level engineers are easy to excite about writing high-performance code to do that. But attackers are not stupid and they’ll throw legitimate looking traffic at a scrubbing server and it gets harder and harder to distinguish good from bad.

At that point, scrubbing engineers need to become protocol experts at all levels of the stack. That means you have to build a competency in all levels of TCP/IP, DNS, HTTP, TLS, etc. And that’s hard.

CC BY-SA 2.0 image by Lisa Stevens

The bottom line is scrubbing centers and exotic hardware are great marketing. But, like citadels of medieval times, they are monumentally expensive and outdated, overwhelmed by better weapons and warfighting techniques.

And many DDoS mitigation services that use scrubbing centers operate in an offline mode. They are only enabled when a DDoS occurs. This typically means that an Internet application will succumb to the DDoS attack before its traffic is diverted to the scrubbing center.

Just imagine citizens fleeing to hide behind the walls of the citadel under fire from an approaching army.

Better, Cheaper, Smarter

There’s a subtler point about not having dedicated scrubbers: it forces us to build better software. If a scrubbing server becomes overwhelmed or fails then only the customer being scrubbed is affected, but when the mitigation happens on the very servers running the core service it has to work and be effective.

I spoke above about the ‘knowledge gap’ that comes about with dedicated DDoS scrubbing. The Cloudflare approach means that if bad traffic gets through, say a flood of bad DNS packets, then it reaches a service owned and operated by people who are experts in that domain. If a DNS flood gets through our DDoS protection it hits our custom DNS server, RRDNS, the engineers who work on it can bring their expertise to bear.

This makes an enormous difference because the result is either improved DDoS scrubbing or a change to the software (e.g. the DNS stack) that improves its performance under load. We’ve lived that story many, many times and the entire software stack has improved because of it.

The approach Cloudflare took to DDoS mitigation is rather simple: make every single server in Cloudflare participate in mitigation, load balance DDoS attacks across the data centers and servers within them and then apply smarts to the handling of packets. These are the same servers, processors and cores handling our entire service.

Eliminating scrubbing centers and hardware completely changes the cost of building a DDoS mitigation service.

We currently have around 15 Tbps of network capacity worldwide but this capacity doesn’t require exotic network hardware. We are able to use low cost or commodity networking equipment bound together using network automation to handle normal and DDoS traffic. Just as Google originally built its service by writing software that tied together commodity servers into a super (search) computer; our architecture binds commodity servers together into one giant network device.

By building the world’s most peered network we’ve built this capacity at reasonable cost and more importantly are able to handle attack traffic globally wherever it originates with low latency links. No scrubbing solution is able to say the same.

And because Cloudflare manages DNS for our customers and uses an Anycasted network attack traffic originating from botnets is automatically distributed across our global network. Each data center deals with a portion of DDoS traffic.

Within each data center DDoS traffic is load balanced across multiple servers running our service. Each server handles a portion of the DDoS traffic. This spreading of DDoS traffic means that a single DDoS attack will be handled by a large number of individual servers across the world.

And as Cloudflare grows our DDoS mitigation capacity grows automatically, and because our DDoS mitigation is built into our stack it is always on. We mitigate a new DDoS attack every three minutes with no downtime for Internet applications and have no need to ‘switch over’ to a scrubbing center.

Inside a Server

Once all this global and local load balancing has occurred packets do finally hit a network adapter card in a server. It’s here that Cloudflare’s custom DDoS mitigation stack comes into play.

Over the years we’ve learned how to automatically detect and mitigate anything the Internet can throw at us. For most of the attacks, we rely on dynamically managing iptables: the standard Linux firewall. We’ve spoken about the most effective techniques in past. iptables has a number of very powerful features which we select depending on specific attack vector. From our experience xt_bpf, ipset, hashlimits and connlimits are the most useful iptables modules.

For very large attacks the Linux Kernel is not fast enough though. To relieve the kernel from processing excessive number of packets, we experimented with various kernel bypass techniques. We’ve settled on a partial kernel bypass interface - Solarflare specific EFVI.

With EFVI we can offload the processing of our firewall rules to a user space program, and we can easily process millions of packets per second on each server, while keeping the CPU usage low. This allows us to withstand the largest attacks, without affecting our multi-tenant service.

Open Source

Cloudflare’s vision is to help to build a better Internet. Fixing DDoS is a part of it. We’ve been relentlessly documenting the most important and dangerous attacks we’ve encountered, fighting botnets and open sourcing critical pieces of our DDoS infrastructure.

We’ve open sourced various tools, from the very low level projects like our BPF Tools, that we use to fight DNS and SYN floods, to contributing to OpenResty a performant application framework on top of NGINX, which is great for building L7 defenses.

Conclusion

Cloudflare’s DDoS mitigation architecture and custom software makes Unmetered Mitigation possible. With it we can withstand the largest DDoS attacks and as our network grows our DDoS mitigation capability grows with it.

Meet Gatebot - a bot that allows us to sleep

Marek Majkowski — Mon, 25 Sep 2017 13:00:30 GMT

In the past, we’ve spoken about how Cloudflare is architected to sustain the largest DDoS attacks. During traffic surges we spread the traffic across a very large number of edge servers. This architecture allows us to avoid having a single choke point because the traffic gets distributed externally across multiple datacenters and internally across multiple servers. We do that by employing Anycast and ECMP.

We don't use separate scrubbing boxes or specialized hardware - every one of our edge servers can perform advanced traffic filtering if the need arises. This allows us to scale up our DDoS capacity as we grow. Each of the new servers we add to our datacenters increases our maximum theoretical DDoS “scrubbing” power. It also scales down nicely - in smaller datacenters we don't have to overinvest in expensive dedicated hardware.

During normal operations our attitude to attacks is rather pragmatic. Since the inbound traffic is distributed across hundreds of servers we can survive periodic spikes and small attacks without doing anything. Vanilla Linux is remarkably resilient against unexpected network events. This is especially true since kernel 4.4 when the performance of SYN cookies was greatly improved.

But at some point, malicious traffic volume can become so large that we must take the load off the networking stack. We have to minimize the amount of CPU spent on dealing with attack packets. Cloudflare operates a multi-tenant service and we must always have enough processing power to serve valid traffic. We can't afford to starve our HTTP proxy (nginx) or custom DNS server (named RRDNS, written in Go) of CPU. When the attack size crosses a predefined threshold (which varies greatly depending on specific attack type), we must intervene.

Mitigations

During large attacks we deploy mitigations to reduce the CPU consumed by malicious traffic. We have multiple layers of defense, each tuned to specific attack vector.

First, there is “scattering”. Since we control DNS resolution we are able to move the domains we serve between IP addresses (we call this "scattering"). This is an effective technique as long as the attacks don’t follow the updated DNS resolutions. This often happens for L3 attacks where the attacker has hardcoded the IP address of the target.

Next, there is a wide range of mitigation techniques that leverage iptables, the firewall built in to the Linux kernel. But we don't use it like a conventional firewall, with a static set of rules. We continuously add, tweak and remove rules, based on specific attack characteristics. Over the years we have mastered the most effective iptables extensions:

xt_bpf
ipsets
hashlimits
connlimit

To make the most of iptables, we built a system to manage the iptables configuration across our entire fleet, allowing us to rapidly deploy rules everywhere. This fits our architecture nicely: due to Anycast, an attack against a single IP will be delivered to multiple locations. Running iptables rules for that IP on all servers makes sense.

Using stock iptables gives us plenty of confidence. When possible we prefer to use off-the-shelf tools to deal with attacks.

Sometimes though, even this is not sufficient. Iptables is fast in the general case, but has its limits. During very large attacks, exceeding 1M packets per second per server, we shift the attack traffic from kernel iptables to a kernel bypass user space program (which we call floodgate). We use a partial kernel bypass solution using Solarflare EF_VI interface. With this on each server we can process more than 5M attack packets per second while consuming only a single CPU core. With floodgate we have comfortable amount of CPU left for our applications, even during the largest network events.

Finally, there are a number of tweaks we can make on at the HTTP layer. For specific attacks we disable HTTP Keep-Alives forcing attackers to re-establish TCP sessions for each request. This sacrifices a bit of performance for valid traffic as well, but is a surprisingly powerful tool throttling many attacks. For other attack patterns we turn the “I’m under attack” mode on, forcing the attack to hit our JavaScript challenge page.

Manual attack handling

Early on these mitigations were applied manually by our tireless System Reliability Engineers (SRE's). Unfortunately, it turns out that humans under stress... well, make mistakes. We learned it the hard way - one of the most famous incidents happened in March 2013 when a simple typo brought our whole network down.

Humans are also not great at applying precise rules. As our systems grew and mitigations became more complex, having many specific toggles, our SREs got overwhelmed by the details. It was challenging to present all the specific information about the attack to the operator. We often applied overly-broad mitigations, which were unnecessarily affecting legitimate traffic. All that changed with the introduction of Gatebot.

Meet Gatebot

To aid our SREs we developed a fully automatic mitigation system. We call it Gatebot[1].

The main goal of Gatebot was to automate as much of the mitigation workflow as possible. That means: to observe the network and note the anomalies, understand the targets of attacks and their metadata (such as the type of customer involved), and perform appropriate mitigation action.

Nowadays we have multiple Gatebot instances - we call them “mitigation pipelines”. Each pipeline has three parts:

“attack detection” or “signal” - A dedicated system detects anomalies in network traffic. This is usually done by sampling a small fraction of the network packets hitting our network, and analyzing them using streaming algorithms. With this we have a real-time view of the current status of the network. This part of the stack is written in Golang, and even though it only examines the sampled packets, it's pretty CPU intensive. It might comfort you to know that at this very moment two big Xeon servers burn all of their combined 48 Skylake CPU cores toiling away counting packets and performing sophisticated analytics looking for attacks.
“reactive automation” or “business logic”. For each anomaly (attack) we see who the target is, can we mitigate it, and with what parameters. Depending on the specific pipeline, the business logic may be anything from a trivial procedure to a multi-step process requiring a number of database lookups and potentially confirmation from a human operator. This code is not performance critical and is written in Python. To make it more accessible and readable by others in company, we developed a simple functional, reactive programming engine. It helps us to keep the code clean and understandable, even as we add more steps, more pipelines and more complex logic. To give you a flavor of the complexity: imagine how the system should behave if a customer upgraded a plan during an attack.
“mitigation”. The previous step feeds specific mitigation instructions into the centralized mitigation management systems. The mitigations are deployed across the world to our servers, applications, customer settings and, in some cases, to the network hardware.

Sleeping at night

Gatebot operates constantly, without breaks for lunch. For the iptables mitigations pipelines alone, Gatebot got engaged between 30 and 1500 times a day. Here is a chart of mitigations per day over last 6 months:

Gatebot is much faster and much more precise than even our most experienced SREs. Without Gatebot we wouldn’t be able to operate our service with the appropriate level of confidence. Furthermore, Gatebot has proved to be remarkably adaptable - we started by automating handling of Layer 3 attacks, but soon we proved that the general model works well for automating other things. Today we have more than 10 separate Gatebot instances doing everything from mitigating Layer 7 attacks to informing our Customer Support team of misbehaving customer origin servers.

Since Gatebot’s inception we learned greatly from the "detection / logic / mitigation" workflow. We reused this model in our Automatic Network System which is used to relieve network congestion[2].

Gatebot allows us to protect our users no matter of the plan. Whether you are a on a Free, Pro, Business or Enterprise plan, Gatebot is working for you. This is why we can afford to provide the same level of DDoS protection for all our customers[3].

Dealing with attacks sound interesting? Join our world famous DDoS team in London, Austin, San Francisco and our elite office in Warsaw, Poland.

Fun fact: all our components in this area are called “gate-something”, like: gatekeeper, gatesetter, floodgate, gatewatcher, gateman... Who said that naming things must be hard? ↩︎
Some of us have argued that this system should be called Netbot. ↩︎
Note: there are caveats. Ask your Success Engineer for specifics! ↩︎

Unmetered Mitigation: DDoS Protection Without Limits

Matthew Prince — Mon, 25 Sep 2017 13:00:00 GMT

This is the week of Cloudflare's seventh birthday. It's become a tradition for us to announce a series of products each day of this week and bring major new benefits to our customers. We're beginning with one I'm especially proud of: Unmetered Mitigation.

CC BY-SA 2.0 image by Vassilis

Cloudflare runs one of the largest networks in the world. One of our key services is DDoS mitigation and we deflect a new DDoS attack aimed at our customers every three minutes. We do this with over 15 terabits per second of DDoS mitigation capacity. That's more than the publicly announced capacity of every other DDoS mitigation service we're aware of combined. And we're continuing to invest in our network to expand capacity at an accelerating rate.

Surge Pricing

Virtually every Cloudflare competitor will send you a bigger bill if you are unlucky enough to get targeted by an attack. We've seen examples of small businesses that survive massive attacks to then be crippled by the bills other DDoS mitigation vendors sent them. From the beginning of Cloudflare's history, it never felt right that you should have to pay more if you came under an attack. That feels barely a step above extortion.

With today’s announcement we are eliminating this industry standard of ‘surge pricing’ for DDoS attacks. Why should customers pay more just to defend themselves? Charging more when the customer is experiencing a painful attack feels wrong; just as surge pricing when it rains hurts ride-sharing customers when they need a ride the most.

End of the FINT

That said, from our early days, we would sometimes fail customers off our network if the size of an attack they received got large enough that it affected other customers. Internally, we referred to this as FINTing (for Fail INTernal) a customer.

The standards for when a customer would get FINTed were situation dependent. We had rough thresholds depending on what plan they were on, but the general rule was to keep a customer online unless the size of the attack impacted other customers. For customers on higher tiered plans, when our automated systems didn't handle the attacks themselves, our technical operations team could take manual steps to protect them.

Every morning I receive a list of all the customers that were FINTed the day before. Over the last four years the number of FINTs has dwindled. The reality is that our network today is at such a scale that we are able to mitigate even the largest DDoS attacks without it impacting other customers. This is almost always handled automatically. And, when manual intervention is required, our techops team has gotten skilled enough that it isn't overly taxing.

Aligning With Our Customers

So today, on the first day of our Birthday Week celebration, we make it official for all our customers: Cloudflare will no longer terminate customers, regardless of the size of the DDoS attacks they receive, regardless of the plan level they use. And, unlike the prevailing practice in the industry, we will never jack up your bill after the attack. Doing so, frankly, is perverse.

CC BY-SA 2.0 image by Dennis Jarvis

We call this Unmetered Mitigation. It stems from a basic idea: you shouldn't have to pay more to be protected from bullies who try and silence you online. Regardless of what Cloudflare plan you use — Free, Pro, Business, or Enterprise — we will never tell you to go away or that you need to pay us more because of the size of an attack.

Cloudflare's higher tier plans will continue to offer more sophisticated reporting, tools, and customer support to better tune our protections against whatever threats you face online. But volumetric DDoS mitigation is now officially unlimited and unmetered.

Setting the New Standard

Back in 2014, during Cloudflare's birthday week, we announced that we were making encryption free for all our customers. We did it because it was the right thing to do and we'd finally developed the technical systems we needed to do it at scale. At the time, people said we were crazy. I'm proud of the fact that, three years later, the rest of the industry has followed our lead and encryption by default has become the standard.

I'm hopeful the same will happen with DDoS mitigation. If the rest of the industry moves away from the practice of surge pricing and builds DDoS mitigation in by default then it would largely end DDoS attacks for good. We took a step down that path today and hope, like with encryption, the rest of the industry will follow.

Want to know more? Read No Scrubs: The Architecture That Made Unmetered Mitigation Possible and Meet Gatebot - a bot that allows us to sleep.

Cloudflare Rate Limiting - Insight, Control, and Mitigation against Layer 7 DDoS Attacks

Timothy Fong — Thu, 13 Apr 2017 20:34:00 GMT

Today, Cloudflare is extending its Rate Limiting service by allowing any of our customers to sign up. Our Enterprise customers have enjoyed the benefits of Cloudflare’s Rate Limiting offering for the past several months. As part of our mission to build a better internet, we believe that everyone should have the ability to sign up for the service to protect their websites and APIs.

CC-BY 2.0 image by Benjamin Child

Rate Limiting is one more feature in our arsenal of tools that help to protect our customers against denial-of-service attacks, brute-force password attempts, and other types of abusive behavior targeting the application layer. Application layer attacks are usually a barrage of HTTP/S requests which may look like they originate from real users, but are typically generated by machines (or bots). As a result, application layer attacks are often harder to detect and can more easily bring down a site, application, or API. Rate Limiting complements our existing DDoS protection services by providing control and insight into Layer 7 DDoS attacks.

Rate Limiting is now available to all customers across all plans as an optional paid feature. The first 10,000 qualifying requests are free, which allows customers to start using the feature without any cost .

Real world examples of how Rate Limiting helped Cloudflare customers

Over the past few months, Cloudflare customers ranging from e-commerce companies to high-profile, ad-driven platforms have been using this service to mitigate malicious attacks. It made a big difference to their business: they’ve stopped revenue loss, reduced infrastructure costs, and protected valuable information, such as intellectual property and/or customer data.

Several common themes have emerged for customers who have been successfully using Rate Limiting during the past couple months. The following are examples of some of the issues those customers have faced and how Rate Limiting addressed them.

High-volume attacks designed to bring down e-commerce sites

Buycraft, a Minecraft e-commerce platform, was subjected to denial-of-service attacks which could have brought down the e-commerce stores of its 500,000+ customers. Rate Limiting addresses this common attack type by blocking offending IP addresses at its network edge, so the malicious traffic doesn’t reach the origin servers and impact customers.

Attacks against API endpoints

Haveibeenpwned.com provides an API that surfaces accounts that have been hacked to help potential victims identify whether their credentials have been compromised. Troy Hunt (the service’s creator), decided to use Cloudflare’s Rate Limiting to protect his API from malicious traffic, leading to improved performance and reduced infrastructure costs.

Brute-force login attacks

After IT consulting firm 2600 Solutions, which manages Wordpress sites for clients, was brute-forced over 200 times in a month, owner Jeff Williams decided to use Cloudflare Rate Limiting. By blocking excessive failed login attempts, they were able to not only protect their clients’ sites from being compromised, they also ensured legitimate users were not impacted by slower application performance.

Bots scraping the site for content

Another Cloudflare customer saw valuable content being scraped from their site by competitors using bots. Competitors then used this scraped content to boost their own search engine ranking at the expense of the targeted site. Our customer lost tens of thousands of dollars before using Cloudflare’s Rate Limiting to prevent the bots from scraping content.

How do I get started with Rate Limiting?

Anyone can start utilizing the benefits of Cloudflare’s Rate Limiting. With the Cloudflare Dashboard, go to the Firewall tab, and within the Rate Limiting card, click on “Enable Rate Limiting.”

Even though you will be prompted to enter a payment method to start using the service, you will not be charged for the first 10,000 qualifying requests. Once done, you’ll be able to create rules.

If you are on an Enterprise plan, contact your Cloudflare Customer Success Manager to enable Rate Limiting.

Tighter control over the type of traffic to rate limit

As customers begin to understand attack patterns and their own application’s potential vulnerabilities, they can tighten criteria. All customers can create path-specific rules, using wildcards (for example: www.example.com/login/\* or www.example.com/\*/checkout.php). Customers on a Business or higher plan can specify to rate limit only certain HTTP request methods.

Simulate traffic to tune your rules

Customers on the Pro and higher plans will be able to ‘simulate’ rules. A rule in simulate mode will not actually block malicious traffic, but will allow you to understand what traffic will be blocked if you were to setup a ‘live’ rule. All Customers will have analytics (coming soon) to let them gain insights into the traffic patterns to their site, and the efficacy of their rules.

Next Steps

If you haven’t enabled Rate Limiting yet, go to the Firewall App and enable Rate Limiting
Create your first rule
For more information, including a demo of Rate Limiting in action, visit www.cloudflare.com/rate-limiting/.

The Daily DDoS: Ten Days of Massive Attacks

John Graham-Cumming — Fri, 02 Dec 2016 13:21:26 GMT

Back in March my colleague Marek wrote about a Winter of Whopping Weekend DDoS Attacks where we were seeing 400Gbps attacks occurring mostly at the weekends. We speculated that attackers were busy with something else during the week.

This winter we've seen a new pattern, and attackers aren't taking the week off, but they do seem to be working regular hours.

CC BY 2.0 image by Carol VanHook

On November 23, the day before US Thanksgiving, our systems detected and mitigated an attack that peaked at 172Mpps and 400Gbps. The attack started at 1830 UTC and lasted non-stop for almost exactly 8.5 hours stopping at 0300 UTC. It felt as if an attacker 'worked' a day and then went home.

The very next day the same thing happened again (although the attack started 30 minutes earlier at 1800 UTC).

On the third day the attacker started promptly at 1800 UTC but went home a little early at around 0130 UTC. But they managed to peak the attack over 200Mpps and 480Gbps.

And the attacker just kept this up day after day. Right through Thanksgiving, Black Friday, Cyber Monday and into this week. Night after night attacks were peaking at 400Gbps and hitting 320Gbps for hours on end.

This chart shows the packet rate in Mpps.

This chart shows the attack bandwidth in gigabytes per second (multiply by 8 to get Gbps).

This Tuesday things got interesting. The attacker stopped taking the night off and moved onto working 24 hours a day.

Another curiosity with these attacks is that they are not coming from the much talked about Mirai botnet. They are using different attack software and are sending very large L3/L4 floods aimed at the TCP protocol. The attacks are also highly concentrated in a small number of locations mostly on the US west coast.

Throughout we've mitigated the attack without impact on customers.

As we've written before, we architected Cloudflare to handle massive attacks automatically. If you are interested in working on systems like this, we're hiring.

The Internet is Hostile: Building a More Resilient Network

Jérôme Fleury — Tue, 08 Nov 2016 18:56:49 GMT

In a recent post we discussed how we have been adding resilience to our network.

The strength of the Internet is its ability to interconnect all sorts of networks — big data centers, e-commerce websites at small hosting companies, Internet Service Providers (ISP), and Content Delivery Networks (CDN) — just to name a few. These networks are either interconnected with each other directly using a dedicated physical fiber cable, through a common interconnection platform called an Internet Exchange (IXP), or they can even talk to each other by simply being on the Internet connected through intermediaries called transit providers.

The Internet is like the network of roads across a country and navigating roads means answering questions like “How do I get from Atlanta to Boise?” The Internet equivalent of that question is asking how to reach one network from another. For example, as you are reading this on the Cloudflare blog, your web browser is connected to your ISP and packets from your computer found their way across the Internet to Cloudflare’s blog server.

Figuring out the route between networks is accomplished through a protocol designed 25 years ago (on two napkins) called BGP.

BGP allows interconnections between networks to change dynamically. It provides an administrative protocol to exchange routes between networks, and allows for withdrawals in the case that a path is no longer viable (when some route no longer works).

The Internet has become such a complex set of tangled fibers, neighboring routers, and millions of servers that you can be certain there is a server failing or a optical fibre being damaged at any moment. Whether it’s in a datacenter, a trench next to a railroad, or at the bottom of the ocean. The reality is that the Internet is in a constant state of flux as connections break and are fixed; it’s incredible strength is that it operates in the face of the real world where conditions constantly change.

While BGP is the cornerstone of Internet routing, it does not provide first class mechanisms to automatically deal with these events, nor does it provide tools to manage quality of service in general.

Although BGP is able to handle the coming and going of networks with grace, it wasn’t designed to deal with Internet brownouts. One common problem is that a connection enters a state where it hasn’t failed, but isn’t working correctly either. This usually presents itself as packet loss: packets enter a connection and never arrive at their destination. The only solution to these brownouts is active, continuous monitoring of the health of the Internet.

CC BY 2.0 image by Steve Jurvetson

Again, the metaphor of a system of roads is useful. A printed map may tell you the route from one city to another, but it won't tell you where there's a traffic jam. However, modern GPS applications such as Waze can tell you which roads are congested and which are clear. Similarly, Internet monitoring shows which parts of the Internet are blocked or losing packets and which are working well.

At Cloudflare we decided to deploy our own mechanisms to react to unpredictable events causing these brownouts. While most events do not fall under our jurisdiction — they are “external” to the Cloudflare network — we have to operate a reliable service by minimizing the impact of external events.

This is a journey of continual improvement, and it can be deconstructed into a few simple components:

Building an exhaustive and consistent view of the quality of the Internet
Building a detection and alerting mechanism on top of this view
Building the automatic mitigation mechanisms to ensure the best reaction time

Monitoring the Internet

Having deployed our network in a hundred locations worldwide, we are in a unique position to monitor the quality of the Internet from a wide variety of locations. To do this, we are leveraging the probing capabilities of our network hardware and have added some extra tools that we’ve built.

By collecting data from thousands of automatically deployed probes, we have a real-time view of the Internet’s infrastructure: packet loss in any of our transit provider’s backbones, packet loss on Internet Exchanges, or packet loss between continents. It is salutary to watch this real-time view over time and realize how often parts of the Internet fail and how resilient the overall network is.

Our monitoring data is stored in real-time in our metrics pipeline powered by a mix of open-source software: ZeroMQ, Prometheus and OpenTSDB. The data can then be queried and filtered on a single dashboard to give us a clear view of the state of a specific transit provider, or one specific PoP.

Above we can see a time-lapse of a transit provider having some packet loss issues.

Here we see a transit provider having some trouble on the US West Coast on October 28, 2016.

Building a Detection Mechanism

We didn’t want to stop here. Having a real-time map of Internet quality puts us in a great position to detect problems and create alerts as they unfold. We have defined a set of triggers that we know are indicative of a network issue, which allow us to quickly analyze and repair problems.

For example, 3% packet loss from Latin America to Asia is expected under normal Internet conditions and not something that would trigger an alert. However, 3% packet loss between two countries in Europe usually indicates a bigger and potentially more impactful problem, and thus will immediately trigger alerts for our Systems Reliability Engineering and Network Engineering teams to look into the issue.

Sitting between eyeball networks and content networks, it is easy for us to correlate this packet loss with various other metrics in our system, such as difficulty connecting to customer origin servers (which manifest as Cloudflare error 522) or a sudden decrease of traffic from a local ISP.

Automatic Mitigation

Receiving valuable and actionable alerts is great, however we were still facing the hard to compress time-to-reaction factor. Thankfully in our early years, we’ve learned a lot from DDoS attacks. We’ve learned how to detect and auto-mitigate most attacks with our efficient automated DDoS mitigation pipeline. So naturally we wondered if we could apply what we’ve learned from DDoS mitigation to these generic internet events? After all, they do exhibit the same characteristics: they’re unpredictable, they’re external to our network, and they can impact our service.

The next step was to correlate these alerts with automated actions. The actions should reflect what an on-call network engineer would have done given the same information. This includes running some important checks: is the packet loss really external to our network? Is the packet loss correlated to an actual impact? Do we currently have enough capacity to reroute the traffic? When all the stars align, we know we have a case to perform some action.

All that said, automating actions on network devices turns out to be more complicated than one would imagine. Without going into too much detail, we struggled to find a common language to talk to our equipment with because we’re a multi-vendor network. We decided to contribute to the brilliant open-source project Napalm, coupled it with the automation framework Salt, and and improved it to bring us the features we needed.

We wanted to be able to perform actions such as configuring probes, retrieving their data, and managing complex BGP neighbor configuration regardless of the network device a given PoP was using. With all these features put together into an automated system, we can see the impact of actions it has taken:

Here you can see one of our transit provider having a sudden problem in Hong-Kong. Our system automatically detects the fault and takes the necessary action, which is to disable this link for our routing.

Our system keeps improving every day, but it is already running at a high pace and making immediate adjustments across our network to optimize performance every single day.

Here we can see actions taken during 90 days of our mitigation bot.

The impact of this is that we’ve managed to make the Internet perform better for our customers and reduce the number of errors that they'd see if they weren't using Cloudflare. One way to measure this is how often we're unable to reach a customer's origin. Sometimes origins are completely offline. However, we are increasingly at a point where if an origin is reachable we'll find a path to reach it. You can see the effects of our improvements over the last year in the graph below.

The Future

While we keep improving this resiliency pipeline every day, we are looking forward to deploying some new technologies to streamline it further: Telemetry will permit a more real-time collection of our data by moving from a pull model to a push model, and new automation languages like OpenConfig will unify and simplify our communication with network devices. We look forward to deploying these improvements as soon as they are mature enough for us to release.

At Cloudflare our mission is to help build a better internet. The internet though, by its nature and size, is in constant flux — breaking down, being added to, and being repaired at almost any given moment — meaning services are often interrupted and traffic is slowed without warning. By enhancing the reliability and resiliency of this complex network of networks we think we are one step closer to fulfilling our mission and building a better internet.

How Cloudflare's Architecture Allows Us to Scale to Stop the Largest Attacks

Matthew Prince — Wed, 26 Oct 2016 12:59:29 GMT

The last few weeks have seen several high-profile outages in legacy DNS and DDoS-mitigation services due to large scale attacks. Cloudflare's customers have, understandably, asked how we are positioned to handle similar attacks.

While there are limits to any service, including Cloudflare, we are well architected to withstand these recent attacks and continue to scale to stop the larger attacks that will inevitably come. We are, multiple times per day, mitigating the very botnets that have been in the news. Based on the attack data that has been released publicly, and what has been shared with us privately, we have been successfully mitigating attacks of a similar scale and type without customer outages.

I thought it was a good time to talk about how Cloudflare's architecture is different than most legacy DNS and DDoS-mitigation services and how that's helped us keep our customers online in the face of these extremely high volume attacks.

Analogy: How Databases Scaled

Before delving into our architecture, it's worth taking a second to think about another analogous technology problem that is better understood: scaling databases. From the mid-1980s, when relational databases started taking off, through the early 2000s the way companies thought of scaling their database was by buying bigger hardware. The game was: buy the biggest database server you could afford, start filling it with data, and then hope a newer, bigger server you could afford was released before you ran out of room. Hardware companies responded with more and more exotic, database-specific hardware.

Meet the IBM z13 mainframe (source: IBM)

At some point, the bounds of a box couldn't contain all the data some organizations wanted to store. Google is a famous example. Back when the company was a startup, they didn't have the resources to purchase the largest database servers. Nor, even if they did, could the largest servers store everything they wanted to index — which was, literally, everything.

So, rather than going the traditional route, Google wrote software that allowed many cheap, commodity servers to work together as if they were one large database. Over time, as Google developed more services, the software became efficient at distributing load across all the machines in Google's network to maximize utilization of network, compute, and storage. And, as Google's needs grew, they just added more commodity servers — allowing them to linearly scale resources to meet their needs.

Legacy DNS and DDoS Mitigation

Compare this with the way legacy DNS and DDoS mitigation services mitigate attacks. Traditionally, the way to stop an attack was to buy or build a big box and use it to filter incoming traffic. If you were to dig into the technical details of most legacy DDoS mitigation service vendors you'd find hardware from companies like Cisco, Arbor Networks, and Radware clustered together into so-called "scrubbing centers."

CC BY-SA 3.0 sewage treatment image by Annabel

Just like in the old database world, there were tricks to get these behemoth mitigation boxes to (sort of) work together, but they were kludgy. Often the physical limits of the number of packets that a single box could absorb became the effective limit on the total volume that could be mitigated by a service provider. And, in very large DDoS attacks, much of the attack traffic will never reach the scrubbing center because, with only a few locations, upstream ISPs become the bottleneck.

The expense of the equipment meant that it is not cost effective to distribute scrubbing hardware broadly. If you were a DNS provider, how often would you really get attacked? How could you justify investing in expensive mitigation hardware in every one of your data centers? Even if you were a legacy DDoS vendor, typically your service was only provisioned when a customer came under attack so it never made sense to have capacity much beyond a certain margin over the largest attack you'd previously seen. It seemed rational that any investment beyond that was a waste, but that conclusion is proving ultimately fatal to the traditional model.

The Future Doesn't Come in a Box

From the beginning at Cloudflare, we saw our infrastructure much more like how Google saw their database. In our early days, the traditional DDoS mitigation hardware vendors tried to pitch us to use their technology. We even considered building mega boxes ourselves and using them just to scrub traffic. It seemed like a fascinating technical challenge, but we realized that it would never be a scalable model.

Instead, we started with a very simple architecture. Cloudflare's first racks had only three components: router, switch, server. Today we’ve made them even simpler, often dropping the router entirely and using switches that can also handle enough of the routing table to route packets over the geographic region the data center serves.

Rather than using load balancers or dedicated mitigation hardware, which could become bottlenecks in an attack, we wrote software that uses BGP, the fundamental routing protocol of the Internet, to distribute load geographically and also within each data center in our network. Critical to our model: every server in every rack is able to answer every type of request. Our software dynamically allocates load based on what is needed for a particular customer at a particular time. That means that we automatically spread load across literally tens of thousands of servers during large attacks.

Graphene: a simple architecture that’s 100 times stronger than the best steel (credit: Wikipedia)

It has also meant that we can cost-effectively continue to invest in our network. If Frankfurt needs 10 percent more capacity, we can ship it 10 percent more servers rather than having to make the step-function decision of whether to buy or build another Colossus Mega Scrubber™ box.

Since every core in every server in every data center can help mitigate attacks, it means that with each new data center we bring online we get better and better at stopping attacks nearer the source. In other words, the solution to a massively distributed botnet is a massively distributed network. This is actually how the Internet was meant to work: distributed strength, not focused brawn within a few scrubbing locations.

How We Made DDoS Mitigation Essentially Free

The efficient use of resources isn't only with capital expenditures but also with operating expenditures. Because we use the same equipment and networks to provide all the functions of Cloudflare, we rarely have any additional bandwidth costs associated with stopping an attack. Bear with me for a second, because, to understand this, you need to understand a bit about how we buy bandwidth.

We pay for bandwidth from transit providers on an aggregated basis billed monthly at the 95th percentile of the greater of ingress vs. egress. Ingress is just network speak for traffic being sent into our network. Egress is traffic being sent out from our network.

In addition to being a DDoS mitigation service, Cloudflare also offers other functions including caching. The nature of a cache is that you should always have more traffic going out from your cache than coming in. In our case, during normal circumstances, we have many times more egress (traffic out) than ingress (traffic in).

Large DDoS attacks drive up our ingress but don't affect our egress. However, even in a very large attack, it is extremely rare that that ingress exceeds egress. Because we only pay for the greater of ingress vs. egress, and because egress is always much higher than ingress, we effectively have an enormous amount of zero-cost bandwidth with which to soak up attacks.

As use of our services increases, the amount of capacity to stop attacks increases proportionately. People wonder how we can provide DDoS mitigation at a fixed fee regardless of the size of the attack; the answer is because attacks don't increase the biggest of our unit costs. And, while legacy providers have stated that their offering pro bono DDoS mitigation would cost them millions, we’re able to protect politically and artistically important sites against huge attacks for free through Project Galileo without it breaking the bank.

Winning the Arms Race

Cloudflare is the only DNS provider that was designed, from the beginning, to mitigate large scale DDoS attacks. Just as DDoS attacks are by their very nature distributed, Cloudflare’s DDoS mitigation system is distributed across our massive global network.

There is no doubt that we are in an arms race with attackers. However, we are well positioned technically and economically to win that race. Against most legacy providers, attackers have an advantage: providers' costs are high because they have to buy expensive boxes and bandwidth, while attackers' costs are low because they use hacked devices. That’s why our secret sauce is the software that spreads our load across our massively distributed network of commodity hardware. By keeping our costs low we are able to continue to grow our capacity efficiently and stay ahead of attacks.

Today, we believe Cloudflare has more capacity to stop attacks than the publicly announced capacity of all our competitors — combined. And we continue to expand, opening nearly a new data center a week. The good news for our customers is that we’ve designed Cloudflare in such a way that we can continue to cost effectively scale our capacity as attacks grow. There are limits to any service, and we remain ever vigilant for new attacks, but we are confident that our architecture is ultimately the right way to stop whatever comes next.

PS - Want to work at our scale on some of the hardest problems the Internet faces? We’re hiring.

Rate Limiting: Live Demo

Timothy Fong — Fri, 30 Sep 2016 19:56:00 GMT

Cloudflare helps customers control their own traffic at the edge. One of two products that we introduced to empower customers to do so is Cloudflare Rate Limiting*.

CC BY 2.0 image by Brian Hefele

Rate Limiting allows a customer to rate limit, shape or block traffic based on the rate of requests per client IP address, cookie, authentication token, or other attributes of the request. Traffic can be controlled on a per-URI (with wildcards for greater flexibility) basis giving pinpoint control over a website, application, or API.

Cloudflare has been dogfooding Rate Limiting to add more granular controls against Layer 7 DOS and brute-force attacks. For example, we've experienced attacks on cloudflare.com from more than 4,000 IP addresses sending 600,000+ requests in 5 minutes to the same URL but with random parameters. These types of attacks send large volumes of HTTP requests intended to bring down our site or to crack login passwords.

Rate Limiting protects websites and APIs from similar types of bad traffic. By leveraging our massive network, we are able to process and enforce rate limiting near the client, shielding the customer's application from unnecessary load.

To make this more concrete, let's look at a live demonstration rule for cloudflare.com. Multiple rules may be used and combined to great effect -- this is just a limited example.

Read on, and then test it yourself.

Creating the rule

Imagine an endpoint that is resource intensive. To maintain availability, we want to protect it from high-volume request rates - like those from an aggressive bot or attacker.

URL *.cloudflare.com/rate-limit-test

Rate Limiting allows for * wildcards to give more flexibility. An API with multiple endpoints might use a pattern of api.example.com/v2/*

With that pattern, all resources under /v2 would be protected by the same rule.

ThresholdWe set this demonstration rule to 10 requests per minute, which is too sensitive for a real web application, but allows a curious user refreshing their browser ten times to see Rate Limiting in action.

ActionWe set this value to block which means that once an IP addresses triggers a rule, all traffic from that IP address will be blocked at the edge and served with a default 429 HTTP error code.

Other possible choices include simulate which means no action taken, but analytics would indicate which requests would have been mitigated to help customers evaluate the potential impact of a given rule.

Timeout

This is the duration of the mitigation once the rule has been triggered. In this example, an offending IP address will be blocked for 1 minute.

Response body type

This type was set to HTML in the demo so that Rate Limiting returns a web page to mitigated requests. For an API endpoint, the response body type could return JSON.

Response body

The response body can be anything you want. Refresh the link below 10 times very quickly to see our choice for this demonstration rule.

https://www.cloudflare.com/rate-limit-test

Other possible configurations

We could have specified a Method. If we only cared to rate limit POST requests, we could adjust the rule to do so. This rule could be used for a login page where high frequency of POSTs by the same IP is potentially suspicious.

We also could have specified a Response Code. If we only wanted to rate limit IPs which were consistently failing to authenticate, we could create the rule to trigger only after a certain threshold of 403’s have been served. Once an IP is flagged, perhaps because it was pounding a login endpoint with incorrect credentials, that client IP could be blocked from hitting either that endpoint or the whole site.

We will expand the matching criteria, such as adding headers or cookies. We will also extend the mitigation options to include CAPTCHA or other challenges. This will give our users even more flexibility and power to protect their websites and API endpoints.

Early Access

We'd love to have you try Rate Limiting. Read more and sign up for Early Access.

**Note: This post was updated 4/13/17 to reflect the current product name. All references to Traffic Control have been changed to Rate Limiting.*

400Gbps: Winter of Whopping Weekend DDoS Attacks

Marek Majkowski — Thu, 03 Mar 2016 02:32:00 GMT

Over the last month, we’ve been watching some of the largest distributed denial of service (DDoS) attacks ever seen unfold. As CloudFlare has grown we've brought on line systems capable of absorbing and accurately measuring attacks. Since we don't need to resort to crude techniques to block traffic we can measure and filter attacks with accuracy. Our systems sort bad packets from good, keep websites online and keep track of attack packet rates and bits per second.

The current spate of large attacks are all layer 3 (L3) DDoS. Layer 3 attacks consist of a large volume of packets hitting the target network, and the aim is usually to overwhelm the target network hardware or connectivity.

L3 attacks are dangerous because most of the time the only solution is to acquire large network capacity and buy beefy networking hardware, which is simply not an option for most independent website operators. Or, faced with huge packet rates, some providers simply turn off connections or entirely block IP addresses.

A Typical Day At CloudFlare

Historically, L3 attacks were the biggest headache for CloudFlare. Over the last two years, we’ve automated almost all of our L3 attack handling and these automatic systems protect CloudFlare customers worldwide 24 hours a day.

This chart shows our L3 DoS mitigation during the last quarter of 2015:

The y-axis is the number of individual "DoS events" that we responded to, which is about 20-80 on a typical day. Each of these DoS events are automatically triggered by an attack on one or more of our customers.

Recent DDoS Attacks Were Larger. Much Larger

Most of the mitigated DoS events are small. When a big attack happens, it usually shows up as a large number of separate events in our system.

During the last month or so, we’e been dealing (mostly automatically) with very large attacks. Here is the same chart, but including the first quarter of 2016. Notice the scale:

That’s about a 15x increase in individual DoS events. These new attacks are interesting for a couple of reasons. First, the spikes align with the weekends. It seems the attackers are busy with something else during the week. Second, they are targeting a couple of fairly benign websites—this demonstrates that anybody can become the target of a large attack. Third, the overall volume of the attack is enormous.

Let me elaborate on this last point.

BGP Black Holes

When operating at a smaller scale, it’s not unusual for a DDoS attack to overwhelm the capacity of the target network. This causes network congestion and forces the operators of the attacked network to black hole the attacked IP addresses, pretty much removing them from the Internet.

After that happens, it’s impossible to report the volume of the attack. The attack traffic will disappear in a “black hole” and is invisible to the operators of the attacked network. This is why reporting on big attacks is difficult—with BGP blackholing, it’s impossible to assess the true scale of the malicious traffic.

CC BY 2.0 image by Marcin Wichary

We’ve encountered very large attacks in the past, and we didn’t often have to resort to blackholing. This allows us to report in detail on the attack volume we see. Fortunately, the same is true this time around—the size of our network and the efficiency of our systems allowed us to simply absorb the attacks. This is the only reason we’re able to provide the following metrics.

How Big Was This DDoS Attack?

DDoS attacks are measured in two dimensions: the number of malicious packets per second (pps) and the attack bandwidth in bits per second (bps).

Rate of Packets (pps)

The packets per second metric is important, since the processing power required on a router raises proportionally with the pps count. Whenever an attack overwhelms a router’s processing power, we will see a frantic error message like this:

PROBLEM/CRITICAL:
edge21.sin01.cloudflare.cc router PFE hardware drops
Info cell drops are increasing at a rate of 210106.35/s.
Fabric drops are increasing at a rate of 329678.81/s.

It’s not uncommon for attacks against CloudFlare to reach 100Mpps. The recent attacks were larger and peaked at 180Mpps. Here’s a chart of the attack packets per second that we received over the last month:

There aren’t many networks in the world that could sustain this rate of attack packets, but our hardware was able to keep up with all the malicious traffic. Our networking team does an excellent job at keeping CloudFlare’s networking hardware up to the task.

Volume in bits (bps)

The other interesting parameter is size of the attack in gigabits per second. Some large DDoS attacks attempt to saturate the network capacity (bps), rather than router processing power (pps). If that happens, we see alerts like this:

PROBLEM/CRITICAL:
edge112.tel01.cloudflare.cc router interfaces
ae-1/2/2 (HardLayer) in: 81092mbps (81.10%)

We saw this warning some time ago when a link got quite full. In fact, in recent weeks a couple of our fat 100Gbps links got quite busy, hitting a ceiling at 77Gbps on inbound:

We speculate the attack against this datacenter was larger than 77Gbps, but that congestion occurred on the other side of the network—close to the attackers. We’re investigating this with our Internet providers.

The recent attacks peaked at around 400Gbps of aggregate inbound traffic. The peak wasn't a one-off spike, it lasted for a couple of hours!

Even during the peak attack traffic, our network wasn’t congested and systems operated normally. Our automatic mitigation software was able to sort the good from the bad packets.

Final Words

Over the past weeks, we encountered a series of large DDoS attacks. In peak, our systems reported over 400Gbps of incoming traffic, which is amongst the largest by aggregate volume that we’ve ever seen. The attack was mostly absorbed by our automatic mitigation systems.

With each attack, we’re improving our automatic attack mitigation system. In addition, our network team constantly keeps an eye on the network, working to identify new bottlenecks. Without our automatic mitigation system and network team, dealing with attacks of this size would not be possible.

You might think that as CloudFlare grows, the relative scale of attacks will shrink in comparison to legitimate traffic. The funny thing is, the opposite is true. Every time we upgrade our network capacity, we see larger attacks because we don’t have to resort to blackholing. With more capacity, we’re able to withstand and accurately measure even bigger attacks.

Interested in dealing with the largest DDoS attacks in the world? We’re hiring in San Francisco, London, and Singapore.

Announcing Virtual DNS: DDoS Mitigation and Global Distribution for DNS Traffic

Dani Grant — Tue, 10 Mar 2015 12:59:09 GMT

It’s 9am and CloudFlare has already mitigated three billion malicious requests for our customers today. Six out of every one hundred requests we see are malicious, and increasingly, more of those bad requests are targeting DNS nameservers.

DNS is the phone book of the Internet and fundamental to the usability of the web, but is also a serious weak link in Internet security. One of the ways CloudFlare is trying to make DNS more secure is by implementing DNSSEC, cryptographic authentication for DNS responses. Another way is Virtual DNS, the authoritative DNS proxy service we are introducing today.

Virtual DNS provides CloudFlare’s DDoS mitigation and global distribution to DNS nameservers. DNS operators need performant, resilient infrastructure, and we are offering ours, the fastest of any providers, to any organization’s DNS servers.

Many organizations have legacy DNS infrastructure that is difficult to change. The hosting industry is a key example of this. A host may have given thousands of clients a set of nameservers but now realize that they don't have the performance or defensibility that their clients need.

Virtual DNS means that the host can get the benefits of a global, modern DNS infrastructure without having to contact every customer and get them to update their name servers.

With legacy infrastructure blocking a host from deploying modern cloud-based security services, DNS providers, even if they are securing their customers' websites, may have a massive single point of failure: their own nameservers.

A Quick Brief on DDoS

Source: Android Corps

DDoS stands for Distributed Denial of Service, and works much like the 2004 video game Diner Dash. In each case, the server is expected to handle more and more requests (for food, in the case of Diner Dash, and for data, in the case of web servers) until the server is so overwhelmed that it invariably fails to answer at all.

A successful DDoS attack on a provider's nameservers will take every website with DNS records on those nameservers offline. For larger providers, this could be hundreds of thousands or millions of websites depending on those nameservers.

Introducing Virtual DNS

Today, CloudFlare introduces Virtual DNS, leveraging its global DNS and proxying infrastructure to provide performance and security for any nameserver by acting as authoritative for its domains.

With Virtual DNS, DNS queries for the provider's records are responded to by the nearest CloudFlare edge location. If the proper DNS response is available in CloudFlare's cache, CloudFlare will return the response to the visitor, saving bandwidth at the origin nameserver.

If the DNS response is not available in cache, CloudFlare will query one of the provider's nameservers in the background to fetch the DNS response and send it back to the visitor. Simultaneously, that response will be temporarily cached on CloudFlare to be automatically returned when the next query for that record comes along. The caching of records at the edge makes CloudFlare one of the fastest DNS providers worldwide.

To protect against attacks, malicious requests to the nameservers will be identified and blocked at CloudFlare’s edge before those requests ever make it to the provider's DNS infrastructure.

A simple representation of this communication can be seen below:

Additional Security

Virtual DNS provides two additional layers of security through the CloudFlare proxy:

First, if for some reason the origin nameserver is knocked offline and the DNS records are cached on CloudFlare, CloudFlare will keep the records in the cache and will continue to answer for them, providing DNS answers even when the origin nameserver is unreachable, and automatically checking in the background for the origin's return or failing over to designated origins.

Secondly, Virtual DNS masks the true origin IP addresses of the provider's nameservers behind CloudFlare’s IP addresses. Visitors and/or attackers only see CloudFlare’s IP addresses when requesting answers, keeping customer nameservers safe from being targeted by attackers.

Virtual DNS Rollout

We are currently rolling out Virtual DNS support. Organizations interested in enabling Virtual DNS should contact our sales team.

Over the past year, we’ve been testing the product with hosting providers, registrars and some enterprises with very positive results.

DigitalOcean, for example, put their nameservers behind Virtual DNS in July 2014, and is now supporting 10K requests per second of 100% clean traffic. They report that they haven’t seen malicious traffic reach their nameservers since.

Maintaining custom DNS infrastructure is hard and expensive, and Virtual DNS makes it more accessible. Any enterprise can use CloudFlare Virtual DNS to deliver answers to the edge, with high performance anywhere in the world, saving bandwidth costs by caching answers, and stopping malicious traffic.

Understanding and mitigating NTP-based DDoS attacks

John Graham-Cumming — Thu, 09 Jan 2014 16:00:00 GMT

Over the last couple of weeks you may have been hearing about a new tool in the DDoS arsenal: NTP-based attacks. These have become popular recently and caused trouble for some gaming web sites and service providers. We'd long thought that NTP might become a vector for DDoS attacks because, like DNS, it is a simple UDP-based protocol that can be persuaded to return a large reply to a small request. Unfortunately, that prediction has come true.

This blog post explains how an NTP-based attack works and how web site owners can help mitigate them. CloudFlare defends web sites against NTP based attacks, but it's best to stem the flow of NTP-based DDoS by making simple configuration changes to firewalls and NTP servers. Doing so makes the web safer for everyone.

DNS Reflection is so 2013

We've written in the past about DNS-based reflection and amplification attacks and NTP-based attacks use similar techniques, just a different protocol.

A reflection attack works when an attacker can send a packet with a forged source IP address. The attacker sends a packet apparently from the intended victim to some server on the Internet that will reply immediately. Because the source IP address is forged, the remote Internet server replies and sends data to the victim.

That has two effects: the actual source of the attack is hidden and is very hard to trace, and, if many Internet servers are used, an attack can consist of an overwhelming number of packets hitting a victim from all over the world.

But what makes reflection attacks really powerful is when they are also amplified: when a small forged packet elicits a large reply from the server (or servers). In that case, an attacker can send a small packet "from" a forged source IP address and have the server (or servers) send large replies to the victim.

Amplification attacks like that result in an attacker turning a small amount of bandwidth coming from a small number of machines into a massive traffic load hitting a victim from around the Internet. Until recently the most popular protocol for amplification attacks was DNS: a small DNS query looking up the IP address of a domain name would result in a large reply.

For DNS the amplification factor (how much larger a reply is than a request) is 8x. So an attacker can generate an attack 8x larger than the bandwidth they themselves have access to. For example, an attacker controlling 10 machines with 1Gbps could generate an 80Gbps DNS amplification attack.

In the past, we've seen one attack that used SNMP for amplification: it has a factor of 650x! Luckily, there are few open SNMP servers on the Internet and SNMP usually requires authentication (although many are poorly secured). That makes SNMP attacks relatively rare.

The new kid on the block today is NTP.

Network Time Protocol attacks: as easy as (UDP port) 123

NTP is the Network Time Protocol that is used by machines connected to the Internet to set their clocks accurately. For example, the address time.euro.apple.com seen in the clock configuration on my Mac is actually the address of an NTP server run by Apple.

My Mac quietly synchronizes with that server to keep its clock accurate. And, of course, NTP is not just used by Macs: it is widely used across the Internet by desktops, servers and even phones to keep their clocks in sync.

Unfortunately, the simple UDP-based NTP protocol is prone to amplification attacks because it will reply to a packet with a spoofed source IP address and because at least one of its built in commands will send a long reply to a short request. That makes it ideal as a DDoS tool.

NTP contains a command called monlist (or sometimes MON_GETLIST) which can be sent to an NTP server for monitoring purposes. It returns the addresses of up to the last 600 machines that the NTP server has interacted with. This response is much bigger than the request sent making it ideal for an amplification attack.

To get an idea of how much larger, I used the ntpdc command to send a monlist command to a randomly chosen open NTP server on the Internet. Here are the request and response packets captured with Wireshark.

At the command line I typed

ntpdc –c monlist 1xx.xxx.xxx.xx9

to send the MON_GETLIST command to the server at 1xx.xxx.xxx.xx9. The request packet is 234 bytes long. The response is split across 10 packets totaling 4,460 bytes. That's an amplification factor of 19x and because the response is sent in many packets an attack using this would consume a large amount of bandwidth and have a high packet rate.

This particular NTP server only had 55 addresses to tell me about. Each response packet contains 6 addresses (with one short packet at the end), so a busy server that responded with the maximum 600 addresses would send 100 packets for a total of over 48k in response to just 234 bytes. That's an amplification factor of 206x!

An attacker, armed with a list of open NTP servers on the Internet, can easily pull off a DDoS attack using NTP. And NTP servers aren't hard to find. Common tools like Metasploit and NMAP have had modules capable of identifying NTP servers that support monlist for a long time. There's also the Open NTP Project which aims to highlight open NTP servers and get them patched.

Don't be part of the problem

If you're running a normal NTP program to set the time on your server and need to know how to configure it to protect your machine, I suggest Team Cymru's excellent page on a Secure NTP Template. It shows how to secure an NTP client on Cisco IOS, Juniper JUNOS or using iptables on a Linux system.

If you're running an ntpd server that needs to be on the public Internet then it's vital that it's upgraded to at least version 4.2.7p26 (more details in CVE-2013-5211). The vulnerability was classed as a bug in the ntpd bug database (issue 1532).

If you are running an ntpd server and still need something like monlist there's the mrulist command (see issue 1531) which now requires a nonce (a proof that the command came from the IP address in the UDP packet).

Neither of these changes are recent, ntpd v4.2.7p26 was released in March 24, 2010, so upgrading doesn't require using bleeding edge code.

If you're running a network (or are a service provider) then it's vital that you implement BCP-38. Implementation of it (and the related BCP-84) would eliminate source IP spoofed attacks of all kinds (DNS, NTP, SNMP, ...).

Footnote

The black and white photograph at the top of this blog post shows the UK's original speaking clock and the original voice of the clock Jane Cain. A common way to synchronize clocks and watches was to telephone the speaking clock to get the precise time.

Geeks like me will be amused that the NTP UDP port for time synchronization is 123 and that the telephone number of the UK speaking clock is also 123. Even today dialing 123 in the UK gets you the time.

The DDoS that almost broke the Internet

Matthew Prince — Wed, 27 Mar 2013 16:35:00 GMT

The New York Times this morning published a story about the Spamhaus DDoS attack and how CloudFlare helped mitigate it and keep the site online. The Times calls the attack the largest known DDoS attack ever on the Internet. We wrote about the attack last week. At the time, it was a large attack, sending 85Gbps of traffic. Since then, the attack got much worse. Here are some of the technical details of what we've seen.

Growth Spurt

On Monday, March 18, 2013 Spamhaus contacted CloudFlare regarding an attack they were seeing against their website spamhaus.org. They signed up for CloudFlare and we quickly mitigated the attack. The attack, initially, was approximately 10Gbps generated largely from open DNS recursors. On March 19, the attack increased in size, peaking at approximately 90Gbps. The attack fluctuated between 90Gbps and 30Gbps until 01:15 UTC on on March 21.

The attackers were quiet for a day. Then, on March 22 at 18:00 UTC, the attack resumed, peaking at 120Gbps of traffic hitting our network. As we discussed in the previous blog post, CloudFlare uses Anycast technology which spreads the load of a distributed attack across all our data centers. This allowed us to mitigate the attack without it affecting Spamhaus or any of our other customers. The attackers ceased their attack against the Spamhaus website four hours after it started.

Other than the scale, which was already among the largest DDoS attacks we've seen, there was nothing particularly unusual about the attack to this point. Then the attackers changed their tactics. Rather than attacking our customers directly, they started going after the network providers CloudFlare uses for bandwidth. More on that in a second, first a bit about how the Internet works.

Peering on the Internet

The "inter" in Internet refers to the fact that it is a collection of independent networks connected together. CloudFlare runs a network, Google runs a network, and bandwidth providers like Level3, AT&T, and Cogent run networks. These networks then interconnect through what are known as peering relationships.

When you surf the web, your browser sends and receives packets of information. These packets are sent from one network to another. You can see this by running a traceroute. Here's one from Stanford University's network to the New York Times' website (nytimes.com):

1 rtr-servcore1-serv01-webserv.slac.stanford.edu (134.79.197.130) 0.572 ms 2 rtr-core1-p2p-servcore1.slac.stanford.edu (134.79.252.166) 0.796 ms 3 rtr-border1-p2p-core1.slac.stanford.edu (134.79.252.133) 0.536 ms 4 slac-mr2-p2p-rtr-border1.slac.stanford.edu (192.68.191.245) 25.636 ms 5 sunncr5-ip-a-slacmr2.es.net (134.55.36.21) 3.306 ms 6 eqxsjrt1-te-sunncr5.es.net (134.55.38.146) 1.384 ms 7 xe-0-3-0.cr1.sjc2.us.above.net (64.125.24.1) 2.722 ms 8 xe-0-1-0.mpr1.sea1.us.above.net (64.125.31.17) 20.812 ms 9 209.249.122.125 (209.249.122.125) 21.385 ms

There are three networks in the above traceroute: stanford.edu, es.net, and above.net. The request starts at Stanford. Between lines 4 and 5 it passes from Stanford's network to their peer es.net. Then, between lines 6 and 7, it passes from es.net to above.net, which appears to provide hosting for the New York Times. This means Stanford has a peering relationship with ES.net. ES.net has a peering relationship with Above.net. And Above.net provides connectivity for the New York Times.

CloudFlare connects to a large number of networks. You can get a sense of some, although not all, of the networks we peer with through a tool like Hurricane Electric's BGP looking glass. CloudFlare connects to peers in two ways. First, we connect directly to certain large carriers and other networks to which we send a large amount of traffic. In this case, we connect our router directly to the router at the border of the other network, usually with a piece of fiber optic cable. Second, we connect to what are known as Internet Exchanges, IXs for short, where a number of networks meet in a central point.

Most major cities have an IX. The model for IXs are different in different parts of the world. Europe runs some of the most robust IXs, and CloudFlare connects to several of them including LINX (the London Internet Exchange), AMS-IX (the Amsterdam Internet Exchange), and DE-CIX (the Frankfurt Internet Exchange), among others. The major networks that make up the Internet --Google, Facebook Yahoo, etc. -- connect to these same exchanges to pass traffic between each other efficiently. When the Spamhaus attacker realized he couldn't go after CloudFlare directly, he began targeting our upstream peers and exchanges.

Headwaters

Once the attackers realized they couldn't knock CloudFlare itself offline even with more than 100Gbps of DDoS traffic, they went after our direct peers. In this case, they attacked the providers from whom CloudFlare buys bandwidth. We, primarily, contract with what are known as Tier 2 providers for CloudFlare's paid bandwidth. These companies peer with other providers and also buy bandwidth from so-called Tier 1 providers.

There are approximately a dozen Tier 1 providers on the Internet. The nature of these providers is that they don't buy bandwidth from anyone. Instead, they engage in what is known as settlement-free peering with the other Tier 1 providers. Tier 2 providers interconnect with each other and then buy bandwidth from the Tier 1 providers in order to ensure they can connect to every other point on the Internet. At the core of the Internet, if all else fails, it is these Tier 1 providers that ensure that every network is connected to every other network. If one of them fails, it's a big deal.

Anycast means that if the attacker attacked the last step in the traceroute then their attack would be spread across CloudFlare's worldwide network, so instead they attacked the second to last step which concentrated the attack on one single point. This wouldn't cause a network-wide outage, but it could potentially cause regional problems.

We carefully select our bandwidth providers to ensure they have the ability to deal with attacks like this. Our direct peers quickly filtered attack traffic at their edge. This pushed the attack upstream to their direct peers, largely Tier 1 networks. Tier 1 networks don't buy bandwidth from anyone, so the majority of the weight of the attack ended up being carried by them. While we don't have direct visibility into the traffic loads they saw, we have been told by one major Tier 1 provider that they saw more than 300Gbps of attack traffic related to this attack. That would make this attack one of the largest ever reported.

The challenge with attacks at this scale is they risk overwhelming the systems that link together the Internet itself. The largest routers that you can buy have, at most, 100Gbps ports. It is possible to bond more than one of these ports together to create capacity that is greater than 100Gbps however, at some point, there are limits to how much these routers can handle. If that limit is exceeded then the network becomes congested and slows down.

Over the last few days, as these attacks have increased, we've seen congestion across several major Tier 1s, primarily in Europe where most of the attacks were concentrated, that would have affected hundreds of millions of people even as they surfed sites unrelated to Spamhaus or CloudFlare. If the Internet felt a bit more sluggish for you over the last few days in Europe, this may be part of the reason why.

Attacks on the IXs

In addition to CloudFlare's direct peers, we also connect with other networks over the so-called Internet Exchanges (IXs). These IXs are, at their most basic level, switches into which multiple networks connect and can then pass bandwidth. In Europe, these IXs are run as non-profit entities and are considered critical infrastructure. They interconnect hundreds of the world's largest networks including CloudFlare, Google, Facebook, and just about every other major Internet company.

Beyond attacking CloudFlare's direct peers, the attackers also attacked the core IX infrastructure on the London Internet Exchange (LINX), the Amsterdam Internet Exchange (AMS-IX), the Frankfurt Internet Exchange (DE-CIX), and the Hong Kong Internet Exchange (HKIX). From our perspective, the attacks had the largest effect on LINX which caused impact over the exchange and LINX's systems that monitor the exchange, as visible through the drop in traffic recorded by their monitoring systems. (Corrected: see below for original phrasing.)

The congestion impacted many of the networks on the IXs, including CloudFlare's. As problems were detected on the IX, we would route traffic around them. However, several London-based CloudFlare users reported intermittent issues over the last several days. This is the root cause of those problems.

The attacks also exposed some vulnerabilities in the architecture of some IXs. We, along with many other network security experts, worked with the team at LINX to better secure themselves. In doing so, we developed a list of best practices for any IX in order to make them less vulnerable to attacks.

Two specific suggestions to limit attacks like this involve making it more difficult to attack the IP addresses that members of the IX use to interchange traffic between each other. We are working with IXs to ensure that: 1) these IP addresses should not be announced as routable across the public Internet; and 2) packets destined to these IP addresses should only be permitted from other IX IP addresses. We've been very impressed with the team at LINX and how quickly they've worked to implement these changes and add additional security to their IX and are hopeful other IXs will quickly follow their lead.

The Full Impact of the Open Recursor Problem

At the bottom of this attack we once again find the problem of open DNS recursors. The attackers were able to generate more than 300Gbps of traffic likely with a network of their own that only had access 1/100th of that amount of traffic themselves. We've written about how these mis-configured DNS recursors as a bomb waiting to go off that literally threatens the stability of the Internet itself. We've now seen an attack that begins to illustrate the full extent of the problem.

While lists of open recursors have been passed around on network security lists for the last few years, on Monday the full extent of the problem was, for the first time, made public. The Open Resolver Project made available the full list of the 21.7 million open resolvers online in an effort to shut them down.

We'd debated doing the same thing ourselves for some time but worried about the collateral damage of what would happen if such a list fell into the hands of the bad people. The last five days have made clear that the bad people have the list of open resolvers and they are getting increasingly brazen in the attacks they are willing to launch. We are in full support of the Open Resolver Project and believe it is incumbent on all network providers to work with their customers to close any open resolvers running on their networks.

Unlike traditional botnets which could only generate limited traffic because of the modest Internet connections and home PCs they typically run on, these open resolvers are typically running on big servers with fat pipes. They are like bazookas and the events of the last week have shown the damage they can cause. What's troubling is that, compared with what is possible, this attack may prove to be relatively modest.

As someone in charge of DDoS mitigation at one of the Internet giants emailed me this weekend: "I've often said we don't have to prepare for the largest-possible attack, we just have to prepare for the largest attack the Internet can send without causing massive collateral damage to others. It looks like you've reached that point, so...congratulations!"

At CloudFlare one of our goals is to make DDoS something you only read about in the history books. We're proud of how our network held up under such a massive attack and are working with our peers and partners to ensure that the Internet overall can stand up to the threats it faces.

Correction: The original sentence about the impact on LINX was "From our perspective, the attacks had the largest effect on LINX which for a little over an hour on March 23 saw the infrastructure serving more than half of the usual 1.5Tbps of peak traffic fail." That was not well phrased, and has been edited, with notation in place.

The Cloudflare Blog

How Cloudflare responded to the “Copy Fail” Linux vulnerability

Background

Our Linux kernel release process

About the Copy Fail vulnerability

AF_ALG and the kernel crypto API

Memory mechanics: page cache and in-place crypto

The vulnerability: out-of-bounds write

The exploit, step by step

How we responded

Validating detection coverage

Hunting for exploitation

Incident timeline and impact

How did we mitigate it?

Removing the module

Bpf-lsm

Rolling it out

Remediation and follow-up steps

Conclusion

SLP: a new DDoS amplification vector in the wild

Cloudflare customers are protected

How to drop 10 million packets per second

Test bench

Step 1. Dropping packets in application

Step 2. Slaughter conntrack

Step 3. BPF drop on a socket

Step 4. iptables DROP after routing

Step 5. iptables DROP in PREROUTING

Step 6. nftables DROP before CONNTRACK

Step 7. tc ingress handler DROP

Step 8. XDP_DROP

Summary

Today we mitigated 1.1.1.1

Gatebot

Cloudflare is growing, so must Gatebot

Things went wrong

Impact

Lessons Learned

Rate Limiting: Delivering more rules, and greater control

So what’s changing?

The new capabilities - in action!

Time-based Firewall

Enumeration Attacks

Content Scraping

One more thing ... more rules, totally rules!

Memcrashed - Major amplification attacks from UDP port 11211

Memcrashed

Memcached does UDP?

Source IPs

Let's fix it up

Memcached Users

System administrators

Internet Service Providers

Developers

That's all

Prologue

On the Leading Edge - Cloudflare named a leader in The Forrester Wave: DDoS Mitigation Solutions

The New DDoS Landscape

Performing a DDoS

Botnets

IoT Devices

DNS Amplification

DDoS Mitigation: The Old Priorities

The New Landscape

A Web Developers Guide to Defeating an Application Layer DDoS Attack

Aggressive Caching

Rate Limiting

Conclusion

No Scrubs: The Architecture That Made Unmetered Mitigation Possible

Three Problems With Scrubbers

Better, Cheaper, Smarter

Inside a Server

Open Source

Further Reading

Conclusion

Meet Gatebot - a bot that allows us to sleep

Mitigations

Manual attack handling

Meet Gatebot

Sleeping at night