How to drop 10 million packets per second

Internally our DDoS mitigation team is sometimes called "the packet droppers". When other teams build exciting products to do smart things with the traffic that passes through our network, we take joy in discovering novel ways of discarding it.

38464589350_d00908ee98_b
CC BY-SA 2.0 image by Brian Evans

Being able to quickly discard packets is very important to withstand DDoS attacks.

Dropping packets hitting our servers, as simple as it sounds, can be done on multiple layers. Each technique has its advantages and limitations. In this blog post we'll review all the techniques we tried thus far.

Test bench

To illustrate the relative performance of the methods we'll show some numbers. The benchmarks are synthetic, so take the numbers with a grain of salt. We'll use one of our Intel servers, with a 10Gbps network card. The hardware details aren't too important, since the tests are prepared to show the operating system, not hardware, limitations.

Our testing setup is prepared as follows:

We transmit a large number of tiny UDP packets, reaching 14Mpps (millions packets per second).
This traffic is directed towards a single CPU on a target server.
We measure the number of packets handled by the kernel on that one CPU.

We're not trying to maximize userspace application speed, nor packet throughput - instead, we're trying to specifically show kernel bottlenecks.

The synthetic traffic is prepared to put maximum stress on conntrack - it uses random source IP and port fields. Tcpdump will show it like this:

$ tcpdump -ni vlan100 -c 10 -t udp and dst port 1234
IP 198.18.40.55.32059 > 198.18.0.12.1234: UDP, length 16
IP 198.18.51.16.30852 > 198.18.0.12.1234: UDP, length 16
IP 198.18.35.51.61823 > 198.18.0.12.1234: UDP, length 16
IP 198.18.44.42.30344 > 198.18.0.12.1234: UDP, length 16
IP 198.18.106.227.38592 > 198.18.0.12.1234: UDP, length 16
IP 198.18.48.67.19533 > 198.18.0.12.1234: UDP, length 16
IP 198.18.49.38.40566 > 198.18.0.12.1234: UDP, length 16
IP 198.18.50.73.22989 > 198.18.0.12.1234: UDP, length 16
IP 198.18.43.204.37895 > 198.18.0.12.1234: UDP, length 16
IP 198.18.104.128.1543 > 198.18.0.12.1234: UDP, length 16

On the target side all of the packets are going to be forwarded to exactly one RX queue, therefore one CPU. We do this with hardware flow steering:

ethtool -N ext0 flow-type udp4 dst-ip 198.18.0.12 dst-port 1234 action 2

Benchmarking is always hard. When preparing the tests we learned that having any active raw sockets destroys performance. It's obvious in hindsight, but easy to miss. Before running any tests remember to make sure you don't have any stale tcpdump process running. This is how to check it, showing a bad process active:

$ ss -A raw,packet_raw -l -p|cat
Netid  State      Recv-Q Send-Q Local Address:Port
p_raw  UNCONN     525157 0      *:vlan100          users:(("tcpdump",pid=23683,fd=3))

Finally, we are going to disable the Intel Turbo Boost feature on the machine:

echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

While Turbo Boost is nice and increases throughput by at least 20%, it also drastically worsens the standard deviation in our tests. With turbo enabled we had ±1.5% deviation in our numbers. With Turbo off this falls down to manageable 0.25%.

layers

Step 1. Dropping packets in application

Let's start with the idea of delivering packets to an application and ignoring them in userspace code. For the test setup, let's make sure our iptables don't affect the performance:

iptables -I PREROUTING -t mangle -d 198.18.0.12 -p udp --dport 1234 -j ACCEPT
iptables -I PREROUTING -t raw -d 198.18.0.12 -p udp --dport 1234 -j ACCEPT
iptables -I INPUT -t filter -d 198.18.0.12 -p udp --dport 1234 -j ACCEPT

The application code is a simple loop, receiving data and immediately discarding it in the userspace:

s = socket.socket(AF_INET, SOCK_DGRAM)
s.bind(("0.0.0.0", 1234))
while True:
    s.recvmmsg([...])

We prepared the code, to run it:

$ ./dropping-packets/recvmmsg-loop
packets=171261 bytes=1940176

This setup allows the kernel to receive a meagre 175kpps from the hardware receive queue, as measured by ethtool and using our simple mmwatch tool:

$ mmwatch 'ethtool -S ext0|grep rx_2'
 rx2_packets: 174.0k/s

The hardware technically gets 14Mpps off the wire, but it's impossible to pass it all to a single RX queue handled by only one CPU core doing kernel work. mpstat confirms this:

$ watch 'mpstat -u -I SUM -P ALL 1 1|egrep -v Aver'
01:32:05 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:32:06 PM    0    0.00    0.00    0.00    2.94    0.00    3.92    0.00    0.00    0.00   93.14
01:32:06 PM    1    2.17    0.00   27.17    0.00    0.00    0.00    0.00    0.00    0.00   70.65
01:32:06 PM    2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
01:32:06 PM    3    0.95    0.00    1.90    0.95    0.00    3.81    0.00    0.00    0.00   92.38

As you can see application code is not a bottleneck, using 27% sys + 2% userspace on CPU #1, while network SOFTIRQ on CPU #2 uses 100% resources.

By the way, using recvmmsg(2) is important. In these post-Spectre days, syscalls got more expensive and indeed, we run kernel 4.14 with KPTI and retpolines:

$ tail -n +1 /sys/devices/system/cpu/vulnerabilities/*
==> /sys/devices/system/cpu/vulnerabilities/meltdown <==
Mitigation: PTI

==> /sys/devices/system/cpu/vulnerabilities/spectre_v1 <==
Mitigation: __user pointer sanitization

==> /sys/devices/system/cpu/vulnerabilities/spectre_v2 <==
Mitigation: Full generic retpoline, IBPB, IBRS_FW

Step 2. Slaughter conntrack

We specifically designed the test - by choosing random source IP and ports - to put stress on the conntrack layer. This can be verified by looking at number of conntrack entries, which during the test is reaching the maximum:

$ conntrack -C
2095202

$ sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 2097152

You can also observe conntrack shouting in dmesg:

[4029612.456673] nf_conntrack: nf_conntrack: table full, dropping packet
[4029612.465787] nf_conntrack: nf_conntrack: table full, dropping packet
[4029617.175957] net_ratelimit: 5731 callbacks suppressed

To speed up our tests let's disable it:

iptables -t raw -I PREROUTING -d 198.18.0.12 -p udp -m udp --dport 1234 -j NOTRACK

And rerun the tests:

$ ./dropping-packets/recvmmsg-loop
packets=331008 bytes=5296128

This instantly bumps the application receive performance to 333kpps. Hurray!

PS. With SO_BUSY_POLL we can bump the numbers to 470k pps, but this is a subject for another time.

Step 3. BPF drop on a socket

Going further, why deliver packets to userspace application at all? While this technique is uncommon, we can attach a classical BPF filter to a SOCK_DGRAM socket with setsockopt(SO_ATTACH_FILTER) and program the filter to discard packets in kernel space.

See the code, to run it:

$ ./bpf-drop
packets=0 bytes=0

With drops in BPF (both Classical as well as extended eBPF have similar performance) we process roughly 512kpps. All of them get dropped in the BPF filter while still in software interrupt mode, which saves us CPU needed to wake up the userspace application.

Step 4. iptables DROP after routing

As a next step we can simply drop packets in the iptables firewall INPUT chain by adding rule like this:

iptables -I INPUT -d 198.18.0.12 -p udp --dport 1234 -j DROP

Remember we disabled conntrack already with -j NOTRACK. These two rules give us 608kpps.

The numbers in iptables counters:

$ mmwatch 'iptables -L -v -n -x | head'

Chain INPUT (policy DROP 0 packets, 0 bytes)
    pkts      bytes target     prot opt in     out     source               destination
605.9k/s    26.7m/s DROP       udp  --  *      *       0.0.0.0/0            198.18.0.12          udp dpt:1234

600kpps is not bad, but we can do better!

Step 5. iptables DROP in PREROUTING

An even faster technique is to drop packets before they get routed. This rule can do this:

iptables -I PREROUTING -t raw -d 198.18.0.12 -p udp --dport 1234 -j DROP

This produces whopping 1.688mpps.

This is quite a significant jump in performance, I don't fully understand it. Either our routing layer is unusually complex or there is a bug in our server configuration.

In any case - "raw" iptables table is definitely way faster.

Step 6. nftables DROP before CONNTRACK

Iptables is considered passé these days. The new kid in town is nftables. See this video for a technical explanation why nftables is superior. Nftables promises to be faster than gray haired iptables for many reasons, among them is a rumor that retpolines (aka: no speculation for indirect jumps) hurt iptables quite badly.

Since this article is not about comparing the nftables vs iptables speed, let's try only the fastest drop I could came up with:

nft add table netdev filter
nft -- add chain netdev filter input { type filter hook ingress device vlan100 priority -500 \; policy accept \; }
nft add rule netdev filter input ip daddr 198.18.0.0/24 udp dport 1234 counter drop
nft add rule netdev filter input ip6 daddr fd00::/64 udp dport 1234 counter drop

The counters can be seen with this command:

$ mmwatch 'nft --handle list chain netdev filter input'
table netdev filter {
    chain input {
        type filter hook ingress device vlan100 priority -500; policy accept;
        ip daddr 198.18.0.0/24 udp dport 1234 counter packets    1.6m/s bytes    69.6m/s drop # handle 2
        ip6 daddr fd00::/64 udp dport 1234 counter packets 0 bytes 0 drop # handle 3
    }
}

Nftables "ingress" hook yields around 1.53mpps. This is slightly slower than iptables in the PREROUTING layer. This is puzzling - theoretically "ingress" happens before PREROUTING, so should be faster.

In our test nftables was slightly slower than iptables, but not by much. Nftables is still better :P

Step 7. tc ingress handler DROP

A somewhat surprising fact is that a tc (traffic control) ingress hook happens before even PREROUTING. tc makes it possible to select packets based on basic criteria and indeed - action drop - them. The syntax is rather hacky, so it's recommended to use this script to set it up. We need a tiny bit more complex tc match, here is the command line:

tc qdisc add dev vlan100 ingress
tc filter add dev vlan100 parent ffff: prio 4 protocol ip u32 match ip protocol 17 0xff match ip dport 1234 0xffff match ip dst 198.18.0.0/24 flowid 1:1 action drop
tc filter add dev vlan100 parent ffff: protocol ipv6 u32 match ip6 dport 1234 0xffff match ip6 dst fd00::/64 flowid 1:1 action drop

We can verify it:

$ mmwatch 'tc -s filter  show dev vlan100  ingress'
filter parent ffff: protocol ip pref 4 u32 
filter parent ffff: protocol ip pref 4 u32 fh 800: ht divisor 1 
filter parent ffff: protocol ip pref 4 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1  (rule hit   1.8m/s success   1.8m/s)
  match 00110000/00ff0000 at 8 (success   1.8m/s ) 
  match 000004d2/0000ffff at 20 (success   1.8m/s ) 
  match c612000c/ffffffff at 16 (success   1.8m/s ) 
        action order 1: gact action drop
         random type none pass val 0
         index 1 ref 1 bind 1 installed 1.0/s sec
        Action statistics:
        Sent    79.7m/s bytes   1.8m/s pkt (dropped   1.8m/s, overlimits 0 requeues 0) 
        backlog 0b 0p requeues 0

A tc ingress hook with u32 match allows us to drop 1.8mpps on a single CPU. This is brilliant!

But we can go even faster...

Step 8. XDP_DROP

Finally, the ultimate weapon is XDP - eXpress Data Path. With XDP we can run eBPF code in the context of a network driver. Most importantly, this is before the skbuff memory allocation, allowing great speeds.

Usually XDP projects have two parts:

the eBPF code loaded into the kernel context
the userspace loader, which loads the code onto the right network card and manages it

Writing the loader is pretty hard, so instead we can use the new iproute2 feature and load the code with this trivial command:

ip link set dev ext0 xdp obj xdp-drop-ebpf.o

Tadam!

The source code for the loaded eBPF XDP program is available here. The program parses IP packets and looks for desired characteristics: IP transport, UDP protocol, desired target subnet and destination port:

if (h_proto == htons(ETH_P_IP)) {
    if (iph->protocol == IPPROTO_UDP
        && (htonl(iph->daddr) & 0xFFFFFF00) == 0xC6120000 // 198.18.0.0/24
        && udph->dest == htons(1234)) {
        return XDP_DROP;
    }
}

XDP program needs to be compiled with modern clang that can emit BPF bytecode. After this we can load and verify the running XDP program:

$ ip link show dev ext0
4: ext0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc fq state UP mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:8a:59:8e brd ff:ff:ff:ff:ff:ff
    prog/xdp id 5 tag aedc195cc0471f51 jited

And see the numbers in ethtool -S network card statistics:

$ mmwatch 'ethtool -S ext0|egrep "rx"|egrep -v ": 0"|egrep -v "cache|csum"'
     rx_out_of_buffer:     4.4m/s
     rx_xdp_drop:         10.1m/s
     rx2_xdp_drop:        10.1m/s

Whooa! With XDP we can drop 10 million packets per second on a single CPU.

225821241_ed5da2da91_o
CC BY-SA 2.0 image by Andrew Filer

Summary

We repeated the these for both IPv4 and IPv6 and prepared this chart:

numbers-noxdp

Generally speaking in our setup IPv6 had slightly lower performance. Remember that IPv6 packets are slightly larger, so some performance difference is unavoidable.

Linux has numerous hooks that can be used to filter packets, each with different performance and ease of use characteristics.

For DDoS purporses, it may totally be reasonable to just receive the packets in the application and process them in userspace. Properly tuned applications can get pretty decent numbers.

For DDoS attacks with random/spoofed source IP's, it might be worthwhile disabling conntrack to gain some speed. Be careful though - there are attacks for which conntrack is very helpful.

In other circumstances it may make sense to integrate the Linux firewall into the DDoS mitigation pipeline. In such cases, remember to put the mitigations in a "-t raw PREROUTING" layer, since it's significantly faster than "filter" table.

For even more demanding workloads, we always have XDP. And boy, it is powerful. Here is the same chart as above, but including XDP:

numbers-xdp-1

If you want to reproduce these numbers, see the README where we documented everything.

Here at Cloudflare we are using... almost all of these techniques. Some of the userspace tricks are integrated with our applications. The iptables layer is managed by our Gatebot DDoS pipeline. Finally, we are working on replacing our proprietary kernel offload solution with XDP.

Want to help us drop more packets? We're hiring for many roles, including packet droppers, systems engineers and more!

Special thanks to Jesper Dangaard Brouer for helping with this work.

The Cloudflare Blog