The Cloudflare Blog

xdpcap: XDP Packet Capture

Arthur Fabre — Wed, 24 Apr 2019 18:21:59 GMT

Our servers process a lot of network packets, be it legitimate traffic or large denial of service attacks. To do so efficiently, we’ve embraced eXpress Data Path (XDP), a Linux kernel technology that provides a high performance mechanism for low level packet processing. We’re using it to drop DoS attack packets with L4Drop, and also in our new layer 4 load balancer. But there’s a downside to XDP: because it processes packets before the normal Linux network stack sees them, packets redirected or dropped are invisible to regular debugging tools such as tcpdump.

To address this, we built a tcpdump replacement for XDP, xdpcap. We are open sourcing this tool: the code and documentation are available on GitHub.

xdpcap uses our classic BPF (cBPF) to eBPF or C compiler, cbpfc, which we are also open sourcing: the code and documentation are available on GitHub.

CC BY 4.0 image by Christoph Müller

Tcpdump provides an easy way to dump specific packets of interest. For example, to capture all IPv4 DNS packets, one could:

$ tcpdump ip and udp port 53

xdpcap reuses the same syntax! xdpcap can write packets to a pcap file:

$ xdpcap /path/to/hook capture.pcap "ip and udp port 53"
XDPAborted: 0/0   XDPDrop: 0/0   XDPPass: 254/0   XDPTx: 0/0   (received/matched packets)
XDPAborted: 0/0   XDPDrop: 0/0   XDPPass: 995/1   XDPTx: 0/0   (received/matched packets)

Or write the pcap to stdout, and decode the packets with tcpdump:

$ xdpcap /path/to/hook - "ip and udp port 53" | sudo tcpdump -r -
reading from file -, link-type EN10MB (Ethernet)
16:18:37.911670 IP 1.1.1.1 > 1.2.3.4.21563: 26445$ 1/0/1 A 93.184.216.34 (56)

The remainder of this post explains how we built xdpcap, including how /path/to/hook/ is used to attach to XDP programs.

tcpdump

To replicate tcpdump, we first need to understand its inner workings. Marek Majkowski has previously written a detailed post on the subject. Tcpdump exposes a high level filter language, pcap-filter, to specify which packets are of interest. Reusing our earlier example, the following filter expression captures all IPv4 UDP packets to or from port 53, likely DNS traffic:

ip and udp port 53

Internally, tcpdump uses libpcap to compile the filter to classic BPF (cBPF). cBPF is a simple bytecode language to represent programs that inspect the contents of a packet. A program returns non-zero to indicate that a packet matched the filter, and zero otherwise. The virtual machine that executes cBPF programs is very simple, featuring only two registers, a and x. There is no way of checking the length of the input packet^[1]; instead any out of bounds packet access will terminate the cBPF program, returning 0 (no match). The full set of opcodes are listed in the Linux documentation. Returning to our example filter, ip and udp port 53 compiles to the following cBPF program, expressed as an annotated flowchart:

Example cBPF filter flowchart

Tcpdump attaches the generated cBPF filter to a raw packet socket using a setsockopt system call with SO_ATTACH_FILTER. The kernel runs the filter on every packet destined for the socket, but only delivers matching packets. Tcpdump displays the delivered packets, or writes them to a pcap capture file for later analysis.

xdpcap

In the context of XDP, our tcpdump replacement should:

Accept filters in the same filter language as tcpdump
Dynamically instrument XDP programs of interest
Expose matching packets to userspace

XDP

XDP uses an extended version of the cBPF instruction set, eBPF, to allow arbitrary programs to run for each packet received by a network card, potentially modifying the packets. A stringent kernel verifier statically analyzes eBPF programs, ensuring that memory bounds are checked for every packet load.

eBPF programs can return:

XDP_DROP: Drop the packet
XDP_TX: Transmit the packet back out the network interface
XDP_PASS: Pass the packet up the network stack

eBPF introduces several new features, notably helper function calls, enabling programs to call functions exposed by the kernel. This includes retrieving or setting values in maps, key-value data structures that can also be accessed from userspace.

Filter

A key feature of tcpdump is the ability to efficiently pick out packets of interest; packets are filtered before reaching userspace. To achieve this in XDP, the desired filter must be converted to eBPF.

cBPF is already used in our XDP based DoS mitigation pipeline: cBPF filters are first converted to C by cbpfc, and the result compiled with Clang to eBPF. Reusing this mechanism allows us to fully support libpcap filter expressions:

Pipeline to convert pcap-filter expressions to eBPF via C using cbpfc

To remove the Clang runtime dependency, our cBPF compiler, cbpfc, was extended to directly generate eBPF:

Pipeline to convert pcap-filter expressions directly to eBPF using cbpfc

Converted to eBPF using cbpfc, ip and udp port 53 yields:

Example cBPF filter converted to eBPF with cbpfc flowchart

The emitted eBPF requires a prologue, which is responsible for loading a pointer to the beginning, and end, of the input packet into registers r6and r7 respectively^[2].

The generated code follows a very similar structure to the original cBPF filter, but with:

Bswap instructions to convert big endian packet data to little endian.
Guards to check the length of the packet before we load data from it. These are required by the kernel verifier.

The epilogue can use the result of the filter to perform different actions on the input packet.

As mentioned earlier, we’re open sourcing cbpfc; the code and documentation are available on GitHub. It can be used to compile cBPF to C, or directly to eBPF, and the generated code is accepted by the kernel verifier.

Instrument

Tcpdump can start and stop capturing packets at any time, without requiring coordination from applications. This rules out modifying existing XDP programs to directly run the generated eBPF filter; the programs would have to be modified each time xdpcap is run. Instead, programs should expose a hook that can be used by xdpcap to attach filters at runtime.

xdpcap’s hook support is built around eBPF tail-calls. XDP programs can yield control to other programs using the tail-call helper. Control is never handed back to the calling program, the return code of the subsequent program is used. For example, consider two XDP programs, foo and bar, with foo attached to the network interface. Foo can tail-call into bar:

Flow of XDP program foo tail-calling into program bar

The program to tail-call into is configured at runtime, using a special eBPF program array map. eBPF programs tail-call into a specific index of the map, the value of which is set by userspace. From our example above, foo’s tail-call map holds a single entry:

index	program
0	bar

A tail-call into an empty index will not do anything, XDP programs always need to return an action themselves after a tail-call should it fail. Once again, this is enforced by the kernel verifier. In the case of program foo:

int foo(struct xdp_md *ctx) {
    // tail-call into index 0 - program bar
    tail_call(ctx, &map, 0);

    // tail-call failed, pass the packet
    return XDP_PASS;
}

To leverage this as a hook point, the instrumented programs are modified to always tail-call, using a map that is exposed to xdpcap by pinning it to a bpffs. To attach a filter, xdpcap can set it in the map. If no filter is attached, the instrumented program returns the correct action itself.

With a filter attached to program foo, we have:

Flow of XDP program foo tail-calling into an xdpcap filter

The filter must return the original action taken by the instrumented program to ensure the packet is processed correctly. To achieve this, xdpcap generates one filter program per possible XDP action, each one hard-coded to return that specific action. All the programs are set in the map:

index	program
0 (`XDP_ABORTED`)	filter `XDP_ABORTED`
1 (`XDP_DROP`)	filter `XDP_DROP`
2 (`XDP_PASS`)	filter `XDP_PASS`
3 (`XDP_TX`)	filter `XDP_TX`

By tail-calling into the correct index, the instrumented program determines the final action:

Flow of XDP program foo tail-calling into xdpcap filters, one for each action

xdpcap provides a helper function that attempts a tail-call for the given action. Should it fail, the action is returned instead:

enum xdp_action xdpcap_exit(struct xdp_md *ctx, enum xdp_action action) {
    // tail-call into the filter using the action as an index
    tail_call((void *)ctx, &xdpcap_hook, action);

    // tail-call failed, return the action
    return action;
}

This allows an XDP program to simply:

int foo(struct xdp_md *ctx) {
    return xdpcap_exit(ctx, XDP_PASS);
}

Expose

Matching packets, as well as the original action taken for them, need to be exposed to userspace. Once again, such a mechanism is already part of our XDP based DoS mitigation pipeline!

Another eBPF helper, perf_event_output, allows an XDP program to generate a perf event containing, amongst some metadata, the packet. As xdpcap generates one filter per XDP action, the filter program can include the action taken in the metadata. A userspace program can create a perf event ring buffer to receive events into, obtaining both the action and the packet.

This is true of the original cBPF, but Linux implements a number of extensions, one of which allows the length of the input packet to be retrieved. ↩︎
This example uses registers r6 and r7, but cbpfc can be configured to use any registers. ↩︎

L4Drop: XDP DDoS Mitigations

Arthur Fabre — Wed, 28 Nov 2018 19:59:25 GMT

Efficient packet dropping is a key part of Cloudflare’s distributed denial of service (DDoS) attack mitigations. In this post, we introduce a new tool in our packet dropping arsenal: L4Drop.

Public domain image by US Air Force

We've written about our DDoS mitigation pipeline extensively in the past, covering:

Gatebot: analyzes traffic hitting our edge and deploys DDoS mitigations matching suspect traffic.
bpftools: generates Berkeley Packet Filter (BPF) bytecode that matches packets based on DNS queries, p0F signatures, or tcpdump filters.
Iptables: matches traffic against the BPF generated by bpftools using the xt_bpf module, and drops it.
Floodgate: offloads work from iptables during big attacks that could otherwise overwhelm the kernel networking stack. Incoming traffic bypasses the kernel to go directly to a BPF interpreter in userspace, which efficiently drops packets matching the BPF rules produced by bpftools.

Both iptables and Floodgate send samples of received traffic to Gatebot for analysis, and filter incoming packets using rules generated by bpftools. This ends up looking something like this:

Floodgate based DDoS mitigation pipeline

This pipeline has served us well, but a lot has changed since we implemented Floodgate. Our new Gen9 and ARM servers use different network interface cards (NIC) than our earlier servers. These new NICs aren’t compatible with Floodgate as it relies on a proprietary Solarflare technology to redirect traffic directly to userspace. Floodgate’s time was finally up.

XDP to the rescue

A new alternative to the kernel bypass approach has been added to Linux: eXpress Data Path (XDP). XDP uses an extended version of the classic BPF instruction set, eBPF, to allow arbitrary code to run for each packet received by a network card driver. As Marek demonstrated, this enables high speed packet dropping! eBPF introduces a slew of new features, including:

Maps, key-value data structures shared between the eBPF programs and userspace.
A Clang eBPF backend, allowing a sizeable subset of C to be compiled to eBPF.
A stringent kernel verifier that statically analyzes eBPF programs, ensuring run time performance and safety.

Compared to our partial kernel bypass, XDP does not require busy polling for packets. This enables us to leave an XDP based solution “always on” instead of enabling it only when attack traffic exceeds a set threshold. XDP programs can also run on multiple CPUs, potentially allowing a higher number of packets to be processed than Floodgate, which was pinned to a single CPU to limit the impact of busy polling.

Updating our pipeline diagram with XDP yields:

XDP based DDoS mitigation pipeline

Introducing L4Drop

All that remains is to convert our existing rules to eBPF! At first glance, it seems we should be able to store our rules in an eBPF map and have a single program that checks incoming packets against them. Facebook’s firewall implements this strategy. This allows rules to be easily inserted or removed from userspace.

However, the filters created by bpftools rely heavily on matching arbitrary packet data, and performing arbitrary comparisons across headers. For example, a single p0f signature can check both IP & TCP options. On top of this, the thorough static analysis performed by the kernel’s eBPF verifier currently disallows loops. This restriction helps ensure that a given eBPF program will always terminate in a set number of instructions. Coupled together, the arbitrary matching and lack of loops prevent us from storing our rules in maps.

Instead, we wrote a tool to compile the rules generated by Gatebot and bpftools to eBPF. This allows the generated eBPF to match against any packet data it needs, at the cost of:

Having to recompile the program to add or remove rules
Possibly hitting eBPF code complexity limits enforced by the kernel with many rules

A C program is generated from the rules built by Gatebot, and compiled to eBPF using Clang. All that’s left is to reimplement the iptables features we use.

BPF support

We have many different tools for generating BPF filters, and we need to be able to include these filters in the eBPF generated by L4Drop. While the name eBPF might suggest a minor extension to BPF, the instruction sets are not compatible. In fact, BPF instructions don't even have a one-to-one mapping to eBPF! This can be seen in the kernel's internal BPF to eBPF converter, where a single BPF IP header length instruction maps to 6 eBPF instructions.

To simplify the conversion, we implemented a BPF to C compiler. This allows us to include any BPF program in the aforementioned C program generated by L4Drop. For example, if we generate a BPF program matching a DNS query to any subdomain of example.com using bpftools, we get:

$ ./bpfgen dns -- "*.example.com"
18,177 0 0 0,0 0 0 20,12 0 0 0,...

Converted to C, we end up with:

bool cbpf_0_0(uint8_t *data, uint8_t *data_end) {
    __attribute__((unused))
    uint32_t a, x, m[16];

    if (data + 1 > data_end) return false;
    x = 4*(*(data + 0) & 0xf);

    ...
}

The BPF instructions each expand to a single C statement, and the BPF registers (a, x and m) are emulated as variables. This has the added benefit of allowing Clang to optimize the full program. The generated C includes the minimum number of guards required to prevent out of bounds packet accesses, as required by the kernel.

Packet sampling

Gatebot requires all traffic received by a server to be sampled at a given rate, and sent off for analysis. This includes dropped packets. Consequently, we have to sample before we drop anything. Thankfully, eBPF can call into the kernel using a restricted set of helper functions, and one of these, bpf_xdp_event_output, allows us to copy packets to a perf event ring buffer. A userspace daemon then reads from the perf buffer, obtaining the packets. Coupled with another helper, bpf_get_prandom_u32(), to generate random numbers, the C code to sample packets ends up something like:

// Threshold may be > UINT32_MAX
uint64_t rnd = (uint64_t)get_prandom_u32();

if (rnd < threshold) {
    // perf_event_output passes the number of bytes as a flags in the
    // high 32 bits of the flags parameter.
    uint64_t flags = len << 32;

    // Use the current CPU number as index to sampled_packets.
    flags |= BPF_F_CURRENT_CPU;


    // Write the packet in ctx to the perf buffer
    if (xdp_event_output(ctx, &sampled_packets, flags, &len, sizeof(len))) {
        return XDP_ABORTED;
    }
}

The newtools/ebpf library we use to load eBPF programs into the kernel has support for creating the required supporting eBPF map (sampled_packets in this example), and reading from the perf buffer in userspace.

Geo rules

With our large anycast network, a truly distributed denial of service attack will impact many of our data centers. But not all attacks have this property. We sometimes require location specific rules.

To avoid having to build separate eBPF programs for every location, we want the ability to enable or disable rules before loading a program, but after compiling it.

One approach would be to store whether a rule is enabled or not in an eBPF map. Unfortunately, such map lookups can increase the code size. Due to the kernel’s strict code complexity limits for XDP code, this reduces the number of rules that fit in a single program. Instead, we modify the generated eBPF ELF before loading it into the kernel.

If, in the original C program, every rule is guarded by a conditional like so:

int xdp_test(struct xdp_md *ctx) {
    unsigned long long enabled;

    asm("%0 = 0 ll" : "=r"(enabled));

    if (enabled) {
        // Check the packet against the rule
        return XDP_DROP;
    } else {
        return XDP_PASS;
    }
}

asm("%0 = 0 ll" : "=r"(enabled)) will emit a single 64bit eBPF load instruction, loading 0 into the register holding the variable enabled:

$ llvm-objdump-6.0 -S test.o

test.o: file format ELF64-BPF

Disassembly of section prog:
xdp_test:
       0:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00         r1 = 0 ll
       2:       b7 00 00 00 02 00 00 00         r0 = 2
       3:       15 01 01 00 00 00 00 00         if r1 == 0 goto +1 
       4:       b7 00 00 00 01 00 00 00         r0 = 1

LBB0_2:
       5:       95 00 00 00 00 00 00 00         exit

Modifying the ELF to change the load instruction to load 0 or 1 will change the value of enabled, enabling or disabling the rule. The kernel even trims conditionals against constants like these.

Modifying the instructions requires the ability to differentiate these special loads against ones normally emitted by Clang. Changing the asm to load a symbol (asm("%0 = RULE_0_ENABLED ll" : "=r"(enabled))) ensures it shows up in the ELF relocation info with that symbol name:

$ llvm-readelf-6.0 -r test.o

Relocation section '.relprog' at offset 0xf0 contains 1 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name
0000000000000000  0000000200000001 R_BPF_64_64            0000000000000000 RULE_0_ENABLED

This enables the newtools/ebpf loader to parse the ELF relocation info, and always find the correct load instruction that guards enabling or disabling a rule.

Production

L4Drop is running in production across all of our servers, and protecting us against DDoS attacks. For example, this server dropped over 8 million packets per second:

received, dropped and lost packets per second vs CPU usage on a server during a DDoS attack

The graph shows a sudden increase in received packets (red). Initially, the attack overwhelmed the kernel network stack, causing some packets to be lost on the network card (yellow). The overall CPU usage (magenta) rose sharply.

Gatebot detected the attack shortly thereafter, deploying a rule matching the attack traffic. L4Drop began dropping the relevant packets (green), reaching over 8 million dropped packets per second.

The amount of traffic dropped (green) closely followed the received traffic (red), and the amount of traffic passed through remains unchanged before, during, and after the attack. This highlights the effectiveness of the deployed rule. At one point the attack traffic changed slightly, leading to a gap between the dropped and received traffic until Gatebot could respond with a new rule.

During the brunt of the attack, the overall CPU usage (magenta) only rose by about 10%, demonstrating the efficiency of XDP.

The softirq CPU usage (blue) shows the CPU usage under which XDP / L4drop runs, but also includes other network related processing. It increased by slightly over a factor of 2, while the number of incoming packets per second increased by over a factor of 40!

Conclusion

While we’re happy with the performance of L4Drop so far, our pipeline is in a constant state of improvement. We’re working on supporting a greater number of simultaneous rules in L4Drop through multiple, chained, eBPF programs. Another point of focus is increasing the efficiency of our generated programs, and supporting new eBPF features. Reducing our attack detection delay would also allow us to deploy rules quicker, leading to less lost packets at the onset of an attack.