Our servers process a lot of network packets, be it legitimate traffic or large denial of service attacks. To do so efficiently, we’ve embraced eXpress Data Path (XDP), a Linux kernel technology that provides a high performance mechanism for low level packet processing. We’re using it to drop DoS attack packets with L4Drop, and also in our new layer 4 load balancer. But there’s a downside to XDP: because it processes packets before the normal Linux network stack sees them, packets redirected or dropped are invisible to regular debugging tools such as tcpdump.
To address this, we built a tcpdump replacement for XDP, xdpcap. We are open sourcing this tool: the code and documentation are available on GitHub.
xdpcap uses our classic BPF (cBPF) to eBPF or C compiler, cbpfc, which we are also open sourcing: the code and documentation are available on GitHub.
CC BY 4.0 image by Christoph Müller
Tcpdump provides an easy way to dump specific packets of interest. For example, to capture all IPv4 DNS packets, one could:
$ tcpdump ip and udp port 53
xdpcap reuses the same syntax! xdpcap can write packets to a pcap file:
$ xdpcap /path/to/hook capture.pcap "ip and udp port 53"
XDPAborted: 0/0 XDPDrop: 0/0 XDPPass: 254/0 XDPTx: 0/0 (received/matched packets)
XDPAborted: 0/0 XDPDrop: 0/0 XDPPass: 995/1 XDPTx: 0/0 (received/matched packets)
Or write the pcap to stdout, and decode the packets with tcpdump:
$ xdpcap /path/to/hook - "ip and udp port 53" | sudo tcpdump -r -
reading from file -, link-type EN10MB (Ethernet)
16:18:37.911670 IP 1.1.1.1 > 1.2.3.4.21563: 26445$ 1/0/1 A 93.184.216.34 (56)
The remainder of this post explains how we built xdpcap, including how /path/to/hook/
is used to attach to XDP programs.
tcpdump
To replicate tcpdump, we first need to understand its inner workings. Marek Majkowski has previously written a detailed post on the subject. Tcpdump exposes a high level filter language, pcap-filter, to specify which packets are of interest. Reusing our earlier example, the following filter expression captures all IPv4 UDP packets to or from port 53, likely DNS traffic:
ip and udp port 53
Internally, tcpdump uses libpcap to compile the filter to classic BPF (cBPF). cBPF is a simple bytecode language to represent programs that inspect the contents of a packet. A program returns non-zero to indicate that a packet matched the filter, and zero otherwise. The virtual machine that executes cBPF programs is very simple, featuring only two registers, a
and x
. There is no way of checking the length of the input packet[1]; instead any out of bounds packet access will terminate the cBPF program, returning 0 (no match). The full set of opcodes are listed in the Linux documentation. Returning to our example filter, ip and udp port 53
compiles to the following cBPF program, expressed as an annotated flowchart:
Example cBPF filter flowchart
Tcpdump attaches the generated cBPF filter to a raw packet socket using a setsockopt
system call with SO_ATTACH_FILTER
. The kernel runs the filter on every packet destined for the socket, but only delivers matching packets. Tcpdump displays the delivered packets, or writes them to a pcap capture file for later analysis.
xdpcap
In the context of XDP, our tcpdump replacement should:
Accept filters in the same filter language as tcpdump
Dynamically instrument XDP programs of interest
Expose matching packets to userspace
XDP
XDP uses an extended version of the cBPF instruction set, eBPF, to allow arbitrary programs to run for each packet received by a network card, potentially modifying the packets. A stringent kernel verifier statically analyzes eBPF programs, ensuring that memory bounds are checked for every packet load.
eBPF programs can return:
XDP_DROP
: Drop the packetXDP_TX
: Transmit the packet back out the network interfaceXDP_PASS
: Pass the packet up the network stack
eBPF introduces several new features, notably helper function calls, enabling programs to call functions exposed by the kernel. This includes retrieving or setting values in maps, key-value data structures that can also be accessed from userspace.
Filter
A key feature of tcpdump is the ability to efficiently pick out packets of interest; packets are filtered before reaching userspace. To achieve this in XDP, the desired filter must be converted to eBPF.
cBPF is already used in our XDP based DoS mitigation pipeline: cBPF filters are first converted to C by cbpfc, and the result compiled with Clang to eBPF. Reusing this mechanism allows us to fully support libpcap filter expressions:
Pipeline to convert pcap-filter expressions to eBPF via C using cbpfc
To remove the Clang runtime dependency, our cBPF compiler, cbpfc, was extended to directly generate eBPF:
Pipeline to convert pcap-filter expressions directly to eBPF using cbpfc
Converted to eBPF using cbpfc, ip and udp port 53
yields:
Example cBPF filter converted to eBPF with cbpfc flowchart
The emitted eBPF requires a prologue, which is responsible for loading a pointer to the beginning, and end, of the input packet into registers r6
and r7
respectively[2].
The generated code follows a very similar structure to the original cBPF filter, but with:
Bswap instructions to convert big endian packet data to little endian.
Guards to check the length of the packet before we load data from it. These are required by the kernel verifier.
The epilogue can use the result of the filter to perform different actions on the input packet.
As mentioned earlier, we’re open sourcing cbpfc; the code and documentation are available on GitHub. It can be used to compile cBPF to C, or directly to eBPF, and the generated code is accepted by the kernel verifier.
Instrument
Tcpdump can start and stop capturing packets at any time, without requiring coordination from applications. This rules out modifying existing XDP programs to directly run the generated eBPF filter; the programs would have to be modified each time xdpcap is run. Instead, programs should expose a hook that can be used by xdpcap to attach filters at runtime.
xdpcap’s hook support is built around eBPF tail-calls. XDP programs can yield control to other programs using the tail-call helper. Control is never handed back to the calling program, the return code of the subsequent program is used. For example, consider two XDP programs, foo and bar, with foo attached to the network interface. Foo can tail-call into bar:
Flow of XDP program foo tail-calling into program bar
The program to tail-call into is configured at runtime, using a special eBPF program array map. eBPF programs tail-call into a specific index of the map, the value of which is set by userspace. From our example above, foo’s tail-call map holds a single entry:
index
program
0
bar
A tail-call into an empty index will not do anything, XDP programs always need to return an action themselves after a tail-call should it fail. Once again, this is enforced by the kernel verifier. In the case of program foo:
int foo(struct xdp_md *ctx) {
// tail-call into index 0 - program bar
tail_call(ctx, &map, 0);
// tail-call failed, pass the packet
return XDP_PASS;
}
To leverage this as a hook point, the instrumented programs are modified to always tail-call, using a map that is exposed to xdpcap by pinning it to a bpffs. To attach a filter, xdpcap can set it in the map. If no filter is attached, the instrumented program returns the correct action itself.
With a filter attached to program foo, we have:
Flow of XDP program foo tail-calling into an xdpcap filter
The filter must return the original action taken by the instrumented program to ensure the packet is processed correctly. To achieve this, xdpcap generates one filter program per possible XDP action, each one hardcoded to return that specific action. All the programs are set in the map:
index
program
0 (XDP_ABORTED
)
filter XDP_ABORTED
1 (XDP_DROP
)
filter XDP_DROP
2 (XDP_PASS
)
filter XDP_PASS
3 (XDP_TX
)
filter XDP_TX
By tail-calling into the correct index, the instrumented program determines the final action:
Flow of XDP program foo tail-calling into xdpcap filters, one for each action
xdpcap provides a helper function that attempts a tail-call for the given action. Should it fail, the action is returned instead:
enum xdp_action xdpcap_exit(struct xdp_md *ctx, enum xdp_action action) {
// tail-call into the filter using the action as an index
tail_call((void *)ctx, &xdpcap_hook, action);
// tail-call failed, return the action
return action;
}
This allows an XDP program to simply:
int foo(struct xdp_md *ctx) {
return xdpcap_exit(ctx, XDP_PASS);
}
Expose
Matching packets, as well as the original action taken for them, need to be exposed to userspace. Once again, such a mechanism is already part of our XDP based DoS mitigation pipeline!
Another eBPF helper, perf_event_output
, allows an XDP program to generate a perf event containing, amongst some metadata, the packet. As xdpcap generates one filter per XDP action, the filter program can include the action taken in the metadata. A userspace program can create a perf event ring buffer to receive events into, obtaining both the action and the packet.