A friend gave me an interesting task: extract IP TTL values from TCP connections established by a userspace program. This seemingly simple task quickly exploded into an epic Linux system programming hack. The result code is grossly over engineered, but boy, did we learn plenty in the process!
CC BY-SA 2.0 image by Paul Miller
Context
You may wonder why she wanted to inspect the TTL packet field (formally known as "IP Time To Live (TTL)" in IPv4, or "Hop Count" in IPv6)? The reason is simple - she wanted to ensure that the connections are routed outside of our datacenter. The "Hop Distance" - the difference between the TTL value set by the originating machine and the TTL value in the packet received at its destination - shows how many routers the packet crossed. If a packet crossed two or more routers, we know it indeed came from outside of our datacenter.
It's uncommon to look at TTL values (except for their intended purpose of mitigating routing loops by checking when the TTL reaches zero). The normal way to deal with the problem we had would be to blocklist IP ranges of our servers. But it’s not that simple in our setup. Our IP numbering configuration is rather baroque, with plenty of Anycast, Unicast and Reserved IP ranges. Some belong to us, some don't. We wanted to avoid having to maintain a hard-coded blocklist of IP ranges.
The gist of the idea is: we want to note the TTL value from a returned SYN+ACK packet. Having this number we can estimate the Hop Distance - number of routers on the path. If the Hop Distance is:
zero: we know the connection went to localhost or a local network.
one: connection went through our router, and was terminated just behind it.
two: connection went through two routers. Most possibly our router, and one just near to it.
For our use case, we want to see if the Hop Distance was two or more - this would ensure the connection was routed outside the datacenter.
Not so easy
It's easy to read TTL values from a userspace application, right? No. It turns out it's almost impossible. Here are the theoretical options we considered early on:
A) Run a libpcap/tcpdump-like raw socket, and catch the SYN+ACK's manually. We ruled out this design quickly - it requires elevated privileges. Also, raw sockets are pretty fragile: they can suffer packet loss if the userspace application can’t keep up.
B) Use the IP_RECVTTL socket option. IP_RECVTTL requests a "cmsg" data to be attached to control/ancillary data in a recvmsg()
syscall. This is a good choice for UDP connections, but this socket option is not supported by TCP SOCK_STREAM sockets.
Extracting the TTL is not so easy.
SO_ATTACH_FILTER to rule the world!
CC BY-SA 2.0 image by Lee Jordan
Wait, there is a third way!
You see, for quite some time it has been possible to attach a BPF filtering program to a socket. See socket(7)
SO_ATTACH_FILTER (since Linux 2.2), SO_ATTACH_BPF (since Linux 3.19)
Attach a classic BPF (SO_ATTACH_FILTER) or an extended BPF
(SO_ATTACH_BPF) program to the socket for use as a filter of
incoming packets. A packet will be dropped if the filter pro‐
gram returns zero. If the filter program returns a nonzero
value which is less than the packet's data length, the packet
will be truncated to the length returned. If the value
returned by the filter is greater than or equal to the
packet's data length, the packet is allowed to proceed unmodi‐
fied.
You probably take advantage of SO_ATTACH_FILTER already: This is how tcpdump/wireshark does filtering when you're dumping packets off the wire.
How does it work? Depending on the result of a BPF program, packets can be filtered, truncated or passed to the socket without modification. Normally SO_ATTACH_FILTER is used for RAW sockets, but surprisingly, BPF filters can also be attached to normal SOCK_STREAM and SOCK_DGRAM sockets!
We don't want to truncate packets though - we want to extract the TTL. Unfortunately with Classical BPF (cBPF) it's impossible to extract any data from a running BPF filter program.
eBPF and maps
This changed with modern BPF machinery, which includes:
modernised eBPF bytecode
eBPF maps
SO_ATTACH_BPF socket option
eBPF bytecode can be thought of as an extension to Classical BPF, but it's the extra features that really let it shine.
The gem is the "map" abstraction. An eBPF map is a thingy that allows an eBPF program to store data and share it with a userspace code. Think of an eBPF map as a data structure (a hash table most usually) shared between a userspace program and an eBPF program running in kernel space.
To solve our TTL problem, we can use eBPF filter program. It will look at the TTL values of passing packets, and save them in an eBPF map. Later, we can inspect the eBPF map and analyze the recorded values from userspace.
SO_ATTACH_BPF to rule the world!
To use eBPF we need a number of things set up. First, we need to create an "eBPF map". There are many specialized map types, but for our purposes let's use the "hash" BPF_MAP_TYPE_HASH type.
We need to figure out the "bpf(BPF_MAP_CREATE, map type, key size, value size, limit, flags)" parameters. For our small TTL program, let's set 4 bytes for key size, and 8 byte value size. The max element limit is set to 5. It doesn't matter, we expect all the packets in one connection to have just one coherent TTL value anyway.
This is how it would look in a Golang code:
bpfMapFd, err := ebpf.NewMap(ebpf.Hash, 4, 8, 5, 0)
A word of warning is needed here. BPF maps use the "locked memory" resource. With multiple BPF programs and maps, it's easy to exhaust the default tiny 64 KiB limit. Consider bumping this with ulimit -l
, for example:
ulimit -l 10240
The bpf()
syscall returns a file descriptor pointing to the kernel BPF map we just created. With it handy we can operate on a map. The possible operations are:
bpf(BPF_MAP_LOOKUP_ELEM, <key>)
bpf(BPF_MAP_UPDATE_ELEM, <key>, <value>, <flags>)
bpf(BPF_MAP_DELETE_ELEM, <key>)
bpf(BPF_MAP_GET_NEXT_KEY, <key>)
With the map created, we need to create a BPF program. As opposed to classical BPF - where the bytecode was a parameter to SO_ATTACH_FILTER - the bytecode is now loaded by the bpf()
syscall. Specifically: bpf(BPF_PROG_LOAD)
.
In our Golang program the eBPF program setup looks like:
ebpfInss := ebpf.Instructions{
ebpf.BPFIDstOffSrc(ebpf.LdXW, ebpf.Reg0, ebpf.Reg1, 16),
ebpf.BPFIDstOffImm(ebpf.JEqImm, ebpf.Reg0, 3, int32(htons(ETH_P_IPV6))),
ebpf.BPFIDstSrc(ebpf.MovSrc, ebpf.Reg6, ebpf.Reg1),
ebpf.BPFIImm(ebpf.LdAbsB, int32(-0x100000+8)),
...
ebpf.BPFIDstImm(ebpf.MovImm, ebpf.Reg0, -1),
ebpf.BPFIOp(ebpf.Exit),
}
bpfProgram, err := ebpf.NewProgram(ebpf.SocketFilter, &ebpfInss, "GPL", 0)
Writing eBPF by hand is rather controversial. Most people use clang
(from version 3.7 onwards) to compile a code written in a C dialect into an eBPF bytecode. The resulting bytecode is saved in an ELF file, which can be loaded by most eBPF libraries. This ELF file also includes description of maps, so you don’t need to set them manually.
I personally don't see the point in adding an ELF/clang dependency for simple SO_ATTACH_BPF snippets. Don't be afraid of the raw bytecode!
BPF calling convention
Before we go further we should highlight couple of things about the eBPF environment. The official kernel documentation isn't too friendly:
The first important bit to know, is the calling convention:
R0 - return value from in-kernel function, and exit value for eBPF program
R1-R5 - arguments from eBPF program to in-kernel function
R6-R9 - callee saved registers that in-kernel function will preserve
R10 - read-only frame pointer to access stack
When the BPF is started, R1 contains a pointer to ctx
. This data structure is defined as struct __sk_buff
. For example, to access the protocol
field you'd need to run:
r0 = *(u32 *)(r1 + 16)
Or in other words:
ebpf.BPFIDstOffSrc(ebpf.LdXW, ebpf.Reg0, ebpf.Reg1, 16),
Which is exactly what we do in first line of our program, since we need to choose between IPv4 or IPv6 code branches.
Accessing the BPF payload
Next, there are special instructions for packet payload loading. Most BPF programs (but not all!) run in the context of packet filtering, so it makes sense to accelerate data lookups by having magic opcodes for accessing packet data.
Instead of dereferencing context, like ctx->data[x]
to load a byte, BPF supports the BPF_LD
instruction that can do it in one operation. There are caveats though, the documentation says:
eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and
(BPF_IND | <size> | BPF_LD) which are used to access packet data.
They had to be carried over from classic BPF to have strong performance of
socket filters running in eBPF interpreter. These instructions can only
be used when interpreter context is a pointer to 'struct sk_buff' and
have seven implicit operands. Register R6 is an implicit input that must
contain pointer to sk_buff. Register R0 is an implicit output which contains
the data fetched from the packet. Registers R1-R5 are scratch registers
and must not be used to store the data across BPF_ABS | BPF_LD or
BPF_IND | BPF_LD instructions.
In other words: before calling BPF_LD
we must move ctx
to R6, like this:
ebpf.BPFIDstSrc(ebpf.MovSrc, ebpf.Reg6, ebpf.Reg1),
Then we can call the load:
ebpf.BPFIImm(ebpf.LdAbsB, int32(-0x100000+7)),
At this stage the result is in r0, but we must remember the r1-r5 should be considered dirty. For an instruction the BPF_LD
looks very much like a function call.
Magical Layer 3 offset
Next note the load offset - we loaded the -0x100000+7
byte of the packet. This magic offset is another BPF context curiosity. It turns out that the BPF script loaded under SO_ATTACH_BPF on a SOCK_STREAM (or SOCK_DGRAM) socket, will only see Layer 4 and higher OSI layers by default. To extract the TTL we need access to the layer 3 header (i.e. the IP header). To access L3 in the L4 context, we must offset the data lookups by magical -0x100000.
This magic constant is defined in the kernel.
For completeness, the +7
is, of course, the offset of the TTL field in an IPv4 packet. Our small BPF program also supports IPv6 where the TTL/Hop Count is at offset +8
.
Return value
Finally, the return value of the BPF program is meaningful. In the context of packet filtering it will be interpreted as a truncated packet length.Had we returned 0 - the packet would be dropped and wouldn't be seen by the userspace socket application. It's quite interesting that we can do packet-based data manipulation with eBPF on a stream-based socket. Anyway, our script returns -1, which when cast to unsigned will be interpreted as a very large number:
ebpf.BPFIDstImm(ebpf.MovImm, ebpf.Reg0, -1),
ebpf.BPFIOp(ebpf.Exit),
Extracting data from map
Our running BPF program will set a key on our map for any matched packet. The key is the recorded TTL value, the value is the packet count. The value counter is somewhat vulnerable to a tiny race condition, but it's ignorable for our purposes. Later on, to extract the data from userspace program we use this Golang loop:
var (
value MapU64
k1, k2 MapU32
)
for {
ok, err := bpfMap.Get(k1, &value, 8)
if ok {
// k1 is TTL, value is counter
...
}
ok, err = bpfMap.GetNextKey(k1, &k2, 4)
if err != nil || ok == false {
break
}
k1 = k2
}
Putting it all together
Now with all the pieces ready we can make it a proper runnable program. There is little point in discussing it here, so allow me to refer to the source code. The BPF pieces are here:
We haven't discussed how to catch inbound SYN+ACK in the BPF program. This is a matter of setting up BPF before calling connect()
. Sadly, it's impossible to customize net.Dial
in Golang. Instead we wrote a surprisingly painful and awful custom Dial implementation. The ugly custom dialer code is here:
To run all this you need kernel 4.4+ Kernel with the bpf()
syscall compiled in. BPF features of specific kernels are documented in this superb page from BCC:
Run the code to observe the TTL Hop Counts:
$ ./ttl-ebpf tcp4://google.com:80 tcp6://google.com:80 \
tcp4://cloudflare.com:80 tcp6://cloudflare.com:80
[+] TTL distance to tcp4://google.com:80 172.217.4.174 is 6
[+] TTL distance to tcp6://google.com:80 [2607:f8b0:4005:809::200e] is 4
[+] TTL distance to tcp4://cloudflare.com:80 198.41.215.162 is 3
[+] TTL distance to tcp6://cloudflare.com:80 [2400:cb00:2048:1::c629:d6a2] is 3
Takeaways
In this blog post we dived into the new eBPF machinery, including the bpf()
syscall, maps and SO_ATTACH_BPF. This work allowed me to realize the potential of running SO_ATTACH_BPF on fully established TCP sockets. Undoubtedly, eBPF still requires plenty of love and documentation, but it seems to be a perfect bridge to expose low level toggles to userspace applications.
I highly recommend keeping the dependencies small. For small BPF programs, like the one shown, there is little need for complex clang compilation and ELF loading. Don't be afraid of the eBPF bytecode!
We only touched on SO_ATTACH_BPF, where we analyzed network packets with BPF running on network sockets. There is more! First, you can attach BPFs to a dozen "things", XDP being the most obvious example. Full list. Then, it's possible to actually affect kernel packet processing, here is a full list of helper functions, some of which can modify kernel data structures.
In February LWN jokingly wrote:
Developers should be careful, though; this could
prove to be a slippery slope leading toward something
that starts to look like a microkernel architecture.
There is a grain of truth here. Maybe the ability to run eBPF on variety of subsystems feels like microkernel coding, but definitely the SO_ATTACH_BPF smells like STREAMS programming model from 1984.
Thanks to Gilberto Bertin and David Wragg for helping out with the eBPF bytecode.
Doing eBPF work sound interesting? Join our world famous team in London, Austin, San Francisco, Champaign and our elite office in Warsaw, Poland.