eBPF can't count?!

Grant mechanical calculating machine, public domain image

It is unlikely we can tell you anything new about the extended Berkeley Packet Filter, eBPF for short, if you've read all the great man pages, docs, guides, and some of our blogs out there.

But we can tell you a war story, and who doesn't like those? This one is about how eBPF lost its ability to count for a while¹.

They say in our Austin, Texas office that all good stories start with "y'all ain't gonna believe this… tale." This one though, starts with a post to Linux netdev mailing list from Marek Majkowski after what I heard was a long night:

Marek's findings were quite shocking - if you subtract two 64-bit timestamps in eBPF, the result is garbage. But only when running as an unprivileged user. From root all works fine. Huh.

If you've seen Marek's presentation from the Netdev 0x13 conference, you know that we are using BPF socket filters as one of the defenses against simple, volumetric DoS attacks. So potentially getting your packet count wrong could be a Bad Thing™, and affect legitimate traffic.

Let's try to reproduce this bug with a simplified eBPF socket filter that subtracts two 64-bit unsigned integers passed to it from user-space though a BPF map. The input for our BPF program comes from a BPF array map, so that the values we operate on are not known at build time. This allows for easy experimentation and prevents the compiler from optimizing out the operations.

Starting small, eBPF, what is 2 - 1? View the code on our GitHub.

$ ./run-bpf 2 1
arg0                    2 0x0000000000000002
arg1                    1 0x0000000000000001
diff                    1 0x0000000000000001

OK, eBPF, what is 2^32 - 1?

$ ./run-bpf $[2**32] 1
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff 18446744073709551615 0xffffffffffffffff

Wrong! But if we ask nicely with sudo:

$ sudo ./run-bpf $[2**32] 1
[sudo] password for jkbs:
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff           4294967295 0x00000000ffffffff

Who is messing with my eBPF?

When computers stop subtracting, you know something big is up. We called for reinforcements.

Our colleague Arthur Fabre quickly noticed something is off when you examine the eBPF code loaded into the kernel. It turns out kernel doesn't actually run the eBPF it's supplied - it sometimes rewrites it first.

Any sane programmer would expect 64-bit subtraction to be expressed as a single eBPF instruction

$ llvm-objdump -S -no-show-raw-insn -section=socket1 bpf/filter.o
…
      20:       1f 76 00 00 00 00 00 00         r6 -= r7
…

However, that's not what the kernel actually runs. Apparently after the rewrite the subtraction becomes a complex, multi-step operation.

To see what the kernel is actually running we can use little known bpftool utility. First, we need to load our BPF

$ ./run-bpf --stop-after-load 2 1
[2]+  Stopped                 ./run-bpf 2 1

Then list all BPF programs loaded into the kernel with bpftool prog list

$ sudo bpftool prog list
…
5951: socket_filter  name filter_alu64  tag 11186be60c0d0c0f  gpl
        loaded_at 2019-04-05T13:01:24+0200  uid 1000
        xlated 424B  jited 262B  memlock 4096B  map_ids 28786

The most recently loaded socket_filter must be our program (filter_alu64). Now we now know its id is 5951 and we can list its bytecode with

$ sudo bpftool prog dump xlated id 5951
…
  33: (79) r7 = *(u64 *)(r0 +0)
  34: (b4) (u32) r11 = (u32) -1
  35: (1f) r11 -= r6
  36: (4f) r11 |= r6
  37: (87) r11 = -r11
  38: (c7) r11 s>>= 63
  39: (5f) r6 &= r11
  40: (1f) r6 -= r7
  41: (7b) *(u64 *)(r10 -16) = r6
…

bpftool can also display the JITed code with: bpftool prog dump jited id 5951.

As you see, subtraction is replaced with a series of opcodes. That is unless you are root. When running from root all is good

$ sudo ./run-bpf --stop-after-load 0 0
[1]+  Stopped                 sudo ./run-bpf --stop-after-load 0 0
$ sudo bpftool prog list | grep socket_filter
659: socket_filter  name filter_alu64  tag 9e7ffb08218476f3  gpl
$ sudo bpftool prog dump xlated id 659
…
  31: (79) r7 = *(u64 *)(r0 +0)
  32: (1f) r6 -= r7
  33: (7b) *(u64 *)(r10 -16) = r6
…

If you've spent any time using eBPF, you must have experienced first hand the dreaded eBPF verifier. It's a merciless judge of all eBPF code that will reject any programs that it deems not worthy of running in kernel-space.

What perhaps nobody has told you before, and what might come as a surprise, is that the very same verifier will actually also rewrite and patch up your eBPF code as needed to make it safe.

The problems with subtraction were introduced by an inconspicuous security fix to the verifier. The patch in question first landed in Linux 5.0 and was backported to 4.20.6 stable and 4.19.19 LTS kernel. The over 2000 words long commit message doesn't spare you any details on the attack vector it targets.

The mitigation stems from CVE-2019-7308 vulnerability discovered by Jann Horn at Project Zero, which exploits pointer arithmetic, i.e. adding a scalar value to a pointer, to trigger speculative memory loads from out-of-bounds addresses. Such speculative loads change the CPU cache state and can be used to mount a Spectre variant 1 attack.

To mitigate it the eBPF verifier rewrites any arithmetic operations on pointer values in such a way the result is always a memory location within bounds. The patch demonstrates how arithmetic operations on pointers get rewritten and we can spot a familiar pattern there

Wait a minute… What pointer arithmetic? We are just trying to subtract two scalar values. How come the mitigation kicks in?

It shouldn't. It's a bug. The eBPF verifier keeps track of what kind of values the ALU is operating on, and in this corner case the state was ignored.

Why running BPF as root is fine, you ask? If your program has CAP_SYS_ADMIN privileges, side-channel mitigations don't apply. As root you already have access to kernel address space, so nothing new can leak through BPF.

After our report, the fix has quickly landed in v5.0 kernel and got backported to stable kernels 4.20.15 and 4.19.28. Kudos to Daniel Borkmann for getting the fix out fast. However, kernel upgrades are hard and in the meantime we were left with code running in production that was not doing what it was supposed to.

32-bit ALU to the rescue

As one of the eBPF maintainers has pointed out, 32-bit arithmetic operations are not affected by the verifier bug. This opens a door for a work-around.

eBPF registers, r0..r10, are 64-bits wide, but you can also access just the lower 32 bits, which are exposed as subregisters w0..w10. You can operate on the 32-bit subregisters using BPF ALU32 instruction subset. LLVM 7+ can generate eBPF code that uses this instruction subset. Of course, you need to you ask it nicely with trivial -Xclang -target-feature -Xclang +alu32 toggle:

$ cat sub32.c
#include "common.h"

u32 sub32(u32 x, u32 y)
{
        return x - y;
}
$ clang -O2 -target bpf -Xclang -target-feature -Xclang +alu32 -c sub32.c
$ llvm-objdump -S -no-show-raw-insn sub32.o
…
sub32:
       0:       bc 10 00 00 00 00 00 00         w0 = w1
       1:       1c 20 00 00 00 00 00 00         w0 -= w2
       2:       95 00 00 00 00 00 00 00         exit

The 0x1c opcode of the instruction #1, which can be broken down as BPF_ALU | BPF_X | BPF_SUB (read more in the kernel docs), is the 32-bit subtraction between registers we are looking for, as opposed to regular 64-bit subtract operation 0x1f = BPF_ALU64 | BPF_X | BPF_SUB, which will get rewritten.

Armed with this knowledge we can borrow a page from bignum arithmetic and subtract 64-bit numbers using just 32-bit ops:

u64 sub64(u64 x, u64 y)
{
        u32 xh, xl, yh, yl;
        u32 hi, lo;

        xl = x;
        yl = y;
        lo = xl - yl;

        xh = x >> 32;
        yh = y >> 32;
        hi = xh - yh - (lo > xl); /* underflow? */

        return ((u64)hi << 32) | (u64)lo;
}

This code compiles as expected on normal architectures, like x86-64 or ARM64, but BPF Clang target plays by its own rules:

$ clang -O2 -target bpf -Xclang -target-feature -Xclang +alu32 -c sub64.c -o - \
  | llvm-objdump -S -
…  
      13:       1f 40 00 00 00 00 00 00         r0 -= r4
      14:       1f 30 00 00 00 00 00 00         r0 -= r3
      15:       1f 21 00 00 00 00 00 00         r1 -= r2
      16:       67 00 00 00 20 00 00 00         r0 <<= 32
      17:       67 01 00 00 20 00 00 00         r1 <<= 32
      18:       77 01 00 00 20 00 00 00         r1 >>= 32
      19:       4f 10 00 00 00 00 00 00         r0 |= r1
      20:       95 00 00 00 00 00 00 00         exit

Apparently the compiler decided it was better to operate on 64-bit registers and discard the upper 32 bits. Thus we weren't able to get rid of the problematic 0x1f opcode. Annoying, back to square one.

Surely a bit of IR will do?

The problem was in Clang frontend - compiling C to IR. We know that BPF "assembly" backend for LLVM can generate bytecode that uses ALU32 instructions. Maybe if we tweak the Clang compiler's output just a little we can achieve what we want. This means we have to get our hands dirty with the LLVM Intermediate Representation (IR).

If you haven't heard of LLVM IR before, now is a good time to do some reading ². In short the LLVM IR is what Clang produces and LLVM BPF backend consumes.

Time to write IR by hand! Here's a hand-tweaked IR variant of our sub64() function:

define dso_local i64 @sub64_ir(i64, i64) local_unnamed_addr #0 {
  %3 = trunc i64 %0 to i32      ; xl = (u32) x;
  %4 = trunc i64 %1 to i32      ; yl = (u32) y;
  %5 = sub i32 %3, %4           ; lo = xl - yl;
  %6 = zext i32 %5 to i64
  %7 = lshr i64 %0, 32          ; tmp1 = x >> 32;
  %8 = lshr i64 %1, 32          ; tmp2 = y >> 32;
  %9 = trunc i64 %7 to i32      ; xh = (u32) tmp1;
  %10 = trunc i64 %8 to i32     ; yh = (u32) tmp2;
  %11 = sub i32 %9, %10         ; hi = xh - yh
  %12 = icmp ult i32 %3, %5     ; tmp3 = xl < lo
  %13 = zext i1 %12 to i32
  %14 = sub i32 %11, %13        ; hi -= tmp3
  %15 = zext i32 %14 to i64
  %16 = shl i64 %15, 32         ; tmp2 = hi << 32
  %17 = or i64 %16, %6          ; res = tmp2 | (u64)lo
  ret i64 %17
}

It may not be pretty but it does produce desired BPF code when compiled³. You will likely find the LLVM IR reference helpful when deciphering it.

And voila! First working solution that produces correct results:

$ ./run-bpf -filter ir $[2**32] 1
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff           4294967295 0x00000000ffffffff

Actually using this hand-written IR function from C is tricky. See our code on GitHub.

public domain image by Sergei Frolov

The final trick

Hand-written IR does the job. The downside is that linking IR modules to your C modules is hard. Fortunately there is a better way. You can persuade Clang to stick to 32-bit ALU ops in generated IR.

We've already seen the problem. To recap, if we ask Clang to subtract 32-bit integers, it will operate on 64-bit values and throw away the top 32-bits. Putting C, IR, and eBPF side-by-side helps visualize this:

The trick to get around it is to declare the 32-bit variable that holds the result as volatile. You might already know the volatile keyword if you've written Unix signal handlers. It basically tells the compiler that the value of the variable may change under its feet so it should refrain from reorganizing loads (reads) from it, as well as that stores (writes) to it might have side-effects so changing the order or eliminating them, by skipping writing it to the memory, is not allowed either.

Using volatile makes Clang emit special loads and/or stores at the IR level, which then on eBPF level translates to writing/reading the value from memory (stack) on every access. While this sounds not related to the problem at hand, there is a surprising side-effect to it:

With volatile access compiler doesn't promote the subtraction to 64 bits! Don't ask me why, although I would love to hear an explanation. For now, consider this a hack. One that does not come for free - there is the overhead of going through the stack on each read/write.

However, if we play our cards right we just might reduce it a little. We don't actually need the volatile load or store to happen, we just want the side effect. So instead of declaring the value as volatile, which implies that both reads and writes are volatile, let's try to make only the writes volatile with a help of a macro:

/* Emits a "store volatile" in LLVM IR */
#define ST_V(rhs, lhs) (*(volatile typeof(rhs) *) &(rhs) = (lhs))

If this macro looks strangely familiar, it's because it does the same thing as WRITE_ONCE() macro in the Linux kernel. Applying it to our example:

That's another hacky but working solution. Pick your poison.

CC BY-SA 3.0 image by ANKAWÜ

So there you have it - from C, to IR, and back to C to hack around a bug in eBPF verifier and be able to subtract 64-bit integers again. Usually you won't have to dive into LLVM IR or assembly to make use of eBPF. But it does help to know a little about it when things don't work as expected.

Did I mention that 64-bit addition is also broken? Have fun fixing it!

¹ Okay, it was more like 3 months time until the bug was discovered and fixed.

² Some even think that it is better than assembly.

³ How do we know? The litmus test is to look for statements matching r[0-9] [-+]= r[0-9] in BPF assembly.

The Cloudflare Blog

eBPF can't count?!

Who is messing with my eBPF?

32-bit ALU to the rescue

Surely a bit of IR will do?

The final trick

QUIC restarts, slow problems: udpgrm to the rescue

A steam locomotive from 1993 broke my yarn test

Searching for the cause of hung tasks in the Linux kernel

Multi-Path TCP: revolutionizing connectivity, one path at a time