
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Sat, 04 Apr 2026 10:09:13 GMT</lastBuildDate>
        <item>
            <title><![CDATA[The quantum state of a TCP port]]></title>
            <link>https://blog.cloudflare.com/the-quantum-state-of-a-tcp-port/</link>
            <pubDate>Mon, 20 Mar 2023 13:00:00 GMT</pubDate>
            <description><![CDATA[ If I navigate to https://blog.cloudflare.com/, my browser will connect to a remote TCP address from the local IP address assigned to my machine, and a randomly chosen local TCP port. What happens if I then decide to head to another site? ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Have you noticed how simple questions sometimes lead to complex answers? Today we will tackle one such question. Category: our favorite - Linux networking.</p>
    <div>
      <h2>When can two TCP sockets share a local address?</h2>
      <a href="#when-can-two-tcp-sockets-share-a-local-address">
        
      </a>
    </div>
    <p>If I navigate to <a href="/">https://blog.cloudflare.com/</a>, my browser will connect to a remote TCP address, might be 104.16.132.229:443 in this case, from the local IP address assigned to my Linux machine, and a randomly chosen local TCP port, say 192.0.2.42:54321. What happens if I then decide to head to a different site? Is it possible to establish another TCP connection from the same local IP address and port?</p><p>To find the answer let's do a bit of <a href="https://en.wikipedia.org/wiki/Discovery_learning">learning by discovering</a>. We have prepared eight quiz questions. Each will let you discover one aspect of the rules that govern local address sharing between TCP sockets under Linux. Fair warning, it might get a bit mind-boggling.</p><p>Questions are split into two groups by test scenario:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3Fu0occJtMgjz3Rd7xGJCo/9bf63f71e754ccbcf47c1fdd801d8f8f/image4-15.png" />
            
            </figure><p>In the first test scenario, two sockets connect from the same local port to the same remote IP and port. However, the local IP is different for each socket.</p><p>While, in the second scenario, the local IP and port is the same for all sockets, but the remote address, or actually just the IP address, differs.</p><p>In our quiz questions, we will either:</p><ol><li><p>let the OS automatically select the the local IP and/or port for the socket, or</p></li><li><p>we will explicitly assign the local address with <a href="https://man7.org/linux/man-pages/man2/bind.2.html"><code>bind()</code></a> before <a href="https://man7.org/linux/man-pages/man2/connect.2.html"><code>connect()</code></a>’ing the socket; a method also known as <a href="https://idea.popcount.org/2014-04-03-bind-before-connect/">bind-before-connect</a>.</p></li></ol><p>Because we will be examining corner cases in the bind() logic, we need a way to exhaust available local addresses, that is (IP, port) pairs. We could just create lots of sockets, but it will be easier to <a href="https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html?#ip-variables">tweak the system configuration</a> and pretend that there is just one ephemeral local port, which the OS can assign to sockets:</p><p><code>sysctl -w net.ipv4.ip_local_port_range='60000 60000'</code></p><p>Each quiz question is a short Python snippet. Your task is to predict the outcome of running the code. Does it succeed? Does it fail? If so, what fails? Asking ChatGPT is not allowed ?</p><p>There is always a common setup procedure to keep in mind. We will omit it from the quiz snippets to keep them short:</p>
            <pre><code>from os import system
from socket import *

# Missing constants
IP_BIND_ADDRESS_NO_PORT = 24

# Our network namespace has just *one* ephemeral port
system("sysctl -w net.ipv4.ip_local_port_range='60000 60000'")

# Open a listening socket at *:1234. We will connect to it.
ln = socket(AF_INET, SOCK_STREAM)
ln.bind(("", 1234))
ln.listen(SOMAXCONN)</code></pre>
            <p>With the formalities out of the way, let us begin. Ready. Set. Go!</p>
    <div>
      <h3>Scenario #1: When the local IP is unique, but the local port is the same</h3>
      <a href="#scenario-1-when-the-local-ip-is-unique-but-the-local-port-is-the-same">
        
      </a>
    </div>
    <p>In Scenario #1 we connect two sockets to the same remote address - 127.9.9.9:1234. The sockets will use different local IP addresses, but is it enough to share the local port?</p>
<table>
<thead>
  <tr>
    <th><span>local IP</span></th>
    <th><span>local port</span></th>
    <th><span>remote IP</span></th>
    <th><span>remote port</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>unique</span></td>
    <td><span>same</span></td>
    <td><span>same</span></td>
    <td><span>same</span></td>
  </tr>
  <tr>
    <td><span>127.0.0.1<br />127.1.1.1<br />127.2.2.2</span></td>
    <td><span>60_000</span></td>
    <td><span>127.9.9.9</span></td>
    <td><span>1234</span></td>
  </tr>
</tbody>
</table>
    <div>
      <h3>Quiz #1</h3>
      <a href="#quiz-1">
        
      </a>
    </div>
    <p>On the local side, we bind two sockets to distinct, explicitly specified IP addresses. We will allow the OS to select the local port. Remember: our local ephemeral port range contains just one port (60,000).</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.bind(('127.1.1.1', 0))
s1.connect(('127.9.9.9', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.bind(('127.2.2.2', 0))
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_1.py">Answer #1</a></p>
    <div>
      <h3>Quiz #2</h3>
      <a href="#quiz-2">
        
      </a>
    </div>
    <p>Here, the setup is almost identical as before. However, we ask the OS to select the local IP address and port for the first socket. Do you think the result will differ from the previous question?</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.connect(('127.9.9.9', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.bind(('127.2.2.2', 0))
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_2.py">Answer #2</a></p>
    <div>
      <h3>Quiz #3</h3>
      <a href="#quiz-3">
        
      </a>
    </div>
    <p>This quiz question is just like  the one above. We just changed the ordering. First, we connect a socket from an explicitly specified local address. Then we ask the system to select a local address for us. Obviously, such an ordering change should not make any difference, right?</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.bind(('127.1.1.1', 0))
s1.connect(('127.9.9.9', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_3.py">Answer #3</a></p>
    <div>
      <h3>Scenario #2: When the local IP and port are the same, but the remote IP differs</h3>
      <a href="#scenario-2-when-the-local-ip-and-port-are-the-same-but-the-remote-ip-differs">
        
      </a>
    </div>
    <p>In Scenario #2 we reverse our setup. Instead of multiple local IP's and one remote address, we now have one local address <code>127.0.0.1:60000</code> and two distinct remote addresses. The question remains the same - can two sockets share the local port? Reminder: ephemeral port range is still of size one.</p>
<table>
<thead>
  <tr>
    <th><span>local IP</span></th>
    <th><span>local port</span></th>
    <th><span>remote IP</span></th>
    <th><span>remote port</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>same</span></td>
    <td><span>same</span></td>
    <td><span>unique</span></td>
    <td><span>same</span></td>
  </tr>
  <tr>
    <td><span>127.0.0.1</span></td>
    <td><span>60_000</span></td>
    <td><span>127.8.8.8<br />127.9.9.9</span></td>
    <td><span>1234</span></td>
  </tr>
</tbody>
</table>
    <div>
      <h3>Quiz #4</h3>
      <a href="#quiz-4">
        
      </a>
    </div>
    <p>Let’s start from the basics. We <code>connect()</code> to two distinct remote addresses. This is a warm up ?</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.connect(('127.8.8.8', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_4.py">Answer #4</a></p>
    <div>
      <h3>Quiz #5</h3>
      <a href="#quiz-5">
        
      </a>
    </div>
    <p>What if we <code>bind()</code> to a local IP explicitly but let the OS select the port - does anything change?</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.bind(('127.0.0.1', 0))
s1.connect(('127.8.8.8', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.bind(('127.0.0.1', 0))
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_5.py">Answer #5</a></p>
    <div>
      <h3>Quiz #6</h3>
      <a href="#quiz-6">
        
      </a>
    </div>
    <p>This time we explicitly specify the local address and port. Sometimes there is a need to specify the local port.</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.bind(('127.0.0.1', 60_000))
s1.connect(('127.8.8.8', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.bind(('127.0.0.1', 60_000))
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_6.py">Answer #6</a></p>
    <div>
      <h3>Quiz #7</h3>
      <a href="#quiz-7">
        
      </a>
    </div>
    <p>Just when you thought it couldn’t get any weirder, we add <a href="https://manpages.debian.org/unstable/manpages/socket.7.en.html#SO_REUSEADDR"><code>SO_REUSEADDR</code></a> into the mix.</p><p>First, we ask the OS to allocate a local address for us. Then we explicitly bind to the same local address, which we know the OS must have assigned to the first socket. We enable local address reuse for both sockets. Is this allowed?</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s1.connect(('127.8.8.8', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s2.bind(('127.0.0.1', 60_000))
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_7.py">Answer #7</a></p>
    <div>
      <h3>Quiz #8</h3>
      <a href="#quiz-8">
        
      </a>
    </div>
    <p>Finally, a cherry on top. This is Quiz #7 but in reverse. Common sense dictates that the outcome should be the same, but is it?</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s1.bind(('127.0.0.1', 60_000))
s1.connect(('127.9.9.9', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s2.connect(('127.8.8.8', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_8.py">Answer #8</a></p>
    <div>
      <h2>The secret tri-state life of a local TCP port</h2>
      <a href="#the-secret-tri-state-life-of-a-local-tcp-port">
        
      </a>
    </div>
    <p>Is it all clear now? Well, probably no. It feels like reverse engineering a black box. So what is happening behind the scenes? Let's take a look.</p><p>Linux tracks all TCP <b>ports</b> in use in a hash table named <a href="https://elixir.bootlin.com/linux/v6.2/source/include/net/inet_hashtables.h#L166">bhash</a>. Not to be confused with with <a href="https://elixir.bootlin.com/linux/v6.2/source/include/net/inet_hashtables.h#L156">ehash</a> table, which tracks <b>sockets</b> with both local and remote address already assigned.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3OJ1M8Zu9lgEZoEJdJCNgr/d50e8f331dc994366b2ff749ed10b519/Untitled.png" />
            
            </figure><p>Each hash table entry points to a chain of so-called bind buckets, which group together sockets which share a local port. To be precise, sockets are grouped into buckets by:</p><ul><li><p>the <a href="https://man7.org/linux/man-pages/man7/network_namespaces.7.html">network namespace</a> they belong to, and</p></li><li><p>the <a href="https://docs.kernel.org/networking/vrf.html">VRF</a> device they are bound to, and</p></li><li><p>the local port number they are bound to.</p></li></ul><p>But in the simplest possible setup - single network namespace, no VRFs - we can say that sockets in a bind bucket are grouped by their local port number.</p><p>The set of sockets in each bind bucket, that is sharing a local port, is backed by a linked list of named owners.</p><p>When we ask the kernel to assign a local address to a socket, its task is to check for a conflict with any existing socket. That is because a local port number can be shared only <a href="https://elixir.bootlin.com/linux/v6.2/source/include/net/inet_hashtables.h#L43">under some conditions</a>:</p>
            <pre><code>/* There are a few simple rules, which allow for local port reuse by
 * an application.  In essence:
 *
 *   1) Sockets bound to different interfaces may share a local port.
 *      Failing that, goto test 2.
 *   2) If all sockets have sk-&gt;sk_reuse set, and none of them are in
 *      TCP_LISTEN state, the port may be shared.
 *      Failing that, goto test 3.
 *   3) If all sockets are bound to a specific inet_sk(sk)-&gt;rcv_saddr local
 *      address, and none of them are the same, the port may be
 *      shared.
 *      Failing this, the port cannot be shared.
 *
 * The interesting point, is test #2.  This is what an FTP server does
 * all day.  To optimize this case we use a specific flag bit defined
 * below.  As we add sockets to a bind bucket list, we perform a
 * check of: (newsk-&gt;sk_reuse &amp;&amp; (newsk-&gt;sk_state != TCP_LISTEN))
 * As long as all sockets added to a bind bucket pass this test,
 * the flag bit will be set.
 * ...
 */</code></pre>
            <p>The comment above hints that the kernel tries to optimize for the happy case of no conflict. To this end the bind bucket holds additional state which aggregates the properties of the sockets it holds:</p>
            <pre><code>struct inet_bind_bucket {
        /* ... */
        signed char          fastreuse;
        signed char          fastreuseport;
        kuid_t               fastuid;
#if IS_ENABLED(CONFIG_IPV6)
        struct in6_addr      fast_v6_rcv_saddr;
#endif
        __be32               fast_rcv_saddr;
        unsigned short       fast_sk_family;
        bool                 fast_ipv6_only;
        /* ... */
};</code></pre>
            <p>Let's focus our attention just on the first aggregate property - <code>fastreuse</code>. It has existed since, now prehistoric, Linux 2.1.90pre1. Initially in the form of a <a href="https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/tree/include/net/tcp.h?h=2.1.90pre1&amp;id=9d11a5176cc5b9609542b1bd5a827b8618efe681#n76">bit flag</a>, as the comment says, only to evolve to a byte-sized field over time.</p><p>The other six fields came on much later with the introduction of <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da5e36308d9f7151845018369148201a5d28b46d"><code>SO_REUSEPORT</code> in Linux 3.9</a>. Because they play a role only when there are sockets with the <a href="https://manpages.debian.org/unstable/manpages/socket.7.en.html#SO_REUSEPORT"><code>SO_REUSEPORT</code></a> flag set. We are going to ignore them today.</p><p>Whenever the Linux kernel needs to bind a socket to a local port, it first has to look for the bind bucket for that port. What makes life a bit more complicated is the fact that the search for a TCP bind bucket exists in two places in the kernel. The bind bucket lookup can happen early - <code>at bind()</code> time - or late - <code>at connect()</code> - time. Which one gets called depends on how the connected socket has been set up:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3KOEGkTF7HEH7qku86bufY/8560316429d383b25add8c9f0b2bab3b/image5-5.png" />
            
            </figure><p>However, whether we land in <a href="https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/inet_connection_sock.c#L486"><code>inet_csk_get_port</code></a> or <a href="https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/inet_hashtables.c#L992"><code>__inet_hash_connect</code></a>, we always end up walking the bucket chain in the bhash looking for the bucket with a matching port number. The bucket might already exist or we might have to create it first. But once it exists, its fastreuse field is in one of three possible states: <code>-1</code>, <code>0</code>, or <code>+1</code>. As if Linux developers were inspired by <a href="https://en.wikipedia.org/wiki/Triplet_state">quantum mechanics</a>.</p><p>That state reflects two aspects of the bind bucket:</p><ol><li><p>What sockets are in the bucket?</p></li><li><p>When can the local port be shared?</p></li></ol><p>So let us try to decipher the three possible fastreuse states then, and what they mean in each case.</p><p>First, what does the fastreuse property say about the owners of the bucket, that is the sockets using that local port?</p>
<table>
<thead>
  <tr>
    <th><span>fastreuse is</span></th>
    <th><span>owners list contains</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>-1</span></td>
    <td><span>sockets connect()'ed from an ephemeral port</span></td>
  </tr>
  <tr>
    <td><span>0</span></td>
    <td><span>sockets bound without SO_REUSEADDR</span></td>
  </tr>
  <tr>
    <td><span>+1</span></td>
    <td><span>sockets bound with SO_REUSEADDR</span></td>
  </tr>
</tbody>
</table><p>While this is not the whole truth, it is close enough for now. We will soon get to the bottom of it.</p><p>When it comes port sharing, the situation is far less straightforward:</p>
<table>
<thead>
  <tr>
    <th><span>Can I … when …</span></th>
    <th><span>fastreuse = -1</span></th>
    <th><span>fastreuse = 0</span></th>
    <th><span>fastreuse = +1</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>bind() to the same port (ephemeral or specified)</span></td>
    <td><span>yes</span><span> IFF local IP is unique ①</span></td>
    <td><span>← </span><a href="https://en.wiktionary.org/wiki/idem#Pronoun"><span>idem</span></a></td>
    <td><span>← idem</span></td>
  </tr>
  <tr>
    <td><span>bind() to the specific port with SO_REUSEADDR</span></td>
    <td><span>yes</span><span> IFF local IP is unique OR conflicting socket uses SO_REUSEADDR ①</span></td>
    <td><span>← idem</span></td>
    <td><span>yes</span><span> ②</span></td>
  </tr>
  <tr>
    <td><span>connect() from the same ephemeral port to the same remote (IP, port)</span></td>
    <td><span>yes</span><span> IFF local IP unique ③</span></td>
    <td><span>no</span><span> ③</span></td>
    <td><span>no</span><span> ③</span></td>
  </tr>
  <tr>
    <td><span>connect() from the same ephemeral port to a unique remote (IP, port)</span></td>
    <td><span>yes</span><span> ③</span></td>
    <td><span>no</span><span> ③</span></td>
    <td><span>no</span><span> ③</span></td>
  </tr>
</tbody>
</table><p>① Determined by <a href="https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/inet_connection_sock.c#L214"><code>inet_csk_bind_conflict()</code></a> called from <code>inet_csk_get_port()</code> (specific port bind) or <code>inet_csk_get_port()</code> → <code>inet_csk_find_open_port()</code> (ephemeral port bind).</p><p>② Because <code>inet_csk_get_port()</code> <a href="https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/inet_connection_sock.c#L531">skips conflict check</a> for <code>fastreuse == 1 buckets</code>.</p><p>③ Because <code>inet_hash_connect()</code> → <code>__inet_hash_connect()</code> <a href="https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/inet_hashtables.c#L1062">skips buckets</a> with <code>fastreuse != -1</code>.</p><p>While it all looks rather complicated at first sight, we can distill the table above into a few statements that hold true, and are a bit easier to digest:</p><ul><li><p><code>bind()</code>, or early local address allocation, always succeeds if there is no local IP address conflict with any existing socket,</p></li><li><p><code>connect()</code>, or late local address allocation, always fails when TCP bind bucket for a local port is in any state other than <code>fastreuse = -1</code>,</p></li><li><p><code>connect()</code> only succeeds if there is no local and remote address conflict,</p></li><li><p><code>SO_REUSEADDR</code> socket option allows local address sharing, if all conflicting sockets also use it (and none of them is in the listening state).</p></li></ul>
    <div>
      <h3>This is crazy. I don’t believe you.</h3>
      <a href="#this-is-crazy-i-dont-believe-you">
        
      </a>
    </div>
    <p>Fortunately, you don't have to. With <a href="https://drgn.readthedocs.io/en/latest/index.html">drgn</a>, the programmable debugger, we can examine the bind bucket state on a live kernel:</p>
            <pre><code>#!/usr/bin/env drgn

"""
dump_bhash.py - List all TCP bind buckets in the current netns.

Script is not aware of VRF.
"""

import os

from drgn.helpers.linux.list import hlist_for_each, hlist_for_each_entry
from drgn.helpers.linux.net import get_net_ns_by_fd
from drgn.helpers.linux.pid import find_task


def dump_bind_bucket(head, net):
    for tb in hlist_for_each_entry("struct inet_bind_bucket", head, "node"):
        # Skip buckets not from this netns
        if tb.ib_net.net != net:
            continue

        port = tb.port.value_()
        fastreuse = tb.fastreuse.value_()
        owners_len = len(list(hlist_for_each(tb.owners)))

        print(
            "{:8d}  {:{sign}9d}  {:7d}".format(
                port,
                fastreuse,
                owners_len,
                sign="+" if fastreuse != 0 else " ",
            )
        )


def get_netns():
    pid = os.getpid()
    task = find_task(prog, pid)
    with open(f"/proc/{pid}/ns/net") as f:
        return get_net_ns_by_fd(task, f.fileno())


def main():
    print("{:8}  {:9}  {:7}".format("TCP-PORT", "FASTREUSE", "#OWNERS"))

    tcp_hashinfo = prog.object("tcp_hashinfo")
    net = get_netns()

    # Iterate over all bhash slots
    for i in range(0, tcp_hashinfo.bhash_size):
        head = tcp_hashinfo.bhash[i].chain
        # Iterate over bind buckets in the slot
        dump_bind_bucket(head, net)


main()</code></pre>
            <p>Let's take <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/dump_bhash.py">this script</a> for a spin and try to confirm what <i>Table 1</i> claims to be true. Keep in mind that to produce the <code>ipython --classic</code> session snippets below I've used the same setup as for the quiz questions.</p><p>Two connected sockets sharing ephemeral port 60,000:</p>
            <pre><code>&gt;&gt;&gt; s1 = socket(AF_INET, SOCK_STREAM)
&gt;&gt;&gt; s1.connect(('127.1.1.1', 1234))
&gt;&gt;&gt; s2 = socket(AF_INET, SOCK_STREAM)
&gt;&gt;&gt; s2.connect(('127.2.2.2', 1234))
&gt;&gt;&gt; !./dump_bhash.py
TCP-PORT  FASTREUSE  #OWNERS
    1234          0        3
   60000         -1        2
&gt;&gt;&gt;</code></pre>
            <p>Two bound sockets reusing port 60,000:</p>
            <pre><code>&gt;&gt;&gt; s1 = socket(AF_INET, SOCK_STREAM)
&gt;&gt;&gt; s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
&gt;&gt;&gt; s1.bind(('127.1.1.1', 60_000))
&gt;&gt;&gt; s2 = socket(AF_INET, SOCK_STREAM)
&gt;&gt;&gt; s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
&gt;&gt;&gt; s2.bind(('127.1.1.1', 60_000))
&gt;&gt;&gt; !./dump_bhash.py
TCP-PORT  FASTREUSE  #OWNERS
    1234          0        1
   60000         +1        2
&gt;&gt;&gt; </code></pre>
            <p>A mix of bound sockets with and without REUSEADDR sharing port 60,000:</p>
            <pre><code>&gt;&gt;&gt; s1 = socket(AF_INET, SOCK_STREAM)
&gt;&gt;&gt; s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
&gt;&gt;&gt; s1.bind(('127.1.1.1', 60_000))
&gt;&gt;&gt; !./dump_bhash.py
TCP-PORT  FASTREUSE  #OWNERS
    1234          0        1
   60000         +1        1
&gt;&gt;&gt; s2 = socket(AF_INET, SOCK_STREAM)
&gt;&gt;&gt; s2.bind(('127.2.2.2', 60_000))
&gt;&gt;&gt; !./dump_bhash.py
TCP-PORT  FASTREUSE  #OWNERS
    1234          0        1
   60000          0        2
&gt;&gt;&gt;</code></pre>
            <p>With such tooling, proving that <i>Table 2</i> holds true is just a matter of writing a bunch of <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/test_fastreuse.py">exploratory tests</a>.</p><p>But what has happened in that last snippet? The bind bucket has clearly transitioned from one fastreuse state to another. This is what <i>Table 1</i> fails to capture. And it means that we still don't have the full picture.</p><p>We have yet to find out when the bucket's fastreuse state can change. This calls for a state machine.</p>
    <div>
      <h3>Das State Machine</h3>
      <a href="#das-state-machine">
        
      </a>
    </div>
    <p>As we have just seen, a bind bucket does not need to stay in the initial fastreuse state throughout its lifetime. Adding sockets to the bucket can trigger a state change. As it turns out, it can only transition into <code>fastreuse = 0</code>, if we happen to bind() a socket that:</p><ol><li><p>doesn't conflict existing owners, and</p></li><li><p>doesn't have the <code>SO_REUSEADDR</code> option enabled.</p></li></ol>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7awSZQJp00EDbGbvcpBNv9/cca9b2c49a3db5f3d1db62ce142ac46a/Untitled--1-.png" />
            
            </figure><p>And while we could have figured it all out by carefully reading the code in <a href="https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/inet_connection_sock.c#L431"><code>inet_csk_get_port → inet_csk_update_fastreuse</code></a>, it certainly doesn't hurt to confirm our understanding with <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/test_fastreuse_states.py">a few more tests</a>.</p><p>Now that we have the full picture, this begs the question...</p>
    <div>
      <h3>Why are you telling me all this?</h3>
      <a href="#why-are-you-telling-me-all-this">
        
      </a>
    </div>
    <p>Firstly, so that the next time <code>bind()</code> syscall rejects your request with <code>EADDRINUSE</code>, or <code>connect()</code> refuses to cooperate by throwing the <code>EADDRNOTAVAIL</code> error, you will know what is happening, or at least have the tools to find out.</p><p>Secondly, because we have previously <a href="/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/">advertised a technique</a> for opening connections from a specific range of ports which involves bind()'ing sockets with the SO_REUSEADDR option. What we did not realize back then, is that there exists <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/test_fastreuse.py#L300">a corner case</a> when the same port can't be shared with the regular, <code>connect()</code>'ed sockets. While that is not a deal-breaker, it is good to understand the consequences.</p><p>To make things better, we have worked with the Linux community to extend the kernel API with a new socket option that lets the user <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=91d0b78c5177">specify the local port range</a>. The new option will be available in the upcoming Linux 6.3. With it we no longer have to resort to bind()-tricks. This makes it possible to yet again share a local port with regular <code>connect()</code>'ed sockets.</p>
    <div>
      <h2>Closing thoughts</h2>
      <a href="#closing-thoughts">
        
      </a>
    </div>
    <p>Today we posed a relatively straightforward question - when can two TCP sockets share a local address? - and worked our way towards an answer. An answer that is too complex to compress it into a single sentence. What is more, it's not even the full answer. After all, we have decided to ignore the existence of the SO_REUSEPORT feature, and did not consider conflicts with TCP listening sockets.</p><p>If there is a simple takeaway, though, it is that bind()'ing a socket can have tricky consequences. When using bind() to select an egress IP address, it is best to combine it with IP_BIND_ADDRESS_NO_PORT socket option, and leave the port assignment to the kernel. Otherwise we might unintentionally block local TCP ports from being reused.</p><p>It is too bad that the same advice does not apply to UDP, where IP_BIND_ADDRESS_NO_PORT does not really work today. But that is another story.</p><p>Until next time ?.</p><p>If you enjoy scratching your head while reading the Linux kernel source code, <a href="https://www.cloudflare.com/careers/">we are hiring</a>.</p> ]]></content:encoded>
            <category><![CDATA[Kernel]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">74q2VGXmBazVsIZUpVUD8o</guid>
            <dc:creator>Jakub Sitnicki</dc:creator>
        </item>
        <item>
            <title><![CDATA[Assembly within! BPF tail calls on x86 and ARM]]></title>
            <link>https://blog.cloudflare.com/assembly-within-bpf-tail-calls-on-x86-and-arm/</link>
            <pubDate>Mon, 10 Oct 2022 13:00:00 GMT</pubDate>
            <description><![CDATA[ We have first adopted the BPF tail calls when building our XDP-based packet processing pipeline. BPF tail calls have served us well since then. But they do have their caveats ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Early on when we learn to program, we get introduced to the concept of <a href="https://ocw.mit.edu/courses/6-00sc-introduction-to-computer-science-and-programming-spring-2011/resources/lecture-6-recursion/">recursion</a>. And that it is handy for computing, among other things, sequences defined in terms of recurrences. Such as the famous <a href="https://en.wikipedia.org/wiki/Fibonacci_number">Fibonnaci numbers</a> - <i>Fn = Fn-1 + Fn-2</i>.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5Zkbmk81PBl2lx62bzcNz4/ce5eb7474578577dd0c5d4c3284d73f2/Screenshot-2022-10-10-at-10.13.32.png" />
            
            </figure><p>Later on, perhaps when diving into multithreaded programming, we come to terms with the fact that <a href="https://textbook.cs161.org/memory-safety/x86.html#26-stack-pushing-and-popping">the stack space</a> for call frames is finite. And that there is an “okay” way and a “cool” way to calculate the Fibonacci numbers using recursion:</p>
            <pre><code>// fib_okay.c

#include &lt;stdint.h&gt;

uint64_t fib(uint64_t n)
{
        if (n == 0 || n == 1)
                return 1;

        return fib(n - 1) + fib(n - 2);
}</code></pre>
            <p>Listing 1. An okay Fibonacci number generator implementation</p>
            <pre><code>// fib_cool.c

#include &lt;stdint.h&gt;

static uint64_t fib_tail(uint64_t n, uint64_t a, uint64_t b)
{
    if (n == 0)
        return a;
    if (n == 1)
        return b;

    return fib_tail(n - 1, b, a + b);
}

uint64_t fib(uint64_t n)
{
    return fib_tail(n, 1, 1);
}</code></pre>
            <p>Listing 2. A better version of the same</p><p>If we take a look at the machine code the compiler produces, the “cool” variant translates to a nice and tight sequence of instructions:</p><p>⚠ DISCLAIMER: This blog post is assembly-heavy. We will be looking at assembly code for x86-64, arm64 and BPF architectures. If you need an introduction or a refresher, I can recommend <a href="https://github.com/Apress/low-level-programming">“Low-Level Programming”</a> by Igor Zhirkov for x86-64, and <a href="https://github.com/Apress/programming-with-64-bit-ARM-assembly-language">“Programming with 64-Bit ARM Assembly Language”</a> by Stephen Smith for arm64. For BPF, see the <a href="https://www.kernel.org/doc/html/latest/bpf/standardization/instruction-set.html">Linux kernel documentation</a>.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7bahmmfAErsC7aO9LUTzTa/e685101ca35f092cd482993e3b86405b/Screenshot-2022-10-10-at-10.25.23.png" />
            
            </figure><p>Listing 3. <code>fib_cool.c</code> compiled for x86-64 and arm64</p><p>The “okay” variant, disappointingly, leads to <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-10-bpf-tail-call/fib_okay.x86-64.disasm">more instructions</a> than a listing can fit. It is a spaghetti of <a href="https://en.wikipedia.org/wiki/Basic_block">basic blocks</a>.</p><p><a href="https://raw.githubusercontent.com/cloudflare/cloudflare-blog/master/2022-10-bpf-tail-call/fib_okay.dot.png"><img src="http://staging.blog.mrk.cfdata.org/content/images/2022/10/image6.png" /></a></p><p>But more importantly, it is not free of <a href="https://textbook.cs161.org/memory-safety/x86.html#29-x86-function-call-in-assembly">x86 call instructions</a>.</p>
            <pre><code>$ objdump -d fib_okay.o | grep call
 10c:   e8 00 00 00 00          call   111 &lt;fib+0x111&gt;
$ objdump -d fib_cool.o | grep call
$</code></pre>
            <p>This has an important consequence - as fib recursively calls itself, the stacks keep growing. We can observe it with a bit of <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-10-bpf-tail-call/trace_rsp.gdb">help from the debugger</a>.</p>
            <pre><code>$ gdb --quiet --batch --command=trace_rsp.gdb --args ./fib_okay 6
Breakpoint 1 at 0x401188: file fib_okay.c, line 3.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
n = 6, %rsp = 0xffffd920
n = 5, %rsp = 0xffffd900
n = 4, %rsp = 0xffffd8e0
n = 3, %rsp = 0xffffd8c0
n = 2, %rsp = 0xffffd8a0
n = 1, %rsp = 0xffffd880
n = 1, %rsp = 0xffffd8c0
n = 2, %rsp = 0xffffd8e0
n = 1, %rsp = 0xffffd8c0
n = 3, %rsp = 0xffffd900
n = 2, %rsp = 0xffffd8e0
n = 1, %rsp = 0xffffd8c0
n = 1, %rsp = 0xffffd900
13
[Inferior 1 (process 50904) exited normally]
$</code></pre>
            <p>While the “cool” variant makes no use of the stack.</p>
            <pre><code>$ gdb --quiet --batch --command=trace_rsp.gdb --args ./fib_cool 6
Breakpoint 1 at 0x40118a: file fib_cool.c, line 13.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
n = 6, %rsp = 0xffffd938
13
[Inferior 1 (process 50949) exited normally]
$</code></pre>
            
    <div>
      <h2>Where did the <code>calls</code> go?</h2>
      <a href="#where-did-the-calls-go">
        
      </a>
    </div>
    <p>The smart compiler turned the last function call in the body into a regular jump. Why was it allowed to do that?</p><p>It is the last instruction in the function body we are talking about. The caller stack frame is going to be destroyed right after we return anyway. So why keep it around when we can reuse it for the callee’s <a href="https://eli.thegreenplace.net/2011/02/04/where-the-top-of-the-stack-is-on-x86/">stack frame</a>?</p><p>This optimization, known as <a href="https://en.wikipedia.org/wiki/Tail_call#In_assembly">tail call elimination</a>, leaves us with no function calls in the “cool” variant of our fib implementation. There was only one call to eliminate - right at the end.</p><p>Once applied, the call becomes a jump (loop). If assembly is not your second language, decompiling the fib_cool.o object file with <a href="https://ghidra-sre.org/">Ghidra</a> helps see the transformation:</p>
            <pre><code>long fib(ulong param_1)

{
  long lVar1;
  long lVar2;
  long lVar3;
  
  if (param_1 &lt; 2) {
    lVar3 = 1;
  }
  else {
    lVar3 = 1;
    lVar2 = 1;
    do {
      lVar1 = lVar3;
      param_1 = param_1 - 1;
      lVar3 = lVar2 + lVar1;
      lVar2 = lVar1;
    } while (param_1 != 1);
  }
  return lVar3;
}</code></pre>
            <p>Listing 4. <code>fib_cool.o</code> decompiled by Ghidra</p><p>This is very much desired. Not only is the generated machine code much shorter. It is also way faster due to lack of calls, which pop up on the <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-10-bpf-tail-call/fib_okay_50.perf.txt#L85">profile</a> for fib_okay.</p><p>But I am no <a href="https://github.com/dendibakh/perf-ninja">performance ninja</a> and this blog post is not about compiler optimizations. So why am I telling you about it?</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4pi98zHx1blOehbqwhB1LP/049a1532b192261cdfedade320e871a1/image5.jpg" />
            
            </figure><p><a href="https://commons.wikimedia.org/wiki/File:Lemur_catta_-_tail_length_01.jpg">Alex Dunkel (Maky), CC BY-SA 3.0</a>, via <a href="https://creativecommons.org/licenses/by-sa/3.0">Wikimedia Commons</a></p>
    <div>
      <h2>Tail calls in BPF</h2>
      <a href="#tail-calls-in-bpf">
        
      </a>
    </div>
    <p>The concept of tail call elimination made its way into the BPF world. Although not in the way you might expect. Yes, the LLVM compiler does get rid of the trailing function calls when building for -target bpf. The transformation happens at the intermediate representation level, so it is backend agnostic. This can save you some <a href="https://docs.cilium.io/en/stable/bpf/#bpf-to-bpf-calls">BPF-to-BPF function calls</a>, which you can spot by looking for call -N instructions in the BPF assembly.</p><p>However, when we talk about tail calls in the BPF context, we usually have something else in mind. And that is a mechanism, built into the BPF JIT compiler, for <a href="https://docs.cilium.io/en/stable/bpf/#tail-calls">chaining BPF programs</a>.</p><p>We first adopted BPF tail calls when building our <a href="https://legacy.netdevconf.info/0x13/session.html?talk-XDP-based-DDoS-mitigation">XDP-based packet processing pipeline</a>. Thanks to it, we were able to divide the processing logic into several XDP programs. Each responsible for doing one thing.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3ANc6el4eTdaklbTHnlzPH/688282a0c8534d50fd1d8c1143c53058/image8.png" />
            
            </figure><p>Slide from “<a href="https://legacy.netdevconf.info/0x13/session.html?talk-XDP-based-DDoS-mitigation">XDP based DDoS Mitigation</a>” talk by <a href="/author/arthur/">Arthur Fabre</a></p><p>BPF tail calls have served us well since then. But they do have their caveats. Until recently it was impossible to have both BPF tails calls and BPF-to-BPF function calls in the same XDP program on arm64, which is one of the supported architectures for us.</p><p>Why? Before we get to that, we have to clarify what a BPF tail call actually does.</p>
    <div>
      <h2>A tail call is a tail call is a tail call</h2>
      <a href="#a-tail-call-is-a-tail-call-is-a-tail-call">
        
      </a>
    </div>
    <p>BPF exposes the tail call mechanism through the <a href="https://elixir.bootlin.com/linux/v5.15.63/source/include/uapi/linux/bpf.h#L1712">bpf_tail_call helper</a>, which we can invoke from our BPF code. We don’t directly point out which BPF program we would like to call. Instead, we pass it a BPF map (a container) capable of holding references to BPF programs (BPF_MAP_TYPE_PROG_ARRAY), and an index into the map.</p>
            <pre><code>long bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)

       Description
              This  special  helper is used to trigger a "tail call", or
              in other words, to jump into  another  eBPF  program.  The
              same  stack frame is used (but values on stack and in reg‐
              isters for the caller are not accessible to  the  callee).
              This  mechanism  allows  for  program chaining, either for
              raising the maximum number of available eBPF instructions,
              or  to  execute  given programs in conditional blocks. For
              security reasons, there is an upper limit to the number of
              successive tail calls that can be performed.</code></pre>
            <p><a href="https://man7.org/linux/man-pages/man7/bpf-helpers.7.html">bpf-helpers(7) man page</a></p><p>At first glance, this looks somewhat similar to the <a href="https://man7.org/linux/man-pages/man2/execve.2.html">execve(2) syscall</a>. It is easy to mistake it for a way to execute a new program from the current program context. To quote the excellent <a href="https://docs.cilium.io/en/stable/bpf/#tail-calls">BPF and XDP Reference Guide</a> from the Cilium project documentation:</p><blockquote><p><i>Tail calls can be seen as a mechanism that allows one BPF program to call another, without returning to the old program. Such a call has minimal overhead as unlike function calls, it is implemented as a long jump, reusing the same stack frame.</i></p></blockquote><p>But once we add <a href="https://docs.cilium.io/en/stable/bpf/#bpf-to-bpf-calls">BPF function calls</a> into the mix, it becomes clear that the BPF tail call mechanism is indeed an implementation of tail call elimination, rather than a way to replace one program with another:</p><blockquote><p><i>Tail calls, before the actual jump to the target program, will unwind only its current stack frame. As we can see in the example above, if a tail call occurs from within the sub-function, the function’s (func1) stack frame will be present on the stack when a program execution is at func2. Once the final function (func3) function terminates, all the previous stack frames will be unwinded and control will get back to the caller of BPF program caller.</i></p></blockquote><p>Alas, one with sometimes slightly surprising semantics. Consider the code <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-10-bpf-tail-call/tail_call_ex3.bpf.c">like below</a>, where a BPF function calls the <code>bpf_tail_call()</code> helper:</p>
            <pre><code>struct {
    __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
    __uint(max_entries, 1);
    __uint(key_size, sizeof(__u32));
    __uint(value_size, sizeof(__u32));
} bar SEC(".maps");

SEC("tc")
int serve_drink(struct __sk_buff *skb __unused)
{
    return 0xcafe;
}

static __noinline
int bring_order(struct __sk_buff *skb)
{
    bpf_tail_call(skb, &amp;bar, 0);
    return 0xf00d;
}

SEC("tc")
int server1(struct __sk_buff *skb)
{
    return bring_order(skb);    
}

SEC("tc")
int server2(struct __sk_buff *skb)
{
    __attribute__((musttail)) return bring_order(skb);  
}</code></pre>
            <p>We have two seemingly not so different BPF programs - <code>server1()</code> and <code>server2()</code>. They both call the same BPF function <code>bring_order()</code>. The function tail calls into the <code>serve_drink()</code> program, if the <code>bar[0]</code> map entry points to it (let’s assume that).</p><p>Do both <code>server1</code> and <code>server2</code> return the same value? Turns out that - no, they don’t. We get a hex ? from <code>server1</code>, and a ☕ from <code>server2</code>. How so?</p><p>First thing to notice is that a BPF tail call unwinds just the current function stack frame. Code past the <code>bpf_tail_call()</code> invocation in the function body never executes, providing the tail call is successful (the map entry was set, and the tail call limit has not been reached).</p><p>When the tail call finishes, control returns to the caller of the function which made the tail call. Applying this to our example, the control flow is <code>serverX() --&gt; bring_order() --&gt; bpf_tail_call() --&gt; serve_drink() -return-&gt; serverX()</code> for both programs.</p><p>The second thing to keep in mind is that the compiler does not know that the bpf_tail_call() helper changes the control flow. Hence, the unsuspecting compiler optimizes the code as if the execution would continue past the BPF tail call.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/57RsaISELMEErCz7js2C0w/70efe8757f435fb6450ae26c50ff0f2b/image2-8.png" />
            
            </figure><p>The call graph for <code>server1()</code> and <code>server2()</code> is the same, but the return value differs due to build time optimizations.</p><p>In our case, the compiler thinks it is okay to propagate the constant which <code>bring_order()</code> returns to <code>server1()</code>. Possibly catching us by surprise, if we didn’t check the <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-10-bpf-tail-call/tail_call_ex3.bpf.disasm">generated BPF assembly</a>.</p><p>We can prevent it by <a href="https://clang.llvm.org/docs/AttributeReference.html#musttail">forcing the compiler</a> to make a tail call to <code>bring_order()</code>. This way we ensure that whatever <code>bring_order()</code> returns will be used as the <code>server2()</code> program result.</p><p>? General rule - for least surprising results, use <a href="https://clang.llvm.org/docs/AttributeReference.html#musttail"><code>musttail attribute</code></a> when calling a function that contain a BPF tail call.</p><p>How does the <code>bpf_tail_call()</code> work underneath then? And why the BPF verifier wouldn’t let us mix the function calls with tail calls on arm64? Time to dig deeper.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5rp6nKgRPwuuXUzAyolOA9/1b740786194fceb11c2fdf26e1c5a1d7/image7.jpg" />
            
            </figure><p>Public Domain <a href="https://www.rawpixel.com/image/3578996/free-photo-image-bulldozer-construction-site">image</a></p>
    <div>
      <h2>BPF tail call on x86-64</h2>
      <a href="#bpf-tail-call-on-x86-64">
        
      </a>
    </div>
    <p>What does a <code>bpf_tail_call()</code> helper call translate to after BPF JIT for x86-64 has compiled it? How does the implementation guarantee that we don’t end up in a tail call loop forever?</p><p>To find out we will need to piece together a few things.</p><p>First, there is the BPF JIT compiler source code, which lives in <a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/x86/net/bpf_jit_comp.c"><code>arch/x86/net/bpf_jit_comp.c</code></a>. Its code is annotated with helpful comments. We will focus our attention on the following call chain within the JIT:</p><p><a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/x86/net/bpf_jit_comp.c#L894"><span>do_jit()</span><span> ?</span><span><br /></span></a><span>   </span><a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/x86/net/bpf_jit_comp.c#L292"><span>emit_prologue()</span><span> ?</span><span><br /></span></a><span>   </span><a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/x86/net/bpf_jit_comp.c#L257"><span>push_callee_regs()</span><span> ?</span><span><br /></span></a><span>   </span><span>for (i = 1; i &lt;= insn_cnt; i++, insn++) {</span><span><br /></span><span>     </span><span>switch (insn-&gt;code) {</span><span><br /></span><span>     </span><span>case BPF_JMP | BPF_CALL:</span><span><br /></span><span>       </span><a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/x86/net/bpf_jit_comp.c#L1434"><span>/* emit function call */</span><span> ?</span><span><br /></span></a><span>     </span><span>case BPF_JMP | BPF_TAIL_CALL:</span><span><br /></span><span>       </span><a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/x86/net/bpf_jit_comp.c#L531"><span>emit_bpf_tail_call_direct()</span><span> ?</span><span><br /></span></a><span>     </span><span>case BPF_JMP | BPF_EXIT:</span><span><br /></span><span>       </span><a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/x86/net/bpf_jit_comp.c#L1693"><span>/* emit epilogue */</span><span> ?</span><span><br /></span></a><span>     </span><span>}</span> <span><br /></span><span>  </span><span>}</span></p><p>It is sometimes hard to visualize the generated instruction stream just from reading the compiler code. Hence, we will also want to inspect the input - BPF instructions - and the output - x86-64 instructions - of the JIT compiler.</p><p>To inspect BPF and x86-64 instructions of a loaded BPF program, we can use <code>bpftool prog dump</code>. However, first we must populate the BPF map used as the tail call jump table. Otherwise, we might not be able to see the tail call jump!</p><p>This is due to <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=428d5df1fa4f28daf622c48dd19da35585c9053c">optimizations</a> that use instruction patching when the index into the program array is known at load time.</p>
            <pre><code># bpftool prog loadall ./tail_call_ex1.o /sys/fs/bpf pinmaps /sys/fs/bpf
# bpftool map update pinned /sys/fs/bpf/jmp_table key 0 0 0 0 value pinned /sys/fs/bpf/target_prog
# bpftool prog dump xlated pinned /sys/fs/bpf/entry_prog
int entry_prog(struct __sk_buff * skb):
; bpf_tail_call(skb, &amp;jmp_table, 0);
   0: (18) r2 = map[id:24]
   2: (b7) r3 = 0
   3: (85) call bpf_tail_call#12
; return 0xf00d;
   4: (b7) r0 = 61453
   5: (95) exit
# bpftool prog dump jited pinned /sys/fs/bpf/entry_prog
int entry_prog(struct __sk_buff * skb):
bpf_prog_4f697d723aa87765_entry_prog:
; bpf_tail_call(skb, &amp;jmp_table, 0);
   0:   nopl   0x0(%rax,%rax,1)
   5:   xor    %eax,%eax
   7:   push   %rbp
   8:   mov    %rsp,%rbp
   b:   push   %rax
   c:   movabs $0xffff888102764800,%rsi
  16:   xor    %edx,%edx
  18:   mov    -0x4(%rbp),%eax
  1e:   cmp    $0x21,%eax
  21:   jae    0x0000000000000037
  23:   add    $0x1,%eax
  26:   mov    %eax,-0x4(%rbp)
  2c:   nopl   0x0(%rax,%rax,1)
  31:   pop    %rax
  32:   jmp    0xffffffffffffffe3   // bug? ?
; return 0xf00d;
  37:   mov    $0xf00d,%eax
  3c:   leave
  3d:   ret</code></pre>
            <p>There is a caveat. The target addresses for tail call jumps in <code>bpftool prog dump jited</code> output will not make any sense. To discover the real jump targets, we have to peek into the kernel memory. That can be done with <code>gdb</code> after we find the address of our JIT’ed BPF programs in <code>/proc/kallsyms</code>:</p>
            <pre><code># tail -2 /proc/kallsyms
ffffffffa0000720 t bpf_prog_f85b2547b00cbbe9_target_prog        [bpf]
ffffffffa0000748 t bpf_prog_4f697d723aa87765_entry_prog [bpf]
# gdb -q -c /proc/kcore -ex 'x/18i 0xffffffffa0000748' -ex 'quit'
[New process 1]
Core was generated by `earlyprintk=serial,ttyS0,115200 console=ttyS0 psmouse.proto=exps "virtme_stty_c'.
#0  0x0000000000000000 in ?? ()
   0xffffffffa0000748:  nopl   0x0(%rax,%rax,1)
   0xffffffffa000074d:  xor    %eax,%eax
   0xffffffffa000074f:  push   %rbp
   0xffffffffa0000750:  mov    %rsp,%rbp
   0xffffffffa0000753:  push   %rax
   0xffffffffa0000754:  movabs $0xffff888102764800,%rsi
   0xffffffffa000075e:  xor    %edx,%edx
   0xffffffffa0000760:  mov    -0x4(%rbp),%eax
   0xffffffffa0000766:  cmp    $0x21,%eax
   0xffffffffa0000769:  jae    0xffffffffa000077f
   0xffffffffa000076b:  add    $0x1,%eax
   0xffffffffa000076e:  mov    %eax,-0x4(%rbp)
   0xffffffffa0000774:  nopl   0x0(%rax,%rax,1)
   0xffffffffa0000779:  pop    %rax
   0xffffffffa000077a:  jmp    0xffffffffa000072b
   0xffffffffa000077f:  mov    $0xf00d,%eax
   0xffffffffa0000784:  leave
   0xffffffffa0000785:  ret
# gdb -q -c /proc/kcore -ex 'x/7i 0xffffffffa0000720' -ex 'quit'
[New process 1]
Core was generated by `earlyprintk=serial,ttyS0,115200 console=ttyS0 psmouse.proto=exps "virtme_stty_c'.
#0  0x0000000000000000 in ?? ()
   0xffffffffa0000720:  nopl   0x0(%rax,%rax,1)
   0xffffffffa0000725:  xchg   %ax,%ax
   0xffffffffa0000727:  push   %rbp
   0xffffffffa0000728:  mov    %rsp,%rbp
   0xffffffffa000072b:  mov    $0xcafe,%eax
   0xffffffffa0000730:  leave
   0xffffffffa0000731:  ret
#</code></pre>
            <p>Lastly, it will be handy to have a cheat sheet of <a href="https://elixir.bootlin.com/linux/v5.15.63/source/arch/x86/net/bpf_jit_comp.c#L104">mapping</a> between BPF registers (<code>r0</code>, <code>r1</code>, …) to hardware registers (<code>rax</code>, <code>rdi</code>, …) that the JIT compiler uses.</p>
<table>
<thead>
  <tr>
    <th>BPF</th>
    <th>x86-64</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>r0</td>
    <td>rax</td>
  </tr>
  <tr>
    <td>r1</td>
    <td>rdi</td>
  </tr>
  <tr>
    <td>r2</td>
    <td>rsi</td>
  </tr>
  <tr>
    <td>r3</td>
    <td>rdx</td>
  </tr>
  <tr>
    <td>r4</td>
    <td>rcx</td>
  </tr>
  <tr>
    <td>r5</td>
    <td>r8</td>
  </tr>
  <tr>
    <td>r6</td>
    <td>rbx</td>
  </tr>
  <tr>
    <td>r7</td>
    <td>r13</td>
  </tr>
  <tr>
    <td>r8</td>
    <td>r14</td>
  </tr>
  <tr>
    <td>r9</td>
    <td>r15</td>
  </tr>
  <tr>
    <td>r10</td>
    <td>rbp</td>
  </tr>
  <tr>
    <td>internal</td>
    <td>r9-r12</td>
  </tr>
</tbody>
</table><p>Now we are prepared to work out what happens when we use a BPF tail call.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3cyhfC6M0zUcXpRmaGDf6b/26595d9ce05e788d01a23fc0c6836d96/image9.png" />
            
            </figure><p>In essence, <code>bpf_tail_call()</code> emits a jump into another function, reusing the current stack frame. It is just like a regular optimized tail call, but with a twist.</p><p>Because of the BPF security guarantees - execution terminates, no stack overflows - there is a limit on the number of tail calls we can have (<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ebf7f6f0a6cdcc17a3da52b81e4b3a98c4005028"><code>MAX_TAIL_CALL_CNT = 33</code></a>).</p><p>Counting the tail calls across BPF programs is not something we can do at load-time. The jump table (BPF program array) contents can change after the program has been verified. Our only option is to keep track of tail calls at run-time. That is why the JIT’ed code for the <code>bpf_tail_call()</code> helper checks and updates the <code>tail_call_cnt</code> counter.</p><p>The updated count is then passed from one BPF program to another, and from one BPF function to another, as we will see, through the <code>rax register</code> (<code>r0</code> in BPF).</p><p>Luckily for us, the x86-64 calling convention dictates that the <code>rax</code> register does not partake in passing function arguments, but rather holds the function return value. The JIT can repurpose it to pass an additional - hidden - argument.</p><p>The function body is, however, free to make use of the <code>r0/rax</code> register in any way it pleases. This explains why we want to save the <code>tail_call_cnt</code> passed via <code>rax</code> onto stack right after we jump to another program. <code>bpf_tail_call()</code> can later load the value from a known location on the stack.</p><p>This way, the <a href="https://elixir.bootlin.com/linux/v5.15.63/source/arch/x86/net/bpf_jit_comp.c#L1437">code emitted</a> for each <code>bpf_tail_call()</code> invocation, and the <a href="https://elixir.bootlin.com/linux/v5.15.63/source/arch/x86/net/bpf_jit_comp.c#L281">BPF function prologue</a> work in tandem, keeping track of tail call count across BPF program boundaries.</p><p>But what if our BPF program is split up into several BPF functions, each with its own stack frame? What if these functions perform BPF tail calls? How is the tail call count tracked then?</p>
    <div>
      <h3>Mixing BPF function calls with BPF tail calls</h3>
      <a href="#mixing-bpf-function-calls-with-bpf-tail-calls">
        
      </a>
    </div>
    <p>BPF has its own terminology when it comes to functions and calling them, which is influenced by the internal implementation. Function calls are referred to as <a href="https://docs.cilium.io/en/stable/bpf/#bpf-to-bpf-calls">BPF to BPF calls</a>. Also, the main/entry function in your BPF code is called “the program”, while all other functions are known as “subprograms”.</p><p>Each call to subprogram allocates a stack frame for local state, which persists until the function returns. Naturally, BPF subprogram calls can be nested creating a call chain. Just like nested function calls in user-space.</p><p>BPF subprograms are also allowed to make BPF tail calls. This, effectively, is a mechanism for extending the call chain to another BPF program and its subprograms.</p><p>If we cannot track how long the call chain can be, and how much stack space each function uses, we put ourselves at risk of <a href="https://en.wikipedia.org/wiki/Stack_overflow">overflowing the stack</a>. We cannot let this happen, so BPF enforces limitations on <a href="https://elixir.bootlin.com/linux/v5.15.62/source/kernel/bpf/verifier.c#L3600">when and how many BPF tail calls can be done</a>:</p>
            <pre><code>static int check_max_stack_depth(struct bpf_verifier_env *env)
{
        …
        /* protect against potential stack overflow that might happen when
         * bpf2bpf calls get combined with tailcalls. Limit the caller's stack
         * depth for such case down to 256 so that the worst case scenario
         * would result in 8k stack size (32 which is tailcall limit * 256 =
         * 8k).
         *
         * To get the idea what might happen, see an example:
         * func1 -&gt; sub rsp, 128
         *  subfunc1 -&gt; sub rsp, 256
         *  tailcall1 -&gt; add rsp, 256
         *   func2 -&gt; sub rsp, 192 (total stack size = 128 + 192 = 320)
         *   subfunc2 -&gt; sub rsp, 64
         *   subfunc22 -&gt; sub rsp, 128
         *   tailcall2 -&gt; add rsp, 128
         *    func3 -&gt; sub rsp, 32 (total stack size 128 + 192 + 64 + 32 = 416)
         *
         * tailcall will unwind the current stack frame but it will not get rid
         * of caller's stack as shown on the example above.
         */
        if (idx &amp;&amp; subprog[idx].has_tail_call &amp;&amp; depth &gt;= 256) {
                verbose(env,
                        "tail_calls are not allowed when call stack of previous frames is %d bytes. Too large\n",
                        depth);
                return -EACCES;
        }
        …
}</code></pre>
            <p>While the stack depth can be calculated by the BPF verifier at load-time, we still need to keep count of tail call jumps at run-time. Even when subprograms are involved.</p><p>This means that we have to pass the tail call count from one BPF subprogram to another, just like we did when making a BPF tail call, so we yet again turn to value passing through the <code>rax register</code>.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3cRz7vr40Hl4l65tgG15gV/09d1d3980f2f4c3e61ed2ff5cb5021f0/image1-11.png" />
            
            </figure><p>Control flow in a BPF program with a function call followed by a tail call.</p><p>? To keep things simple, BPF code in our examples does not allocate anything on stack. I encourage you to check how the JIT’ed code changes when you <a href="https://elixir.bootlin.com/linux/v5.19.11/source/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf6.c#L37">add some local variables</a>. Just make sure the compiler does not optimize them out.</p><p>To make it work, we need to:</p><p>① load the tail call count saved on stack into <code>rax</code> before <code>call</code>’ing the subprogram,② adjust the subprogram prologue, so that it does not reset the <code>rax</code> like the main program does,③ save the passed tail call count on subprogram’s stack for the <code>bpf_tail_call()</code> helper to consume it.</p><p>A <code>bpf_tail_call()</code> within our suprogram will then:</p><p>④ load the tail call count from stack,⑤ unwind the BPF stack, but keep the current subprogram’s stack frame in tact, and⑥ jump to the target BPF program.</p><p>Now we have seen how all the pieces of the puzzle fit together to make BPF tail work on x86-64 safely. The only open question is does it work the same way on other platforms like arm64? Time to shift gears and dive into a completely different BPF JIT implementation.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4WJ7PZxCCS8jaahx6fGfu/a3c1568774c9c47fbf902f1959b3ec17/image10.jpg" />
            
            </figure><p>Based on an image by <a href="https://www.flickr.com/photos/pockethifi/48303582037/">Wutthichai Charoenburi</a>, <a href="https://creativecommons.org/licenses/by/2.0/">CC BY 2.0</a></p>
    <div>
      <h2>Tail calls on arm64</h2>
      <a href="#tail-calls-on-arm64">
        
      </a>
    </div>
    <p>If you try loading a BPF program that uses both BPF function calls (aka BPF to BPF calls) and BPF tail calls on an arm64 machine running the latest 5.15 LTS kernel, or even the latest 5.19 stable kernel, the BPF verifier will kindly ask you to reconsider your choice:</p>
            <pre><code># uname -rm
5.19.12 aarch64
# bpftool prog loadall tail_call_ex2.o /sys/fs/bpf
libbpf: prog 'entry_prog': BPF program load failed: Invalid argument
libbpf: prog 'entry_prog': -- BEGIN PROG LOAD LOG --
0: R1=ctx(off=0,imm=0) R10=fp0
; __attribute__((musttail)) return sub_func(skb);
0: (85) call pc+1
caller:
 R10=fp0
callee:
 frame1: R1=ctx(off=0,imm=0) R10=fp0
; bpf_tail_call(skb, &amp;jmp_table, 0);
2: (18) r2 = 0xffffff80c38c7200       ; frame1: R2_w=map_ptr(off=0,ks=4,vs=4,imm=0)
4: (b7) r3 = 0                        ; frame1: R3_w=P0
5: (85) call bpf_tail_call#12
tail_calls are not allowed in non-JITed programs with bpf-to-bpf calls
processed 4 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
-- END PROG LOAD LOG --
…
#</code></pre>
            <p>That is a pity! We have been looking forward to reaping the benefits of code sharing with BPF to BPF calls in our lengthy machine generated BPF programs. So we asked - how hard could it be to make it work?</p><p>After all, BPF <a href="https://elixir.bootlin.com/linux/v5.15.70/source/arch/arm64/net/bpf_jit_comp.c">JIT for arm64</a> already can handle BPF tail calls and BPF to BPF calls, when used in isolation.</p><p>It is “just” a matter of understanding the existing JIT implementation, which lives in <code>arch/arm64/net/bpf_jit_comp.c</code>, and identifying the missing pieces.</p><p>To understand how BPF JIT for arm64 works, we will use the same method as before - look at its code together with sample input (BPF instructions) and output (arm64 instructions).</p><p>We don’t have to read the whole source code. It is enough to zero in on a few particular code paths:</p><p><a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/arm64/net/bpf_jit_comp.c#L1356"><span>bpf_int_jit_compile()</span><span> ?</span><span><br /></span></a><span>   </span><a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/arm64/net/bpf_jit_comp.c#L246"><span>build_prologue()</span><span> ?</span><span><br /></span></a><span>   </span><a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/arm64/net/bpf_jit_comp.c#L1287"><span>build_body()</span><span> ?</span><span><br /></span></a><span>     </span><span>for (i = 0; i &lt; prog-&gt;len; i++) {</span><span><br /></span><span>        </span><a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/arm64/net/bpf_jit_comp.c#L663"><span>build_insn()</span><span> ?</span><span><br /></span></a><span>          </span><span>switch (code) {</span><span><br /></span><span>          </span><span>case BPF_JMP | BPF_CALL:</span><span><br /></span><span>            </span><a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/arm64/net/bpf_jit_comp.c#L983"><span>/* emit function call */</span><span> ?</span><span><br /></span></a><span>          </span><span>case BPF_JMP | BPF_TAIL_CALL:</span><span><br /></span><span>            </span><a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/arm64/net/bpf_jit_comp.c#L329"><span>emit_bpf_tail_call()</span><span> ?</span><span><br /></span></a><span>          </span><span>}</span><span><br /></span><span>     </span><span>}</span><span><br /></span><span>   </span><a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/arm64/net/bpf_jit_comp.c#L559"><span>build_epilogue()</span><span> ?</span></a></p><p>One thing that the arm64 architecture, and RISC architectures in general, are known for is that it has a plethora of general purpose registers (<code>x0-x30</code>). This is a good thing. We have more registers to allocate to JIT internal state, like the tail call count. A cheat sheet of what <a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/arm64/net/bpf_jit_comp.c#L42">roles</a> the hardware registers play in the BPF JIT will be helpful:</p>
<table>
<thead>
  <tr>
    <th>BPF</th>
    <th>arm64</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>r0</td>
    <td>x7</td>
  </tr>
  <tr>
    <td>r1</td>
    <td>x0</td>
  </tr>
  <tr>
    <td>r2</td>
    <td>x1</td>
  </tr>
  <tr>
    <td>r3</td>
    <td>x2</td>
  </tr>
  <tr>
    <td>r4</td>
    <td>x3</td>
  </tr>
  <tr>
    <td>r5</td>
    <td>x4</td>
  </tr>
  <tr>
    <td>r6</td>
    <td>x19</td>
  </tr>
  <tr>
    <td>r7</td>
    <td>x20</td>
  </tr>
  <tr>
    <td>r8</td>
    <td>x21</td>
  </tr>
  <tr>
    <td>r9</td>
    <td>x22</td>
  </tr>
  <tr>
    <td>r10</td>
    <td>x25</td>
  </tr>
  <tr>
    <td>internal</td>
    <td>x9-x12, x26 (tail_call_cnt), x27</td>
  </tr>
</tbody>
</table><p>Now let’s try to understand the state of things by looking at the JIT’s input and output for two particular scenarios: (1) a BPF tail call, and (2) a BPF to BPF call.</p><p>It is hard to read assembly code selectively. We will have to go through all instructions one by one, and understand what each one is doing.</p><p>⚠ Brace yourself. Time to decipher a bit of ARM64 assembly. If this will be your first time reading ARM64 assembly, you might want to at least skim through this <a href="https://modexp.wordpress.com/2018/10/30/arm64-assembly/">Guide to ARM64 / AArch64 Assembly on Linux</a> before diving in.</p><p>Scenario #1: A single BPF tail call - <a href="https://github.com/jsitnicki/cloudflare-blog/blob/jakub/2022-10-bpf-tail-calls/2022-10-bpf-tail-call/tail_call_ex1.bpf.c"><code>tail_call_ex1.bpf.c</code></a></p><p>Input: BPF assembly (<code>bpftool prog dump xlated</code>)</p>
            <pre><code>   0: (18) r2 = map[id:4]           // jmp_table map
   2: (b7) r3 = 0
   3: (85) call bpf_tail_call#12
   4: (b7) r0 = 61453               // 0xf00d
   5: (95) exit</code></pre>
            <p>Output: ARM64 assembly (<code>bpftool prog dump jited</code>)</p>
            <pre><code> 0:   paciasp                            // Sign LR (ROP protection) ①
 4:   stp     x29, x30, [sp, #-16]!      // Save FP and LR registers ②
 8:   mov     x29, sp                    // Set up Frame Pointer
 c:   stp     x19, x20, [sp, #-16]!      // Save callee-saved registers ③
10:   stp     x21, x22, [sp, #-16]!      // ⋮ 
14:   stp     x25, x26, [sp, #-16]!      // ⋮ 
18:   stp     x27, x28, [sp, #-16]!      // ⋮ 
1c:   mov     x25, sp                    // Set up BPF stack base register (r10)
20:   mov     x26, #0x0                  // Initialize tail_call_cnt ④
24:   sub     x27, x25, #0x0             // Calculate FP bottom ⑤
28:   sub     sp, sp, #0x200             // Set up BPF program stack ⑥
2c:   mov     x1, #0xffffff80ffffffff    // r2 = map[id:4] ⑦
30:   movk    x1, #0xc38c, lsl #16       // ⋮ 
34:   movk    x1, #0x7200                // ⋮
38:   mov     x2, #0x0                   // r3 = 0
3c:   mov     w10, #0x24                 // = offsetof(struct bpf_array, map.max_entries) ⑧
40:   ldr     w10, [x1, x10]             // Load array-&gt;map.max_entries
44:   add     w2, w2, #0x0               // = index (0)
48:   cmp     w2, w10                    // if (index &gt;= array-&gt;map.max_entries)
4c:   b.cs    0x0000000000000088         //     goto out;
50:   mov     w10, #0x21                 // = MAX_TAIL_CALL_CNT (33)
54:   cmp     x26, x10                   // if (tail_call_cnt &gt;= MAX_TAIL_CALL_CNT)
58:   b.cs    0x0000000000000088         //     goto out;
5c:   add     x26, x26, #0x1             // tail_call_cnt++;
60:   mov     w10, #0x110                // = offsetof(struct bpf_array, ptrs)
64:   add     x10, x1, x10               // = &amp;array-&gt;ptrs
68:   lsl     x11, x2, #3                // = index * sizeof(array-&gt;ptrs[0])
6c:   ldr     x11, [x10, x11]            // prog = array-&gt;ptrs[index];
70:   cbz     x11, 0x0000000000000088    // if (prog == NULL) goto out;
74:   mov     w10, #0x30                 // = offsetof(struct bpf_prog, bpf_func)
78:   ldr     x10, [x11, x10]            // Load prog-&gt;bpf_func
7c:   add     x10, x10, #0x24            // += PROLOGUE_OFFSET * AARCH64_INSN_SIZE (4)
80:   add     sp, sp, #0x200             // Unwind BPF stack
84:   br      x10                        // goto *(prog-&gt;bpf_func + prologue_offset)
88:   mov     x7, #0xf00d                // r0 = 0xf00d
8c:   add     sp, sp, #0x200             // Unwind BPF stack ⑨
90:   ldp     x27, x28, [sp], #16        // Restore used callee-saved registers
94:   ldp     x25, x26, [sp], #16        // ⋮
98:   ldp     x21, x22, [sp], #16        // ⋮
9c:   ldp     x19, x20, [sp], #16        // ⋮
a0:   ldp     x29, x30, [sp], #16        // ⋮
a4:   add     x0, x7, #0x0               // Set return value
a8:   autiasp                            // Authenticate LR
ac:   ret                                // Return to caller</code></pre>
            <p>① BPF program prologue starts with Pointer Authentication Code (PAC), which protects against Return <a href="https://developer.arm.com/documentation/102433/0100/Return-oriented-programming">Oriented Programming attacks</a>. PAC instructions are emitted by JIT only if CONFIG_ARM64_PTR_AUTH_KERNEL is enabled.</p><p>② <a href="https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst">Arm 64 Architecture Procedure Call Standard</a> <a href="https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#the-frame-pointer">mandates</a> that the Frame Pointer (register X29) and the Link Register (register X30), aka the return address, of the caller should be recorded onto the stack.</p><p>③ Registers X19 to X28, and X29 (FP) plus X30 (LR), are callee saved. ARM64 BPF JIT does not use registers X23 and X24 currently, so they are not saved.</p><p>④ We track the tail call depth in X26. No need to save it onto stack since we use a register dedicated just for this purpose.</p><p>⑤ FP bottom is an optimization that allows store/loads to BPF stack <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b3d19b9bd4080d7f5e260f91ce8f639e19eb499">with a single instruction and an immediate offset value</a>.</p><p>⑥ Reserve space for the BPF program stack. The stack layout is now as shown in a <a href="https://elixir.bootlin.com/linux/v5.19.12/source/arch/arm64/net/bpf_jit_comp.c#L260">diagram in <code>build_prologue()</code></a> source code.</p><p>⑦ The BPF function body starts here.</p><p>⑧ <code>bpf_tail_call()</code> instructions start here.</p><p>⑨ The epilogue starts here.</p><p>Whew! That was a handful ?.</p><p>Notice that the BPF tail call implementation on arm64 is not as optimized as on x86-64. There is no code patching to make direct jumps when the target program index is known at the JIT-compilation time. Instead, the target address is always loaded from the BPF program array.</p><p>Ready for the second scenario? I promise it will be shorter. Function prologue and epilogue instructions will look familiar, so we are going to keep annotations down to a minimum.</p><p>Scenario #2: A BPF to BPF call - <a href="https://github.com/jsitnicki/cloudflare-blog/blob/jakub/2022-10-bpf-tail-calls/2022-10-bpf-tail-call/sub_call_ex1.bpf.c"><code>sub_call_ex1.bpf.c</code></a></p><p>Input: BPF assembly (<code>bpftool prog dump xlated</code>)</p>
            <pre><code>int entry_prog(struct __sk_buff * skb):
   0: (85) call pc+1#bpf_prog_a84919ecd878b8f3_sub_func
   1: (95) exit
int sub_func(struct __sk_buff * skb):
   2: (b7) r0 = 61453                   // 0xf00d
   3: (95) exit</code></pre>
            <p>Output: ARM64 assembly</p>
            <pre><code>int entry_prog(struct __sk_buff * skb):
bpf_prog_163e74e7188910f2_entry_prog:
   0:   paciasp                                 // Begin prologue
   4:   stp     x29, x30, [sp, #-16]!           // ⋮
   8:   mov     x29, sp                         // ⋮
   c:   stp     x19, x20, [sp, #-16]!           // ⋮
  10:   stp     x21, x22, [sp, #-16]!           // ⋮
  14:   stp     x25, x26, [sp, #-16]!           // ⋮
  18:   stp     x27, x28, [sp, #-16]!           // ⋮
  1c:   mov     x25, sp                         // ⋮
  20:   mov     x26, #0x0                       // ⋮
  24:   sub     x27, x25, #0x0                  // ⋮
  28:   sub     sp, sp, #0x0                    // End prologue
  2c:   mov     x10, #0xffffffffffff5420        // Build sub_func()+0x0 address
  30:   movk    x10, #0x8ff, lsl #16            // ⋮
  34:   movk    x10, #0xffc0, lsl #32           // ⋮
  38:   blr     x10 ------------------.         // Call sub_func()+0x0 
  3c:   add     x7, x0, #0x0 &lt;----------.       // r0 = sub_func()
  40:   mov     sp, sp                | |       // Begin epilogue
  44:   ldp     x27, x28, [sp], #16   | |       // ⋮
  48:   ldp     x25, x26, [sp], #16   | |       // ⋮
  4c:   ldp     x21, x22, [sp], #16   | |       // ⋮
  50:   ldp     x19, x20, [sp], #16   | |       // ⋮
  54:   ldp     x29, x30, [sp], #16   | |       // ⋮
  58:   add     x0, x7, #0x0          | |       // ⋮
  5c:   autiasp                       | |       // ⋮
  60:   ret                           | |       // End epilogue
                                      | |
int sub_func(struct __sk_buff * skb): | |
bpf_prog_a84919ecd878b8f3_sub_func:   | |
   0:   paciasp &lt;---------------------' |       // Begin prologue
   4:   stp     x29, x30, [sp, #-16]!   |       // ⋮
   8:   mov     x29, sp                 |       // ⋮
   c:   stp     x19, x20, [sp, #-16]!   |       // ⋮
  10:   stp     x21, x22, [sp, #-16]!   |       // ⋮
  14:   stp     x25, x26, [sp, #-16]!   |       // ⋮
  18:   stp     x27, x28, [sp, #-16]!   |       // ⋮
  1c:   mov     x25, sp                 |       // ⋮
  20:   mov     x26, #0x0               |       // ⋮
  24:   sub     x27, x25, #0x0          |       // ⋮
  28:   sub     sp, sp, #0x0            |       // End prologue
  2c:   mov     x7, #0xf00d             |       // r0 = 0xf00d
  30:   mov     sp, sp                  |       // Begin epilogue
  34:   ldp     x27, x28, [sp], #16     |       // ⋮
  38:   ldp     x25, x26, [sp], #16     |       // ⋮
  3c:   ldp     x21, x22, [sp], #16     |       // ⋮
  40:   ldp     x19, x20, [sp], #16     |       // ⋮
  44:   ldp     x29, x30, [sp], #16     |       // ⋮
  48:   add     x0, x7, #0x0            |       // ⋮
  4c:   autiasp                         |       // ⋮
  50:   ret ----------------------------'       // End epilogue</code></pre>
            <p>We have now seen what a BPF tail call and a BPF function/subprogram call compiles down to. Can you already spot what would go wrong if mixing the two was allowed?</p><p>That’s right! Every time we enter a BPF subprogram, we reset the X26 register, which holds the tail call count, to zero (<code>mov x26</code>, <code>#0x0</code>). This is bad. It would let users create program chains longer than the <code>MAX_TAIL_CALL_CNT</code> limit.</p><p>How about we just skip this step when emitting the prologue for BPF subprograms?</p>
            <pre><code>@@ -246,6 +246,7 @@ static bool is_lsi_offset(int offset, int scale)
 static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
 {
        const struct bpf_prog *prog = ctx-&gt;prog;
+       const bool is_main_prog = prog-&gt;aux-&gt;func_idx == 0;
        const u8 r6 = bpf2a64[BPF_REG_6];
        const u8 r7 = bpf2a64[BPF_REG_7];
        const u8 r8 = bpf2a64[BPF_REG_8];
@@ -299,7 +300,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
        /* Set up BPF prog stack base register */
        emit(A64_MOV(1, fp, A64_SP), ctx);

-       if (!ebpf_from_cbpf) {
+       if (!ebpf_from_cbpf &amp;&amp; is_main_prog) {
                /* Initialize tail_call_cnt */
                emit(A64_MOVZ(1, tcc, 0, 0), ctx);</code></pre>
            <p>Believe it or not. This is everything that <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d4609a5d8c70d21b4a3f801cf896a3c16c613fe1">was missing</a> to get BPF tail calls working with function calls on arm64. The feature will be enabled in the upcoming Linux 6.0 release.</p>
    <div>
      <h2>Outro</h2>
      <a href="#outro">
        
      </a>
    </div>
    <p>From recursion to tweaking the BPF JIT. How did we get here? Not important. It’s all about the journey.</p><p>Along the way we have unveiled a few secrets behind BPF tails calls, and hopefully quenched your thirst for low-level programming. At least for today.</p><p>All that is left is to sit back and watch the fruits of our work. With GDB hooked up to a VM, we can observe how a BPF program calls into a BPF function, and from there tail calls to another BPF program:</p><p><a href="https://demo-gdb-step-thru-bpf.pages.dev/">https://demo-gdb-step-thru-bpf.pages.dev/</a></p><p>Until next time ?.</p> ]]></content:encoded>
            <category><![CDATA[Kernel]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">2mPM5QrHXSfrNhO9mxBcWD</guid>
            <dc:creator>Jakub Sitnicki</dc:creator>
        </item>
        <item>
            <title><![CDATA[Missing Manuals - io_uring worker pool]]></title>
            <link>https://blog.cloudflare.com/missing-manuals-io_uring-worker-pool/</link>
            <pubDate>Fri, 04 Feb 2022 13:58:05 GMT</pubDate>
            <description><![CDATA[ Chances are you might have heard of io_uring. It first appeared in Linux 5.1, back in 2019, and was advertised as the new API for asynchronous I/O. Its goal was to be an alternative to the deemed-to-be-broken-beyond-repair AIO, the “old” asynchronous I/O API ]]></description>
            <content:encoded><![CDATA[ <p>Chances are you might have heard of <code>io_uring</code>. It first appeared in <a href="https://kernelnewbies.org/Linux_5.1#High-performance_asynchronous_I.2FO_with_io_uring">Linux 5.1</a>, back in 2019, and was <a href="https://lwn.net/Articles/776703/">advertised as the new API for asynchronous I/O</a>. Its goal was to be an alternative to the deemed-to-be-broken-beyond-repair <a href="/io_submit-the-epoll-alternative-youve-never-heard-about/">AIO</a>, the “old” asynchronous I/O API.</p><p>Calling <code>io_uring</code> just an asynchronous I/O API doesn’t do it justice, though. Underneath the API calls, io_uring is a full-blown runtime for processing I/O requests. One that spawns threads, sets up work queues, and dispatches requests for processing. All this happens “in the background” so that the user space process doesn’t have to, but can, block while waiting for its I/O requests to complete.</p><p>A runtime that spawns threads and manages the worker pool for the developer makes life easier, but using it in a project begs the questions:</p><p>1. How many threads will be created for my workload by default?</p><p>2. How can I monitor and control the thread pool size?</p><p>I could not find the answers to these questions in either the <a href="https://kernel.dk/io_uring.pdf">Efficient I/O with io_uring</a> article, or the <a href="https://unixism.net/loti/">Lord of the io_uring</a> guide – two well-known pieces of available documentation.</p><p>And while a recent enough <a href="https://manpages.debian.org/unstable/liburing-dev/io_uring_register.2.en.html"><code>io_uring</code> man page</a> touches on the topic:</p><blockquote><p>By default, <code>io_uring</code> limits the unbounded workers created to the maximum processor count set by <code>RLIMIT_NPROC</code> and the bounded workers is a function of the SQ ring size and the number of CPUs in the system.</p></blockquote><p>… it also leads to more questions:</p><p>3. What is an unbounded worker?</p><p>4. How does it differ from a bounded worker?</p><p>Things seem a bit under-documented as is, hence this blog post. Hopefully, it will provide the clarity needed to put <code>io_uring</code> to work in your project when the time comes.</p><p>Before we dig in, a word of warning. This post is not meant to be an introduction to <code>io_uring</code>. The existing documentation does a much better job at showing you the ropes than I ever could. Please give it a read first, if you are not familiar yet with the io_uring API.</p>
    <div>
      <h2>Not all I/O requests are created equal</h2>
      <a href="#not-all-i-o-requests-are-created-equal">
        
      </a>
    </div>
    <p><code>io_uring</code> can perform I/O on any kind of file descriptor; be it a regular file or a special file, like a socket. However, the kind of file descriptor that it operates on makes a difference when it comes to the size of the worker pool.</p><p>You see, <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2e480058ddc21ec53a10e8b41623e245e908bdbc">I/O requests get classified into two categories</a> by <code>io_uring</code>:</p><blockquote><p><code>io-wq</code> divides work into two categories:1. Work that completes in a bounded time, like reading from a regular file or a block device. This type of work is limited based on the size of the SQ ring.2. Work that may never complete, we call this unbounded work. The amount of workers here is limited by <code>RLIMIT_NPROC</code>.</p></blockquote><p>This answers the latter two of our open questions. Unbounded workers handle I/O requests that operate on neither regular files (<code>S_IFREG</code>) nor block devices (<code>S_ISBLK</code>). This is the case for network I/O, where we work with sockets (<code>S_IFSOCK</code>), and other special files like character devices (e.g. <code>/dev/null</code>).</p><p>We now also know that there are different limits in place for how many bounded vs unbounded workers there can be running. So we have to pick one before we dig further.</p>
    <div>
      <h2>Capping the unbounded worker pool size</h2>
      <a href="#capping-the-unbounded-worker-pool-size">
        
      </a>
    </div>
    <p>Pushing data through sockets is Cloudflare’s bread and butter, so this is what we are going to base our test workload around. To put it in <code>io_uring</code> lingo – we will be submitting unbounded work requests.</p><p>While doing that, we will observe how <code>io_uring</code> goes about creating workers.</p><p>To observe how <code>io_uring</code> goes about creating workers we will ask it to read from a UDP socket multiple times. No packets will arrive on the socket, so we will have full control over when the requests complete.</p><p>Here is our test workload - <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-02-io_uring-worker-pool/src/bin/udp_read.rs">udp_read.rs</a>.</p>
            <pre><code>$ ./target/debug/udp-read -h
udp-read 0.1.0
read from UDP socket with io_uring

USAGE:
    udp-read [FLAGS] [OPTIONS]

FLAGS:
    -a, --async      Set IOSQE_ASYNC flag on submitted SQEs
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -c, --cpu &lt;cpu&gt;...                     CPU to run on when invoking io_uring_enter for Nth ring (specify multiple
                                           times) [default: 0]
    -w, --workers &lt;max-unbound-workers&gt;    Maximum number of unbound workers per NUMA node (0 - default, that is
                                           RLIMIT_NPROC) [default: 0]
    -r, --rings &lt;num-rings&gt;                Number io_ring instances to create per thread [default: 1]
    -t, --threads &lt;num-threads&gt;            Number of threads creating io_uring instances [default: 1]
    -s, --sqes &lt;sqes&gt;                      Number of read requests to submit per io_uring (0 - fill the whole queue)
                                           [default: 0]</code></pre>
            <p>While it is parametrized for easy experimentation, at its core it doesn’t do much. We fill the submission queue with read requests from a UDP socket and then wait for them to complete. But because data doesn’t arrive on the socket out of nowhere, and there are no timeouts set up, nothing happens. As a bonus, we have complete control over when requests complete, which will come in handy later.</p><p>Let’s run the test workload to convince ourselves that things are working as expected. <code>strace</code> won’t be very helpful when using <code>io_uring</code>. We won’t be able to tie I/O requests to system calls. Instead, we will have to turn to in-kernel tracing.</p><p>Thankfully, <code>io_uring</code> comes with a set of ready to use static tracepoints, which save us the trouble of digging through the source code to decide where to hook up dynamic tracepoints, known as <a href="https://docs.kernel.org/trace/kprobes.html">kprobes</a>.</p><p>We can discover the tracepoints with <a href="https://man7.org/linux/man-pages/man1/perf-list.1.html"><code>perf list</code></a> or <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#4--l-listing-probes"><code>bpftrace -l</code></a>, or by browsing the <code>events/</code> directory on the <a href="https://www.kernel.org/doc/Documentation/trace/ftrace.txt"><code>tracefs filesystem</code></a>, usually mounted under <code>/sys/kernel/tracing</code>.</p>
            <pre><code>$ sudo perf list 'io_uring:*'

List of pre-defined events (to be used in -e):

  io_uring:io_uring_complete                         [Tracepoint event]
  io_uring:io_uring_cqring_wait                      [Tracepoint event]
  io_uring:io_uring_create                           [Tracepoint event]
  io_uring:io_uring_defer                            [Tracepoint event]
  io_uring:io_uring_fail_link                        [Tracepoint event]
  io_uring:io_uring_file_get                         [Tracepoint event]
  io_uring:io_uring_link                             [Tracepoint event]
  io_uring:io_uring_poll_arm                         [Tracepoint event]
  io_uring:io_uring_poll_wake                        [Tracepoint event]
  io_uring:io_uring_queue_async_work                 [Tracepoint event]
  io_uring:io_uring_register                         [Tracepoint event]
  io_uring:io_uring_submit_sqe                       [Tracepoint event]
  io_uring:io_uring_task_add                         [Tracepoint event]
  io_uring:io_uring_task_run                         [Tracepoint event]</code></pre>
            <p>Judging by the number of tracepoints to choose from, <code>io_uring</code> takes visibility seriously. To help us get our bearings, here is a diagram that maps out paths an I/O request can take inside io_uring code annotated with tracepoint names – not all of them, just those which will be useful to us.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6MeUxepPVh1riK3jpiWQ0m/e29927eafad9fc96d30473af1f637f5b/image3-8.png" />
            
            </figure><p>Starting on the left, we expect our toy workload to push entries onto the submission queue. When we publish submitted entries by calling <a href="https://manpages.debian.org/unstable/liburing-dev/io_uring_enter.2.en.html"><code>io_uring_enter()</code></a>, the kernel consumes the submission queue and constructs internal request objects. A side effect we can observe is a hit on the <a href="https://elixir.bootlin.com/linux/v5.15.16/source/fs/io_uring.c#L7193"><code>io_uring:io_uring_submit_sqe</code></a> tracepoint.</p>
            <pre><code>$ sudo perf stat -e io_uring:io_uring_submit_sqe -- timeout 1 ./udp-read

 Performance counter stats for 'timeout 1 ./udp-read':

              4096      io_uring:io_uring_submit_sqe

       1.049016083 seconds time elapsed

       0.003747000 seconds user
       0.013720000 seconds sys</code></pre>
            <p>But, as it turns out, submitting entries is not enough to make <code>io_uring</code> spawn worker threads. Our process remains single-threaded:</p>
            <pre><code>$ ./udp-read &amp; p=$!; sleep 1; ps -o thcount $p; kill $p; wait $p
[1] 25229
THCNT
    1
[1]+  Terminated              ./udp-read</code></pre>
            <p>This shows that <code>io_uring</code> is smart. It knows that sockets support non-blocking I/O, and they <a href="https://elixir.bootlin.com/linux/v5.15.16/source/fs/io_uring.c#L2837">can be polled for readiness to read</a>.</p><p>So, by default, <code>io_uring</code> performs a non-blocking read on sockets. This is bound to fail with <code>-EAGAIN</code> in our case. What follows is that <code>io_uring</code> registers a wake-up call (<a href="https://elixir.bootlin.com/linux/v5.15.16/source/fs/io_uring.c#L5570"><code>io_async_wake()</code></a>) for when the socket becomes readable. There is no need to perform a blocking read, when we can wait to be notified.</p><p>This resembles polling the socket with <code>select()</code> or <code>[e]poll()</code> from user space. There is no timeout, if we didn’t ask for it explicitly by submitting an <code>IORING_OP_LINK_TIMEOUT</code> request. <code>io_uring</code> will simply wait indefinitely.</p><p>We can observe <code>io_uring</code> when it calls <a href="https://elixir.bootlin.com/linux/v5.15.16/source/include/linux/poll.h#L86"><code>vfs_poll</code></a>, the machinery behind non-blocking I/O, to monitor the sockets. If that happens, we will be hitting the <a href="https://elixir.bootlin.com/linux/v5.15.16/source/fs/io_uring.c#L5691"><code>io_uring:io_uring_poll_arm</code></a> tracepoint. Meanwhile, the wake-ups that follow, if the polled file becomes ready for I/O, can be recorded with the <code>io_uring:io_uring_poll_wake</code> tracepoint embedded in <code>io_async_wake()</code> wake-up call.</p><p>This is what we are experiencing. <code>io_uring</code> is polling the socket for read-readiness:</p>
            <pre><code>$ sudo bpftrace -lv t:io_uring:io_uring_poll_arm
tracepoint:io_uring:io_uring_poll_arm
    void * ctx
    void * req
    u8 opcode
    u64 user_data
    int mask
    int events      
$ sudo bpftrace -e 't:io_uring:io_uring_poll_arm { @[probe, args-&gt;opcode] = count(); } i:s:1 { exit(); }' -c ./udp-read
Attaching 2 probes...


@[tracepoint:io_uring:io_uring_poll_arm, 22]: 4096
$ sudo bpftool btf dump id 1 format c | grep 'IORING_OP_.*22'
        IORING_OP_READ = 22,
$</code></pre>
            <p>To make <code>io_uring</code> spawn worker threads, we have to force the read requests to be processed concurrently in a blocking fashion. We can do this by marking the I/O requests as asynchronous. As <a href="https://manpages.debian.org/bullseye/liburing-dev/io_uring_enter.2.en.html"><code>io_uring_enter(2) man-page</code></a> says:</p>
            <pre><code>  IOSQE_ASYNC
         Normal operation for io_uring is to try and  issue  an
         sqe  as non-blocking first, and if that fails, execute
         it in an async manner. To support more efficient over‐
         lapped  operation  of  requests  that  the application
         knows/assumes will always (or most of the time) block,
         the  application can ask for an sqe to be issued async
         from the start. Available since 5.6.</code></pre>
            <p>This will trigger a call to <a href="https://elixir.bootlin.com/linux/v5.15.16/source/fs/io_uring.c#L1482"><code>io_queue_sqe() → io_queue_async_work()</code></a>, which deep down invokes <a href="https://elixir.bootlin.com/linux/v5.15.16/source/kernel/fork.c#L2520"><code>create_io_worker() → create_io_thread()</code></a> to spawn a new task to process work. Remember that last function, <code>create_io_thread()</code> – it will come up again later.</p><p>Our toy program sets the <code>IOSQE_ASYNC</code> flag on requests when we pass the <code>--async</code> command line option to it. Let’s give it a try:</p>
            <pre><code>$ ./udp-read --async &amp; pid=$!; sleep 1; ps -o pid,thcount $pid; kill $pid; wait $pid
[2] 3457597
    PID THCNT
3457597  4097
[2]+  Terminated              ./udp-read --async
$</code></pre>
            <p>The thread count went up by the number of submitted I/O requests (4,096). And there is one extra thread - the main thread. <code>io_uring</code> has spawned workers.</p><p>If we trace it again, we see that requests are now taking the blocking-read path, and we are hitting the <code>io_uring:io_uring_queue_async_work</code> tracepoint on the way.</p>
            <pre><code>$ sudo perf stat -a -e io_uring:io_uring_poll_arm,io_uring:io_uring_queue_async_work -- ./udp-read --async
^C./udp-read: Interrupt

 Performance counter stats for 'system wide':

                 0      io_uring:io_uring_poll_arm
              4096      io_uring:io_uring_queue_async_work

       1.335559294 seconds time elapsed

$</code></pre>
            <p>In the code, the fork happens in the <a href="https://elixir.bootlin.com/linux/v5.15.16/source/fs/io_uring.c#L7046"><code>io_queue_sqe()</code> function</a>, where we are now branching off to <a href="https://elixir.bootlin.com/linux/v5.15.16/source/fs/io_uring.c#L1482"><code>io_queue_async_work()</code></a>, which contains the corresponding tracepoint.</p><p>We got what we wanted. We are now using the worker thread pool.</p><p>However, having 4,096 threads just for reading one socket sounds like overkill. If we were to limit the number of worker threads, how would we go about that? There are four ways I know of.</p>
    <div>
      <h3>Method 1 - Limit the number of in-flight requests</h3>
      <a href="#method-1-limit-the-number-of-in-flight-requests">
        
      </a>
    </div>
    <p>If we take care to never have more than some number of in-flight blocking I/O requests, then we will have more or less the same number of workers. This is because:</p><ol><li><p><code>io_uring</code> spawns workers only when there is work to process. We control how many requests we submit and can throttle new submissions based on completion notifications.</p></li><li><p><code>io_uring</code> retires workers when there is no more pending work in the queue. Although, there is a grace period before a worker dies.</p></li></ol><p>The downside of this approach is that by throttling submissions, we reduce batching. We will have to drain the completion queue, refill the submission queue, and switch context with <code>io_uring_enter()</code> syscall more often.</p><p>We can convince ourselves that this method works by tweaking the number of submitted requests, and observing the thread count as the requests complete. The <code>--sqes &lt;n&gt;</code> option (<b>s</b>ubmission <b>q</b>ueue <b>e</b>ntrie<b>s</b>) controls how many read requests get queued by our workload. If we want a request to complete, we simply need to send a packet toward the UDP socket we are reading from. The workload does not refill the submission queue.</p>
            <pre><code>$ ./udp-read --async --sqes 8 &amp; pid=$!
[1] 7264
$ ss -ulnp | fgrep pid=$pid
UNCONN 0      0          127.0.0.1:52763      0.0.0.0:*    users:(("udp-read",pid=7264,fd=3))
$ ps -o thcount $pid; nc -zu 127.0.0.1 52763; echo -e '\U1F634'; sleep 5; ps -o thcount $pid
THCNT
    9
?
THCNT
    8
$</code></pre>
            <p>After sending one packet, the run queue length shrinks by one, and the thread count soon follows.</p><p>This works, but we can do better.</p>
    <div>
      <h3>Method 2 - Configure IORING_REGISTER_IOWQ_MAX_WORKERS</h3>
      <a href="#method-2-configure-ioring_register_iowq_max_workers">
        
      </a>
    </div>
    <p>In 5.15 the <a href="https://manpages.debian.org/unstable/liburing-dev/io_uring_register.2.en.html"><code>io_uring_register()</code> syscall</a> gained a new command for setting the maximum number of bound and unbound workers.</p>
            <pre><code>  IORING_REGISTER_IOWQ_MAX_WORKERS
         By default, io_uring limits the unbounded workers cre‐
         ated   to   the   maximum   processor   count  set  by
         RLIMIT_NPROC and the bounded workers is a function  of
         the SQ ring size and the number of CPUs in the system.
         Sometimes this can be excessive (or  too  little,  for
         bounded),  and  this  command provides a way to change
         the count per ring (per NUMA node) instead.

         arg must be set to an unsigned int pointer to an array
         of  two values, with the values in the array being set
         to the maximum count of workers per NUMA node. Index 0
         holds  the bounded worker count, and index 1 holds the
         unbounded worker  count.  On  successful  return,  the
         passed  in array will contain the previous maximum va‐
         lyes for each type. If the count being passed in is 0,
         then  this  command returns the current maximum values
         and doesn't modify the current setting.  nr_args  must
         be set to 2, as the command takes two values.

         Available since 5.15.</code></pre>
            <p>By the way, if you would like to grep through the <code>io_uring</code> man pages, they live in the <a href="https://github.com/axboe/liburing">liburing</a> repo maintained by <a href="https://twitter.com/axboe">Jens Axboe</a> – not the go-to repo for Linux API <a href="https://github.com/mkerrisk/man-pages">man-pages</a> maintained by <a href="https://twitter.com/mkerrisk">Michael Kerrisk</a>.</p><p>Since it is a fresh addition to the <code>io_uring</code> API, the <a href="https://docs.rs/io-uring/latest/io_uring/"><code>io-uring</code></a> Rust library we are using has not caught up yet. But with <a href="https://github.com/tokio-rs/io-uring/pull/121">a bit of patching</a>, we can make it work.</p><p>We can tell our toy program to set <code>IORING_REGISTER_IOWQ_MAX_WORKERS (= 19 = 0x13)</code> by running it with the <code>--workers &lt;N&gt;</code> option:</p>
            <pre><code>$ strace -o strace.out -e io_uring_register ./udp-read --async --workers 8 &amp;
[1] 3555377
$ pstree -pt $!
strace(3555377)───udp-read(3555380)─┬─{iou-wrk-3555380}(3555381)
                                    ├─{iou-wrk-3555380}(3555382)
                                    ├─{iou-wrk-3555380}(3555383)
                                    ├─{iou-wrk-3555380}(3555384)
                                    ├─{iou-wrk-3555380}(3555385)
                                    ├─{iou-wrk-3555380}(3555386)
                                    ├─{iou-wrk-3555380}(3555387)
                                    └─{iou-wrk-3555380}(3555388)
$ cat strace.out
io_uring_register(4, 0x13 /* IORING_REGISTER_??? */, 0x7ffd9b2e3048, 2) = 0
$</code></pre>
            <p>This works perfectly. We have spawned just eight <code>io_uring</code> worker threads to handle 4k of submitted read requests.</p><p>Question remains - is the set limit per io_uring instance? Per thread? Per process? Per UID? Read on to find out.</p>
    <div>
      <h3>Method 3 - Set RLIMIT_NPROC resource limit</h3>
      <a href="#method-3-set-rlimit_nproc-resource-limit">
        
      </a>
    </div>
    <p>A resource limit for the maximum number of new processes is another way to cap the worker pool size. The documentation for the <code>IORING_REGISTER_IOWQ_MAX_WORKERS</code> command mentions this.</p><p>This resource limit overrides the <code>IORING_REGISTER_IOWQ_MAX_WORKERS</code> setting, which makes sense because bumping <code>RLIMIT_NPROC</code> above the configured hard maximum requires <code>CAP_SYS_RESOURCE</code> capability.</p><p>The catch is that the limit is tracked <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=21d1c5e386bc751f1953b371d72cd5b7d9c9e270">per UID within a user namespace</a>.</p><p>Setting the new process limit without using a dedicated UID or outside a dedicated user namespace, where other processes are running under the same UID, can have surprising effects.</p><p>Why? io_uring will try over and over again to scale up the worker pool, only to generate a bunch of <code>-EAGAIN</code> errors from <code>create_io_worker()</code> if it can’t reach the configured <code>RLIMIT_NPROC</code> limit:</p>
            <pre><code>$ prlimit --nproc=8 ./udp-read --async &amp;
[1] 26348
$ ps -o thcount $!
THCNT
    3
$ sudo bpftrace --btf -e 'kr:create_io_thread { @[retval] = count(); } i:s:1 { print(@); clear(@); } END { clear(@); }' -c '/usr/bin/sleep 3' | cat -s
Attaching 3 probes...
@[-11]: 293631
@[-11]: 306150
@[-11]: 311959

$ mpstat 1 3
Linux 5.15.9-cloudflare-2021.12.8 (bullseye)    01/04/22        _x86_64_        (4 CPU)
                                   ???
02:52:46     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
02:52:47     all    0.00    0.00   25.00    0.00    0.00    0.00    0.00    0.00    0.00   75.00
02:52:48     all    0.00    0.00   25.13    0.00    0.00    0.00    0.00    0.00    0.00   74.87
02:52:49     all    0.00    0.00   25.30    0.00    0.00    0.00    0.00    0.00    0.00   74.70
Average:     all    0.00    0.00   25.14    0.00    0.00    0.00    0.00    0.00    0.00   74.86
$</code></pre>
            <p>We are hogging one core trying to spawn new workers. This is not the best use of CPU time.</p><p>So, if you want to use <code>RLIMIT_NPROC</code> as a safety cap over the <code>IORING_REGISTER_IOWQ_MAX_WORKERS</code> limit, you better use a “fresh” UID or a throw-away user namespace:</p>
            <pre><code>$ unshare -U prlimit --nproc=8 ./udp-read --async --workers 16 &amp;
[1] 3555870
$ ps -o thcount $!
THCNT
    9</code></pre>
            
    <div>
      <h3>Anti-Method 4 - cgroup process limit - pids.max file</h3>
      <a href="#anti-method-4-cgroup-process-limit-pids-max-file">
        
      </a>
    </div>
    <p>There is also one other way to cap the worker pool size – <a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#pid-interface-files">limit the number of tasks</a> (that is, processes and their threads) in a control group.</p><p>It is an anti-example and a potential misconfiguration to watch out for, because just like with <code>RLIMIT_NPROC</code>, we can fall into the same trap where <code>io_uring</code> will burn CPU:</p>
            <pre><code>$ systemd-run --user -p TasksMax=128 --same-dir --collect --service-type=exec ./udp-read --async
Running as unit: run-ra0336ff405f54ad29726f1e48d6a3237.service
$ systemd-cgls --user-unit run-ra0336ff405f54ad29726f1e48d6a3237.service
Unit run-ra0336ff405f54ad29726f1e48d6a3237.service (/user.slice/user-1000.slice/user@1000.service/app.slice/run-ra0336ff405f54ad29726f1e48d6a3237.service):
└─823727 /blog/io-uring-worker-pool/./udp-read --async
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/run-ra0336ff405f54ad29726f1e48d6a3237.service/pids.max
128
$ ps -o thcount 823727
THCNT
  128
$ sudo bpftrace --btf -e 'kr:create_io_thread { @[retval] = count(); } i:s:1 { print(@); clear(@); }'
Attaching 2 probes...
@[-11]: 163494
@[-11]: 173134
@[-11]: 184887
^C

@[-11]: 76680
$ systemctl --user stop run-ra0336ff405f54ad29726f1e48d6a3237.service
$</code></pre>
            <p>Here, we again see <code>io_uring</code> wasting time trying to spawn more workers without success. The kernel does not let the number of tasks within the service’s control group go over the limit.</p><p>Okay, so we know what is the best and the worst way to put a limit on the number of <code>io_uring</code> workers. But is the limit per <code>io_uring</code> instance? Per user? Or something else?</p>
    <div>
      <h2>One ring, two ring, three ring, four …</h2>
      <a href="#one-ring-two-ring-three-ring-four">
        
      </a>
    </div>
    <p>Your process is not limited to one instance of io_uring, naturally. In the case of a network proxy, where we push data from one socket to another, we could have one instance of io_uring servicing each half of the proxy.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2TXCzhgUiUGocx8WP9rJ8B/b77b4b9d89414398674d069dd80ae749/image2-3.png" />
            
            </figure><p>How many worker threads will be created in the presence of multiple <code>io_urings</code>? That depends on whether your program is single- or multithreaded.</p><p>In the single-threaded case, if the main thread creates two io_urings, and configures each io_uring to have a maximum of two unbound workers, then:</p>
            <pre><code>$ unshare -U ./udp-read --async --threads 1 --rings 2 --workers 2 &amp;
[3] 3838456
$ pstree -pt $!
udp-read(3838456)─┬─{iou-wrk-3838456}(3838457)
                  └─{iou-wrk-3838456}(3838458)
$ ls -l /proc/3838456/fd
total 0
lrwx------ 1 vagrant vagrant 64 Dec 26 03:32 0 -&gt; /dev/pts/0
lrwx------ 1 vagrant vagrant 64 Dec 26 03:32 1 -&gt; /dev/pts/0
lrwx------ 1 vagrant vagrant 64 Dec 26 03:32 2 -&gt; /dev/pts/0
lrwx------ 1 vagrant vagrant 64 Dec 26 03:32 3 -&gt; 'socket:[279241]'
lrwx------ 1 vagrant vagrant 64 Dec 26 03:32 4 -&gt; 'anon_inode:[io_uring]'
lrwx------ 1 vagrant vagrant 64 Dec 26 03:32 5 -&gt; 'anon_inode:[io_uring]'</code></pre>
            <p>… a total of two worker threads will be spawned.</p><p>While in the case of a multithreaded program, where two threads create one <code>io_uring</code> each, with a maximum of two unbound workers per ring:</p>
            <pre><code>$ unshare -U ./udp-read --async --threads 2 --rings 1 --workers 2 &amp;
[2] 3838223
$ pstree -pt $!
udp-read(3838223)─┬─{iou-wrk-3838224}(3838227)
                  ├─{iou-wrk-3838224}(3838228)
                  ├─{iou-wrk-3838225}(3838226)
                  ├─{iou-wrk-3838225}(3838229)
                  ├─{udp-read}(3838224)
                  └─{udp-read}(3838225)
$ ls -l /proc/3838223/fd
total 0
lrwx------ 1 vagrant vagrant 64 Dec 26 02:53 0 -&gt; /dev/pts/0
lrwx------ 1 vagrant vagrant 64 Dec 26 02:53 1 -&gt; /dev/pts/0
lrwx------ 1 vagrant vagrant 64 Dec 26 02:53 2 -&gt; /dev/pts/0
lrwx------ 1 vagrant vagrant 64 Dec 26 02:53 3 -&gt; 'socket:[279160]'
lrwx------ 1 vagrant vagrant 64 Dec 26 02:53 4 -&gt; 'socket:[279819]'
lrwx------ 1 vagrant vagrant 64 Dec 26 02:53 5 -&gt; 'anon_inode:[io_uring]'
lrwx------ 1 vagrant vagrant 64 Dec 26 02:53 6 -&gt; 'anon_inode:[io_uring]'</code></pre>
            <p>… four workers will be spawned in total – two for each of the program threads. This is reflected by the owner thread ID present in the worker’s name (<code>iou-wrk-&lt;tid&gt;</code>).</p><p>So you might think - “It makes sense! Each thread has their own dedicated pool of I/O workers, which service all the <code>io_uring</code> instances operated by that thread.”</p><p>And you would be right<sup>1</sup>. If we follow the code – <code>task_struct</code> has an instance of <code>io_uring_task</code>, aka <code>io_uring</code> context for the task<sup>2</sup>. Inside the context, we have a reference to the <code>io_uring</code> work queue (<code>struct io_wq</code>), which is actually an array of work queue entries (<code>struct io_wqe</code>). More on why that is an array soon.</p><p>Moving down to the work queue entry, we arrive at the work queue accounting table (<code>struct io_wqe_acct [2]</code>), with one record for each type of work – bounded and unbounded. This is where <code>io_uring</code> keeps track of the worker pool limit (<code>max_workers</code>) the number of existing workers (<code>nr_workers</code>).</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5czT5432CY7bwHC2jznios/dc8672e1c4ed4d215e5b4d659988806b/image4-3.png" />
            
            </figure><p>The perhaps not-so-obvious consequence of this arrangement is that setting just the <code>RLIMIT_NPROC</code> limit, without touching <code>IORING_REGISTER_IOWQ_MAX_WORKERS</code>, can backfire for multi-threaded programs.</p><p>See, when the maximum number of workers for an io_uring instance is not configured, <a href="https://elixir.bootlin.com/linux/v5.15.16/source/fs/io-wq.c#L1162">it defaults to <code>RLIMIT_NPROC</code></a>. This means that <code>io_uring</code> will try to scale the unbounded worker pool to <code>RLIMIT_NPROC</code> for each thread that operates on an <code>io_uring</code> instance.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/42pJE30GsrNspCHfDzq5xa/08dadfaa1120aaca648d8cb2ebbb1fc1/io_uring_workers_multi_threaded.png" />
            
            </figure><p>A multi-threaded process, by definition, creates threads. Now recall that the process management in the kernel tracks the number of tasks <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=21d1c5e386bc751f1953b371d72cd5b7d9c9e270">per UID within the user namespace</a>. Each spawned thread depletes the quota set by <code>RLIMIT_NPROC</code>. As a consequence, <code>io_uring</code> will never be able to fully scale up the worker pool, and will burn the CPU trying to do so.</p>
            <pre><code>$ unshare -U prlimit --nproc=4 ./udp-read --async --threads 2 --rings 1 &amp;
[1] 26249
vagrant@bullseye:/blog/io-uring-worker-pool$ pstree -pt $!
udp-read(26249)─┬─{iou-wrk-26251}(26252)
                ├─{iou-wrk-26251}(26253)
                ├─{udp-read}(26250)
                └─{udp-read}(26251)
$ sudo bpftrace --btf -e 'kretprobe:create_io_thread { @[retval] = count(); } interval:s:1 { print(@); clear(@); } END { clear(@); }' -c '/usr/bin/sleep 3' | cat -s
Attaching 3 probes...
@[-11]: 517270
@[-11]: 509508
@[-11]: 461403

$ mpstat 1 3
Linux 5.15.9-cloudflare-2021.12.8 (bullseye)    01/04/22        _x86_64_        (4 CPU)
                                   ???
02:23:23     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
02:23:24     all    0.00    0.00   50.13    0.00    0.00    0.00    0.00    0.00    0.00   49.87
02:23:25     all    0.00    0.00   50.25    0.00    0.00    0.00    0.00    0.00    0.00   49.75
02:23:26     all    0.00    0.00   49.87    0.00    0.00    0.50    0.00    0.00    0.00   49.62
Average:     all    0.00    0.00   50.08    0.00    0.00    0.17    0.00    0.00    0.00   49.75
$</code></pre>
            
    <div>
      <h2>NUMA, NUMA, yay <a href="https://en.wikipedia.org/wiki/Numa_Numa_(video)">?</a></h2>
      <a href="#numa-numa-yay">
        
      </a>
    </div>
    <p>Lastly, there’s the case of NUMA systems with more than one memory node. <code>io_uring</code> documentation clearly says that <code>IORING_REGISTER_IOWQ_MAX_WORKERS</code> configures the maximum number of workers per NUMA node.</p><p>That is why, as we have seen, <a href="https://elixir.bootlin.com/linux/v5.15.16/source/fs/io-wq.c#L125"><code>io_wq.wqes</code></a> is an array. It contains one entry, struct <code>io_wqe</code>, for each NUMA node. If your servers are NUMA systems like <a href="/the-epyc-journey-continues-to-milan-in-cloudflares-11th-generation-edge-server/">Cloudflare</a>, that is something to take into account.</p><p>Luckily, we don’t need a NUMA machine to experiment. <a href="https://www.qemu.org/">QEMU</a> happily emulates NUMA architectures. If you are hardcore enough, you can configure the NUMA layout with the right combination of <a href="https://www.qemu.org/docs/master/system/qemu-manpage.html"><code>-smp</code> and <code>-numa</code> options</a>.</p><p>But why bother when the <a href="https://github.com/vagrant-libvirt/vagrant-libvirt"><code>libvirt</code> provider</a> for Vagrant makes it so simple to configure a 2 node / 4 CPU layout:</p>
            <pre><code>    libvirt.numa_nodes = [
      {:cpus =&gt; "0-1", :memory =&gt; "2048"},
      {:cpus =&gt; "2-3", :memory =&gt; "2048"}
    ]</code></pre>
            <p>Let’s confirm how io_uring behaves on a NUMA system.Here’s our NUMA layout with two vCPUs per node ready for experimentation:</p>
            <pre><code>$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1
node 0 size: 1980 MB
node 0 free: 1802 MB
node 1 cpus: 2 3
node 1 size: 1950 MB
node 1 free: 1751 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10</code></pre>
            <p>If we once again run our test workload and ask it to create a single <code>io_uring</code> with a maximum of two workers per NUMA node, then:</p>
            <pre><code>$ ./udp-read --async --threads 1 --rings 1 --workers 2 &amp;
[1] 693
$ pstree -pt $!
udp-read(693)─┬─{iou-wrk-693}(696)
              └─{iou-wrk-693}(697)</code></pre>
            <p>… we get just two workers on a machine with two NUMA nodes. Not the outcome we were hoping for.</p><p>Why are we not reaching the expected pool size of <code>&lt;max workers&gt; × &lt;# NUMA nodes&gt;</code> = 2 × 2 = 4 workers? And is it possible to make it happen?</p><p>Reading the code reveals that – yes, it is possible. However, for the per-node worker pool to be scaled up for a given NUMA node, we have to submit requests, that is, call <code>io_uring_enter()</code>, from a CPU that belongs to that node. In other words, the process scheduler and thread CPU affinity have a say in how many I/O workers will be created.</p><p>We can demonstrate the effect that jumping between CPUs and NUMA nodes has on the worker pool by operating two instances of <code>io_uring</code>. We already know that having more than one io_uring instance per thread does not impact the worker pool limit.</p><p>This time, however, we are going to ask the workload to pin itself to a particular CPU before submitting requests with the <code>--cpu</code> option – first it will run on CPU 0 to enter the first ring, then on CPU 2 to enter the second ring.</p>
            <pre><code>$ strace -e sched_setaffinity,io_uring_enter ./udp-read --async --threads 1 --rings 2 --cpu 0 --cpu 2 --workers 2 &amp; sleep 0.1 &amp;&amp; echo
[1] 6949
sched_setaffinity(0, 128, [0])          = 0
io_uring_enter(4, 4096, 0, 0, NULL, 128) = 4096
sched_setaffinity(0, 128, [2])          = 0
io_uring_enter(5, 4096, 0, 0, NULL, 128) = 4096
io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 128
$ pstree -pt 6949
strace(6949)───udp-read(6953)─┬─{iou-wrk-6953}(6954)
                              ├─{iou-wrk-6953}(6955)
                              ├─{iou-wrk-6953}(6956)
                              └─{iou-wrk-6953}(6957)
$</code></pre>
            <p>Voilà. We have reached the said limit of <code>&lt;max workers&gt; x &lt;# NUMA nodes&gt;</code>.</p>
    <div>
      <h2>Outro</h2>
      <a href="#outro">
        
      </a>
    </div>
    <p>That is all for the very first installment of the Missing Manuals. <code>io_uring</code> has more secrets that deserve a write-up, like request ordering or handling of interrupted syscalls, so Missing Manuals might return soon.</p><p>In the meantime, please tell us what topic would you nominate to have a Missing Manual written?</p><p>Oh, and did I mention that if you enjoy putting cutting edge Linux APIs to use, <a href="https://www.cloudflare.com/careers/jobs/">we are hiring</a>? Now also remotely ?.</p><p>_____</p><p><sup>1</sup>And it probably does not make the users of runtimes that implement a hybrid threading model, like Golang, too happy.</p><p><sup>2</sup>To the Linux kernel, processes and threads are just kinds of tasks, which either share or don’t share some resources.</p> ]]></content:encoded>
            <category><![CDATA[Kernel]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">6aZLxCaZdX6PbblPyyepdf</guid>
            <dc:creator>Jakub Sitnicki</dc:creator>
        </item>
        <item>
            <title><![CDATA[The tale of a single register value]]></title>
            <link>https://blog.cloudflare.com/the-tale-of-a-single-register-value/</link>
            <pubDate>Wed, 03 Nov 2021 14:37:10 GMT</pubDate>
            <description><![CDATA[ It’s not every day that you get to debug what may well be a packet of death. It was certainly the first time for me.
What do I mean by “a packet of death”? A software bug where the network stack crashes in reaction to a single received network packet, taking down the whole operating system with it.  ]]></description>
            <content:encoded><![CDATA[ <p></p><blockquote><p>“Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.” — Sherlock Holmes</p></blockquote>
    <div>
      <h2>Intro</h2>
      <a href="#intro">
        
      </a>
    </div>
    <p>It’s not every day that you get to debug what may well be a packet of death. It was certainly the first time for me.</p><p>What do I mean by “a packet of death”? A software bug where the network stack crashes in reaction to a single received network packet, taking down the whole operating system with it. Like in the well known case of Windows <a href="https://en.wikipedia.org/wiki/Ping_of_death">ping of death</a>.</p><p><i>Challenge accepted.</i></p>
    <div>
      <h2>It starts with an oops</h2>
      <a href="#it-starts-with-an-oops">
        
      </a>
    </div>
    <p>Around a year ago <a href="https://lore.kernel.org/netdev/CAKxSbF01cLpZem2GFaUaifh0S-5WYViZemTicAg7FCHOnh6kug@mail.gmail.com/T/#u">we started seeing kernel crashes</a> in the Linux ipv4 stack. Servers were crashing sporadically, but we learned the hard way to never ignore cases like that — when possible we always trace crashes. We also couldn’t tie it to a particular kernel version, which could indicate a regression which hopefully could be tracked down to a single faulty change in the Linux kernel.</p><p>The crashed servers were leaving behind only a crash report, affectionately known as a “kernel oops”. Let’s take a look at it and go over what information we have there.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3X4BpNNGAD1ZPfxBlb1o4T/4a7cde352e376a1fe414d0e2923e3890/oops-before-decode-white-bg.png" />
            
            </figure><p>Parts of the oops, like offsets into functions, need to be decoded in order to be human-readable. Fortunately Linux comes with the <code>decode_stacktrace.sh</code> script that did the work for us.</p><p>All we need is to install a kernel debug and source packages before running the script. We will use the latest version of the script as it has been significantly improved since Linux v5.4 came out.</p>
            <pre><code>$ RELEASE=`uname -r`
$ apt install linux-image-$RELEASE-dbg linux-source-$RELEASE
$ curl -sLO https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/scripts/decode_stacktrace.sh
$ curl -sLO https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/scripts/decodecode
$ chmod +x decode_stacktrace.sh decodecode
$ ./decode_stacktrace.sh -r 5.4.14-cloudflare-2020.1.11 &lt; oops.txt &gt; oops-decoded.txt</code></pre>
            <p>When decoded, the oops report is even longer than before! But that is a good thing. There is new information there that can help us.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1J7tFIHQHeUhXDkBLEWFth/fdde379245fe287ade67f07ccea9032e/oops-after-decode-white-bg.png" />
            
            </figure>
    <div>
      <h2>What has happened?</h2>
      <a href="#what-has-happened">
        
      </a>
    </div>
    <p>With this much input we can start sketching a picture of what could have happened. First thing to check is where exactly did we crash?</p><p>The report points at line 5160 in the <code>skb_gso_transport_seglen()</code> function. If we take a look at the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/skbuff.c?h=v5.4.14#n5160">source code</a>, we can get a rough idea of what happens there. We are processing a <a href="https://www.kernel.org/doc/html/latest/networking/segmentation-offloads.html#generic-segmentation-offload">Generic Segmentation Offload</a> (GSO) packet carrying an encapsulated TCP packet. What is a GSO packet? In this context it's a batch of consecutive TCP segments, travelling through the network stack together to amortize the processing cost. We will look more at the GSO later.</p>
            <pre><code>net/core/skbuff.c:
5150) static unsigned int skb_gso_transport_seglen(const struct sk_buff *skb)
5151) {
          …
5155)     if (skb-&gt;encapsulation) {
                  …
5159)             if (likely(shinfo-&gt;gso_type &amp; (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)))
5160)                     thlen += inner_tcp_hdrlen(skb); ?
5161)     } else if (…) {
          …
5172)     return thlen + shinfo-&gt;gso_size;
5173) }</code></pre>
            <p>The exact line where we crashed belongs to an if-branch that handles tunnel traffic. It calculates the length of the TCP header of the inner packet, that is the encapsulated one. We do that to compute the length of the outer L4 segment, which accounts for the inner packet length:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5f4jCCQGws5d2j7iRB8Duo/434ac3684fedde0c205d267072173afa/image3-9.png" />
            
            </figure><p>To understand how the length of the inner TCP header is computed we have to peel off a few layers of inlined function calls:</p>
            <pre><code>inner_tcp_hdrlen(skb)
⇓
inner_tcp_hdr(skb)-&gt;doff * 4
⇓
((struct tcphdr *)skb_inner_transport_header(skb))-&gt;doff * 4
⇓
((struct tcphdr *)(skb-&gt;head + skb-&gt;inner_transport_header))-&gt;doff * 4</code></pre>
            <p>Now it is clear that <code>inner_tcp_hdrlen(skb)</code> simply reads the <a href="https://datatracker.ietf.org/doc/html/rfc793#section-3.1">Data Offset field</a> (<code>doff</code>) inside the inner TCP header. Because Data Offset carries the number of 32-bit words in the TCP header, we multiply it by 4 to get the TCP header length in bytes.</p><p>From the memory access point of view, to read the Data Offset value we need to:</p><ol><li><p>load <code>skb-&gt;head</code> value from address <code>skb + offsetof(struct sk_buff, head)</code></p></li><li><p>load <code>skb-&gt;inner_transport_header</code> value from address <code>skb + offsetof(struct sk_buff, inner_transport_header)</code>,</p></li><li><p>load the TCP Data Offset from <code>skb-&gt;head + skb-&gt;inner_transport_header + offsetof(struct tcphdr, doff)</code></p></li></ol><p>Potentially, any of these loads could trigger a page fault. But it’s unlikely that <code>skb</code> contains an invalid address since we accessed the <code>skb-&gt;encapsulation</code> field without crashing just a few lines earlier. Our main suspect is the last load.</p><p>The invalid memory address we attempt to load from should be in one of the CPU registers at the time of the exception. And we have the CPU register snapshot in the oops report. Which register holds the address? That has been decided by the compiler. We will need to take a look at the instruction stream to discover that.</p><p>Remember the disassembly in the decoded kernel oops? Now is the time to go back to it. Hint, it’s in <a href="https://en.wikipedia.org/wiki/X86_assembly_language#Syntax">AT&amp;T syntax</a>. But to give everyone a fair chance to follow along, here’s the same disassembly but in Intel syntax. (Alright, alright. You caught me. I just can’t read the AT&amp;T syntax.)</p>
            <pre><code>All code
========
   0:   c0 41 83 e0             rol    BYTE PTR [rcx-0x7d],0xe0
   4:   11 f6                   adc    esi,esi
   6:   87 81 00 00 00 20       xchg   DWORD PTR [rcx+0x20000000],eax
   c:   74 30                   je     0x3e
   e:   0f b7 87 aa 00 00 00    movzx  eax,WORD PTR [rdi+0xaa]
  15:   0f b7 b7 b2 00 00 00    movzx  esi,WORD PTR [rdi+0xb2]
  1c:   48 01 c1                add    rcx,rax
  1f:   48 29 f0                sub    rax,rsi
  22:   45 85 c0                test   r8d,r8d
  25:   48 89 c6                mov    rsi,rax
  28:   74 0d                   je     0x37
  2a:*  0f b6 41 0c             movzx  eax,BYTE PTR [rcx+0xc]           &lt;-- trapping instruction
  2e:   c0 e8 04                shr    al,0x4
  31:   0f b6 c0                movzx  eax,al
  34:   8d 04 86                lea    eax,[rsi+rax*4]
  37:   0f b7 52 04             movzx  edx,WORD PTR [rdx+0x4]
  3b:   01 d0                   add    eax,edx
  3d:   c3                      ret
  3e:   45                      rex.RB
  3f:   85                      .byte 0x85

Code starting with the faulting instruction
===========================================
   0:   0f b6 41 0c             movzx  eax,BYTE PTR [rcx+0xc]
   4:   c0 e8 04                shr    al,0x4
   7:   0f b6 c0                movzx  eax,al
   a:   8d 04 86                lea    eax,[rsi+rax*4]
   d:   0f b7 52 04             movzx  edx,WORD PTR [rdx+0x4]
  11:   01 d0                   add    eax,edx
  13:   c3                      ret
  14:   45                      rex.RB
  15:   85                      .byte 0x85</code></pre>
            <p>When the trapped page fault happened, we tried to load from address <code>%rcx + 0xc</code>, or 12 bytes from whatever memory location <code>%rcx</code> held. Which is hardly a coincidence since the Data Offset field is 12 bytes into the TCP header.</p><p>This means that <code>%rcx</code> holds the computed <code>skb-&gt;head + skb-&gt;inner_transport_header</code> address. Let’s take a look at it:</p>
            <pre><code>RSP: 0018:ffffa4740d344ba0 EFLAGS: 00010202
RAX: 000000000000feda RBX: ffff9d982becc900 RCX: ffff9d9624bbaffc
RDX: ffff9d9624babec0 RSI: 000000000000feda RDI: ffff9d982becc900
…</code></pre>
            <p>The RCX value doesn’t look particularly suspicious. We can say that:</p><ol><li><p>it’s in a kernel virtual address space because it is greater than <code>0xffff000000000000</code> - expected, and</p></li><li><p>it is very close to the 4 KiB page boundary (<code>0xffff9d9624bbb000</code> - 4),</p></li></ol><p>... and not much more.</p><p>We must go back further in the instruction stream. Where did the value in <code>%rcx</code> come from? What I like to do is try to correlate the machine code leading up to the crash with pseudo source code:</p>
            <pre><code>&lt;function entry&gt;                # %rdi = skb
…
movzx  eax,WORD PTR [rdi+0xaa]  # %eax = skb-&gt;inner_transport_header
movzx  esi,WORD PTR [rdi+0xb2]  # %esi = skb-&gt;transport_header
add    rcx,rax                  # %rcx = skb-&gt;head + skb-&gt;inner_transport_header
sub    rax,rsi                  # %rax = skb-&gt;inner_transport_header - skb-&gt;transport_header
test   r8d,r8d
mov    rsi,rax                  # %rsi = skb-&gt;inner_transport_header - skb-&gt;transport_header
je     0x37
movzx  eax,BYTE PTR [rcx+0xc]   # %eax = *(skb-&gt;head + skb-&gt;inner_transport_header + offsetof(struct tcphdr, doff))</code></pre>
            <p>How did I decode that assembly snippet? We know that the <code>skb</code> address was passed to our function in the <code>%rdi</code> register because the <a href="https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI">System V AMD64 ABI calling convention</a> dictates that. If the <code>%rdi</code> register hasn’t been clobbered by any function calls, or reused because the compiler decided so, then maybe, just maybe, it still holds the <code>skb</code> address.</p><p>If <code>0xaa</code> and <code>0xb2</code> are offsets into an <code>sk_buff</code> structure, then <code>pahole</code> tool can tell us which fields they correspond to:</p>
            <pre><code>$ pahole --hex -C sk_buff /usr/lib/debug/vmlinux-5.4.14-cloudflare-2020.1.11 | grep '\(head\|inner_transport_header\|transport_header\);'
        __u16                      inner_transport_header; /*  0xaa   0x2 */
        __u16                      transport_header;     /*  0xb2   0x2 */
        unsigned char *            head;                 /*  0xc0   0x8 */</code></pre>
            <p>To confirm our guesswork, we can <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-10-gso-encap-crash/listings/gdb-disassemble-skb_gso_transport_seglen.txt">disassemble the whole function</a> in <code>gdb</code>.</p><p>It would be great to find out the value of the <code>inner_transport_header</code> and <code>transport_header</code> offsets. But the registers that were holding them, <code>%rax</code> and <code>%rsi</code>, respectively, were reused after the offset values were loaded.</p><p>However, we can still examine the difference between <code>inner_transport_header</code> and <code>transport_header</code> that both <code>%rax</code> and <code>%rsi</code> hold. Let’s take a look.</p>
    <div>
      <h2>The suspicious offset</h2>
      <a href="#the-suspicious-offset">
        
      </a>
    </div>
    <p>Here are the register values from the oops as a reminder:</p>
            <pre><code>RAX: 000000000000feda RBX: ffff9d982becc900 RCX: ffff9d9624bbaffc
RDX: ffff9d9624babec0 RSI: 000000000000feda RDI: ffff9d982becc900</code></pre>
            <p>From the register snapshot we can tell that:</p><p><code>%rax = %rsi = skb-&gt;inner_transport_header - skb-&gt;transport_header = 0xfeda = 65242</code></p><p>That is clearly suspicious. We expect that <code>skb-&gt;transport_header &lt; skb-&gt;inner_transport_header</code>, so either</p><ol><li><p><code>skb-&gt;inner_transport_header &gt; 0xfeda</code>, which would mean that between outer and inner L4 packets there is 65k+ bytes worth of headers - unlikely, or</p></li><li><p><code>0xfeda</code> is a garbage value, perhaps an effect of an underflow if <code>skb-&gt;inner_transport_header &lt; skb-&gt;transport_header</code>.</p></li></ol><p>Let’s entertain the theory that an underflow has occurred.</p><p>Any other scenario, be it an out-of-bounds write or a use-after-free that corrupted the memory, is a scary prospect where we don’t stand much chance of debugging it without help from tools like <a href="https://www.kernel.org/doc/html/latest/dev-tools/kasan.html">KASAN report</a>.</p><p>But if we assume for a moment that it’s an underflow, then the task is simple ?. We “just” need to audit all places where <code>skb-&gt;inner_transport_header</code> or <code>skb-&gt;transport_header</code> offsets could have been updated while the <code>skb</code> buffer travelled through the network stack.</p><p>That raises the question — what path did the packet take through the network stack before it brought the machine down?</p>
    <div>
      <h2>Packet path</h2>
      <a href="#packet-path">
        
      </a>
    </div>
    <p>It is time to take a look at the call trace in the oops report. If we walk through it, it is apparent that a veth device received a packet. The packet then got routed and forwarded to some other network device. The kernel crashed before the egress device transmitted the packet out.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/30cfhDbkYftnpJA2cU0q6W/8fa1d2db7aac9a171b82e987849bc3b1/image4-2.png" />
            
            </figure><p>What immediately draws our attention is the <code>veth_poll()</code> function in the call trace. Polling inside a virtual device that acts as a simple pipe joining two network namespaces together? Puzzling!</p><p>The regular operation mode of a <code>veth</code> device is that transmission of a packet from one side of a <code>veth-pair</code> results in immediate, in-line, on the same CPU, reception of the packet by the other side of the pair. There shouldn't be any polling, context switches or such.</p><p>However, in Linux v4.19 veth driver <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=60afdf066a35317efd5d1d7ae7c7f4ef2b32601f">gained support</a> for native mode <a href="https://docs.cilium.io/en/latest/bpf/#xdp">eXpress Data Path (XDP)</a>. XDP relies on <a href="https://lwn.net/Articles/244640/">NAPI</a>, an interface between the network drivers and the Linux network stack. NAPI requires that drivers <a href="https://www.kernel.org/doc/html/latest/networking/kapi.html?highlight=netif_napi_add#c.netif_napi_add">register a <code>poll()</code> callback</a> for fetching received packets.</p><p>The NAPI receive path in the <code>veth</code> driver is taken only when there is an XDP program attached. The fork occurs in <a href="https://elixir.bootlin.com/linux/v5.4.14/source/drivers/net/veth.c#L228"><code>veth_forward_skb</code></a>, where the TX path ends and a RX path on the other side begins.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1qHZ8EKuRnrslIiRz0Bj2L/18a0101bec28d22c5bead91f3306b8a2/image9-4.png" />
            
            </figure><p>This is an important observation because only on the NAPI/XDP path in the <code>veth</code> driver, received packets might get aggregated by the <a href="https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#generic-receive-offloading-gro">Generic Receive Offload</a>.</p>
    <div>
      <h2>Super-packets</h2>
      <a href="#super-packets">
        
      </a>
    </div>
    <p>Early on we’ve noted that the crash happens when processing a GSO packet. I’ve promised we will get back to it and now is the time.</p><p>Generic Segmentation Offload (GSO) is all about delaying the L4 segmentation process until the very last moment. So called super-packets, that exceed the egress route MTU in size, travel all the way through the network stack, only to be cut into MTU-sized segments just before handing the data over to the network driver for transmission. This way we process just one big packet on the transmit path, instead of a few smaller ones and save on CPU cycles in all the IP-level stack functions like routing, nftables, traffic control</p><p>Where do these super-packets come from? They can be a result of large write to a socket, or as is our case, they can be received from one network and forwarded to another network.</p><p>The latter case, that is forwarding a super-packet, happens when Generic Receive Offload (GRO) kicks in during receive. GRO is the opposite process of GSO. Smaller, MTU-sized packets get merged to form a super-packet early on the receive path. The goal is the same — process less by pushing just one packet through the network stack layers.</p><p>Not just any packets can be fused together by GRO. <a href="https://lwn.net/Articles/358910/">Loosely speaking</a>, any two packets to be merged must form a logical sequence in the network flow, and carry the same metadata in protocol headers. It is critical that no information is lost in the aggregation process. Otherwise, GSO won’t be able to reconstruct the segment stream when serializing packets in the network card transmission code.</p><p>To this end, each network protocol that supports GRO provides a callback which signals whether the above conditions hold true. GRO implementation (<a href="https://elixir.bootlin.com/linux/v5.4.14/source/net/core/dev.c#L5496"><code>dev_gro_receive()</code></a>) then walks through the packet headers, the outer as well as the inner ones, and delegates the pre-merge check to the right protocol callback. If all stars align, the packets get spliced at the end of the callback chain (<a href="https://elixir.bootlin.com/linux/v5.4.14/source/net/core/skbuff.c#L3986"><code>skb_gro_receive()</code></a>).</p><p>I will be frank. The code that performs GRO is pretty complex, and I spent a significant amount of time staring into it. Hat tip to its authors. However, for our little investigation it will be enough to understand that a TCP stream encapsulated with <a href="https://en.wikipedia.org/wiki/Generic_Routing_Encapsulation">GRE</a>1 would trigger callback chain like so:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1qEZv8Xq8brXJwmOplw0HA/38715e0e84c5fa98bc931faff8d47942/image11-3.png" />
            
            </figure><p>Armed with basic GRO/GSO understanding we are ready to take a shot at reproducing the crash.</p>
    <div>
      <h2>The reproducer</h2>
      <a href="#the-reproducer">
        
      </a>
    </div>
    <p>Let’s recap what we know:</p><ol><li><p>a super-packet was received from a <code>veth</code> device,</p></li><li><p>the <code>veth</code> device had an <code>XDP</code> program attached,</p></li><li><p>the packet was forwarded to another device,</p></li><li><p>the egress device was transmitting a GSO super-packet,</p></li><li><p>the packet was encapsulated,</p></li><li><p>the super-packet must have been produced by GRO on ingress.</p></li></ol><p>This paints a pretty clear picture on what the setup should look like:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2tj05SVdRlQaspSaW0DO9J/fff7b814e502dac66eee59a78a53f4d1/image6-4.png" />
            
            </figure><p>We can work with that. A <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-10-gso-encap-crash/setup.sh">simple shell script</a> will be our setup machinery.</p><p>We will be sending traffic from <code>10.1.1.1</code> to <code>10.2.2.2</code>. Our traffic pattern will be a TCP stream consisting of two consecutive segments so that GRO can merge something. A <a href="https://scapy.net/">Scapy</a> script will be great for that. Let’s call it <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-10-gso-encap-crash/send-a-pair.py"><code>send-a-pair.py</code></a> and give it a run:</p>
            <pre><code>$ { sleep 5; sudo ip netns exec A ./send-a-pair.py; } &amp;
[1] 1603
$ sudo ip netns exec B tcpdump -i BA -n -nn -ttt 'ip and not arp'
…
 00:00:00.020506 IP 10.1.1.1 &gt; 10.2.2.2: GREv0, length 1480: IP 192.168.1.1.12345 &gt; 192.168.2.2.443: Flags [.], seq 0:1436, ack 1, win 8192, length 1436
 00:00:00.000082 IP 10.1.1.1 &gt; 10.2.2.2: GREv0, length 1480: IP 192.168.1.1.12345 &gt; 192.168.2.2.443: Flags [.], seq 1436:2872, ack 1, win 8192, length 1436</code></pre>
            <p>Where is our super-packet? Look at the packet sizes, the GRO didn’t merge anything.</p><p>Turns out NAPI is just too fast at fetching the packets from the Rx ring. We need a little buffering on transmit to increase our chances of GRO batching:</p>
            <pre><code># Help GRO
ip netns exec A tc qdisc add dev AB root netem delay 200us slot 5ms 10ms packets 2 bytes 64k</code></pre>
            <p>With the delay in place, things look better:</p>
            <pre><code> 00:00:00.016972 IP 10.1.1.1 &gt; 10.2.2.2: GREv0, length 2916: IP 192.168.1.1.12345 &gt; 192.168.2.2.443: Flags [.], seq 0:2872, ack 1, win 8192, length 2872</code></pre>
            <p>2,872 bytes shown by <code>tcpdump</code> clearly indicate GRO in action. And we are even hitting the crash point:</p>
            <pre><code>$ sudo bpftrace -e 'kprobe:skb_gso_transport_seglen { print(kstack()); }' -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 1 probe...

        skb_gso_transport_seglen+1
        skb_gso_validate_network_len+17
        __ip_finish_output+293
        ip_output+113
        ip_forward+876
        ip_rcv+188
        __netif_receive_skb_one_core+128
        netif_receive_skb_internal+47
        napi_gro_flush+151
        napi_complete_done+183
        veth_poll+1697
        net_rx_action+314
        …

^C</code></pre>
            <p>…but we are not crashing. We will need to dig deeper.</p><p>We know what packet metadata <code>skb_gso_transport_seglen()</code> looks at — the header offsets, then <code>encapsulation</code> flag, and GSO info. Let’s <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-10-gso-encap-crash/why-no-crash.bt">dump</a> all of it:</p>
            <pre><code>$ sudo bpftrace ./why-no-crash.bt -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 2 probes...
DEV  LEN  NH  TH  ENC INH ITH GSO SIZE SEGS TYPE FUNC
sink 2936 270 290 1   294 254  |  1436 2    0x41 skb_gso_transport_seglen</code></pre>
            <p>Since the <code>skb-&gt;encapsulation</code> flag (<code>ENC</code>) is set, both outer and inner header offsets should be valid. Are they?</p><p>The outer network / L3 header (<code>NH</code>) looks sane. When XDP is enabled, it reserves 256 bytes of headroom before the headers. 14 byte long Ethernet header follows the headroom. The IPv4 header should then start at 270 bytes into the packet buffer.</p><p>The outer transport / L4 header offset is as expected as well. The IPv4 header takes 20 bytes, and the GRE header follows it.</p><p>The inner network header (<code>INH</code>) begins at the offset of 294 bytes. This makes sense because the GRE header in its most basic form is 4 bytes long.</p><p>The surprise comes last. The inner transport header offset points somewhere near the end of headroom which XDP reserves. Instead, it should start at 314, following the inner IPv4 header.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3cyux0UKvxGzmBEfxSh6Pf/40835a4882e722a2f530bd6b4920159e/image10-6.png" />
            
            </figure><p>Is this the smoking gun we were looking for?</p>
    <div>
      <h2>The bug</h2>
      <a href="#the-bug">
        
      </a>
    </div>
    <p><code>skb_gso_transport_seglen()</code> calculates the length of the outer L4 segment when given a GSO packet. If the <code>inner_transport_header</code> offset is off, then the result of the calculation might be off as well. Worth checking.</p><p>We know that our segments are 1500 bytes long. That makes the L4 part 1480 bytes long. What does <code>skb_gso_transport_seglen()</code> say though?</p>
            <pre><code>$ sudo bpftrace -e 'kretprobe:skb_gso_transport_seglen { print(retval); }' -c …
Attaching 1 probe...
1460</code></pre>
            <p>Seems that we don’t agree. But if <code>skb_gso_transport_seglen()</code> is getting garbage on input we can’t really <a href="https://en.wikipedia.org/wiki/Garbage_in%2C_garbage_out">blame it</a>.</p><p>If <code>inner_transport_header</code> is not correct, that TCP Data Offset read that we know happens inside the function cannot end well.</p><p>If we map it out, it looks like we are loading part of the source MAC address (upper 4 bits of the 5th byte, to be precise) and interpreting it as TCP Data Offset.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5EPmlwZGRF0Yeykmr4gHPu/ff8c21aa49c9b7f3090e1160db1d09cb/image5-4.png" />
            
            </figure><p>Are we? There is an easy way to check.</p><p>If we ask nicely, <code>tcpdump</code> will tell us what the MAC addresses are:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6m1WVc1zTBaEVPTQLqLbz2/e99c5754c25cccde322c54787fc0b719/Screenshot-2021-11-03-at-14-57-51-The-tale-of-a-single-register-value---Blog-Drafts-documentation.png" />
            
            </figure><p>Plugging that into the calculations that <code>skb_gso_transport_seglen()</code> …</p>
            <pre><code>thlen = inner_transport_header(skb) - transport_header(skb) = 254 - 290 = -36
thlen += inner_transport_header(skb)-&gt;doff * 4 = -36 + (0xf * 4) = -36 + 60 = 24
retval = gso_size + thlen = 1436 + 24 = 1460</code></pre>
            <p>…checks out!</p><p>Does this mean that I can control the return value by setting source MAC address?!</p>
            <pre><code>                                               ?
$ sudo ip -n A link set AB address be:d6:07:5e:05:11 # Change the MAC address 
$ sudo bpftrace -e 'kretprobe:skb_gso_transport_seglen { print(retval); }' -c …
Attaching 1 probe...
1400</code></pre>
            <p>Yes! <code>1436 + (-36) + (0 * 4) = 1400</code>. This is it.</p><p>However, how does all this tie it to the original crash? The badly calculated L4 segment length will make GSO emit shorter segments on egress. But that’s all.</p><p>Remember the suspicious offset from the crash report?</p><p><code>%rax = %rsi = skb-&gt;inner_transport_header - skb-&gt;transport_header = 0xfeda = 65242</code></p><p>We now know that <code>skb-&gt;transport_header</code> should be 290. That makes <code>skb-&gt;inner_transport_header = 65242 + 290 = 65532 = 0xfffc</code>.</p><p>Which means that when we triggered the page fault we were trying to load memory from a location at</p><p><code>skb-&gt;head + skb-&gt;inner_transport_header + offsetof(tcphdr, doff) = skb-&gt;head + 0xfffc + 12 = 0xffff9d9624bbb008</code></p><p>Solving it for <code>skb-&gt;head yields 0xffff9d9624bbb008 - 0xfffc - 12 = 0xffff9d9624bab000</code>.</p><p>And this makes sense. The <a href="http://vger.kernel.org/~davem/skb_data.html"><code>skb-&gt;head buffer</code></a> is page-aligned, meaning it’s a multiple of 4 KiB on x86-64 platforms — the 12 least significant bits the address are 0.</p><p>However, the address we were trying to read was <code>(0xfffc+12)/4096 ~= 16 pages</code> (or 64 KiB) past the <code>skb-&gt;head</code> page boundary <code>(0xffff9d9624babfff)</code>.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/69J4h6qU3nC1rv0Pvypxng/870bcf91eee6b65ecbacf76fab8b207b/image7-5.png" />
            
            </figure><p>Who knows if there was memory mapped to this address?! Looks like from time to time there wasn’t anything mapped there and the kernel page fault handling code went “oops!”.</p>
    <div>
      <h2>The fix</h2>
      <a href="#the-fix">
        
      </a>
    </div>
    <p>It is finally time to understand who sets the header offsets in a super-packet.</p><p>Once GRO is done merging segments, it flushes the super-packet down the pipe by kicking off a chain of <code>gro_complete</code> callbacks:</p><p><code>napi_gro_complete → inet_gro_complete → gre_gro_complete → inet_gro_complete → tcp4_gro_complete → tcp_gro_complete</code></p><p>These callbacks are responsible for updating the header offsets and populating the GSO-related fields in <code>skb_shared_info</code> struct. Later on the transmit side will consume this data.</p><p>Let’s see how the packet metadata changes as it travels through the <code>gro_complete callbacks</code>2 by adding a few more tracepoint to our bpftrace script:</p>
            <pre><code>$ sudo bpftrace ./why-no-crash.bt -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 7 probes...
DEV  LEN  NH  TH  ENC INH ITH GSO SIZE SEGS TYPE FUNC
BA   2936 294 314 0   254 254  |  1436 0    0x00 napi_gro_complete
BA   2936 294 314 0   254 254  |  1436 0    0x00 inet_gro_complete
BA   2936 294 314 0   254 254  |  1436 0    0x00 gre_gro_complete
BA   2936 294 314 1   254 254  |  1436 0    0x40 inet_gro_complete
BA   2936 294 314 1   294 254  |  1436 0    0x40 tcp4_gro_complete
BA   2936 294 314 1   294 254  |  1436 0    0x41 tcp_gro_complete
sink 2936 270 290 1   294 254  |  1436 2    0x41 skb_gso_transport_seglen</code></pre>
            <p>As the packet travels through the <code>gro_complete</code> callbacks, the inner network header (<code>INH</code>) offset <a href="https://elixir.bootlin.com/linux/v5.4.14/source/net/ipv4/af_inet.c#L1591">gets updated</a> after we have processed the inner IPv4 header.</p><p>However, the same thing did not happen to the inner transport header (ITH) that is causing trouble. We need to fix that.</p>
            <pre><code>--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -298,6 +298,9 @@ int tcp_gro_complete(struct sk_buff *skb)
        if (th-&gt;cwr)
                skb_shinfo(skb)-&gt;gso_type |= SKB_GSO_TCP_ECN;

+       if (skb-&gt;encapsulation)
+               skb-&gt;inner_transport_header = skb-&gt;transport_header;
+
        return 0;
 }
 EXPORT_SYMBOL(tcp_gro_complete);</code></pre>
            <p>With the patch in place, the header offsets are finally all sane and <code>skb_gso_transport_seglen()</code> return value is as expected:</p>
            <pre><code>$ sudo bpftrace ./why-no-crash.bt -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 2 probes...
DEV  LEN  NH  TH  ENC INH ITH GSO SIZE SEGS TYPE FUNC
sink 2936 270 290 1   294 314  |  1436 2    0x41 skb_gso_transport_seglen

$ sudo bpftrace -e 'kretprobe:skb_gso_transport_seglen { print(retval); }' -c …
Attaching 1 probe...
1480</code></pre>
            <p>Don’t worry, though. The fix is already likely in your kernel long time ago. Patch <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d51c5907e9809a803b276883d203f45849abd4d6">d51c5907e980 (“net, gro: Set inner transport header offset in tcp/udp GRO hook”)</a> has been merged into Linux v5.14, and backported to v5.10.58 and v5.4.140 LTS kernels. The Linux kernel community has got you covered. But please, keep on updating your production kernels.</p>
    <div>
      <h2>Outro</h2>
      <a href="#outro">
        
      </a>
    </div>
    <p>What a journey! We have learned a ton and fixed a real bug in the Linux kernel. In the end it was not a Packet of Death. Maybe next time we can find one ;-)</p><p>Enjoyed the read? Why not join Cloudflare and help us fix the remaining bugs in the Linux kernel? We are hiring in <a href="https://boards.greenhouse.io/cloudflare/jobs/3547091?gh_jid=3547091">Lisbon</a>, <a href="https://boards.greenhouse.io/cloudflare/jobs/3550616?gh_jid=3550616">London</a>, and <a href="https://boards.greenhouse.io/cloudflare/jobs/3550623?gh_jid=3550623">Austin</a>.</p><p>And if you would like to see more kernel blog posts, please let us know!</p><p>...1Why GRE and not some other type of encapsulation? If you follow our blog closely, you might already know that <a href="/magic-transit-network-functions/">Cloudflare Magic Transit</a> uses veth pairs to route traffic into and out of network namespaces. It also happens to use GRE encapsulation. If you are curious why we chose network namespaces linked with veth pairs, be sure to watch the <a href="https://linuxplumbersconf.org/event/7/contributions/682/">How we built Magic Transit</a> talk.2Just turn off GRO on all other network devices in use to get a clean output (sudo ethtool -K enp0s5 gro off).</p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">775eetgTezSasskekM5MSO</guid>
            <dc:creator>Jakub Sitnicki</dc:creator>
        </item>
        <item>
            <title><![CDATA[Conntrack turns a blind eye to dropped SYNs]]></title>
            <link>https://blog.cloudflare.com/conntrack-turns-a-blind-eye-to-dropped-syns/</link>
            <pubDate>Thu, 04 Mar 2021 12:00:00 GMT</pubDate>
            <description><![CDATA[ We have been dealing with conntrack, the connection tracking layer in the Linux kernel, for years. And yet, despite the collected know-how, questions about its inner workings occasionally come up. When they do, it is hard to resist the temptation to go digging for answers. ]]></description>
            <content:encoded><![CDATA[ 
    <div>
      <h2>Intro</h2>
      <a href="#intro">
        
      </a>
    </div>
    <p>We have been working with conntrack, the connection tracking layer in the Linux kernel, for years. And yet, despite the collected know-how, questions about its inner workings occasionally come up. When they do, it is hard to resist the temptation to go digging for answers.</p><p>One such question popped up while writing <a href="/conntrack-tales-one-thousand-and-one-flows/">the previous blog post on conntrack</a>:</p><blockquote><p>“Why are there no entries in the conntrack table for SYN packets dropped by the firewall?”</p></blockquote><p>Ready for a deep dive into the network stack? Let’s find out.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2hrMaSaIFXFsXevfMqgsdh/becd0e20191174d5c30970e7e6c00a1e/tqbla_C4bF9esDdiQKx-12wXVI3xKv6IDUklAgB0zu6G4KiC3ziZeCJkSgUWechnpaCnFBL6XY_glt1HNaXDqURrRm6ttta7ciHiG8vidp7x6Th0eQUqXQF4Ure.jpeg" />
            
            </figure><p><a href="https://pixabay.com/photos/female-diver-sea-scuba-diving-4829158/"><i>Image</i></a> <i>by</i> <a href="https://pixabay.com/users/chulmin1700-15022416/?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=4829158"><i>chulmin park</i></a> <i>from</i> <a href="https://pixabay.com/?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=4829158"><i>Pixabay</i></a></p><p>We already know from <a href="/conntrack-tales-one-thousand-and-one-flows/">last time</a> that conntrack is in charge of tracking incoming and outgoing network traffic. By running conntrack -L we can inspect existing network flows, or as conntrack calls them, connections.</p><p>So if we spin up a <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-conntrack-syn-drop/Vagrantfile">toy VM</a>, <i>connect</i> to it over SSH, and inspect the contents of the conntrack table, we will see…</p>
            <pre><code>$ vagrant init fedora/33-cloud-base
$ vagrant up
…
$ vagrant ssh
Last login: Sun Jan 31 15:08:02 2021 from 192.168.122.1
[vagrant@ct-vm ~]$ sudo conntrack -L
conntrack v1.4.5 (conntrack-tools): 0 flow entries have been shown.</code></pre>
            <p>… nothing!</p><p>Even though the conntrack kernel module is loaded:</p>
            <pre><code>[vagrant@ct-vm ~]$ lsmod | grep '^nf_conntrack\b'
nf_conntrack          163840  1 nf_conntrack_netlink</code></pre>
            <p>Hold on a minute. Why is the SSH connection to the VM not listed in conntrack entries? SSH is working. With each keystroke we are sending packets to the VM. But conntrack doesn’t register it.</p><p>Isn’t conntrack an integral part of the network stack that sees every packet passing through it? ?</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2NeBs0DqiYv9no3mYlBp7f/092bced2f034366e69afb52ad2c101a4/image1-5.png" />
            
            </figure><p>Based on an <a href="https://commons.wikimedia.org/wiki/File:Netfilter-packet-flow.svg"><i>image</i></a> <i>by</i> <a href="https://commons.wikimedia.org/wiki/User_talk:Jengelh"><i>Jan Engelhardt</i></a> <i>CC BY-SA 3.0</i></p><p>Clearly everything we learned about conntrack last time is not the whole story.</p>
    <div>
      <h2>Calling into conntrack</h2>
      <a href="#calling-into-conntrack">
        
      </a>
    </div>
    <p>Our little experiment with SSH’ing into a VM begs the question — how does conntrack actually get notified about network packets passing through the stack?</p><p>We can walk <a href="https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/">the receive path step by step</a> and we won’t find any direct calls into the conntrack code in either the IPv4 or IPv6 stack. Conntrack does not interface with the network stack directly.</p><p>Instead, it relies on the Netfilter framework, and its set of hooks baked into in the stack:</p>
            <pre><code>int ip_rcv(struct sk_buff *skb, struct net_device *dev, …)
{
    …
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
               net, NULL, skb, dev, NULL,
               ip_rcv_finish);
}</code></pre>
            <p>Netfilter users, like conntrack, can register callbacks with it. Netfilter will then run all registered callbacks when its hook processes a network packet.</p><p>For the INET family, that is IPv4 and IPv6, there are five Netfilter hooks  to choose from:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3TAI6xvDHx8kWETIdRURIw/f69688dbf7790f766999fe904a5b3b2e/image5-1.png" />
            
            </figure><p>Based on <a href="https://thermalcircle.de/doku.php?id=blog:linux:nftables_packet_flow_netfilter_hooks_detail"><i>Nftables - Packet flow and Netfilter hooks in detail</i></a><i>,</i> <a href="https://thermalcircle.de/"><i>thermalcircle.de</i></a><i>,</i> <a href="https://creativecommons.org/licenses/by-sa/4.0/deed.en"><i>CC BY-SA 4.0</i></a></p><p>Which ones does conntrack use? We will get to that in a moment.</p><p>First, let’s focus on the trigger. What makes conntrack register its callbacks with Netfilter?</p><p>The SSH connection doesn’t show up in the conntrack table just because the module is loaded. We already saw that. This means that conntrack doesn’t register its callbacks with Netfilter at module load time.</p><p>Or at least, it doesn't do it by <i>default</i>. Since Linux v5.1 (May 2019) the conntrack module has the <a href="https://github.com/torvalds/linux/commit/ba3fbe663635ae7b33a2d972c5d2def036258e42">enable_hooks parameter</a>, which causes conntrack to register its callbacks on load:</p>
            <pre><code>[vagrant@ct-vm ~]$ modinfo nf_conntrack
…
parm:           enable_hooks:Always enable conntrack hooks (bool)</code></pre>
            <p>Going back to our toy VM, let’s try to reload the conntrack module with enable_hooks set:</p>
            <pre><code>[vagrant@ct-vm ~]$ sudo rmmod nf_conntrack_netlink nf_conntrack
[vagrant@ct-vm ~]$ sudo modprobe nf_conntrack enable_hooks=1
[vagrant@ct-vm ~]$ sudo conntrack -L
tcp      6 431999 ESTABLISHED src=192.168.122.204 dst=192.168.122.1 sport=22 dport=34858 src=192.168.122.1 dst=192.168.122.204 sport=34858 dport=22 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
conntrack v1.4.5 (conntrack-tools): 1 flow entries have been shown.
[vagrant@ct-vm ~]$</code></pre>
            <p>Nice! The conntrack table now contains an entry for our SSH session.</p><p>The Netfilter hook notified conntrack about SSH session packets passing through the stack.</p><p>Now that we know how conntrack gets called, we can go back to our question — can we observe a TCP SYN packet dropped by the firewall with conntrack?</p>
    <div>
      <h2>Listing Netfilter hooks</h2>
      <a href="#listing-netfilter-hooks">
        
      </a>
    </div>
    <p>That is easy to check:</p><ol><li><p>Add a rule to drop anything coming to port tcp/2570<sup>2</sup></p></li></ol>
            <pre><code>[vagrant@ct-vm ~]$ sudo iptables -t filter -A INPUT -p tcp --dport 2570 -j DROP</code></pre>
            <ol><li><p>Connect to the VM on port tcp/2570 from the outside</p></li></ol>
            <pre><code>host $ nc -w 1 -z 192.168.122.204 2570</code></pre>
            <ol><li><p>List conntrack table entries</p></li></ol>
            <pre><code>[vagrant@ct-vm ~]$ sudo conntrack -L
tcp      6 431999 ESTABLISHED src=192.168.122.204 dst=192.168.122.1 sport=22 dport=34858 src=192.168.122.1 dst=192.168.122.204 sport=34858 dport=22 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
conntrack v1.4.5 (conntrack-tools): 1 flow entries have been shown.</code></pre>
            <p>No new entries. Conntrack didn’t record a new flow for the dropped SYN.</p><p>But did it process the SYN packet? To answer that we have to find out which callbacks conntrack registered with Netfilter.</p><p>Netfilter keeps track of callbacks registered for each hook in instances of <code>struct nf_hook_entries</code>. We can reach these objects through the Netfilter state (<code>struct netns_nf</code>), which lives inside network namespace (<code>struct net</code>).</p>
            <pre><code>struct netns_nf {
    …
    struct nf_hook_entries __rcu *hooks_ipv4[NF_INET_NUMHOOKS];
    struct nf_hook_entries __rcu *hooks_ipv6[NF_INET_NUMHOOKS];
    …
}</code></pre>
            <p><code>struct nf_hook_entries</code>, if you look at its <a href="https://elixir.bootlin.com/linux/v5.10.14/source/include/linux/netfilter.h#L101">definition</a>, is a bit of an exotic construct. A glance at how the object size is calculated during its <a href="https://elixir.bootlin.com/linux/v5.10.14/source/net/netfilter/core.c#L50">allocation</a> gives a hint about its memory layout:</p>
            <pre><code>    struct nf_hook_entries *e;
    size_t alloc = sizeof(*e) +
               sizeof(struct nf_hook_entry) * num +
               sizeof(struct nf_hook_ops *) * num +
               sizeof(struct nf_hook_entries_rcu_head);</code></pre>
            <p>It’s an element count, followed by two arrays glued together, and some <a href="https://en.wikipedia.org/wiki/Read-copy-update">RCU</a>-related state which we’re going to ignore. The two arrays have the same size, but hold different kinds of values.</p><p>We can walk the second array, holding pointers to struct nf_hook_ops, to discover the registered callbacks and their priority. Priority determines the invocation order.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5nKYeHIB6aTRVx36yUT3la/d7148917310c537a81ecfed5d639dfe2/image3-3.png" />
            
            </figure><p>With <a href="https://github.com/osandov/drgn">drgn</a>, a programmable C debugger tailored for the Linux kernel, we can locate the Netfilter state in kernel memory, and walk its contents relatively easily. Given we know what we are looking for.</p>
            <pre><code>[vagrant@ct-vm ~]$ sudo drgn
drgn 0.0.8 (using Python 3.9.1, without libkdumpfile)
…
&gt;&gt;&gt; pre_routing_hook = prog['init_net'].nf.hooks_ipv4[0]
&gt;&gt;&gt; for i in range(0, pre_routing_hook.num_hook_entries):
...     pre_routing_hook.hooks[i].hook
...
(nf_hookfn *)ipv4_conntrack_defrag+0x0 = 0xffffffffc092c000
(nf_hookfn *)ipv4_conntrack_in+0x0 = 0xffffffffc093f290
&gt;&gt;&gt;</code></pre>
            <p>Neat! We have a way to access Netfilter state.</p><p>Let’s take it to the next level and list all registered callbacks for each Netfilter hook (using <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-conntrack-syn-drop/tools/list-nf-hooks">less than 100 lines of Python</a>):</p>
            <pre><code>[vagrant@ct-vm ~]$ sudo /vagrant/tools/list-nf-hooks
? ipv4 PRE_ROUTING
       -400 → ipv4_conntrack_defrag     ☜ conntrack callback
       -300 → iptable_raw_hook
       -200 → ipv4_conntrack_in         ☜ conntrack callback
       -150 → iptable_mangle_hook
       -100 → nf_nat_ipv4_in

? ipv4 LOCAL_IN
       -150 → iptable_mangle_hook
          0 → iptable_filter_hook
         50 → iptable_security_hook
        100 → nf_nat_ipv4_fn
 2147483647 → ipv4_confirm
…</code></pre>
            <p>The output from our script shows that conntrack has two callbacks registered with the <code>PRE_ROUTING</code> hook - <code>ipv4_conntrack_defrag</code> and <code>ipv4_conntrack_in</code>. But are they being called?</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/iBNniB9sIszNQV2u34qd4/59c67b7c7e53749696ca82992296edf0/image4-3.png" />
            
            </figure><p>Based on <a href="https://thermalcircle.de/doku.php?id=blog:linux:nftables_packet_flow_netfilter_hooks_detail"><i>Netfilter PRE_ROUTING hook</i></a><i>,</i> <a href="https://thermalcircle.de/"><i>thermalcircle.de</i></a><i>,</i> <a href="https://creativecommons.org/licenses/by-sa/4.0/deed.en"><i>CC BY-SA 4.0</i></a></p>
    <div>
      <h2>Tracing conntrack callbacks</h2>
      <a href="#tracing-conntrack-callbacks">
        
      </a>
    </div>
    <p>We expect that when the Netfilter <code>PRE_ROUTING</code> hook processes a TCP SYN packet, it will invoke <code>ipv4_conntrack_defrag</code> and then <code>ipv4_conntrack_in</code> callbacks.</p><p>To confirm it we will put to use the tracing powers of <a href="https://ebpf.io/">BPF ?</a>. BPF programs can run on entry to functions. These kinds of programs are known as BPF kprobes. In our case we will attach BPF kprobes to conntrack callbacks.</p><p>Usually, when working with BPF, we would write the BPF program in C and use <code>clang -target bpf</code> to compile it. However, for tracing it will be much easier to use <a href="https://bpftrace.org/">bpftrace</a>. With bpftrace we can write <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-conntrack-syn-drop/tools/trace-conntrack-prerouting.bt">our BPF kprobe program</a> in a <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md">high-level language</a> inspired by <a href="https://en.wikipedia.org/wiki/AWK">AWK</a>:</p>
            <pre><code>kprobe:ipv4_conntrack_defrag,
kprobe:ipv4_conntrack_in
{
    $skb = (struct sk_buff *)arg1;
    $iph = (struct iphdr *)($skb-&gt;head + $skb-&gt;network_header);
    $th = (struct tcphdr *)($skb-&gt;head + $skb-&gt;transport_header);

    if ($iph-&gt;protocol == 6 /* IPPROTO_TCP */ &amp;&amp;
        $th-&gt;dest == 2570 /* htons(2570) */ &amp;&amp;
        $th-&gt;syn == 1) {
        time("%H:%M:%S ");
        printf("%s:%u &gt; %s:%u tcp syn %s\n",
               ntop($iph-&gt;saddr),
               (uint16)($th-&gt;source &lt;&lt; 8) | ($th-&gt;source &gt;&gt; 8),
               ntop($iph-&gt;daddr),
               (uint16)($th-&gt;dest &lt;&lt; 8) | ($th-&gt;dest &gt;&gt; 8),
               func);
    }
}</code></pre>
            <p>What does this program do? It is roughly an equivalent of a tcpdump filter:</p>
            <pre><code>dst port 2570 and tcp[tcpflags] &amp; tcp-syn != 0</code></pre>
            <p>But only for packets passing through conntrack <code>PRE_ROUTING</code> callbacks.</p><p>(If you haven’t used bpftrace, it comes with an excellent <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md">reference guide</a> and gives you the ability to explore kernel data types on the fly with <code>bpftrace -lv 'struct iphdr'</code>.)</p><p>Let’s run the tracing program while we connect to the VM from the outside (<code>nc -z 192.168.122.204 2570</code>):</p>
            <pre><code>[vagrant@ct-vm ~]$ sudo bpftrace /vagrant/tools/trace-conntrack-prerouting.bt
Attaching 3 probes...
Tracing conntrack prerouting callbacks... Hit Ctrl-C to quit
13:22:56 192.168.122.1:33254 &gt; 192.168.122.204:2570 tcp syn ipv4_conntrack_defrag
13:22:56 192.168.122.1:33254 &gt; 192.168.122.204:2570 tcp syn ipv4_conntrack_in
^C

[vagrant@ct-vm ~]$</code></pre>
            <p>Conntrack callbacks have processed the TCP SYN packet destined to tcp/2570.</p><p>But if conntrack saw the packet, why is there no corresponding flow entry in the conntrack table?</p>
    <div>
      <h2>Going down the rabbit hole</h2>
      <a href="#going-down-the-rabbit-hole">
        
      </a>
    </div>
    <p>What actually happens inside the conntrack <code>PRE_ROUTING</code> callbacks?</p><p>To find out, we can trace the call chain that starts on entry to the conntrack callback. The <code>function_graph</code> tracer built into the <a href="https://lwn.net/Articles/365835/">Ftrace</a> framework is perfect for this task.</p><p>But because all incoming traffic goes through the <code>PRE_ROUTING</code> hook, including our SSH connection, our trace will be polluted with events from SSH traffic. To avoid that, let’s switch from SSH access to a serial console.</p><p>When using <a href="https://github.com/vagrant-libvirt/vagrant-libvirt">libvirt</a> as the Vagrant provider, you can connect to the serial console with <code>virsh</code>:</p>
            <pre><code>host $ virsh -c qemu:///session list
 Id   Name                State
-----------------------------------
 1    conntrack_default   running

host $ virsh -c qemu:///session console conntrack_default
Once connected to the console and logged into the VM, we can record the call chain using the trace-cmd wrapper for Ftrace:
[vagrant@ct-vm ~]$ sudo trace-cmd start -p function_graph -g ipv4_conntrack_defrag -g ipv4_conntrack_in
  plugin 'function_graph'
[vagrant@ct-vm ~]$ # … connect from the host with `nc -z 192.168.122.204 2570` …
[vagrant@ct-vm ~]$ sudo trace-cmd stop
[vagrant@ct-vm ~]$ sudo cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 1)   1.219 us    |  finish_task_switch();
 1)   3.532 us    |  ipv4_conntrack_defrag [nf_defrag_ipv4]();
 1)               |  ipv4_conntrack_in [nf_conntrack]() {
 1)               |    nf_conntrack_in [nf_conntrack]() {
 1)   0.573 us    |      get_l4proto [nf_conntrack]();
 1)               |      nf_ct_get_tuple [nf_conntrack]() {
 1)   0.487 us    |        nf_ct_get_tuple_ports [nf_conntrack]();
 1)   1.564 us    |      }
 1)   0.820 us    |      hash_conntrack_raw [nf_conntrack]();
 1)   1.255 us    |      __nf_conntrack_find_get [nf_conntrack]();
 1)               |      init_conntrack.constprop.0 [nf_conntrack]() {  ❷
 1)   0.427 us    |        nf_ct_invert_tuple [nf_conntrack]();
 1)               |        __nf_conntrack_alloc [nf_conntrack]() {      ❶
                             … 
 1)   3.680 us    |        }
                           … 
 1) + 15.847 us   |      }
                         … 
 1) + 34.595 us   |    }
 1) + 35.742 us   |  }
 …
[vagrant@ct-vm ~]$</code></pre>
            <p>What catches our attention here is the allocation, <a href="https://elixir.bootlin.com/linux/v5.10.14/source/net/netfilter/nf_conntrack_core.c#L1471">__nf_conntrack_alloc()</a> (❶), inside <code>init_conntrack() (❷). __nf_conntrack_alloc()</code> creates a <a href="https://elixir.bootlin.com/linux/v5.10.14/source/include/net/netfilter/nf_conntrack.h#L58">struct nf_conn</a> object which represents a tracked connection.</p><p>This object is not created in vain. <a href="https://elixir.bootlin.com/linux/v5.10.14/source/net/netfilter/nf_conntrack_core.c#L1633">A glance</a> at <code>init_conntrack()</code> <a href="https://elixir.bootlin.com/linux/v5.10.14/source/net/netfilter/nf_conntrack_core.c#L1633">source</a> shows that it is pushed onto a list of unconfirmed connections<sup>3</sup>.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4NAnZwLE3NnwyYdlSpnPls/7f87ba94ccade0bea86d95a22130e2cd/image6-1.png" />
            
            </figure><p>What does it mean that a connection is unconfirmed? As <a href="https://manpages.debian.org/buster/conntrack/conntrack.8.en.html">conntrack(8) man page</a> explains:</p>
            <pre><code>unconfirmed:
       This table shows new entries, that are not yet inserted into the
       conntrack table. These entries are attached to packets that  are
       traversing  the  stack, but did not reach the confirmation point
       at the postrouting hook.</code></pre>
            <p>Perhaps we have been looking for our flow in the wrong table? Does the unconfirmed table have a record for our dropped TCP SYN?</p>
    <div>
      <h2>Pulling the rabbit out of the hat</h2>
      <a href="#pulling-the-rabbit-out-of-the-hat">
        
      </a>
    </div>
    <p>I have bad news…</p>
            <pre><code>[vagrant@ct-vm ~]$ sudo conntrack -L unconfirmed
conntrack v1.4.5 (conntrack-tools): 0 flow entries have been shown.
[vagrant@ct-vm ~]$</code></pre>
            <p>The flow is not present in the unconfirmed table. We have to dig deeper.</p><p>Let’s for a moment assume that a <code>struct nf_conn</code> object was added to the <code>unconfirmed</code> list. If the list is now empty, then the object must have been removed from the list before we inspected its contents.</p><p>Has an entry been removed from the <code>unconfirmed</code> table? What function removes entries from the <code>unconfirmed</code> table?</p><p>It turns out that <code>nf_ct_add_to_unconfirmed_list()</code> which <code>init_conntrack()</code> invokes, has its opposite defined just right beneath it - <code>nf_ct_del_from_dying_or_unconfirmed_list()</code>.</p><p>It is worth a shot to check if this function is being called, and if so, from where. For that we can again use a BPF tracing program, attached to function entry. However, this time our program will record a kernel stack trace:</p>
            <pre><code>kprobe:nf_ct_del_from_dying_or_unconfirmed_list { @[kstack()] = count(); exit(); }</code></pre>
            <p>With <code>bpftrace</code> running our one-liner, we connect to the VM from the host with <code>nc</code> as before:</p>
            <pre><code>[vagrant@ct-vm ~]$ sudo bpftrace -e 'kprobe:nf_ct_del_from_dying_or_unconfirmed_list { @[kstack()] = count(); exit(); }'
Attaching 1 probe...

@[
    nf_ct_del_from_dying_or_unconfirmed_list+1 ❹
    destroy_conntrack+78
    nf_conntrack_destroy+26
    skb_release_head_state+78
    kfree_skb+50 ❸
    nf_hook_slow+143 ❷
    ip_local_deliver+152 ❶
    ip_sublist_rcv_finish+87
    ip_sublist_rcv+387
    ip_list_rcv+293
    __netif_receive_skb_list_core+658
    netif_receive_skb_list_internal+444
    napi_complete_done+111
    …
]: 1

[vagrant@ct-vm ~]$</code></pre>
            <p>Bingo. The conntrack delete function was called, and the captured stack trace shows that on local delivery path (❶), where <code>LOCAL_IN</code> Netfilter hook runs (❷), the packet is destroyed (❸). Conntrack must be getting called when <a href="https://elixir.bootlin.com/linux/v5.10.14/source/include/linux/skbuff.h#L713">sk_buff</a> (the packet and its metadata) is destroyed. This causes conntrack to remove the unconfirmed flow entry (❹).</p><p>It makes sense. After all we have a <code>DROP</code> rule in the <code>filter/INPUT</code> chain. And that <code>iptables -j DROP</code> rule has a significant side effect. It cleans up an entry in the conntrack <code>unconfirmed</code> table!</p><p>This explains why we can’t observe the flow in the <code>unconfirmed</code> table. It lives for only a very short period of time.</p><p>Not convinced? You don’t have to take my word for it. I will prove it with a dirty trick!</p>
    <div>
      <h2>Making the rabbit disappear, or actually appear</h2>
      <a href="#making-the-rabbit-disappear-or-actually-appear">
        
      </a>
    </div>
    <p>If you recall the output from <code>list-nf-hooks</code> that we’ve seen earlier, there is another conntrack callback there - <code>ipv4_confirm</code>, which I have ignored:</p>
            <pre><code>[vagrant@ct-vm ~]$ sudo /vagrant/tools/list-nf-hooks
…
? ipv4 LOCAL_IN
       -150 → iptable_mangle_hook
          0 → iptable_filter_hook
         50 → iptable_security_hook
        100 → nf_nat_ipv4_fn
 2147483647 → ipv4_confirm              ☜ another conntrack callback
… </code></pre>
            <p><code>ipv4_confirm</code> is “the confirmation point” mentioned in the <a href="https://manpages.debian.org/buster/conntrack/conntrack.8.en.html">conntrack(8) man page</a>. When a flow gets confirmed, it is moved from the <code>unconfirmed</code> table to the main <code>conntrack</code> table.</p><p>The callback is registered with a “weird” priority – 2,147,483,647. It’s the maximum positive value of a 32-bit signed integer can hold, and at the same time, the lowest possible priority a callback can have.</p><p>This ensures that the <code>ipv4_confirm</code> callback runs last. We want the flows to graduate from the <code>unconfirmed</code> table to the main <code>conntrack</code> table only once we know the corresponding packet has made it through the firewall.</p><p>Luckily for us, it is possible to have more than one callback registered with the same priority. In such cases, the order of registration matters. We can put that to use. Just for educational purposes.</p><p>Good old <code>iptables</code> won’t be of much help here. Its Netfilter callbacks have hard-coded priorities which we can’t change. But <code>nftables</code>, the <code>iptables</code> <a href="https://developers.redhat.com/blog/2016/10/28/what-comes-after-iptables-its-successor-of-course-nftables/">successor</a>, is much more flexible in this regard. With <code>nftables</code> we can create a rule chain with arbitrary priority.</p><p>So this time, let’s use nftables to install a filter rule to drop traffic to port tcp/2570. The trick, though, is to register our chain before conntrack registers itself. This way our filter will run <i>last</i>.</p><p>First, delete the tcp/2570 drop rule in iptables and unregister conntrack.</p>
            <pre><code>vm # iptables -t filter -F
vm # rmmod nf_conntrack_netlink nf_conntrack</code></pre>
            <p>Then add tcp/2570 drop rule in <code>nftables</code>, with lowest possible priority.</p>
            <pre><code>vm # nft add table ip my_table
vm # nft add chain ip my_table my_input { type filter hook input priority 2147483647 \; }
vm # nft add rule ip my_table my_input tcp dport 2570 counter drop
vm # nft -a list ruleset
table ip my_table { # handle 1
        chain my_input { # handle 1
                type filter hook input priority 2147483647; policy accept;
                tcp dport 2570 counter packets 0 bytes 0 drop # handle 4
        }
}</code></pre>
            <p>Finally, re-register conntrack hooks.</p>
            <pre><code>vm # modprobe nf_conntrack enable_hooks=1</code></pre>
            <p>The registered callbacks for the <code>LOCAL_IN</code> hook now look like this:</p>
            <pre><code>vm # /vagrant/tools/list-nf-hooks
…
? ipv4 LOCAL_IN
       -150 → iptable_mangle_hook
          0 → iptable_filter_hook
         50 → iptable_security_hook
        100 → nf_nat_ipv4_fn
 2147483647 → ipv4_confirm, nft_do_chain_ipv4
…</code></pre>
            <p>What happens if we connect to port tcp/2570 now?</p>
            <pre><code>vm # conntrack -L
tcp      6 115 SYN_SENT src=192.168.122.1 dst=192.168.122.204 sport=54868 dport=2570 [UNREPLIED] src=192.168.122.204 dst=192.168.122.1 sport=2570 dport=54868 mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
conntrack v1.4.5 (conntrack-tools): 1 flow entries have been shown.</code></pre>
            <p>We have fooled conntrack ?</p><p>Conntrack promoted the flow from the <code>unconfirmed</code> to the main <code>conntrack</code> table despite the fact that the firewall dropped the packet. We can observe it.</p>
    <div>
      <h2>Outro</h2>
      <a href="#outro">
        
      </a>
    </div>
    <p>Conntrack processes every received packet<sup>4</sup> and creates a flow for it. A flow entry is always created even if the packet is dropped shortly after. The flow might never be promoted to the main conntrack table and can be short lived.</p><p>However, this blog post is not really about conntrack. Its internals have been covered by <a href="https://www.usenix.org/publications/login/june-2006-volume-31-number-3/netfilters-connection-tracking-system">magazines</a>, <a href="https://www.semanticscholar.org/paper/Netfilter-Connection-Tracking-and-NAT-Boye/3f3c09dbc2a13c4840bb4a148753bb528493b607">papers</a>, <a href="https://books.google.pl/books?id=RpsQAwAAQBAJ&amp;lpg=PP1&amp;pg=PA253#v=onepage&amp;q&amp;f=false">books</a>, and on other blogs long before. We probably could have learned elsewhere all that has been shown here.</p><p>For us, conntrack was really just an excuse to demonstrate various ways to discover the inner workings of the Linux network stack. As good as any other.</p><p>Today we have powerful introspection tools like <a href="https://github.com/osandov/drgn">drgn</a>, <a href="https://bpftrace.org/">bpftrace</a>, or <a href="https://lwn.net/Articles/365835/">Ftrace</a>, and a <a href="https://elixir.bootlin.com/">cross referencer</a> to plow through the source code, at our fingertips. They help us look under the hood of a live operating system and gradually deepen our understanding of its workings.</p><p>I have to warn you, though. Once you start digging into the kernel, it is hard to stop…</p><p>...........</p><p><sup>1</sup>Actually since <a href="https://kernelnewbies.org/Linux_5.10#Networking">Linux v5.10</a> (Dec 2020) there is an additional Netfilter hook for the INET family named <code>NF_INET_INGRESS</code>. The new hook type allows users to attach nftables chains to the Traffic Control ingress hook.</p><p><sup>2</sup>Why did I pick this port number? Because 2570 = 0x0a0a. As we will see later, this saves us the trouble of converting between the network byte order and the host byte order.</p><p><sup>3</sup>To be precise, there are multiple lists of unconfirmed connections. One per each CPU. This is a common pattern in the kernel. Whenever we want to prevent CPUs from contending for access to a shared state, we give each CPU a private instance of the state.</p><p><sup>4</sup>Unless we explicitly exclude it from being tracked with <code>iptables -j NOTRACK</code>.</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Network]]></category>
            <category><![CDATA[Kernel]]></category>
            <guid isPermaLink="false">2F04ZqBN3X4bKtLaLCptMm</guid>
            <dc:creator>Jakub Sitnicki</dc:creator>
        </item>
        <item>
            <title><![CDATA[It's crowded in here!]]></title>
            <link>https://blog.cloudflare.com/its-crowded-in-here/</link>
            <pubDate>Sat, 12 Oct 2019 13:00:00 GMT</pubDate>
            <description><![CDATA[ We recently gave a presentation on Programming socket lookup with BPF at the Linux Plumbers Conference 2019 in Lisbon, Portugal. ]]></description>
            <content:encoded><![CDATA[ <p>We recently gave a presentation on <a href="https://linuxplumbersconf.org/event/4/contributions/487/">Programming socket lookup with BPF</a> at the Linux Plumbers Conference 2019 in Lisbon, Portugal. This blog post is a recap of the problem statement and proposed solution we presented.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3ODIBQvumtXQYscbSqjLFn/24257f7186eae5e62fcaf20a166490ea/birds_cable_wire.jpg" />
          </figure><p>CC0 Public Domain, <a href="https://pxhere.com/en/photo/1526517">PxHere</a></p><p>Our edge servers are crowded. We run more than a dozen public facing services, leaving aside the all internal ones that do the work behind the scenes.</p><p>Quick Quiz #1: How many can you name? We blogged about them! <a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-1">Jump to answer</a>.</p><p>These services are exposed on more than a million Anycast <a href="https://www.cloudflare.com/ips/">public IPv4 addresses</a> partitioned into 100+ network prefixes.</p><p>To keep things uniform every Cloudflare edge server runs all services and responds to every Anycast address. This allows us to make efficient use of the hardware by load-balancing traffic between all machines. We have shared the details of Cloudflare <a href="https://blog.cloudflare.com/no-scrubs-architecture-unmetered-mitigation/">edge</a><a href="https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/"> architecture</a> on the blog before.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6MHwv0fusFTiEylSaWQeg3/dbbaa25e7093b0be4fa8ec6dbcd88bca/edge_data_center-1.png" />
          </figure><p>Granted not all services work on all the addresses but rather on a subset of them, covering one or several network prefixes.</p><p>So how do you set up your network services to listen on hundreds of IP addresses without driving the network stack over the edge? Cloudflare engineers have had to ask themselves this question more than once over the years, and the answer has changed as our edge evolved. This evolution forced us to look for creative ways to work with the <a href="https://en.wikipedia.org/wiki/Berkeley_sockets">Berkeley sockets API</a>, a POSIX standard for assigning a network address and a port number to your application. It has been quite a journey, and we are not done yet.</p>
    <div>
      <h2>When life is simple - one address, one socket</h2>
      <a href="#when-life-is-simple-one-address-one-socket">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6aVlT1BbslJwQRzC0HnF0p/eb5b0651ba36da1010e1bd72b7c57a55/mapping_1_to_1-1.png" />
          </figure><p>The simplest kind of association between an (IP address, port number) and a service that we can imagine is one-to-one. A server responds to client requests on a single address, on a well known port. To set it up the application has to open one socket for each transport protocol (be it TCP or UDP) it wants to support. A network server like our <a href="https://www.cloudflare.com/dns/">authoritative DNS</a> would open up two sockets (one for UDP, one for TCP):</p><p>(192.0.2.1, 53/tcp) ⇨ ("auth-dns", pid=1001, fd=3)
(192.0.2.1, 53/udp) ⇨ ("auth-dns", pid=1001, fd=4)</p><p>To take it to Cloudflare scale, the service is likely to have to receive on at least a /20 network prefix, which is a range of IPs with 4096 addresses in it.</p><p>This translates to opening 4096 sockets for each transport protocol. Something that is not likely to go unnoticed when looking at <a href="http://man7.org/linux/man-pages/man8/ss.8.html">ss tool</a> output.</p><p></p><p>$ sudo ss -ulpn 'sport = 53'
State  Recv-Q Send-Q  Local Address:Port Peer Address:Port
…
UNCONN 0      0           192.0.2.40:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11076))
UNCONN 0      0           192.0.2.39:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11075))
UNCONN 0      0           192.0.2.38:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11074))
UNCONN 0      0           192.0.2.37:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11073))
UNCONN 0      0           192.0.2.36:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11072))
UNCONN 0      0           192.0.2.31:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11071))
…</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2v9WUgAYqxc0JyRM6llZCa/5c29ebd8a06ab9ded6dc0c483431b88d/lots_of_socks.jpg" />
          </figure><p>CC BY 2.0, Luca Nebuloni, <a href="https://flickr.com/photos/7897906@N06/20655224708">Flickr</a></p><p>The approach, while naive, has an advantage: when an IP from the range gets attacked with a UDP flood, the receive queues of sockets bound to the remaining IP addresses are not affected.</p>
    <div>
      <h2>Life can be easier - all addresses, one socket</h2>
      <a href="#life-can-be-easier-all-addresses-one-socket">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4nfFyQt9C1FLIKO6wnmV4Q/90405ed4e16846e092ecd7e0d88d5110/mapping_inaddr_any-1.png" />
          </figure><p>It seems rather silly to create so many sockets for one service to receive traffic on a range of addresses. Not only that, the more listening sockets there are, the longer the chains in the socket lookup hash table. We have learned the hard way that going in this direction <a href="https://blog.cloudflare.com/revenge-listening-sockets/">can hurt packet processing latency</a>.</p><p>The sockets API comes with a big hammer that can make our life easier - the <code>INADDR_ANY</code> aka <code>0.0.0.0</code> wildcard address. With <code>INADDR_ANY</code> we can make a single socket receive on all addresses assigned to our host, specifying just the port.</p>
            <pre><code>s = socket(AF_INET, SOCK_STREAM, 0)
s.bind(('0.0.0.0', 12345))
s.listen(16)</code></pre>
            <p></p><p>Quick Quiz #2: Is there another way to bind a socket to all local addresses? <a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-2">Jump to answer</a>.</p><p>In other words, compared to the naive “one address, one socket” approach, <code>INADDR_ANY</code> allows us to have a single catch-all listening socket for the whole IP range on which we accept incoming connections.</p><p>On Linux this is possible thanks to a two-phase listening socket lookup, where it falls back to search for an <code>INADDR_ANY</code> socket if a more specific match has not been found.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1B0Td9aV8dFNmmOZ7Z5Ell/51b479c1ed0abae1d71038ab030dd98b/tcp_socket_lookup-1.png" />
          </figure><p>Another upside of binding to <code>0.0.0.0</code> is that our application doesn’t need to be aware of what addresses we have assigned to our host. We are also free to assign or remove the addresses after binding the listening socket. No need to reconfigure the service when its listening IP range changes.</p><p>On the other hand if our service should be listening on just <code>A.B.C.0/20</code> prefix, binding to all local addresses is more than we need. We might unintentionally expose an otherwise internal-only service to external traffic without a proper firewall or a socket filter in place.</p><p>Then there is the security angle. Since we now only have one socket, attacks attempting to flood any of the IPs assigned to our host on our service’s port, will hit the catch-all socket and its receive queue. While in such circumstances the Linux <a href="https://blog.cloudflare.com/syn-packet-handling-in-the-wild/">TCP stack has your back</a>, UDP needs special care or legitimate traffic might drown in the flood of dropped packets.</p><p>Possibly the biggest downside, though, is that a service listening on the wildcard <code>INADDR_ANY</code> address claims the port number exclusively for itself. Binding over the wildcard-listening socket with a specific IP and port fails miserably due to the address already being taken (<code>EADDRINUSE</code>).</p>
            <pre><code>bind(3, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)
</code></pre>
            <p>Unless your service is UDP-only, setting the <code>SO_REUSEADDR</code> socket option, will not help you overcome this restriction. The only way out is to turn to <code>SO_REUSEPORT</code>, normally used to construct a load-balancing socket group. And that is only if you are lucky enough to run the port-conflicting services as the same user (UID). That is a story for another post.Quick Quiz #3: Does setting the <code>SO_REUSEADDR</code> socket option have any effect at all when there is bind conflict? <a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-3">Jump to answer</a>.</p>
    <div>
      <h2>Life gets real - one port, two services</h2>
      <a href="#life-gets-real-one-port-two-services">
        
      </a>
    </div>
    <p>As it happens, at the Cloudflare edge we do host services that share the same port number but otherwise respond to requests on non-overlapping IP ranges. A prominent example of such port-sharing is our <a href="https://blog.cloudflare.com/dns-resolver-1-1-1-1/">1.1.1.1</a> recursive DNS resolver running side-by-side with the<a href="https://www.cloudflare.com/dns/"> authoritative DNS service</a> that we offer to all customers.</p><p>Sadly the s<a href="http://man7.org/linux/man-pages/man2/bind.2.html">ockets API</a> doesn’t allow us to express a setup in which two services share a port and accept requests on disjoint IP ranges.</p><p>However, as Linux development history shows, any networking API limitation can be overcome by introducing a new <a href="https://github.com/torvalds/linux/blame/master/include/uapi/asm-generic/socket.h">socket option</a>, with sixty-something options available (and counting!).</p><p>Enter <code>SO_BINDTOPREFIX</code>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1VW8uJH7XkJKqUt6HZXWiU/6995812a081001a31238f45c0455960b/mapping_bindtoprefix-1.png" />
          </figure><p>Back in 2016 we proposed <a href="https://lore.kernel.org/netdev/1458699966-3752-1-git-send-email-gilberto.bertin@gmail.com/">an extension to the Linux network stack</a>. It allowed services to constrain a wildcard-bound socket to an IP range belonging to a network prefix.</p>
            <pre><code># Service 1, 127.0.0.0/20, 1234/tcp
net1, plen1 = '127.0.0.0', 20
bindprefix1 = struct.pack('BBBBBxxx', *inet_aton(net1), plen1)

s1 = socket(AF_INET, SOCK_STREAM, 0)
s1.setsockopt(SOL_IP, IP_BINDTOPREFIX, bindprefix1)
s1.bind(('0.0.0.0', 1234))
s1.listen(1)

# Service 2, 127.0.16.0/20, 1234/tcp
net2, plen2 = '127.0.16.0', 20
bindprefix2 = struct.pack('BBBBBxxx', *inet_aton(net2), plen2)

s2 = socket(AF_INET, SOCK_STREAM, 0)
s2.setsockopt(SOL_IP, IP_BINDTOPREFIX, bindprefix2)
s2.bind(('0.0.0.0', 1234))
s2.listen(1)
</code></pre>
            <p>This mechanism has served us well since then. Unfortunately, it didn’t get accepted upstream due to being too specific to our use-case. Having no better alternative we ended up maintaining patches in our kernel to this day.</p>
    <div>
      <h2>Life gets complicated - all ports, one service</h2>
      <a href="#life-gets-complicated-all-ports-one-service">
        
      </a>
    </div>
    <p>Just when we thought we had things figured out, we were faced with a new challenge. How to build a service that accepts connections on any of the 65,535 ports? The ultimate reverse proxy, if you will, code named <a href="https://www.cloudflare.com/products/cloudflare-spectrum/">Spectrum</a>.The <code>bind</code> syscall offers very little flexibility when it comes to mapping a socket to a port number. You can either specify the number you want or let the network stack pick an unused one for you. There is no counterpart of <code>INADDR_ANY</code>, a wildcard value to select all ports (<code>INPORT_ANY</code>?).To achieve what we wanted, we had to turn to <a href="https://blog.cloudflare.com/how-we-built-spectrum/">TPROXY</a>, a <a href="https://www.kernel.org/doc/Documentation/networking/tproxy.txt">Netfilter / <code>iptables</code> extension</a> designed for intercepting remote-destined traffic on the forward path. However, we use it to steer local-destined packets, that is ones targeted to our host, to a catch-all-ports socket.</p>
            <pre><code>iptables -t mangle -I PREROUTING \
         -d 192.0.2.0/24 -p tcp \
         -j TPROXY --on-ip=127.0.0.1 --on-port=1234</code></pre>
            
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3n0QTqrm2m4ZTiqPtiiF9a/fbbd5a464721df6ae8017805e2d540dd/mapping_tproxy-1.png" />
          </figure><p>TPROXY-based setup comes at a price. For starters, your service needs elevated privileges to create a special catch-all socket (see the <a href="http://man7.org/linux/man-pages/man7/ip.7.html"><code>IP_TRANSPARENT</code> socket option</a>). Then you also have to understand and consider the subtle interactions between TPROXY and the receive path for your traffic profile, for example:</p><ul><li><p>does connection tracking register the flows redirected with TPROXY?</p></li><li><p>is listening socket contention during a SYN flood when using TPROXY a concern?</p></li><li><p>do other parts of the network stack, like XDP programs, need to know about TPROXY redirecting packets?</p></li></ul><p>These are some of the questions we needed to answer, and after running it in production for a while now, we have a good idea of what the consequences of using TPROXY are.That said, it would not come as a shock, if tomorrow we’d discovered something new about TPROXY. Due to its complexity we’ve always considered using it to steer local-destined traffic a <a href="https://blog.cloudflare.com/how-we-built-spectrum/">hack</a>, a use-case outside its intended application. No matter how well understood, a hack remains a hack.</p>
    <div>
      <h2>Can BPF make life easier?</h2>
      <a href="#can-bpf-make-life-easier">
        
      </a>
    </div>
    <p>Despite its complex nature TPROXY shows us something important. No matter what IP or port the listening socket is bound to, with a bit of support from the network stack, we can steer any connection to it. As long the application is ready to handle this situation, things work.</p><p>Quick Quiz #4: Are there really no problems with accepting any connection on any socket? <a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-4">Jump to answer</a>.</p><p>This is a really powerful concept. With a bunch of TPROXY rules, we can configure any mapping between (address, port) tuples and listening sockets.</p><p><b>? Idea #1:</b> A local-destined connection can be accepted by any listening socket.</p><p>We didn’t tell you the whole story before. When we published <code>SO_BINDTOPREFIX</code> patches, they did not just get rejected. <a href="https://meta.wikimedia.org/wiki/Cunningham%27s_Law">As sometimes happens</a> by posting the wrong answer, we got <a href="https://lore.kernel.org/netdev/1459261895.6473.176.camel@edumazet-glaptop3.roam.corp.google.com/">the right answer</a> to our problem</p><blockquote><p>❝BPF is absolutely the way to go here, as it allows for whatever user specified tweaks, like a list of destination subnetwork, or/and a list of source network, or the date/time of the day, or port knocking without netfilter, or … you name it.❞</p></blockquote><p><b>? Idea #2:</b> How we pick a listening socket can be tweaked with BPF.</p><p>Combine the two ideas together, and we arrive at an exciting concept. Let’s run BPF code to match an incoming packet with a listening socket, ignoring the address the socket is bound to. ?</p><p>Here’s an example to illustrate it.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5zhLPX1wssfv3c4QvObgz9/66812e76990de528517d020dbca70638/idea_program_socket_lookup_with_bpf-2.png" />
          </figure><p>All packets coming on <code>192.0.2.0/24</code> prefix, port <code>53</code> are steered to socket <code>sk:2</code>, while traffic targeted at <code>203.0.113.1</code>, on any port number lands in socket <code>sk:4</code>.</p>
    <div>
      <h2>Welcome BPF inet_lookup</h2>
      <a href="#welcome-bpf-inet_lookup">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4Pg5Lou5IFLuZcFHbaY46O/590e5267f496e9ce2e1eb6d1460bbd45/bpf_inet_lookup_hook-1.png" />
          </figure><p>To make this concept a reality we are proposing a new mechanism to program the socket lookup with BPF. What is socket lookup? It’s a stage on the receive path where the transport layer searches for a socket to dispatch the packet to. The last possible moment to steer packets before they land in the selected socket receive queue. In there we attach a new type of BPF program called <code>inet_lookup</code>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2yATE46s8d1no0gYCw6Asw/6cd9e449e7d6b0571a8c69ce9e4d9f33/tcp_socket_lookup_with_bpf-1.png" />
          </figure><p>If you recall, socket lookup in the Linux TCP stack is a <a href="https://elixir.bootlin.com/linux/v5.4-rc2/source/include/net/inet_hashtables.h#L329">two phase process</a>. First the kernel will try to find an established (connected) socket matching the packet 4-tuple. If there isn’t one, it will continue by looking for a listening socket using just the packet 2-tuple as key.</p><p>Our proposed extension allows users to program the second phase, the listening socket lookup. If present, a BPF program is allowed to choose a listening socket and terminate the lookup. Our program is also free to ignore the packet, in which case the kernel will continue to look for a listening socket as usual.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/bjoj0UUcthGLLGOSTbida/3e139697c9d897cd11aabbdf80946032/bpf_inet_lookup_operation-1.png" />
          </figure><p>How does this new type of BPF program operate? On input, as context, it gets handed a subset of information extracted from packet headers, including the packet 4-tuple. Based on the input the program accesses a BPF map containing references to listening sockets, and selects one to yield as the socket lookup result.</p><p>If we take a look at the corresponding BPF code, the program structure resembles a firewall rule. We have some match statements followed by an action.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2lhm5L9NH0AN8tnMh2cMpl/14656d47d0e81440617893422ac2140b/bpf_inet_lookup_code_sample-2.png" />
          </figure><p>You may notice that we don’t access the BPF map with sockets directly. Instead, we follow an established pattern in BPF called “map based redirection”, where a dedicated BPF helper accesses the map and carries out any steps necessary to redirect the packet.</p><p>We’ve skipped over one thing. Where does the BPF map of sockets come from? We create it ourselves and populate it with sockets. This is most easily done if your service uses systemd <a href="http://0pointer.de/blog/projects/socket-activation.html">socket activation</a>. systemd will let you associate more than one service unit with a socket unit, and both of the services will receive a file descriptor for the same socket. From there it’s just a matter of inserting the socket into the BPF map.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2ufZk0QnR4cOf061PvseTW/a7bdf0a77d655be88adc527cadc4a5bf/bpf_inet_lookup_socket_activation-1.png" />
          </figure>
    <div>
      <h2>Demo time!</h2>
      <a href="#demo-time">
        
      </a>
    </div>
    <p>This is not just a concept. We have already published a first working <a href="https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/">set of patches for the kernel</a> together with ancillary <a href="https://github.com/majek/inet-tool">user-space tooling</a> to configure the socket lookup to your needs.</p><p>If you would like to see it in action, you are in luck. We’ve put together a demo that shows just how easily you can bind a network service to (i) a single port, (ii) all ports, or (iii) a network prefix. On-the-fly, without having to restart the service! There is a port scan running to prove it.</p><p>You can also bind to all-addresses-all-ports (<code>0.0.0.0/0</code>) because why not? Take that <code>INADDR_ANY</code>. All thanks to BPF superpowers.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>We have gone over how the way we bind services to network addresses on the Cloudflare edge has evolved over time. Each approach has its pros and cons, summarized below. We are currently working on a new BPF-based mechanism for binding services to addresses, which is intended to address the shortcomings of existing solutions.</p><p><b>bind to one address and port
</b>? flood traffic on one address hits one socket, doesn’t affect the rest
? as many sockets as listening addresses, doesn’t scale</p><p><b>bind to all addresses with </b><b><code>INADDR_ANY
</code></b>? just one socket for all addresses, the kernel thanks you
? application doesn’t need to know about listening addresses
? flood scenario requires custom protection, at least for UDP
? port sharing is tricky or impossible</p><p><b>bind to a network prefix with </b><b><code>SO_BINDTOPREFIX
</code></b>? two services can share a port if their IP ranges are non-overlapping
? custom kernel API extension that never went upstream</p><p><b>bind to all port with TPROXY
</b>? enables redirecting all ports to a listening socket and more
? meant for intercepting forwarded traffic early on the ingress path
? has subtle interactions with the network stack
? requires privileges from the application</p><p><b>bind to anything you want with BPF </b><b><code>inet_lookup
</code></b>? allows for the same flexibility as with TPROXY or <code>SO_BINDTOPREFIX</code>
? services don’t need extra capabilities, meant for local traffic only
? needs cooperation from services or PID 1 to build a socket map</p><hr /><p>Getting to this point has been a team effort. A special thank you to Lorenz Bauer and <a href="https://blog.cloudflare.com/author/marek-majkowski/">Marek Majkowski</a> who have contributed in an essential way to the BPF <code>inet_lookup</code> implementation. The <code>SO_BINDTOPREFIX</code> patches were authored by <a href="https://blog.cloudflare.com/author/gilberto-bertin/">Gilberto Bertin</a>.Fancy joining the team? <a href="https://www.cloudflare.com/careers/departments/?utm_referrer=blog">Apply here!</a></p>
    <div>
      <h2>Quiz Answers</h2>
      <a href="#quiz-answers">
        
      </a>
    </div>
    
    <div>
      <h3>Quiz 1</h3>
      <a href="#quiz-1">
        
      </a>
    </div>
    <p>Q: How many Cloudflare services can you name?</p><ol><li><p><a href="https://www.cloudflare.com/cdn/">HTTP CDN</a> (tcp/80)</p></li><li><p><a href="https://www.cloudflare.com/ssl/">HTTPS CDN</a> (tcp/443, <a href="https://cloudflare-quic.com/">udp/443</a>)</p></li><li><p><a href="https://www.cloudflare.com/dns/">authoritative DNS</a> (udp/53)</p></li><li><p><a href="https://blog.cloudflare.com/dns-resolver-1-1-1-1/">recursive DNS</a> (udp/53, 853)</p></li><li><p><a href="https://blog.cloudflare.com/secure-time/">NTP with NTS</a> (udp/1234)</p></li><li><p><a href="https://blog.cloudflare.com/roughtime/">Roughtime time service</a> (udp/2002)</p></li><li><p><a href="https://blog.cloudflare.com/distributed-web-gateway/">IPFS Gateway</a> (tcp/443)</p></li><li><p><a href="https://blog.cloudflare.com/cloudflare-ethereum-gateway/">Ethereum Gateway</a> (tcp/443)</p></li><li><p><a href="https://blog.cloudflare.com/spectrum/">Spectrum proxy</a> (tcp/any, udp/any)</p></li><li><p><a href="https://blog.cloudflare.com/announcing-warp-plus/">WARP</a> (udp)</p></li></ol><p><a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-1-question">Go back</a></p>
    <div>
      <h3>Quiz 2</h3>
      <a href="#quiz-2">
        
      </a>
    </div>
    <p>Q: Is there another way to bind a socket to all local addresses?Yes, there is - by not <code>bind()</code>’ing it at all. Calling <code>listen()</code> on an unbound socket is equivalent to binding it to <code>INADDR_ANY</code> and letting the kernel pick a free port.$ strace -e socket,bind,listen nc -l
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
listen(3, 1)                            = 0
^Z
[1]+  Stopped                 strace -e socket,bind,listen nc -l
$ ss -4tlnp
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
LISTEN     0      1            *:42669      </p><p><a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-2-question">Go back</a></p>
    <div>
      <h3>Quiz 3</h3>
      <a href="#quiz-3">
        
      </a>
    </div>
    <p>Q: Does setting the <code>SO_REUSEADDR</code> socket option have any effect at all when there is bind conflict?Yes. If two processes are racing to <code>bind</code> and <code>listen</code> on the same TCP port, on an overlapping IP, setting <code>SO_REUSEADDR</code> changes which syscall will report an error (<code>EADDRINUSE</code>). Without <code>SO_REUSEADDR</code> it will always be the second bind. With <code>SO_REUSEADDR</code> set there is a window of opportunity for a second <code>bind</code> to succeed but the subsequent <code>listen</code> to fail.</p><p><a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-3-question">Go back</a></p>
    <div>
      <h3><b>Quiz 4</b></h3>
      <a href="#quiz-4">
        
      </a>
    </div>
    <p>Q: Are there really no problems with accepting any connection on any socket?If the connection is destined for an address assigned to our host, i.e. a local address, there are no problems. However, for remote-destined connections, sending return traffic from a non-local address (i.e., one not present on any interface) will not get past the Linux network stack. The <code>IP_TRANSPARENT</code> socket option bypasses this protection mechanism known as <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=88ef4a5a78e63420dd1dd770f1bd1dc198926b04">source address check</a> to lift this restriction.</p><p><a href="https://blog.cloudflare.com/its-crowded-in-here/#quiz-4-question">Go back</a></p> ]]></content:encoded>
            <category><![CDATA[eBPF]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[UDP]]></category>
            <guid isPermaLink="false">2tVUhaeVSAohZJbbDgK6vy</guid>
            <dc:creator>Jakub Sitnicki</dc:creator>
        </item>
        <item>
            <title><![CDATA[eBPF can't count?!]]></title>
            <link>https://blog.cloudflare.com/ebpf-cant-count/</link>
            <pubDate>Fri, 03 May 2019 13:00:00 GMT</pubDate>
            <description><![CDATA[ It is unlikely we can tell you anything new about the extended Berkeley Packet Filter, eBPF for short, if you've read all the great man pages, docs, guides, and some of our blogs out there. But we can tell you a war story, who doesn't like those?  ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Grant mechanical calculating machine, public domain <a href="https://en.wikipedia.org/wiki/File:Grant_mechanical_calculating_machine_1877.jpg">image</a></p><p>It is unlikely we can tell you anything new about the extended Berkeley Packet Filter, eBPF for short, if you've read all the great <a href="http://man7.org/linux/man-pages/man2/bpf.2.html">man pages</a>, <a href="https://www.kernel.org/doc/Documentation/networking/filter.txt">docs</a>, <a href="https://cilium.readthedocs.io/en/latest/bpf/">guides</a>, and some of our <a href="https://blog.cloudflare.com/epbf_sockets_hop_distance/">blogs</a> out there.</p><p>But we can tell you a war story, and who doesn't like those? This one is about how eBPF lost its ability to count for a while<a href="#f1"><sup>1</sup></a>.</p><p>They say in our Austin, Texas office that all good stories start with "y'all ain't gonna believe this… tale." This one though, starts with a <a href="https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/">post</a> to Linux netdev mailing list from <a href="https://twitter.com/majek04">Marek Majkowski</a> after what I heard was a long night:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2z8UE9I87Kpw8ONzxBH3WW/96500c2ee678248af9c22936bb49cdee/ebpf_bug_email_netdev.png" />
            
            </figure><p>Marek's findings were quite shocking - if you subtract two 64-bit timestamps in eBPF, the result is garbage. But only when running as an unprivileged user. From root all works fine. Huh.</p><p>If you've seen Marek's <a href="https://speakerdeck.com/majek04/linux-at-cloudflare">presentation</a> from the Netdev 0x13 conference, you know that we are using BPF socket filters as one of the defenses against simple, volumetric DoS attacks. So potentially getting your packet count wrong could be a Bad Thing™, and affect legitimate traffic.</p><p>Let's try to reproduce this bug with a simplified <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-04-ebpf-alu32/bpf/filter.c#L63">eBPF socket filter</a> that subtracts two 64-bit unsigned integers passed to it from <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-04-ebpf-alu32/run_bpf.go#L93">user-space</a> though a BPF map. The input for our BPF program comes from a <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-04-ebpf-alu32/bpf/filter.c#L44">BPF array map</a>, so that the values we operate on are not known at build time. This allows for easy experimentation and prevents the compiler from optimizing out the operations.</p><p>Starting small, eBPF, what is 2 - 1? View the code <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-04-ebpf-alu32/run_bpf.go#L93">on our GitHub</a>.</p>
            <pre><code>$ ./run-bpf 2 1
arg0                    2 0x0000000000000002
arg1                    1 0x0000000000000001
diff                    1 0x0000000000000001</code></pre>
            <p>OK, eBPF, what is 2^32 - 1?</p>
            <pre><code>$ ./run-bpf $[2**32] 1
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff 18446744073709551615 0xffffffffffffffff</code></pre>
            <p>Wrong! But if we ask nicely with sudo:</p>
            <pre><code>$ sudo ./run-bpf $[2**32] 1
[sudo] password for jkbs:
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff           4294967295 0x00000000ffffffff</code></pre>
            
    <div>
      <h3>Who is messing with my eBPF?</h3>
      <a href="#who-is-messing-with-my-ebpf">
        
      </a>
    </div>
    <p>When computers stop subtracting, you know something big is up. We called for reinforcements.</p><p>Our colleague Arthur Fabre <a href="https://lore.kernel.org/netdev/20190301113901.29448-1-afabre@cloudflare.com/">quickly noticed</a> something is off when you examine the eBPF code loaded into the kernel. It turns out kernel doesn't actually run the eBPF it's supplied - it sometimes rewrites it first.</p><p>Any sane programmer would expect 64-bit subtraction to be expressed as <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-04-ebpf-alu32/bpf/filter.s#L47">a single eBPF instruction</a></p>
            <pre><code>$ llvm-objdump -S -no-show-raw-insn -section=socket1 bpf/filter.o
…
      20:       1f 76 00 00 00 00 00 00         r6 -= r7
…</code></pre>
            <p>However, that's not what the kernel actually runs. Apparently after the rewrite the subtraction becomes a complex, multi-step operation.</p><p>To see what the kernel is actually running we can use little known <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/bpftool?h=v5.0">bpftool utility</a>. First, we need to load our BPF</p>
            <pre><code>$ ./run-bpf --stop-after-load 2 1
[2]+  Stopped                 ./run-bpf 2 1</code></pre>
            <p>Then list all BPF programs loaded into the kernel with <code>bpftool prog list</code></p>
            <pre><code>$ sudo bpftool prog list
…
5951: socket_filter  name filter_alu64  tag 11186be60c0d0c0f  gpl
        loaded_at 2019-04-05T13:01:24+0200  uid 1000
        xlated 424B  jited 262B  memlock 4096B  map_ids 28786</code></pre>
            <p>The most recently loaded <code>socket_filter</code> must be our program (<code>filter_alu64</code>). Now we now know its id is 5951 and we can list its bytecode with</p>
            <pre><code>$ sudo bpftool prog dump xlated id 5951
…
  33: (79) r7 = *(u64 *)(r0 +0)
  34: (b4) (u32) r11 = (u32) -1
  35: (1f) r11 -= r6
  36: (4f) r11 |= r6
  37: (87) r11 = -r11
  38: (c7) r11 s&gt;&gt;= 63
  39: (5f) r6 &amp;= r11
  40: (1f) r6 -= r7
  41: (7b) *(u64 *)(r10 -16) = r6
…</code></pre>
            <p>bpftool can also display the JITed code with: <code>bpftool prog dump jited id 5951</code>.</p><p>As you see, subtraction is replaced with a series of opcodes. That is unless you are root. When running from root all is good</p>
            <pre><code>$ sudo ./run-bpf --stop-after-load 0 0
[1]+  Stopped                 sudo ./run-bpf --stop-after-load 0 0
$ sudo bpftool prog list | grep socket_filter
659: socket_filter  name filter_alu64  tag 9e7ffb08218476f3  gpl
$ sudo bpftool prog dump xlated id 659
…
  31: (79) r7 = *(u64 *)(r0 +0)
  32: (1f) r6 -= r7
  33: (7b) *(u64 *)(r10 -16) = r6
…</code></pre>
            <p>If you've spent any time using eBPF, you must have experienced first hand the dreaded eBPF verifier. It's a merciless judge of all eBPF code that will reject any programs that it deems not worthy of running in kernel-space.</p><p>What perhaps nobody has told you before, and what might come as a surprise, is that the very same verifier will actually also <a href="https://elixir.bootlin.com/linux/v4.20.13/source/kernel/bpf/verifier.c#L6421">rewrite and patch up your eBPF code</a> as needed to make it safe.</p><p>The problems with subtraction were introduced by an inconspicuous security fix to the verifier. The patch in question first landed in Linux 5.0 and was backported to 4.20.6 stable and 4.19.19 LTS kernel. <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=979d63d50c0c0f7bc537bf821e056cc9fe5abd38">The over 2000 words long commit message</a> doesn't spare you any details on the attack vector it targets.</p><p>The mitigation stems from <a href="https://nvd.nist.gov/vuln/detail/CVE-2019-7308">CVE-2019-7308</a> vulnerability <a href="https://bugs.chromium.org/p/project-zero/issues/detail?id=1711">discovered by Jann Horn at Project Zero</a>, which exploits pointer arithmetic, i.e. adding a scalar value to a pointer, to trigger speculative memory loads from out-of-bounds addresses. Such speculative loads change the CPU cache state and can be used to mount a <a href="https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html">Spectre variant 1 attack</a>.</p><p>To mitigate it the eBPF verifier rewrites any arithmetic operations on pointer values in such a way the result is always a memory location within bounds. The patch demonstrates how arithmetic operations on pointers get rewritten and we can spot a familiar pattern there</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/8irnzqpgIARBl3UdtS30S/5e885701ef2fe3a9426b1e960e48f0d1/bpf_commit.png" />
            
            </figure><p>Wait a minute… What pointer arithmetic? We are just trying to subtract two scalar values. How come the mitigation kicks in?</p><p>It shouldn't. It's a bug. The eBPF verifier keeps track of what kind of values the ALU is operating on, and in this corner case the state was ignored.</p><p>Why running BPF as root is fine, you ask? If your program has <code>CAP_SYS_ADMIN</code> privileges, side-channel mitigations <a href="https://elixir.bootlin.com/linux/v5.0/source/kernel/bpf/verifier.c#L7218">don't</a> <a href="https://elixir.bootlin.com/linux/v5.0/source/kernel/bpf/verifier.c#L3109">apply</a>. As root you already have access to kernel address space, so nothing new can leak through BPF.</p><p>After our report, <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3612af783cf52c74a031a2f11b82247b2599d3cd">the fix</a> has quickly landed in v5.0 kernel and got backported to stable kernels 4.20.15 and 4.19.28. Kudos to Daniel Borkmann for getting the fix out fast. However, kernel upgrades are hard and in the meantime we were left with code running in production that was not doing what it was supposed to.</p>
    <div>
      <h3>32-bit ALU to the rescue</h3>
      <a href="#32-bit-alu-to-the-rescue">
        
      </a>
    </div>
    <p>As one of the eBPF maintainers <a href="https://lore.kernel.org/netdev/af0643e0-08a1-6326-2a80-71892de1bf56@iogearbox.net/">has pointed out</a>, 32-bit arithmetic operations are not affected by the verifier bug. This opens a door for a work-around.</p><p>eBPF registers, <code>r0</code>..<code>r10</code>, are 64-bits wide, but you can also access just the lower 32 bits, which are exposed as subregisters <code>w0</code>..<code>w10</code>. You can operate on the 32-bit subregisters using BPF ALU32 instruction subset. LLVM 7+ can generate eBPF code that uses this instruction subset. Of course, you need to you ask it nicely with trivial <code>-Xclang -target-feature -Xclang +alu32</code> toggle:</p>
            <pre><code>$ cat sub32.c
#include "common.h"

u32 sub32(u32 x, u32 y)
{
        return x - y;
}
$ clang -O2 -target bpf -Xclang -target-feature -Xclang +alu32 -c sub32.c
$ llvm-objdump -S -no-show-raw-insn sub32.o
…
sub32:
       0:       bc 10 00 00 00 00 00 00         w0 = w1
       1:       1c 20 00 00 00 00 00 00         w0 -= w2
       2:       95 00 00 00 00 00 00 00         exit</code></pre>
            <p>The <code>0x1c</code> <a href="https://elixir.bootlin.com/linux/v5.0/source/include/uapi/linux/bpf_common.h#L11">opcode</a> of the instruction #1, which can be broken down as <code>BPF_ALU | BPF_X | BPF_SUB</code> (read more in the <a href="https://www.kernel.org/doc/Documentation/networking/filter.txt">kernel docs</a>), is the 32-bit subtraction between registers we are looking for, as opposed to regular 64-bit subtract operation <code>0x1f = BPF_ALU64 | BPF_X | BPF_SUB</code>, which will get rewritten.</p><p>Armed with this knowledge we can borrow a page from <a href="https://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic">bignum arithmetic</a> and subtract 64-bit numbers using just 32-bit ops:</p>
            <pre><code>u64 sub64(u64 x, u64 y)
{
        u32 xh, xl, yh, yl;
        u32 hi, lo;

        xl = x;
        yl = y;
        lo = xl - yl;

        xh = x &gt;&gt; 32;
        yh = y &gt;&gt; 32;
        hi = xh - yh - (lo &gt; xl); /* underflow? */

        return ((u64)hi &lt;&lt; 32) | (u64)lo;
}</code></pre>
            <p>This code compiles as expected on normal architectures, like x86-64 or ARM64, but BPF Clang target plays by its own rules:</p>
            <pre><code>$ clang -O2 -target bpf -Xclang -target-feature -Xclang +alu32 -c sub64.c -o - \
  | llvm-objdump -S -
…  
      13:       1f 40 00 00 00 00 00 00         r0 -= r4
      14:       1f 30 00 00 00 00 00 00         r0 -= r3
      15:       1f 21 00 00 00 00 00 00         r1 -= r2
      16:       67 00 00 00 20 00 00 00         r0 &lt;&lt;= 32
      17:       67 01 00 00 20 00 00 00         r1 &lt;&lt;= 32
      18:       77 01 00 00 20 00 00 00         r1 &gt;&gt;= 32
      19:       4f 10 00 00 00 00 00 00         r0 |= r1
      20:       95 00 00 00 00 00 00 00         exit</code></pre>
            <p>Apparently the compiler decided it was better to operate on 64-bit registers and discard the upper 32 bits. Thus we weren't able to get rid of the problematic <code>0x1f</code> opcode. Annoying, back to square one.</p>
    <div>
      <h3>Surely a bit of IR will do?</h3>
      <a href="#surely-a-bit-of-ir-will-do">
        
      </a>
    </div>
    <p>The problem was in Clang frontend - compiling C to IR. We know that BPF "assembly" backend for LLVM can generate bytecode that uses ALU32 instructions. Maybe if we tweak the Clang compiler's output just a little we can achieve what we want. This means we have to get our hands dirty with the LLVM Intermediate Representation (IR).</p><p>If you haven't heard of LLVM IR before, now is a good time to do some <a href="http://www.aosabook.org/en/llvm.html">reading</a><a href="#f2"><sup>2</sup></a>. In short the LLVM IR is what Clang produces and LLVM BPF backend consumes.</p><p>Time to write IR by hand! Here's a hand-tweaked <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-04-ebpf-alu32/bpf/sub64_ir.ll#L7">IR variant</a> of our <code>sub64()</code> function:</p>
            <pre><code>define dso_local i64 @sub64_ir(i64, i64) local_unnamed_addr #0 {
  %3 = trunc i64 %0 to i32      ; xl = (u32) x;
  %4 = trunc i64 %1 to i32      ; yl = (u32) y;
  %5 = sub i32 %3, %4           ; lo = xl - yl;
  %6 = zext i32 %5 to i64
  %7 = lshr i64 %0, 32          ; tmp1 = x &gt;&gt; 32;
  %8 = lshr i64 %1, 32          ; tmp2 = y &gt;&gt; 32;
  %9 = trunc i64 %7 to i32      ; xh = (u32) tmp1;
  %10 = trunc i64 %8 to i32     ; yh = (u32) tmp2;
  %11 = sub i32 %9, %10         ; hi = xh - yh
  %12 = icmp ult i32 %3, %5     ; tmp3 = xl &lt; lo
  %13 = zext i1 %12 to i32
  %14 = sub i32 %11, %13        ; hi -= tmp3
  %15 = zext i32 %14 to i64
  %16 = shl i64 %15, 32         ; tmp2 = hi &lt;&lt; 32
  %17 = or i64 %16, %6          ; res = tmp2 | (u64)lo
  ret i64 %17
}</code></pre>
            <p>It may not be pretty but it does produce <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-04-ebpf-alu32/bpf/sub64_ir.s#L5">desired BPF code</a> when compiled<a href="#f3"><sup>3</sup></a>. You will likely find the <a href="https://llvm.org/docs/LangRef.html">LLVM IR reference</a> helpful when deciphering it.</p><p>And voila! First working solution that produces correct results:</p>
            <pre><code>$ ./run-bpf -filter ir $[2**32] 1
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff           4294967295 0x00000000ffffffff</code></pre>
            <p>Actually using this hand-written IR function from C is tricky. See <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-04-ebpf-alu32/build.ninja#L27">our code on GitHub</a>.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6zQLSvXLH19lXCjYnstl2a/f8fb73710fc8e5eda96476006974eea5/LED_DISP-1.JPG.jpeg" />
            
            </figure><p>public domain <a href="https://en.wikipedia.org/wiki/File:LED_DISP.JPG">image</a> by <a href="https://commons.wikimedia.org/wiki/User:Sergei_Frolov">Sergei Frolov</a></p>
    <div>
      <h3>The final trick</h3>
      <a href="#the-final-trick">
        
      </a>
    </div>
    <p>Hand-written IR does the job. The downside is that linking IR modules to your C modules is hard. Fortunately there is a better way. You can persuade Clang to stick to 32-bit ALU ops in generated IR.</p><p>We've already seen the problem. To recap, if we ask Clang to subtract 32-bit integers, it will operate on 64-bit values and throw away the top 32-bits. Putting C, IR, and eBPF side-by-side helps visualize this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/33bbylbcFR7zuF7NwxHcTs/fa9f110a71ab6e5f5cf8f7d70aadd587/sub32_v1.png" />
            
            </figure><p>The trick to get around it is to declare the 32-bit variable that holds the result as <code>volatile</code>. You might already know the <a href="https://en.wikipedia.org/wiki/Volatile_(computer_programming)"><code>volatile</code> keyword</a> if you've written Unix signal handlers. It basically tells the compiler that the value of the variable may change under its feet so it should refrain from reorganizing loads (reads) from it, as well as that stores (writes) to it might have side-effects so changing the order or eliminating them, by skipping writing it to the memory, is not allowed either.</p><p>Using <code>volatile</code> makes Clang emit <a href="https://llvm.org/docs/LangRef.html#volatile-memory-accesses">special loads and/or stores</a> at the IR level, which then on eBPF level translates to writing/reading the value from memory (stack) on every access. While this sounds not related to the problem at hand, there is a surprising side-effect to it:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2fEaCu15LmmeauGv7Vh1Wb/354c4a8aa60da7a7e5797fe0f951508d/sub32_v2.png" />
            
            </figure><p>With volatile access compiler doesn't promote the subtraction to 64 bits! Don't ask me why, although I would love to hear an explanation. For now, consider this a hack. One that does not come for free - there is the overhead of going through the stack on each read/write.</p><p>However, if we play our cards right we just might reduce it a little. We don't actually need the volatile load or store to happen, we just want the side effect. So instead of declaring the value as volatile, which implies that both reads and writes are volatile, let's try to make only the writes volatile with a help of a macro:</p>
            <pre><code>/* Emits a "store volatile" in LLVM IR */
#define ST_V(rhs, lhs) (*(volatile typeof(rhs) *) &amp;(rhs) = (lhs))</code></pre>
            <p>If this macro looks strangely familiar, it's because it does the same thing as <a href="https://elixir.bootlin.com/linux/v5.1-rc5/source/include/linux/compiler.h#L214"><code>WRITE_ONCE()</code> macro</a> in the Linux kernel. Applying it to our example:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3qQ1CrUvUBsJJE09xVqzA1/bc84a1c3fdd5a855c505c56c2273e664/sub32_v3.png" />
            
            </figure><p>That's another <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-04-ebpf-alu32/bpf/filter.c#L143">hacky but working solution</a>. Pick your poison.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/28GXw3wm8YlUgg9nvnBwEx/2939bf7b36758a66685385e16a48cc5f/poison_bottles-2.jpg" />
            
            </figure><p><a href="https://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA 3.0</a> <a href="https://commons.wikimedia.org/wiki/File:D-BW-Kressbronn_aB_-_Kl%C3%A4ranlage_067.jpg">image</a> by ANKAWÜ</p><p>So there you have it - from C, to IR, and back to C to hack around a bug in eBPF verifier and be able to subtract 64-bit integers again. Usually you won't have to dive into LLVM IR or assembly to make use of eBPF. But it does help to know a little about it when things don't work as expected.</p><p>Did I mention that 64-bit addition is also broken? Have fun fixing it!</p><hr /><p><sup>1</sup> Okay, it was more like 3 months time until the bug was discovered and fixed.</p><p><sup>2</sup> Some even think that it is <a href="https://idea.popcount.org/2013-07-24-ir-is-better-than-assembly/">better than assembly</a>.</p><p><sup>3</sup> How do we know? The litmus test is to look for statements matching <code>r[0-9] [-+]= r[0-9]</code> in BPF assembly.</p> ]]></content:encoded>
            <category><![CDATA[eBPF]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Programming]]></category>
            <guid isPermaLink="false">2fXVvP0CJO4Bt4KuJthpbx</guid>
            <dc:creator>Jakub Sitnicki</dc:creator>
        </item>
    </channel>
</rss>