
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Sat, 04 Apr 2026 06:45:17 GMT</lastBuildDate>
        <item>
            <title><![CDATA[How to build your own VPN, or: the history of WARP]]></title>
            <link>https://blog.cloudflare.com/how-to-build-your-own-vpn-or-the-history-of-warp/</link>
            <pubDate>Wed, 29 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ WARP’s initial implementation resembled a VPN that allows Internet access through it. Here’s how we built it – and how you can, too.  ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Linux’s networking capabilities are a crucial part of how Cloudflare serves billions of requests in the face of DDoS attacks. The tools it provides us are <a href="https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp-stack/"><u>invaluable and useful</u></a>, and a constant stream of contributions from developers worldwide ensures it <a href="https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/"><u>continually gets more capable and performant</u></a>.</p><p>When we developed <a href="https://blog.cloudflare.com/1111-warp-better-vpn/"><u>WARP, our mobile-first performance and security app</u></a>, we faced a new challenge: how to securely and efficiently egress arbitrary user packets for millions of mobile clients from our edge machines. This post explores our first solution, which was essentially building our own high-performance VPN with the Linux networking stack. We needed to integrate it into our existing network; not just directly linking it into our CDN service, but providing a way to securely egress arbitrary user packets from Cloudflare machines. The lessons we learned here helped us develop new <a href="https://www.cloudflare.com/en-gb/zero-trust/products/gateway/"><u>products</u></a> and <a href="https://blog.cloudflare.com/icloud-private-relay/"><u>capabilities</u></a> and discover more strange things besides. But first, how did we get started?</p>
    <div>
      <h2>A bridge between two worlds</h2>
      <a href="#a-bridge-between-two-worlds">
        
      </a>
    </div>
    <p>WARP’s initial implementation resembled a virtual private network (VPN) that allows Internet access through it. Specifically, a Layer 3 VPN – a tunnel for IP packets.</p><p>IP packets are the building blocks of the Internet. When you send data over the Internet, it is split into small chunks and sent separately in packets, each one labeled with a destination address (who the packet goes to) and a source address (who to send a reply to). If you are connected to the Internet, you have an IP address.</p><p>You may not have a <i>unique</i> IP address, though. This is certainly true for IPv4 which, despite our and many others’ long-standing efforts to move everyone to IPv6, is still in widespread use. IPv4 has only 4 billion possible addresses and they have all been assigned – you’re gonna have to share.</p><p>When you use WiFi at home, work or the coffee shop, you’re connected to a local network. Your device is assigned a local IP address to talk to the access point and any other devices in your network. However, that address has no meaning outside of the local network. You can’t use that address in IP packets sent over the Internet, because every local IPv4 network uses <a href="https://en.wikipedia.org/wiki/Private_network"><u>the same few sets of addresses</u></a>.</p><p>So how does Internet access work? Local IPv4 networks generally employ a <i>router</i>, a device to perform network-address translation (NAT). NAT is used to convert the private IPv4 network addresses allocated to devices on the local-area network to a small set of publicly-routable addresses given by your Internet service provider. The router keeps track of the conversions it applies between the two networks in a translation table. When a packet is received on either network, the router consults the translation table and applies the appropriate conversion before sending the packet to the opposite network.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5uT2VOMUn2fJ9NleofEfVB/b871de07a16714f1d05b2b3d0d547aa7/image6.png" />
          </figure><p><sup>Diagram of a router using NAT to bridge connections from devices on a private network to the public Internet</sup></p><p>A VPN that provides Internet access is no different in this respect to a LAN – the only unusual aspect is that the user of the VPN communicates with the VPN server over the public Internet. The model is simple: private network IP packets are tunnelled, or encapsulated, in public IP packets addressed to the VPN server.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/613OhwoQSh2JHzIsBLzo8U/876446bed57eb8b70ba9ecac0d8f0c75/image5.png" />
          </figure><p><sup>Schematic of HTTPS packets being encapsulated between a VPN client and server</sup></p><p>Most times, VPN software only handles the encapsulation and decapsulation of packets, and gives you a virtual network device to send and receive packets on the VPN. This gives you the freedom to configure the VPN however you like. For WARP, we need our servers to act as a router between the VPN client and the Internet.</p>
    <div>
      <h2>NAT’s how you do it</h2>
      <a href="#nats-how-you-do-it">
        
      </a>
    </div>
    <p>Linux – the operating system powering our servers – can be configured to perform routing with NAT in its <a href="https://en.wikipedia.org/wiki/Netfilter"><u>Netfilter</u></a> subsystem. Netfilter is frequently configured through nftables or iptables rules. Configuring a “source NAT” to rewrite the source IP of outgoing packets is achieved with a single rule:</p><p><code>nft add rule ip nat postrouting oifname "eth0" ip saddr 10.0.0.0/8 snat to 198.51.100.42</code></p><p>This rule configures Netfilter’s NAT feature to perform source address translation for any packet matching the following criteria:</p><ol><li><p>The source address is the 10.0.0.0/8 private network subnet - in this example, let’s say VPN clients have addresses from this subnet.</p></li><li><p>The packet shall be sent on the “eth0” interface - in this example, it’s the server’s only physical network interface, and thus the route to the public Internet.</p></li></ol><p>Where these two conditions are true, we apply the “snat” action to rewrite the source IP packet, from whichever address the VPN client is using, to our example server’s public IP address 198.51.100.42. We keep track of the original and rewritten addresses in the rewrite table.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4sUznAhNxIXRCdhjILq6fe/539a2ee09eb149ae9856172043a7d527/image1.png" />
          </figure><p><sup>Schematic of an encapsulated packet being decapsulated and rewritten by a VPN server</sup></p><p><sup></sup><a href="https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/10/html/configuring_firewalls_and_packet_filters/configuring-nat-using-nftables"><u>You may require additional configuration</u></a> depending on how your distribution ships nftables – nftables is more flexible than the deprecated iptables, but has fewer “implicit” tables ready to use.</p><p>You also might need to <a href="https://linux-audit.com/kernel/sysctl/net/net.ipv4.ip_forward/"><u>enable IP forwarding in general</u></a>, as by default you don’t want a machine connected to two different networks to forward between them without realising it.</p>
    <div>
      <h2>A conntrack is a conntrack is a conntrack</h2>
      <a href="#a-conntrack-is-a-conntrack-is-a-conntrack">
        
      </a>
    </div>
    <p>We said before that a router keeps track of the conversions between addresses in the two networks. In the diagram above, that state is held in the rewrite table.</p><p>In practice, any device may only implement NAT usefully if it understands the TCP and UDP protocols, in particular how they use port numbers to support multiple independent flows of data on a single IP address. The NAT device – in our case Linux – ensures that a unique source port and address is used for each connection, and reassigns the port if required. It also needs to understand the lifecycle of a TCP connection, so that it knows when it is safe to reuse a port number: with only 65,536 possible ports, port reuse is essential.</p><p>Linux Netfilter has the <i>conntrack</i> module, widely used to implement a stateful firewall that protects servers against spoofed or unexpected packets, preventing them interfering with legitimate connections. This protection is possible because it understands TCP and the valid state of a connection. This capability means it’s perfectly positioned to implement NAT, too. In fact, all packet rewriting is implemented by conntrack.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5HjxbjpRIJIPygV4zMo4XL/7ff4e11334e8e64826be1f29f5e5fb17/image2.png" />
          </figure><p><sup>A diagram showing the steps taken by conntrack to validate and rewrite packets</sup></p><p>As a stateful firewall, the conntrack module maintains a table of all connections it has seen. If you know all of the active connections, you can rewrite a new connection to a port that is not in use.</p><p>In the “snat” rule above, Netfilter adds an entry to the rewrite table, but doesn’t change the packet yet. Only <a href="https://wiki.nftables.org/wiki-nftables/index.php/Mangling_packet_headers"><u>basic packet changes are permitted within nftables</u></a>. We must wait for packet processing to reach the conntrack module, which selects a port unused by any active connection, and only then rewrites the packet.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6qT3d8JXiTYLwQWsVOCtcQ/ff8c8adcb209f2cdc2578dc1218923ca/image4.png" />
          </figure><p><sup>A diagram showing the roles of netfilter and conntrack when applying NAT to traffic</sup></p>
    <div>
      <h2>Marky mark and the firewall bunch</h2>
      <a href="#marky-mark-and-the-firewall-bunch">
        
      </a>
    </div>
    <p>Another mode of conntrack is to assign a persistent mark to packets belonging to a connection. The mark can be referenced in nftables rules to implement different firewall policies, or to control routing decisions.</p><p>Suppose you want to prevent specific addresses (e.g. from a guest network) from accessing certain services on your machine. You could add a firewall rule for each service denying access to those addresses. However, if you need to change the set of addresses to block, you have to update every rule accordingly.</p><p>Alternatively, you could use one rule to apply a mark to packets coming from the addresses you wish to block, and then reference the mark in all the service rules that implement the block. Now if you wish to change the addresses, you need only update a single rule to change the scope of that packet mark.</p><p>This is most beneficial to control routing behaviour, as routing rules cannot make decisions on as many attributes of the packet as Netfilter can. Using marks allows you to select packets based on powerful Netfilter rules.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/33J5E9eds0JiGVNqOInJ0K/829d033b3ee255093ff1927c0b03f4fb/image3.png" />
          </figure><p><sup>A diagram showing netfilter marking specific packets to apply special routing rules</sup></p><p>The code powering the WARP service was written by Cloudflare in Rust, a security-focused systems programming language. We took great care implementing <a href="https://github.com/cloudflare/boringtun"><u>boringtun</u></a> - our WireGuard implementation - and <a href="https://blog.cloudflare.com/zero-trust-warp-with-a-masque/"><u>MASQUE</u></a>. But even if you think the front door is impenetrable, it is good security practice to employ defense-in-depth.</p><p>One example is distinguishing IP packets that come from clients vs. packets that originate elsewhere in our network. One common method is to allocate a unique IP space to WARP traffic and distinguish it based on IP address, but this can be fragile if we need to apply a configuration change to renumber our internal networks – remember IPv4’s limited address space! Instead we can do something simpler.</p><p>To bring IP packets from WARP clients into the Linux networking stack, WARP uses a <a href="https://blog.cloudflare.com/virtual-networking-101-understanding-tap/"><u>TUN device</u></a> – Linux’s name for the virtual network device that programs can use to send and receive IP packets. A TUN device can be configured similarly to any other network device like Ethernet or Wi-Fi adapters, including firewall and routing.</p><p>Using nftables, we mark all packets output on WARP’s TUN device. We have to explicitly store the mark in conntrack’s state table on the outgoing path and retrieve it for the incoming packet, as netfilter can use packet marks independently of conntrack.</p>
            <pre><code>table ip mangle {
    chain forward {
        type filter hook forward priority mangle; policy accept;
        oifname "fishtun" counter ct mark set 42
    }
    chain prerouting {
        type filter hook prerouting priority mangle; policy accept;
        counter meta mark set ct mark
    }
}</code></pre>
            <p>We also need to add a routing rule to return marked packets to the TUN device:</p><p><code>ip rule add fwmark 42 table 100 priority 10
ip route add 0.0.0.0/0 proto static dev warp-tun table 100</code></p><p>Now we’re done. All connections from WARP are clearly identified and can be firewalled separately from locally-originated connections or other nodes on our network. Conntrack handles NAT for us, and the connection marks tell us which tracked connections were made by WARP clients.</p>
    <div>
      <h2>The end?</h2>
      <a href="#the-end">
        
      </a>
    </div>
    <p>In our first version of WARP, we enabled clients to access arbitrary Internet hosts by combining multiple components of Linux’s networking stack. Each of our edge servers had a single IP address from an allocation dedicated to WARP, and we were able to configure NAT, routing, and appropriate firewall rules using standard and well-documented methods.</p><p>Linux is flexible and easy to configure, but it would require one IPv4 address per machine. Due to IPv4 address exhaustion, this approach would not scale to Cloudflare’s large network. Assigning a dedicated IPv4 address for every machine that runs the WARP server results in an eye-watering address lease bill. To bring costs down, we would have to limit the number of servers running WARP, increasing the operational complexity of deploying it.</p><p>We had ideas, but we would have to give up the easy path Linux gave us. <a href="https://blog.cloudflare.com/cloudflare-servers-dont-own-ips-anymore/"><u>IP sharing seemed to us the most promising solution</u></a>, but how much has to change if a single machine can only receive packets addressed to a narrow set of ports? We will reveal all in a follow-up blog post, but if you are the kind of curious problem-solving engineer who is already trying to imagine solutions to this problem, look at <a href="https://www.cloudflare.com/en-gb/careers/jobs/?department=Engineering"><u>our open positions</u></a> – we’d like to hear from you!</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[WARP]]></category>
            <category><![CDATA[Linux]]></category>
            <guid isPermaLink="false">3ClsS6mSOdk413zjE9GH6t</guid>
            <dc:creator>Chris Branch</dc:creator>
        </item>
        <item>
            <title><![CDATA[So long, and thanks for all the fish: how to escape the Linux networking stack]]></title>
            <link>https://blog.cloudflare.com/so-long-and-thanks-for-all-the-fish-how-to-escape-the-linux-networking-stack/</link>
            <pubDate>Wed, 29 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ Many products at Cloudflare aren’t possible without pushing the limits of network hardware and software to deliver improved performance, increased efficiency, or novel capabilities such as soft-unicast, our method for sharing IP subnets across data centers. Happily, most people do not need to know the intricacies of how your operating system handles network and Internet access in general. Yes, even most people within Cloudflare. But sometimes we try to push well beyond the design intentions of Linux’s networking stack. This is a story about one of those attempts. ]]></description>
            <content:encoded><![CDATA[ <p><b></b><a href="https://www.goodreads.com/quotes/2397-there-is-a-theory-which-states-that-if-ever-anyone"><u>There is a theory which states</u></a> that if ever anyone discovers exactly what the Linux networking stack does and why it does it, it will instantly disappear and be replaced by something even more bizarre and inexplicable.</p><p>There is another theory which states that Git was created to track how many times this has already happened.</p><p>Many products at Cloudflare aren’t possible without pushing the limits of network hardware and software to deliver improved performance, increased efficiency, or novel capabilities such as <a href="https://blog.cloudflare.com/cloudflare-servers-dont-own-ips-anymore/"><u>soft-unicast, our method for sharing IP subnets across data centers</u></a>. Happily, most people do not need to know the intricacies of how your operating system handles network and Internet access in general. Yes, even most people within Cloudflare.</p><p>But sometimes we try to push well beyond the design intentions of Linux’s networking stack. This is a story about one of those attempts.</p>
    <div>
      <h2>Hard solutions for soft problems</h2>
      <a href="#hard-solutions-for-soft-problems">
        
      </a>
    </div>
    <p>My previous blog post about the Linux networking stack teased a problem matching the ideal model of soft-unicast with the basic reality of IP packet forwarding rules. Soft-unicast is the name given to our method of sharing IP addresses between machines. <a href="https://blog.cloudflare.com/cloudflare-servers-dont-own-ips-anymore/"><u>You may learn about all the cool things we do with it</u></a>, but as far as a single machine is concerned, it has dozens to hundreds of combinations of IP address and source-port range, any of which may be chosen for use by outgoing connections.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1NsU3FdxgJ0FNL78SDCo9D/65a27e8fd4339d3318a1b55b5979e3c6/image3.png" />
          </figure><p>The SNAT target in iptables supports a source-port range option to restrict the ports selected during NAT. In theory, we could continue to use iptables for this purpose, and to support multiple IP/port combinations we could use separate packet marks or multiple TUN devices. In actual deployment we would have to overcome challenges such as managing large numbers of iptables rules and possibly network devices, interference with other uses of packet marks, and deployment and reallocation of existing IP ranges.</p><p>Rather than increase the workload on our firewall, we wrote a single-purpose service dedicated to egressing IP packets on soft-unicast address space. For reasons lost in the mists of time, we named it SLATFATF, or “fish” for short. This service’s sole responsibility is to proxy IP packets using soft-unicast address space and manage the lease of those addresses.</p><p>WARP is not the only user of soft-unicast IP space in our network. Many Cloudflare products and services make use of the soft-unicast capability, and many of them use it in scenarios where we create a TCP socket in order to proxy or carry HTTP connections and other TCP-based protocols. Fish therefore needs to lease addresses that are not used by open sockets, and ensure that sockets cannot be opened to addresses leased by fish.</p><p>Our first attempt was to use distinct per-client addresses in fish and continue to let Netfilter/conntrack apply SNAT rules. However, we discovered an unfortunate interaction between Linux’s socket subsystem and the Netfilter conntrack module that reveals itself starkly when you use packet rewriting.</p>
    <div>
      <h2>Collision avoidance</h2>
      <a href="#collision-avoidance">
        
      </a>
    </div>
    <p>Suppose we have a soft-unicast address slice, 198.51.100.10:9000-9009. Then, suppose we have two separate processes that want to bind a TCP socket at 198.51.100.10:9000 and connect it to 203.0.113.1:443. The first process can do this successfully, but the second process will receive an error when it attempts to connect, because there is already a socket matching the requested 5-tuple.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2eXmHlyC0pdDUkZ9OI3JI/b83286088b4efa6ddee897e8b5d3b191/image8.png" />
          </figure><p>Instead of creating sockets, what happens when we emit packets on a TUN device with the same destination IP but a unique source IP, and use source NAT to rewrite those packets to an address in this range?</p><p>If we add an nftables “snat” rule that rewrites the source address to 198.51.100.10:9000-9009, Netfilter will create an entry in the conntrack table for each new connection seen on fishtun, mapping the new source address to the original one. If we try to forward more connections on that TUN device to the same destination IP, new source ports will be selected in the requested range, until all ten available ports have been allocated; once this happens, new connections will be dropped until an existing connection expires, freeing an entry in the conntrack table.</p><p>Unlike when binding a socket, Netfilter will simply pick the first free space in the conntrack table. However, if you use up all the possible entries in the table <a href="https://blog.cloudflare.com/conntrack-tales-one-thousand-and-one-flows/"><u>you will get an EPERM error when writing an IP packet</u></a>. Either way, whether you bind kernel sockets or you rewrite packets with conntrack, errors will indicate when there isn’t a free entry matching your requirements.</p><p>Now suppose that you combine the two approaches: a first process emits an IP packet on the TUN device that is rewritten to a packet on our soft-unicast port range. Then, a second process binds and connects a TCP socket with the same addresses as that IP packet:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/57KuCP4vkp4TGPiLwDRPZv/c066279cd8a84a511f09ed5218488cec/image7.png" />
          </figure><p>The first problem is that there is no way for the second process to know that there is an active connection from 198.51.100.10:9000 to 203.0.113.1:443, at the time the <code>connect() </code>call is made. The second problem is that the connection is successful from the point of view of that second process.</p><p>It should not be possible for two connections to share the same 5-tuple. Indeed, they don’t. Instead, the source address of the TCP socket is <a href="https://github.com/torvalds/linux/blob/v6.15/net/netfilter/nf_nat_core.c#L734"><u>silently rewritten to the next free port</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3DWpWJ5gBIDoEhimxIR8TT/fd3d8bd46353cd42ed09a527d4841da8/image6.png" />
          </figure><p>This behaviour is present even if you use conntrack without either SNAT or MASQUERADE rules. It usually happens that the lifetime of conntrack entries matches the lifetime of the sockets they’re related to, but this is not guaranteed, and you cannot depend on the source address of your socket matching the source address of the generated IP packets.</p><p>Crucially for soft-unicast, it means conntrack may rewrite our connection to have a source port outside of the port slice assigned to our machine. This will silently break the connection, causing unnecessary delays and false reports of connection timeouts. We need another solution.</p>
    <div>
      <h2>Taking a breather</h2>
      <a href="#taking-a-breather">
        
      </a>
    </div>
    <p>For WARP, the solution we chose was to stop rewriting and forwarding IP packets, instead to terminate all TCP connections within the server and proxy them to a locally-created TCP socket with the correct soft-unicast address. This was an easy and viable solution that we already employed for a portion of our connections, such as those directed at the CDN, or intercepted as part of the Zero Trust Secure Web Gateway. However, it does introduce additional resource usage and potentially increased latency compared to the status quo. We wanted to find another way (to) forward.</p>
    <div>
      <h2>An inefficient interface</h2>
      <a href="#an-inefficient-interface">
        
      </a>
    </div>
    <p>If you want to use both packet rewriting and bound sockets, you need to decide on a single source of truth. Netfilter is not aware of the socket subsystem, but most of the code that uses sockets and is also aware of soft-unicast is code that Cloudflare wrote and controls. A slightly younger version of myself therefore thought it made sense to change our code to work correctly in the face of Netfilter’s design.</p><p>Our first attempt was to use the Netlink interface to the conntrack module, to inspect and manipulate the connection tracking tables before sockets were created. <a href="https://docs.kernel.org/userspace-api/netlink/intro.html"><u>Netlink is an extensible interface to various Linux subsystems</u></a> and is used by many command-line tools like <a href="https://man7.org/linux/man-pages/man8/ip.8.html"><u>ip</u></a> and, in our case, <a href="https://conntrack-tools.netfilter.org/manual.html"><u>conntrack-tools</u></a>. By creating the conntrack entry for the socket we are about to bind, we can guarantee that conntrack won’t rewrite the connection to an invalid port number, and ensure success every time. Likewise, if creating the entry fails, then we can try another valid address. This approach works regardless of whether we are binding a socket or forwarding IP packets.</p><p>There is one problem with this — it’s not terribly efficient. Netlink is slow compared to the bind/connect socket dance, and when creating conntrack entries you have to specify a timeout for the flow and delete the entry if your connection attempt fails, to ensure that the connection table doesn’t fill up too quickly for a given 5-tuple. In other words, you have to manually reimplement <a href="https://sysctl-explorer.net/net/ipv4/tcp_tw_reuse/"><u>tcp_tw_reuse</u></a> option to support high-traffic destinations with limited resources. In addition, a stray RST packet can erase your connection tracking entry. At our scale, anything like this that can happen, will happen. It is not a place for fragile solutions.</p>
    <div>
      <h2>Socket to ‘em</h2>
      <a href="#socket-to-em">
        
      </a>
    </div>
    <p>Instead of creating conntrack entries, we can abuse kernel features for our own benefit. Some time ago Linux added <a href="https://lwn.net/Articles/495304/"><u>the TCP_REPAIR socket option</u></a>, ostensibly to support connection migration between servers e.g. to relocate a VM. The scope of this feature allows you to create a new TCP socket and specify its entire connection state by hand.</p><p>An alternative use of this is to create a “connected” socket that never performed the TCP three-way handshake needed to establish that connection. At least, the kernel didn’t do that — if you are forwarding the IP packet containing a TCP SYN, you have more certainty about the expected state of the world.</p><p>However, the introduction of <a href="https://en.wikipedia.org/wiki/TCP_Fast_Open"><u>TCP Fast Open</u></a> provides an even simpler way to do this: you can create a “connected” socket that doesn’t perform the traditional three-way handshake, on the assumption that the SYN packet — when sent with its initial payload — contains a valid cookie to immediately establish the connection. However, as nothing is sent until you write to the socket, this serves our needs perfectly.</p><p>You can try this yourself:</p>
            <pre><code>TCP_FASTOPEN_CONNECT = 30
TCP_FASTOPEN_NO_COOKIE = 34
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_TCP, TCP_FASTOPEN_CONNECT, 1)
s.setsockopt(SOL_TCP, TCP_FASTOPEN_NO_COOKIE, 1)
s.bind(('198.51.100.10', 9000))
s.connect(('1.1.1.1', 53))</code></pre>
            <p>Binding a “connected” socket that nevertheless corresponds to no actual socket has one important feature: if other processes attempt to bind to the same addresses as the socket, they will fail to do so. This satisfies the problem we had at the beginning to make packet forwarding coexist with socket usage.</p>
    <div>
      <h2>Jumping the queue</h2>
      <a href="#jumping-the-queue">
        
      </a>
    </div>
    <p>While this solves one problem, it creates another. By default, you can’t use an IP address for both locally-originated packets and forwarded packets.</p><p>For example, we assign the IP address 198.51.100.10 to a TUN device. This allows any program to create a TCP socket using the address 198.51.100.10:9000. We can also write packets to that TUN device with the address 198.51.100.10:9001, and Linux can be configured to forward those packets to a gateway, following the same route as the TCP socket. So far, so good.</p><p>On the inbound path, TCP packets addressed to 198.51.100.10:9000 will be accepted and data put into the TCP socket. TCP packets addressed to 198.51.100.10:9001, however, will be dropped. They are not forwarded to the TUN device at all.</p><p>Why is this the case? Local routing is special. If packets are received to a local address, they are treated as “input” and not forwarded, regardless of any routing you think should apply. Behold the default routing rules:</p><p><code>cbranch@linux:~$ ip rule
cbranch@linux:~$ ip rule
0:        from all lookup local
32766:    from all lookup main
32767:    from all lookup default</code></p><p>The rule priority is a nonnegative integer, the smallest priority value is evaluated first. This requires some slightly awkward rule manipulation to “insert” a lookup rule at the beginning that redirects marked packets to the packet forwarding service’s TUN device; you have to delete the existing rule, then create new rules in the right order. However, you don’t want to leave the routing rules without any route to the “local” table, in case you lose a packet while manipulating these rules. In the end, the result looks something like this:</p><p><code>ip rule add fwmark 42 table 100 priority 10
ip rule add lookup local priority 11
ip rule del priority 0
ip route add 0.0.0.0/0 proto static dev fishtun table 100</code></p><p>As with WARP, we simplify connection management by assigning a mark to packets coming from the “fishtun” interface, which we can use to route them back there. To prevent locally-originated TCP sockets from having this same mark applied, we assign the IP to the loopback interface instead of fishtun, leaving fishtun with no assigned address. But it doesn’t need one, as we have explicit routing rules now.</p>
    <div>
      <h2>Uncharted territory</h2>
      <a href="#uncharted-territory">
        
      </a>
    </div>
    <p>While testing this last fix, I ran into an unfortunate problem. It did not work in our production environment.</p><p>It is not simple to debug the path of a packet through Linux’s networking stack. There are a few tools you can use, such as setting nftrace in nftables or applying the LOG/TRACE targets in iptables, which help you understand which rules and tables are applied for a given packet.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7ofuljq2tDVVUyzyPOMYSp/3da5954ef254aa3aae5397b310f6dcad/image5.png" />
          </figure><p><sup></sup><a href="https://en.m.wikipedia.org/wiki/File:Netfilter-packet-flow.svg"><sup><u>Schematic for the packet flow paths through Linux networking and *tables</u></sup></a><sup> by </sup><a href="https://commons.wikimedia.org/wiki/User_talk:Jengelh"><sup>Jan Engelhardt</sup></a></p><p>Our expectation is that the packet will pass the prerouting hook, a routing decision is made to send the packet to our TUN device, then the packet will traverse the forward table. By tracing packets originating from the IP of a test host, we could see the packets enter the prerouting phase, but disappear after the ‘routing decision’ block.</p><p>While there is a block in the diagram for “socket lookup”, this occurs after processing the input table. Our packet doesn’t ever enter the input table; the only change we made was to create a local socket. If we stop creating the socket, the packet passes to the forward table as before.</p><p>It turns out that part of the ‘routing decision’ involves some protocol-specific processing. For IP packets, <a href="https://github.com/torvalds/linux/blob/89be9a83ccf1f88522317ce02f854f30d6115c41/net/ipv4/ip_input.c#L317"><u>routing decisions can be cached</u></a>, and some basic address validation is performed. In 2012, an additional feature was added: <a href="https://lore.kernel.org/all/20120619.163911.2094057156011157978.davem@davemloft.net/"><u>early demux</u></a>. The rationale being, at this point in packet processing we are already looking up something, and the majority of packets received are expected to be for local sockets, rather than an unknown packet or one that needs to be forwarded somewhere. In this case, why not look up the socket directly here and save yourself an extra route lookup?</p>
    <div>
      <h2>The workaround at the end of the universe</h2>
      <a href="#the-workaround-at-the-end-of-the-universe">
        
      </a>
    </div>
    <p>Unfortunately for us, we just created a socket and didn’t want it to receive packets. Our adjustment to the routing table is ignored, because that routing lookup is skipped entirely when the socket is found. Raw sockets avoid this by receiving all packets regardless of the routing decision, but the packet rate is too high for this to be efficient. The only way around this is disabling the early demux feature. According to the patch’s claims, though, this feature improves performance: how far will performance regress on our existing workloads if we disable it?</p><p>This calls for a simple experiment: set the <a href="https://docs.kernel.org/6.16/networking/ip-sysctl.html"><u>net.ipv4.tcp_early_demux</u></a> syscall to 0 on some machines in a datacenter, let it run for a while, then compare the CPU usage with machines using default settings and the same hardware configuration as the machines under test.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3ypZGWN811vIQu04YERP8m/709e115068bad3994c88ce899cdfba29/image4.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5eF441OrGSDwvAFEFYWbtT/40c330d687bf7e30597d046274d959e1/image2.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/34gBimlHXXvLLbGJpriVJA/39f7408dd6ef37aaff3f0fa50a37518f/image1.png" />
          </figure><p>The key metrics are CPU usage from /proc/stat. If there is a performance degradation, we would expect to see higher CPU usage allocated to “softirq” — the context in which Linux network processing occurs — with little change to either userspace (top) or kernel time (bottom). The observed difference is slight, and mostly appears to reduce efficiency during off-peak hours.</p>
    <div>
      <h2>Swimming upstream</h2>
      <a href="#swimming-upstream">
        
      </a>
    </div>
    <p>While we tested different solutions to IP packet forwarding, we continued to terminate TCP connections on our network. Despite our initial concerns, the performance impact was small, and the benefits of increased visibility into origin reachability, fast internal routing within our network, and simpler observability of soft-unicast address usage flipped the burden of proof: was it worth trying to implement pure IP forwarding and supporting two different layers of egress?</p><p>So far, the answer is no. Fish runs on our network today, but with the much smaller responsibility of handling ICMP packets. However, when we decide to tunnel all IP packets, we know exactly how to do it.</p><p>A typical engineering role at Cloudflare involves solving many strange and difficult problems at scale. If you are the kind of goal-focused engineer willing to try novel approaches and explore the capabilities of the Linux kernel despite minimal documentation, look at <a href="https://www.cloudflare.com/en-gb/careers/jobs/?department=Engineering"><u>our open positions</u></a> — we would love to hear from you!</p> ]]></content:encoded>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Egress]]></category>
            <guid isPermaLink="false">x9Fb6GXRm3RObU5XezhnE</guid>
            <dc:creator>Chris Branch</dc:creator>
        </item>
        <item>
            <title><![CDATA[A deep dive into BPF LPM trie performance and optimization]]></title>
            <link>https://blog.cloudflare.com/a-deep-dive-into-bpf-lpm-trie-performance-and-optimization/</link>
            <pubDate>Tue, 21 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ This post explores the performance of BPF LPM tries, a critical data structure used for IP matching.  ]]></description>
            <content:encoded><![CDATA[ <p>It started with a mysterious soft lockup message in production. A single, cryptic line that led us down a rabbit hole into the performance of one of the most fundamental data structures we use: the BPF LPM trie.</p><p>BPF trie maps (<a href="https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_LPM_TRIE/">BPF_MAP_TYPE_LPM_TRIE</a>) are heavily used for things like IP and IP+Port matching when routing network packets, ensuring your request passes through the right services before returning a result. The performance of this data structure is critical for serving our customers, but the speed of the current implementation leaves a lot to be desired. We’ve run into several bottlenecks when storing millions of entries in BPF LPM trie maps, such as entry lookup times taking hundreds of milliseconds to complete and freeing maps locking up a CPU for over 10 seconds. For instance, BPF maps are used when evaluating Cloudflare’s <a href="https://www.cloudflare.com/network-services/products/magic-firewall/"><u>Magic Firewall</u></a> rules and these bottlenecks have even led to traffic packet loss for some customers.</p><p>This post gives a refresher of how tries and prefix matching work, benchmark results, and a list of the shortcomings of the current BPF LPM trie implementation.</p>
    <div>
      <h2>A brief recap of tries</h2>
      <a href="#a-brief-recap-of-tries">
        
      </a>
    </div>
    <p>If it’s been a while since you last looked at the trie data structure (or if you’ve never seen it before), a trie is a tree data structure (similar to a binary tree) that allows you to store and search for data for a given key and where each node stores some number of key bits.</p><p>Searches are performed by traversing a path, which essentially reconstructs the key from the traversal path, meaning nodes do not need to store their full key. This differs from a traditional binary search tree (BST) where the primary invariant is that the left child node has a key that is less than the current node and the right child has a key that is greater. BSTs require that each node store the full key so that a comparison can be made at each search step.</p><p>Here’s an example that shows how a BST might store values for the keys:</p><ul><li><p>ABC</p></li><li><p>ABCD</p></li><li><p>ABCDEFGH</p></li><li><p>DEF</p></li></ul>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1uXt5qwpyq7VzrqxXlHFLj/99677afd73a98b9ce04d30209065499f/image4.png" />
          </figure><p>In comparison, a trie for storing the same set of keys might look like this.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3TfFZmwekNAF18yWlOIVWh/58396a19e053bd1c02734a6a54eea18e/image8.png" />
          </figure><p>This way of splitting out bits is really memory-efficient when you have redundancy in your data, e.g. prefixes are common in your keys, because that shared data only requires a single set of nodes. It’s for this reason that tries are often used to efficiently store strings, e.g. dictionaries of words – storing the strings “ABC” and “ABCD” doesn’t require 3 bytes + 4 bytes (assuming ASCII), it only requires 3 bytes + 1 byte because “ABC” is shared by both (the exact number of bits required in the trie is implementation dependent).</p><p>Tries also allow more efficient searching. For instance, if you wanted to know whether the key “CAR” existed in the BST you are required to go to the right child of the root (the node with key “DEF”) and check its left child because this is where it would live if it existed. A trie is more efficient because it searches in prefix order. In this particular example, a trie knows at the root whether that key is in the trie or not.</p><p>This design makes tries perfectly suited for performing longest prefix matches and for working with IP routing using CIDR. CIDR was introduced to make more efficient use of the IP address space (no longer requiring that classes fall into 4 buckets of 8 bits) but comes with added complexity because now the network portion of an IP address can fall anywhere. Handling the CIDR scheme in IP routing tables requires matching on the longest (most specific) prefix in the table rather than performing a search for an exact match.</p><p>If searching a trie does a single-bit comparison at each node, that’s a binary trie. If searching compares more bits we call that a <b><i>multibit trie</i></b>. You can store anything you like in a trie, including IP and subnet addresses – it’s all just ones and zeroes.</p><p>Nodes in multibit tries use more memory than in binary tries, but since computers operate on multibit words anyhow, it’s more efficient from a microarchitecture perspective to use multibit tries because you can traverse through the bits faster, reducing the number of comparisons you need to make to search for your data. It’s a classic space vs time tradeoff.</p><p>There are other optimisations we can use with tries. The distribution of data that you store in a trie might not be uniform and there could be sparsely populated areas. For example, if you store the strings “A” and “BCDEFGHI” in a multibit trie, how many nodes do you expect to use? If you’re using ASCII, you could construct the binary trie with a root node and branch left for “A” or right for “B”. With 8-bit nodes, you’d need another 7 nodes to store “C”, “D”, “E”, “F”, “G”, “H", “I”.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/LO6izFC5e06dRf9ra2roC/167ba5c4128fcebacc7b7a8eab199ea5/image5.png" />
          </figure><p>Since there are no other strings in the trie, that’s pretty suboptimal. Once you hit the first level after matching on “B” you know there’s only one string in the trie with that prefix, and you can avoid creating all the other nodes by using <b><i>path compression</i></b>. Path compression replaces nodes “C”, “D”, “E” etc. with a single one such as “I”.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ADY3lNtF7NIgfUX7bX9vY/828a14e155d6530a4dc8cf3286ce8cc3/image13.png" />
          </figure><p>If you traverse the tree and hit “I”, you still need to compare the search key with the bits you skipped (“CDEFGH”) to make sure your search key matches the string. Exactly how and where you store the skipped bits is implementation dependent – BPF LPM tries simply store the entire key in the leaf node. As your data becomes denser, path compression is less effective.</p><p>What if your data distribution is dense and, say, all the first 3 levels in a trie are fully populated? In that case you can use <b><i>level compression</i></b><i> </i>and replace all the nodes in those levels with a single node that has 2**3 children. This is how Level-Compressed Tries work which are used for <a href="https://vincent.bernat.ch/en/blog/2017-ipv4-route-lookup-linux">IP route lookup</a> in the Linux kernel (see <a href="https://elixir.bootlin.com/linux/v6.12.43/source/net/ipv4/fib_trie.c"><u>net/ipv4/fib_trie.c</u></a>).</p><p>There are other optimisations too, but this brief detour is sufficient for this post because the BPF LPM trie implementation in the kernel doesn’t fully use the three we just discussed.</p>
    <div>
      <h2>How fast are BPF LPM trie maps?</h2>
      <a href="#how-fast-are-bpf-lpm-trie-maps">
        
      </a>
    </div>
    <p>Here are some numbers from running <a href="https://lore.kernel.org/bpf/20250827140149.1001557-1-matt@readmodwrite.com/"><u>BPF selftests benchmark</u></a> on AMD EPYC 9684X 96-Core machines. Here the trie has 10K entries, a 32-bit prefix length, and an entry for every key in the range [0, 10K).</p><table><tr><td><p>Operation</p></td><td><p>Throughput</p></td><td><p>Stddev</p></td><td><p>Latency</p></td></tr><tr><td><p>lookup</p></td><td><p>7.423M ops/s</p></td><td><p>0.023M ops/s</p></td><td><p>134.710 ns/op</p></td></tr><tr><td><p>update</p></td><td><p>2.643M ops/s</p></td><td><p>0.015M ops/s</p></td><td><p>378.310 ns/op</p></td></tr><tr><td><p>delete</p></td><td><p>0.712M ops/s</p></td><td><p>0.008M ops/s</p></td><td><p>1405.152 ns/op</p></td></tr><tr><td><p>free</p></td><td><p>0.573K ops/s</p></td><td><p>0.574K ops/s</p></td><td><p>1.743 ms/op</p></td></tr></table><p>The time to free a BPF LPM trie with 10K entries is noticeably large. We recently ran into an issue where this took so long that it caused <a href="https://lore.kernel.org/lkml/20250616095532.47020-1-matt@readmodwrite.com/"><u>soft lockup messages</u></a> to spew in production.</p><p>This benchmark gives some idea of worst case behaviour. Since the keys are so densely populated, path compression is completely ineffective. In the next section, we explore the lookup operation to understand the bottlenecks involved.</p>
    <div>
      <h2>Why are BPF LPM tries slow?</h2>
      <a href="#why-are-bpf-lpm-tries-slow">
        
      </a>
    </div>
    <p>The LPM trie implementation in <a href="https://elixir.bootlin.com/linux/v6.12.43/source/kernel/bpf/lpm_trie.c"><u>kernel/bpf/lpm_trie.c</u></a> has a couple of the optimisations we discussed in the introduction. It is capable of multibit comparisons at leaf nodes, but since there are only two child pointers in each internal node, if your tree is densely populated with a lot of data that only differs by one bit, these multibit comparisons degrade into single bit comparisons.</p><p>Here’s an example. Suppose you store the numbers 0, 1, and 3 in a BPF LPM trie. You might hope that since these values fit in a single 32 or 64-bit machine word, you could use a single comparison to decide which next node to visit in the trie. But that’s only possible if your trie implementation has 3 child pointers in the current node (which, to be fair, most trie implementations do). In other words, you want to make a 3-way branching decision but since BPF LPM tries only have two children, you’re limited to a 2-way branch.</p><p>A diagram for this 2-child trie is given below.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ciL2t6aMyJHR2FfX41rNk/365abe47cf384729408cf9b98c65c0be/image9.png" />
          </figure><p>The leaf nodes are shown in green with the key, as a binary string, in the center. Even though a single 8-bit comparison is more than capable of figuring out which node has that key, the BPF LPM trie implementation resorts to inserting intermediate nodes (blue) to inject 2-way branching decisions into your path traversal because its parent (the orange root node in this case) only has 2 children. Once you reach a leaf node, BPF LPM tries can perform a multibit comparison to check the key. If a node supported pointers to more children, the above trie could instead look like this, allowing a 3-way branch and reducing the lookup time.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/17VoWl8OY6tzcARKDKuSjS/b9200dbeddf13f101b7085a549742f95/image3.png" />
          </figure><p>This 2-child design impacts the height of the trie. In the worst case, a completely full trie essentially becomes a binary search tree with height log2(nr_entries) and the height of the trie impacts how many comparisons are required to search for a key.</p><p>The above trie also shows how BPF LPM tries implement a form of path compression – you only need to insert an intermediate node where you have two nodes whose keys differ by a single bit. If instead of 3, you insert a key of 15 (0b1111), this won’t change the layout of the trie; you still only need a single node at the right child of the root.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2ecfKSeoqN3bfBXmC9KHw5/3be952edea34d6b2cc867ba31ce14805/image12.png" />
          </figure><p>And finally, BPF LPM tries do not implement level compression. Again, this stems from the fact that nodes in the trie can only have 2 children. IP route tables tend to have many prefixes in common and you typically see densely packed tries at the upper levels which makes level compression very effective for tries containing IP routes.</p><p>Here’s a graph showing how the lookup throughput for LPM tries (measured in million ops/sec) degrades as the number of entries increases, from 1 entry up to 100K entries.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/33I92exrEZTcUWOjxaBOqY/fb1de551b06e3272c8670d0117d738fa/image2.png" />
          </figure><p>Once you reach 1 million entries, throughput is around 1.5 million ops/sec, and continues to fall as the number of entries increases.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4OhaAaI5Y2XJCofI9V39z/567a01b3335f29ef3b46ccdd74dc27e5/image1.png" />
          </figure><p>Why is this? Initially, this is because of the L1 dcache miss rate. All of those nodes that need to be traversed in the trie are potential cache miss opportunities.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5Gx4fOLKmhUKHegybQU7sl/4936239213f0061d5cbc2f5d6b63fde6/image11.png" />
          </figure><p>As you can see from the graph, L1 dcache miss rate remains relatively steady and yet the throughput continues to decline. At around 80K entries, dTLB miss rate becomes the bottleneck.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4Jy7aTN3Nyo2EsbSzw313n/d26871fa417ffe293adb47fe7f7dc56b/image7.png" />
          </figure><p>Because BPF LPM tries to dynamically allocate individual nodes from a freelist of kernel memory, these nodes can live at arbitrary addresses. Which means traversing a path through a trie almost certainly will incur cache misses and potentially dTLB misses. This gets worse as the number of entries, and height of the trie, increases.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6CB3MvSvSgH1T2eY7Xlei8/81ebe572592ca71529d79564a88993f0/image10.png" />
          </figure>
    <div>
      <h2>Where do we go from here?</h2>
      <a href="#where-do-we-go-from-here">
        
      </a>
    </div>
    <p>By understanding the current limitations of the BPF LPM trie, we can now work towards building a more performant and efficient solution for the future of the Internet.</p><p>We’ve already contributed these benchmarks to the upstream Linux kernel — but that’s only the start. We have plans to improve the performance of BPM LPM tries, particularly the lookup function which is heavily used for our workloads. This post covered a number of optimisations that are already used by the <a href="https://elixir.bootlin.com/linux/v6.12.43/source/net/ipv4/fib_trie.c"><u>net/ipv4/fib_trie.c</u></a> code, so a natural first step is to refactor that code so that a common Level Compressed trie implementation can be used. Expect future blog posts to explore this work in depth.</p><p>If you’re interested in looking at more performance numbers, <a href="https://wiki.cfdata.org/display/~jesper">Jesper Brouer</a> has recorded some here: <a href="https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench02_lpm-trie-lookup.org">https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench02_lpm-trie-lookup.org</a>.</p><h6><i>If the Linux kernel, performance, or optimising data structures excites you, </i><a href="https://www.cloudflare.com/en-gb/careers/jobs/?department=Engineering&amp;location=default"><i>our engineering teams are hiring</i></a><i>.</i></h6><p></p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[eBPF]]></category>
            <category><![CDATA[IPv4]]></category>
            <category><![CDATA[IPv6]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Performance]]></category>
            <guid isPermaLink="false">2A4WHjTqyxprwUMPaZ6tfj</guid>
            <dc:creator>Matt Fleming</dc:creator>
            <dc:creator>Jesper Brouer</dc:creator>
        </item>
        <item>
            <title><![CDATA[Safe in the sandbox: security hardening for Cloudflare Workers]]></title>
            <link>https://blog.cloudflare.com/safe-in-the-sandbox-security-hardening-for-cloudflare-workers/</link>
            <pubDate>Thu, 25 Sep 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ We are further hardening Cloudflare Workers with the latest software and hardware features. We use defense-in-depth, including V8 sandboxes and the CPU's memory protection keys to keep your data safe. ]]></description>
            <content:encoded><![CDATA[ <p>As a <a href="https://www.cloudflare.com/learning/serverless/what-is-serverless/"><u>serverless</u></a> cloud provider, we run your code on our globally distributed infrastructure. Being able to run customer code on our network means that anyone can take advantage of our global presence and low latency. Workers isn’t just efficient though, we also make it simple for our users. In short: <a href="https://workers.cloudflare.com/"><u>You write code. We handle the rest</u></a>.</p><p>Part of 'handling the rest' is making Workers as secure as possible. We have previously written about our <a href="https://blog.cloudflare.com/mitigating-spectre-and-other-security-threats-the-cloudflare-workers-security-model/"><u>security architecture</u></a>. Making Workers secure is an interesting problem because the whole point of Workers is that we are running third party code on our hardware. This is one of the hardest security problems there is: any attacker has the full power available of a programming language running on the victim's system when they are crafting their attacks.</p><p>This is why we are constantly updating and improving the Workers Runtime to take advantage of the latest improvements in both hardware and software. This post shares some of the latest work we have been doing to keep Workers secure.</p><p>Some background first: <a href="https://www.cloudflare.com/developer-platform/products/workers/"><u>Workers</u></a> is built around the <a href="https://v8.dev/"><u>V8</u></a> JavaScript runtime, originally developed for Chromium-based browsers like Chrome. This gives us a head start, because V8 was forged in an adversarial environment, where it has always been under intense attack and <a href="https://github.blog/security/vulnerability-research/getting-rce-in-chrome-with-incorrect-side-effect-in-the-jit-compiler/"><u>scrutiny</u></a>. Like Workers, Chromium is built to run adversarial code safely. That's why V8 is constantly being tested against the best fuzzers and sanitizers, and over the years, it has been hardened with new technologies like <a href="https://v8.dev/blog/oilpan-library"><u>Oilpan/cppgc</u></a> and improved static analysis.</p><p>We use V8 in a slightly different way, though, so we will be describing in this post how we have been making some changes to V8 to improve security in our use case.</p>
    <div>
      <h2>Hardware-assisted security improvements from Memory Protection Keys</h2>
      <a href="#hardware-assisted-security-improvements-from-memory-protection-keys">
        
      </a>
    </div>
    <p>Modern CPUs from Intel, AMD, and ARM have support for <a href="https://man7.org/linux/man-pages/man7/pkeys.7.html"><u>memory protection keys</u></a>, sometimes called <i>PKU</i>, Protection Keys for Userspace. This is a great security feature which increases the power of virtual memory and memory protection.</p><p>Traditionally, the memory protection features of the CPU in your PC or phone were mainly used to protect the kernel and to protect different processes from each other. Within each process, all threads had access to the same memory. Memory protection keys allow us to prevent specific threads from accessing memory regions they shouldn't have access to.</p><p>V8 already <a href="https://issues.chromium.org/issues/41480375"><u>uses memory protection keys</u></a> for the <a href="https://en.wikipedia.org/wiki/Just-in-time_compilation"><u>JIT compilers</u></a>. The JIT compilers for a language like JavaScript generate optimized, specialized versions of your code as it runs. Typically, the compiler is running on its own thread, and needs to be able to write data to the code area in order to install its optimized code. However, the compiler thread doesn't need to be able to run this code. The regular execution thread, on the other hand, needs to be able to run, but not modify, the optimized code. Memory protection keys offer a way to give each thread the permissions it needs, but <a href="https://en.wikipedia.org/wiki/W%5EX"><u>no more</u></a>. And the V8 team in the Chromium project certainly aren't standing still. They describe some of their future plans for memory protection keys <a href="https://docs.google.com/document/d/1l3urJdk1M3JCLpT9HDvFQKOxuKxwINcXoYoFuKkfKcc/edit?tab=t.0#heading=h.gpz70vgxo7uc"><u>here</u></a>.</p><p>In Workers, we have some different requirements than Chromium. <a href="https://developers.cloudflare.com/workers/reference/security-model/"><u>The security architecture for Workers</u></a> uses V8 isolates to separate different scripts that are running on our servers. (In addition, we have <a href="https://blog.cloudflare.com/spectre-research-with-tu-graz/"><u>extra mitigations</u></a> to harden the system against <a href="https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)"><u>Spectre</u></a> attacks). If V8 is working as intended, this should be enough, but we believe in <i>defense in depth</i>: multiple, overlapping layers of security controls.</p><p>That's why we have deployed internal modifications to V8 to use memory protection keys to isolate the isolates from each other. There are up to 15 different keys available on a modern x64 CPU and a few are used for other purposes in V8, so we have about 12 to work with. We give each isolate a random key which is used to protect its V8 <i>heap data</i>, the memory area containing the JavaScript objects a script creates as it runs. This means security bugs that might previously have allowed an attacker to read data from a different isolate would now hit a hardware trap in 92% of cases. (Assuming 12 keys, 92% is about 11/12.)</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4cHaaZrAhQf759og04S63G/59ff1974dc878ec8ad7d40f1f079be37/image9.png" />
          </figure><p>The illustration shows an attacker attempting to read from a different isolate. Most of the time this is detected by the mismatched memory protection key, which kills their script and notifies us, so we can investigate and remediate. The red arrow represents the case where the attacker got lucky by hitting an isolate with the same memory protection key, represented by the isolates having the same colors.</p><p>However, we can further improve on a 92% protection rate. In the last part of this blog post we'll explain how we can lift that to 100% for a particular common scenario. But first, let's look at a software hardening feature in V8 that we are taking advantage of.</p>
    <div>
      <h2>The V8 sandbox, a software-based security boundary</h2>
      <a href="#the-v8-sandbox-a-software-based-security-boundary">
        
      </a>
    </div>
    <p>Over the past few years, V8 has been gaining another defense in depth feature: the V8 sandbox. (Not to be confused with the <a href="https://blog.cloudflare.com/sandboxing-in-linux-with-zero-lines-of-code/"><u>layer 2 sandbox</u></a> which Workers have been using since the beginning.) The V8 sandbox has been a multi-year project that has been gaining <a href="https://v8.dev/blog/sandbox"><u>maturity</u></a> for a while. The sandbox project stems from the observation that many V8 security vulnerabilities start by corrupting objects in the V8 heap memory. Attackers then leverage this corruption to reach other parts of the process, giving them the opportunity to escalate and gain more access to the victim's browser, or even the entire system.</p><p>V8's sandbox project is an ambitious software security mitigation that aims to thwart that escalation: to make it impossible for the attacker to progress from a corruption on the V8 heap to a compromise of the rest of the process. This means, among other things, removing all pointers from the heap. But first, let's explain in as simple terms as possible, what a memory corruption attack is.</p>
    <div>
      <h3>Memory corruption attacks</h3>
      <a href="#memory-corruption-attacks">
        
      </a>
    </div>
    <p>A memory corruption attack tricks a program into misusing its own memory. Computer memory is just a store of integers, where each integer is stored in a location. The locations each have an <i>address</i>, which is also just a number. Programs interpret the data in these locations in different ways, such as text, pixels, or <i>pointers</i>. Pointers are addresses that identify a different memory location, so they act as a sort of arrow that points to some other piece of data.</p><p>Here's a concrete example, which uses a buffer overflow. This is a form of attack that was historically common and relatively simple to understand: Imagine a program has a small buffer (like a 16-character text field) followed immediately by an 8-byte pointer to some ordinary data. An attacker might send the program a 24-character string, causing a "buffer overflow." Because of a vulnerability in the program, the first 16 characters fill the intended buffer, but the remaining 8 characters spill over and overwrite the adjacent pointer.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5VlcKOYtfRHwWZVDb6GOPm/517ae1987c89273e1f33eb6ca11d752d/image5.png" />
          </figure><p><sup><i>See below for how such an attack would now be thwarted.</i></sup></p><p>Now the pointer has been redirected to point at sensitive data of the attacker's choosing, rather than the normal data it was originally meant to access. When the program tries to use what it believes is its normal pointer, it's actually accessing sensitive data chosen by the attacker.</p><p>This type of attack works in steps: first create a small confusion (like the buffer overflow), then use that confusion to create bigger problems, eventually gaining access to data or capabilities the attacker shouldn't have.  The attacker can eventually use the misdirection to either steal information or plant malicious data that the program will treat as legitimate.</p><p>This was a somewhat abstract description of memory corruption attacks using a buffer overflow, one of the simpler techniques. For some much more detailed and recent examples, see <a href="https://googleprojectzero.blogspot.com/2015/06/what-is-good-memory-corruption.html"><u>this description from Google</u></a>, or this <a href="https://medium.com/@INTfinitySG/miscellaneous-series-2-a-script-kiddie-diary-in-v8-exploit-research-part-1-5b0bab211f5a"><u>breakdown of a V8 vulnerability</u></a>.</p>
    <div>
      <h3>Compressed pointers in V8</h3>
      <a href="#compressed-pointers-in-v8">
        
      </a>
    </div>
    <p>Many attacks are based on corrupting pointers, so ideally we would remove all pointers from the memory of the program.  Since an object-oriented language's heap is absolutely full of pointers, that would seem, on its face, to be a hopeless task, but it is enabled by an earlier development. Starting in 2020, V8 has offered the option of saving memory by using <a href="https://v8.dev/blog/pointer-compression"><u>compressed pointers</u></a>. This means that, on a 64-bit system, the heap uses only 32 bit offsets, relative to a base address. This limits the total heap to maximally 4 GiB, a limitation that is acceptable for a browser, and also fine for individual scripts running in a V8 isolate on Cloudflare Workers.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/sO5ByQzR62UcxZiaxwcaq/2f2f0c04af57bb492e9ecaa321935112/image1.png" />
          </figure><p><sup><i>An artificial object with various fields, showing how the layout differs in a compressed vs. an uncompressed heap. The boxes are 64 bits wide.</i></sup></p><p>If the whole of the heap is in a single 4 GiB area then the first 32 bits of all pointers will be the same, and we don't need to store them in every pointer field in every object. In the diagram we can see that the object pointers all start with 0x12345678, which is therefore redundant and doesn't need to be stored. This means that object pointer fields and integer fields can be reduced from 64 to 32 bits.</p><p>We still need 64 bit fields for some fields like double precision floats and for the sandbox offsets of buffers, which are typically used by the script for input and output data. See below for details.</p><p>Integers in an uncompressed heap are stored in the high 32 bits of a 64 bit field. In the compressed heap, the top 31 bits of a 32 bit field are used. In both cases the lowest bit is set to 0 to indicate integers (as opposed to pointers or offsets).</p><p>Conceptually, we have two methods for compressing and decompressing, using a base address that is divisible by 4 GiB:</p>
            <pre><code>// Decompress a 32 bit offset to a 64 bit pointer by adding a base address.
void* Decompress(uint32_t offset) { return base + offset; }
// Compress a 64 bit pointer to a 32 bit offset by discarding the high bits.
uint32_t Compress(void* pointer) { return (intptr_t)pointer &amp; 0xffffffff; }</code></pre>
            <p>This pointer compression feature, originally primarily designed to save memory, can be used as the basis of a sandbox.</p>
    <div>
      <h3>From compressed pointers to the sandbox</h3>
      <a href="#from-compressed-pointers-to-the-sandbox">
        
      </a>
    </div>
    <p>The biggest 32-bit unsigned integer is about 4 billion, so the <code>Decompress()</code> function cannot generate any pointer that is outside the range [base, base + 4 GiB]. You could say the pointers are trapped in this area, so it is sometimes called the <i>pointer cage</i>. V8 can reserve 4 GiB of virtual address space for the pointer cage so that only V8 objects appear in this range. By eliminating <i>all</i> pointers from this range, and following some other strict rules, V8 can contain any memory corruption by an attacker to this cage. Even if an attacker corrupts a 32 bit offset within the cage, it is still only a 32 bit offset and can only be used to create new pointers that are still trapped within the pointer cage.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3r5H81eDvHgaPIBFw5gG6B/65ffa220f9141a81af893183a09321ac/image7.png" />
          </figure><p><sup><i>The buffer overflow attack from earlier no longer works because only the attacker's own data is available in the pointer cage.</i></sup></p><p>To construct the sandbox, we take the 4 GiB pointer cage and add another 4 GiB for buffers and other data structures to make the 8 GiB sandbox. This is why the buffer offsets above are 33 bits, so they can reach buffers in the second half of the sandbox (40 bits in Chromium with larger sandboxes). V8 stores these buffer offsets in the high 33 bits and shifts down by 31 bits before use, in case an attacker corrupted the low bits.</p><p>Cloudflare Workers have made use of compressed pointers in V8 for a while, but for us to get the full power of the sandbox we had to make some changes. Until recently, all isolates in a process had to be one single sandbox if you were using the sandboxed configuration of V8. This would have limited the total size of all V8 heaps to be less than 4 GiB, far too little for our architecture, which relies on serving 1000s of scripts at once.</p><p>That's why we commissioned <a href="https://www.igalia.com/"><u>Igalia</u></a> to add<a href="https://dbezhetskov.dev/multi-sandboxes/"><u> isolate groups</u></a> to V8. Each isolate group has its own sandbox and can have 1 or more isolates within it. Building on this change we have been able to start using the sandbox, eliminating a whole class of potential security issues in one stroke. Although we can place multiple isolates in the same sandbox, we are currently only putting a single isolate in each sandbox.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3jwaGI8xIAC6755vw2BWfE/d8b0cd5b36dbe8b5e628c62ef7f3d474/image2.png" />
          </figure><p><sup><i>The layout of the sandbox. In the sandbox there can be more than one isolate, but all their heap pages must be in the pointer cage: the first 4 GiB of the sandbox. Instead of pointers between the objects, we use 32 bit offsets. The offsets for the buffers are 33 bits, so they can reach the whole sandbox, but not outside it.</i></sup></p>
    <div>
      <h2>Virtual memory isn't infinite, there's a lot going on in a Linux process</h2>
      <a href="#virtual-memory-isnt-infinite-theres-a-lot-going-on-in-a-linux-process">
        
      </a>
    </div>
    <p>At this point, we were not quite done, though. Each sandbox reserves 8 GiB of space in the virtual memory map of the process, and it must be 4 GiB aligned <a href="https://v8.dev/blog/pointer-compression"><u>for efficiency</u></a>. It uses much less physical memory, but the sandbox mechanism requires this much virtual space for its security properties. This presents us with a problem, since a Linux process 'only' has 128 TiB of virtual address space in a 4-level page table (another 128 TiB are reserved for the kernel, not available to user space).</p><p>At Cloudflare, we want to run Workers as efficiently as possible to keep costs and prices down, and to offer a generous free tier. That means that on each machine we have so many isolates running (one per sandbox) that it becomes hard to place them all in a 128 TiB space.</p><p>Knowing this, we have to place the sandboxes carefully in memory. Unfortunately, the Linux syscall, <a href="https://man7.org/linux/man-pages/man2/mmap.2.html"><u>mmap</u></a>, does not allow us to specify the alignment of an allocation unless you can guess a free location to request. To get an 8 GiB area that is 4 GiB aligned, we have to ask for 12 GiB, then find the aligned 8 GiB area that must exist within that, and return the unused (hatched) edges to the OS:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7Dqey3y5ZsPugD3pyRpQUY/cdadceeb96dbb01a2062dc98c7c554bc/image6.png" />
          </figure><p>If we allow the Linux kernel to place sandboxes randomly, we end up with a layout like this with gaps. Especially after running for a while, there can be both 8 GiB and 4 GiB gaps between sandboxes:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6oaIPZnjaJrLYoFK6v03oI/6c53895f1151d70f71511d8cdfa35f00/image3.png" />
          </figure><p>Sadly, because of our 12 GiB alignment trick, we can't even make use of the 8 GiB gaps. If we ask the OS for 12 GiB, it will never give us a gap like the 8 GiB gap between the green and blue sandboxes above. In addition, there are a host of other things going on in the virtual address space of a Linux process: the malloc implementation may want to grab pages at particular addresses, the executable and libraries are mapped at a random location by ASLR, and V8 has allocations outside the sandbox.</p><p>The latest generation of x64 CPUs supports a much bigger address space, which solves both problems, and Linux kernels are able to make use of the extra bits with <a href="https://en.wikipedia.org/wiki/Intel_5-level_paging"><u>five level page tables</u></a>. A process has to <a href="https://lwn.net/Articles/717293/"><u>opt into this</u></a>, which is done by a single mmap call suggesting an address outside the 47 bit area. The reason this needs an opt-in is that some programs can't cope with such high addresses. Curiously, V8 is one of them.</p><p>This isn't hard to fix in V8, but not all of our fleet has been upgraded yet to have the necessary hardware. So for now, we need a solution that works with the existing hardware. We have modified V8 to be able to grab huge memory areas and then use <a href="https://man7.org/linux/man-pages/man2/mprotect.2.html"><u>mprotect syscalls</u></a> to create tightly packed 8 GiB spaces for sandboxes, bypassing the inflexible mmap API.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7kPgAWxoR7nDsZHUOBsNMp/15e7b2a1aac827acfce8b0d614e44cde/image8.png" />
          </figure>
    <div>
      <h2>Putting it all together</h2>
      <a href="#putting-it-all-together">
        
      </a>
    </div>
    <p>Taking control of the sandbox placement like this actually gives us a security benefit, but first we need to describe a particular threat model.</p><p>We assume for the purposes of this threat model that an attacker has an arbitrary way to corrupt data within the sandbox. This is historically the first step in many V8 exploits. So much so that there is a <a href="https://bughunters.google.com/about/rules/chrome-friends/5745167867576320/chrome-vulnerability-reward-program-rules#v8-sandbox-bypass-rewards"><u>special tier</u></a> in Google's V8 bug bounty program where you may <i>assume</i> you have this ability to corrupt memory, and they will pay out if you can leverage that to a more serious exploit.</p><p>However, we assume that the attacker does not have the ability to execute arbitrary machine code. If they did, they could <a href="https://www.usenix.org/system/files/sec20fall_connor_prepub.pdf"><u>disable memory protection keys</u></a>. Having access to the in-sandbox memory only gives the attacker access to their own data. So the attacker must attempt to escalate, by corrupting data inside the sandbox to access data outside the sandbox.</p><p>You will recall that the compressed, sandboxed V8 heap only contains 32 bit offsets. Therefore, no corruption there can reach outside the pointer cage. But there are also arrays in the sandbox — vectors of data with a given size that can be accessed with an index. In our threat model, the attacker can modify the sizes recorded for those arrays and the indexes used to access elements in the arrays. That means an attacker could potentially turn an array in the sandbox into a tool for accessing memory incorrectly. For this reason, the V8 sandbox normally has <i>guard regions</i> around it: These are 32 GiB virtual address ranges that have no virtual-to-physical address mappings. This helps guard against the worst case scenario: Indexing an array where the elements are 8 bytes in size (e.g. an array of double precision floats) using a maximal 32 bit index. Such an access could reach a distance of up to 32 GiB outside the sandbox: 8 times the maximal 32 bit index of four billion.</p><p>We want such accesses to trigger an alarm, rather than letting an attacker access nearby memory.  This happens automatically with guard regions, but we don't have space for conventional 32 GiB guard regions around every sandbox.</p><p>Instead of using conventional guard regions, we can make use of memory protection keys. By carefully controlling which isolate group uses which key, we can ensure that no sandbox within 32 GiB has the same protection key. Essentially, the sandboxes are acting as each other's guard regions, protected by memory protection keys. Now we only need a wasted 32 GiB guard region at the start and end of the huge packed sandbox areas.
</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/53MPs8P84ayqEiTXh7gV5O/88104f74f1d51dbdda8d987e1c7df3aa/image10.png" />
          </figure><p>With the new sandbox layout, we use strictly rotating memory protection keys. Because we are not using randomly chosen memory protection keys, for this threat model the 92% problem described above disappears. Any in-sandbox security issue is unable to reach a sandbox with the same memory protection key. In the diagram, we show that there is no memory within 32 GiB of a given sandbox that has the same memory protection key. Any attempt to access memory within 32 GiB of a sandbox will trigger an alarm, just like it would with unmapped guard regions.</p>
    <div>
      <h2>The future</h2>
      <a href="#the-future">
        
      </a>
    </div>
    <p>In a way, this whole blog post is about things our customers <i>don't</i> need to do. They don't need to upgrade their server software to get the latest patches, we do that for them. They don't need to worry whether they are using the most secure or efficient configuration. So there's no call to action here, except perhaps to sleep easy.</p><p>However, if you find work like this interesting, and especially if you have experience with the implementation of V8 or similar language runtimes, then you should consider coming to work for us. <a href="https://job-boards.greenhouse.io/cloudflare/jobs/6718312?gh_jid=6718312"><u>We are recruiting both in the US and in Europe</u></a>. It's a great place to work, and Cloudflare is going from strength to strength.</p> ]]></content:encoded>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[Attacks]]></category>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Malicious JavaScript]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Vulnerabilities]]></category>
            <guid isPermaLink="false">7bZyPF4nBnr5gisZW2crax</guid>
            <dc:creator>Erik Corry</dc:creator>
            <dc:creator>Ketan Gupta</dc:creator>
        </item>
        <item>
            <title><![CDATA[QUIC restarts, slow problems: udpgrm to the rescue]]></title>
            <link>https://blog.cloudflare.com/quic-restarts-slow-problems-udpgrm-to-the-rescue/</link>
            <pubDate>Wed, 07 May 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ udpgrm is a lightweight daemon for graceful restarts of UDP servers. It leverages SO_REUSEPORT and eBPF to route new and existing flows to the correct server instance. ]]></description>
            <content:encoded><![CDATA[ <p>At Cloudflare, we do everything we can to avoid interruption to our services. We frequently deploy new versions of the code that delivers the services, so we need to be able to restart the server processes to upgrade them without missing a beat. In particular, performing graceful restarts (also known as "zero downtime") for UDP servers has proven to be surprisingly difficult.</p><p>We've <a href="https://blog.cloudflare.com/graceful-upgrades-in-go/"><u>previously</u></a> <a href="https://blog.cloudflare.com/oxy-the-journey-of-graceful-restarts/"><u>written</u></a> about graceful restarts in the context of TCP, which is much easier to handle. We didn't have a strong reason to deal with UDP until recently — when protocols like HTTP3/QUIC became critical. This blog post introduces <b><i>udpgrm</i></b>, a lightweight daemon that helps us to upgrade UDP servers without dropping a single packet.</p><p><a href="https://github.com/cloudflare/udpgrm/blob/main/README.md"><u>Here's the </u><i><u>udpgrm</u></i><u> GitHub repo</u></a>.</p>
    <div>
      <h2>Historical context</h2>
      <a href="#historical-context">
        
      </a>
    </div>
    <p>In the early days of the Internet, UDP was used for stateless request/response communication with protocols like DNS or NTP. Restarts of a server process are not a problem in that context, because it does not have to retain state across multiple requests. However, modern protocols like QUIC, WireGuard, and SIP, as well as online games, use stateful flows. So what happens to the state associated with a flow when a server process is restarted? Typically, old connections are just dropped during a server restart. Migrating the flow state from the old instance to the new instance is possible, but it is complicated and notoriously hard to get right.</p><p>The same problem occurs for TCP connections, but there a common approach is to keep the old instance of the server process running alongside the new instance for a while, routing new connections to the new instance while letting existing ones drain on the old. Once all connections finish or a timeout is reached, the old instance can be safely shut down. The same approach works for UDP, but it requires more involvement from the server process than for TCP.</p><p>In the past, we <a href="https://blog.cloudflare.com/everything-you-ever-wanted-to-know-about-udp-sockets-but-were-afraid-to-ask-part-1/"><u>described</u></a> the <i>established-over-unconnected</i> method. It offers one way to implement flow handoff, but it comes with significant drawbacks: it’s prone to race conditions in protocols with multi-packet handshakes, and it suffers from a scalability issue. Specifically, the kernel hash table used for dispatching packets is keyed only by the local IP:port tuple, which can lead to bucket overfill when dealing with many inbound UDP sockets.</p><p>Now we have found a better method, leveraging Linux’s <code>SO_REUSEPORT</code> API. By placing both old and new sockets into the same REUSEPORT group and using an eBPF program for flow tracking, we can route packets to the correct instance and preserve flow stickiness. This is how <i>udpgrm</i> works.</p>
    <div>
      <h2>REUSEPORT group</h2>
      <a href="#reuseport-group">
        
      </a>
    </div>
    <p>Before diving deeper, let's quickly review the basics. Linux provides the <code>SO_REUSEPORT</code> socket option, typically set after <code>socket()</code> but before <code>bind()</code>. Please note that this has a separate purpose from the better known <code>SO_REUSEADDR</code> socket option.</p><p><code>SO_REUSEPORT</code> allows multiple sockets to bind to the same IP:port tuple. This feature is primarily used for load balancing, letting servers spread traffic efficiently across multiple CPU cores. You can think of it as a way for an IP:port to be associated with multiple packet queues. In the kernel, sockets sharing an IP:port this way are organized into a <i>reuseport group </i>— a term we'll refer to frequently throughout this post.</p>
            <pre><code>┌───────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443             │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ socket #1 │ │ socket #2 │ │ socket #3 │ │
│ └───────────┘ └───────────┘ └───────────┘ │
└───────────────────────────────────────────┘
</code></pre>
            <p>Linux supports several methods for distributing inbound packets across a reuseport group. By default, the kernel uses a hash of the packet's 4-tuple to select a target socket. Another method is <code>SO_INCOMING_CPU</code>, which, when enabled, tries to steer packets to sockets running on the same CPU that received the packet. This approach works but has limited flexibility.</p><p>To provide more control, Linux introduced the <code>SO_ATTACH_REUSEPORT_CBPF</code> option, allowing server processes to attach a classic BPF (cBPF) program to make socket selection decisions. This was later extended with <code>SO_ATTACH_REUSEPORT_EBPF</code>, enabling the use of modern eBPF programs. With <a href="https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/"><u>eBPF</u></a>, developers can implement arbitrary custom logic. A boilerplate program would look like this:</p>
            <pre><code>SEC("sk_reuseport")
int udpgrm_reuseport_prog(struct sk_reuseport_md *md)
{
    uint64_t socket_identifier = xxxx;
    bpf_sk_select_reuseport(md, &amp;sockhash, &amp;socket_identifier, 0);
    return SK_PASS;
}</code></pre>
            <p>To select a specific socket, the eBPF program calls <code>bpf_sk_select_reuseport</code>, using a reference to a map with sockets (<code>SOCKHASH</code>, <code>SOCKMAP</code>, or the older, mostly obsolete <code>SOCKARRAY</code>), along with a key or index. For example, a declaration of a <code>SOCKHASH</code> might look like this:</p>
            <pre><code>struct {
	__uint(type, BPF_MAP_TYPE_SOCKHASH);
	__uint(max_entries, MAX_SOCKETS);
	__uint(key_size, sizeof(uint64_t));
	__uint(value_size, sizeof(uint64_t));
} sockhash SEC(".maps");</code></pre>
            <p>This <code>SOCKHASH</code> is a hash map that holds references to sockets, even though the value size looks like a scalar 8-byte value. In our case it's indexed by an <code>uint64_t</code> key. This is pretty neat, as it allows for a simple number-to-socket mapping!</p><p>However, there's a catch: <b>the </b><code><b>SOCKHASH</b></code><b> must be populated and maintained from user space (or a separate control plane), outside the eBPF program itself</b>. Keeping this socket map accurate and in sync with the server process state is surprisingly difficult to get right — especially under dynamic conditions like restarts, crashes, or scaling events. The point of <i>udpgrm</i> is to take care of this stuff, so that server processes don’t have to.</p>
    <div>
      <h2>Socket generation and working generation</h2>
      <a href="#socket-generation-and-working-generation">
        
      </a>
    </div>
    <p>Let’s look at how graceful restarts for UDP flows are achieved in <i>udpgrm</i>. To reason about this setup, we’ll need a bit of terminology: A <b>socket generation</b> is a set of sockets within a reuseport group that belong to the same logical application instance:</p>
            <pre><code>┌───────────────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443                     │
│  ┌─────────────────────────────────────────────┐  │
│  │ socket generation 0                         │  │
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐  │  │
│  │  │ socket #1 │ │ socket #2 │ │ socket #3 │  │  │
│  │  └───────────┘ └───────────┘ └───────────┘  │  │
│  └─────────────────────────────────────────────┘  │
│  ┌─────────────────────────────────────────────┐  │
│  │ socket generation 1                         │  │
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐  │  │
│  │  │ socket #4 │ │ socket #5 │ │ socket #6 │  │  │
│  │  └───────────┘ └───────────┘ └───────────┘  │  │
│  └─────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────┘</code></pre>
            <p>When a server process needs to be restarted, the new version creates a new socket generation for its sockets. The old version keeps running alongside the new one, using sockets from the previous socket generation.</p><p>Reuseport eBPF routing boils down to two problems:</p><ul><li><p>For new flows, we should choose a socket from the socket generation that belongs to the active server instance.</p></li><li><p>For already established flows, we should choose the appropriate socket — possibly from an older socket generation — to keep the flows sticky. The flows will eventually drain away, allowing the old server instance to shut down.</p></li></ul><p>Easy, right?</p><p>Of course not! The devil is in the details. Let's take it one step at a time.</p><p>Routing new flows is relatively easy. <i>udpgrm</i> simply maintains a reference to the socket generation that should handle new connections. We call this reference the <b>working generation</b>. Whenever a new flow arrives, the eBPF program consults the working generation pointer and selects a socket from that generation.</p>
            <pre><code>┌──────────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443                │
│   ...                                        │
│   Working generation ────┐                   │
│                          V                   │
│           ┌───────────────────────────────┐  │
│           │ socket generation 1           │  │
│           │  ┌───────────┐ ┌──────────┐   │  │
│           │  │ socket #4 │ │ ...      │   │  │
│           │  └───────────┘ └──────────┘   │  │
│           └───────────────────────────────┘  │
│   ...                                        │
└──────────────────────────────────────────────┘</code></pre>
            <p>For this to work, we first need to be able to differentiate packets belonging to new connections from packets belonging to old connections. This is very tricky and highly dependent on the specific UDP protocol. For example, QUIC has an <a href="https://datatracker.ietf.org/doc/html/rfc9000#name-initial-packet"><i><u>initial packet</u></i></a> concept, similar to a TCP SYN, but other protocols might not.</p><p>There needs to be some flexibility in this and <i>udpgrm</i> makes this configurable. Each reuseport group sets a specific <b>flow dissector</b>.</p><p>Flow dissector has two tasks:</p><ul><li><p>It distinguishes new packets from packets belonging to old, already established flows.</p></li><li><p>For recognized flows, it tells <i>udpgrm</i> which specific socket the flow belongs to.</p></li></ul><p>These concepts are closely related and depend on the specific server. Different UDP protocols define flows differently. For example, a naive UDP server might use a typical 5-tuple to define flows, while QUIC uses a "connection ID" field in the QUIC packet header to survive <a href="https://www.rfc-editor.org/rfc/rfc9308.html#section-3.2"><u>NAT rebinding</u></a>.</p><p><i>udpgrm</i> supports three flow dissectors out of the box and is highly configurable to support any UDP protocol. More on this later.</p>
    <div>
      <h2>Welcome udpgrm!</h2>
      <a href="#welcome-udpgrm">
        
      </a>
    </div>
    <p>Now that we covered the theory, we're ready for the business: please welcome <b>udpgrm</b> — UDP Graceful Restart Marshal! <i>udpgrm</i> is a stateful daemon that handles all the complexities of the graceful restart process for UDP. It installs the appropriate eBPF REUSEPORT program, maintains flow state, communicates with the server process during restarts, and reports useful metrics for easier debugging.</p><p>We can describe <i>udpgrm</i> from two perspectives: for administrators and for programmers.</p>
    <div>
      <h2>udpgrm daemon for the system administrator</h2>
      <a href="#udpgrm-daemon-for-the-system-administrator">
        
      </a>
    </div>
    <p><i>udpgrm</i> is a stateful daemon, to run it:</p>
            <pre><code>$ sudo udpgrm --daemon
[ ] Loading BPF code
[ ] Pinning bpf programs to /sys/fs/bpf/udpgrm
[*] Tailing message ring buffer  map_id 936146</code></pre>
            <p>This sets up the basic functionality, prints rudimentary logs, and should be deployed as a dedicated systemd service — loaded after networking. However, this is not enough to fully use <i>udpgrm</i>. <i>udpgrm</i> needs to hook into <code>getsockopt</code>, <code>setsockopt</code>, <code>bind</code>, and <code>sendmsg</code> syscalls, which are scoped to a cgroup. To install the <i>udpgrm</i> hooks, you can install it like this:</p>
            <pre><code>$ sudo udpgrm --install=/sys/fs/cgroup/system.slice</code></pre>
            <p>But a more common pattern is to install it within the <i>current</i> cgroup:</p>
            <pre><code>$ sudo udpgrm --install --self</code></pre>
            <p>Better yet, use it as part of the systemd "service" config:</p>
            <pre><code>[Service]
...
ExecStartPre=/usr/local/bin/udpgrm --install --self</code></pre>
            <p>Once <i>udpgrm</i> is running, the administrator can use the CLI to list reuseport groups, sockets, and metrics, like this:</p>
            <pre><code>$ sudo udpgrm list
[ ] Retrievieng BPF progs from /sys/fs/bpf/udpgrm
192.0.2.0:4433
	netns 0x1  dissector bespoke  digest 0xdead
	socket generations:
		gen  3  0x17a0da  &lt;=  app 0  gen 3
	metrics:
		rx_processed_total 13777528077
...</code></pre>
            <p>Now, with both the <i>udpgrm</i> daemon running, and cgroup hooks set up, we can focus on the server part.</p>
    <div>
      <h2>udpgrm for the programmer</h2>
      <a href="#udpgrm-for-the-programmer">
        
      </a>
    </div>
    <p>We expect the server to create the appropriate UDP sockets by itself. We depend on <code>SO_REUSEPORT</code>, so that each server instance can have a dedicated socket or a set of sockets:</p>
            <pre><code>sd = socket.socket(AF_INET, SOCK_DGRAM, 0)
sd.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
sd.bind(("192.0.2.1", 5201))</code></pre>
            <p>With a socket descriptor handy, we can pursue the <i>udpgrm</i> magic dance. The server communicates with the <i>udpgrm</i> daemon using <code>setsockopt</code> calls. Behind the scenes, udpgrm provides eBPF <code>setsockopt</code> and <code>getsockopt</code> hooks and hijacks specific calls. It's not easy to set up on the kernel side, but when it works, it’s truly awesome. A typical socket setup looks like this:</p>
            <pre><code>try:
    work_gen = sd.getsockopt(IPPROTO_UDP, UDP_GRM_WORKING_GEN)
except OSError:
    raise OSError('Is udpgrm daemon loaded? Try "udpgrm --self --install"')
    
sd.setsockopt(IPPROTO_UDP, UDP_GRM_SOCKET_GEN, work_gen + 1)
for i in range(10):
    v = sd.getsockopt(IPPROTO_UDP, UDP_GRM_SOCKET_GEN, 8);
    sk_gen, sk_idx = struct.unpack('II', v)
    if sk_idx != 0xffffffff:
        break
    time.sleep(0.01 * (2 ** i))
else:
    raise OSError("Communicating with udpgrm daemon failed.")

sd.setsockopt(IPPROTO_UDP, UDP_GRM_WORKING_GEN, work_gen + 1)</code></pre>
            <p>You can see three blocks here:</p><ul><li><p>First, we retrieve the working generation number and, by doing so, check for <i>udpgrm</i> presence. Typically, <i>udpgrm</i> absence is fine for non-production workloads.</p></li><li><p>Then we register the socket to an arbitrary socket generation. We choose <code>work_gen + 1</code> as the value and verify that the registration went through correctly.</p></li><li><p>Finally, we bump the working generation pointer.</p></li></ul><p>That's it! Hopefully, the API presented here is clear and reasonable. Under the hood, the <i>udpgrm</i> daemon installs the REUSEPORT eBPF program, sets up internal data structures, collects metrics, and manages the sockets in a <code>SOCKHASH</code>.</p>
    <div>
      <h2>Advanced socket creation with udpgrm_activate.py</h2>
      <a href="#advanced-socket-creation-with-udpgrm_activate-py">
        
      </a>
    </div>
    <p>In practice, we often need sockets bound to low ports like <code>:443</code>, which requires elevated privileges like <code>CAP_NET_BIND_SERVICE</code>. It's usually better to configure listening sockets outside the server itself. A typical pattern is to pass the listening sockets using <a href="https://0pointer.de/blog/projects/socket-activation.html"><u>socket activation</u></a>.</p><p>Sadly, systemd cannot create a new set of UDP <code>SO_REUSEPORT</code> sockets for each server instance. To overcome this limitation, <i>udpgrm</i> provides a script called <code>udpgrm_activate.py</code>, which can be used like this:</p>
            <pre><code>[Service]
Type=notify                 # Enable access to fd store
NotifyAccess=all            # Allow access to fd store from ExecStartPre
FileDescriptorStoreMax=128  # Limit of stored sockets must be set

ExecStartPre=/usr/local/bin/udpgrm_activate.py test-port 0.0.0.0:5201</code></pre>
            <p>Here, <code>udpgrm_activate.py</code> binds to <code>0.0.0.0:5201</code> and stores the created socket in the systemd FD store under the name <code>test-port</code>. The server <code>echoserver.py</code> will inherit this socket and receive the appropriate <code>FD_LISTEN</code> environment variables, following the typical systemd socket activation pattern.</p>
    <div>
      <h2>Systemd service lifetime</h2>
      <a href="#systemd-service-lifetime">
        
      </a>
    </div>
    <p>Systemd typically can't handle more than one server instance running at the same time. It prefers to kill the old instance quickly. It supports the "at most one" server instance model, not the "at least one" model that we want. To work around this, <i>udpgrm</i> provides a <b>decoy</b> script that will exit when systemd asks it to, while the actual old instance of the server can stay active in the background.</p>
            <pre><code>[Service]
...
ExecStart=/usr/local/bin/mmdecoy examples/echoserver.py

Restart=always             # if pid dies, restart it.
KillMode=process           # Kill only decoy, keep children after stop.
KillSignal=SIGTERM         # Make signals explicit</code></pre>
            <p>At this point, we showed a full template for a <i>udpgrm</i> enabled server that contains all three elements: <code>udpgrm --install --self</code> for cgroup hooks, <code>udpgrm_activate.py</code> for socket creation, and <code>mmdecoy</code> for fooling systemd service lifetime checks.</p>
            <pre><code>[Service]
Type=notify                 # Enable access to fd store
NotifyAccess=all            # Allow access to fd store from ExecStartPre
FileDescriptorStoreMax=128  # Limit of stored sockets must be set

ExecStartPre=/usr/local/bin/udpgrm --install --self
ExecStartPre=/usr/local/bin/udpgrm_activate.py --no-register test-port 0.0.0.0:5201
ExecStart=/usr/local/bin/mmdecoy PWD/examples/echoserver.py

Restart=always             # if pid dies, restart it.
KillMode=process           # Kill only decoy, keep children after stop. 
KillSignal=SIGTERM         # Make signals explicit</code></pre>
            
    <div>
      <h2>Dissector modes</h2>
      <a href="#dissector-modes">
        
      </a>
    </div>
    <p>We've discussed the <i>udpgrm</i> daemon, the <i>udpgrm</i> setsockopt API, and systemd integration, but we haven't yet covered the details of routing logic for old flows. To handle arbitrary protocols, <i>udpgrm</i> supports three <b>dissector modes</b> out of the box:</p><p><b>DISSECTOR_FLOW</b>: <i>udpgrm</i> maintains a flow table indexed by a flow hash computed from a typical 4-tuple. It stores a target socket identifier for each flow. The flow table size is fixed, so there is a limit to the number of concurrent flows supported by this mode. To mark a flow as "assured," <i>udpgrm</i> hooks into the <code>sendmsg</code> syscall and saves the flow in the table only when a message is sent.</p><p><b>DISSECTOR_CBPF</b>: A cookie-based model where the target socket identifier — called a udpgrm cookie — is encoded in each incoming UDP packet. For example, in QUIC, this identifier can be stored as part of the connection ID. The dissection logic is expressed as cBPF code. This model does not require a flow table in <i>udpgrm</i> but is harder to integrate because it needs protocol and server support.</p><p><b>DISSECTOR_NOOP</b>: A no-op mode with no state tracking at all. It is useful for traditional UDP services like DNS, where we want to avoid losing even a single packet during an upgrade.</p><p>Finally, <i>udpgrm</i> provides a template for a more advanced dissector called <b>DISSECTOR_BESPOKE</b>. Currently, it includes a QUIC dissector that can decode the QUIC TLS SNI and direct specific TLS hostnames to specific socket generations.</p><p>For more details, <a href="https://github.com/cloudflare/udpgrm/blob/main/README.md"><u>please consult the </u><i><u>udpgrm</u></i><u> README</u></a>. In short: the FLOW dissector is the simplest one, useful for old protocols. CBPF dissector is good for experimentation when the protocol allows storing a custom connection id (cookie) — we used it to develop our own QUIC Connection ID schema (also named DCID) — but it's slow, because it interprets cBPF inside eBPF (yes really!). NOOP is useful, but only for very specific niche servers. The real magic is in the BESPOKE type, where users can create arbitrary, fast, and powerful dissector logic.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>The adoption of QUIC and other UDP-based protocols means that gracefully restarting UDP servers is becoming an increasingly important problem. To our knowledge, a reusable, configurable and easy to use solution didn't exist yet. The <i>udpgrm</i> project brings together several novel ideas: a clean API using <code>setsockopt()</code>, careful socket-stealing logic hidden under the hood, powerful and expressive configurable dissectors, and well-thought-out integration with systemd.</p><p>While <i>udpgrm</i> is intended to be easy to use, it hides a lot of complexity and solves a genuinely hard problem. The core issue is that the Linux Sockets API has not kept up with the modern needs of UDP.</p><p>Ideally, most of this should really be a feature of systemd. That includes supporting the "at least one" server instance mode, UDP <code>SO_REUSEPORT</code> socket creation, installing a <code>REUSEPORT_EBPF</code> program, and managing the "working generation" pointer. We hope that <i>udpgrm</i> helps create the space and vocabulary for these long-term improvements.</p> ]]></content:encoded>
            <category><![CDATA[UDP]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Programming]]></category>
            <guid isPermaLink="false">2baeaA3qbgFISPMjlZ74a4</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[A steam locomotive from 1993 broke my yarn test]]></title>
            <link>https://blog.cloudflare.com/yarn-test-suffers-strange-derailment/</link>
            <pubDate>Wed, 02 Apr 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ Yarn tests fail consistently at the 27-second mark. The usual suspects are swiftly eliminated. A deep dive is taken to comb through traces, only to be derailed into an unexpected crash investigation. ]]></description>
            <content:encoded><![CDATA[ <p>So the story begins with a pair programming session I had with my colleague, which I desperately needed because my node skill tree is still at level 1, and I needed to get started with React because I'll be working on our internal <a href="https://github.com/backstage/backstage"><u>backstage</u></a> instance.</p><p>We worked together on a small feature, tested it locally, and it worked. Great. Now it's time to make My Very First React Commit. So I ran the usual <code>git add</code> and <code>git commit</code>, which hooked into <code>yarn test</code>, to automatically run unit tests for backstage, and that's when <i>everything got derailed</i>. For all the React tutorials I have followed, I have never actually run a <code>yarn test</code> on <i>my machine</i>. And the first time I tried yarn test, it hung, and after a long time, the command eventually failed:</p>
            <pre><code>Determining test suites to run...

  ● Test suite failed to run

thrown: [Error]

error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
🌈  backstage  ⚡</code></pre>
            <p>I could tell it was obviously unhappy about something, and then it threw some [Error]. I have very little actual JavaScript experience, but this looks suspiciously like someone had neglected to write a proper toString() or whatever, and thus we're stuck with the monumentally unhelpful [Error]. Searching the web yielded an entire ocean of false positives due to how vague the error message is. <i>What a train wreck!</i></p><p>Fine, let's put on our troubleshooting hats. My memory is not perfect, but thankfully shell history is. Let's see all the (ultimately useless) things that were tried (with commentary):</p>
            <pre><code>2025-03-19 14:18  yarn test --help                                                                                                  
2025-03-19 14:20  yarn test --verbose                    
2025-03-19 14:21  git diff --staged                                                                                                 
2025-03-19 14:25  vim README.md                    # Did I miss some setup?
2025-03-19 14:28  i3lock -c 336699                 # "I need a drink"            
2025-03-19 14:34  yarn test --debug                # Debug, verbose, what's the diff
2025-03-19 14:35  yarn backstage-cli repo test     # Maybe if I invoke it directly ...
2025-03-19 14:36  yarn backstage-cli --version     # Nope, same as mengnan's
2025-03-19 14:36  yarn backstage-cli repo --help
2025-03-19 14:36  yarn backstage-cli repo test --since HEAD~1   # Minimal changes?
2025-03-19 14:36  yarn backstage-cli repo test --since HEAD     # Uhh idk no changes???
2025-03-19 14:38  yarn backstage-cli repo test plugins          # The first breakthrough. More on this later
2025-03-19 14:39  n all tests.\n › Press f to run only failed tests.\n › Press o to only run tests related to changed files.\n › Pres
filter by a filename regex pattern.\n › Press t to filter by a test name regex pattern.\n › Press q to quit watch mode.\n › Press Ent
rigger a test run all tests.\n › Press f to run only failed tests.\n › Press o to only run tests related to changed files.\n › Press
lter by a filename regex pattern.\n › Press t to filter by a test name regex pattern.\n › Press q to quit watch mode.\n › Press Enter
gger a test ru                                     # Got too excited and pasted rubbish
2025-03-19 14:44  ls -a | fgrep log
2025-03-19 14:44  find | fgrep log                 # Maybe it leaves a log file?
2025-03-19 14:46  yarn backstage-cli repo test --verbose --debug --no-cache plugins    # "clear cache"
2025-03-19 14:52  yarn backstage-cli repo test --no-cache --runInBand .                # No parallel
2025-03-19 15:00  yarn backstage-cli repo test --jest-help
2025-03-19 15:03  yarn backstage-cli repo test --resetMocks --resetModules plugins     # I have no idea what I'm resetting</code></pre>
            <p>The first real breakthrough was <code>test plugins</code>, which runs only tests matching "plugins". This effectively bypassed the "determining suites to run..." logic, which was the thing that was hanging. So, I am now able to get tests to run. However, these too eventually crash with the same cryptic <code>[Error]</code>:</p>
            <pre><code>PASS   @cloudflare/backstage-components  plugins/backstage-components/src/components/Cards/TeamMembersListCard/TeamMembersListCard.test.tsx (6.787 s)
PASS   @cloudflare/backstage-components  plugins/backstage-components/src/components/Cards/ClusterDependencyCard/ClusterDependencyCard.test.tsx
PASS   @internal/plugin-software-excellence-dashboard  plugins/software-excellence-dashboard/src/components/AppDetail/AppDetail.test.tsx
PASS   @cloudflare/backstage-entities  plugins/backstage-entities/src/AccessLinkPolicy.test.ts


  ● Test suite failed to run

thrown: [Error]</code></pre>
            <p>Re-running it or matching different tests will give slightly different run logs, but they always end with the same error.</p><p>By now, I've figured out that yarn test is actually backed by <a href="https://jestjs.io/"><u>Jest, a JavaScript testing framework</u></a>, so my next strategy is simply trying different Jest flags to see what sticks, but invariably, none do:</p>
            <pre><code>2025-03-19 15:16  time yarn test --detectOpenHandles plugins
2025-03-19 15:18  time yarn test --runInBand .
2025-03-19 15:19  time yarn test --detectLeaks .
2025-03-19 15:20  yarn test --debug aetsnuheosnuhoe
2025-03-19 15:21  yarn test --debug --no-watchman nonexisis
2025-03-19 15:21  yarn test --jest-help
2025-03-19 15:22  yarn test --debug --no-watch ooooooo &gt; ~/jest.config
</code></pre>
            
    <div>
      <h2>A pattern finally emerges</h2>
      <a href="#a-pattern-finally-emerges">
        
      </a>
    </div>
    <p>Eventually, after re-running it so many times, I started to notice a pattern. So by default after a test run, Jest drops you into an interactive menu where you can  (Q)uit, Run (A)ll tests, etc. and I realized that Jest would eventually crash, even if it's idling in the menu. I started timing the runs, which led me to the second breakthrough:</p>
            <pre><code>› Press q to quit watch mode.
 › Press Enter to trigger a test run.


  ● Test suite failed to run

thrown: [Error]

error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
yarn test .  109.96s user 14.21s system 459% cpu 27.030 total</code></pre>
            
            <pre><code>RUNS   @cloudflare/backstage-components  plugins/backstage-components/src/components/Cards/TeamRoles/CustomerSuccessCard.test.tsx
 RUNS   @cloudflare/backstage-app  packages/app/src/components/catalog/EntityFipsPicker/EntityFipsPicker.test.tsx

Test Suites: 2 failed, 23 passed, 25 of 65 total
Tests:       217 passed, 217 total
Snapshots:   0 total
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
yarn test .  110.85s user 14.04s system 463% cpu 26.974 total</code></pre>
            <p>No matter what Jest was doing, it always crashes after almost exactly 27 wallclock seconds. It literally didn't matter what tests I selected or re-ran. Even the original problem, a bare yarn test (no tests selected, just hangs), will crash after 27 seconds:</p>
            <pre><code>Determining test suites to run...

  ● Test suite failed to run

thrown: [Error]

error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
yarn test  2.05s user 0.71s system 10% cpu 27.094 total
</code></pre>
            <p>Obviously, some sort of timeout. 27 seconds is kind of a weird number (unlike, say, 5 seconds or 60 seconds) but let's try:</p>
            <pre><code>2025-03-19 15:09  find | fgrep 27
2025-03-19 15:09  git grep '\b27\b'
</code></pre>
            <p>No decent hits.</p><p>How about something like 20+7 or even 20+5+2? <i>Nope.</i></p><p>Googling/GPT-4oing  for "jest timeout 27 seconds" again yielded nothing useful. Far more people were having problems with testing asynchronously, or getting their tests to timeout, than with Jest proper. </p><p>At this time, my colleague came back from his call, and with his help we determined some other things:</p><ul><li><p>his system (MacOS) is not affected at all versus mine (Linux)</p></li><li><p><a href="https://github.com/nvm-sh/nvm?tab=readme-ov-file#intro"><u>nvm use v20</u></a> didn't fix it</p></li><li><p>I can reproduce it on a clean clone of <a href="https://github.com/backstage/backstage"><u>github.com/backstage/backstage</u></a>. The tests seem to progress further, about 50+ seconds. This lends credence to a running theory that the filesystem crawler/watcher is the one crashing, and backstage/backstage is a bigger repo than the internal Cloudflare instance, so it takes longer.</p></li></ul><p>I next went <i>on a little detour</i> to grab another colleague who I know has been working on a <a href="https://nextjs.org/"><u>Next.js</u></a> project. He's one of the few other people nearby who knows anything about <a href="https://nodejs.org/en"><u>Node.js</u></a>. In my experience with troubleshooting it’s helpful to get multiple perspectives, so we can cover each other’s blind spots and avoid tunnel vision.</p><p>I then tried invoking many yarn tests in parallel, and I did manage to get the crash time to stretch out to 28 or 29 seconds if the system was under heavy load. So this tells me that it might not be a hard timeout but rather processing driven. A series of sleeps <i>chugging along</i> perhaps?</p><p>By now, there is a veritable crowd of curious onlookers gathered in front of my terminal marveling at the consistent 27 seconds crash and trading theories. At some point, someone asked if I had tried rebooting yet, and I had to sheepishly reply that I haven't but "I'm absolutely sure it wouldn't help whatsoever".</p><p>And the astute reader can already guess that rebooting did nothing at all, or else this wouldn't even be a story worth telling. Besides, haven't I teased in the clickbaity title about some crazy Steam Locomotive from 1993?</p>
    <div>
      <h2>Strace to the rescue</h2>
      <a href="#strace-to-the-rescue">
        
      </a>
    </div>
    <p>My colleague then put us <i>back on track</i> and suggested <a href="https://strace.io/"><u>strace</u></a>, and I decided to trace the simpler case of the idling menu (rather than trace running tests, which generated far more syscalls).</p>
            <pre><code>Watch Usage
 › Press a to run all tests.
 › Press f to run only failed tests.
 › Press o to only run tests related to changed files.
 › Press p to filter by a filename regex pattern.
 › Press t to filter by a test name regex pattern.
 › Press q to quit watch mode.
 › Press Enter to trigger a test run.
[], 1024, 1000)          = 0
openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 21
read(21, "42375 (node) R 42372 42372 11692"..., 1023) = 301
close(21)                               = 0
epoll_wait(13, [], 1024, 0)             = 0
epoll_wait(13, [], 1024, 999)           = 0
openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 21
read(21, "42375 (node) R 42372 42372 11692"..., 1023) = 301
close(21)                               = 0
epoll_wait(13, [], 1024, 0)             = 0
epoll_wait(13,</code></pre>
            <p>It basically <code>epoll_waits</code> until 27 seconds are up and then, right when the crash happens:</p>
            <pre><code> ● Test suite failed to run                                                                                                                
                                                                                                                                            
thrown: [Error]                                                                                                                             
                                                                                                                                            
0x7ffd7137d5e0, 1024, 1000) = -1 EINTR (Interrupted system call)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=42578, si_uid=1000, si_status=1, si_utime=0, si_stime=0} ---
read(4, "*", 1)                     	= 1
write(15, "\210\352!\5\0\0\0\0\21\0\0\0\0\0\0\0", 16) = 16
write(5, "*", 1)                    	= 1
rt_sigreturn({mask=[]})             	= -1 EINTR (Interrupted system call)
epoll_wait(13, [{events=EPOLLIN, data={u32=14, u64=14}}], 1024, 101) = 1
read(14, "\210\352!\5\0\0\0\0\21\0\0\0\0\0\0\0", 512) = 16
wait4(42578, [{WIFEXITED(s) &amp;&amp; WEXITSTATUS(s) == 1}], WNOHANG, NULL) = 42578
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
read(4, "*", 1)                     	= 1
rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x79e91e045330}, NULL, 8) = 0
write(5, "*", 1)                    	= 1
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
mmap(0x34ecad880000, 1495040, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x34ecad880000
madvise(0x34ecad880000, 1495040, MADV_DONTFORK) = 0
munmap(0x34ecad9ae000, 258048)      	= 0
mprotect(0x34ecad880000, 1236992, PROT_READ|PROT_WRITE) = 0</code></pre>
            <p>I don't know about you, but sometimes I look at straces and wonder “Do people actually read this gibberish?” Fortunately, in the modern generative AI era, we can count on GPT-4o to gently chide: the process was interrupted <code>EINTR</code> by its child <code>SIGCHLD</code>, which means you forgot about the children, silly human. Is the problem with <i>one of the cars rather than the engine</i>?</p><p>Following this <i>train of thought</i>, I now re-ran with <code>strace --follow-forks</code>, which revealed a giant flurry of activity that promptly overflowed my terminal buffer. The investigation is really <i>gaining steam</i> now. The original trace weighs in at a hefty 500,000 lines, but here is a smaller equivalent version derived from a clean instance of backstage: <a href="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3IX3GEpe1WPllLf4tJBw5l/9a6c00a3eb06693d545ca49834ded987/trace.log.gz"><u>trace.log.gz</u></a>. I have uploaded this trace here because the by-now overhyped Steam Locomotive is finally making its grand appearance<b> </b>and I know there'll be people who'd love nothing more than to crawl through a haystack of system calls looking for a train-sized needle. Consider yourself lucky, I had to do it without even knowing what I was looking for, much less that it was a whole Steam Locomotive.</p><hr /><p>This section is left intentionally blank to allow locomotive enthusiasts who want to find the train on their own to do so first.</p><hr /><p>Remember my comment about straces being gibberish? Actually, I was kidding. So there are a few ways to make it more manageable, and with experience you'll learn which system calls to pay attention to, such as <code>execve</code>,<code> chdir</code>,<code> open</code>,<code> read</code>,<code> fork</code>,<code> and signals</code>, and which ones to skim over, such as <code>mprotect,</code> <code>mmap</code>, and <code>futex</code>.</p><p>Since I'm writing this account after the fact, let's cheat a little and assume I was super smart and zeroed in on execve correctly on the first try:</p>
            <pre><code>🌈  ~  zgrep execve trace.log.gz | head
execve("/home/yew/.nvm/versions/node/v18.20.6/bin/yarn", ["yarn", "test", "steam-regulator"], 0x7ffdff573148 /* 72 vars */) = 0
execve("/home/yew/.pyenv/shims/node", ["node", "/home/yew/.nvm/versions/node/v18"..., "test", "steam-regulator"], 0x7ffd64f878c8 /* 72 vars */) = -1 ENOENT (No such file or directory)
execve("/home/yew/.pyenv/bin/node", ["node", "/home/yew/.nvm/versions/node/v18"..., "test", "steam-regulator"], 0x7ffd64f878c8 /* 72 vars */) = -1 ENOENT (No such file or directory)
execve("/home/yew/repos/secrets/bin/node", ["node", "/home/yew/.nvm/versions/node/v18"..., "test", "steam-regulator"], 0x7ffd64f878c8 /* 72 vars */) = -1 ENOENT (No such file or directory)
execve("/home/yew/.nvm/versions/node/v18.20.6/bin/node", ["node", "/home/yew/.nvm/versions/node/v18"..., "test", "steam-regulator"], 0x7ffd64f878c8 /* 72 vars */) = 0
[pid 49307] execve("/bin/sh", ["/bin/sh", "-c", "backstage-cli repo test resource"...], 0x3d17d6d0 /* 156 vars */ &lt;unfinished ...&gt;
[pid 49307] &lt;... execve resumed&gt;)   	= 0
[pid 49308] execve("/home/yew/cloudflare/repos/backstage/node_modules/.bin/backstage-cli", ["backstage-cli", "repo", "test", "steam-regulator"], 0x5e7ef80051d8 /* 156 vars */ &lt;unfinished ...&gt;
[pid 49308] &lt;... execve resumed&gt;)   	= 0
[pid 49308] execve("/tmp/yarn--1742459197616-0.9027914591640542/node", ["node", "/home/yew/cloudflare/repos/backs"..., "repo", "test", "steam-regulator"], 0x7ffcc18af270 /* 156 vars */) = 0
🌈  ~  zgrep execve trace.log.gz | wc -l
2254
</code></pre>
            <p>Phew, 2,000 is a lot of <code>execves</code> . Let's get the unique ones, plus their counts:</p>
            <pre><code>🌈  ~  zgrep -oP '(?&lt;=execve\(")[^"]+' trace.log.gz | xargs -L1 basename | sort | uniq -c | sort -nr
    576 watchman
    576 hg
    368 sl
    358 git
     16 sl.actual
     14 node
      2 sh
      1 yarn
      1 backstage-cli</code></pre>
            <p>Have you spotted the <b>S</b>team <b>L</b>ocomotive yet? I spotted it immediately because this is My Own System and Surely This Means I Am Perfectly Aware Of Everything That Is Installed Unlike, er, <code>node_modules</code>.</p><p><code>sl</code>  is actually a<a href="https://github.com/mtoyoda/sl"> <u>fun little joke program from 1993</u></a> that plays on users' tendencies to make a typo on <a href="https://man7.org/linux/man-pages/man1/ls.1.html"><code><u>ls</u></code></a>. When <code>sl</code>  runs, it clears your terminal to make way for an animated steam locomotive to come chugging through.</p>
            <pre><code>                        (  ) (@@) ( )  (@)  ()	@@	O 	@ 	O 	@  	O
                   (@@@)
               (	)
            (@@@@)
 
          (   )
      ====    	________            	___________
  _D _|  |_______/    	\__I_I_____===__|_________|
   |(_)---  |   H\________/ |   |    	=|___ ___|  	_________________
   / 	|  |   H  |  | 	|   |     	||_| |_|| 	_|            	\_____A
  |  	|  |   H  |__--------------------| [___] |   =|                    	|
  | ________|___H__/__|_____/[][]~\_______|   	|   -|                    	|
  |/ |   |-----------I_____I [][] []  D   |=======|____|________________________|_
__/ =| o |=-~~\  /~~\  /~~\  /~~\ ____Y___________|__|__________________________|_
 |/-=|___|=O=====O=====O=====O   |_____/~\___/      	|_D__D__D_|  |_D__D__D_|
  \_/  	\__/  \__/  \__/  \__/  	\_/           	\_/   \_/	\_/   \_/</code></pre>
            <div>
  
</div><p>When I first saw that Jest was running <code>sl</code> so many times, my first thought was to ask my colleague if <code>sl</code> is a valid command on his Mac, and of course it is not. After all, which serious engineer would stuff their machine full of silly commands like <code>sl</code>, <a href="https://r-wos.org/hacks/gti"><code><u>gti</u></code></a>, <a href="https://en.wikipedia.org/wiki/Cowsay"><code><u>cowsay</u></code></a>, or <a href="https://linuxcommandlibrary.com/man/toilet"><code><u>toilet</u></code></a> ? The next thing I tried was to rename <code>sl</code> to something else, and sure enough all my problems disappeared: <code>yarn test</code> started working perfectly.</p>
    <div>
      <h2>So what does Jest have to do with Steam Locomotives?</h2>
      <a href="#so-what-does-jest-have-to-do-with-steam-locomotives">
        
      </a>
    </div>
    <p><a href="https://github.com/jestjs/jest/issues/14046"><u>Nothing, that's what</u></a>. The whole affair is an unfortunate naming clash between <code>sl</code> the Steam Locomotive and <code>sl</code> the<a href="https://github.com/facebook/sapling?tab=readme-ov-file#sapling-cli"> <u>Sapling CLI</u></a>. Jest wanted <code>sl</code> the source control system, but ended up getting steam-rolled by <code>sl </code>the Steam Locomotive.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6BDDm4dEbvMuVyn1J5reqf/ac7f914b3b23608941c84a6cd7aadbb1/image3.png" />
          </figure><p>Fortunately the devs took it in good humor, and<a href="https://github.com/jestjs/jest/pull/15053"> <u>made a (still unreleased) fix</u></a>. Check out the train memes!</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/n8MrPDXuVMuze7UEkxCie/5d187401c8bb12d8e3e0ef89d03b87f3/image4.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/Fot6dhfCsL0zvxYkVEh6a/440436342f93d1f7aca7dd70a9059436/image2.png" />
          </figure><p>At this point the main story has ended. However, there are still some unresolved nagging questions, like...</p>
    <div>
      <h2>How did the crash arrive at the magic number of a relatively even 27 seconds?</h2>
      <a href="#how-did-the-crash-arrive-at-the-magic-number-of-a-relatively-even-27-seconds">
        
      </a>
    </div>
    <p>I don't know. Actually I'm not sure if a forked child executing <code>sl</code> still has a terminal anymore, but the travel time of the train does depend on the terminal width. The wider it is, the longer it takes:</p>
            <pre><code>🌈  ~  tput cols
425
🌈  ~  time sl
sl  0.19s user 0.06s system 1% cpu 20.629 total
🌈  ~  tput cols
58
🌈  ~  time sl  
sl  0.03s user 0.01s system 0% cpu 5.695 total</code></pre>
            <p>So the first thing I tried was to run yarn test in a ridiculously narrow terminal and see what happens:</p>
            <pre><code>Determin
ing test
 suites 
to run..
.       
        
  ● Test
 suite f
ailed to
 run    
        
thrown: 
[Error] 
        
error Co
mmand fa
iled wit
h exit c
ode 1.  
info Vis
it https
://yarnp
kg.com/e
n/docs/c
li/run f
or docum
entation
 about t
his comm
and.    
yarn tes
t  1.92s
 user 0.
67s syst
em 9% cp
u 27.088
 total  
🌈  back
stage [m
aster] t
put cols
        
8</code></pre>
            <p>Alas, the terminal width doesn't affect jest at all. Jest calls<a href="https://github.com/jestjs/jest/pull/13941/files#diff-4c649a2515ae8d1dc1ed6ebccbe3b43e54e5d7217cb011df616dcf471a80319cR59"> <code><u>sl via execa</u></code></a> so let's mock that up locally:</p>
            <pre><code>🌈  choochoo  cat runSl.mjs 
import {execa} from 'execa';
const { stdout } = await execa('tput', ['cols']);
console.log('terminal colwidth:', stdout);
await execa('sl', ['root']);
🌈  choochoo  time node runSl.mjs
terminal colwidth: 80
node runSl.mjs  0.21s user 0.06s system 4% cpu 6.730 total</code></pre>
            <p>So <code>execa</code> uses the default terminal width of 80, which takes the train 6.7 seconds to cross. And 27 seconds divided by 6.7 is awfully close to 4. So is Jest running <code>sl</code> 4 times? Let's do a poor man's <a href="https://github.com/bpftrace/bpftrace"><u>bpftrace</u></a> by hooking into <code>sl</code> like so:</p>
            <pre><code>#!/bin/bash

uniqid=$RANDOM
echo "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid started" &gt;&gt; /home/yew/executed.log
/usr/games/sl.actual "$@"
echo "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid ended" &gt;&gt; /home/yew/executed.log</code></pre>
            <p>And if we check <code>executed.log</code>, <code>sl</code> is indeed executed in 4 waves, albeit by 5 workers simultaneously in each wave:</p>
            <pre><code>#wave1
2025-03-20 13:23:57.125482563 21049 started
2025-03-20 13:23:57.127526987 21666 started
2025-03-20 13:23:57.131099388 4897 started
2025-03-20 13:23:57.134237754 102 started
2025-03-20 13:23:57.137091737 15733 started
#wave1 ends, wave2 starts
2025-03-20 13:24:03.704588580 21666 ended
2025-03-20 13:24:03.704621737 21049 ended
2025-03-20 13:24:03.707780748 4897 ended
2025-03-20 13:24:03.712086346 15733 ended
2025-03-20 13:24:03.711953000 102 ended
2025-03-20 13:24:03.714831149 18018 started
2025-03-20 13:24:03.721293279 23293 started
2025-03-20 13:24:03.724600164 27918 started
2025-03-20 13:24:03.729763900 15091 started
2025-03-20 13:24:03.733176122 18473 started
#wave2 ends, wave3 starts
2025-03-20 13:24:10.294286746 18018 ended
2025-03-20 13:24:10.297261754 23293 ended
2025-03-20 13:24:10.300925031 27918 ended
2025-03-20 13:24:10.300950334 15091 ended
2025-03-20 13:24:10.303498710 24873 started
2025-03-20 13:24:10.303980494 18473 ended
2025-03-20 13:24:10.308560194 31825 started
2025-03-20 13:24:10.310595182 18452 started
2025-03-20 13:24:10.314222848 16121 started
2025-03-20 13:24:10.317875812 30892 started
#wave3 ends, wave4 starts
2025-03-20 13:24:16.883609316 24873 ended
2025-03-20 13:24:16.886708598 18452 ended
2025-03-20 13:24:16.886867725 31825 ended
2025-03-20 13:24:16.890735338 16121 ended
2025-03-20 13:24:16.893661911 21975 started
2025-03-20 13:24:16.898525968 30892 ended
#crash imminent! wave4 ending, wave5 starting...
2025-03-20 13:24:23.474925807 21975 ended</code></pre>
            <p>The logs were emitted for about 26.35 seconds, which is close to 27. It probably crashed just as wave4 was reporting back. And each wave lasts about 6.7 seconds, right on the money with manual measurement. </p>
    <div>
      <h2>So why is Jest running sl in 4 waves? Why did it crash at the start of the 5th wave?</h2>
      <a href="#so-why-is-jest-running-sl-in-4-waves-why-did-it-crash-at-the-start-of-the-5th-wave">
        
      </a>
    </div>
    <p>Let's again modify the poor man's bpftrace to also log the args and working directory:</p>
            <pre><code>echo "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid started: $@ at $PWD" &gt;&gt; /home/yew/executed.log</code></pre>
            <p>From the results we can see that the 5 workers are busy executing <code>sl root</code>, which corresponds to the <code>getRoot()</code>  function in <code>jest-change-files/sl.ts</code> </p>
            <pre><code>2025-03-21 05:50:22.663263304  started: root at /home/yew/cloudflare/repos/backstage/packages/app/src
2025-03-21 05:50:22.665550470  started: root at /home/yew/cloudflare/repos/backstage/packages/backend/src
2025-03-21 05:50:22.667988509  started: root at /home/yew/cloudflare/repos/backstage/plugins/access/src
2025-03-21 05:50:22.671781519  started: root at /home/yew/cloudflare/repos/backstage/plugins/backstage-components/src
2025-03-21 05:50:22.673690514  started: root at /home/yew/cloudflare/repos/backstage/plugins/backstage-entities/src
2025-03-21 05:50:29.247573899  started: root at /home/yew/cloudflare/repos/backstage/plugins/catalog-types-common/src
2025-03-21 05:50:29.251173536  started: root at /home/yew/cloudflare/repos/backstage/plugins/cross-connects/src
2025-03-21 05:50:29.255263605  started: root at /home/yew/cloudflare/repos/backstage/plugins/cross-connects-backend/src
2025-03-21 05:50:29.257293780  started: root at /home/yew/cloudflare/repos/backstage/plugins/pingboard-backend/src
2025-03-21 05:50:29.260285783  started: root at /home/yew/cloudflare/repos/backstage/plugins/resource-insights/src
2025-03-21 05:50:35.823374079  started: root at /home/yew/cloudflare/repos/backstage/plugins/scaffolder-backend-module-gaia/src
2025-03-21 05:50:35.825418386  started: root at /home/yew/cloudflare/repos/backstage/plugins/scaffolder-backend-module-r2/src
2025-03-21 05:50:35.829963172  started: root at /home/yew/cloudflare/repos/backstage/plugins/security-scorecard-dash/src
2025-03-21 05:50:35.832597778  started: root at /home/yew/cloudflare/repos/backstage/plugins/slo-directory/src
2025-03-21 05:50:35.834631869  started: root at /home/yew/cloudflare/repos/backstage/plugins/software-excellence-dashboard/src
2025-03-21 05:50:42.404063080  started: root at /home/yew/cloudflare/repos/backstage/plugins/teamcity/src</code></pre>
            <p>The 16 entries here correspond neatly to the 16 <code>rootDirs</code> configured in Jest for Cloudflare's backstage. We have 5 trains, and we want to visit 16 stations so let's do some simple math. 16/5.0 = 3.2 which means our trains need to go back and forth 4 times at a minimum to cover them all.</p>
    <div>
      <h2>Final mystery: Why did it crash?</h2>
      <a href="#final-mystery-why-did-it-crash">
        
      </a>
    </div>
    <p>Let's go back to the very start of our journey. The original <code>[Error]</code> thrown was actually<a href="https://github.com/jestjs/jest/pull/13941/files#diff-4c649a2515ae8d1dc1ed6ebccbe3b43e54e5d7217cb011df616dcf471a80319cR48"> <u>from here</u></a> and after modifying <code>node_modules/jest-changed-files/index.js</code>, I found that the error is <code>shortMessage: 'Command failed with ENAMETOOLONG: sl status...</code>'  and the reason why became clear when I interrogated Jest about what it thinks the repos are.</p><p>While the git repo is what you'd expect, the sl "repo" looks amazingly like a <i>train wreck in motion</i>:</p>
            <pre><code>got repos.git as Set(1) { '/home/yew/cloudflare/repos/backstage' }
got repos.sl as Set(1) {
  '\x1B[?1049h\x1B[1;24r\x1B[m\x1B(B\x1B[4l\x1B[?7h\x1B[?25l\x1B[H\x1B[2J\x1B[15;80H_\x1B[15;79H_\x1B[16d|\x1B[9;80H_\x1B[12;80H|\x1B[13;80H|\x1B[14;80H|\x1B[15;78H__/\x1B[16;79H|/\x1B[17;80H\\\x1B[9;
  79H_D\x1B[10;80H|\x1B[11;80H/\x1B[12;79H|\x1B[K\x1B[13d\b|\x1B[K\x1B[14d\b|/\x1B[15;1H\x1B[1P\x1B[16;78H|/-\x1B[17;79H\\_\x1B[9;1H\x1B[1P\x1B[10;79H|(\x1B[11;79H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b|
  _\x1B[14;1H\x1B[1P\x1B[15;76H__/ =\x1B[16;77H|/-=\x1B[17;78H\\_/\x1B[9;77H_D _\x1B[10;78H|(_\x1B[11;78H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b| _\x1B[14;77H|/ |\x1B[15;75H__/
  =|\x1B[16;76H|/-=|\x1B[17;1H\x1B[1P\x1B[8;80H=\x1B[9;76H_D _|\x1B[10;77H|(_)\x1B[11;77H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b|
  _\r\x1B[14d\x1B[1P\x1B[15d\x1B[1P\x1B[16;75H|/-=|_\x1B[17;1H\x1B[1P\x1B[8;79H=\r\x1B[9d\x1B[1P\x1B[10;76H|(_)-\x1B[11;76H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b| _\r\x1B[14d\x1B[1P\x1B[15;73H__/ =|
  o\x1B[16;74H|/-=|_\r\x1B[17d\x1B[1P\x1B[8;78H=\r\x1B[9d\x1B[1P\x1B[10;75H|(_)-\x1B[11;75H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b|
  _\r\x1B[14d\x1B[1P\x1B[15d\x1B[1P\x1B[16;73H|/-=|_\r\x1B[17d\x1B[1P\x1B[8;77H=\x1B[9;73H_D _|  |\x1B[10;74H|(_)-\x1B[11;74H/     |\x1B[12;73H|      |\x1B[13;73H| _\x1B[14;73H|/ |   |\x1B[15;71H__/
  =| o |\x1B[16;72H|/-=|___|\x1B[17;1H\x1B[1P\x 1B[5;79H(@\x1B[7;77H(\r\x1B[8d\x1B[1P\x1B[9;72H_D _|  |_\x1B[10;1H\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;72H| _\x1B[14;72H|/ |   |-\x1B[15;70H__/
  =| o |=\x1B[16;71H|/-=|___|=\x1B[17;1H\x1B[1P\x1B[8d\x1B[1P\x1B[9;71H_D _|  |_\r\x1B[10d\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;71H| _\x1B[14;71H|/ |   |-\x1B[15;69H__/ =| o
  |=-\x1B[16;70H|/-=|___|=O\x1B[17;71H\\_/      \\\x1B[8;1H\x1B[1P\x1B[9;70H_D _|  |_\x1B[10;71H|(_)---  |\x1B[11;71H/     |  |\x1B[12;70H|      |  |\x1B[13;70H| _\x1B[80G|\x1B[14;70H|/ |
  |-\x1B[15;68H__/ =| o |=-~\x1B[16;69H|/-=|___|=\x1B[K\x1B[17;70H\\_/      \\O\x1B[8;1H\x1B[1P\x1B[9;69H_D _|  |_\r\x1B[10d\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;69H| _\x1B[79G|_\x1B[14;69H|/ |
  |-\x1B[15;67H__/ =| o |=-~\r\x1B[16d\x1B[1P\x1B[17;69H\\_/      \\_\x1B[4d\b\b(@@\x1B[5;75H(    )\x1B[7;73H(@@@)\r\x1B[8d\x1B[1P\x1B[9;68H_D _|
  |_\r\x1B[10d\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;68H| _\x1B[78G|_\x1B[14;68H|/ |   |-\x1B[15;66H__/ =| o |=-~~\\\x1B[16;67H|/-=|___|=   O\x1B[17;68H\\_/ \\__/\x1B[8;1H\x1B[1P\x1B[9;67H_D _|
  |_\r\x1B[10d\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;67H| _\x1B[77G|_\x1B[14;67H|/ |   |-\x1B[15;65H__/ =| o |=-~O==\x1B[16;66H|/-=|___|= |\x1B[17;1H\x1B[1P\x1B[8d\x1B[1P\x1B[9;66H_D _|
  |_\x1B[10;67H|(_)---  |   H\x1B[11;67H/     |  |   H\x1B[12;66H|      |  |   H\x1B[13;66H| _\x1B[76G|___H\x1B[14;66H|/ |   |-\x1B[15;64H__/ =| o |=-O==\x1B[16;65H|/-=|___|=
  |\r\x1B[17d\x1B[1P\x1B[8d\x1B[1P\x1B[9;65H_D _|  |_\x1B[80G/\x1B[10;66H|(_)---  |   H\\\x1B[11;1H\x1B[1P\x1B[12d\x1B[1P\x1B[13;65H| _\x1B[75G|___H_\x1B[14;65H|/ | |-\x1B[15;63H__/ =| o |=-~~\\
  /\x1B[16;64H|/-=|___|=O=====O\x1B[17;65H\\_/      \\__/  \\\x1B[1;4r\x1B[4;1H\n' + '\x1B[1;24r\x1B[4;74H(    )\x1B[5;71H(@@@@)\x1B[K\x1B[7;69H(   )\x1B[K\x1B[8;68H====
  \x1B[80G_\x1B[9;1H\x1B[1P\x1B[10;65H|(_)---  |   H\\_\x1B[11;1H\x1B[1P\x1B[12d\x1B[1P\x1B[13;64H| _\x1B[74G|___H_\x1B[14;64H|/ |   |-\x1B[15;62H__/ =| o |=-~~\\  /~\x1B[16;63H|/-=|___|=
  ||\x1B[K\x1B[17;64H\\_/      \\O=====O\x1B[8;67H==== \x1B[79G_\r\x1B[9d\x1B[1P\x1B[10;64H|(_)---  |   H\\_\x1B[11;64H/     |  |   H  |\x1B[12;63H|      |  |   H  |\x1B[13;63H|
  _\x1B[73G|___H__/\x1B[14;63H|/ |   |-\x1B[15;61H__/ =| o |=-~~\\  /~\r\x1B[16d\x1B[1P\x1B[17;63H\\_/      \\_\x1B[8;66H==== \x1B[78G_\r\x1B[9d\x1B[1P\x1B[10;63H|(_)---  |
  H\\_\r\x1B[11d\x1B[1P\x1B[12;62H|      |  |   H  |_\x1B[13;62H| _\x1B[72G|___H__/_\x1B[14;62H|/ |   |-\x1B[15;60H__/ =| o |=-~~\\  /~~\\\x1B[16;61H|/-=|___|=   O=====O\x1B[17;62H\\_/      \\__/
  \\__/\x1B[8;65H==== \x1B[77G_\r\x1B[9d\x1B[1P\x1B[10;62H|(_)---  |   H\\_\r\x1B[11d\x1B[1P\x1B[12;61H|      |  |   H  |_\x1B[13;61H| _\x1B[71G|___H__/_\x1B[14;61H|/ |   |-\x1B[80GI\x1B[15;59H__/ =|
  o |=-~O=====O==\x1B[16;60H|/-=|___|=    ||    |\x1B[17;1H\x1B[1P\x1B[2;79H(@\x1B[3;74H(   )\x1B[K\x1B[4;70H(@@@@)\x1B[K\x1B[5;67H(    )\x1B[K\x1B[7;65H(@@@)\x1B[K\x1B[8;64H====
  \x1B[76G_\r\x1B[9d\x1B[1P\x1B[10;61H|(_)---  |   H\\_\x1B[11;61H/     |  |   H  |  |\x1B[12;60H|      |  |   H  |__-\x1B[13;60H| _\x1B[70G|___H__/__|\x1B[14;60H|/ |   |-\x1B[79GI_\x1B[15;58H__/ =| o
  |=-O=====O==\x1B[16;59H|/-=|___|=    ||    |\r\x1B[17d\x1B[1P\x1B[8;63H==== \x1B[75G_\r\x1B[9d\x1B[1P\x1B[10;60H|(_)---  |   H\\_\r\x1B[11d\x1B[1P\x1B[12;59H|      |  |   H  |__-\x1B[13;59H|
  _\x1B[69G|___H__/__|_\x1B[14;59H|/ |   |-\x1B[78GI_\x1B[15;57H__/ =| o |=-~~\\  /~~\\  /\x1B[16;58H|/-=|___|=O=====O=====O\x1B[17;59H\\_/      \\__/  \\__/  \\\x1B[8;62H====
  \x1B[74G_\r\x1B[9d\x1B[1P\x1B[10;59H|(_)---  |   H\\_\r\x1B  |  |   H  |__-\x1B[13;58H| _\x1B[68G|___H__/__|_\x1B[14;58H|/ |   |-\x1B[77GI_\x1B[15;56H__/ =| o |=-~~\\ /~~\\  /~\x1B[16;57H|/-=|___|=
  ||    ||\x1B[K\x1B[17;58H\\_/      \\O=====O=====O\x1B[8;61H==== \x1B[73G_\r\x1B[9d\x1B[1P\x1B[10;58H|(_)---    _\x1B[67G|___H__/__|_\x1B[14;57H|/ |   |-\x1B[76GI_\x1B[15;55H__/ =| o |=-~~\\  /~~\\
  /~\r\x1B[16d\x1B[1P\x1B[17;57H\\_/      \\_\x1B[2;75H(  ) (\x1B[3;70H(@@@)\x1B[K\x1B[4;66H()\x1B[K\x1B[5;63H(@@@@)\x1B[</code></pre>
            
    <div>
      <h2>Acknowledgements</h2>
      <a href="#acknowledgements">
        
      </a>
    </div>
    <p>Thank you to my colleagues Mengnan Gong and Shuhao Zhang, whose ideas and perspectives helped narrow down the root causes of this mystery.</p><p>If you enjoy troubleshooting weird and tricky production issues, <a href="https://www.cloudflare.com/en-gb/careers/jobs/?department=Engineering"><i><u>our engineering teams are hiring</u></i></a><i>.</i></p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <guid isPermaLink="false">4X0QnKvMC0yi0eRv5YqC7g</guid>
            <dc:creator>Yew Leong</dc:creator>
        </item>
        <item>
            <title><![CDATA[Searching for the cause of hung tasks in the Linux kernel]]></title>
            <link>https://blog.cloudflare.com/searching-for-the-cause-of-hung-tasks-in-the-linux-kernel/</link>
            <pubDate>Fri, 14 Feb 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ The Linux kernel can produce a hung task warning. Searching the Internet and the kernel docs, you can find a brief explanation that the process is stuck in the uninterruptible state. ]]></description>
            <content:encoded><![CDATA[ <p>Depending on your configuration, the Linux kernel can produce a hung task warning message in its log. Searching the Internet and the kernel documentation, you can find a brief explanation that the kernel process is stuck in the uninterruptable state and hasn’t been scheduled on the CPU for an unexpectedly long period of time. That explains the warning’s meaning, but doesn’t provide the reason it occurred. In this blog post we’re going to explore how the hung task warning works, why it happens, whether it is a bug in the Linux kernel or application itself, and whether it is worth monitoring at all.</p>
    <div>
      <h3>INFO: task XXX:1495882 blocked for more than YYY seconds.</h3>
      <a href="#info-task-xxx-1495882-blocked-for-more-than-yyy-seconds">
        
      </a>
    </div>
    <p>The hung task message in the kernel log looks like this:</p>
            <pre><code>INFO: task XXX:1495882 blocked for more than YYY seconds.
     Tainted: G          O       6.6.39-cloudflare-2024.7.3 #1
"echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:XXX         state:D stack:0     pid:1495882 ppid:1      flags:0x00004002
. . .</code></pre>
            <p>Processes in Linux can be in different states. Some of them are running or ready to run on the CPU — they are in the <a href="https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/sched.h#L99"><code><u>TASK_RUNNING</u></code></a> state. Others are waiting for some signal or event to happen, e.g. network packets to arrive or terminal input from a user. They are in a <code>TASK_INTERRUPTIBLE</code> state and can spend an arbitrary length of time in this state until being woken up by a signal. The most important thing about these states is that they still can receive signals, and be terminated by a signal. In contrast, a process in the <code>TASK_UNINTERRUPTIBLE</code> state is waiting only for certain special classes of events to wake them up, and can’t be interrupted by a signal. The signals are not delivered until the process emerges from this state and only a system reboot can clear the process. It’s marked with the letter <code>D</code> in the log shown above.</p><p>What if this wake up event doesn’t happen or happens with a significant delay? (A “significant delay” may be on the order of seconds or minutes, depending on the system.) Then our dependent process is hung in this state. What if this dependent process holds some lock and prevents other processes from acquiring it? Or if we see many processes in the D state? Then it might tell us that some of the system resources are overwhelmed or are not working correctly. At the same time, this state is very valuable, especially if we want to preserve the process memory. It might be useful if part of the data is written to disk and another part is still in the process memory — we don’t want inconsistent data on a disk. Or maybe we want a snapshot of the process memory when the bug is hit. To preserve this behaviour, but make it more controlled, a new state was introduced in the kernel: <a href="https://lwn.net/Articles/288056/"><code><u>TASK_KILLABLE</u></code></a> — it still protects a process, but allows termination with a fatal signal. </p>
    <div>
      <h3>How Linux identifies the hung process</h3>
      <a href="#how-linux-identifies-the-hung-process">
        
      </a>
    </div>
    <p>The Linux kernel has a special thread called <code>khungtaskd</code>. It runs regularly depending on the settings, iterating over all processes in the <code>D</code> state. If a process is in this state for more than YYY seconds, we’ll see a message in the kernel log. There are settings for this daemon that can be changed according to your wishes:</p>
            <pre><code>$ sudo sysctl -a --pattern hung
kernel.hung_task_all_cpu_backtrace = 0
kernel.hung_task_check_count = 4194304
kernel.hung_task_check_interval_secs = 0
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 10
kernel.hung_task_warnings = 200</code></pre>
            <p>At Cloudflare, we changed the notification threshold <code>kernel.hung_task_timeout_secs</code> from the default 120 seconds to 10 seconds. You can adjust the value for your system depending on configuration and how critical this delay is for you. If the process spends more than <code>hung_task_timeout_secs</code> seconds in the D state, a log entry is written, and our internal monitoring system emits an alert based on this log. Another important setting here is <code>kernel.hung_task_warnings</code> — the total number of messages that will be sent to the log. We limit it to 200 messages and reset it every 15 minutes. It allows us not to be overwhelmed by the same issue, and at the same time doesn’t stop our monitoring for too long. You can make it unlimited by <a href="https://docs.kernel.org/admin-guide/sysctl/kernel.html#hung-task-warnings"><u>setting the value to "-1"</u></a>.</p><p>To better understand the root causes of the hung tasks and how a system can be affected, we’re going to review more detailed examples. </p>
    <div>
      <h3>Example #1 or XFS</h3>
      <a href="#example-1-or-xfs">
        
      </a>
    </div>
    <p>Typically, there is a meaningful process or application name in the log, but sometimes you might see something like this:</p>
            <pre><code>INFO: task kworker/13:0:834409 blocked for more than 11 seconds.
 	Tainted: G      	O   	6.6.39-cloudflare-2024.7.3 #1
"echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/13:0	state:D stack:0 	pid:834409 ppid:2   flags:0x00004000
Workqueue: xfs-sync/dm-6 xfs_log_worker</code></pre>
            <p>In this log, <code>kworker</code> is the kernel thread. It’s used as a deferring mechanism, meaning a piece of work will be scheduled to be executed in the future. Under <code>kworker</code>, the work is aggregated from different tasks, which makes it difficult to tell which application is experiencing a delay. Luckily, the <code>kworker</code> is accompanied by the <a href="https://docs.kernel.org/core-api/workqueue.html"><code><u>Workqueue</u></code></a> line. <code>Workqueue</code> is a linked list, usually predefined in the kernel, where these pieces of work are added and performed by the <code>kworker</code> in the order they were added to the queue. The <code>Workqueue</code> name <code>xfs-sync</code> and <a href="https://elixir.bootlin.com/linux/v6.12.6/source/kernel/workqueue.c#L6096"><u>the function which it points to</u></a>, <code>xfs_log_worker</code>, might give a good clue where to look. Here we can make an assumption that the <a href="https://en.wikipedia.org/wiki/XFS"><u>XFS</u></a> is under pressure and check the relevant metrics. It helped us to discover that due to some configuration changes, we forgot <code>no_read_workqueue</code> / <code>no_write_workqueue</code> flags that were introduced some time ago to <a href="https://blog.cloudflare.com/speeding-up-linux-disk-encryption/"><u>speed up Linux disk encryption</u></a>.</p><p><i>Summary</i>: In this case, nothing critical happened to the system, but the hung tasks warnings gave us an alert that our file system had slowed down.</p>
    <div>
      <h3>Example #2 or Coredump</h3>
      <a href="#example-2-or-coredump">
        
      </a>
    </div>
    <p>Let’s take a look at the next hung task log and its decoded stack trace:</p>
            <pre><code>INFO: task test:964 blocked for more than 5 seconds.
      Not tainted 6.6.72-cloudflare-2025.1.7 #1
"echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:test            state:D stack:0     pid:964   ppid:916    flags:0x00004000
Call Trace:
&lt;TASK&gt;
__schedule (linux/kernel/sched/core.c:5378 linux/kernel/sched/core.c:6697) 
schedule (linux/arch/x86/include/asm/preempt.h:85 (discriminator 13) linux/kernel/sched/core.c:6772 (discriminator 13)) 
[do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4)) 
? finish_task_switch.isra.0 (linux/arch/x86/include/asm/irqflags.h:42 linux/arch/x86/include/asm/irqflags.h:77 linux/kernel/sched/sched.h:1385 linux/kernel/sched/core.c:5132 linux/kernel/sched/core.c:5250) 
do_group_exit (linux/kernel/exit.c:1005) 
get_signal (linux/kernel/signal.c:2869) 
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) 
? hrtimer_try_to_cancel.part.0 (linux/kernel/time/hrtimer.c:1347) 
arch_do_signal_or_restart (linux/arch/x86/kernel/signal.c:310) 
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) 
? hrtimer_nanosleep (linux/kernel/time/hrtimer.c:2105) 
exit_to_user_mode_prepare (linux/kernel/entry/common.c:176 linux/kernel/entry/common.c:210) 
syscall_exit_to_user_mode (linux/arch/x86/include/asm/entry-common.h:91 linux/kernel/entry/common.c:141 linux/kernel/entry/common.c:304) 
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) 
do_syscall_64 (linux/arch/x86/entry/common.c:88) 
entry_SYSCALL_64_after_hwframe (linux/arch/x86/entry/entry_64.S:121) 
&lt;/TASK&gt;</code></pre>
            <p>The stack trace says that the process or application <code>test</code> was blocked <code>for more than 5 seconds</code>. We might recognise this user space application by the name, but why is it blocked? It’s always helpful to check the stack trace when looking for a cause. The most interesting line here is <code>do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4))</code>. The <a href="https://elixir.bootlin.com/linux/v6.6.67/source/kernel/exit.c#L825"><u>source code</u></a> points to the <code>coredump_task_exit</code> function. Additionally, checking the process metrics revealed that the application crashed during the time when the warning message appeared in the log. When a process is terminated based on some set of signals (abnormally), <a href="https://man7.org/linux/man-pages/man5/core.5.html"><u>the Linux kernel can provide a core dump file</u></a>, if enabled. The mechanism — when a process terminates, the kernel makes a snapshot of the process memory before exiting and either writes it to a file or sends it through the socket to another handler — can be <a href="https://systemd.io/COREDUMP/"><u>systemd-coredump</u></a> or your custom one. When it happens, the kernel moves the process to the <code>D</code> state to preserve its memory and early termination. The higher the process memory usage, the longer it takes to get a core dump file, and the higher the chance of getting a hung task warning.</p><p>Let’s check our hypothesis by triggering it with a small Go program. We’ll use the default Linux coredump handler and will decrease the hung task threshold to 1 second.</p><p>Coredump settings:</p>
            <pre><code>$ sudo sysctl -a --pattern kernel.core
kernel.core_pattern = core
kernel.core_pipe_limit = 16
kernel.core_uses_pid = 1</code></pre>
            <p>You can make changes with <a href="https://man7.org/linux/man-pages/man8/sysctl.8.html"><u>sysctl</u></a>:</p>
            <pre><code>$ sudo sysctl -w kernel.core_uses_pid=1</code></pre>
            <p>Hung task settings:</p>
            <pre><code>$ sudo sysctl -a --pattern hung
kernel.hung_task_all_cpu_backtrace = 0
kernel.hung_task_check_count = 4194304
kernel.hung_task_check_interval_secs = 0
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 1
kernel.hung_task_warnings = -1</code></pre>
            <p>Go program:</p>
            <pre><code>$ cat main.go
package main

import (
	"os"
	"time"
)

func main() {
	_, err := os.ReadFile("test.file")
	if err != nil {
		panic(err)
	}
	time.Sleep(8 * time.Minute) 
}</code></pre>
            <p>This program reads a 10 GB file into process memory. Let’s create the file:</p>
            <pre><code>$ yes this is 10GB file | head -c 10GB &gt; test.file</code></pre>
            <p>The last step is to build the Go program, crash it, and watch our kernel log:</p>
            <pre><code>$ go mod init test
$ go build .
$ GOTRACEBACK=crash ./test
$ (Ctrl+\)</code></pre>
            <p>Hooray! We can see our hung task warning:</p>
            <pre><code>$ sudo dmesg -T | tail -n 31
INFO: task test:8734 blocked for more than 22 seconds.
      Not tainted 6.6.72-cloudflare-2025.1.7 #1
      Blocked by coredump.
"echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:test            state:D stack:0     pid:8734  ppid:8406   task_flags:0x400448 flags:0x00004000</code></pre>
            <p>By the way, have you noticed the <code>Blocked by coredump.</code> line in the log? It was recently added to the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-nonmm-stable&amp;id=23f3f7625cfb55f92e950950e70899312f54afb7"><u>upstream</u></a> code to improve visibility and remove the blame from the process itself. The patch also added the <code>task_flags</code> information, as <code>Blocked by coredump</code> is detected via the flag <a href="https://elixir.bootlin.com/linux/v6.13.1/source/include/linux/sched.h#L1675"><code><u>PF_POSTCOREDUMP</u></code></a>, and knowing all the task flags is useful for further root-cause analysis.</p><p><i>Summary</i>: This example showed that even if everything suggests that the application is the problem, the real root cause can be something else — in this case, <code>coredump</code>.</p>
    <div>
      <h3>Example #3 or rtnl_mutex</h3>
      <a href="#example-3-or-rtnl_mutex">
        
      </a>
    </div>
    <p>This one was tricky to debug. Usually, the alerts are limited by one or two different processes, meaning only a certain application or subsystem experiences an issue. In this case, we saw dozens of unrelated tasks hanging for minutes with no improvements over time. Nothing else was in the log, most of the system metrics were fine, and existing traffic was being served, but it was not possible to ssh to the server. New Kubernetes container creations were also stalling. Analyzing the stack traces of different tasks initially revealed that all the traces were limited to just three functions:</p>
            <pre><code>rtnetlink_rcv_msg+0x9/0x3c0
dev_ethtool+0xc6/0x2db0 
bonding_show_bonds+0x20/0xb0</code></pre>
            <p>Further investigation showed that all of these functions were waiting for <a href="https://elixir.bootlin.com/linux/v6.6.74/source/net/core/rtnetlink.c#L76"><code><u>rtnl_lock</u></code></a> to be acquired. It looked like some application acquired the <code>rtnl_mutex</code> and didn’t release it. All other processes were in the <code>D</code> state waiting for this lock.</p><p>The RTNL lock is primarily used by the kernel networking subsystem for any network-related config, for both writing and reading. The RTNL is a global <b>mutex</b> lock, although <a href="https://lpc.events/event/18/contributions/1959/"><u>upstream efforts</u></a> are being made for splitting up RTNL per network namespace (netns).</p><p>From the hung task reports, we can observe the “victims” that are being stalled waiting for the lock, but how do we identify the task that is holding this lock for too long? For troubleshooting this, we leveraged <code>BPF</code> via a <code>bpftrace</code> script, as this allows us to inspect the running kernel state. The <a href="https://elixir.bootlin.com/linux/v6.6.75/source/include/linux/mutex.h#L67"><u>kernel's mutex implementation</u></a> has a struct member called <code>owner</code>. It contains a pointer to the <a href="https://elixir.bootlin.com/linux/v6.6.75/source/include/linux/sched.h#L746"><code><u>task_struct</u></code></a> from the mutex-owning process, except it is encoded as type <code>atomic_long_t</code>. This is because the mutex implementation stores some state information in the lower 3-bits (mask 0x7) of this pointer. Thus, to read and dereference this <code>task_struct</code> pointer, we must first mask off the lower bits (0x7).</p><p>Our <code>bpftrace</code> script to determine who holds the mutex is as follows:</p>
            <pre><code>#!/usr/bin/env bpftrace
interval:s:10 {
  $rtnl_mutex = (struct mutex *) kaddr("rtnl_mutex");
  $owner = (struct task_struct *) ($rtnl_mutex-&gt;owner.counter &amp; ~0x07);
  if ($owner != 0) {
    printf("rtnl_mutex-&gt;owner = %u %s\n", $owner-&gt;pid, $owner-&gt;comm);
  }
}</code></pre>
            <p>In this script, the <code>rtnl_mutex</code> lock is a global lock whose address can be exposed via <code>/proc/kallsyms</code> – using <code>bpftrace</code> helper function <code>kaddr()</code>, we can access the struct mutex pointer from the <code>kallsyms</code>. Thus, we can periodically (via <code>interval:s:10</code>) check if someone is holding this lock.</p><p>In the output we had this:</p>
            <pre><code>rtnl_mutex-&gt;owner = 3895365 calico-node</code></pre>
            <p>This allowed us to quickly identify <code>calico-node</code> as the process holding the RTNL lock for too long. To quickly observe where this process itself is stalled, the call stack is available via <code>/proc/3895365/stack</code>. This showed us that the root cause was a Wireguard config change, with function <code>wg_set_device()</code> holding the RTNL lock, and <code>peer_remove_after_dead()</code> waiting too long for a <code>napi_disable()</code> call. We continued debugging via a tool called <a href="https://drgn.readthedocs.io/en/latest/user_guide.html#stack-traces"><code><u>drgn</u></code></a>, which is a programmable debugger that can debug a running kernel via a Python-like interactive shell. We still haven’t discovered the root cause for the Wireguard issue and have <a href="https://lore.kernel.org/lkml/CALrw=nGoSW=M-SApcvkP4cfYwWRj=z7WonKi6fEksWjMZTs81A@mail.gmail.com/"><u>asked the upstream</u></a> for help, but that is another story.</p><p><i>Summary</i>: The hung task messages were the only ones which we had in the kernel log. Each stack trace of these messages was unique, but by carefully analyzing them, we could spot similarities and continue debugging with other instruments.</p>
    <div>
      <h3>Epilogue</h3>
      <a href="#epilogue">
        
      </a>
    </div>
    <p>Your system might have different hung task warnings, and we have many others not mentioned here. Each case is unique, and there is no standard approach to debug them. But hopefully this blog post helps you better understand why it’s good to have these warnings enabled, how they work, and what the meaning is behind them. We tried to provide some navigation guidance for the debugging process as well:</p><ul><li><p>analyzing the stack trace might be a good starting point for debugging it, even if all the messages look unrelated, like we saw in example #3</p></li><li><p>keep in mind that the alert might be misleading, pointing to the victim and not the offender, as we saw in example #2 and example #3</p></li><li><p>if the kernel doesn’t schedule your application on the CPU, puts it in the D state, and emits the warning – the real problem might exist in the application code</p></li></ul><p>Good luck with your debugging, and hopefully this material will help you on this journey!</p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Kernel]]></category>
            <category><![CDATA[Monitoring]]></category>
            <guid isPermaLink="false">3UHNgpNPKn2IAwDUzD4m3a</guid>
            <dc:creator>Oxana Kharitonova</dc:creator>
            <dc:creator>Jesper Brouer</dc:creator>
        </item>
        <item>
            <title><![CDATA[Multi-Path TCP: revolutionizing connectivity, one path at a time]]></title>
            <link>https://blog.cloudflare.com/multi-path-tcp-revolutionizing-connectivity-one-path-at-a-time/</link>
            <pubDate>Fri, 03 Jan 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ Multi-Path TCP (MPTCP) leverages multiple network interfaces, like Wi-Fi and cellular, to provide seamless mobility for more reliable connectivity. While promising, MPTCP is still in its early stages, ]]></description>
            <content:encoded><![CDATA[ <p></p><p>The Internet is designed to provide multiple paths between two endpoints. Attempts to exploit multi-path opportunities are almost as old as the Internet, culminating in <a href="https://datatracker.ietf.org/doc/html/rfc2991"><u>RFCs</u></a> documenting some of the challenges. Still, today, virtually all end-to-end communication uses only one available path at a time. Why? It turns out that in multi-path setups, even the smallest differences between paths can harm the connection quality due to <a href="https://en.wikipedia.org/wiki/Equal-cost_multi-path_routing#History"><u>packet reordering</u></a> and other issues. As a result, Internet devices usually use a single path and let the routers handle the path selection.</p><p>There is another way. Enter Multi-Path TCP (MPTCP), which exploits the presence of multiple interfaces on a device, such as a mobile phone that has both Wi-Fi and cellular antennas, to achieve multi-path connectivity.</p><p>MPTCP has had a long history — see the <a href="https://en.wikipedia.org/wiki/Multipath_TCP"><u>Wikipedia article</u></a> and the <a href="https://datatracker.ietf.org/doc/html/rfc8684"><u>spec (RFC 8684)</u></a> for details. It's a major extension to the TCP protocol, and historically most of the TCP changes failed to gain traction. However, MPTCP is supposed to be mostly an operating system feature, making it easy to enable. Applications should only need minor code changes to support it.</p><p>There is a caveat, however: MPTCP is still fairly immature, and while it can use multiple paths, giving it superpowers over regular TCP, it's not always strictly better than it. Whether MPTCP should be used over TCP is really a case-by-case basis.</p><p>In this blog post we show how to set up MPTCP to find out.</p>
    <div>
      <h2>Subflows</h2>
      <a href="#subflows">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3r8AP5BHbvtYEtXmYSXFwO/36e95cbac93cdecf2f5ee65945abf0b3/Screenshot_2024-12-23_at_3.07.37_PM.png" />
          </figure><p>Internally, MPTCP extends TCP by introducing "subflows". When everything is working, a single TCP connection can be backed by multiple MPTCP subflows, each using different paths. This is a big deal - a single TCP byte stream is now no longer identified by a single 5-tuple. On Linux you can see the subflows with <code>ss -M</code>, like:</p>
            <pre><code>marek$ ss -tMn dport = :443 | cat
tcp   ESTAB 0  	0 192.168.2.143%enx2800af081bee:57756 104.28.152.1:443
tcp   ESTAB 0  	0       192.168.1.149%wlp0s20f3:44719 104.28.152.1:443
mptcp ESTAB 0  	0                 192.168.2.143:57756 104.28.152.1:443</code></pre>
            <p>Here you can see a single MPTCP connection, composed of two underlying TCP flows.</p>
    <div>
      <h2>MPTCP aspirations</h2>
      <a href="#mptcp-aspirations">
        
      </a>
    </div>
    <p>Being able to separate the lifetime of a connection from the lifetime of a flow allows MPTCP to address two problems present in classical TCP: aggregation and mobility.</p><ul><li><p><b>Aggregation</b>: MPTCP can aggregate the bandwidth of many network interfaces. For example, in a data center scenario, it's common to use interface bonding. A single flow can make use of just one physical interface. MPTCP, by being able to launch many subflows, can expose greater overall bandwidth. I'm personally not convinced if this is a real problem. As we'll learn below, modern Linux has a <a href="https://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf"><u>BLESS-like MPTCP scheduler</u></a> and macOS stack has the "aggregation" mode, so aggregation should work, but I'm not sure how practical it is. However, there are <a href="https://www.openmptcprouter.com/"><u>certainly projects that are trying to do link aggregation</u></a> using MPTCP.</p></li><li><p><b>Mobility</b>: On a customer device, a TCP stream is typically broken if the underlying network interface goes away. This is not an uncommon occurrence — consider a smartphone dropping from Wi-Fi to cellular. MPTCP can fix this — it can create and destroy many subflows over the lifetime of a single connection and survive multiple network changes.</p></li></ul><p>Improving reliability for mobile clients is a big deal. While some software can use QUIC, which also works on <a href="https://www.ietf.org/archive/id/draft-ietf-quic-multipath-11.html"><u>Multipath Extensions</u></a>, a large number of classical services still use TCP. A great example is SSH: it would be very nice if you could walk around with a laptop and keep an SSH session open and switch Wi-Fi networks seamlessly, without breaking the connection.</p><p>MPTCP work was initially driven by <a href="https://uclouvain.be/fr/index.html"><u>UCLouvain in Belgium</u></a>. The first serious adoption was on the iPhone. Apparently, users have a tendency to use Siri while they are walking out of their home. It's very common to lose Wi-Fi connectivity while they are doing this. (<a href="https://youtu.be/BucQ1lfbtd4?t=533"><u>source</u></a>) </p>
    <div>
      <h2>Implementations</h2>
      <a href="#implementations">
        
      </a>
    </div>
    <p>Currently, there are only two major MPTCP implementations — Linux kernel support from v5.6, but realistically you need at least kernel v6.1 (<a href="https://oracle.github.io/kconfigs/?config=UTS_RELEASE&amp;config=MPTCP"><u>MPTCP is not supported on Android</u></a> yet) and iOS from version 7 / Mac OS X from 10.10.</p><p>Typically, Linux is used on the server side, and iOS/macOS as the client. It's possible to get Linux to work as a client-side, but it's not straightforward, as we'll learn soon. Beware — there is plenty of outdated Linux MPTCP documentation. The code has had a bumpy history and at least two different APIs were proposed. See the Linux kernel source for <a href="https://docs.kernel.org/networking/mptcp.html"><u>the mainline API</u></a> and the <a href="https://www.mptcp.dev/"><u>mptcp.dev</u></a> website.</p>
    <div>
      <h2>Linux as a server</h2>
      <a href="#linux-as-a-server">
        
      </a>
    </div>
    <p>Conceptually, the MPTCP design is pretty sensible. After the initial TCP handshake, each peer may announce additional addresses (and ports) on which it can be reached. There are two ways of doing this. First, in the handshake TCP packet each peer specifies the "<i>Do not attempt to establish new subflows to this address and port</i>" bit, also known as bit [C], in the MPTCP TCP extensions header.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/bT8oz3wxpw7alftvdYg5n/b7614a4d10b6c81e18027f6785391ede/BLOG-2637_3.png" />
          </figure><p><sup><i>Wireshark dissecting MPTCP flags from a SYN packet. </i></sup><a href="https://github.com/multipath-tcp/mptcp_net-next/issues/535"><sup><i><u>Tcpdump does not report</u></i></sup></a><sup><i> this flag yet.</i></sup></p><p>With this bit cleared, the other peer is free to assume the two-tuple is fine to be reconnected to. Typically, the <b>server allows</b> the client to reuse the server IP/port address. Usually, the <b>client is not listening</b> and disallows the server to connect back to it. There are caveats though. For example, in the context of Cloudflare, where our servers are using Anycast addressing, reconnecting to the server IP/port won't work. Going twice to the IP/port pair is unlikely to reach the same server. For us it makes sense to set this flag, disallowing clients from reconnecting to our server addresses. This can be done on Linux with:</p>
            <pre><code># Linux server sysctl - useful for ECMP or Anycast servers
$ sysctl -w net.mptcp.allow_join_initial_addr_port=0
</code></pre>
            <p>There is also a second way to advertise a listening IP/port. During the lifetime of a connection, a peer can send an ADD-ADDR MPTCP signal which advertises a listening IP/port. This can be managed on Linux by <code>ip mptcp endpoint ... signal</code>, like:</p>
            <pre><code># Linux server - extra listening address
$ ip mptcp endpoint add 192.51.100.1 dev eth0 port 4321 signal
</code></pre>
            <p>With such a config, a Linux peer (typically server) will report the additional IP/port with ADD-ADDR MPTCP signal in an ACK packet, like this:</p>
            <pre><code>host &gt; host: Flags [.], ack 1, win 8, options [mptcp 30 add-addr v1 id 1 192.51.100.1:4321 hmac 0x...,nop,nop], length 0
</code></pre>
            <p>It's important to realize that either peer can send ADD-ADDR messages. Unusual as it might sound, it's totally fine for the client to advertise extra listening addresses. The most common scenario though, consists of either nobody, or just a server, sending ADD-ADDR.</p><p>Technically, to launch an MPTCP socket on Linux, you just need to replace IPPROTO_TCP with IPPROTO_MPTCP in the application code:</p>
            <pre><code>IPPROTO_MPTCP = 262
sd = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP)
</code></pre>
            <p>In practice, though, this introduces some changes to the sockets API. Currently not all setsockopt's work yet — like <code>TCP_USER_TIMEOUT</code>. Additionally, at this stage, MPTCP is incompatible with kTLS.</p>
    <div>
      <h2>Path manager / scheduler</h2>
      <a href="#path-manager-scheduler">
        
      </a>
    </div>
    <p>Once the peers have exchanged the address information, MPTCP is ready to kick in and perform the magic. There are two independent pieces of logic that MPTCP handles. First, given the address information, MPTCP must figure out if it should establish additional subflows. The component that decides on this is called "Path Manager". Then, another component called "scheduler" is responsible for choosing a specific subflow to transmit the data over.</p><p>Both peers have a path manager, but typically only the client uses it. A path manager has a hard task to launch enough subflows to get the benefits, but not too many subflows which could waste resources. This is where the MPTCP stacks get complicated. </p>
    <div>
      <h2>Linux as client</h2>
      <a href="#linux-as-client">
        
      </a>
    </div>
    <p>On Linux, path manager is an operating system feature, not an application feature. The in-kernel path manager requires some configuration — it must know which IP addresses and interfaces are okay to start new subflows. This is configured with <code>ip mptcp endpoint ... subflow</code>, like:</p>
            <pre><code>$ ip mptcp endpoint add dev wlp1s0 192.0.2.3 subflow  # Linux client
</code></pre>
            <p>This informs the path manager that we (typically a client) own a 192.0.2.3 IP address on interface wlp1s0, and that it's fine to use it as source of a new subflow. There are two additional flags that can be passed here: "backup" and "fullmesh". Maintaining these <code>ip mptcp endpoints</code> on a client is annoying. They need to be added and removed every time networks change. Fortunately, <a href="https://ubuntu.com/core/docs/networkmanager"><u>NetworkManager</u></a> from 1.40 supports managing these by default. If you want to customize the "backup" or "fullmesh" flags, you can do this here (see <a href="https://networkmanager.dev/docs/api/1.44.4/settings-connection.html#:~:text=mptcp-flags"><u>the documentation</u></a>):</p>
            <pre><code>ubuntu$ cat /etc/NetworkManager/conf.d/95-mptcp.conf
# set "subflow" on all managed "ip mptcp endpoints". 0x22 is the default.
[connection]
connection.mptcp-flags=0x22
</code></pre>
            <p>Path manager also takes a "limit" setting, to set a cap of additional subflows per MPTCP connection, and limit the received ADD-ADDR messages, like: </p>
            <pre><code>$ ip mptcp limits set subflow 4 add_addr_accepted 2  # Linux client
</code></pre>
            <p>I experimented with the "mobility" use case on my Ubuntu 22 Linux laptop. I repeatedly enabled and disabled Wi-Fi and Ethernet. On new kernels (v6.12), it works, and I was able to hold a reliable MPTCP connection over many interface changes. I was <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/534"><u>less lucky with the Ubuntu v6.8</u></a> kernel. Unfortunately, the <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/536"><u>default path manager on Linux</u></a> client only works when the flag "<i>Do not attempt to establish new subflows to this address and port</i>" is cleared on the server. Server-announced ADD-ADDR don't result in new subflows created, unless <code>ip mptcp endpoint</code> has a <code>fullmesh</code> flag.</p><p>It feels like the underlying MPTCP transport code works, but the path manager requires a bit more intelligence. With a new kernel, it's possible to get the "interactive" case working out of the box, but not for the ADD-ADDR case. </p>
    <div>
      <h2>Custom path manager</h2>
      <a href="#custom-path-manager">
        
      </a>
    </div>
    <p>Linux allows for two implementations of a path manager component. It can either use built-in kernel implementation (default), or userspace netlink daemon.</p>
            <pre><code>$ sysctl -w net.mptcp.pm_type=1 # use userspace path manager
</code></pre>
            <p>However, from what I found there is no serious implementation of configurable userspace path manager. The existing <a href="https://github.com/multipath-tcp/mptcpd/blob/main/plugins/path_managers/sspi.c"><u>implementations don't do much</u></a>, and the API <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/533"><u>seems</u></a> <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/532"><u>immature</u></a> yet.</p>
    <div>
      <h2>Scheduler and BPF extensions</h2>
      <a href="#scheduler-and-bpf-extensions">
        
      </a>
    </div>
    <p>Thus far we've covered Path Manager, but what about the scheduler that chooses which link to actually use? It seems that on Linux there is only one built-in "default" scheduler, and it can do basic failover on packet loss. The developers want to write <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/75"><u>MPTCP schedulers in BPF</u></a>, and this work is in-progress.</p>
    <div>
      <h2>macOS</h2>
      <a href="#macos">
        
      </a>
    </div>
    <p>As opposed to Linux, macOS and iOS expose a raw MPTCP API. On those operating systems, path manager is not handled by the kernel, but instead can be an application responsibility. The exposed low-level API is based on <code>connectx()</code>. For example, <a href="https://github.com/apple-oss-distributions/network_cmds/blob/97bfa5b71464f1286b51104ba3e60db78cd832c9/mptcp_client/mptcp_client.c#L461"><u>here's an example of obscure code</u></a> that establishes one connection with two subflows:</p>
            <pre><code>int sock = socket(AF_MULTIPATH, SOCK_STREAM, 0);
connectx(sock, ..., &amp;cid1);
connectx(sock, ..., &amp;cid2);
</code></pre>
            <p>This powerful API is hard to use though, as it would require every application to listen for network changes. Fortunately, macOS and iOS also expose higher-level APIs. One <a href="https://github.com/mptcp-apps/mptcp-hello/blob/main/c/macOS/main.c"><u>example is nw_connection</u></a> in C, which uses nw_parameters_set_multipath_service.</p><p>Another, more common example is using <code>Network.framework</code>, and would <a href="https://gist.github.com/majek/cb54b537c74506164d2a7fa2d6601491"><u>look like this</u></a>:</p>
            <pre><code>let parameters = NWParameters.tcp
parameters.multipathServiceType = .interactive
let connection = NWConnection(host: host, port: port, using: parameters) 
</code></pre>
            <p>The API supports three MPTCP service type modes:</p><ul><li><p><i>Handover Mode</i>: Tries to minimize cellular. Uses only Wi-Fi. Uses cellular only when <a href="https://support.apple.com/en-us/102228"><u>Wi-Fi Assist</u></a> is enabled and makes such a decision.</p></li><li><p><i>Interactive Mode</i>: Used for Siri. Reduces latency. Only for low-bandwidth flows.</p></li><li><p><i>Aggregation Mode</i>: Enables resource pooling but it's only available for developer accounts and not deployable.</p></li></ul>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/47MukOs6bhCMOkO1JL15sP/7dd75417b855b681bde504122d5af01e/Screenshot_2024-12-23_at_2.59.51_PM.png" />
          </figure><p>The MPTCP API is nicely integrated with the <a href="https://support.apple.com/en-us/102228"><u>iPhone "Wi-Fi Assist" feature</u></a>. While the official documentation is lacking, it's possible to find <a href="https://youtu.be/BucQ1lfbtd4?t=533"><u>sources explaining</u></a> how it actually works. I was able to successfully test both the cleared "<i>Do not attempt to establish new subflows"</i> bit and ADD-ADDR scenarios. Hurray!</p>
    <div>
      <h2>IPv6 caveat</h2>
      <a href="#ipv6-caveat">
        
      </a>
    </div>
    <p>Sadly, MPTCP IPv6 has a caveat. Since IPv6 addresses are long, and MPTCP uses the space-constrained TCP Extensions field, there is <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/448"><u>not enough room for ADD-ADDR messages</u></a> if TCP timestamps are enabled. If you want to use MPTCP and IPv6, it's something to consider.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>I find MPTCP very exciting, being one of a few deployable serious TCP extensions. However, current implementations are limited. My experimentation showed that the only practical scenario where currently MPTCP might be useful is:</p><ul><li><p>Linux as a server</p></li><li><p>macOS/iOS as a client</p></li><li><p>"interactive" use case</p></li></ul><p>With a bit of effort, Linux can be made to work as a client.</p><p>Don't get me wrong, <a href="https://netdevconf.info/0x14/pub/slides/59/mptcp-netdev0x14-final.pdf"><u>Linux developers did tremendous work</u></a> to get where we are, but, in my opinion for any serious out-of-the-box use case, we're not there yet. I'm optimistic that Linux can develop a good MPTCP client story relatively soon, and the possibility of implementing the Path manager and Scheduler in BPF is really enticing. </p><p>Time will tell if MPTCP succeeds — it's been 15 years in the making. In the meantime, <a href="https://datatracker.ietf.org/meeting/121/materials/slides-121-quic-multipath-quic-00"><u>Multi-Path QUIC</u></a> is under active development, but it's even further from being usable at this stage.</p><p>We're not quite sure if it makes sense for Cloudflare to support MPTCP. <a href="https://community.cloudflare.com/c/feedback/feature-request/30"><u>Reach out</u></a> if you have a use case in mind!</p><p><i>Shoutout to </i><a href="https://fosstodon.org/@matttbe"><i><u>Matthieu Baerts</u></i></a><i> for tremendous help with this blog post.</i></p> ]]></content:encoded>
            <category><![CDATA[TCP]]></category>
            <category><![CDATA[Network]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">6ZxrGIedGqREgTs02vpt0t</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[Linux kernel security tunables everyone should consider adopting]]></title>
            <link>https://blog.cloudflare.com/linux-kernel-hardening/</link>
            <pubDate>Wed, 06 Mar 2024 14:00:43 GMT</pubDate>
            <description><![CDATA[ This post illustrates some of the Linux Kernel features, which are helping us to keep our production systems more secure. We will deep dive into how they work and why you may consider enabling them as well ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1MwDdpmFexb2YBRb6Y4VkD/f8fe7a101729ee9fa518a457b8bb170a/Technical-deep-dive-for-Security-week.png" />
            
            </figure><p>The Linux kernel is the heart of many modern production systems. It decides when any code is allowed to run and which programs/users can access which resources. It manages memory, mediates access to hardware, and does a bulk of work under the hood on behalf of programs running on top. Since the kernel is always involved in any code execution, it is in the best position to protect the system from malicious programs, enforce the desired system security policy, and provide security features for safer production environments.</p><p>In this post, we will review some Linux kernel security configurations we use at Cloudflare and how they help to block or minimize a potential system compromise.</p>
    <div>
      <h2>Secure boot</h2>
      <a href="#secure-boot">
        
      </a>
    </div>
    <p>When a machine (either a laptop or a server) boots, it goes through several boot stages:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7LULhRh5qh02zFnil5ZPnl/b2e1867ba56492628bbb95ba6850448e/image3-17.png" />
            
            </figure><p>Within a secure boot architecture each stage from the above diagram verifies the integrity of the next stage before passing execution to it, thus forming a so-called secure boot chain. This way “trustworthiness” is extended to every component in the boot chain, because if we verified the code integrity of a particular stage, we can trust this code to verify the integrity of the next stage.</p><p>We <a href="/anchoring-trust-a-hardware-secure-boot-story">have previously covered</a> how Cloudflare implements secure boot in the initial stages of the boot process. In this post, we will focus on the Linux kernel.</p><p>Secure boot is the cornerstone of any operating system security mechanism. The Linux kernel is the primary enforcer of the operating system security configuration and policy, so we have to be sure that the Linux kernel itself has not been tampered with. In our previous <a href="/anchoring-trust-a-hardware-secure-boot-story">post about secure boot</a> we showed how we use UEFI Secure Boot to ensure the integrity of the Linux kernel.</p><p>But what happens next? After the kernel gets executed, it may try to load additional drivers, or as they are called in the Linux world, kernel modules. And kernel module loading is not confined just to the boot process. A module can be loaded at any time during runtime — a new device being plugged in and a driver is needed, some additional extensions in the networking stack are required (for example, for fine-grained firewall rules), or just manually by the system administrator.</p><p>However, uncontrolled kernel module loading might pose a significant risk to system integrity. Unlike regular programs, which get executed as user space processes, kernel modules are pieces of code which get injected and executed directly in the Linux kernel address space. There is no separation between the code and data in different kernel modules and core kernel subsystems, so everything can access everything. This means that a rogue kernel module can completely nullify the trustworthiness of the operating system and make secure boot useless. As an example, consider a simple Debian 12 (Bookworm installation), but with <a href="https://selinuxproject.org/">SELinux</a> configured and enforced:</p>
            <pre><code>ignat@dev:~$ lsb_release --all
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 12 (bookworm)
Release:	12
Codename:	bookworm
ignat@dev:~$ uname -a
Linux dev 6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
ignat@dev:~$ sudo getenforce
Enforcing</code></pre>
            <p>Now we need to do some research. First, we see that we’re running 6.1.76 Linux Kernel. If we explore the source code, we would see that <a href="https://elixir.bootlin.com/linux/v6.1.76/source/security/selinux/hooks.c#L107">inside the kernel, the SELinux configuration is stored in a singleton structure</a>, which <a href="https://elixir.bootlin.com/linux/v6.1.76/source/security/selinux/include/security.h#L92">is defined</a> as follows:</p>
            <pre><code>struct selinux_state {
#ifdef CONFIG_SECURITY_SELINUX_DISABLE
	bool disabled;
#endif
#ifdef CONFIG_SECURITY_SELINUX_DEVELOP
	bool enforcing;
#endif
	bool checkreqprot;
	bool initialized;
	bool policycap[__POLICYDB_CAP_MAX];

	struct page *status_page;
	struct mutex status_lock;

	struct selinux_avc *avc;
	struct selinux_policy __rcu *policy;
	struct mutex policy_mutex;
} __randomize_layout;</code></pre>
            <p>From the above, we can see that if the kernel configuration has <code>CONFIG_SECURITY_SELINUX_DEVELOP</code> enabled, the structure would have a boolean variable <code>enforcing</code>, which controls the enforcement status of SELinux at runtime. This is exactly what the above <code>$ sudo getenforce</code> command returns. We can double check that the Debian kernel indeed has the configuration option enabled:</p>
            <pre><code>ignat@dev:~$ grep CONFIG_SECURITY_SELINUX_DEVELOP /boot/config-`uname -r`
CONFIG_SECURITY_SELINUX_DEVELOP=y</code></pre>
            <p>Good! Now that we have a variable in the kernel, which is responsible for some security enforcement, we can try to attack it. One problem though is the <code>__randomize_layout</code> attribute: since <code>CONFIG_SECURITY_SELINUX_DISABLE</code> is actually not set for our Debian kernel, normally <code>enforcing</code> would be the first member of the struct. Thus if we know where the struct is, we immediately know the position of the <code>enforcing</code> flag. With <code>__randomize_layout</code>, during kernel compilation the compiler might place members at arbitrary positions within the struct, so it is harder to create generic exploits. But arbitrary struct randomization within the kernel <a href="https://elixir.bootlin.com/linux/v6.1.76/source/security/Kconfig.hardening#L301">may introduce performance impact</a>, so is often disabled and it is disabled for the Debian kernel:</p>
            <pre><code>ignat@dev:~$ grep RANDSTRUCT /boot/config-`uname -r`
CONFIG_RANDSTRUCT_NONE=y</code></pre>
            <p>We can also confirm the compiled position of the <code>enforcing</code> flag using the <a href="https://git.kernel.org/pub/scm/devel/pahole/pahole.git/">pahole tool</a> and either kernel debug symbols, if available, or (on modern kernels, if enabled) in-kernel <a href="https://www.kernel.org/doc/html/next/bpf/btf.html">BTF</a> information. We will use the latter:</p>
            <pre><code>ignat@dev:~$ pahole -C selinux_state /sys/kernel/btf/vmlinux
struct selinux_state {
	bool                       enforcing;            /*     0     1 */
	bool                       checkreqprot;         /*     1     1 */
	bool                       initialized;          /*     2     1 */
	bool                       policycap[8];         /*     3     8 */

	/* XXX 5 bytes hole, try to pack */

	struct page *              status_page;          /*    16     8 */
	struct mutex               status_lock;          /*    24    32 */
	struct selinux_avc *       avc;                  /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct selinux_policy *    policy;               /*    64     8 */
	struct mutex               policy_mutex;         /*    72    32 */

	/* size: 104, cachelines: 2, members: 9 */
	/* sum members: 99, holes: 1, sum holes: 5 */
	/* last cacheline: 40 bytes */
};</code></pre>
            <p>So <code>enforcing</code> is indeed located at the start of the structure and we don’t even have to be a privileged user to confirm this.</p><p>Great! All we need is the runtime address of the <code>selinux_state</code> variable inside the kernel:(shell/bash)</p>
            <pre><code>ignat@dev:~$ sudo grep selinux_state /proc/kallsyms
ffffffffbc3bcae0 B selinux_state</code></pre>
            <p>With all the information, we can write an almost textbook simple kernel module to manipulate the SELinux state:</p><p>Mymod.c:</p>
            <pre><code>#include &lt;linux/module.h&gt;

static int __init mod_init(void)
{
	bool *selinux_enforce = (bool *)0xffffffffbc3bcae0;
	*selinux_enforce = false;
	return 0;
}

static void mod_fini(void)
{
}

module_init(mod_init);
module_exit(mod_fini);

MODULE_DESCRIPTION("A somewhat malicious module");
MODULE_AUTHOR("Ignat Korchagin &lt;ignat@cloudflare.com&gt;");
MODULE_LICENSE("GPL");</code></pre>
            <p>And the respective <code>Kbuild</code> file:</p>
            <pre><code>obj-m := mymod.o</code></pre>
            <p>With these two files we can build a full fledged kernel module according to <a href="https://docs.kernel.org/kbuild/modules.html">the official kernel docs</a>:</p>
            <pre><code>ignat@dev:~$ cd mymod/
ignat@dev:~/mymod$ ls
Kbuild  mymod.c
ignat@dev:~/mymod$ make -C /lib/modules/`uname -r`/build M=$PWD
make: Entering directory '/usr/src/linux-headers-6.1.0-18-cloud-amd64'
  CC [M]  /home/ignat/mymod/mymod.o
  MODPOST /home/ignat/mymod/Module.symvers
  CC [M]  /home/ignat/mymod/mymod.mod.o
  LD [M]  /home/ignat/mymod/mymod.ko
  BTF [M] /home/ignat/mymod/mymod.ko
Skipping BTF generation for /home/ignat/mymod/mymod.ko due to unavailability of vmlinux
make: Leaving directory '/usr/src/linux-headers-6.1.0-18-cloud-amd64'</code></pre>
            <p>If we try to load this module now, the system may not allow it due to the SELinux policy:</p>
            <pre><code>ignat@dev:~/mymod$ sudo insmod mymod.ko
insmod: ERROR: could not load module mymod.ko: Permission denied</code></pre>
            <p>We can workaround it by copying the module into the standard module path somewhere:</p>
            <pre><code>ignat@dev:~/mymod$ sudo cp mymod.ko /lib/modules/`uname -r`/kernel/crypto/</code></pre>
            <p>Now let’s try it out:</p>
            <pre><code>ignat@dev:~/mymod$ sudo getenforce
Enforcing
ignat@dev:~/mymod$ sudo insmod /lib/modules/`uname -r`/kernel/crypto/mymod.ko
ignat@dev:~/mymod$ sudo getenforce
Permissive</code></pre>
            <p>Not only did we disable the SELinux protection via a malicious kernel module, we did it quietly. Normal <code>sudo setenforce 0</code>, even if allowed, would go through the official <a href="https://elixir.bootlin.com/linux/v6.1.76/source/security/selinux/selinuxfs.c#L173">selinuxfs interface and would emit an audit message</a>. Our code manipulated the kernel memory directly, so no one was alerted. This illustrates why uncontrolled kernel module loading is very dangerous and that is why most security standards and commercial security monitoring products advocate for close monitoring of kernel module loading.</p><p>But we don’t need to monitor kernel modules at Cloudflare. Let’s repeat the exercise on a Cloudflare production kernel (module recompilation skipped for brevity):</p>
            <pre><code>ignat@dev:~/mymod$ uname -a
Linux dev 6.6.17-cloudflare-2024.2.9 #1 SMP PREEMPT_DYNAMIC Mon Sep 27 00:00:00 UTC 2010 x86_64 GNU/Linux
ignat@dev:~/mymod$ sudo insmod /lib/modules/`uname -r`/kernel/crypto/mymod.ko
insmod: ERROR: could not insert module /lib/modules/6.6.17-cloudflare-2024.2.9/kernel/crypto/mymod.ko: Key was rejected by service</code></pre>
            <p>We get a <code>Key was rejected by service</code> error when trying to load a module, and the kernel log will have the following message:</p>
            <pre><code>ignat@dev:~/mymod$ sudo dmesg | tail -n 1
[41515.037031] Loading of unsigned module is rejected</code></pre>
            <p>This is because the Cloudflare kernel <a href="https://elixir.bootlin.com/linux/v6.6.17/source/kernel/module/Kconfig#L211">requires all the kernel modules to have a valid signature</a>, so we don’t even have to worry about a malicious module being loaded at some point:</p>
            <pre><code>ignat@dev:~$ grep MODULE_SIG_FORCE /boot/config-`uname -r`
CONFIG_MODULE_SIG_FORCE=y</code></pre>
            <p>For completeness it is worth noting that the Debian stock kernel also supports module signatures, but does not enforce it:</p>
            <pre><code>ignat@dev:~$ grep MODULE_SIG /boot/config-6.1.0-18-cloud-amd64
CONFIG_MODULE_SIG_FORMAT=y
CONFIG_MODULE_SIG=y
# CONFIG_MODULE_SIG_FORCE is not set
…</code></pre>
            <p>The above configuration means that the kernel will validate a module signature, if available. But if not - the module will be loaded anyway with a warning message emitted and the <a href="https://docs.kernel.org/admin-guide/tainted-kernels.html">kernel will be tainted</a>.</p>
    <div>
      <h3>Key management for kernel module signing</h3>
      <a href="#key-management-for-kernel-module-signing">
        
      </a>
    </div>
    <p>Signed kernel modules are great, but it creates a key management problem: to sign a module we need a signing keypair that is trusted by the kernel. The public key of the keypair is usually directly embedded into the kernel binary, so the kernel can easily use it to verify module signatures. The private key of the pair needs to be protected and secure, because if it is leaked, anyone could compile and sign a potentially malicious kernel module which would be accepted by our kernel.</p><p>But what is the best way to eliminate the risk of losing something? Not to have it in the first place! Luckily the kernel build system <a href="https://elixir.bootlin.com/linux/v6.6.17/source/certs/Makefile#L36">will generate a random keypair</a> for module signing, if none is provided. At Cloudflare, we use that feature to sign all the kernel modules during the kernel compilation stage. When the compilation and signing is done though, instead of storing the key in a secure place, we just destroy the private key:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1EWN5Kh5NebZTW3kAzK9uM/0dfb98bc095710e5b317a128d3efb1c0/image1-19.png" />
            
            </figure><p>So with the above process:</p><ol><li><p>The kernel build system generated a random keypair, compiles the kernel and modules</p></li><li><p>The public key is embedded into the kernel image, the private key is used to sign all the modules</p></li><li><p>The private key is destroyed</p></li></ol><p>With this scheme not only do we not have to worry about module signing key management, we also use a different key for each kernel we release to production. So even if a particular build process is hijacked and the signing key is not destroyed and potentially leaked, the key will no longer be valid when a kernel update is released.</p><p>There are some flexibility downsides though, as we can’t “retrofit” a new kernel module for an already released kernel (for example, for <a href="https://www.cloudflare.com/en-gb/press-releases/2023/cloudflare-powers-hyper-local-ai-inference-with-nvidia/">a new piece of hardware we are adopting</a>). However, it is not a practical limitation for us as we release kernels often (roughly every week) to keep up with a steady stream of bug fixes and vulnerability patches in the Linux Kernel.</p>
    <div>
      <h2>KEXEC</h2>
      <a href="#kexec">
        
      </a>
    </div>
    <p><a href="https://en.wikipedia.org/wiki/Kexec">KEXEC</a> (or <code>kexec_load()</code>) is an interesting system call in Linux, which allows for one kernel to directly execute (or jump to) another kernel. The idea behind this is to switch/update/downgrade kernels faster without going through a full reboot cycle to minimize the potential system downtime. However, it was developed quite a while ago, when secure boot and system integrity was not quite a concern. Therefore its original design has security flaws and is known to be able to <a href="https://mjg59.dreamwidth.org/28746.html">bypass secure boot and potentially compromise system integrity</a>.</p><p>We can see the problems just based on the <a href="https://man7.org/linux/man-pages/man2/kexec_load.2.html">definition of the system call itself</a>:</p>
            <pre><code>struct kexec_segment {
	const void *buf;
	size_t bufsz;
	const void *mem;
	size_t memsz;
};
...
long kexec_load(unsigned long entry, unsigned long nr_segments, struct kexec_segment *segments, unsigned long flags);</code></pre>
            <p>So the kernel expects just a collection of buffers with code to execute. Back in those days there was not much desire to do a lot of data parsing inside the kernel, so the idea was to parse the to-be-executed kernel image in user space and provide the kernel with only the data it needs. Also, to switch kernels live, we need an intermediate program which would take over while the old kernel is shutting down and the new kernel has not yet been executed. In the kexec world this program is called <a href="https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/purgatory">purgatory</a>. Thus the problem is evident: we give the kernel a bunch of code and it will happily execute it at the highest privilege level. But instead of the original kernel or purgatory code, we can easily provide code similar to the one demonstrated earlier in this post, which disables SELinux (or does something else to the kernel).</p><p>At Cloudflare we have had <code>kexec_load()</code> disabled for some time now just because of this. The advantage of faster reboots with kexec comes with <a href="https://elixir.bootlin.com/linux/v6.6.17/source/kernel/Kconfig.kexec#L30">a (small) risk of improperly initialized hardware</a>, so it was not worth using it even without the security concerns. However, kexec does provide one useful feature — it is the foundation of the Linux kernel <a href="https://docs.kernel.org/admin-guide/kdump/kdump.html">crashdumping solution</a>. In a nutshell, if a kernel crashes in production (due to a bug or some other error), a backup kernel (previously loaded with kexec) can take over, collect and save the memory dump for further investigation. This allows to more effectively investigate kernel and other issues in production, so it is a powerful tool to have.</p><p>Luckily, since the <a href="https://mjg59.dreamwidth.org/28746.html">original problems with kexec were outlined</a>, Linux developed an alternative <a href="https://elixir.bootlin.com/linux/v6.6.17/source/kernel/Kconfig.kexec#L36">secure interface for kexec</a>: instead of buffers with code it expects file descriptors with the to-be-executed kernel image and initrd and does parsing inside the kernel. Thus, only a valid kernel image can be supplied. On top of this, we can <a href="https://elixir.bootlin.com/linux/v6.6.17/source/kernel/Kconfig.kexec#L48">configure</a> and <a href="https://elixir.bootlin.com/linux/v6.6.17/source/kernel/Kconfig.kexec#L62">require</a> kexec to ensure the provided images are properly signed, so only authorized code can be executed in the kexec scenario. A secure configuration for kexec looks something like this:</p>
            <pre><code>ignat@dev:~$ grep KEXEC /boot/config-`uname -r`
CONFIG_KEXEC_CORE=y
CONFIG_HAVE_IMA_KEXEC=y
# CONFIG_KEXEC is not set
CONFIG_KEXEC_FILE=y
CONFIG_KEXEC_SIG=y
CONFIG_KEXEC_SIG_FORCE=y
CONFIG_KEXEC_BZIMAGE_VERIFY_SIG=y
…</code></pre>
            <p>Above we ensure that the legacy <code>kexec_load()</code> system call is disabled by disabling <code>CONFIG_KEXEC</code>, but still can configure Linux Kernel crashdumping via the new <code>kexec_file_load()</code> system call via <code>CONFIG_KEXEC_FILE=y</code> with enforced signature checks (<code>CONFIG_KEXEC_SIG=y</code> and <code>CONFIG_KEXEC_SIG_FORCE=y</code>).</p><p>Note that stock Debian kernel has the legacy <code>kexec_load()</code> system call enabled and does not enforce signature checks for <code>kexec_file_load()</code> (similar to module signature checks):</p>
            <pre><code>ignat@dev:~$ grep KEXEC /boot/config-6.1.0-18-cloud-amd64
CONFIG_KEXEC=y
CONFIG_KEXEC_FILE=y
CONFIG_ARCH_HAS_KEXEC_PURGATORY=y
CONFIG_KEXEC_SIG=y
# CONFIG_KEXEC_SIG_FORCE is not set
CONFIG_KEXEC_BZIMAGE_VERIFY_SIG=y
…</code></pre>
            
    <div>
      <h2>Kernel Address Space Layout Randomization (KASLR)</h2>
      <a href="#kernel-address-space-layout-randomization-kaslr">
        
      </a>
    </div>
    <p>Even on the stock Debian kernel if you try to repeat the exercise we described in the “Secure boot” section of this post after a system reboot, you will likely see it would fail to disable SELinux now. This is because we hardcoded the kernel address of the <code>selinux_state</code> structure in our malicious kernel module, but the address changed now:</p>
            <pre><code>ignat@dev:~$ sudo grep selinux_state /proc/kallsyms
ffffffffb41bcae0 B selinux_state</code></pre>
            <p><a href="https://docs.kernel.org/security/self-protection.html#kernel-address-space-layout-randomization-kaslr">Kernel Address Space Layout Randomization (or KASLR)</a> is a simple concept: it slightly and randomly shifts the kernel code and data on each boot:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6e7sTU0q25lHmBG5b4Q8if/d19a762047547d6f14dff4f9ab8f2d81/Screenshot-2024-03-06-at-13.53.23-2.png" />
            
            </figure><p>This is to combat targeted exploitation (like the malicious module in this post) based on the knowledge of the location of internal kernel structures and code. It is especially useful for popular Linux distribution kernels, like the Debian one, because most users use the same binary and anyone can download the debug symbols and the System.map file with all the addresses of the kernel internals. Just to note: it will not prevent the module loading and doing harm, but it will likely not achieve the targeted effect of disabling SELinux. Instead, it will modify a random piece of kernel memory potentially causing the kernel to crash.</p><p>Both the Cloudflare kernel and the Debian one have this feature enabled:</p>
            <pre><code>ignat@dev:~$ grep RANDOMIZE_BASE /boot/config-`uname -r`
CONFIG_RANDOMIZE_BASE=y</code></pre>
            
    <div>
      <h3>Restricted kernel pointers</h3>
      <a href="#restricted-kernel-pointers">
        
      </a>
    </div>
    <p>While KASLR helps with targeted exploits, it is quite easy to bypass since everything is shifted by a single random offset as shown on the diagram above. Thus if the attacker knows at least one runtime kernel address, they can recover this offset by subtracting the runtime address from the compile time address of the same symbol (function or data structure) from the kernel’s System.map file. Once they know the offset, they can recover the addresses of all other symbols by adjusting them by this offset.</p><p>Therefore, modern kernels take precautions not to leak kernel addresses at least to unprivileged users. One of the main tunables for this is the <a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#kptr-restrict">kptr_restrict sysctl</a>. It is a good idea to set it at least to <code>1</code> to not allow regular users to see kernel pointers:(shell/bash)</p>
            <pre><code>ignat@dev:~$ sudo sysctl -w kernel.kptr_restrict=1
kernel.kptr_restrict = 1
ignat@dev:~$ grep selinux_state /proc/kallsyms
0000000000000000 B selinux_state</code></pre>
            <p>Privileged users can still see the pointers:</p>
            <pre><code>ignat@dev:~$ sudo grep selinux_state /proc/kallsyms
ffffffffb41bcae0 B selinux_state</code></pre>
            <p>Similar to <a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#kptr-restrict">kptr_restrict sysctl</a> there is also <a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#dmesg-restrict">dmesg_restrict</a>, which if set, would prevent regular users from reading the kernel log (which may also leak kernel pointers via its messages). While you need to explicitly set <a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#kptr-restrict">kptr_restrict sysctl</a> to a non-zero value on each boot (or use some system sysctl configuration utility, like <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd-sysctl.service.html">this one</a>), you can configure <a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#dmesg-restrict">dmesg_restrict</a> initial value via the <code>CONFIG_SECURITY_DMESG_RESTRICT</code> kernel configuration option. Both the Cloudflare kernel and the Debian one enforce <a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#dmesg-restrict">dmesg_restrict</a> this way:</p>
            <pre><code>ignat@dev:~$ grep CONFIG_SECURITY_DMESG_RESTRICT /boot/config-`uname -r`
CONFIG_SECURITY_DMESG_RESTRICT=y</code></pre>
            <p>Worth noting that <code>/proc/kallsyms</code> and the kernel log are not the only sources of potential kernel pointer leaks. There is a lot of legacy in the Linux kernel and [new sources are continuously being found and patched]. That’s why it is very important to stay up to date with the latest kernel bugfix releases.</p>
    <div>
      <h2>Lockdown LSM</h2>
      <a href="#lockdown-lsm">
        
      </a>
    </div>
    <p><a href="https://www.kernel.org/doc/html/latest/admin-guide/LSM/index.html">Linux Security Modules (LSM)</a> is a hook-based framework for implementing security policies and Mandatory Access Control in the Linux Kernel. We have [covered our usage of another LSM module, BPF-LSM, previously].</p><p>BPF-LSM is a useful foundational piece for our kernel security, but in this post we want to mention another useful LSM module we use — <a href="https://man7.org/linux/man-pages/man7/kernel_lockdown.7.html">the Lockdown LSM</a>. Lockdown can be in three states (controlled by the <code>/sys/kernel/security/lockdown</code> special file):</p>
            <pre><code>ignat@dev:~$ cat /sys/kernel/security/lockdown
[none] integrity confidentiality</code></pre>
            <p><code>none</code> is the state where nothing is enforced and the module is effectively disabled. When Lockdown is in the <code>integrity</code> state, the kernel tries to prevent any operation, which may compromise its integrity. We already covered some examples of these in this post: loading unsigned modules and executing unsigned code via KEXEC. But there are other potential ways (which are mentioned in <a href="https://man7.org/linux/man-pages/man7/kernel_lockdown.7.html">the LSM’s man page</a>), all of which this LSM tries to block. <code>confidentiality</code> is the most restrictive mode, where Lockdown will also try to prevent any information leakage from the kernel. In practice this may be too restrictive for server workloads as it blocks all runtime debugging capabilities, like <code>perf</code> or eBPF.</p><p>Let’s see the Lockdown LSM in action. On a barebones Debian system the initial state is <code>none</code> meaning nothing is locked down:</p>
            <pre><code>ignat@dev:~$ uname -a
Linux dev 6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
ignat@dev:~$ cat /sys/kernel/security/lockdown
[none] integrity confidentiality</code></pre>
            <p>We can switch the system into the <code>integrity</code> mode:</p>
            <pre><code>ignat@dev:~$ echo integrity | sudo tee /sys/kernel/security/lockdown
integrity
ignat@dev:~$ cat /sys/kernel/security/lockdown
none [integrity] confidentiality</code></pre>
            <p>It is worth noting that we can only put the system into a more restrictive state, but not back. That is, once in <code>integrity</code> mode we can only switch to <code>confidentiality</code> mode, but not back to <code>none</code>:</p>
            <pre><code>ignat@dev:~$ echo none | sudo tee /sys/kernel/security/lockdown
none
tee: /sys/kernel/security/lockdown: Operation not permitted</code></pre>
            <p>Now we can see that even on a stock Debian kernel, which as we discovered above, does not enforce module signatures by default, we cannot load a potentially malicious unsigned kernel module anymore:</p>
            <pre><code>ignat@dev:~$ sudo insmod mymod/mymod.ko
insmod: ERROR: could not insert module mymod/mymod.ko: Operation not permitted</code></pre>
            <p>And the kernel log will helpfully point out that this is due to Lockdown LSM:</p>
            <pre><code>ignat@dev:~$ sudo dmesg | tail -n 1
[21728.820129] Lockdown: insmod: unsigned module loading is restricted; see man kernel_lockdown.7</code></pre>
            <p>As we can see, Lockdown LSM helps to tighten the security of a kernel, which otherwise may not have other enforcing bits enabled, like the stock Debian one.</p><p>If you compile your own kernel, you can go one step further and set <a href="https://elixir.bootlin.com/linux/v6.6.17/source/security/lockdown/Kconfig#L33">the initial state of the Lockdown LSM to be more restrictive than none from the start</a>. This is exactly what we did for the Cloudflare production kernel:</p>
            <pre><code>ignat@dev:~$ grep LOCK_DOWN /boot/config-6.6.17-cloudflare-2024.2.9
# CONFIG_LOCK_DOWN_KERNEL_FORCE_NONE is not set
CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY=y
# CONFIG_LOCK_DOWN_KERNEL_FORCE_CONFIDENTIALITY is not set</code></pre>
            
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>In this post we reviewed some useful Linux kernel security configuration options we use at Cloudflare. This is only a small subset, and there are many more available and even more are being constantly developed, reviewed, and improved by the Linux kernel community. We hope that this post will shed some light on these security features and that, if you haven’t already, you may consider enabling them in your Linux systems.</p>
    <div>
      <h2>Watch on Cloudflare TV</h2>
      <a href="#watch-on-cloudflare-tv">
        
      </a>
    </div>
    <div>
  
</div><p>Tune in for more news, announcements and thought-provoking discussions! Don't miss the full <a href="https://cloudflare.tv/shows/security-week">Security Week hub page</a>.</p> ]]></content:encoded>
            <category><![CDATA[Security Week]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Kernel]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Security]]></category>
            <guid isPermaLink="false">3ySkqS53T1nhzX61XzJFEG</guid>
            <dc:creator>Ignat Korchagin</dc:creator>
        </item>
        <item>
            <title><![CDATA[connect() - why are you so slow?]]></title>
            <link>https://blog.cloudflare.com/linux-transport-protocol-port-selection-performance/</link>
            <pubDate>Thu, 08 Feb 2024 14:00:27 GMT</pubDate>
            <description><![CDATA[ This is our story of what we learned about the connect() implementation for TCP in Linux. Both its strong and weak points. How connect() latency changes under pressure, and how to open connection so that the syscall latency is deterministic and time-bound ]]></description>
            <content:encoded><![CDATA[ <p></p><p>It is no secret that Cloudflare is encouraging companies to deprecate their use of IPv4 addresses and move to IPv6 addresses. We have a couple articles on the subject from this year:</p><ul><li><p><a href="/amazon-2bn-ipv4-tax-how-avoid-paying/">Amazon’s $2bn IPv4 tax – and how you can avoid paying it</a></p></li><li><p><a href="/ipv6-from-dns-pov/">Using DNS to estimate worldwide state of IPv6 adoption</a></p></li></ul><p>And many more in our <a href="/searchresults#q=IPv6&amp;sort=date%20descending&amp;f:@customer_facing_source=[Blog]&amp;f:@language=[English]">catalog</a>. To help with this, we spent time this last year investigating and implementing infrastructure to reduce our internal and egress use of IPv4 addresses. We prefer to re-allocate our addresses than to purchase more due to increasing costs. And in this effort we discovered that our cache service is one of our bigger consumers of IPv4 addresses. Before we remove IPv4 addresses for our cache services, we first need to understand how cache works at Cloudflare.</p>
    <div>
      <h2>How does cache work at Cloudflare?</h2>
      <a href="#how-does-cache-work-at-cloudflare">
        
      </a>
    </div>
    <p>Describing the full scope of the <a href="https://developers.cloudflare.com/reference-architecture/cdn-reference-architecture/#cloudflare-cdn-architecture-and-design">architecture</a> is out of scope of this article, however, we can provide a basic outline:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/70ULgxsqU4zuyWYVrNn6et/8c80079d6dd93083059a875bbf48059d/image1-2.png" />
            
            </figure><ol><li><p>Internet User makes a request to pull an asset</p></li><li><p>Cloudflare infrastructure routes that request to a handler</p></li><li><p>Handler machine returns cached asset, or if miss</p></li><li><p>Handler machine reaches to origin server (owned by a customer) to pull the requested asset</p></li></ol><p>The particularly interesting part is the cache miss case. When a website suddenly becomes very popular, many uncached assets may need to be fetched all at once. Hence we may make an upwards of: 50k TCP unicast connections to a single destination_._</p><p>That is a lot of connections! We have strategies in place to limit the impact of this or avoid this problem altogether. But in these rare cases when it occurs, we will then balance these connections over two source IPv4 addresses.</p><p>Our goal is to remove the load balancing and prefer one IPv4 address. To do that, we need to understand the performance impact of two IPv4 addresses vs one.</p>
    <div>
      <h2>TCP connect() performance of two source IPv4 addresses vs one IPv4 address</h2>
      <a href="#tcp-connect-performance-of-two-source-ipv4-addresses-vs-one-ipv4-address">
        
      </a>
    </div>
    <p>We leveraged a tool called <a href="https://github.com/wg/wrk">wrk</a>, and modified it to distribute connections over multiple source IP addresses. Then we ran a workload of 70k connections over 48 threads for a period of time.</p><p>During the test we measured the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/tcp_ipv4.c#L201">tcp_v4_connect()</a> with the BPF BCC libbpf-tool <a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/funclatency.c">funclatency</a> tool to gather latency metrics as time progresses.</p><p>Note that throughout the rest of this article, all the numbers are specific to a single machine with no production traffic. We are making the assumption that if we can improve a worse case scenario in an algorithm with a best case machine, that the results could be extrapolated to production. Lock contention was specifically taken out of the equation, but will have production implications.</p>
    <div>
      <h3>Two IPv4 addresses</h3>
      <a href="#two-ipv4-addresses">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1q7v3WNgI5X3JQg5ua0B8g/5b557ca762a08422badae379233dee76/image6.png" />
            
            </figure><p>The y-axis are buckets of nanoseconds in powers of ten. The x-axis represents the number of connections made per bucket. Therefore, more connections in a lower power of ten buckets is better.</p><p>We can see that the majority of the connections occur in the fast case with roughly ~20k in the slow case. We should expect this bimodal to increase over time due to wrk continuously closing and establishing connections.</p><p>Now let us look at the performance of one IPv4 address under the same conditions.</p>
    <div>
      <h3>One IPv4 address</h3>
      <a href="#one-ipv4-address">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6kpueuXS3SbBTIig306IDN/b27ab899656fbfc0bf3c885a44fb04a4/image8.png" />
            
            </figure><p>In this case, the bimodal distribution is even more pronounced. Over half of the total connections are in the slow case than in the fast! We may conclude that simply switching to one IPv4 address for cache egress is going to introduce significant latency on our connect() syscalls.</p><p>The next logical step is to figure out where this bottleneck is happening.</p>
    <div>
      <h2>Port selection is not what you think it is</h2>
      <a href="#port-selection-is-not-what-you-think-it-is">
        
      </a>
    </div>
    <p>To investigate this, we first took a flame graph of a production machine:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1tFwadYDdC5UVK78j4yKsv/64aca09189acba5bf3dab2e043265e0f/image7.png" />
            
            </figure><p>Flame graphs depict a run-time function call stack of a system. Y-axis depicts call-stack depth, and x-axis depicts a function name in a horizontal bar that represents the amount of times the function was sampled. Checkout this in-depth <a href="https://www.brendangregg.com/flamegraphs.html">guide</a> about flame graphs for more details.</p><p>Most of the samples are taken in the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L1000"><code>__inet_hash_connect()</code></a>. We can see that there are also many samples for <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L544"><code>__inet_check_established()</code></a> with some lock contention sampled between. We have a better picture of a potential bottleneck, but we do not have a consistent test to compare against.</p><p>Wrk introduces a bit more variability than we would like to see. Still focusing on the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/tcp_ipv4.c#L201"><code>tcp_v4_connect()</code></a>, we performed another synthetic test with a homegrown benchmark tool to test one IPv4 address. A tool such as <a href="https://github.com/ColinIanKing/stress-ng">stress-ng</a> may also be used, but some modification is necessary to implement the socket option <a href="https://man7.org/linux/man-pages/man7/ip.7.html"><code>IP_LOCAL_PORT_RANGE</code></a>. There is more about that socket option later.</p><p>We are now going to ensure a deterministic amount of connections, and remove lock contention from the problem. The result is something like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5d6tJum5BBe3jsLRqhXtFN/7952fb3d0a3da761de158fae4f925eb5/Screenshot-2024-02-07-at-15.54.29.png" />
            
            </figure><p>On the y-axis we measured the latency between the start and end of a connect() syscall. The x-axis denotes when a connect() was called. Green dots are even numbered ports, and red dots are odd numbered ports. The orange line is a linear-regression on the data.</p><p>The disparity between the average time for port allocation between even and odd ports provides us with a major clue. Connections with odd ports are found significantly slower than the even. Further, odd ports are not interleaved with earlier connections. This implies we exhaust our even ports before attempting the odd. The chart also confirms our bimodal distribution.</p>
    <div>
      <h3>__inet_hash_connect()</h3>
      <a href="#__inet_hash_connect">
        
      </a>
    </div>
    <p>At this point we wanted to understand this split a bit better. We know from the flame graph and the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L1000"><code>__inet_hash_connect()</code></a> that this holds the algorithm for port selection. For context, this function is responsible for associating the socket to a source port in a late bind. If a port was previously provided with bind(), the algorithm just tests for a unique TCP 4-tuple (src ip, src port, dest ip, dest port) and ignores port selection.</p><p>Before we dive in, there is a little bit of setup work that happens first. Linux first generates a time-based hash that is used as the basis for the starting port, then adds randomization, and then puts that information into an offset variable. This is always set to an even integer.</p><p><a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L1043">net/ipv4/inet_hashtables.c</a></p>
            <pre><code>   offset &amp;= ~1U;
    
other_parity_scan:
    port = low + offset;
    for (i = 0; i &lt; remaining; i += 2, port += 2) {
        if (unlikely(port &gt;= high))
            port -= remaining;

        inet_bind_bucket_for_each(tb, &amp;head-&gt;chain) {
            if (inet_bind_bucket_match(tb, net, port, l3mdev)) {
                if (!check_established(death_row, sk, port, &amp;tw))
                    goto ok;
                goto next_port;
            }
        }
    }

    offset++;
    if ((offset &amp; 1) &amp;&amp; remaining &gt; 1)
        goto other_parity_scan;</code></pre>
            <p>Then in a nutshell: loop through one half of ports in our range (all even or all odd ports) before looping through the other half of ports (all odd or all even ports respectively) for each connection. Specifically, this is a variation of the <a href="https://datatracker.ietf.org/doc/html/rfc6056#section-3.3.4">Double-Hash Port Selection Algorithm</a>. We will ignore the bind bucket functionality since that is not our main concern.</p><p>Depending on your port range, you either start with an even port or an odd port. In our case, our low port, 9024, is even. Then the port is picked by adding the offset to the low port:</p><p><a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L1045">net/ipv4/inet_hashtables.c</a></p>
            <pre><code>port = low + offset;</code></pre>
            <p>If low was odd, we will have an odd starting port because odd + even = odd.</p><p>There is a bit too much going on in the loop to explain in text. I have an example instead:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6uVqtAUR07epRRKqQbHWkp/2a5671b1dd3c68c012e7171b8103a53e/image5.png" />
            
            </figure><p>This example is bound by 8 ports and 8 possible connections. All ports start unused. As a port is used up, the port is grayed out. Green boxes represent the next chosen port. All other colors represent open ports. Blue arrows are even port iterations of offset, and red are the odd port iterations of offset. Note that the offset is randomly picked, and once we cross over to the odd range, the offset is incremented by one.</p><p>For each selection of a port, the algorithm then makes a call to the function <code>check_established()</code> which dereferences <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L544"><code>__inet_check_established()</code></a>. This function loops over sockets to verify that the TCP 4-tuple is unique. The takeaway is that the socket list in the function is usually smaller than not. This grows as more unique TCP 4-tuples are introduced to the system. Longer socket lists may slow down port selection eventually. We have a blog post on <a href="/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/">ephemeral port exhausting</a> that dives into the socket list and port uniqueness criteria.</p><p>At this point, we can summarize that the odd/even port split is what is causing our performance bottleneck. And during the investigation, it was not obvious to me (or even maybe you) why the offset was initially calculated the way it was, and why the odd/even port split was introduced. After some git-archaeology the decisions become more clear.</p>
    <div>
      <h3>Security considerations</h3>
      <a href="#security-considerations">
        
      </a>
    </div>
    <p>Port selection has been shown to be used in device <a href="https://lwn.net/Articles/910435/">fingerprinting</a> in the past. This led the authors to introduce more randomization into the initial port selection. Prior, ports were predictably picked solely based on their initial hash and a salt value which does not change often. This helps with explaining the offset, but does not explain the split.</p>
    <div>
      <h3>Why the even/odd split?</h3>
      <a href="#why-the-even-odd-split">
        
      </a>
    </div>
    <p>Prior to this <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=07f4c90062f8fc7c8c26f8f95324cbe8fa3145a5">patch</a> and that <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1580ab63fc9a03593072cc5656167a75c4f1d173">patch</a>, services may have conflicts between the connect() and bind() heavy workloads. Thus, to avoid those conflicts, the split was added. An even offset was chosen for the connect() workloads, and an odd offset for the bind() workloads. However, we can see that the split works great for connect() workloads that do not exceed one half of the allotted port range.</p><p>Now we have an explanation for the flame graph and charts. So what can we do about this?</p>
    <div>
      <h2>User space solution (kernel &lt; 6.8)</h2>
      <a href="#user-space-solution-kernel-6-8">
        
      </a>
    </div>
    <p>We have a couple of strategies that would work best for us. Infrastructure or architectural strategies are not considered due to significant development effort. Instead, we prefer to tackle the problem where it occurs.</p><h3>Select, test, repeat<p>For the “select, test, repeat” approach, you may have code that ends up looking like this:</p>
            <pre><code>sys = get_ip_local_port_range()
estab = 0
i = sys.hi
while i &gt;= 0:
    if estab &gt;= sys.hi:
        break

    random_port = random.randint(sys.lo, sys.hi)
    connection = attempt_connect(random_port)
    if connection is None:
        i += 1
        continue

    i -= 1
    estab += 1</code></pre>
            <p>The algorithm simply loops through the system port range, and randomly picks a port each iteration. Then test that the connect() worked. If not, rinse and repeat until range exhaustion.</p><p>This approach is good for up to ~70-80% port range utilization. And this may take roughly eight to twelve attempts per connection as we approach exhaustion. The major downside to this approach is the extra syscall overhead on conflict. In order to reduce this overhead, we can consider another approach that allows the kernel to still select the port for us.</p><h3>Select port by random shifting range<p>This approach leverages the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=91d0b78c5177f3e42a4d8738af8ac19c3a90d002"><code>IP_LOCAL_PORT_RANGE</code></a> socket option. And we were able to achieve performance like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/Uz8whp12VuvqvKTDnE1u9/4701177d739bdffe2a2399213cf72941/Screenshot-2024-02-07-at-16.00.22.png" />
            
            </figure><p>That is much better! The chart also introduces black dots that represent errored connections. However, they have a tendency to clump at the very end of our port range as we approach exhaustion. This is not dissimilar to what we may see in “<a href="#selecttestrepeat">select, test, repeat</a>”.</p><p>The way this solution works is something like:</p>
            <pre><code>IP_BIND_ADDRESS_NO_PORT = 24
IP_LOCAL_PORT_RANGE = 51
sys = get_local_port_range()
window.lo = 0
window.hi = 1000
range = window.hi - window.lo
offset = randint(sys.lo, sys.hi - range)
window.lo = offset
window.hi = offset + range

sk = socket(AF_INET, SOCK_STREAM)
sk.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
range = pack("@I", window.lo | (window.hi &lt;&lt; 16))
sk.setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, range)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))</code></pre>
            <p>We first fetch the system's local port range, define a custom port range, and then randomly shift the custom range within the system range. Introducing this randomization helps the kernel to start port selection randomly at an odd or even port. Then reduces the loop search space down to the range of the custom window.</p><p>We tested with a few different window sizes, and determined that a five hundred or one thousand size works fairly well for our port range:</p>
<table>
<thead>
  <tr>
    <th><span>Window size</span></th>
    <th><span>Errors</span></th>
    <th><span>Total test time</span></th>
    <th><span>Connections/second</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>500</span></td>
    <td><span>868</span></td>
    <td><span>~1.8 seconds</span></td>
    <td><span>~30,139</span></td>
  </tr>
  <tr>
    <td><span>1,000</span></td>
    <td><span>1,129</span></td>
    <td><span>~2 seconds</span></td>
    <td><span>~27,260</span></td>
  </tr>
  <tr>
    <td><span>5,000</span></td>
    <td><span>4,037</span></td>
    <td><span>~6.7 seconds</span></td>
    <td><span>~8,405</span></td>
  </tr>
  <tr>
    <td><span>10,000</span></td>
    <td><span>6,695</span></td>
    <td><span>~17.7 seconds</span></td>
    <td><span>~3,183</span></td>
  </tr>
</tbody>
</table><p>As the window size increases, the error rate increases. That is because a larger window provides less random offset opportunity. A max window size of 56,512 is no different from using the kernels default behavior. Therefore, a smaller window size works better. But you do not want it to be too small either. A window size of one is no different from “<a href="#selecttestrepeat">select, test, repeat</a>”.</p><p>In kernels &gt;= 6.8, we can do even better.</p><h2>Kernel solution (kernel &gt;= 6.8)</h2><p>A new <a href="https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=207184853dbd">patch</a> was introduced that eliminates the need for the window shifting. This solution is going to be available in the 6.8 kernel.</p><p>Instead of picking a random window offset for <code>setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE</code>, …), like in the previous solution, we instead just pass the full system port range to activate the solution. The code may look something like this:</p>
            <pre><code>IP_BIND_ADDRESS_NO_PORT = 24
IP_LOCAL_PORT_RANGE = 51
sys = get_local_port_range()
sk = socket(AF_INET, SOCK_STREAM)
sk.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
range = pack("@I", sys.lo | (sys.hi &lt;&lt; 16))
sk.setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, range)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))</code></pre>
            <p>Setting <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=91d0b78c5177f3e42a4d8738af8ac19c3a90d002"><code>IP_LOCAL_PORT_RANGE</code></a> option is what tells the kernel to use a similar approach to “<a href="#random">select port by random shifting range</a>” such that the start offset is randomized to be even or odd, but then loops incrementally rather than skipping every other port. We end up with results like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ttWStZgNYfwftr71r8Vrt/7c333411ef01b674cc839f27ae4cbbbf/Screenshot-2024-02-07-at-16.04.24.png" />
            
            </figure><p>The performance of this approach is quite comparable to our user space implementation. Albeit, a little faster. Due in part to general improvements, and that the algorithm can always find a port given the full search space of the range. Then there are no cycles wasted on a potentially filled sub-range.</p><p>These results are great for TCP, but what about other protocols?</p>
    <div>
      <h2>Other protocols &amp; connect()</h2>
      <a href="#other-protocols-connect">
        
      </a>
    </div>
    <p>It is worth mentioning at this point that the algorithms used for the protocols are <i>mostly</i> the same for IPv4 &amp; IPv6. Typically, the key difference is how the sockets are compared to determine uniqueness and where the port search happens. We did not compare performance for all protocols. But it is worth mentioning some similarities and differences with TCP and a couple of others.</p>
    <div>
      <h3>DCCP</h3>
      <a href="#dccp">
        
      </a>
    </div>
    <p>The DCCP protocol leverages the same port selection <a href="https://elixir.bootlin.com/linux/v6.6/source/net/dccp/ipv4.c#L115">algorithm</a> as TCP. Therefore, this protocol benefits from the recent kernel changes. It is also possible the protocol could benefit from our user space solution, but that is untested. We will let the reader exercise DCCP use-cases.</p>
    <div>
      <h3>UDP &amp; UDP-Lite</h3>
      <a href="#udp-udp-lite">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/learning/ddos/glossary/user-datagram-protocol-udp/">UDP</a> leverages a different algorithm found in the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/udp.c#L239"><code>udp_lib_get_port()</code></a>. Similar to TCP, the algorithm will loop over the whole port range space incrementally. This is only the case if the port is not already supplied in the bind() call. The key difference between UDP and TCP is that a random number is generated as a step variable. Then, once a first port is identified, the algorithm loops on that port with the random number. This relies on an uint16_t overflow to eventually loop back to the chosen port. If all ports are used, increment the port by one and repeat. There is no port splitting between even and odd ports.</p><p>The best comparison to the TCP measurements is a UDP setup similar to:</p>
            <pre><code>sk = socket(AF_INET, SOCK_DGRAM)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))</code></pre>
            <p>And the results should be unsurprising with one IPv4 source address:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4UM5d0RBTgqADgVLbbqMlQ/940306c90767ba4b5e3762c6467b71ed/Screenshot-2024-02-07-at-16.06.27.png" />
            
            </figure><p>UDP fundamentally behaves differently from TCP. And there is less work overall for port lookups. The outliers in the chart represent a worst-case scenario when we reach a fairly bad random number collision. In that case, we need to more-completely loop over the ephemeral range to find a port.</p><p>UDP has another problem. Given the socket option <code>SO_REUSEADDR</code>, the port you get back may conflict with another UDP socket. This is in part due to the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/udp.c#L141"><code>udp_lib_lport_inuse()</code></a> ignoring the UDP 2-tuple (src ip, src port) check given the socket option. When this happens you may have a new socket that overwrites a previous. Extra care is needed in that case. We wrote more in depth about these cases in a previous <a href="/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/">blog post</a>.</p>
    <div>
      <h2>In summary</h2>
      <a href="#in-summary">
        
      </a>
    </div>
    <p>Cloudflare can make a lot of unicast egress connections to origin servers with popular uncached assets. To avoid port-resource exhaustion, we balance the load over a couple of IPv4 source addresses during those peak times. Then we asked: “what is the performance impact of one IPv4 source address for our connect()-heavy workloads?”. Port selection is not only difficult to get right, but is also a performance bottleneck. This is evidenced by measuring connect() latency with a flame graph and synthetic workloads. That then led us to discovering TCP’s quirky port selection process that loops over half your ephemeral ports before the other for each connect().</p><p>We then proposed three solutions to solve the problem outside of adding more IP addresses or other architectural changes: “<a href="#selecttestrepeat">select, test, repeat</a>”, “<a href="#random">select port by random shifting range</a>”, and an <a href="https://man7.org/linux/man-pages/man7/ip.7.html"><code>IP_LOCAL_PORT_RANGE</code></a> socket option <a href="#kernel">solution</a> in newer kernels. And finally closed out with other protocol honorable mentions and their quirks.</p><p>Do not take our numbers! Please explore and measure your own systems. With a better understanding of your workloads, you can make a good decision on which strategy works best for your needs. Even better if you come up with your own strategy!</p></h3></h3> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Protocols]]></category>
            <category><![CDATA[Performance]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[IPv4]]></category>
            <category><![CDATA[IPv6]]></category>
            <category><![CDATA[Network]]></category>
            <guid isPermaLink="false">1C6z0btasEsz1cmdmoug0m</guid>
            <dc:creator>Frederick Lawler</dc:creator>
        </item>
        <item>
            <title><![CDATA[How to execute an object file: part 4, AArch64 edition]]></title>
            <link>https://blog.cloudflare.com/how-to-execute-an-object-file-part-4/</link>
            <pubDate>Fri, 17 Nov 2023 14:00:35 GMT</pubDate>
            <description><![CDATA[ The initial posts are dedicated to the x86 architecture. Since then, the fleet of our working machines has expanded to include a large and growing number of ARM CPUs. This time we’ll repeat this exercise for the aarch64 architecture. ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5Ih4pvbZshdUy2ihfodYAU/7800842ad2c0270609b36a8a99d0ceb4/image1-1.png" />
            
            </figure><p>Translating source code written in a high-level programming language into an executable binary typically involves a series of steps, namely compiling and assembling the code into object files, and then linking those object files into the final executable. However, there are certain scenarios where it can be useful to apply an alternate approach that involves executing object files directly, bypassing the linker. For example, we might use it for malware analysis or when part of the code requires an incompatible compiler. We’ll be focusing on the latter scenario: when one of our libraries needed to be compiled differently from the rest of the code. Learning how to execute an object file directly will give you a much better sense of how code is compiled and linked together.</p><p>To demonstrate how this was done, we have previously published a series of posts on executing an object file:</p><ul><li><p><a href="/how-to-execute-an-object-file-part-1/">How to execute an object file: Part 1</a></p></li><li><p><a href="/how-to-execute-an-object-file-part-2/">How to execute an object file: Part 2</a></p></li><li><p><a href="/how-to-execute-an-object-file-part-3/">How to execute an object file: Part 3</a></p></li></ul><p>The initial posts are dedicated to the x86 architecture. Since then the fleet of our working machines has expanded to include a large and growing number of ARM CPUs. This time we’ll repeat this exercise for the aarch64 architecture. You can pause here to read the previous blog posts before proceeding with this one, or read through the brief summary below and reference the earlier posts for more detail. We might reiterate some theory as working with ELF files can be daunting, if it’s not your day-to-day routine. Also, please be mindful that for simplicity, these examples omit bounds and integrity checks. Let the journey begin!</p>
    <div>
      <h2>Introduction</h2>
      <a href="#introduction">
        
      </a>
    </div>
    <p>In order to obtain an object file or an executable binary from a high-level compiled programming language the code needs to be processed by three components: compiler, assembler and linker. The compiler generates an assembly listing. This assembly listing is picked up by the assembler and translated into an object file. All source files, if a program contains multiple, go through these two steps generating an object file for each source file. At the final step the linker unites all object files into one binary, additionally resolving references to the shared libraries (i.e. we don’t implement the <code>printf</code> function each time, rather we take it from a system library). Even though the approach is platform independent, the compiler output varies by platform as the assembly listing is closely tied to the CPU architecture.</p><p>GCC (GNU Compiler Collection) can run each step: compiler, assembler and linker separately for us:</p><p>main.c:</p>
            <pre><code>#include &lt;stdio.h&gt;

int main(void)
{
	puts("Hello, world!");
	return 0;
}</code></pre>
            <p>Compiler (output <code>main.s</code> - assembly listing):</p>
            <pre><code>$ gcc -S main.c
$ ls
main.c  main.s</code></pre>
            <p>Assembler (output <code>main.o</code> - an object file):</p>
            <pre><code>$ gcc -c main.s -o main.o
$ ls
main.c  main.o  main.s</code></pre>
            <p>Linker (<code>main</code> - an object file):</p>
            <pre><code>$ gcc main.o -o main
$ ls
main  main.c  main.o  main.s
$ ./main
Hello, world!</code></pre>
            <p>All the examples assume gcc is running on a native aarch64 architecture or include a cross compilation flag for those who want to reproduce and have no aarch64.</p><p>We have two object files in the output above: <code>main.o</code> and <code>main</code>. Object files are files encoded with the <a href="https://en.wikipedia.org/wiki/Executable_and_Linkable_Format">ELF (Executable and Linkable Format)</a> standard. Although, <code>main.o</code> is an ELF file, it doesn’t contain all the information to be fully executable.</p>
            <pre><code>$ file main.o
main.o: ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), not stripped

$ file main
main: ELF 64-bit LSB pie executable, ARM aarch64, version 1 (SYSV), dynamically
linked, interpreter /lib/ld-linux-aarch64.so.1,
BuildID[sha1]=d3ecd2f8ac3b2dec11ed4cc424f15b3e1f130dd4, for GNU/Linux 3.7.0, not stripped</code></pre>
            
    <div>
      <h2>The ELF File</h2>
      <a href="#the-elf-file">
        
      </a>
    </div>
    <p>The central idea of this series of blog posts is to understand how to resolve dependencies from object files without directly involving the linker. For illustrative purposes we generated an object file based on some C-code and used it as a library for our main program. Before switching to the code, we need to understand the basics of the ELF structure.</p><p>Each ELF file is made up of one <i>ELF header</i>, followed by file data. The data can include: a <i>program header</i> table, a <i>section header</i> table, and the data which is referred to by the program or section header tables.</p>
    <div>
      <h3>The ELF Header</h3>
      <a href="#the-elf-header">
        
      </a>
    </div>
    <p>The ELF header provides some basic information about the file: what architecture the file is compiled for, the program entry point and the references to other tables.</p><p>The ELF Header:</p>
            <pre><code>$ readelf -h main
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Position-Independent Executable file)
  Machine:                           AArch64
  Version:                           0x1
  Entry point address:               0x640
  Start of program headers:          64 (bytes into file)
  Start of section headers:          68576 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         9
  Size of section headers:           64 (bytes)
  Number of section headers:         29
  Section header string table index: 28</code></pre>
            
    <div>
      <h3>The ELF Program Header</h3>
      <a href="#the-elf-program-header">
        
      </a>
    </div>
    <p>The execution process of almost every program starts from an auxiliary program, called loader, which arranges the memory and calls the program’s entry point. In the following output the loader is marked with a line <code>“Requesting program interpreter: /lib/ld-linux-aarch64.so.1”</code>. The whole program memory is split into different segments with associated size, permissions and type (which instructs the loader on how to interpret this block of memory). Because the execution process should be performed in the shortest possible time, the <i>sections</i> with the same characteristics and located nearby are grouped into bigger blocks — <i>segments</i> — and placed in the <i>program header</i>. We can say that the <i>program header</i> summarizes the types of data that appear in the <i>section header</i>.</p><p>The ELF Program Header:</p>
            <pre><code>$ readelf -Wl main

Elf file type is DYN (Position-Independent Executable file)
Entry point 0x640
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x0001f8 0x0001f8 R   0x8
  INTERP         0x000238 0x0000000000000238 0x0000000000000238 0x00001b 0x00001b R   0x1
      [Requesting program interpreter: /lib/ld-linux-aarch64.so.1]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x00088c 0x00088c R E 0x10000
  LOAD           0x00fdc8 0x000000000001fdc8 0x000000000001fdc8 0x000270 0x000278 RW  0x10000
  DYNAMIC        0x00fdd8 0x000000000001fdd8 0x000000000001fdd8 0x0001e0 0x0001e0 RW  0x8
  NOTE           0x000254 0x0000000000000254 0x0000000000000254 0x000044 0x000044 R   0x4
  GNU_EH_FRAME   0x0007a0 0x00000000000007a0 0x00000000000007a0 0x00003c 0x00003c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x00fdc8 0x000000000001fdc8 0x000000000001fdc8 0x000238 0x000238 R   0x1

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
   03     .init_array .fini_array .dynamic .got .got.plt .data .bss 
   04     .dynamic 
   05     .note.gnu.build-id .note.ABI-tag 
   06     .eh_frame_hdr 
   07     
   08     .init_array .fini_array .dynamic .got </code></pre>
            
    <div>
      <h3>The ELF Section Header</h3>
      <a href="#the-elf-section-header">
        
      </a>
    </div>
    <p>In the source code of high-level languages, variables, functions, and constants are mixed together. However, in assembly you might see that the data and instructions are separated into different blocks. The ELF file content is divided in an even more granular way. For example, variables with initial values are placed into different sections than the uninitialized ones. This approach optimizes for space, otherwise the values for uninitialized variables would be filled with zeros. Along with the space efficiency, there are security reasons for stratification — executable instructions can’t have writable permissions, while memory containing variables can't be executable. The section header describes each of these sections.</p><p>The ELF Section Header:</p>
            <pre><code>$ readelf -SW main
There are 29 section headers, starting at offset 0x10be0:

Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .interp           PROGBITS        0000000000000238 000238 00001b 00   A  0   0  1
  [ 2] .note.gnu.build-id NOTE            0000000000000254 000254 000024 00   A  0   0  4
  [ 3] .note.ABI-tag     NOTE            0000000000000278 000278 000020 00   A  0   0  4
  [ 4] .gnu.hash         GNU_HASH        0000000000000298 000298 00001c 00   A  5   0  8
  [ 5] .dynsym           DYNSYM          00000000000002b8 0002b8 0000f0 18   A  6   3  8
  [ 6] .dynstr           STRTAB          00000000000003a8 0003a8 000092 00   A  0   0  1
  [ 7] .gnu.version      VERSYM          000000000000043a 00043a 000014 02   A  5   0  2
  [ 8] .gnu.version_r    VERNEED         0000000000000450 000450 000030 00   A  6   1  8
  [ 9] .rela.dyn         RELA            0000000000000480 000480 0000c0 18   A  5   0  8
  [10] .rela.plt         RELA            0000000000000540 000540 000078 18  AI  5  22  8
  [11] .init             PROGBITS        00000000000005b8 0005b8 000018 00  AX  0   0  4
  [12] .plt              PROGBITS        00000000000005d0 0005d0 000070 00  AX  0   0 16
  [13] .text             PROGBITS        0000000000000640 000640 000134 00  AX  0   0 64
  [14] .fini             PROGBITS        0000000000000774 000774 000014 00  AX  0   0  4
  [15] .rodata           PROGBITS        0000000000000788 000788 000016 00   A  0   0  8
  [16] .eh_frame_hdr     PROGBITS        00000000000007a0 0007a0 00003c 00   A  0   0  4
  [17] .eh_frame         PROGBITS        00000000000007e0 0007e0 0000ac 00   A  0   0  8
  [18] .init_array       INIT_ARRAY      000000000001fdc8 00fdc8 000008 08  WA  0   0  8
  [19] .fini_array       FINI_ARRAY      000000000001fdd0 00fdd0 000008 08  WA  0   0  8
  [20] .dynamic          DYNAMIC         000000000001fdd8 00fdd8 0001e0 10  WA  6   0  8
  [21] .got              PROGBITS        000000000001ffb8 00ffb8 000030 08  WA  0   0  8
  [22] .got.plt          PROGBITS        000000000001ffe8 00ffe8 000040 08  WA  0   0  8
  [23] .data             PROGBITS        0000000000020028 010028 000010 00  WA  0   0  8
  [24] .bss              NOBITS          0000000000020038 010038 000008 00  WA  0   0  1
  [25] .comment          PROGBITS        0000000000000000 010038 00001f 01  MS  0   0  1
  [26] .symtab           SYMTAB          0000000000000000 010058 000858 18     27  66  8
  [27] .strtab           STRTAB          0000000000000000 0108b0 00022c 00      0   0  1
  [28] .shstrtab         STRTAB          0000000000000000 010adc 000103 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  D (mbind), p (processor specific)</code></pre>
            
    <div>
      <h2>Executing example from Part 1 on aarch64</h2>
      <a href="#executing-example-from-part-1-on-aarch64">
        
      </a>
    </div>
    <p>Actually, our <a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2021-03-obj-file/1">initial code</a> from <a href="/how-to-execute-an-object-file-part-1/">Part 1</a> works on aarch64 as is!</p><p>Let’s have a quick summary about what was done in the code:</p><ol><li><p>We need to find the code of two functions (<code>add5</code> and <code>add10</code>) in the <code>.text</code> section of our object file (<code>obj.o</code>)</p></li><li><p>Load the functions in the executable memory</p></li><li><p>Return the memory locations of the functions to the main program</p></li></ol><p>There is one nuance: even though all the sections are in the section header, neither of them have a string name. Without the names we can’t identify them. However, having an additional character field for each section in the ELF structure would be inefficient for the space — it must be limited by some maximum length and those names which are shorter would leave the space unfilled. Instead, ELF provides an additional section, <code>.shstrtab</code>. This string table concatenates all the names where each name ends with a null terminated byte. We can iterate over the names and match with an offset held by other sections to reference their name. But how do we find <code>.shstrtab</code> itself if we don’t have a name? To solve this chicken and egg problem, the ELF program header provides a direct pointer to <code>.shstrtab</code>. The similar approach is applied to two other sections: <code>.symtab</code> and <code>.strtab</code>. Where <code>.symtab</code> contains all information about the symbols and <code>.strtab</code> holds the list of symbol names. In the code we work with these tables to resolve all their dependencies and find our functions.</p>
    <div>
      <h2>Executing example from Part 2 on aarch64</h2>
      <a href="#executing-example-from-part-2-on-aarch64">
        
      </a>
    </div>
    <p>At the beginning of the second blog post on <a href="/how-to-execute-an-object-file-part-2/">how to execute an object file</a> we made the function <code>add10</code> depend on <code>add5</code> instead of being self-contained. This is the first time when we faced relocations. <i>Relocations</i> is the process of loading symbols defined outside the current scope. The relocated symbols can present global or thread-local variables, constant, functions, etc. We’ll start from checking assembly instructions which trigger relocations and uncovering how the ELF format handles them in a more general way.</p><p>After making <code>add10</code> depend on <code>add5</code> our aarch64 version stopped working as well, similarly to the x86. Let’s take a look at assembly listing:</p>
            <pre><code>$ objdump --disassemble --section=.text obj.o

obj.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 &lt;add5&gt;:
   0:	d10043ff 	sub	sp, sp, #0x10
   4:	b9000fe0 	str	w0, [sp, #12]
   8:	b9400fe0 	ldr	w0, [sp, #12]
   c:	11001400 	add	w0, w0, #0x5
  10:	910043ff 	add	sp, sp, #0x10
  14:	d65f03c0 	ret

0000000000000018 &lt;add10&gt;:
  18:	a9be7bfd 	stp	x29, x30, [sp, #-32]!
  1c:	910003fd 	mov	x29, sp
  20:	b9001fe0 	str	w0, [sp, #28]
  24:	b9401fe0 	ldr	w0, [sp, #28]
  28:	94000000 	bl	0 &lt;add5&gt;
  2c:	b9001fe0 	str	w0, [sp, #28]
  30:	b9401fe0 	ldr	w0, [sp, #28]
  34:	94000000 	bl	0 &lt;add5&gt;
  38:	a8c27bfd 	ldp	x29, x30, [sp], #32
  3c:	d65f03c0 	ret</code></pre>
            <p>Have you noticed that all the hex values in the second column are exactly the same length, in contrast with the instructions lengths seen for x86 in Part 2 of our series? This is because all Armv8-A instructions are presented in 32 bits. Since it is impossible to encode every immediate value into less than 32 bits, some operations require more than one instruction, as we’ll see later. For now, we’re interested in one instruction <code>- bl</code> (<a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/BL--Branch-with-Link-?lang=en">branch with link</a>) on rows <code>28</code> and <code>34</code>. The <code>bl</code> is a “jump” instruction, but before the jump it preserves the next instruction after the current one in the link register (<code>lr</code>). When the callee finishes execution the caller address is recovered from <code>lr</code>. Usually, the aarch64 instructions reserve the last 6 bits [31:26] for opcode and some auxiliary fields such as running architecture (32 or 64 bits), condition flag and others. Remaining bits are shared between arguments like source register, destination register and immediate value. Since the <code>bl</code> instruction does not require a source or destination register, the full 26 bits can be used to encode the immediate offset instead. However, 26 bits can only encode a small range (+/-32 MB), but because the jump can only target a beginning of an instruction, it must always be aligned to 4 bytes, which increases the effective range of the encoded immediate fourfold, to +/-128 MB.</p><p>Similarly to what we did in <a href="/how-to-execute-an-object-file-part-2/">Part 2</a> we’re going to resolve our relocations - first by manually calculating the correct addresses and then by using an approach similar to what the linker does. The current value of our <code>bl</code> instruction is <code>94000000</code> or in binary representation <code>100101**00000000000000000000000000**</code>. All 26 bits are zeros, so we don't jump anywhere. The address is calculated by an offset from the current <i>program counter</i> (<code>pc</code>), which can be positive or negative. In our case we expect it to be <code>-0x28</code> and <code>-0x34</code>. As described above, it should be divided by 4 and taken as <a href="https://en.wikipedia.org/wiki/Two%27s_complement">two's complements</a>: <code>-0x28 / 4 = -0xA == 0xFFFFFFF6</code> and <code>-0x34 / 4 = -0xD == 0xFFFFFFF3</code>. From these values we need to take the lower 26 bits and concatenate them with the initial 6 bits to get the final instruction. So, the final ones will be: <code>100101**11111111111111111111110110** == 0x97FFFFF6</code> and <code>100101**11111111111111111111110011** == 0x97FFFFF3</code>. Have you noticed that all the distance calculations are done relative to the <code>bl</code> (or current <code>pc</code>), not the next instruction as in x86?</p><p>Let’s add to the code and execute:</p>
            <pre><code>... 

static void parse_obj(void)
{
	...
	/* copy the contents of `.text` section from the ELF file */
	memcpy(text_runtime_base, obj.base + text_hdr-&gt;sh_offset, text_hdr-&gt;sh_size);

	*((uint32_t *)(text_runtime_base + 0x28)) = 0x97FFFFF6;
	*((uint32_t *)(text_runtime_base + 0x34)) = 0x97FFFFF3;

	/* make the `.text` copy readonly and executable */
	if (mprotect(text_runtime_base, page_align(text_hdr-&gt;sh_size), PROT_READ | PROT_EXEC)) {
	...</code></pre>
            <p>Compile and run:</p>
            <pre><code>$ gcc -o loader loader.c
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52</code></pre>
            <p>It works! But this is not how the linker handles the relocations. The linker resolves relocation based on the type and formula assigned to this type. In our <a href="/how-to-execute-an-object-file-part-2/">Part 2</a> we investigated it quite well. Here again we need to find the type and check the formula for this type:</p>
            <pre><code>$ readelf --relocs obj.o

Relocation section '.rela.text' at offset 0x228 contains 2 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000028  000a0000011b R_AARCH64_CALL26  0000000000000000 add5 + 0
000000000034  000a0000011b R_AARCH64_CALL26  0000000000000000 add5 + 0

Relocation section '.rela.eh_frame' at offset 0x258 contains 2 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000001c  000200000105 R_AARCH64_PREL32  0000000000000000 .text + 0
000000000034  000200000105 R_AARCH64_PREL32  0000000000000000 .text + 18</code></pre>
            <p>Our Type is R_AARCH64_CALL26 and the <a href="https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst#5733relocation-operations">formula</a> for it is:</p><table>
	<tbody>
		<tr>
			<td>
			<p><span><span><span><b>ELF64 Code</b></span></span></span></p>
			</td>
			<td>
			<p><span><span><span><b>Name</b></span></span></span></p>
			</td>
			<td>
			<p><span><span><span><b>Operation</b></span></span></span></p>
			</td>
		</tr>
		<tr>
			<td>
			<p><span><span><span>283</span></span></span></p>
			</td>
			<td>
			<p><span><span><span>R_&lt;CLS&gt;_CALL26</span></span></span></p>
			</td>
			<td>
			<p><span><span><span>S + A - P</span></span></span></p>
			</td>
		</tr>
	</tbody>
</table><p>where:</p><ul><li><p><code>S</code> (when used on its own) is the address of the symbol</p></li><li><p><code>A</code> is the addend for the relocation</p></li><li><p><code>P</code> is the address of the place being relocated (derived from <code>r_offset</code>)</p></li></ul><p>Here are the relevant changes to loader.c:</p>
            <pre><code>/* Replace `#define R_X86_64_PLT32 4` with our Type */
#define R_AARCH64_CALL26 283
...

static void do_text_relocations(void)
{
	...
	uint32_t val;

	switch (type)
	{
	case R_AARCH64_CALL26:
		/* The mask separates opcode (6 bits) and the immediate value */
		uint32_t mask_bl = (0xffffffff &lt;&lt; 26);
		/* S+A-P, divided by 4 */
		val = (symbol_address + relocations[i].r_addend - patch_offset) &gt;&gt; 2;
		/* Concatenate opcode and value to get final instruction */
		*((uint32_t *)patch_offset) &amp;= mask_bl;
		val &amp;= ~mask_bl;
		*((uint32_t *)patch_offset) |= val;
		break;
	}
	...
}</code></pre>
            <p>Compile and run:</p>
            <pre><code>$ gcc -o loader loader.c 
$ ./loader
Calculated relocation: 0x97fffff6
Calculated relocation: 0x97fffff3
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52</code></pre>
            <p>So far so good. The next challenge is to <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/2/obj.c#L16-L31">add constant data and global variables</a> to our object file and check relocations again:</p>
            <pre><code>$ readelf --relocs --wide obj.o

Relocation section '.rela.text' at offset 0x388 contains 8 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000000000  0000000500000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .rodata + 0
0000000000000004  0000000500000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .rodata + 0
000000000000000c  0000000300000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .data + 0
0000000000000010  0000000300000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .data + 0
0000000000000024  0000000300000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .data + 0
0000000000000028  0000000300000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .data + 0
0000000000000068  000000110000011b R_AARCH64_CALL26       0000000000000040 add5 + 0
0000000000000074  000000110000011b R_AARCH64_CALL26       0000000000000040 add5 + 0
...</code></pre>
            <p>We have even two new relocations: <code>R_AARCH64_ADD_ABS_LO12_NC</code> and <code>R_AARCH64_ADR_PREL_PG_HI21</code>. Their formulas are:</p><table>
	<tbody>
		<tr>
			<td>
			<p><span><span><span><b>ELF64 Code</b></span></span></span></p>
			</td>
			<td>
			<p><span><span><span><b>Name</b></span></span></span></p>
			</td>
			<td>
			<p><span><span><span><b>Operation</b></span></span></span></p>
			</td>
		</tr>
		<tr>
			<td>
			<p><span><span><span><span>275</span></span></span></span></p>
			</td>
			<td>
			<p><span><span><span><span>R_&lt;CLS&gt;_ ADR_PREL_PG_HI21</span></span></span></span></p>
			</td>
			<td>
			<p><span><span><span><span>Page(S+A) - Page(P)</span></span></span></span></p>
			</td>
		</tr>
		<tr>
			<td>
			<p><span><span><span>277</span></span></span></p>
			</td>
			<td>
			<p><span><span><span>R_&lt;CLS&gt;_ ADD_ABS_LO12_NC</span></span></span></p>
			</td>
			<td>
			<p><span><span><span>S + A</span></span></span></p>
			</td>
		</tr>
	</tbody>
</table><p>where:</p><p><code>Page(expr)</code> is the page address of the expression expr, defined as <code>(expr &amp; ~0xFFF)</code>. (This applies even if the machine page size supported by the platform has a different value.)</p><p>It’s a bit unclear why we have two new types, while in x86 we had only one. Let’s investigate the assembly code:</p>
            <pre><code>$ objdump --disassemble --section=.text obj.o

obj.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 &lt;get_hello&gt;:
   0:	90000000 	adrp	x0, 0 &lt;get_hello&gt;
   4:	91000000 	add	x0, x0, #0x0
   8:	d65f03c0 	ret

000000000000000c &lt;get_var&gt;:
   c:	90000000 	adrp	x0, 0 &lt;get_hello&gt;
  10:	91000000 	add	x0, x0, #0x0
  14:	b9400000 	ldr	w0, [x0]
  18:	d65f03c0 	ret

000000000000001c &lt;set_var&gt;:
  1c:	d10043ff 	sub	sp, sp, #0x10
  20:	b9000fe0 	str	w0, [sp, #12]
  24:	90000000 	adrp	x0, 0 &lt;get_hello&gt;
  28:	91000000 	add	x0, x0, #0x0
  2c:	b9400fe1 	ldr	w1, [sp, #12]
  30:	b9000001 	str	w1, [x0]
  34:	d503201f 	nop
  38:	910043ff 	add	sp, sp, #0x10
  3c:	d65f03c0 	ret</code></pre>
            <p>We see that all <code>adrp</code> instructions are followed by <code>add</code> instructions. The <code>[add](https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/ADD--immediate---Add--immediate--?lang=en)</code> instruction adds an immediate value to the source register and writes the result to the destination register. The source and destination registers can be the same, the immediate value is 12 bits. The <code>[adrp](https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/ADRP--Form-PC-relative-address-to-4KB-page-?lang=en)</code> instruction generates a <code>pc</code>-relative (program counter) address and writes the result to the destination register. It takes <code>pc</code> of the instruction itself and adds a 21-bit immediate value shifted left by 12 bits. If the immediate value weren’t shifted it would lie in a range of +/-1 MB memory, which isn’t enough. The left shift increases the range up to +/-1 GB. However, because the 12 bits are masked out with the shift, we need to store them somewhere and restore later. That’s why we see add instruction following <code>adrp</code> and two types instead of one. Also, it’s a bit tricky to encode <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/ADRP--Form-PC-relative-address-to-4KB-page-?lang=en"><code>adrp</code></a>: 2 low bits of immediate value are placed in the position 30:29 and the rest in the position 23:5. Due to size limitations, the aarch64 instructions try to make the most out of 32 bits.</p><p>In the code we are going to use the formulas to calculate the values and description of <code>adrp</code> and <code>add</code> instructions to obtain the final opcode:</p>
            <pre><code>#define R_AARCH64_CALL26 283
#define R_AARCH64_ADD_ABS_LO12_NC 277
#define R_AARCH64_ADR_PREL_PG_HI21 275
...

{
case R_AARCH64_CALL26:
	/* The mask separates opcode (6 bits) and the immediate value */
	uint32_t mask_bl = (0xffffffff &lt;&lt; 26);
	/* S+A-P, divided by 4 */
	val = (symbol_address + relocations[i].r_addend - patch_offset) &gt;&gt; 2;
	/* Concatenate opcode and value to get final instruction */
	*((uint32_t *)patch_offset) &amp;= mask_bl;
	val &amp;= ~mask_bl;
	*((uint32_t *)patch_offset) |= val;
	break;
case R_AARCH64_ADD_ABS_LO12_NC:
	/* The mask of `add` instruction to separate 
	* opcode, registers and calculated value 
	*/
	uint32_t mask_add = 0b11111111110000000000001111111111;
	/* S + A */
	uint32_t val = *(symbol_address + relocations[i].r_addend);
	val &amp;= ~mask_add;
	*((uint32_t *)patch_offset) &amp;= mask_add;
	/* Final instruction */
	*((uint32_t *)patch_offset) |= val;
case R_AARCH64_ADR_PREL_PG_HI21:
	/* Page(S+A)-Page(P), Page(expr) is defined as (expr &amp; ~0xFFF) */
	val = (((uint64_t)(symbol_address + relocations[i].r_addend)) &amp; ~0xFFF) - (((uint64_t)patch_offset) &amp; ~0xFFF);
	/* Shift right the calculated value by 12 bits.
	 * During decoding it will be shifted left as described above, 
	 * so we do the opposite.
	*/
	val &gt;&gt;= 12;
	/* Separate the lower and upper bits to place them in different positions */ 
	uint32_t immlo = (val &amp; (0xf &gt;&gt; 2)) &lt;&lt; 29 ;
	uint32_t immhi = (val &amp; ((0xffffff &gt;&gt; 13) &lt;&lt; 2)) &lt;&lt; 22;
	*((uint32_t *)patch_offset) |= immlo;
	*((uint32_t *)patch_offset) |= immhi;
	break;
}</code></pre>
            <p>Compile and run:</p>
            <pre><code>$ gcc -o loader loader.c 
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
Executing get_hello...
get_hello() = Hello, world!
Executing get_var...
get_var() = 5
Executing set_var(42)...
Executing get_var again...
get_var() = 42</code></pre>
            <p>It works! The final code is <a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2021-03-obj-file/4/2">here</a>.</p>
    <div>
      <h2>Executing example from Part 3 on aarch64</h2>
      <a href="#executing-example-from-part-3-on-aarch64">
        
      </a>
    </div>
    <p>Our <a href="/how-to-execute-an-object-file-part-3/">Part 3</a> is about resolving external dependencies. When we write code we don’t think much about how to allocate memory or print debug information to the console. Instead, we involve functions from the system libraries. But the code of system libraries needs to be passed through to our programs somehow. Additionally, for optimization purposes, it would be nice if this code would be stored in one place and shared between all programs. And another wish — we don’t want to resolve all the functions and global variables from the libraries, only those which we need and at those times when we need them. To solve these problems, ELF introduced two sections: PLT (the procedure linkage table) and GOT (the global offset table). The dynamic loader creates a list which contains all external functions and variables from the shared library, but doesn’t resolve them immediately; instead they are placed in the PLT section. Each external symbol is presented by a small function, a stub, e.g. <code>puts@plt</code>. When an external symbol is requested, the stub checks if it was resolved previously. If not, the stub searches for an absolute address of the symbol, returns to the requester and writes it in the GOT table. The next time, the address returns directly from the GOT table.</p><p>In <a href="/how-to-execute-an-object-file-part-3/">Part 3</a> we implemented a simplified PLT/GOT resolution. Firstly, we added a new function <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/3/obj.c#L35"><code>say_hello</code></a> in the <code>obj.c</code>, which calls unresolved system library function <code>puts</code>. Further we added an optional wrapper <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/3/loader.c#L73"><code>my_puts</code></a> in the <code>loader.c</code>. The wrapper isn’t required, we could’ve resolved directly to a standard function, but it's a good example of how the implementation of some functions can be overwritten with custom code. In the next steps we added our PLT/GOT resolution:</p><ul><li><p>PLT section we replaced with a <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/3/loader.c#L340">jumptable</a></p></li><li><p>GOT we replaced with <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/3/loader.c#L238-L248">assembly instructions</a></p></li></ul><p>Basically, we created a small stub with assembly code (our <code>jumptable</code>) to resolve the global address of our <code>my_puts</code> wrapper and jump to it.</p><p>The approach for aarch64 is the same. But the <code>jumptable</code> is very different as it consists of different assembly instructions.</p><p>The big difference here compared to the other parts is that we need to work with a 64-bit address for the GOT resolution. Our custom PLT or <code>jumptable</code> is placed close to the main code of <code>obj.c</code> and can operate with the relative addresses as before. For the GOT or referencing <code>my_puts</code> wrapper we’ll use different branch instructions — <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/BR--Branch-to-Register-"><code>br</code></a> or <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/BLR--Branch-with-Link-to-Register-"><code>blr</code></a>. These instructions branch to the register, where the aarch64 registers can hold 64-bit values.</p><p>We can check how it resolves with the native PLT/GOT in our loader assembly code:</p>
            <pre><code>$ objdump --disassemble --section=.text loader
...
1d2c:	97fffb45 	bl	a40 &lt;puts@plt&gt;
1d30:	f94017e0 	ldr	x0, [sp, #40]
1d34:	d63f0000 	blr	x0
...</code></pre>
            <p>The first instruction is <code>bl</code> jump to <code>puts@plt</code> stub. The next <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/LDR--immediate---Load-Register--immediate--"><code>ldr</code></a> instruction tells us that some value was loaded into the register <code>x0</code> from the stack. Each function has its own <a href="https://en.wikipedia.org/wiki/Call_stack#Stack_and_frame_pointers">stack frame</a> to hold the local variables. The last <code>blr</code> instruction makes a jump to the address stored in <code>x0</code> register. There is an agreement in the register naming: if the stored value is 64-bits then the register is called <code>x0-x30</code>; if only 32-bits are used then it’s called <code>w0-w30</code> (the value will be stored in the lower 32-bits and upper 32-bits will be zeroed).</p><p>We need to do something similar — place the absolute address of our <code>my_puts</code> wrapper in some register and call <code>br</code> on this register. We don’t need to store the link before branching, the call will be returned to <code>say_hello</code> from <code>obj.c</code>, which is why a plain <code>br</code> will be enough. Let’s check an assembly of simple C-function:</p><p>hello.c:</p>
            <pre><code>#include &lt;stdint.h&gt;

void say_hello(void)
{
    uint64_t reg = 0x555555550c14;
}</code></pre>
            
            <pre><code>$ gcc -c hello.c
$ objdump --disassemble --section=.text hello.o

hello.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 &lt;say_hello&gt;:
   0:	d10043ff 	sub	sp, sp, #0x10
   4:	d2818280 	mov	x0, #0xc14                 	// #3092
   8:	f2aaaaa0 	movk	x0, #0x5555, lsl #16
   c:	f2caaaa0 	movk	x0, #0x5555, lsl #32
  10:	f90007e0 	str	x0, [sp, #8]
  14:	d503201f 	nop
  18:	910043ff 	add	sp, sp, #0x10
  1c:	d65f03c0 	ret</code></pre>
            <p>The number <code>0x555555550c14</code> is the address returned by <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/3/loader.c#L238"><code>lookup_ext_function</code></a>. We’ve printed it out to use as an example, but any <a href="https://www.kernel.org/doc/html/latest/arch/arm64/memory.html">48-bits</a> hex value can be used.</p><p>In our output we see that the value was split in three sections and written in <code>x0</code> register with three instructions: one <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/MOV--inverted-wide-immediate---Move--inverted-wide-immediate---an-alias-of-MOVN-"><code>mov</code></a> and two <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/MOVK--Move-wide-with-keep-?lang=en"><code>movk</code></a>. The documentation says that there are only 16 bits for the immediate value, but a shift can be applied (in our case left shift <code>lsl</code>).</p><p>However, we can’t use <code>x0</code> in our context. By <a href="https://developer.arm.com/documentation/den0024/a/The-ABI-for-ARM-64-bit-Architecture/Register-use-in-the-AArch64-Procedure-Call-Standard/Parameters-in-general-purpose-registers">convention</a> the registers <code>x0-x7</code> are caller-saved and used to pass function parameters between calls to other functions. Let’s use <code>x9</code> then.</p><p>We need to modify our loader. Firstly let’s change the jumptable structure.</p><p>loader.c:</p>
            <pre><code>...
struct ext_jump {
	uint32_t instr[4];
};
...</code></pre>
            <p>As we saw above, we need four instructions: <code>mov</code>, <code>movk</code>, <code>movk</code>, <code>br</code>. We don’t need a stack frame as we aren’t preserving any local variables. We just want to load the address into the register and branch to it. But we can’t write human-readable code</p><p>e.g. <code>mov  x0, #0xc14</code> into instructions, we need machine binary or hex representation, e.g. <code>d2818280</code>.</p><p>Let’s write a simple assembly code to get it:</p><p>hw.s:</p>
            <pre><code>.global _start

_start: mov     x9, #0xc14 
        movk    x9, #0x5555, lsl #16
        movk    x9, #0x5555, lsl #32
        br      x9</code></pre>
            
            <pre><code>$ as -o hw.o hw.s
$ objdump --disassemble --section=.text hw.o

hw.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 &lt;_start&gt;:
   0:	d2818289 	mov	x9, #0xc14                 	// #3092
   4:	f2aaaaa9 	movk	x9, #0x5555, lsl #16
   8:	f2caaaa9 	movk	x9, #0x5555, lsl #32
   c:	d61f0120 	br	x9</code></pre>
            <p>Almost done! But there’s one more thing to consider. Even if the value <code>0x555555550c14</code> is a real <code>my_puts</code> wrapper address, it will be different on each run if <a href="https://en.wikipedia.org/wiki/Address_space_layout_randomization">ASLR(Address space layout randomization)</a> is enabled. We need to patch these instructions to put the value which will be returned by <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/3/loader.c#L238"><code>lookup_ext_function</code></a> on each run. We’ll split the obtained value in three parts, 16-bits each, and replace them in our <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/MOV--inverted-wide-immediate---Move--inverted-wide-immediate---an-alias-of-MOVN-"><code>mov</code></a> and <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/MOVK--Move-wide-with-keep-?lang=en"><code>movk</code></a> instructions according to the documentation, similar to what we did before for our second part.</p>
            <pre><code>if (symbols[symbol_idx].st_shndx == SHN_UNDEF) {
	static int curr_jmp_idx = 0;

	uint64_t addr = lookup_ext_function(strtab +  symbols[symbol_idx].st_name);
	uint32_t mov = 0b11010010100000000000000000001001 | ((addr &lt;&lt; 48) &gt;&gt; 43);
	uint32_t movk1 = 0b11110010101000000000000000001001 | (((addr &gt;&gt; 16) &lt;&lt; 48) &gt;&gt; 43);
	uint32_t movk2 = 0b11110010110000000000000000001001 | (((addr &gt;&gt; 32) &lt;&lt; 48) &gt;&gt; 43);
	jumptable[curr_jmp_idx].instr[0] = mov;         // mov  x9, #0x0c14
	jumptable[curr_jmp_idx].instr[1] = movk1;       // movk x9, #0x5555, lsl #16
	jumptable[curr_jmp_idx].instr[2] = movk2;       // movk x9, #0x5555, lsl #32
	jumptable[curr_jmp_idx].instr[3] = 0xd61f0120;  // br   x9

	symbol_address = (uint8_t *)(&amp;jumptable[curr_jmp_idx].instr[0]);
	curr_jmp_idx++;
} else {
	symbol_address = section_runtime_base(&amp;sections[symbols[symbol_idx].st_shndx]) + symbols[symbol_idx].st_value;
}
uint32_t val;
switch (type)
{
case R_AARCH64_CALL26:
	/* The mask separates opcode (6 bits) and the immediate value */
	uint32_t mask_bl = (0xffffffff &lt;&lt; 26);
	/* S+A-P, divided by 4 */
	val = (symbol_address + relocations[i].r_addend - patch_offset) &gt;&gt; 2;
	/* Concatenate opcode and value to get final instruction */
	*((uint32_t *)patch_offset) &amp;= mask_bl;
	val &amp;= ~mask_bl;
	*((uint32_t *)patch_offset) |= val;
	break;
...</code></pre>
            <p>In the code we took the address of the first instruction <code>&amp;jumptable[curr_jmp_idx].instr[0]</code> and wrote it in the <code>symbol_address</code>, further because the <code>type</code> is still <code>R_AARCH64_CALL26</code> it will be put into <code>bl</code> - jump to the relative address. Where our relative address is the first <code>mov</code> instruction. The whole <code>jumptable</code> code will be executed and finished with the <code>blr</code> instruction.</p><p>The final run:</p>
            <pre><code>$ gcc -o loader loader.c
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
Executing get_hello...
get_hello() = Hello, world!
Executing get_var...
get_var() = 5
Executing set_var(42)...
Executing get_var again...
get_var() = 42
Executing say_hello...
my_puts executed
Hello, world!</code></pre>
            <p>The final code is <a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2021-03-obj-file/4/3">here</a>.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>There are several things we covered in this blog post. We gave a brief introduction on how the binary got executed on Linux and how all components are linked together. We saw a big difference between x86 and aarch64 assembly. We learned how we can hook into the code and change its behavior. But just as it was said in the first blog post of this series, the most important thing is to remember to always think about security first. Processing external inputs should always be done with great care. Bounds and integrity checks have been omitted for the purposes of keeping the examples short, so readers should be aware that the code is not production ready and is designed for educational purposes only.</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Programming]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">6bGK0NoXnHBGOjKvJ60FRu</guid>
            <dc:creator>Oxana Kharitonova</dc:creator>
        </item>
        <item>
            <title><![CDATA[The Linux Crypto API for user applications]]></title>
            <link>https://blog.cloudflare.com/the-linux-crypto-api-for-user-applications/</link>
            <pubDate>Thu, 11 May 2023 13:00:58 GMT</pubDate>
            <description><![CDATA[ If you run your software on Linux, the Linux Kernel itself can satisfy all your cryptographic needs! In this post we will explore Linux Crypto API for user applications and try to understand its pros and cons ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6o7ZLXKVXmuq5yaRC7sdbe/cef8a48e2dead5815f187b829103622d/Screenshot_2024-08-26_at_6.21.48_PM.png" />
            
            </figure><p>In this post we will explore Linux Crypto API for user applications and try to understand its pros and cons.</p><p>The Linux Kernel Crypto API was introduced in <a href="https://lwn.net/Articles/14197/">October 2002</a>. It was initially designed to satisfy internal needs, mostly for <a href="https://www.cloudflare.com/learning/network-layer/what-is-ipsec/">IPsec</a>. However, in addition to the kernel itself, user space applications can benefit from it.</p><p>If we apply the basic definition of an <a href="https://www.cloudflare.com/learning/security/api/what-is-an-api/">API</a> to our case, we will have the kernel on one side and our application on the other. The application needs to send data, i.e. plaintext or ciphertext, and get encrypted/decrypted text in response from the kernel. To communicate with the kernel we need to make a system call. Also, before starting the data exchange, we need to agree on some cryptographic parameters, at least the selected crypto algorithm and key length. These constraints, along with all supported algorithms, can be found in the <code>/proc/crypto</code> virtual file.</p><p>Below is a short excerpt from my <code>/proc/crypto</code> looking at <code>ctr(aes)</code>. In the examples, we will use the AES cipher in CTR mode, further we will give more details about the algorithm itself.</p>
            <pre><code>name         : ctr(aes)
driver       : ctr(aes-generic)
module       : ctr
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 1
min keysize  : 16
max keysize  : 32
ivsize       : 16
chunksize    : 16
walksize     : 16


name         : ctr(aes)
driver       : ctr(aes-aesni)
module       : ctr
priority     : 300
refcnt       : 1
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 1
min keysize  : 16
max keysize  : 32
ivsize       : 16
chunksize    : 16
walksize     : 16


name         : ctr(aes)
driver       : ctr-aes-aesni
module       : aesni_intel
priority     : 400
refcnt       : 1
selftest     : passed
internal     : no
type         : skcipher
async        : yes
blocksize    : 1
min keysize  : 16
max keysize  : 32
ivsize       : 16
chunksize    : 16
walksize     : 16</code></pre>
            <p>In the output above, there are three config blocks. The kernel may provide several implementations of the same algorithm depending on the CPU architecture, available hardware, presence of crypto accelerators etc.</p><p>We can pick the implementation based on the algorithm name or the driver name. The algorithm name is not unique, but the driver name is. If we use the algorithm name, the driver with the highest priority will be chosen for us, which in theory should provide the best cryptographic performance in this context. Let’s see the performance of different implementations of AES-CTR encryption. I use the <a href="https://github.com/smuellerDD/libkcapi">libkcapi library</a>: it’s a lightweight wrapper for the kernel crypto API which also provides built-in speed tests. We will examine <a href="https://github.com/smuellerDD/libkcapi/blob/master/speed-test/cryptoperf-skcipher.c#L228-L238">these tests</a>.</p>
            <pre><code>$ kcapi-speed -c "AES(G) CTR(G) 128" -b 1024 -t 10
AES(G) CTR(G) 128   	|d|	1024 bytes|          	149.80 MB/s|153361 ops/s
AES(G) CTR(G) 128   	|e|	1024 bytes|          	159.76 MB/s|163567 ops/s
 
$ kcapi-speed -c "AES(AESNI) CTR(ASM) 128" -b 1024 -t 10
AES(AESNI) CTR(ASM) 128 |d|	1024 bytes|          	343.10 MB/s|351332 ops/s
AES(AESNI) CTR(ASM) 128 |e|	1024 bytes|         	310.100 MB/s|318425 ops/s
 
$ kcapi-speed -c "AES(AESNI) CTR(G) 128" -b 1024 -t 10
AES(AESNI) CTR(G) 128   |d|	1024 bytes|          	155.37 MB/s|159088 ops/s
AES(AESNI) CTR(G) 128   |e|	1024 bytes|          	172.94 MB/s|177054 ops/s</code></pre>
            <p>Here and later ignore the absolute numbers, as they depend on the environment where the tests were running. Rather look at the relationship between the numbers.</p><p>The <a href="https://en.wikipedia.org/wiki/AES_instruction_set">x86 AES instructions</a> showed the best results, twice as fast vs the generic portable C implementation. As expected, this implementation has the highest priority in the <code>/proc/crypto</code>. We will use only this one later.</p><p>This brief introduction can be rephrased as: “I can ask the kernel to encrypt or decrypt data from my application”. But, why do I need it?</p>
    <div>
      <h2>Why do I need it?</h2>
      <a href="#why-do-i-need-it">
        
      </a>
    </div>
    <p>In our previous blog post <a href="/the-linux-kernel-key-retention-service-and-why-you-should-use-it-in-your-next-application/">Linux Kernel Key Retention Service</a> we talked a lot about cryptographic key protection. We concluded that the best Linux option is to store <a href="https://www.cloudflare.com/learning/ssl/what-is-a-cryptographic-key/">cryptographic keys</a> in the kernel space and restrict the access to a limited number of applications. However, if all our cryptography is processed in user space, potentially damaging code still has access to the raw key material. We have to think wisely about using the key: what part of the code has access to it, don’t log it accidentally, how the open-source libraries manage it and if the memory is purged after using it. We may need to support a dedicated process to not have a key in network-facing code. Thus, many things need to be done for security, and for each application which works with cryptography. And even after all these precautionary measures, the best of the best are subject to bugs and vulnerabilities. <a href="https://en.wikipedia.org/wiki/OpenSSL">OpenSSL</a>, the most known and widely used cryptographic library in user space, <a href="/cloudflare-is-not-affected-by-the-openssl-vulnerabilities-cve-2022-3602-and-cve-2022-37/">has had a few problems in its security</a>.</p><p>Can we move all the cryptography to the kernel and help solve these problems? Looks like it! Our <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7984ceb134bf31aa9a597f10ed52d831d5aede14">recent patch</a> to upstream extended the key types which can be used in symmetric encryption in the Crypto API directly from the Linux Kernel Key Retention Service.</p><p>But nothing is free. There will be some overhead for the system calls and data copying between user and kernel spaces. So, the next question is how fast it is.</p>
    <div>
      <h2>Is it fast?</h2>
      <a href="#is-it-fast">
        
      </a>
    </div>
    <p>To answer this question we need to have some baseline to compare with. OpenSSL would be the best as it’s used all around the Internet. OpenSSL provides a good composite of toolkits, including C-functions, a console utility and various speed tests. For the sake of equality, we will ignore the built-in tests and write our own tests using OpenSSL C-functions. We want the same data to be processed and the same logic parts to be measured in both cases (Kernel versus OpenSSL).</p><p>So, the task: write a benchmark for AES-CTR-128 encrypting data split in chunks. Make implementations for the Kernel Crypto API and OpenSSL.</p>
    <div>
      <h3>About AES-CTR-128</h3>
      <a href="#about-aes-ctr-128">
        
      </a>
    </div>
    <p>AES stands for <a href="https://en.wikipedia.org/wiki/Advanced_Encryption_Standard">Advanced Encryption Standard</a>. It is a block cipher algorithm, which means the whole plaintext is split into blocks and two operations are applied: substitution and permutation. There are two parameters characterizing a block cipher: the block size and the key size. AES processes a block of 128 bits using a key of either 128, 192 or 256 bits. Each 128 bits or 16 bytes block is presented as a 4x4 two-dimensional array (matrix), where one element of the matrix presents one byte of the plaintext. To change the plaintext to ciphertext several rounds of transformation are applied: the bits of the block XORs with a key derived from the main key and substitution with permutation are applied to rows and columns of the matrix. There can be 10, 12 or 14 rounds depending on the key size (the key size determines how many keys can be derived from it).</p><p>AES is a secure cipher, but there is one nuance - the same plaintext/block of text will produce the same result. Look at <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Electronic_codebook_(ECB)">Linux’s mascot Tux</a>. To avoid this, a <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation">mode of operation</a> (or just mode) has to be applied. It determines how the text changes, so the same input doesn't result in the same output. Tux was encrypted using ECB mode, there is no text transformation at all. Another mode example is CBC, where the ciphertext from the previously encrypted block is added to the next block, for the first block an initial value (IV) is added. This mode guarantees that for the same input and different IV the output will be different. However, this mode is slow as each block depends on the previous one and so encryption can’t be parallelized. CTR is a counter mode, instead of using previously encrypted blocks it uses a counter and a nonce. A counter is an integer which is incremented for each block. A nonce is just a random number similar to the IV. The nonce, and IV, should be different for each message and can be transferred openly with the encrypted text. So, the title AES-CTR-128 means AES used in CTR mode with the key size of 128 bits.</p>
    <div>
      <h3>Implementing AES-CTR-128 with the Kernel Crypto API</h3>
      <a href="#implementing-aes-ctr-128-with-the-kernel-crypto-api">
        
      </a>
    </div>
    <p>The kernel and user spaces are isolated for security reasons and each time data needs to be transferred between them, it’s copied. In our case, it would add a significant overhead - copying a big bunch of plain or encrypted text to the kernel and back. However, the crypto API supports a zero-copy interface. Instead of transferring the actual data, a file descriptor is passed. But it has a limitation - the maximum size is only <a href="https://www.kernel.org/doc/html/latest/crypto/userspace-if.html#zero-copy-interface">16 pages</a>. So for our tests we picked the number closest to the maximum limit - 63KB (16 pages of 4KB minus 1KB to avoid any potential edge cases).</p><p>The code below is the exact implementation of what is written in the <a href="https://www.kernel.org/doc/html/latest/crypto/userspace-if.html">kernel documentation</a>. Firstly we created a socket of AF_ALG type. The <code>salg_type</code> and <code>salg_name</code> parameters can be taken from the <code>/proc/crypto</code> file. Instead of a generic name we used the driver name <code>ctr-aes-aesni</code>. We might put just a name <code>ctr(aes)</code> and the driver with the highest priority (<code>ctr-aes-aesni</code> in our context) will be picked for us by the Kernel. Further we put the key length and accepted the socket. The IV size is provided before the payload as ancillary data. Constraints of the key and IV sizes can be found in <code>/proc/crypto</code> too.</p><p>Now we are ready to start communication. We excluded all pre-set up steps from the measurements. In a loop we send plaintext for encryption with the flag <code>SPLICE_F_MORE</code> to inform the kernel that more data will be provided. And here in the loop we <code>read</code> the cipher text from the kernel. The last plaintext should be sent without the flag thus saying that we are done, and the kernel can finalize the encryption.</p><p>In favor of brevity, error handling is omitted in both examples.</p><p>kernel.c</p>
            <pre><code>#define _GNU_SOURCE

#include &lt;stdint.h&gt;
#include &lt;string.h&gt;
#include &lt;stdio.h&gt;

#include &lt;unistd.h&gt;
#include &lt;fcntl.h&gt;
#include &lt;time.h&gt;
#include &lt;sys/random.h&gt;
#include &lt;sys/socket.h&gt;
#include &lt;linux/if_alg.h&gt;

#define PT_LEN (63 * 1024)
#define CT_LEN PT_LEN
#define IV_LEN 16
#define KEY_LEN 16
#define ITER_COUNT 100000

static uint8_t pt[PT_LEN];
static uint8_t ct[CT_LEN];
static uint8_t key[KEY_LEN];
static uint8_t iv[IV_LEN];

static void time_diff(struct timespec *res, const struct timespec *start, const struct timespec *end)
{
    res-&gt;tv_sec = end-&gt;tv_sec - start-&gt;tv_sec;
    res-&gt;tv_nsec = end-&gt;tv_nsec - start-&gt;tv_nsec;
    if (res-&gt;tv_nsec &lt; 0) {
        res-&gt;tv_sec--;
        res-&gt;tv_nsec += 1000000000;
    }
}

int main(void)
{
    // Fill the test data
    getrandom(key, sizeof(key), GRND_NONBLOCK);
    getrandom(iv, sizeof(iv), GRND_NONBLOCK);
    getrandom(pt, sizeof(pt), GRND_NONBLOCK);

    // Set up AF_ALG socket
    int alg_s, aes_ctr;
    struct sockaddr_alg sa = { .salg_family = AF_ALG };
    strcpy(sa.salg_type, "skcipher");
    strcpy(sa.salg_name, "ctr-aes-aesni");

    alg_s = socket(AF_ALG, SOCK_SEQPACKET, 0);
    bind(alg_s, (const struct sockaddr *)&amp;sa, sizeof(sa));
    setsockopt(alg_s, SOL_ALG, ALG_SET_KEY, key, KEY_LEN);
    aes_ctr = accept(alg_s, NULL, NULL);
    close(alg_s);

    // Set up IV
    uint8_t cmsg_buf[CMSG_SPACE(sizeof(uint32_t)) + CMSG_SPACE(sizeof(struct af_alg_iv) + IV_LEN)] = {0};
    struct msghdr msg = {
	.msg_control = cmsg_buf,
	.msg_controllen = sizeof(cmsg_buf)
    };

    struct cmsghdr *cmsg = CMSG_FIRSTHDR(&amp;msg);
    cmsg-&gt;cmsg_len = CMSG_LEN(sizeof(uint32_t));
    cmsg-&gt;cmsg_level = SOL_ALG;
    cmsg-&gt;cmsg_type = ALG_SET_OP;
    *((uint32_t *)CMSG_DATA(cmsg)) = ALG_OP_ENCRYPT;
    
    cmsg = CMSG_NXTHDR(&amp;msg, cmsg);
    cmsg-&gt;cmsg_len = CMSG_LEN(sizeof(struct af_alg_iv) + IV_LEN);
    cmsg-&gt;cmsg_level = SOL_ALG;
    cmsg-&gt;cmsg_type = ALG_SET_IV;
    ((struct af_alg_iv *)CMSG_DATA(cmsg))-&gt;ivlen = IV_LEN;
    memcpy(((struct af_alg_iv *)CMSG_DATA(cmsg))-&gt;iv, iv, IV_LEN);
    sendmsg(aes_ctr, &amp;msg, 0);

    // Set up pipes for using zero-copying interface
    int pipes[2];
    pipe(pipes);

    struct iovec pt_iov = {
        .iov_base = pt,
        .iov_len = sizeof(pt)
    };

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &amp;start);
    
    int i;
    for (i = 0; i &lt; ITER_COUNT; i++) {
        vmsplice(pipes[1], &amp;pt_iov, 1, SPLICE_F_GIFT);
        // SPLICE_F_MORE means more data will be coming
        splice(pipes[0], NULL, aes_ctr, NULL, sizeof(pt), SPLICE_F_MORE);
        read(aes_ctr, ct, sizeof(ct));
    }
    vmsplice(pipes[1], &amp;pt_iov, 1, SPLICE_F_GIFT);
    // A final call without SPLICE_F_MORE
    splice(pipes[0], NULL, aes_ctr, NULL, sizeof(pt), 0);
    read(aes_ctr, ct, sizeof(ct));
    
    clock_gettime(CLOCK_MONOTONIC, &amp;end);

    close(pipes[0]);
    close(pipes[1]);
    close(aes_ctr);

    struct timespec diff;
    time_diff(&amp;diff, &amp;start, &amp;end);
    double tput_krn = ((double)ITER_COUNT * PT_LEN) / (diff.tv_sec + (diff.tv_nsec * 0.000000001 ));
    printf("Kernel: %.02f Mb/s\n", tput_krn / (1024 * 1024));
    
    return 0;
}</code></pre>
            <p>Compile and run:</p>
            <pre><code>$ gcc -o kernel kernel.c
$ ./kernel
Kernel: 2112.49 Mb/s</code></pre>
            
    <div>
      <h3>Implementing AES-CTR-128 with OpenSSL</h3>
      <a href="#implementing-aes-ctr-128-with-openssl">
        
      </a>
    </div>
    <p>With OpenSSL everything is straight forward, we just repeated an example from the <a href="https://wiki.openssl.org/index.php/EVP_Symmetric_Encryption_and_Decryption#Encrypting_the_message">official documentation</a>.</p><p>openssl.c</p>
            <pre><code>#include &lt;time.h&gt;
#include &lt;sys/random.h&gt;
#include &lt;openssl/evp.h&gt;

#define PT_LEN (63 * 1024)
#define CT_LEN PT_LEN
#define IV_LEN 16
#define KEY_LEN 16
#define ITER_COUNT 100000

static uint8_t pt[PT_LEN];
static uint8_t ct[CT_LEN];
static uint8_t key[KEY_LEN];
static uint8_t iv[IV_LEN];

static void time_diff(struct timespec *res, const struct timespec *start, const struct timespec *end)
{
    res-&gt;tv_sec = end-&gt;tv_sec - start-&gt;tv_sec;
    res-&gt;tv_nsec = end-&gt;tv_nsec - start-&gt;tv_nsec;
    if (res-&gt;tv_nsec &lt; 0) {
        res-&gt;tv_sec--;
        res-&gt;tv_nsec += 1000000000;
    }
}

int main(void)
{
    // Fill the test data
    getrandom(key, sizeof(key), GRND_NONBLOCK);
    getrandom(iv, sizeof(iv), GRND_NONBLOCK);
    getrandom(pt, sizeof(pt), GRND_NONBLOCK);

    EVP_CIPHER_CTX *ctx = EVP_CIPHER_CTX_new();
    EVP_EncryptInit_ex(ctx, EVP_aes_128_ctr(), NULL, key, iv);

    int outl = sizeof(ct);
    
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &amp;start);

    int i;
    for (i = 0; i &lt; ITER_COUNT; i++) {
        EVP_EncryptUpdate(ctx, ct, &amp;outl, pt, sizeof(pt));
    }
    uint8_t *ct_final = ct + outl;
    outl = sizeof(ct) - outl;
    EVP_EncryptFinal_ex(ctx, ct_final, &amp;outl);

    clock_gettime(CLOCK_MONOTONIC, &amp;end);

    EVP_CIPHER_CTX_free(ctx);

    struct timespec diff;
    time_diff(&amp;diff, &amp;start, &amp;end);
    double tput_ossl = ((double)ITER_COUNT * PT_LEN) / (diff.tv_sec + (diff.tv_nsec * 0.000000001 ));
    printf("OpenSSL: %.02f Mb/s\n", tput_ossl / (1024 * 1024));

    return 0;
}</code></pre>
            <p>Compile and run:</p>
            <pre><code>$ gcc -o openssl openssl.c -lcrypto
$ ./openssl
OpenSSL: 3758.60 Mb/s</code></pre>
            
    <div>
      <h3>Results of OpenSSL vs Crypto API</h3>
      <a href="#results-of-openssl-vs-crypto-api">
        
      </a>
    </div>
    
            <pre><code>OpenSSL: 3758.60 Mb/s
Kernel: 2112.49 Mb/s</code></pre>
            <p>Don’t pay attention to the absolute values, look at the relationship.</p><p>The numbers look pessimistic. But why? Can't the kernel implement AES-CTR similar to OpenSSL? We used <a href="https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md">bpftrace</a> to understand this better. The encryption function is called on the <code>read()</code> system call. Trying to be as close to the encryption code as possible, we put a probe on the <a href="https://elixir.bootlin.com/linux/v5.15.90/source/arch/x86/crypto/aesni-intel_glue.c#L1027">ctr_crypt function</a> instead of the whole <code>read</code> call.</p>
            <pre><code>$ sudo bpftrace -e 'kprobe:ctr_crypt { @start=nsecs; @count+=1; } kretprobe:ctr_crypt /@start!=0/ { @total+=nsecs-@start; }'</code></pre>
            <p>We took the same plaintext, encrypted it in chunks of 63KB and measured how much time it took for both cases to encrypt it with <code>bpftrace</code> attached to the kernel:</p>
            <pre><code>OpenSSL: 1 sec 650532178 nsec
Kernel: 3 sec 120442931 nsec // 3120442931 ns
OpenSSL: 3727.49 Mb/s
Kernel: 1971.63 Mb/s

@total: 2031169756     //  2031169756 / 3120442931 = 0.6509235390339526</code></pre>
            <p>The <code>@total</code> number is output from bpftrace, which tells us how much time the kernel spent in the encryption function. To compare plain kernel encryption vs OpenSSL we need to say how many Mb/s kernel would have done if only encryption had been involved (excluding all system calls and data copy/ manipulation). We need to apply some math:</p><ol><li><p>The correlation between the total time and the time which the kernel spent in the encryption is <code>2031169756 / 3120442931 = 0.6509235390339526</code> or 65%.</p></li><li><p>So throughput would be <code>1971.63 / 0.650923539033952</code> - 3028.97 Mb/s. Comparing this to OpenSSL Mb/s we get <code>3028.97 / 3727.49</code>, so around 81%.</p></li></ol><p>It would be fair to say that <code>bpftrace</code> adds some overhead and our numbers for the kernel are less than they could be. So, we can safely say that while the Kernel Crypto API is two times slower than OpenSSL, the crypto part itself is almost equal.</p>
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>In this post we reviewed the Linux Kernel Crypto API and its user space interface. We reiterated some security benefits of doing encryption through the Kernel vs using some sort of cryptographic library. We also measured the performance overhead of doing data encryption/decryption through the Kernel Crypto API, confirmed that in-kernel crypto is likely as good as in OpenSSL, but a better user space interface is needed to make Kernel Crypto API as fast as using a cryptographic library. Using Crypto API is a subjective decision depending on your circumstances, it’s a trade-off in speed vs. security.</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Kernel]]></category>
            <guid isPermaLink="false">TZcwKRTxODYQJEEXgDegx</guid>
            <dc:creator>Oxana Kharitonova</dc:creator>
        </item>
        <item>
            <title><![CDATA[The quantum state of a TCP port]]></title>
            <link>https://blog.cloudflare.com/the-quantum-state-of-a-tcp-port/</link>
            <pubDate>Mon, 20 Mar 2023 13:00:00 GMT</pubDate>
            <description><![CDATA[ If I navigate to https://blog.cloudflare.com/, my browser will connect to a remote TCP address from the local IP address assigned to my machine, and a randomly chosen local TCP port. What happens if I then decide to head to another site? ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Have you noticed how simple questions sometimes lead to complex answers? Today we will tackle one such question. Category: our favorite - Linux networking.</p>
    <div>
      <h2>When can two TCP sockets share a local address?</h2>
      <a href="#when-can-two-tcp-sockets-share-a-local-address">
        
      </a>
    </div>
    <p>If I navigate to <a href="/">https://blog.cloudflare.com/</a>, my browser will connect to a remote TCP address, might be 104.16.132.229:443 in this case, from the local IP address assigned to my Linux machine, and a randomly chosen local TCP port, say 192.0.2.42:54321. What happens if I then decide to head to a different site? Is it possible to establish another TCP connection from the same local IP address and port?</p><p>To find the answer let's do a bit of <a href="https://en.wikipedia.org/wiki/Discovery_learning">learning by discovering</a>. We have prepared eight quiz questions. Each will let you discover one aspect of the rules that govern local address sharing between TCP sockets under Linux. Fair warning, it might get a bit mind-boggling.</p><p>Questions are split into two groups by test scenario:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3Fu0occJtMgjz3Rd7xGJCo/9bf63f71e754ccbcf47c1fdd801d8f8f/image4-15.png" />
            
            </figure><p>In the first test scenario, two sockets connect from the same local port to the same remote IP and port. However, the local IP is different for each socket.</p><p>While, in the second scenario, the local IP and port is the same for all sockets, but the remote address, or actually just the IP address, differs.</p><p>In our quiz questions, we will either:</p><ol><li><p>let the OS automatically select the the local IP and/or port for the socket, or</p></li><li><p>we will explicitly assign the local address with <a href="https://man7.org/linux/man-pages/man2/bind.2.html"><code>bind()</code></a> before <a href="https://man7.org/linux/man-pages/man2/connect.2.html"><code>connect()</code></a>’ing the socket; a method also known as <a href="https://idea.popcount.org/2014-04-03-bind-before-connect/">bind-before-connect</a>.</p></li></ol><p>Because we will be examining corner cases in the bind() logic, we need a way to exhaust available local addresses, that is (IP, port) pairs. We could just create lots of sockets, but it will be easier to <a href="https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html?#ip-variables">tweak the system configuration</a> and pretend that there is just one ephemeral local port, which the OS can assign to sockets:</p><p><code>sysctl -w net.ipv4.ip_local_port_range='60000 60000'</code></p><p>Each quiz question is a short Python snippet. Your task is to predict the outcome of running the code. Does it succeed? Does it fail? If so, what fails? Asking ChatGPT is not allowed ?</p><p>There is always a common setup procedure to keep in mind. We will omit it from the quiz snippets to keep them short:</p>
            <pre><code>from os import system
from socket import *

# Missing constants
IP_BIND_ADDRESS_NO_PORT = 24

# Our network namespace has just *one* ephemeral port
system("sysctl -w net.ipv4.ip_local_port_range='60000 60000'")

# Open a listening socket at *:1234. We will connect to it.
ln = socket(AF_INET, SOCK_STREAM)
ln.bind(("", 1234))
ln.listen(SOMAXCONN)</code></pre>
            <p>With the formalities out of the way, let us begin. Ready. Set. Go!</p>
    <div>
      <h3>Scenario #1: When the local IP is unique, but the local port is the same</h3>
      <a href="#scenario-1-when-the-local-ip-is-unique-but-the-local-port-is-the-same">
        
      </a>
    </div>
    <p>In Scenario #1 we connect two sockets to the same remote address - 127.9.9.9:1234. The sockets will use different local IP addresses, but is it enough to share the local port?</p>
<table>
<thead>
  <tr>
    <th><span>local IP</span></th>
    <th><span>local port</span></th>
    <th><span>remote IP</span></th>
    <th><span>remote port</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>unique</span></td>
    <td><span>same</span></td>
    <td><span>same</span></td>
    <td><span>same</span></td>
  </tr>
  <tr>
    <td><span>127.0.0.1<br />127.1.1.1<br />127.2.2.2</span></td>
    <td><span>60_000</span></td>
    <td><span>127.9.9.9</span></td>
    <td><span>1234</span></td>
  </tr>
</tbody>
</table>
    <div>
      <h3>Quiz #1</h3>
      <a href="#quiz-1">
        
      </a>
    </div>
    <p>On the local side, we bind two sockets to distinct, explicitly specified IP addresses. We will allow the OS to select the local port. Remember: our local ephemeral port range contains just one port (60,000).</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.bind(('127.1.1.1', 0))
s1.connect(('127.9.9.9', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.bind(('127.2.2.2', 0))
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_1.py">Answer #1</a></p>
    <div>
      <h3>Quiz #2</h3>
      <a href="#quiz-2">
        
      </a>
    </div>
    <p>Here, the setup is almost identical as before. However, we ask the OS to select the local IP address and port for the first socket. Do you think the result will differ from the previous question?</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.connect(('127.9.9.9', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.bind(('127.2.2.2', 0))
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_2.py">Answer #2</a></p>
    <div>
      <h3>Quiz #3</h3>
      <a href="#quiz-3">
        
      </a>
    </div>
    <p>This quiz question is just like  the one above. We just changed the ordering. First, we connect a socket from an explicitly specified local address. Then we ask the system to select a local address for us. Obviously, such an ordering change should not make any difference, right?</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.bind(('127.1.1.1', 0))
s1.connect(('127.9.9.9', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_3.py">Answer #3</a></p>
    <div>
      <h3>Scenario #2: When the local IP and port are the same, but the remote IP differs</h3>
      <a href="#scenario-2-when-the-local-ip-and-port-are-the-same-but-the-remote-ip-differs">
        
      </a>
    </div>
    <p>In Scenario #2 we reverse our setup. Instead of multiple local IP's and one remote address, we now have one local address <code>127.0.0.1:60000</code> and two distinct remote addresses. The question remains the same - can two sockets share the local port? Reminder: ephemeral port range is still of size one.</p>
<table>
<thead>
  <tr>
    <th><span>local IP</span></th>
    <th><span>local port</span></th>
    <th><span>remote IP</span></th>
    <th><span>remote port</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>same</span></td>
    <td><span>same</span></td>
    <td><span>unique</span></td>
    <td><span>same</span></td>
  </tr>
  <tr>
    <td><span>127.0.0.1</span></td>
    <td><span>60_000</span></td>
    <td><span>127.8.8.8<br />127.9.9.9</span></td>
    <td><span>1234</span></td>
  </tr>
</tbody>
</table>
    <div>
      <h3>Quiz #4</h3>
      <a href="#quiz-4">
        
      </a>
    </div>
    <p>Let’s start from the basics. We <code>connect()</code> to two distinct remote addresses. This is a warm up ?</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.connect(('127.8.8.8', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_4.py">Answer #4</a></p>
    <div>
      <h3>Quiz #5</h3>
      <a href="#quiz-5">
        
      </a>
    </div>
    <p>What if we <code>bind()</code> to a local IP explicitly but let the OS select the port - does anything change?</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.bind(('127.0.0.1', 0))
s1.connect(('127.8.8.8', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.bind(('127.0.0.1', 0))
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_5.py">Answer #5</a></p>
    <div>
      <h3>Quiz #6</h3>
      <a href="#quiz-6">
        
      </a>
    </div>
    <p>This time we explicitly specify the local address and port. Sometimes there is a need to specify the local port.</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.bind(('127.0.0.1', 60_000))
s1.connect(('127.8.8.8', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.bind(('127.0.0.1', 60_000))
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_6.py">Answer #6</a></p>
    <div>
      <h3>Quiz #7</h3>
      <a href="#quiz-7">
        
      </a>
    </div>
    <p>Just when you thought it couldn’t get any weirder, we add <a href="https://manpages.debian.org/unstable/manpages/socket.7.en.html#SO_REUSEADDR"><code>SO_REUSEADDR</code></a> into the mix.</p><p>First, we ask the OS to allocate a local address for us. Then we explicitly bind to the same local address, which we know the OS must have assigned to the first socket. We enable local address reuse for both sockets. Is this allowed?</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s1.connect(('127.8.8.8', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s2.bind(('127.0.0.1', 60_000))
s2.connect(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_7.py">Answer #7</a></p>
    <div>
      <h3>Quiz #8</h3>
      <a href="#quiz-8">
        
      </a>
    </div>
    <p>Finally, a cherry on top. This is Quiz #7 but in reverse. Common sense dictates that the outcome should be the same, but is it?</p>
            <pre><code>s1 = socket(AF_INET, SOCK_STREAM)
s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s1.bind(('127.0.0.1', 60_000))
s1.connect(('127.9.9.9', 1234))
s1.getsockname(), s1.getpeername()

s2 = socket(AF_INET, SOCK_STREAM)
s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s2.connect(('127.8.8.8', 1234))
s2.getsockname(), s2.getpeername()</code></pre>
            <p>GOTO <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/quiz_8.py">Answer #8</a></p>
    <div>
      <h2>The secret tri-state life of a local TCP port</h2>
      <a href="#the-secret-tri-state-life-of-a-local-tcp-port">
        
      </a>
    </div>
    <p>Is it all clear now? Well, probably no. It feels like reverse engineering a black box. So what is happening behind the scenes? Let's take a look.</p><p>Linux tracks all TCP <b>ports</b> in use in a hash table named <a href="https://elixir.bootlin.com/linux/v6.2/source/include/net/inet_hashtables.h#L166">bhash</a>. Not to be confused with with <a href="https://elixir.bootlin.com/linux/v6.2/source/include/net/inet_hashtables.h#L156">ehash</a> table, which tracks <b>sockets</b> with both local and remote address already assigned.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3OJ1M8Zu9lgEZoEJdJCNgr/d50e8f331dc994366b2ff749ed10b519/Untitled.png" />
            
            </figure><p>Each hash table entry points to a chain of so-called bind buckets, which group together sockets which share a local port. To be precise, sockets are grouped into buckets by:</p><ul><li><p>the <a href="https://man7.org/linux/man-pages/man7/network_namespaces.7.html">network namespace</a> they belong to, and</p></li><li><p>the <a href="https://docs.kernel.org/networking/vrf.html">VRF</a> device they are bound to, and</p></li><li><p>the local port number they are bound to.</p></li></ul><p>But in the simplest possible setup - single network namespace, no VRFs - we can say that sockets in a bind bucket are grouped by their local port number.</p><p>The set of sockets in each bind bucket, that is sharing a local port, is backed by a linked list of named owners.</p><p>When we ask the kernel to assign a local address to a socket, its task is to check for a conflict with any existing socket. That is because a local port number can be shared only <a href="https://elixir.bootlin.com/linux/v6.2/source/include/net/inet_hashtables.h#L43">under some conditions</a>:</p>
            <pre><code>/* There are a few simple rules, which allow for local port reuse by
 * an application.  In essence:
 *
 *   1) Sockets bound to different interfaces may share a local port.
 *      Failing that, goto test 2.
 *   2) If all sockets have sk-&gt;sk_reuse set, and none of them are in
 *      TCP_LISTEN state, the port may be shared.
 *      Failing that, goto test 3.
 *   3) If all sockets are bound to a specific inet_sk(sk)-&gt;rcv_saddr local
 *      address, and none of them are the same, the port may be
 *      shared.
 *      Failing this, the port cannot be shared.
 *
 * The interesting point, is test #2.  This is what an FTP server does
 * all day.  To optimize this case we use a specific flag bit defined
 * below.  As we add sockets to a bind bucket list, we perform a
 * check of: (newsk-&gt;sk_reuse &amp;&amp; (newsk-&gt;sk_state != TCP_LISTEN))
 * As long as all sockets added to a bind bucket pass this test,
 * the flag bit will be set.
 * ...
 */</code></pre>
            <p>The comment above hints that the kernel tries to optimize for the happy case of no conflict. To this end the bind bucket holds additional state which aggregates the properties of the sockets it holds:</p>
            <pre><code>struct inet_bind_bucket {
        /* ... */
        signed char          fastreuse;
        signed char          fastreuseport;
        kuid_t               fastuid;
#if IS_ENABLED(CONFIG_IPV6)
        struct in6_addr      fast_v6_rcv_saddr;
#endif
        __be32               fast_rcv_saddr;
        unsigned short       fast_sk_family;
        bool                 fast_ipv6_only;
        /* ... */
};</code></pre>
            <p>Let's focus our attention just on the first aggregate property - <code>fastreuse</code>. It has existed since, now prehistoric, Linux 2.1.90pre1. Initially in the form of a <a href="https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/tree/include/net/tcp.h?h=2.1.90pre1&amp;id=9d11a5176cc5b9609542b1bd5a827b8618efe681#n76">bit flag</a>, as the comment says, only to evolve to a byte-sized field over time.</p><p>The other six fields came on much later with the introduction of <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da5e36308d9f7151845018369148201a5d28b46d"><code>SO_REUSEPORT</code> in Linux 3.9</a>. Because they play a role only when there are sockets with the <a href="https://manpages.debian.org/unstable/manpages/socket.7.en.html#SO_REUSEPORT"><code>SO_REUSEPORT</code></a> flag set. We are going to ignore them today.</p><p>Whenever the Linux kernel needs to bind a socket to a local port, it first has to look for the bind bucket for that port. What makes life a bit more complicated is the fact that the search for a TCP bind bucket exists in two places in the kernel. The bind bucket lookup can happen early - <code>at bind()</code> time - or late - <code>at connect()</code> - time. Which one gets called depends on how the connected socket has been set up:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3KOEGkTF7HEH7qku86bufY/8560316429d383b25add8c9f0b2bab3b/image5-5.png" />
            
            </figure><p>However, whether we land in <a href="https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/inet_connection_sock.c#L486"><code>inet_csk_get_port</code></a> or <a href="https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/inet_hashtables.c#L992"><code>__inet_hash_connect</code></a>, we always end up walking the bucket chain in the bhash looking for the bucket with a matching port number. The bucket might already exist or we might have to create it first. But once it exists, its fastreuse field is in one of three possible states: <code>-1</code>, <code>0</code>, or <code>+1</code>. As if Linux developers were inspired by <a href="https://en.wikipedia.org/wiki/Triplet_state">quantum mechanics</a>.</p><p>That state reflects two aspects of the bind bucket:</p><ol><li><p>What sockets are in the bucket?</p></li><li><p>When can the local port be shared?</p></li></ol><p>So let us try to decipher the three possible fastreuse states then, and what they mean in each case.</p><p>First, what does the fastreuse property say about the owners of the bucket, that is the sockets using that local port?</p>
<table>
<thead>
  <tr>
    <th><span>fastreuse is</span></th>
    <th><span>owners list contains</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>-1</span></td>
    <td><span>sockets connect()'ed from an ephemeral port</span></td>
  </tr>
  <tr>
    <td><span>0</span></td>
    <td><span>sockets bound without SO_REUSEADDR</span></td>
  </tr>
  <tr>
    <td><span>+1</span></td>
    <td><span>sockets bound with SO_REUSEADDR</span></td>
  </tr>
</tbody>
</table><p>While this is not the whole truth, it is close enough for now. We will soon get to the bottom of it.</p><p>When it comes port sharing, the situation is far less straightforward:</p>
<table>
<thead>
  <tr>
    <th><span>Can I … when …</span></th>
    <th><span>fastreuse = -1</span></th>
    <th><span>fastreuse = 0</span></th>
    <th><span>fastreuse = +1</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>bind() to the same port (ephemeral or specified)</span></td>
    <td><span>yes</span><span> IFF local IP is unique ①</span></td>
    <td><span>← </span><a href="https://en.wiktionary.org/wiki/idem#Pronoun"><span>idem</span></a></td>
    <td><span>← idem</span></td>
  </tr>
  <tr>
    <td><span>bind() to the specific port with SO_REUSEADDR</span></td>
    <td><span>yes</span><span> IFF local IP is unique OR conflicting socket uses SO_REUSEADDR ①</span></td>
    <td><span>← idem</span></td>
    <td><span>yes</span><span> ②</span></td>
  </tr>
  <tr>
    <td><span>connect() from the same ephemeral port to the same remote (IP, port)</span></td>
    <td><span>yes</span><span> IFF local IP unique ③</span></td>
    <td><span>no</span><span> ③</span></td>
    <td><span>no</span><span> ③</span></td>
  </tr>
  <tr>
    <td><span>connect() from the same ephemeral port to a unique remote (IP, port)</span></td>
    <td><span>yes</span><span> ③</span></td>
    <td><span>no</span><span> ③</span></td>
    <td><span>no</span><span> ③</span></td>
  </tr>
</tbody>
</table><p>① Determined by <a href="https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/inet_connection_sock.c#L214"><code>inet_csk_bind_conflict()</code></a> called from <code>inet_csk_get_port()</code> (specific port bind) or <code>inet_csk_get_port()</code> → <code>inet_csk_find_open_port()</code> (ephemeral port bind).</p><p>② Because <code>inet_csk_get_port()</code> <a href="https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/inet_connection_sock.c#L531">skips conflict check</a> for <code>fastreuse == 1 buckets</code>.</p><p>③ Because <code>inet_hash_connect()</code> → <code>__inet_hash_connect()</code> <a href="https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/inet_hashtables.c#L1062">skips buckets</a> with <code>fastreuse != -1</code>.</p><p>While it all looks rather complicated at first sight, we can distill the table above into a few statements that hold true, and are a bit easier to digest:</p><ul><li><p><code>bind()</code>, or early local address allocation, always succeeds if there is no local IP address conflict with any existing socket,</p></li><li><p><code>connect()</code>, or late local address allocation, always fails when TCP bind bucket for a local port is in any state other than <code>fastreuse = -1</code>,</p></li><li><p><code>connect()</code> only succeeds if there is no local and remote address conflict,</p></li><li><p><code>SO_REUSEADDR</code> socket option allows local address sharing, if all conflicting sockets also use it (and none of them is in the listening state).</p></li></ul>
    <div>
      <h3>This is crazy. I don’t believe you.</h3>
      <a href="#this-is-crazy-i-dont-believe-you">
        
      </a>
    </div>
    <p>Fortunately, you don't have to. With <a href="https://drgn.readthedocs.io/en/latest/index.html">drgn</a>, the programmable debugger, we can examine the bind bucket state on a live kernel:</p>
            <pre><code>#!/usr/bin/env drgn

"""
dump_bhash.py - List all TCP bind buckets in the current netns.

Script is not aware of VRF.
"""

import os

from drgn.helpers.linux.list import hlist_for_each, hlist_for_each_entry
from drgn.helpers.linux.net import get_net_ns_by_fd
from drgn.helpers.linux.pid import find_task


def dump_bind_bucket(head, net):
    for tb in hlist_for_each_entry("struct inet_bind_bucket", head, "node"):
        # Skip buckets not from this netns
        if tb.ib_net.net != net:
            continue

        port = tb.port.value_()
        fastreuse = tb.fastreuse.value_()
        owners_len = len(list(hlist_for_each(tb.owners)))

        print(
            "{:8d}  {:{sign}9d}  {:7d}".format(
                port,
                fastreuse,
                owners_len,
                sign="+" if fastreuse != 0 else " ",
            )
        )


def get_netns():
    pid = os.getpid()
    task = find_task(prog, pid)
    with open(f"/proc/{pid}/ns/net") as f:
        return get_net_ns_by_fd(task, f.fileno())


def main():
    print("{:8}  {:9}  {:7}".format("TCP-PORT", "FASTREUSE", "#OWNERS"))

    tcp_hashinfo = prog.object("tcp_hashinfo")
    net = get_netns()

    # Iterate over all bhash slots
    for i in range(0, tcp_hashinfo.bhash_size):
        head = tcp_hashinfo.bhash[i].chain
        # Iterate over bind buckets in the slot
        dump_bind_bucket(head, net)


main()</code></pre>
            <p>Let's take <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/dump_bhash.py">this script</a> for a spin and try to confirm what <i>Table 1</i> claims to be true. Keep in mind that to produce the <code>ipython --classic</code> session snippets below I've used the same setup as for the quiz questions.</p><p>Two connected sockets sharing ephemeral port 60,000:</p>
            <pre><code>&gt;&gt;&gt; s1 = socket(AF_INET, SOCK_STREAM)
&gt;&gt;&gt; s1.connect(('127.1.1.1', 1234))
&gt;&gt;&gt; s2 = socket(AF_INET, SOCK_STREAM)
&gt;&gt;&gt; s2.connect(('127.2.2.2', 1234))
&gt;&gt;&gt; !./dump_bhash.py
TCP-PORT  FASTREUSE  #OWNERS
    1234          0        3
   60000         -1        2
&gt;&gt;&gt;</code></pre>
            <p>Two bound sockets reusing port 60,000:</p>
            <pre><code>&gt;&gt;&gt; s1 = socket(AF_INET, SOCK_STREAM)
&gt;&gt;&gt; s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
&gt;&gt;&gt; s1.bind(('127.1.1.1', 60_000))
&gt;&gt;&gt; s2 = socket(AF_INET, SOCK_STREAM)
&gt;&gt;&gt; s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
&gt;&gt;&gt; s2.bind(('127.1.1.1', 60_000))
&gt;&gt;&gt; !./dump_bhash.py
TCP-PORT  FASTREUSE  #OWNERS
    1234          0        1
   60000         +1        2
&gt;&gt;&gt; </code></pre>
            <p>A mix of bound sockets with and without REUSEADDR sharing port 60,000:</p>
            <pre><code>&gt;&gt;&gt; s1 = socket(AF_INET, SOCK_STREAM)
&gt;&gt;&gt; s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
&gt;&gt;&gt; s1.bind(('127.1.1.1', 60_000))
&gt;&gt;&gt; !./dump_bhash.py
TCP-PORT  FASTREUSE  #OWNERS
    1234          0        1
   60000         +1        1
&gt;&gt;&gt; s2 = socket(AF_INET, SOCK_STREAM)
&gt;&gt;&gt; s2.bind(('127.2.2.2', 60_000))
&gt;&gt;&gt; !./dump_bhash.py
TCP-PORT  FASTREUSE  #OWNERS
    1234          0        1
   60000          0        2
&gt;&gt;&gt;</code></pre>
            <p>With such tooling, proving that <i>Table 2</i> holds true is just a matter of writing a bunch of <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/test_fastreuse.py">exploratory tests</a>.</p><p>But what has happened in that last snippet? The bind bucket has clearly transitioned from one fastreuse state to another. This is what <i>Table 1</i> fails to capture. And it means that we still don't have the full picture.</p><p>We have yet to find out when the bucket's fastreuse state can change. This calls for a state machine.</p>
    <div>
      <h3>Das State Machine</h3>
      <a href="#das-state-machine">
        
      </a>
    </div>
    <p>As we have just seen, a bind bucket does not need to stay in the initial fastreuse state throughout its lifetime. Adding sockets to the bucket can trigger a state change. As it turns out, it can only transition into <code>fastreuse = 0</code>, if we happen to bind() a socket that:</p><ol><li><p>doesn't conflict existing owners, and</p></li><li><p>doesn't have the <code>SO_REUSEADDR</code> option enabled.</p></li></ol>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7awSZQJp00EDbGbvcpBNv9/cca9b2c49a3db5f3d1db62ce142ac46a/Untitled--1-.png" />
            
            </figure><p>And while we could have figured it all out by carefully reading the code in <a href="https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/inet_connection_sock.c#L431"><code>inet_csk_get_port → inet_csk_update_fastreuse</code></a>, it certainly doesn't hurt to confirm our understanding with <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/test_fastreuse_states.py">a few more tests</a>.</p><p>Now that we have the full picture, this begs the question...</p>
    <div>
      <h3>Why are you telling me all this?</h3>
      <a href="#why-are-you-telling-me-all-this">
        
      </a>
    </div>
    <p>Firstly, so that the next time <code>bind()</code> syscall rejects your request with <code>EADDRINUSE</code>, or <code>connect()</code> refuses to cooperate by throwing the <code>EADDRNOTAVAIL</code> error, you will know what is happening, or at least have the tools to find out.</p><p>Secondly, because we have previously <a href="/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/">advertised a technique</a> for opening connections from a specific range of ports which involves bind()'ing sockets with the SO_REUSEADDR option. What we did not realize back then, is that there exists <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2023-03-quantum-state-of-tcp-port/test_fastreuse.py#L300">a corner case</a> when the same port can't be shared with the regular, <code>connect()</code>'ed sockets. While that is not a deal-breaker, it is good to understand the consequences.</p><p>To make things better, we have worked with the Linux community to extend the kernel API with a new socket option that lets the user <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=91d0b78c5177">specify the local port range</a>. The new option will be available in the upcoming Linux 6.3. With it we no longer have to resort to bind()-tricks. This makes it possible to yet again share a local port with regular <code>connect()</code>'ed sockets.</p>
    <div>
      <h2>Closing thoughts</h2>
      <a href="#closing-thoughts">
        
      </a>
    </div>
    <p>Today we posed a relatively straightforward question - when can two TCP sockets share a local address? - and worked our way towards an answer. An answer that is too complex to compress it into a single sentence. What is more, it's not even the full answer. After all, we have decided to ignore the existence of the SO_REUSEPORT feature, and did not consider conflicts with TCP listening sockets.</p><p>If there is a simple takeaway, though, it is that bind()'ing a socket can have tricky consequences. When using bind() to select an egress IP address, it is best to combine it with IP_BIND_ADDRESS_NO_PORT socket option, and leave the port assignment to the kernel. Otherwise we might unintentionally block local TCP ports from being reused.</p><p>It is too bad that the same advice does not apply to UDP, where IP_BIND_ADDRESS_NO_PORT does not really work today. But that is another story.</p><p>Until next time ?.</p><p>If you enjoy scratching your head while reading the Linux kernel source code, <a href="https://www.cloudflare.com/careers/">we are hiring</a>.</p> ]]></content:encoded>
            <category><![CDATA[Kernel]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">74q2VGXmBazVsIZUpVUD8o</guid>
            <dc:creator>Jakub Sitnicki</dc:creator>
        </item>
        <item>
            <title><![CDATA[CVE-2022-47929: traffic control noqueue no problem?]]></title>
            <link>https://blog.cloudflare.com/cve-2022-47929-traffic-control-noqueue-no-problem/</link>
            <pubDate>Tue, 31 Jan 2023 14:00:00 GMT</pubDate>
            <description><![CDATA[ In the Linux kernel before 6.1.6, a NULL pointer dereference bug in the traffic control subsystem allows an unprivileged user to trigger a denial of service (system crash) via a crafted traffic control configuration that is set up with "tc qdisc" and "tc class" commands. ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1Kt5g4yfw3QI3Gu8UzcclV/da58a3de4dc53ef2ff7130e27cbb0bf4/image1-56.png" />
            
            </figure><p>USER namespaces power the functionality of our favorite tools such as docker and podman. <a href="/live-patch-security-vulnerabilities-with-ebpf-lsm/">We wrote about Linux namespaces back in June</a> and explained them like this:</p><blockquote><p>Most of the namespaces are uncontroversial, like the UTS namespace which allows the host system to hide its hostname and time. Others are complex but straightforward - NET and NS (mount) namespaces are known to be hard to wrap your head around. Finally, there is this very special, very curious USER namespace. USER namespace is special since it allows the - typically unprivileged owner to operate as "root" inside it. It's a foundation to having tools like Docker to not operate as true root, and things like rootless containers.</p></blockquote><p>Due to its nature, allowing unprivileged users access to USER namespace always carried a great security risk. With its help the unprivileged user can in fact run code that typically requires root. This code is often under-tested and buggy. Today we will look into one such case where USER namespaces are leveraged to exploit a kernel bug that can result in an unprivileged denial of service attack.</p>
    <div>
      <h3>Enter Linux Traffic Control queue disciplines</h3>
      <a href="#enter-linux-traffic-control-queue-disciplines">
        
      </a>
    </div>
    <p>In 2019, we were exploring leveraging <a href="https://man7.org/linux/man-pages/man8/tc.8.html#DESCRIPTION">Linux Traffic Control's</a> <a href="https://tldp.org/HOWTO/Traffic-Control-HOWTO/components.html#c-qdisc">queue discipline</a> (qdisc) to schedule packets for one of our services with the <a href="https://man7.org/linux/man-pages/man8/tc-htb.8.html">Hierarchy Token Bucket</a> (HTB) <a href="https://tldp.org/HOWTO/Traffic-Control-HOWTO/classful-qdiscs.html">classful qdisc</a> strategy. Linux Traffic Control is a user-configured system to schedule and filter network packets. Queue disciplines are the strategies in which packets are scheduled. In particular, we wanted to filter and schedule certain packets from an interface, and drop others into the <a href="https://linux-tc-notes.sourceforge.net/tc/doc/sch_noqueue.txt">noqueue</a> qdisc.</p><p>noqueue is a special case qdisc, such that packets are supposed to be dropped when scheduled into it. In practice, this is not the case. Linux handles noqueue such that packets are passed through and not dropped (for the most part). The <a href="https://linux-tc-notes.sourceforge.net/tc/doc/sch_noqueue.txt">documentation</a> states as much. It also states that “It is not possible to assign the noqueue queuing discipline to physical devices or classes.” So what happens when we assign noqueue to a class?</p><p>Let's write some shell commands to show the problem in action:</p>
            <pre><code>1. $ sudo -i
2. # dev=enp0s5
3. # tc qdisc replace dev $dev root handle 1: htb default 1
4. # tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit
5. # tc qdisc add dev $dev parent 1:1 handle 10: noqueue</code></pre>
            <ol><li><p>First we need to log in as root because that gives us <a href="https://man7.org/linux/man-pages/man7/capabilities.7.html#DESCRIPTION">CAP_NET_ADMIN</a> to be able to configure traffic control.</p></li><li><p>We then assign a network interface to a variable. These can be found with <code>ip a</code>. Virtual interfaces can be located by calling <code>ls /sys/devices/virtual/net</code>. These will match with the output from <code>ip a</code>.</p></li><li><p>Our interface is currently assigned to the <a href="https://man7.org/linux/man-pages/man8/tc-pfifo_fast.8.html">pfifo_fast</a> qdisc, so we replace it with the HTB classful qdisc and assign it the handle of <code>1:</code>. We can think of this as the root node in a tree. The “default 1” configures this such that unclassified traffic will be routed directly through this qdisc which falls back to pfifo_fast queuing. (more on this later)</p></li><li><p>Next we add a class to our root qdisc <code>1:</code>, assign it to the first leaf node 1 of root 1: <code>1:1</code>, and give it some reasonable configuration defaults.</p></li><li><p>Lastly, we add the noqueue qdisc to our first leaf node in the hierarchy: <code>1:1</code>. This effectively means traffic routed here will be scheduled to noqueue</p></li></ol><p>Assuming our setup executed without a hitch, we will receive something similar to this kernel panic:</p>
            <pre><code>BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
...
Call Trace:
&lt;TASK&gt;
htb_enqueue+0x1c8/0x370
dev_qdisc_enqueue+0x15/0x90
__dev_queue_xmit+0x798/0xd00
...
&lt;/TASK&gt;
</code></pre>
            <p>We know that the root user is responsible for setting qdisc on interfaces, so if root can crash the kernel, so what? We just do not apply noqueue qdisc to a class id of a HTB qdisc:</p>
            <pre><code># dev=enp0s5
# tc qdisc replace dev $dev root handle 1: htb default 1
# tc class add dev $dev parent 1: classid 1:2 htb rate 10mbit // A
// B is missing, so anything not filtered into 1:2 will be pfifio_fast</code></pre>
            <p>Here, we leveraged the default case of HTB where we assign a class id 1:2 to be rate-limited (A), and implicitly did not set a qdisc to another class such as id 1:1 (B). Packets queued to (A) will be filtered to <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_htb.c#L620">HTB_DIRECT</a> and packets queued to (B) will be filtered into pfifo_fast.</p><p>Because we were not familiar with this part of the codebase, we <a href="https://lore.kernel.org/all/CALrw=nEdA0asN4n7B3P2TyHKJ+UBPvoAiMrwkT42=fqp2-CPiw@mail.gmail.com/">notified</a> the mailing lists and created a ticket. The bug did not seem all that important to us at that time.</p><p>Fast-forward to 2022, we are <a href="https://lwn.net/Articles/903580/">pushing</a> USER namespace creation hardening. We extended the Linux LSM framework with a new LSM hook: <a href="https://lore.kernel.org/all/20220815162028.926858-1-fred@cloudflare.com/">userns_create</a> to leverage <a href="/live-patch-security-vulnerabilities-with-ebpf-lsm/">eBPF LSM</a> for our protections, and encourage others to do so as well. Recently while combing our ticket backlog, we rethought this bug. We asked ourselves, “can we leverage USER namespaces to trigger the bug?” and the short answer is yes!</p>
    <div>
      <h3>Demonstrating the bug</h3>
      <a href="#demonstrating-the-bug">
        
      </a>
    </div>
    <p>The exploit can be performed with any classful qdisc that assumes a <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/include/net/sch_generic.h#L73">struct Qdisc.enqueue</a> function to not be NULL (more on this later), but in this case, we are demonstrating just with HTB.</p>
            <pre><code>$ unshare -rU –net
$ dev=lo
$ tc qdisc replace dev $dev root handle 1: htb default 1
$ tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit
$ tc qdisc add dev $dev parent 1:1 handle 10: noqueue
$ ping -I $dev -w 1 -c 1 1.1.1.1</code></pre>
            <p>We use the “lo” interface to demonstrate that this bug is triggerable with a virtual interface. This is important for containers because they are fed virtual interfaces most of the time, and not the physical interface. Because of that, we can use a container to crash the host as an unprivileged user, and thus perform a denial of service attack.</p>
    <div>
      <h3>Why does that work?</h3>
      <a href="#why-does-that-work">
        
      </a>
    </div>
    <p>To understand the problem a bit better, we need to look back to the original <a href="https://lore.kernel.org/all/1440703299-21243-1-git-send-email-phil@nwl.cc/#t">patch series</a>, but specifically this <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d66d6c3152e8d5a6db42a56bf7ae1c6cae87ba48">commit</a> that introduced the bug. Before this series, achieving noqueue on interfaces relied on a hack that would set a device qdisc to noqueue if the device had a <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_api.c#L1263">tx_queue_len = 0</a>. The commit d66d6c3152e8 (“net: sched: register noqueue qdisc”) circumvents this by explicitly allowing noqueue to be added with the <code>tc</code> command without needing to get around that limitation.</p><p>The way the kernel checks for whether we are in a noqueue case or not, is to simply check if a qdisc has a <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/core/dev.c#L4214">NULL enqueue()</a> function. Recall from earlier that noqueue does not necessarily drop packets in practice? After that check in the fail case, the following logic handles the noqueue functionality. In order to fail the check, the author had to <i>cheat</i> a reassignment from <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_generic.c#L628">noop_enqueue()</a> to <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_api.c#L142">NULL</a> by making <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_generic.c#L683">enqueue = NULL</a> in the init which is called <i>way after</i> <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_api.c#L131">register_qdisc()</a> during runtime.</p><p>Here is where classful qdiscs come into play. The check for an enqueue function is no longer NULL. In this call path, it is now set to HTB (in our example) and is thus allowed to enqueue the struct skb to a <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/core/dev.c#L3778">queue</a> by making a call to the function <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_htb.c#L612">htb_enqueue()</a>. Once in there, HTB performs a <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_htb.c#L216">lookup</a> to pull in a qdisc assigned to a leaf node, and eventually attempts to <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_htb.c#L635">queue</a> the struct skb to the chosen qdisc which ultimately reaches this function:</p><p><i>include/net/sch_generic.h</i></p>
            <pre><code>static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
				struct sk_buff **to_free)
{
	qdisc_calculate_pkt_len(skb, sch);
	return sch-&gt;enqueue(skb, sch, to_free); // sch-&gt;enqueue == NULL
}</code></pre>
            <p>We can see that the enqueueing process is fairly agnostic from physical/virtual interfaces. The permissions and validation checks are done when adding a queue to an interface, which is why the classful qdics assume the queue to not be NULL. This knowledge leads us to a few solutions to consider.</p>
    <div>
      <h3>Solutions</h3>
      <a href="#solutions">
        
      </a>
    </div>
    <p>We had a few solutions ranging from what we thought was best to worst:</p><ol><li><p>Follow tc-noqueue documentation and do not allow noqueue to be assigned to a classful qdisc</p></li><li><p>Instead of checking for <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/core/dev.c#L4214">NULL</a>, check for <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_generic.c#L687">struct noqueue_qdisc_ops</a>, and reset noqueue to back to noop_enqueue</p></li><li><p>For each classful qdisc, check for NULL and fallback</p></li></ol><p>While we ultimately went for the first option: <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=96398560f26aa07e8f2969d73c8197e6a6d10407">"disallow noqueue for qdisc classes"</a>, the third option creates a lot of churn in the code, and does not solve the problem completely. Future qdiscs implementations could forget that important check as well as the maintainers. However, the reason for passing on the second option is a bit more interesting.</p><p>The reason we did not follow that approach is because we need to first answer these questions:</p><p>Why not allow noqueue for classful qdiscs?</p><p>This contradicts the documentation. The documentation does have some precedent for not being totally followed in practice, but we will need to update that to reflect the current state. This is fine to do, but does not address the behavior change problem other than remove the NULL dereference bug.</p><p>What behavior changes if we do allow noqueue for qdiscs?</p><p>This is harder to answer because we need to determine what that behavior should be. Currently, when noqueue is applied as the root qdisc for an interface, the path is to essentially allow packets to be processed. Claiming a fallback for classes is a different matter. They may each have their own fallback rules, and how do we know what is the right fallback? Sometimes in HTB the fallback is pass-through with HTB_DIRECT, sometimes it is pfifo_fast. What about the other classes? Perhaps instead we should fall back to the default noqueue behavior as it is for root qdiscs?</p><p>We felt that going down this route would only add confusion and additional complexity to queuing. We could also make an argument that such a change could be considered a feature addition and not necessarily a bug fix. Suffice it to say, adhering to the current documentation seems to be the more appealing approach to prevent the vulnerability now, while something else can be worked out later.</p>
    <div>
      <h3>Takeaways</h3>
      <a href="#takeaways">
        
      </a>
    </div>
    <p>First and foremost, apply this <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=96398560f26aa07e8f2969d73c8197e6a6d10407">patch</a> as soon as possible. And consider hardening USER namespaces on your systems by setting <code>sysctl -w</code> <a href="https://sources.debian.org/patches/linux/3.16.56-1+deb8u1/debian/add-sysctl-to-disallow-unprivileged-CLONE_NEWUSER-by-default.patch/"><code>kernel.unprivileged_userns_clone</code></a><code>=0</code>, which only lets root create USER namespaces in Debian kernels, <code>sysctl -w</code> <a href="https://docs.kernel.org/admin-guide/sysctl/user.html?highlight=max_user_namespaces"><code>user.max_user_namespaces</code></a><code>=[number]</code> for a process hierarchy, or consider backporting these two patches: <code>[security_create_user_ns()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7cd4c5c2101cb092db00f61f69d24380cf7a0ee8)</code> and the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed5d44d42c95e8a13bb54e614d2269c8740667f9">SELinux implementation</a>  (now in Linux 6.1.x) to allow you to protect your systems with either eBPF or SELinux. If you are sure you're not using USER namespaces and in extreme cases, you might consider turning the feature off with <code>CONFIG_USERNS=n</code>. This is just one example of many where namespaces are leveraged to perform an attack, and more are surely to crop up in varying levels of severity in the future.</p><p>Special thanks to Ignat Korchagin and Jakub Sitnicki for code reviews and helping demonstrate the bug in practice.</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Vulnerabilities]]></category>
            <category><![CDATA[CVE]]></category>
            <guid isPermaLink="false">KiLg1KENXvFAT8ADCa4hN</guid>
            <dc:creator>Frederick Lawler</dc:creator>
        </item>
        <item>
            <title><![CDATA[A debugging story: corrupt packets in AF_XDP; a kernel bug or user error?]]></title>
            <link>https://blog.cloudflare.com/a-debugging-story-corrupt-packets-in-af_xdp-kernel-bug-or-user-error/</link>
            <pubDate>Mon, 16 Jan 2023 13:46:45 GMT</pubDate>
            <description><![CDATA[ A race condition in the virtual ethernet driver of the Linux kernel led to occasional packet content corruptions, which resulted in unwanted packet drops by one of our DDoS mitigation systems. This blogpost describes the thought process and technique we used to debug this complex issue. ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6y8e5gSMEYuaTCmUVwscON/e7b52d75c044b155fbc6682379a6350b/image2-33.png" />
            
            </figure>
    <div>
      <h3>panic: Invalid TCP packet: Truncated</h3>
      <a href="#panic-invalid-tcp-packet-truncated">
        
      </a>
    </div>
    <p>A few months ago we started getting a handful of crash reports for flowtrackd, our <a href="https://developers.cloudflare.com/ddos-protection/tcp-protection/">Advanced TCP Protection</a> system that runs on our global network. The provided stack traces indicated that the panics occurred while parsing a TCP packet that was truncated.</p><p>What was most interesting wasn’t that we failed to parse the packet. It isn’t rare that we receive malformed packets from the Internet that are (deliberately or not) truncated. Those packets will be caught the first time we parse them and won’t make it to the latter processing stages. However, in our case, the panic occurred the second time we parsed the packet, indicating it had been truncated <b>after</b> we received it and successfully parsed it the first time. Both parse calls were made from a single green thread and referenced the same packet buffer in memory, and we made no attempts to mutate the packet in between.</p><p>It can be easy to dread discovering a bug like this. Is there a race condition? Is there memory corruption? Is this a kernel bug? A compiler bug? Our plan to get to the root cause of this potentially complex issue was to identify symptom(s) related to the bug, create theories on what may be occurring and create a way to test our theories or gather more information.</p><p>Before we get into the details we first need some background information about <code>AF_XDP</code> and our setup.</p>
    <div>
      <h3>AF_XDP overview</h3>
      <a href="#af_xdp-overview">
        
      </a>
    </div>
    <p><a href="https://www.kernel.org/doc/html/latest/networking/af_xdp.html">AF_XDP</a> is the high performance asynchronous user-space networking API in the Linux kernel. For network devices that support it, AF_XDP provides a way to perform extremely fast, zero-copy packet forwarding using a memory buffer that’s shared between the kernel and a user-space application.</p><p>A number of components need to be set up by the user-space application to start interacting with the packets entering a network device using AF_XDP.</p><p>First, a shared packet buffer (UMEM) is created. This UMEM is divided into equal-sized “frames” that are referenced by a “descriptor address,” which is just the offset from the start of the UMEM.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7sZldzVDmiFTIbWi5VROpW/799e655b50ce2288d748086a1c85bab4/image12-2.png" />
            
            </figure><p>Next, multiple AF_XDP sockets (XSKs) are created – one for each hardware queue on the network device – and bound to the UMEM. Each of these sockets provides four ring buffers (or “queues”) which are used to send descriptors back and forth between the kernel and user-space.</p><p>User-space sends packets by taking an unused descriptor and copying the packet into that descriptor (or rather, into the UMEM frame that the descriptor points to). It gives the descriptor to the kernel by enqueueing it on the <b>TX queue</b>. Some time later, the kernel dequeues the descriptor from the <b>TX queue</b> and transmits the packet that it points to out of the network device. Finally, the kernel gives the descriptor back to user-space by enqueueing it on the <b>COMPLETION queue</b>, so that user-space can reuse it later to send another packet.</p><p>To receive packets, user-space provides the kernel with unused descriptors by enqueueing them on the <b>FILL queue</b>. The kernel copies packets it receives into these unused descriptors, and then gives them to user-space by enqueueing them on the <b>RX queue</b>. Once user-space processes the packets it dequeues from the <b>RX queue</b>, it either transmits them back out of the network device by enqueueing them on the <b>TX queue</b>, or it gives them back to the kernel for later reuse by enqueueing them on the <b>FILL queue</b>.</p><table>
<thead>
  <tr>
    <th>Queue</th>
    <th>User space</th>
    <th>Kernel space</th>
    <th>Content description</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>COMPLETION</td>
    <td>Consumes</td>
    <td>Produces</td>
    <td>Descriptors containing a packet that was successfully transmitted by the kernel</td>
  </tr>
  <tr>
    <td>FILL</td>
    <td>Produces</td>
    <td>Consumes</td>
    <td>Descriptors that are empty and ready to be used by the kernel to receive packets</td>
  </tr>
  <tr>
    <td>RX</td>
    <td>Consumes</td>
    <td>Produces</td>
    <td>Descriptors containing a packet that was recently received by the kernel</td>
  </tr>
  <tr>
    <td>TX</td>
    <td>Produces</td>
    <td>Consumes</td>
    <td>Descriptors containing a packet that is ready to be transmitted by the kernel</td>
  </tr>
</tbody>
</table><p>Finally, a BPF program is attached to the network device. Its job is to direct incoming packets to whichever XSK is associated with the specific hardware queue that the packet was received on.</p><p>Here is an overview of the interactions between the kernel and user-space:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5g3TzP8qFYPWMxEXeXxhz3/798bc1db0ce5bdf894804039eee1b77d/Untitled-2.png" />
            
            </figure>
    <div>
      <h3>Our setup</h3>
      <a href="#our-setup">
        
      </a>
    </div>
    <p>Our application uses AF_XDP on a pair of multi-queue veth interfaces (“outer” and “inner”) that are each in different network namespaces. We follow the process outlined above to bind an XSK to each of the interfaces’ queues, forward packets from one interface to the other, send packets back out of the interface they were received on, or drop them. This functionality enables us to implement bidirectional traffic inspection to perform DDoS mitigation logic.</p><p>This setup is depicted in the following diagram:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/nAwNZF6NfDJK6sz1yymW2/45823d63ae47c22878796864d847187a/Untitled--1--1.png" />
            
            </figure>
    <div>
      <h3>Information gathering</h3>
      <a href="#information-gathering">
        
      </a>
    </div>
    <p>All we knew to start with was that our program was occasionally seeing corruption that seemed to be impossible. We didn’t know what these corrupt packets actually looked like. It was possible that their contents would reveal more details about the bug and how to reproduce it, so our first step was to log the packet bytes and discard the packet instead of panicking. We could then take the logs with packet bytes in them and create a PCAP file to analyze with <a href="https://www.wireshark.org/">Wireshark</a>. This showed us that the packets looked mostly normal, except for Wireshark’s TCP analyzer complaining that their “IPv4 total length exceeds packet length”. In other words, the “total length” IPv4 header field said the packet should be (for example) 60 bytes long, but the packet itself was only 56 bytes long.</p>
    <div>
      <h3>Lengths mismatch</h3>
      <a href="#lengths-mismatch">
        
      </a>
    </div>
    <p>Could it be possible that the number of bytes we read from the RX ring was incorrect? Let’s check.</p><p>An XDP descriptor has the following <a href="https://elixir.bootlin.com/linux/v5.15.77/source/include/uapi/linux/if_xdp.h#L103">C struct</a>:</p>
            <pre><code>struct xdp_desc {
	__u64 addr;
	__u32 len;
	__u32 options;
};</code></pre>
            <p>Here the len member tells us the total size of the packet pointed to by addr in the UMEM frame.</p><p>Our first interaction with the packet content happens in the BPF code attached to the network interfaces.</p><p>There our entrypoint function gets a pointer to a xdp_md C struct with <a href="https://elixir.bootlin.com/linux/v5.15.77/source/include/uapi/linux/bpf.h#L5442">the following definition</a>:</p>
            <pre><code>struct xdp_md {
	__u32 data;
	__u32 data_end;
	__u32 data_meta;
	/* Below access go through struct xdp_rxq_info */
	__u32 ingress_ifindex; /* rxq-&gt;dev-&gt;ifindex */
	__u32 rx_queue_index;  /* rxq-&gt;queue_index  */

	__u32 egress_ifindex;  /* txq-&gt;dev-&gt;ifindex */
};</code></pre>
            <p>This context structure contains two pointers (<code>as __u32</code>) referring to start and the end of the packet. Getting the packet length can be done by subtracting data from data_end.</p><p>If we compare that value with the one we get from the descriptors, we would surely find they are the same right?</p><p>We can use the BPF helper function <a href="https://man7.org/linux/man-pages/man7/bpf-helpers.7.html"><code>bpf_xdp_adjust_meta()</code></a> (since the veth driver supports it) to declare a metadata space that will hold the packet buffer length that we computed. We use it the same way <a href="https://elixir.bootlin.com/linux/v5.15.77/source/samples/bpf/xdp2skb_meta_kern.c#L41">this kernel sample code</a> does.</p><p>After deploying the new code in production, we saw the following lines in our logs:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2NG7NbMAC6kpbMT1mBZtwT/cbb254e24627be61ecba34d2783d069b/pasted-image-0-2.png" />
            
            </figure><p>Here you can see three interesting things:</p><ol><li><p>As we theorized, the length of the packet when first seen in XDP doesn’t match the length present in the descriptor.</p></li><li><p>We had already observed from our truncated packet panics that sometimes the descriptor length is shorter than the actual packet length, however the prints show that sometimes the descriptor length might be larger than the real packet bytes.</p></li><li><p>These often appeared to happen in “pairs” where the XDP length and descriptor length would swap between packets.</p></li></ol>
    <div>
      <h3>Two packets and one buffer?</h3>
      <a href="#two-packets-and-one-buffer">
        
      </a>
    </div>
    <p>Seeing the XDP and descriptor lengths swap in “pairs” was perhaps the first lightbulb moment. Are these two different packets being written to the same buffer? This also revealed a key piece of information that we failed to add to our debug prints, the descriptor address! We took this opportunity to print additional information like the packet bytes, and to print at multiple locations in the path to see if anything changed over time.</p><p>The real key piece of information that these debug prints revealed was that not only were each swapped “pair” sharing a descriptor address, but nearly every corrupt packet on a single server was always using the same descriptor address. Here you can see 49750 corrupt packets that all used descriptor address 69837056:</p>
            <pre><code>$ cat flowtrackd.service-2022-11-03.log | grep 87m237 | grep -o -E 'desc_addr: [[:digit:]]+' | sort | uniq -c
  49750 desc_addr: 69837056</code></pre>
            <p>This was the second lightbulb moment. Not only are we trying to copy two packets to the same buffer, but it is always the same buffer. Perhaps the problem is that this descriptor has been inserted into the AF_XDP rings twice? We tested this theory by updating our consumer code to test if a batch of descriptors read from the RX ring ever contained the same descriptor twice. This wouldn’t guarantee that the descriptor isn’t in the ring twice since there is no guarantee that the two descriptors will be in the same read batch, but we were lucky enough that it did catch the same descriptor twice in a single read proving this was our issue. In hindsight the <a href="https://www.kernel.org/doc/html/latest/networking/af_xdp.html">linux kernel AF_XDP documentation</a> points out this very issue:</p><blockquote><p><i>Q: My packets are sometimes corrupted. What is wrong?</i></p></blockquote><blockquote><p><i>A: Care has to be taken not to feed the same buffer in the UMEM into more than one ring at the same time. If you for example feed the same buffer into the FILL ring and the TX ring at the same time, the NIC might receive data into the buffer at the same time it is sending it. This will cause some packets to become corrupted. Same thing goes for feeding the same buffer into the FILL rings belonging to different queue ids or netdevs bound with the XDP_SHARED_UMEM flag.</i></p></blockquote><p>We now understand <i>why</i> we have corrupt packets, but we still don’t understand how a descriptor ever ends up in the AF_XDP rings twice. I would love to blame this on a kernel bug, but as the documentation points out this is more likely that we’ve placed the descriptor in the ring twice in our application. Additionally, since this is listed as a FAQ for AF_XDP we will need sufficient evidence proving that this is caused by a kernel bug and not user error before reporting to the kernel mailing list(s).</p>
    <div>
      <h3>Tracking descriptor transitions</h3>
      <a href="#tracking-descriptor-transitions">
        
      </a>
    </div>
    <p>Auditing our application code did not show any obvious location where we might be inserting the same descriptor address into either the FILL or TX ring twice. We do however know that descriptors transition through a set of known states, and we could track those transitions with a state machine. The below diagram shows all the possible valid transitions:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7c0EnogcWg13acmzNWQmiJ/6fc9f78824fc566a68da6893b24b7f0a/af_xdp_descriptor_transitions.png" />
            
            </figure><p>For example, a descriptor going from the RX ring to either the FILL or the TX ring is a perfectly valid transition. On the other hand, a descriptor going from the FILL ring to the COMP ring is an invalid transition.</p><p>To test the validity of the descriptor transitions, we added code to track their membership across the rings. This produced some of the following log messages:</p>
            <pre><code>Nov 16 23:49:01 fuzzer4 flowtrackd[45807]: thread 'flowtrackd-ZrBh' panicked at 'descriptor 26476800 transitioned from Fill to Tx'
Nov 17 02:09:01 fuzzer4 flowtrackd[45926]: thread 'flowtrackd-Ay0i' panicked at 'descriptor 18422016 transitioned from Comp to Rx'
Nov 29 10:52:08 fuzzer4 flowtrackd[83849]: thread 'flowtrackd-5UYF' panicked at 'descriptor 3154176 transitioned from Tx to Rx'</code></pre>
            <p>The first print shows a descriptor was put on the FILL ring and transitioned directly to the TX ring without being read from the RX ring first. This appears to hint at a bug in our application, perhaps indicating that our application duplicates the descriptor putting one copy in the FILL ring and the other copy in the TX ring.</p><p>The second invalid transition happened for a descriptor moving from the COMP ring to the RX ring without being put first on the FILL ring. This appears to hint at a kernel bug, perhaps indicating that the kernel duplicated a descriptor and put it both in the COMP ring and the RX ring.</p><p>The third invalid transition was from the TX to the RX ring without going through the FILL or COMP ring first. This seems like an extended case of the previous COMP to RX transition and again hints at a possible kernel bug.</p><p>Confused by the results we double-checked our tracking code and attempted to find any possible way our application could duplicate a descriptor putting it both in the FILL and TX rings. With no bugs found we felt we needed to gather more information.</p>
    <div>
      <h3>Using ftrace as a “flight recorder”</h3>
      <a href="#using-ftrace-as-a-flight-recorder">
        
      </a>
    </div>
    <p>While using a state machine to catch invalid descriptor transitions was able to catch these cases, it still lacked a number of important details which might help track down the ultimate cause of the bug. We still didn’t know if the bug was a kernel issue or an application issue. Confusingly the transition states seemed to indicate it was both.</p><p>To gather some more information we ideally wanted to be able to track the history of a descriptor. Since we were using a shared UMEM a descriptor could in theory transition between interfaces, and receive queues. Additionally, our application uses a single green thread to handle each XSK, so it might be interesting to track those descriptor transitions by XSK, CPU, and thread. A simple but unscalable way to achieve this would be to simply print this information at every transition point. This of course is not really an option for a production environment that needs to be able to process millions of packets per second. Both the amount of data produced and the overhead of printing that information will not work.</p><p>Up to this point we had been carefully debugging this issue in production systems. The issue was rare enough that even with our large production deployment it might take a day for some production machines to start to display the issue. If we did want to explore more resource intensive debugging techniques we needed to see if we could reproduce this in a test environment. For this we created 10 virtual machines that were continuously load testing our application with <a href="https://iperf.fr/">iperf</a>. Fortunately with this setup we were able to reproduce the issue about once a day, giving us some more freedom to try some more resource intensive debugging techniques.</p><p>Even using a virtual machine it still doesn’t scale to print logs at every descriptor transition, but do you really need to see every transition? In theory the most interesting events are the events right before the bug occurs. We could build something that internally keeps a log of the last N events and only dump that log when the bug occurs. Something like a black box flight recorder used in airplanes to track the events leading up to a crash. Fortunately for us, we don’t really need to build this, and instead can use the Linux kernel’s <a href="https://www.kernel.org/doc/html/latest/trace/ftrace.html">ftrace</a> feature, which has some additional features that might help us ultimately track down the cause of this bug.</p><p>ftrace is a kernel feature that operates by internally keeping a set of per-CPU ring buffers of trace events. Each event stored in the ring buffer is time-stamped and contains some additional information about the context where the event occurred, the CPU, and what process or thread was running at the time of the event. Since these events are stored in per-CPU ring buffers, once the ring is full, new events will overwrite the oldest events leaving a log of the most recent events on that CPU. Effectively we have our flight recorder that we desired, all we need to do is add our events to the ftrace ring buffers and disable tracing when the bug occurs.</p><p>ftrace is controlled using virtual files in the debugfs filesystem. Tracing can be enabled and disabled by writing either a 1 or a 0 to:</p><p><code>/sys/kernel/debug/tracing/tracing_on</code></p><p>We can update our application to insert our own events into the tracing ring buffer by writing our messages into the trace_marker file:</p><p><code>/sys/kernel/debug/tracing/trace_marker</code></p><p>And finally after we’ve reproduced the bug and our application has disabled tracing we can extract the contents of all the ring buffers into a single trace file by reading the trace file:</p><p><code>/sys/kernel/debug/tracing/trace</code></p><p>It is worth noting that writing messages to the trace_marker virtual file still involves making a system call and copying your message into the ring buffers. This can still add overhead and in our case where we are logging several prints per packet that overhead might be significant. Additionally, ftrace is a systemwide kernel tracing feature, so you may need to either adjust the permissions of virtual files, or run your application with the appropriate permissions.</p><p>There is of course one more big advantage of using ftrace to assist in debugging this issue. As shown above we can log our own application messages to ftrace using the trace_marker file, but at its core ftrace is a kernel tracing feature. This means that we can additionally use ftrace to log events from the kernel side of the AF_XDP packet processing. There are several ways to do this, but for our purposes we used kprobes so that we could target very specific lines of code and print some variables. kprobes can be created directly in ftrace, but I find it easier to create them using the “perf probe” command of perf tool in Linux. Using the “-L” and “-V” arguments you can find which lines of a function can be probed and which variables can be viewed at those probe points. Finally, you can add the probe with the “-a” argument. For example after examining the kernel code we insert the following probe in the receive path of a XSK:</p><p><code>perf probe -a '__xsk_rcv_zc:7 addr len xs xs-&gt;pool-&gt;fq xs-&gt;dev'</code></p><p>This will probe line 7 of __xsk_rcv_zc() and print the descriptor address, the packet length, the XSK address, the fill queue address and the net device address. For context here is what <a href="https://elixir.bootlin.com/linux/v5.15.77/source/net/xdp/xsk.c#L152"><code>__xsk_rcv_zc()</code></a> looks like from the perf probe command:</p>
            <pre><code>$ perf probe -L __xsk_rcv_zc
      0  static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
         {
                struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
                u64 addr;
                int err;
         
                addr = xp_get_handle(xskb);
      7         err = xskq_prod_reserve_desc(xs-&gt;rx, addr, len);
      8         if (err) {
                        xs-&gt;rx_queue_full++;
                        return err;
                }</code></pre>
            <p>In our case line 7 is the call to xskq_prod_reserve_desc(). At this point in the code the kernel has already removed a descriptor from the FILL queue and copied a packet into that descriptor. The call to xsk_prod_reserve_desc() will ensure that there is space in the RX queue, and if there is space will add that descriptor to the RX queue. It is important to note that while xskq_prod_reserve_desc() will put the descriptor in the RX queue it does not update the producer pointer of the RX ring or notify the XSK that packets are ready to be read because the kernel tries to batch these operations.</p><p>Similarly, we wanted to place a probe in the transmit path on the kernel side and ultimately placed the following probe:</p><p><code>perf probe -a 'xp_raw_get_data:0 addr'</code></p><p>There isn’t much interesting to show here in <a href="https://elixir.bootlin.com/linux/v5.15.77/source/net/xdp/xsk_buff_pool.c#L542">the code</a>, but this probe is placed at a location where descriptors have been removed from the TX queue but have not yet been put in the COMPLETION queue.</p><p>In both of these probes it would have been nice to put the probes at the earliest location where descriptors were added or removed from the XSK queues, and to print as much information as possible at these locations. However, in practice the locations where kprobes can be placed and the variables available at those locations limits what can be seen.</p><p>With the probes created we still need to enable them to be seen in ftrace. This can be done with:</p><p><code>echo 1 &gt; /sys/kernel/debug/tracing/events/probe/__xsk_rcv_zc_L7/enableecho 1 &gt; /sys/kernel/debug/tracing/events/probe/xp_raw_get_data/enable</code></p><p>With our application updated to trace the transition of every descriptor and stop tracing when an invalid transition occurred we were ready to test again.</p>
    <div>
      <h3>Tracking descriptor state is not enough</h3>
      <a href="#tracking-descriptor-state-is-not-enough">
        
      </a>
    </div>
    <p>Unfortunately our initial test of our “flight recorder” didn’t immediately tell us anything new. Instead, it mostly confirmed what we already knew, which was that somehow we would end up in a state with the same descriptor twice. It also highlighted the fact that catching an invalid descriptor transition doesn’t mean you have caught the earliest point where the duplicate descriptor appeared. For example assume we have our descriptor A and our duplicate A’. If these are already both present in the FILL queue it is perfectly valid to:</p><p><code>RX A -&gt; FILL ARX A’ -&gt; FILL A’</code></p><p>This can occur for many cycles, before an invalid transition eventually occurs when both descriptors are seen either in the same batch or between queues.</p><p>Instead, we needed to rethink our approach. We knew that the kernel removes descriptors from the FILL queue, fills them, and places them in the RX queue. This means that for any given XSK the order that descriptors are inserted into the FILL queue should match the order that they come out of the RX queue. If a descriptor was ever duplicated in this kernel RX path we should see the duplicate descriptor appear out-of-order. With this in mind we updated our application to independently track the order of the FILL queue using a double ended queue. As our application puts descriptors into the FILL queue we also push the descriptor address into the tail of our tracking queue and when we receive packets we pop the descriptor address from the head of our tracking queue and ensure the address matches. If it ever doesn’t match we again can log to trace_marker and stop ftrace.</p><p>Below is the end of the first trace we captured with the updated code tracking the order of the FILL to RX queues. The color has been added to improve readability:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/uSATZc25o3ESPjeXWQVUX/9967b0a78fd476266ca5daf42b5cc177/Screenshot-2023-01-16-at-14.16.47.png" />
            
            </figure><p>Here you can see the power of our ftrace flight recorder. For example, we can follow the full cycle of descriptor <span>0x16ce900</span> as it is first received in the kernel, received by our application which forwards the packet by adding to the TX queue, the kernel transmitting, and finally our application receiving the completion and placing the descriptor back in the FILL queue.</p>
<p>
The trace starts to get interesting on the next two packets received by the kernel. We can see <span>0x160a100</span> received first in the kernel and then by our application. However things go wrong when the kernel receives <span>0x13d3900</span> but our application receives <span>0x1229100</span>. The last print of the trace shows the result of our descriptor order tracking. We can see that the kernel side appears to match our next expected descriptor and the next two descriptors, yet unexpectedly we see <span>0x1229100</span> arrive out of nowhere. We do think that the descriptor is present in the FILL queue, but it is much further down the line in the queue. Another potentially interesting detail is that between <span>0x160a100</span> and <span>0x13d3900</span> the kernel’s softirq switches from CPU 1 to CPU 2.</p>
<p>If you recall, our __xsk_rcv_zc_L7 kprobe was placed on the call to xskq_prod_reserve_desc() which adds the descriptor to the RX queue. Below we can examine that function to see if there are any clues on how the descriptor address received by our application could be different from what we think should have been inserted by the kernel.</p>
            <pre><code>static inline int xskq_prod_reserve_desc(struct xsk_queue *q,
                                     	u64 addr, u32 len)
{
    	struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q-&gt;ring;
    	u32 idx;
 
    	if (xskq_prod_is_full(q))
            	return -ENOBUFS;
 
    	/* A, matches D */
    	idx = q-&gt;cached_prod++ &amp; q-&gt;ring_mask;
    	ring-&gt;desc[idx].addr = addr;
    	ring-&gt;desc[idx].len = len;
 
    	return 0;
}</code></pre>
            <p>Here you can see that the queue’s cached_prod pointer is incremented first before we update the descriptor address and length. As the name implies the cached_prod pointer isn’t the actual producer pointer which means that at some point <code>xsk_flush()</code> must be called to sync the cached_prod pointer and the prod pointer to actually expose the newly received descriptors to user-mode. Perhaps there is a race where <code>xsk_flush()</code> is called after updating the <code>cached_prod</code> pointer, but before the actual descriptor address has been updated in the ring? If this were to occur our application would see the old descriptor address from that slot in the RX queue and would cause us to “duplicate” that descriptor.</p><p>We can test our theory by making two more changes. First we can update our application to write back a known “poisoned” descriptor address to each RX queue slot after we have received a packet. In this case we chose <code>0xdeadbeefdeadbeef</code> as our known invalid address and if we ever receive this value back out of the RX queue we know a race has occurred and exposed an uninitialized descriptor. The second change we can make is to add a kprobe on <code>xsk_flush()</code> to see if we can actually capture the race in the trace.</p><p><code>perf probe -a 'xsk_flush:0 xs'</code></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1Xx2qaoywvGvlb0VsenLAw/c0ec0249ac0f1ca77829fd6a39228624/Screenshot-2023-01-16-at-14.16.05.png" />
            
            </figure><p>Here we appear to have our smoking gun. As we predicted we can see that <span><span>xsk_flush()</span></span> is called on CPU 0 while a softirq is currently in progress on CPU 2. After the flush our application sees the expected <span><span><span>0xff0900</span></span></span> filled in from the softirq on CPU 0, and then <span><span>0xdeadbeefdeadbeef</span></span> which is our poisoned uninitialized descriptor address.</p>
<p>We now have evidence that the following order of operations is happening:</p>
            <pre><code>CPU 2                                                   CPU 0
-----------------------------------                     --------------------------------
__xsk_rcv_zc(struct xdp_sock *xs):                      xsk_flush(struct xdp_sock *xs):
                                        
idx = xs-&gt;rx-&gt;cached_prod++ &amp; xs-&gt;rx-&gt;ring_mask; 
                                                        // Flush the cached pointer as the new head pointer of
                                                        // the RX ring.
                                                        smp_store_release(&amp;xs-&gt;rx-&gt;ring-&gt;producer, xs-&gt;rx-&gt;cached_prod);

                                                        // Notify user-side that new descriptors have been produced to
                                                        // the RX ring.
                                                        sock_def_readable(&amp;xs-&gt;sk);

                                                        // flowtrackd reads a descriptor "too soon" where the addr
                                                        // and/or len fields have not yet been updated.
xs-&gt;rx-&gt;ring-&gt;desc[idx].addr = addr;
xs-&gt;rx-&gt;ring-&gt;desc[idx].len = len;</code></pre>
            <p>The AF_XDP documentation states that: <i>“All rings are single-producer/single-consumer, so the user-space application needs explicit synchronization of multiple processes/threads are reading/writing to them.”</i> The explicit synchronization requirement must also apply on the kernel side. How can two operations on the RX ring of a socket run at the same time?</p><p>On Linux, a mechanism called <a href="https://wiki.linuxfoundation.org/networking/napi">NAPI</a> prevents CPU interrupts from occurring every time a packet is received by the network interface. It instructs the network driver to process a certain amount of packets at a frequent interval. For the veth driver that polling function is called <a href="https://elixir.bootlin.com/linux/v5.15.77/source/drivers/net/veth.c#L906">veth_poll</a>, and it is <a href="https://elixir.bootlin.com/linux/v5.15.77/source/drivers/net/veth.c#L1015">registered</a> as the function handler for each queue of the XDP enabled network device. A NAPI-compliant network driver provides the guarantee that the processing of the packets tied to a NAPI context (<code>struct napi_struct *napi</code>) will not be happening at the same time on multiple processors. In our case, a NAPI context exists for each queue of the device which means per AF_XDP socket and their associated set of ring buffers (RX, TX, FILL, COMPLETION).</p>
            <pre><code>static int veth_poll(struct napi_struct *napi, int budget)
{
	struct veth_rq *rq =
		container_of(napi, struct veth_rq, xdp_napi);
	struct veth_stats stats = {};
	struct veth_xdp_tx_bq bq;
	int done;

	bq.count = 0;

	xdp_set_return_frame_no_direct();
	done = veth_xdp_rcv(rq, budget, &amp;bq, &amp;stats);

	if (done &lt; budget &amp;&amp; napi_complete_done(napi, done)) {
		/* Write rx_notify_masked before reading ptr_ring */
		smp_store_mb(rq-&gt;rx_notify_masked, false);
		if (unlikely(!__ptr_ring_empty(&amp;rq-&gt;xdp_ring))) {
			if (napi_schedule_prep(&amp;rq-&gt;xdp_napi)) {
				WRITE_ONCE(rq-&gt;rx_notify_masked, true);
				__napi_schedule(&amp;rq-&gt;xdp_napi);
			}
		}
	}

	if (stats.xdp_tx &gt; 0)
		veth_xdp_flush(rq, &amp;bq);
	if (stats.xdp_redirect &gt; 0)
		xdp_do_flush();
	xdp_clear_return_frame_no_direct();

	return done;
}</code></pre>
            <p><code>veth_xdp_rcv()</code> processes as many packets as the budget variable is set to, marks the NAPI processing as complete, potentially reschedules a NAPI polling, and <b>then</b>, calls <code>xdp_do_flush()</code>, breaking the NAPI guarantee cited above. After the call to <code>napi_complete_done()</code>, any CPU is free to execute the <code>veth_poll()</code> function before all the flush operations of the previous call are complete, allowing the race on the RX ring.</p><p>The race condition can be fixed by completing all the packet processing before signaling the NAPI poll as complete. The patch as well as the discussion on the kernel mailing list that lead to the fix are available here: <a href="https://lore.kernel.org/bpf/20221220185903.1105011-1-sbohrer@cloudflare.com/">[PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors</a>. The patch was recently merged upstream.</p>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>We’ve found and fixed a race condition in the Linux virtual ethernet (veth) driver that was corrupting packets for AF_XDP enabled devices!</p><p>This issue was a tough one to find (and to reproduce) but logical iterations lead us all the way down to the internals of the Linux kernel where we saw that a few lines of code were not executed in the correct order.</p><p>A rigorous methodology and the knowledge of the right debugging tools are essential to go about tracking down the root cause of potentially complex bugs.</p><p>This was important for us to fix because while TCP was designed to recover from occasional packet drops, randomly dropping legitimate packets slightly increased the latency of connection establishments and data transfers across our network.</p><p>Interested about other deep dive kernel debugging journeys? Read more of them on <a href="/">our blog</a>!</p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Linux]]></category>
            <guid isPermaLink="false">5bnyvGSAmPfUzfkmoNLY40</guid>
            <dc:creator>Shawn Bohrer</dc:creator>
            <dc:creator>Bastien Dhiver</dc:creator>
        </item>
        <item>
            <title><![CDATA[The Linux Kernel Key Retention Service and why you should use it in your next application]]></title>
            <link>https://blog.cloudflare.com/the-linux-kernel-key-retention-service-and-why-you-should-use-it-in-your-next-application/</link>
            <pubDate>Mon, 28 Nov 2022 14:57:20 GMT</pubDate>
            <description><![CDATA[ Many leaks happen because of software bugs and security vulnerabilities. In this post we will learn how the Linux kernel can help protect cryptographic keys from a whole class of potential security vulnerabilities: memory access violations. ]]></description>
            <content:encoded><![CDATA[ <p><i></i></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2LKKOrcwGlRDUpSWhMkxe4/406961181fbe307f99573e8fbc13a0b0/unnamed-5.png" />
            
            </figure><p>We want our digital data to be safe. We want to visit websites, send bank details, type passwords, sign documents online, login into remote computers, encrypt data before storing it in databases and be sure that nobody can tamper with it. Cryptography can provide a high degree of data security, but we need to protect cryptographic keys.</p><p>At the same time, we can’t have our key written somewhere securely and just access it occasionally. Quite the opposite, it’s involved in every request where we do crypto-operations. If a site supports TLS, then the private key is used to establish each connection.</p><p>Unfortunately cryptographic keys sometimes leak and when it happens, it is a big problem. Many leaks happen because of software bugs and security vulnerabilities. In this post we will learn how the Linux kernel can help protect cryptographic keys from a whole class of potential security vulnerabilities: memory access violations.</p>
    <div>
      <h3>Memory access violations</h3>
      <a href="#memory-access-violations">
        
      </a>
    </div>
    <p>According to the <a href="https://www.nsa.gov/Press-Room/News-Highlights/Article/Article/3215760/nsa-releases-guidance-on-how-to-protect-against-software-memory-safety-issues/">NSA</a>, around 70% of vulnerabilities in both Microsoft's and Google's code were related to memory safety issues. One of the consequences of incorrect memory accesses is leaking security data (including cryptographic keys). Cryptographic keys are just some (mostly random) data stored in memory, so they may be subject to memory leaks like any other in-memory data. The below example shows how a cryptographic key may accidentally leak via stack memory reuse:</p><p>broken.c</p>
            <pre><code>#include &lt;stdio.h&gt;
#include &lt;stdint.h&gt;

static void encrypt(void)
{
    uint8_t key[] = "hunter2";
    printf("encrypting with super secret key: %s\n", key);
}

static void log_completion(void)
{
    /* oh no, we forgot to init the msg */
    char msg[8];
    printf("not important, just fyi: %s\n", msg);
}

int main(void)
{
    encrypt();
    /* notify that we're done */
    log_completion();
    return 0;
}</code></pre>
            <p>Compile and run our program:</p>
            <pre><code>$ gcc -o broken broken.c
$ ./broken 
encrypting with super secret key: hunter2
not important, just fyi: hunter2</code></pre>
            <p>Oops, we printed the secret key in the “fyi” logger instead of the intended log message! There are two problems with the code above:</p><ul><li><p>we didn’t securely destroy the key in our pseudo-encryption function (by overwriting the key data with zeroes, for example), when we finished using it</p></li><li><p>our buggy logging function has access to any memory within our process</p></li></ul><p>And while we can probably easily fix the first problem with some additional code, the second problem is the inherent result of how software runs inside the operating system.</p><p>Each process is given a block of contiguous virtual memory by the operating system. It allows the kernel to share limited computer resources among several simultaneously running processes. This approach is called <a href="https://en.wikipedia.org/wiki/Virtual_memory">virtual memory management</a>. Inside the virtual memory a process has its own address space and doesn’t have access to the memory of other processes, but it can access any memory within its address space. In our example we are interested in a piece of process memory called the stack.</p><p>The stack consists of stack frames. A stack frame is dynamically allocated space for the currently running function. It contains the function’s local variables, arguments and return address. When compiling a function the compiler calculates how much memory needs to be allocated and requests a stack frame of this size. Once a function finishes execution the stack frame is marked as free and can be used again. A stack frame is a logical block, it doesn’t provide any boundary checks, it’s not erased, just marked as free. Additionally, the virtual memory is a contiguous block of addresses. Both of these statements give the possibility for malware/buggy code to access data from anywhere within virtual memory.</p><p>The stack of our program <code>broken.c</code> will look like:</p><img src="https://imagedelivery.net/52R3oh4H-57qkVChwuo3Ag/3526edee-ce7e-4f98-a2bf-ff1efd2fc800/public" /><p>At the beginning we have a stack frame of the main function. Further, the <code>main()</code> function calls <code>encrypt()</code> which will be placed on the stack immediately below the <code>main()</code> (the code stack grows downwards). Inside <code>encrypt()</code> the compiler requests 8 bytes for the <code>key</code> variable (7 bytes of data + C-null character). When <code>encrypt()</code> finishes execution, the same memory addresses are taken by <code>log_completion()</code>. Inside the <code>log_completion()</code> the compiler allocates eight bytes for the <code>msg</code> variable. Accidentally, it was put on the stack at the same place where our private key was stored before. The memory for <code>msg</code> was only allocated, but not initialized, the data from the previous function left as is.</p><p>Additionally, to the code bugs, programming languages provide unsafe functions known for the safe-memory vulnerabilities. For example, for C such functions are <code>printf()</code>, <code>strcpy()</code>, <code>gets()</code>. The function <code>printf()</code> doesn’t check how many arguments must be passed to replace all placeholders in the format string. The function arguments are placed on the stack above the function stack frame, <code>printf()</code> fetches arguments according to the numbers and type of placeholders, easily going off its arguments and accessing data from the stack frame of the previous function.</p><p>The NSA advises us to use safety-memory languages like Python, Go, Rust. But will it completely protect us?</p><p>The Python compiler will definitely check boundaries in many cases for you and notify with an error:</p>
            <pre><code>&gt;&gt;&gt; print("x: {}, y: {}, {}".format(1, 2))
Traceback (most recent call last):
  File "&lt;stdin&gt;", line 1, in &lt;module&gt;
IndexError: Replacement index 2 out of range for positional args tuple</code></pre>
            <p>However, this is a quote from one of 36 (for now) <a href="https://www.cvedetails.com/vulnerability-list/vendor_id-10210/opov-1/Python.html">vulnerabilities</a>:</p><blockquote><p><i>Python 2.7.14 is vulnerable to a Heap-Buffer-Overflow as well as a Heap-Use-After-Free.</i></p></blockquote><p>Golang has its own list of <a href="https://www.cvedetails.com/vulnerability-list/vendor_id-14185/opov-1/Golang.html">overflow vulnerabilities</a>, and has an <a href="https://pkg.go.dev/unsafe">unsafe package</a>. The name of the package speaks for itself, usual rules and checks don’t work inside this package.</p>
    <div>
      <h3>Heartbleed</h3>
      <a href="#heartbleed">
        
      </a>
    </div>
    <p>In 2014, the Heartbleed bug was discovered. The (at the time) most used cryptography library OpenSSL leaked private keys. We experienced it <a href="/answering-the-critical-question-can-you-get-private-ssl-keys-using-heartbleed/">too</a>.</p>
    <div>
      <h3>Mitigation</h3>
      <a href="#mitigation">
        
      </a>
    </div>
    <p>So memory bugs are a fact of life, and we can’t really fully protect ourselves from them. But, given the fact that cryptographic keys are much more valuable than the other data, can we do better protecting the keys at least?</p><p>As we already said, a memory address space is normally associated with a process. And two different processes don’t share memory by default, so are naturally isolated from each other. Therefore, a potential memory bug in one of the processes will not accidentally leak a cryptographic key from another process. The security of ssh-agent builds on this principle. There are always two processes involved: a client/requester and the <a href="https://linux.die.net/man/1/ssh-agent">agent</a>.</p><blockquote><p><i>The agent will never send a private key over its request channel. Instead, operations that require a private key will be performed by the agent, and the result will be returned to the requester. This way, private keys are not exposed to clients using the agent.</i></p></blockquote><p>A requester is usually a network-facing process and/or processing untrusted input. Therefore, the requester is much more likely to be susceptible to memory-related vulnerabilities but in this scheme it would never have access to cryptographic keys (because keys reside in a separate process address space) and, thus, can never leak them.</p><p>At Cloudflare, we employ the same principle in <a href="/heartbleed-revisited/">Keyless SSL</a>. Customer private keys are stored in an isolated environment and protected from Internet-facing connections.</p>
    <div>
      <h3>Linux Kernel Key Retention Service</h3>
      <a href="#linux-kernel-key-retention-service">
        
      </a>
    </div>
    <p>The client/requester and agent approach provides better protection for secrets or cryptographic keys, but it brings some drawbacks:</p><ul><li><p>we need to develop and maintain two different programs instead of one</p></li><li><p>we also need to design a well-defined-interface for communication between the two processes</p></li><li><p>we need to implement the communication support between two processes (Unix sockets, shared memory, etc.)</p></li><li><p>we might need to authenticate and support ACLs between the processes, as we don’t want any requester on our system to be able to use our cryptographic keys stored inside the agent</p></li><li><p>we need to ensure the agent process is up and running, when working with the client/requester process</p></li></ul><p>What if we replace the agent process with the Linux kernel itself?</p><ul><li><p>it is already running on our system (otherwise our software would not work)</p></li><li><p>it has a well-defined interface for communication (system calls)</p></li><li><p>it can enforce various ACLs on kernel objects</p></li><li><p>and it runs in a separate address space!</p></li></ul><p>Fortunately, the <a href="https://www.kernel.org/doc/html/v6.0/security/keys/core.html">Linux Kernel Key Retention Service</a> can perform all the functions of a typical agent process and probably even more!</p><p>Initially it was designed for kernel services like dm-crypt/ecryptfs, but later was opened to use by userspace programs. It gives us some advantages:</p><ul><li><p>the keys are stored outside the process address space</p></li><li><p>the well-defined-interface and the communication layer is implemented via syscalls</p></li><li><p>the keys are kernel objects and so have associated permissions and ACLs</p></li><li><p>the keys lifecycle can be implicitly bound to the process lifecycle</p></li></ul><p>The Linux Kernel Key Retention Service operates with two types of entities: keys and keyrings, where a keyring is a key of a special type. If we put it into analogy with files and directories, we can say a key is a file and a keyring is a directory. Moreover, they represent a key hierarchy similar to a filesystem tree hierarchy: keyrings reference keys and other keyrings, but only keys can hold the actual cryptographic material similar to files holding the actual data.</p><p>Keys have types. The type of key determines which operations can be performed over the keys. For example, keys of user and logon types can hold arbitrary blobs of data, but logon keys can never be read back into userspace, they are exclusively used by the in-kernel services.</p><p>For the purposes of using the kernel instead of an agent process the most interesting type of keys is the <a href="https://man7.org/linux/man-pages/man7/asymmetric.7.html">asymmetric type</a>. It can hold a private key inside the kernel and provides the ability for the allowed applications to either decrypt or sign some data with the key. Currently, only RSA keys are supported, but work is underway to add <a href="https://www.cloudflare.com/learning/dns/dnssec/ecdsa-and-dnssec/">ECDSA key support</a>.</p><p>While keys are responsible for safeguarding the cryptographic material inside the kernel, keyrings determine key lifetime and shared access. In its simplest form, when a particular keyring is destroyed, all the keys that are linked only to that keyring are securely destroyed as well. We can create custom keyrings manually, but probably one the most powerful features of the service are the “special keyrings”.</p><p>These keyrings are created implicitly by the kernel and their lifetime is bound to the lifetime of a different kernel object, like a process or a user. (Currently there are four categories of “implicit” <a href="https://man7.org/linux/man-pages/man7/keyrings.7.html">keyrings</a>), but for the purposes of this post we’re interested in two most widely used ones: process keyrings and user keyrings.</p><p>User keyring lifetime is bound to the existence of a particular user and this keyring is shared between all the processes of the same UID. Thus, one process, for example, can store a key in a user keyring and another process running as the same user can retrieve/use the key. When the UID is removed from the system, all the keys (and other keyrings) under the associated user keyring will be securely destroyed by the kernel.</p><p>Process keyrings are bound to some processes and may be of three types differing in semantics: process, thread and session. A process keyring is bound and private to a particular process. Thus, any code within the process can store/use keys in the keyring, but other processes (even with the same user id or child processes) cannot get access. And when the process dies, the keyring and the associated keys are securely destroyed. Besides the advantage of storing our secrets/keys in an isolated address space, the process keyring gives us the guarantee that the keys will be destroyed regardless of the reason for the process termination: even if our application crashed hard without being given an opportunity to execute any clean up code - our keys will still be securely destroyed by the kernel.</p><p>A thread keyring is similar to a process keyring, but it is private and bound to a particular thread. For example, we can build a multithreaded web server, which can serve TLS connections using multiple private keys, and we can be sure that connections/code in one thread can never use a private key, which is associated with another thread (for example, serving a different domain name).</p><p>A session keyring makes its keys available to the current process and all its children. It is destroyed when the topmost process terminates and child processes can store/access keys, while the topmost process exists. It is mostly useful in shell and interactive environments, when we employ the <a href="https://man7.org/linux/man-pages/man1/keyctl.1.html">keyctl tool</a> to access the Linux Kernel Key Retention Service, rather than using the kernel system call interface. In the shell, we generally can’t use the process keyring as every executed command creates a new process. Thus, if we add a key to the process keyring from the command line - that key will be immediately destroyed, because the “adding” process terminates, when the command finishes executing. Let’s actually confirm this with <code>[bpftrace](https://github.com/iovisor/bpftrace)</code>.</p><p>In one terminal we will trace the <code>[user_destroy](https://elixir.bootlin.com/linux/v5.19.17/source/security/keys/user_defined.c#L146)</code> function, which is responsible for deleting a user key:</p>
            <pre><code>$ sudo bpftrace -e 'kprobe:user_destroy { printf("destroying key %d\n", ((struct key *)arg0)-&gt;serial) }'
Att</code></pre>
            <p>And in another terminal let’s try to add a key to the process keyring:</p>
            <pre><code>$ keyctl add user mykey hunter2 @p
742524855</code></pre>
            <p>Going back to the first terminal we can immediately see:</p>
            <pre><code>…
Attaching 1 probe...
destroying key 742524855</code></pre>
            <p>And we can confirm the key is not available by trying to access it:</p>
            <pre><code>$ keyctl print 742524855
keyctl_read_alloc: Required key not available</code></pre>
            <p>So in the above example, the key “mykey” was added to the process keyring of the subshell executing <code>keyctl add user mykey hunter2 @p</code>. But since the subshell process terminated the moment the command was executed, both its process keyring and the added key were destroyed.</p><p>Instead, the session keyring allows our interactive commands to add keys to our current shell environment and subsequent commands to consume them. The keys will still be securely destroyed, when our main shell process terminates (likely, when we log out from the system).</p><p>So by selecting the appropriate keyring type we can ensure the keys will be securely destroyed, when not needed. Even if the application crashes! This is a very brief introduction, but it will allow you to play with our examples, for the whole context, please, reach the <a href="https://www.kernel.org/doc/html/v5.8/security/keys/core.html">official documentation</a>.</p>
    <div>
      <h3>Replacing the ssh-agent with the Linux Kernel Key Retention Service</h3>
      <a href="#replacing-the-ssh-agent-with-the-linux-kernel-key-retention-service">
        
      </a>
    </div>
    <p>We gave a long description of how we can replace two isolated processes with the Linux Kernel Retention Service. It’s time to put our words into code. We talked about ssh-agent as well, so it will be a good exercise to replace our private key stored in memory of the agent with an in-kernel one. We picked the most popular SSH implementation <a href="https://github.com/openssh/openssh-portable.git">OpenSSH</a> as our target.</p><p>Some minor changes need to be added to the code to add functionality to retrieve a key from the kernel:</p><p>openssh.patch</p>
            <pre><code>diff --git a/ssh-rsa.c b/ssh-rsa.c
index 6516ddc1..797739bb 100644
--- a/ssh-rsa.c
+++ b/ssh-rsa.c
@@ -26,6 +26,7 @@
 
 #include &lt;stdarg.h&gt;
 #include &lt;string.h&gt;
+#include &lt;stdbool.h&gt;
 
 #include "sshbuf.h"
 #include "compat.h"
@@ -63,6 +64,7 @@ ssh_rsa_cleanup(struct sshkey *k)
 {
 	RSA_free(k-&gt;rsa);
 	k-&gt;rsa = NULL;
+	k-&gt;serial = 0;
 }
 
 static int
@@ -220,9 +222,14 @@ ssh_rsa_deserialize_private(const char *ktype, struct sshbuf *b,
 	int r;
 	BIGNUM *rsa_n = NULL, *rsa_e = NULL, *rsa_d = NULL;
 	BIGNUM *rsa_iqmp = NULL, *rsa_p = NULL, *rsa_q = NULL;
+	bool is_keyring = (strncmp(ktype, "ssh-rsa-keyring", strlen("ssh-rsa-keyring")) == 0);
 
+	if (is_keyring) {
+		if ((r = ssh_rsa_deserialize_public(ktype, b, key)) != 0)
+			goto out;
+	}
 	/* Note: can't reuse ssh_rsa_deserialize_public: e, n vs. n, e */
-	if (!sshkey_is_cert(key)) {
+	else if (!sshkey_is_cert(key)) {
 		if ((r = sshbuf_get_bignum2(b, &amp;rsa_n)) != 0 ||
 		    (r = sshbuf_get_bignum2(b, &amp;rsa_e)) != 0)
 			goto out;
@@ -232,28 +239,46 @@ ssh_rsa_deserialize_private(const char *ktype, struct sshbuf *b,
 		}
 		rsa_n = rsa_e = NULL; /* transferred */
 	}
-	if ((r = sshbuf_get_bignum2(b, &amp;rsa_d)) != 0 ||
-	    (r = sshbuf_get_bignum2(b, &amp;rsa_iqmp)) != 0 ||
-	    (r = sshbuf_get_bignum2(b, &amp;rsa_p)) != 0 ||
-	    (r = sshbuf_get_bignum2(b, &amp;rsa_q)) != 0)
-		goto out;
-	if (!RSA_set0_key(key-&gt;rsa, NULL, NULL, rsa_d)) {
-		r = SSH_ERR_LIBCRYPTO_ERROR;
-		goto out;
-	}
-	rsa_d = NULL; /* transferred */
-	if (!RSA_set0_factors(key-&gt;rsa, rsa_p, rsa_q)) {
-		r = SSH_ERR_LIBCRYPTO_ERROR;
-		goto out;
-	}
-	rsa_p = rsa_q = NULL; /* transferred */
 	if ((r = sshkey_check_rsa_length(key, 0)) != 0)
 		goto out;
-	if ((r = ssh_rsa_complete_crt_parameters(key, rsa_iqmp)) != 0)
-		goto out;
-	if (RSA_blinding_on(key-&gt;rsa, NULL) != 1) {
-		r = SSH_ERR_LIBCRYPTO_ERROR;
-		goto out;
+
+	if (is_keyring) {
+		char *name;
+		size_t len;
+
+		if ((r = sshbuf_get_cstring(b, &amp;name, &amp;len)) != 0)
+			goto out;
+
+		key-&gt;serial = request_key("asymmetric", name, NULL, KEY_SPEC_PROCESS_KEYRING);
+		free(name);
+
+		if (key-&gt;serial == -1) {
+			key-&gt;serial = 0;
+			r = SSH_ERR_KEY_NOT_FOUND;
+			goto out;
+		}
+	} else {
+		if ((r = sshbuf_get_bignum2(b, &amp;rsa_d)) != 0 ||
+			(r = sshbuf_get_bignum2(b, &amp;rsa_iqmp)) != 0 ||
+			(r = sshbuf_get_bignum2(b, &amp;rsa_p)) != 0 ||
+			(r = sshbuf_get_bignum2(b, &amp;rsa_q)) != 0)
+			goto out;
+		if (!RSA_set0_key(key-&gt;rsa, NULL, NULL, rsa_d)) {
+			r = SSH_ERR_LIBCRYPTO_ERROR;
+			goto out;
+		}
+		rsa_d = NULL; /* transferred */
+		if (!RSA_set0_factors(key-&gt;rsa, rsa_p, rsa_q)) {
+			r = SSH_ERR_LIBCRYPTO_ERROR;
+			goto out;
+		}
+		rsa_p = rsa_q = NULL; /* transferred */
+		if ((r = ssh_rsa_complete_crt_parameters(key, rsa_iqmp)) != 0)
+			goto out;
+		if (RSA_blinding_on(key-&gt;rsa, NULL) != 1) {
+			r = SSH_ERR_LIBCRYPTO_ERROR;
+			goto out;
+		}
 	}
 	/* success */
 	r = 0;
@@ -333,6 +358,21 @@ rsa_hash_alg_nid(int type)
 	}
 }
 
+static const char *
+rsa_hash_alg_keyctl_info(int type)
+{
+	switch (type) {
+	case SSH_DIGEST_SHA1:
+		return "enc=pkcs1 hash=sha1";
+	case SSH_DIGEST_SHA256:
+		return "enc=pkcs1 hash=sha256";
+	case SSH_DIGEST_SHA512:
+		return "enc=pkcs1 hash=sha512";
+	default:
+		return NULL;
+	}
+}
+
 int
 ssh_rsa_complete_crt_parameters(struct sshkey *key, const BIGNUM *iqmp)
 {
@@ -433,7 +473,14 @@ ssh_rsa_sign(struct sshkey *key,
 		goto out;
 	}
 
-	if (RSA_sign(nid, digest, hlen, sig, &amp;len, key-&gt;rsa) != 1) {
+	if (key-&gt;serial &gt; 0) {
+		len = keyctl_pkey_sign(key-&gt;serial, rsa_hash_alg_keyctl_info(hash_alg), digest, hlen, sig, slen);
+		if ((long)len == -1) {
+			ret = SSH_ERR_LIBCRYPTO_ERROR;
+			goto out;
+		}
+	}
+	else if (RSA_sign(nid, digest, hlen, sig, &amp;len, key-&gt;rsa) != 1) {
 		ret = SSH_ERR_LIBCRYPTO_ERROR;
 		goto out;
 	}
@@ -705,6 +752,18 @@ const struct sshkey_impl sshkey_rsa_impl = {
 	/* .funcs = */		&amp;sshkey_rsa_funcs,
 };
 
+const struct sshkey_impl sshkey_rsa_keyring_impl = {
+	/* .name = */		"ssh-rsa-keyring",
+	/* .shortname = */	"RSA",
+	/* .sigalg = */		NULL,
+	/* .type = */		KEY_RSA,
+	/* .nid = */		0,
+	/* .cert = */		0,
+	/* .sigonly = */	0,
+	/* .keybits = */	0,
+	/* .funcs = */		&amp;sshkey_rsa_funcs,
+};
+
 const struct sshkey_impl sshkey_rsa_cert_impl = {
 	/* .name = */		"ssh-rsa-cert-v01@openssh.com",
 	/* .shortname = */	"RSA-CERT",
diff --git a/sshkey.c b/sshkey.c
index 43712253..3524ad37 100644
--- a/sshkey.c
+++ b/sshkey.c
@@ -115,6 +115,7 @@ extern const struct sshkey_impl sshkey_ecdsa_nistp521_cert_impl;
 #  endif /* OPENSSL_HAS_NISTP521 */
 # endif /* OPENSSL_HAS_ECC */
 extern const struct sshkey_impl sshkey_rsa_impl;
+extern const struct sshkey_impl sshkey_rsa_keyring_impl;
 extern const struct sshkey_impl sshkey_rsa_cert_impl;
 extern const struct sshkey_impl sshkey_rsa_sha256_impl;
 extern const struct sshkey_impl sshkey_rsa_sha256_cert_impl;
@@ -154,6 +155,7 @@ const struct sshkey_impl * const keyimpls[] = {
 	&amp;sshkey_dss_impl,
 	&amp;sshkey_dsa_cert_impl,
 	&amp;sshkey_rsa_impl,
+	&amp;sshkey_rsa_keyring_impl,
 	&amp;sshkey_rsa_cert_impl,
 	&amp;sshkey_rsa_sha256_impl,
 	&amp;sshkey_rsa_sha256_cert_impl,
diff --git a/sshkey.h b/sshkey.h
index 771c4bce..a7ae45f6 100644
--- a/sshkey.h
+++ b/sshkey.h
@@ -29,6 +29,7 @@
 #include &lt;sys/types.h&gt;
 
 #ifdef WITH_OPENSSL
+#include &lt;keyutils.h&gt;
 #include &lt;openssl/rsa.h&gt;
 #include &lt;openssl/dsa.h&gt;
 # ifdef OPENSSL_HAS_ECC
@@ -153,6 +154,7 @@ struct sshkey {
 	size_t	shielded_len;
 	u_char	*shield_prekey;
 	size_t	shield_prekey_len;
+	key_serial_t serial;
 };
 
 #define	ED25519_SK_SZ	crypto_sign_ed25519_SECRETKEYBYTES</code></pre>
            <p>We need to download and patch OpenSSH from the latest git as the above patch won’t work on the latest release (<code>V_9_1_P1</code> at the time of this writing):</p>
            <pre><code>$ git clone https://github.com/openssh/openssh-portable.git
…
$ cd openssl-portable
$ $ patch -p1 &lt; ../openssh.patch
patching file ssh-rsa.c
patching file sshkey.c
patching file sshkey.h</code></pre>
            <p>Now compile and build the patched OpenSSH</p>
            <pre><code>$ autoreconf
$ ./configure --with-libs=-lkeyutils --disable-pkcs11
…
$ make
…</code></pre>
            <p>Note that we instruct the build system to additionally link with <code>[libkeyutils](https://man7.org/linux/man-pages/man3/keyctl.3.html)</code>, which provides convenient wrappers to access the Linux Kernel Key Retention Service. Additionally, we had to disable PKCS11 support as the code has a function with the same name as in `libkeyutils`, so there is a naming conflict. There might be a better fix for this, but it is out of scope for this post.</p><p>Now that we have the patched OpenSSH - let’s test it. Firstly, we need to generate a new SSH RSA key that we will use to access the system. Because the Linux kernel only supports private keys in the PKCS8 format, we’ll use it from the start (instead of the default OpenSSH format):</p>
            <pre><code>$ ./ssh-keygen -b 4096 -m PKCS8
Generating public/private rsa key pair.
…</code></pre>
            <p>Normally, we would be using `ssh-add` to add this key to our ssh agent. In our case we need to use a replacement script, which would add the key to our current session keyring:</p><p>ssh-add-keyring.sh</p>
            <pre><code>#/bin/bash -e

in=$1
key_desc=$2
keyring=$3

in_pub=$in.pub
key=$(mktemp)
out="${in}_keyring"

function finish {
    rm -rf $key
}
trap finish EXIT

# https://github.com/openssh/openssh-portable/blob/master/PROTOCOL.key
# null-terminanted openssh-key-v1
printf 'openssh-key-v1\0' &gt; $key
# cipher: none
echo '00000004' | xxd -r -p &gt;&gt; $key
echo -n 'none' &gt;&gt; $key
# kdf: none
echo '00000004' | xxd -r -p &gt;&gt; $key
echo -n 'none' &gt;&gt; $key
# no kdf options
echo '00000000' | xxd -r -p &gt;&gt; $key
# one key in the blob
echo '00000001' | xxd -r -p &gt;&gt; $key

# grab the hex public key without the (00000007 || ssh-rsa) preamble
pub_key=$(awk '{ print $2 }' $in_pub | base64 -d | xxd -s 11 -p | tr -d '\n')
# size of the following public key with the (0000000f || ssh-rsa-keyring) preamble
printf '%08x' $(( ${#pub_key} / 2 + 19 )) | xxd -r -p &gt;&gt; $key
# preamble for the public key
# ssh-rsa-keyring in prepended with length of the string
echo '0000000f' | xxd -r -p &gt;&gt; $key
echo -n 'ssh-rsa-keyring' &gt;&gt; $key
# the public key itself
echo $pub_key | xxd -r -p &gt;&gt; $key

# the private key is just a key description in the Linux keyring
# ssh will use it to actually find the corresponding key serial
# grab the comment from the public key
comment=$(awk '{ print $3 }' $in_pub)
# so the total size of the private key is
# two times the same 4 byte int +
# (0000000f || ssh-rsa-keyring) preamble +
# a copy of the public key (without preamble) +
# (size || key_desc) +
# (size || comment )
priv_sz=$(( 8 + 19 + ${#pub_key} / 2 + 4 + ${#key_desc} + 4 + ${#comment} ))
# we need to pad the size to 8 bytes
pad=$(( 8 - $(( priv_sz % 8 )) ))
# so, total private key size
printf '%08x' $(( $priv_sz + $pad )) | xxd -r -p &gt;&gt; $key
# repeated 4-byte int
echo '0102030401020304' | xxd -r -p &gt;&gt; $key
# preamble for the private key
echo '0000000f' | xxd -r -p &gt;&gt; $key
echo -n 'ssh-rsa-keyring' &gt;&gt; $key
# public key
echo $pub_key | xxd -r -p &gt;&gt; $key
# private key description in the keyring
printf '%08x' ${#key_desc} | xxd -r -p &gt;&gt; $key
echo -n $key_desc &gt;&gt; $key
# comment
printf '%08x' ${#comment} | xxd -r -p &gt;&gt; $key
echo -n $comment &gt;&gt; $key
# padding
for (( i = 1; i &lt;= $pad; i++ )); do
    echo 0$i | xxd -r -p &gt;&gt; $key
done

echo '-----BEGIN OPENSSH PRIVATE KEY-----' &gt; $out
base64 $key &gt;&gt; $out
echo '-----END OPENSSH PRIVATE KEY-----' &gt;&gt; $out
chmod 600 $out

# load the PKCS8 private key into the designated keyring
openssl pkcs8 -in $in -topk8 -outform DER -nocrypt | keyctl padd asymmetric $key_desc $keyring
</code></pre>
            <p>Depending on how our kernel was compiled, we might also need to load some kernel modules for asymmetric private key support:</p>
            <pre><code>$ sudo modprobe pkcs8_key_parser
$ ./ssh-add-keyring.sh ~/.ssh/id_rsa myssh @s
Enter pass phrase for ~/.ssh/id_rsa:
723263309</code></pre>
            <p>Finally, our private ssh key is added to the current session keyring with the name “myssh”. In addition, the <code>ssh-add-keyring.sh</code> will create a pseudo-private key file in <code>~/.ssh/id_rsa_keyring</code>, which needs to be passed to the main <code>ssh</code> process. It is a pseudo-private key, because it doesn’t have any sensitive cryptographic material. Instead, it only has the “myssh” identifier in a native OpenSSH format. If we use multiple SSH keys, we have to tell the main <code>ssh</code> process somehow which in-kernel key name should be requested from the system.</p><p>Before we start testing it, let’s make sure our SSH server (running locally) will accept the newly generated key as a valid authentication:</p>
            <pre><code>$ cat ~/.ssh/id_rsa.pub &gt;&gt; ~/.ssh/authorized_keys</code></pre>
            <p>Now we can try to SSH into the system:</p>
            <pre><code>$ SSH_AUTH_SOCK="" ./ssh -i ~/.ssh/id_rsa_keyring localhost
The authenticity of host 'localhost (::1)' can't be established.
ED25519 key fingerprint is SHA256:3zk7Z3i9qZZrSdHvBp2aUYtxHACmZNeLLEqsXltynAY.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'localhost' (ED25519) to the list of known hosts.
Linux dev 5.15.79-cloudflare-2022.11.6 #1 SMP Mon Sep 27 00:00:00 UTC 2010 x86_64
…</code></pre>
            <p>It worked! Notice that we’re resetting the `SSH_AUTH_SOCK` environment variable to make sure we don’t use any keys from an ssh-agent running on the system. Still the login flow does not request any password for our private key, the key itself is resident of the kernel address space, and we reference it using its serial for signature operations.</p>
    <div>
      <h3>User or session keyring?</h3>
      <a href="#user-or-session-keyring">
        
      </a>
    </div>
    <p>In the example above, we set up our SSH private key into the session keyring. We can check if it is there:</p>
            <pre><code>$ keyctl show
Session Keyring
 577779279 --alswrv   1000  1000  keyring: _ses
 846694921 --alswrv   1000 65534   \_ keyring: _uid.1000
 723263309 --als--v   1000  1000   \_ asymmetric: myssh</code></pre>
            <p>We might have used user keyring as well. What is the difference? Currently, the “myssh” key lifetime is limited to the current login session. That is, if we log out and login again, the key will be gone, and we would have to run the <code>ssh-add-keyring.sh</code> script again. Similarly, if we log in to a second terminal, we won’t see this key:</p>
            <pre><code>$ keyctl show
Session Keyring
 333158329 --alswrv   1000  1000  keyring: _ses
 846694921 --alswrv   1000 65534   \_ keyring: _uid.1000</code></pre>
            <p>Notice that the serial number of the session keyring <code>_ses</code> in the second terminal is different. A new keyring was created and  “myssh” key along with the previous session keyring doesn’t exist anymore:</p>
            <pre><code>$ SSH_AUTH_SOCK="" ./ssh -i ~/.ssh/id_rsa_keyring localhost
Load key "/home/ignat/.ssh/id_rsa_keyring": key not found
…</code></pre>
            <p>If instead we tell <code>ssh-add-keyring.sh</code> to load the private key into the user keyring (replace <code>@s</code> with <code>@u</code> in the command line parameters), it will be available and accessible from both login sessions. In this case, during logout and re-login, the same key will be presented. Although, this has a security downside - any process running as our user id will be able to access and use the key.</p>
    <div>
      <h3>Summary</h3>
      <a href="#summary">
        
      </a>
    </div>
    <p>In this post we learned about one of the most common ways that data, including highly valuable cryptographic keys, can leak. We talked about some real examples, which impacted many users around the world, including Cloudflare. Finally, we learned how the Linux Kernel Retention Service can help us to protect our cryptographic keys and secrets.</p><p>We also introduced a working patch for OpenSSH to use this cool feature of the Linux kernel, so you can easily try it yourself. There are still many Linux Kernel Key Retention Service features left untold, which might be a topic for another blog post. Stay tuned!</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Kernel]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">1w40uuzCCDLItmJBNEkbkq</guid>
            <dc:creator>Oxana Kharitonova</dc:creator>
            <dc:creator>Ignat Korchagin</dc:creator>
        </item>
        <item>
            <title><![CDATA[When the window is not fully open, your TCP stack is doing more than you think]]></title>
            <link>https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/</link>
            <pubDate>Tue, 26 Jul 2022 13:00:00 GMT</pubDate>
            <description><![CDATA[ In this blog post I'll share my journey deep into the Linux networking stack, trying to understand the memory and window management of the receiving side of a TCP connection ]]></description>
            <content:encoded><![CDATA[ <p>Over the years I've been lurking around the Linux kernel and have investigated the TCP code many times. But when recently we were working on <a href="/optimizing-tcp-for-high-throughput-and-low-latency/">Optimizing TCP for high WAN throughput while preserving low latency</a>, I realized I have gaps in my knowledge about how Linux manages TCP receive buffers and windows. As I dug deeper I found the subject complex and certainly non-obvious.</p><p>In this blog post I'll share my journey deep into the Linux networking stack, trying to understand the memory and window management of the receiving side of a TCP connection. Specifically, looking for answers to seemingly trivial questions:</p><ul><li><p>How much data can be stored in the TCP receive buffer? (it's not what you think)</p></li><li><p>How fast can it be filled? (it's not what you think either!)</p></li></ul><p>Our exploration focuses on the receiving side of the TCP connection. We'll try to understand how to tune it for the best speed, without wasting precious memory.</p>
    <div>
      <h3>A case of a rapid upload</h3>
      <a href="#a-case-of-a-rapid-upload">
        
      </a>
    </div>
    <p>To best illustrate the receive side buffer management we need pretty charts! But to grasp all the numbers, we need a bit of theory.</p><p>We'll draw charts from a receive side of a TCP flow, running a pretty straightforward scenario:</p><ul><li><p>The client opens a TCP connection.</p></li><li><p>The client does <code>send()</code>, and pushes as much data as possible.</p></li><li><p>The server doesn't <code>recv()</code> any data. We expect all the data to stay and wait in the receive queue.</p></li><li><p>We fix the SO_RCVBUF for better illustration.</p></li></ul><p>Simplified pseudocode might look like (<a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-07-rmem-a/window.py">full code if you dare</a>):</p>
            <pre><code>sd = socket.socket(AF_INET, SOCK_STREAM, 0)
sd.bind(('127.0.0.3', 1234))
sd.listen(32)

cd = socket.socket(AF_INET, SOCK_STREAM, 0)
cd.setsockopt(SOL_SOCKET, SO_RCVBUF, 32*1024)
cd.connect(('127.0.0.3', 1234))

ssd, _ = sd.accept()

while true:
    cd.send(b'a'*128*1024)</code></pre>
            <p>We're interested in basic questions:</p><ul><li><p>How much data can fit in the server’s receive buffer? It turns out it's not exactly the same as the default read buffer size on Linux; we'll get there.</p></li><li><p>Assuming infinite bandwidth, what is the minimal time  - measured in <a href="https://www.cloudflare.com/learning/cdn/glossary/round-trip-time-rtt/">RTT</a> - for the client to fill the receive buffer?</p></li></ul>
    <div>
      <h3>A bit of theory</h3>
      <a href="#a-bit-of-theory">
        
      </a>
    </div>
    <p>Let's start by establishing some common nomenclature. I'll follow the wording used by the <a href="https://man7.org/linux/man-pages/man8/ss.8.html"><code>ss</code> Linux tool from the <code>iproute2</code> package</a>.</p><p>First, there is the buffer budget limit. <a href="https://man7.org/linux/man-pages/man8/ss.8.html"><code>ss</code> manpage</a> calls it <b>skmem_rb</b>, in the kernel it's named <b>sk_rcvbuf</b>. This value is most often controlled by the Linux autotune mechanism using the <code>net.ipv4.tcp_rmem</code> setting:</p>
            <pre><code>$ sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096 131072 6291456</code></pre>
            <p>Alternatively it can be manually set with <code>setsockopt(SO_RCVBUF)</code> on a socket. Note that the kernel doubles the value given to this setsockopt. For example SO_RCVBUF=16384 will result in skmem_rb=32768. The max value allowed to this setsockopt is limited to meager 208KiB by default:</p>
            <pre><code>$ sysctl net.core.rmem_max net.core.wmem_max
net.core.rmem_max = 212992
net.core.wmem_max = 212992</code></pre>
            <p><a href="/optimizing-tcp-for-high-throughput-and-low-latency/">The aforementioned blog post</a> discusses why manual buffer size management is problematic - relying on autotuning is generally preferable.</p><p>Here’s a diagram showing how <b>skmem_rb</b> budget is being divided:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3EEuOnbl8CKYCv4oWj5Ejw/4a0bf778f484bbddebfac4099d8e21f4/image2-17.png" />
            
            </figure><p>In any given moment, we can think of the budget as being divided into four parts:</p><ul><li><p><b>Recv-q</b>: part of the buffer budget occupied by actual application bytes awaiting <code>read()</code>.</p></li><li><p>Another part of is consumed by metadata handling - the cost of <b>struct sk_buff</b> and such.</p></li><li><p>Those two parts together are reported by <code>ss</code> as <b>skmem_r</b> - kernel name is <b>sk_rmem_alloc</b>.</p></li><li><p>What remains is "free", that is: it's not actively used yet.</p></li><li><p>However, a portion of this "free" region is an advertised window - it may become occupied with application data soon.</p></li><li><p>The remainder will be used for future metadata handling, or might be divided into the advertised window further in the future.</p></li></ul><p>The upper limit for the window is configured by <code>tcp_adv_win_scale</code> setting. By default, the window is set to at most 50% of the "free" space. The value can be clamped further by the TCP_WINDOW_CLAMP option or an internal <code>rcv_ssthresh</code> variable.</p>
    <div>
      <h3>How much data can a server receive?</h3>
      <a href="#how-much-data-can-a-server-receive">
        
      </a>
    </div>
    <p>Our first question was "How much data can a server receive?". A naive reader might think it's simple: if the server has a receive buffer set to say 64KiB, then the client will surely be able to deliver 64KiB of data!</p><p>But this is totally not how it works. To illustrate this, allow me to temporarily set sysctl <code>tcp_adv_win_scale=0</code>. This is not a default and, as we'll learn, it's the wrong thing to do. With this setting the server will indeed set 100% of the receive buffer as an advertised window.</p><p>Here's our setup:</p><ul><li><p>The client tries to send as fast as possible.</p></li><li><p>Since we are interested in the receiving side, we can cheat a bit and speed up the sender arbitrarily. The client has transmission congestion control disabled: we set initcwnd=10000 as the route option.</p></li><li><p>The server has a fixed <b>skmem_rb</b> set at 64KiB.</p></li><li><p>The server has <code><b>tcp_adv_win_scale=0</b></code>.</p></li></ul>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/44j6HUJ496dIXVMkltUe4O/1765b3f25ef767dfcb23d3c079f7e8cb/image6-10.png" />
            
            </figure><p>There are so many things here! Let's try to digest it. First, the X axis is an ingress packet number (we saw about 65). The Y axis shows the buffer sizes as seen on the receive path for every packet.</p><ul><li><p>First, the purple line is a buffer size limit in bytes - <b>skmem_rb</b>. In our experiment we called <code>setsockopt(SO_RCVBUF)=32K</code> and skmem_rb is double that value. Notice, by calling SO_RCVBUF we disabled the Linux autotune mechanism.</p></li><li><p>Green <b>recv-q</b> line is how many application bytes are available in the receive socket. This grows linearly with each received packet.</p></li><li><p>Then there is the blue <b>skmem_r</b>, the used data + metadata cost in the receive socket. It grows just like <b>recv-q</b> but a bit faster, since it accounts for the cost of the metadata kernel needs to deal with.</p></li><li><p>The orange <b>rcv_win</b> is an advertised window. We start with 64KiB (100% of skmem_rb) and go down as the data arrives.</p></li><li><p>Finally, the dotted line shows <b>rcv_ssthresh</b>, which is not important yet, we'll get there.</p></li></ul>
    <div>
      <h3>Running over the budget is bad</h3>
      <a href="#running-over-the-budget-is-bad">
        
      </a>
    </div>
    <p>It's super important to notice that we finished with <b>skmem_r</b> higher than <b>skmem_rb</b>! This is rather unexpected, and undesired. The whole point of the <b>skmem_rb</b> memory budget is, well, not to exceed it. Here's how <code>ss</code> shows it:</p>
            <pre><code>$ ss -m
Netid  State  Recv-Q  Send-Q  Local Address:Port  Peer Address:Port   
tcp    ESTAB  62464   0       127.0.0.3:1234      127.0.0.2:1235
     skmem:(r73984,rb65536,...)</code></pre>
            <p>As you can see, skmem_rb is 65536 and skmem_r is 73984, which is 8448 bytes over! When this happens we have an even bigger issue on our hands. At around the 62nd packet we have an advertised window of 3072 bytes, but while packets are being sent, the receiver is unable to process them! This is easily verifiable by inspecting an nstat TcpExtTCPRcvQDrop counter:</p>
            <pre><code>$ nstat -az TcpExtTCPRcvQDrop
TcpExtTCPRcvQDrop    13    0.0</code></pre>
            <p>In our run 13 packets were dropped. This variable counts a number of packets dropped due to either system-wide or per-socket memory pressure - we know we hit the latter. In our case, soon after the socket memory limit was crossed, new packets were prevented from being enqueued to the socket. This happened even though the TCP advertised window was still open.</p><p>This results in an interesting situation. The receiver's window is open which might indicate it has resources to handle the data. But that's not always the case, like in our example when it runs out of the memory budget.</p><p>The sender will think it hit a network congestion packet loss and will run the usual retry mechanisms including exponential backoff. This behavior can be looked at as desired or undesired, depending on how you look at it. On one hand no data will be lost, the sender can eventually deliver all the bytes reliably. On the other hand the exponential backoff logic might stall the sender for a long time, causing a noticeable delay.</p><p>The root of the problem is straightforward - Linux kernel <b>skmem_rb</b> sets a memory budget for both the <b>data</b> and <b>metadata</b> which reside on the socket. In a pessimistic case each packet might incur a cost of a <b>struct sk_buff</b> + <b>struct skb_shared_info</b>, which on my system is 576 bytes, above the actual payload size, plus memory waste due to network card buffer alignment:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7nJyE7p1rtHK9SvSTDnZoj/c02019aeed1e3b17f24b506b4eeaef36/image7-10.png" />
            
            </figure><p>We now understand that Linux can't just advertise 100% of the memory budget as an advertised window. Some budget must be reserved for metadata and such. The upper limit of window size is expressed as a fraction of the "free" socket budget. It is controlled by <code>tcp_adv_win_scale</code>, with the following values:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6ZfsgDbgLiLQ0HXUV5mVeK/31e596946e101fef2443896f9db8fcdb/image9-5.png" />
            
            </figure><p>By default, Linux sets the advertised window at most at 50% of the remaining buffer space.</p><p>Even with 50% of space "reserved" for metadata, the kernel is very smart and tries hard to reduce the metadata memory footprint. It has two mechanisms for this:</p><ul><li><p><b>TCP Coalesce</b> - on the happy path, Linux is able to throw away <b>struct sk_buff</b>. It can do so, by just linking the data to the previously enqueued packet. You can think about it as if it was <a href="https://www.spinics.net/lists/netdev/msg755359.html">extending the last packet on the socket</a>.</p></li><li><p><b>TCP Collapse</b> - when the memory budget is hit, Linux runs "collapse" code. Collapse rewrites and defragments the receive buffer from many small skb's into a few very long segments - therefore reducing the metadata cost.</p></li></ul><p>Here's an extension to our previous chart showing these mechanisms in action:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2KhHEeAUvJ6rinNLoRBwd1/36b733fddcbb885d8db5b076602ca168/image3-10.png" />
            
            </figure><p><b>TCP Coalesce</b> is a very effective measure and works behind the scenes at all times. In the bottom chart, the packets where the coalesce was engaged are shown with a pink line. You can see - the <b>skmem_r</b> bumps (blue line) are clearly correlated with a <b>lack</b> of coalesce (pink line)! The nstat TcpExtTCPRcvCoalesce counter might be helpful in debugging coalesce issues.</p><p>The <b>TCP Collapse</b> is a bigger gun. <a href="/optimizing-tcp-for-high-throughput-and-low-latency/">Mike wrote about it extensively</a>, and <a href="/the-story-of-one-latency-spike/">I wrote a blog post years ago, when the latency of TCP collapse hit us hard</a>. In the chart above, the collapse is shown as a red circle. We clearly see it being engaged after the socket memory budget is reached - from packet number 63. The nstat TcpExtTCPRcvCollapsed counter is relevant here. This value growing is a bad sign and might indicate bad latency spikes - especially when dealing with larger buffers. Normally collapse is supposed to be run very sporadically. A <a href="https://lore.kernel.org/lkml/20120510173135.615265392@linuxfoundation.org/">prominent kernel developer describes</a> this pessimistic situation:</p><blockquote><p>This also means tcp advertises a too optimistic window for a given allocated rcvspace: When receiving frames, <code>sk_rmem_alloc</code> can hit <code>sk_rcvbuf</code> limit and we call <code>tcp_collapse()</code> too often, especially when application is slow to drain its receive queue [...] This is a major latency source.</p></blockquote><p>If the memory budget remains exhausted after the collapse, Linux will drop ingress packets. In our chart it's marked as a red "X". The nstat TcpExtTCPRcvQDrop counter shows the count of dropped packets.</p>
    <div>
      <h3>rcv_ssthresh predicts the metadata cost</h3>
      <a href="#rcv_ssthresh-predicts-the-metadata-cost">
        
      </a>
    </div>
    <p>Perhaps counter-intuitively, the memory cost of a packet can be much larger than the amount of actual application data contained in it. It depends on number of things:</p><ul><li><p><b>Network card</b>: some network cards always allocate a full page (4096, or even 16KiB) per packet, no matter how small or large the payload.</p></li><li><p><b>Payload size</b>: shorter packets, will have worse metadata to content ratio since <b>struct skb</b> will be comparably larger.</p></li><li><p>Whether XDP is being used.</p></li><li><p>L2 header size: things like ethernet, vlan tags, and tunneling can add up.</p></li><li><p>Cache line size: many kernel structs are cache line aligned. On systems with larger cache lines, they will use more memory (see P4 or S390X architectures).</p></li></ul><p>The first two factors are the most important. Here's a run when the sender was specially configured to make the metadata cost bad and the coalesce ineffective (the <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-07-rmem-a/window.py#L90">details of the setup are messy</a>):</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6Oo38G0pRDoxfqkcIE7D9Y/c372a9cba2402cee11c14fc815875ea3/image1-10.png" />
            
            </figure><p>You can see the kernel hitting TCP collapse multiple times, which is totally undesired. Each time a collapse kernel is likely to rewrite the full receive buffer. This whole kernel machinery, from reserving some space for metadata with tcp_adv_win_scale, via using coalesce to reduce the memory cost of each packet, up to the rcv_ssthresh limit, exists to avoid this very case of hitting collapse too often.</p><p>The kernel machinery most often works fine, and TCP collapse is rare in practice. However, we noticed that's not the case for certain types of traffic. One example is <a href="https://lore.kernel.org/lkml/CA+wXwBSGsBjovTqvoPQEe012yEF2eYbnC5_0W==EAvWH1zbOAg@mail.gmail.com/">websocket traffic with loads of tiny packets</a> and a slow reader. One <a href="https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_input.c#L452">kernel comment talks about</a> such a case:</p>
            <pre><code>* The scheme does not work when sender sends good segments opening
* window and then starts to feed us spaghetti. But it should work
* in common situations. Otherwise, we have to rely on queue collapsing.</code></pre>
            <p>Notice that the <b>rcv_ssthresh</b> line dropped down on the TCP collapse. This variable is an internal limit to the advertised window. By dropping it the kernel effectively says: hold on, I mispredicted the packet cost, next time I'm given an opportunity I'm going to open a smaller window. Kernel will advertise a smaller window and be more careful - all of this dance is done to avoid the collapse.</p>
    <div>
      <h3>Normal run - continuously updated window</h3>
      <a href="#normal-run-continuously-updated-window">
        
      </a>
    </div>
    <p>Finally, here's a chart from a normal run of a connection. Here, we use the default <code>tcp_adv_win_wcale=1 (50%)</code>:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ZlSO1vQxnHim1D8dav4Aa/5ce538b22b546d194df83130d9f39bc9/image5-13.png" />
            
            </figure><p>Early in the connection you can see <b>rcv_win</b> being continuously updated with each received packet. This makes sense: while the <b>rcv_ssthresh</b> and <b>tcp_adv_win_scale</b> restrict the advertised window to never exceed 32KiB, the window is sliding nicely as long as there is enough space. At packet 18 the receiver stops updating the window and waits a bit. At packet 32 the receiver decides there still is some space and updates the window again, and so on. At the end of the flow the socket has 56KiB of data. This 56KiB of data was received over a sliding window reaching at most 32KiB .</p><p>The saw blade pattern of rcv_win is enabled by delayed ACK (aka QUICKACK). You can see the "<b>acked</b>" bytes in red dashed line. Since the ACK's might be delayed, the receiver waits a bit before updating the window. If you want a smooth line, you can use <code>quickack 1</code> per-route parameter, but this is not recommended since it will result in many small ACK packets flying over the wire.</p><p>In normal connection we expect the majority of packets to be coalesced and the collapse/drop code paths never to be hit.</p>
    <div>
      <h3>Large receive windows - rcv_ssthresh</h3>
      <a href="#large-receive-windows-rcv_ssthresh">
        
      </a>
    </div>
    <p>For large bandwidth transfers over big latency links - big BDP case - it's beneficial to have a very wide advertised window. However, Linux takes a while to fully open large receive windows:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6UL4j5NH62FE1350yWnTcP/e257d2160e41a3f71aa3b727debc44fc/image8-4.png" />
            
            </figure><p>In this run, the <b>skmem_rb</b> is set to 2MiB. As opposed to previous runs, the buffer budget is large and the receive window doesn't start with 50% of the skmem_rb! Instead it starts from 64KiB and grows linearly. It takes a while for Linux to ramp up the receive window to full size - ~800KiB in this case. The window is clamped by <b>rcv_ssthresh</b>. This variable starts at 64KiB and then grows at a rate of two full-MSS packets per each packet which has a "good" ratio of total size (truesize) to payload size.</p><p><a href="https://lore.kernel.org/lkml/CANn89i+mhqGaM2tuhgEmEPbbNu_59GGMhBMha4jnnzFE=UBNYg@mail.gmail.com/">Eric Dumazet writes</a> about this behavior:</p><blockquote><p>Stack is conservative about RWIN increase, it wants to receive packets to have an idea of the skb-&gt;len/skb-&gt;truesize ratio to convert a memory budget to  RWIN.Some drivers have to allocate 16K buffers (or even 32K buffers) just to hold one segment (of less than 1500 bytes of payload), while others are able to pack memory more efficiently.</p></blockquote><p>This behavior of slow window opening is fixed, and not configurable in vanilla kernel. <a href="https://lore.kernel.org/netdev/20220721151041.1215017-1-marek@cloudflare.com/#r">We prepared a kernel patch that allows to start up with higher rcv_ssthresh</a> based on per-route option <code>initrwnd</code>:</p>
            <pre><code>$ ip route change local 127.0.0.0/8 dev lo initrwnd 1000</code></pre>
            <p>With the patch and the route change deployed, this is how the buffers look:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4YE7Oolhn4ZQ9ihNi11HEL/af8c492bc03243e12b54541e954a3061/image4-12.png" />
            
            </figure><p>The advertised window is limited to 64KiB during the TCP handshake, but with our kernel patch enabled it's quickly bumped up to 1MiB in the first ACK packet afterwards. In both runs it took ~1800 packets to fill the receive buffer, however it took different time. In the first run the sender could push only 64KiB onto the wire in the second RTT. In the second run it could immediately push full 1MiB of data.</p><p>This trick of aggressive window opening is not really necessary for most users. It's only helpful when:</p><ul><li><p>You have high-bandwidth TCP transfers over big-latency links.</p></li><li><p>The metadata + buffer alignment cost of your NIC is sensible and predictable.</p></li><li><p>Immediately after the flow starts your application is ready to send a lot of data.</p></li><li><p>The sender has configured large <code>initcwnd</code>.</p></li></ul>
    <div>
      <h3>You care about shaving off every possible RTT.</h3>
      <a href="#you-care-about-shaving-off-every-possible-rtt">
        
      </a>
    </div>
    <p>On our systems we do have such flows, but arguably it might not be a common scenario. In the real world most of your TCP connections go to the nearest CDN point of presence, which is very close.</p>
    <div>
      <h3>Getting it all together</h3>
      <a href="#getting-it-all-together">
        
      </a>
    </div>
    <p>In this blog post, we discussed a seemingly simple case of a TCP sender filling up the receive socket. We tried to address two questions: with our isolated setup, how much data can be sent, and how quickly?</p><p>With the default settings of net.ipv4.tcp_rmem, Linux initially sets a memory budget of 128KiB for the receive data and metadata. On my system, given full-sized packets, it's able to eventually accept around 113KiB of application data.</p><p>Then, we showed that the receive window is not fully opened immediately. Linux keeps the receive window small, as it tries to predict the metadata cost and avoid overshooting the memory budget, therefore hitting TCP collapse. By default, with the net.ipv4.tcp_adv_win_scale=1, the upper limit for the advertised window is 50% of "free" memory. rcv_ssthresh starts up with 64KiB and grows linearly up to that limit.</p><p>On my system it took five window updates - six RTTs in total - to fill the 128KiB receive buffer. In the first batch the sender sent ~64KiB of data (remember we hacked the <code>initcwnd</code> limit), and then the sender topped it up with smaller and smaller batches until the receive window fully closed.</p><p>I hope this blog post is helpful and explains well the relationship between the buffer size and advertised window on Linux. Also, it describes the often misunderstood rcv_ssthresh which limits the advertised window in order to manage the memory budget and predict the unpredictable cost of metadata.</p><p>In case you wonder, similar mechanisms are in play in QUIC. The QUIC/H3 libraries though are still pretty young and don't have so many complex and mysterious toggles.... yet.</p><p>As always, <a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2022-07-rmem-a">the code and instructions on how to reproduce the charts are available at our GitHub</a>.</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[TCP]]></category>
            <guid isPermaLink="false">ROvfvY7ClXiGsjf1moUld</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[A story about AF_XDP, network namespaces and a cookie]]></title>
            <link>https://blog.cloudflare.com/a-story-about-af-xdp-network-namespaces-and-a-cookie/</link>
            <pubDate>Mon, 18 Jul 2022 12:56:42 GMT</pubDate>
            <description><![CDATA[ A crash in a development version of flowtrackd (the daemon that powers our Advanced TCP Protection) highlighted that libxdp (and specifically the AF_XDP part) was not Linux network namespace aware.  ]]></description>
            <content:encoded><![CDATA[ <p></p><p>A crash in a development version of <a href="/announcing-flowtrackd/">flowtrackd</a> (the daemon that powers our <a href="https://developers.cloudflare.com/ddos-protection/managed-rulesets/tcp-protection/">Advanced TCP Protection</a>) highlighted the fact that <a href="https://www.mankier.com/3/libxdp">libxdp</a> (and specifically the <a href="https://www.kernel.org/doc/html/latest/networking/af_xdp.html">AF_XDP</a> part) was not Linux <a href="https://man7.org/linux/man-pages/man7/network_namespaces.7.html">network namespace</a> aware.</p><p>This blogpost describes the debugging journey to find the bug, as well as a fix.</p><p><a href="/announcing-flowtrackd/">flowtrackd</a> is a volumetric denial of service defense mechanism that sits in the <a href="https://www.cloudflare.com/magic-transit/">Magic Transit</a> customer’s data path and protects the network from complex randomized TCP floods. It does so by challenging TCP connection establishments and by verifying that TCP packets make sense in an ongoing flow.</p><p>It uses the Linux kernel <a href="https://www.kernel.org/doc/html/latest/networking/af_xdp.html">AF_XDP</a> feature to transfer packets from a network device in kernel space to a memory buffer in user space without going through the network stack. We use most of the helper functions of the <a href="https://github.com/libbpf/libbpf/tree/v0.8.0">C libbpf</a> with the <a href="https://github.com/libbpf/libbpf-sys">Rust bindings</a> to interact with AF_XDP.</p><p>In our setup, both the ingress and the egress network interfaces are in different network namespaces. When a packet is determined to be valid (after a challenge or under some thresholds), it is forwarded to the second network interface.</p><p>For the rest of this post the network setup will be the following:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/gnH2TUlwgdNAxFjrw0vL6/be9a5939c06f5d10924c1942a35ee869/image13-3.png" />
            
            </figure><p>e.g. eyeball packets arrive at the outer device in the root network namespace, they are picked up by flowtrackd and then forwarded to the inner device in the inner-ns namespace.</p>
    <div>
      <h2>AF_XDP</h2>
      <a href="#af_xdp">
        
      </a>
    </div>
    <p>The kernel and the userspace share a memory buffer called the UMEM. This is where packet bytes are written to and read from.</p><p>The UMEM is split in contiguous equal-sized "frames" that are referenced by "descriptors" which are just offsets from the start address of the UMEM.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/48r1ecV9BRfVD8wNc103jB/8fe3ca45fc83d557af8086c27cfe91fa/image12-2.png" />
            
            </figure><p>The interactions and synchronization between the kernel and userspace happen via a set of queues (circular buffers) as well as a socket from the AF_XDP family.</p><p>Most of the work is about managing the ownership of the descriptors. Which descriptors the kernel owns and which descriptors the userspace owns.</p><p>The interface provided for the ownership management are a set of queues:</p><table><tr><td><p><b>Queue</b></p></td><td><p><b>User space</b></p></td><td><p><b>Kernel space</b></p></td><td><p><b>Content description</b></p></td></tr><tr><td><p>COMPLETION</p></td><td><p>Consumes</p></td><td><p>Produces</p></td><td><p>Frame descriptors that have successfully been transmitted</p></td></tr><tr><td><p>FILL</p></td><td><p>Produces</p></td><td><p>Consumes</p></td><td><p>Frame descriptors ready to get new packet bytes written to</p></td></tr><tr><td><p>RX</p></td><td><p>Consumes</p></td><td><p>Produces</p></td><td><p>Frame descriptors of a newly received packet</p></td></tr><tr><td><p>TX</p></td><td><p>Produces</p></td><td><p>Consumes</p></td><td><p>Frame descriptors to be transmitted</p></td></tr></table><p>When the UMEM is created, a FILL and a COMPLETION queue are associated with it.</p><p>An RX and a TX queue are associated with the AF_XDP socket (abbreviated <b>Xsk</b>) at its creation. This particular socket is bound to a network device queue id. The userspace can then <code>poll()</code> on the socket to know when new descriptors are ready to be consumed from the RX queue and to let the kernel deal with the descriptors that were set on the TX queue by the application.</p><p>The last plumbing operation to be done to use AF_XDP is to load a BPF program attached with XDP on the network device we want to interact with and insert the Xsk file descriptor into a BPF map (of type XSKMAP). Doing so will enable the BPF program to redirect incoming packets (with the <code>bpf_redirect_map()</code> <a href="https://man7.org/linux/man-pages/man7/bpf-helpers.7.html">function</a>) to a specific socket that we created in userspace:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3K8DUwo8pARFEwtKmpRjp0/9b6e624d5d159b87e8bb3875f338312a/image4-9.png" />
            
            </figure><p>Once everything has been allocated and strapped together, what I call "the descriptors dance" can start. While this has nothing to do with courtship behaviors it still requires a flawless execution:</p><p>When the kernel receives a packet (more specifically the device driver), it will write the packet bytes to a UMEM frame (from a descriptor that the userspace put in the FILL queue) and then insert the frame descriptor in the RX queue for the userspace to consume. The userspace can then read the packet bytes from the received descriptor, take a decision, and potentially send it back to the kernel for transmission by inserting the descriptor in the TX queue. The kernel can then transmit the content of the frame and put the descriptor from the TX to the COMPLETION queue. The userspace can then "recycle" this descriptor in the FILL or TX queue.</p><p>The overview of the queue interactions from the application perspective is represented on the following diagram (note that the queues contain descriptors that point to UMEM frames):</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5TkCEuedruPJCPfzWidBjc/a600b902fea7894d12006afb8c2d46a4/image7-6.png" />
            
            </figure>
    <div>
      <h2>flowtrackd I/O rewrite project</h2>
      <a href="#flowtrackd-i-o-rewrite-project">
        
      </a>
    </div>
    <p>To increase flowtrackd performance and to be able to scale with the growth of the Magic Transit product we decided to rewrite the I/O subsystem.</p><p>There will be a public blogpost about the technical aspects of the rewrite.</p><p>Prior to the rewrite, each customer had a dedicated flowtrackd instance (Unix process) that attached itself to dedicated network devices. A dedicated UMEM was created per network device (see schema on the left side below). The packets were copied from one UMEM to the other.</p><p>In this blogpost, we will only focus on the new usage of the <a href="https://www.kernel.org/doc/html/latest/networking/af_xdp.html#xdp-shared-umem-bind-flag">AF_XDP shared UMEM feature</a> which enables us to handle all customer accounts with a single flowtrackd instance per server and with a single shared UMEM (see schema on the right side below).</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5fzyIwrz5hCvFe5nvvPgf8/ed34ea06fa7d9f2ccc54e1df103ecde3/unnamed-4.png" />
            
            </figure><p>The Linux kernel documentation describes the additional plumbing steps to share a UMEM across multiple AF_XDP sockets:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7jo3Fu7TMp3riMHpFS2JVB/172fee5a98ad24fae110e75750590fe5/image14-3.png" />
            
            </figure><p>Followed by the instructions for our use case:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1SnkkPjseYCtlkpImqFUqN/4505c75ffd7ff6f07f851425c4f0eabe/image9-4.png" />
            
            </figure><p>Hopefully for us a helper function in libbpf does it all for us: <code>xsk_socket__create_shared()</code></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/X3bAf6b0i9A5MQ8rpPwEB/659c37c28c96d55f5d33b9abc6acdc4a/image10-2.png" />
            
            </figure><p>The final setup is the following: Xsks are created for each queue of the devices in their respective network namespaces. flowtrackd then handles the descriptors like a puppeteer while applying our DoS mitigation logic on the packets that they reference with one exception… (notice the red crosses on the diagram):</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5Of32s5kJrl6seldfKEnI4/76b3dfb3e7b3dd4ab47ba1f4715ae422/unnamed1.png" />
            
            </figure>
    <div>
      <h2>What "Invalid argument" ??!</h2>
      <a href="#what-invalid-argument">
        
      </a>
    </div>
    <p>We were happily near the end of the rewrite when, suddenly, after porting our integration tests in the CI, flowtrackd crashed!</p><p>The following errors was displayed:</p>
            <pre><code>[...]
Thread 'main' panicked at 'failed to create Xsk: Libbpf("Invalid argument")', flowtrack-io/src/packet_driver.rs:144:22
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace</code></pre>
            <p>According to the line number, the first socket was created with success and flowtrackd crashed when the second Xsk was created:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5wSlomdCWIQcOjiFLpvtmk/c90313156ae25307fc0810f99e47e092/image11-3.png" />
            
            </figure><p>Here is what we do: we enter the network namespace where the interface sits, load and attach the BPF program and for each queue of the interface, we create a socket. The UMEM and the config parameters are the same with the ingress Xsk creation. Only the ingress_veth and egress_veth are different.</p><p>This is what the code to create a Xsk looks like:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2xd6YfIKTsCJOEq9OHGYfW/2b06a4be3d9ab9d64842d99b68b998c6/image3-6.png" />
            
            </figure><p>The call to the libbpf function <a href="https://github.com/libbpf/libbpf/blob/v0.8.0/src/xsk.c#L994"><code>xsk_socket__create_shared()</code></a> didn't return 0.</p><p>The <a href="https://www.mankier.com/3/libxdp">libxdp manual page</a> doesn't help us here…</p><p>Which argument is "invalid"? And why is this error not showing up when we run flowtrackd locally but only in the CI?</p><p>We can try to reproduce locally with a similar network setup script used in the CI:</p>
            <pre><code>#!/bin/bash
 
set -e -u -x -o pipefail
 
OUTER_VETH=${OUTER_VETH:=outer}
TEST_NAMESPACE=${TEST_NAMESPACE:=inner-ns}
INNER_VETH=${INNER_VETH:=inner}
QUEUES=${QUEUES:=$(grep -c ^processor /proc/cpuinfo)}
 
ip link delete $OUTER_VETH &amp;&gt;/dev/null || true
ip netns delete $TEST_NAMESPACE &amp;&gt;/dev/null || true
ip netns add $TEST_NAMESPACE
ip link \
  add name $OUTER_VETH numrxqueues $QUEUES numtxqueues $QUEUES type veth \
  peer name $INNER_VETH netns $TEST_NAMESPACE numrxqueues $QUEUES numtxqueues $QUEUES
ethtool -K $OUTER_VETH tx off rxvlan off txvlan off
ip link set dev $OUTER_VETH up
ip addr add 169.254.0.1/30 dev $OUTER_VETH
ip netns exec $TEST_NAMESPACE ip link set dev lo up
ip netns exec $TEST_NAMESPACE ethtool -K $INNER_VETH tx off rxvlan off txvlan off
ip netns exec $TEST_NAMESPACE ip link set dev $INNER_VETH up
ip netns exec $TEST_NAMESPACE ip addr add 169.254.0.2/30 dev $INNER_VETH</code></pre>
            <p>For the rest of the blogpost, we set the number of queues per interface to 1. If you have questions about the set command in the script, <a href="/pipefail-how-a-missing-shell-option-slowed-cloudflare-down/">check this out</a>.</p><p>Not much success triggering the error.</p><p>What differs between my laptop setup and the CI setup?</p><p>I managed to find out that when the outer and inner interface <b>index numbers</b> are the same then it crashes. Even though the interfaces don't have the same name, and they are not in the same network namespace. When the tests are run by the CI, both interfaces got index number 5 which was not the case on my laptop since I have more interfaces:</p>
            <pre><code>$ ip -o link | cut -d' ' -f1,2
1: lo:
2: wwan0:
3: wlo1:
4: virbr0:
7: br-ead14016a14c:
8: docker0:
9: br-bafd94c79ff4:
29: outer@if2:</code></pre>
            <p>We can edit the script to set a fixed interface index number:</p>
            <pre><code>ip link \
  add name $OUTER_VETH numrxqueues $QUEUES numtxqueues $QUEUES index 4242 type veth \
  peer name $INNER_VETH netns $TEST_NAMESPACE numrxqueues $QUEUES numtxqueues $QUEUES index 4242</code></pre>
            <p>And we can now reproduce the issue locally!</p><p>Interesting observation: I was not able to reproduce this issue with the previous flowtrackd version. Is this somehow related to the shared UMEM feature that we are now using?</p><p>Back to the "invalid" argument. strace to the rescue:</p>
            <pre><code>sudo strace -f -x ./flowtrackd -v -c flowtrackd.toml --ingress outer --egress inner --egress-netns inner-ns
 
[...]
 
// UMEM allocation + first Xsk creation
 
[pid 389577] brk(0x55b485819000)        = 0x55b485819000
[pid 389577] mmap(NULL, 8396800, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f85037fe000
 
[pid 389577] socket(AF_XDP, SOCK_RAW|SOCK_CLOEXEC, 0) = 9
[pid 389577] setsockopt(9, SOL_XDP, XDP_UMEM_REG, "\x00\xf0\x7f\x03\x85\x7f\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", 32) = 0
[pid 389577] setsockopt(9, SOL_XDP, XDP_UMEM_FILL_RING, [2048], 4) = 0
[pid 389577] setsockopt(9, SOL_XDP, XDP_UMEM_COMPLETION_RING, [2048], 4) = 0
[pid 389577] getsockopt(9, SOL_XDP, XDP_MMAP_OFFSETS, "\x00\x00\x00\x00\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x00\x40\x01\x00\x00\x00\x00\x00\x00\xc4\x00\x00\x00\x00\x00\x00\x00"..., [128]) = 0
[pid 389577] mmap(NULL, 16704, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 9, 0x100000000) = 0x7f852801b000
[pid 389577] mmap(NULL, 16704, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 9, 0x180000000) = 0x7f8528016000
[...]
[pid 389577] setsockopt(9, SOL_XDP, XDP_RX_RING, [2048], 4) = 0
[pid 389577] setsockopt(9, SOL_XDP, XDP_TX_RING, [2048], 4) = 0
[pid 389577] getsockopt(9, SOL_XDP, XDP_MMAP_OFFSETS, "\x00\x00\x00\x00\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x00\x40\x01\x00\x00\x00\x00\x00\x00\xc4\x00\x00\x00\x00\x00\x00\x00"..., [128]) = 0
[pid 389577] mmap(NULL, 33088, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 9, 0) = 0x7f850377e000
[pid 389577] mmap(NULL, 33088, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 9, 0x80000000) = 0x7f8503775000
[pid 389577] bind(9, {sa_family=AF_XDP, sa_data="\x08\x00\x92\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"}, 16) = 0
 
[...]
 
// Second Xsk creation
 
[pid 389577] socket(AF_XDP, SOCK_RAW|SOCK_CLOEXEC, 0) = 62
[...]
[pid 389577] setsockopt(62, SOL_XDP, XDP_RX_RING, [2048], 4) = 0
[pid 389577] setsockopt(62, SOL_XDP, XDP_TX_RING, [2048], 4) = 0
[pid 389577] getsockopt(62, SOL_XDP, XDP_MMAP_OFFSETS, "\x00\x00\x00\x00\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x00\x40\x01\x00\x00\x00\x00\x00\x00\xc4\x00\x00\x00\x00\x00\x00\x00"..., [128]) = 0
[pid 389577] mmap(NULL, 33088, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 62, 0) = 0x7f85036e4000
[pid 389577] mmap(NULL, 33088, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 62, 0x80000000) = 0x7f85036db000
[pid 389577] bind(62, {sa_family=AF_XDP, sa_data="\x01\x00\x92\x10\x00\x00\x00\x00\x00\x00\x09\x00\x00\x00"}, 16) = -1 EINVAL (Invalid argument)
 
[pid 389577] munmap(0x7f85036db000, 33088) = 0
[pid 389577] munmap(0x7f85036e4000, 33088) = 0
[pid 389577] close(62)                  = 0
[pid 389577] write(2, "thread '", 8thread ')    = 8
[pid 389577] write(2, "main", 4main)        = 4
[pid 389577] write(2, "' panicked at '", 15' panicked at ') = 15
[pid 389577] write(2, "failed to create Xsk: Libbpf(\"In"..., 48failed to create Xsk: Libbpf("Invalid argument")) = 48
[...]</code></pre>
            <p>Ok, the second <code>bind() syscall</code> returns the EINVAL value.</p><p>The <code>sa_family</code> is the right one. Is something wrong with <code>sa_data="\x01\x00\x92\x10\x00\x00\x00\x00\x00\x00\x09\x00\x00\x00"</code> ?</p><p>Let's look at the <a href="https://elixir.bootlin.com/linux/v5.15/source/net/socket.c#L1702">bind syscall kernel code</a>:</p>
            <pre><code>err = sock-&gt;ops-&gt;bind(sock, (struct sockaddr *) &amp;address, addrlen);</code></pre>
            <p>The bind function of the protocol specific socket operations gets called. Searching for "AF_XDP" in the code, we quickly found the bind function call related to the <a href="https://elixir.bootlin.com/linux/v5.15/source/net/xdp/xsk.c#L857">AF_XDP socket address family</a>.</p><p>So, where in the syscall could this value be returned?</p><p>First, let's examine the syscall parameters to see if the libbpf <code>xsk_socket__create_shared()</code> function sets weird values for us.</p><p>We use the <a href="https://linux.die.net/man/1/pahole">pahole</a> tool to print the structure definitions:</p>
            <pre><code>$ pahole sockaddr
struct sockaddr {
        sa_family_t                sa_family;            /*     0     2 */
        char                       sa_data[14];          /*     2    14 */
 
        /* size: 16, cachelines: 1, members: 2 */
        /* last cacheline: 16 bytes */
};
 
$ pahole sockaddr_xdp
struct sockaddr_xdp {
        __u16                      sxdp_family;          /*     0     2 */
        __u16                      sxdp_flags;           /*     2     2 */
        __u32                      sxdp_ifindex;         /*     4     4 */
        __u32                      sxdp_queue_id;        /*     8     4 */
        __u32                      sxdp_shared_umem_fd;  /*    12     4 */
 
        /* size: 16, cachelines: 1, members: 5 */
        /* last cacheline: 16 bytes */
};</code></pre>
            <p>Translation of the arguments of the bind syscall (the 14 bytes of <code>sa_data</code>) for the first <code>bind()</code> call:</p><table><tr><td><p><b>Struct member</b></p></td><td><p><b>Big Endian value</b></p></td><td><p><b>Decimal</b></p></td><td><p><b>Meaning</b></p></td><td><p><b>Observation</b></p></td></tr><tr><td><p><a href="http://web.archive.org/web/20220814030152/https://elixir.bootlin.com/linux/v5.15/source/include/uapi/linux/if_xdp.h#L15">sxdp_flags</a></p></td><td><p>\x08\x00</p></td><td><p>8</p></td><td><p>XDP_USE_NEED_WAKEUP</p></td><td><p>expected</p></td></tr><tr><td><p>sxdp_ifindex</p></td><td><p>\x92\x10\x00\x00</p></td><td><p>4242</p></td><td><p>The network interface index</p></td><td><p>expected</p></td></tr><tr><td><p>sxdp_queue_id</p></td><td><p>\x00\x00\x00\x00</p></td><td><p>0</p></td><td><p>The network interface queue </p></td><td><p>expected</p></td></tr><tr><td><p>sxdp_shared_umem_fd</p></td><td><p>\x00\x00\x00\x00</p></td><td><p>0</p></td><td><p>The umem is not shared yet</p></td><td><p>expected</p></td></tr></table><p>Second <code>bind()</code>call:</p><table><tr><td><p>Struct member</p></td><td><p>Big Endian value</p></td><td><p>Decimal</p></td><td><p>Meaning</p></td><td><p>Observation</p></td></tr><tr><td><p><a href="http://web.archive.org/web/20220814030152/https://elixir.bootlin.com/linux/v5.15/source/include/uapi/linux/if_xdp.h#L15">sxdp_flags</a></p></td><td><p>\x01\x00</p></td><td><p>1</p></td><td><p>XDP_SHARED_UMEM</p></td><td><p>expected</p></td></tr><tr><td><p>sxdp_ifindex</p></td><td><p>\x92\x10\x00\x00</p></td><td><p>4242</p></td><td><p>The network interface index</p></td><td><p>expected</p></td></tr><tr><td><p>sxdp_queue_id</p></td><td><p>\x00\x00\x00\x00</p></td><td><p>0</p></td><td><p>The network interface queue id</p></td><td><p>expected</p></td></tr><tr><td><p>sxdp_shared_umem_fd</p></td><td><p>\x09\x00\x00\x00</p></td><td><p>9</p></td><td><p>File descriptor of the first AF_XDP socket associated to the UMEM</p></td><td><p>expected</p></td></tr></table><p>The arguments look good...</p><p>We could statically try to infer where the EINVAL was returned looking at the source code. But this analysis has its limits and can be error-prone.</p><p>Overall, it seems that the network namespaces are not taken into account somewhere because it seems that there is some confusion with the interface indexes.</p><p>Is the issue on the kernel-side?</p>
    <div>
      <h2>Digging deeper</h2>
      <a href="#digging-deeper">
        
      </a>
    </div>
    <p>It would be nice if we had step-by-step runtime inspection of code paths and variables.</p><p>Let's:</p><ul><li><p>Compile a Linux kernel version closer to the one used on our servers (5.15) with debug symbols.</p></li><li><p>Generate a root filesystem for the kernel to boot.</p></li><li><p>Boot in <a href="https://www.qemu.org/">QEMU</a>.</p></li><li><p>Attach gdb to it and set a breakpoint on the syscall.</p></li><li><p>Check where the EINVAL value is returned.</p></li></ul><p>We could have used <a href="https://buildroot.org/">buildroot</a> with a minimal reproduction code, but it wasn't funny enough. Instead, we install a minimal Ubuntu and load our custom kernel. This has the benefit of having a package manager if we need to install other debugging tools.</p><p>Let's install a minimal Ubuntu server 21.10 (with ext4, no LVM and an ssh server selected in the installation wizard):</p>
            <pre><code>qemu-img create -f qcow2 ubuntu-21.10-live-server-amd64.qcow2 20G
 
qemu-system-x86_64 \
  -smp $(nproc) \
  -m 4G \
  -hda ubuntu-21.10-live-server-amd64.qcow2 \
  -cdrom /home/bastien/Downloads/ubuntu-21.10-live-server-amd64.iso \
  -enable-kvm \
  -cpu host \
  -net nic,model=virtio \
  -net user,hostfwd=tcp::10022-:22</code></pre>
            <p>And then build a kernel (<a href="https://www.josehu.com/memo/2021/01/02/linux-kernel-build-debug.html">link</a> and <a href="https://www.starlab.io/blog/using-gdb-to-debug-the-linux-kernel">link</a>) with the following changes in the menuconfig:</p><ul><li><p>Cryptographic API → Certificates for signature checking → Provide system-wide ring of trusted keys</p><ul><li><p>change the additional string to be EMPTY ("")</p></li></ul></li><li><p>Device drivers → Network device support → Virtio network driver</p><ul><li><p>Set to Enable</p></li></ul></li><li><p>Device Drivers → Network device support → Virtual ethernet pair device</p><ul><li><p>Set to Enable</p></li></ul></li><li><p>Device drivers → Block devices → Virtio block driver</p><ul><li><p>Set to Enable</p></li></ul></li></ul>
            <pre><code>git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git &amp;&amp; cd linux/
git checkout v5.15
make menuconfig
make -j$(nproc) bzImage</code></pre>
            <p>We can now run Ubuntu with our custom kernel waiting for gdb to be connected:</p>
            <pre><code>qemu-system-x86_64 \
  -kernel /home/bastien/work/linux/arch/x86_64/boot/bzImage \
  -append "root=/dev/sda2 console=ttyS0 nokaslr" \
  -nographic \
  -smp $(nproc) \
  -m 8G \
  -hda ubuntu-21.10-live-server-amd64.qcow2 \
  -boot c \
  -cpu host \
  -net nic,model=virtio \
  -net user,hostfwd=tcp::10022-:22 \
  -enable-kvm \
  -s -S</code></pre>
            <p>And we can fire up gdb and set a breakpoint on the xsk_bind function:</p>
            <pre><code>$ gdb  -ex "add-auto-load-safe-path $(pwd)" -ex "file vmlinux" -ex "target remote :1234" -ex "hbreak start_kernel" -ex "continue"
(gdb) b xsk_bind
(gdb) continue</code></pre>
            <p>After executing the network setup script and running flowtrackd, we hit the <code>xsk_bind</code> breakpoint:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5gubA71gEppejdOkeME2rL/61ea127cb68fdfd4a6603445bb28630c/image6-7.png" />
            
            </figure><p>We continue to hit the second <code>xsk_bind</code> breakpoint (the one that returns EINVAL) and after a few <code>next</code> and <code>step</code> commands, we find <a href="https://elixir.bootlin.com/linux/v5.15/source/net/xdp/xsk.c#L938">which function</a> returned the <a href="https://elixir.bootlin.com/linux/v5.15/source/net/xdp/xsk_buff_pool.c#L201">EINVAL value</a>:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1GRpxt6bPf3f42tL0rTmf7/a99d880dc0579353c49bb21703adfc79/image8-2.png" />
            
            </figure><p>In our Rust code, we allocate a new FILL and a COMPLETION queue for each queue id of the device prior to calling <code>xsk_socket__create_shared()</code>. Why are those set to NULL? Looking at the code, <code>pool-&gt;fq</code> comes from a struct field named <code>fq_tmp</code> that is accessed from the sock pointer <code>(print ((struct xdp_sock *)sock-&gt;sk)-&gt;fq_tmp)</code>. The field is set in the first call to xsk_bind() but isn't in the second call. We note that at the end of the xsk_bind() function, <a href="https://elixir.bootlin.com/linux/v5.15/source/net/xdp/xsk.c#L981">fq_tmp and cq_tmp are set to NULL</a> as per this comment: "FQ and CQ are now owned by the buffer pool and cleaned up with it.".</p><p>Something is definitely going wrong in libbpf because the FILL queue and COMPLETION queue pointers are missing.</p><p>Back to the libbpf <code>xsk_socket__create_shared()</code> function to check where the queues are set for the socket, and we quickly notice two functions that interact with the FILL and COMPLETION queues:</p><p>The first function called is <a href="https://github.com/libbpf/libbpf/blob/v0.8.0/src/xsk.c#L879"><code>xsk_get_ctx()</code></a>:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3ZJQXz150mnzj47exyaKUB/c0cc03d40a70873a21725abc4401886f/image18-2.png" />
            
            </figure><p>The second is <a href="https://github.com/libbpf/libbpf/blob/v0.8.0/src/xsk.c#L923"><code>xsk_create_ctx()</code></a>:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/37eShbafNDmVglIOd8RUPt/c1ade49287889ec4a9b31e350b47c889/image16-3.png" />
            
            </figure><p>Remembering our setup, can you spot what the issue is?</p>
    <div>
      <h2>The bug / missing feature</h2>
      <a href="#the-bug-missing-feature">
        
      </a>
    </div>
    <p>The issue is in the comparison performed in the <a href="https://github.com/libbpf/libbpf/blob/v0.8.0/src/xsk.c#L888"><code>xsk_get_ctx()</code></a> to find the right socket context structure associated with the (ifindex, queue_id) pair in the linked-list. The UMEM being shared across Xsks, the same <code>umem-&gt;ctx_list</code> linked list head is used to find the sockets that use this UMEM. Remember that in our setup, flowtrackd attaches itself to two network devices that live in different network namespaces. Using the interface index and the queue_id to find the right context (FILL and COMPLETION queues) associated to a socket is not sufficient because another network interface with the same interface index can exist at the same time in another network namespace.</p>
    <div>
      <h2>What can we do about it?</h2>
      <a href="#what-can-we-do-about-it">
        
      </a>
    </div>
    <p>We need to tell apart two network devices "system-wide". That means across the network namespace boundaries.</p><p>Could we fetch and store the network namespace inode number of the current process (<code>stat -c%i -L /proc/self/ns/net</code>) at the context creation and then use it in the comparison? According to <a href="https://man7.org/linux/man-pages/man7/inode.7.html">man 7 inode</a>: "Each file in a filesystem has a unique inode number. Inode numbers are guaranteed to be unique only within a filesystem". However, inode numbers can be reused:</p>
            <pre><code># ip netns add a
# stat -c%i /run/netns/a
4026532570
# ip netns delete a
# ip netns add b
# stat -c%i /run/netns/b
4026532570</code></pre>
            <p>Here are our options:</p><ul><li><p>Do a quick hack to ensure that the interface indexes are not the same (as done in the integration tests).</p></li><li><p>Explain our use case to the libbpf maintainers and see how the API for the <code>xsk_socket__create_shared()</code> function should change. It could be possible to pass an opaque "cookie" as a parameter at the socket creation and pass it to the functions that access the socket contexts.</p></li><li><p>Take our chances and look for Linux patches that contain the words “netns” and “cookie”</p></li></ul><p>Well, well, well: <a href="https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net/">[PATCH bpf-next 3/7] bpf: add netns cookie and enable it for bpf cgroup hooks</a></p><p>This is almost what we need! This patch adds a kernel function named <code>bpf_get_netns_cookie()</code> that would get us the network namespace cookie linked to a socket:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4yeyvSXN19FPw3gJvKGPdh/31c07d8740a52ce7de8073565d181de0/image15-3.png" />
            
            </figure><p>A <a href="https://patchwork.kernel.org/project/linux-parisc/patch/20210210120425.53438-2-lmb@cloudflare.com/">second patch</a> enables us to get this cookie from userspace:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2fkodeI6Um1we4uArDAl9w/864571b824f22e85d43265d964efc138/image17-2.png" />
            
            </figure><p>I know this Lorenz from somewhere :D</p><p>Note that this patch was shipped with the Linux <b>v5.14</b> release.</p><p>We have more guaranties now:</p><ul><li><p>The cookie is generated for us by the kernel.</p></li><li><p>There is a strong bound to the socket from its creation (the netns cookie value is present in the socket structure).</p></li><li><p>The network namespace cookie remains stable for its lifetime.</p></li><li><p>It provides a global identifier that can be assumed unique and not reused.</p></li></ul>
    <div>
      <h2>A patch</h2>
      <a href="#a-patch">
        
      </a>
    </div>
    <p>At the socket creation, we retrieve the netns_cookie from the Xsk file descriptor with <code>getsockopt()</code>, insert it in the xsk_ctx struct and add it in the comparison performed in <code>xsk_get_ctx()</code>.</p><p>Our initial patch was tested on Linux v5.15 with libbpf v0.8.0.</p>
    <div>
      <h3>Testing the patch</h3>
      <a href="#testing-the-patch">
        
      </a>
    </div>
    <p>We keep the same network setup script, but we set the number of queues per interface to two (QUEUES=2). This will help us check that two sockets created in the same network namespace have the same netns_cookie.</p><p>After recompiling flowtrackd to use our patched libbpf, we can run it inside <b>our guest</b> with gdb and set breakpoints on <code>xsk_get_ctx</code> as well as <code>xsk_create_ctx</code>. We now have two instances of gdb running at the same time, one debugging the system and the other debugging the application running in that system. Here is the gdb guest view:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1sbLABtvPacBjuX9oFVn0Z/c01cd118d1d3d6e15d178a0689e92691/unnamed2-1.png" />
            
            </figure><p>Here is the gdb system view:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4HDyMoI08Ipp4ppwCRZmH8/256e62125a122c3e8a493059dd3f718e/unnamed3.png" />
            
            </figure><p>We can see that the netns_cookie value for the first two Xsks is 1 (root namespace) and the <code>net_cookie</code> value for the two other Xsks is 8193 (inner-ns namespace).</p><p>flowtrackd didn't crash and is behaving as expected. It works!</p>
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    
    <div>
      <h3>Situation</h3>
      <a href="#situation">
        
      </a>
    </div>
    <p>Creating AF_XDP sockets with the XDP_SHARED_UMEM flag set fails when the two devices’ ifindex (and the queue_id) are the same. This can happen with devices in different network namespaces.</p><p>In the shared UMEM mode, each Xsk is expected to have a dedicated fill and completion queue. Context data about those queues are set by libbpf in a linked-list stored by the UMEM object. The comparison performed to pick the right context in the linked-list only takes into account the device ifindex and the queue_id which can be the same when devices are in different network namespaces.</p>
    <div>
      <h3>Resolution</h3>
      <a href="#resolution">
        
      </a>
    </div>
    <p>We retrieve the <code>netns_cookie</code> associated with the socket at its creation and add it in the comparison operation.</p><p>The <a href="https://github.com/xdp-project/xdp-tools/pull/205">fix</a> has been submitted and merged in libxdp which is where the AF_XDP parts of libbpf now <a href="https://github.com/libbpf/libbpf/wiki/Libbpf:-the-road-to-v1.0#xskch-is-moving-into-libxdp">live</a>.</p><p>We’ve also <a href="https://github.com/libbpf/libbpf/releases/tag/v0.8.1">backported the fix</a> in libbpf and updated the <a href="https://github.com/libbpf/libbpf-sys/releases/tag/0.8.2%2Bv0.8.1">libbpf-sys Rust crate</a> accordingly.</p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Linux]]></category>
            <guid isPermaLink="false">7imoRx7f0kPMEpeuPibXPO</guid>
            <dc:creator>Bastien Dhiver</dc:creator>
        </item>
        <item>
            <title><![CDATA[A July 4 technical reading list]]></title>
            <link>https://blog.cloudflare.com/july-4-2022-reading-list/</link>
            <pubDate>Mon, 04 Jul 2022 12:55:08 GMT</pubDate>
            <description><![CDATA[ Here’s a short list of recent technical blog posts to give you something to read today ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2S9gHqjdCaiGCiCTBkGt0P/3a2a26f413cb9a908a9112a858495a7e/image1-61.png" />
            
            </figure><p>Here’s a short list of recent technical blog posts to give you something to read today.</p>
    <div>
      <h3>Internet Explorer, we hardly knew ye</h3>
      <a href="#internet-explorer-we-hardly-knew-ye">
        
      </a>
    </div>
    <p>Microsoft has announced the end-of-life for the venerable Internet Explorer browser. Here <a href="/internet-explorer-retired/">we take a look</a> at the demise of IE and the rise of the Edge browser. And we investigate how many bots on the Internet continue to impersonate Internet Explorer versions that have long since been replaced.</p>
    <div>
      <h3>Live-patching security vulnerabilities inside the Linux kernel with eBPF Linux Security Module</h3>
      <a href="#live-patching-security-vulnerabilities-inside-the-linux-kernel-with-ebpf-linux-security-module">
        
      </a>
    </div>
    <p>Looking for something with a lot of technical detail? Look no further than <a href="/live-patch-security-vulnerabilities-with-ebpf-lsm/">this blog about live-patching</a> the Linux kernel using eBPF. Code, Makefiles and more within!</p>
    <div>
      <h3>Hertzbleed explained</h3>
      <a href="#hertzbleed-explained">
        
      </a>
    </div>
    <p>Feeling mathematical? Or just need a dose of CPU-level antics? Look no further than this <a href="/hertzbleed-explained/">deep explainer</a> about how CPU frequency scaling leads to a nasty side channel affecting cryptographic algorithms.</p>
    <div>
      <h3>Early Hints update: How Cloudflare, Google, and Shopify are working together to build a faster Internet for everyone</h3>
      <a href="#early-hints-update-how-cloudflare-google-and-shopify-are-working-together-to-build-a-faster-internet-for-everyone">
        
      </a>
    </div>
    <p>The HTTP standard for Early Hints shows a lot of promise. How much? In this blog post, we <a href="/early-hints-performance/">dig into data</a> about Early Hints in the real world and show how much faster the web is with it.</p>
    <div>
      <h3>Private Access Tokens: eliminating CAPTCHAs on iPhones and Macs with open standards</h3>
      <a href="#private-access-tokens-eliminating-captchas-on-iphones-and-macs-with-open-standards">
        
      </a>
    </div>
    <p>Dislike CAPTCHAs? Yes, us too. As part of our program to eliminate captures there’s a new standard: Private Access Tokens. This blog shows <a href="/eliminating-captchas-on-iphones-and-macs-using-new-standard/">how they work</a> and how they can be used to prove you’re human without saying who you are.</p>
    <div>
      <h3>Optimizing TCP for high WAN throughput while preserving low latency</h3>
      <a href="#optimizing-tcp-for-high-wan-throughput-while-preserving-low-latency">
        
      </a>
    </div>
    <p>Network nerd? Yeah, me too. Here’s a very <a href="/optimizing-tcp-for-high-throughput-and-low-latency/">in depth look</a> at how we tune TCP parameters for low latency and high throughput.</p><p>...<i>We protect </i><a href="https://www.cloudflare.com/network-services/"><i>entire corporate networks</i></a><i>, help customers build </i><a href="https://workers.cloudflare.com/"><i>Internet-scale applications efficiently</i></a><i>, accelerate any </i><a href="https://www.cloudflare.com/performance/accelerate-internet-applications/"><i>website or Internet application</i></a><i>, ward off </i><a href="https://www.cloudflare.com/ddos/"><i>DDoS attacks</i></a><i>, keep </i><a href="https://www.cloudflare.com/application-security/"><i>hackers at bay</i></a><i>, and can help you on </i><a href="https://www.cloudflare.com/products/zero-trust/"><i>your journey to Zero Trust</i></a><i>.</i></p><p><i>Visit </i><a href="https://1.1.1.1/"><i>1.1.1.1</i></a><i> from any device to get started with our free app that makes your Internet faster and safer.To learn more about our mission to help build a better Internet, start </i><a href="https://www.cloudflare.com/learning/what-is-cloudflare/"><i>here</i></a><i>. If you’re looking for a new career direction, check out </i><a href="http://cloudflare.com/careers"><i>our open positions</i></a><i>.</i></p> ]]></content:encoded>
            <category><![CDATA[Reading List]]></category>
            <category><![CDATA[Radar]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[TCP]]></category>
            <category><![CDATA[Hertzbleed]]></category>
            <category><![CDATA[eBPF]]></category>
            <guid isPermaLink="false">4ffQabh80U3V99Grzwc88g</guid>
            <dc:creator>John Graham-Cumming</dc:creator>
        </item>
        <item>
            <title><![CDATA[Live-patching security vulnerabilities inside the Linux kernel with eBPF Linux Security Module]]></title>
            <link>https://blog.cloudflare.com/live-patch-security-vulnerabilities-with-ebpf-lsm/</link>
            <pubDate>Wed, 29 Jun 2022 11:45:00 GMT</pubDate>
            <description><![CDATA[ Learn how to patch Linux security vulnerabilities without rebooting the hardware and how to tighten the security of your Linux operating system with eBPF Linux Security Module ]]></description>
            <content:encoded><![CDATA[ <p></p><p><a href="https://www.kernel.org/doc/html/latest/admin-guide/LSM/index.html">Linux Security Modules</a> (LSM) is a hook-based framework for implementing security policies and Mandatory Access Control in the Linux kernel. Until recently users looking to implement a security policy had just two options. Configure an existing LSM module such as AppArmor or SELinux, or write a custom kernel module.</p><p><a href="https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.7">Linux 5.7</a> introduced a third way: <a href="https://docs.kernel.org/bpf/prog_lsm.html">LSM extended Berkeley Packet Filters (eBPF)</a> (LSM BPF for short). LSM BPF allows developers to write granular policies without configuration or loading a kernel module. LSM BPF programs are verified on load, and then executed when an LSM hook is reached in a call path.</p>
    <div>
      <h2>Let’s solve a real-world problem</h2>
      <a href="#lets-solve-a-real-world-problem">
        
      </a>
    </div>
    <p>Modern operating systems provide facilities allowing "partitioning" of kernel resources. For example FreeBSD has "jails", Solaris has "zones". Linux is different - it provides a set of seemingly independent facilities each allowing isolation of a specific resource. These are called "namespaces" and have been growing in the kernel for years. They are the base of popular tools like Docker, lxc or firejail. Many of the namespaces are uncontroversial, like the UTS namespace which allows the host system to hide its hostname and time. Others are complex but straightforward - NET and NS (mount) namespaces are known to be hard to wrap your head around. Finally, there is this very special very curious USER namespace.</p><p>USER namespace is special, since it allows the owner to operate as "root" inside it. How it works is beyond the scope of this blog post, however, suffice to say it's a foundation to having tools like Docker to not operate as true root, and things like rootless containers.</p><p>Due to its nature, allowing unpriviledged users access to USER namespace always carried a great security risk.  One such risk is privilege escalation.</p><p>Privilege escalation is a <a href="https://www.cloudflare.com/learning/security/what-is-an-attack-surface/">common attack surface</a> for operating systems. One way users may gain privilege is by mapping their namespace to the root namespace via the unshare <a href="https://en.wikipedia.org/wiki/System_call">syscall</a> and specifying the <i>CLONE_NEWUSER</i> flag. This tells unshare to create a new user namespace with full permissions, and maps the new user and group ID to the previous namespace. You can use the <a href="https://man7.org/linux/man-pages/man1/unshare.1.html">unshare(1)</a> program to map root to our original namespace:</p>
            <pre><code>$ id
uid=1000(fred) gid=1000(fred) groups=1000(fred) …
$ unshare -rU
# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
# cat /proc/self/uid_map
         0       1000          1</code></pre>
            <p>In most cases using unshare is harmless, and is intended to run with lower privileges. However, this syscall has been known to be used to <a href="https://nvd.nist.gov/vuln/detail/CVE-2022-0492">escalate privileges</a>.</p><p>Syscalls <i>clone</i> and <i>clone3</i> are worth looking into as they also have the ability to <i>CLONE_NEWUSER</i>. However, for this post we’re going to focus on unshare.</p><p>Debian solved this problem with this <a href="https://sources.debian.org/patches/linux/3.16.56-1+deb8u1/debian/add-sysctl-to-disallow-unprivileged-CLONE_NEWUSER-by-default.patch/">"add sysctl to disallow unprivileged CLONE_NEWUSER by default"</a> patch, but it was not mainlined. Another similar patch <a href="https://lore.kernel.org/all/1453502345-30416-3-git-send-email-keescook@chromium.org/">"sysctl: allow CLONE_NEWUSER to be disabled"</a> attempted to mainline, and was met with push back. A critique is the <a href="https://lore.kernel.org/all/87poq5y0jw.fsf@x220.int.ebiederm.org/">inability to toggle this feature</a> for specific applications. In the article “<a href="https://lwn.net/Articles/673597/">Controlling access to user namespaces</a>” the author wrote: “... the current patches do not appear to have an easy path into the mainline.” And as we can see, the patches were ultimately not included in the vanilla kernel.</p>
    <div>
      <h2>Our solution - LSM BPF</h2>
      <a href="#our-solution-lsm-bpf">
        
      </a>
    </div>
    <p>Since upstreaming code that restricts USER namespace seem to not be an option, we decided to use LSM BPF to circumvent these issues. This requires no modifications to the kernel and allows us to express complex rules guarding the access.</p>
    <div>
      <h3>Track down an appropriate hook candidate</h3>
      <a href="#track-down-an-appropriate-hook-candidate">
        
      </a>
    </div>
    <p>First, let us track down the syscall we’re targeting. We can find the prototype in the <a href="https://elixir.bootlin.com/linux/v5.18/source/include/linux/syscalls.h#L608"><i>include/linux/syscalls.h</i></a> file. From there, it’s not as obvious to track down, but the line:</p>
            <pre><code>/* kernel/fork.c */</code></pre>
            <p>Gives us a clue of where to look next in <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/fork.c#L3201"><i>kernel/fork.c</i></a>. There a call to <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/fork.c#L3082"><i>ksys_unshare()</i></a> is made. Digging through that function, we find a call to <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/fork.c#L3129"><i>unshare_userns()</i></a>. This looks promising.</p><p>Up to this point, we’ve identified the syscall implementation, but the next question to ask is what hooks are available for us to use? Because we know from the <a href="https://man7.org/linux/man-pages/man2/unshare.2.html">man-pages</a> that unshare is used to mutate tasks, we look at the task-based hooks in <a href="https://elixir.bootlin.com/linux/v5.18/source/include/linux/lsm_hooks.h#L605"><i>include/linux/lsm_hooks.h</i></a>. Back in the function <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/user_namespace.c#L171"><i>unshare_userns()</i></a> we saw a call to <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/cred.c#L252"><i>prepare_creds()</i></a>. This looks very familiar to the <a href="https://elixir.bootlin.com/linux/v5.18/source/include/linux/lsm_hooks.h#L624"><i>cred_prepare</i></a> hook. To verify we have our match via <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/cred.c#L291"><i>prepare_creds()</i></a>, we see a call to the security hook <a href="https://elixir.bootlin.com/linux/v5.18/source/security/security.c#L1706"><i>security_prepare_creds()</i></a> which ultimately calls the hook:</p>
            <pre><code>…
rc = call_int_hook(cred_prepare, 0, new, old, gfp);
…</code></pre>
            <p>Without going much further down this rabbithole, we know this is a good hook to use because <i>prepare_creds()</i> is called right before <i>create_user_ns()</i> in <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/user_namespace.c#L181"><i>unshare_userns()</i></a> which is the operation we’re trying to block.</p>
    <div>
      <h3>LSM BPF solution</h3>
      <a href="#lsm-bpf-solution">
        
      </a>
    </div>
    <p>We’re going to compile with the <a href="https://nakryiko.com/posts/bpf-core-reference-guide/#defining-own-co-re-relocatable-type-definitions">eBPF compile once-run everywhere (CO-RE)</a> approach. This allows us to compile on one architecture and load on another. But we’re going to target x86_64 specifically. LSM BPF for ARM64 is still in development, and the following code will not run on that architecture. Watch the <a href="https://lore.kernel.org/bpf/">BPF mailing list</a> to follow the progress.</p><p>This solution was tested on kernel versions &gt;= 5.15 configured with the following:</p>
            <pre><code>BPF_EVENTS
BPF_JIT
BPF_JIT_ALWAYS_ON
BPF_LSM
BPF_SYSCALL
BPF_UNPRIV_DEFAULT_OFF
DEBUG_INFO_BTF
DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT
DYNAMIC_FTRACE
FUNCTION_TRACER
HAVE_DYNAMIC_FTRACE</code></pre>
            <p>A boot option <code>lsm=bpf</code> may be necessary if <code>CONFIG_LSM</code> does not contain “bpf” in the list.</p><p>Let’s start with our preamble:</p><p><i>deny_unshare.bpf.c</i>:</p>
            <pre><code>#include &lt;linux/bpf.h&gt;
#include &lt;linux/capability.h&gt;
#include &lt;linux/errno.h&gt;
#include &lt;linux/sched.h&gt;
#include &lt;linux/types.h&gt;

#include &lt;bpf/bpf_tracing.h&gt;
#include &lt;bpf/bpf_helpers.h&gt;
#include &lt;bpf/bpf_core_read.h&gt;

#define X86_64_UNSHARE_SYSCALL 272
#define UNSHARE_SYSCALL X86_64_UNSHARE_SYSCALL</code></pre>
            <p>Next we set up our necessary structures for CO-RE relocation in the following way:</p><p><i>deny_unshare.bpf.c</i>:</p>
            <pre><code>…

typedef unsigned int gfp_t;

struct pt_regs {
	long unsigned int di;
	long unsigned int orig_ax;
} __attribute__((preserve_access_index));

typedef struct kernel_cap_struct {
	__u32 cap[_LINUX_CAPABILITY_U32S_3];
} __attribute__((preserve_access_index)) kernel_cap_t;

struct cred {
	kernel_cap_t cap_effective;
} __attribute__((preserve_access_index));

struct task_struct {
    unsigned int flags;
    const struct cred *cred;
} __attribute__((preserve_access_index));

char LICENSE[] SEC("license") = "GPL";

…</code></pre>
            <p>We don’t need to fully-flesh out the structs; we just need the absolute minimum information a program needs to function. CO-RE will do whatever is necessary to perform the relocations for your kernel. This makes writing the LSM BPF programs easy!</p><p><i>deny_unshare.bpf.c</i>:</p>
            <pre><code>SEC("lsm/cred_prepare")
int BPF_PROG(handle_cred_prepare, struct cred *new, const struct cred *old,
             gfp_t gfp, int ret)
{
    struct pt_regs *regs;
    struct task_struct *task;
    kernel_cap_t caps;
    int syscall;
    unsigned long flags;

    // If previous hooks already denied, go ahead and deny this one
    if (ret) {
        return ret;
    }

    task = bpf_get_current_task_btf();
    regs = (struct pt_regs *) bpf_task_pt_regs(task);
    // In x86_64 orig_ax has the syscall interrupt stored here
    syscall = regs-&gt;orig_ax;
    caps = task-&gt;cred-&gt;cap_effective;

    // Only process UNSHARE syscall, ignore all others
    if (syscall != UNSHARE_SYSCALL) {
        return 0;
    }

    // PT_REGS_PARM1_CORE pulls the first parameter passed into the unshare syscall
    flags = PT_REGS_PARM1_CORE(regs);

    // Ignore any unshare that does not have CLONE_NEWUSER
    if (!(flags &amp; CLONE_NEWUSER)) {
        return 0;
    }

    // Allow tasks with CAP_SYS_ADMIN to unshare (already root)
    if (caps.cap[CAP_TO_INDEX(CAP_SYS_ADMIN)] &amp; CAP_TO_MASK(CAP_SYS_ADMIN)) {
        return 0;
    }

    return -EPERM;
}</code></pre>
            <p>Creating the program is the first step, the second is loading and attaching the program to our desired hook. There are several ways to do this: <a href="https://github.com/cilium/ebpf">Cilium ebpf</a> project, <a href="https://github.com/libbpf/libbpf-rs">Rust bindings</a>, and several others on the <a href="https://ebpf.io/projects/">ebpf.io</a> project landscape page. We’re going to use native libbpf.</p><p><i>deny_unshare.c</i>:</p>
            <pre><code>#include &lt;bpf/libbpf.h&gt;
#include &lt;unistd.h&gt;
#include "deny_unshare.skel.h"

static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
    return vfprintf(stderr, format, args);
}

int main(int argc, char *argv[])
{
    struct deny_unshare_bpf *skel;
    int err;

    libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
    libbpf_set_print(libbpf_print_fn);

    // Loads and verifies the BPF program
    skel = deny_unshare_bpf__open_and_load();
    if (!skel) {
        fprintf(stderr, "failed to load and verify BPF skeleton\n");
        goto cleanup;
    }

    // Attaches the loaded BPF program to the LSM hook
    err = deny_unshare_bpf__attach(skel);
    if (err) {
        fprintf(stderr, "failed to attach BPF skeleton\n");
        goto cleanup;
    }

    printf("LSM loaded! ctrl+c to exit.\n");

    // The BPF link is not pinned, therefore exiting will remove program
    for (;;) {
        fprintf(stderr, ".");
        sleep(1);
    }

cleanup:
    deny_unshare_bpf__destroy(skel);
    return err;
}</code></pre>
            <p>Lastly, to compile, we use the following Makefile:</p><p><i>Makefile</i>:</p>
            <pre><code>CLANG ?= clang-13
LLVM_STRIP ?= llvm-strip-13
ARCH := x86
INCLUDES := -I/usr/include -I/usr/include/x86_64-linux-gnu
LIBS_DIR := -L/usr/lib/lib64 -L/usr/lib/x86_64-linux-gnu
LIBS := -lbpf -lelf

.PHONY: all clean run

all: deny_unshare.skel.h deny_unshare.bpf.o deny_unshare

run: all
	sudo ./deny_unshare

clean:
	rm -f *.o
	rm -f deny_unshare.skel.h

#
# BPF is kernel code. We need to pass -D__KERNEL__ to refer to fields present
# in the kernel version of pt_regs struct. uAPI version of pt_regs (from ptrace)
# has different field naming.
# See: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fd56e0058412fb542db0e9556f425747cf3f8366
#
deny_unshare.bpf.o: deny_unshare.bpf.c
	$(CLANG) -g -O2 -Wall -target bpf -D__KERNEL__ -D__TARGET_ARCH_$(ARCH) $(INCLUDES) -c $&lt; -o $@
	$(LLVM_STRIP) -g $@ # Removes debug information

deny_unshare.skel.h: deny_unshare.bpf.o
	sudo bpftool gen skeleton $&lt; &gt; $@

deny_unshare: deny_unshare.c deny_unshare.skel.h
	$(CC) -g -Wall -c $&lt; -o $@.o
	$(CC) -g -o $@ $(LIBS_DIR) $@.o $(LIBS)

.DELETE_ON_ERROR:</code></pre>
            
    <div>
      <h3>Result</h3>
      <a href="#result">
        
      </a>
    </div>
    <p>In a new terminal window run:</p>
            <pre><code>$ make run
…
LSM loaded! ctrl+c to exit.</code></pre>
            <p>In another terminal window, we’re successfully blocked!</p>
            <pre><code>$ unshare -rU
unshare: unshare failed: Cannot allocate memory
$ id
uid=1000(fred) gid=1000(fred) groups=1000(fred) …</code></pre>
            <p>The policy has an additional feature to always allow privilege pass through:</p>
            <pre><code>$ sudo unshare -rU
# id
uid=0(root) gid=0(root) groups=0(root)</code></pre>
            <p>In the unprivileged case the syscall early aborts. What is the performance impact in the privileged case?</p>
    <div>
      <h3>Measure performance</h3>
      <a href="#measure-performance">
        
      </a>
    </div>
    <p>We’re going to use a one-line unshare that’ll map the user namespace, and execute a command within for the measurements:</p>
            <pre><code>$ unshare -frU --kill-child -- bash -c "exit 0"</code></pre>
            <p>With a resolution of CPU cycles for syscall unshare enter/exit, we’ll measure the following as root user:</p><ol><li><p>Command ran without the policy</p></li><li><p>Command run with the policy</p></li></ol><p>We’ll record the measurements with <a href="https://docs.kernel.org/trace/ftrace.html">ftrace</a>:</p>
            <pre><code>$ sudo su
# cd /sys/kernel/debug/tracing
# echo 1 &gt; events/syscalls/sys_enter_unshare/enable ; echo 1 &gt; events/syscalls/sys_exit_unshare/enable</code></pre>
            <p>At this point, we’re enabling tracing for the syscall enter and exit for unshare specifically. Now we set the time-resolution of our enter/exit calls to count CPU cycles:</p>
            <pre><code># echo 'x86-tsc' &gt; trace_clock </code></pre>
            <p>Next we begin our measurements:</p>
            <pre><code># unshare -frU --kill-child -- bash -c "exit 0" &amp;
[1] 92014</code></pre>
            <p>Run the policy in a new terminal window, and then run our next syscall:</p>
            <pre><code># unshare -frU --kill-child -- bash -c "exit 0" &amp;
[2] 92019</code></pre>
            <p>Now we have our two calls for comparison:</p>
            <pre><code># cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 4/4   #P:8
#
#                                _-----=&gt; irqs-off
#                               / _----=&gt; need-resched
#                              | / _---=&gt; hardirq/softirq
#                              || / _--=&gt; preempt-depth
#                              ||| / _-=&gt; migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
         unshare-92014   [002] ..... 762950852559027: sys_unshare(unshare_flags: 10000000)
         unshare-92014   [002] ..... 762950852622321: sys_unshare -&gt; 0x0
         unshare-92019   [007] ..... 762975980681895: sys_unshare(unshare_flags: 10000000)
         unshare-92019   [007] ..... 762975980752033: sys_unshare -&gt; 0x0
</code></pre>
            <p>unshare-92014 used 63294 cycles.unshare-92019 used 70138 cycles.</p><p>We have a 6,844 (~10%) cycle penalty between the two measurements. Not bad!</p><p>These numbers are for a single syscall, and add up the more frequently the code is called. Unshare is typically called at task creation, and not repeatedly during normal execution of a program. Careful consideration and measurement is needed for your use case.</p>
    <div>
      <h2>Outro</h2>
      <a href="#outro">
        
      </a>
    </div>
    <p>We learned a bit about what LSM BPF is, how unshare is used to map a user to root, and how to solve a real-world problem by implementing a solution in eBPF. Tracking down the appropriate hook is not an easy task, and requires a bit of playing and a lot of kernel code. Fortunately, that’s the hard part. Because a policy is written in C, we can granularly tweak the policy to our problem. This means one may extend this policy with an allow-list to allow certain programs or users to continue to use an unprivileged unshare. Finally, we looked at the performance impact of this program, and saw the overhead is worth blocking the attack vector.</p><p>“Cannot allocate memory” is not a clear error message for denying permissions. We proposed a <a href="https://lore.kernel.org/all/20220608150942.776446-1-fred@cloudflare.com/">patch</a> to propagate error codes from the <i>cred_prepare</i> hook up the call stack. Ultimately we came to the conclusion that a new hook is better suited to this problem. Stay tuned!</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Programming]]></category>
            <guid isPermaLink="false">2AGA68zpZ0kGK4kfyvQ5Fa</guid>
            <dc:creator>Frederick Lawler</dc:creator>
        </item>
    </channel>
</rss>