
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Sun, 12 Apr 2026 10:43:48 GMT</lastBuildDate>
        <item>
            <title><![CDATA[How Automatic Return Routing solves IP overlap]]></title>
            <link>https://blog.cloudflare.com/automatic-return-routing-ip-overlap/</link>
            <pubDate>Thu, 05 Mar 2026 06:00:00 GMT</pubDate>
            <description><![CDATA[ Automatic Return Routing (ARR) solves the common enterprise challenge of overlapping private IP addresses by using stateful flow tracking instead of traditional routing tables. This userspace-driven approach ensures return traffic reaches the correct origin tunnel without manual NAT or VRF configuration. ]]></description>
            <content:encoded><![CDATA[ <p>The public Internet relies on a fundamental principle of predictable routing: a single IP address points to a logically unique destination. Even in an <a href="https://www.cloudflare.com/learning/cdn/glossary/anycast-network/"><u>Anycast architecture</u></a> like Cloudflare’s, where one IP is announced from hundreds of locations, every instance of that IP represents the same service. The routing table always knows exactly where a packet is intended to go.</p><p>This principle holds up because <a href="https://www.iana.org/numbers"><u>global addressing authorities</u></a> assign IP space to organizations to prevent duplication or conflict. When everyone adheres to a single, authoritative registry, a routing table functions as a source of absolute truth.</p><p>On the public Internet, an IP address is like a unique, globally registered national identity card. In private networks, an IP is just a name like “John Smith”, which is perfectly fine until you have three of them in the same room trying to talk to the same person.</p><p>As we expand Cloudflare One to become the <a href="https://blog.cloudflare.com/welcome-to-connectivity-cloud/"><u>connectivity cloud</u></a> for <a href="https://www.cloudflare.com/network-services/products/magic-wan/"><u>enterprise backbones</u></a>, we’ve entered the messy reality of private IP address space. There are good reasons why duplication arises, and enterprises need solutions to handle these conflicts.</p><p>Today, we are introducing Automatic Return Routing (ARR) in Closed Beta. ARR is an optional tool for Cloudflare One customers that gives you the flexibility to route traffic back to where it originated, without requiring an IP route in a routing table. This capability allows overlapping networks to coexist without a single line of Network Address Translation (NAT) or complex Virtual Routing and Forwarding (VRF) configuration.</p>
    <div>
      <h3>The ambiguity problem</h3>
      <a href="#the-ambiguity-problem">
        
      </a>
    </div>
    <p>In enterprise networking, IP overlap is a fact of life. We see it in three common scenarios that traditionally cause toil for admins:</p><ul><li><p><b>Mergers &amp; acquisitions:</b> Two companies merge, and both use <code>10.0.1.0/24</code> for their core services.</p></li><li><p><b>Extranets:</b> Partners, vendors or customers securely connect to your network using their own internal IP schemes, leading to unavoidable conflicts.</p></li><li><p><b>Cookie-cutter architectures:</b> SaaS providers or retail brands use identical IP space for every branch to simplify deployment and operation.</p></li></ul><p>The problem arises when these sites try to talk to the Internet or a data center through Cloudflare. If two different sites send traffic from the same source IP, the return packet hits an architectural wall. The administrator has to make a decision on how to route the traffic based on the ambiguous destination. If the administrator puts both routes into the routing table, it will be non-deterministic as to which path is taken: the correct path or the incorrect path. From the perspective of a standard routing table, there is no way to distinguish between two identical paths.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1fjcFntrxlnhXae4vobf5f/3ca26869ad923a1384349805fb4b371e/image1.png" />
          </figure><p><sup><i>This diagram shows two branches (Site A and Site B) both using </i></sup><a href="http://10.0.1.0/24"><sup><i>10.0.1.0/24</i></sup></a><sup><i>. They send packets to Cloudflare. The return packet from the Internet reaches the Cloudflare edge, and this return traffic is sometimes sent to the wrong site because the routing table has two identical egress options.</i></sup></p>
    <div>
      <h3>Why traditional fixes fail</h3>
      <a href="#why-traditional-fixes-fail">
        
      </a>
    </div>
    <p>There are numerous ways to resolve this ambiguity, and we are committed to solving them in the easiest way for our customers to manage. The traditional “industry standard” fixes are functional, but they introduce significant administrative overhead and complexity that we are committed to eliminating:</p><ol><li><p><b>Virtual Routing and Forwarding </b>(<b>VRF):</b> This involves creating "virtual" routing tables to keep traffic isolated. While effective for separation, it adds administrative overhead. Managing cross-VRF communication (route leaking) is brittle and complex at scale. </p></li><li><p><b>Network Address Translation (NAT):</b> You can NAT each overlapping subnet from an unmanaged IP space to a managed IP range that is unique in your network. This approach works well, but the mapping is administrative toil for each new site or partner.</p></li></ol><p>Typically, the use case we hear from customers is an overlapping network needing to access the Internet or a private data center. How do we solve this without administrative overhead?</p>
    <div>
      <h3>Introducing Automatic Return Routing (ARR)</h3>
      <a href="#introducing-automatic-return-routing-arr">
        
      </a>
    </div>
    <p>We developed <b>ARR</b> as a "zero-touch" solution to this problem. ARR moves the intelligence from the routing table to stateful tracking.</p><p>So what is stateful tracking?</p><p>In traditional networking, a router is "forgetful" (aka “stateless”). It treats every single packet like a total stranger. Even if it just saw a packet from the exact same source going to the exact same destination a millisecond ago, it has to look at its routing table all over again to decide where to send the next one.</p><p><b>With stateful tracking, the system has a memory.</b> It recognizes when a series of packets are all part of the same “flow” (that is, a network conversation between two endpoints), and remembers key information about that flow until it finishes. With ARR, we remember one extra piece of information when initializing the flow: the specific tunnel that initiated it. This allows us to send return traffic back to that same tunnel, without ever consulting a routing table!</p><p>Instead of asking the network, "Where does this IP live?" ARR asks, "Where did this specific conversation originate?"</p><p><b>The Logic:</b></p><ol><li><p><b>Ingress:</b> A packet arrives at the Cloudflare edge from a site via a specific connection, i.e. an <a href="https://developers.cloudflare.com/cloudflare-wan/configuration/manually/how-to/configure-tunnel-endpoints/#ways-to-onboard-traffic-to-cloudflare"><u>IPsec tunnel, GRE tunnel, or Network Interconnect</u></a>.</p></li><li><p><b>Flow Matching:</b> The Cloudflare Virtual Network first checks (by header inspection) whether that packet matches an existing flow.</p><ol><li><p><b>Proxying: </b>If the packet matches, that's great! All of the decisions about this traffic have already been made and stored in our memory. All we need to do is pass that packet along already-established paths.</p></li><li><p><b>Flow Setup: </b>If it doesn’t match an existing flow, we decide which parts of the Cloudflare One stack to pass it through (e.g. <a href="https://developers.cloudflare.com/cloudflare-one/networks/connectors/cloudflare-wan/zero-trust/cloudflare-gateway/"><u>Gateway</u></a>, <a href="https://developers.cloudflare.com/cloudflare-one/data-loss-prevention/"><u>DLP</u></a>, <a href="https://developers.cloudflare.com/cloudflare-network-firewall/"><u>Firewall</u></a>), as well as its ultimate destination. We store all of this state in memory. With ARR, this is when we record which tunnel initiated the flow.</p></li></ol></li><li><p><b>Symmetric Return:</b> When return traffic arrives from the destination, the Cloudflare Virtual Network uses its existing in-memory state to proxy the traffic. Crucially, it does this without needing to examine the traffic’s destination IP, which could very well be reused across different sites. This completely bypasses the need to consult a routing table. We see the originating tunnel in the flow state and deliver the packet directly back to it.</p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5zOzhLU8jwsxcOSVmTGpXE/a305ba3b5ad1600b0bee4f5e11c992d4/image7.png" />
          </figure><p><sup><i>Example of overlapping source IPs tracked by in-memory flow state, tagged with source onramp to inform return routing decision.</i></sup></p><p>By remembering the originating tunnel for every flow, ARR facilitates <b>zero-touch routing</b>. If your site traffic is only client-to-Internet, there is no need to configure return routes at all, reducing toil when deploying new branch sites or “<a href="https://www.cloudflare.com/learning/access-management/coffee-shop-networking/"><u>Coffee Shop Networking</u></a>.”</p>
    <div>
      <h3>Built on Unified Routing</h3>
      <a href="#built-on-unified-routing">
        
      </a>
    </div>
    <p>To make ARR a reality at Cloudflare scale, we plugged into another initiative we have been working on: Unified Routing.</p><p>Historically, Cloudflare Zero Trust (users/proxies) and Cloudflare WAN (network-layer/sites) lived at different levels of the system. Cloudflare WAN relied on kernel primitives (Linux network namespaces, routes, eBPF, etc). Zero Trust lived in userspace, where proxies could perform deep inspection and application-level security. This "split-brain" approach often required complex logic to move traffic between component services, and some of this complexity became product limitations that customers might notice.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6BcqzXc35KtEu7SNre03g0/d14ff89eecdec047ae615e1bc6d9b713/image6.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2rxEKg3TEpqxqI7LLGAMOB/a3ae3068b87a8393ad13f7638ca2c93a/image5.png" />
          </figure><p>With our new Unified Routing mode, we have moved the initial routing decision from our network-layer data plane into our existing Zero Trust userspace routing logic, the same hardened software used by Cloudflare One Clients and Cloudflare Tunnel in our Zero Trust solution. This change has <a href="https://developers.cloudflare.com/cloudflare-wan/reference/traffic-steering/#why-use-unified-routing"><u>many benefits</u></a> to how we enable our customers to use their private networks with products across the Cloudflare platform, as it fixes long-standing interoperability problems between Cloudflare WAN and Zero Trust. Unified Routing means you can use Cloudflare Mesh, Cloudflare Tunnel, and IPsec/GRE on-ramps together in the same account without a single conflict.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3iHhwmiZgl52HXuze7ct3t/f2746e8c75f202465ed9d8a9bc031204/image2.png" />
          </figure><p>In September 2025, we deployed Unified Routing mode internally for all Cloudflare employees and sites. We saw immediate 3-5x performance improvements for Cloudflare One Clients, as you can see in the graph above.</p><p>When designing ARR, we knew that we needed to move away from kernel-based routing and build on our new Unified Routing framework.</p><p>When Unified Routing is enabled, all Cloudflare WAN traffic flows through <a href="https://blog.cloudflare.com/extending-local-traffic-management-load-balancing-to-layer-4-with-spectrum/#how-we-enabled-spectrum-to-support-private-networks"><u>Apollo, our Zero Trust hub</u></a>. Unlike the Linux kernel's standard routing table, our userspace data plane is fully programmable. We can attach metadata, like the originating Tunnel ID, directly to a flow entry in Apollo. </p><p>Each packet is tracked by flow from the moment it hits our edge, and we no longer need to make independent, per-packet routing decisions. Instead, we can make consistent, session-aware decisions for the lifetime of the flow.</p><p>ARR is <a href="https://developers.cloudflare.com/magic-wan/configuration/manually/how-to/configure-routes/#configure-automatic-return-routing-beta"><u>straightforward to enable</u></a> on a per tunnel or interconnect basis:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6gAcPydMW8AIdua46qrtwO/58ba33000804da6f890be9fbfd4d4b1f/image4.png" />
          </figure><p>Once enabled for a tunnel or interconnect, any traffic that matches an existing flow is routed back to the connection where it originated, without consulting the routing table.</p>
    <div>
      <h3>Putting ARR to work</h3>
      <a href="#putting-arr-to-work">
        
      </a>
    </div>
    <p>For the enterprise architect, ARR is a tool to bypass the persistent friction of IP address conflicts. Whether integrating an acquisition or onboarding a partner, the goal is to make the network invisible, so you can focus on the applications, not the plumbing.</p><p>Today, ARR is in closed beta and supports overlapping IP addresses accessing the Internet via our Secure Web Gateway. We are already extending this to support private data center access, adding mid-flow failover (pinning the flow to a primary onramp, and seamlessly detecting when that flow fails over to a backup onramp), and further investing in the architectural capabilities needed to make IP overlap a non-issue for even the most complex global deployments.</p><p>Not using Cloudflare One yet? <a href="https://dash.cloudflare.com/sign-up/zero-trust"><u>Start now</u></a> with our Free and Pay-as-you-go plans to protect and connect your users and networks, and <a href="https://www.cloudflare.com/contact/sase/"><u>contact us</u></a> for comprehensive private WAN connectivity via IPsec and private interconnect.</p> ]]></content:encoded>
            <category><![CDATA[SASE]]></category>
            <category><![CDATA[Cloudflare One]]></category>
            <guid isPermaLink="false">Fvm2xTFInpKNW6WLw63Bw</guid>
            <dc:creator>Steve Welham</dc:creator>
            <dc:creator>Lauren Joplin</dc:creator>
            <dc:creator>Jackson Kruger</dc:creator>
            <dc:creator>Thea Heinen</dc:creator>
        </item>
        <item>
            <title><![CDATA[How we found a bug in Go's arm64 compiler]]></title>
            <link>https://blog.cloudflare.com/how-we-found-a-bug-in-gos-arm64-compiler/</link>
            <pubDate>Wed, 08 Oct 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ 84 million requests a second means even rare bugs appear often. We'll reveal how we discovered a race condition in the Go arm64 compiler and got it fixed. ]]></description>
            <content:encoded><![CDATA[ <p>Every second, 84 million HTTP requests are hitting Cloudflare across our fleet of data centers in 330 cities. It means that even the rarest of bugs can show up frequently. In fact, it was our scale that recently led us to discover a bug in Go's arm64 compiler which causes a race condition in the generated code.</p><p>This post breaks down how we first encountered the bug, investigated it, and ultimately drove to the root cause.</p>
    <div>
      <h2>Investigating a strange panic</h2>
      <a href="#investigating-a-strange-panic">
        
      </a>
    </div>
    <p>We run a service in our network which configures the kernel to handle traffic for some products like <a href="https://www.cloudflare.com/network-services/products/magic-transit/"><u>Magic Transit</u></a> and <a href="https://www.cloudflare.com/network-services/products/magic-wan/"><u>Magic WAN</u></a>. Our monitoring watches this closely, and it started to observe very sporadic panics on arm64 machines.</p><p>We first saw one with a fatal error stating that <a href="https://github.com/golang/go/blob/c0ee2fd4e309ef0b8f4ab6f4860e2626c8e00802/src/runtime/traceback.go#L566"><u>traceback did not unwind completely</u></a>. That error suggests that invariants were violated when traversing the stack, likely because of stack corruption. After a brief investigation we decided that it was probably rare stack memory corruption. This was a largely idle control plane service where unplanned restarts have negligible impact, and so we felt that following up was not a priority unless it kept happening.</p><p>And then it kept happening. </p>
    <div>
      <h4>Coredumps per hour</h4>
      <a href="#coredumps-per-hour">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7ohMD2ph3I9l7lbZQWZ4xl/243e679d8293565fc6a55a670c9f3583/BLOG-2906_2.png" />
          </figure><p>When we first saw this bug we saw that the fatal errors correlated with recovered panics. These were caused by some old code which used panic/recover as error handling. </p><p>At this point, our theory was: </p><ol><li><p>All of the fatal panics happen within stack unwinding.</p></li><li><p>We correlated an increased volume of recovered panics with these fatal panics.</p></li><li><p>Recovering a panic unwinds goroutine stacks to call deferred functions.</p></li><li><p>A related <a href="https://github.com/golang/go/issues/73259"><u>Go issue (#73259)</u></a> reported an arm64 stack unwinding crash.</p></li><li><p>Let’s stop using panic/recover for error handling and wait out the upstream fix?</p></li></ol><p>So we did that and watched as fatal panics stopped occurring as the release rolled out. Fatal panics gone, our theoretical mitigation seemed to work, and this was no longer our problem. We subscribed to the upstream issue so we could update when it was resolved and put it out of our minds.</p><p>But, this turned out to be a much stranger bug than expected. Putting it out of our minds was premature as the same class of fatal panics came back at a much higher rate. A month later, we were seeing up to 30 daily fatal panics with no real discernible cause; while that might account for only one machine a day in less than 10% of our data centers, we found it concerning that we didn’t understand the cause. The first thing we checked was the number of recovered panics, to match our previous pattern, but there were none. More interestingly, we could not correlate this increased rate of fatal panics with anything. A release? Infrastructure changes? The position of Mars? </p><p>At this point we felt like we needed to dive deeper to better understand the root cause. Pattern matching and hoping was clearly insufficient. </p><p>We saw two classes of this bug -- a crash while accessing invalid memory and an explicitly checked fatal error. </p>
    <div>
      <h4>Fatal Error</h4>
      <a href="#fatal-error">
        
      </a>
    </div>
    
            <pre><code>goroutine 153 gp=0x4000105340 m=324 mp=0x400639ea08 [GC worker (active)]:
/usr/local/go/src/runtime/asm_arm64.s:244 +0x6c fp=0x7ff97fffe870 sp=0x7ff97fffe860 pc=0x55558d4098fc
runtime.systemstack(0x0)
       /usr/local/go/src/runtime/mgc.go:1508 +0x68 fp=0x7ff97fffe860 sp=0x7ff97fffe810 pc=0x55558d3a9408
runtime.gcBgMarkWorker.func2()
       /usr/local/go/src/runtime/mgcmark.go:1102
runtime.gcDrainMarkWorkerIdle(...)
       /usr/local/go/src/runtime/mgcmark.go:1188 +0x434 fp=0x7ff97fffe810 sp=0x7ff97fffe7a0 pc=0x55558d3ad514
runtime.gcDrain(0x400005bc50, 0x7)
       /usr/local/go/src/runtime/mgcmark.go:212 +0x1c8 fp=0x7ff97fffe7a0 sp=0x7ff97fffe6f0 pc=0x55558d3ab248
runtime.markroot(0x400005bc50, 0x17e6, 0x1)
       /usr/local/go/src/runtime/mgcmark.go:238 +0xa8 fp=0x7ff97fffe6f0 sp=0x7ff97fffe6a0 pc=0x55558d3ab578
runtime.markroot.func1()
       /usr/local/go/src/runtime/mgcmark.go:887 +0x290 fp=0x7ff97fffe6a0 sp=0x7ff97fffe560 pc=0x55558d3acaa0
runtime.scanstack(0x4014494380, 0x400005bc50)
       /usr/local/go/src/runtime/traceback.go:447 +0x2ac fp=0x7ff97fffe560 sp=0x7ff97fffe4d0 pc=0x55558d3eeb7c
runtime.(*unwinder).next(0x7ff97fffe5b0?)
       /usr/local/go/src/runtime/traceback.go:566 +0x110 fp=0x7ff97fffe4d0 sp=0x7ff97fffe490 pc=0x55558d3eed40
runtime.(*unwinder).finishInternal(0x7ff97fffe4f8?)
       /usr/local/go/src/runtime/panic.go:1073 +0x38 fp=0x7ff97fffe490 sp=0x7ff97fffe460 pc=0x55558d403388
runtime.throw({0x55558de6aa27?, 0x7ff97fffe638?})
runtime stack:
fatal error: traceback did not unwind completely
       stack=[0x4015d6a000-0x4015d8a000
runtime: g8221077: frame.sp=0x4015d784c0 top=0x4015d89fd0</code></pre>
            <p></p>
    <div>
      <h4>Segmentation fault</h4>
      <a href="#segmentation-fault">
        
      </a>
    </div>
    
            <pre><code>goroutine 187 gp=0x40003aea80 m=13 mp=0x40003ca008 [GC worker (active)]:
       /usr/local/go/src/runtime/asm_arm64.s:244 +0x6c fp=0x7fff2afde870 sp=0x7fff2afde860 pc=0x55557e2d98fc
runtime.systemstack(0x0)
       /usr/local/go/src/runtime/mgc.go:1489 +0x94 fp=0x7fff2afde860 sp=0x7fff2afde810 pc=0x55557e279434
runtime.gcBgMarkWorker.func2()
       /usr/local/go/src/runtime/mgcmark.go:1112
runtime.gcDrainMarkWorkerDedicated(...)
       /usr/local/go/src/runtime/mgcmark.go:1188 +0x434 fp=0x7fff2afde810 sp=0x7fff2afde7a0 pc=0x55557e27d514
runtime.gcDrain(0x4000059750, 0x3)
       /usr/local/go/src/runtime/mgcmark.go:212 +0x1c8 fp=0x7fff2afde7a0 sp=0x7fff2afde6f0 pc=0x55557e27b248
runtime.markroot(0x4000059750, 0xb8, 0x1)
       /usr/local/go/src/runtime/mgcmark.go:238 +0xa8 fp=0x7fff2afde6f0 sp=0x7fff2afde6a0 pc=0x55557e27b578
runtime.markroot.func1()
       /usr/local/go/src/runtime/mgcmark.go:887 +0x290 fp=0x7fff2afde6a0 sp=0x7fff2afde560 pc=0x55557e27caa0
runtime.scanstack(0x40042cc000, 0x4000059750)
       /usr/local/go/src/runtime/traceback.go:458 +0x188 fp=0x7fff2afde560 sp=0x7fff2afde4d0 pc=0x55557e2bea58
runtime.(*unwinder).next(0x7fff2afde5b0)
goroutine 0 gp=0x40003af880 m=13 mp=0x40003ca008 [idle]:
PC=0x55557e2bea58 m=13 sigcode=1 addr=0x118
SIGSEGV: segmentation violation</code></pre>
            <p></p><p></p><p>Now we could observe some clear patterns. Both errors occur when unwinding the stack in <code>(*unwinder).next</code>. In one case we saw an intentional<a href="https://github.com/golang/go/blob/b3251514531123d7fd007682389bce7428d159a0/src/runtime/traceback.go#L566"> <u>fatal error</u></a> as the runtime identified that unwinding could not complete and the stack was in a bad state. In the other case there was a direct memory access error that happened while trying to unwind the stack. The segfault was discussed in the <a href="https://github.com/golang/go/issues/73259#issuecomment-2786818812"><u>GitHub issue</u></a> and a Go engineer identified it as dereference of a go scheduler struct, <a href="https://github.com/golang/go/blob/924fe98902cdebf20825ab5d1e4edfc0fed2966f/src/runtime/runtime2.go#L536"><u>m</u></a>, when<a href="https://github.com/golang/go/blob/b3251514531123d7fd007682389bce7428d159a0/src/runtime/traceback.go#L458"> <u>unwinding</u></a>.   </p>
    <div>
      <h3>A review of Go scheduler structs</h3>
      <a href="#a-review-of-go-scheduler-structs">
        
      </a>
    </div>
    <p>Go uses a lightweight userspace scheduler to manage concurrency. Many goroutines are scheduled on a smaller number of kernel threads – this is often referred to as M:N scheduling. Any individual goroutine can be scheduled on any kernel thread. The scheduler has three core types – <a href="https://github.com/golang/go/blob/924fe98902cdebf20825ab5d1e4edfc0fed2966f/src/runtime/runtime2.go#L394"><code><u>g</u></code></a>  (the goroutine), <a href="https://github.com/golang/go/blob/924fe98902cdebf20825ab5d1e4edfc0fed2966f/src/runtime/runtime2.go#L536"><code><u>m</u></code></a> (the kernel thread, or “machine”), and <a href="https://github.com/golang/go/blob/924fe98902cdebf20825ab5d1e4edfc0fed2966f/src/runtime/runtime2.go#L644"><code><u>p</u></code></a> (the physical execution context, or  “processor”). For a goroutine to be scheduled a free <code>m</code> must acquire a free <code>p</code>, which will execute a g. Each <code>g</code> contains a field for its m if it is currently running, otherwise it will be <b>nil</b>. This is all the context needed for this post but the <a href="https://github.com/golang/go/blob/master/src/runtime/HACKING.md#gs-ms-ps"><u>go runtime docs</u></a> explore this more comprehensively. </p><p>At this point we can start to make inferences on what’s happening: the program crashes because we try to unwind a goroutine stack which is invalid. In the first backtrace, if a <a href="https://github.com/golang/go/blob/b3251514531123d7fd007682389bce7428d159a0/src/runtime/traceback.go#L446"><u>return address is null, we call </u><code><u>finishInternal</u></code><u> and abort because the stack was not fully unwound</u></a>. The segmentation fault case in the second backtrace is a bit more interesting: if instead the return address is non-zero but not a function then the unwinder code assumes that the goroutine is currently running. It'll then dereference m and fault by accessing <code>m.incgo</code> (the offset of <code>incgo</code> into <code>struct m</code> is 0x118, the faulting memory access). </p><p>What, then, is causing this corruption? The traces were difficult to get anything useful from – our service has hundreds if not thousands of active goroutines. It was fairly clear from the beginning that the panic was remote from the actual bug. The crashes were all observed while unwinding the stack and if this were an issue any time the stack was unwound on arm64 we would be seeing it in many more services. We felt pretty confident that the stack unwinding was happening correctly but on an invalid stack. </p><p>Our investigation stalled for a while at this point – making guesses, testing guesses, trying to infer if the panic rate went up or down, or if nothing changed. There was <a href="https://github.com/golang/go/issues/73259"><u>a known issue</u></a> on Go’s GitHub issue tracker which matched our symptoms almost exactly, but what they discussed was mostly what we already knew. At some point when looking through the linked stack traces we realized that their crash referenced an old version of a library that we were also using – Go Netlink.</p>
            <pre><code>goroutine 1267 gp=0x4002a8ea80 m=nil [runnable (scan)]:
runtime.asyncPreempt2()
        /usr/local/go/src/runtime/preempt.go:308 +0x3c fp=0x4004cec4c0 sp=0x4004cec4a0 pc=0x46353c
runtime.asyncPreempt()
        /usr/local/go/src/runtime/preempt_arm64.s:47 +0x9c fp=0x4004cec6b0 sp=0x4004cec4c0 pc=0x4a6a8c
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0x14360300000000?)
        /go/pkg/mod/github.com/!data!dog/netlink@v1.0.1-0.20240223195320-c7a4f832a3d1/nl/nl_linux.go:803 +0x130 fp=0x4004cfc710 sp=0x4004cec6c0 pc=0xf95de0
</code></pre>
            <p></p><p>We spot-checked a few stack traces and confirmed the presence of this Netlink library. Querying our logs showed that not only did we share a library – every single segmentation fault we observed had happened while preempting<a href="https://github.com/vishvananda/netlink/blob/e1e260214862392fb28ff72c9b11adc84df73e2c/nl/nl_linux.go#L880"> <code><u>NetlinkSocket.Receive</u></code></a>.</p>
    <div>
      <h3>What’s (async) preemption?</h3>
      <a href="#whats-async-preemption">
        
      </a>
    </div>
    <p>In the prehistoric era of Go (&lt;=1.13) the runtime was cooperatively scheduled. A goroutine would run until it decided it was ready to yield to the scheduler – usually due to explicit calls to <code>runtime.Gosched()</code> or injected yield points at function calls/IO operations. Since<a href="https://go.dev/doc/go1.14#runtime"> <u>Go 1.14</u></a> the runtime instead does async preemption. The Go runtime has a thread <code>sysmon</code> which tracks the runtime of goroutines and will preempt any that run for longer than 10ms (at time of writing). It does this by sending <code>SIGURG</code> to the OS thread and in the signal handler will modify the program counter and stack to mimic a call to <code>asyncPreempt</code>.</p><p>At this point we had two broad theories:</p><ul><li><p>This is a Go Netlink bug – likely due to <code>unsafe.Pointer</code> usage which invoked undefined behavior but is only actually broken on arm64</p></li><li><p>This is a Go runtime bug and we're only triggering it in <code>NetlinkSocket.Receive</code> for some reason</p></li></ul><p>After finding the same bug publicly reported upstream, we were feeling confident this was caused by a Go runtime bug. However, upon seeing that both issues implicated the same function, we felt more skeptical – notably the Go Netlink library uses unsafe.Pointer so memory corruption was a plausible explanation even if we didn't understand why.</p><p>After an unsuccessful code audit we had hit a wall. The crashes were rare and remote from the root cause. Maybe these crashes were caused by a runtime bug, maybe they were caused by a Go Netlink bug. It seemed clear that there was something wrong with this area of the code, but code auditing wasn’t going anywhere. </p>
    <div>
      <h2>Breakthrough</h2>
      <a href="#breakthrough">
        
      </a>
    </div>
    <p>At this point we had a fairly good understanding of what was crashing but very little understanding of <b>why</b> it was happening. It was clear that the root cause of the stack unwinder crashing was remote from the actual crash, and that it had to do with <code>(*NetlinkSocket).Receive</code>, but why? We were able to capture a <b>coredump</b> of a production crash and view it in a debugger. The backtrace confirmed what we already knew – that there was a segmentation fault when unwinding a stack. The crux of the issue revealed itself when we looked at the goroutine which had been preempted while calling <code>(*NetlinkSocket).Receive</code>.     </p>
            <pre><code>(dlv) bt
0  0x0000555577579dec in runtime.asyncPreempt2
   at /usr/local/go/src/runtime/preempt.go:306
1  0x00005555775bc94c in runtime.asyncPreempt
   at /usr/local/go/src/runtime/preempt_arm64.s:47
2  0x0000555577cb2880 in github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive
   at
/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:779
3  0x0000555577cb19a8 in github.com/vishvananda/netlink/nl.(*NetlinkRequest).Execute
   at 
/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:532
4  0x0000555577551124 in runtime.heapSetType
   at /usr/local/go/src/runtime/mbitmap.go:714
5  0x0000555577551124 in runtime.heapSetType
   at /usr/local/go/src/runtime/mbitmap.go:714
...
(dlv) disass -a 0x555577cb2878 0x555577cb2888
TEXT github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(SB) /vendor/github.com/vishvananda/netlink/nl/nl_linux.go
        nl_linux.go:779 0x555577cb2878  fdfb7fa9        LDP -8(RSP), (R29, R30)
        nl_linux.go:779 0x555577cb287c  ff430191        ADD $80, RSP, RSP
        nl_linux.go:779 0x555577cb2880  ff434091        ADD $(16&lt;&lt;12), RSP, RSP
        nl_linux.go:779 0x555577cb2884  c0035fd6        RET
</code></pre>
            <p></p><p>The goroutine was paused between two opcodes in the function epilogue. Since the process of unwinding a stack relies on the stack frame being in a consistent state, it felt immediately suspicious that we preempted in the middle of adjusting the stack pointer. The goroutine had been paused at 0x555577cb2880, between<code> ADD $80, RSP, RSP and ADD $(16&lt;&lt;12), RSP, RSP</code>. </p><p>We queried the service logs to confirm our theory. This wasn’t isolated – the majority of stack traces showed that this same opcode was preempted. This was no longer a weird production crash we couldn’t reproduce. A crash happened when the Go runtime preempted between these two stack pointer adjustments. We had our smoking gun. </p>
    <div>
      <h2>Building a minimal reproducer</h2>
      <a href="#building-a-minimal-reproducer">
        
      </a>
    </div>
    <p>At this point we felt pretty confident that this was actually just a runtime bug and it should be reproducible in an isolated environment without any dependencies. The theory at this point was:</p><ol><li><p>Stack unwinding is triggered by garbage collection</p></li><li><p>Async preemption between a split stack pointer adjustment causes a crash</p></li><li><p>What if we make a function which splits the adjustment and then call it in a loop?</p></li></ol>
            <pre><code>package main

import (
	"runtime"
)

//go:noinline
func big_stack(val int) int {
	var big_buffer = make([]byte, 1 &lt;&lt; 16)

	sum := 0
	// prevent the compiler from optimizing out the stack
	for i := 0; i &lt; (1&lt;&lt;16); i++ {
		big_buffer[i] = byte(val)
	}
	for i := 0; i &lt; (1&lt;&lt;16); i++ {
		sum ^= int(big_buffer[i])
	}
	return sum
}

func main() {
	go func() {
		for {
			runtime.GC()
		}
	}()
	for {
		_ = big_stack(1000)
	}
}
</code></pre>
            <p></p><p>This function ends up with a stack frame slightly larger than can be represented in 16 bits, and so on arm64 the Go compiler will split the stack pointer adjustment into two opcodes. If the runtime preempts between these opcodes then the stack unwinder will read an invalid stack pointer and crash. </p>
            <pre><code>; epilogue for main.big_stack
ADD $8, RSP, R29
ADD $(16&lt;&lt;12), R29, R29
ADD $16, RSP, RSP
; preemption is problematic between these opcodes
ADD $(16&lt;&lt;12), RSP, RSP
RET
</code></pre>
            <p></p><p>After running this for a few minutes the program panicked as expected!</p>
            <pre><code>SIGSEGV: segmentation violation
PC=0x60598 m=8 sigcode=1 addr=0x118

goroutine 0 gp=0x400019c540 m=8 mp=0x4000198708 [idle]:
runtime.(*unwinder).next(0x400030fd10)
        /home/thea/sdk/go1.23.4/src/runtime/traceback.go:458 +0x188 fp=0x400030fcc0 sp=0x400030fc30 pc=0x60598
runtime.scanstack(0x40000021c0, 0x400002f750)
        /home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:887 +0x290 

[...]

goroutine 1 gp=0x40000021c0 m=nil [runnable (scan)]:
runtime.asyncPreempt2()
        /home/thea/sdk/go1.23.4/src/runtime/preempt.go:308 +0x3c fp=0x40003bfcf0 sp=0x40003bfcd0 pc=0x400cc
runtime.asyncPreempt()
        /home/thea/sdk/go1.23.4/src/runtime/preempt_arm64.s:47 +0x9c fp=0x40003bfee0 sp=0x40003bfcf0 pc=0x75aec
main.big_stack(0x40003cff38?)
        /home/thea/dev/stack_corruption_reproducer/main.go:29 +0x94 fp=0x40003cff00 sp=0x40003bfef0 pc=0x77c04
Segmentation fault (core dumped)

real    1m29.165s
user    4m4.987s
sys     0m43.212s</code></pre>
            <p></p><p>A reproducible crash with standard library only? This felt like conclusive evidence that our problem was a runtime bug.</p><p>This was an extremely particular reproducer! Even now with a good understanding of the bug and its fix, some of the behavior is still puzzling. It's a one-instruction race condition, so it’s unsurprising that small changes could have large impact. For example, this reproducer was originally written and tested on Go 1.23.4, but did not crash when compiled with 1.23.9 (the version in production), even though we could objdump the binary and see the split ADD still present! We don’t have a definite explanation for this behavior – even with the bug present there remain a few unknown variables which affect the likelihood of hitting the race condition. </p>
    <div>
      <h2>A single-instruction race condition window</h2>
      <a href="#a-single-instruction-race-condition-window">
        
      </a>
    </div>
    <p>arm64 is a fixed-length 4-byte instruction set architecture. This has a lot of implications on codegen but most relevant to this bug is the fact that immediate length is limited.<a href="https://developer.arm.com/documentation/ddi0596/2020-12/Base-Instructions/ADD--immediate---Add--immediate--"> <code><u>add</u></code></a> gets a 12-bit immediate,<a href="https://developer.arm.com/documentation/dui0802/a/A64-General-Instructions/MOV--wide-immediate-"> <code><u>mov</u></code></a> gets a 16-bit immediate, etc. How does the architecture handle this when the operands don't fit? It depends – <code>ADD</code> in particular reserves a bit for "shift left by 12" so any 24 bit addition can be decomposed into two opcodes. Other instructions are decomposed similarly, or just require loading an immediate into a register first. </p><p>The very last step of the Go compiler before emitting machine code involves transforming the program into <code>obj.Prog</code> structs. It's a very low level intermediate representation (IR) that mostly serves to be translated into machine code. </p>
            <pre><code>//https://github.com/golang/go/blob/fa2bb342d7b0024440d996c2d6d6778b7a5e0247/src/cmd/internal/obj/arm64/obj7.go#L856

// Pop stack frame.
// ADD $framesize, RSP, RSP
p = obj.Appendp(p, c.newprog)
p.As = AADD
p.From.Type = obj.TYPE_CONST
p.From.Offset = int64(c.autosize)
p.To.Type = obj.TYPE_REG
p.To.Reg = REGSP
p.Spadj = -c.autosize
</code></pre>
            <p></p><p>Notably, this IR is not aware of immediate length limitations. Instead, this happens in<a href="https://github.com/golang/go/blob/2f653a5a9e9112ff64f1392ff6e1d404aaf23e8c/src/cmd/internal/obj/arm64/asm7.go"> <u>asm7.go</u></a> when Go's internal intermediate representation is translated into arm64 machine code. The assembler will classify an immediate in <a href="https://github.com/golang/go/blob/2f653a5a9e9112ff64f1392ff6e1d404aaf23e8c/src/cmd/internal/obj/arm64/asm7.go#L1905"><u>conclass</u></a> based on bit size and then use that when emitting instructions – extra if needed.</p><p>The Go assembler uses a combination of (<code>mov, add</code>) opcodes for some adds that fit in 16-bit immediates, and prefers (<code>add, add + lsl 12</code>) opcodes for 16-bit+ immediates.   </p><p>Compare a stack of (slightly larger than) <code>1&lt;&lt;15</code>:</p>
            <pre><code>; //go:noinline
; func big_stack() byte {
; 	var big_stack = make([]byte, 1&lt;&lt;15)
; 	return big_stack[0]
; }
MOVD $32776, R27
ADD R27, RSP, R29
MOVD $32784, R27
ADD R27, RSP, RSP
RET
</code></pre>
            <p></p><p>With a stack of <code>1&lt;&lt;16</code>:</p>
            <pre><code>; //go:noinline
; func big_stack() byte {
; 	var big_stack = make([]byte, 1&lt;&lt;16)
; 	return big_stack[0]
; } 
ADD $8, RSP, R29
ADD $(16&lt;&lt;12), R29, R29
ADD $16, RSP, RSP
ADD $(16&lt;&lt;12), RSP, RSP
RET
</code></pre>
            <p>In the larger stack case, there is a point between <code>ADD x, RSP, RSP</code> opcodes where the stack pointer is not pointing to the tip of a stack frame. We thought at first that this was a matter of memory corruption – that in handling async preemption the runtime would push a function call on the stack and corrupt the middle of the stack. However, this goroutine is already in the function epilogue – any data we corrupt is actively in the process of being thrown away. What's the issue then?  </p><p>The Go runtime often needs to <b>unwind</b> the stack, which means walking backwards through the chain of function calls. For example: garbage collection uses it to find live references on the stack, panicking relies on it to evaluate <code>defer</code> functions, and generating stack traces needs to print the call stack. For this to work the stack pointer <b>must be accurate during unwinding</b> because of how golang dereferences sp to determine the calling function. If the stack pointer is partially modified, the unwinder will look for the calling function in the middle of the stack. The underlying data is meaningless when interpreted as directions to a parent stack frame and then the runtime will likely crash. </p>
            <pre><code>//https://github.com/golang/go/blob/66536242fce34787230c42078a7bbd373ef8dcb0/src/runtime/traceback.go#L373

if innermost &amp;&amp; frame.sp &lt; frame.fp || frame.lr == 0 {
    lrPtr = frame.sp
    frame.lr = *(*uintptr)(unsafe.Pointer(lrPtr))
}
</code></pre>
            <p></p><p>When async preemption happens it will push a function call onto the stack but the parent stack frame is no longer correct because sp was only partially adjusted when the preemption happened. The crash flow looks something like this:  </p><ol><li><p>Async preemption happens between the two opcodes that <code>add x, rsp</code> expands to</p></li><li><p>Garbage collection triggers stack unwinding (to check for heap object liveness)</p></li><li><p>The unwinder starts traversing the stack of the problematic goroutine and correctly unwinds up to the problematic function</p></li><li><p>The unwinder dereferences <code>sp</code> to determine the parent function</p></li><li><p>Almost certainly the data behind <code>sp</code> is not a function</p></li><li><p>Crash</p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5LKOBC6lQzuKvwy0vyEfk2/054c9aaedc14d155294a682f7de3a610/BLOG-2906_3.png" />
          </figure><p>We saw earlier a faulting stack trace which ended in <code>(*NetlinkSocket).Receive</code> – in this case stack unwinding faulted while it was trying to determine the parent frame.    </p>
            <pre><code>goroutine 90 gp=0x40042cc000 m=nil [preempted (scan)]:
runtime.asyncPreempt2()
/usr/local/go/src/runtime/preempt.go:306 +0x2c fp=0x40060a25d0 sp=0x40060a25b0 pc=0x55557e299dec
runtime.asyncPreempt()
/usr/local/go/src/runtime/preempt_arm64.s:47 +0x9c fp=0x40060a27c0 sp=0x40060a25d0 pc=0x55557e2dc94c
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0xff48ce6e060b2848?)
/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:779 +0x130 fp=0x40060b2820 sp=0x40060a27d0 pc=0x55557e9d2880
</code></pre>
            <p></p><p>Once we discovered the root cause we reported it with a reproducer and the bug was quickly fixed. This bug is fixed in <a href="https://github.com/golang/go/commit/e8794e650e05fad07a33fb6e3266a9e677d13fa8"><u>go1.23.12</u></a>, <a href="https://github.com/golang/go/commit/6e1c4529e4e00ab58572deceab74cc4057e6f0b6"><u>go1.24.6</u></a>, and <a href="https://github.com/golang/go/commit/f7cc61e7d7f77521e073137c6045ba73f66ef902"><u>go1.25.0</u></a>. Previously, the go compiler emitted a single <code>add x, rsp</code> instruction and relied on the assembler to split immediates into multiple opcodes as necessary. After this change, stacks larger than 1&lt;&lt;12 will build the offset in a temporary register and then add that to <code>rsp</code> in a single, indivisible opcode. A goroutine can be preempted before or after the stack pointer modification, but never during. This means that the stack pointer is always valid and there is no race condition.</p>
            <pre><code>LDP -8(RSP), (R29, R30)
MOVD $32, R27
MOVK $(1&lt;&lt;16), R27
ADD R27, RSP, RSP
RET</code></pre>
            <p></p><p>This was a very fun problem to debug. We don’t often see bugs where you can accurately blame the compiler. Debugging it took weeks and we had to learn about areas of the Go runtime that people don’t usually need to think about. It’s a nice example of a rare race condition, the sort of bug that can only really be quantified at a large scale.</p><p>We’re always looking for people who enjoy this kind of detective work. <a href="https://www.cloudflare.com/careers/jobs/?department=Engineering"><u>Our engineering teams are hiring</u></a>.   </p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Go]]></category>
            <category><![CDATA[Programming]]></category>
            <guid isPermaLink="false">12E3V053vhNrZrU5I2AAV1</guid>
            <dc:creator>Thea Heinen</dc:creator>
        </item>
    </channel>
</rss>