
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Thu, 09 Apr 2026 03:01:31 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Inside Gen 13: how we built our most powerful server yet]]></title>
            <link>https://blog.cloudflare.com/gen13-config/</link>
            <pubDate>Mon, 23 Mar 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare's Gen 13 servers introduce AMD EPYC™ Turin 9965 processors and a transition to 100 GbE networking to meet growing traffic demands. In this technical deep dive, we explain the engineering rationale behind each major component selection. ]]></description>
            <content:encoded><![CDATA[ <p>A few months ago, Cloudflare announced <a href="https://blog.cloudflare.com/20-percent-internet-upgrade/"><u>the transition to FL2</u></a>, our Rust-based rewrite of Cloudflare's core request handling layer. This transition accelerates our ability to help build a better Internet for everyone. With the migration in the software stack, Cloudflare has refreshed our server hardware design with improved hardware capabilities and better efficiency to serve the evolving demands of our network and software stack. Gen 13 is designed with 192-core AMD EPYC™ Turin 9965 processor, 768 GB of DDR5-6400 memory, 24 TB of PCIe 5.0 NVMe storage, and dual 100 GbE port network interface card.</p><p>Gen 13 delivers:</p><ul><li><p>Up to 2x throughput compared to Gen 12 while staying within latency SLA</p></li><li><p>Up to 50% improvement in performance / watt efficiency, reducing data center expansion costs</p></li><li><p>Up to 60% higher throughput per rack keeping rack power budget constant</p></li><li><p>2x memory capacity, 1.5x storage capacity, 4x network bandwidth</p></li><li><p>Introduced PCIe encryption hardware support in addition to memory encryption</p></li><li><p>Improved support for thermally demanding powerful drop-in PCIe accelerators</p></li></ul><p>This blog post covers the engineering rationale behind each major component selection: what we evaluated, what we chose, and why.</p><table><tr><td><p>Generation</p></td><td><p>Gen 13 Compute</p></td><td><p>Previous Gen 12 Compute</p></td></tr><tr><td><p>Form Factor</p></td><td><p>2U1N, Single socket</p></td><td><p>2U1N, Single socket</p></td></tr><tr><td><p>Processor</p></td><td><p>AMD EPYC™ 9965 
Turin 192-Core Processor</p></td><td><p>AMD EPYC™ 9684X 
Genoa-X 96-Core Processor</p></td></tr><tr><td><p>Memory</p></td><td><p>768GB of DDR5-6400 x12 memory channel</p></td><td><p>384GB of DDR5-4800 x12 memory channel</p></td></tr><tr><td><p>Storage</p></td><td><p>x3 E1.S NVMe</p><p>
</p><p> Samsung PM9D3a 7.68TB / 
Micron 7600 Pro 7.68TB</p></td><td><p>x2 E1.S NVMe </p><p>
</p><p>Samsung PM9A3 7.68TB / 
Micron 7450 Pro 7.68TB</p></td></tr><tr><td><p>Network</p></td><td><p>Dual 100 GbE OCP 3.0 </p><p>
</p><p>Intel Ethernet Network Adapter E830-CDA2 /
NVIDIA Mellanox ConnectX-6 Dx</p></td><td><p>Dual 25 GbE OCP 3.0</p><p>
</p><p>Intel Ethernet Network Adapter E810-XXVDA2 / 
NVIDIA Mellanox ConnectX-6 Lx</p></td></tr><tr><td><p>System Management</p></td><td><p>DC-SCM 2.0 ASPEED AST2600 (BMC) + AST1060 (HRoT)</p></td><td><p>DC-SCM 2.0 ASPEED AST2600 (BMC) + AST1060 (HRoT)</p></td></tr><tr><td><p>Power Supply</p></td><td><p>1300W, Titanium Grade</p></td><td><p>800W, Titanium Grade</p></td></tr></table>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2Gawj2GP8s2CCZCWwNgBiB/587b0ed5ef65cf95cf178e5457150b6a/image3.png" />
          </figure><p><i><sup>Figure: Gen 13 server</sup></i></p>
    <div>
      <h2>CPU</h2>
      <a href="#cpu">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p>AMD EPYC™ 9684X Genoa-X 96-Core (400W TDP, 1152 MB L3 Cache)</p></td></tr><tr><td><p>Gen 13</p></td><td><p>AMD EPYC™ 9965 Turin Dense 192-Core (500W TDP, 384 MB L3 Cache)</p></td></tr></table><p>During the design phase, we evaluated several 5th generation AMD EPYC™ Processors, code-named Turin, in Cloudflare’s hardware lab: AMD Turin 9755, AMD Turin 9845, and AMD Turin 9965. The table below summarizes the differences in <a href="https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/datasheets/amd-epyc-9005-series-processor-datasheet.pdf"><u>specifications</u></a> of the candidates for Gen 13 servers against the AMD Genoa-X 9684X used in our <a href="https://blog.cloudflare.com/gen-12-servers/"><u>Gen 12 servers</u></a>. Notably, all three candidates offer increases in core count but with smaller L3 cache per core. However, with the <a href="https://blog.cloudflare.com/20-percent-internet-upgrade/"><u>migration to FL2</u></a>, the new workloads are <a href="https://blog.cloudflare.com/gen13-launch/"><u>less dependent on L3 cache and scale up well with the increased core count to achieve up to 100% increase in throughput</u></a>.</p><p>The three CPU candidates are designed to target different use cases: AMD Turin 9755 offers superior per-core performance, AMD Turin 9965 trades per-core performance for efficiency, and AMD Turin 9845 trades core count for lower socket power. We evaluated three CPUs in the production environment.</p><table><tr><td><p>CPU Model</p></td><td><p>AMD Genoa-X 9684X</p></td><td><p>AMD Turin 9755</p></td><td><p>AMD Turin 9845</p></td><td><p>AMD Turin 9965</p></td></tr><tr><td><p>For server platform</p></td><td><p>Gen 12</p></td><td><p>Gen 13 candidate</p></td><td><p>Gen 13 candidate</p></td><td><p>Gen 13 candidate</p></td></tr><tr><td><p># of CPU Cores</p></td><td><p>96</p></td><td><p>128</p></td><td><p>160</p></td><td><p>192</p></td></tr><tr><td><p># of Threads</p></td><td><p>192</p></td><td><p>256</p></td><td><p>320</p></td><td><p>384</p></td></tr><tr><td><p>Base Clock</p></td><td><p>2.4 GHz</p></td><td><p>2.7 GHz</p></td><td><p>2.1 GHz</p></td><td><p>2.25 GHz</p></td></tr><tr><td><p>Max Boost Clock</p></td><td><p>3.7 GHz</p></td><td><p>4.1 GHz</p></td><td><p>3.7 GHz</p></td><td><p>3.7 GHz</p></td></tr><tr><td><p>All Core Boost Clock</p></td><td><p>3.42 GHz</p></td><td><p>4.1 GHz</p></td><td><p>3.25 GHz</p></td><td><p>3.35 GHz</p></td></tr><tr><td><p>Total L3 Cache</p></td><td><p>1152 MB</p></td><td><p>512 MB</p></td><td><p>320 MB</p></td><td><p>384 MB</p></td></tr><tr><td><p>L3 cache per core</p></td><td><p>12 MB / core</p></td><td><p>4 MB / core</p></td><td><p>2 MB / core</p></td><td><p>2 MB / core</p></td></tr><tr><td><p>Maximum configurable TDP</p></td><td><p>400W</p></td><td><p>500W</p></td><td><p>390W</p></td><td><p>500W</p></td></tr></table>
    <div>
      <h3>Why AMD Turin 9965?</h3>
      <a href="#why-amd-turin-9965">
        
      </a>
    </div>
    <p>First, <b>FL2 ended the L3 cache crunch</b>.</p><p>L3 cache is the large, last-level cache shared among all CPU cores on the same compute die to store frequently used data. It bridges the gap between slow main memory external to the CPU, and the fast but smaller L1 and L2 cache on the CPU, reducing the latency for the CPU to access data.</p><p>Some may notice that the 9965 has only 2 MB of L3 cache per core, an 83.3% reduction from the 12 MB per core on Gen 12’s Genoa-X 9684X. Why trade away the very cache advantage that gave Gen 12 its edge? The answer lies in how our workloads have evolved.</p><p>Cloudflare has <a href="https://blog.cloudflare.com/20-percent-internet-upgrade/"><u>migrated from FL1 to FL2</u></a>, a complete rewrite of our request handling layer in Rust. With the new software stack, Cloudflare’s request processing pipeline has become significantly less dependent on large L3 cache. FL2 workloads <a href="https://blog.cloudflare.com/gen13-launch/"><u>scale nearly linearly with core count</u></a>, and the 9965’s 192 cores provide a 2x increase in hardware threads over Gen 12.</p><p>Second, <b>performance per total cost of ownership (TCO)</b>. During production evaluation, the 9965’s 192 cores delivered the highest aggregate requests per second of the three candidates, and its performance-per-watt scaled favorably at 500W TDP, yielding superior rack-level TCO.</p><table><tr><td><p>
</p></td><td><p><b>Gen 12 </b></p></td><td><p><b>Gen 13 </b></p></td></tr><tr><td><p>Processor</p></td><td><p>AMD EPYC™ 4th Gen Genoa-X 9684X</p></td><td><p>AMD EPYC™ 5th Gen Turin 9965</p></td></tr><tr><td><p>Core count</p></td><td><p>96C/192T</p></td><td><p>192C/384T</p></td></tr><tr><td><p>FL throughput</p></td><td><p>Baseline</p></td><td><p>Up to +100%</p></td></tr><tr><td><p>Performance per watt</p></td><td><p>Baseline</p></td><td><p>Up to +50%</p></td></tr></table><p>Third, <b>operational simplicity</b>. Our operational teams have a strong preference for fewer, higher-density servers. Managing a fleet of 192-core machines means fewer nodes to provision, patch, and monitor per unit of compute delivered. This directly reduces operational overhead across our global network.</p><p>Finally,<b> </b>they are <b>forward compatible</b>. The AMD processor architecture supports DDR5-6400, PCIe Gen 5.0, CXL 2.0 Type 3 memory across all SKUs. AMD Turin 9965 has the highest number of high-performing cores per socket in the industry, maximizing the compute density per socket, maintaining competitiveness and relevance of the platform for years to come. By moving to AMD Turin 9965 from AMD Genoa-X 9684X, we get longer security support from AMD, extending the useful life of the Gen 13 server before they become obsolete and need to be refreshed.</p>
    <div>
      <h2>Memory</h2>
      <a href="#memory">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p>12x 32GB DDR5-4800 2Rx8 (384 GB total, 4 GB/core)</p></td></tr><tr><td><p>Gen 13</p></td><td><p>12x 64GB DDR5-6400 2Rx4 (768 GB total, 4 GB/core)</p></td></tr></table><p>Because the AMD Turin processor has twice the core count of the previous generation, it demands more memory resources, both in capacity and in bandwidth, to deliver throughput gains.</p>
    <div>
      <h3>Maximizing bandwidth with 12 channels</h3>
      <a href="#maximizing-bandwidth-with-12-channels">
        
      </a>
    </div>
    <p>The chosen AMD EPYC™ 9965 CPU supports twelve memory channels, and for Gen 13, we are populating every single one of them. We’ve selected 64 GB DDR5-6400 ECC RDIMMs in a “one DIMM per channel” (1DPC) configuration.</p><p>This setup provides 614 GB/s of peak memory bandwidth per socket, a 33.3% increase compared to our Gen 12 server platform. By utilizing all 12 channels, we ensure that the CPU is never “starved” for data, even during the most memory-intensive parallel workloads.</p><p>Populating all twelve channels in a balanced configuration — equal capacity per channel, with no mixed configurations — is common best practice. This matters operationally: AMD Turin processors interleave across all memory channels with the same DIMM type, same memory capacity and same rank configuration. Interleaving increases memory bandwidth by spreading contiguous memory access across all memory channels in the interleave set instead of sending all memory access to a single or a small subset of memory channels. </p>
    <div>
      <h3>The 4 GB per core “sweet spot”</h3>
      <a href="#the-4-gb-per-core-sweet-spot">
        
      </a>
    </div>
    <p>Our Gen 12 servers are configured with 4GB per core. We revisited that decision as we designed Gen 13.</p><p>Cloudflare launches a lot of new products and services every month, and each new product or service demands an incremental amount of memory capacity. These accumulate over time and could become an issue of memory pressure, if memory capacity is not sized appropriately.</p><p>Initial requirement considered a memory-to-core ratio between 4 GB and 6 GB per core. With 192 cores on the AMD Turin 9965, that translates to a range of 768 GB to 1152 GB. Note that at higher capacities,  DIMM module capacity granularity are typically 16GB increments. With 12 channels in a 1DPC configuration, our options are 12x 48GB (576 GB), 12x 64GB (768 GB), or 12x 96GB (1152 GB).</p><ul><li><p>12x 48GB = 576 GB, or 1.5 GB/thread. The memory capacity of this configuration is too low; this would starve memory-hungry workloads and violate the lower bound.</p></li><li><p>12x 96GB = 1152 GB, or 3.0 GB/thread. This would be a 50% capacity increase per core and would also result in higher power consumption and a substantial increase in cost, especially in the current market conditions where memory prices are 10x of what they were a year ago.</p></li><li><p>12x 64GB = 768 GB, or 2.0 GB/thread (4 GB/core). This configuration is consistent with our Gen 12 memory to core ratio, and represents a 2x increase in memory capacity per server. Keeping the memory capacity configuration at 4 GB per core provides sufficient capacity for workloads that scale with core count, like our primary workload, FL, and provide sufficient memory capacity headroom for future growth without overprovisioning.</p></li></ul><p><a href="https://blog.cloudflare.com/20-percent-internet-upgrade/"><u>FL2 uses memory more efficiently</u></a> than FL1 did: our internal measures show FL2 uses less than half the CPU of FL1, and far less than half the memory. The capacity freed up by the software stack migration provides ample headroom to support Cloudflare growth for the next few years.</p><p>The decision: 12x 64GB for 768 GB total. This maintains the proven 4 GB/core ratio, provides a 2x total capacity increase over Gen 12, and stays within the DIMM cost curve sweet spot.</p>
    <div>
      <h3>Efficiency through dual rank</h3>
      <a href="#efficiency-through-dual-rank">
        
      </a>
    </div>
    <p>In Gen 12, we demonstrated that dual-rank DIMMs provide measurably higher memory throughput than single-rank modules, with advantages of up to 17.8% at a 1:1 read-write ratio. Dual-rank DIMMs are faster because they allow the memory controller to access one rank while another is refreshing. That same principle carries forward here.</p><p>Our requirement also calls for approximately 1 GB/s of memory bandwidth per hardware thread. With 614 GB/s of peak bandwidth across 384 threads, we deliver 1.6 GB/s per thread, comfortably exceeding the minimum. Production analysis has shown that Cloudflare workloads are not memory-bandwidth-bound, so we bank the headroom as margin for future workload growth.</p><p>By opting for 2Rx4 DDR5 RDIMMs at maximum supported 6400MT/s, we ensure we get the lowest latency and best performance from our Gen 13 platform memory configuration.</p>
    <div>
      <h2>Storage</h2>
      <a href="#storage">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p>x2 E1.S NVMe PCIe 4.0, 16 TB total</p><p>Samsung PM9A3 7.68TB</p><p>Micron 7450 Pro 7.68TB</p></td></tr><tr><td><p>Gen 13</p></td><td><p>x3 E1.S NVMe PCIe 5.0, 24 TB total</p><p>Samsung PM9D3a 7.68TB</p><p>Micron 7600 Pro 7.68TB</p><p>+10x U.2 NVMe PCIe 5.0 option</p></td></tr></table><p>Our storage architecture underwent a transformation in Gen 12 when we pivoted from M.2 to EDSFF E1.S. For Gen 13, we are increasing the storage capacity and the bandwidth to align with the latest technology. We have also added a front drive bay for flexibility to add up to 10x U.2 drives to keep pace with Cloudflare storage product growth. </p>
    <div>
      <h3>The move to PCIe 5.0</h3>
      <a href="#the-move-to-pcie-5-0">
        
      </a>
    </div>
    <p>Gen 13 is configured with PCIe Gen 5.0 NVMe drives. While Gen 4.0 served us well, the move to Gen 5.0 ensures that our storage subsystem can serve data at improved latency, and keep up with increased storage bandwidth demand from the new processor. </p>
    <div>
      <h3>16 TB to 24 TB</h3>
      <a href="#16-tb-to-24-tb">
        
      </a>
    </div>
    <p>Beyond the speed increase, we are physically expanding the array from two to three NVMe drives. Our Gen 12 server platform was designed with four E1.S storage drive slots, but only two slots were populated with 8TB drives. The Gen 13 server platform uses the same design with four E1.S storage drive slots available, but with three slots populated with 8TB drives. Why add a third drive? This increases our storage capacity per server from 16TB to 24TB, ensuring we are expanding our global storage capacity to maintain and improve CDN cache performance. This supports growth projections for Durable Objects, Containers, and Quicksilver services, too.</p>
    <div>
      <h3>Front drive bay to support additional drives</h3>
      <a href="#front-drive-bay-to-support-additional-drives">
        
      </a>
    </div>
    <p>For Gen 13, the chassis is designed with a front drive bay that can support up to ten U.2 PCIe Gen 5.0 NVMe drives. The front drive bay provides the option for Cloudflare to use the same chassis across compute and storage platforms, as well as the flexibility to convert a compute SKU to a storage SKU when needed. </p>
    <div>
      <h3>Endurance and reliability</h3>
      <a href="#endurance-and-reliability">
        
      </a>
    </div>
    <p>We designed our servers to have a 5-year operational life and require storage drives endurance to sustain 1 DWPD (Drive Writes Per Day) over the full server lifespan.</p><p>Both the Samsung PM9D3a and Micron 7600 Pro meet the 1 DWPD specification with a hardware over-provisioning (OP) of approximately 7%. If future workload profiles demand higher endurance, we have the option to hold back additional user capacity to increase effective OP.</p>
    <div>
      <h3>NVMe 2.0 and OCP NVMe 2.0 compliance</h3>
      <a href="#nvme-2-0-and-ocp-nvme-2-0-compliance">
        
      </a>
    </div>
    <p>Both the Samsung PM9D3a and Micron 7600 adopt the NVMe 2.0 specification (up from NVMe 1.4) and the OCP NVMe Cloud SSD Specification 2.0. Key improvements include Zoned Namespaces (ZNS) for better write amplification management, Simple Copy Command for intra-device data movement without crossing the PCIe bus, and enhanced Command and Feature Lockdown for tighter security controls. The OCP 2.0 spec also adds deeper telemetry and debug capabilities purpose-built for datacenter operations, which aligns with our emphasis on fleet-wide manageability.</p>
    <div>
      <h3>Thermal efficiency</h3>
      <a href="#thermal-efficiency">
        
      </a>
    </div>
    <p>The storage drives will continue to be in the E1.S 15mm form factor. Its high-surface-area design is essential for cooling these new Gen 5.0 controllers, which can pull upwards of 25W under sustained heavy I/O. The 2U chassis provides ample airflow over the E1.S drives as well as U.2 drive bays, a design advantage we validated in Gen 12 when we made the decision to move from 1U to 2U.</p>
    <div>
      <h2>Network</h2>
      <a href="#network">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p>Dual 25 GbE port OCP 3.0 NIC </p><p>Intel E810-XXVDA2</p><p>NVIDIA Mellanox ConnectX-6 Lx</p></td></tr><tr><td><p>Gen 13</p></td><td><p>Dual 100 GbE port OCP 3.0 NIC</p><p>Intel E830-CDA2</p><p>NVIDIA Mellanox ConnectX-6 Dx</p></td></tr></table><p>For more than eight years, dual 25 GbE was the backbone of our fleet. <a href="https://blog.cloudflare.com/a-tour-inside-cloudflares-g9-servers/"><u>Since 2018</u></a> it has served us well, but as the CPU has improved to serve more requests and our products scale, we’ve officially hit the wall. For Gen 13, we are quadrupling our per-port bandwidth.</p>
    <div>
      <h3>Why 100 GbE and why now?</h3>
      <a href="#why-100-gbe-and-why-now">
        
      </a>
    </div>
    <p>Network Interface Card (NIC) bandwidth must keep pace with compute performance growth. With 192 modern cores, our 25 GbE links will become a measurable bottleneck. Production data from our co-locations worldwide over a week showed that, on our Gen 12, P95 bandwidth per port is consistently &gt;50% of available bandwidth. Since throughput is doubling per server on Gen 13, we are at risk of saturating the NIC bandwidth.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2lxP5Vy6y6CzCk1rE9FKVU/064d9e5392a08e92b38bca637d053573/image4.png" />
          </figure><p><sup><i>Figure: on Gen 12, P95 bandwidth per port is consistently &gt;50% of available bandwidth</i></sup></p><p>The decision to go to 100 GbE rather than 50 GbE was driven by industry economics: 50 GbE transceiver volumes remain low in the industry, making them a poor supply chain bet. Dual 100 GbE ports also give us 200 Gb/s of aggregate bandwidth per server, future-proofing against the next several years of traffic growth.</p>
    <div>
      <h3>Hardware choices and compatibility</h3>
      <a href="#hardware-choices-and-compatibility">
        
      </a>
    </div>
    <p>We are maintaining our dual-vendor strategy to ensure supply chain resilience, a lesson hard-learned during the pandemic when single-sourcing the Gen 11 NIC left us scrambling.</p><p>Both NICs are compliant with <a href="https://www.servethehome.com/ocp-nic-3-0-form-factors-quick-guide-intel-broadcom-nvidia-meta-inspur-dell-emc-hpe-lenovo-gigabyte-supermicro/"><u>OCP 3.0 SFF/TSFF</u></a> form factor with the integrated pull tab, maintaining chassis commonality with Gen 12 and ensuring field technicians need no new tools or training for swaps.</p>
    <div>
      <h3>PCIe Allocation</h3>
      <a href="#pcie-allocation">
        
      </a>
    </div>
    <p>The OCP 3.0 NIC slot is allocated PCIe 4.0 x16 lanes on the motherboard, providing 256 Gb/s of bidirectional bandwidth, more than enough for dual 100 GbE (200 Gb/s aggregate) with room to spare.</p>
    <div>
      <h2>Management</h2>
      <a href="#management">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p><a href="https://blog.cloudflare.com/introducing-the-project-argus-datacenter-ready-secure-control-module-design-specification/"><u>Project Argus</u></a> Data Center Secure Control Module 2.0</p></td></tr><tr><td><p>Gen 13</p></td><td><p><a href="https://blog.cloudflare.com/introducing-the-project-argus-datacenter-ready-secure-control-module-design-specification/"><u>Project Argus</u></a> Data Center Secure Control Module 2.0</p><p>PCIe encryption</p></td></tr></table><p>We are maintaining the architectural shift, introduced in Gen 12, of separating management and security-related components from the motherboard onto the <a href="https://blog.cloudflare.com/introducing-the-project-argus-datacenter-ready-secure-control-module-design-specification/"><u>Project Argus</u></a> Data Center Secure Control Module 2.0.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6F3XH0uQvBry9LJkZlVZOi/42f507d0d46d276db1e3724b21ea49dc/image1.png" />
          </figure><p><sup><i>Figure: Project Argus DC-SCM 2.0</i></sup></p>
    <div>
      <h3>Continuity with DC-SCM 2.0</h3>
      <a href="#continuity-with-dc-scm-2-0">
        
      </a>
    </div>
    <p>We are carrying forward the Data Center Secure Control Module 2.0 (DC-SCM 2.0) standard. By decoupling management and security functions from the motherboard, we ensure that the “brains” of the server’s security stay modular and protected.</p><p>The DC-SCM module houses our most critical components:</p><ul><li><p>Basic Input/Output System (BIOS)</p></li><li><p>Baseboard Management Controller (BMC)</p></li><li><p>Hardware Root of Trust (HRoT) and TPM (Infineon SLB 9672)</p></li><li><p>Dual BMC/BIOS flash chips for redundancy</p></li></ul>
    <div>
      <h3>Why we are staying the course with DC-SCM 2.0</h3>
      <a href="#why-we-are-staying-the-course-with-dc-scm-2-0">
        
      </a>
    </div>
    <p>The decision to keep this architecture for Gen 13 is driven by the proven security gains we saw in the previous generation. By offloading these functions to a dedicated module, we maintain:</p><ul><li><p><b>Rapid recovery</b>: Dual image redundancy allows for near-instant restoration of BIOS/UEFI and BMC firmware if an accidental corruption or a malicious update is detected.</p></li><li><p><b>Physical resilience</b>: The Gen 13 chassis also moves the intrusion detection mechanism further from the flat edge of the chassis, making physical intercept harder.</p></li><li><p><b>PCIe encryption</b>: In addition to TSME (Transparent Secure Memory Encryption) for CPU-to-memory encryption that was already enabled since our Gen 10 platforms, AMD Turin 9965 processor for Gen 13 extends encryption to PCIe traffic, this ensures data is protected in transit across every bus in the system.</p></li><li><p><b>Operational consistency</b>: Sticking with the Gen 12 management stack means our security audits, deployment, provisioning, and operational standard procedure remain fully compatible. </p></li></ul>
    <div>
      <h2>Power</h2>
      <a href="#power">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p>800W 80 PLUS Titanium CRPS</p></td></tr><tr><td><p>Gen 13</p></td><td><p>1300W 80 PLUS Titanium CRPS</p></td></tr></table><p>As we upgrade the compute and networking capability of the server, the power envelope of our servers has naturally expanded. Gen 13 are equipped with bigger power supplies to deliver the power needed.</p>
    <div>
      <h3>The jump to 1300W</h3>
      <a href="#the-jump-to-1300w">
        
      </a>
    </div>
    <p>While our Gen 12 nodes operated comfortably with 800W 80 PLUS Titanium CRPS (Common Redundant Power Supply), the Gen 13 specification requires a larger power supply. We have selected a 1300W 80 PLUS Titanium CRPS.</p><p>Power consumption of Gen 13 during typical operation has risen to 850W, a 250W increase over the 600W seen in Gen 12. The primary contributors are the 500W TDP CPU (up from 400W), doubling of the memory capacity and the additional NVMe drive.</p><p>Why 1300W instead of 1000W? The current PSU ecosystem lacks viable, high-efficiency options at 1000W. To ensure supply chain reliability, we moved to the next industry-standard tier of 1300W. </p><p><a href="https://eur-lex.europa.eu/eli/reg/2019/424/oj/eng"><u>EU Lot 9</u></a> is a regulation that requires servers deploying in the European Union to have power supplies with efficiency at 10%, 20%, 50% and 100% load to be at or above the percentage threshold specified in the regulation. The threshold matches <a href="https://www.clearesult.com/80plus/80plus-psu-ratings-explained"><u>80 PLUS Power Supply certification program</u></a> titanium grade PSU requirement. We chose a titanium grade PSU for Gen 13 to maintain full compliance with EU Lot 9, ensuring that the servers can be deployed in our European data centers and beyond. </p>
    <div>
      <h3>Thermal design: 2U pays dividends again</h3>
      <a href="#thermal-design-2u-pays-dividends-again">
        
      </a>
    </div>
    <p>The 2U1N form factor we adopted in Gen 12 continues to pay dividends. Gen 13 uses 5x 80mm fans (up from 4x in Gen 12) to handle the increased thermal load from the 500W CPU. The larger fan volume, combined with the 2U chassis airflow characteristics, means fans operate well below maximum duty cycle at typical ambient temperatures, keeping fan power in the &lt; 50W range per fan.</p>
    <div>
      <h2>Drop-in accelerator support</h2>
      <a href="#drop-in-accelerator-support">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p>x2 single width FHFL or x1 double width FHFL</p></td></tr><tr><td><p>Gen 13</p></td><td><p>x2 double width FHFL</p></td></tr></table><p>Maintaining the modularity of our fleet is a core requirement for our server design. This requirement enabled Cloudflare to quickly retrofit and <a href="https://blog.cloudflare.com/workers-ai?_gl=1*1gag2w6*_gcl_au*MzM4MjEyMTE0LjE3Njg5NDQ2NjA.*_ga*YzE1ZWNmMTgtNWNmOC00ZDJhLTkyYjUtMzQ0NjNiZjE1OWY1*_ga_SQCRB0TXZW*czE3NzMzNTQzNjQkbzE1JGcxJHQxNzczMzU0NTQ4JGoxOCRsMCRoMCRkQmROOWVoOFpxajBtSWtMTGRCa1VUVDJaY2RoaXBxTmY4QQ../#a-road-to-global-gpu-coverage"><u>deploy GPUs globally to more than 100 cities in 2024</u></a>. In Gen 13, we are continuing the support of high-performance PCIe add-in cards.</p><p>On Gen 13, the 2U chassis layout is updated and configured to support more demanding power and thermal requirements. While Gen 12 was limited to a single double-width GPU, the Gen 13 architecture now supports two double-width PCIe cards.</p>
    <div>
      <h2>A launchpad to scale Cloudflare to greater heights</h2>
      <a href="#a-launchpad-to-scale-cloudflare-to-greater-heights">
        
      </a>
    </div>
    <p>Every generation of Cloudflare servers is an exercise in balancing competing constraints: performance versus power, capacity versus cost, flexibility versus simplicity. Gen 13 comes with 2x core count, 2x memory capacity, 4x network bandwidth, 1.5x storage capacity, and future-proofing for accelerator deployments — all while improving total cost of ownership and maintaining a robust management feature set and security posture that our global fleet demands.</p><p>Gen 13 servers are fully qualified and will be deployed to serve millions of requests across Cloudflare’s global network in more than 330 cities. As always, Cloudflare’s journey to serve the Internet as efficiently as possible does not end here. As the deployment of Gen 13 begins, we are planning the architecture for Gen 14.</p><p>If you are excited about helping build a better Internet, come join us. <a href="https://www.cloudflare.com/careers/jobs/"><u>We are hiring</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Hardware]]></category>
            <category><![CDATA[Infrastructure]]></category>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[AMD]]></category>
            <guid isPermaLink="false">7KkjVfneO6PwoHTEAiSYVM</guid>
            <dc:creator>Syona Sarma</dc:creator>
            <dc:creator>JQ Lau</dc:creator>
            <dc:creator>Ma Xiong</dc:creator>
            <dc:creator>Victor Hwang</dc:creator>
        </item>
        <item>
            <title><![CDATA[Launching Cloudflare’s Gen 13 servers: trading cache for cores for 2x edge compute performance]]></title>
            <link>https://blog.cloudflare.com/gen13-launch/</link>
            <pubDate>Mon, 23 Mar 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare’s Gen 13 servers double our compute throughput by rethinking the balance between cache and cores. Moving to high-core-count AMD EPYC ™ Turin CPUs, we traded large L3 cache for raw compute density. By running our new Rust-based FL2 stack, we completely mitigated the latency penalty to unlock twice the performance. ]]></description>
            <content:encoded><![CDATA[ <p>Two years ago, Cloudflare deployed our <a href="https://blog.cloudflare.com/cloudflare-gen-12-server-bigger-better-cooler-in-a-2u1n-form-factor/"><u>12th Generation server fleet</u></a>, based on AMD EPYC™ Genoa-X processors with their massive 3D V-Cache. That cache-heavy architecture was a perfect match for our request handling layer, FL1 at the time. But as we evaluated next-generation hardware, we faced a dilemma — the CPUs offering the biggest throughput gains came with a significant cache reduction. Our legacy software stack wasn't optimized for this, and the potential throughput benefits were being capped by increasing latency.</p><p>This blog describes how the <a href="https://blog.cloudflare.com/20-percent-internet-upgrade/"><u>FL2 transition</u></a>, our Rust-based rewrite of Cloudflare's core request handling layer, allowed us to prove Gen 13's full potential and unlock performance gains that would have been impossible on our previous stack. FL2 removes the dependency on the larger cache, allowing for performance to scale with cores while maintaining our SLAs. Today, we are proud to announce the launch of Cloudflare's Gen 13 based on AMD EPYC™ 5th Gen Turin-based servers running FL2, effectively capturing and scaling performance at the edge. </p>
    <div>
      <h2>What AMD EPYCTurin brings to the table</h2>
      <a href="#what-amd-epycturin-brings-to-the-table">
        
      </a>
    </div>
    <p><a href="https://www.amd.com/en/products/processors/server/epyc/9005-series.html"><u>AMD's EPYC™ 5th Generation Turin-based processors</u></a> deliver more than just a core count increase. The architecture delivers improvements across multiple dimensions of what Cloudflare servers require.</p><ul><li><p><b>2x core count:</b> up to 192 cores versus Gen 12's 96 cores, with SMT providing 384 threads</p></li><li><p><b>Improved IPC:</b> Zen 5's architectural improvements deliver better instructions-per-cycle compared to Zen 4</p></li><li><p><b>Better power efficiency</b>: Despite the higher core count, Turin consumes up to 32% fewer watts per core compared to Genoa-X</p></li><li><p><b>DDR5-6400 support</b>: Higher memory bandwidth to feed all those cores</p></li></ul><p>However, Turin's high density OPNs make a deliberate tradeoff: prioritizing throughput over per core cache. Our analysis across the Turin stack highlighted this shift. For example, comparing the highest density Turin OPN to our Gen 12 Genoa-X processors reveals that Turin's 192 cores share 384MB of L3 cache. This leaves each core with access to just 2MB, one-sixth of Gen 12's allocation. For any workload that relies heavily on cache locality, which ours did, this reduction posed a serious challenge.</p><table><tr><td><p>Generation</p></td><td><p>Processor</p></td><td><p>Cores/Threads</p></td><td><p>L3 Cache/Core</p></td></tr><tr><td><p>Gen 12</p></td><td><p>AMD Genoa-X 9684X</p></td><td><p>96C/192T</p></td><td><p>12MB (3D V-Cache)</p></td></tr><tr><td><p>Gen 13 Option 1</p></td><td><p>AMD Turin 9755</p></td><td><p>128C/256T</p></td><td><p>4MB</p></td></tr><tr><td><p>Gen 13 Option 2</p></td><td><p>AMD Turin 9845</p></td><td><p>160C/320T</p></td><td><p>2MB</p></td></tr><tr><td><p>Gen 13 Option 3</p></td><td><p>AMD Turin 9965</p></td><td><p>192C/384T</p></td><td><p>2MB</p></td></tr></table>
    <div>
      <h2>Diagnosing the problem with performance counters</h2>
      <a href="#diagnosing-the-problem-with-performance-counters">
        
      </a>
    </div>
    <p>For our FL1 request handling layer, NGINX- and LuaJIT-based code, this cache reduction presented a significant challenge. But we didn't just assume it would be a problem; we measured it.</p><p>During the CPU evaluation phase for Gen 13, we collected CPU performance counters and profiling data to identify exactly what was happening under the hood using <a href="https://docs.amd.com/r/en-US/68658-uProf-getting-started-guide/Identifying-Issues-Using-uProfPcm"><u>AMD uProf tool</u></a>. The data showed:</p><ul><li><p>L3 cache miss rates increased dramatically compared to Gen 12's server equipped with 3D V-cache processors</p></li><li><p>Memory fetch latency dominated request processing time as data that previously stayed in L3 now required trips to DRAM</p></li><li><p>The latency penalty scaled with utilization as we pushed CPU usage higher, and cache contention worsened</p></li></ul><p>L3 cache hits complete in roughly 50 cycles; L3 cache misses requiring DRAM access take 350+ cycles, an order of magnitude difference. With 6x less cache per core, FL1 on Gen 13 was hitting memory far more often, incurring latency penalties.</p>
    <div>
      <h2>The tradeoff: latency vs. throughput </h2>
      <a href="#the-tradeoff-latency-vs-throughput">
        
      </a>
    </div>
    <p>Our initial tests running FL1 on Gen 13 confirmed what the performance counters had already suggested. While the Turin processor could achieve higher throughput, it came at a steep latency cost.</p><table><tr><td><p>Metric</p></td><td><p>Gen 12 (FL1)</p></td><td><p>Gen 13 - AMD Turin 9755 (FL1)</p></td><td><p>Gen 13 - AMD Turin 9845 (FL1)</p></td><td><p>Gen 13 - AMD Turin 9965 (FL1)</p></td><td><p>Delta</p></td></tr><tr><td><p>Core count</p></td><td><p>baseline</p></td><td><p>+33%</p></td><td><p>+67%</p></td><td><p>+100%</p></td><td><p></p></td></tr><tr><td><p>FL throughput</p></td><td><p>baseline</p></td><td><p>+10%</p></td><td><p>+31%</p></td><td><p>+62%</p></td><td><p>Improvement</p></td></tr><tr><td><p>Latency at low to moderate CPU utilization</p></td><td><p>baseline</p></td><td><p>+10%</p></td><td><p>+30%</p></td><td><p>+30%</p></td><td><p>Regression</p></td></tr><tr><td><p>Latency at high CPU utilization</p></td><td><p>baseline</p></td><td><p>&gt; 20% </p></td><td><p>&gt; 50% </p></td><td><p>&gt; 50% </p></td><td><p>Unacceptable</p></td></tr></table><p>The Gen 13 evaluation server with AMD Turin 9965 that generated 60% throughput gain was compelling, and the performance uplift provided the most improvement to Cloudflare’s total cost of ownership (TCO). </p><p>But a more than 50% latency penalty is not acceptable. The increase in request processing latency would directly impact customer experience. We faced a familiar infrastructure question: do we accept a solution with no TCO benefit, accept the increased latency tradeoff, or find a way to boost efficiency without adding latency?</p>
    <div>
      <h2>Incremental gains with performance tuning</h2>
      <a href="#incremental-gains-with-performance-tuning">
        
      </a>
    </div>
    <p>To find a path to an optimal outcome, we collaborated with AMD to analyze the Turin 9965 data and run targeted optimization experiments. We systematically tested multiple configurations:</p><ul><li><p><b>Hardware Tuning:</b> Adjusting hardware prefetchers and Data Fabric (DF) Probe Filters, which showed only marginal gains</p></li><li><p><b>Scaling Workers</b>: Launching more FL1 workers, which improved throughput but cannibalized resources from other production services</p></li><li><p><b>CPU Pinning &amp; Isolation:</b> Adjusting workload isolation configurations to find optimal mix, with limited success </p></li></ul><p>The configuration that ultimately provided the most value was <b>AMD’s Platform Quality of Service (PQOS). PQOS </b>extensions enable fine-grained regulation of shared resources like cache and memory bandwidth. Since Turin processors consist of one I/O Die and up to 12 Core Complex Dies (CCDs), each sharing an L3 cache across up to 16 cores, we put this to the test. Here is how the different experimental configurations performed. </p><p>First, we used PQOS to allocate a dedicated L3 cache share within a single CCD for FL1, the gains were minimal. However, when we scaled the concept to the socket level, dedicating an <i>entire</i> CCD strictly to FL1, we saw meaningful throughput gains while keeping latency acceptable.</p><div>
<figure>
<table><colgroup><col></col><col></col><col></col><col></col></colgroup>
<tbody>
<tr>
<td>
<p><span><span>Configuration</span></span></p>
</td>
<td>
<p><span><span>Description</span></span></p>
</td>
<td>
<p><span><span>Illustration</span></span></p>
</td>
<td>
<p><span><span>Performance gain</span></span></p>
</td>
</tr>
<tr>
<td>
<p><span><span>NUMA-aware core affinity </span></span><br /><span><span>(equivalent to PQOS at socket level)</span></span></p>
</td>
<td>
<p><span><span>6 out of 12 CCD (aligned with NUMA domain) run FL.</span></span></p>
<p> </p>
<p><span><span>32MB L3 cache in each CCD shared among all cores. </span></span></p>
</td>
<td>
<p><span><span><img src="https://images.ctfassets.net/zkvhlag99gkb/4CBSHY02oIZOiENgFrzLSz/0c6c2ac8ef0096894ff4827e30d25851/image3.png" /></span></span></p>
</td>
<td>
<p><span><span>&gt;15% incremental </span></span></p>
<p><span><span>throughput gain</span></span></p>
</td>
</tr>
<tr>
<td>
<p><span><span>PQOS config 1</span></span></p>
</td>
<td>
<p><span><span>1 of 2 vCPU on each physical core in each CCD runs FL. </span></span></p>
<p> </p>
<p><span><span>FL gets 75% of the 32MB L3 cache of each CCD.</span></span></p>
</td>
<td>
<p><span><span><img src="https://images.ctfassets.net/zkvhlag99gkb/3iJo1BBRueQRy92R3aXbGx/596c3231fa0e66f20de70ea02615f9a7/image2.png" /></span></span></p>
</td>
<td>
<p><span><span>&lt; 5% incremental throughput gain</span></span></p>
<p> </p>
<p><span><span>Other services show minor signs of degradation</span></span></p>
</td>
</tr>
<tr>
<td>
<p><span><span>PQOS config 2</span></span></p>
</td>
<td>
<p><span><span>1 of 2 vCPU in each physical core in each CCD runs FL.</span></span></p>
<p> </p>
<p><span><span>FL gets 50% of the 32MB L3 cache of each CCD.</span></span></p>
</td>
<td>
<p><span><span><img src="https://images.ctfassets.net/zkvhlag99gkb/3iJo1BBRueQRy92R3aXbGx/596c3231fa0e66f20de70ea02615f9a7/image2.png" /></span></span></p>
</td>
<td>
<p><span><span>&lt; 5% incremental throughput gain</span></span></p>
</td>
</tr>
<tr>
<td>
<p><span><span>PQOS config 3</span></span></p>
</td>
<td>
<p><span><span>2 vCPU on 50% of the physical core in each CCD runs FL. </span></span></p>
<p> </p>
<p><span><span>FL gets 50% of  the 32MB L3 cache of each CCD.</span></span></p>
</td>
<td>
<p><span><span><img src="https://images.ctfassets.net/zkvhlag99gkb/7FKLfSxnSNUlXJCw8CJGzU/69c7b81b6cee5a2c7040ecc96748084b/image5.png" /></span></span></p>
</td>
<td>
<p><span><span>&lt; 5% incremental throughput gain</span></span></p>
</td>
</tr>
</tbody>
</table>
</figure>
</div>
    <div>
      <h2>The opportunity: FL2 was already in progress</h2>
      <a href="#the-opportunity-fl2-was-already-in-progress">
        
      </a>
    </div>
    <p>Hardware tuning and resource configuration provided modest gains, but to truly unlock the performance potential of the Gen 13 architecture, we knew we would have to rewrite our software stack to fundamentally change how it utilized system resources.</p><p>Fortunately, we weren't starting from scratch. As we <a href="https://blog.cloudflare.com/20-percent-internet-upgrade/"><u>announced during Birthday Week 2025</u></a>, we had already been rebuilding FL1 from the ground up. FL2 is a complete rewrite of our request handling layer in Rust, built on our <a href="https://blog.cloudflare.com/pingora-open-source/"><u>Pingora</u></a> and <a href="https://blog.cloudflare.com/introducing-oxy/"><u>Oxy</u></a> frameworks, replacing 15 years of NGINX and LuaJIT code.</p><p>The FL2 project wasn't initiated to solve the Gen 13 cache problem — it was driven by the need for better security (Rust's memory safety), faster development velocity (strict module system), and improved performance across the board (less CPU, less memory, modular execution).</p><p>FL2's cleaner architecture, with better memory access patterns and less dynamic allocation, might not depend on massive L3 caches the way FL1 did. This gave us an opportunity to use the FL2 transition to prove whether Gen 13's throughput gains could be realized without the latency penalty.</p>
    <div>
      <h2>Proving it out: FL2 on Gen 13</h2>
      <a href="#proving-it-out-fl2-on-gen-13">
        
      </a>
    </div>
    <p>As the FL2 rollout progressed, production metrics from our Gen 13 servers validated what we had hypothesized.</p><table><tr><td><p>Metric</p></td><td><p>Gen 13 AMD Turin 9965 (FL1)</p></td><td><p>Gen 13 AMD Turin 9965 (FL2)</p></td></tr><tr><td><p>FL requests per CPU%</p></td><td><p>baseline</p></td><td><p>50% higher</p></td></tr><tr><td><p>Latency vs Gen 12</p></td><td><p>baseline</p></td><td><p>70% lower</p></td></tr><tr><td><p>Throughput vs Gen 12</p></td><td><p>62% higher</p></td><td><p>100% higher</p></td></tr></table><p>The out-of-the-box efficiency gains on our new FL2 stack were substantial, even before any system optimizations. FL2 slashed the latency penalty by 70%, allowing us to push Gen 13 to higher CPU utilization while strictly meeting our latency SLAs. Under FL1, this would have been impossible.</p><p>By effectively eliminating the cache bottleneck, FL2 enables our throughput to scale linearly with core count. The impact is undeniable on the high-density AMD Turin 9965: we achieved a 2x performance gain, unlocking the true potential of the hardware. With further system tuning, we expect to squeeze even more power out of our Gen 13 fleet.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1jV1q0n9PgmbbNzDl8E1J1/2ead24a20cc10836ba041f73a16f3883/image6.png" />
          </figure>
    <div>
      <h2>Generational improvement with Gen 13</h2>
      <a href="#generational-improvement-with-gen-13">
        
      </a>
    </div>
    <p>With FL2 unlocking the immense throughput of the high-core-count AMD Turin 9965, we have officially selected these processors for our Gen 13 deployment. Hardware qualification is complete, and Gen 13 servers are now shipping at scale to support our global rollout.</p>
    <div>
      <h3>Performance improvements</h3>
      <a href="#performance-improvements">
        
      </a>
    </div>
    <table><tr><td><p>
</p></td><td><p>Gen 12 </p></td><td><p>Gen 13 </p></td></tr><tr><td><p>Processor</p></td><td><p>AMD EPYC™ 4th Gen Genoa-X 9684X</p></td><td><p>AMD EPYC™ 5th Gen Turin 9965</p></td></tr><tr><td><p>Core count</p></td><td><p>96C/192T</p></td><td><p>192C/384T</p></td></tr><tr><td><p>FL throughput</p></td><td><p>baseline</p></td><td><p>Up to +100%</p></td></tr><tr><td><p>Performance per watt</p></td><td><p>baseline</p></td><td><p>Up to +50%</p></td></tr></table>
    <div>
      <h3>Gen 13 business impact</h3>
      <a href="#gen-13-business-impact">
        
      </a>
    </div>
    <p><b>Up to 2x throughput vs Gen 12 </b>for uncompromising customer experience: By doubling our throughput capacity while staying within our latency SLAs, we guarantee our applications remain fast and responsive, and able to absorb massive traffic spikes.</p><p><b>50% better performance/watt vs Gen 12 </b>for sustainable scaling: This gain in power efficiency not only reduces data center expansion costs, but allows us to process growing traffic with a vastly lower carbon footprint per request.</p><p><b>60% higher rack throughput vs Gen 12 </b>for global edge upgrades: Because we achieved this throughput density while keeping the rack power budget constant, we can seamlessly deploy this next generation compute anywhere in the world across our global edge network, delivering top tier performance exactly where our customers want it.</p>
    <div>
      <h2>Gen 13 + FL2: ready for the edge </h2>
      <a href="#gen-13-fl2-ready-for-the-edge">
        
      </a>
    </div>
    <p>Our legacy request serving layer FL1 hit a cache contention wall on Gen 13, forcing an unacceptable tradeoff between throughput and latency. Instead of compromising, we built FL2. </p><p>Designed with a vastly leaner memory access pattern, FL2 removes our dependency on massive L3 caches and allows linear scaling with core count. Running on the Gen 13 AMD Turin platform, FL2 unlocks 2x the throughput and a 50% boost in power efficiency all while keeping latency within our SLAs. This leap forward is a great reminder of the importance of hardware-software co-design. Unconstrained by cache limits, Gen 13 servers are now ready to be deployed to serve millions of requests across Cloudflare’s global network.</p><p>If you're excited about working on infrastructure at global scale, <a href="https://www.cloudflare.com/careers/jobs"><u>we're hiring</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Hardware]]></category>
            <category><![CDATA[Performance]]></category>
            <category><![CDATA[Infrastructure]]></category>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[AMD]]></category>
            <category><![CDATA[Engineering]]></category>
            <guid isPermaLink="false">4shbA7eyT2KredK7RJyizK</guid>
            <dc:creator>Syona Sarma</dc:creator>
            <dc:creator>JQ Lau</dc:creator>
            <dc:creator>Jesse Brandeburg</dc:creator>
        </item>
        <item>
            <title><![CDATA[Analysis of the EPYC 145% performance gain in Cloudflare Gen 12 servers]]></title>
            <link>https://blog.cloudflare.com/analysis-of-the-epyc-145-performance-gain-in-cloudflare-gen-12-servers/</link>
            <pubDate>Tue, 15 Oct 2024 15:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare’s Gen 12 server is the most powerful and power efficient server that we have deployed to date. Through sensitivity analysis, we found that Cloudflare workloads continue to scale with higher core count and higher CPU frequency, as well as achieving a significant boost in performance with larger L3 cache per core. ]]></description>
            <content:encoded><![CDATA[ <p>Cloudflare's <a href="https://www.cloudflare.com/network"><u>network</u></a> spans more than 330 cities in over 120 countries, serving over 60 million HTTP requests per second and 39 million DNS queries per second on average. These numbers will continue to grow, and at an accelerating pace, as will Cloudflare’s infrastructure to support them. While we can continue to scale out by deploying more servers, it is also paramount for us to develop and deploy more performant and more efficient servers.</p><p>At the heart of each server is the processor (central processing unit, or CPU). Even though many aspects of a server rack can be redesigned to improve the cost to serve a request, CPU remains the biggest lever, as it is typically the primary compute resource in a server, and the primary enabler of new technologies.</p><p><a href="https://blog.cloudflare.com/gen-12-servers/"><u>Cloudflare’s 12th Generation server with AMD EPYC 9684-X (codenamed Genoa-X) is 145% more performant and 63% more efficient</u></a>. These are big numbers, but where do the performance gains come from? Cloudflare’s hardware system engineering team did a sensitivity analysis on three variants of 4th generation AMD EPYC processor to understand the contributing factors.</p><p>For the 4th generation AMD EPYC Processors, AMD offers three architectural variants: </p><ol><li><p>mainstream classic Zen 4 cores, codenamed Genoa</p></li><li><p>efficiency optimized dense Zen 4c cores, codenamed Bergamo</p></li><li><p>cache optimized Zen 4 cores with 3D V-cache, codenamed Genoa-X</p></li></ol>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7hb5yDJksMIwbWVQcCzuoC/c085edca54d3820f7e93564791b0289a/image6.png" />
            
            </figure><p><sup>Figure 1 (from left to right): AMD EPYC 9654 (Genoa), AMD EPYC 9754 (Bergamo), AMD EPYC 9684X (Genoa-X)</sup></p><p>Key features common across the 4th Generation AMD EPYC processors:</p><ul><li><p>Up to 12x Core Complex Dies (CCDs)</p></li><li><p>Each core has a private 1MB L2 cache</p></li><li><p>The CCDs connect to memory, I/O, and each other through an I/O die</p></li><li><p>Configurable Thermal Design Power (cTDP) up to 400W</p></li><li><p>Support up to 12 channels of DDR5-4800 1DPC</p></li><li><p>Support up to 128 lanes PCIe Gen 5</p></li></ul><p>Classic Zen 4 Cores (Genoa):</p><ul><li><p>Each Core Complex (CCX) has 8x Zen 4 Cores (16x Threads)</p></li><li><p>Each CCX has a shared 32 MB L3 cache (4 MB/core)</p></li><li><p>Each CCD has 1x CCX</p></li></ul><p>Dense Zen 4c Cores (Bergamo):</p><ul><li><p>Each CCX has 8x Zen 4c Cores (16x Threads)</p></li><li><p>Each CCX has a shared 16 MB L3 cache (2 MB/core)</p></li><li><p>Each CCD has 2x CCX</p></li></ul><p>Classic Zen 4 Cores with 3D V-cache (Genoa-X):</p><ul><li><p>Each CCX has 8x Zen 4 Cores (16x Threads)</p></li><li><p>Each CCX has a shared 96MB L3 cache (12 MB/core)</p></li><li><p>Each CCD has 1x CCX</p></li></ul><p>For more information on 4th generation AMD EPYC Processors architecture, see: <a href="https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf"><u>https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf</u></a> </p><p>The following table is a summary of the specification of the AMD EPYC 7713 CPU in our <a href="https://blog.cloudflare.com/the-epyc-journey-continues-to-milan-in-cloudflares-11th-generation-edge-server/"><u>Gen 11 server</u></a> against the three CPU candidates, one from each variant of the 4th generation AMD EPYC Processors architecture:</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td>
                        <p><span><span><strong>CPU Model</strong></span></span></p>
                    </td>
                    <td>
                        <p><a href="https://www.amd.com/en/products/specifications/server-processor.html"><span><span><strong><u>AMD EPYC 7713</u></strong></span></span></a></p>
                    </td>
                    <td>
                        <p><a href="https://www.amd.com/en/products/specifications/server-processor.html"><span><span><strong><u>AMD EPYC 9654</u></strong></span></span></a></p>
                    </td>
                    <td>
                        <p><a href="https://www.amd.com/en/products/specifications/server-processor.html"><span><span><strong><u>AMD EPYC 9754</u></strong></span></span></a></p>
                    </td>
                    <td>
                        <p><a href="https://www.amd.com/en/products/specifications/server-processor.html"><span><span><strong><u>AMD EPYC 9684X</u></strong></span></span></a></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Series</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>Milan</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Genoa</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Bergamo</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Genoa-X</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong># of CPU Cores</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>64</span></span></p>
                    </td>
                    <td>
                        <p><span><span>96</span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>128</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>96</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong># of Threads</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>128</span></span></p>
                    </td>
                    <td>
                        <p><span><span>192</span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>256</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>192</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Base Clock</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.0 GHz</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.4 GHz</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.25 GHz</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.4 GHz</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>All Core Boost Clock</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>~2.7 GHz*</span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>3.55 Ghz</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>3.1 Ghz</span></span></p>
                    </td>
                    <td>
                        <p><span><span>3.42 Ghz</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Total L3 Cache</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>256 MB</span></span></p>
                    </td>
                    <td>
                        <p><span><span>384 MB</span></span></p>
                    </td>
                    <td>
                        <p><span><span>256 MB</span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>1152 MB</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>L3 cache per core</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>4 MB / core</span></span></p>
                    </td>
                    <td>
                        <p><span><span>4 MB / core</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2 MB / core</span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>12 MB / core</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Maximum configurable TDP</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>240W</span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>400W</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>400W</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>400W</strong></span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div><p><sup>* AMD EPYC 7713 all core boost clock is based on Cloudflare production data, not the official specification from AMD</sup></p>
    <div>
      <h2>cf_benchmark</h2>
      <a href="#cf_benchmark">
        
      </a>
    </div>
    <p>Readers may remember that Cloudflare introduced <a href="https://github.com/cloudflare/cf_benchmark"><u>cf_benchmark</u></a> when we evaluated <a href="https://blog.cloudflare.com/arm-takes-wing/"><u>Qualcomm's ARM chips</u></a>, using it as our first pass benchmark to shortlist <a href="https://blog.cloudflare.com/an-epyc-trip-to-rome-amd-is-cloudflares-10th-generation-edge-server-cpu/"><u>AMD’s Rome CPU for our Gen 10 servers</u></a> and to evaluate <a href="https://blog.cloudflare.com/arms-race-ampere-altra-takes-on-aws-graviton2/"><u>our chosen ARM CPU Ampere Altra Max against AWS Graviton 2</u></a>. Likewise, we ran cf_benchmark against the three candidate CPUs for our 12th Gen servers: AMD EPYC 9654 (Genoa), AMD EPYC 9754 (Bergamo), and AMD EPYC 9684X (Genoa-X). The majority of cf_benchmark workloads are compute bound, and given more cores or higher CPU frequency, they score better. The graph and the table below show the benchmark performance comparison of the three CPU candidates with Genoa 9654 as the baseline, where &gt; 1.00x indicates better performance.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6YNQ8YB5XV3fZEVpVFT7lM/2dadd7fda9832716f198dee2d5fbfd22/image5.png" />
          </figure><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td> </td>
                    <td>
                        <p><span><span><strong>Genoa 9654 (baseline)</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Bergamo 9754</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Genoa-X 9684X</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>openssl_pki</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.00x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.16x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.01x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>openssl_aead</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.00x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.20x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.01x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>luajit</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.00x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>0.86x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.00x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>brotli</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.00x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.11x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>0.98x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>gzip</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.00x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>0.87x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.01x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>go</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.00x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.09x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.00x</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div><p>Bergamo 9754 with 128 cores scores better in openssl_pki, openssl_aead, brotli, and go benchmark suites, and performs less favorably in luajit and gzip benchmark suites. Genoa-X 9684X (with significantly more L3 cache) doesn’t offer a significant boost in performance for these compute-bound benchmarks.</p><p>These benchmarks are representative of some of the common workloads Cloudflare runs, and are useful in identifying software scaling issues, system configuration bottlenecks, and the impact of CPU design choices on workload-specific performance. However, the benchmark suite is not an exhaustive list of all workloads Cloudflare runs in production, and in reality, the workloads included in the benchmark suites are almost certainly not the exclusive workload running on the CPU. In short, though benchmark results can be informative, they do not represent a good indication of production performance when a mix of these workloads run on the same processor.</p>
    <div>
      <h2>Performance simulation</h2>
      <a href="#performance-simulation">
        
      </a>
    </div>
    <p>To get an early indication of production performance, Cloudflare has an internal performance simulation tool that exercises our software stack to fetch a fixed asset repeatedly. The simulation tool can be configured to fetch a specified fixed-size asset and configured to include or exclude services like WAF or Workers in the request path. Below, we show the simulated performance between the three CPUs for an asset size of 10 KB, where &gt;1.00x indicates better performance.</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td> </td>
                    <td>
                        <p><span><span><strong>Milan 7713</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Genoa 9654</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Bergamo 9754</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Genoa-X 9684X</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Lab simulation performance multiplier</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.00x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.20x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.95x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.75x</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div><p>Based on these results, Bergamo 9754, which has the highest core count, but smallest L3 cache per core, is least performant among the three candidates, followed by Genoa 9654. The Genoa-X 9684X with the largest L3 cache per core is the most performant. This data suggests that our software stack is very sensitive to L3 cache size, in addition to core count and CPU frequency. This is interesting and worth a deep dive into a sensitivity analysis of our workload against a few (high level) CPU design points, especially core scaling, frequency scaling, and L2/L3 cache sizes scaling.</p>
    <div>
      <h2>Sensitivity analysis</h2>
      <a href="#sensitivity-analysis">
        
      </a>
    </div>
    
    <div>
      <h3>Core sensitivity</h3>
      <a href="#core-sensitivity">
        
      </a>
    </div>
    <p>Number of cores is the headline specification that practically everyone talks about, and one of the easiest improvements CPU vendors can make to increase performance per socket. The AMD Genoa 9654 has 96 cores, 50% more than the 64 cores available on the AMD Milan 7713 CPUs that we used in our Gen 11 servers. Is more always better? Does Cloudflare’s primary workload scale with core count and effectively utilize all available cores?</p><p>The figure and table below shows the result of a core scaling experiment performed on an AMD Genoa 9654 configured with 96 cores, 80 cores, 64 cores, and 48 cores, which was done by incrementally disabling 2x CCD (8 cores/CCD) at each step. The result is GREAT, as Cloudflare’s simulated primary workload scales linearly with core count on AMD Genoa CPUs.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/f1w9JxhP8aLIoFONMq5tr/b3269fc121b9bfd8f4d9a3394a73c599/image4.png" />
          </figure><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td>
                        <p><span><span><strong>Core count</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Core increase</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Performance increase</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>48</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.00x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.00</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>64</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.33x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.39x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>80</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.67x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.71x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>96</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.00x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.05x</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div>
    <div>
      <h3>TDP sensitivity</h3>
      <a href="#tdp-sensitivity">
        
      </a>
    </div>
    <p><a href="https://blog.cloudflare.com/thermal-design-supporting-gen-12-hardware-cool-efficient-and-reliable/"><u>Thermal Design Power (TDP), is the maximum amount of heat generated by a CPU that the cooling system is designed to dissipate</u></a>, but more commonly refers to the power consumption of the processor under the maximum theoretical loads. AMD Genoa 9654’s default TDP is 360W, but can be configured up to 400W TDP. Is more always better? Does Cloudflare continue to see meaningful performance improvement up to 400W, or does performance stagnate at some point?</p><p>The chart below shows the result of sweeping the TDP of the AMD Genoa 9654 (in power determinism mode) from 240W to 400W. (Note: x-axis step size is not linear).</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/216pSnjCDQOLwtRS822ugu/17dc1885d34ae10a8a1cecb6584b7dd7/image3.png" />
          </figure><p>Cloudflare’s simulated primary workload continues to see incremental performance improvements up to the maximum configurable 400W, albeit at a less favorable perf/watt ratio.</p><p>Looking at TDP sensitivity data is a quick and easy way to identify if performance stagnates at some power point, but what does power sensitivity actually measure? There are several factors contributing to CPU power consumption, but let's focus on one of the primary factors: dynamic power consumption. Dynamic power consumption is approximately <i>CV</i><i><sup>2</sup></i><i>f</i>, where C is the switched load capacitance, V is the regulated voltage, and f is the frequency. In modern processors like the AMD Genoa 9654, the CPU dynamically scales its voltage along with frequency, so theoretically, CPU dynamic power is loosely proportional to f<sup>3</sup>. In other words, measuring TDP sensitivity is measuring the frequency sensitivity of a workload. Does the data agree? Yes!</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td>
                        <p><span><span><strong>cTDP</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>All core boost frequency (GHz)</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Perf (rps) / baseline</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>240</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.47</span></span></p>
                    </td>
                    <td>
                        <p><span><span>0.78x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>280</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.75</span></span></p>
                    </td>
                    <td>
                        <p><span><span>0.87x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>320</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.93</span></span></p>
                    </td>
                    <td>
                        <p><span><span>0.93x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>340</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>3.13</span></span></p>
                    </td>
                    <td>
                        <p><span><span>0.97x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>360</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>3.3</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.00x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>380</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>3.4</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.03x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>390</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>3.465</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.04x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>400</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>3.55</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.05x</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div>
    <div>
      <h3>Frequency sensitivity</h3>
      <a href="#frequency-sensitivity">
        
      </a>
    </div>
    <p>Instead of relying on an indirect measure through the TDP, let’s measure frequency sensitivity directly by sweeping the maximum boost frequency.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2oobymubOiZrpJjSExQxxo/a02985506ecc803e8b09d7f999b86c55/image2.png" />
          </figure><p>At above 3GHz, the data shows that Cloudflare’s primary workload sees roughly 2% incremental improvement for every 0.1GHz all core average frequency increment. We hit the 400W power cap at 3.545GHz. This is notably higher than the typical all core boost frequency that Cloudflare Gen 11 servers with AMD Milan 7713 at 2.7GHz see in production, or at 2.4GHz in our performance simulation, which is amazing!</p>
    <div>
      <h3>L3 cache size sensitivity</h3>
      <a href="#l3-cache-size-sensitivity">
        
      </a>
    </div>
    <p>What about L3 cache size sensitivity? L3 cache size is one of the primary design choices and major differences between the trio of Genoa, Bergamo, and Genoa-X. Genoa 9654 has 4 MB L3/core, Bergamo 9754 has 2 MB L3/core, and Genoa-X has 12 MB L3/core. L3 cache is the last and largest “memory” bank on-chip before having to access memory on DIMMs outside the chip that would take significantly more CPU cycles.</p><p>We ran an experiment on the Genoa 9654 to check how performance scales with L3 cache size. L3 cache size per core is reduced through MSR writes (but could also be done using <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/use-intel-resource-director-technology-to-allocate-last-level-cache-llc.htm"><u>Intel RDT</u></a>) and L3 cache per core is increased by disabling physical cores in a CCD (which reduces the number of cores sharing the fixed size 32 MB L3 cache per CCD effectively growing the L3 cache per core). Below is the result of the experiment, where &gt;1.00x indicates better performance:</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td>
                        <p><span><span><strong>L3 cache size increase vs baseline 4MB per core</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>0.25x</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>0.5x</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>0.75x</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>1x</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>1.14x</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>1.33x</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>1.60x</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>2.00x</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>rps/core / baseline</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>0.67x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>0.78x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>0.89x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.00x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.08x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.15x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.25x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.31x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>L3 cache miss rate per CCD</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>56.04%</span></span></p>
                    </td>
                    <td>
                        <p><span><span>39.15%</span></span></p>
                    </td>
                    <td>
                        <p><span><span>30.37%</span></span></p>
                    </td>
                    <td>
                        <p><span><span>23.55%</span></span></p>
                    </td>
                    <td>
                        <p><span><span>22.39%</span></span></p>
                    </td>
                    <td>
                        <p><span><span>19.73%</span></span></p>
                    </td>
                    <td>
                        <p><span><span>16.94%</span></span></p>
                    </td>
                    <td>
                        <p><span><span>14.28%</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/cjfPuhUHBfABvQ5hJGchb/e470ca9bb45b0c69b5069556ce866647/image7.png" />
          </figure><p>Even though the expectation was that the impact of a different L3 cache size gets diminished by the faster DDR5 and larger memory bandwidth, Cloudflare’s simulated primary workload is quite sensitive to L3 cache size. The L3 cache miss rate dropped from 56% with only 1 MB L3 per core, to 14.28% with 8 MB L3/core. Changing the L3 cache size by 25% affects the performance by approximately 11%, and we continue to see performance increase to 2x L3 cache size, though the performance increase starts to diminish when we get to 2x L3 cache per core.</p><p>Do we see the same behavior when comparing Genoa 9654, Bergamo 9754 and Genoa-X 9684X? We ran an experiment comparing the impact of L3 cache size, controlling for core count and all core boost frequency, and we also saw significant deltas. Halving the L3 cache size from 4 MB/core to 2 MB/core reduces performance by 24%, roughly matching the experiment above. However, increasing the cache 3x from 4 MB/core to 12 MB/core only increases performance by 25%, less than the indication provided by previous experiments. This is likely because the performance gain we saw on experiment result above could be partially attributed to less cache contention due to reduced number of cores based on how we set up the test. Nevertheless, these are significant deltas!</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td>
                        <p><span><span><strong>L3/core</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>2MB/core</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>4MB/core</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>12MB/core</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Perf (rps) / baseline</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>0.76x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.25x</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div>
    <div>
      <h3>Putting it all together</h3>
      <a href="#putting-it-all-together">
        
      </a>
    </div>
    <p>The table below summarizes how each factor from sensitivity analysis above contributes to the overall performance gain. There are an additional 6% to 14% of unaccounted performance improvement that are contributed by other factors like larger L2 cache, higher memory bandwidth, and miscellaneous CPU architecture changes that improve IPC.</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td> </td>
                    <td>
                        <p><span><span><strong>Milan</strong></span></span></p>
                        <p><span><span><strong>7713</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Genoa</strong></span></span></p>
                        <p><span><span><strong>9654</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Bergamo</strong></span></span></p>
                        <p><span><span><strong>9754</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Genoa-X</strong></span></span></p>
                        <p><span><span><strong>9684X</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Lab simulation performance multiplier</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.2x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.95x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.75x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Performance multiplier due to Core scaling</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.5x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.5x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Performance multiplier due to Frequency scaling</strong></span></span></p>
                        <p><span><span><strong>(*Note: Milan 7713 all core frequency is ~2.4GHz when running simulated workload at 100% CPU utilization)</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.32x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.21x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.29x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Performance multiplier due to L3 cache size scaling</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>0.76x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.25x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Performance multiplier due to other factors like larger L2 cache, higher memory bandwidth, miscellaneous CPU architecture changes that improve IPC</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.11x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.06x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.14x</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div>
    <div>
      <h2>Performance evaluation in production</h2>
      <a href="#performance-evaluation-in-production">
        
      </a>
    </div>
    <p>How do these CPU candidates perform with real-world traffic and an actual production workload mix? The table below summarizes the performance of the three CPUs in lab simulation and in production. Genoa-X 9684X continues to outperform in production.</p><p>In addition, the Gen 12 server equipped with Genoa-X offered outstanding performance but only consumed 1.5x more power per system than our Gen 11 server with Milan 7713. In other words, we see a 63% increase in performance per watt. Genoa-X 9684X provides the best TCO improvement among the 3 options, and was ultimately chosen as the CPU for our Gen 12 server.</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td> </td>
                    <td>
                        <p><span><span><strong>Milan 7713</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Genoa 9654</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Bergamo 9754</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Genoa-X 9684X</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Lab simulation performance multiplier</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.2x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.95x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.75x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Production performance multiplier</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.15x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>2.45x</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Production performance per watt multiplier</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>1x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.33x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.38x</span></span></p>
                    </td>
                    <td>
                        <p><span><span>1.63x</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div><p>The Gen 12 server with AMD Genoa-X 9684X is the most powerful and the most power efficient server Cloudflare has built to date. It serves as the underlying platform for all the incredible services that Cloudflare offers to our customers globally, and will help power the growth of Cloudflare infrastructure for the next several years with improved cost structure. </p><p>Hardware engineers at Cloudflare work closely with our infrastructure engineering partners and externally with our vendors to design and develop world-class servers to best serve our customers. </p><p><a href="https://www.cloudflare.com/careers/jobs/"><u>Come join us</u></a> at Cloudflare to help build a better Internet!</p> ]]></content:encoded>
            <category><![CDATA[AMD]]></category>
            <category><![CDATA[EPYC]]></category>
            <category><![CDATA[Hardware]]></category>
            <category><![CDATA[Cloudflare Network]]></category>
            <guid isPermaLink="false">1Sj8x3oCQRSZqU8oNgMmkE</guid>
            <dc:creator>JQ Lau</dc:creator>
            <dc:creator>Syona Sarma</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cloudflare’s 12th Generation servers — 145% more performant and 63% more efficient]]></title>
            <link>https://blog.cloudflare.com/gen-12-servers/</link>
            <pubDate>Wed, 25 Sep 2024 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare is thrilled to announce the general deployment of our next generation of server — Gen 12 powered by AMD Genoa-X processors. This new generation of server focuses on delivering exceptional performance across all Cloudflare services, enhanced support for AI/ML workloads, significant strides in power efficiency, and improved security features. ]]></description>
            <content:encoded><![CDATA[ <p>Cloudflare is thrilled to announce the general deployment of our next generation of servers — Gen 12 powered by AMD EPYC 9684X (code name “Genoa-X”) processors. This next generation focuses on delivering exceptional performance across all Cloudflare services, enhanced support for AI/ML workloads, significant strides in power efficiency, and improved security features.</p><p>Here are some key performance indicators and feature improvements that this generation delivers as compared to the <a href="https://blog.cloudflare.com/the-epyc-journey-continues-to-milan-in-cloudflares-11th-generation-edge-server/"><u>prior generation</u></a>: </p><p>Beginning with performance, with close engineering collaboration between Cloudflare and AMD on optimization, Gen 12 servers can serve more than twice as many requests per second (RPS) as Gen 11 servers, resulting in lower Cloudflare infrastructure build-out costs.</p><p>Next, our power efficiency has improved significantly, by more than 60% in RPS per watt as compared to the prior generation. As Cloudflare continues to expand our infrastructure footprint, the improved efficiency helps reduce Cloudflare’s operational expenditure and carbon footprint as a percentage of our fleet size.</p><p>Third, in response to the growing demand for AI capabilities, we've updated the thermal-mechanical design of our Gen 12 server to support more powerful GPUs. This aligns with the <a href="https://www.cloudflare.com/lp/pg-ai/?utm_medium=cpc&amp;utm_source=google&amp;utm_campaign=2023-q4-acq-gbl-developers-wo-ge-general-paygo_mlt_all_g_search_bg_exp__dev&amp;utm_content=workers-ai&amp;gad_source=1&amp;gclid=CjwKCAjwl6-3BhBWEiwApN6_kjigJdDvEYqHPYi8tdXuTe4APbqX923v-CBjpGiAVwITNhp8GrW3ARoCyJ4QAvD_BwE&amp;gclsrc=aw.ds"><u>Workers AI</u></a> objective to support larger large language models and increase throughput for smaller models. This enhancement underscores our ongoing commitment to advancing AI inference capabilities</p><p>Fourth, to underscore our security-first position as a company, we've integrated hardware <a href="https://trustedcomputinggroup.org/about/what-is-a-root-of-trust-rot/"><u>root of trust</u></a> (HRoT) capabilities to ensure the integrity of boot firmware and board management controller firmware. Continuing to embrace open standards, the baseboard management and security controller (Data Center Secure Control Module or <a href="https://drive.google.com/file/d/13BxuseSrKo647hjIXjp087ei8l5QQVb0/view"><u>OCP DC-SCM</u></a>) that we’ve designed into our systems is modular and vendor-agnostic, enabling a unified <a href="https://www.openbmc.org/"><u>openBMC</u></a> image, quicker prototyping, and allowing for reuse.</p><p>Finally, given the increasing importance of supply assurance and reliability in infrastructure deployments, our approach includes a robust multi-vendor strategy to mitigate supply chain risks, ensuring continuity and resiliency of our infrastructure deployment.</p><p>Cloudflare is dedicated to constantly improving our server fleet, empowering businesses worldwide with enhanced performance, efficiency, and security.</p>
    <div>
      <h2>Gen 12 Servers </h2>
      <a href="#gen-12-servers">
        
      </a>
    </div>
    <p>Let's take a closer look at our Gen 12 server. The server is powered by a 4th generation AMD EPYC Processor, paired with 384 GB of DDR5 RAM, 16 TB of NVMe storage, a dual-port 25 GbE NIC, and two 800 watt power supply units.</p>
<div><table><thead>
  <tr>
    <th><span>Generation</span></th>
    <th><span>Gen 12 Compute</span></th>
    <th><span>Previous Gen 11 Compute</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>Form Factor</span></td>
    <td><span>2U1N - Single socket</span></td>
    <td><span>1U1N - Single socket</span></td>
  </tr>
  <tr>
    <td><span>Processor</span></td>
    <td><span>AMD EPYC 9684X Genoa-X 96-Core Processor</span></td>
    <td><span>AMD EPYC 7713 Milan 64-Core Processor</span></td>
  </tr>
  <tr>
    <td><span>Memory</span></td>
    <td><span>384GB of DDR5-4800</span><br /><span>x12 memory channel</span></td>
    <td><span>384GB of DDR4-3200</span><br /><span>x8 memory channel</span></td>
  </tr>
  <tr>
    <td><span>Storage</span></td>
    <td><span>x2 E1.S NVMe</span><br /><span>Samsung PM9A3 7.68TB / Micron 7450 Pro 7.68TB</span></td>
    <td><span>x2 M.2 NVMe</span><br /><span>2x Samsung PM9A3 x 1.92TB</span></td>
  </tr>
  <tr>
    <td><span>Network</span></td>
    <td><span>Dual 25 Gbe OCP 3.0 </span><br /><span>Intel Ethernet Network Adapter E810-XXVDA2 / NVIDIA Mellanox ConnectX-6 Lx</span></td>
    <td><span>Dual 25 Gbe OCP 2.0</span><br /><span>Mellanox ConnectX-4 dual-port 25G</span></td>
  </tr>
  <tr>
    <td><span>System Management</span></td>
    <td><span>DC-SCM 2.0</span><br /><span>ASPEED AST2600 (BMC) + AST1060 (HRoT)</span></td>
    <td><span>ASPEED AST2500 (BMC)</span></td>
  </tr>
  <tr>
    <td><span>Power Supply</span></td>
    <td><span>800W - Titanium Grade</span></td>
    <td><span>650W - Titanium Grade</span></td>
  </tr>
</tbody></table></div>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/ywinOgSFpevEcQSZLhESv/b61d70a1504b4d873d0bbf2e83221bf6/BLOG-2116_2.png" />
          </figure><p><sup><i>Cloudflare Gen 12 server</i></sup></p>
    <div>
      <h3>CPU</h3>
      <a href="#cpu">
        
      </a>
    </div>
    <p>During the design phase, we conducted an extensive survey of the CPU landscape. These options offer valuable choices as we consider how to shape the future of Cloudflare's server technology to match the needs of our customers. We evaluated many candidates in the lab, and short-listed three standout CPU candidates from the 4th generation AMD EPYC Processor lineup: Genoa 9654, Bergamo 9754, and Genoa-X 9684X for production evaluation. The table below summarizes the differences in <a href="https://www.amd.com/content/dam/amd/en/documents/products/epyc/epyc-9004-series-processors-data-sheet.pdf"><u>specifications</u></a> of the short-listed candidates for Gen 12 servers against the AMD EPYC 7713 used in our Gen 11 servers. Notably, all three candidates offer significant increase in core count and marked increase in all core boost clock frequency.</p>
<div><table><thead>
  <tr>
    <th><span>CPU Model</span></th>
    <th><a href="https://www.amd.com/en/products/specifications/server-processor.html"><span>AMD EPYC 7713</span></a></th>
    <th><a href="https://www.amd.com/en/products/specifications/server-processor.html"><span>AMD EPYC 9654</span></a></th>
    <th><a href="https://www.amd.com/en/products/specifications/server-processor.html"><span>AMD EPYC 9754</span></a></th>
    <th><a href="https://www.amd.com/en/products/specifications/server-processor.html"><span>AMD EPYC 9684X</span></a></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>Series</span></td>
    <td><span>Milan</span></td>
    <td><span>Genoa</span></td>
    <td><span>Bergamo</span></td>
    <td><span>Genoa-X</span></td>
  </tr>
  <tr>
    <td><span># of CPU Cores</span></td>
    <td><span>64</span></td>
    <td><span>96</span></td>
    <td><span>128</span></td>
    <td><span>96</span></td>
  </tr>
  <tr>
    <td><span># of Threads</span></td>
    <td><span>128</span></td>
    <td><span>192</span></td>
    <td><span>256</span></td>
    <td><span>192</span></td>
  </tr>
  <tr>
    <td><span>Base Clock</span></td>
    <td><span>2.0 GHz</span></td>
    <td><span>2.4 GHz</span></td>
    <td><span>2.25 GHz</span></td>
    <td><span>2.4 GHz</span></td>
  </tr>
  <tr>
    <td><span>Max Boost Clock</span></td>
    <td><span>3.67 GHz</span></td>
    <td><span>3.7 Ghz</span></td>
    <td><span>3.1 Ghz</span></td>
    <td><span>3.7 Ghz</span></td>
  </tr>
  <tr>
    <td><span>All Core Boost Clock</span></td>
    <td><span>2.7 GHz *</span></td>
    <td><span>3.55 GHz</span></td>
    <td><span>3.1GHz</span></td>
    <td><span>3.42 GHz</span></td>
  </tr>
  <tr>
    <td><span>Total L3 Cache</span></td>
    <td><span>256 MB</span></td>
    <td><span>384 MB</span></td>
    <td><span>256 MB</span></td>
    <td><span>1152 MB</span></td>
  </tr>
  <tr>
    <td><span>L3 cache per core</span></td>
    <td><span>4MB / core</span></td>
    <td><span>4MB / core</span></td>
    <td><span>2MB / core</span></td>
    <td><span>12MB / core</span></td>
  </tr>
  <tr>
    <td><span>Maximum configurable TDP</span></td>
    <td><span>240W</span></td>
    <td><span>400W</span></td>
    <td><span>400W</span></td>
    <td><span>400W</span></td>
  </tr>
</tbody></table></div><p><sub>*Note: AMD EPYC 7713 all core boost clock frequency of 2.7 GHz is not an official specification of the CPU but based on data collected at Cloudflare production fleet.</sub></p><p>During production evaluation, the configuration of all three CPUs were optimized to the best of our knowledge, including thermal design power (TDP) configured to 400W for maximum performance. The servers are set up to run the same processes and services like any other server we have in production, which makes for a great side-by-side comparison. </p>
<div><table><thead>
  <tr>
    <th></th>
    <th><span>Milan 7713</span></th>
    <th><span>Genoa 9654</span></th>
    <th><span>Bergamo 9754</span></th>
    <th><span>Genoa-X 9684X</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>Production performance (request per second) multiplier</span></td>
    <td><span>1x</span></td>
    <td><span>2x</span></td>
    <td><span>2.15x</span></td>
    <td><span>2.45x</span></td>
  </tr>
  <tr>
    <td><span>Production efficiency (request per second per watt) multiplier</span></td>
    <td><span>1x</span></td>
    <td><span>1.33x</span></td>
    <td><span>1.38x</span></td>
    <td><span>1.63x</span></td>
  </tr>
</tbody></table></div>
    <div>
      <h4>AMD EPYC Genoa-X in Cloudflare Gen 12 server</h4>
      <a href="#amd-epyc-genoa-x-in-cloudflare-gen-12-server">
        
      </a>
    </div>
    <p>Each of these CPUs outperforms the previous generation of processors by at least 2x. AMD EPYC 9684X Genoa-X with 3D V-cache technology gave us the greatest performance improvement, at 2.45x, when compared against our Gen 11 servers with AMD EPYC 7713 Milan.</p><p>Comparing the performance between Genoa-X 9684X and Genoa 9654, we see a ~22.5% performance delta. The primary difference between the two CPUs is the amount of L3 cache available on the CPU. Genoa-X 9684X has 1152 MB of L3 cache, which is three times the Genoa 9654 with 384 MB of L3 cache. Cloudflare workloads benefit from more low level cache being accessible and avoid the much larger latency penalty associated with fetching data from memory.</p><p>Genoa-X 9684X CPU delivered ~22.5% improved performance consuming the same amount of 400W power compared to Genoa 9654. The 3x larger L3 cache does consume additional power, but only at the expense of sacrificing 3% of highest achievable all core boost frequency on Genoa-X 9684X, a favorable trade-off for Cloudflare workloads.</p><p>More importantly, Genoa-X 9684X CPU delivered 145% performance improvement with only 50% system power increase, offering a 63% power efficiency improvement that will help drive down operational expenditure tremendously. It is important to note that even though a big portion of the power efficiency is due to the CPU, it needs to be paired with optimal thermal-mechanical design to realize the full benefit. Earlier last year, <a href="https://blog.cloudflare.com/cloudflare-gen-12-server-bigger-better-cooler-in-a-2u1n-form-factor/"><u>we made the thermal-mechanical design choice to double the height of the server chassis to optimize rack density and cooling efficiency across our global data centers</u></a>. We estimated that moving from 1U to 2U would reduce fan power by 150W, which would decrease system power from 750 watts to 600 watts. Guess what? We were right — a Gen 12 server consumes 600 watts per system at a typical ambient temperature of 25°C.</p><p>While high performance often comes at a higher price, fortunately AMD EPYC 9684X offer an excellent balance between cost and capability. A server designed with this CPU provides top-tier performance without necessitating a huge financial outlay, resulting in a good Total Cost of Ownership improvement for Cloudflare.</p>
    <div>
      <h3>Memory</h3>
      <a href="#memory">
        
      </a>
    </div>
    <p>AMD Genoa-X CPU supports twelve memory channels of DDR5 RAM up to 4800 mega transfers per second (MT/s) and per socket Memory Bandwidth of 460.8 GB/s. The twelve channels are fully utilized with 32 GB ECC 2Rx8 DDR5 RDIMM with one DIMM per channel configuration for a combined total memory capacity of 384 GB. </p><p>Choosing the optimal memory capacity is a balancing act, as maintaining an optimal memory-to-core ratio is important to make sure CPU capacity or memory capacity is not wasted. Some may remember that our Gen 11 servers with 64 core AMD EPYC 7713 CPUs are also configured with 384 GB of memory, which is about 6 GB per core. So why did we choose to configure our Gen 12 servers with 384 GB of memory when the core count is growing to 96 cores? Great question! A lot of memory optimization work has happened since we introduced Gen 11, including some that we blogged about, like <a href="https://blog.cloudflare.com/scalable-machine-learning-at-cloudflare/"><u>Bot Management code optimization</u></a> and <a href="https://blog.cloudflare.com/how-we-built-pingora-the-proxy-that-connects-cloudflare-to-the-internet/"><u>our transition to highly efficient Pingora</u></a>. In addition, each service has a memory allocation that is sized for optimal performance. The per-service memory allocation is programmed and monitored utilizing Linux control group resource management features. When sizing memory capacity for Gen 12, we consulted with the team who monitor resource allocation and surveyed memory utilization metrics collected from our fleet. The result of the analysis is that the optimal memory-to-core ratio is 4 GB per CPU core, or 384 GB total memory capacity. This configuration is validated in production. We chose dual rank memory modules over single rank memory modules because they have higher memory throughput, which improves server performance (read more about <a href="https://blog.cloudflare.com/ddr4-memory-organization-and-how-it-affects-memory-bandwidth/"><u>memory module organization and its effect on memory bandwidth</u></a>). </p><p>The table below shows the result of running the <a href="https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html"><u>Intel Memory Latency Checker (MLC)</u></a> tool to measure peak memory bandwidth for the system and to compare memory throughput between 12 channels of dual-rank (2Rx8) 32 GB DIMM and 12 channels of single rank (1Rx4) 32 GB DIMM. Dual rank DIMMs have slightly higher (1.8%) read memory bandwidth, but noticeably higher write bandwidth. As write ratios increased from 25% to 50%, the memory throughput delta increased by 10%.</p>
<div><table><thead>
  <tr>
    <th><span>Benchmark</span></th>
    <th><span>Dual rank advantage over single rank</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>Intel MLC ALL Reads</span></td>
    <td><span>101.8%</span></td>
  </tr>
  <tr>
    <td><span>Intel MLC 3:1 Reads-Writes</span></td>
    <td><span>107.7%</span></td>
  </tr>
  <tr>
    <td><span>Intel MLC 2:1 Reads-Writes</span></td>
    <td><span>112.9%</span></td>
  </tr>
  <tr>
    <td><span>Intel MLC 1:1 Reads-Writes</span></td>
    <td><span>117.8%</span></td>
  </tr>
  <tr>
    <td><span>Intel MLC Stream-triad like</span></td>
    <td><span>108.6%</span></td>
  </tr>
</tbody></table></div><p>The table below shows the result of running the <a href="https://www.amd.com/en/developer/zen-software-studio/applications/spack/stream-benchmark.html"><u>AMD STREAM benchmark</u></a> to measure sustainable main memory bandwidth in MB/s and the corresponding computation rate for simple vector kernels. In all 4 types of vector kernels, dual rank DIMMs provide a noticeable advantage over single rank DIMMs.</p>
<div><table><thead>
  <tr>
    <th><span>Benchmark</span></th>
    <th><span>Dual rank advantage over single rank</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>Stream Copy</span></td>
    <td><span>115.44%</span></td>
  </tr>
  <tr>
    <td><span>Stream Scale</span></td>
    <td><span>111.22%</span></td>
  </tr>
  <tr>
    <td><span>Stream Add</span></td>
    <td><span>109.06%</span></td>
  </tr>
  <tr>
    <td><span>Stream Triad</span></td>
    <td><span>107.70%</span></td>
  </tr>
</tbody></table></div>
    <div>
      <h3>Storage</h3>
      <a href="#storage">
        
      </a>
    </div>
    <p>Cloudflare’s Gen X server and Gen 11 server support <a href="https://en.wikipedia.org/wiki/M.2"><u>M.2</u></a> form factor drives. We liked the M.2 form factor mainly because it was compact. The M.2 specification was introduced in 2012, but today, the connector system is dated and the industry has concerns about its ability to maintain signal integrity with the high speed signal specified by <a href="https://www.xda-developers.com/pcie-5/"><u>PCIe 5.0</u></a> and <a href="https://pcisig.com/pci-express-6.0-specification"><u>PCIe 6.0</u></a> specifications. The 8.25W thermal limit of the M.2 form factor also limits the number of flash dies that can be fitted, which limits the maximum supported capacity per drive. To address these concerns, the industry has introduced the <a href="https://americas.kioxia.com/content/dam/kioxia/en-us/business/ssd/data-center-ssd/asset/KIOXIA_Meta_Microsoft_EDSFF_E1_S_Intro_White_Paper.pdf"><u>E1.S</u></a> specification and is transitioning from the M.2 form factor to the E1.S form factor. </p><p>In Gen 12, we are making the change to the <a href="https://www.snia.org/forums/cmsi/knowledge/formfactors#EDSFF"><u>EDSFF</u></a> E1 form factor, more specifically the E1.S 15mm. E1.S 15mm, though still in a compact form factor, provides more space to fit more flash dies for larger capacity support. The form factor also has better cooling design to support more than 25W of sustained power.</p><p>While the AMD Genoa-X CPU supports 128 PCIe 5.0 lanes, we continue to use NVMe devices with PCIe Gen 4.0 x4 lanes, as PCIe Gen 4.0 throughput is sufficient to meet drive bandwidth requirements and keep server design costs optimal. The server is equipped with two 8 TB NVMe drives for a total of 16 TB available storage. We opted for two 8 TB drives instead of four 4 TB drives because the dual 8 TB configuration already provides sufficient I/O bandwidth for all Cloudflare workloads that run on each server.</p>
<div><table><thead>
  <tr>
    <th><span>Sequential Read (MB/s) :</span></th>
    <th><span>6,700</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>Sequential Write (MB/s) :</span></td>
    <td><span>4,000</span></td>
  </tr>
  <tr>
    <td><span>Random Read IOPS:</span></td>
    <td><span>1,000,000</span></td>
  </tr>
  <tr>
    <td><span>Random Write IOPS: </span></td>
    <td><span>200,000</span></td>
  </tr>
  <tr>
    <td><span>Endurance</span></td>
    <td><span>1 DWPD</span></td>
  </tr>
  <tr>
    <td><span>PCIe GEN4 x4 lane throughput</span></td>
    <td><span>7880 MB/s</span></td>
  </tr>
</tbody></table></div><p><sub><i>Storage devices performance specification</i></sub></p>
    <div>
      <h3>Network</h3>
      <a href="#network">
        
      </a>
    </div>
    <p>Cloudflare servers and top-of-rack (ToR) network equipment operate at <a href="https://en.wikipedia.org/wiki/25_Gigabit_Ethernet"><u>25 GbE</u></a> speeds. In Gen 12, we utilized a <a href="https://www.opencompute.org/wiki/Server/DC-MHS"><u>DC-MHS</u></a> motherboard-inspired design, and upgraded from an <a href="https://drive.google.com/file/d/1VGAtABAKU9fq3KfClYhFOgGFN3oe63Uw/view?usp=sharing"><u>OCP 2.0 form factor</u></a> to an <a href="https://drive.google.com/file/d/1U3oEGiSWfupG4SnIdPuJ_8Nte2lJRqTN/view?usp=sharing"><u>OCP 3.0 form factor</u></a>, which provides tool-less serviceability of the NIC. The OCP 3.0 form factor also occupies less space in the 2U server compared to PCIe-attached NICs, which improves airflow and frees up space for other application-specific PCIe cards, such as GPUs.</p><p>Cloudflare has been using the Mellanox CX4-Lx EN dual port 25 GbE NIC since our <a href="https://blog.cloudflare.com/a-tour-inside-cloudflares-g9-servers/"><u>Gen 9 servers in 2018</u></a>. Even though the NIC has served us well over the years, we are single sourced. During the pandemic, we were faced with supply constraints and extremely long lead times. The team scrambled to qualify the Broadcom M225P dual port 25 GbE NIC as our second-sourced NIC in 2022, ensuring we could continue to turn up servers to serve customer demand. With the lessons learned from single-sourcing the Gen 11 NIC, we are now dual-sourcing and have chosen the Intel Ethernet Network Adapter E810 and NVIDIA Mellanox ConnectX-6 Lx to support Gen 12. These two NICs are compliant with the <a href="https://www.opencompute.org/wiki/Server/NIC"><u>OCP 3.0 specification</u></a> and offer more MSI-X queues that can then be mapped to the increased core count on the AMD EPYC 9684X. The Intel Ethernet Network Adapter comes with an additional advantage, offering full Generic Segmentation Offload (GSO) support including VLAN-tagged encapsulated traffic, whereast many vendors either only support <a href="https://netdevconf.info/1.2/papers/LCO-GSO-Partial-TSO-MangleID.pdf"><u>Partial GSO</u></a> or do not support it at all today. With Full GSO support, the kernel spent noticeably less time in softirq segmenting packets, and servers with Intel E810 NICs are processing approximately 2% more requests per second.</p>
    <div>
      <h3>Improved security with DC-SCM: Project Argus</h3>
      <a href="#improved-security-with-dc-scm-project-argus">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6ur3YccqqXckIL6oWKd6Lq/5352252ff8e5c1fb15eb02d1572a0689/BLOG-2116_3.png" />
          </figure><p><sup><i>DC-SCM in Gen 12 server (Project Argus)</i></sup></p><p>Gen 12 servers are integrated with <a href="https://blog.cloudflare.com/introducing-the-project-argus-datacenter-ready-secure-control-module-design-specification/"><u>Project Argus</u></a>, one of the industry first implementations of <a href="https://drive.google.com/file/d/13BxuseSrKo647hjIXjp087ei8l5QQVb0/view"><u>Data Center Secure Control Module 2.0 (DC-SCM 2.0)</u></a>. DC-SCM 2.0 decouples server management and security functions away from the motherboard. The baseboard management controller (BMC), hardware root of trust (HRoT), trusted platform module (TPM), and dual BMC/BIOS flash chips are all installed on the DC-SCM. </p><p>On our Gen X and Gen 11 server, Cloudflare moved our secure boot trust anchor from the system Basic Input/Output System (BIOS) or the Unified Extensible Firmware Interface (UEFI) firmware to hardware-rooted boot integrity — <a href="https://blog.cloudflare.com/anchoring-trust-a-hardware-secure-boot-story/"><u>AMD’s implementation of Platform Secure Boot (PSB)</u></a> or <a href="https://blog.cloudflare.com/armed-to-boot/"><u>Ampere’s implementation of Single Domain Secure Boot</u></a>. These solutions helped secure Cloudflare infrastructure from BIOS / UEFI firmware attacks. However, we are still vulnerable to out-of-band attacks through compromising the BMC firmware. BMC is a microcontroller that provides out-of-band monitoring and management capabilities for the system. When compromised, attackers can read processor console logs accessible by BMC and control server power states for example. On Gen 12, the HRoT on the DC-SCM serves as the trust store of cryptographic keys and is responsible to authenticate the BIOS/UEFI firmware (independent of CPU vendor) and the BMC firmware for secure boot process.</p><p>In addition, on the DC-SCM, there are additional flash storage devices to enable storing back-up BIOS/UEFI firmware and BMC firmware to allow rapid recovery when a corrupted or malicious firmware is programmed, and to be resilient to flash chip failure due to aging.</p><p>These updates make our Gen 12 server more secure and more resilient to firmware attacks.</p>
    <div>
      <h3>Power</h3>
      <a href="#power">
        
      </a>
    </div>
    <p>A Gen 12 server consumes 600 watts at a typical ambient temperature of 25°C. Even though this is a 50% increase from the 400 watts consumed by the Gen 11 server, as mentioned above in the CPU section, this is a relatively small price to pay for a 145% increase in performance. We’ve paired the server up with dual 800W common redundant power supplies (CRPS) with 80 PLUS Titanium grade efficiency. Both power supply units (PSU) operate actively with distributed power and current. The units are hot-pluggable, allowing the server to operate with redundancy and maximize uptime.</p><p><a href="https://www.clearesult.com/80plus/program-details"><u>80 PLUS</u></a> is a PSU efficiency certification program. The Titanium grade efficiency PSU is 2% more efficient than the Platinum grade efficiency PSU between typical operating load of 25% to 50%. 2% may not sound like a lot, but considering the size of Cloudflare fleet with servers deployed worldwide, 2% savings over the lifetime of all Gen 12 deployment is a reduction of more than 7 GWh, <a href="https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator#results"><u>equivalent to carbon sequestered by more than 3400 acres of U.S. forests in one year</u></a>.  This upgrade also means our Gen 12 server complies with <a href="https://www.unicomengineering.com/blog/eu-lot-9-update-the-coming-server-power-migration/"><u>EU Lot9 requirements</u></a> and can be deployed in the EU region.</p>
<div><table><thead>
  <tr>
    <th><span>80 PLUS certification</span></th>
    <th><span>10%</span></th>
    <th><span>20%</span></th>
    <th><span>50%</span></th>
    <th><span>100%</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>80 PLUS Platinum</span></td>
    <td><span>-</span></td>
    <td><span>92%</span></td>
    <td><span>94%</span></td>
    <td><span>90%</span></td>
  </tr>
  <tr>
    <td><span>80 PLUS Titanium</span></td>
    <td><span>90%</span></td>
    <td><span>94%</span></td>
    <td><span>96%</span></td>
    <td><span>91%</span></td>
  </tr>
</tbody></table></div>
    <div>
      <h3>Drop-in GPU support</h3>
      <a href="#drop-in-gpu-support">
        
      </a>
    </div>
    <p>Demand for machine learning and AI workloads exploded in 2023, and Cloudflare <a href="https://blog.cloudflare.com/workers-ai/"><u>introduced Workers AI </u></a>to serve the needs of our customers. Cloudflare retrofitted or deployed GPUs worldwide in a portion of our Gen 11 server fleet to support the growth of Workers AI. Our Gen 12 server is also designed to accommodate the addition of more powerful GPUs. This gives Cloudflare the flexibility to support Workers AI in all regions of the world, and to strategically place GPUs in regions to reduce inference latency for our customers. With this design, the server can run Cloudflare’s full software stack. During times when GPUs see lower utilization, the server continues to serve general web requests and remains productive.</p><p>The electrical design of the motherboard is designed to support up to two PCIe add-in cards and the power distribution board is sized to support an additional 400W of power. The mechanics are sized to support either a single FHFL (full height, full length) double width GPU PCIe card, or two FHFL single width GPU PCIe cards. The thermal solution including the component placement, fans, and air duct design are sized to support adding GPUs with TDP up to 400W.</p>
    <div>
      <h3>Looking to the future</h3>
      <a href="#looking-to-the-future">
        
      </a>
    </div>
    <p>Gen 12 Servers are currently deployed and live in multiple Cloudflare data centers worldwide, and already process millions of requests per second. Cloudflare’s EPYC journey has not ended — the 5th-gen AMD EPYC CPUs (code name “Turin”) are already available for testing, and we are very excited to start the architecture planning and design discussion for the Gen 13 server. <a href="https://www.cloudflare.com/careers/jobs/"><u>Come join us</u></a> at Cloudflare to help build a better Internet!</p> ]]></content:encoded>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[EPYC]]></category>
            <category><![CDATA[AMD]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Cloudflare Network]]></category>
            <category><![CDATA[Hardware]]></category>
            <guid isPermaLink="false">sdvPBDBwhcEcrODVeOE7A</guid>
            <dc:creator>JQ Lau</dc:creator>
            <dc:creator>Ma Xiong</dc:creator>
            <dc:creator>Syona Sarma</dc:creator>
        </item>
        <item>
            <title><![CDATA[Leveling up Workers AI: general availability and more new capabilities]]></title>
            <link>https://blog.cloudflare.com/workers-ai-ga-huggingface-loras-python-support/</link>
            <pubDate>Tue, 02 Apr 2024 13:01:00 GMT</pubDate>
            <description><![CDATA[ Today, we’re excited to make a series of announcements, including Workers AI, Cloudflare’s inference platform becoming GA and support for fine-tuned models with LoRAs and one-click deploys from HuggingFace. Cloudflare Workers now supports the Python programming language, and more ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1YNXJ4s4e47U7MvddTlpz8/3a53be280a5e373b589eba37bc4740d0/Cities-with-GPUs-momentum-update.png" />
            
            </figure><p>Welcome to Tuesday – our AI day of Developer Week 2024! In this blog post, we’re excited to share an overview of our new AI announcements and vision, including news about Workers AI officially going GA with improved pricing, a GPU hardware momentum update, an expansion of our Hugging Face partnership, Bring Your Own LoRA fine-tuned inference, Python support in Workers, more providers in AI Gateway, and Vectorize metadata filtering.</p>
    <div>
      <h3>Workers AI GA</h3>
      <a href="#workers-ai-ga">
        
      </a>
    </div>
    <p>Today, we’re excited to announce that our Workers AI inference platform is now Generally Available. After months of being in open beta, we’ve improved our service with greater reliability and performance, unveiled pricing, and added many more models to our catalog.</p>
    <div>
      <h4>Improved performance &amp; reliability</h4>
      <a href="#improved-performance-reliability">
        
      </a>
    </div>
    <p>With Workers AI, our goal is to make AI inference as reliable and easy to use as the rest of Cloudflare’s network. Under the hood, we’ve upgraded the load balancing that is built into Workers AI. Requests can now be routed to more GPUs in more cities, and each city is aware of the total available capacity for AI inference. If the request would have to wait in a queue in the current city, it can instead be routed to another location, getting results back to you faster when traffic is high. With this, we’ve increased rate limits across all our models – most LLMs now have a of 300 requests per minute, up from 50 requests per minute during our beta phase. Smaller models have a limit of 1500-3000 requests per minute. Check out our <a href="https://developers.cloudflare.com/workers-ai/platform/limits/">Developer Docs for the rate limits</a> of individual models.</p>
    <div>
      <h4>Lowering costs on popular models</h4>
      <a href="#lowering-costs-on-popular-models">
        
      </a>
    </div>
    <p>Alongside our GA of Workers AI, we published a <a href="https://ai.cloudflare.com/#pricing-calculator">pricing calculator</a> for our 10 non-beta models earlier this month. We want Workers AI to be one of the most affordable and accessible solutions to run <a href="https://www.cloudflare.com/learning/ai/inference-vs-training/">inference</a>, so we added a few optimizations to our models to make them more affordable. Now, Llama 2 is over 7x cheaper and Mistral 7B is over 14x cheaper to run than we had initially <a href="https://developers.cloudflare.com/workers-ai/platform/pricing/">published</a> on March 1. We want to continue to be the best platform for AI inference and will continue to roll out optimizations to our customers when we can.</p><p>As a reminder, our billing for Workers AI started on April 1st for our non-beta models, while beta models remain free and unlimited. We offer 10,000 <a href="/workers-ai#:~:text=may%20be%20wondering%20%E2%80%94-,what%E2%80%99s%20a%20neuron">neurons</a> per day for free to all customers. Workers Free customers will encounter a hard rate limit after 10,000 neurons in 24 hours while Workers Paid customers will incur usage at $0.011 per 1000 additional neurons.  Read our <a href="https://developers.cloudflare.com/workers-ai/platform/pricing/">Workers AI Pricing Developer Docs</a> for the most up-to-date information on pricing.</p>
    <div>
      <h4>New dashboard and playground</h4>
      <a href="#new-dashboard-and-playground">
        
      </a>
    </div>
    <p>Lastly, we’ve revamped our <a href="https://dash.cloudflare.com/?to=/:account/ai/workers-ai">Workers AI dashboard</a> and <a href="https://playground.ai.cloudflare.com/">AI playground</a>. The Workers AI page in the Cloudflare dashboard now shows analytics for usage across models, including neuron calculations to help you better predict pricing. The AI playground lets you quickly test and compare different models and configure prompts and parameters. We hope these new tools help developers start building on Workers AI seamlessly – go try them out!</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3uSiHo9pV21DreFiLpUfPX/5aa5c8a2448da881a0872e3f550c39a2/image3-3.png" />
            
            </figure>
    <div>
      <h3>Run inference on GPUs in over 150 cities around the world</h3>
      <a href="#run-inference-on-gpus-in-over-150-cities-around-the-world">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/qyZ48xVu70u80swQKPUau/1e85ea2b833139d277d5769d5a8c8c66/image5-2.png" />
            
            </figure><p>When we announced Workers AI back in September 2023, we set out to deploy GPUs to our data centers around the world. We plan to deliver on that promise and deploy inference-tuned GPUs almost everywhere by the end of 2024, making us the most widely distributed cloud-AI inference platform. We have over 150 cities with GPUs today and will continue to roll out more throughout the year.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/451IgQCfz2j10xM8kXhZy0/67632b1b5f52e3387aa387e43c804324/image7-1.png" />
            
            </figure><p>We also have our next generation of compute servers with GPUs launching in Q2 2024, which means better performance, power efficiency, and improved reliability over previous generations. We provided a preview of our Gen 12 Compute servers design in a <a href="/cloudflare-gen-12-server-bigger-better-cooler-in-a-2u1n-form-factor">December 2023 blog post</a>, with more details to come. With Gen 12 and future planned hardware launches, the next step is to support larger machine learning models and offer fine-tuning on our platform. This will allow us to achieve higher inference throughput, lower latency and greater availability for production workloads, as well as expanding support to new categories of workloads such as fine-tuning.</p>
    <div>
      <h3>Hugging Face Partnership</h3>
      <a href="#hugging-face-partnership">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7ATIA1mxaeqHYE31cztdjg/33717b0fd20244f3089cf5d4c8a9c13f/image2-2.png" />
            
            </figure><p>We’re also excited to continue our partnership with Hugging Face in the spirit of bringing the best of open-source to our customers. Now, you can visit some of the most popular models on Hugging Face and easily click to run the model on Workers AI if it is available on our platform.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3geQJnlHZhMktG1eoEWYKE/36c76ee43eca9443b4e2e9c6d3e1df7e/image6-1.png" />
            
            </figure><p>We’re happy to announce that we’ve added 4 more models to our platform in conjunction with Hugging Face. You can now access the new <a href="https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2">Mistral 7B v0.2</a> model with improved context windows, <a href="https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B">Nous Research’s Hermes 2 Pro</a> fine-tuned version of Mistral 7B, <a href="https://huggingface.co/google/gemma-7b-it">Google’s Gemma 7B</a>, and <a href="https://huggingface.co/Nexusflow/Starling-LM-7B-beta">Starling-LM-7B-beta</a> fine-tuned from OpenChat. There are currently 14 models that we’ve curated with Hugging Face to be available for serverless GPU inference powered by Cloudflare’s Workers AI platform, with more coming soon. These models are all served using Hugging Face’s technology with a <a href="https://github.com/huggingface/text-generation-inference/">TGI</a> backend, and we work closely with the Hugging Face team to curate, optimize, and deploy these models.</p><blockquote><p><i>“We are excited to work with Cloudflare to make AI more accessible to developers. Offering the most popular open models with a serverless API, powered by a global fleet of GPUs is an amazing proposition for the Hugging Face community, and I can’t wait to see what they build with it.”</i>- <b>Julien Chaumond</b>, Co-founder and CTO, Hugging Face</p></blockquote><p>You can find all of the open models supported in Workers AI in this <a href="https://huggingface.co/collections/Cloudflare/hf-curated-models-available-on-workers-ai-66036e7ad5064318b3e45db6">Hugging Face Collection</a>, and the “Deploy to Cloudflare Workers AI” button is at the top of each model card. To learn more, read Hugging Face’s <a href="http://huggingface.co/blog/cloudflare-workers-ai">blog post</a> and take a look at our <a href="https://developers.cloudflare.com/workers-ai/models/">Developer Docs</a> to get started. Have a model you want to see on Workers AI? Send us a message on <a href="https://discord.cloudflare.com">Discord</a> with your request.</p>
    <div>
      <h3>Supporting fine-tuned inference - BYO LoRAs</h3>
      <a href="#supporting-fine-tuned-inference-byo-loras">
        
      </a>
    </div>
    <p>Fine-tuned inference is one of our most requested features for Workers AI, and we’re one step closer now with Bring Your Own (BYO) LoRAs. Using the popular <a href="https://www.cloudflare.com/learning/ai/what-is-lora/">Low-Rank Adaptation</a> method, researchers have figured out how to take a model and adapt <i>some</i> model parameters to the task at hand, rather than rewriting <i>all</i> model parameters like you would for a fully fine-tuned model. This means that you can get fine-tuned model outputs without the computational expense of fully fine-tuning a model.</p><p>We now support bringing trained LoRAs to Workers AI, where we apply the LoRA adapter to a base model at runtime to give you fine-tuned inference, at a fraction of the cost, size, and speed of a fully fine-tuned model. In the future, we want to be able to support fine-tuning jobs and fully fine-tuned models directly on our platform, but we’re excited to be one step closer today with LoRAs.</p>
            <pre><code>const response = await ai.run(
  "@cf/mistralai/mistral-7b-instruct-v0.2-lora", //the model supporting LoRAs
  {
      messages: [{"role": "user", "content": "Hello world"],
      raw: true, //skip applying the default chat template
      lora: "00000000-0000-0000-0000-000000000", //the finetune id OR name 
  }
);</code></pre>
            <p>BYO LoRAs is in open beta as of today for Gemma 2B and 7B, Llama 2 7B and Mistral 7B models with LoRA adapters up to 100MB in size and max rank of 8, and up to 30 total LoRAs per account. As always, we expect you to use Workers AI and our new BYO LoRA feature with our <a href="https://www.cloudflare.com/service-specific-terms-developer-platform/#developer-platform-terms">Terms of Service</a> in mind, including any model-specific restrictions on use contained in the models’ license terms.</p><p>Read the technical deep dive blog post on <a href="/fine-tuned-inference-with-loras">fine-tuning with LoRA</a> and <a href="https://developers.cloudflare.com/workers-ai/fine-tunes">developer docs</a> to get started.</p>
    <div>
      <h3>Write Workers in Python</h3>
      <a href="#write-workers-in-python">
        
      </a>
    </div>
    <p>Python is the second most popular programming language in the world (after JavaScript) and the language of choice for building AI applications. And starting today, in open beta, you can now <a href="https://ggu-python.cloudflare-docs-7ou.pages.dev/workers/languages/python/">write Cloudflare Workers in Python</a>. Python Workers support all <a href="https://developers.cloudflare.com/workers/configuration/bindings/">bindings</a> to resources on Cloudflare, including <a href="https://developers.cloudflare.com/vectorize/">Vectorize</a>, <a href="https://developers.cloudflare.com/d1/">D1</a>, <a href="https://developers.cloudflare.com/kv/">KV</a>, <a href="https://www.cloudflare.com/developer-platform/products/r2/">R2</a> and more.</p><p><a href="https://ggu-python.cloudflare-docs-7ou.pages.dev/workers/languages/python/packages/langchain/">LangChain</a> is the most popular framework for building LLM‑powered applications, and like how <a href="/langchain-and-cloudflare">Workers AI works with langchain-js</a>, the <a href="https://python.langchain.com/docs/get_started/introduction">Python LangChain library</a> works on Python Workers, as do <a href="https://ggu-python.cloudflare-docs-7ou.pages.dev/workers/languages/python/packages/">other Python packages</a> like FastAPI.</p><p>Workers written in Python are just as simple as Workers written in JavaScript:</p>
            <pre><code>from js import Response

async def on_fetch(request, env):
    return Response.new("Hello world!")</code></pre>
            <p>…and are configured by simply pointing at a .py file in your <code>wrangler.toml</code>:</p>
            <pre><code>name = "hello-world-python-worker"
main = "src/entry.py"
compatibility_date = "2024-03-18"
compatibility_flags = ["python_workers"]</code></pre>
            <p>There are no extra toolchain or precompilation steps needed. The <a href="https://pyodide.org/en/stable/">Pyodide</a> Python execution environment is provided for you, directly by the Workers runtime, mirroring how Workers written in JavaScript already work.</p><p>There’s lots more to dive into — take a look at the <a href="https://ggu-python.cloudflare-docs-7ou.pages.dev/workers/languages/python/">docs</a>, and check out our <a href="/python-workers">companion blog post</a> for details about how Python Workers work behind the scenes.</p>
    <div>
      <h2>AI Gateway now supports Anthropic, Azure, AWS Bedrock, Google Vertex, and Perplexity</h2>
      <a href="#ai-gateway-now-supports-anthropic-azure-aws-bedrock-google-vertex-and-perplexity">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6YaNMed9Aw4YjhbZk75SGI/a53b499b743e36ad2635357118e6623f/image4-2.png" />
            
            </figure><p>Our <a href="/announcing-ai-gateway">AI Gateway</a> product helps developers better control and observe their AI applications, with analytics, caching, rate limiting, and more. We are continuing to add more providers to the product, including Anthropic, Google Vertex, and Perplexity, which we’re excited to announce today. We quietly rolled out Azure and Amazon Bedrock support in December 2023, which means that the most popular providers are now supported via AI Gateway, including Workers AI itself.</p><p>Take a look at our <a href="https://developers.cloudflare.com/ai-gateway/">Developer Docs</a> to get started with AI Gateway.</p>
    <div>
      <h4>Coming soon: Persistent Logs</h4>
      <a href="#coming-soon-persistent-logs">
        
      </a>
    </div>
    <p>In Q2 of 2024, we will be adding persistent logs so that you can push your logs (including prompts and responses) to <a href="https://www.cloudflare.com/learning/cloud/what-is-object-storage/">object storage</a>, custom metadata so that you can tag requests with user IDs or other identifiers, and secrets management so that you can securely manage your application’s API keys.</p><p>We want AI Gateway to be the control plane for your AI applications, allowing developers to dynamically evaluate and route requests to different models and providers. With our persistent logs feature, we want to enable developers to use their logged data to fine-tune models in one click, eventually running the fine-tune job and the fine-tuned model directly on our Workers AI platform. AI Gateway is just one product in our AI toolkit, but we’re excited about the workflows and use cases it can unlock for developers building on our platform, and we hope you’re excited about it too.</p>
    <div>
      <h3>Vectorize metadata filtering and future GA of million vector indexes</h3>
      <a href="#vectorize-metadata-filtering-and-future-ga-of-million-vector-indexes">
        
      </a>
    </div>
    <p>Vectorize is another component of our toolkit for AI applications. In open beta since September 2023, Vectorize allows developers to persist embeddings (vectors), like those generated from Workers AI <a href="https://developers.cloudflare.com/workers-ai/models/#text-embeddings">text embedding</a> models, and query for the closest match to support use cases like similarity search or recommendations. Without a vector database, model output is forgotten and can’t be recalled without extra costs to re-run a model.</p><p>Since Vectorize’s open beta, we’ve added <a href="https://developers.cloudflare.com/vectorize/reference/metadata-filtering/">metadata filtering</a>. Metadata filtering lets developers combine vector search with filtering for arbitrary metadata, supporting the query complexity in AI applications. We’re laser-focused on getting Vectorize ready for general availability, with an target launch date of June 2024, which will include support for multi-million vector indexes.</p>
            <pre><code>// Insert vectors with metadata
const vectors: Array&lt;VectorizeVector&gt; = [
  {
    id: "1",
    values: [32.4, 74.1, 3.2],
    metadata: { url: "/products/sku/13913913", streaming_platform: "netflix" }
  },
  {
    id: "2",
    values: [15.1, 19.2, 15.8],
    metadata: { url: "/products/sku/10148191", streaming_platform: "hbo" }
  },
...
];
let upserted = await env.YOUR_INDEX.upsert(vectors);

// Query with metadata filtering
let metadataMatches = await env.YOUR_INDEX.query(&lt;queryVector&gt;, { filter: { streaming_platform: "netflix" }} )</code></pre>
            
    <div>
      <h3>The most comprehensive Developer Platform to build AI applications</h3>
      <a href="#the-most-comprehensive-developer-platform-to-build-ai-applications">
        
      </a>
    </div>
    <p>On Cloudflare’s Developer Platform, we believe that all developers should be able to quickly build and ship full-stack applications  – and that includes AI experiences as well. With our GA of Workers AI, announcements for Python support in Workers, AI Gateway, and Vectorize, and our partnership with Hugging Face, we’ve expanded the world of possibilities for what you can build with AI on our platform. We hope you are as excited as we are – take a look at all our <a href="https://developers.cloudflare.com">Developer Docs</a> to get started, and <a href="https://discord.cloudflare.com/">let us know</a> what you build.</p> ]]></content:encoded>
            <category><![CDATA[Developer Week]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[General Availability]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <guid isPermaLink="false">6ItPe1u2j71C4DTSxJdccB</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Jesse Kipp</dc:creator>
            <dc:creator>Syona Sarma</dc:creator>
            <dc:creator>Brendan Irvine-Broque</dc:creator>
            <dc:creator>Vy Ton</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cloudflare Gen 12 Server: bigger, better, cooler in a 2U1N form factor]]></title>
            <link>https://blog.cloudflare.com/cloudflare-gen-12-server-bigger-better-cooler-in-a-2u1n-form-factor/</link>
            <pubDate>Fri, 01 Dec 2023 18:45:57 GMT</pubDate>
            <description><![CDATA[ Cloudflare Gen 12 Compute servers are moving to 2U1N form factor to optimize the thermal design to accommodate both high-power CPUs (>350W) and GPUs effectively while maintaining performance and reliability ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4k4pAqN8lhelYK7bHj3j2n/853ec5b25a2e62e08b85f611d0f76eb6/image5.png" />
            
            </figure><p>Two years ago, Cloudflare undertook a significant upgrade to our compute server hardware as we deployed our cutting-edge <a href="/the-epyc-journey-continues-to-milan-in-cloudflares-11th-generation-edge-server/">11th Generation server fleet</a>, based on AMD EPYC Milan x86 processors. It's nearly time for another refresh to our x86 infrastructure, with deployment planned for 2024. This involves upgrading not only the processor itself, but many of the server's components. It must be able to accommodate the GPUs that drive inference on <a href="/workers-ai/">Workers AI</a>, and leverage the latest advances in memory, storage, and security. Every aspect of the server is rigorously evaluated — including the server form factor itself.</p><p>One crucial variable always in consideration is temperature. The latest generations of x86 processors have yielded significant leaps forward in performance, with the tradeoff of higher power draw and heat output. In this post we will explore this trend, and how it informed our decision to adopt a new physical footprint for our next-generation fleet of servers.</p><p>In preparation for the upcoming refresh, we conducted an extensive survey of the x86 CPU landscape. AMD recently introduced its latest offerings: Genoa, Bergamo, and Genoa-X, featuring the power of their innovative Zen 4 architecture. At the same time, Intel unveiled Sapphire Rapids as part of its 4th Generation Intel Xeon Scalable Processor Platform, code-named “Eagle Stream”, showcasing their own advancements. These options offer valuable choices as we consider how to shape the future of Cloudflare's server technology to match the needs of our customers.</p><p>A continuing challenge we face across x86 CPU vendors, including the new Intel and AMD chipsets, is the rapidly increasing CPU Thermal Design Point (TDP) generation over generation. TDP is defined to be the maximum heat dissipated by the CPU under load that a cooling system should be designed to cool; TDP also describes the maximum power consumption of the CPU socket. This plot shows the CPU TDP trend of each hardware server generation since 2014:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5Ae6NrHId3uKhuS6mIhW7F/055e89fef13a710b3cc45607841de8ec/image4.png" />
            
            </figure><p>At Cloudflare, our <a href="/a-tour-inside-cloudflares-g9-servers/">Gen 9 server</a> was based on Intel Skylake 6162 with a TDP of 150W, our <a href="/cloudflares-gen-x-servers-for-an-accelerated-future/">Gen 10 server</a> was based on AMD Rome 7642 at 240W, and our <a href="/the-epyc-journey-continues-to-milan-in-cloudflares-11th-generation-edge-server/">Gen 11 server</a> was based on AMD Milan 7713 at 240W. Today, <a href="https://www.amd.com/system/files/documents/epyc-9004-series-processors-data-sheet.pdf">AMD EPYC 9004 Series SKU Stack</a> default TDP goes up to 360W and is configurable up to 400W. <a href="https://ark.intel.com/content/www/us/en/ark/products/codename/126212/products-formerly-sapphire-rapids.html#@Server">Intel Sapphire Rapid SKU stack</a> default TDP goes up to 350W. This trend of rising TDP is expected to continue with the next generation of x86 CPU offerings.</p>
    <div>
      <h2>Designing multi-generational cooling solutions</h2>
      <a href="#designing-multi-generational-cooling-solutions">
        
      </a>
    </div>
    <p>Cloudflare Gen 10 servers and Gen 11 servers were designed in a 1U1N form factor, with air cooling to maximize rack density (1U means the server form factor is 1 Rack Unit, which is 1.75” in height or thickness; 1N means there is one server node per chassis). However, to cool more than 350 Watt TDP CPU with air in a 1U1N form factor requires fans to be spinning at 100% duty cycle (running all the time, at max speed). A single fan running at full speed consumes about 40W, and a typical server configuration of 7–8 dual rotor fans per server can hit 280–320 W to power the fans alone. At peak loads, the total system power consumed, including the cooling fans, processor, and other components, can eclipse 750 Watt per server.</p><p>The 1U form factor can fit a maximum of eight 40mm dual rotor fans, which sets an upper bound on the temperature range it can support. We first take into account ambient room temperature, which we assume to be 40° C (the maximum expected temperature under normal conditions). Under these conditions we determined that air-cooled servers, with all eight fans running at 100% duty cycle, can support CPUs with a maximum TDP of 400W.</p><p>This poses a challenge, because the next generation of AMD CPUs, while being socket compatible with the current gen, rise up to 500W TDP and we expect other vendors to follow a similar trend in subsequent generations. In order to future-proof, and re-use as much of Gen 12 design as possible for future generations across all x86 CPU products, we will need a scalable thermal solution. Moreover, many co-location facilities where Cloudflare deploys servers have a rack power limit. With total system power consumption at north of 750 Watt per node, and after accounting for space utilized by networking gear, we would have been underutilizing rack space by as much as 50%.</p>
    <div>
      <h3>We have a problem!</h3>
      <a href="#we-have-a-problem">
        
      </a>
    </div>
    <p>We do have a variety of SKU options available to use on each CPU generation, and if power is the primary constraint, we could choose to limit the TDP and use a lower core count, low-power SKU. To evaluate this, the hardware team ran a synthetic workload benchmark in the lab across several CPU SKUs. We found that Cloudflare services continue to scale with cores effectively up to 128 cores or 256 hardware threads, resulting in significant performance gain, and Total Cost of Ownership (TCO) benefit, at and above 360W TDP.</p><p>However, while the performance metric and TCO metric look good on a per-server basis, this is only part of the story: servers go into a server rack when they are deployed, and server racks come with constraints and limitations that have to be taken into design consideration. The two limiting factors are rack power budget and rack height. Taking these two rack-level constraints into account, how does the combined Total Cost of Ownership (TCO) benefit scale with TDP? We ran a performance sweep across the configurable TDP range of the highest core count CPUs and noticed that rack-level TCO benefit stagnates when CPU TDP rises above roughly 340W.</p><p>TCO advantage stagnates because we hit our rack power budget limit — the incremental performance gain per server, coinciding with an incremental increase of CPU TDP above 340W, is negated by the reduction in the number of servers that can be installed in a rack to remain within the rack’s power budget. Even with CPU TDP power capped at 340W, we are still underutilizing the rack, with 30% of the space still available.</p><p>Thankfully, there is an alternative to power capping and compromising on possible performance gain, by increasing the chassis height to a 2U form factor (from 1.75” in height to 3.5” in height). The benefits from doing this include:</p><ul><li><p>Larger fans (up to 80mm) that can move more air</p></li><li><p>Allowing for a taller and larger heatsink that can dissipate heat more effectively</p></li><li><p>Less air impedance within the chassis since the majority of components are 1U height</p></li><li><p>Providing sufficient room to add PCIe attached accelerators / GPUs, including dual-slot form factor options</p></li></ul><p><i>Click images to enlarge</i></p><p>2U chassis design is nothing new, and is actually very common in the industry for various reasons, one of which is better airflow to dissipate more heat, but it does come with the tradeoff of taking up more space and limiting the number of servers than can be installed in a rack. Since we are power constrained instead of space constrained, the tradeoff did not negatively impact our design.</p><p>Thermal simulations provided by Cloudflare vendors showed that 4x 60mm fans or 4x 80mm fans at less than 40 Watt per fan is sufficient to cool the system. That is a theoretical savings of at least 150 Watt compared to 8x 40mm fans in a 1U design, which would result in significant Operational Expenditure (OPEX) savings and a boost to TCO improvement. Switching to a 2U form factor also gives us the benefit of fully utilizing our rack power budget and our rack space, and provides ample room for the addition of PCIe attached accelerators / GPUs, including dual-slot form factor options.</p>
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>It might seem counter-intuitive, but our observations indicate that growing the server chassis, and utilizing more space per node actually increases rack density and improves overall TCO benefit over previous generation deployments, since it allows for a better thermal design. We are very happy with the result of this technical readiness investigation, and are actively working on validating our Gen 12 Compute servers and launching them into production soon. Stay tuned for more details on our Gen 12 designs.</p><p>If you are excited about helping build  a better Internet, come join us, <a href="https://www.cloudflare.com/careers/jobs/">we are hiring</a>!</p> ]]></content:encoded>
            <category><![CDATA[AMD]]></category>
            <category><![CDATA[Hardware]]></category>
            <category><![CDATA[Cloudflare Network]]></category>
            <guid isPermaLink="false">2aWmwl401SPH5eZNo2wHMg</guid>
            <dc:creator>JQ Lau</dc:creator>
            <dc:creator>Syona Sarma</dc:creator>
        </item>
    </channel>
</rss>