The Cloudflare Blog

Analysis of the EPYC 145% performance gain in Cloudflare Gen 12 servers

JQ Lau — Tue, 15 Oct 2024 15:00:00 GMT

Cloudflare's network spans more than 330 cities in over 120 countries, serving over 60 million HTTP requests per second and 39 million DNS queries per second on average. These numbers will continue to grow, and at an accelerating pace, as will Cloudflare’s infrastructure to support them. While we can continue to scale out by deploying more servers, it is also paramount for us to develop and deploy more performant and more efficient servers.

At the heart of each server is the processor (central processing unit, or CPU). Even though many aspects of a server rack can be redesigned to improve the cost to serve a request, CPU remains the biggest lever, as it is typically the primary compute resource in a server, and the primary enabler of new technologies.

Cloudflare’s 12th Generation server with AMD EPYC 9684-X (codenamed Genoa-X) is 145% more performant and 63% more efficient. These are big numbers, but where do the performance gains come from? Cloudflare’s hardware system engineering team did a sensitivity analysis on three variants of 4th generation AMD EPYC processor to understand the contributing factors.

For the 4th generation AMD EPYC Processors, AMD offers three architectural variants:

mainstream classic Zen 4 cores, codenamed Genoa
efficiency optimized dense Zen 4c cores, codenamed Bergamo
cache optimized Zen 4 cores with 3D V-cache, codenamed Genoa-X

^{Figure 1 (from left to right): AMD EPYC 9654 (Genoa), AMD EPYC 9754 (Bergamo), AMD EPYC 9684X (Genoa-X)}

Key features common across the 4th Generation AMD EPYC processors:

Up to 12x Core Complex Dies (CCDs)
Each core has a private 1MB L2 cache
The CCDs connect to memory, I/O, and each other through an I/O die
Configurable Thermal Design Power (cTDP) up to 400W
Support up to 12 channels of DDR5-4800 1DPC
Support up to 128 lanes PCIe Gen 5

Classic Zen 4 Cores (Genoa):

Each Core Complex (CCX) has 8x Zen 4 Cores (16x Threads)
Each CCX has a shared 32 MB L3 cache (4 MB/core)
Each CCD has 1x CCX

Dense Zen 4c Cores (Bergamo):

Each CCX has 8x Zen 4c Cores (16x Threads)
Each CCX has a shared 16 MB L3 cache (2 MB/core)
Each CCD has 2x CCX

Classic Zen 4 Cores with 3D V-cache (Genoa-X):

Each CCX has 8x Zen 4 Cores (16x Threads)
Each CCX has a shared 96MB L3 cache (12 MB/core)
Each CCD has 1x CCX

For more information on 4th generation AMD EPYC Processors architecture, see: https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf

The following table is a summary of the specification of the AMD EPYC 7713 CPU in our Gen 11 server against the three CPU candidates, one from each variant of the 4th generation AMD EPYC Processors architecture:

CPU Model	AMD EPYC 7713	AMD EPYC 9654	AMD EPYC 9754	AMD EPYC 9684X
Series	Milan	Genoa	Bergamo	Genoa-X
# of CPU Cores	64	96	128	96
# of Threads	128	192	256	192
Base Clock	2.0 GHz	2.4 GHz	2.25 GHz	2.4 GHz
All Core Boost Clock	~2.7 GHz*	3.55 Ghz	3.1 Ghz	3.42 Ghz
Total L3 Cache	256 MB	384 MB	256 MB	1152 MB
L3 cache per core	4 MB / core	4 MB / core	2 MB / core	12 MB / core
Maximum configurable TDP	240W	400W	400W	400W

^{* AMD EPYC 7713 all core boost clock is based on Cloudflare production data, not the official specification from AMD}

cf_benchmark

Readers may remember that Cloudflare introduced cf_benchmark when we evaluated Qualcomm's ARM chips, using it as our first pass benchmark to shortlist AMD’s Rome CPU for our Gen 10 servers and to evaluate our chosen ARM CPU Ampere Altra Max against AWS Graviton 2. Likewise, we ran cf_benchmark against the three candidate CPUs for our 12th Gen servers: AMD EPYC 9654 (Genoa), AMD EPYC 9754 (Bergamo), and AMD EPYC 9684X (Genoa-X). The majority of cf_benchmark workloads are compute bound, and given more cores or higher CPU frequency, they score better. The graph and the table below show the benchmark performance comparison of the three CPU candidates with Genoa 9654 as the baseline, where > 1.00x indicates better performance.

	Genoa 9654 (baseline)	Bergamo 9754	Genoa-X 9684X
openssl_pki	1.00x	1.16x	1.01x
openssl_aead	1.00x	1.20x	1.01x
luajit	1.00x	0.86x	1.00x
brotli	1.00x	1.11x	0.98x
gzip	1.00x	0.87x	1.01x
go	1.00x	1.09x	1.00x

Bergamo 9754 with 128 cores scores better in openssl_pki, openssl_aead, brotli, and go benchmark suites, and performs less favorably in luajit and gzip benchmark suites. Genoa-X 9684X (with significantly more L3 cache) doesn’t offer a significant boost in performance for these compute-bound benchmarks.

These benchmarks are representative of some of the common workloads Cloudflare runs, and are useful in identifying software scaling issues, system configuration bottlenecks, and the impact of CPU design choices on workload-specific performance. However, the benchmark suite is not an exhaustive list of all workloads Cloudflare runs in production, and in reality, the workloads included in the benchmark suites are almost certainly not the exclusive workload running on the CPU. In short, though benchmark results can be informative, they do not represent a good indication of production performance when a mix of these workloads run on the same processor.

Performance simulation

To get an early indication of production performance, Cloudflare has an internal performance simulation tool that exercises our software stack to fetch a fixed asset repeatedly. The simulation tool can be configured to fetch a specified fixed-size asset and configured to include or exclude services like WAF or Workers in the request path. Below, we show the simulated performance between the three CPUs for an asset size of 10 KB, where >1.00x indicates better performance.

	Milan 7713	Genoa 9654	Bergamo 9754	Genoa-X 9684X
Lab simulation performance multiplier	1.00x	2.20x	1.95x	2.75x

Based on these results, Bergamo 9754, which has the highest core count, but smallest L3 cache per core, is least performant among the three candidates, followed by Genoa 9654. The Genoa-X 9684X with the largest L3 cache per core is the most performant. This data suggests that our software stack is very sensitive to L3 cache size, in addition to core count and CPU frequency. This is interesting and worth a deep dive into a sensitivity analysis of our workload against a few (high level) CPU design points, especially core scaling, frequency scaling, and L2/L3 cache sizes scaling.

Sensitivity analysis

Core sensitivity

Number of cores is the headline specification that practically everyone talks about, and one of the easiest improvements CPU vendors can make to increase performance per socket. The AMD Genoa 9654 has 96 cores, 50% more than the 64 cores available on the AMD Milan 7713 CPUs that we used in our Gen 11 servers. Is more always better? Does Cloudflare’s primary workload scale with core count and effectively utilize all available cores?

The figure and table below shows the result of a core scaling experiment performed on an AMD Genoa 9654 configured with 96 cores, 80 cores, 64 cores, and 48 cores, which was done by incrementally disabling 2x CCD (8 cores/CCD) at each step. The result is GREAT, as Cloudflare’s simulated primary workload scales linearly with core count on AMD Genoa CPUs.

Core count	Core increase	Performance increase
48	1.00x	1.00
64	1.33x	1.39x
80	1.67x	1.71x
96	2.00x	2.05x

TDP sensitivity

Thermal Design Power (TDP), is the maximum amount of heat generated by a CPU that the cooling system is designed to dissipate, but more commonly refers to the power consumption of the processor under the maximum theoretical loads. AMD Genoa 9654’s default TDP is 360W, but can be configured up to 400W TDP. Is more always better? Does Cloudflare continue to see meaningful performance improvement up to 400W, or does performance stagnate at some point?

The chart below shows the result of sweeping the TDP of the AMD Genoa 9654 (in power determinism mode) from 240W to 400W. (Note: x-axis step size is not linear).

Cloudflare’s simulated primary workload continues to see incremental performance improvements up to the maximum configurable 400W, albeit at a less favorable perf/watt ratio.

Looking at TDP sensitivity data is a quick and easy way to identify if performance stagnates at some power point, but what does power sensitivity actually measure? There are several factors contributing to CPU power consumption, but let's focus on one of the primary factors: dynamic power consumption. Dynamic power consumption is approximately CV²f, where C is the switched load capacitance, V is the regulated voltage, and f is the frequency. In modern processors like the AMD Genoa 9654, the CPU dynamically scales its voltage along with frequency, so theoretically, CPU dynamic power is loosely proportional to f³. In other words, measuring TDP sensitivity is measuring the frequency sensitivity of a workload. Does the data agree? Yes!

cTDP	All core boost frequency (GHz)	Perf (rps) / baseline
240	2.47	0.78x
280	2.75	0.87x
320	2.93	0.93x
340	3.13	0.97x
360	3.3	1.00x
380	3.4	1.03x
390	3.465	1.04x
400	3.55	1.05x

Frequency sensitivity

Instead of relying on an indirect measure through the TDP, let’s measure frequency sensitivity directly by sweeping the maximum boost frequency.

At above 3GHz, the data shows that Cloudflare’s primary workload sees roughly 2% incremental improvement for every 0.1GHz all core average frequency increment. We hit the 400W power cap at 3.545GHz. This is notably higher than the typical all core boost frequency that Cloudflare Gen 11 servers with AMD Milan 7713 at 2.7GHz see in production, or at 2.4GHz in our performance simulation, which is amazing!

L3 cache size sensitivity

What about L3 cache size sensitivity? L3 cache size is one of the primary design choices and major differences between the trio of Genoa, Bergamo, and Genoa-X. Genoa 9654 has 4 MB L3/core, Bergamo 9754 has 2 MB L3/core, and Genoa-X has 12 MB L3/core. L3 cache is the last and largest “memory” bank on-chip before having to access memory on DIMMs outside the chip that would take significantly more CPU cycles.

We ran an experiment on the Genoa 9654 to check how performance scales with L3 cache size. L3 cache size per core is reduced through MSR writes (but could also be done using Intel RDT) and L3 cache per core is increased by disabling physical cores in a CCD (which reduces the number of cores sharing the fixed size 32 MB L3 cache per CCD effectively growing the L3 cache per core). Below is the result of the experiment, where >1.00x indicates better performance:

L3 cache size increase vs baseline 4MB per core	0.25x	0.5x	0.75x	1x	1.14x	1.33x	1.60x	2.00x
rps/core / baseline	0.67x	0.78x	0.89x	1.00x	1.08x	1.15x	1.25x	1.31x
L3 cache miss rate per CCD	56.04%	39.15%	30.37%	23.55%	22.39%	19.73%	16.94%	14.28%

Even though the expectation was that the impact of a different L3 cache size gets diminished by the faster DDR5 and larger memory bandwidth, Cloudflare’s simulated primary workload is quite sensitive to L3 cache size. The L3 cache miss rate dropped from 56% with only 1 MB L3 per core, to 14.28% with 8 MB L3/core. Changing the L3 cache size by 25% affects the performance by approximately 11%, and we continue to see performance increase to 2x L3 cache size, though the performance increase starts to diminish when we get to 2x L3 cache per core.

Do we see the same behavior when comparing Genoa 9654, Bergamo 9754 and Genoa-X 9684X? We ran an experiment comparing the impact of L3 cache size, controlling for core count and all core boost frequency, and we also saw significant deltas. Halving the L3 cache size from 4 MB/core to 2 MB/core reduces performance by 24%, roughly matching the experiment above. However, increasing the cache 3x from 4 MB/core to 12 MB/core only increases performance by 25%, less than the indication provided by previous experiments. This is likely because the performance gain we saw on experiment result above could be partially attributed to less cache contention due to reduced number of cores based on how we set up the test. Nevertheless, these are significant deltas!

L3/core	2MB/core	4MB/core	12MB/core
Perf (rps) / baseline	0.76x	1x	1.25x

Putting it all together

The table below summarizes how each factor from sensitivity analysis above contributes to the overall performance gain. There are an additional 6% to 14% of unaccounted performance improvement that are contributed by other factors like larger L2 cache, higher memory bandwidth, and miscellaneous CPU architecture changes that improve IPC.

	Milan 7713	Genoa 9654	Bergamo 9754	Genoa-X 9684X
Lab simulation performance multiplier	1x	2.2x	1.95x	2.75x
Performance multiplier due to Core scaling	1x	1.5x	2x	1.5x
Performance multiplier due to Frequency scaling *(Note: Milan 7713 all core frequency is ~2.4GHz when running simulated workload at 100% CPU utilization)**	1x	1.32x	1.21x	1.29x
Performance multiplier due to L3 cache size scaling	1x	1x	0.76x	1.25x
Performance multiplier due to other factors like larger L2 cache, higher memory bandwidth, miscellaneous CPU architecture changes that improve IPC	1x	1.11x	1.06x	1.14x

Performance evaluation in production

How do these CPU candidates perform with real-world traffic and an actual production workload mix? The table below summarizes the performance of the three CPUs in lab simulation and in production. Genoa-X 9684X continues to outperform in production.

In addition, the Gen 12 server equipped with Genoa-X offered outstanding performance but only consumed 1.5x more power per system than our Gen 11 server with Milan 7713. In other words, we see a 63% increase in performance per watt. Genoa-X 9684X provides the best TCO improvement among the 3 options, and was ultimately chosen as the CPU for our Gen 12 server.

	Milan 7713	Genoa 9654	Bergamo 9754	Genoa-X 9684X
Lab simulation performance multiplier	1x	2.2x	1.95x	2.75x
Production performance multiplier	1x	2x	2.15x	2.45x
Production performance per watt multiplier	1x	1.33x	1.38x	1.63x

The Gen 12 server with AMD Genoa-X 9684X is the most powerful and the most power efficient server Cloudflare has built to date. It serves as the underlying platform for all the incredible services that Cloudflare offers to our customers globally, and will help power the growth of Cloudflare infrastructure for the next several years with improved cost structure.

Hardware engineers at Cloudflare work closely with our infrastructure engineering partners and externally with our vendors to design and develop world-class servers to best serve our customers.

Come join us at Cloudflare to help build a better Internet!

Cloudflare’s 12th Generation servers — 145% more performant and 63% more efficient

JQ Lau — Wed, 25 Sep 2024 13:00:00 GMT

Cloudflare is thrilled to announce the general deployment of our next generation of servers — Gen 12 powered by AMD EPYC 9684X (code name “Genoa-X”) processors. This next generation focuses on delivering exceptional performance across all Cloudflare services, enhanced support for AI/ML workloads, significant strides in power efficiency, and improved security features.

Here are some key performance indicators and feature improvements that this generation delivers as compared to the prior generation:

Beginning with performance, with close engineering collaboration between Cloudflare and AMD on optimization, Gen 12 servers can serve more than twice as many requests per second (RPS) as Gen 11 servers, resulting in lower Cloudflare infrastructure build-out costs.

Next, our power efficiency has improved significantly, by more than 60% in RPS per watt as compared to the prior generation. As Cloudflare continues to expand our infrastructure footprint, the improved efficiency helps reduce Cloudflare’s operational expenditure and carbon footprint as a percentage of our fleet size.

Third, in response to the growing demand for AI capabilities, we've updated the thermal-mechanical design of our Gen 12 server to support more powerful GPUs. This aligns with the Workers AI objective to support larger large language models and increase throughput for smaller models. This enhancement underscores our ongoing commitment to advancing AI inference capabilities

Fourth, to underscore our security-first position as a company, we've integrated hardware root of trust (HRoT) capabilities to ensure the integrity of boot firmware and board management controller firmware. Continuing to embrace open standards, the baseboard management and security controller (Data Center Secure Control Module or OCP DC-SCM) that we’ve designed into our systems is modular and vendor-agnostic, enabling a unified openBMC image, quicker prototyping, and allowing for reuse.

Finally, given the increasing importance of supply assurance and reliability in infrastructure deployments, our approach includes a robust multi-vendor strategy to mitigate supply chain risks, ensuring continuity and resiliency of our infrastructure deployment.

Cloudflare is dedicated to constantly improving our server fleet, empowering businesses worldwide with enhanced performance, efficiency, and security.

Gen 12 Servers

Let's take a closer look at our Gen 12 server. The server is powered by a 4th generation AMD EPYC Processor, paired with 384 GB of DDR5 RAM, 16 TB of NVMe storage, a dual-port 25 GbE NIC, and two 800 watt power supply units.

Generation	Gen 12 Compute	Previous Gen 11 Compute
Form Factor	2U1N - Single socket	1U1N - Single socket
Processor	AMD EPYC 9684X Genoa-X 96-Core Processor	AMD EPYC 7713 Milan 64-Core Processor
Memory	384GB of DDR5-4800 x12 memory channel	384GB of DDR4-3200 x8 memory channel
Storage	x2 E1.S NVMe Samsung PM9A3 7.68TB / Micron 7450 Pro 7.68TB	x2 M.2 NVMe 2x Samsung PM9A3 x 1.92TB
Network	Dual 25 Gbe OCP 3.0 Intel Ethernet Network Adapter E810-XXVDA2 / NVIDIA Mellanox ConnectX-6 Lx	Dual 25 Gbe OCP 2.0 Mellanox ConnectX-4 dual-port 25G
System Management	DC-SCM 2.0 ASPEED AST2600 (BMC) + AST1060 (HRoT)	ASPEED AST2500 (BMC)
Power Supply	800W - Titanium Grade	650W - Titanium Grade

^{Cloudflare Gen 12 server}

CPU

During the design phase, we conducted an extensive survey of the CPU landscape. These options offer valuable choices as we consider how to shape the future of Cloudflare's server technology to match the needs of our customers. We evaluated many candidates in the lab, and short-listed three standout CPU candidates from the 4th generation AMD EPYC Processor lineup: Genoa 9654, Bergamo 9754, and Genoa-X 9684X for production evaluation. The table below summarizes the differences in specifications of the short-listed candidates for Gen 12 servers against the AMD EPYC 7713 used in our Gen 11 servers. Notably, all three candidates offer significant increase in core count and marked increase in all core boost clock frequency.

CPU Model	AMD EPYC 7713	AMD EPYC 9654	AMD EPYC 9754	AMD EPYC 9684X
Series	Milan	Genoa	Bergamo	Genoa-X
# of CPU Cores	64	96	128	96
# of Threads	128	192	256	192
Base Clock	2.0 GHz	2.4 GHz	2.25 GHz	2.4 GHz
Max Boost Clock	3.67 GHz	3.7 Ghz	3.1 Ghz	3.7 Ghz
All Core Boost Clock	2.7 GHz *	3.55 GHz	3.1GHz	3.42 GHz
Total L3 Cache	256 MB	384 MB	256 MB	1152 MB
L3 cache per core	4MB / core	4MB / core	2MB / core	12MB / core
Maximum configurable TDP	240W	400W	400W	400W

_{*Note: AMD EPYC 7713 all core boost clock frequency of 2.7 GHz is not an official specification of the CPU but based on data collected at Cloudflare production fleet.}

During production evaluation, the configuration of all three CPUs were optimized to the best of our knowledge, including thermal design power (TDP) configured to 400W for maximum performance. The servers are set up to run the same processes and services like any other server we have in production, which makes for a great side-by-side comparison.

	Milan 7713	Genoa 9654	Bergamo 9754	Genoa-X 9684X
Production performance (request per second) multiplier	1x	2x	2.15x	2.45x
Production efficiency (request per second per watt) multiplier	1x	1.33x	1.38x	1.63x

AMD EPYC Genoa-X in Cloudflare Gen 12 server

Each of these CPUs outperforms the previous generation of processors by at least 2x. AMD EPYC 9684X Genoa-X with 3D V-cache technology gave us the greatest performance improvement, at 2.45x, when compared against our Gen 11 servers with AMD EPYC 7713 Milan.

Comparing the performance between Genoa-X 9684X and Genoa 9654, we see a ~22.5% performance delta. The primary difference between the two CPUs is the amount of L3 cache available on the CPU. Genoa-X 9684X has 1152 MB of L3 cache, which is three times the Genoa 9654 with 384 MB of L3 cache. Cloudflare workloads benefit from more low level cache being accessible and avoid the much larger latency penalty associated with fetching data from memory.

Genoa-X 9684X CPU delivered ~22.5% improved performance consuming the same amount of 400W power compared to Genoa 9654. The 3x larger L3 cache does consume additional power, but only at the expense of sacrificing 3% of highest achievable all core boost frequency on Genoa-X 9684X, a favorable trade-off for Cloudflare workloads.

More importantly, Genoa-X 9684X CPU delivered 145% performance improvement with only 50% system power increase, offering a 63% power efficiency improvement that will help drive down operational expenditure tremendously. It is important to note that even though a big portion of the power efficiency is due to the CPU, it needs to be paired with optimal thermal-mechanical design to realize the full benefit. Earlier last year, we made the thermal-mechanical design choice to double the height of the server chassis to optimize rack density and cooling efficiency across our global data centers. We estimated that moving from 1U to 2U would reduce fan power by 150W, which would decrease system power from 750 watts to 600 watts. Guess what? We were right — a Gen 12 server consumes 600 watts per system at a typical ambient temperature of 25°C.

While high performance often comes at a higher price, fortunately AMD EPYC 9684X offer an excellent balance between cost and capability. A server designed with this CPU provides top-tier performance without necessitating a huge financial outlay, resulting in a good Total Cost of Ownership improvement for Cloudflare.

Memory

AMD Genoa-X CPU supports twelve memory channels of DDR5 RAM up to 4800 mega transfers per second (MT/s) and per socket Memory Bandwidth of 460.8 GB/s. The twelve channels are fully utilized with 32 GB ECC 2Rx8 DDR5 RDIMM with one DIMM per channel configuration for a combined total memory capacity of 384 GB.

Choosing the optimal memory capacity is a balancing act, as maintaining an optimal memory-to-core ratio is important to make sure CPU capacity or memory capacity is not wasted. Some may remember that our Gen 11 servers with 64 core AMD EPYC 7713 CPUs are also configured with 384 GB of memory, which is about 6 GB per core. So why did we choose to configure our Gen 12 servers with 384 GB of memory when the core count is growing to 96 cores? Great question! A lot of memory optimization work has happened since we introduced Gen 11, including some that we blogged about, like Bot Management code optimization and our transition to highly efficient Pingora. In addition, each service has a memory allocation that is sized for optimal performance. The per-service memory allocation is programmed and monitored utilizing Linux control group resource management features. When sizing memory capacity for Gen 12, we consulted with the team who monitor resource allocation and surveyed memory utilization metrics collected from our fleet. The result of the analysis is that the optimal memory-to-core ratio is 4 GB per CPU core, or 384 GB total memory capacity. This configuration is validated in production. We chose dual rank memory modules over single rank memory modules because they have higher memory throughput, which improves server performance (read more about memory module organization and its effect on memory bandwidth).

The table below shows the result of running the Intel Memory Latency Checker (MLC) tool to measure peak memory bandwidth for the system and to compare memory throughput between 12 channels of dual-rank (2Rx8) 32 GB DIMM and 12 channels of single rank (1Rx4) 32 GB DIMM. Dual rank DIMMs have slightly higher (1.8%) read memory bandwidth, but noticeably higher write bandwidth. As write ratios increased from 25% to 50%, the memory throughput delta increased by 10%.

Benchmark	Dual rank advantage over single rank
Intel MLC ALL Reads	101.8%
Intel MLC 3:1 Reads-Writes	107.7%
Intel MLC 2:1 Reads-Writes	112.9%
Intel MLC 1:1 Reads-Writes	117.8%
Intel MLC Stream-triad like	108.6%

The table below shows the result of running the AMD STREAM benchmark to measure sustainable main memory bandwidth in MB/s and the corresponding computation rate for simple vector kernels. In all 4 types of vector kernels, dual rank DIMMs provide a noticeable advantage over single rank DIMMs.

Benchmark	Dual rank advantage over single rank
Stream Copy	115.44%
Stream Scale	111.22%
Stream Add	109.06%
Stream Triad	107.70%

Storage

Cloudflare’s Gen X server and Gen 11 server support M.2 form factor drives. We liked the M.2 form factor mainly because it was compact. The M.2 specification was introduced in 2012, but today, the connector system is dated and the industry has concerns about its ability to maintain signal integrity with the high speed signal specified by PCIe 5.0 and PCIe 6.0 specifications. The 8.25W thermal limit of the M.2 form factor also limits the number of flash dies that can be fitted, which limits the maximum supported capacity per drive. To address these concerns, the industry has introduced the E1.S specification and is transitioning from the M.2 form factor to the E1.S form factor.

In Gen 12, we are making the change to the EDSFF E1 form factor, more specifically the E1.S 15mm. E1.S 15mm, though still in a compact form factor, provides more space to fit more flash dies for larger capacity support. The form factor also has better cooling design to support more than 25W of sustained power.

While the AMD Genoa-X CPU supports 128 PCIe 5.0 lanes, we continue to use NVMe devices with PCIe Gen 4.0 x4 lanes, as PCIe Gen 4.0 throughput is sufficient to meet drive bandwidth requirements and keep server design costs optimal. The server is equipped with two 8 TB NVMe drives for a total of 16 TB available storage. We opted for two 8 TB drives instead of four 4 TB drives because the dual 8 TB configuration already provides sufficient I/O bandwidth for all Cloudflare workloads that run on each server.

Sequential Read (MB/s) :	6,700
Sequential Write (MB/s) :	4,000
Random Read IOPS:	1,000,000
Random Write IOPS:	200,000
Endurance	1 DWPD
PCIe GEN4 x4 lane throughput	7880 MB/s

_{Storage devices performance specification}

Network

Cloudflare servers and top-of-rack (ToR) network equipment operate at 25 GbE speeds. In Gen 12, we utilized a DC-MHS motherboard-inspired design, and upgraded from an OCP 2.0 form factor to an OCP 3.0 form factor, which provides tool-less serviceability of the NIC. The OCP 3.0 form factor also occupies less space in the 2U server compared to PCIe-attached NICs, which improves airflow and frees up space for other application-specific PCIe cards, such as GPUs.

Cloudflare has been using the Mellanox CX4-Lx EN dual port 25 GbE NIC since our Gen 9 servers in 2018. Even though the NIC has served us well over the years, we are single sourced. During the pandemic, we were faced with supply constraints and extremely long lead times. The team scrambled to qualify the Broadcom M225P dual port 25 GbE NIC as our second-sourced NIC in 2022, ensuring we could continue to turn up servers to serve customer demand. With the lessons learned from single-sourcing the Gen 11 NIC, we are now dual-sourcing and have chosen the Intel Ethernet Network Adapter E810 and NVIDIA Mellanox ConnectX-6 Lx to support Gen 12. These two NICs are compliant with the OCP 3.0 specification and offer more MSI-X queues that can then be mapped to the increased core count on the AMD EPYC 9684X. The Intel Ethernet Network Adapter comes with an additional advantage, offering full Generic Segmentation Offload (GSO) support including VLAN-tagged encapsulated traffic, whereast many vendors either only support Partial GSO or do not support it at all today. With Full GSO support, the kernel spent noticeably less time in softirq segmenting packets, and servers with Intel E810 NICs are processing approximately 2% more requests per second.

Improved security with DC-SCM: Project Argus

^{DC-SCM in Gen 12 server (Project Argus)}

Gen 12 servers are integrated with Project Argus, one of the industry first implementations of Data Center Secure Control Module 2.0 (DC-SCM 2.0). DC-SCM 2.0 decouples server management and security functions away from the motherboard. The baseboard management controller (BMC), hardware root of trust (HRoT), trusted platform module (TPM), and dual BMC/BIOS flash chips are all installed on the DC-SCM.

On our Gen X and Gen 11 server, Cloudflare moved our secure boot trust anchor from the system Basic Input/Output System (BIOS) or the Unified Extensible Firmware Interface (UEFI) firmware to hardware-rooted boot integrity — AMD’s implementation of Platform Secure Boot (PSB) or Ampere’s implementation of Single Domain Secure Boot. These solutions helped secure Cloudflare infrastructure from BIOS / UEFI firmware attacks. However, we are still vulnerable to out-of-band attacks through compromising the BMC firmware. BMC is a microcontroller that provides out-of-band monitoring and management capabilities for the system. When compromised, attackers can read processor console logs accessible by BMC and control server power states for example. On Gen 12, the HRoT on the DC-SCM serves as the trust store of cryptographic keys and is responsible to authenticate the BIOS/UEFI firmware (independent of CPU vendor) and the BMC firmware for secure boot process.

In addition, on the DC-SCM, there are additional flash storage devices to enable storing back-up BIOS/UEFI firmware and BMC firmware to allow rapid recovery when a corrupted or malicious firmware is programmed, and to be resilient to flash chip failure due to aging.

These updates make our Gen 12 server more secure and more resilient to firmware attacks.

Power

A Gen 12 server consumes 600 watts at a typical ambient temperature of 25°C. Even though this is a 50% increase from the 400 watts consumed by the Gen 11 server, as mentioned above in the CPU section, this is a relatively small price to pay for a 145% increase in performance. We’ve paired the server up with dual 800W common redundant power supplies (CRPS) with 80 PLUS Titanium grade efficiency. Both power supply units (PSU) operate actively with distributed power and current. The units are hot-pluggable, allowing the server to operate with redundancy and maximize uptime.

80 PLUS is a PSU efficiency certification program. The Titanium grade efficiency PSU is 2% more efficient than the Platinum grade efficiency PSU between typical operating load of 25% to 50%. 2% may not sound like a lot, but considering the size of Cloudflare fleet with servers deployed worldwide, 2% savings over the lifetime of all Gen 12 deployment is a reduction of more than 7 GWh, equivalent to carbon sequestered by more than 3400 acres of U.S. forests in one year. This upgrade also means our Gen 12 server complies with EU Lot9 requirements and can be deployed in the EU region.

80 PLUS certification	10%	20%	50%	100%
80 PLUS Platinum	-	92%	94%	90%
80 PLUS Titanium	90%	94%	96%	91%

Drop-in GPU support

Demand for machine learning and AI workloads exploded in 2023, and Cloudflare introduced Workers AI to serve the needs of our customers. Cloudflare retrofitted or deployed GPUs worldwide in a portion of our Gen 11 server fleet to support the growth of Workers AI. Our Gen 12 server is also designed to accommodate the addition of more powerful GPUs. This gives Cloudflare the flexibility to support Workers AI in all regions of the world, and to strategically place GPUs in regions to reduce inference latency for our customers. With this design, the server can run Cloudflare’s full software stack. During times when GPUs see lower utilization, the server continues to serve general web requests and remains productive.

The electrical design of the motherboard is designed to support up to two PCIe add-in cards and the power distribution board is sized to support an additional 400W of power. The mechanics are sized to support either a single FHFL (full height, full length) double width GPU PCIe card, or two FHFL single width GPU PCIe cards. The thermal solution including the component placement, fans, and air duct design are sized to support adding GPUs with TDP up to 400W.

Looking to the future

Gen 12 Servers are currently deployed and live in multiple Cloudflare data centers worldwide, and already process millions of requests per second. Cloudflare’s EPYC journey has not ended — the 5th-gen AMD EPYC CPUs (code name “Turin”) are already available for testing, and we are very excited to start the architecture planning and design discussion for the Gen 13 server. Come join us at Cloudflare to help build a better Internet!

Measuring Hyper-Threading and Turbo Boost

Sung Park — Tue, 05 Oct 2021 12:58:35 GMT

We often put together experiments that measure hardware performance to improve our understanding and provide insights to our hardware partners. We recently wanted to know more about Hyper-Threading and Turbo Boost. The last time we assessed these two technologies was when we were still deploying the Intel Xeons (Skylake/Purley), but beginning with our Gen X servers we switched over to the AMD EPYC (Zen 2/Rome). This blog is about our latest attempt at quantifying the performance impact of Hyper-Threading and Turbo Boost on our AMD-based servers running our software stack.

Intel briefly introduced Hyper-Threading with NetBurst (Northwood) back in 2002, then reintroduced Hyper-Threading six years later with Nehalem along with Turbo Boost. AMD presented their own implementation of these technologies with Zen in 2017, but AMD’s version of Turbo Boost actually dates back to AMD K10 (Thuban), in 2010, when it used to be called Turbo Core. Since Zen, Hyper-Threading and Turbo Boost are known as simultaneous multithreading (SMT) and Core Performance Boost (CPB), respectively. The underlying implementation of Hyper-Threading and Turbo Boost differs between the two vendors, but the high-level concept remains the same.

Hyper-Threading or simultaneous multithreading creates a second hardware thread within a processor’s core, also known as a logical core, by duplicating various parts of the core to support the context of a second application thread. The two hardware threads execute simultaneously within the core, across their dedicated and remaining shared resources. If neither hardware threads contend over a particular shared resource, then the throughput can be drastically increased.

Turbo Boost or Core Performance Boost opportunistically allows the processor to operate beyond its rated base frequency as long as the processor operates within guidelines set by Intel or AMD. Generally speaking, the higher the frequency, the faster the processor finishes a task.

Simulated Environment

CPU Specification

Our Gen X or 10th generation servers are powered by the AMD EPYC 7642, based on the Zen 2 microarchitecture. The vast majority of the Zen 2-based processors along with its successor Zen 3 that our Gen 11 servers are based on, supports simultaneous multithreading and Core Performance Boost.

Similar to Intel’s Hyper-Threading, AMD implemented 2-way simultaneous multithreading. The AMD EPYC 7642 has 48 cores, and with simultaneous multithreading enabled it can simultaneously execute 96 hardware threads. Core Performance Boost allows the AMD EPYC 7642 to operate anywhere between 2.3 to 3.3 GHz, depending on the workload and limitations imposed on the processor. With Core Performance Boost disabled, the processor will operate at 2.3 GHz, the rated base frequency on the AMD EPYC 7642. We took our usual simulated traffic pattern of 10 KiB cached assets over HTTPS, provided by our performance team, to generate a sustained workload that saturated the processor to 100% CPU utilization.

Results

After establishing a baseline with simultaneous multithreading and Core Performance Boost disabled, we started enabling one feature at a time. When we enabled Core Performance Boost, the processor operated near its peak turbo frequency, hovering between 3.2 to 3.3 GHz which is more than 39% higher than the base frequency. Higher operating frequency directly translated into 40% additional requests per second. We then disabled Core Performance Boost and enabled simultaneous multithreading. Similar to Core Performance Boost, simultaneous multithreading alone improved requests per second by 43%. Lastly, by enabling both features, we observed an 86% improvement in requests per second.

Latencies were generally lowered by either or both Core Performance Boost and simultaneous multithreading. While Core Performance Boost consistently maintained a lower latency than the baseline, simultaneous multithreading gradually took longer to process a request as it reached tail latencies. Though not depicted in the figure below, when we examined beyond p9999 or 99.99th percentile, simultaneous multithreading, even with the help of Core Performance Boost, exponentially increased in latency by more than 150% over the baseline, presumably due to the two hardware threads contending over a shared resource within the core.

Production Environment

Moving into production, since our traffic fluctuates throughout the day, we took four identical Gen X servers and measured in parallel during peak hours. The only changes we made to the servers were enabling and disabling simultaneous multithreading and Core Performance Boost to create a comprehensive test matrix. We conducted the experiment in two different regions to identify any anomalies and mismatching trends. All trends were alike.

Before diving into the results, we should preface that the baseline server operated at a higher CPU utilization than others. Every generation, our servers deliver a noticeable improvement in performance. So our load balancer, named Unimog, sends a different number of connections to the target server based on its generation to balance out the CPU utilization. When we disabled simultaneous multithreading and Core Performance Boost, the baseline server’s performance degraded to the point where Unimog encountered a “guard rail” or the lower limit on the requests sent to the server, and so its CPU utilization rose instead. Given that the baseline server operated at a higher CPU utilization, the baseline server processed more requests per second to meet the minimum performance threshold.

Results

Due to the skewed baseline, when core performance boost was enabled, we only observed 7% additional requests per second. Next, simultaneous multithreading alone improved requests per second by 41%. Lastly, with both features enabled, we saw an 86% improvement in requests per second.

Though we lack concrete baseline data, we can normalize requests per second by CPU utilization to approximate the improvement for each scenario. Once normalized, the estimated improvement in requests per second from core performance boost and simultaneous multithreading were 36% and 80%, respectively. With both features enabled, requests per second improved by 136%.

Latency was not as interesting since the baseline server operated at a higher CPU utilization, and in turn, it produced a higher tail latency than we would have otherwise expected. All other servers maintained a lower latency due to their lower CPU utilization in conjunction with Core Performance Boost, simultaneous multithreading, or both.

At this point, our experiment did not go as we had planned. Our baseline is skewed, and we only got half useful answers. However, we find experimenting to be important because we usually end up finding other helpful insights as well.

Let’s add power data. Since our baseline server was operating at a higher CPU utilization, we knew it was serving more requests and therefore, consumed more power than it needed to. Enabling Core Performance Boost allowed the processor to run up to its peak turbo frequency, increasing power consumption by 35% over the skewed baseline. More interestingly, enabling simultaneous multithreading increased power consumption by only 7%. Combining Core Performance Boost with simultaneous multithreading resulted in 58% increase in power consumption.

AMD’s implementation of simultaneous multithreading appears to be power efficient as it achieves 41% additional requests per second while consuming only 7% more power compared to the skewed baseline. For completeness, using the data we have, we bridged performance and power together to obtain performance per watt to summarize power efficiency. We divided the non-normalized requests per second by power consumption to produce the requests per watt figure below. Our Gen X servers attained the best performance per watt by enabling just simultaneous multithreading.

Conclusion

In our assessment of AMD’s implementation of Hyper-Threading and Turbo Boost, the original experiment we designed to measure requests per second and latency did not pan out as expected. As soon as we entered production, our baseline measurement was skewed due to the imbalance in CPU utilization and only partially reproduced our lab results.

We added power to the experiment and found other meaningful insights. By analyzing the performance and power characteristics of simultaneous multithreading and Core Performance Boost, we concluded that simultaneous multithreading could be a power-efficient mechanism to attain additional requests per second. Drawbacks of simultaneous multithreading include long tail latency that is currently curtailed by enabling Core Performance Boost. While the higher frequency enabled by Core Performance Boost provides latency reduction and more requests per second, we are more mindful that the increase in power consumption is quite significant.

Do you want to help shape the Cloudflare network? This blog was a glimpse of the work we do at Cloudflare. Come join us and help complete the feedback loop for our developers and hardware partners.

The EPYC journey continues to Milan in Cloudflare’s 11th generation Edge Server

Chris Howells — Tue, 31 Aug 2021 13:00:45 GMT

When I was interviewing to join Cloudflare in 2014 as a member of the SRE team, we had just introduced our generation 4 server, and I was excited about the prospects. Since then, Cloudflare, the industry and I have all changed dramatically. The best thing about working for a rapidly growing company like Cloudflare is that as the company grows, new roles open up to enable career development. And so, having left the SRE team last year, I joined the recently formed hardware engineering team, a team that simply didn’t exist in 2014.

We aim to introduce a new server platform to our edge network every 12 to 18 months or so, to ensure that we keep up with the latest industry technologies and developments. We announced the generation 9 server in October 2018 and we announced the generation 10 server in February 2020. We consider this length of cycle optimal: short enough to stay nimble and take advantage of the latest technologies, but long enough to offset the time taken by our hardware engineers to test and validate the entire platform. When we are shipping servers to over 200 cities around the world with a variety of regulatory standards, it’s essential to get things right the first time.

We continually work with our silicon vendors to receive product roadmaps and stay on top of the latest technologies. Since mid-2020, the hardware engineering team at Cloudflare has been working on our generation 11 server.

Requests per Watt is one of our defining characteristics when testing new hardware and we use it to identify how much more efficient a new hardware generation is than the previous generation. We continually strive to reduce our operational costs and power consumption reduction is one of the most important parts of this. It’s good for the planet and we can fit more servers into a rack, reducing our physical footprint.

The design of these Generation 11 x86 servers has been in parallel with our efforts to design next-generation edge servers using the Ampere Altra Arm architecture. You can read more about our tests in a blog post by my colleague Sung and we will document our work on Arm at the edge in a subsequent blog post.

We evaluated Intel’s latest generation of “Ice Lake” Xeon processors. Although Intel’s chips were able to compete with AMD in terms of raw performance, the power consumption was several hundred watts higher per server - that’s enormous. This meant that Intel’s Performance per Watt was unattractive.

We previously described how we had deployed AMD EPYC 7642’s processors in our generation 10 server. This has 48 cores and is based on AMD’s 2nd generation EPYC architecture, code named Rome. For our generation 11 server, we evaluated 48, 56 and 64 core samples based on AMD’s 3rd generation EPYC architecture, code named Milan. We were interested to find that comparing the two 48 core processors directly, we saw a performance boost of several percent in the 3rd generation EPYC architecture. We therefore had high hopes for the 56 core and 64 core chips.

So, based on the samples we received from our vendors and our subsequent testing, hardware from AMD and Ampere made the shortlist for our generation 11 server. On this occasion, we decided that Intel did not meet our requirements. However, it’s healthy that Intel and AMD compete and innovate in the x86 space and we look forward to seeing how Intel’s next generation shapes up.

Testing and validation process

Before we go on to talk about the hardware, I’d like to say a few words about the testing process we went through to test out generation 11 servers.

As we elected to proceed with AMD chips, we were able to use our generation 10 servers as our Engineering Validation Test platform, with the only changes being the new silicon and updated firmware. We were able to perform these upgrades ourselves in our hardware validation lab.

Cloudflare’s network is built with commodity hardware and we source the hardware from multiple vendors, known as ODMs (Original Design Manufacturer) who build the servers to our specifications.

When you are working with bleeding edge silicon and experimental firmware, not everything is plain sailing. We worked with one of our ODMs to eliminate an issue which was causing the Linux kernel to panic on boot. Once resolved, we used a variety of synthetic benchmarking tools to verify the performance including cf_benchmark, as well as an internal tool which applies a synthetic load to our entire software stack.

Once we were satisfied, we ordered Design Validation Test samples, which were manufactured by our ODMs with the new silicon. We continued to test these and iron out the inevitable issues that arise when you are developing custom hardware. To ensure that performance matched our expectations, we used synthetic benchmarking to test the new silicon. We also began testing it in our production environment by gradually introducing customer traffic to them as confidence grew.

Once the issues were resolved, we ordered the Product Validation Test samples, which were again manufactured by our ODMs, taking into account the feedback obtained in the DVT phase. As these are intended to be production grade, we work with the broader Cloudflare teams to deploy these units like a mass production order.

CPU

Previously: AMD EPYC 7642 48-Core ProcessorNow: AMD EPYC 7713 64-Core Processor

	AMD EPYC 7642	AMD EPYC 7643	AMD EPYC 7663	AMD EPYC 7713
Status	Incumbent	Candidate	Candidate	Candidate
Core Count	48	48	56	64
Thread Count	96	96	112	128
Base Clock	2.3GHz	2.3GHz	2.0GHz	2.0GHz
Max Boost Clock	3.3GHz	3.6GHz	3.5GHz	3.675GHz
Total L3 Cache	256MB	256MB	256MB	256MB
Default TDP	225W	225W	240W	225W
Configurable TDP	240W	240W	240W	240W

In the above chart, TDP refers to Thermal Design Power, a measure of the heat dissipated. All of the above processors have a configurable TDP - assuming the cooling solution is capable - giving more performance at the expense of increased power consumption. We tested all processors configured at their highest supported TDP.

The 64 core processors have 33% more cores than the 48 core processors so you might hypothesize that we would see a corresponding 33% increase in performance, although our benchmarks saw slightly more modest gains. This can be explained because the 64 core processors have lower base clock frequencies to fit within the same 225W power envelope.

In production testing, we found that the 64 core EPYC 7713 gave us around a 29% performance boost over the incumbent, whilst having similar power consumption and thermal properties.

Memory

Previously: 256GB DDR4-2933Now: 384GB DDR4-3200

Having made a decision about the processor, the next step was to determine the optimal amount of memory for our workload. We ran a series of experiments with our chosen EPYC 7713 processor and 256GB, 384GB and 512GB memory configurations. We started off by running synthetic benchmarks with tools such as STREAM to ensure that none of the configurations performed unexpectedly poorly and to generate a baseline understanding of the performance.

After the synthetic benchmarks, we proceeded to test the various configurations with production workloads to empirically determine the optimal quantity. We use Prometheus and Grafana to gather and display a rich set of metrics from all of our servers so that we can monitor and spot trends, and we re-used the same infrastructure for our performance analysis.

As well as measuring available memory, previous experience has shown us that one of the best ways to ensure that we have enough memory is to observe request latency and disk IO performance. If there is insufficient memory, we expect to see request latency and disk IO volume and latency to increase. The reason for this is that our core HTTP server uses memory to cache web assets and if there is insufficient memory the assets will be ejected from memory prematurely and more assets will be fetched from disk instead of memory, degrading performance.

Like most things in life, it’s a balancing act. We want enough memory to take advantage of the fact that serving web assets directly from memory is much faster than even the best NVMe disks. We also want to future proof our platform to enable the new features such as the ones that we recently announced in security week and developer week. However, we don’t want to spend unnecessarily on excess memory that will never be used. We found that the 512GB configuration did not provide a performance boost to justify the extra cost and settled on the 384GB configuration.

We also tested the performance impact of switching from DDR4-2933 to DDR4-3200 memory. We found that it provided a performance boost of several percent and the pricing has improved to the point where it is cost beneficial to make the change.

Disk

Previously: 3x Samsung PM983 x 960GBNow: 2x Samsung PM9A3 x 1.92TB

We validated samples by studying the manufacturer’s data sheets and testing using fio to ensure that the results being obtained in our test environment were in line with the published specifications. We also developed an automation framework to help compare different drive models using fio. The framework helps us to restore the drives close to factory settings, precondition the drives, perform the sequential and random tests in our environment, and analyze the data results to evaluate the bandwidth and latency results. Since our SSD samples were arriving in our test center at different months, having an automated framework helped in dealing with speedy evaluations by reducing our time spent testing and doing analysis.

For Gen 11 we decided to move to a 2x 2TB configuration from the original 3x 1TB configuration giving us an extra 1 TB of storage. This also meant we could use the higher performance of a 2TB drive and save around 6W of power since there is one less SSD.

After analyzing the performances of various 2TB drives, their latencies and endurances, we chose Samsung’s PM9A3 SSDs as our Gen11 drives. The results we obtained below were consistent with the manufacturer's claims.

Sequential performance:

Random Performance:

Compared to our previous generation drives, we could see a 1.5x - 2x improvement in read and write bandwidths. The higher values for the PM9A3 can be attributed to the fact that these are PCIe 4.0 drives, have more intelligent SSD controllers and an upgraded NAND architecture.

Network

Previously: Mellanox ConnectX-4 dual-port 25GNow: Mellanox ConnectX-4 dual-port 25G

There is no change on the network front; the Mellanox ConnectX-4 is a solid performer which continues to meet our needs. We investigated higher speed Ethernet, but we do not currently see this as beneficial. Cloudflare’s network is built on cheap commodity hardware and the highly distributed nature of Cloudflare’s network means we don’t have discrete DDoS scrubbing centres. All points of presence operate as scrubbing centres. This means that we distribute the load across our entire network and do not need to employ higher speed and more expensive Ethernet devices.

Open source firmware

Transparency, security and integrity is absolutely critical to us at Cloudflare. Last year, we described how we had deployed Platform Secure Boot to create trust that we were running the software that we thought we were.

Now, we are pleased to announce that we are deploying open source firmware to our servers using OpenBMC. With access to the source code, we have been able to configure BMC features such as the fan PID controller, having BIOS POST codes recorded and accessible, and managing networking ports and devices. Prior to OpenBMC, requesting these features from our vendors led to varying results and misunderstandings of the scope and capabilities of the BMC. After working with the BMC source code much more directly, we have the flexibility to work on features ourselves to our liking, or understand why the BMC is incapable of running our desired software.

Whilst our current BMC is an industry standard, we feel that OpenBMC better suits our needs and gives us advantages such as allowing us to deal with upstream security issues without a dependency on our vendors. Some opportunities with security include integration of desired authentication modules, usage of specific software packages, staying up to date with the latest Linux kernel, and controlling a variety of attack vectors. Because we have a kernel lockdown implemented, flashing tooling is difficult to use in our environment. With access to source code of the flashing tools, we have an understanding of what the tools need access to, and assess whether or not this meets our standard of security.

Summary

The jump between our generation 9 and generation 10 servers was enormous. To summarise, we changed from a dual-socket Intel platform to a single socket AMD platform. We upgraded the SATA SSDs to NVMe storage devices, and physically the multi-node chassis changed to a 1U form factor.

At the start of the generation 11 project we weren’t sure if we would be making such radical changes again. However, after a thorough testing of the latest chips and a review of how well the generation 10 server has performed in production for over a year, our generation 11 server built upon the solid foundations of generation 10 and ended up as a refinement rather than total revamp. Despite this, and bearing in mind that performance varies by time of day and geography, we are pleased that generation 11 is capable of serving approximately 29% more requests than generation 10 without an increase in power consumption.

Thanks to Denny Mathew and Ryan Chow’s work on benchmarking and OpenBMC, respectively.

If you are interested in working with bleeding edge hardware, open source server firmware, solving interesting problems, helping to improve our performance, and are interested in helping us work on our generation 12 server platform (amongst many other things!), we’re hiring.

Branch predictor: How many "if"s are too many? Including x86 and M1 benchmarks!

Marek Majkowski — Thu, 06 May 2021 13:00:00 GMT

Some time ago I was looking at a hot section in our code and I saw this:


	if (debug) {
    	  log("...");
    }

This got me thinking. This code is in a performance critical loop and it looks like a waste - we never run with the "debug" flag enabled^[¹^]. Is it ok to have if clauses that will basically never be run? Surely, there must be some performance cost to that...

Just how bad is peppering the code with avoidable `if` statements?

Back in the days the general rule was: a fully predictable branch has close to zero CPU cost.

To what extent is this true? If one branch is fine, then how about ten? A hundred? A thousand? When does adding one more if statement become a bad idea?

At some point the negligible cost of simple branch instructions surely adds up to a significant amount. As another example, a colleague of mine found this snippet in our production code:


const char *getCountry(int cc) {
		if(cc == 1) return "A1";
        if(cc == 2) return "A2";
        if(cc == 3) return "O1";
        if(cc == 4) return "AD";
        if(cc == 5) return "AE";
        if(cc == 6) return "AF";
        if(cc == 7) return "AG";
        if(cc == 8) return "AI";
        ...
        if(cc == 252) return "YT";
        if(cc == 253) return "ZA";
        if(cc == 254) return "ZM";
        if(cc == 255) return "ZW";
        if(cc == 256) return "XK";
        if(cc == 257) return "T1";
        return "UNKNOWN";
}

Obviously, this code could be improved^[²^]. But when I thought about it more: should it be improved? Is there an actual performance hit of a code that consists of a series of simple branches?

Understanding the cost of jump

We must start our journey with a bit of theory. We want to figure out if the CPU cost of a branch increases as we add more of them. As it turns out, assessing the cost of a branch is not trivial. On modern processors it takes between one and twenty CPU cycles. There are at least four categories of control flow instructions^[³^]: unconditional branch (jmp on x86), call/return, conditional branch (e.g. je on x86) taken and conditional branch not taken. The taken branches are especially problematic: without special care they are inherently costly - we'll explain this in the following section. To bring down the cost, modern CPU's try to predict the future and figure out the branch target before the branch is actually fully executed! This is done in a special part of the processor called the branch predictor unit (BPU).

The branch predictor attempts to figure out a destination of a branching instruction very early and with very little context. This magic happens before the "decoder" pipeline stage and the predictor has very limited data available. It only has some past history and the address of the current instruction. If you think about it - this is super powerful. Given only current instruction pointer it can assess, with very high confidence, where the target of the jump will be.

Source: https://en.wikipedia.org/wiki/Branch_predictor

The BPU maintains a couple of data structures, but today we'll focus on Branch Target Buffer (BTB). It's a place where the BPU remembers the target instruction pointer of previously taken branches. The whole mechanism is much more complex, take a look a the Vladimir Uzelac's Master thesis for details about branch prediction on CPU's from 2008 era:

For the scope of this article we'll simplify and focus on the BTB only. We'll try to show how large it is and how it behaves under different conditions.

Why is branch prediction needed?

But first, why is branch prediction used at all? In order to get the best performance, the CPU pipeline must feed a constant flow of instructions. Consider what happens to the multi-stage CPU pipeline on a branch instruction. To illustrate let's consider the following ARM program:


	BR label_a;
    X1
    ...
label_a:
 	Y1

Assuming a simplistic CPU model, the operations would flow through the pipeline like this:

In the first cycle the BR instruction is fetched. This is an unconditional branch instruction changing the execution flow of the CPU. At this point it's not yet decoded, but the CPU would like to fetch another instruction already! Without a branch predictor in cycle 2 the fetch unit either has to wait or simply continues to the next instruction in memory, hoping it will be the right one.

In our example, instruction X1 is fetched even though this isn't the correct instruction to run. In cycle 4, when the branch instruction finishes the execute stage, the CPU will be able to understand the mistake, and roll back the speculated instructions before they have any effect. At this point the fetch unit is updated to correctly get the right instruction - Y1 in our case.

This situation of losing a number of cycles due to fetching code from an incorrect place is called a "frontend bubble". Our theoretical CPU has a two-cycle frontend bubble when a branch target wasn’t predicted right.

In this example we see that, although the CPU does the right thing in the end, without good branch prediction it wasted effort on bad instructions. In the past, various techniques have been used to reduce this problem, such as static branch prediction and branch delay slots. But the dominant CPU designs today rely on dynamic branch prediction. This technique is able to mostly avoid the frontend bubble problem, by predicting the correct address of the next instruction even for branches that aren’t fully decoded and executed yet.

Playing with the BTB

Today we're focusing on the BTB - a data structure managed by the branch predictor responsible for figuring out a target of a branch. It's important to note that the BTB is distinct from and independent of the system assessing if the branch was taken or not taken. Remember, we want to figure out if a cost of a branch increases as we run more of them.

Preparing an experiment to stress only the BTB is relatively simple (based on Matt Godbolt's work). It turns out a sequence of unconditional jmps is totally sufficient. Consider this x86 code:

This code stresses the BTB to an extreme - it just consists of a chain of jmp +2 statements (i.e. literally jumping to the next instruction). In order to avoid wasting cycles on frontend pipeline bubbles, each taken jump needs a BTB hit. This branch prediction must happen very early in the CPU pipeline, before instruction decode is finished. This same mechanism is needed for any taken branch, whether it's unconditional, conditional or a function call.

The code above was run inside a test harness that measures how many CPU cycles elapse for each instruction. For example, in this run we're measuring times of dense - every two bytes - 1024 jmp instructions one after another:

We’ll look at the results of experiments like this for a few different CPUs. But in this instance, it was run on a machine with an AMD EPYC 7642. Here, the cold run took 10.5 cycles per jmp, and then all subsequent runs took ~3.5 cycles per jmp. The code is prepared in such a way to make sure it's the BTB that is slowing down the first run. Take a look at the full code, there is quite some magic to warm up the L1 cache and iTLB without priming the BTB.

Top tip 1. On this CPU a branch instruction that is taken but not predicted, costs ~7 cycles more than one that is taken and predicted. Even if the branch was unconditional.

Density matters

To get a full picture we also need to think about the density of jmp instructions in the code. The code above did eight jmps per 16-byte code block. This is a lot. For example, the code below contains one jmp instruction in each block of 16 bytes. Notice that the nop opcodes are jumped over. The block size doesn't change the number of executed instructions, only the code density:

Varying the jmp block size might be important. It allows us to control the placement of the jmp opcodes. Remember the BTB is indexed by instruction pointer address. Its value and its alignment might influence the placement in the BTB and help us reveal the BTB layout. Increasing the alignment will cause more nop padding to be added. The sequence of a single measured instruction - jmp in this case - and zero or more nops, I will call "block", and its size "block size". Notice that the larger the block size, the larger the working code size for the CPU. At larger values we might see some performance drop due to exhausting L1 cache space.

The experiment

Our experiment is crafted to show the performance drop depending on the number of branches, across different working code sizes. Hopefully, we will be able to prove the performance is mostly dependent on the number of blocks - and therefore the BTB size, and not the working code size.

See the code on GitHub. If you want to see the generated machine code, though, you need to run a special command. It's created procedurally by the code, customized by passed parameters. Here's an example gdb incantation:

Let's bring this experiment forward, what if we took the best times of each run - with a fully primed BTB - for varying values of jmp block sizes and number of blocks - working set size? Here you go:

This is an astonishing chart. First, it's obvious something happens at the 4096 jmp mark[4] regardless of how large the jmp block sizes - how many nop's we skip over. Reading it aloud:

On the far left, we see that if the amount of code is small enough - less than 2048 bytes (256 times a block of 8 bytes) - it's possible to hit some kind of uop/L1 cache and get ~1.5 cycles per fully predicted branch. This is amazing.
Otherwise, if you keep your hot loop to 4096 branches then, no matter how dense your code is you are likely to see ~3.4 cycles per fully predicted branch
Above 4096 branches the branch predictor gives up and the cost of each branch shoots to ~10.5 cycles per jmp. This is consistent with what we saw above - unpredicted branch on flushed BTB took ~10.5 cycles.

Great, so what does it mean? Well, you should avoid branch instructions if you want to avoid branch misses because you have at most 4096 of fast BTB slots. This is not a very pragmatic advice though - it's not like we deliberately put many unconditional jmps in real code!

There are a couple of takeaways for the discussed CPU. I repeated the experiment with an always-taken conditional branch sequence and the resulting chart looks almost identical. The only difference being the predicted taken conditional-je instruction being 2 cycles slower than unconditional jmp.

An entry to BTB is added wherever a branch is "taken" - that is, the jump actually happens. An unconditional "jmp" or always taken conditional branch, will cost a BTB slot. To get best performance make sure to not have more than 4096 taken branches in the hot loop. The good news is that branches never-taken don't take space in the BTB. We can illustrate this with another experiment:

This boring code is going over not-taken jne followed by two nops (block size=4). Aimed with this test (jne never-taken), the previous one (jmp always-taken) and a conditional branch je always-taken, we can draw this chart:

First, without any surprise we can see the conditional 'je always-taken' is getting slightly more costly than the simple unconditional jmp, but only after the 4096 branches mark. This makes sense, the conditional branch is resolved later in the pipeline so the frontend bubble is longer. Then take a look at the blue line hovering near zero. This is the "jne never-taken" line flat at 0.3 clocks / block, no matter how many blocks we run in sequence. The takeaway is clear - you can have as many never-taken branches as you want, without incurring any cost. There isn't any spike at 4096 mark, meaning BTB is not used in this case. It seems the conditional jump not seen before is guessed to be not-taken.

Top tip 2: conditional branches never-taken are basically free - at least on this CPU.

So far we established that branches always-taken occupy BTB, branches never taken do not. How about other control flow instructions, like the call?

I haven't been able to find this in the literature, but it seems call/ret also need the BTB entry for best performance. I was able to illustrate this on our AMD EPYC. Let's take a look at this test:

This time we'll issue a number of callq instructions followed by ret - both of which should be fully predicted. The experiment is crafted so that each callq calls a unique function, to allow for retq prediction - each one returns to exactly one caller.

This chart confirms the theory: no matter the code density - with the exception of 64-byte block size being notably slower - the cost of predicted call/ret starts to deteriorate after the 2048 mark. At this point the BTB is filled with call and ret predictions and can't handle any more data. This leads to an important conclusion:

Top tip 3. In the hot code you want to have less than 2K function calls - on this CPU.

In our test CPU a sequence of fully predicted call/ret takes about 7 cycles, which is about the same as two unconditional predicted jmp opcodes. It's consistent with our results above.

So far we thoroughly checked AMD EPYC 7642. We started with this CPU because the branch predictor is relatively simple and the charts were easy to read. It turns out more recent CPUs are less clear.

AMD EPYC 7713

Newer AMD is more complex than the previous generations. Let's run the two most important experiments. First, the jmp one:

For the always-taken branches case we can see a very good, sub 1 cycle, timings when the number of branches doesn't exceed 1024 and the code isn't too dense.

Top tip 4. On this CPU it's possible to get <1 cycle per predicted jmp when the hot loop fits in ~32KiB.

Then there is some noise starting after the 4096 jmps mark. This is followed by a complete drop of speed at about 6000 branches. This is in line with the theory that BTB is 4096 entries long. We can speculate that some other prediction mechanism is successfully kicking in beyond that, and keeps up the performance up the ~6k mark.

The call/ret chart shows a similar tale, the timings start breaking after 2048 mark, and completely fail to be predicted beyond ~3000.

Xeon Gold 6262

The Intel Xeon looks different from the AMD:

Our test shows the predicted taken branch costs 2 cycles. Intel has documented a clock penalty for very dense branching code - this explains the 4-byte block size line hovering at ~3 cycles. The branch cost breaks at the 4096 jmp mark, confirming the theory that the Intel BTB can hold 4096 entries. The 64-byte block size chart looks confusing, but really isn't. The branch cost stays at flat 2 cycles up till the 512 jmp count. Then it increases. This is caused by the internal layout of the BTB which is said to be 8-way associative. It seems with the 64-byte block size we can utilize at most half of the 4096 BTB slots.

Top tip 5. On Intel avoid placing your jmp/call/ret instructions at regular 64-byte intervals.

Then the call/ret chart:

Similarly, we can see the branch predictions failing after the 2048 jmp mark - in this experiment one block uses two flow control instructions: call and ret. This again confirms the BTB size of 4K entries. The 64-byte block size is generally slower due to the nop padding but also breaks faster due to the instructions alignment issue. Notice, we haven't seen this effect on AMD.

Apple Silicon M1

So far we saw examples of AMD and Intel server grade CPUs. How does an Apple Silicon M1 fit in this picture?

We expect it to be very different - it's designed for mobile and it's using ARM64 architecture. Let's see our two experiments:

The predicted jmp test shows an interesting story. First, when the code fits 4096 bytes (1024*4 or 512*8, etc) you can expect a predicted jmp to cost 1 clock cycle. This is an excellent score.

Beyond that, generally, you can expect a cost of 3 clock cycles per predicted jmp. This is also very good. This starts to deteriorate when the working code grows beyond ~200KiB. This is visible with block size 64 breaking at 3072 mark 3072*64=196K, and for block 32 at 6144: 6144*32=196K. At this point the prediction seems to stop working. The documentation indicates that the M1 CPU has 192 KB L1 of instruction cache - our experiment matches that.

Let's compare the "predicted jmp" with the "unpredicted jmp" chart. Take this chart with a grain of salt, because flushing the branch predictor is notoriously difficult.

However, even if we don't trust the flush-bpu code (adapted from Matt Godbolt), this chart reveals two things. First, the "unpredicted" branch cost seems to be correlated with the branch distance. The longer the branch the costlier it is. We haven't seen such behaviour on x86 CPUs.

Then there is the cost itself. We saw a predicted sequence of branches cost, and what a supposedly-unpredicted jmp costs. In the first chart we saw that beyond ~192KiB working code, the branch predictor seems to become ineffective. The supposedly-flushed BPU seems to show the same cost. For example, the cost of a 64-byte block size jmp with a small working set size is 3 cycles. A miss is ~8 cycles. For a large working set size both times are ~8 cycles. It seems that the BTB is linked to the L1 cache state. Paul A. Clayton suggested a possibility of such a design back in 2016.

Top tip 6. on M1 the predicted-taken branch generally takes 3 cycles and unpredicted but taken has varying cost, depending on jmp length. BTB is likely linked with L1 cache.

The call/ret chart is funny:

Like in the chart before, we can see a big benefit if hot code fits within 4096 bytes (512*4 or 256*8). Otherwise, you can count on 4-6 cycles per call/ret sequence (or, bl/ret as it's known in ARM). The chart shows funny alignment issues. It's unclear what they are caused by. Beware, comparing the numbers in this chart with x86 is unfair, since ARM call operation differs substantially from the x86 variant.

M1 seems pretty fast, with predicted branches usually at 3 clock cycles. Even unpredicted branches never cost more than 8 ticks in our benchmark. Call+ret sequence for dense code should fit under 5 cycles.

Summary

We started our journey from a piece of trivial code, and asked a basic question: how costly is adding a never-taken if branch in the hot portion of code?

Then we quickly dived in very low level CPU features. By the end of this article, hopefully, an astute reader might get better intuition how a modern branch predictors works.

On x86 the hot code needs to split the BTB budget between function calls and taken branches. The BTB has only a size of 4096 entries. There are strong benefits in keeping the hot code under 16KiB.

On the other hand on M1 the BTB seems to be limited by L1 instruction cache. If you're writing super hot code, ideally it should fit 4KiB.

Finally, can you add this one more if statement? If it's never-taken, it's probably ok. I found no evidence that such branches incur any extra cost. But do avoid always-taken branches and function calls.

Sources

I'm not the first person to investigate how BTB works. I based my experiments on:

Vladimir Uzelac thesis
Matt Godbolt work. The series has 5 articles.
Travis Downs BTB questions on Real World Tech
various stackoverflow discussions. Especially this one and this
Agner Fog microarchitecture guide has a good section on branch predictions.

Acknowledgements

Thanks to David Wragg and Dan Luu for technical expertise and proofreading help.

PS

Oh, oh. But this is not the whole story! Similar research was the base to the Spectre v2 attack. The attack was exploiting the little known fact that the BPU state was not cleared between context switches. With the correct technique it was possible to train the BPU - in the case of Spectre it was iBTB - and force a privileged piece of code to be speculatively executed. This, combined with a cache side-channel data leak, allowed an attacker to steal secrets from the privileged kernel. Powerful stuff.

A proposed solution was to avoid using shared BTB. This can be done in two ways: make the indirect jumps to always fail to predict, or fix the CPU to avoid sharing BTB state across isolation domains. This is a long story, maybe for another time...

Footnotes

1. One historical solution to this specific 'if debug' problem is called "runtime nop'ing". The idea is to modify the code in runtime and patch the never-taken branch instruction with a nop. For example, see the "ISENABLED" discussion on https://bugzilla.mozilla.org/showbug.cgi?id=370906.

2. Fun fact: modern compilers are pretty smart. New gcc (>=11) and older clang (>=3.7) are able to actually optimize it quite a lot. See for yourself. But, let's not get distracted by that. This article is about low level machine code branch instructions!

3. This is a simplification. There are of course more control flow instructions, like: software interrupts, syscalls, VMENTER/VMEXIT.

4. Ok, I'm slightly overinterpreting the chart. Maybe the 4096 jmp mark is due to the 4096 uop cache or some instruction decoder artifact? To prove this spike is indeed BTB related I looked at Intel BPUCLEARS.EARLY and BACLEAR.CLEAR performance counters. Its value is small for block count under 4096 and large for block count greater than 5378. This is strong evidence that the performance drop is indeed caused by the BPU and likely BTB.

Why Cloudflare Chose AMD EPYC for Gen X Servers

Nitin Rao — Sun, 01 Mar 2020 13:00:00 GMT

From the very beginning Cloudflare used Intel CPU-based servers (and, also, Intel components for things like NICs and SSDs). But we're always interested in optimizing the cost of running our service so that we can provide products at a low cost and high gross margin.

We're also mindful of events like the Spectre and Meltdown vulnerabilities and have been working with outside parties on research into mitigation and exploitation which we hope to publish later this year.

We looked very seriously at ARM-based CPUs and continue to keep our software up to date for the ARM architecture so that we can use ARM-based CPUs when the requests per watt is interesting to us.

In the meantime, we've deployed AMD's EPYC processors as part of Gen X server platform and for the first time are not using any Intel components at all. This week, we announced details of this tenth generation of servers. Below is a recap of why we're excited about the design, specifications, and performance of our newest hardware.

Servers for an Accelerated Future

Every server can run every service. This architectural decision has helped us achieve higher efficiency across the Cloudflare network. It has also given us more flexibility to build new software or adopt the newest available hardware.

Notably, Intel is not inside. We are not using their hardware for any major server components such as the CPU, board, memory, storage, network interface card (or any type of accelerator).

This time, AMD is inside.

Compared with our prior server (Gen 9), our Gen X server processes as much as 36% more requests while costing substantially less. Additionally, it enables a ~50% decrease in L3 cache miss rate and up to 50% decrease in NGINX p99 latency, powered by a CPU rated at 25% lower TDP (thermal design power) per core.

Gen X CPU benchmarks

To identify the most efficient CPU for our software stack, we ran several benchmarks for key workloads such as cryptography, compression, regular expressions, and LuaJIT. Then, we simulated typical requests we see, before testing servers in live production to measure requests per watt.

Based on our findings, we selected the single socket 48-core AMD 2nd Gen EPYC 7642.

Impact of Cache Locality

Gen X Performance Tuning

Partnering with AMD, we tuned the 2nd Gen EPYC 7642 processor to achieve additional 6% performance. We achieved this by using power determinism and configuring the CPU's Thermal Design Power (TDP).

Securing Memory at EPYC Scale

Finally, we described how we use Secure Memory Encryption (SME), an interesting security feature within the System on a Chip architecture of the AMD EPYC line. We were impressed by how we could achieve RAM encryption without significant decrease in performance. This reduces the worry that any data could be exfiltrated from a stolen server.

We enjoy designing hardware that improves the security, performance and reliability of our global network, trusted by over 26 million Internet properties.

Want to help us evaluate new hardware? Join us!

Securing Memory at EPYC Scale

Derek Chamorro — Fri, 28 Feb 2020 14:00:00 GMT

Security is a serious business, one that we do not take lightly at Cloudflare. We have invested a lot of effort into ensuring that our services, both external and internal, are protected by meeting or exceeding industry best practices. Encryption is a huge part of our strategy as it is embedded in nearly every process we have. At Cloudflare, we encrypt data both in transit (on the network) and at rest (on the disk). Both practices address some of the most common vectors used to exfiltrate information and these measures serve to protect sensitive data from attackers but, what about data currently in use?

Can encryption or any technology eliminate all threats? No, but as Infrastructure Security, it’s our job to consider worst-case scenarios. For example, what if someone were to steal a server from one of our data centers? How can we leverage the most reliable, cutting edge, innovative technology to secure all data on that host if it were in the wrong hands? Would it be protected? And, in particular, what about the server’s RAM?

Data in random access memory (RAM) is usually stored in the clear. This can leave data vulnerable to software or hardware probing by an attacker on the system. Extracting data from memory isn’t an easy task but, with the rise of persistent memory technologies, additional attack vectors are possible:

Dynamic random-access memory (DRAM) interface snooping
Installation of hardware devices that access host memory
Freezing and stealing dual in-line memory module (DIMMs)
Stealing non-volatile dual in-line memory module (NVDIMMs)

So, what about enclaves? Hardware manufacturers have introduced Trusted Execution Environments (also known as enclaves) to help create security boundaries by isolating software execution at runtime so that sensitive data can be processed in a trusted environment, such as secure area inside an existing processor or Trusted Platform Module.

While this allows developers to shield applications in untrusted environments, it doesn’t effectively address all of the physical system attacks mentioned previously. Enclaves were also meant to run small pieces of code. You could run an entire OS in an enclave, but there are limitations and performance issues in doing so.

This isn’t meant to bash enclave usage; we just wanted a better solution for encrypting all memory at scale. We expected performance to be compromised, and conducted tests to see just how much.

Time to get EPYC

Since we are using AMD for our tenth generation "Gen X servers", we found an interesting security feature within the System on a Chip architecture of the AMD EPYC line. Secure Memory Encryption (SME) is an x86 instruction set extension introduced by AMD and available in the EPYC processor line. SME provides the ability to mark individual pages of memory as encrypted using standard x86 page tables. A page that is marked encrypted will be automatically decrypted when read from DRAM and encrypted when written to DRAM. SME can therefore be used to protect the contents of DRAM from physical attacks on the system.

Sounds complicated, right? Here’s the secret: It isn’t ?

Components

SME is comprised of two components:

AES-128 encryption engine: Embedded in the memory controller. It is responsible for encrypting and decrypting data in main memory when an appropriate key is provided via the Secure Processor.
AMD Secure Processor (AMD-SP): An on-die 32-bit ARM Cortex A5 CPU that provides cryptographic functionality for secure key generation and key management. Think of this like a mini hardware security module that uses a hardware random number generator to generate the 128-bit key(s) used by the encryption engine.

How It Works

We had two options available to us when it came to enabling SME. The first option, regular SME, requires enabling a model specific register MSR 0xC001_0010[SMEE]. This enables the ability to set a page table entry encryption bit:

0 = memory encryption features are disabled
1 = memory encryption features are enabled

After memory encryption is enabled, a physical address bit (C-Bit) is utilized to mark if a memory page is protected. The operating system sets the bit of a physical address to 1 in the page table entry (PTE) to indicate the page should be encrypted. This causes any data assigned to that memory space to be automatically encrypted and decrypted by the AES engine in the memory controller:

Becoming More Transparent

While arbitrarily flagging which page table entries we want encrypted is nice, our objective is to ensure that we are incorporating the full physical protection of SME. This is where the second mode of SME came in, Transparent SME (TSME). In TSME, all memory is encrypted regardless of the value of the encrypt bit on any particular page. This includes both instruction and data pages, as well as the pages corresponding to the page tables themselves.

Enabling TSME is as simple as:

Setting a BIOS flag:

2. Enabling kernel support with the following flag:

CONFIG_AMD_MEM_ENCRYPT=y

After a reboot you should see the following in dmesg:

$ sudo dmesg | grep SME
[    2.537160] AMD Secure Memory Encryption (SME) active

Performance Testing

To weigh the pros and cons of implementation against the potential risk of a stolen server, we had to test the performance of enabling TSME. We took a test server that mirrored a production edge metal with the following specs:

Memory: 8 x 32GB 2933MHz
CPU: AMD 2nd Gen EPYC 7642 with SMT enabled and running NPS4 mode
OS: Debian 9
Kernel: 5.4.12

The performance tools we used were:

STREAM
Cryptsetup
Benchmarky

Stream

We used a custom STREAM binary with 24 threads, using all available cores, to measure the sustainable memory bandwidth (in MB/s). Four synthetic computational kernels are run, with the output of each kernel being used as an input to the next. The best rates observed are reported for each choice of thread count.

The figures above show 2.6% to 4.2% performance variation, with a mean of 3.7%. These were the highest numbers measured, which fell below an expected performance impact of >5%.

Cryptsetup

While cryptsetup is normally used for encrypting disk partitions, it has a benchmarking utility that will report on a host’s cryptographic performance by iterating key derivation functions using memory only:

Example:

$ sudo cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1162501 iterations per second for 256-bit key
PBKDF2-sha256    1403716 iterations per second for 256-bit key
PBKDF2-sha512    1161213 iterations per second for 256-bit key
PBKDF2-ripemd160  856679 iterations per second for 256-bit key
PBKDF2-whirlpool  661979 iterations per second for 256-bit key

Benchmarky

Benchmarky is a homegrown tool provided by our Performance team to run synthetic workloads against a specific target to evaluate performance of different components. It uses Cloudflare Workers to send requests and read stats on responses. In addition to that, it also reports versions of all important stack components and their CPU usage. Each test runs 256 concurrent clients, grabbing a cached 10kB PNG image from a performance testing endpoint, and calculating the requests per second (RPS).

Conclusion

In the majority of test results, performance decreased by a nominal amount, actually less than we expected. AMD’s official white paper on SME even states that encryption and decryption of memory through the AES engine does incur a small amount of additional latency for DRAM memory accesses, though dependent on the workload. Across all 11 data points, our average performance drag was only down by .699%. Even at scale, enabling this feature has reduced the worry that any data could be exfiltrated from a stolen server.

While we wait for other hardware manufacturers to add support for total memory encryption, we are happy that AMD has set the bar high and is protecting our next generation edge hardware.

Gen X Performance Tuning

Sung Park — Thu, 27 Feb 2020 14:00:00 GMT

We are using AMD 2nd Gen EPYC 7642 for our tenth generation “Gen X” servers. We found many aspects of this processor compelling such as its increase in performance due to its frequency bump and cache-to-core ratio. We have partnered with AMD to get the best performance out of this processor and today, we are highlighting our tuning efforts that led to an additional 6% performance.

Thermal Design Power & Dynamic Power

Thermal design power (TDP) and dynamic power, amongst others, play a critical role when tuning a system. Many share a common belief that thermal design power is the maximum or average power drawn by the processor. The 48-core AMD EPYC 7642 has a TDP rating of 225W which is just as high as the 64-core AMD EPYC 7742. It comes to mind that fewer cores should translate into lower power consumption, so why is the AMD EPYC 7642 expected to draw just as much power as the AMD EPYC 7742?

TDP Comparison between the EPYC 7642, EPYC 7742 and top-end EPYC 7H12

Let’s take a step back and understand that TDP does not always mean the maximum or average power that the processor will draw. At a glance, TDP may provide a good estimate about the processor’s power draw; TDP is really about how much heat the processor is expected to generate. TDP should be used as a guideline for designing cooling solutions with appropriate thermal capacitance. The cooling solution is expected to indefinitely dissipate heat up to the TDP, in turn, this can help the chip designers determine a power budget and create new processors around that constraint. In the case of the AMD EPYC 7642, the extra power budget was spent on retaining all of its 256 MiB L3 cache and the cores to operate at a higher sustained frequency as needed during our peak hours.

The overall power drawn by the processor depends on many different factors; dynamic or active power establishes the relationship between power and frequency. Dynamic power is a function of capacitance - the chip itself, supply voltage, frequency of the processor, and activity factor. The activity factor is dependent on the characteristics of the workloads running on the processor. Different workloads will have different characteristics. Some examples include Cloudflare Workers or Cloudflare for Teams, the hotspots in these or any other particular programs will utilize different parts of the processor, affecting the activity factor.

Determinism Modes & Configurable TDP

The latest AMD processors continue to implement AMD Precision Boost which opportunistically allows the processor to operate at a higher frequency than its base frequency. How much higher can the frequency go depends on many different factors such as electrical, thermal, and power limitations.

AMD offers a knob known as determinism modes. This knob affects power and frequency, placing emphasis one over the other depending on the determinism selected. There is a white paper posted on AMD's website that goes into the nuanced details of determinism, and remembering how frequency and power are related - this was the simplest definition I took away from the paper:

Performance Determinism - Power is a function of frequency.

Power Determinism - Frequency is a function of power.

Another knob that is available to us is Configurable Thermal Design Power (cTDP), this knob allows the end-user to reconfigure the factory default thermal design power. The AMD EPYC 7642 is rated at 225W; however, we have been given guidance from AMD that this particular part can be reconfigured up to 240W. As mentioned previously, the cooling solution must support up to the TDP to avoid throttling. We have also done our due diligence and tested that our cooling solution can reliably dissipate up to 240W, even at higher ambient temperatures.

We gave these two knobs a try and got the results shown in the figures below using 10 KiB web assets over HTTPS provided by our performance team over a sustained period of time to properly heat up the processor. The figures are broken down into the average operating frequencies across all 48 cores, total package power drawn from the socket, and the highest reported die temperature out of the 8 dies that are laid out on the AMD EPYC 7642. Each figure will compare the results we obtained from power and performance determinism, and finally, cTDP at 240W using power determinism.

Performance determinism clearly emphasized stabilizing operating frequency by matching its frequency to its lowest performing core over time. This mode appears to be ideal for tuners that emphasizes predictability over maximizing the processor’s operating frequency. In other words, the number of cycles that the processor has at its disposal every second should be predictable. This is useful if two or more cores share data dependencies. By allowing the cores to work in unison; this can prevent one core stalling another. Power determinism on the other hand, maximized its power and frequency as much as it can.

Heat generated by power can be compounded even further by ambient temperature. All components should stay within safe operating temperature at all times as specified by the vendor’s datasheet. If unsure, please reach out to your vendor as soon as possible. We put the AMD EPYC 7642 through several thermal tests and determined that it will operate within safe operating temperatures. Before diving into the figure below, it is important to mention that our fan speed ramped up over time, preventing the processor from reaching critical temperature; nevertheless, this meant that our cooling solution worked as intended - a combination of fans and a heatsink rated for 240W. We have yet to see the processors throttle in production. The figure below shows the highest temperature of the 8 dies laid out on the AMD EPYC 7642.

An unexpected byproduct of performance determinism was frequency jitter. It took a longer time for performance determinism to achieve steady-state frequency, which contradicted predictable performance that performance determinism mode was meant to achieve.

Finally, here are the deltas out from the real world with performance determinism and TDP set to factory default of 225W as the baseline. Deltas under 10% can be influenced by a wide variety of factors especially in production, however, none of our data showed a negative trend. Here are the averages.

Nodes Per Socket (NPS)

The AMD EPYC 7642 physically lays out its 8 dies across 4 different quadrants on a single package. By having such a layout, AMD supports dividing its dies into NUMA domains and this feature is called Nodes Per Socket or NPS. Available NPS options differ model-to-model, the AMD EPYC 7642 supports 4, 2, and 1 node(s) per socket. We thought it might be worth our time exploring this option despite the fact that none of these choices will end up yielding a shared last level cache. We did not observe any significant deltas from the NPS options in terms of performance.

Conclusion

Tuning the processor to yield an additional 6% throughput or requests per second out of the box has been a great start. Some parts will have more rooms for improvement than others. In production we were able to achieve an additional 2% requests per seconds with power determinism, and we achieved another 4% by reconfiguring the TDP to 240W. We will continue to work with AMD as well as internal teams within Cloudflare to identify additional areas of improvement. If you like the idea of supercharging the Internet on behalf of our customers, then come join us.

Impact of Cache Locality

Sung Park — Wed, 26 Feb 2020 12:00:00 GMT

In the past, we didn't have the opportunity to evaluate as many CPUs as we do today. The hardware ecosystem was simple – Intel had consistently delivered industry leading processors. Other vendors could not compete with them on both performance and cost. Recently it all changed: AMD has been challenging the status quo with their 2nd Gen EPYC processors.

This is not the first time that Intel has been challenged; previously there was Qualcomm, and we worked with AMD and considered their 1st Gen EPYC processors and based on the original Zen architecture, but ultimately, Intel prevailed. AMD did not give up and unveiled their 2nd Gen EPYC processors codenamed Rome based on the latest Zen 2 architecture.

Playing with some new fun kit. #epyc pic.twitter.com/1No8Cmfzwl
— Matthew Prince ? (@eastdakota) November 8, 2019

This made many improvements over its predecessors. Improvements include a die shrink from 14nm to 7nm, a doubling of the top end core count from 32 to 64, and a larger L3 cache size. Let’s emphasize again on the size of that L3 cache, which is 32 MiB L3 cache per Core Complex Die (CCD).

This time around, we have taken steps to understand our workloads at the hardware level through the use of hardware performance counters and profiling tools. Using these specialized CPU registers and profilers, we collected data on the AMD 2nd Gen EPYC and Intel Skylake-based Xeon processors in a lab environment, then validated our observations in production against other generations of servers from the past.

Simulated Environment

CPU Specifications

We evaluated several Intel Cascade Lake and AMD 2nd Gen EPYC processors, trading off various factors between power and performance; the AMD EPYC 7642 CPU came out on top. The majority of Cascade Lake processors have 1.375 MiB L3 cache per core shared across all cores, a common theme that started with Skylake. On the other hand, the 2nd Gen EPYC processors start at 4 MiB per core. The AMD EPYC 7642 is a unique SKU since it has 256 MiB of L3 cache. Having a cache this large or approximately 5.33 MiB sitting right next to each core means that a program will spend fewer cycles fetching data from RAM with the capability to have more data readily available in the L3 cache.

Before (Intel)

After (AMD)

Traditional cache layout has also changed with the introduction of 2nd Gen EPYC, a byproduct of AMD using a multi-chip module (MCM) design. The 256 MiB L3 cache is formed by 8 individual dies or Core Complex Die (CCD) that is formed by 2 Core Complexes (CCX) with each CCX containing 16 MiB of L3 cache.

Core Complex (CCX) - Up to four cores

Core Complex Die (CCD) - Created by combining two CCXs

AMD 2nd Gen EPYC 7642 - Created with 8x CCDs plus an I/O die in the center

Methodology

Our production traffic shares many characteristics of a sustained workload which typically does not induce large variation in operating frequencies nor enter periods of idle time. We picked out a simulated traffic pattern that closely resembled our production traffic behavior which was the Cached 10KiB png via HTTPS. We were interested in assessing the CPU’s maximum throughput or requests per second (RPS), one of our key metrics. With that being said, we did not disable Intel Turbo Boost or AMD Precision Boost, nor matched the frequencies clock-for-clock while measuring for requests per second, instructions retired per second (IPS), L3 cache miss rate, and sustained operating frequency.

Results

The 1P AMD 2nd Gen EPYC 7642 powered server took the lead and processed 50% more requests per second compared to our Gen 9’s 2P Intel Xeon Platinum 6162 server.

We are running a sustained workload, so we should end up with a sustained operating frequency that is higher than base clock. The AMD EPYC 7642 operating frequency or the number cycles that the processor had at its disposal was approximately 20% greater than the Intel Xeon Platinum 6162, so frequency alone was not enough to explain the 50% gain in requests per second.

Taking a closer look, the number of instructions retired over time was far greater on the AMD 2nd Gen EPYC 7642 server, thanks to its low L3 cache miss rate.

Production Environment

CPU Specifications

Methodology

Our most predominant bottleneck appears to be the cache memory and we saw significant improvement in requests per second as well as time to process a request due to low L3 cache miss rate. The data we present in this section was collected at a point of presence that spanned between Gen 7 to Gen 9 servers. We also collected data from a secondary region to gain additional confidence that the data we present here was not unique to one particular environment. Gen 9 is the baseline just as we have done in the previous section.

We put the 2nd Gen EPYC-based Gen X into production with hopes that the results would mirror closely to what we have previously seen in the lab. We found that the requests per second did not quite align with the results we had hoped, but the AMD EPYC server still outperformed all previous generations including outperforming the Intel Gen 9 server by 36%.

Sustained operating frequency was nearly identical to what we have seen back in the lab.

Due to the lower than expected requests per second, we also saw lower instructions retired over time and higher L3 cache miss rate but maintained a lead over Gen 9, with 29% better performance.

Conclusion

The single AMD EPYC 7642 performed very well during our lab testing, beating our Gen 9 server with dual Intel Xeon Platinum 6162 with the same total number of cores. Key factors we noticed were its large L3 cache, which led to a low L3 cache miss rate, as well as a higher sustained operating frequency. The AMD 2nd Gen EPYC 7642 did not have as big of an advantage in production, but nevertheless still outperformed all previous generations. The observation we made in production was based on a PoP that could have been influenced by a number of other factors such as but not limited to ambient temperature, timing, and other new products that will shape our traffic patterns in the future such as WebAssembly on Cloudflare Workers. The AMD EPYC 7642 opens up the possibility for our upcoming Gen X server to maintain the same core count while processing more requests per second than its predecessor.

Got a passion for hardware? I think we should get in touch. We are always looking for talented and curious individuals to join our team. The data presented here would not have been possible if it was not for the teamwork between many different individuals within Cloudflare. As a team, we strive to work together to create highly performant, reliable, and secure systems that will form the pillars of our rapidly growing network that spans 200 cities in more than 90 countries and we are just getting started.

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

Rob Dinh — Tue, 25 Feb 2020 14:00:00 GMT

More than 1 billion unique IP addresses pass through the Cloudflare Network each day, serving on average 11 million HTTP requests per second and operating within 100ms of 95% of the Internet-connected population globally. Our network spans 200 cities in more than 90 countries, and our engineering teams have built an extremely fast and reliable infrastructure.

We’re extremely proud of our work and are determined to help make the Internet a better and more secure place. Cloudflare engineers who are involved with hardware get down to servers and their components to understand and select the best hardware to maximize the performance of our stack.

Our software stack is compute intensive and is very much CPU bound, driving our engineers to work continuously at optimizing Cloudflare’s performance and reliability at all layers of our stack. With the server, a straightforward solution for increasing computing power is to have more CPU cores. The more cores we can include in a server, the more output we can expect. This is important for us since the diversity of our products and customers has grown over time with increasing demand that requires our servers to do more. To help us drive compute performance, we needed to increase core density and that's what we did. Below is the processor detail for servers we’ve deployed since 2015, including the core counts:

---	Gen 6	Gen 7	Gen 8	Gen 9
Start of service	2015	2016	2017	2018
CPU	Intel Xeon E5-2630 v3	Intel Xeon E5-2630 v4	Intel Xeon Silver 4116	Intel Xeon Platinum 6162
Physical Cores	2 x 8	2 x 10	2 x 12	2 x 24
TDP	2 x 85W	2 x 85W	2 x 85W	2 x 150W
TDP per Core	10.65W	8.50W	7.08W	6.25W

In 2018, we made a big jump in total number of cores per server with Gen 9. Our physical footprint was reduced by 33% compared to Gen 8, giving us increased capacity and computing power per rack. Thermal Design Power (TDP aka typical power usage) are mentioned above to highlight that we've also been more power efficient over time. Power efficiency is important to us: first, because we'd like to be as carbon friendly as we can; and second, so we can better utilize our provisioned power supplied by the data centers. But we know we can do better.

Our main defining metric is Requests per Watt. We can increase our Requests per Second number with more cores, but we have to stay within our power budget envelope. We are constrained by the data centers’ power infrastructure which, along with our selected power distribution units, leads us to power cap for each server rack. Adding servers to a rack obviously adds more power draw increasing power consumption at the rack level. Our Operational Costs significantly increase if we go over a rack’s power cap and have to provision another rack. What we need is more compute power inside the same power envelope which will drive a higher (better) Requests per Watt number – our key metric.

As you might imagine, we look at power consumption carefully in the design stage. From the above you can see that it’s not worth the time for us to deploy more power-hungry CPUs if TDP per Core is higher than our current generation which would hurt our Requests per Watt metric. As we started looking at production ready systems to power our Gen X solution, we took a long look at what is available to us in the market today, and we’ve made our decision. We’re moving on from Gen 9's 48-core setup of dual socket Intel® Xeon® Platinum 6162's to a 48-core single socket AMD EPYC™ 7642.

Gen X server setup with single socket 48-core AMD EPYC 7642.

---	Intel	AMD
CPU	Xeon Platinum 6162	EPYC 7642
Microarchitecture	“Skylake”	“Zen 2”
Codename	“Skylake SP”	“Rome”
Process	14nm	7nm
Physical Cores	2 x 24	48
Frequency	1.9 GHz	2.4 GHz
L3 Cache / socket	24 x 1.375MiB	16 x 16MiB
Memory / socket	6 channels, up to DDR4-2400	8 channels, up to DDR4-3200
TDP	2 x 150W	225W
PCIe / socket	48 lanes	128 lanes
ISA	x86-64	x86-64

From the specs, we see that with the AMD chip we get to keep the same amount of cores and lower TDP. Gen 9's TDP per Core was 6.25W, Gen X's will be 4.69W... That's a 25% decrease. With higher frequency, and perhaps going to a more simplified setup of single socket, we can speculate that the AMD chip will perform better. We're walking through a series of tests, simulations, and live production results in the rest of this blog to see how much better AMD performs.

As a side note before we go further, TDP is a simplified metric from the manufacturers’ datasheets that we use in the early stages of our server design and CPU selection process. A quick Google search leads to thoughts that AMD and Intel define TDP differently, which basically makes the spec unreliable. Actual CPU power draw, and more importantly server system power draw, are what we really factor in our final decisions.

Ecosystem Readiness

At the beginning of our journey to choose our next CPU, we got a variety of processors from different vendors that could fit well with our software stack and services, which are written in C, LuaJIT, and Go. More details about benchmarking for our stack were explained when we benchmarked Qualcomm's ARM® chip in the past. We're going to go through the same suite of tests as Vlad's blog this time around since it is a quick and easy "sniff test". This allows us to test a bunch of CPUs within a manageable time period before we commit to spend more engineering effort and need to apply our software stack.

We tried a variety of CPUs with different number of cores, sockets, and frequencies. Since we're explaining how we chose the AMD EPYC 7642, all the graphs in this blog focus on how AMD compares with our Gen 9's Intel Xeon Platinum 6162 CPU as a baseline.

Our results correspond to server node for both CPUs tested; meaning the numbers pertain to 2x 24-core processors for Intel, and 1x 48-core processor for AMD – a two socket Intel based server and a one socket AMD EPYC powered server. Before we started our testing, we changed the Cloudflare lab test servers’ BIOS settings to match our production server settings. This gave us CPU frequencies yields for AMD at 3.03 Ghz and Intel at 2.50 Ghz on average with very little variation. With gross simplification, we expect that with the same amount of cores AMD would perform about 21% better than Intel. Let’s start with our crypto tests.

Cryptography

Looking promising for AMD. In public key cryptography, it does 18% better. Meanwhile, for symmetric key, AMD loses on AES-128-GCM, but it’s comparable overall.

Compression

We do a lot of compression at the edge to save bandwidth and help deliver content faster. We go through both zlib and brotli libraries written in C. All tests are done on blog.cloudflare.com HTML file in memory.

AMD wins by an average of 29% using gzip across all qualities. It does even better with brotli with tests lower than quality 7, which we use for dynamic compression. There’s a throughput cliff starting brotli-9 which Vlad’s explanation is that Brotli consumes lots of memory and thrashes cache. Nevertheless, AMD wins by a healthy margin.

A lot of our services are written in Go. In the following graphs we’re redoing the crypto and compression tests in Go along with RegExp on 32KB strings and the strings' library.

Go Cryptography

Go Compression

Go Regexp

Go Strings

AMD performs better in all of our Go benchmarks except for ECDSA P256 Sign losing by 38%, which is peculiar since with the test in C it does 24% better. It’s worth investigating what’s going on here. Other than that, AMD doesn’t win by as much of a margin, but it still proves to be better.

LuaJIT

We rely a lot on LuaJIT in our stack. As Vlad said, it’s the glue that holds Cloudflare together. We’re glad to show that AMD wins here as well.

Overall our tests show a single EPYC 7642 to be more competitive than two Xeon Platinum 6162. While there are a couple of tests where AMD loses out such as OpenSSL AES-128-GCM and Go OpenSSL ECDSA-P256 Sign, AMD wins in all the others. By scanning quickly and treating all tests equally, AMD does on average 25% better than Intel.

Performance Simulations

After our ‘sniff’ tests, we put our servers through another series of emulations which apply synthetic workloads simulating our edge software stack. Here we are simulating workloads of scenarios with different types of requests we see in production. Types of requests vary from asset size, whether they go through HTTP or HTTPS, WAF, Workers, or one of many additional variables. Below shows the throughput comparison between the two CPUs of the types of requests we see most typically.

The results above are ratios using Gen 9's Intel CPUs as the baseline normalized at 1.0 on the X-axis. For example, looking at simple requests of 10KiB assets over HTTPS, we see that AMD does 1.50x better than Intel in Requests per Second. On average for the tests shown on the graph above, AMD performs 34% better than Intel. Considering that the TDP for the single AMD EPYC 7642 is 225W, when compared to two Intel's being 300W, we're looking at AMD delivering up to 2.0x better Requests per Watt vs. the Intel CPUs!

By this time, we were already leaning heavily toward a single socket setup with AMD EPYC 7642 as our CPU for Gen X. We were excited to see exactly how well AMD EPYC servers would do in production, so we immediately shipped a number of the servers out to some of our data centers.

Live Production

Step one of course was to get all our test servers set up for a production environment. All of our machines in the fleet are loaded with the same processes and services which makes for a great apples-to-apples comparison. Like data centers everywhere, we have multiple generations of servers deployed, and we deploy our servers in clusters such that each cluster is pretty homogeneous by server generation. In some environments this can lead to varying utilization curves between clusters. This is not the case for us. Our engineers have optimized CPU utilization across all server generations so that no matter if the machine's CPU has 8 cores or 24 cores, CPU usage is generally the same.

As you can see above and to illustrate our ‘similar CPU utilization’ comment, there is no significant difference in CPU usage between Gen X AMD powered servers and Gen 9 Intel based servers. This means both test and baseline servers are equally loaded. Good. This is exactly what we want to see with our setup, to have a fair comparison. The 2 graphs below show the comparative number of requests processed at the CPU single core and all core (server) level.

We see that AMD does on average about 23% more requests. That's really good! We talked a lot about bringing more muscle in the Gen 9 blog. We have the same number of cores, yet AMD does more work, and does it with less power. Just by looking at the specs for number of cores and TDP in the beginning, it's really nice to see that AMD also delivers significantly more performance with better power efficiency.

But as we mentioned earlier, TDP isn’t a standardized spec across manufacturers so let’s look at real power usage below. Measuring server power consumption along with requests per second (RPS) yields the graph below:

Observing our servers request rate over their power consumption, the AMD Gen X server performs 28% better. While we could have expected more out of AMD since its TDP is 25% lower, keep in mind that TDP is very ambiguous. In fact, we saw that AMD actual power draw ran nearly at spec TDP with it's much higher than base frequency; Intel was far from it. Another reason why TDP is becoming a less reliable estimate of power draw. Moreover, CPU is just one component contributing to the overall power of the system. Let’s remind that Intel CPUs are integrated in a multi-node system as described in the Gen 9 blog, while AMD is in a regular 1U form-factor machine. That actually doesn’t favor AMD since multi-node systems are designed for high density capabilities at lower power per node, yet it still outperformed the Intel system on a power per node basis anyway.

Through the majority of comparisons from the datasheets, test simulations, and live production performance, the 1P AMD EPYC 7642 configuration performed significantly better than the 2P Intel Xeon 6162. We’ve seen in some environments that AMD can do up to 36% better in live production, and we believe we can achieve that consistently with some optimization on both our hardware and software.

So that's it. AMD wins.

The additional graphs below show the median and p99 NGINX processing mostly on-CPU time latencies between the two CPUs throughout 24 hours. On average, AMD processes about 25% faster. At p99, it does about 20-50% depending on the time of day.

Conclusion

Hardware and Performance engineers at Cloudflare do significant research and testing to figure out the best server configuration for our customers. Solving big problems like this is why we love working here, and we're also helping to solve yours with our services like serverless edge compute and the array of security solutions such as Magic Transit, Argo Tunnel, and DDoS protection. All of our servers on the Cloudflare Network are designed to make our products work reliably, and we strive to make each new generation of our server design better than its predecessor. We believe the AMD EPYC 7642 is the answer for our Gen X's processor question.

With Cloudflare Workers, developers have enjoyed deploying their applications to our Network, which is ever expanding across the globe. We’ve been proud to empower our customers by letting them focus on writing their code while we are managing the security and reliability in the cloud. We are now even more excited to say that their work will be deployed on our Gen X servers powered by 2nd Gen AMD EPYC processors.

Expanding Rome to a data center near you

Thanks to AMD, using the EPYC 7642 allows us to increase our capacity and expand into more cities easier. Rome wasn’t built in one day, but it will be very close to many of you.

In the last couple of years, we've been experimenting with many Intel and AMD x86 chips along with ARM CPUs. We look forward to having these CPU manufacturers partner with us for future generations so that together we can help build a better Internet.

Cloudflare’s Gen X: Servers for an Accelerated Future

Nitin Rao — Mon, 24 Feb 2020 13:00:00 GMT

“Every server can run every service.”

We designed and built Cloudflare’s network to be able to grow capacity quickly and inexpensively; to allow every server, in every city, to run every service; and to allow us to shift customers and traffic across our network efficiently. We deploy standard, commodity hardware, and our product developers and customers do not need to worry about the underlying servers. Our software automatically manages the deployment and execution of our developers’ code and our customers’ code across our network. Since we manage the execution and prioritization of code running across our network, we are both able to optimize the performance of our highest tier customers and effectively leverage idle capacity across our network.

An alternative approach might have been to run several fragmented networks with specialized servers designed to run specific features, such as the Firewall, DDoS protection or Workers. However, we believe that approach would have resulted in wasted idle resources and given us less flexibility to build new software or adopt the newest available hardware. And a single optimization target means we can provide security and performance at the same time.

We use Anycast to route a web request to the nearest Cloudflare data center (from among 200 cities), improving performance and maximizing the surface area to fight attacks.

Once a datacenter is selected, we use Unimog, Cloudflare’s custom load balancing system, to dynamically balance requests across diverse generations of servers. We load balance at different layers: between cities, between physical deployments located across a city, between external Internet ports, between internal cables, between servers, and even between logical CPU threads within a server.

As demand grows, we can scale out by simply adding new servers, points of presence (PoPs), or cities to the global pool of available resources. If any server component has a hardware failure, it is gracefully de-prioritized or removed from the pool, to be batch repaired by our operations team. This architecture has enabled us to have no dedicated Cloudflare staff at any of the 200 cities, instead relying on help for infrequent physical tasks from the ISPs (or data centers) hosting our equipment.

Gen X: Intel Not Inside

We recently turned up our tenth generation of servers, “Gen X”, already deployed across major US cities, and in the process of being shipped worldwide. Compared with our prior server (Gen 9), it processes as much as 36% more requests while costing substantially less. Additionally, it enables a ~50% decrease in L3 cache miss rate and up to 50% decrease in NGINX p99 latency, powered by a CPU rated at 25% lower TDP (thermal design power) per core.

Notably, for the first time, Intel is not inside. We are not using their hardware for any major server components such as the CPU, board, memory, storage, network interface card (or any type of accelerator). Given how critical Intel is to our industry, this would until recently have been unimaginable, and is in contrast with prior generations which made extensive use of their hardware.

Intel-based Gen 9 server

This time, AMD is inside.

We were particularly impressed by the 2nd Gen AMD EPYC processors because they proved to be far more efficient for our customers’ workloads. Since the pendulum of technology leadership swings back and forth between providers, we wouldn’t be surprised if that changes over time. However, we were happy to adapt quickly to the components that made the most sense for us.

Compute

CPU efficiency is very important to our server design. Since we have a compute-heavy workload, our servers are typically limited by the CPU before other components. Cloudflare’s software stack scales quite well with additional cores. So, we care more about core-count and power-efficiency than dimensions such as clock speed.

We selected the AMD EPYC 7642 processor in a single-socket configuration for Gen X. This CPU has 48-cores (96 threads), a base clock speed of 2.4 GHz, and an L3 cache of 256 MB. While the rated power (225W) may seem high, it is lower than the combined TDP in our Gen 9 servers and we preferred the performance of this CPU over lower power variants. Despite AMD offering a higher core count option with 64-cores, the performance gains for our software stack and usage weren’t compelling enough.

We have deployed the AMD EPYC 7642 in half a dozen Cloudflare data centers; it is considerably more powerful than a dual-socket pair of high-core count Intel processors (Skylake as well as Cascade Lake) we used in the last generation.

Readers of our blog might remember our excitement around ARM processors. We even ported the entirety of our software stack to run on ARM, just as it does with x86, and have been maintaining that ever since even though it calls for slightly more work for our software engineering teams. We did this leading up to the launch of Qualcomm’s Centriq server CPU, which eventually got shuttered. While none of the off-the-shelf ARM CPUs available this moment are interesting to us, we remain optimistic about high core count offerings launching in 2020 and beyond, and look forward to a day when our servers are a mix of x86 (Intel and AMD) and ARM.

We aim to replace servers when the efficiency gains enabled by new equipment outweigh their cost.

The performance we’ve seen from the AMD EPYC 7642 processor has encouraged us to accelerate replacement of multiple generations of Intel-based servers.

Compute is our largest investment in a server. Our heaviest workloads, from the Firewall to Workers (our serverless offering), often require more compute than other server resources. Also, the average size in kilobytes of a web request across our network tends to be small, influenced in part by the relative popularity of APIs and mobile applications. Our approach to server design is very different than traditional content delivery networks engineered to deliver large object video libraries, for whom servers focused on storage might make more sense, and re-architecting to offer serverless is prohibitively capital intensive.

Our Gen X server is intentionally designed with an “empty” PCIe slot for a potential add on card, if it can perform some functions more efficiently than the primary CPU. Would that be a GPU, FPGA, SmartNIC, custom ASIC, TPU or something else? We’re intrigued to explore the possibilities.

In accompanying blog posts over the next few days, our hardware engineers will describe how AMD 7642 performed against the benchmarks we care about. We are thankful for their hard work.

Memory, Storage & Network

Since we are typically limited by CPU, Gen X represented an opportunity to grow components such as RAM and SSD more slowly than compute.

For memory, we continued to use 256GB of RAM, as in our prior generation, but rated higher at 2933MHz. For storage, we continue to have ~3TB, but moved to 3x1TB form factor using NVME flash (instead of SATA) with increased available IOPS and higher endurance, which enables full disk encryption using LUKS without penalty. For the network card, we continue to use Mellanox 2x25G NIC.

We moved from our multi-node chassis back to a simple 1U form factor, designed to be lighter and less error prone during operational work at the data center. We also added multiple new ODM partners to diversify how we manufacture our equipment and to take advantage of additional global warehousing.

Network Expansion

Our newest generation of servers give us the flexibility to continue to build out our network even closer to every user on Earth. We’re proud of the hard work from across engineering teams on Gen X, and are grateful for the support of our partners. Be on the lookout for more blogs about these servers in the coming days.

The Cloudflare Blog

Analysis of the EPYC 145% performance gain in Cloudflare Gen 12 servers

cf_benchmark

Performance simulation

Sensitivity analysis

Core sensitivity

TDP sensitivity

Frequency sensitivity

L3 cache size sensitivity

Putting it all together

Performance evaluation in production

Cloudflare’s 12th Generation servers — 145% more performant and 63% more efficient

Gen 12 Servers

CPU

AMD EPYC Genoa-X in Cloudflare Gen 12 server

Memory

Storage

Network

Improved security with DC-SCM: Project Argus

Power

Drop-in GPU support

Looking to the future

Measuring Hyper-Threading and Turbo Boost

Simulated Environment

CPU Specification

Results

Production Environment

Results

Conclusion

The EPYC journey continues to Milan in Cloudflare’s 11th generation Edge Server

Testing and validation process

CPU

Memory

Disk

Network

Open source firmware

Summary

Branch predictor: How many "if"s are too many? Including x86 and M1 benchmarks!

Just how bad is peppering the code with avoidable if statements?

Understanding the cost of jump

Why is branch prediction needed?

Playing with the BTB

Density matters

The experiment

AMD EPYC 7713

Xeon Gold 6262

Apple Silicon M1

Summary

Acknowledgements

PS

Why Cloudflare Chose AMD EPYC for Gen X Servers

Servers for an Accelerated Future

Gen X CPU benchmarks

Impact of Cache Locality

Gen X Performance Tuning

Securing Memory at EPYC Scale

Securing Memory at EPYC Scale

Time to get EPYC

Components

How It Works

Becoming More Transparent

Performance Testing

Stream

Cryptsetup

Benchmarky

Conclusion

Gen X Performance Tuning

Thermal Design Power & Dynamic Power

Determinism Modes & Configurable TDP

Nodes Per Socket (NPS)

Conclusion

Impact of Cache Locality

Simulated Environment

Methodology

Results

Production Environment

Methodology

Conclusion

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

Ecosystem Readiness

Just how bad is peppering the code with avoidable `if` statements?