Posts tagged "Gen X"

Debugging Hardware Performance on Gen X Servers

Yasir Jamal — Tue, 17 May 2022 12:59:23 GMT

In Cloudflare’s global network, every server runs the whole software stack. Therefore, it's critical that every server performs to its maximum potential capacity. In order to provide us better flexibility from a supply chain perspective, we buy server hardware from multiple vendors with the exact same configuration. However, after the deployment of our Gen X AMD EPYC Zen 2 (Rome) servers, we noticed that servers from one vendor (which we’ll call SKU-B) were consistently performing 5-10% worse than servers from second vendor (which we'll call SKU-A).

The graph below shows the performance discrepancy between the two SKUs in terms of percentage difference. The performance is gauged on the metric of requests per second, and this data is an average of observations captured over 24 hours.

Machines before implementing performance improvements. The average RPS for SKU-B is approximately 10% below SKU-A.

Compute performance via DGEMM

The initial debugging efforts centered around the compute performance. We ran AMD’s DGEMM high performance computing tool to determine if CPU performance was the cause. DGEMM is designed to measure the sustained floating-point computation rate of a single server. Specifically, the code measures the floating point rate of execution of a real matrix–matrix multiplication with double precision. We ran a modified version of DGEMM equipped with specific AMD libraries to support the EPYC processor instruction set.

The DGEMM test brought out a few points related to the processor’s Thermal Design Power (TDP). TDP refers to the maximum power that a processor can draw for a thermally significant period while running a software application. The underperforming servers were only drawing 215 to 220 watts of power when fully stressed, whereas the max supported TDP on the processors is 240 watts. Additionally, their floating-point computation rate was ~100 gigaflops (GFLOPS) behind the better performing servers.

Screenshot from the DGEMM run of a good system:

Screenshot from an underperforming system:

To debug the issue, we first tried disabling idle power saving mode (also known as C-states) in the CPU BIOS configuration. The servers reported expected GFLOPS numbers and achieved max TDP. Believing that this could have been the root cause, the servers were moved back into the production test environment for data collection.

However, the performance delta was still there.

Further debugging

We started the debugging process all over again by comparing the BIOS settings logs of both SKU-A and SKU-B. Once this debugging option was exhausted, the focus shifted to the network interface. We ran the open source network interface tool iPerf to check if there were any bottlenecks — no issues were observed either.

During this activity, we noticed that the BIOS of both SKUs were not using the AMD Preferred I/O functionality, which allows devices on a single PCIe bus to obtain improved DMA write performance. We enabled the Preferred I/O option on SKU-B and tested the change in production. Unfortunately, there were no performance gains. After the focus on network activity, the team started looking into memory configuration and operating speed.

AMD HSMP tool & Infinity Fabric diagnosis

The Gen X systems are configured with DDR4 memory modules that can support a maximum of 2933 megatransfers per second (MT/s). With the BIOS configuration for memory clock Frequency set to Auto, the 2933 MT/s automatically configures the memory clock frequency to 1467 MHz. Double Data Rate (DDR) technology allows for the memory signal to be sampled twice per clock cycle: once on the rising edge and once on the falling edge of the clock signal. Because of this, the reported memory speed rate is twice the true memory clock frequency. Memory bandwidth was validated to be working quite well by running a Stream benchmark test.

Running out of debugging options, we reached out to AMD for assistance and were provided a debug tool called HSMP that lets users access the Host System Management Port. This tool provides a wide variety of processor management options, such as reading and changing processor TDP limits, reading processor and memory temperatures, and reading memory and processor clock frequencies. When we ran the HSMP tool, we discovered that the memory was being clocked at a different rate from AMD’s Infinity Fabric system — the architecture which connects the memory to the processor core. As shown below, the memory clock frequency was set to 1467 MHz while the Infinity Fabric clock frequency was set to 1333 MHz.

Ideally, the two clock frequencies need to be equal for optimized workloads sensitive to latency and throughput. Below is a snapshot for an SKU-A server where both memory and Infinity Fabric frequencies are equal.

Improved Performance

Since the Infinity Fabric clock setting was not exposed as a tunable parameter in the BIOS, we asked the vendor to provide a new BIOS that set the frequency to 1467 MHz during compile time. Once we deployed the new BIOS on the underperforming systems in production, we saw that all servers started performing to their expected levels. Below is a snapshot of the same set of systems with data captured and averaged over a 24 hours window on the same day of the week as the previous dataset. Now, the requests per second performance of SKU-B equals and in some cases exceeds the performance of SKU-A!

Internet Requests Per Second (RPS) for four SKU-A machines and four SKU-B machines after implementing the BIOS fix on SKU-B. The RPS of SKU-B now equals the RPS of SKU-A.

Hello, I am Yasir Jamal. I recently joined Cloudflare as a Hardware Engineer with the goal of helping provide a better Internet to everyone. If you share the same interest, come join us!

Getting to the Core: Benchmarking Cloudflare’s Latest Server Hardware

Brian Bassett — Fri, 20 Nov 2020 12:00:00 GMT

Maintaining a server fleet the size of Cloudflare’s is an operational challenge, to say the least. Anything we can do to lower complexity and improve efficiency has effects for our SRE (Site Reliability Engineer) and Data Center teams that can be felt throughout a server’s 4+ year lifespan.

At the Cloudflare Core, we process logs to analyze attacks and compute analytics. In 2020, our Core servers were in need of a refresh, so we decided to redesign the hardware to be more in line with our Gen X edge servers. We designed two major server variants for the core. The first is Core Compute 2020, an AMD-based server for analytics and general-purpose compute paired with solid-state storage drives. The second is Core Storage 2020, an Intel-based server with twelve spinning disks to run database workloads.

Core Compute 2020

Earlier this year, we blogged about our 10th generation edge servers or Gen X and the improvements they delivered to our edge in both performance and security. The new Core Compute 2020 server leverages many of our learnings from the edge server. The Core Compute servers run a variety of workloads including Kubernetes, Kafka, and various smaller services.

Configuration Changes (Kubernetes)

Configuration Changes (Kafka)

Both previous generation servers were Intel-based platforms, with the Kubernetes server based on Xeon 6262 processors, and the Kafka server based on Xeon 4116 processors. One goal with these refreshed versions was to converge the configurations in order to simplify spare parts and firmware management across the fleet.

As the above tables show, the configurations have been converged with the only difference being the number of NVMe drives installed depending on the workload running on the host. In both cases we moved from a dual-socket configuration to a single-socket configuration, and the number of cores and threads per server either increased or stayed the same. In all cases, the base frequency of those cores was significantly improved. We also moved from SATA SSDs to NVMe SSDs.

Core Compute 2020 Synthetic Benchmarking

The heaviest user of the SSDs was determined to be Kafka. The majority of the time Kafka is sequentially writing 2MB blocks to the disk. We created a simple FIO script with 75% sequential write and 25% sequential read, scaling the block size from a standard page table entry size of 4096B to Kafka’s write size of 2MB. The results aligned with what we expected from an NVMe-based drive.

Core Compute 2020 Production Benchmarking

Cloudflare runs many of our Core Compute services in Kubernetes containers, some of which are multi-core. By transitioning to a single socket, problems associated with dual sockets were eliminated, and we are guaranteed to have all cores allocated for any given container on the same socket.

Another heavy workload that is constantly running on Compute hosts is the Cloudflare CSAM Scanning Tool. Our Systems Engineering team isolated a Compute 2020 compute host and the previous generation compute host, had them run just this workload, and measured the time to compare the fuzzy hashes for images to the NCMEC hash lists and verify that they are a “miss”.

Because the CSAM Scanning Tool is very compute intensive we specifically isolated it to take a look at its performance with the new hardware. We’ve spent a great deal of effort on software optimization and improved algorithms for this tool but investing in faster, better hardware is also important.

In these heatmaps, the X axis represents time, and the Y axis represents “buckets” of time taken to verify that it is not a match to one of the NCMEC hash lists. For a given time slice in the heatmap, the red point is the bucket with the most times measured, the yellow point the second most, and the green points the least. The red points on the Compute 2020 graph are all in the 5 to 8 millisecond bucket, while the red points on the previous Gen heatmap are all in the 8 to 13 millisecond bucket, which shows that on average, the Compute 2020 host is verifying hashes significantly faster.

Core Storage 2020

Another major workload we identified was ClickHouse, which performs analytics over large datasets. The last time we upgraded our servers running ClickHouse was back in 2018.

Configuration Changes

CPU Changes

For ClickHouse, we use a 1U chassis with 12 x 10TB 3.5” hard drives. At the time we were designing Core Storage 2020 our server vendor did not yet have an AMD version of this chassis, so we remained on Intel. However, we moved Core Storage 2020 to a single 20 core / 40 thread Xeon processor, rather than the previous generation’s dual-socket 10 core / 20 thread processors. By moving to the single-socket Xeon 6210U processor, we were able to keep the same core count, but gained 17% higher base frequency and 26% higher max turbo frequency. Meanwhile, the total CPU thermal design profile (TDP), which is an approximation of the maximum power the CPU can draw, went down from 165W to 150W.

On a dual-socket server, remote memory accesses, which are memory accesses by a process on socket 0 to memory attached to socket 1, incur a latency penalty, as seen in this table:

An additional advantage of having a CPU with all 20 cores on the same socket is the elimination of these remote memory accesses, which take 76% longer than local memory accesses.

Memory Changes

The memory in the Core Storage 2020 host is rated for operation at 2933 MHz; however, in the 8 x 32GB configuration we need on these hosts, the Intel Xeon 6210U processor clocks them at 2666 MH. Compared to the previous generation, this gives us a 13% boost in memory speed. While we would get a slightly higher clock speed with a balanced, 6 DIMMs configuration, we determined that we are willing to sacrifice the slightly higher clock speed in order to have the additional RAM capacity provided by the 8 x 32GB configuration.

Storage Changes

Data capacity stayed the same, with 12 x 10TB SATA drives in RAID 0 configuration for best throughput. Unlike the previous generation, the drives in the Core Storage 2020 host are helium filled. Helium produces less drag than air, resulting in potentially lower latency.

Core Storage 2020 Synthetic benchmarking

We performed synthetic four corners benchmarking: IOPS measurements of random reads and writes using 4k block size, and bandwidth measurements of sequential reads and writes using 128k block size. We used the fio tool to see what improvements we would get in a lab environment. The results show a 10% latency improvement and 11% IOPS improvement in random read performance. Random write testing shows 38% lower latency and 60% higher IOPS. Write throughput is improved by 23%, and read throughput is improved by a whopping 90%.

CPU frequencies

The higher base and turbo frequencies of the Core Storage 2020 host’s Xeon 6210U processor allowed that processor to achieve higher average frequencies while running our production ClickHouse workload. A recent snapshot of two production hosts showed the Core Storage 2020 host being able to sustain an average of 31% higher CPU frequency while running ClickHouse.

Core Storage 2020 Production benchmarking

Our ClickHouse database hosts are continually performing merge operations to optimize the database data structures. Each individual merge operation takes just a few seconds on average, but since they’re constantly running, they can consume significant resources on the host. We sampled the average merge time every five minutes over seven days, and then sampled the data to find the average, minimum, and maximum merge times reported by a Compute 2020 host and by a previous generation host. Results are summarized below.

ClickHouse merge operation performance improvement

Our lab-measured CPU frequency and storage performance improvements on Core Storage 2020 have translated into significantly reduced times to perform this database operation.

Conclusion

With our Core 2020 servers, we were able to realize significant performance improvements, both in synthetic benchmarking outside production and in the production workloads we tested. This will allow Cloudflare to run the same workloads on fewer servers, saving CapEx costs and data center rack space. The similarity of the configuration of the Kubernetes and Kafka hosts should help with fleet management and spare parts management. For our next redesign, we will try to further converge the designs on which we run the major Core workloads to further improve efficiency.

Special thanks to Will Buckner and Chris Snook for their help in the development of these servers, and to Tim Bart for validating CSAM Scanning Tool’s performance on Compute.

Why Cloudflare Chose AMD EPYC for Gen X Servers

Nitin Rao — Sun, 01 Mar 2020 13:00:00 GMT

From the very beginning Cloudflare used Intel CPU-based servers (and, also, Intel components for things like NICs and SSDs). But we're always interested in optimizing the cost of running our service so that we can provide products at a low cost and high gross margin.

We're also mindful of events like the Spectre and Meltdown vulnerabilities and have been working with outside parties on research into mitigation and exploitation which we hope to publish later this year.

We looked very seriously at ARM-based CPUs and continue to keep our software up to date for the ARM architecture so that we can use ARM-based CPUs when the requests per watt is interesting to us.

In the meantime, we've deployed AMD's EPYC processors as part of Gen X server platform and for the first time are not using any Intel components at all. This week, we announced details of this tenth generation of servers. Below is a recap of why we're excited about the design, specifications, and performance of our newest hardware.

Servers for an Accelerated Future

Every server can run every service. This architectural decision has helped us achieve higher efficiency across the Cloudflare network. It has also given us more flexibility to build new software or adopt the newest available hardware.

Notably, Intel is not inside. We are not using their hardware for any major server components such as the CPU, board, memory, storage, network interface card (or any type of accelerator).

This time, AMD is inside.

Compared with our prior server (Gen 9), our Gen X server processes as much as 36% more requests while costing substantially less. Additionally, it enables a ~50% decrease in L3 cache miss rate and up to 50% decrease in NGINX p99 latency, powered by a CPU rated at 25% lower TDP (thermal design power) per core.

Gen X CPU benchmarks

To identify the most efficient CPU for our software stack, we ran several benchmarks for key workloads such as cryptography, compression, regular expressions, and LuaJIT. Then, we simulated typical requests we see, before testing servers in live production to measure requests per watt.

Based on our findings, we selected the single socket 48-core AMD 2nd Gen EPYC 7642.

Impact of Cache Locality

The single AMD EPYC 7642 performed very well during our lab testing, beating our Gen 9 server with dual Intel Xeon Platinum 6162 with the same total number of cores. Key factors we noticed were its large L3 cache, which led to a low L3 cache miss rate, as well as a higher sustained operating frequency.

Gen X Performance Tuning

Partnering with AMD, we tuned the 2nd Gen EPYC 7642 processor to achieve additional 6% performance. We achieved this by using power determinism and configuring the CPU's Thermal Design Power (TDP).

Securing Memory at EPYC Scale

Finally, we described how we use Secure Memory Encryption (SME), an interesting security feature within the System on a Chip architecture of the AMD EPYC line. We were impressed by how we could achieve RAM encryption without significant decrease in performance. This reduces the worry that any data could be exfiltrated from a stolen server.

We enjoy designing hardware that improves the security, performance and reliability of our global network, trusted by over 26 million Internet properties.

Want to help us evaluate new hardware? Join us!

Gen X Performance Tuning

Sung Park — Thu, 27 Feb 2020 14:00:00 GMT

We are using AMD 2nd Gen EPYC 7642 for our tenth generation “Gen X” servers. We found many aspects of this processor compelling such as its increase in performance due to its frequency bump and cache-to-core ratio. We have partnered with AMD to get the best performance out of this processor and today, we are highlighting our tuning efforts that led to an additional 6% performance.

Thermal Design Power & Dynamic Power

Thermal design power (TDP) and dynamic power, amongst others, play a critical role when tuning a system. Many share a common belief that thermal design power is the maximum or average power drawn by the processor. The 48-core AMD EPYC 7642 has a TDP rating of 225W which is just as high as the 64-core AMD EPYC 7742. It comes to mind that fewer cores should translate into lower power consumption, so why is the AMD EPYC 7642 expected to draw just as much power as the AMD EPYC 7742?

TDP Comparison between the EPYC 7642, EPYC 7742 and top-end EPYC 7H12

Let’s take a step back and understand that TDP does not always mean the maximum or average power that the processor will draw. At a glance, TDP may provide a good estimate about the processor’s power draw; TDP is really about how much heat the processor is expected to generate. TDP should be used as a guideline for designing cooling solutions with appropriate thermal capacitance. The cooling solution is expected to indefinitely dissipate heat up to the TDP, in turn, this can help the chip designers determine a power budget and create new processors around that constraint. In the case of the AMD EPYC 7642, the extra power budget was spent on retaining all of its 256 MiB L3 cache and the cores to operate at a higher sustained frequency as needed during our peak hours.

The overall power drawn by the processor depends on many different factors; dynamic or active power establishes the relationship between power and frequency. Dynamic power is a function of capacitance - the chip itself, supply voltage, frequency of the processor, and activity factor. The activity factor is dependent on the characteristics of the workloads running on the processor. Different workloads will have different characteristics. Some examples include Cloudflare Workers or Cloudflare for Teams, the hotspots in these or any other particular programs will utilize different parts of the processor, affecting the activity factor.

Determinism Modes & Configurable TDP

The latest AMD processors continue to implement AMD Precision Boost which opportunistically allows the processor to operate at a higher frequency than its base frequency. How much higher can the frequency go depends on many different factors such as electrical, thermal, and power limitations.

AMD offers a knob known as determinism modes. This knob affects power and frequency, placing emphasis one over the other depending on the determinism selected. There is a white paper posted on AMD's website that goes into the nuanced details of determinism, and remembering how frequency and power are related - this was the simplest definition I took away from the paper:

Performance Determinism - Power is a function of frequency.

Power Determinism - Frequency is a function of power.

Another knob that is available to us is Configurable Thermal Design Power (cTDP), this knob allows the end-user to reconfigure the factory default thermal design power. The AMD EPYC 7642 is rated at 225W; however, we have been given guidance from AMD that this particular part can be reconfigured up to 240W. As mentioned previously, the cooling solution must support up to the TDP to avoid throttling. We have also done our due diligence and tested that our cooling solution can reliably dissipate up to 240W, even at higher ambient temperatures.

We gave these two knobs a try and got the results shown in the figures below using 10 KiB web assets over HTTPS provided by our performance team over a sustained period of time to properly heat up the processor. The figures are broken down into the average operating frequencies across all 48 cores, total package power drawn from the socket, and the highest reported die temperature out of the 8 dies that are laid out on the AMD EPYC 7642. Each figure will compare the results we obtained from power and performance determinism, and finally, cTDP at 240W using power determinism.

Performance determinism clearly emphasized stabilizing operating frequency by matching its frequency to its lowest performing core over time. This mode appears to be ideal for tuners that emphasizes predictability over maximizing the processor’s operating frequency. In other words, the number of cycles that the processor has at its disposal every second should be predictable. This is useful if two or more cores share data dependencies. By allowing the cores to work in unison; this can prevent one core stalling another. Power determinism on the other hand, maximized its power and frequency as much as it can.

Heat generated by power can be compounded even further by ambient temperature. All components should stay within safe operating temperature at all times as specified by the vendor’s datasheet. If unsure, please reach out to your vendor as soon as possible. We put the AMD EPYC 7642 through several thermal tests and determined that it will operate within safe operating temperatures. Before diving into the figure below, it is important to mention that our fan speed ramped up over time, preventing the processor from reaching critical temperature; nevertheless, this meant that our cooling solution worked as intended - a combination of fans and a heatsink rated for 240W. We have yet to see the processors throttle in production. The figure below shows the highest temperature of the 8 dies laid out on the AMD EPYC 7642.

An unexpected byproduct of performance determinism was frequency jitter. It took a longer time for performance determinism to achieve steady-state frequency, which contradicted predictable performance that performance determinism mode was meant to achieve.

Finally, here are the deltas out from the real world with performance determinism and TDP set to factory default of 225W as the baseline. Deltas under 10% can be influenced by a wide variety of factors especially in production, however, none of our data showed a negative trend. Here are the averages.

Nodes Per Socket (NPS)

The AMD EPYC 7642 physically lays out its 8 dies across 4 different quadrants on a single package. By having such a layout, AMD supports dividing its dies into NUMA domains and this feature is called Nodes Per Socket or NPS. Available NPS options differ model-to-model, the AMD EPYC 7642 supports 4, 2, and 1 node(s) per socket. We thought it might be worth our time exploring this option despite the fact that none of these choices will end up yielding a shared last level cache. We did not observe any significant deltas from the NPS options in terms of performance.

Conclusion

Tuning the processor to yield an additional 6% throughput or requests per second out of the box has been a great start. Some parts will have more rooms for improvement than others. In production we were able to achieve an additional 2% requests per seconds with power determinism, and we achieved another 4% by reconfiguring the TDP to 240W. We will continue to work with AMD as well as internal teams within Cloudflare to identify additional areas of improvement. If you like the idea of supercharging the Internet on behalf of our customers, then come join us.