The Cloudflare Blog

Getting to the Core: Benchmarking Cloudflare’s Latest Server Hardware

Brian Bassett — Fri, 20 Nov 2020 12:00:00 GMT

Maintaining a server fleet the size of Cloudflare’s is an operational challenge, to say the least. Anything we can do to lower complexity and improve efficiency has effects for our SRE (Site Reliability Engineer) and Data Center teams that can be felt throughout a server’s 4+ year lifespan.

At the Cloudflare Core, we process logs to analyze attacks and compute analytics. In 2020, our Core servers were in need of a refresh, so we decided to redesign the hardware to be more in line with our Gen X edge servers. We designed two major server variants for the core. The first is Core Compute 2020, an AMD-based server for analytics and general-purpose compute paired with solid-state storage drives. The second is Core Storage 2020, an Intel-based server with twelve spinning disks to run database workloads.

Core Compute 2020

Earlier this year, we blogged about our 10th generation edge servers or Gen X and the improvements they delivered to our edge in both performance and security. The new Core Compute 2020 server leverages many of our learnings from the edge server. The Core Compute servers run a variety of workloads including Kubernetes, Kafka, and various smaller services.

Configuration Changes (Kubernetes)

	Previous Generation Compute	Core Compute 2020
CPU	2 x Intel Xeon Gold 6262	1 x AMD EPYC 7642
Total Core / Thread Count	48C / 96T	48C / 96T
Base / Turbo Frequency	1.9 / 3.6 GHz	2.3 / 3.3 GHz
Memory	8 x 32GB DDR4-2666	8 x 32GB DDR4-2933
Storage	6 x 480GB SATA SSD	2 x 3.84TB NVMe SSD
Network	Mellanox CX4 Lx 2 x 25GbE	Mellanox CX4 Lx 2 x 25GbE

Configuration Changes (Kafka)

	Previous Generation (Kafka)	Core Compute 2020
CPU	2 x Intel Xeon Silver 4116	1 x AMD EPYC 7642
Total Core / Thread Count	24C / 48T	48C / 96T
Base / Turbo Frequency	2.1 / 3.0 GHz	2.3 / 3.3 GHz
Memory	6 x 32GB DDR4-2400	8 x 32GB DDR4-2933
Storage	12 x 1.92TB SATA SSD	10 x 3.84TB NVMe SSD
Network	Mellanox CX4 Lx 2 x 25GbE	Mellanox CX4 Lx 2 x 25GbE

Both previous generation servers were Intel-based platforms, with the Kubernetes server based on Xeon 6262 processors, and the Kafka server based on Xeon 4116 processors. One goal with these refreshed versions was to converge the configurations in order to simplify spare parts and firmware management across the fleet.

As the above tables show, the configurations have been converged with the only difference being the number of NVMe drives installed depending on the workload running on the host. In both cases we moved from a dual-socket configuration to a single-socket configuration, and the number of cores and threads per server either increased or stayed the same. In all cases, the base frequency of those cores was significantly improved. We also moved from SATA SSDs to NVMe SSDs.

Core Compute 2020 Synthetic Benchmarking

The heaviest user of the SSDs was determined to be Kafka. The majority of the time Kafka is sequentially writing 2MB blocks to the disk. We created a simple FIO script with 75% sequential write and 25% sequential read, scaling the block size from a standard page table entry size of 4096B to Kafka’s write size of 2MB. The results aligned with what we expected from an NVMe-based drive.

Core Compute 2020 Production Benchmarking

Cloudflare runs many of our Core Compute services in Kubernetes containers, some of which are multi-core. By transitioning to a single socket, problems associated with dual sockets were eliminated, and we are guaranteed to have all cores allocated for any given container on the same socket.

Another heavy workload that is constantly running on Compute hosts is the Cloudflare CSAM Scanning Tool. Our Systems Engineering team isolated a Compute 2020 compute host and the previous generation compute host, had them run just this workload, and measured the time to compare the fuzzy hashes for images to the NCMEC hash lists and verify that they are a “miss”.

Because the CSAM Scanning Tool is very compute intensive we specifically isolated it to take a look at its performance with the new hardware. We’ve spent a great deal of effort on software optimization and improved algorithms for this tool but investing in faster, better hardware is also important.

In these heatmaps, the X axis represents time, and the Y axis represents “buckets” of time taken to verify that it is not a match to one of the NCMEC hash lists. For a given time slice in the heatmap, the red point is the bucket with the most times measured, the yellow point the second most, and the green points the least. The red points on the Compute 2020 graph are all in the 5 to 8 millisecond bucket, while the red points on the previous Gen heatmap are all in the 8 to 13 millisecond bucket, which shows that on average, the Compute 2020 host is verifying hashes significantly faster.

Core Storage 2020

Another major workload we identified was ClickHouse, which performs analytics over large datasets. The last time we upgraded our servers running ClickHouse was back in 2018.

Configuration Changes

	Previous Generation	Core Storage 2020
CPU	2 x Intel Xeon E5-2630 v4	1 x Intel Xeon Gold 6210U
Total Core / Thread Count	20C / 40T	20C / 40T
Base / Turbo Frequency	2.2 / 3.1 GHz	2.5 / 3.9 GHz
Memory	8 x 32GB DDR4-2400	8 x 32GB DDR4-2933
Storage	12 x 10TB 7200 RPM 3.5” SATA	12 x 10TB 7200 RPM 3.5” SATA
Network	Mellanox CX4 Lx 2 x 25GbE	Mellanox CX4 Lx 2 x 25GbE

CPU Changes

For ClickHouse, we use a 1U chassis with 12 x 10TB 3.5” hard drives. At the time we were designing Core Storage 2020 our server vendor did not yet have an AMD version of this chassis, so we remained on Intel. However, we moved Core Storage 2020 to a single 20 core / 40 thread Xeon processor, rather than the previous generation’s dual-socket 10 core / 20 thread processors. By moving to the single-socket Xeon 6210U processor, we were able to keep the same core count, but gained 17% higher base frequency and 26% higher max turbo frequency. Meanwhile, the total CPU thermal design profile (TDP), which is an approximation of the maximum power the CPU can draw, went down from 165W to 150W.

On a dual-socket server, remote memory accesses, which are memory accesses by a process on socket 0 to memory attached to socket 1, incur a latency penalty, as seen in this table:

	Previous Generation	Core Storage 2020
Memory latency, socket 0 to socket 0	81.3 ns	86.9 ns
Memory latency, socket 0 to socket 1	142.6 ns	N/A

An additional advantage of having a CPU with all 20 cores on the same socket is the elimination of these remote memory accesses, which take 76% longer than local memory accesses.

Memory Changes

The memory in the Core Storage 2020 host is rated for operation at 2933 MHz; however, in the 8 x 32GB configuration we need on these hosts, the Intel Xeon 6210U processor clocks them at 2666 MH. Compared to the previous generation, this gives us a 13% boost in memory speed. While we would get a slightly higher clock speed with a balanced, 6 DIMMs configuration, we determined that we are willing to sacrifice the slightly higher clock speed in order to have the additional RAM capacity provided by the 8 x 32GB configuration.

Storage Changes

Data capacity stayed the same, with 12 x 10TB SATA drives in RAID 0 configuration for best throughput. Unlike the previous generation, the drives in the Core Storage 2020 host are helium filled. Helium produces less drag than air, resulting in potentially lower latency.

Core Storage 2020 Synthetic benchmarking

We performed synthetic four corners benchmarking: IOPS measurements of random reads and writes using 4k block size, and bandwidth measurements of sequential reads and writes using 128k block size. We used the fio tool to see what improvements we would get in a lab environment. The results show a 10% latency improvement and 11% IOPS improvement in random read performance. Random write testing shows 38% lower latency and 60% higher IOPS. Write throughput is improved by 23%, and read throughput is improved by a whopping 90%.

	Previous Generation	Core Storage 2020	% Improvement
4k Random Reads (IOPS)	3,384	3,758	11.0%
4k Random Read Mean Latency (ms, lower is better)	75.4	67.8	10.1% lower
4k Random Writes (IOPS)	4,009	6,397	59.6%
4k Random Write Mean Latency (ms, lower is better)	63.5	39.7	37.5% lower
128k Sequential Reads (MB/s)	1,155	2,195	90.0%
128k Sequential Writes (MB/s)	1,265	1,558	23.2%

CPU frequencies

The higher base and turbo frequencies of the Core Storage 2020 host’s Xeon 6210U processor allowed that processor to achieve higher average frequencies while running our production ClickHouse workload. A recent snapshot of two production hosts showed the Core Storage 2020 host being able to sustain an average of 31% higher CPU frequency while running ClickHouse.

	Previous generation (average core frequency)	Core Storage 2020 (average core frequency)	% improvement
Mean Core Frequency	2441 MHz	3199 MHz	31%

Core Storage 2020 Production benchmarking

Our ClickHouse database hosts are continually performing merge operations to optimize the database data structures. Each individual merge operation takes just a few seconds on average, but since they’re constantly running, they can consume significant resources on the host. We sampled the average merge time every five minutes over seven days, and then sampled the data to find the average, minimum, and maximum merge times reported by a Compute 2020 host and by a previous generation host. Results are summarized below.

ClickHouse merge operation performance improvement

Time	Previous generation	Core Storage 2020	% improvement
Mean time to merge	1.83	1.15	37% lower
Maximum merge time	3.51	2.35	33% lower
Minimum merge time	0.68	0.32	53% lower

Our lab-measured CPU frequency and storage performance improvements on Core Storage 2020 have translated into significantly reduced times to perform this database operation.

Conclusion

With our Core 2020 servers, we were able to realize significant performance improvements, both in synthetic benchmarking outside production and in the production workloads we tested. This will allow Cloudflare to run the same workloads on fewer servers, saving CapEx costs and data center rack space. The similarity of the configuration of the Kubernetes and Kafka hosts should help with fleet management and spare parts management. For our next redesign, we will try to further converge the designs on which we run the major Core workloads to further improve efficiency.

Special thanks to Will Buckner and Chris Snook for their help in the development of these servers, and to Tim Bart for validating CSAM Scanning Tool’s performance on Compute.

Securing Memory at EPYC Scale

Derek Chamorro — Fri, 28 Feb 2020 14:00:00 GMT

Security is a serious business, one that we do not take lightly at Cloudflare. We have invested a lot of effort into ensuring that our services, both external and internal, are protected by meeting or exceeding industry best practices. Encryption is a huge part of our strategy as it is embedded in nearly every process we have. At Cloudflare, we encrypt data both in transit (on the network) and at rest (on the disk). Both practices address some of the most common vectors used to exfiltrate information and these measures serve to protect sensitive data from attackers but, what about data currently in use?

Can encryption or any technology eliminate all threats? No, but as Infrastructure Security, it’s our job to consider worst-case scenarios. For example, what if someone were to steal a server from one of our data centers? How can we leverage the most reliable, cutting edge, innovative technology to secure all data on that host if it were in the wrong hands? Would it be protected? And, in particular, what about the server’s RAM?

Data in random access memory (RAM) is usually stored in the clear. This can leave data vulnerable to software or hardware probing by an attacker on the system. Extracting data from memory isn’t an easy task but, with the rise of persistent memory technologies, additional attack vectors are possible:

Dynamic random-access memory (DRAM) interface snooping
Installation of hardware devices that access host memory
Freezing and stealing dual in-line memory module (DIMMs)
Stealing non-volatile dual in-line memory module (NVDIMMs)

So, what about enclaves? Hardware manufacturers have introduced Trusted Execution Environments (also known as enclaves) to help create security boundaries by isolating software execution at runtime so that sensitive data can be processed in a trusted environment, such as secure area inside an existing processor or Trusted Platform Module.

While this allows developers to shield applications in untrusted environments, it doesn’t effectively address all of the physical system attacks mentioned previously. Enclaves were also meant to run small pieces of code. You could run an entire OS in an enclave, but there are limitations and performance issues in doing so.

This isn’t meant to bash enclave usage; we just wanted a better solution for encrypting all memory at scale. We expected performance to be compromised, and conducted tests to see just how much.

Time to get EPYC

Since we are using AMD for our tenth generation "Gen X servers", we found an interesting security feature within the System on a Chip architecture of the AMD EPYC line. Secure Memory Encryption (SME) is an x86 instruction set extension introduced by AMD and available in the EPYC processor line. SME provides the ability to mark individual pages of memory as encrypted using standard x86 page tables. A page that is marked encrypted will be automatically decrypted when read from DRAM and encrypted when written to DRAM. SME can therefore be used to protect the contents of DRAM from physical attacks on the system.

Sounds complicated, right? Here’s the secret: It isn’t ?

Components

SME is comprised of two components:

AES-128 encryption engine: Embedded in the memory controller. It is responsible for encrypting and decrypting data in main memory when an appropriate key is provided via the Secure Processor.
AMD Secure Processor (AMD-SP): An on-die 32-bit ARM Cortex A5 CPU that provides cryptographic functionality for secure key generation and key management. Think of this like a mini hardware security module that uses a hardware random number generator to generate the 128-bit key(s) used by the encryption engine.

How It Works

We had two options available to us when it came to enabling SME. The first option, regular SME, requires enabling a model specific register MSR 0xC001_0010[SMEE]. This enables the ability to set a page table entry encryption bit:

0 = memory encryption features are disabled
1 = memory encryption features are enabled

After memory encryption is enabled, a physical address bit (C-Bit) is utilized to mark if a memory page is protected. The operating system sets the bit of a physical address to 1 in the page table entry (PTE) to indicate the page should be encrypted. This causes any data assigned to that memory space to be automatically encrypted and decrypted by the AES engine in the memory controller:

Becoming More Transparent

While arbitrarily flagging which page table entries we want encrypted is nice, our objective is to ensure that we are incorporating the full physical protection of SME. This is where the second mode of SME came in, Transparent SME (TSME). In TSME, all memory is encrypted regardless of the value of the encrypt bit on any particular page. This includes both instruction and data pages, as well as the pages corresponding to the page tables themselves.

Enabling TSME is as simple as:

Setting a BIOS flag:

2. Enabling kernel support with the following flag:

CONFIG_AMD_MEM_ENCRYPT=y

After a reboot you should see the following in dmesg:

$ sudo dmesg | grep SME
[    2.537160] AMD Secure Memory Encryption (SME) active

Performance Testing

To weigh the pros and cons of implementation against the potential risk of a stolen server, we had to test the performance of enabling TSME. We took a test server that mirrored a production edge metal with the following specs:

Memory: 8 x 32GB 2933MHz
CPU: AMD 2nd Gen EPYC 7642 with SMT enabled and running NPS4 mode
OS: Debian 9
Kernel: 5.4.12

The performance tools we used were:

STREAM
Cryptsetup
Benchmarky

Stream

We used a custom STREAM binary with 24 threads, using all available cores, to measure the sustainable memory bandwidth (in MB/s). Four synthetic computational kernels are run, with the output of each kernel being used as an input to the next. The best rates observed are reported for each choice of thread count.

The figures above show 2.6% to 4.2% performance variation, with a mean of 3.7%. These were the highest numbers measured, which fell below an expected performance impact of >5%.

Cryptsetup

While cryptsetup is normally used for encrypting disk partitions, it has a benchmarking utility that will report on a host’s cryptographic performance by iterating key derivation functions using memory only:

Example:

$ sudo cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1162501 iterations per second for 256-bit key
PBKDF2-sha256    1403716 iterations per second for 256-bit key
PBKDF2-sha512    1161213 iterations per second for 256-bit key
PBKDF2-ripemd160  856679 iterations per second for 256-bit key
PBKDF2-whirlpool  661979 iterations per second for 256-bit key

Benchmarky

Benchmarky is a homegrown tool provided by our Performance team to run synthetic workloads against a specific target to evaluate performance of different components. It uses Cloudflare Workers to send requests and read stats on responses. In addition to that, it also reports versions of all important stack components and their CPU usage. Each test runs 256 concurrent clients, grabbing a cached 10kB PNG image from a performance testing endpoint, and calculating the requests per second (RPS).

Conclusion

In the majority of test results, performance decreased by a nominal amount, actually less than we expected. AMD’s official white paper on SME even states that encryption and decryption of memory through the AES engine does incur a small amount of additional latency for DRAM memory accesses, though dependent on the workload. Across all 11 data points, our average performance drag was only down by .699%. Even at scale, enabling this feature has reduced the worry that any data could be exfiltrated from a stolen server.

While we wait for other hardware manufacturers to add support for total memory encryption, we are happy that AMD has set the bar high and is protecting our next generation edge hardware.