ARMs Race: Ampere Altra takes on the AWS Graviton2

Over three years ago, we embraced the ARM ecosystem after evaluating the Qualcomm Centriq. The Centriq and its Falkor cores delivered a significant reduction in power consumption while maintaining a comparable performance against the processor that was powering our server fleet at the time. By the time we completed porting our software stack to be compatible with ARM, Qualcomm decided to exit the server business. Since then, we have been waiting for another server-grade ARM processor with hopes to improve our power efficiencies across our global network, which now spans more than 200 cities in over 100 countries.

ARM has introduced the Neoverse N1 platform, the blueprint for creating power-efficient processors licensed to institutions that can customize the original design to meet their specific requirements. Ampere licensed the Neoverse N1 platform to create the Ampere Altra, a processor that allows companies that own and manage their own fleet of servers, like ourselves, to take advantage of the expanding ARM ecosystem. We have been working with Ampere to determine whether Altra is the right processor to power our first generation of ARM edge servers.

The AWS Graviton2 is the only other Neoverse N1-based processor publicly accessible, but only made available through Amazon’s cloud product portfolio. We wanted to understand the differences between the two, so we compared Ampere’s single-socket server, named Mt. Snow, equipped with the Ampere Altra Q80-30 against an EC2 instance of the AWS Graviton2.

The Mt. Snow 1P server equipped with the Ampere Altra Q80-30

The Ampere Altra and AWS Graviton2 alike are based on the Neoverse N1 platform by ARM, manufactured on the TSMC 7nm process. The N1 reference core features an 11-stage out-of-order execution pipeline along with the following specifications in ARM nomenclature.

Pipeline Stage	Width
Fetch	4 instructions/cycle
Decode	4 instructions/cycle
Rename	4 Mops/cycle
Dispatch	8 µops/cycle
Issue	8 µops/cycle
Commit	8 µops/cycle

A N1 core contains 64KiB of L1 instruction and 64KiB of L1 data caches and a dedicated L2 cache that can be implemented with up to 1MiB per core. The L3 cache or the System Level Cache can be implemented with up to 256MiB (up to 4MiB per slice, up to 64 slices) shared by all cores.

Cache Hierarchy	Capacity
L1i Cache	64 KiB
L1d Cache	64 KiB
L2 Cache	Up to 1 MiB
L3 Cache	Up to 256 MiB

A pair of N1 cores can optionally be combined into a dual-core configuration through the Component Aggregation Layer, then placed inside the mesh interconnects named the Coherent Mesh Network or CMN-600 that supports up to 64 pairs of cores, totaling 128 cores.

Component Aggregation Layer (CAL) supports up to two N1 cores

Coherent Hub Interface (CHI) is used to connect components to the Mesh Cross Point (XP)

Coherent Mesh Network (CMN-600) can be scaled up to 8x8 XPs supporting up to 128 cores

System Specifications

	AWS (Annapurna Labs)	Ampere
Processor Model	AWS Graviton2	Ampere Altra Q80-30
ISA	ARMv8.2	ARMv8.2+
Core / Thread Count	64C / 64T	80C / 80T
Operating Frequency	2.5 GHz	3.0 GHz
L1 Cache Size	8 MiB (64 x 128 KiB)	10 MiB (80 x 128 KiB)
L2 Cache Size	64 MiB (64 x 1 MiB)	80 MiB (80 x 1 MiB)
L3 Cache Size	32 MiB	32 MiB
System Memory	256 GB (8 x 32 GB DDR4-3200)	256 GB (8 x 32 GB DDR4-3200)

We can verify that the two processors are based on the Neoverse N1 platform by checking CPU part inside cpuinfo, which returns 0xd0c, the designated part number for the Neoverse N1.

sung@ampere-altra:~$ cat /proc/cpuinfo | grep 'CPU part' | head -1
CPU part	: 0xd0c

sung@aws-graviton2:~$ cat /proc/cpuinfo | grep 'CPU part' | head -1
CPU part	: 0xd0c

Ampere backported various features into the Ampere Altra notably, speculative side-channel attack mitigations that include Meltdown and Spectre (variants 1 and 2) from the ARMv8.5 instruction set architecture, giving Altra the “+” designation in their ISA (ARMv8.2+).

The Ampere Altra consists of 25% more physical cores and 20% higher operating frequency than the AWS Graviton2. Because of Altra’s higher core-count and frequency, we should expect the Ampere Altra to perform up to 50% better than the AWS Graviton2 in compute-bound workloads.

Both Ampere and AWS decided to implement 1MiB of L2 cache per core, but only 32MiB of L3 or System Level Cache, although the CMN-600 specification allows up to 256MiB. This leaves the Ampere Altra with a lower L3 cache-per-core ratio than the AWS Graviton2 and places Altra at a potential disadvantage in situations where the application’s working set does not fit within the L3 cache.

Methodology

We imaged both systems with Debian Buster and downloaded our open-source benchmark suite, cf_benchmark. This benchmark suite executes 49 different workloads over approximately 15 to 30 minutes. Each workload utilizes a library that has either been used in our stack or considered at one point or another. Due to the relatively short duration of the benchmark and the libraries called by the workloads, we run cf_benchmark as our smoke test on any new system brought in for evaluation. We recommend cf_benchmark to our technology partners when we discuss new hardware, primarily processors, to ensure these workloads can compile and run successfully.

We ran cf_benchmark multiple times and observed no significant run-to-run variations. Similar to industry standard benchmarks, we calculated the overall score using geometric mean to provide a more conservative result than arithmetic mean across all 49 workloads and category scores using their respective subset of workloads.

In single-core scenarios, cf_benchmark will spawn a single application thread, occupying a single hardware thread. In multi-core scenarios, cf_benchmark will spawn as many application threads as needed until all hardware threads on the processor are occupied. Neither processors implemented simultaneous multithreading, so each physical core can only execute one hardware thread at a time.

Overall Performance Results

Given Altra’s advantage in both operating frequency and sheer core-count, the Ampere Altra took the lead across single and multi-core performance.

In single-core performance, the Ampere Altra outperformed the AWS Graviton2 by 16%. The differences in operating frequencies between the two processors gave the Ampere Altra a proportional advantage of up to 20% in more than half of our single-core workloads.

In multi-core performance, the Ampere Altra continued to maintain its edge over the AWS Graviton2 by 31%. We expected the Ampere Altra to attain a theoretical advantage between 20% to 50% due to Altra’s higher operating frequency and core count while taking inherent overheads associated with increasing core count into consideration, primarily scaling out the mesh network. We observed that the majority of our multi-core workloads fell within this 20% to 50% range.

Categorized Results

In OpenSSL and LuaJIT, the Ampere Altra scaled proportionally with its operating frequency in single-core performance and continued to scale out with its core count in multi-core performance. The Ampere Altra maintained its lead in the vast majority of compression and Golang workloads with the exception of Brotli level 9 compression and some Golang multi-core workloads that were not able to keep the two processors busy at 100% CPU utilization.

OpenSSL

As described previously, we use a fork of OpenSSL, named BoringSSL, in places such as for TLS handshake handling. The original OpenSSL comes with a built-in benchmark called `speed` that measures single and multi-core performance. The asymmetric and symmetric crypto workloads found here saturated the processors at 100% CPU utilization. Altra’s performance aligned with our expectations throughout this entire category by maintaining a consistent 20% advantage over the AWS Graviton2 in single-core and continued to scale with its core count and performed, on average, 47% better in multi-core performance.

Public Key Cryptography

Symmetric Key Cryptography

LuaJIT

We commonly describe LuaJIT as the glue that holds Cloudflare together. LuaJIT has a long history here at Cloudflare for the same reason it is widely used in the videogame industry, where performance and responsiveness are non-negotiable requirements.

Although we have been transitioning from LuaJIT to Rust, we welcomed the instruction cache coherency implemented into the Ampere Altra. The instruction cache coherency aims to address the drawbacks associated with LuaJIT, such as its self-modifying properties, where the next instruction that needs to be fetched is not in the instruction cache, but rather the data cache. With this feature enabled, the instruction cache becomes coherent or made aware of changes made in the data cache. The L2 cache becomes inclusive of the L1 instruction cache, meaning any cache lines found in the L1 instruction cache should also be present in the L2 cache. Furthermore, L2 cache intercepts any invalidations or stores made to any of the L2’s cache lines and propagates those changes up to the L1 instruction cache.

The Ampere Altra performed well within single-core performance except for fasta where both processors stalled at the front-end of the pipeline. In the multi-core version of fasta, Altra did not spend as many cycles stalled in the front-end likely due to Ampere's implementation of the instruction cache coherency. All other workloads continued to scale in multi-core workloads with binary trees coming in at 73%. Spectral was omitted since the multi-core variant failed to run on both servers.

Compression

Brotli and Gzip are the two primary types of compression we use at Cloudflare. We find value in highly efficient compression algorithms as it helps us find a balance between spending more resources on a faster processor or having a larger storage capacity in addition to being able to transfer contents faster. The Ampere Altra performed well on both compression types with the exception of Brotli level 9 and Gzip level 4 multi-core performance.

Brotli

Out of the entire benchmark suite, Brotli level 9 caused the largest delta between the Ampere Altra and the AWS Graviton2. Although we do not use Brotli level 7 and above when performing dynamic compression, we decided to investigate further.

From historical observations, we noticed that most processors tend to experience significant performance degradation at level 9. Compression workloads generally consist of a higher branch instructions mix, and we found approximately 15% to 20% of the dynamic instructions to be branch instructions. Intuitively, we first looked at the misprediction rate since a high miss rate will quickly add up penalty cycles. However, to our surprise, we found the misprediction rate at level 9 very low.

By analyzing our dataset further, we found the common underlying cause appeared to be the high number of page faults incurred at level 9. Ampere has demonstrated that by increasing the page size from 4K to 64K bytes, we can alleviate the bottleneck and bring the Ampere Altra at parity with the AWS Graviton2. We plan to experiment with large page sizes in the future as we continue to evaluate Altra.

Gzip

Golang

LuaJIT, Rust, C++, and Go are some of our primary languages; the Golang workloads found here include cryptography, compression, regular expression, and string manipulation. The Ampere Altra performed proportionally to its operating frequency against the AWS Graviton2 in the vast majority of single-core performance. In multi-core performance, the workloads that could not saturate the processors to 100% CPU utilization did not scale proportionally to the two processor’s respective operating frequencies and core count.

Go Cryptography

Go Compression & String

Go Regex

Frequency, Power & Thermals

We were unable to collect any telemetry on the AWS Graviton2. Power would have been an interesting data point on the AWS Graviton2, but the EC2 instance did not expose either frequency or power sensors. The Ampere model we tested was the Altra Q80-30, rated with an operating frequency of 3.0GHz and a thermal design power (TDP) of 210W.

The Ampere Altra maintained a sustained operating frequency throughout cf_benchmark while Altra’s dynamic voltage and frequency scaling (DVFS) mechanism occasionally lowered its operating frequency in between workloads as needed to reduce power consumption. Package power varied depending on the workload, but Altra never consumed anywhere near its rated TDP while the default fan curve maintained the temperature below 70C. At the system level, by simply querying the baseboard management controller over IPMI, we observed that the Ampere server consumed no more than 300W. More power-efficient servers directly translates to more flexibility over how we can populate our racks and we hope this trend will continue in production.

Conclusion

In our top-down evaluation, the Ampere Altra largely performed as we had expected throughout our benchmarking suite. We regularly observed the two Neoverse N1 processors perform to their respective operating frequency and core count, with the Ampere Altra coming on top in these key areas over the AWS Graviton2. The Ampere Altra yielded a better single-core performance due to higher operating frequency and multiplied its impact even further by scaling out with a higher core count in multi-core performance against the AWS Graviton2. We also found our Ampere evaluation server consuming less power than we had anticipated and hope this trend will continue in production.

Throughout this initial set of testing and learning more about the underlying architecture, there were many parts we found interesting when compared against contemporary x86 counterparts such as the absence of simultaneous multithreading or dynamic frequency boost. We look forward to seeing which design philosophy makes more sense for the cloud. We are currently assessing Altra’s performance against our fleet of servers and collaborating with Ampere to define an ARM edge server based on Altra that could be deployed at scale.

The Cloudflare Blog