One of the nicer perks I have here at Cloudflare is access to the latest hardware, long before it even reaches the market.
Until recently I mostly played with Intel hardware. For example Intel supplied us with an engineering sample of their Skylake based Purley platform back in August 2016, to give us time to evaluate it and optimize our software. As a former Intel Architect, who did a lot of work on Skylake (as well as Sandy Bridge, Ivy Bridge and Icelake), I really enjoy that.
Our previous generation of servers was based on the Intel Broadwell micro-architecture. Our configuration includes dual-socket Xeons E5-2630 v4, with 10 cores each, running at 2.2GHz, with a 3.1GHz turbo boost and hyper-threading enabled, for a total of 40 threads per server.
Since Intel was, and still is, the undisputed leader of the server CPU market with greater than 98% market share, our upgrade process until now was pretty straightforward: every year Intel releases a new generation of CPUs, and every year we buy them. In the process we usually get two extra cores per socket, and all the extra architectural features such upgrade brings: hardware AES and CLMUL in Westmere, AVX in Sandy Bridge, AVX2 in Haswell, etc.
In the current upgrade cycle, our next server processor ought to be the Xeon Silver 4116, also in a dual-socket configuration. In fact, we have already purchased a significant number of them. Each CPU has 12 cores, but it runs at a lower frequency of 2.1GHz, with 3.0GHz turbo boost. It also has smaller last level cache: 1.375MiB/core, compared to 2.5MiB the Broadwell processors had. In addition, the Skylake based platform supports 6 memory channels and the AVX-512 instruction set.
As we head into 2018, however, change is in the air. For the first time in a while, Intel has serious competition in the server market: Qualcomm and Cavium both have new server platforms based on the ARMv8 64-bit architecture (aka aarch64 or arm64). Qualcomm has the Centriq platform (codename Amberwing), based on the Falkor core, and Cavium has the ThunderX2 platform, based on the, ahm ... ThunderX2 core?
The majestic Amberwing powered by the Falkor CPU CC BY-SA 2.0 image by DrPhotoMoto
Recently, both Qualcomm and Cavium provided us with engineering samples of their ARM based platforms, and in this blog post I would like to share my findings about Centriq, the Qualcomm platform.
The actual Amberwing in question
Overview
I tested the Qualcomm Centriq server, and compared it with our newest Intel Skylake based server and previous Broadwell based server.
Platform
Grantley(Intel)
Purley(Intel)
Centriq(Qualcomm)
Core
Broadwell
Skylake
Falkor
Process
14nm
14nm
10nm
Issue
8 µops/cycle
8 µops/cycle
8 instructions/cycle
Dispatch
4 µops/cycle
5 µops/cycle
4 instructions/cycle
# Cores
10 x 2S + HT (40 threads)
12 x 2S + HT (48 threads)
46
Frequency
2.2GHz (3.1GHz turbo)
2.1GHz (3.0GHz turbo)
2.5 GHz
LLC
2.5 MB/core
1.35 MB/core
1.25 MB/core
Memory Channels
4
6
6
TDP
170W (85W x 2S)
170W (85W x 2S)
120W
Other features
AESCLMULAVX2
AESCLMULAVX512
AESCLMULNEONTrustzoneCRC32
Overall on paper Falkor looks very competitive. In theory a Falkor core can process 8 instructions/cycle, same as Skylake or Broadwell, and it has higher base frequency at a lower TDP rating.
Ecosystem readiness
Up until now, a major obstacle to the deployment of ARM servers was lack, or weak, support by the majority of the software vendors. In the past two years, ARM’s enablement efforts have paid off, as most Linux distros, as well as most popular libraries support the 64-bit ARM architecture. Driver availability, however, is unclear at that point.
At Cloudflare we run a complex software stack that consists of many integrated services, and running each of them efficiently is top priority.
On the edge we have the NGINX server software, that does support ARMv8. NGINX is written in C, and it also uses several libraries written in C, such as zlib and BoringSSL, therefore solid C compiler support is very important.
In addition our flavor of NGINX is highly integrated with the lua-nginx-module, and we rely a lot on LuaJIT.
Finally a lot of our services, such as our DNS server, RRDNS, are written in Go.
The good news is that both gcc and clang not only support ARMv8 in general, but have optimization profiles for the Falkor core.
Go has official support for ARMv8 as well, and they improve the arm64 backend constantly.
As for LuaJIT, the stable version, 2.0.5 does not support ARMv8, but the beta version, 2.1.0 does. Let’s hope it gets out of beta soon.
Benchmarks
OpenSSL
The first benchmark I wanted to perform, was OpenSSL version 1.1.1 (development version), using the bundled openssl speed
tool. Although we recently switched to BoringSSL, I still prefer OpenSSL for benchmarking, because it has almost equally well optimized assembly code paths for both ARMv8 and the latest Intel processors.
In my opinion handcrafted assembly is the best measure of a CPU’s potential, as it bypasses the compiler bias.
Public key cryptography
Public key cryptography is all about raw ALU performance. It is interesting, but not surprising to see that in the single core benchmark, the Broadwell core is faster than Skylake, and both in turn are faster than Falkor. This is because Broadwell runs at a higher frequency, while architecturally it is not much inferior to Skylake.
Falkor is at a disadvantage here. First, in a single core benchmark, the turbo is engaged, meaning the Intel processors run at a higher frequency. Second, in Broadwell, Intel introduced two special instructions to accelerate big number multiplication: ADCX and ADOX. These perform two independent add-with-carry operations per cycle, whereas ARM can only do one. Similarly the ARMv8 instruction set does not have a single instruction to perform 64-bit multiplication, instead it uses a pair of MUL and UMULH instructions.
Nevertheless, at the SoC level, Falkor wins big time. It is only marginally slower than Skylake at an RSA2048 signature, and only because RSA2048 does not have an optimized implementation for ARM. The ECDSA performance is ridiculously fast. A single Centriq chip can satisfy the ECDSA needs of almost any company in the world.
It is also very interesting to see Skylake outperform Broadwell by a 30% margin, despite losing the single core benchmark, and only having 20% more cores. This can be explained by more efficient all-core turbo, and improved hyper-threading.
Symmetric key cryptography
Symmetric key performance of the Intel cores is outstanding.
AES-GCM uses a combination of special hardware instructions to accelerate AES and CLMUL (carryless multiplication). Intel first introduced those instructions back in 2010, with their Westmere CPU, and every generation since they have improved their performance. ARM introduced a set of similar instructions just recently, with their 64-bit instruction set, and as an optional extension. Fortunately every hardware vendor I know of implemented those. It is very likely that Qualcomm will improve the performance of the cryptographic instructions in future generations.
ChaCha20-Poly1305 is a more generic algorithm, designed in such a way as to better utilize wide SIMD units. The Qualcomm CPU only has the 128-bit wide NEON SIMD, while Broadwell has 256-bit wide AVX2, and Skylake has 512-bit wide AVX-512. This explains the huge lead Skylake has over both in single core performance. In the all-cores benchmark the Skylake lead lessens, because it has to lower the clock speed when executing AVX-512 workloads. When executing AVX-512 on all cores, the base frequency goes down to just 1.4GHz---keep that in mind if you are mixing AVX-512 and other code.
The bottom line for symmetric crypto is that although Skylake has the lead, Broadwell and Falkor both have good enough performance for any real life scenario, especially considering the fact that on our edge, RSA consumes more CPU time than all of the other crypto algorithms combined.
Compression
The next benchmark I wanted to see was compression. This is for two reasons. First, it is a very important workload on the edge, as having better compression saves bandwidth, and helps deliver content faster to the client. Second, it is a very demanding workload, with a high rate of branch mispredictions.
Obviously the first benchmark would be the popular zlib library. At Cloudflare we use an improved version of the library, optimized for 64-bit Intel processors, and although it is written mostly in C, it does use some Intel specific intrinsics. Comparing this optimized version to the generic zlib library wouldn’t be fair. Not to worry, with little effort I adapted the library to work very well on the ARMv8 architecture, with the use of NEON and CRC32 intrinsics. In the process it is twice as fast as the generic library for some files.
The second benchmark is the emerging brotli library, it is written in C, and allows for a level playing field for all platforms.
All the benchmarks are performed on the HTML of blog.cloudflare.com, in memory, similar to the way NGINX performs streaming compression. The size of the specific version of the HTML file is 29,329 bytes, making it a good representative of the type of files we usually compress. The parallel benchmark compresses multiple files in parallel, as opposed to compressing a single file on many threads, also similar to the way NGINX works.
gzip
When using gzip, at the single core level Skylake is the clear winner. Despite having lower frequency than Broadwell, it seems that having lower penalty for branch misprediction helps it pull ahead. The Falkor core is not far behind, especially with lower quality settings. At the system level Falkor performs significantly better, thanks to the higher core count. Note how well gzip scales on multiple cores.
brotli
With brotli on single core the situation is similar. Skylake is the fastest, but Falkor is not very much behind, and with quality setting 9, Falkor is actually faster. Brotli with quality level 4 performs very similarly to gzip at level 5, while actually compressing slightly better (8,010B vs 8,187B).
When performing many-core compression, the situation becomes a bit messy. For levels 4, 5 and 6 brotli scales very well. At level 7 and 8 we start seeing lower performance per core, bottoming with level 9, where we get less than 3x the performance of single core, running on all cores.
My understanding is that at those quality levels Brotli consumes significantly more memory, and starts thrashing the cache. The scaling improves again at levels 10 and 11.
Bottom line for brotli, Falkor wins, since we would not consider going above quality 7 for dynamic compression.
Golang
Golang is another very important language for Cloudflare. It is also one of the first languages to offer ARMv8 support, so one would expect good performance. I used some of the built-in benchmarks, but modified them to run on multiple goroutines.
Go crypto
I would like to start the benchmarks with crypto performance. Thanks to OpenSSL we have good reference numbers, and it is interesting to see just how good the Go library is.
As far as Go crypto is concerned ARM and Intel are not even on the same playground. Go has very optimized assembly code for ECDSA, AES-GCM and Chacha20-Poly1305 on Intel. It also has Intel optimized math functions, used in RSA computations. All those are missing for ARMv8, putting it at a big disadvantage.
Nevertheless, the gap can be bridged with a relatively small effort, and we know that with the right optimizations, performance can be on par with OpenSSL. Even a very minor change, such as implementing the function addMulVVW in assembly, lead to an over tenfold improvement in RSA performance, putting Falkor ahead of both Broadwell and Skylake, with 8,009 signatures/second.
Another interesting thing to note is that on Skylake, the Go Chacha20-Poly1305 code, that uses AVX2 performs almost identically to the OpenSSL AVX512 code, this is again due to AVX2 running at higher clock speeds.
Go gzip
Next in Go performance is gzip. Here again we have a reference point to pretty well optimized code, and we can compare it to Go. In the case of the gzip library, there are no Intel specific optimizations in place.
Gzip performance is pretty good. The single core Falkor performance is way below both Intel processors, but at the system level it manages to outperform Broadwell, and lags behind Skylake. Since we already know that Falkor outperforms both when C is used, it can only mean that Go’s backend for ARMv8 is still pretty immature compared to gcc.
Go regexp
Regexp is widely used in a variety of tasks, so its performance is quite important too. I ran the builtin benchmarks on 32KB strings.
Go regexp performance is not very good on Falkor. In the medium and hard tests it takes second place, thanks to the higher core count, but Skylake is significantly faster still.
Doing some profiling shows that a lot of the time is spent in the function bytes.IndexByte. This function has an assembly implementation for amd64 (runtime.indexbytebody), but generic implementation for Go. The easy regexp tests spend most of time in this function, which explains the even wider gap.
Go strings
Another important library for a webserver is the Go strings library. I only tested the basic Replacer class here.
In this test again, Falkor lags behind, and loses even to Broadwell. Profiling shows significant time is spent in the function runtime.memmove. Guess what? It has a highly optimized assembly code for amd64, that uses AVX2, but only very simple ARM assembly, that copies 8 bytes at a time. By changing three lines in that code, and using the LDP/STP instructions (load pair/store pair) to copy 16 bytes at a time, I improved the performance of memmove by 30%, which resulted in 20% faster EscapeString and UnescapeString performance. And that is just scratching the surface.
Go conclusion
Go support for aarch64 is quite disappointing. I am very happy to say that everything compiles and works flawlessly, but on the performance side, things should get better. Is seems like the enablement effort so far was concentrated on the compiler back end, and the library was left largely untouched. There are a lot of low hanging optimization fruits out there, like my 20 minute fix for addMulVVW clearly shows. Qualcomm and other ARMv8 vendors intends to put significant engineering resources to amend this situation, but really any one can contribute to Go. So if you want to leave your mark, now is the time.
LuaJIT
Lua is the glue that holds Cloudflare together.
With the exception of the binary_trees benchmark, the performance of LuaJIT on ARM is very competitive. It wins two benchmarks, and is in almost a tie in a third one.
That being said, binary_trees is a very important benchmark, because it triggers many memory allocations and garbage collection cycles. It will require deeper investigation in the future.
NGINX
For the NGINX workload, I decided to generate a load that would resemble an actual server.
I set up a server that serves the HTML file used in the gzip benchmark, over https, with the ECDHE-ECDSA-AES128-GCM-SHA256 cipher suite.
It also uses LuaJIT to redirect the incoming request, remove all line breaks and extra spaces from the HTML file, while adding a timestamp. The HTML is then compressed using brotli with quality 5.
Each server was configured to work with as many workers as it has virtual CPUs. 40 for Broadwell, 48 for Skylake and 46 for Falkor.
As the client for this test, I used the hey program, running from 3 Broadwell servers.
Concurrently with the test, we took power readings from the respective BMC units of each server.
With the NGINX workload Falkor handled almost the same amount of requests as the Skylake server, and both significantly outperform Broadwell. The power readings, taken from the BMC show that it did so while consuming less than half the power of other processors. That means Falkor managed to get 214 requests/watt vs the Skylake’s 99 requests/watt and Broadwell’s 77.
I was a bit surprised to see Skylake and Broadwell consume about the same amount of power, given both are manufactured with the same process, and Skylake has more cores.
The low power consumption of Falkor is not surprising, Qualcomm processors are known for their great power efficiency, which has allowed them to be a dominant player in the mobile phone CPU market.
Conclusion
The engineering sample of Falkor we got certainly impressed me a lot. This is a huge step up from any previous attempt at ARM based servers. Certainly core for core, the Intel Skylake is far superior, but when you look at the system level the performance becomes very attractive.
The production version of the Centriq SoC will feature up to 48 Falkor cores, running at a frequency of up to 2.6GHz, for a potential additional 8% better performance.
Obviously the Skylake server we tested is not the flagship Platinum unit that has 28 cores, but those 28 cores come both with a big price and over 200W TDP, whereas we are interested in improving our bang for buck metric, and performance per watt.
Currently my main concern is weak Go language performance, but that is bound to improve quickly once ARM based servers start gaining some market share.
Both C and LuaJIT performance is very competitive, and in many cases outperforms the Skylake contender. In almost every benchmark Falkor shows itself as a worthy upgrade from Broadwell.
The largest win by far for Falkor is the low power consumption. Although it has a TDP of 120W, during my tests it never went above 89W (for the go benchmark). In comparison Skylake and Broadwell both went over 160W, while the TDP of the two CPUs is 170W.
If you enjoy testing and selecting hardware on behalf of millions of Internet properties, come [join us](https://www.cloudflare.com/careers/).