While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena.
When benchmarking OpenSSL 1.1.1dev, I discovered that the performance of the cipher ChaCha20-Poly1305 does not scale very well. On a single thread, it performed at the speed of approximately 2.89GB/s, whereas on 24 cores, and 48 threads it performed at just over 35 GB/s.
CC BY-SA 2.0 image by blumblaum
Now this is a very high number, but I would like to see something closer to 69GB/s. 35GB/s is just 1.46GB/s/core, or roughly 50% of the single core performance. AES-GCM scales much better, to 80% of single core performance, which is understandable, because the CPU can sustain higher frequency turbo on a single core, but not all cores.
Why is the scaling of ChaCha20-Poly1305 so poor? Meet AVX-512. AVX-512 is a new Intel instruction set that adds many new 512-bit wide SIMD instructions and promotes most of the existing ones to 512-bit. The problem with such wide instructions is that they consume power. A lot of power. Imagine a single instruction that does the work of 64 regular byte instructions, or 8 full blown 64-bit instructions.
To keep power in check Intel introduced something called dynamic frequency scaling. It reduces the base frequency of the processor whenever AVX2 or AVX-512 instructions are used. This is not new, and has existed since Haswell introduced AVX2 three years ago.
The scaling gets worse when more cores execute AVX-512 and when multiplication is used.
If you only run AVX-512 code, then everything is good. The frequency is lower, but your overall productivity is higher, because each instruction does more work.
OpenSSL 1.1.1dev implements several variants of ChaCha20-Poly1305, including AVX2 and AVX-512 variants. BoringSSL implements a different AVX2 version of ChaCha20-Poly1305. It is understandable then why BoringSSL achieves only 1.6GB/s on a single core, compared to the 2.89GB/s OpenSSL does.
So how does this affect you, if you mix a little AVX-512 with your real workload? We use the Xeon Silver 4116 CPUs, with a base frequency 2.1GHz, in a dual socket configuration. From a figure I found on wikichip it seems that running AVX-512 even just on one core on this CPU will reduce the base frequency to 1.8GHz. Running AVX-512 on all cores will reduce it to just 1.4GHz.
Now imagine you run a webserver with Apache or NGINX. In addition you have many other services, performing some real, important work. What happens if you start encrypting your traffic with ChaCha20-Poly1305 using AVX-512? That is the question I asked myself.
I compiled two versions of NGINX, one with OpenSSL1.1.1dev and the other with BoringSSL, and installed it on our server with two Xeon Silver 4116 CPUs, for a total of 24 cores.
I configured the server to serve a medium sized HTML page, and perform some meaningful work on it. I used LuaJIT to remove line breaks and extra spaces, and brotli to compress the file.
I then monitored the number of requests per second served under full load. This is what I got:
By using ChaCha20-Poly1305 over AES-128-GCM, the server that uses OpenSSL serves 10% fewer requests per second. And that is a huge number! It is equivalent to giving up on two cores, for nothing. One might think that this is due to ChaCha20-Poly1305 being inherently slower. But that is not the case.
First, BoringSSL performs equivalently well with AES-GCM and ChaCha20-Poly1305.
Second, even when only 20% of the requests use ChaCha20-Poly1305, the server throughput drops by more than 7%, and by 5.5% when 10% of the requests are ChaCha20-Poly1305. For reference, 15% of the TLS requests Cloudflare handles are ChaCha20-Poly1305.
Finally, according to perf
, the AVX-512 workload consumes only 2.5% of the CPU time when all the requests are ChaCha20-Poly1305, and less then 0.3% when doing ChaCha20-Poly1305 for 10% of the requests. Irregardless the CPU throttles down, because that what it does when it sees AVX-512 running on all cores.
It is hard to say just how much each core is throttled at any given time, but doing some sampling using lscpu
, I found out that when executing the openssl speed -evp chacha20-poly1305 -multi 48
benchmark, it shows CPU MHz: 1199.963
, for OpenSSL with all AES-GCM connections I got CPU MHz: 2399.926
and for OpenSSL with all ChaCha20-Poly1305 connections I saw CPU MHz: 2184.338
, which is obviously 9% slower.
Another interesting distinction is that ChaCha20-Poly1305 with AVX2 is slightly slower in OpenSSL but is the same in BoringSSL. Why might that be? The reason here is that the BoringSSL code does not use AVX2 multiplication instructions for Poly1305, and only uses simple xor, shift and add operations for ChaCha20, which allows it to run at the base frequency.
OpenSSL 1.1.1dev is still in development, therefore I suspect no one is affected by this issue yet. We switched to BoringSSL months ago, and our server performance is not affected by this issue.
What the future holds in unclear. Intel announced very cool new ISA extensions for the future generation of CPUs, that are expected to improve crypto performance even further. Those extensions include AVX512+VAES, AVX512+VPCLMULQDQ and AVX512IFMA. But if the frequency scaling issue is not resolved by then, using those for general purpose cryptography libraries will do (much) more harm than good.
The problem is not with cryptography libraries alone. OpenSSL did nothing wrong by trying to get the best possible performance, on the contrary, I wrote a decent amount of AVX-512 code for OpenSSL myself. The observed behavior is a sad side effect. There are many libraries that use AVX and AVX2 instructions out there, they will probably be updated to AVX-512 at some point, and users are not likely to be aware of the implementation details. If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.