Subscribe to receive notifications of new posts:

On the dangers of Intel's frequency scaling

2017-11-10

4 min read

While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena.

When benchmarking OpenSSL 1.1.1dev, I discovered that the performance of the cipher ChaCha20-Poly1305 does not scale very well. On a single thread, it performed at the speed of approximately 2.89GB/s, whereas on 24 cores, and 48 threads it performed at just over 35 GB/s.

CC BY-SA 2.0 image by blumblaum

Now this is a very high number, but I would like to see something closer to 69GB/s. 35GB/s is just 1.46GB/s/core, or roughly 50% of the single core performance. AES-GCM scales much better, to 80% of single core performance, which is understandable, because the CPU can sustain higher frequency turbo on a single core, but not all cores.

alt

Why is the scaling of ChaCha20-Poly1305 so poor? Meet AVX-512. AVX-512 is a new Intel instruction set that adds many new 512-bit wide SIMD instructions and promotes most of the existing ones to 512-bit. The problem with such wide instructions is that they consume power. A lot of power. Imagine a single instruction that does the work of 64 regular byte instructions, or 8 full blown 64-bit instructions.

To keep power in check Intel introduced something called dynamic frequency scaling. It reduces the base frequency of the processor whenever AVX2 or AVX-512 instructions are used. This is not new, and has existed since Haswell introduced AVX2 three years ago.

The scaling gets worse when more cores execute AVX-512 and when multiplication is used.

If you only run AVX-512 code, then everything is good. The frequency is lower, but your overall productivity is higher, because each instruction does more work.

OpenSSL 1.1.1dev implements several variants of ChaCha20-Poly1305, including AVX2 and AVX-512 variants. BoringSSL implements a different AVX2 version of ChaCha20-Poly1305. It is understandable then why BoringSSL achieves only 1.6GB/s on a single core, compared to the 2.89GB/s OpenSSL does.

So how does this affect you, if you mix a little AVX-512 with your real workload? We use the Xeon Silver 4116 CPUs, with a base frequency 2.1GHz, in a dual socket configuration. From a figure I found on wikichip it seems that running AVX-512 even just on one core on this CPU will reduce the base frequency to 1.8GHz. Running AVX-512 on all cores will reduce it to just 1.4GHz.

Now imagine you run a webserver with Apache or NGINX. In addition you have many other services, performing some real, important work. What happens if you start encrypting your traffic with ChaCha20-Poly1305 using AVX-512? That is the question I asked myself.

I compiled two versions of NGINX, one with OpenSSL1.1.1dev and the other with BoringSSL, and installed it on our server with two Xeon Silver 4116 CPUs, for a total of 24 cores.

I configured the server to serve a medium sized HTML page, and perform some meaningful work on it. I used LuaJIT to remove line breaks and extra spaces, and brotli to compress the file.

I then monitored the number of requests per second served under full load. This is what I got:

alt

By using ChaCha20-Poly1305 over AES-128-GCM, the server that uses OpenSSL serves 10% fewer requests per second. And that is a huge number! It is equivalent to giving up on two cores, for nothing. One might think that this is due to ChaCha20-Poly1305 being inherently slower. But that is not the case.

First, BoringSSL performs equivalently well with AES-GCM and ChaCha20-Poly1305.

Second, even when only 20% of the requests use ChaCha20-Poly1305, the server throughput drops by more than 7%, and by 5.5% when 10% of the requests are ChaCha20-Poly1305. For reference, 15% of the TLS requests Cloudflare handles are ChaCha20-Poly1305.

Finally, according to perf, the AVX-512 workload consumes only 2.5% of the CPU time when all the requests are ChaCha20-Poly1305, and less then 0.3% when doing ChaCha20-Poly1305 for 10% of the requests. Irregardless the CPU throttles down, because that what it does when it sees AVX-512 running on all cores.

It is hard to say just how much each core is throttled at any given time, but doing some sampling using lscpu, I found out that when executing the openssl speed -evp chacha20-poly1305 -multi 48 benchmark, it shows CPU MHz: 1199.963, for OpenSSL with all AES-GCM connections I got CPU MHz: 2399.926 and for OpenSSL with all ChaCha20-Poly1305 connections I saw CPU MHz: 2184.338, which is obviously 9% slower.

Another interesting distinction is that ChaCha20-Poly1305 with AVX2 is slightly slower in OpenSSL but is the same in BoringSSL. Why might that be? The reason here is that the BoringSSL code does not use AVX2 multiplication instructions for Poly1305, and only uses simple xor, shift and add operations for ChaCha20, which allows it to run at the base frequency.

OpenSSL 1.1.1dev is still in development, therefore I suspect no one is affected by this issue yet. We switched to BoringSSL months ago, and our server performance is not affected by this issue.

What the future holds in unclear. Intel announced very cool new ISA extensions for the future generation of CPUs, that are expected to improve crypto performance even further. Those extensions include AVX512+VAES, AVX512+VPCLMULQDQ and AVX512IFMA. But if the frequency scaling issue is not resolved by then, using those for general purpose cryptography libraries will do (much) more harm than good.

The problem is not with cryptography libraries alone. OpenSSL did nothing wrong by trying to get the best possible performance, on the contrary, I wrote a decent amount of AVX-512 code for OpenSSL myself. The observed behavior is a sad side effect. There are many libraries that use AVX and AVX2 instructions out there, they will probably be updated to AVX-512 at some point, and users are not likely to be aware of the implementation details. If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.
SecurityOpenSSLSSLCryptography

Follow on X

Cloudflare|@cloudflare

Related posts

October 02, 2024 1:00 PM

How Cloudflare auto-mitigated world record 3.8 Tbps DDoS attack

Over the past couple of weeks, Cloudflare's DDoS protection systems have automatically and successfully mitigated multiple hyper-volumetric L3/4 DDoS attacks exceeding 3 billion packets per second (Bpps). Our systems also automatically mitigated multiple attacks exceeding 3 terabits per second (Tbps), with the largest ones exceeding 3.65 Tbps. The scale of these attacks is unprecedented....

September 27, 2024 1:00 PM

Advancing cybersecurity: Cloudflare implements a new bug bounty VIP program as part of CISA Pledge commitment

Cloudflare strengthens its commitment to cybersecurity by joining CISA's "Secure by Design" pledge. In line with this commitment, we're enhancing our vulnerability disclosure policy by launching a VIP bug bounty program, giving top researchers early access to our products. Keep an eye out for future updates regarding Cloudflare's CISA pledge as we work together to shape a safer digital future....

September 27, 2024 1:00 PM

AI Everywhere with the WAF Rule Builder Assistant, Cloudflare Radar AI Insights, and updated AI bot protection

This year for Cloudflare’s birthday, we’ve extended our AI Assistant capabilities to help you build new WAF rules, added new AI bot & crawler traffic insights to Radar, and given customers new AI bot blocking capabilities...