I wouldn’t be surprised if the title of this post attracts some Bitcoin aficionados, but if you are such, I want to disappoint you. For me crypto means cryptography, not cybermoney, and the price we pay for it is measured in CPU cycles, not USD.
If you got to this second paragraph you probably heard that TLS today is very cheap to deploy. Considerable effort was put to optimize the cryptography stacks of OpenSSL and BoringSSL, as well as the hardware that runs them. However, aside for the occasional benchmark, that can tell us how many GB/s a given algorithm can encrypt, or how many signatures a certain elliptic curve can generate, I did not find much information about the cost of crypto in real world TLS deployments.
CC BY-SA 2.0 image by Michele M. F.
As Cloudflare is the largest provider of TLS on the planet, one would think we perform a lot of cryptography related tasks, and one would be absolutely correct. More than half of our external traffic is now TLS, as well as all of our internal traffic. Being in that position means that crypto performance is critical to our success, and as it happens, every now and then we like to profile our production servers, to identify and fix hot spots.
In this post I want to share the latest profiling results that relate to crypto.
The profiled server is located in our Frankfurt data center, and sports 2 Xeon Silver 4116 processors. Every geography has a slightly different use pattern of TLS. In Frankfurt 73% of the requests are TLS, and the negotiated cipher-suites break down like so:
Processing all of those different ciphersuites, BoringSSL consumes just 1.8% of the CPU time. That’s right, mere 1.8%. And that is not even pure cryptography, there is a considerable overhead involved too.
Let’s take a deeper dive, shall we?
If we break down the negotiated cipher suites, by the AEAD used, we get the following breakdown:
BoringSSL speed tells us that AES-128-GCM, ChaCha20-Poly1305 and AES-128-CBC-SHA1 can achieve encryption speeds of 3,733.3 MB/s, 1,486.9 MB/s and 387.0 MB/s, but this speed varies greatly as a function of the record size. Indeed we see that GCM uses proportionately less CPU time.
Still the CPU time consumed by encryption and decryption depends on typical record size, as well as the amount of data processed, both metrics we don’t currently log. We do know that ChaCha20-Poly1305 is usually used by older phones, where the connections are short lived to save power, while AES-CBC is used for … well your guess is as good as mine who still uses AES-CBC and for what, but good thing its usage keeps declining.
Finally keep in mind that 6.8% of BoringSSL usage in the graph translates into 6.8% x 1.8% = 0.12% of total CPU time.
Public key algorithms in TLS serve two functions.
The first function is as a key exchange algorithm, the prevalent algorithm here is ECDHE that uses the NIST P256 curve, the runner up is ECDHE using DJB’s x25519 curve. Finally there is a small fraction that still uses RSA for key exchange, the only key exchange algorithm currently used, that does not provide Forward Secrecy guarantees.
The second function is that of a signature used to sign the handshake parameters and thus authenticate the server to the client. As a signature RSA is very much alive, present in almost one quarter of the connections, the other three quarters using ECDSA.
BoringSSL speed reports that a single core on our server can perform 1,120 RSA2048 signatures/s, 120 RSA4096 signatures/s, 18,477 P256 ESDSA signatures/s, 9,394 P256 ECDHE operations/s and 9,278 x25519 ECDHE operations/s.
Looking at the CPU consumption, it is clear that RSA is very expensive. Roughly half the time BoringSSL performs an operation related to RSA. P256 consumes twice as much CPU time as x25519, but considering that it handles twice as much key-exchanges, while also being used as a signature, that is commendable.
If you want to make the internet a better place, please get an ECDSA signed certificate next time!
Only two hash function are currently used in TLS: SHA1 and SHA2 (including SHA384). SHA3 will probably debut with TLS1.3. Hash functions serve several purposes in TLS. First, they are used as part of the signature for both the certificate and the handshake, second they are used for key derivation, finally when using AES-CBC, SHA1 and SHA2 are used in HMAC to authenticate the records.
Here we see SHA1 consuming more resources than expected, but that is really because it is used as HMAC, whereas most cipher suites that negotiate SHA256 use AEADs. In terms of benchmarks BoringSSL speed reports 667.7 MB/s for SHA1, 309.0 MB/s for SHA256 and 436.0 MB/s for SHA512 (truncated to SHA384 in TLS, that is not visible in the graphs because its usage approaches 0%).
Using TLS is very cheap, even at the scale of Cloudflare. Modern crypto is very fast, with AES-GCM and P256 being great examples. RSA, once a staple of cryptography, that truly made SSL accessible to everyone, is now a dying dinosaur, replaced by faster and safer algorithms, still consumes a disproportionate amount of resources, but even that is easily manageable.
The future however is less clear. As we approach the era of Quantum computers it is clear that TLS must adapt sooner rather than later. We already support SIDH as a key exchange algorithm for some services, and there is a NIST competition in place, that will determine the most likely Post Quantum candidates for TLS adoption, but none of the candidates can outperform P256. I just hope that when I profile our edge two years from now, my conclusion won’t change to “Whoa, crypto is expensive!”.