At CloudFlare, making web sites faster and safer at scale is always a driving force for innovation. We introduced “Universal SSL” to dramatically increase the size of the encrypted web. In order for that to happen we knew we needed to efficiently handle large volumes of HTTPS traffic, and give end users the fastest possible performance.
In this article, I’ll explain how we added speed to Universal SSL with session resumptions across multiple hosts, and explain the design decisions we made in this process. Currently, we use two standardized session resumption mechanisms that require two different data sharing designs: Session IDs RFC 5246, and Session Tickets RFC 5077.
Session ID Resumption
Resuming an encrypted session through a session ID means that the server keeps track of recent negotiated sessions using unique session IDs. This is done so that when a client reconnects to a server with a session ID, the server can quickly look up the session keys and resume the encrypted communication.
At each of CloudFlare’s PoPs (Point of Presence) there are multiple hosts handling HTTPS traffic. When the client attempts to resume a TLS connection with a web site, there is no guarantee that they will connect to the same physical machine that they connected to previously. Without session sharing, the success rate of session ID resumption could be as low as 1/n (when there are n hosts). That means the more hosts we have, the less likely a session can be resumed. This goes directly against our goal of scaling SSL performance!
CloudFlare’s solution to this problem is to share the sessions within the PoP, making the successful resumption rate approach 100%.
How sessions are shared
We employ a memcached cluster to cache all the recent negotiated sessions from all the hosts within the same PoP. To enhance the secrecy and security of session keys, all cached sessions are encrypted. When a new session with a session ID is negotiated, a host will encrypt the new session and insert it to memcached, indexed by the session ID. When a host needs to look up a session for session resumption, it will query memcached using the session ID as the key and decrypt the cached session to resume it. All those operations happen as non-blocking asynchronous calls thanks to the power of OpenResty, and many handy OpenResty modules such as the fully asynchronous memcached client. We also needed tweaks in OpenSSL to support asynchronous session caching.
I’d like to send a few shout-outs to my amazing colleagues Piotr Sikora and Yichun Zhang for making this project possible.
Using OpenSSL’s s_client utility, we can quickly test how a session ID is speeds up the TLS connection from the client side. We test the TLS performance of www.cloudflare.com from our office. And the result is shown below:
The overall cost of a session resumption is less than 50% of a full TLS handshake, mainly because session resumption only costs one round-trip while a full TLS handshake requires two. Moreover, a session resumption does not require any large finite field arithmetic (new sessions do), so the CPU cost for the client is almost negligible compared to that in a full TLS handshake. For mobile users, the performance improvement by session resumption means a much more reactive and battery-life-friendly surfing experience.
Session Ticket Resumption
Session resumption with session IDs has a major limitation: servers are responsible for remembering negotiated TLS sessions for a given period of time. It poses scalability issues for servers with a large load of concurrent connections per second and for servers that want to cache sessions for a long time. Session ticket resumption is designed to address this issue.
The idea is simple: outsource session storage to clients. A session ticket is a blob of a session key and associated information encrypted by a key which is only known by the server. The ticket is sent by the server at the end of the TLS handshake. Clients supporting session tickets will cache the ticket along with the current session key information. Later the client includes the session ticket in the handshake message to indicate it wishes to resume the earlier session The server on the other end will be able to decrypt this ticket, recover the session key and resume the session.
Now consider every host in the same PoP uses the same encryption key, the good news is that every host now is able to decrypt this session ticket and resume the session for the client. The not-so-good news is that this key becomes critical single point of failure for TLS security: if an adversary gets hold of it, the session key information is exposed for every session ticket! Even after the lifetime of a session ticket, such a loss would invalidate supposed “perfect forward secrecy” (as evangelized here on our blog). Therefore, it is important to:
“generate session ticket keys randomly, distribute them to the servers without ever touching persistent storage and rotate them frequently.”
How session encryption keys are encrypted, shared and rotated
To meet all these security goals, we first start an in-memory key generator daemon that generates a fresh, timestamped key every hour. Keys are encrypted so that only our nginx servers can decrypt them. Then with CloudFlare’s existing secure data propagation infrastructure, ticket keys replicate from one master instance to all of our PoPs around the world. Each host periodically queries the local copy of the database through a memcached interface for fresh encryption keys for the current hour. To summarize, the key generation daemon generates keys randomly and rotates them hourly, and keys are distributed to all hosts across the globe securely without being written to disk.
There are some technical details still worth mentioning. First, we need to tackle distributed clock synchronization. For example, there might be one host thinks it is UTC 12:01pm while other hosts still think it UTC 11:59am, the faster-clock host might start encrypting session tickets with the key of 12:00pm while other hosts could not decrypt those tickets because they don’t know the new key yet. Or the fast-clock host might find the key is not yet available due to propagation delay. Instead of dedicating efforts for synchronization, we solve the problem by breaking the synchronization requirement. The key daemon generates keys one hour ahead and each host would opportunistically save the key for the next hour (if there is any) as a decryption-only key. Now even with one or more faster-clock hosts, session resumption by ticket still works without interruption because they can still decrypt session tickets encrypted by any other.
Also we set the session ticket lifetime hint to be 18 hours, the same value for SSL session timeout. Each server also keeps ticket keys for the past 18 hours for ticket decryption.
To summarize, we support TLS session resumption globally using both sessions IDs and session tickets. For any web site on CloudFlare’s network, HTTPS performance has been made faster for every user and every device.