Initial Problem Report
Several months ago we started hearing occasional reports from .NET developers that they were having trouble maintaining HTTPS sessions with one of our customer’s websites. Establishing connections worked just fine but they would periodically get disconnected, resulting in an exception that crashed their application. Around the same time, we also started hearing reports that two other Microsoft products—Internet Explorer and its heir-apparent, Edge—were also having trouble with our edge.
Just a few weeks prior, we had updated our handling of TLS session tickets to be more performant and more secure. We suspected these improvements were the trigger and focused our investigation there. What we learned was that the problem ran much deeper than .NET or IE. It went all the way down to the SChannel security package, which provides TLS functionality for a vast array of Microsoft applications.
TLS Session Tickets
Before diving into the story further, however, it’s helpful to understand exactly what TLS session tickets are, how they’re beneficial to HTTPS, and what optimizations CloudFlare does to use them at scale. (If you’d like to skip over the primer and jump right to the punchline, go ahead and click here.)
First introduced in 2006, RFC 4507 (since deprecated by RFC 5077) proposes a "a mechanism that enables the Transport Layer Security (TLS) server to resume sessions and avoid keeping per-client session state". Before the server can absolve itself of maintaining state, however, it must first "encapsulate session state into a ticket and forward it to the client".
The client then saves this ticket and presents it to the server the next time it needs to make an HTTPS request, skipping most of the back-and-forth process that’s initially required to establish a TLS session. Instead of the standard protocol negotiation messages, the two parties perform what’s known as an “abbreviated handshake,” jumping right to using the previously agreed upon session details.
This shortcut reduces the connection setup time by more than 50%, since the number of round trips was reduced from two to one.
Note that the lifetime and continued validity of a ticket is up to both the server and the client: the server communicates a “session lifetime hint” to the client, and the client decides if it’s willing to re-use the session all the way until that expiration. If not, or if there’s any doubt about the session integrity, either side can force a full handshake and regenerate a ticket.
Resuming the Session
Figure 1 - Full TLS handshake vs. abbreviated TLS handshake
To inform the server that the browser can utilize session tickets (not all can), the connecting user agent must first advertise its support for them—otherwise there’s no point spending cycles on creating and encrypting the ticket. To convey this capability, the client sends an empty value in the session ticket extension of the initial ClientHello message.
When the server sees this extension, it knows to hold on to the negotiated details of the session such as the TLS protocol version, cipher suite, and the symmetric key used to encrypt both requests and responses. These details are then packed together in a data structure, encrypted with a "session ticket key" that only the server knows, and sent to the client for safe keeping.
Because the server knows that only it can decrypt and view the ticket contents, it doesn't need to hold a copy to check for tampering (e.g., improper extension of the session lifetime), and instead relies upon the client to do so. This distributed storage is what makes RFC 5077 session tickets much more scalable than their predecessor, session IDs.
The Challenge of Session Tickets at Scale
So, an HTTP server encrypts session tickets with a key that only it can access. In CloudFlare’s case, however, "the server" is actually thousands of machines blanketing the world and responding to requests from as close (network-wise) to browsers as possible.
Each machine in our 36 country, 76 datacenter network is designed to handle all types of requests, including SSL/TLS handshakes and the processing of application data. Requests are routed to servers with the least load and away from degraded network conditions (or entire facility outages).
Transacting with more than one server during a single session should not (and does not) degrade performance: TLS sessions that begin with a full handshake on one server are resumable with an abbreviated handshake on any other server (assuming the client is holding the appropriate ticket, of course). This complicates our session ticket implementation, since it means each machine needs to have instant access to the same session ticket keys.
A simplistic (read: insecure) approach to this key distribution problem would be to randomly generate this 48-byte key once and add it to the configuration files replicated to each server. On boot, the server would read the key and use it indefinitely to both issue new tickets and decrypt existing ones.
The most obvious problem with this method (besides the anti-pattern of storing secrets in configuration files) is that the key rarely changes. As a result, the cryptographic property of “forward secrecy” is greatly compromised. Forward secrecy is important because it renders historical captures of encrypted traffic worthless, even if the server’s private key is compromised.
Session Tickets at CloudFlare
CloudFlare’s solution to this problem, documented in previous blog posts, is to frequently regenerate and synchronize these session ticket keys across our entire global network. We currently do this once per hour.
This means we need a mechanism for turning over session ticket keys. For instance, if a client instantiates an HTTPS session at 12:00pm and continues using that ticket past 1:00pm, our edge network will re-encrypt the ticket with a brand new session ticket key.
To accomplish this, our web servers must have both the full history of all previous keys that could have encrypted the ticket (i.e., one per hour dating back to the maximum session lifetime of 64,800 seconds) as well as immediate access to each newly generated key. The previous keys are used exclusively to decrypt tickets presented by the client, while the new keys are used to "refresh" the encryption on existing tickets and encrypt tickets for entirely new sessions.
Back to Microsoft
The handling of these previously issued session tickets — re-wrapped with our newly generated key and sent back to the client in an abbreviated handshake — is where we observed every user agent built on Microsoft SChannel break down.
Renewing Session Tickets (or not)
According to session ticket specification, the server can, for any reason, reject the session ticket presented to it in an abbreviated handshake and force a full negotiation. When it does this, it simply sends the
NewSessionTicket message immediately after receiving the
Finished message from the client. The client then saves the new ticket and uses it for subsequent requests. This message flow is documented below in Figure 2, and does not present any problems for our customers’ .NET and IE visitors. That is to say, it’s not part of the bug we uncovered.
* Indicates optional message
Figure 2 - Server Rejecting Ticket, Performing Full Handshake, and Issuing New Session Ticket
A common example of this behavior is after a restart when the server loses access to all of its existing session keys. Another reason for rejecting and replacing the session ticket sent by the client is so the server can preserve forward secrecy: the shorter the lifetime of the key, the smaller window an attacker has to brute force it and decrypt traffic.
While forcing a full handshake and sending a new session ticket is one way to cycle the session ticket key, a more efficient approach is to simply decrypt the ticket during an abbreviated handshake, re-encrypt it with the new key, and send it back to the client in the same packet as the
Server Hello. From RFC 5077:
If the server successfully verifies the client's ticket, then it MAY renew the ticket by including a
NewSessionTickethandshake message after the
ServerHelloin the abbreviated handshake. The client should start using the new ticket as soon as possible after it verifies the server's
Finishedmessage for new connections.
This use case is illustrated below, in Figure 3, with emphasis added on the
NewSessionTicket message that was problematic for Microsoft:
Figure 3 - Server Issuing New Session Ticket during Abbreviated Handshake
The Problem with SChannel
While this "in-line"
NewSessionTicket mechanism works without issue in Chrome or Firefox (Safari still does not have RFC 5077 support) it causes all sorts of trouble with user agents built using Microsoft’s SChannel (i.e., all versions of Internet Explorer, Edge, and .NET running on Windows 8, 8.1, Server 2012, Server 2012 R2, RT, RT 8.1, and 10).
What we found, with excellent troubleshooting help from our good friends mentioned at the end of this post, was that when a Microsoft UA saw the
NewSessionTicket, it either immediately aborted the connection and threw an exception (.NET) or it downgraded(!) from TLS 1.2 to TLS 1.0. Clearly, neither scenario is desirable, but the latter is a legitimate security concern: SSL/TLS downgrade attacks have been exploited in the past, and they’ll almost definitely be exploited again in the future.
The Nitty Gritty Details
For those brave enough to follow along, below is a brief walk through of the packet captures. The first capture, available on CloudShark, contains a single TCP stream of the initial TLS handshake. The second capture, also available online, includes both this original stream as well as the second attempt to establish the session. In short, this is what happened:
- The client presents an existing session ticket to the server.
- The server "turns over" the session ticket key.
- The client aborts the connection when presented with the new ticket.
- The client downgrades to TLS 1.0 and tries again.
Part 1: Presenting The Ticket To The Server
As can be seen in the following pcap screenshot, the Internet Explorer 11 client (
192.168.2.225) initiates an abbreviated TLS handshake by sending a
ClientHello to the server (
220.127.116.11) containing a non-empty
Recall that these 192 bytes represent an encrypted data structure that contains, among other details, the lifetime of the session and the cipher suite originally negotiated.
Part 2: Turning Over The Key
Upon receipt of the session ticket from the client, our edge decrypts it and checks to see if the current key is still valid. After determining that the key has expired and should not be used again (other than to decrypt old tickets), it substitutes in a new key that’s used to re-encrypt the ticket and returns it to the client as part of a
NewSessionTicket message. All of these steps take place as part of the same abbreviated handshake.
Part 3: Aborting The Connection
Unfortunately, immediately after the IE11 client ACKs the
NewSessionTicket record, it responds by shutting down the connection (TCP FIN). The cause of this aborted connection, as later confirmed by Microsoft, is a bug in their underlying crypto stack, SChannel.
What’s really interesting is what happens next with the subsequent
ClientHello. As shown below, Internet Explorer/SChannel re-establishes the connection by indicating TLS 1.0 (rather than TLS 1.2) is the highest version it supports. As a result, it advertises support for a significantly smaller set of cipher suites (compare the 24 originally presented with the 12 now presented).
Another difference is that the TLS 1.0 handshake does not support the
signature_algorithm extension (
0x00d). Included in the TLS 1.2 specification (see RFC 5246), this extension specifies preference-ordered pairs of signature hash algorithms (e.g., SHA-256/RSA, SHA-256/ECDSA) that the user agent can validate. In fact, this is the primary data point that CloudFlare looks at when determining whether to serve a SHA-2 or SHA-1 certificate to the client.
Part 4: Completing a Downgraded Connection
Finally, the TLS 1.0 handshake completes, during which a new session ticket is sent back to the browser—this time as part of a full handshake. SChannel has no issue with full handshakes, so it commences sending application data (e.g., GET and POST requests).
What has Microsoft done to fix?
After we reported the issue to them, Microsoft was extremely responsive and quick to dig in on the bug. While I’m sure it didn’t hurt that our head of engineering, Ben Fathi, connected us directly with Matt Thomlinson, Microsoft’s VP of Security—a position Ben used to hold (thanks, Ben!)—it is non-trivial to patch the crypto stack all the way back through Windows 8.
Nor, as we’ve seen from previous SChannel updates, is it an easy or necessarily error-free process. We’d like to thank Microsoft and Matt’s team for their sense of urgency in fixing the issue and rapidly communicating out the resolution to their customers.
Here are the affected platforms (and a link to the software updates):
- Windows 8 for 32-bit and x64-based Systems
- Windows 8.1 for 32-bit and x64-based Systems
- Windows Server 2012 (including Server Core install)
- Windows Server 2012 R2 (including Server Core install)
- Windows RT
- Windows RT 8.1
- Windows 10 for 32-bit and x64-based Systems
- Windows 10 Version 1511 for 32-bit and x64-based Systems
Why didn’t this get caught in testing, you might ask? While I don’t know for sure, I suspect Microsoft probably never saw a server "in the wild" (let alone in their test environment) that rotated session ticket keys as aggressively as we do. As Google crypto wiz Adam Langley writes:
So how do you run forward secrecy with several servers and support session tickets? You need to generate session ticket keys randomly, distribute them to the servers without ever touching persistent storage and rotate them frequently. However, I'm not aware of any open source servers that support anything like that.
Fortunately, back in late 2013, CloudFlare committed code to nginx that allows it to read in multiple session ticket keys at boot. We then hooked up our own internal nginx instances to a centralized key-value store that’s replicated to each datacenter so we can roll these keys over without changing configuration files (or even restarting the processes). This work allowed us to advance the forward secrecy capabilities of open-source projects handling HTTPS such as nginx.
Ferreting out the low-level technical details of this complex problem (and testing a temporary workaround while Microsoft had a chance to roll out the fix) was supported by a number of folks in the TLS community.
In particular, I’d like to thank Eric Lawrence of Fiddler fame for having the patience to induce the issue with a packet capture running, Aaron Coleman of Fitabase for providing sample code to replicate the issue in .NET, and Jeremiah Lee of Fitbit for support testing the workaround that was implemented by CloudFlare's Zi Lin.
Would you like to work on solving interesting problems like this for hundreds of millions of website visitors? If so, you're in luck: the CloudFlare Security Engineering team is hiring. Apply through our career page - we'd love to talk with you.