High-reliability OCSP stapling and why it matters

by Nick Sullivan.

At Cloudflare our focus is making the internet faster and more secure. Today we are announcing a new enhancement to our HTTPS service: High-Reliability OCSP stapling. This feature is a step towards enabling an important security feature on the web: certificate revocation checking. Reliable OCSP stapling also improves connection times by up to 30% in some cases. In this post, we’ll explore the importance of certificate revocation checking in HTTPS, the challenges involved in making it reliable, and how we built a robust OCSP stapling service.

Why revocation is hard

Digital certificates are the cornerstone of trust on the web. A digital certificate is like an identification card for a website. It contains identity information including the website’s hostname along with a cryptographic public key. In public key cryptography, each public key has an associated private key. This private key is kept secret by the site owner. For a browser to trust an HTTPS site, the site’s server must provide a certificate that is valid for the site’s hostname and a proof of control of the certificate’s private key. If someone gets access to a certificate’s private key, they can impersonate the site. Private key compromise is a serious risk to trust on the web.

Certificate revocation is a way to mitigate the risk of key compromise. A website owner can revoke a compromised certificate by informing the certificate issuer that it should no longer be trusted. For example, back in 2014, Cloudflare revoked all managed certificates after it was shown that the Heartbleed vulnerability could be used to steal private keys. There are other reasons to revoke, but key compromise is the most common.

Certificate revocation has a spotty history. Most of the revocation checking mechanisms implemented today don’t protect site owners from key compromise. If you know about why revocation checking is broken, feel free to skip ahead to the OCSP stapling section below.

Revocation checking: a history of failure

There are several ways a web browser can check whether a site’s certificate is revoked or not. The most well-known mechanisms are Certificate Revocation Lists (CRL) and Online Certificate Status Protocol (OCSP). A CRL is a signed list of serial numbers of certificates revoked by a CA. OCSP is a protocol that can be used to query a CA about the revocation status of a given certificate. An OCSP response contains signed assertions that a certificate is not revoked.

CRL OCSP

Certificates that support OCSP contain the responder's URL and those that support CRLs contain a URLs where the CRL can be obtained. When a browser is served a certificate as part of an HTTPS connection, it can use the embedded URL to download a CRL or an OCSP response and check that the certificate hasn't been revoked before rendering the web page. The question then becomes: what should the browser do if the request for a CRL or OCSP response fails? As it turns out, both answers to that question are problematic.

Hard-fail doesn’t work

When browsers encounter a web page and there’s a problem fetching revocation information, the safe option is to block the page and show a security warning. This is called a hard-fail strategy. This strategy is conservative from a security standpoint, but prone to false positives. For example, if the proof of non-revocation could not be obtained for a valid certificate, a hard-fail strategy will show a security warning. Showing a security warning when no security issue exists is dangerous because it can lead to warning fatigue and teach users to click through security warnings, which is a bad idea.

In the real world, false positives are unavoidable. OCSP and CRL endpoints subject to service outages and network errors. There are also common situations where these endpoints are completely inaccessible to the browser, such as when the browser is behind a captive portal. For some access points used in hotels and airplanes, unencrypted traffic (like OCSP endpoints) are blocked. A hard-fail strategy force users behind captive portals and other networks that block OCSP requests to click through unnecessary security warnings. This reality is unpalatable to browser vendors.

Another drawback to a hard-fail strategy is that it puts an increased burden on certificate authorities to keep OCSP and CRL endpoints available and online. A broken OCSP or CRL server becomes a central point of failure for all certificates issued by a certificate authority. If browsers followed a hard-fail strategy, an OCSP outage would be an Internet outage. Certificate authorities are organizations optimized to provide trust and accountability, not necessarily resilient infrastructure. In a hard-fail world, the availability of the web as a whole would be limited by the ability for CAs to keep their OCSP services online at all times; a dangerous systemic risk to the internet as a whole.

Soft-fail: it’s not much better

To avoid the downsides of a hard-fail strategy, most browsers take another approach to certificate revocation checking. Upon seeing a new certificate, the browser will attempt to fetch the revocation information from the CRL or OCSP endpoint embedded in the certificate. If the revocation information is available, they rely on it, and otherwise they assume the certificate is not revoked and display the page without any errors. This is called a “soft-fail” strategy.

The soft-fail strategy has a critical security flaw. An attacker with network position can block the OCSP request. If this attacker also has the private key of a revoked certificate, they can intercept the outgoing connection for the site and present the revoked certificate to the browser. Since the browser doesn’t know the certificate is revoked and is following a soft-fail strategy, the page will load without alerting the user. As Adam Langley described: “soft-fail revocation checks are like a seat-belt that snaps when you crash. Even though it works 99% of the time, it's worthless because it only works when you don't need it.”

A soft-fail strategy also makes connections slower. If revocation information for a certificate is not already cached, the browser will block the rendering of the page until the revocation information is retrieved, or a timeout occurs. This additional step causes a noticeable and unwelcome delay, with marginal security benefits. This tradeoff is a hard sell for the performance-obsessed web. Because of the limited benefit, some browsers have eliminated live revocation checking for at least some subset of certificates.

Live OCSP checking has an additional downside: it leaks private browsing information. OCSP requests are sent over unencrypted HTTP and are tied to a specific certificate. Sending an OCSP request tells the certificate authority which websites you are visiting. Furthermore, everyone on the network path between your browser and the OCSP server will also know which sites you are browsing.

Alternative revocation checking

Some client still perform soft-fail OCSP checking, but it’s becoming less common due to the performance and privacy downsides described above. To protect high-value certificates, some browsers have explored alternative mechanisms for revocation checking.

One technique is to pre-package a list of revoked certificates and distribute them through browser updates. Because the list of all revoked certificates is so large, only a few high-impact certificates are included in this list. This technique is called OneCRL by Firefox and CRLSets by Chrome. This has been effective for some high-profile revocations, but it is by no means a complete solution. Not only are not all certificates covered, this technique leaves a window of vulnerability between the time the certificate is revoked and the certificate list gets to browsers.

OCSP Stapling

OCSP stapling is a technique to get revocation information to browsers that fixes some of the performance and privacy issues associated with live OCSP fetching. In OCSP stapling, the server includes a current OCSP response for the certificate included (or "stapled") into the initial HTTPS connection. That removes the need for the browser to request the OCSP response itself. OCSP stapling is widely supported by modern browsers.

Not all servers support OCSP stapling, so browsers still take a soft-fail approach to warning the user when the OCSP response is not stapled. Some browsers (such as Safari, Edge and Firefox for now) check certificate revocation for certificates, so OCSP stapling can provide a performance boost of up to 30%. For browsers like Chrome that don’t check for revocation for all certificates, OCSP stapling provides a proof of non-revocation that they would not have otherwise.

High-reliability OCSP stapling

Cloudflare started offering OCSP stapling in 2012. Cloudflare’s original implementation relied on code from nginx that was able to provide OCSP stapling for a some, but not all connections. As Cloudflare’s network grew, the implementation wasn’t able to scale with it, resulting in a drop in the percentage of connections with OCSP responses stapled. The architecture we had chosen had served us well, but we could definitely do better.

In the last year we redesigned our OCSP stapling infrastructure to make it much more robust and reliable. We’re happy to announce that we now provide reliable OCSP stapling connections to Cloudflare. As long as the certificate authority has set up OCSP for a certificate, Cloudflare will serve a valid OCSP stapled response. All Cloudflare customers now benefit from much more reliable OCSP stapling.

OCSP stapling past

In Cloudflare’s original implementation of OCSP stapling, OCSP responses were fetched opportunistically. Given a connection that required a certificate, Cloudflare would check to see if there was a fresh OCSP response to staple. If there was, it would be included in the connection. If not, then the client would not be sent an OCSP response, and Cloudflare would send a request to refresh the OCSP response in the cache in preparation for the next request.

If a fresh OCSP response wasn’t cached, the connection wouldn’t get an OCSP staple. The next connection for that same certificate would get a OCSP staple, because the cache will have been populated.

This architecture was elegant, but not robust. First, there are several situations in which the client is guaranteed to not get an OCSP response. For example, the first request in every cache region and the first request after an OCSP response expires are guaranteed to not have an OCSP response stapled. With Cloudflare’s expansion to more locations, these failures were more common. Less popular sites would have their OCSP responses fetched less often resulting in an even lower ratio of stapled connections. Another reason that connections could be missing OCSP responses is if the OCSP request from Cloudflare to fill the cache failed. There was a lot of room for improvement.

Our solution: OCSP pre-fetching

In order to be able to reliably include OCSP staples in all connection, we decided to change the model. Instead of fetching the OCSP response when a request came in, we would fetch it in a centralized location and distribute valid responses to all our servers. When a response started getting close to expiration, we’d fetch a new one. If the OCSP request fails, we put it into a queue to re-fetch at a later time. Since most OCSP staples are valid for around 7 days, there is a lot of flexibility in term of refreshing expiring responses.

To keep our cache of OCSP responses fresh, we created an OCSP fetching service. This service ensures that there is a valid OCSP response for every certificate managed by Cloudflare. We constantly crawl our cache of OCSP responses and refresh those that are close to expiring. We also make sure to never cache invalid OCSP responses, as this can have bad consequences. This system has been running for several months now, and we are now reliably including OCSP staples for almost every HTTPS request.

Reliable stapling improves performance for browsers that would have otherwise fetched OCSP, but it also changes the optimal failure strategy for browsers. If a browser can reliably get an OCSP staple for a certificate, why not switch back from a soft-fail to a hard-fail strategy?

OCSP must-staple

As described above, the soft-fail strategy for validating OCSP responses opens up a security hole. An attacker with a revoked certificate can simply neglect to provide an OCSP response when a browser connects to it and the browser will accept their revoked certificate.

In the OCSP fetching case, a soft-fail approach makes sense. There many reasons the browser would not be able to obtain an OCSP: captive portals, broken OCSP servers, network unreliability and more. However, as we have shown with our high-reliability OCSP fetching service, it is possible for a server to fetch OCSP responses without any of these problems. OCSP responses are re-usable and are valid for several days. When one is close to expiring, the server can fetch a new one out-of-band and be able to reliably serve OCSP staples for all connections.

Public Domain

If the client knows that a server will always serve OCSP staples for every connection, it can apply a hard-fail approach, failing a connection if the OCSP response is missing. This closes the security hole introduced by the soft-fail strategy. This is where OCSP must-staple fits in.

OCSP must-staple is an extension that can be added to a certificate that tells the browser to expect an OCSP staple whenever it sees the certificate. This acts as an explicit signal to the browser that it’s safe to use the more secure hard-fail strategy.

Firefox enforces OCSP must-staple, returning the following error if such a certificate is presented without a stapled OCSP response.

Chrome provides the ability to mark a domain as “Expect-Staple”. If Chrome sees a certificate for the domain without a staple, it will send a report to a pre-configured report endpoint.

Reliability

As a part of our push to provide reliable OCSP stapling, we put our money where our mouths are and put an OCSP must-staple certificate on blog.cloudflare.com. Now if we ever don’t serve an OCSP staple, this page will fail to load on browsers like Firefox that enforce must-staple. You can identify a certificate by looking at the certificate details for the “1.3.6.1.5.5.7.1.24” OID.

Cloudflare customers can choose to upload must-staple custom certificates, but we encourage them not to do so yet because there may be a multi-second delay between the certificate being installed and our ability to populate the OCSP response cache. This will be fixed in the coming months. Other than the first few seconds after uploading the certificate, Cloudflare’s new OCSP fetching is robust enough to offer OCSP staples for every connection thereafter.

As of today, an attacker with access to the private key for a revoked certificate can still hijack the connection. All they need to do is to place themselves on the network path of the connection and block the OCSP request. OCSP must-staple prevents that, since an attacker will not be able to obtain an OCSP response that says the certificate has not been revoked.

The weird world of OCSP responders

For browsers, an OCSP failure is not the end of the world. Most browsers are configured to soft-fail when an OCSP responder returns an error, so users are unaffected by OCSP server failures. Some Certificate Authorities have had massive multi-day outages in their OCSP servers without affecting the availability of sites that use their certificates.

There’s no strong feedback mechanism for broken or slow OCSP servers. This lack of feedback has led to an ecosystem of faulty or unreliable OCSP servers. We experienced this first-hand while developing high-reliability OCSP stapling. In this section, we’ll outline half a dozen unexpected behaviors we found when deploying high-reliability OCSP stapling. A big thanks goes out to all the CAs who fixed the issues we pointed out. CA names redacted to preserve their identities.

CA #1: Non-overlapping periods

We noticed CA #1 certificates frequently missing their refresh-deadline, and during debugging we were lucky enough to see this:

$ date -u
Sat Mar  4 02:45:35 UTC 2017
$ ocspfetch <redacted, customer ID>
This Update: 2017-03-04 01:45:49 +0000 UTC
Next Update: 2017-03-04 02:45:49 +0000 UTC
$ date -u
Sat Mar  4 02:45:48 UTC 2017
$ ocspfetch <redacted, customer ID>
This Update: 2017-03-04 02:45:49 +0000 UTC
Next Update: 2017-03-04 03:45:49 +0000 UTC

It shows that CA #1 had configured their OCSP responders to use an incredibly short validity period with almost no overlap between validity periods, which makes it functionally impossible to always have a fresh OCSP response for their certificates. We contacted them, and they reconfigured the responder to produce new responses every half-interval.

CA #2: Wrong signature algorithm

Several certificates from CA #2 started failing with this error:

 bad OCSP signature: crypto/rsa: verification error

The issue is that the OCSP claims to be signed with SHA256-RSA, when it is actually signed with SHA1-RSA (and the reverse: indicates SHA1-RSA, actually signed with SHA256-RSA).

CA #3: Malformed OCSP responses

When we first started the project, our tool was unable to parse dozens of certificates in our database because of this error

asn1: structure error: integer not minimally-encoded

and many OCSP responses that we fetched from the same CA:

parsing ocsp response: bad OCSP signature: asn1: structure error: integer not minimally-encoded

What happened was that this CA had begun issuing <1% of certificates with a minor formatting error that rendered them unparseable by Golang’s x509 package. After contacting them directly, they quickly fixed the issue but then we had to patch Golang's parser to be more lenient about encoding bugs.

CA #4: Failed responder behind a load balancer

A small number of CA #4’s OCSP responders fell into a "bad state" without them knowing, and would return 404 on every request. Since the CA used load balancing to round-robin requests to a number of responders, it looked like 1 in 6 requests would fail inexplicably.

Two times between Jan 2017 and May 2017, this CA also experienced some kind of large data-loss event that caused them to return persistent "Try Later" responses for a large number of requests.

CA #5: Delay between certificate and OCSP responder

When a certificate is issued by CA #5, there is a large delay between the time a certificate is issued and the OCSP responder is able to start returning signed responses. This results in a delay between certificate issance and the availability of OCSP. It has mostly been resolved, but this general pattern is dangerous for OCSP must-staple. There have been some changes discussed by the CA/B Forum, an organization that regulates the issuance and management of certificates, to require CAs to offer OCSP soon after issuance.

CA #6: Extra certificates

It is typical to embed only one certificate in an OCSP response, if any. The one embedded certificate is supposed to be a leaf specially issued for signing an intermediate's OCSP responses. However, several CAs embed multiple certificates: the leaf they use for signing OCSP responses, the intermediate itself, and sometimes all the intermediates up to the root certificate.

Conclusions

We made OCSP stapling better and more reliable for Cloudflare customers. Despite the various strange behaviors we found in OCSP servers, we’ve been able to consistently serve OCSP responses for over 99.9% of connections since we’ve moved over to the new system. This great work was done by Brendan McMillion and Alessandro Ghedini. This is an important step in protecting the web community from attackers who have compromised certificate private keys.

comments powered by Disqus