Subscribe to receive notifications of new posts:

The Curious Case of Caching CSRF Tokens

2017-12-13

11 min read

It is now commonly accepted as fact that web performance is critical for business. Slower sites can affect conversion rates on e-commerce stores, they can affect your sign-up rate on your SaaS service and lower the readership of your content.

In the run-up to Thanksgiving and Black Friday, e-commerce sites turned to services like Cloudflare to help optimise their performance and withstand the traffic spikes of the shopping season.

In preparation, an e-commerce customer joined Cloudflare on the 9th November, a few weeks before the shopping season. Instead of joining via our Enterprise plan, they were a self-serve customer who signed-up by subscribing to our Business plan online and switching their nameservers over to us.

Their site was running Magento, a notably slow e-commerce platform - filled with lots of interesting PHP, with a considerable amount of soft code in XML. Running version 1.9, the platform was somewhat outdated (Magento was totally rewritten in version 2.0 and subsequent releases).

Despite the somewhat dated technology, the e-commerce site was "good enough" for this customer and had done it's job for many years.

They were the first to notice an interesting technical issue surrounding how performance and security can often feel at odds with each other. Although they were the first to highlight this issue, into the run-up of Black Friday, we ultimately saw around a dozen customers on Magento 1.8/1.9 have similar issues.

Initial Optimisations

After signing-up for Cloudflare, the site owners attempted to make some changes to ensure their site was loading quickly.

The website developers had already ensured the site was loading over HTTPS, in doing so they were able to ensure their site was loading over the new HTTP/2 Protocol and made some changes to ensure their site was optimised for HTTP/2 (for details, see our blog post on HTTP/2 For Web Developers).

At Cloudflare we've taken steps to ensure that there isn't a latency overhead for establishing a secure TLS connection, here is a non-complete list of optimisations we use:

Additionally, they had enabled HTTP/2 Server Push to ensure critical CSS/JS assets could be pushed to clients when they made their first request. Without Server Push, a client has to download the HTML response, interpret it and then work out assets it needs to download.

Big images were Lazy Loaded, only downloading them when they needed to be seen by the users. Additionally, they had enabled a Cloudflare feature called Polish. With this enabled, Cloudflare dynamically works out whether it's faster serve an image in WebP (a new image format developed by Google) or whether it's faster to serve it in a different format.

These optimisations did make some improvement to performance, but their site was still slow.

Respect The TTFB

In web performance, there are a few different things which can affect the response times - I've crudely summarised them into the following three categories:

  • Connection & Request Time - Before a request can be sent off for a website to load something, a few things need to happen: DNS queries, a TCP handshake to establish the connection with the web server and a TLS handshake to establish a secure connection

  • Page Render - A dynamic site needs to query databases, call APIs, write logs, render views, etc before a response can be made to a client

  • Response Speed - Downloading the response from web server, browser-side rendering of the HTML and pulling the other resources linked in the HTML

The e-commerce site had taken steps to improve their Response Speed by enabling HTTP/2 and performing other on-site optimisations. They had also optimised their Connection & Response Time by using a CDN service like Cloudflare to provide fast DNS and reduce latency when optimising TLS/TCP connections.

However, they now realised the critical step they needed to optimise was around the Page Render that would happen on their web server.

By looking at a Waterfall View of how their site loaded (similar to the one below) they could see the main constraint.

Example Waterfall view from WebSiteOptimization.com

On the initial request, you can see the green "Time to First Byte" view taking a very long time.

Many browsers have tools for viewing Waterfall Charts like the one above, Google provide some excellent documentation for Chrome on doing this: Get Started with Analyzing Network Performance in Chrome DevTools. You can also generate these graphs fairly easily from site speed test tools like WebPageTest.org.

Time to First Byte itself is an often misunderstood metric and often can't be attributed to a single fault. For example; using a CDN service like Cloudflare may increase TTFB by a few milliseconds, but do so to the benefit of an overall load time. This can be as the CDN is adding additional compression functionality to speed up the response, or simply as it has to establish a connection back to the origin web server (which isn't visible by the client).

There are instances where it is important to debug why TTFB is a problem. For example; in this instance, the e-commerce platform was taking upwards of 3 seconds just to generate the HTML response. In this case, it was clear the constraint was the server-side Page Render.

When the web server was generating dynamic content, it was having to query databases and perform logic before a request could be served. In most instances (i.e. a product page) the page would be identical to every other request. It would only be when someone would add something to their shopping cart that the site would really become dynamic.

Before someone logs into the the Magento admin panel or adds something to their shopping cart, the page view is anonymous and will be served up identically to every visitor. It will only be the when an anonymous visitor logs in or adds something to their shopping cart that they will see a page that's dynamic and unlike every other page that's been rendered.

It therefore is possible to cache those anonymous requests so that Magento on an origin server doesn't need to constantly regenerate the HTML.

Cloudflare users on our Business Plan are able to cache anonymous page views when using Magneto using our Bypass Cache on Cookie functionality. This allows for static HTML to be cached at our edge, with no need for it to be regenerated from request to request.

This provides a huge performance boost for the first few page visits of a visitor, and allows them still to interact with the dynamic site when they need to. Additionally, it helps keep load down on the origin server in the event of traffic spikes, sparing precious server CPU time for those who need it to complete dynamic actions like paying for an order.

Here's an example of how this can be configured in Cloudflare using the Page Rules functionality:

Magento Cache Cookie Bypass

The Page Rule configuration above instructs Cloudflare to "Cache Everything (including HTML), but bypass the cache when it sees a request which contains any of the cookies: external_no_cache, PHPSESSID or adminhtml. The final Edge Cache TTL setting just instructs Cloudflare to keep HTML files in cache for a month, this is necessary as Magento by default uses headers to discourage caching.

The site administrator configured their site to work something like this:

  1. On the first request, the user is anonymous and their request indistinguishable from any other - their page can be served from the Cloudflare cache

  2. When the customer adds something to their shopping cart, they do that via a POST request - as methods like POST, PUT and DELETE are intended to change a resource, they bypass the Cloudflare cache

  3. On the POST request to add something to their shopping cart, Magento will set a cookie called external_no_cache

  4. As the site owner has configured Cloudflare to bypass the cache when we see a request containing the external_no_cache cookie, all subsequent requests go direct to origin

This behaviour can be summarised in the following crude diagram:

The site administrators initially enabled this configuration on a subdomain for testing purposes, but noticed something rather strange. When they would add something to the cart on their test site, the cart would show up empty. If they then tried again to add something to the cart, the item would be added successfully.

The customer reported one additional, interesting piece of information - when they tried to mimic this cookie-based caching behaviour internally using Varnish, they faced the exact same issue.

In essence, the Add to Cart functionality would fail, but only on the first request. This was indeed odd behaviour, and the customer reached out to Cloudflare Support.

Debugging

The customer wrote in just as our Singapore office were finishing up their afternoon and was initially triaged by a Support Engineer in that office.

The Support Agent evaluated what the problem was and initially identified that if the frontend cookie was missing, the Add to Cart functionality would fail.

No matter which page you access on Magento, it will attempt to set a frontend cookie, even if it doesn't add an external_no_cache cookie

When Cloudflare caches static content, the default behaviour is to strip away any cookies coming from the server if the file is going to end up in cache - this is a security safeguard to prevent customers accidentally caching private session cookies. This applies when a cached response contains a Set-Cookie header, but does not apply when the cookie is set via JavaScript - in order to allow functionality like Google Analytics to work.

They had identified that the caching logic at our network edge was working fine, but for whatever reason Magento would refuse to add something to a shopping cart without a valid frontend cookie. Why was this?

As Singapore handed their shift work over to London, the Support Engineer working on this ticket decided to escalate the ticket up to me. This was largely as, towards the end of last year, I had owned the re-pricing of this feature (which opened it up to our self-service Business plan users, instead of being Enterprise-only). That said; I had not touched Magneto in many years, even when I was working in digital agencies I wasn't the most enthusiastic to build on it.

The Support Agent provided some internal comments that described the issue in detail and their own debugging steps, with an effective "TL;DR" summary:

Debugging these kinds of customer issues is not as simple as putting breakpoints into a codebase. Often, for our Support Engineers, the customers origin server acts as a black-box and there can be many moving parts, and they of course have to manage the expectations of a real customer at the other end. This level of problem solving fun, is one of the reasons I still like answering customer support tickets when I get a chance.

Before attempting to debug anything, I double checked that the Support Agent was correct that nothing had gone wrong on our end - I trusted their judgement and no others customers were reporting their caching functionality had broken, but it is always best to cross-check manual debugging work. I ran some checks to ensure that there were no regressions in our Lua codebase that controls caching logic:

  • Checked that there were no changes to this logic in our internal code respository

  • Check that automated tests are still in place and build successfully

  • Run checks on production to verify that caching behaviour still works as normal

As Cloudflare has customers across so many platforms, I also checked to ensure that there were no breaking changes in Magento codebase that would cause this bug to occur. Occasionally we find our customers accidentally come across bugs in CMS platforms which are unreported. This, fortunately, was not one of those instances.

The next step is to attempt to replicate the issue locally and away from the customers site. I spun up a vanilla instance of Magento 1.9 and set it up with an identical Cloudflare configuration. The experiment was successful and I was able to replicate the customer issue.

I had an instinctive feeling that it was the Cross Site Request Forgery Protection functionality that was at fault here and I started tweaking my own test Magento installation to see if this was the cases.

Cross Site Request Forgery attacks work by exploiting the fact that one site on the internet can get a client to make requests to another site.

For example; suppose you have an online bank account with the ability to send money to other accounts. Once logged in, there is a form to send money which uses the following HTML:

<form action="https://example.com/send-money">
Account Name:
<input type="text" name="account_name" />
Amount:
<input type="text" name="amount" />
<input type="submit" />
</form>

After logging in and doing your transactions, you don't log-out of the website - but you simply navigate elsewhere online. Whilst browsing around you come across a button on a website that contains the text "Click me! Why not?". You click the button, and £10,000 goes from your bank account to mine.

This happens because the button you clicked was connected to an endpoint on the banking website, and contained hidden fields instructing it to send me £10,000 of your cash:

<form action="https://example.com/send-money">
<input type="hidden" name="account_name" value="Junade Ali" />
<input type="hidden" name="amount" value="10,000" />
<input type="submit" value="Click me! Why not?" />
</form>

In order to prevent these attacks, CSRF Tokens are inserted as hidden fields into web forms:

<form action="https://example.com/send-money">
Account Name:
<input type="text" name="account_name" />
Amount:
<input type="text" name="amount" />
<input type="hidden" name="csrf_protection" value="hunter2" />
<input type="submit" />
</form>

A cookie is first set on the clients computer containing a random session cookie. When a form is served to the client, a CSRF token is generated using that cookie. The server will check that the CSRF token submitted in the HTML form actually matches the session cookie, and if it doesn't block the request.

In this instance, as there was no session cookie ever set (Cloudflare would strip it out before it entered cache), the POST request to the Add to Cart functionality could never verify the CSRF token and the request would fail.

Due to CSRF vulnerabilities, Magento applied CSRF protection to all forms; this broke Full Page Cache implementations in Magento 1.8.x/1.9.x. You can find all the details in the SUPEE-6285 patch documentation from Magento.

Caching Content with CSRF Protected Forms

To validate that CSRF Tokens were definitely at fault here, I completely disabled CSRF Protection in Magento. Obviously you should never do this in production, I found it slightly odd that they even had a UI toggle for this!

Another method which was created in the Magento Community was an extension to disable CSRF Protection just for the Add To Cart functionality (Inovarti_FixAddToCartMage18), under the argument that CSRF risks are far reduced when we're talking about Add to Cart functionality. This is still not ideal, we should ideally have CSRF Protection on every form when we're talking about actions which change site behaviour.

There is, however, a third way. I did some digging and identified a Magento plugin that effectively uses JavaScript to inject a dynamic CSRF token the moment a user clicks the Add to Cart button but just before the request is actually submitted. There's quite a lengthy Github thread which outlines this issue and references the Pull Requests which fixed this behaviour in the the Magento Turpentine plugin. I won't repeat the set-up instructions here, but they can be found in an article I've written on the Cloudflare Knowledge Base: Caching Static HTML with Magento (version 1 & 2)

Effectively what happens here is that the dynamic CSRF token is only injected into the web page the moment that it's needed. This is actually the behaviour that's implemented in other e-commerce platforms and Magento 2.0+, allowing Full Page Caching to be implemented quite easily. We had to recommend this plugin as it wouldn't be practical for the site owner to simply update to Magneto 2.

One thing to be wary of when exposing CSRF tokens via an AJAX endpoint is JSON Hijacking. There are some tips on how you can prevent this in the OWASP AJAX Security Cheat Sheet. Iain Collins has a Medium post with further discussion on the security merits of CSRF Tokens via AJAX (that said, however you're performing CSRF prevention, Same Origin Policies and HTTPOnly cookies FTW!).

There is an even cooler way you can do this using Cloudflare's Edge Workers offering. Soon this will allow you to run JavaScript at our Edge network, and you can use that to dynamically insert CSRF tokens into cached content (and, then either perform cryptographic validation of CSRF either at our Edge or the Origin itself using a shared secret).

But this has been a problem since 2015?

Another interesting observation is that the Magento patch which caused this interesting behaviour had been around since July 7, 2015. Why did our Support Team only see this issue in the run-up to Black Friday in 2017? What's more, we ultimately saw around a dozen support tickets around this exact issue on Magento 1.8/1.9 over the course over 6 weeks.

When an Enterprise customer ordinarily joins Cloudflare, there is a named Solutions Engineer who gets them up and running and ensures there is no pain; however when you sign-up online with a credit card, your forgo this privilege.

Last year, we released Bypass Cache on Cookie to self-serve users when a lot of e-commerce customers were in their Christmas/New Year release freeze and not making changes to their websites. Since then, there were no major shopping events; most the sites enabling this feature were new build websites using Magento 2 where this wasn't an issue.

In the run-up to Black Friday, performance and coping under load became a key consideration for developers working on legacy e-commerce websites - and they turned to Cloudflare. Given the large, but steady, influx of e-commerce websites joining Cloudflare - the low overall percentage of those on Magento 1.8/1.9 became noticeable.

Conclusion

Caching anonymous page views is an important, and in some cases, essential mechanism to dramatically improve site performance to substantially reduce site load, especially during traffic spikes. Whilst aggressively caching content when users are anonymous, you can bypass the cache and allow users to use the dynamic functionality your site has to offer.

When you need to insert a dynamic state into cached content, JavaScript offers a nice compromise. JavaScript allows us to cache HTML for anonymous page visits, but insert a state when the users interact in a certain way. In essence, defusing this conflict between performance and security. In the future you'll be able to run this JavaScript logic at our network edge using Cloudflare Edge Workers.

It also remains important to respect the RESTful properties of HTTP and ensure GET, OPTIONS and HEAD requests remain safe and instead using POST, PUT, PATCH and DELETE as necessary.

If you're interested in debugging interesting technical problems on a network that sees around 10% of global internet traffic, we're hiring for Support Engineers in San Francisco, London, Austin and Singapore.

Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.
Page RulesSaaSSpeed & ReliabilityHTTPSCache

Follow on X

Junade Ali|@IcyApril
Cloudflare|@cloudflare

Related posts

October 09, 2024 1:00 PM

Improving platform resilience at Cloudflare through automation

We realized that we need a way to automatically heal our platform from an operations perspective, and designed and built a workflow orchestration platform to provide these self-healing capabilities across our global network. We explore how this has helped us to reduce the impact on our customers due to operational issues, and the rich variety of similar problems it has empowered us to solve....

September 25, 2024 1:00 PM

Introducing Speed Brain: helping web pages load 45% faster

We are excited to announce the latest leap forward in speed – Speed Brain. Speed Brain uses the Speculation Rules API to prefetch content for the user's likely next navigations. The goal is to download a web page to the browser before a user navigates to it, allowing pages to load instantly. ...