The Cloudflare Blog

Speeding up your (WordPress) website is a few clicks away

Alex Krivit — Thu, 22 Jun 2023 13:00:58 GMT

Every day, website visitors spend far too much time waiting for websites to load in their browsers. This waiting is partially due to browsers not knowing which resources are critically important so they can prioritize them ahead of less-critical resources. In this blog we will outline how millions of websites across the Internet can improve their performance by specifying which critical content loads first with Cloudflare Workers and what Cloudflare will do to make this easier by default in the future.

Popular Content Management Systems (CMS) like WordPress have made attempts to influence website resource priority, for example through techniques like lazy loading images. When done correctly, the results are magical. Performance is optimized between the CMS and browser without needing to implement any changes or coding new prioritization strategies. However, we’ve seen that these default priorities have opportunities to improve greatly.

In this co-authored blog with Google’s Patrick Meenan we will explain where the opportunities exist to improve website performance, how to check if a specific site can improve performance, and provide a small JavaScript snippet which can be used with Cloudflare Workers to do this optimization for you.

What happens when a browser receives the response?

Before we dive into where the opportunities are to improve website performance, let’s take a step back to understand how browsers load website assets by default.

After the browser sends a HTTP request to a server, it receives a HTTP response containing information like status codes, headers, and the requested content. The browser carefully analyzes the response's status code and response headers to ensure proper handling of the content.

Next, the browser processes the content itself. For HTML responses, the browser extracts important information from the section of the HTML, such as the page title, stylesheets, and scripts. Once this information is parsed, the browser moves on to the response which has the actual page content. During this stage, the browser begins to present the webpage to the visitor.

If the response includes additional 3rd party resources like CSS, JavaScript, or other content, the browser may need to fetch and integrate them into the webpage. Typically, browsers like Google Chrome delay loading images until after the resources in the HTML have loaded. This is also known as “blocking” the render of the webpage. However, developers can override this blocking behavior using fetch priority or other methods to boost other content’s priority in the browser. By adjusting an important image's fetch priority, it can be loaded earlier, which can lead to significant improvements in crucial performance metrics like LCP (Largest Contentful Paint).

Images are so central to web pages that they have become an essential element in measuring website performance from Core Web Vitals. LCP measures the time it takes for the largest visible element, often an image, to be fully rendered on the screen. Optimizing the loading of critical images (like LCP images) can greatly enhance performance, improving the overall user experience and page performance.

But here's the challenge – a browser may not know which images are the most important for the visitor experience (like the LCP image) until rendering begins. If the developer can identify the LCP image or critical elements before it reaches the browser, its priority can be increased at the server to boost website performance instead of waiting for the browser to naturally discover the critical images.

In our Smart Hints blog, we describe how Cloudflare will soon be able to automatically prioritize content on behalf of website developers, but what happens if there’s a need to optimize the priority of the images right now? How do you know if a website is in a suboptimal state and what can you do to improve?

Using Cloudflare, developers should be able to improve image performance with heuristics that identify likely-important images before the browser parses them so these images can have increased priority and be loaded sooner.

Identifying Image Priority opportunities

Just increasing the fetch priority of all images won't help if they are lazy-loaded or not critical/LCP images. Lazy-loading is a method that developers use to generally improve the initial load of a webpage if it includes numerous out-of-view elements. For example, on Instagram, when you continually scroll down the application to see more images, it would only make sense to load those images when the user arrives at them otherwise the performance of the page load would be needlessly delayed by the browser eagerly loading these out-of-view images. Instead the highest priority should be given to the LCP image in the viewport to improve performance.

So developers are left in a situation where they need to know which images are on users' screens/viewports to increase their priority and which are off their screens to lazy-load them.

Recently, we’ve seen attempts to influence image priority on behalf of developers. For example, by default, in WordPress 5.5 all images with an IMG tag and aspect ratios were directed to be lazy-loaded. While there are plugins and other methods WordPress developers can use to boost the priority of LCP images, lazy-loading all images in a default manner and not knowing which are LCP images can cause artificial performance delays in website performance (they’re working on this though, and have partially resolved this for block themes).

So how do we identify the LCP image and other critical assets before they get to the browser?

To evaluate the opportunity to improve image performance, we turned to the HTTP Archive. Out of the approximately 22 million desktop pages tested in February 2023, 46% had an LCP element with an IMG tag. Meaning that for page load metrics, LCP had an image included about half the time. Though, among these desktop pages, 8.5 million had the image in the static HTML delivered with the page, indicating a total potential improvement opportunity of approximately 39% of the desktop pages within the dataset.

In the case of mobile pages, out of the ~28.5 million tested, 40% had an LCP element as an IMG tag. Among these mobile pages, 10.3 million had the image in the static HTML delivered with the page, suggesting a potential improvement opportunity in around 36% of the mobile pages within the dataset.

However, as previously discussed, prioritizing an image won't be effective if the image is lazy-loaded because the directives are contradictory. In the dataset, approximately 1.8 million LCP desktop images and 2.4 million LCP mobile images were lazy-loaded.

Therefore, across the Internet, the opportunity to improve image performance would be about ~30% of pages that have an LCP image in the original HTML markup that weren’t lazy-loaded, but with a more advanced Cloudflare Worker, the additional 9% of lazy-loaded LCP images can also be improved improved by removing the lazy-load attribute.

If you’d like to determine which element on your website serves as the LCP element so you can increase the priority or remove any lazy-loading, you can use browser developer tools, or speed tests like Webpagetest or Cloudflare Observatory.

39% of desktop images seems like a lot of opportunity to improve image performance. So the next question is how can Cloudflare determine the LCP image across our network and automatically prioritize them?

Image Index

We thought that how soon the LCP image showed up in the HTML would serve as a useful indicator. So we analyzed the HTTP Archive dataset to see where the cumulative percentage of LCP images are discovered based on their position in the HTML, including lazy-loaded images.

We found that approximately 25% of the pages had the LCP image as the first image in the HTML (around 10% of all pages). Another 25% had the LCP image as the second image. WordPress seemed to arrive at a similar conclusion and recently released a development to remove the default lazy-load attribute from the first image on block themes, but there are opportunities to go further.

Our analysis revealed that implementing a straightforward rule like "do not lazy-load the first four images," either through the browser, a content management system (CMS), or a Cloudflare Worker could address approximately 75% of the issue of lazy-loading LCP images (example Worker below).

Ignoring small images

In trying to find other ways to identify likely LCP images we next turned to the size of the image. To increase the likelihood of getting the LCP image early in the HTML, we looked into ignoring “small” images as they are unlikely to be big enough to be a LCP element. We explored several sizes and 10,000 pixels (less than 100x100) was a pretty reliable threshold that didn’t skip many LCP images and avoided a good chunk of the non-LCP images.

By ignoring small images (<10,000px), we found that the first image became the LCP image in approximately 30-34% of cases. Adding the second image increased this percentage to 56-60% of pages.

Therefore, to improve image priority, a potential approach could involve assigning a higher priority to the first four "not-small" images.

Chrome 114 Image Prioritization Experiment

An experiment running in Chrome 114 does exactly what we described above. Within the browser there are a few different prioritization knobs to play with that aren’t web-exposed so we have the opportunity to assign a “medium” priority to images that we want to boost automatically (directly controlling priority with “fetch priority” lets you set high or low). This will let us move the images ahead of other images, async scripts and parser-blocking scripts late in the body but still keep the boosted image priority below any high-priority requests, particularly dynamically-injected blocking scripts.

We are experimenting with boosting the priority of varying numbers of images (2, 5 and 10) and with allowing one of those medium-priority images to load at a time during Chromes “tight” mode (when it is loading the render-blocking resources in the head) to increase the likelihood that the LCP image will be available when the first paint is done.

The data is still coming in and no “ship” decisions have been made yet but the early results are very promising, improving the LCP time across the entire web for all arms of the experiment (not by massive amounts but moving the metrics of the whole web is notoriously difficult).

How to use Cloudflare Workers to boost performance

Now that we’ve seen that there is a large opportunity across the Internet for helping prioritize images for performance and how to identify images on individual pages that are likely LCP images, the question becomes, what would the results be of implementing a network-wide rule that could boost image priority from this study?

We built a test worker and deployed it on some WordPress test sites with our friends at Rocket.net, a WordPress hosting platform focused on performance. This worker boosts the priority of the first four images while removing the lazy-load attribute, if present. When deployed we saw good performance results and the expected image prioritization.

export default {
  async fetch(request) {
    const response = await fetch(request);
 
    // Check if the response is HTML
    const contentType = response.headers.get('Content-Type');
    if (!contentType || !contentType.includes('text/html')) {
      return response;
    }
 
    const transformedResponse = transformResponse(response);
 
    // Return the transformed response with streaming enabled
    return transformedResponse;
  },
};
 
async function transformResponse(response) {
  // Create an HTMLRewriter instance and define the image transformation logic
  const rewriter = new HTMLRewriter()
    .on('img', new ImageElementHandler());
 
  const transformedBody = await rewriter.transform(response).text()
 
  const transformresponse = new Response(transformedBody, response)
 
  // Return the transformed response with streaming enabled
  return transformresponse
}
 
class ImageElementHandler {
  constructor() {
    this.imageCount = 0;
    this.processedImages = new Set();
  }
 
  element(element) {
    const imgSrc = element.getAttribute('src');
 
    // Check if the image is small based on Chrome's criteria
    if (imgSrc && this.imageCount < 4 && !this.processedImages.has(imgSrc) && !isImageSmall(element)) {
      element.removeAttribute('loading');
      element.setAttribute('fetchpriority', 'high');
      this.processedImages.add(imgSrc);
      this.imageCount++;
    }
  }
}
 
function isImageSmall(element) {
  // Check if the element has width and height attributes
  const width = element.getAttribute('width');
  const height = element.getAttribute('height');
 
  // If width or height is 0, or width * height < 10000, consider the image as small
  if ((width && parseInt(width, 10) === 0) || (height && parseInt(height, 10) === 0)) {
    return true;
  }
 
  if (width && height) {
    const area = parseInt(width, 10) * parseInt(height, 10);
    if (area < 10000) {
      return true;
    }
  }
 
  return false;
}

When testing the Worker, we saw that default image priority was boosted into “high” for the first four images and the fifth image remained “low.” This resulted in an LCP range of “good” from a speed test. While this initial test is not a dispositive indicator that the Worker will boost performance in every situation, the results are promising and we look forward to continuing to experiment with this idea.

While we’ve experimented with WordPress sites to illustrate the issues and potential performance benefits, this issue is present across the Internet.

Website owners can help us experiment with the Worker above to improve the priority of images on their websites or edit it to be more specific by targeting likely LCP elements. Cloudflare will continue experimenting using a very similar process to understand how to safely implement a network-wide rule to ensure that images are correctly prioritized across the Internet and performance is boosted without the need to configure a specific Worker.

Automatic Platform Optimization

Cloudflare’s Automatic Platform Optimization (APO) is a plugin for WordPress which allows Cloudflare to deliver your entire WordPress site from our network ensuring consistent, fast performance for visitors. By serving cached sites, APO can improve performance metrics. APO does not currently have a way to prioritize images over other assets to improve browser render metrics or dynamically rewrite HTML, techniques we’ve discussed in this post. Although this presents a potential opportunity for future development, it requires thorough testing to ensure safe and reliable support.

In the future we’ll look to include the techniques discussed today as part of APO, however in the meantime we recommend using Snippets (and Experiments) to test with the code example above to see the performance impact on your website.

Get in touch!

If you are interested in using the JavaScript above, we recommended testing with Workers or using Cloudflare Snippets. We’d love to hear from you on what your results were. Get in touch via social media and share your experiences.

Better HTTP/2 Prioritization for a Faster Web

Patrick Meenan (Guest Author) — Tue, 14 May 2019 13:00:00 GMT

HTTP/2 promised a much faster web and Cloudflare rolled out HTTP/2 access for all our customers long, long ago. But one feature of HTTP/2, Prioritization, didn’t live up to the hype. Not because it was fundamentally broken but because of the way browsers implemented it.

Today Cloudflare is pushing out a change to HTTP/2 Prioritization that gives our servers control of prioritization decisions that truly make the web much faster.

Historically the browser has been in control of deciding how and when web content is loaded. Today we are introducing a radical change to that model for all paid plans that puts control into the hands of the site owner directly. Customers can enable “Enhanced HTTP/2 Prioritization” in the Speed tab of the Cloudflare dashboard: this overrides the browser defaults with an improved scheduling scheme that results in a significantly faster visitor experience (we have seen 50% faster on multiple occasions). With Cloudflare Workers, site owners can take this a step further and fully customize the experience to their specific needs.

Background

Web pages are made up of dozens (sometimes hundreds) of separate resources that are loaded and assembled by the browser into the final displayed content. This includes the visible content the user interacts with (HTML, CSS, images) as well as the application logic (JavaScript) for the site itself, ads, analytics for tracking site usage and marketing tracking beacons. The sequencing of how those resources are loaded can have a significant impact on how long it takes for the user to see the content and interact with the page.

A browser is basically an HTML processing engine that goes through the HTML document and follows the instructions in order from the start of the HTML to the end, building the page as it goes along. References to stylesheets (CSS) tell the browser how to style the page content and the browser will delay displaying content until it has loaded the stylesheet (so it knows how to style the content it is going to display). Scripts referenced in the document can have several different behaviors. If the script is tagged as “async” or “defer” the browser can keep processing the document and just run the script code whenever the scripts are available. If the scripts are not tagged as async or defer then the browser MUST stop processing the document until the script has downloaded and executed before continuing. These are referred to as “blocking” scripts because they block the browser from continuing to process the document until they have been loaded and executed.

The HTML document is split into two parts. The of the document is at the beginning and contains stylesheets, scripts and other instructions for the browser that are needed to display the content. The of the document comes after the head and contains the actual page content that is displayed in the browser window (though scripts and stylesheets are allowed to be in the body as well). Until the browser gets to the body of the document there is nothing to display to the user and the page will remain blank so getting through the head of the document as quickly as possible is important. “HTML5 rocks” has a great tutorial on how browsers work if you want to dive deeper into the details.

The browser is generally in charge of determining the order of loading the different resources it needs to build the page and to continue processing the document. In the case of HTTP/1.x, the browser is limited in how many things it can request from any one server at a time (generally 6 connections and only one resource at a time per connection) so the ordering is strictly controlled by the browser by how things are requested. With HTTP/2 things change pretty significantly. The browser can request all of the resources at once (at least as soon as it knows about them) and it provides detailed instructions to the server for how the resources should be delivered.

Optimal Resource Ordering

For most parts of the page loading cycle there is an optimal ordering of the resources that will result in the fastest user experience (and the difference between optimal and not can be significant - as much as a 50% improvement or more).

As described above, early in the page load cycle before the browser can render any content it is blocked on the CSS and blocking JavaScript in the section of the HTML. During that part of the loading cycle it is best for 100% of the connection bandwidth to be used to download the blocking resources and for them to be downloaded one at a time in the order they are defined in the HTML. That lets the browser parse and execute each item while it is downloading the next blocking resource, allowing the download and execution to be pipelined.

The scripts take the same amount of time to download when downloaded in parallel or one after the other but by downloading them sequentially the first script can be processed and execute while the second script is downloading.

Once the render-blocking content has loaded things get a little more interesting and the optimal loading may depend on the specific site or even business priorities (user content vs ads vs analytics, etc). Fonts in particular can be difficult as the browser only discovers what fonts it needs after the stylesheets have been applied to the content that is about to be displayed so by the time the browser knows about a font, it is needed to display text that is already ready to be drawn to the screen. Any delays in getting the font loaded end up as periods with blank text on the screen (or text displayed using the wrong font).

Generally there are some tradeoffs that need to be considered:

Custom fonts and visible images in the visible part of the page (viewport) should be loaded as quickly as possible. They directly impact the user’s visual experience of the page loading.
Non-blocking JavaScript should be downloaded serially relative to other JavaScript resources so the execution of each can be pipelined with the downloads. The JavaScript may include user-facing application logic as well as analytics tracking and marketing beacons and delaying them can cause a drop in the metrics that the business tracks.
Images benefit from downloading in parallel. The first few bytes of an image file contain the image dimensions which may be necessary for browser layout, and progressive images downloading in parallel can look visually complete with around 50% of the bytes transferred.

Weighing the tradeoffs, one strategy that works well in most cases is:

Custom fonts download sequentially and split the available bandwidth with visible images.
Visible images download in parallel, splitting the “images” share of the bandwidth among them.
When there are no more fonts or visible images pending:
- Non-blocking scripts download sequentially and split the available bandwidth with non-visible images
- Non-visible images download in parallel, splitting the “images” share of the bandwidth among them.

That way the content visible to the user is loaded as quickly as possible, the application logic is delayed as little as possible and the non-visible images are loaded in such a way that layout can be completed as quickly as possible.

Example

For illustrative purposes, we will use a simplified product category page from a typical e-commerce site. In this example the page has:

The HTML file for the page itself, represented by a blue box.

1 external stylesheet (CSS file), represented by a green box.

4 external scripts (JavaScript), represented by orange boxes. 2 of the scripts are blocking at the beginning of the page and 2 are asynchronous. The blocking script boxes use a darker shade of orange.

1 custom web font, represented by a red box.

13 images, represented by purple boxes. The page logo and 4 of the product images are visible in the viewport and 8 of the product images require scrolling to see. The 5 visible images use a darker shade of purple.

For simplicity, we will assume that all the resources are the same size and each takes 1 second to download on the visitor’s connection. Loading everything takes a total of 20 seconds, but HOW it is loaded can have a huge impact to the experience.

This is what the described optimal loading would look like in the browser as the resources load:

The page is blank for the first 4 seconds while the HTML, CSS and blocking scripts load, all using 100% of the connection.
At the 4-second mark the background and structure of the page is displayed with no text or images.
One second later, at 5 seconds, the text for the page is displayed.
From 5-10 seconds the images load, starting out as blurry but sharpening very quickly. By around the 7-second mark it is almost indistinguishable from the final version.
At the 10 second mark all of the visual content in the viewport has completed loading.
Over the next 2 seconds the asynchronous JavaScript is loaded and executed, running any non-critical logic (analytics, marketing tags, etc).
For the final 8 seconds the rest of the product images load so they are ready for when the user scrolls.

Current Browser Prioritization

All of the current browser engines implement different prioritization strategies, none of which are optimal.

Microsoft Edge and Internet Explorer do not support prioritization so everything falls back to the HTTP/2 default which is to load everything in parallel, splitting the bandwidth evenly among everything. Microsoft Edge is moving to use the Chromium browser engine in future Windows releases which will help improve the situation. In our example page this means that the browser is stuck in the head for the majority of the loading time since the images are slowing down the transfer of the blocking scripts and stylesheets.

Visually that results in a pretty painful experience of staring at a blank screen for 19 seconds before most of the content displays, followed by a 1-second delay for the text to display. Be patient when watching the animated progress because for the 19 seconds of blank screen it may feel like nothing is happening (even though it is):

Safari loads all resources in parallel, splitting the bandwidth between them based on how important Safari believes they are (with render-blocking resources like scripts and stylesheets being more important than images). Images load in parallel but also load at the same time as the render-blocking content.

While similar to Edge in that everything downloads at the same time, by allocating more bandwidth to the render-blocking resources Safari can display the content much sooner:

At around 8 seconds the stylesheet and scripts have finished loading so the page can start to be displayed. Since the images were loading in parallel, they can also be rendered in their partial state (blurry for progressive images). This is still twice as slow as the optimal case but much better than what we saw with Edge.
At around 11 seconds the font has loaded so the text can be displayed and more image data has been downloaded so the images will be a little sharper. This is comparable to the experience around the 7-second mark for the optimal loading case.
For the remaining 9 seconds of the load the images get sharper as more data for them downloads until it is finally complete at 20 seconds.

Firefox builds a dependency tree that groups resources and then schedules the groups to either load one after another or to share bandwidth between the groups. Within a given group the resources share bandwidth and download concurrently. The images are scheduled to load after the render-blocking stylesheets and to load in parallel but the render-blocking scripts and stylesheets also load in parallel and do not get the benefits of pipelining.

In our example case this ends up being a slightly faster experience than with Safari since the images are delayed until after the stylesheets complete:

At the 6 second mark the initial page content is rendered with the background and blurry versions of the product images (compared to 8 seconds for Safari and 4 seconds for the optimal case).
At 8 seconds the font has loaded and the text can be displayed along with slightly sharper versions of the product images (compared to 11 seconds for Safari and 7 seconds in the Optimal case).
For the remaining 12 seconds of the loading the product images get sharper as the remaining content loads.

Chrome (and all Chromium-based browsers) prioritizes resources into a list. This works really well for the render-blocking content that benefits from loading in order but works less well for images. Each image loads to 100% completion before starting the next image.

In practice this is almost as good as the optimal loading case with the only difference being that the images load one at a time instead of in parallel:

Up until the 5 second mark the Chrome experience is identical to the optimal case, displaying the background at 4 seconds and the text content at 5.
For the next 5 seconds the visible images load one at a time until they are all complete at the 10 second mark (compared to the optimal case where they are just slightly blurry at 7 seconds and sharpen up for the remaining 3 seconds).
After the visual part of the page is complete at 10 seconds (identical to the optimal case), the remaining 10 seconds are spent running the async scripts and loading the hidden images (just like with the optimal loading case).

Visual Comparison

Visually, the impact can be quite dramatic, even though they all take the same amount of time to technically load all of the content:

Server-Side Prioritization

HTTP/2 prioritization is requested by the client (browser) and it is up to the server to decide what to do based on the request. A good number of servers don’t support doing anything at all with the prioritization but for those that do, they all honor the client’s request. Another option would be to decide on the best prioritization to use on the server-side, taking into account the client’s request.

Per the specification, HTTP/2 prioritization is a dependency tree that requires full knowledge of all of the in-flight requests to be able to prioritize resources against each other. That allows for incredibly complex strategies but is difficult to implement well on either the browser or server side (as evidenced by the different browser strategies and varying levels of server support). To make prioritization easier to manage we have developed a simpler prioritization scheme that still has all of the flexibility needed for optimal scheduling.

The Cloudflare prioritization scheme consists of 64 priority “levels” and within each priority level there are groups of resources that determine how the connection is shared between them:

All of the resources at a higher priority level are transferred before moving on to the next lower priority level.

Within a given priority level, there are 3 different “concurrency” groups:

0 : All of the resources in the concurrency “0” group are sent sequentially in the order they were requested, using 100% of the bandwidth. Only after all of the concurrency “0” group resources have been downloaded are other groups at the same level considered.
1 : All of the resources in the concurrency “1” group are sent sequentially in the order they were requested. The available bandwidth is split evenly between the concurrency “1” group and the concurrency “n” group.
n : The resources in the concurrency “n” group are sent in parallel, splitting the bandwidth available to the group between them.

Practically speaking, the concurrency “0” group is useful for critical content that needs to be processed sequentially (scripts, CSS, etc). The concurrency “1” group is useful for less-important content that can share bandwidth with other resources but where the resources themselves still benefit from processing sequentially (async scripts, non-progressive images, etc). The concurrency “n” group is useful for resources that benefit from processing in parallel (progressive images, video, audio, etc).

Cloudflare Default Prioritization

When enabled, the enhanced prioritization implements the “optimal” scheduling of resources described above. The specific prioritizations applied look like this:

This prioritization scheme allows sending the render-blocking content serially, followed by the visible images in parallel and then the rest of the page content with some level of sharing to balance application and content loading. The “* If Detectable” caveat is that not all browsers differentiate between the different types of stylesheets and scripts but it will still be significantly faster in all cases. 50% faster by default, particularly for Edge and Safari visitors is not unusual:

Customizing Prioritization with Workers

Faster-by-default is great but where things get really interesting is that the ability to configure the prioritization is also exposed to Cloudflare Workers so sites can override the default prioritization for resources or implement their own complete prioritization schemes.

If a Worker adds a “cf-priority” header to the response, Cloudflare edge servers will use the specified priority and concurrency for that response. The format of the header is / so something like response.headers.set('cf-priority', “30/0”); would set the priority to 30 with a concurrency of 0 for the given response. Similarly, “30/1” would set concurrency to 1 and “30/n” would set concurrency to n.

With this level of flexibility a site can tweak resource prioritization to meet their needs. Boosting the priority of some critical async scripts for example or increasing the priority of hero images before the browser has identified that they are in the viewport.

To help inform any prioritization decisions, the Workers runtime also exposes the browser-requested prioritization information in the request object passed in to the Worker’s fetch event listener (request.cf.requestPriority). The incoming requested priority is a semicolon-delimited list of attributes that looks something like this: “weight=192;exclusive=0;group=3;group-weight=127”.

weight: The browser-requested weight for the HTTP/2 prioritization.
exclusive: The browser-requested HTTP/2 exclusive flag (1 for Chromium-based browsers, 0 for others).
group: HTTP/2 stream ID for the request group (only non-zero for Firefox).
group-weight: HTTP/2 weight for the request group (only non-zero for Firefox).

This is Just the Beginning

The ability to tune and control the prioritization of responses is the basic building block that a lot of future work will benefit from. We will be implementing our own advanced optimizations on top of it but by exposing it in Workers we have also opened it up to sites and researchers to experiment with different prioritization strategies. With the Apps Marketplace it is also possible for companies to build new optimization services on top of the Workers platform and make it available to other sites to use.

If you are on a Pro plan or above, head over to the speed tab in the Cloudflare dashboard and turn on “Enhanced HTTP/2 Prioritization” to accelerate your site.

Welcome to Speed Week!

Patrick Meenan (Guest Author) — Mon, 13 May 2019 01:00:00 GMT

Every year, we celebrate Cloudflare’s birthday in September when we announce the products we’re releasing to help make the Internet better for everyone. We’re always building new and innovative products throughout the year, and having to pick five announcements for just one week of the year is always challenging. Last year we brought back Crypto Week where we shared new cryptography technologies we’re supporting and helping advance to help build a more secure Internet.

Today I’m thrilled to announce we are launching our first-ever Speed Week and we want to showcase some of the things that we’re obsessed with to make the Internet faster for everyone.

How much faster is faster?

When we built the software stack that runs our network, we knew that both security and speed are important to our customers, and they should never have to compromise one for the other. All of the products we’re announcing this week will help our customers have a better experience on the Internet with as much as a 50% improvement in page load times for websites, getting the most out of HTTP/2’s features (while only lifting a finger to click the button that enables them), finding the optimal route across the Internet, and providing the best live streaming video experience. I am constantly amazed by the talented engineers that work on the products that we launch all year round. I wish we could have weeks like this all year round to celebrate the wins they’ve accumulated as they tackle the difficult performance challenges of the web. We’re never content to settle for the status quo, and as our network continues to grow, so does our ability to improve our flagship products like Argo, or how we support rich media sites that rely heavily on images and video. The sheer scale of our network provides rich data that we can use to make better decisions on how we support our customers’ web properties.

We also recognize that the Internet is evolving. New standards and protocols such as HTTP/2, QUIC, TLS 1.3 are great advances to improve web performance and security, but they can also be challenging for many developers to easily deploy. HTTP/2 was introduced in 2015 by the IETF, and was the first major revision of the HTTP protocol. While our customers have always been able to benefit from HTTP/2, we’re exploring how we can make that experience even faster.

All things Speed

Want a sneak peek at what we’re announcing this week? I’m really excited to see this week’s announcements unfold. Each day we’ll post a new blog where we’ll share product announcements and customer stories that demonstrate how we’re making life better for our customers.

Monday: An inside view of how we’re making faster, smarter routing decisions
Tuesday: HTTP/2 can be faster, we’ll show you how
Wednesday: Simplify image management and speed up load times on any device
Thursday: How we’re improving our network for faster video streaming
Friday: How we’re helping make JavaScript faster

For bonus points, sign up for a live stream webinar where Kornel Lesinksi and I will be hosted by Dennis Publishing to discuss the many challenges of the modern web “Stronger, Better, Faster: Solving the performance challenges of the modern web.” The event will be held on Monday, May 13th at 11:00 am BST and you can either register for the live event or sign up for one of the on-demand sessions later in the week.

I hope you’re just as excited about our upcoming Speed Week as much as I am, be sure to subscribe to the blog to get daily updates sent to your inbox, cause who knows… there may even be “one last thing”

Improving HTML Time to First Byte

Patrick Meenan (Guest Author) — Mon, 24 Dec 2018 16:00:00 GMT

The Time to First Byte (TTFB) of a site is the time from when the user starts navigating until the HTML for the page they requested starts to arrive. A slow TTFB has been the bane of my existence for more than the ten years I have been running WebPageTest.

Looking at a recent test data set (~100k pages):
20% TTFB > 3s80% start render > 5s (10% > 10s)500 pages were > 15MB
So much slow to fix
— Patrick Meenan (@patmeenan) August 10, 2016

There is a reason why TTFB appears as one of the few “grades” that WebPageTest scores a site on and, specifically, why it is the first grade in the list.

If the first byte is slow, EVERY other metric will also be slow. Improving it is one of the few cases where you can predict what the impact will be on every other measurement. Every millisecond improvement in the TTFB translates directly into a millisecond of savings in every other measurement (i.e. first paint will be 500ms faster if TTFB improves by 500ms). That said, a fast ttfb doesn't guarantee a fast experience but a slow ttfb does guarantee a slow experience. I’d estimate that roughly 50% of all requests for help with WebPageTest results come from site owners struggling with a slow TTFB.

Many things can roll up into the TTFB, including redirects, DNS, connection setup, SSL negotiation and the actual server response time. Most of them can be fixed relatively easily by using a service like Cloudflare, but the server response time for the HTML itself is often the biggest problem and the hardest one to solve.

The waterfall diagram below shows the server response time as a light blue bar on the first request and it can be painfully obvious when it is slow. Under optimal conditions, the server response time would be no longer than the orange socket connect bar right before it.

Over three seconds for the server to respond.

Slow origin response times can be caused by an enormous assortment of issues from the server configuration, system load, back-end databases and systems it talks to, to the application code itself. Getting to the root of the performance issues usually involves teams of Dev Ops engineers working with Application Performance Management tools to track down the slowest parts of the application and improve them.

A huge portion of the site owners I have worked with don’t have the resources or expertise to handle that kind of an investigation. More often than not, they paid someone a one-time development fee to build their site or built it themselves on WordPress and are hosting it on the lowest-cost hosting they could find. The hosting is generally designed to run as many sites as possible, not necessarily for the highest performance.

Edge Caching of HTML

The thing is, most HTML isn’t really dynamic. It needs to be able to change relatively quickly when the site is updated but for a huge portion of the web the content is static for months or years at a time.

There are special cases like when a user is logged-in (as the admin or otherwise) where the content needs to differ but the vast majority of visits are of anonymous users. If the HTML can be cached and served directly from the edge then the performance gains can be substantial (over 3 seconds faster on all metrics in this case).

Much faster server response time.

There are dozens of plugins for WordPress for caching content at the origin but they require configuration (where to cache the pages) and the performance still depends heavily on the performance of the hosting itself. Pushing the content cache further out to the edge reduces the complexity, eliminates the additional time to get back to the origin and completely removes the hosting performance from the equation. It can also significantly reduce the load on the hosting systems by offloading all of the anonymous traffic.

Cloudflare supports caching static HTML, and business and enterprise customers can enable logged-in users to skip the cache by enabling “bypass cache on cookies”. It works in tandem with the Cloudflare plugin for WordPress so the cache can be cleared whenever content is updated. There are also several other caching plugins that integrate with various CDN’s but in all of the cases they need to be configured with API keys for the CDN and the implementations are specific for each CDN.

Zero-Config Edge Caching of HTML

For broad adoption, we need to make the caching of HTML happen automatically (or as close to automatically as possible). To that end, we need a way to communicate between an origin (like a WordPress site) and an edge cache (like Cloudflare’s edge nodes) for managing a remote cache that can be explicitly purged.

The Origin needs to be able to:

Recognize when there is a supported edge cache in front of it.
Specify content that should be cached and for what visitors (i.e. visits without a login cookie).
Purge the cached content when it has changed (globally across all edges).

Instead of requiring the origin to integrate with an API for purging on changes and requiring manual configuration for what to cache an when we can accomplish everything using HTTP headers on the requests that travel back and forth between the edges and the origin:

1 - A HTTP header is added to requests going from the edge to the origin to advertise that there is an edge cache present and the capabilities it supports:

x-HTML-Edge-Cache: supports=cache|purgeall|bypass-cookies

2 - When the origin responds with a cacheable page it adds a HTTP header on the response to indicate that it should be cached and any rules for when the cached version shouldn’t be used (to allow for bypassing the cache for logged-in users):

x-HTML-Edge-Cache: cache,bypass-cookies=wp-|wordpress

In this case the HTML will be cached but any requests that have cookies that start with “wordpress” or “wp-” for the cookie name will bypass the cache and go to the origin.

3 - When a request modifies the site content (updating a post, changing a theme, adding a comment) the origin adds a HTTP response header indicating that the cache needs to be purged:

x-HTML-Edge-Cache: purgeall

The only tricky part to deal with is that the purge needs to clear the cache from ALL of the edges, not just the one edge that the request went through.

4 - When a new request comes in for HTML that is in the edge cache, the request cookies are checked against the rules for the cached response. If the cookies are not present then the cached version is served; otherwise, the request is passed through to the origin.

With this simple header-based command and control interface we can eliminate the need for an origin to talk to an API and for any explicit configuration. It also makes the logic on the origin significantly easier to implement as there is no configuration (or UI) and no need to make any outbound requests to a vendor-specific API. The example WordPress plugin is less than 50 lines of code and the vast majority of that is hooking up callbacks for all of the events that change content.

Start Caching today with WordPress and Workers

One of the things I love most about Workers is that it gives you a fully programmable edge to experiment with ideas and implement your own logic. I created a Worker script that implements the header-based protocol and edge caching on the Cloudflare edges and a WordPress plugin that implements the origin logic for WordPress.

The only tricky part with the Worker was finding a way to purge items from the cache globally. The Worker caches are local to each edge and don’t provide an interface for doing any global operations. One way it does it is to use the Cloudflare API to purge the global cache but that is a bit of a heavy hammer (purging everything from the cache including scripts and images) and it requires some configuration. If you know the specific URLs that will be changed by a content change then doing a targeted purge through the API for just those URLs would probably be the best solution.

Using the new Workers KV store we can purge the cache a different way. The Worker script uses a versioning scheme for the cache where every URL gets a version number appended to it (i.e. http://www.example.com/?cf_edge_cache_ver=32). The modified URL is only ever used locally by the worker as a key for the cached responses and the current version number is stored in KV which is a global store. When the cache is purged, the version number is incremented which changes the URL for all of the resources. Old entries will age out of the cache normally since they will no longer be accessed. It requires a little configuration to set up KV for the worker but hopefully at some point in the future that can also be automatic.

What Next?

I think there is a huge value for the web in standardizing a way for edge caches and origins to communicate for caching of dynamic content. That would provide incentive for content management systems to build support directly into the platforms and provide a standard interface that could be used across different providers (and even for local edge caches in load balancers or other reverse proxies). After doing some more testing with different types of sites I’m planning to bring the concept to the IETF HTTP Working Group to see if we can come up with an official standard for the control headers (using different names). If you have opinions about how it should work or what features you’d need exposed I’d love to hear about it (like purging specific URLs, varying content for mobile/desktop or region, expanding it to cover all content types, etc).

Mejoramos el tiempo hasta el primer byte de HTML

Patrick Meenan (Guest Author) — Mon, 24 Dec 2018 16:00:00 GMT

El tiempo hasta el primer byte (TTFB) de un sitio web es el tiempo desde que el usuario comienza a navegar hasta que el código HTML de la página que ha solicitado empieza a llegar. Un TTFB lento ha sido la pesadilla de mi existencia los más de diez años que llevo ejecutando WebPageTest.

Looking at a recent test data set (~100k pages):
20% TTFB > 3s80% start render > 5s (10% > 10s)500 pages were > 15MB
So much slow to fix
— Patrick Meenan (@patmeenan) August 10, 2016

Hay una razón por la que el TTFB aparece como una de las pocas "notas" que WebPageTest le adjudica a los sitios y, específicamente, por la que es la primera nota de la lista.

Si el primer byte es lento, TODAS las otras métricas también serán lentas. Su mejora es uno de los pocos elementos que te permite predecir cuál será el impacto sobre las otras medidas. Cada mejora de un milisegundo en el TTFB se traduce directamente en un milisegundo de ahorro en cada una de las otras medidas (es decir, los primeros contenidos serán 500 ms más rápidos si el TTFB mejora 500 ms). Dicho esto, un TTFB rápido no garantiza una experiencia rápida, pero un TTFB lento garantiza una experiencia lenta. Calculo que aproximadamente 50 % de todas las solicitudes de ayuda sobre los resultados de WebPageTest proceden de los propietarios de sitios que tienen problemas con un TTFB lento.

Pueden juntarse muchas cosas con el TTFB, incluidos los redireccionamientos, el DNS, la configuración de conexión, la negociación de SSL y el tiempo real de respuesta del servidor. La mayoría de ellas se arreglan con cierta facilidad usando un servicio como Cloudflare, pero el tiempo de respuesta del servidor para HTML en sí es a menudo el problema mayor y más difícil de resolver.

El gráfico de cascada a continuación muestra el tiempo de respuesta del servidor como una barra azul claro en la primera solicitud y puede ser vergonzosamente obvio cuando es lento. En condiciones óptimas, el tiempo de respuesta del servidor no sería más largo que la barra naranja del socket de conexión que va justo antes.

Unos tres segundos para que el servidor responda.

La lentitud de respuesta de origen puede estar causada por problemas muy diversos, desde la configuración del servidor, la carga del sistema, las bases de datos del back-end y los sistemas con los que se comunica, hasta el propio código de la aplicación. Llegar a la raíz de los problemas de rendimiento generalmente implica equipos de ingenieros desarrolladores de operaciones que trabajen con herramientas de Gestión de rendimiento de aplicaciones para rastrear las partes más lentas de la aplicación y mejorarlas.

Buena parte de los propietarios de sitios con quienes he trabajado no tienen recursos o conocimientos para realizar ese tipo de investigación. En la mayoría de los casos, habían contratado de manera puntual a un desarrollador para que creara su sitio o lo habían hecho ellos mismos en WordPress y lo alojaban con el hosting más económico que habían encontrado. El hosting generalmente está diseñado para ejecutar tantos sitios como sea posible, no necesariamente con el máximo rendimiento.

Almacenamiento perimetral de HTML

En realidad, la mayoría del HTML no es especialmente dinámica. Tiene que poder cambiar relativamente rápido cuando se actualiza el sitio, pero en gran parte de la web el contenido es estático durante meses o años.

Existen casos especiales, como cuando un usuario inicia sesión (como administrador u otro) donde el contenido difiere, pero la gran mayoría de visitas son de usuarios anónimos. Si el HTML puede almacenarse en caché y servirse directamente desde el perímetro, entonces la mejora del rendimiento puede ser considerable (unos 3 segundos más rápido en todas las métricas en este caso).

Un tiempo de respuesta de servidor mucho más rápido.

Hay docenas de plugins para WordPress para el almacenamiento en caché de contenido en el origen, pero requieren configuración (dónde almacenar las páginas) y el rendimiento depende todavía en gran medida del rendimiento del hosting. Si se aparta contenido para su almacenamiento perimetral se reduce la complejidad, se elimina el tiempo adicional para volver al origen y se suprime por completo el rendimiento del hosting de la ecuación. También puede reducirse significativamente la carga sobre los sistemas de alojamiento al descargar todo el tráfico anónimo.

Cloudflare es compatible con el almacenamiento caché estático de HTML, y los clientes comerciales y empresariales pueden posibilitar que los usuarios con sesión iniciada se salten la caché habilitando "evitar almacenamiento de cookies". Esto funciona en tándem con el plugin de Cloudflare para WordPress, así que la caché se puede borrar cuando se actualiza el contenido. Hay también otros plugins de almacenamiento que se integran con varias redes de distribución de contenidos, pero en todos los casos tienen que configurarse con claves de API para la red de distribución de contenidos y las implementaciones son específicas para cada una de ellas.

Almacenamiento perimetral de HTML con cero configuración

Para que se adopte ampliamente, tenemos que hacer que el almacenamiento de HTML en caché se produzca automáticamente (o lo más parecido a automáticamente como sea posible). Con ese fin, necesitamos una vía de comunicación entre un origen (como un sitio de WordPress) y un almacenamiento perimetral (como nodos perimetrales de Cloudflare) para gestionar una caché remota que pueda purgarse explícitamente.

El origen debe poder:

Detectar cuándo tiene delante un almacenamiento perimetral compatible.
Especificar el contenido que debe almacenarse en caché y para qué visitantes (es decir, visitas sin cookie de inicio de sesión).
Purgar el contenido almacenado en caché cuando ha cambiado (globalmente en todos los perímetros).

En lugar de requerir que el origen se integre con una API para que se purguen los cambios y requerir configuración manual para ver qué almacenar en caché y cuándo, podemos hacer todo eso con encabezados HTTP de las peticiones que van y vienen entre los perímetros y el origen:

1 - Un encabezado HTTP se añade a las peticiones que van desde el perímetro hasta el origen para anunciar que existe un almacenamiento perimetral y las capacidades con las que es compatible:

x-HTML-Edge-Cache: supports=cache|purgeall|bypass-cookies

2 - Cuando el origen responde con una página almacenable en caché, añade un encabezado HTTP en la respuesta para indicar que debe almacenarse en caché y las normas para cuando la versión almacenada no se debe utilizar (para posibilitar que se evite el almacenamiento en caché de usuarios con sesión iniciada):

x-HTML-Edge-Cache: cache,bypass-cookies=wp-|wordpress

En este caso, se almacenará en caché el código HTML, pero las solicitudes que tienen cookies que comienzan con "wordpress" o "wp-" en su nombre evitarán el almacenamiento e irán al origen.

3 - Cuando una solicitud modifica el contenido del sitio (actualiza un post, cambia un tema, agrega un comentario) el origen agrega un encabezado de respuesta HTTP que indica que la caché debe purgarse:

x-HTML-Edge-Cache: purgeall

La única parte complicada de gestionar es que la purga tiene que borrar la caché de TODOS los perímetros, no solo el que traspasó la solicitud.

4 - Cuando aparece una nueva solicitud para HTML que está en el almacenamiento perimetral, las cookies de petición se cotejan con las normas para la respuesta almacenada en caché. Si las cookies no están presentes, se sirve la versión almacenada en caché; de lo contrario, la solicitud se traspasa al origen.

Con este sencillo comando basado en el encabezado y esta interfaz de control, podemos eliminar la necesidad de que un origen se comunique con una API y de cualquier configuración explícita. También hace que la lógica del origen sea significativamente más fácil de implementar, ya que no hay ninguna configuración (o interfaz de usuario) y no hay necesidad de realizar solicitudes salientes a una API específica de un proveedor. El ejemplo del plugin de WordPresses menos de 50 líneas de código que en su gran mayoría conecta retrollamadas de todos los eventos que cambian contenidos.

Empezar a almacenar hoy con WordPress y Workers

Una de las cosas que más me gustan de Workers es que te da un perímetro totalmente programable para experimentar con ideas e implementar tu propia lógica. He creado un script de Worker que implementa el protocolo basado en el encabezado y el almacenamiento perimetral de Cloudflare y un plugin para WordPress que implementa la lógica del origen de WordPress.

La única parte delicada de Worker era encontrar una manera de purgar elementos de la caché globalmente. Las cachés de Worker son locales en cada perímetro y no proporcionan una interfaz para hacer operaciones globales. Una de las formas de hacerlo es utilizar la API de Cloudflare para purgar la caché global, pero es un poco fuerte (purga todo de la caché, incluidas secuencias de comandos e imágenes) y requiere cierta configuración. Si sabes qué URL específicas se cambiarán al cambiar un contenido, hacer una purga dirigida en la API solo de esas URL sería probablemente la mejor solución.

Utilizando el nuevo Almacén KV de Workers podemos purgar la caché de una manera diferente. El script de Worker utiliza un esquema de versionado de la caché donde cada URL obtiene un número de versión que se añade a él (es decir, http://www.example.com/?cf_edge_cache_ver=32). La URL modificada solo se usa localmente por parte del Worker como una clave para las respuestas almacenadas en caché y el número actual de la versión se guarda en KV, que es un almacén global. Cuando se purga la caché, el número de versión se incrementa, lo que cambia la URL de todos los recursos. Las entradas más antiguas irán quedando fuera de la caché con normalidad, ya que no se accederá a ellas. Requiere un pequeño ajuste configurar KV para Worker, pero esperemos que en algún momento futuro pueda ser automático.

¿Y luego?

Creo que tiene un gran valor para la web la estandarización de una forma de que el almacenamiento perimetral y el origen se comuniquen para almacenar en caché contenido dinámico. Incentivaría que los sistemas de gestión de contenidos creen apoyo directamente en las plataformas y proporcionen una interfaz estándar que podría usarse con diferentes proveedores (e incluso para almacenamiento perimetral local en equilibradores de carga u otros servidores de proxy inverso). Después de hacer algunas pruebas más con diferentes tipos de sitios, estoy pensando en traer el concepto al Grupo de trabajo de HTTP del IETF para ver si podemos crear una norma oficial para las cabeceras de control (usando nombres diferentes). Si tienes comentarios sobre cómo debería funcionar o qué características habría que exponer, me encantaría que me lo contaras (como purgar URL específicas, variar contenidos para móvil/ordenador o por región, expandirlo para cubrir todos los tipos de contenido, etc.).

Fast Google Fonts with Cloudflare Workers

Patrick Meenan (Guest Author) — Thu, 22 Nov 2018 07:58:00 GMT

Google Fonts is one of the most common third-party resources on the web, but carries with it significant user-facing performance issues. Cloudflare Workers running at the edge is a great solution for fixing these performance issues, without having to modify the publishing system for every site using Google Fonts.

This post walks through the implementation details for how to fix the performance of Google Fonts with Cloudflare Workers. More importantly, it also provides code for doing high-performance content modification at the edge using Cloudflare Workers.

Google fonts are SLOW

First, some background. Google Fonts provides a rich selection of royalty-free fonts for sites to use. You select the fonts you want to use, and end up with a simple stylesheet URL to include on your pages, as well as styles to use for applying the fonts to parts of the page:


\n";
       content = content.split(matchString).join(cssString);
     }
     match = fontCSSRegex.exec(content);
   }
 }
 
 return content;
}

The fetching (and modifying) of the CSS is a little more complicated than a straight passthrough because we want to cache the result when possible. We cache the responses locally using the worker’s Cache API. Since the response is browser-specific, and we don’t want to fragment the cache too crazily, we create a custom cache key based on the browser user agent string that is basically browser+version+mobile.

Some plans have access to named cache storage, but to work with all plans it is easiest if we just modify the font URL that gets stored in cache and append the cache key to the end of the URL as a query parameter. The cache URL never gets sent to a server but is useful for local caching of different content that shares the same URL. For example:

https://fonts.googleapis.com/css?family=Roboto&chrome71

If the CSS isn’t available in the cache then we create a fetch request for the original URL from the Google servers, passing through the HTML url as the referer, the correct browser user agent string and the client’s IP address in a standard proxy X-Forwarded-For header. Once the response is available we store it in the cache for future requests.

For browsers that can’t be identified by user agent string a generic request for css is sent with the user agent string from Internet Explorer 8 to get the lowest common denominator fallback CSS.

The actual modification of the CSS just uses a regex to look for font URLs, replaces them with the HTML origin as a prefix.

async function fetchCSS(url, request) {
 let fontCSS = "";
 if (url.startsWith('/'))
   url = 'https:' + url;
 const userAgent = request.headers.get('user-agent');
 const clientAddr = request.headers.get('cf-connecting-ip');
 const browser = getCacheKey(userAgent);
 const cacheKey = browser ? url + '&' + browser : url;
 const cacheKeyRequest = new Request(cacheKey);
 let cache = null;
 
 let foundInCache = false;
 // Try pulling it from the cache API (wrap it in case it's not implemented)
 try {
   cache = caches.default;
   let response = await cache.match(cacheKeyRequest);
   if (response) {
     fontCSS = response.text();
     foundInCache = true;
   }
 } catch(e) {
   // Ignore the exception
 }
 
 if (!foundInCache) {
   let headers = {'Referer': request.url};
   if (browser) {
     headers['User-Agent'] = userAgent;
   } else {
     headers['User-Agent'] =
       "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)";
   }
   if (clientAddr) {
     headers['X-Forwarded-For'] = clientAddr;
   }
 
   try {
     const response = await fetch(url, {headers: headers});
     fontCSS = await response.text();
 
     // Rewrite all of the font URLs to come through the worker
     fontCSS = fontCSS.replace(/(https?:)?\/\/fonts\.gstatic\.com\//mgi,
                               '/fonts.gstatic.com/');
 
     // Add the css info to the font caches
     FONT_CACHE[cacheKey] = fontCSS;
     try {
       if (cache) {
         const cacheResponse = new Response(fontCSS, {ttl: 86400});
         event.waitUntil(cache.put(cacheKeyRequest, cacheResponse));
       }
     } catch(e) {
       // Ignore the exception
     }
   } catch(e) {
     // Ignore the exception
   }
 }
 
 return fontCSS;
}

Generating the browser-specific cache key is a little sensitive since browsers tend to clone each other’s user agent strings and add their own information to them. For example, Edge includes a Chrome identifier and Chrome includes a Safari identifier, etc. We don’t necessarily have to handle every browser string since it will fallback to the least common denominator (ttf files without unicode range support) but it is helpful to catch as many of the large mainstream browser engines as possible.

function getCacheKey(userAgent) {
 let os = '';
 const osRegex = /^[^(]*\(\s*(\w+)/mgi;
 let match = osRegex.exec(userAgent);
 if (match) {
   os = match[1];
 }
 
 let mobile = '';
 if (userAgent.match(/Mobile/mgi)) {
   mobile = 'Mobile';
 }
 
 // Detect Edge first since it includes Chrome and Safari
 const edgeRegex = /\s+Edge\/(\d+)/mgi;
 match = edgeRegex.exec(userAgent);
 if (match) {
   return 'Edge' + match[1] + os + mobile;
 }
 
 // Detect Chrome next (and browsers using the Chrome UA/engine)
 const chromeRegex = /\s+Chrome\/(\d+)/mgi;
 match = chromeRegex.exec(userAgent);
 if (match) {
   return 'Chrome' + match[1] + os + mobile;
 }
 
 // Detect Safari and Webview next
 const webkitRegex = /\s+AppleWebKit\/(\d+)/mgi;
 match = webkitRegex.exec(userAgent.match);
 if (match) {
   return 'WebKit' + match[1] + os + mobile;
 }
 
 // Detect Firefox
 const firefoxRegex = /\s+Firefox\/(\d+)/mgi;
 match = firefoxRegex.exec(userAgent);
 if (match) {
   return 'Firefox' + match[1] + os + mobile;
 }
  return null;
}

Profit!

Any site served through Cloudflare can implement workers to rewrite their content but for something like Google fonts or other third-party resources it gets much more interesting when someone implements it once and everyone else can benefit. With Cloudflare Apps’ new worker support you can bundle up and deliver complex worker logic for anyone else to consume and publish it to the Apps marketplace.

If you are a third-party content provider for sites, think about what you might be able to do to leverage workers for your content for sites that are served through Cloudflare.

I get excited thinking about the performance implications of something like a tag manager running entirely on the edge without the sites having to change their published pages and without browsers having to fetch heavy JavaScript to do the page modifications. It can be done dynamically for every request directly on the edge!

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowat

Patrick Meenan (Guest Author) — Fri, 12 Oct 2018 12:00:00 GMT

Getting the best end-user performance from HTTP/2 requires good support for resource prioritization. While most web servers support HTTP/2 prioritization, getting it to work well all the way to the browser requires a fair bit of coordination across the networking stack. This article will expose some of the interactions between the web server, Operating System and network and how to tune a server to optimize performance for end users.

tl;dr

On Linux 4.9 kernels and later, enable BBR congestion control and set tcp_notsent_lowat to 16KB for HTTP/2 prioritization to work reliably. This can be done in /etc/sysctl.conf:

    net.core.default_qdisc = fq
    net.ipv4.tcp_congestion_control = bbr
    net.ipv4.tcp_notsent_lowat = 16384

Browsers and Request Prioritization

A single web page is made up of dozens to hundreds of separate pieces of content that a web browser pulls together to create and present to the user. The main content (HTML) for the page you are visiting is a list of instructions on how to construct the page and the browser goes through the instructions from beginning to end to figure out everything it needs to load and how to put it all together. Each piece of content requires a separate HTTP request from the browser to the server responsible for that content (or if it has been loaded before, it can be loaded from a local cache in the browser).

In a simple implementation, the web browser could wait until everything is loaded and constructed and then show the result but that would be pretty slow. Not all of the content is critical to the user and can include things such as images way down in the page, analytics for tracking usage, ads, like buttons, etc. All the browsers work more incrementally where they display the content as it becomes available. This results in a much faster user experience. The visible part of the page can be displayed while the rest of the content is being loaded in the background. Deciding on the best order to request the content in is where browser request prioritization comes into play. Done correctly the visible content can display significantly faster than a naive implementation.

HTML Parser blocking page render for styles and scripts in the head of the document.

Most modern browsers use similar prioritization schemes which generally look like:

Load similar resources (scripts, images, styles) in the order they were listed in the HTML.
Load styles/CSS before anything else because content cannot be displayed until styles are complete.
Load blocking scripts/JavaScript next because blocking scripts stop the browser from moving on to the next instruction in the HTML until they have been loaded and executed.
Load images and non-blocking scripts (async/defer).

Fonts are a bit of a special case in that they are needed to draw the text on the screen but the browser won’t know that it needs to load a font until it is actually ready to draw the text to the screen. So they are discovered pretty late. As a result they are generally given a very high priority once they are discovered but aren’t known about until fairly late in the loading process.

Chrome also applies some special treatment to images that are visible in the current browser viewport (part of the page visible on the screen). Once the styles have been applied and the page has been laid out it will give visible images a much higher priority and load them in order from largest to smallest.

HTTP/1.x prioritization

With HTTP/1.x, each connection to a server can support one request at a time (practically anyway as no browser supports pipelining) and most browsers will open up to 6 connections at a time to each server. The browser maintains a prioritized list of the content it needs and makes the requests to each server as a connection becomes available. When a high-priority piece of content is discovered it is moved to the front of a list and when the next connection becomes available it is requested.

HTTP/2 prioritization

With HTTP/2, the browser uses a single connection and the requests are multiplexed over the connection as separate “streams”. The requests are all sent to the server as soon as they are discovered along with some prioritization information to let the server know the preferred ordering of the responses. It is then up to the server to do its best to deliver the most important responses first, followed by lower priority responses. When a high priority request comes in to the server, it should immediately jump ahead of the lower priority responses, even mid-response. The actual priority scheme implemented by HTTP/2 allows for parallel downloads with weighting between them and more complicated schemes. For now it is easiest to just think about it as a priority ordering of the resources.

Most servers that support prioritization will send data for the highest priority responses for which it has data available. But if the most important response takes longer to generate than lower priority responses, the server may end up starting to send data for a lower priority response and then interrupt its stream when the higher priority response becomes available. That way it can avoid wasting available bandwidth and head-of-line blocking where a slow response holds everything else up.

Browser requesting a high-priority resource after several low-priority resources.

In an optimal configuration, the time to retrieve a top-priority resource on a busy connection with lots of other streams will be identical to the time to retrieve it on an empty connection. Effectively that means that the server needs to be able to interrupt the response streams of all of the other responses immediately with no additional buffering to delay the high-priority response (beyond the minimal amount of data in-flight on the network to keep the connection fully utilized).

Buffers on the Internet

Excessive buffering is pretty much the nemesis for HTTP/2 because it directly impacts the ability for a server to be nimble in responding to priority shifts. It is not unusual for there to be megabytes-worth of buffering between the server and the browser which is larger than most websites. Practically that means that the responses will get delivered in whatever order they become available on the server. It is not unusual to have a critical resource (like a font or a render-blocking script in the of a document) delayed by megabytes of lower priority images. For the end-user this translates to seconds or even minutes of delay rendering the page.

TCP send buffers

The first layer of buffering between the server and the browser is in the server itself. The operating system maintains a TCP send buffer that the server writes data into. Once the data is in the buffer then the operating system takes care of delivering the data as-needed (pulling from the buffer as data is sent and signaling to the server when the buffer needs more data). A large buffer also reduces CPU load because it reduces the amount of writing that the server has to do to the connection.

The actual size of the send buffer needs to be big enough to keep a copy of all of the data that has been sent to the browser but has yet to be acknowledged in case a packet gets dropped and some data needs to be retransmitted. Too small of a buffer will prevent the server from being able to max-out the connection bandwidth to the client (and is a common cause of slow downloads over long distances). In the case of HTTP/1.x (and a lot of other protocols), the data is delivered in bulk in a known-order and tuning the buffers to be as big as possible has no downside other than the increase in memory use (trading off memory for CPU). Increasing the TCP send buffer sizes is an effective way to increase the throughput of a web server.

For HTTP/2, the problem with large send buffers is that it limits the nimbleness of the server to adjust the data it is sending on a connection as high priority responses become available. Once the response data has been written into the TCP send buffer it is beyond the server’s control and has been committed to be delivered in the order it is written.

High-priority resource queued behind low-priority resources in the TCP send buffer.

The optimal send buffer size for HTTP/2 is the minimal amount of data required to fully utilize the available bandwidth to the browser (which is different for every connection and changes over time even for a single connection). Practically you’d want the buffer to be slightly bigger to allow for some time between when the server is signaled that more data is needed and when the server writes the additional data.

TCP_NOTSENT_LOWAT

TCP_NOTSENT_LOWAT is a socket option that allows configuration of the send buffer so that it is always the optimal size plus a fixed additional buffer. You provide a buffer size (X) which is the additional amount of size you’d like in addition to the minimal needed to fully utilize the connection and it dynamically adjusts the TCP send buffer to always be X bytes larger than the current connection congestion window. The congestion window is the TCP stack’s estimate of the amount of data that needs to be in-flight on the network to fully utilize the connection.

TCP_NOTSENT_LOWAT can be configured in code on a socket-by-socket basis if the web server software supports it or system-wide using the net.ipv4.tcp_notsent_lowat sysctl:

    net.ipv4.tcp_notsent_lowat = 16384

We have a patch we are preparing to upstream for NGINX to make it configurable but it isn’t quite ready yet so configuring it system-wide is required. Experimentally, the value 16,384 (16K) has proven to be a good balance where the connections are kept fully-utilized with negligible additional CPU overhead. That will mean that at most 16KB of lower priority data will be buffered before a higher priority response can interrupt it and be delivered. As always, your mileage may vary and it is worth experimenting with.

High-priority resource ready to send with minimal TCP buffering.

Bufferbloat

Beyond buffering on the server, the network connection between the server and the browser can act as a buffer. It is increasingly common for networking gear to have large buffers that absorb data that is sent faster than the receiving side can consume it. This is generally referred to as Bufferbloat. I hedged my explanation of the effectiveness of tcp_notsent_lowat a little bit in that it is based on the current congestion window which is an estimate of the optimal amount of in-flight data needed but not necessarily the actual optimal amount of in-flight data.

The buffers in the network can be quite large at times (megabytes) and they interact very poorly with the congestion control algorithms usually used by TCP. Most classic congestion-control algorithms determine the congestion window by watching for packet loss. Once a packet is dropped then it knows there was too much data on the network and it scales back from there. With Bufferbloat that limit is raised artificially high because the buffers are absorbing the extra packets beyond what is needed to saturate the connection. As a result, the TCP stack ends up calculating a congestion window that spikes to much larger than the actual size needed, then drops to significantly smaller once the buffers are saturated and a packet is dropped and the cycle repeats.

Loss-based congestion control congestion window graph.

TCP_NOTSENT_LOWAT uses the calculated congestion window as a baseline for the size of the send buffer it needs to use so when the underlying calculation is wrong, the server ends up with send buffers much larger (or smaller) than it actually needs.

I like to think about Bufferbloat as being like a line for a ride at an amusement park. Specifically, one of those lines where it’s a straight shot to the ride when there are very few people in line but once the lines start to build they can divert you through a maze of zig-zags. Approaching the ride it looks like a short distance from the entrance to the ride but things can go horribly wrong.

Bufferbloat is very similar. When the data is coming into the network slower than the links can support, everything is nice and fast:

Response traveling through the network with no buffering.

Once the data comes in faster than it can go out the gates are flipped and the data gets routed through the maze of buffers to hold it until it can be sent. From the entrance to the line it still looks like everything is going fine since the network is absorbing the extra data but it also means there is a long queue of the low-priority data already absorbed when you want to send the high-priority data and it has no choice but to follow at the back of the line:

Responses queued in network buffers.

BBR congestion control

BBR is a new congestion control algorithm from Google that uses changes in packet delays to model the congestion instead of waiting for packets to drop. Once it sees that packets are taking longer to be acknowledged it assumes it has saturated the connection and packets have started to buffer. As a result the congestion window is often very close to the optimal needed to keep the connection fully utilized while also avoiding Bufferbloat. BBR was merged into the Linux kernel in version 4.9 and can be configured through sysctl:

    net.core.default_qdisc = fq
    net.ipv4.tcp_congestion_control = bbr

BBR also tends to perform better overall since it doesn’t require packet loss as part of probing for the correct congestion window and also tends to react better to random packet loss.

BBR congestion window graph.

Back to the amusement park line, BBR is like having each person carry one of the RFID cards they use to measure the wait time. Once the wait time looks like it is getting slower the people at the entrance slow down the rate that they let people enter the line.

BBR detecting network congestion early.

This way BBR essentially keeps the line moving as fast as possible and prevents the maze of lines from being used. When a guest with a fast pass arrives (the high-priority request) they can jump into the fast-moving line and hop right onto the ride.

BBR delivering responses without network buffering.

Technically, any congestion control that keeps Bufferbloat in check and maintains an accurate congestion window will work for keeping the TCP send buffers in check, BBR just happens to be one of them (with lots of good properties).

Putting it all together

The combination of TCP_NOTSENT_LOWAT and BBR reduces the amount of buffering on the network to the absolute minimum and is CRITICAL for good end-user performance with HTTP/2. This is particularly true for NGINX and other HTTP/2 servers that don’t implement their own buffer throttling.

The end-user impact of correct prioritization is huge and may not show up in most of the metrics you are used to watching (particularly any server-side metrics like requests-per-second, request response time, etc).

Even on a 5Mbps cable connection proper resource ordering can result in rendering a page significantly faster (and the difference can explode to dozens of seconds or even minutes on a slower connection). Here is a relatively common case of a WordPress blog served over HTTP/2:

The page from the tuned server (After) starts to render at 1.8 seconds.

The page from the tuned server (After) is completely done rendering at 4.5 seconds, well before the default configuration (Before) even started to render.

Finally, at 10.2 seconds the default configuration started to render (8.4 seconds later or 5.6 times slower than the tuned server).

Visually complete on the default configuration arrives at 10.7 seconds (6.2 seconds or 2.3 times slower than the tuned server).

Both configurations served the exact same content using the exact same servers with “After” being tuned for TCP_NOTSENT_LOWAT of 16KB (both configurations used BBR).

Identifying Prioritization Issues In The Wild

If you look at a network waterfall diagram of a page loading prioritization issues will show up as high-priority requests completing much later than lower-priority requests from the same origin. Usually that will also push metrics like First Paint and DOM Content Loaded (the vertical purple bar below) much later.

Network waterfall showing critical CSS and JavaScript delayed by images.

When prioritization is working correctly you will see critical resources all completing much earlier and not be blocked by the lower-priority requests. You may still see SOME low-priority data download before the higher-priority data starts downloading because there is still some buffering even under ideal conditions but it should be minimal.

Network waterfall showing critical CSS and JavaScript loading quickly.

Chrome 69 and later may hide the problem a bit. Chrome holds back lower-priority requests even on HTTP/2 connections until after it has finished processing the head of the document. In a waterfall it will look like a delayed block of requests that all start at the same time after the critical requests have completed. That doesn’t mean that it isn’t a problem for Chrome, just that it isn’t as obvious. Even with the staggering of requests there are still high-priority requests outside of the head of the document that can be delayed by lower-priority requests. Most notable are any blocking scripts in the body of the page and any external fonts that were not preloaded.

Network waterfall showing Chrome delaying the requesting of low-priority resources.

Hopefully this post gives you the tools to be able to identify HTTP/2 prioritization issues when they happen, a deeper understanding of how HTTP/2 prioritization works and some tools to fix the issues when they appear.