The Cloudflare Blog

Go and enhance your calm: demolishing an HTTP/2 interop problem

Lucas Pardue — Fri, 31 Oct 2025 13:00:00 GMT

In September 2025, a thread popped up in our internal engineering chat room asking, "Which part of our stack would be responsible for sending ErrCode=ENHANCE_YOUR_CALM to an HTTP/2 client?" Two internal microservices were experiencing a critical error preventing their communication and the team needed a timely answer.

In this blog post, we describe the background to well-known HTTP/2 attacks that trigger Cloudflare defences, which close connections. We then document an easy-to-make mistake using Go's standard library that can cause clients to send PING flood attacks and how you can avoid it.

HTTP/2 is powerful – but it can be easy to misuse

HTTP/2 defines a binary wire format for encoding HTTP semantics. Request and response messages are encoded as a series of HEADERS and DATA frames, each associated with a logical stream, sent over a TCP connection using TLS. There are also control frames that relate to the management of streams or the connection as a whole. For example, SETTINGS frames advertise properties of an endpoint, WINDOW_UPDATE frames provide flow control credit to a peer so that it can send data, RST_STREAM can be used to cancel or reject a request or response, while GOAWAY can be used to signal graceful or immediate connection closure.

HTTP/2 provides many powerful features that have legitimate uses. However, with great power comes responsibility and opportunity for accidental or intentional misuse. The specification details a number of denial-of-service considerations. Implementations are advised to harden themselves: "An endpoint that doesn't monitor use of these features exposes itself to a risk of denial of service. Implementations SHOULD track the use of these features and set limits on their use."

Cloudflare implements many different HTTP/2 defenses, developed over years in order to protect our systems and our customers. Some notable examples include mitigations added in 2019 to address "Netflix vulnerabilities" and in 2023 to mitigate Rapid Reset and similar style attacks.

When Cloudflare detects that HTTP/2 client behaviour is likely malicious, we close the connection using the GOAWAY frame and include the error code ENHANCE_YOUR_CALM.

One of the well-known and common attacks is CVE-2019-9512, aka PING flood: "The attacker sends continual pings to an HTTP/2 peer, causing the peer to build an internal queue of responses. Depending on how efficiently this data is queued, this can consume excess CPU, memory, or both." Sending a PING frame causes the peer to respond with a PING acknowledgement (indicated by an ACK flag). This allows for checking the liveness of the HTTP connection, along with measuring the layer 7 round-trip time – both useful things. The requirement to acknowledge a PING, however, provides the potential attack vector since it generates work for the peer.

A client that PINGs the Cloudflare edge too frequently will trigger our CVE-2019-9512 mitigations, causing us to close the connection. Shortly after we launched support for gRPC in 2020, we encountered interoperability issues with some gRPC clients that sent many PINGs as part of a performance optimization for window tuning. We also discovered that the Rust Hyper crate had a feature called Adaptive Window that emulated the design and triggered a similar problem until Hyper made a fix.

Solving a microservice miscommunication mystery

When that thread popped up asking which part of our stack was responsible for sending the ENHANCE_YOUR_CALM error code, it was regarding a client communicating over HTTP/2 between two internal microservices.

We suspected that this was an HTTP/2 mitigation issue and confirmed it was a PING flood mitigation in our logs. But taking a step back, you may wonder why two internal microservices are communicating over the Cloudflare edge at all, and therefore hitting our mitigations. In this case, communicating over the edge provides us with several advantages:

We get to dogfood our edge infrastructure and discover issues like this!
We can use Cloudflare Access for authentication. This allows our microservices to be accessed securely by both other services (using service tokens) and engineers (which is invaluable for debugging).
Internal services that are written with Cloudflare Workers can easily communicate with services that are accessible at the edge.

The question remained: Why was this client behaving this way? We traded some ideas as we attempted to get to the bottom of the issue.

The client had a configuration that would indicate that it didn't need to PING very frequently:

t2.PingTimeout = 2 * time.Second
t2.ReadIdleTimeout = 5 * time.Second

However, in situations like this it is generally a good idea to establish ground truth about what is really happening "on the wire." For instance, grabbing a packet capture that can be dissected and explored in Wireshark can provide unequivocal evidence of precisely what was sent over the network. The next best option is detailed/trace logging at the sender or receiver, although sometimes logging can be misleading, so caveat emptor.

In our particular case, it was simpler to use logging with GODEBUG=http2debug=2. We built a simplified minimal reproduction of the client that triggered the error, helping to eliminate other potential variables. We did some group log analysis, combined with diving into some of the Go standard library code to understand what it was really doing. Issac Asimov is commonly credited with the quote "The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'" and sure enough, within the hour someone declared– the funny part I see is this:

2025/09/02 17:33:18 http2: Framer 0x14000624540: wrote RST_STREAM stream=9 len=4 ErrCode=CANCEL
2025/09/02 17:33:18 http2: Framer 0x14000624540: wrote PING len=8 ping="j\xe7\xd6R\xdaw\xf8+"

every ping seems to be preceded by a RST_STREAM

Observant readers will recall the earlier mention of Rapid Reset. However, our logs clearly indicated ENHANCE_YOUR_CALM being triggered due to the PING flood. A bit of searching landed us on this mailing list thread and the comment "Sending a PING frame along with an RST_STREAM allows a client to distinguish between an unresponsive server and a slow response." That seemed quite relevant. We also found a change that was committed related to this topic. This partly answered why there were so many PINGs, but it also raised a new question: Why so many stream resets? So we went back to the logs and built up a little more context about the interaction:

2025/09/02 17:33:18 http2: Transport received DATA flags=END_STREAM stream=47 len=0 data=""
2025/09/02 17:33:18 http2: Framer 0x14000624540: wrote RST_STREAM stream=47 len=4 ErrCode=CANCEL
2025/09/02 17:33:18 http2: Framer 0x14000624540: wrote PING len=8 ping="\x97W\x02\xfa>\xa8\xabi"

The interesting thing here is that the server had sent a DATA frame with the END_STREAM flag set. Per the HTTP/2 stream state machine, the stream should have transitioned to closed when a frame with END_STREAM was processed. The client doesn't need to do anything in this state – sending a RST_STREAM is entirely unnecessary.

A little more digging and noodling and an engineer proclaimed: I noticed that the reset+ping only happens when you call resp.Body.Close()

I believe Go's HTTP library doesn't actually read the response body automatically, but keeps the stream open for you to use until you call resp.Body.Close(), which you can do at any point you like.

The hilarious thing in our example was that there wasn't actually any HTTP body to read. From the earlier example: received DATA flags=END_STREAM stream=47 len=0 data="".

Science and engineering are at times weird and counterintuitive. We decided to tweak our client to read the (absent) body via io.Copy(io.Discard, resp.Body) before closing it.

Sure enough, this immediately stopped the client sending both a useless RST_STREAM and, by association, a PING frame.

Mystery solved?

To prove we had fixed the root cause, the production client was updated with a similar fix. A few hours later, all the ENHANCE_YOUR_CALM closures were eliminated.

Reading bodies in Go can be unintuitive

It’s worth noting that in some situations, ensuring the response body is always read can sometimes be unintuitive in Go. For example, at first glance it appears that the response body will always be read in the following example:

resp, err := http.DefaultClient.Do(req)
if err != nil {
	return err
}
defer resp.Body.Close()

if err := json.NewDecoder(resp.Body).Decode(&respBody); err != nil {
	return err
}

However, json.Decoder stops reading as soon as it finds a complete JSON document or errors. If the response body contains multiple JSON documents or invalid JSON, then the entire response body may still not be read.

Therefore, in our clients, we’ve started replacing defer response.Body.Close() with the following pattern to ensure that response bodies are always fully read:

resp, err := http.DefaultClient.Do(req)
if err != nil {
	return err
}
defer func() {
	io.Copy(io.Discard, resp.Body)
	resp.Body.Close()
}()

if err := json.NewDecoder(resp.Body).Decode(&respBody); err != nil {
	return err
}

Actions to take if you encounter ENHANCE_YOUR_CALM

HTTP/2 is a protocol with several features. Many implementations have implemented hardening to protect themselves from misuse of features, which can trigger a connection to be closed. The recommended error code for closing connections in such conditions is ENHANCE_YOUR_CALM. There are numerous HTTP/2 implementations and APIs, which may drive the use of HTTP/2 features in unexpected ways that could appear like attacks.

If you have an HTTP/2 client that encounters closures with ENHANCE_YOUR_CALM, we recommend that you try to establish ground truth with packet captures (including TLS decryption keys via mechanisms like SSLKEYLOGFILE) and/or detailed trace logging. Look for patterns of frequent or repeated frames that might be similar to malicious traffic. Adjusting your client may help avoid it getting misclassified as an attacker.

If you use Go, we recommend always reading HTTP/2 response bodies (even if empty) in order to avoid sending unnecessary RST_STREAM and PING frames. This is especially important if you use a single connection for multiple requests, which can cause a high frequency of these frames.

This was also a great reminder of the advantages of dogfooding our own products within our internal services. When we run into issues like this one, our learnings can benefit our customers with similar setups.

DDoS threat report for 2023 Q3

Omer Yoachimik — Thu, 26 Oct 2023 13:00:58 GMT

Welcome to the third DDoS threat report of 2023. DDoS attacks, or distributed denial-of-service attacks, are a type of cyber attack that aims to disrupt websites (and other types of Internet properties) to make them unavailable for legitimate users by overwhelming them with more traffic than they can handle — similar to a driver stuck in a traffic jam on the way to the grocery store.

We see a lot of DDoS attacks of all types and sizes, and our network is one of the largest in the world spanning more than 300 cities in over 100 countries. Through this network we serve over 64 million HTTP requests per second at peak and about 2.3 billion DNS queries every day. On average, we mitigate 140 billion cyber threats each day. This colossal amount of data gives us a unique vantage point to understand the threat landscape and provide the community access to insightful and actionable DDoS trends.

In recent weeks, we've also observed a surge in DDoS attacks and other cyber attacks against Israeli newspaper and media websites, as well as financial institutions and government websites. Palestinian websites have also seen a significant increase in DDoS attacks. View the full coverage here.

HTTP DDoS attacks against Israeli websites using Cloudflare

The global DDoS threat landscape

In the third quarter of 2023, Cloudflare faced one of the most sophisticated and persistent DDoS attack campaigns in recorded history.

Cloudflare mitigated thousands of hyper-volumetric HTTP DDoS attacks, 89 of which exceeded 100 million requests per second (rps) and with the largest peaking at 201 million rps — a figure three times higher than the previous largest attack on record (71M rps).
The campaign contributed to an overall increase of 65% in HTTP DDoS attack traffic in Q3 compared to the previous quarter. Similarly, L3/4 DDoS attacks also increased by 14% alongside numerous attacks in the terabit-per-second range — the largest attack targeted Cloudflare’s free DNS resolver 1.1.1.1 and peaked at 2.6 Tbps.
Gaming and Gambling companies were bombarded with the largest volume of HTTP DDoS attack traffic, overtaking the Cryptocurrency industry from last quarter.

Reminder: an interactive version of this report is also available as a Cloudflare Radar Report. On Radar, you can also dive deeper and explore traffic trends, attacks, outages and many more insights for your specific industry, network and country.

HTTP DDoS attacks and hyper-volumetric attacks

An HTTP DDoS attack is a DDoS attack over the Hypertext Transfer Protocol (HTTP). It targets HTTP Internet properties such as mobile application servers, ecommerce websites, and API gateways.

Illustration of an HTTP DDoS attack

HTTP/2, which accounts for 62% of HTTP traffic, is a version of the protocol that’s meant to improve application performance. The downside is that HTTP/2 can also help improve a botnet’s performance.

Distribution of HTTP versions by Radar

Campaign of hyper-volumetric DDoS attacks exploiting HTTP/2 Rapid Resets

Starting in late August 2023, Cloudflare and various other vendors were subject to a sophisticated and persistent DDoS attack campaign that exploited the HTTP/2 Rapid Reset vulnerability (CVE-2023-44487).

Illustration of an HTTP/2 Rapid Reset DDoS attack

The DDoS campaign included thousands of hyper-volumetric DDoS attacks over HTTP/2 that peaked in the range of millions of requests per second. The average attack rate was 30M rps. Approximately 89 of the attacks peaked above 100M rps and the largest one we saw hit 201M rps.

HTTP/2 Rapid Reset campaign of hyper-volumetric DDoS attacks

Cloudflare’s systems automatically detected and mitigated the vast majority of attacks. We deployed emergency countermeasures and improved our mitigation systems’ efficacy and efficiency to ensure the availability of our network and of our customers’.

Check out our engineering blog that dives deep into the land of HTTP/2, what we learned and what actions we took to make the Internet safer.

Hyper-volumetric DDoS attacks enabled by VM-based botnets

As we’ve seen in this campaign and previous ones, botnets that leverage cloud computing platforms and exploit HTTP/2 are able to generate up to x5,000 more force per botnet node. This allowed them to launch hyper-volumetric DDoS attacks with a small botnet ranging 5-20 thousand nodes alone. To put that into perspective, in the past, IoT based botnets consisted of fleets of millions of nodes and barely managed to reach a few million requests per second.

Comparison of an Internet of Things (IoT) based botnet and a Virtual Machine (VM) based botnet

When analyzing the two-month-long DDoS campaign, we can see that Cloudflare infrastructure was the main target of the attacks. More specifically, 19% of all attacks targeted Cloudflare websites and infrastructure. Another 18% targeted Gaming companies, and 10% targeted well known VoIP providers.

Top industries targeted by the HTTP/2 Rapid Reset DDoS attacks

HTTP DDoS attack traffic increased by 65%

The attack campaign contributed to an overall increase in the amount of attack traffic. Last quarter, the volume of HTTP DDoS attacks increased by 15% QoQ. This quarter, it grew even more. Attacks volume increased by 65% QoQ to a total staggering figure of 8.9 trillion HTTP DDoS requests that Cloudflare systems automatically detected and mitigated.

Aggregated volume of HTTP DDoS attack requests by quarter

Alongside the 65% increase in HTTP DDoS attacks, we also saw a minor increase of 14% in L3/4 DDoS attacks — similar to the figures we saw in the first quarter of this year.

L3/4 DDoS attack by quarter

A rise in large volumetric DDoS attacks contributing to this increase. In Q3, our DDoS defenses automatically detected and mitigated numerous DDoS attacks in the terabit-per-second range. The largest attacks we saw peaked at 2.6 Tbps. This attack was part of a broader campaign that targeted Cloudflare’s free DNS resolver 1.1.1.1. It was a UDP flood that was launched by a Mirai-variant botnet.

Top sources of HTTP DDoS attacks

When comparing the global and country-specific HTTP DDoS attack request volume, we see that the US remains the largest source of HTTP DDoS attacks. One out of every 25 HTTP DDoS requests originated from the US. China remains in second place. Brazil replaced Germany as the third-largest source of HTTP DDoS attacks, as Germany fell to fourth place.

HTTP DDoS attacks: Top sources compared to all attack traffic

Some countries naturally receive more traffic due to various factors such as the population and Internet usage, and therefore also receive/generate more attacks. So while it’s interesting to understand the total amount of attack traffic originating from or targeting a given country, it is also helpful to remove that bias by normalizing the attack traffic by all traffic to a given country.

When doing so, we see a different pattern. The US doesn’t even make it into the top ten. Instead, Mozambique is in first place (again). One out of every five HTTP requests that originated from Mozambique was part of an HTTP DDoS attack traffic.

Egypt remains in second place — approximately 13% of requests originating from Egypt were part of an HTTP DDoS attack. Libya and China follow as the third and fourth-largest source of HTTP DDoS attacks.

HTTP DDoS attacks: Top sources compared to their own traffic

Top sources of L3/4 DDoS attacks

When we look at the origins of L3/4 DDoS attacks, we ignore the source IP address because it can be spoofed. Instead, we rely on the location of Cloudflare’s data center where the traffic was ingested. Thanks to our large network and global coverage, we’re able to achieve geographical accuracy to understand where attacks come from.

In Q3, approximately 36% of all L3/4 DDoS attack traffic that we saw originated from the US. Far behind, Germany came in second place with 8% and the UK followed in third place with almost 5%.

L3/4 DDoS attacks: Top sources compared to all attack traffic

When normalizing the data, we see that Vietnam dropped to the second-largest source of L3/4 DDoS attacks after being first for two consecutive quarters. New Caledonia, a French territory comprising dozens of islands in the South Pacific, grabbed the first place. Two out of every four bytes ingested in Cloudflare’s data centers in New Caledonia were attacks.

L3/4 DDoS attacks: Top sources compared to their own traffic

Top attacked industries by HTTP DDoS attacks

In terms of absolute volume of HTTP DDoS attack traffic, the Gaming and Gambling industry jumps to first place overtaking the Cryptocurrency industry. Over 5% of all HTTP DDoS attack traffic that Cloudflare saw targeted the Gaming and Gambling industry.

HTTP DDoS attacks: Top attacked industries compared to all attack traffic

The Gaming and Gambling industry has long been one of the most attacked industries compared to others. But when we look at the HTTP DDoS attack traffic relative to each specific industry, we see a different picture. The Gaming and Gambling industry has so much user traffic that, despite being the most attacked industry by volume, it doesn’t even make it into the top ten when we put it into the per-industry context.

Instead, what we see is that the Mining and Metals industry was targeted by the most attacks compared to its total traffic — 17.46% of all traffic to Mining and Metals companies were DDoS attack traffic.

Following closely in second place, 17.41% of all traffic to Non-profits were HTTP DDoS attacks. Many of these attacks are directed at more than 2,400 Non-profit and independent media organizations in 111 countries that Cloudflare protects for free as part of Project Galileo, which celebrated its ninth anniversary this year. Over the past quarter alone, Cloudflare mitigated an average of 180.5 million cyber threats against Galileo-protected websites every day.

HTTP DDoS attacks: Top attacked industries compared to their own traffic

Pharmaceuticals, Biotechnology and Health companies came in third, and US Federal Government websites in fourth place. Almost one out of every 10 HTTP requests to US Federal Government Internet properties were part of an attack. In fifth place, Cryptocurrency and then Farming and Fishery not far behind.

Top attacked industries by region

Now let’s dive deeper to understand which industries were targeted the most in each region.

HTTP DDoS attacks: Top industries targeted by HTTP DDoS attacks by region

Regional deepdives

Africa

After two consecutive quarters as the most attacked industry, the Telecommunications industry dropped from first place to fourth. Media Production companies were the most attacked industry in Africa. The Banking, Financial Services and Insurance (BFSI) industry follows as the second most attacked. Gaming and Gambling companies in third.

Asia

The Cryptocurrency industry remains the most attacked in APAC for the second consecutive quarter. Gaming and Gambling came in second place. Information Technology and Services companies in third.

Europe

For the fourth consecutive quarter, the Gaming and Gambling industry remains the most attacked industry in Europe. Retail companies came in second, and Computer Software companies in third.

Latin America

Farming was the most targeted industry in Latin America in Q3. It accounted for a whopping 53% of all attacks towards Latin America. Far behind, Gaming and Gambling companies were the second most targeted. Civic and Social Organizations were in third.

Middle East

Retail companies were the most targeted in the Middle East in Q3. Computer Software companies came in second and the Gaming and Gambling industry in third.

North America

After two consecutive quarters, the Marketing and Advertising industry dropped from the first place to the second. Computer Software took the lead. In third place, Telecommunications companies.

Oceania

The Telecommunications industry was, by far, the most targeted in Oceania in Q3 — over 45% of all attacks to Oceania. Cryptocurrency and Computer Software companies came in second and third places respectively.

Top attacked industries by L3/4 DDoS attacks

When descending the layers of the OSI model, the Internet networks and services that were most targeted belonged to the Information Technology and Services industry. Almost 35% of all L3/4 DDoS attack traffic (in bytes) targeted the Information Technology and Internet industry.

Far behind, Telecommunication companies came in second with a mere share of 3%. Gaming and Gambling came in third, Banking, Financial Services and Insurance companies (BFSI) in fourth.

L3/4 DDoS attacks: Top attacked industries compared to all attack traffic

When comparing the attacks on industries to all traffic for that specific industry, we see that the Music industry jumps to the first place, followed by Computer and Network Security companies, Information Technology and Internet companies and Aviation and Aerospace.

L3/4 DDoS attacks: Top attacked industries compared to their own traffic

Top attacked countries by HTTP DDoS attacks

When examining the total volume of attack traffic, the US remains the main target of HTTP DDoS attacks. Almost 5% of all HTTP DDoS attack traffic targeted the US. Singapore came in second and China in third.

HTTP DDoS attacks: Top attacked countries compared to all traffic

If we normalize the data per country and region and divide the attack traffic by the total traffic, we get a different picture. The top three most attacked countries are Island nations.

Anguilla, a small set of islands east of Puerto Rico, jumps to the first place as the most attacked country. Over 75% of all traffic to Anguilla websites were HTTP DDoS attacks. In second place, American Samoa, a group of islands east of Fiji. In third, the British Virgin Islands.

In fourth place, Algeria, and then Kenya, Russia, Vietnam, Singapore, Belize, and Japan.

HTTP DDoS attacks: Top attacked countries compared to their own traffic

Top attacked countries by L3/4 DDoS attacks

For the second consecutive quarter, Chinese Internet networks and services remain the most targeted by L3/4 DDoS attacks. These China-bound attacks account for 29% of all attacks we saw in Q3.

Far, far behind, the US came in second place (3.5%) and Taiwan in third place (3%).

L3/4 DDoS attacks: Top attacked countries compared to all traffic

When normalizing the amount of attack traffic compared to all traffic to a country, China remains in first place and the US disappears from the top ten. Cloudflare saw that 73% of traffic to China Internet networks were attacks. However, the normalized ranking changes from second place on, with the Netherlands receiving the second-highest proportion of attack traffic (representing 35% of the country’s overall traffic), closely followed by Thailand, Taiwan and Brazil.

L3/4 DDoS attacks: Top attacked countries compared to their own traffic

Top attack vectors

The Domain Name System, or DNS, serves as the phone book of the Internet. DNS helps translate the human-friendly website address (e.g., www.cloudflare.com) to a machine-friendly IP address (e.g., 104.16.124.96). By disrupting DNS servers, attackers impact the machines’ ability to connect to a website, and by doing so making websites unavailable to users.

For the second consecutive quarter, DNS-based DDoS attacks were the most common. Almost 47% of all attacks were DNS-based. This represents a 44% increase compared to the previous quarter. SYN floods remain in second place, followed by RST floods, UDP floods, and Mirai attacks.

Top attack vectors

Emerging threats - reduced, reused and recycled

Aside from the most common attack vectors, we also saw significant increases in lesser known attack vectors. These tend to be very volatile as threat actors try to “reduce, reuse and recycle” older attack vectors. These tend to be UDP-based protocols that can be exploited to launch amplification and reflection DDoS attacks.

One well-known tactic that we continue to see is the use of amplification/reflection attacks. In this attack method, the attacker bounces traffic off of servers, and aims the responses towards their victim. Attackers are able to aim the bounced traffic to their victim by various methods such as IP spoofing.

Another form of reflection can be achieved differently in an attack named ‘DNS Laundering attack’. In a DNS Laundering attack, the attacker will query subdomains of a domain that is managed by the victim’s DNS server. The prefix that defines the subdomain is randomized and is never used more than once or twice in such an attack. Due to the randomization element, recursive DNS servers will never have a cached response and will need to forward the query to the victim’s authoritative DNS server. The authoritative DNS server is then bombarded by so many queries until it cannot serve legitimate queries or even crashes all together.

Illustration of a reflection and amplification attack

Overall in Q3, Multicast DNS (mDNS) based DDoS attacks was the attack method that increased the most. In second place were attacks that exploit the Constrained Application Protocol (CoAP), and in third, the Encapsulating Security Payload (ESP). Let’s get to know those attack vectors a little better.

Main emerging threats

mDNS DDoS attacks increased by 456%

Multicast DNS (mDNS) is a UDP-based protocol that is used in local networks for service/device discovery. Vulnerable mDNS servers respond to unicast queries originating outside the local network, which are ‘spoofed’ (altered) with the victim's source address. This results in amplification attacks. In Q3, we noticed a large increase of mDNS attacks; a 456% increase compared to the previous quarter.

CoAP DDoS attacks increased by 387%

The Constrained Application Protocol (CoAP) is designed for use in simple electronics and enables communication between devices in a low-power and lightweight manner. However, it can be abused for DDoS attacks via IP spoofing or amplification, as malicious actors exploit its multicast support or leverage poorly configured CoAP devices to generate large amounts of unwanted network traffic. This can lead to service disruption or overloading of the targeted systems, making them unavailable to legitimate users.

ESP DDoS attacks increased by 303%

The Encapsulating Security Payload (ESP) protocol is part of IPsec and provides confidentiality, authentication, and integrity to network communications. However, it could potentially be abused in DDoS attacks if malicious actors exploit misconfigured or vulnerable systems to reflect or amplify traffic towards a target, leading to service disruption. Like with other protocols, securing and properly configuring the systems using ESP is crucial to mitigate the risks of DDoS attacks.

Ransom DDoS attacks

Occasionally, DDoS attacks are carried out to extort ransom payments. We’ve been surveying Cloudflare customers over three years now, and have been tracking the occurrence of Ransom DDoS attack events.

Comparison of Ransomware and Ransom DDoS attacks

Unlike Ransomware attacks, where victims typically fall prey to downloading a malicious file or clicking on a compromised email link which locks, deletes, or leaks their files until a ransom is paid, Ransom DDoS attacks can be much simpler for threat actors to execute. Ransom DDoS attacks bypass the need for deceptive tactics such as luring victims into opening dubious emails or clicking on fraudulent links, and they don't necessitate a breach into the network or access to corporate resources.

Over the past quarter, reports of Ransom DDoS attacks continue to decrease. Approximately 8% of respondents reported being threatened or subject to Random DDoS attacks, which continues a decline we've been tracking throughout the year. Hopefully it is because threat actors have realized that organizations will not pay them (which is our recommendation).

Ransom DDoS attacks by quarter

However, keep in mind that this is also very seasonal, and we can expect an increase in ransom DDoS attacks during the months of November and December. If we look at Q4 numbers from the past three years, we can see that Ransom DDoS attacks have been significantly increasing YoY in November. In previous Q4s, it reached a point where one out of every four respondents reported being subject to Ransom DDoS attacks.

Improving your defenses in the era of hyper-volumetric DDoS attacks

In the past quarter, we saw an unprecedented surge in DDoS attack traffic. This surge was largely driven by the hyper-volumetric HTTP/2 DDoS attack campaign.

Cloudflare customers using our HTTP reverse proxy, i.e. our CDN/WAF services, are already protected from these and other HTTP DDoS attacks. Cloudflare customers that are using non-HTTP services and organizations that are not using Cloudflare at all are strongly encouraged to use an automated, always-on HTTP DDoS Protection service for their HTTP applications.

It’s important to remember that security is a process, not a single product or flip of a switch. Atop of our automated DDoS protection systems, we offer comprehensive bundled features such as firewall, bot detection, API protection, and caching to bolster your defenses. Our multi-layered approach optimizes your security posture and minimizes potential impact. We’ve also put together a list of recommendations to help you optimize your defenses against DDoS attacks, and you can follow our step-by-step wizards to secure your applications and prevent DDoS attacks.

...Report methodologiesLearn more about our methodologies and how we generate these insights: https://developers.cloudflare.com/radar/reference/quarterly-ddos-reports

Connection coalescing with ORIGIN Frames: fewer DNS queries, fewer connections

Suleman Ahmad — Mon, 04 Sep 2023 13:00:51 GMT

This blog reports and summarizes the contents of a Cloudflare research paper which appeared at the ACM Internet Measurement Conference, that measures and prototypes connection coalescing with ORIGIN Frames.

Some readers might be surprised to hear that a single visit to a web page can cause a browser to make tens, sometimes even hundreds, of web connections. Take this very blog as an example. If it is your first visit to the Cloudflare blog, or it has been a while since your last visit, your browser will make multiple connections to render the page. The browser will make DNS queries to find IP addresses corresponding to blog.cloudflare.com and then subsequent requests to retrieve any necessary subresources on the web page needed to successfully render the complete page. How many? Looking below, at the time of writing, there are 32 different hostnames used to load the Cloudflare Blog. That means 32 DNS queries and at least 32 TCP (or QUIC) connections, unless the client is able to reuse (or coalesce) some of those connections.

Each new web connection not only introduces additional load on a server's processing capabilities – potentially leading to scalability challenges during peak usage hours – but also exposes client metadata to the network, such as the plaintext hostnames being accessed by an individual. Such meta information can potentially reveal a user’s online activities and browsing behaviors to on-path network adversaries and eavesdroppers!

In this blog we’re going to take a closer look at “connection coalescing”. Since our initial look at IP-based coalescing in 2021, we have done further large-scale measurements and modeling across the Internet, to understand and predict if and where coalescing would work best. Since IP coalescing is difficult to manage at large scale, last year we implemented and experimented with a promising standard called the HTTP/2 ORIGIN Frame extension that we leveraged to coalesce connections to our edge without worrying about managing IP addresses.

All told, there are opportunities being missed by many large providers. We hope that this blog (and our publication at ACM IMC 2022 with full details) offers a first step that helps servers and clients take advantage of the ORIGIN Frame standard.

Setting the stage

At a high level, as a user navigates the web, the browser renders web pages by retrieving dependent subresources to construct the complete web page. This process bears a striking resemblance to the way physical products are assembled in a factory. In this sense, a modern web page can be considered like an assembly plant. It relies on a ‘supply chain’ of resources that are needed to produce the final product.

An assembly plant in the physical world can place a single order for different parts and get a single shipment from the supplier (similar to the kitting process for maximizing value and minimizing response time); no matter the manufacturer of those parts or where they are made -- one ‘connection’ to the supplier is all that is needed. Any single truck from a supplier to an assembly plant can be filled with parts from multiple manufacturers.

The design of the web causes browsers to typically do the opposite in nature. To retrieve the images, JavaScript, and other resources on a web page (the parts), web clients (assembly plants) have to make at least one connection to every hostname (the manufacturers) defined in the HTML that is returned by the server (the supplier). It makes no difference if the connections to those hostnames go to the same server or not, for example they could go to a reverse proxy like Cloudflare. For each manufacturer a ‘new’ truck would be needed to transfer the materials to the assembly plant from the same supplier, or more formally, a new connection would need to be made to request a subresource from a hostname on the same web page.

Without connection coalescing

The number of connections used to load a web page can be surprisingly high. It is also common for the subresources to need yet other sub-subresources, and so new connections emerge as a result of earlier ones. Remember, too, that HTTP connections to hostnames are often preceded by DNS queries! Connection coalescing allows us to use fewer connections_, or ‘reuse’ the same set of trucks to carry parts from multiple manufacturers from a single supplier._

With connection coalescing

Connection coalescing in principle

Connection coalescing was introduced in HTTP/2, and carried over into HTTP/3. We’ve blogged about connection coalescing previously (for a detailed primer we encourage going over that blog). While the idea is simple, implementing it can present a number of engineering challenges. For example, recall from above that there are 32 hostnames (at the time of writing) to load the web page you are reading right now. Among the 32 hostnames are 16 unique domains (defined as “Effective TLD+1”). Can we create fewer connections or ‘coalesce’ existing connections for each unique domain? The answer is ‘Yes, but it depends’.

The exact number of connections to load the blog page is not at all obvious, and hard to know. There may be 32 hostnames attached to 16 domains but, counter-intuitively, this does not mean the answer to “how many unique connections?” is 16. The true answer could be as few as one connection if all the hostnames are reachable at a single server; or as many as 32 independent connections if a different and distinct server is needed to access each individual hostname.

Connection reuse comes in many forms, so it’s important to define “connection coalescing” in the HTTP space. For example, the reuse of an existing TCP or TLS connection to a hostname to make multiple requests for subresources from that same hostname is connection reuse, but not coalescing.

Coalescing occurs when an existing TLS channel for some hostname can be repurposed or used for connecting to a different hostname. For example, upon visiting blog.cloudflare.com, the HTML points to subresources at cdnjs.cloudflare.com. To reuse the same TLS connection for the subresources, it is necessary for both hostnames to appear together in the TLS certificate's “Server Alternative Name (SAN)” list, but this step alone is not sufficient to convince browsers to coalesce. After all, the cdnjs.cloudflare.com service may or may not be reachable at the same server as blog.cloudflare.com, despite being on the same certificate. So how can the browser know? Coalescing only works if servers set up the right conditions, but clients have to decide whether to coalesce or not – thus, browsers require a signal to coalesce beyond the SANs list on the certificate. Revisiting our analogy, the assembly plant may order a part from a manufacturer directly, not knowing that the supplier already has the same part in its warehouse.

There are two explicit signals a browser can use to decide whether connections can be coalesced: one is IP-based, the other ORIGIN Frame-based. The former requires the server operators to tightly bind DNS records to the HTTP resources available on the server. This is difficult to manage and deploy, and actually creates a risky dependency, because you have to place all the resources behind a specific set or a single IP address. The way IP addresses influence coalescing decisions varies among browsers, with some choosing to be more conservative and others more permissive. Alternatively, the HTTP ORIGIN Frame is an easier signal for the servers to orchestrate; it’s also flexible and has graceful failure with no interruption to service (for a specification compliant implementation).

A foundational difference between both these coalescing signals is: IP-based coalescing signals are implicit, even accidental, and force clients to infer coalescing possibilities that may exist, or not. None of this is surprising since IP addresses are designed to have no real relationship with names! In contrast, ORIGIN Frame is an explicit signal from servers to clients that coalescing is available no matter what DNS says for any particular hostname.

We have experimented with IP-based coalescing previously; for the purpose of this blog we will take a deeper look at ORIGIN Frame-based coalescing.

What is the ORIGIN Frame standard?

The ORIGIN Frame is an extension to the HTTP/2 and HTTP/3 specification, a special Frame sent on stream 0 or the control stream of the connection respectively. The Frame allows the servers to send an ‘origin-set’ to the clients on an existing established TLS connection, which includes hostnames that it is authorized for and will not incur any HTTP 421 errors. Hostnames in the origin-set MUST also appear in the certificate SAN list for the server, even if those hostnames are announced on different IP addresses via DNS.

Specifically, two different steps are required:

Web servers must send a list enumerating the Origin Set (the hostnames that a given connection might be used for) in the ORIGIN Frame extension.
The TLS certificate returned by the web server must cover the additional hostnames being returned in the ORIGIN Frame in the DNS names SAN entries.

At a high-level ORIGIN Frames are a supplement to the TLS certificate that operators can attach to say, “Psst! Hey, client, here are the names in the SANs that are available on this connection -- you can coalesce!” Since the ORIGIN Frame is not part of the certificate itself, its contents can be made to change independently. No new certificate is required. There is also no dependency on IP addresses. For a coalesceable hostname, existing TCP/QUIC+TLS connections can be reused without requiring new connections or DNS queries.

Many websites today rely on content which is served by CDNs, like Cloudflare CDN service. The practice of using external CDN services offers websites the advantages of speed, reliability, and reduces the load of content served by their origin servers. When both the website, and the resources are served by the same CDN, despite being different hostnames, owned by different entities, it opens up some very interesting opportunities for CDN operators to allow connections to be reused and coalesced since they can control both the certificate management and connection requests for sending ORIGIN frames on behalf of the real origin server.

Unfortunately, there has been no way to turn the possibilities enabled by ORIGIN Frame into practice. To the best of our knowledge, until today, there has been no server implementation that supports ORIGIN Frames. Among browsers, only Firefox supports ORIGIN Frames. Since IP coalescing is challenging and ORIGIN Frame has no deployed support, is the engineering time and energy to better support coalescing worth the investment? We decided to find out with a large-scale Internet-wide measurement to understand the opportunities and predict the possibilities, and then implemented the ORIGIN Frame to experiment on production traffic.

Experiment #1: What is the scale of required changes?

In February 2021, we collected data for 500K of the most popular websites on the Internet, using a modified Web Page Test on 100 virtual machines. An automated Chrome (v88) browser instance was launched for every visit to a web page to eliminate caching effects (because we wanted to understand coalescing, not caching). On successful completion of each session, Chrome developer tools were used to retrieve and write the page load data as an HTTP Archive format (HAR) file with a full timeline of events, as well as additional information about certificates and their validation. Additionally, we parsed the certificate chains for the root web page and new TLS connections triggered by subresource requests to (i) identify certificate issuers for the hostnames, (ii) inspect the presence of the Subject Alternative Name (SAN) extension, and (iii) validate that DNS names resolve to the IP address used. Further details about our methodology and results can be found in the technical paper.

The first step was to understand what resources are requested by web pages to successfully render the page contents, and where these resources were present on the Internet. Connection coalescing becomes possible when subresource domains are ideally co-located. We approximated the location of a domain by finding its corresponding autonomous system (AS). For example, the domain attached to cdnjs is reachable via AS 13335 in the BGP routing table, and that AS number belongs to Cloudflare. The figure below describes the percentage of web pages and the number of unique ASes needed to fully load a web page.

Around 14% of the web pages need two ASes to fully load i.e. pages that have a dependency on one additional AS for subresources. More than 50% of the web pages need to contact no more than six ASes to obtain all the necessary subresources. This finding as shown in the plot above implies that a relatively small number of operators serve the sub-resource content necessary for a majority (~50%) of the websites, and any usage of ORIGIN Frames would need only a few changes to have its intended impact. The potential for connection coalescing can therefore be optimistically approximated to the number of unique ASes needed to retrieve all subresources in a web page. In practice however, this may be superseded by operational factors such as SLAs or helped by flexible mappings between sockets, names, and IP addresses which we worked on previously at Cloudflare.

We then tried to understand the impact of coalescing on connection metrics. The measured and ideal number of DNS queries and TLS connections needed to load a web page are summarized by their CDFs in the figure below.

Through modeling and extensive analysis, we identify that connection coalescing through ORIGIN Frames could reduce the number of DNS and TLS connections made by browsers by over 60% at the median. We performed this modeling by identifying the number of times the clients requested DNS records, and combined them with the ideal ORIGIN Frames to serve.

Many multi-origin servers such as those operated by CDNs tend to reuse certificates and serve the same certificate with multiple DNS SAN entries. This allows the operators to manage fewer certificates through their creation and renewal cycles. While theoretically one can have millions of names in the certificate, creating such certificates is unreasonable and a challenge to manage effectively. By continuing to rely on existing certificates, our modeling measurements bring to light the volume of changes required to enable perfect coalescing, while presenting information about the scale of changes needed, as highlighted in the figure below.

We identify that over 60% of the certificates served by websites do not need any modifications and could benefit from ORIGIN Frames, while with no more than 10 additions to the DNS SAN names in certificates we’re able to successfully coalesce connections to over 92% of the websites in our measurement. The most effective changes could be made by CDN providers by adding three or four of their most popular requested hostnames into each certificate.

Experiment #2: ORIGIN Frames in action

In order to validate our modeling expectations, we then took a more active approach in early 2022. Our next experiment focused on 5,000 websites that make extensive use of cdnjs.cloudflare.com as a subresource. By modifying our experimental TLS termination endpoint we deployed HTTP/2 ORIGIN Frame support as defined in the RFC standard. This involved changing the internal fork of net and http dependency modules of Golang which we have open sourced (see here, and here).

During the experiments, connecting to a website in the experiment set would return cdnjs.cloudflare.com in the ORIGIN frame, while the control set returned an arbitrary (unused) hostname. All existing edge certificates for the 5000 websites were also modified. For the experimental group, the corresponding certificates were renewed with cdnjs.cloudflare.com added to the SAN. To ensure integrity between control and experimental sets, control group domains certificates were also renewed with a valid and identical size third party domain used by none of the control domains. This is done to ensure that the relative size changes to the certificates is kept constant avoiding potential biases due to different certificate sizes. Our results were striking!

Sampling 1% of the requests we received from Firefox to the websites in the experiment, we identified over 50% reduction in new TLS connections per second indicating a lesser number of cryptographic verification operations done by both the client and reduced server compute overheads. As expected there were no differences in the control set indicating the effectiveness of connection re-use as seen by the CDN or server operators.

Discussion and insights

While our modeling measurements indicated that we could anticipate some performance improvements, in practice it was not significantly better suggesting that ‘no-worse’ is the appropriate mental model regarding performance. The subtle interplay between resource object sizes, competing connections, and congestion control is subject to network conditions. Bottleneck-share capacity, for example, diminishes as fewer connections compete for bottleneck resources on network links. It would be interesting to revisit these measurements as more operators deploy support on their servers for ORIGIN Frames.

Apart from performance, one major benefit of ORIGIN frames is in terms of privacy. How? Well, each coalesced connection hides client metadata that is otherwise leaked from non-coalesced connections. Certain resources on a web page are loaded depending on how one is interacting with the website. This means for every new connection for retrieving some resource from the server, TLS plaintext metadata like SNI (in the absence of Encrypted Client Hello) and at least one plaintext DNS query, if transmitted over UDP or TCP on port 53, is exposed to the network. Coalescing connections helps remove the need for browsers to open new TLS connections, and the need to do extra DNS queries. This prevents metadata leakage from anyone listening on the network. ORIGIN Frames help minimize those signals from the network path, improving privacy by reducing the amount of cleartext information leaked on path to network eavesdroppers.

While the browsers benefit from reduced cryptographic computations needed to verify multiple certificates, a major advantage comes from the fact that it opens up very interesting future opportunities for resource scheduling at the endpoints (the browsers, and the origin servers) such as prioritization, or recent proposals like HTTP early hints to provide clients experiences where connections are not overloaded or competing for those resources. When coupled with CERTIFICATE Frames IETF draft, we can further eliminate the need for manual certificate modifications as a server can prove its authority of hostnames after connection establishment without any additional SAN entries on the website’s TLS certificate.

Conclusion and call to action

In summary, the current Internet ecosystem has a lot of opportunities for connection coalescing with only a few changes to certificates and their server infrastructure. Servers can significantly reduce the number of TLS handshakes by roughly 50%, while reducing the number of render blocking DNS queries by over 60%. Clients additionally reap these benefits in privacy by reducing cleartext DNS exposure to network on-lookers.

To help make this a reality we are currently planning to add support for both HTTP/2 and HTTP/3 ORIGIN Frames for our customers. We also encourage other operators that manage third party resources to adopt support of ORIGIN Frame to improve the Internet ecosystem.Our paper submission was accepted to the ACM Internet Measurement Conference 2022 and is available for download. If you’d like to work on projects like this, where you get to see the rubber meet the road for new standards, visit our careers page!

The state of HTTP in 2022

Mark Nottingham — Fri, 30 Dec 2022 14:00:00 GMT

At over thirty years old, HTTP is still the foundation of the web and one of the Internet’s most popular protocols—not just for browsing, watching videos and listening to music, but also for apps, machine-to-machine communication, and even as a basis for building other protocols, forming what some refer to as a “second waist” in the classic Internet hourglass diagram.

What makes HTTP so successful? One answer is that it hits a “sweet spot” for most applications that need an application protocol. “Building Protocols with HTTP” (published in 2022 as a Best Current Practice RFC by the HTTP Working Group) argues that HTTP’s success can be attributed to factors like:

- familiarity by implementers, specifiers, administrators, developers, and users;- availability of a variety of client, server, and proxy implementations;- ease of use;- availability of web browsers;- reuse of existing mechanisms like authentication and encryption;- presence of HTTP servers and clients in target deployments; and- its ability to traverse firewalls.

Another important factor is the community of people using, implementing, and standardising HTTP. We work together to maintain and develop the protocol actively, to assure that it’s interoperable and meets today’s needs. If HTTP stagnates, another protocol will (justifiably) replace it, and we’ll lose all the community’s investment, shared understanding and interoperability.

Cloudflare and many others do this by sending engineers to participate in the IETF, where most Internet protocols are discussed and standardised. We also attend and sponsor community events like the HTTP Workshop to have conversations about what problems people have, what they need, and understand what changes might help them.

So what happened at all of those working group meetings, specification documents, and side events in 2022? What are implementers and deployers of the web’s protocol doing? And what’s coming next?

New Specification: HTTP/3

Specification-wise, the biggest thing to happen in 2022 was the publication of HTTP/3, because it was an enormous step towards keeping up with the requirements of modern applications and sites by using the network more efficiently to unblock web performance.

Way back in the 90s, HTTP/0.9 and HTTP/1.0 used a new TCP connection for each request—an astoundingly inefficient use of the network. HTTP/1.1 introduced persistent connections (which were backported to HTTP/1.0 with the `Connection: Keep-Alive` header). This was an improvement that helped servers and the network cope with the explosive popularity of the web, but even back then, the community knew it had significant limitations—in particular, head-of-line blocking (where one outstanding request on a connection blocks others from completing).

That didn’t matter so much in the 90s and early 2000s, but today’s web pages and applications place demands on the network that make these limitations performance-critical. Pages often have hundreds of assets that all compete for network resources, and HTTP/1.1 wasn’t up to the task. After some false starts, the community finally addressed these issues with HTTP/2 in 2015.

However, removing head-of-line blocking in HTTP exposed the same problem one layer lower, in TCP. Because TCP is an in-order, reliable delivery protocol, loss of a single packet in a flow can block access to those after it—even if they’re sitting in the operating system’s buffers. This turns out to be a real issue for HTTP/2 deployment, especially on less-than-optimal networks.

The answer, of course, was to replace TCP—the venerable transport protocol that so much of the Internet is built upon. After much discussion and many drafts in the QUIC Working Group, QUIC version 1 was published as that replacement in 2021.

HTTP/3 is the version of HTTP that uses QUIC. While the working group effectively finished it in 2021 along with QUIC, its publication was held until 2022 to synchronise with the publication of other documents (see below). 2022 was also a milestone year for HTTP/3 deployment; Cloudflare saw increasing adoption and confidence in the new protocol.

While there was only a brief gap of a few years between HTTP/2 and HTTP/3, there isn’t much appetite for working on HTTP/4 in the community soon. QUIC and HTTP/3 are both new, and the world is still learning how best to implement the protocols, operate them, and build sites and applications using them. While we can’t rule out a limitation that will force a new version in the future, the IETF built these protocols based upon broad industry experience with modern networks, and have significant extensibility available to ease any necessary changes.

New specifications: HTTP “core”

The other headline event for HTTP specs in 2022 was the publication of its “core” documents -- the heart of HTTP’s specification. The core comprises: HTTP Semantics - things like methods, headers, status codes, and the message format; HTTP Caching - how HTTP caches work; HTTP/1.1 - mapping semantics to the wire, using the format everyone knows and loves.

Additionally, HTTP/2 was republished to properly integrate with the Semantics document, and to fix a few outstanding issues.

This is the latest in a long series of revisions for these documents—in the past, we’ve had the RFC 723x series, the (perhaps most well-known) RFC 2616, RFC 2068, and the grandparent of them all, RFC 1945. Each revision has aimed to improve readability, fix errors, explain concepts better, and clarify protocol operation. Poorly specified (or implemented) features are deprecated; new features that improve protocol operation are added. See the ‘Changes from...’ appendix in each document for the details. And, importantly, always refer to the latest revisions linked above; the older RFCs are now obsolete.

Deploying Early Hints

HTTP/2 included server push, a feature designed to allow servers to “push” a request/response pair to clients when they knew the client was going to need something, so it could avoid the latency penalty of making a request and waiting for the response.

After HTTP/2 was finalised in 2015, Cloudflare and many other HTTP implementations soon rolled out server push in anticipation of big performance wins. Unfortunately, it turned out that’s harder than it looks; server push effectively requires the server to predict the future—not only what requests the client will send but also what the network conditions will be. And, when the server gets it wrong (“over-pushing”), the pushed requests directly compete with the real requests that the browser is making, representing a significant opportunity cost with real potential for harming performance, rather than helping it. The impact is even worse when the browser already has a copy in cache, so it doesn’t need the push at all.

As a result, Chrome removed HTTP/2 server push in 2022. Other browsers and servers might still support it, but the community seems to agree that it’s only suitable for specialised uses currently, like the browser notification-specific Web Push Protocol.

That doesn’t mean that we’re giving up, however. The 103 (Early Hints) status code was published as an Experimental RFC by the HTTP Working Group in 2017. It allows a server to send hints to the browser in a non-final response, before the “real” final response. That’s useful if you know that the content is going to include some links to resources that the browser will fetch, but need more time to get the response to the client (because it will take more time to generate, or because the server needs to fetch it from somewhere else, like a CDN does).

Early Hints can be used in many situations that server push was designed for -- for example, when you have CSS and JavaScript that a page is going to need to load. In theory, they’re not as optimal as server push, because they only allow hints to be sent when there’s an outstanding request, and because getting the hints to the client and acted upon adds some latency.

In practice, however, Cloudflare and our partners (like Shopify and Google) spent 2022 experimenting with Early Hints and finding them much safer to use, with promising performance benefits that include significant reductions in key web performance metrics.

We’re excited about the potential that Early Hints show; so excited that we’ve integrated them into Cloudflare Pages. We’re also evaluating new ways to improve performance using this new capability in the protocol.

Privacy-focused intermediation

For many, the most exciting HTTP protocol extensions in 2022 focused on intermediation—the ability to insert proxies, gateways, and similar components into the protocol to achieve specific goals, often focused on improving privacy.

The MASQUE Working Group, for example, is an effort to add new tunneling capabilities to HTTP, so that an intermediary can pass the tunneled traffic along to another server.

While CONNECT has enabled TCP tunnels for a long time, MASQUE enabled UDP tunnels, allowing more protocols to be tunneled more efficiently–including QUIC and HTTP/3.

At Cloudflare, we’re enthusiastic to be working with Apple to use MASQUE to implement iCloud Private Relay and enhance their customers’ privacy without relying solely on one company. We’re also very interested in the Working Group’s future work, including IP tunneling that will enable MASQUE-based VPNs.Another intermediation-focused specification is Oblivious HTTP (or OHTTP). OHTTP uses sets of intermediaries to prevent the server from using connections or IP addresses to track clients, giving greater privacy assurances for things like collecting telemetry or other sensitive data. This specification is just finishing the standards process, and we’re using it to build an important new product, Privacy Gateway, to protect the privacy of our customers’ customers.

We and many others in the Internet community believe that this is just the start, because intermediation can partition communication, a valuable tool for improving privacy.

Protocol security

Finally, 2022 saw a lot of work on security-related aspects of HTTP. The Digest Fields specification is an update to the now-ancient `Digest` header field, allowing integrity digests to be added to messages. The HTTP Message Signatures specification enables cryptographic signatures on requests and responses -- something that has widespread ad hoc deployment, but until now has lacked a standard. Both specifications are in the final stages of standardisation.

A revision of the Cookie specification also saw a lot of progress in 2022, and should be final soon. Since it’s not possible to get rid of them completely soon, much work has taken place to limit how they operate to improve privacy and security, including a new `SameSite` attribute.

Another set of security-related specifications that Cloudflare has invested in for many years is Privacy Pass also known as “Private Access Tokens.” These are cryptographic tokens that can assure clients are real people, not bots, without using an intrusive CAPTCHA, and without tracking the user’s activity online. In HTTP, they take the form of a new authentication scheme.

While Privacy Pass is still not quite through the standards process, 2022 saw its broad deployment by Apple, a huge step forward. And since Cloudflare uses it in Turnstile, our CAPTCHA alternative, your users can have a better experience today.

What about 2023?

So, what’s next? Besides, the specifications above that aren’t quite finished, the HTTP Working Group has a few other works in progress, including a QUERY method (think GET but with a body), Resumable Uploads (based on tus), Variants (an improved Vary header for caching), improvements to Structured Fields (including a new Date type), and a way to retrofit existing headers into Structured Fields. We’ll write more about these as they progress in 2023.

At the 2022 HTTP Workshop, the community also talked about what new work we can take on to improve the protocol. Some ideas discussed included improving our shared protocol testing infrastructure (right now we have a few resources, but it could be much better), improving (or replacing) Alternative Services to allow more intelligent and correct connection management, and more radical changes, like alternative, binary serialisations of headers.

There’s also a continuing discussion in the community about whether HTTP should accommodate pub/sub, or whether it should be standardised to work over WebSockets (and soon, WebTransport). Although it’s hard to say now, adjacent work on Media over QUIC that just started might provide an opportunity to push this forward.

Of course, that’s not everything, and what happens to HTTP in 2023 (and beyond) remains to be seen. HTTP is still evolving, even as it stays compatible with the largest distributed hypertext system ever conceived—the World Wide Web.

Moving k8s communication to gRPC

Alex Fattouche — Sat, 20 Mar 2021 14:00:00 GMT

Over the past year and a half, Cloudflare has been hard at work moving our back-end services running in our non-edge locations from bare metal solutions and Mesos Marathon to a more unified approach using Kubernetes(K8s). We chose Kubernetes because it allowed us to split up our monolithic application into many different microservices with granular control of communication.

For example, a ReplicaSet in Kubernetes can provide high availability by ensuring that the correct number of pods are always available. A Pod in Kubernetes is similar to a container in Docker. Both are responsible for running the actual application. These pods can then be exposed through a Kubernetes Service to abstract away the number of replicas by providing a single endpoint that load balances to the pods behind it. The services can then be exposed to the Internet via an Ingress. Lastly, a network policy can protect against unwanted communication by ensuring the correct policies are applied to the application. These policies can include L3 or L4 rules.

The diagram below shows a simple example of this setup.

Though Kubernetes does an excellent job at providing the tools for communication and traffic management, it does not help the developer decide the best way to communicate between the applications running on the pods. Throughout this blog we will look at some of the decisions we made and why we made them to discuss the pros and cons of two commonly used API architectures, REST and gRPC.

Out with the old, in with the new

When the DNS team first moved to Kubernetes, all of our pod-to-pod communication was done through REST APIs and in many cases also included Kafka. The general communication flow was as follows:

We use Kafka because it allows us to handle large spikes in volume without losing information. For example, during a Secondary DNS Zone zone transfer, Service A tells Service B that the zone is ready to be published to the edge. Service B then calls Service A’s REST API, generates the zone, and pushes it to the edge. If you want more information about how this works, I wrote an entire blog post about the Secondary DNS pipeline at Cloudflare.

HTTP worked well for most communication between these two services. However, as we scaled up and added new endpoints, we realized that as long as we control both ends of the communication, we could improve the usability and performance of our communication. In addition, sending large DNS zones over the network using HTTP often caused issues with sizing constraints and compression.

In contrast, gRPC can easily stream data between client and server and is commonly used in microservice architecture. These qualities made gRPC the obvious replacement for our REST APIs.

gRPC Usability

Often overlooked from a developer’s perspective, HTTP client libraries are clunky and require code that defines paths, handles parameters, and deals with responses in bytes. gRPC abstracts all of this away and makes network calls feel like any other function calls defined for a struct.

The example below shows a very basic schema to set up a GRPC client/server system. As a result of gRPC using protobuf for serialization, it is largely language agnostic. Once a schema is defined, the protoc command can be used to generate code for many languages.

Protocol Buffer data is structured as messages, with each message containing information stored in the form of fields. The fields are strongly typed, providing type safety unlike JSON or XML. Two messages have been defined, Hello and HelloResponse. Next we define a service called HelloWorldHandler which contains one RPC function called SayHello that must be implemented if any object wants to call themselves a HelloWorldHandler.

Simple Proto:

message Hello{
   string Name = 1;
}

message HelloResponse{}

service HelloWorldHandler {
   rpc SayHello(Hello) returns (HelloResponse){}
}

Once we run our protoc command, we are ready to write the server-side code. In order to implement the HelloWorldHandler, we must define a struct that implements all of the RPC functions specified in the protobuf schema above_._ In this case, the struct Server defines a function SayHello that takes in two parameters, context and *pb.Hello. *pb.Hello was previously specified in the schema and contains one field, Name. SayHello must also return the *pbHelloResponse which has been defined without fields for simplicity.

Inside the main function, we create a TCP listener, create a new gRPC server, and then register our handler as a HelloWorldHandlerServer. After calling Serve on our gRPC server, clients will be able to communicate with the server through the function SayHello.

Simple Server:

type Server struct{}

func (s *Server) SayHello(ctx context.Context, in *pb.Hello) (*pb.HelloResponse, error) {
    fmt.Println("%s says hello\n", in.Name)
    return &pb.HelloResponse{}, nil
}

func main() {
    lis, err := net.Listen("tcp", ":8080")
    if err != nil {
        panic(err)
    }
    gRPCServer := gRPC.NewServer()
    handler := Server{}
    pb.RegisterHelloWorldHandlerServer(gRPCServer, &handler)
    if err := gRPCServer.Serve(lis); err != nil {
        panic(err)
    }
}

Finally, we need to implement the gRPC Client. First, we establish a TCP connection with the server. Then, we create a new pb.HandlerClient. The client is able to call the server's SayHello function by passing in a *pb.Hello object.

Simple Client:

conn, err := gRPC.Dial("127.0.0.1:8080", gRPC.WithInsecure())
if err != nil {
    panic(err)
}
client := pb.NewHelloWorldHandlerClient(conn)
client.SayHello(context.Background(), &pb.Hello{Name: "alex"})

Though I have removed some code for simplicity, these services and messages can become quite complex if needed. The most important thing to understand is that when a server attempts to announce itself as a HelloWorldHandlerServer, it is required to implement the RPC functions as specified within the protobuf schema. This agreement between the client and server makes cross-language network calls feel like regular function calls.

In addition to the basic Unary server described above, gRPC lets you decide between four types of service methods:

Unary (example above): client sends a single request to the server and gets a single response back, just like a normal function call.
Server Streaming: server returns a stream of messages in response to a client's request.
Client Streaming: client sends a stream of messages to the server and the server replies in a single message, usually once the client has finished streaming.
Bi-directional Streaming: the client and server can both send streams of messages to each other asynchronously.

gRPC Performance

Not all HTTP connections are created equal. Though Golang natively supports HTTP/2, the HTTP/2 transport must be set by the client and the server must also support HTTP/2. Before moving to gRPC, we were still using HTTP/1.1 for client connections. We could have switched to HTTP/2 for performance gains, but we would have lost some of the benefits of native protobuf compression and usability changes.

The best option available in HTTP/1.1 is pipelining. Pipelining means that although requests can share a connection, they must queue up one after the other until the request in front completes. HTTP/2 improved pipelining by using connection multiplexing. Multiplexing allows for multiple requests to be sent on the same connection and at the same time.

HTTP REST APIs generally use JSON for their request and response format. Protobuf is the native request/response format of gRPC because it has a standard schema agreed upon by the client and server during registration. In addition, protobuf is known to be significantly faster than JSON due to its serialization speeds. I’ve run some benchmarks on my laptop, source code can be found here.

As you can see, protobuf performs better in small, medium, and large data sizes. It is faster per operation, smaller after marshalling, and scales well with input size. This becomes even more noticeable when unmarshaling very large data sets. Protobuf takes 96.4ns/op but JSON takes 22647ns/op, a 235X reduction in time! For large DNS zones, this efficiency makes a massive difference in the time it takes us to go from record change in our API to serving it at the edge.

Combining the benefits of HTTP/2 and protobuf showed almost no performance change from our application’s point of view. This is likely due to the fact that our pods were already so close together that our connection times were already very low. In addition, most of our gRPC calls are done with small amounts of data where the difference is negligible. One thing that we did notice — likely related to the multiplexing of HTTP/2 — was greater efficiency when writing newly created/edited/deleted records to the edge. Our latency spikes dropped in both amplitude and frequency.

gRPC Security

One of the best features in Kubernetes is the NetworkPolicy. This allows developers to control what goes in and what goes out.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      role: db
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - ipBlock:
        cidr: 172.17.0.0/16
        except:
        - 172.17.1.0/24
    - namespaceSelector:
        matchLabels:
          project: myproject
    - podSelector:
        matchLabels:
          role: frontend
    ports:
    - protocol: TCP
      port: 6379
  egress:
  - to:
    - ipBlock:
        cidr: 10.0.0.0/24
    ports:
    - protocol: TCP
      port: 5978

In this example, taken from the Kubernetes docs, we can see that this will create a network policy called test-network-policy. This policy controls both ingress and egress communication to or from any pod that matches the role db and enforces the following rules:

Ingress connections allowed:

Any pod in default namespace with label “role=frontend”
Any pod in any namespace that has a label “project=myproject”
Any source IP address in 172.17.0.0/16 except for 172.17.1.0/24

Egress connections allowed:

Any dest IP address in 10.0.0.0/24

NetworkPolicies do a fantastic job of protecting APIs at the network level, however, they do nothing to protect APIs at the application level. If you wanted to control which endpoints can be accessed within the API, you would need k8s to be able to not only distinguish between pods, but also endpoints within those pods. These concerns led us to per RPC credentials. Per RPC credentials are easy to set up on top of the pre-existing gRPC code. All you need to do is add interceptors to both your stream and unary handlers.

func (s *Server) UnaryAuthInterceptor(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
    // Get the targeted function
    functionInfo := strings.Split(info.FullMethod, "/")
    function := functionInfo[len(functionInfo)-1]
    md, _ := metadata.FromIncomingContext(ctx)

    // Authenticate
    err := authenticateClient(md.Get("username")[0], md.Get("password")[0], function)
    // Blocked
    if err != nil {
        return nil, err
    }
    // Verified
    return handler(ctx, req)
}

In this example code snippet, we are grabbing the username, password, and requested function from the info object. We then authenticate against the client to make sure that it has correct rights to call that function. This interceptor will run before any of the other functions get called, which means one implementation protects all functions. The client would initialize its secure connection and send credentials like so:

transportCreds, err := credentials.NewClientTLSFromFile(certFile, "")
if err != nil {
    return nil, err
}
perRPCCreds := Creds{Password: grpcPassword, User: user}
conn, err := grpc.Dial(endpoint, grpc.WithTransportCredentials(transportCreds), grpc.WithPerRPCCredentials(perRPCCreds))
if err != nil {
    return nil, err
}
client:= pb.NewRecordHandlerClient(conn)
// Can now start using the client

Here the client first verifies that the server matches with the certFile. This step ensures that the client does not accidentally send its password to a bad actor. Next, the client initializes the perRPCCreds struct with its username and password and dials the server with that information. Any time the client makes a call to an rpc defined function, its credentials will be verified by the server.

Next Steps

Our next step is to remove the need for many applications to access the database and ultimately DRY up our codebase by pulling all DNS-related code into a single API, accessed from one gRPC interface. This removes the potential for mistakes in individual applications and makes updating our database schema easier. It also gives us more granular control over which functions can be accessed rather than which tables can be accessed.

So far, the DNS team is very happy with the results of our gRPC migration. However, we still have a long way to go before we can move entirely away from REST. We are also patiently waiting for HTTP/3 support for gRPC so that we can take advantage of those super quic speeds!

Delivering HTTP/2 upload speed improvements

Junho Choi — Mon, 24 Aug 2020 11:00:00 GMT

Cloudflare recently shipped improved upload speeds across our network for clients using HTTP/2. This post describes our journey from troubleshooting an issue to fixing it and delivering faster upload speeds to the global Internet.

We launched speed.cloudflare.com in May 2020 to give our users insight into how well their networks perform. The test provides download, upload and latency tests. Soon after release, we received reports from a small number of users that sometimes upload speeds were underreported. Our investigation determined that it seemed to happen with end users that had high upload bandwidth available (several hundreds Mbps class cable modem or fiber service). Our speed tests are performed via browser JavaScript, and most browsers use HTTP/2 by default. We found that HTTP/2 upload speeds were sometimes much slower than HTTP/1.1 (assuming all TLS) when the user had high available upload bandwidth.

Upload speed is more important than ever, especially for people using home broadband connections. As many people have been forced to work from home they’re using their broadband connections differently than before. Prior to the pandemic broadband traffic was very asymmetric (you downloaded way more than you uploaded… think listening to music, or streaming a movie), but now we’re seeing an increase in uploading as people video conference from home or create content from their home office.

Initial Tests

User reports were focused on particularly fast home networks. I set up a dummynet network simulator to test upload speed in a controlled environment. I launched a linux VM running our code inside my Macbook Pro and set up a dummynet between the VM and Mac host. Measuring upload speed is simple – create a file and upload using curl to an endpoint which accepts a request body. I ran the same test 20 times and took a median upload speed (Mbps).

% dd if=/dev/urandom of=test.dat bs=1M count=10
% curl --http1.1 -w '%{speed_upload}\n' -sf -o/dev/null --data-binary @test.dat https://edge/upload-endpoint
% curl --http2 -w '%{speed_upload}\n' -sf -o/dev/null --data-binary @test.dat https://edge/upload-endpoint

Stepping up to uploading a 10MB object over a network which has 200Mbps available bandwidth and 40ms RTT, the result was surprising. Using our production configuration, HTTP/2 upload speed tested at almost half of the same test conditions using HTTP/1.1 (higher is better).

The result may differ depending on your network, but the gap is bigger when the network is fast. On a slow network, like my home cable connection (5Mbps upload and 20ms RTT), HTTP/2 upload speed was almost identical to the performance observed with HTTP/1.1.

Receiver Flow Control

Before we get into more detail on this topic, my intuition suggested the issue was related to receiver flow control. Usually the client (browser or any HTTP client) is the receiver of data, but in the case the client is uploading content to the server, the server is the receiver of data. And the receiver needs some type of flow control of the receive buffer.

How we handle receiver flow control differs between HTTP/1.1 and HTTP/2. For example, HTTP/1.1 doesn’t define protocol-level receiver flow control since there is no multiplexing of requests in the connection and it’s up to the TCP layer which handles receiving data. Note that most of the modern OS TCP stacks have auto tuning of the receive buffer (we will revisit that later) and they tune based on the current BDP (bandwidth-delay product).

In the case of HTTP/2, there is a stream-level flow control mechanism because the protocol supports multiplexing of streams. Each HTTP/2 stream has its own flow control window and there is connection level flow control for all streams in the connection. If it’s too tight, the sender will be blocked by the flow control. If it’s too loose we may end up wasting memory for buffering. So keeping it optimal is important when implementing flow control and the most optimal strategy is to keep the receive buffer matching the current BDP. BDP represents the maximum bytes in flight in the given network and can be used as an optimal buffer size.

How NGINX handles the request body buffer

Initially I tried to find a parameter which controls NGINX upload buffering and tried to see if tuning the values improved the result. There are a couple of parameters which are related to uploading a request body.

[proxy_buffering](http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_buffering)
[client_body_buffer_size](http://nginx.org/en/docs/http/ngx_http_core_module.html#client_body_buffer_size)

And this is HTTP/2 specific:

[http2_body_preread_size](http://nginx.org/en/docs/http/ngx_http_v2_module.html#http2_body_preread_size)

Cloudflare does not use the proxy_request_buffering directive, so it can be immediately discounted. client_body_buffer_size is the size of the request body buffer which is used regardless of the protocol, so this one applies to HTTP/1.1 and HTTP/2 as well.

When looking into the code, here is how it works:

HTTP/1.1: use client_body_buffer_size buffer as a buffer between upstream and the client, simply repeating reading and writing using this buffer.
HTTP/2: since we need a flow control window update for the HTTP/2 DATA frame, there are two parameters:
- http2_body_preread_size: it specifies the size of the initial request body read before it starts to send to the upstream.
- client_body_buffer_size: it specifies the size of the request body buffer.
- Those two parameters are used for allocating a request body buffer during uploading. Here is a brief summary of how unbuffered upload works:
  - Allocate a single request body buffer which size is a maximum of http2_body_preread_size and client_body_buffer_size. This means if http2_body_preread_size is 64KB and client_body_buffer_size is 128KB, then a 128KB buffer is allocated. We use 128KB for client_body_buffer_size.
  - HTTP/2 Settings INITIAL_WINDOW_SIZE of the stream is set to http2_body_preread_size and we use 64KB as a default (the RFC7540 default value).
  - HTTP/2 module reads up to http2_body_preread_size before sending it to upstream.
  - After flushing the preread buffer, keep reading from the client and write to upstream and send WINDOW_UPDATE frame back to the client when necessary until the request body is fully received.

To summarise what this means: HTTP/1.1 simply uses a single buffer, so TCP socket buffers do the flow control. However with HTTP/2, the application layer also has receiver flow control and NGINX uses a fixed size buffer for the receiver. This limits upload speed when the current link has a BDP larger than the current request body buffer size. So the bottleneck is HTTP/2 flow control when the buffer size is too tight.

We're going to need a bigger buffer?

In theory, bigger buffer sizes should avoid upload bottlenecks, so I tried a few out by running my tests again. The previous chart result is now labelled "prod" and plotted alongside HTTP/2 tests with client_body_buffer_size set to 256KB, 512KB and 1024KB:

It appears 512KB is an optimal value for client_body_buffer_size.

What if I test with some other network parameter? Here is when RTT is 10ms, in this case, 256KB looks optimal.

Both cases look much better than current 128KB and get a similar performance to HTTP/1.1 or even better. However, it seems like the optimal buffer size is a moving target and furthermore having too large a buffer size can hurt the performance: we need a smart way to find the optimal buffer size.

Autotuning request body buffer size

One of the ideas that can help this kind of situation is autotuning. For example, modern TCP stacks autotune their receive buffers automatically. In production, our edge also has TCP receiver buffer autotuning enabled by default.

net.ipv4.tcp_moderate_rcvbuf = 1

But in case of HTTP/2, TCP buffer autotuning is not very effective because the HTTP/2 layer is doing its own flow control and the existing 128KB was too small for a high BDP link. At this point, I decided to pursue autotuning HTTP/2 receive buffer sizing as well, similar to what TCP does.

The basic idea is that NGINX doubles the size of HTTP/2 request body buffer size based on its BDP. Here is an algorithm currently implemented in our version of NGINX:

Allocate a request body buffer as explained above.
For every RTT (using linux tcp_info), update the current BDP.
Double the request body buffer size when the current BDP > (receiver window / 4).

Test Result

Lab Test

Here is a test result when HTTP/2 autotune upload is enabled (still using client_body_buffer_size 128KB). You can see "h2 autotune" is doing pretty well – similar or even slightly faster than HTTP/1.1 speed (that's the initial goal). It might be slightly worse than a hand-picked optimal buffer size for given conditions, but you can see now NGINX picks up optimal buffer size automatically based on network conditions.

Production test

After we deployed this feature, I ran similar tests against our production edge, uploading a 10MB file from well connected client nodes to our edge. I created a Linux VM instance in Google Cloud and ran the upload test where the network is very fast (a few Gbps) and low latency (<10ms).

Here is when I run the test in the Google Cloud Belgium region to our CDG (Paris) PoP which has 7ms RTT. This looks very good with almost 3x improvement.

I also tested between the Google Cloud Tokyo region and our NRT (Tokyo) PoP, which had a 2.3ms RTT. Although this is not realistic for home users, the results are interesting. A 128KB fixed buffer performs well, but HTTP/2 with buffer autotune outperforms even HTTP/1.1.

Summary

HTTP/2 upload buffer autotuning is now fully deployed in the Cloudflare edge. Customers should now benefit from improved upload performance for all HTTP/2 connections, including speed tests on speed.cloudflare.com. Autotuning upload buffer size logic works well for most cases, and now HTTP/2 upload is much faster than before! When we think about the performance we usually tend to think about download speed or latency reduction, but faster uploading can also help users working from home when they need a large amount of upload, such as photo/video sharing apps, content creation, video conferencing or self broadcasting.

Many thanks to Lucas Pardue and Rustam Lalkaka for great feedback and suggestions on the article.

Update: we made this patch available to public and sent to NGINX upstream. You can find it here.

Comparing HTTP/3 vs. HTTP/2 Performance

Sreeni Tellakula — Tue, 14 Apr 2020 11:00:00 GMT

We announced support for HTTP/3, the successor to HTTP/2 during Cloudflare’s birthday week last year. Our goal is and has always been to help build a better Internet. Collaborating on standards is a big part of that, and we're very fortunate to do that here.

Even though HTTP/3 is still in draft status, we've seen a lot of interest from our users. So far, over 113,000 zones have activated HTTP/3 and, if you are using an experimental browser those zones can be accessed using the new protocol! It's been great seeing so many people enable HTTP/3: having real websites accessible through HTTP/3 means browsers have more diverse properties to test against.

When we launched support for HTTP/3, we did so in partnership with Google, who simultaneously launched experimental support in Google Chrome. Since then, we've seen more browsers add experimental support: Firefox to their nightly builds, other Chromium-based browsers such as Opera and Microsoft Edge through the underlying Chrome browser engine, and Safari via their technology preview. We closely follow these developments and partner wherever we can help; having a large network with many sites that have HTTP/3 enabled gives browser implementers an excellent testbed against which to try out code.

So, what's the status and where are we now?

The IETF standardization process develops protocols as a series of document draft versions with the ultimate aim of producing a final draft version that is ready to be marked as an RFC. The members of the QUIC Working Group collaborate on analyzing, implementing and interoperating the specification in order to find things that don't work quite right. We launched with support for Draft-23 for HTTP/3 and have since kept up with each new draft, with 27 being the latest at time of writing. With each draft the group improves the quality of the QUIC definition and gets closer to "rough consensus" about how it behaves. In order to avoid a perpetual state of analysis paralysis and endless tweaking, the bar for proposing changes to the specification has been increasing with each new draft. This means that changes between versions are smaller, and that a final RFC should closely match the protocol that we've been running in production.

Benefits

One of the main touted advantages of HTTP/3 is increased performance, specifically around fetching multiple objects simultaneously. With HTTP/2, any interruption (packet loss) in the TCP connection blocks all streams (Head of line blocking). Because HTTP/3 is UDP-based, if a packet gets dropped that only interrupts that one stream, not all of them.

In addition, HTTP/3 offers 0-RTT support, which means that subsequent connections can start up much faster by eliminating the TLS acknowledgement from the server when setting up the connection. This means the client can start requesting data much faster than with a full TLS negotiation, meaning the website starts loading earlier.

The following illustrates the packet loss and its impact: HTTP/2 multiplexing two requests . A request comes over HTTP/2 from the client to the server requesting two resources (we’ve colored the requests and their associated responses green and yellow). The responses are broken up into multiple packets and, alas, a packet is lost so both requests are held up.

The above shows HTTP/3 multiplexing 2 requests. A packet is lost that affects the yellow response but the green one proceeds just fine.

Improvements in session startup mean that ‘connections’ to servers start much faster, which means the browser starts to see data more quickly. We were curious to see how much of an improvement, so we ran some tests. To measure the improvement resulting from 0-RTT support, we ran some benchmarks measuring time to first byte (TTFB). On average, with HTTP/3 we see the first byte appearing after 176ms. With HTTP/2 we see 201ms, meaning HTTP/3 is already performing 12.4% better!

Interestingly, not every aspect of the protocol is governed by the drafts or RFC. Implementation choices can affect performance, such as efficient packet transmission and choice of congestion control algorithm. Congestion control is a technique your computer and server use to adapt to overloaded networks: by dropping packets, transmission is subsequently throttled. Because QUIC is a new protocol, getting the congestion control design and implementation right requires experimentation and tuning.

In order to provide a safe and simple starting point, the Loss Detection and Congestion Control specification recommends the Reno algorithm but allows endpoints to choose any algorithm they might like. We started with New Reno but we know from experience that we can get better performance with something else. We have recently moved to CUBIC and on our network with larger size transfers and packet loss, CUBIC shows improvement over New Reno. Stay tuned for more details in future.

For our existing HTTP/2 stack, we currently support BBR v1 (TCP). This means that in our tests, we’re not performing an exact apples-to-apples comparison as these congestion control algorithms will behave differently for smaller vs larger transfers. That being said, we can already see a speedup in smaller websites using HTTP/3 when compared to HTTP/2. With larger zones, the improved congestion control of our tuned HTTP/2 stack shines in performance.

For a small test page of 15KB, HTTP/3 takes an average of 443ms to load compared to 458ms for HTTP/2. However, once we increase the page size to 1MB that advantage disappears: HTTP/3 is just slightly slower than HTTP/2 on our network today, taking 2.33s to load versus 2.30s.

Synthetic benchmarks are interesting, but we wanted to know how HTTP/3 would perform in the real world.

To measure, we wanted a third party that could load websites on our network, mimicking a browser. WebPageTest is a common framework that is used to measure the page load time, with nice waterfall charts. For analyzing the backend, we used our in-house Browser Insights, to capture timings as our edge sees it. We then tied both pieces together with bits of automation.

As a test case we decided to use this very blog for our performance monitoring. We configured our own instances of WebPageTest spread over the world to load these sites over both HTTP/2 and HTTP/3. We also enabled HTTP/3 and Browser Insights. So, every time our test scripts kickoff a webpage test with an HTTP/3 supported browser loading the page, browser analytics report the data back. Rinse and repeat for HTTP/2 to be able to compare.

The following graph shows the page load time for a real world page -- blog.cloudflare.com, to compare the performance of HTTP/3 and HTTP/2. We have these performance measurements running from different geographical locations.

As you can see, HTTP/3 performance still trails HTTP/2 performance, by about 1-4% on average in North America and similar results are seen in Europe, Asia and South America. We suspect this could be due to the difference in congestion algorithms: HTTP/2 on BBR v1 vs. HTTP/3 on CUBIC. In the future, we’ll work to support the same congestion algorithm on both to get a more accurate apples-to-apples comparison.

Conclusion

Overall, we’re very excited to be allowed to help push this standard forward. Our implementation is holding up well, offering better performance in some cases and at worst similar to HTTP/2. As the standard finalizes, we’re looking forward to seeing browsers add support for HTTP/3 in mainstream versions. As for us, we continue to support the latest drafts while at the same time looking for more ways to leverage HTTP/3 to get even better performance, be it congestion tuning, prioritization or system capacity (CPU and raw network throughput).

In the meantime, if you’d like to try it out, just enable HTTP/3 on our dashboard and download a nightly version of one of the major browsers. Instructions on how to enable HTTP/3 can be found on our developer documentation.

On the recent HTTP/2 DoS attacks

Nafeez — Tue, 13 Aug 2019 17:00:00 GMT

Today, multiple Denial of Service (DoS) vulnerabilities were disclosed for a number of HTTP/2 server implementations. Cloudflare uses NGINX for HTTP/2. Customers using Cloudflare are already protected against these attacks.

The individual vulnerabilities, originally discovered by Netflix and are included in this announcement are:

CVE-2019-9511 HTTP/2 Data Dribble
CVE-2019-9512 HTTP/2 Ping Flood
CVE-2019-9513 HTTP/2 Resource Loop
CVE-2019-9514 HTTP/2 Reset Flood
CVE-2019-9515 HTTP/2 Settings Flood
CVE-2019-9516 HTTP/2 0-Length Headers Leak
CVE-2019-9518 HTTP/2 Request Data/Header Flood

As soon as we became aware of these vulnerabilities, Cloudflare’s Protocols team started working on fixing them. We first pushed a patch to detect any attack attempts and to see if any normal traffic would be affected by our mitigations. This was followed up with work to mitigate these vulnerabilities; we pushed the changes out few weeks ago and continue to monitor similar attacks on our stack.

If any of our customers host web services over HTTP/2 on an alternative, publicly accessible path that is not behind Cloudflare, we recommend you apply the latest security updates to your origin servers in order to protect yourselves from these HTTP/2 vulnerabilities.

We will soon follow up with more details on these vulnerabilities and how we mitigated them.

Full credit for the discovery of these vulnerabilities goes to Jonathan Looney of Netflix and Piotr Sikora of Google and the Envoy Security Team.

NGINX structural enhancements for HTTP/2 performance

Nick Jones — Wed, 22 May 2019 17:14:56 GMT

Introduction

My team: the Cloudflare PROTOCOLS team is responsible for termination of HTTP traffic at the edge of the Cloudflare network. We deal with features related to: TCP, QUIC, TLS and Secure Certificate management, HTTP/1 and HTTP/2. Over Q1, we were responsible for implementing the Enhanced HTTP/2 Prioritization product that Cloudflare announced during Speed Week.

This is a very exciting project to be part of, and doubly exciting to see the results of, but during the course of the project, we had a number of interesting realisations about NGINX: the HTTP oriented server onto which Cloudflare currently deploys its software infrastructure. We quickly became certain that our Enhanced HTTP/2 Prioritization project could not achieve even moderate success if the internal workings of NGINX were not changed.

Due to these realisations we embarked upon a number of significant changes to the internal structure of NGINX in parallel to the work on the core prioritization product. This blog post describes the motivation behind the structural changes, how we approached them, and what impact they had. We also identify additional changes that we plan to add to our roadmap, which we hope will improve performance further.

Background

Enhanced HTTP/2 Prioritization aims to do one thing to web traffic flowing between a client and a server: it provides a means to shape the many HTTP/2 streams as they flow from upstream (server or origin side) into a single HTTP/2 connection that flows downstream (client side).

Enhanced HTTP/2 Prioritization allows site owners and the Cloudflare edge systems to dictate the rules about how various objects should combine into the single HTTP/2 connection: whether a particular object should have priority and dominate that connection and reach the client as soon as possible, or whether a group of objects should evenly share the capacity of the connection and put more emphasis on parallelism.

As a result, Enhanced HTTP/2 Prioritization allows site owners to tackle two problems that exist between a client and a server: how to control precedence and ordering of objects, and: how to make the best use of a limited connection resource, which may be constrained by a number of factors such as bandwidth, volume of traffic and CPU workload at the various stages on the path of the connection.

What did we see?

The key to prioritisation is being able to compare two or more HTTP/2 streams in order to determine which one’s frame is to go down the pipe next. The Enhanced HTTP/2 Prioritization project necessarily drew us into the core NGINX codebase, as our intention was to fundamentally alter the way that NGINX compared and queued HTTP/2 data frames as they were written back to the client.

Very early in the analysis phase, as we rummaged through the NGINX internals to survey the site of our proposed features, we noticed a number of shortcomings in the structure of NGINX itself, in particular: how it moved data from upstream (server side) to downstream (client side) and how it temporarily stored (buffered) that data in its various internal stages. The main conclusion of our early analysis of NGINX was that it largely failed to give the stream data frames any 'proximity'. Either streams were processed in the NGINX HTTP/2 layer in isolated succession or frames of different streams spent very little time in the same place: a shared queue for example. The net effect was a reduction in the opportunities for useful comparison.

We coined a new, barely scientific but useful measurement: Potential, to describe how effectively the Enhanced HTTP/2 Prioritization strategies (or even the default NGINX prioritization) can be applied to queued data streams. Potential is not so much a measurement of the effectiveness of prioritization per se, that metric would be left for later on in the project, it is more a measurement of the levels of participation during the application of the algorithm. In simple terms, it considers the number of streams and frames thereof that are included in an iteration of prioritization, with more streams and more frames leading to higher Potential.

What we could see from early on was that by default, NGINX displayed low Potential: rendering prioritization instructions from either the browser, as is the case in the traditional HTTP/2 prioritization model, or from our Enhanced HTTP/2 Prioritization product, fairly useless.

What did we do?

With the goal of improving the specific problems related to Potential, and also improving general throughput of the system, we identified some key pain points in NGINX. These points, which will be described below, have either been worked on and improved as part of our initial release of Enhanced HTTP/2 Prioritization, or have now branched out into meaningful projects of their own that we will put engineering effort into over the course of the next few months.

HTTP/2 frame write queue reclamation

Write queue reclamation was successfully shipped with our release of Enhanced HTTP/2 Prioritization and ironically, it wasn’t a change made to the original NGINX, it was in fact a change made against our Enhanced HTTP/2 Prioritization implementation when we were part way through the project, and it serves as a good example of something one may call: conservation of data, which is a good way to increase Potential.

Similar to the original NGINX, our Enhanced HTTP/2 Prioritization algorithm will place a cohort of HTTP/2 data frames into a write queue as a result of an iteration of the prioritization strategies being applied to them. The contents of the write queue would be destined to be written the downstream TLS layer. Also similar to the original NGINX, the write queue may only be partially written to the TLS layer due to back-pressure from the network connection that has temporarily reached write capacity.

Early on in our project, if the write queue was only partially written to the TLS layer, we would simply leave the frames in the write queue until the backlog was cleared, then we would re-attempt to write that data to the network in a future write iteration, just like the original NGINX.

The original NGINX takes this approach because the write queue is the only place that waiting data frames are stored. However, in our NGINX modified for Enhanced HTTP/2 Prioritization, we have a unique structure that the original NGINX lacks: per-stream data frame queues where we temporarily store data frames before our prioritization algorithms are applied to them.

We came to the realisation that in the event of a partial write, we were able to restore the unwritten frames back into their per-stream queues. If it was the case that a subsequent data cohort arrived behind the partially unwritten one, then the previously unwritten frames could participate in an additional round of prioritization comparisons, thus raising the Potential of our algorithms.

The following diagram illustrates this process:

We were very pleased to ship Enhanced HTTP/2 Prioritization with the reclamation feature included as this single enhancement greatly increased Potential and made up for the fact that we had to withhold the next enhancement for speed week due to its delicacy.

HTTP/2 frame write event re-ordering

In Cloudflare infrastructure, we map the many streams of a single HTTP/2 connection from the eyeball to multiple HTTP/1.1 connections to the upstream Cloudflare control plane.

As a note: it may seem counterintuitive that we downgrade protocols like this, and it may seem doubly counterintuitive when I reveal that we also disable HTTP keepalive on these upstream connections, resulting in only one transaction per connection, however this arrangement offers a number of advantages, particularly in the form of improved CPU workload distribution.

When NGINX monitors its upstream HTTP/1.1 connections for read activity, it may detect readability on many of those connections and process them all in a batch. However, within that batch, each of the upstream connections is processed sequentially, one at a time, from start to finish: from HTTP/1.1 connection read, to framing in the HTTP/2 stream, to HTTP/2 connection write to the TLS layer.

The existing NGINX workflow is illustrated in this diagram:

By committing each streams’ frames to the TLS layer one stream at a time, many frames may pass entirely through the NGINX system before backpressure on the downstream connection allows the queue of frames to build up, providing an opportunity for these frames to be in proximity and allowing prioritization logic to be applied. This negatively impacts Potential and reduces the effectiveness of prioritization.

The Cloudflare Enhanced HTTP/2 Prioritization modified NGINX aims to re-arrange the internal workflow described above into the following model:

Although we continue to frame upstream data into HTTP/2 data frames in the separate iterations for each upstream connection, we no longer commit these frames to a single write queue within each iteration, instead we arrange the frames into the per-stream queues described earlier. We then post a single event to the end of the per-connection iterations, and perform the prioritization, queuing and writing of the HTTP/2 data frames of all streams in that single event.

This single event finds the cohort of data conveniently stored in their respective per-stream queues, all in close proximity, which greatly increases the Potential of the Edge Prioritization algorithms.

In a form closer to actual code, the core of this modification looks a bit like this:

ngx_http_v2_process_data(ngx_http_v2_connection *h2_conn,
                         ngx_http_v2_stream *h2_stream,
                         ngx_buffer *buffer)
{
    while ( ! ngx_buffer_empty(buffer) {
        ngx_http_v2_frame_data(h2_conn,
                               h2_stream->frames,
                               buffer);
    }

    ngx_http_v2_prioritise(h2_conn->queue,
                           h2_stream->frames);

    ngx_http_v2_write_queue(h2_conn->queue);
}

To this:

ngx_http_v2_process_data(ngx_http_v2_connection *h2_conn,
                         ngx_http_v2_stream *h2_stream,
                         ngx_buffer *buffer)
{
    while ( ! ngx_buffer_empty(buffer) {
        ngx_http_v2_frame_data(h2_conn,
                               h2_stream->frames,
                               buffer);
    }

    ngx_list_add(h2_conn->active_streams, h2_stream);

    ngx_call_once_async(ngx_http_v2_write_streams, h2_conn);
}

ngx_http_v2_write_streams(ngx_http_v2_connection *h2_conn)
{
    ngx_http_v2_stream *h2_stream;

    while ( ! ngx_list_empty(h2_conn->active_streams)) {
        h2_stream = ngx_list_pop(h2_conn->active_streams);

        ngx_http_v2_prioritise(h2_conn->queue,
                               h2_stream->frames);
    }

    ngx_http_v2_write_queue(h2_conn->queue);
}

There is a high level of risk in this modification, for even though it is remarkably small, we are taking the well established and debugged event flow in NGINX and switching it around to a significant degree. Like taking a number of Jenga pieces out of the tower and placing them in another location, we risk: race conditions, event misfires and event blackholes leading to lockups during transaction processing.

Because of this level of risk, we did not release this change in its entirety during Speed Week, but we will continue to test and refine it for future release.

Upstream buffer partial re-use

Nginx has an internal buffer region to store connection data it reads from upstream. To begin with, the entirety of this buffer is Ready for use. When data is read from upstream into the Ready buffer, the part of the buffer that holds the data is passed to the downstream HTTP/2 layer. Since HTTP/2 takes responsibility for that data, that portion of the buffer is marked as: Busy and it will remain Busy for as long as it takes for the HTTP/2 layer to write the data into the TLS layer, which is a process that may take some time (in computer terms!).

During this gulf of time, the upstream layer may continue to read more data into the remaining Ready sections of the buffer and continue to pass that incremental data to the HTTP/2 layer until there are no Ready sections available.

When Busy data is finally finished in the HTTP/2 layer, the buffer space that contained that data is then marked as: Free

The process is illustrated in this diagram:

You may ask: When the leading part of the upstream buffer is marked as Free (in blue in the diagram), even though the trailing part of the upstream buffer is still Busy, can the Free part be re-used for reading more data from upstream?

The answer to that question is: NO

Because just a small part of the buffer is still Busy, NGINX will refuse to allow any of the entire buffer space to be re-used for reads. Only when the entirety of the buffer is Free, can the buffer be returned to the Ready state and used for another iteration of upstream reads. So in summary, data can be read from upstream into Ready space at the tail of the buffer, but not into Free space at the head of the buffer.

This is a shortcoming in NGINX and is clearly undesirable as it interrupts the flow of data into the system. We asked: what if we could cycle through this buffer region and re-use parts at the head as they became Free? We seek to answer that question in the near future by testing the following buffering model in NGINX:

TLS layer Buffering

On a number of occasions in the above text, I have mentioned the TLS layer, and how the HTTP/2 layer writes data into it. In the OSI network model, TLS sits just below the protocol (HTTP/2) layer, and in many consciously designed networking software systems such as NGINX, the software interfaces are separated in a way that mimics this layering.

The NGINX HTTP/2 layer will collect the current cohort of data frames and place them in priority order into an output queue, then submit this queue to the TLS layer. The TLS layer makes use of a per-connection buffer to collect HTTP/2 layer data before performing the actual cryptographic transformations on that data.

The purpose of the buffer is to give the TLS layer a more meaningful quantity of data to encrypt, for if the buffer was too small, or the TLS layer simply relied on the units of data from the HTTP/2 layer, then the overhead of encrypting and transmitting the multitude of small blocks may negatively impact system throughput.

The following diagram illustrates this undersize buffer situation:

If the TLS buffer is too big, then an excessive amount of HTTP/2 data will be committed to encryption and if it failed to write to the network due to backpressure, it would be locked into the TLS layer and not be available to return to the HTTP/2 layer for the reclamation process, thus reducing the effectiveness of reclamation. The following diagram illustrates this oversize buffer situation:

In the coming months, we will embark on a process to attempt to find the ‘goldilocks’ spot for TLS buffering: To size the TLS buffer so it is big enough to maintain efficiency of encryption and network writes, but not so big as to reduce the responsiveness to incomplete network writes and the efficiency of reclamation.

Thank you - Next!

The Enhanced HTTP/2 Prioritization project has the lofty goal of fundamentally re-shaping how we send traffic from the Cloudflare edge to clients, and as results of our testing and feedback from some of our customers shows, we have certainly achieved that! However, one of the most important aspects that we took away from the project was the critical role the internal data flow within our NGINX software infrastructure plays in the outlook of the traffic observed by our end users. We found that changing a few lines of (albeit critical) code, could have significant impacts on the effectiveness and performance of our prioritization algorithms. Another positive outcome is that in addition to improving HTTP/2, we are looking forward to carrying our newfound skills and lessons learned and apply them to HTTP/3 over QUIC.

We are eager to share our modifications to NGINX with the community, so we have opened this ticket, through which we will discuss upstreaming the event re-ordering change and the buffer partial re-use change with the NGINX team.

As Cloudflare continues to grow, our requirements on our software infrastructure also shift. Cloudflare has already moved beyond proxying of HTTP/1 over TCP to support termination and Layer 3 and 4 protection for any UDP and TCP traffic. Now we are moving on to other technologies and protocols such as QUIC and HTTP/3, and full proxying of a wide range of other protocols such as messaging and streaming media.

For these endeavours we are looking at new ways to answer questions on topics such as: scalability, localised performance, wide scale performance, introspection and debuggability, release agility, maintainability.

If you would like to help us answer these questions and know a bit about: hardware and software scalability, network programming, asynchronous event and futures based software design, TCP, TLS, QUIC, HTTP, RPC protocols, Rust or maybe something else?, then have a look here.

One more thing... new Speed Page

Andrew Galloni — Mon, 20 May 2019 13:00:00 GMT

Congratulations on making it through Speed Week. In the last week, Cloudflare has: described how our global network speeds up the Internet, launched a HTTP/2 prioritisation model that will improve web experiences on all browsers, launched an image resizing service which will deliver the optimal image to every device, optimized live video delivery, detailed how to stream progressive images so that they render twice as fast - using the flexibility of our new HTTP/2 prioritisation model and finally, prototyped a new over-the-wire format for JavaScript that could improve application start-up performance especially on mobile devices. As a bonus, we’re also rolling out one more new feature: “TCP Turbo” automatically chooses the TCP settings to further accelerate your website.

As a company, we want to help every one of our customers improve web experiences. The growth of Cloudflare, along with the increase in features, has often made simple questions difficult to answer:

How fast is my website?
How should I be thinking about performance features?
How much faster would the site be if I were to enable a particular feature?

This post will describe the exciting changes we have made to the Speed Page on the Cloudflare dashboard to give our customers a much clearer understanding of how their websites are performing and how they can be made even faster. The new Speed Page consists of :

A visual comparison of your website loading on Cloudflare, with caching enabled, compared to connecting directly to the origin.
The measured improvement expected if any performance feature is enabled.
A report describing how fast your website is on desktop and mobile.

We want to simplify the complexity of making web experiences fast and give our customers control. Take a look - We hope you like it.

Why do fast web experiences matter?

Customer experience : No one likes slow service. Imagine if you go to a restaurant and the service is slow, especially when you arrive; you are not likely to go back or recommend it to your friends. It turns out the web works in the same way and Internet customers are even more demanding. As many as 79% of customers who are “dissatisfied” with a website’s performance are less likely to buy from that site again.

Engagement and Revenue : There are many studies explaining how speed affects customer engagement, bounce rates and revenue.

Reputation : There is also brand reputation to consider as customers associate an online experience to the brand. A study found that for 66% of the sample website performance influences their impression of the company.

Diversity : Mobile traffic has grown to be larger than its desktop counterpart over the last few years. Mobile customers' expectations have become increasingly demanding and expect seamless Internet access regardless of location.

Mobile provides a new set of challenges that includes the diversity of device specifications. When testing, be aware that the average mobile device is significantly less capable than the top-of-the-range models. For example, there can be orders-of-magnitude disparity in the time different mobile devices take to run JavaScript. Another challenge is the variance in mobile performance, as customers move from a strong, high quality office network to mobile networks of different speeds (3G/5G), and quality within the same browsing session.

New Speed Page

There is compelling evidence that a faster web experience is important for anyone online. Most of the major studies involve the largest tech companies, who have whole teams dedicated to measuring and improving web experiences for their own services. At Cloudflare we are on a mission to help build a better and faster Internet for everyone - not just the selected few.

Delivering fast web experiences is not a simple matter. That much is clear.To know what to send and when requires a deep understanding of every layer of the stack, from TCP tuning, protocol level prioritisation, content delivery formats through to the intricate mechanics of browser rendering. You will also need a global network that strives to be within 10 ms of every Internet user. The intrinsic value of such a network, should be clear to everyone. Cloudflare has this network, but it also offers many additional performance features.

With the Speed Page redesign, we are emphasizing the performance benefits of using Cloudflare and the additional improvements possible from our features.

The de facto standard for measuring website performance has been WebPageTest. Having its creator in-house at Cloudflare encouraged us to use it as the basis for website performance measurement. So, what is the easiest way to understand how a web page loads? A list of statistics do not paint a full picture of actual user experience. One of the cool features of WebPageTest is that it can generate a filmstrip of screen snapshots taken during a web page load, enabling us to quantify how a page loads, visually. This view makes it significantly easier to determine how long the page is blank for, and how long it takes for the most important content to render. Being able to look at the results in this way, provides the ability to empathise with the user.

How fast on Cloudflare ?

After moving your website to Cloudflare, you may have asked: How fast did this decision make my website? Well, now we provide the answer:

Comparison of website performance using Cloudflare.

As well as the increase in speed, we provide filmstrips of before and after, so that it is easy to compare and understand how differently a user will experience the website. If our tests are unable to reach your origin and you are already setup on Cloudflare, we will test with development mode enabled, which disables caching and minification.

Site performance statistics

How can we measure the user experience of a website?

Traditionally, page load was the important metric. Page load is a technical measurement used by browser vendors that has no bearing on the presentation or usability of a page. The metric reports on how long it takes not only to load the important content but also all of the 3rd party content (social network widgets, advertising, tracking scripts etc.). A user may very well not see anything until after all the page content has loaded, or they may be able to interact with a page immediately, while content continues to load.

A user will not decide whether a page is fast by a single measure or moment. A user will perceive how fast a website is from a combination of factors:

when they see any response
when they see the content they expect
when they can interact with the page
when they can perform the task they intended

Experience has shown that if you focus on one measure, it will likely be to the detriment of the others.

Importance of Visual response

If an impatient user navigates to your site and sees no content for several seconds or no valuable content, they are likely to get frustrated and leave. The paint timing spec defines a set of paint metrics, when content appears on a page, to measure the key moments in how a user perceives performance.

First Contentful Paint (FCP) is the time when the browser first renders any DOM content.

First Meaningful Paint (FMP) is the point in time when the page’s “primary” content appears on the screen. This metric should relate to what the user has come to the site to see and is designed as the point in time when the largest visible layout change happens.

Speed Index attempts to quantify the value of the filmstrip rather than using a single paint timing. The speed index measures the rate at which content is displayed - essentially the area above the curve. In the chart below from our progressive image feature you can see reaching 80% happens much earlier for the parallelized (red) load rather than the regular (blue).

Importance of interactivity

The same impatient user is now happy that the content they want to see has appeared. They will still become frustrated if they are unable to interact with the site.Time to Interactive is the time it takes for content to be rendered and the page is ready to receive input from the user. Technically this is defined as when the browser’s main processing thread has been idle for several seconds after first meaningful paint.

The Speed Tab displays these key metrics for mobile and desktop.

How much faster on Cloudflare ?

The Cloudflare Dashboard provides a list of performance features which can, admittedly, be both confusing and daunting. What would be the benefit of turning on Rocket Loader and on which performance metrics will it have the most impact ? If you upgrade to Pro what will be the value of the enhanced HTTP/2 prioritisation ? The optimization section answers these questions.

Tests are run with each performance feature turned on and off. The values for the tests for the appropriate performance metrics are displayed, along with the improvement. You can enable or upgrade the feature from this view. Here are a few examples :

If Rocket Loader were enabled for this website, the render-blocking JavaScript would be deferred causing first paint time to drop from 1.25s to 0.81s - an improvement of 32% on desktop.

Image heavy sites do not perform well on slow mobile connections. If you enable Mirage, your customers on 3G connections would see meaningful content 1s sooner - an improvement of 29.4%.

So how about our new features?

We tested the enhanced HTTP/2 prioritisation feature on an Edge browser on desktop and saw meaningful content display 2s sooner - an improvement of 64%.

This is a more interesting result taken from the blog example used to illustrate the progressive image streaming. At first glance the improvement of 29% in speed index is good. The filmstrip comparison shows a more significant difference. In this case the page with no images shown is already 43% visually complete for both scenarios after 1.5s. At 2.5s the difference is 77% compared to 50%.

This is a great example of how metrics do not tell the full story. They cannot completely replace viewing the page loading flow and understanding what is important for your site.

How to try

This is our first iteration of the new Speed Page and we are eager to get your feedback. We will be rolling this out to beta customers who are interested in seeing how their sites perform. To be added to the queue for activation of the new Speed Page please click on the banner on the overview page,

or click on the banner on the existing Speed Page.

Better HTTP/2 Prioritization for a Faster Web

Patrick Meenan (Guest Author) — Tue, 14 May 2019 13:00:00 GMT

HTTP/2 promised a much faster web and Cloudflare rolled out HTTP/2 access for all our customers long, long ago. But one feature of HTTP/2, Prioritization, didn’t live up to the hype. Not because it was fundamentally broken but because of the way browsers implemented it.

Today Cloudflare is pushing out a change to HTTP/2 Prioritization that gives our servers control of prioritization decisions that truly make the web much faster.

Historically the browser has been in control of deciding how and when web content is loaded. Today we are introducing a radical change to that model for all paid plans that puts control into the hands of the site owner directly. Customers can enable “Enhanced HTTP/2 Prioritization” in the Speed tab of the Cloudflare dashboard: this overrides the browser defaults with an improved scheduling scheme that results in a significantly faster visitor experience (we have seen 50% faster on multiple occasions). With Cloudflare Workers, site owners can take this a step further and fully customize the experience to their specific needs.

Background

Web pages are made up of dozens (sometimes hundreds) of separate resources that are loaded and assembled by the browser into the final displayed content. This includes the visible content the user interacts with (HTML, CSS, images) as well as the application logic (JavaScript) for the site itself, ads, analytics for tracking site usage and marketing tracking beacons. The sequencing of how those resources are loaded can have a significant impact on how long it takes for the user to see the content and interact with the page.

A browser is basically an HTML processing engine that goes through the HTML document and follows the instructions in order from the start of the HTML to the end, building the page as it goes along. References to stylesheets (CSS) tell the browser how to style the page content and the browser will delay displaying content until it has loaded the stylesheet (so it knows how to style the content it is going to display). Scripts referenced in the document can have several different behaviors. If the script is tagged as “async” or “defer” the browser can keep processing the document and just run the script code whenever the scripts are available. If the scripts are not tagged as async or defer then the browser MUST stop processing the document until the script has downloaded and executed before continuing. These are referred to as “blocking” scripts because they block the browser from continuing to process the document until they have been loaded and executed.

The HTML document is split into two parts. The of the document is at the beginning and contains stylesheets, scripts and other instructions for the browser that are needed to display the content. The of the document comes after the head and contains the actual page content that is displayed in the browser window (though scripts and stylesheets are allowed to be in the body as well). Until the browser gets to the body of the document there is nothing to display to the user and the page will remain blank so getting through the head of the document as quickly as possible is important. “HTML5 rocks” has a great tutorial on how browsers work if you want to dive deeper into the details.

The browser is generally in charge of determining the order of loading the different resources it needs to build the page and to continue processing the document. In the case of HTTP/1.x, the browser is limited in how many things it can request from any one server at a time (generally 6 connections and only one resource at a time per connection) so the ordering is strictly controlled by the browser by how things are requested. With HTTP/2 things change pretty significantly. The browser can request all of the resources at once (at least as soon as it knows about them) and it provides detailed instructions to the server for how the resources should be delivered.

Optimal Resource Ordering

For most parts of the page loading cycle there is an optimal ordering of the resources that will result in the fastest user experience (and the difference between optimal and not can be significant - as much as a 50% improvement or more).

As described above, early in the page load cycle before the browser can render any content it is blocked on the CSS and blocking JavaScript in the section of the HTML. During that part of the loading cycle it is best for 100% of the connection bandwidth to be used to download the blocking resources and for them to be downloaded one at a time in the order they are defined in the HTML. That lets the browser parse and execute each item while it is downloading the next blocking resource, allowing the download and execution to be pipelined.

The scripts take the same amount of time to download when downloaded in parallel or one after the other but by downloading them sequentially the first script can be processed and execute while the second script is downloading.

Once the render-blocking content has loaded things get a little more interesting and the optimal loading may depend on the specific site or even business priorities (user content vs ads vs analytics, etc). Fonts in particular can be difficult as the browser only discovers what fonts it needs after the stylesheets have been applied to the content that is about to be displayed so by the time the browser knows about a font, it is needed to display text that is already ready to be drawn to the screen. Any delays in getting the font loaded end up as periods with blank text on the screen (or text displayed using the wrong font).

Generally there are some tradeoffs that need to be considered:

Custom fonts and visible images in the visible part of the page (viewport) should be loaded as quickly as possible. They directly impact the user’s visual experience of the page loading.
Non-blocking JavaScript should be downloaded serially relative to other JavaScript resources so the execution of each can be pipelined with the downloads. The JavaScript may include user-facing application logic as well as analytics tracking and marketing beacons and delaying them can cause a drop in the metrics that the business tracks.
Images benefit from downloading in parallel. The first few bytes of an image file contain the image dimensions which may be necessary for browser layout, and progressive images downloading in parallel can look visually complete with around 50% of the bytes transferred.

Weighing the tradeoffs, one strategy that works well in most cases is:

Custom fonts download sequentially and split the available bandwidth with visible images.
Visible images download in parallel, splitting the “images” share of the bandwidth among them.
When there are no more fonts or visible images pending:
- Non-blocking scripts download sequentially and split the available bandwidth with non-visible images
- Non-visible images download in parallel, splitting the “images” share of the bandwidth among them.

That way the content visible to the user is loaded as quickly as possible, the application logic is delayed as little as possible and the non-visible images are loaded in such a way that layout can be completed as quickly as possible.

Example

For illustrative purposes, we will use a simplified product category page from a typical e-commerce site. In this example the page has:

The HTML file for the page itself, represented by a blue box.

1 external stylesheet (CSS file), represented by a green box.

4 external scripts (JavaScript), represented by orange boxes. 2 of the scripts are blocking at the beginning of the page and 2 are asynchronous. The blocking script boxes use a darker shade of orange.

1 custom web font, represented by a red box.

13 images, represented by purple boxes. The page logo and 4 of the product images are visible in the viewport and 8 of the product images require scrolling to see. The 5 visible images use a darker shade of purple.

For simplicity, we will assume that all the resources are the same size and each takes 1 second to download on the visitor’s connection. Loading everything takes a total of 20 seconds, but HOW it is loaded can have a huge impact to the experience.

This is what the described optimal loading would look like in the browser as the resources load:

The page is blank for the first 4 seconds while the HTML, CSS and blocking scripts load, all using 100% of the connection.
At the 4-second mark the background and structure of the page is displayed with no text or images.
One second later, at 5 seconds, the text for the page is displayed.
From 5-10 seconds the images load, starting out as blurry but sharpening very quickly. By around the 7-second mark it is almost indistinguishable from the final version.
At the 10 second mark all of the visual content in the viewport has completed loading.
Over the next 2 seconds the asynchronous JavaScript is loaded and executed, running any non-critical logic (analytics, marketing tags, etc).
For the final 8 seconds the rest of the product images load so they are ready for when the user scrolls.

Current Browser Prioritization

All of the current browser engines implement different prioritization strategies, none of which are optimal.

Microsoft Edge and Internet Explorer do not support prioritization so everything falls back to the HTTP/2 default which is to load everything in parallel, splitting the bandwidth evenly among everything. Microsoft Edge is moving to use the Chromium browser engine in future Windows releases which will help improve the situation. In our example page this means that the browser is stuck in the head for the majority of the loading time since the images are slowing down the transfer of the blocking scripts and stylesheets.

Visually that results in a pretty painful experience of staring at a blank screen for 19 seconds before most of the content displays, followed by a 1-second delay for the text to display. Be patient when watching the animated progress because for the 19 seconds of blank screen it may feel like nothing is happening (even though it is):

Safari loads all resources in parallel, splitting the bandwidth between them based on how important Safari believes they are (with render-blocking resources like scripts and stylesheets being more important than images). Images load in parallel but also load at the same time as the render-blocking content.

While similar to Edge in that everything downloads at the same time, by allocating more bandwidth to the render-blocking resources Safari can display the content much sooner:

At around 8 seconds the stylesheet and scripts have finished loading so the page can start to be displayed. Since the images were loading in parallel, they can also be rendered in their partial state (blurry for progressive images). This is still twice as slow as the optimal case but much better than what we saw with Edge.
At around 11 seconds the font has loaded so the text can be displayed and more image data has been downloaded so the images will be a little sharper. This is comparable to the experience around the 7-second mark for the optimal loading case.
For the remaining 9 seconds of the load the images get sharper as more data for them downloads until it is finally complete at 20 seconds.

Firefox builds a dependency tree that groups resources and then schedules the groups to either load one after another or to share bandwidth between the groups. Within a given group the resources share bandwidth and download concurrently. The images are scheduled to load after the render-blocking stylesheets and to load in parallel but the render-blocking scripts and stylesheets also load in parallel and do not get the benefits of pipelining.

In our example case this ends up being a slightly faster experience than with Safari since the images are delayed until after the stylesheets complete:

At the 6 second mark the initial page content is rendered with the background and blurry versions of the product images (compared to 8 seconds for Safari and 4 seconds for the optimal case).
At 8 seconds the font has loaded and the text can be displayed along with slightly sharper versions of the product images (compared to 11 seconds for Safari and 7 seconds in the Optimal case).
For the remaining 12 seconds of the loading the product images get sharper as the remaining content loads.

Chrome (and all Chromium-based browsers) prioritizes resources into a list. This works really well for the render-blocking content that benefits from loading in order but works less well for images. Each image loads to 100% completion before starting the next image.

In practice this is almost as good as the optimal loading case with the only difference being that the images load one at a time instead of in parallel:

Up until the 5 second mark the Chrome experience is identical to the optimal case, displaying the background at 4 seconds and the text content at 5.
For the next 5 seconds the visible images load one at a time until they are all complete at the 10 second mark (compared to the optimal case where they are just slightly blurry at 7 seconds and sharpen up for the remaining 3 seconds).
After the visual part of the page is complete at 10 seconds (identical to the optimal case), the remaining 10 seconds are spent running the async scripts and loading the hidden images (just like with the optimal loading case).

Visual Comparison

Visually, the impact can be quite dramatic, even though they all take the same amount of time to technically load all of the content:

Server-Side Prioritization

HTTP/2 prioritization is requested by the client (browser) and it is up to the server to decide what to do based on the request. A good number of servers don’t support doing anything at all with the prioritization but for those that do, they all honor the client’s request. Another option would be to decide on the best prioritization to use on the server-side, taking into account the client’s request.

Per the specification, HTTP/2 prioritization is a dependency tree that requires full knowledge of all of the in-flight requests to be able to prioritize resources against each other. That allows for incredibly complex strategies but is difficult to implement well on either the browser or server side (as evidenced by the different browser strategies and varying levels of server support). To make prioritization easier to manage we have developed a simpler prioritization scheme that still has all of the flexibility needed for optimal scheduling.

The Cloudflare prioritization scheme consists of 64 priority “levels” and within each priority level there are groups of resources that determine how the connection is shared between them:

All of the resources at a higher priority level are transferred before moving on to the next lower priority level.

Within a given priority level, there are 3 different “concurrency” groups:

0 : All of the resources in the concurrency “0” group are sent sequentially in the order they were requested, using 100% of the bandwidth. Only after all of the concurrency “0” group resources have been downloaded are other groups at the same level considered.
1 : All of the resources in the concurrency “1” group are sent sequentially in the order they were requested. The available bandwidth is split evenly between the concurrency “1” group and the concurrency “n” group.
n : The resources in the concurrency “n” group are sent in parallel, splitting the bandwidth available to the group between them.

Practically speaking, the concurrency “0” group is useful for critical content that needs to be processed sequentially (scripts, CSS, etc). The concurrency “1” group is useful for less-important content that can share bandwidth with other resources but where the resources themselves still benefit from processing sequentially (async scripts, non-progressive images, etc). The concurrency “n” group is useful for resources that benefit from processing in parallel (progressive images, video, audio, etc).

Cloudflare Default Prioritization

When enabled, the enhanced prioritization implements the “optimal” scheduling of resources described above. The specific prioritizations applied look like this:

This prioritization scheme allows sending the render-blocking content serially, followed by the visible images in parallel and then the rest of the page content with some level of sharing to balance application and content loading. The “* If Detectable” caveat is that not all browsers differentiate between the different types of stylesheets and scripts but it will still be significantly faster in all cases. 50% faster by default, particularly for Edge and Safari visitors is not unusual:

Customizing Prioritization with Workers

Faster-by-default is great but where things get really interesting is that the ability to configure the prioritization is also exposed to Cloudflare Workers so sites can override the default prioritization for resources or implement their own complete prioritization schemes.

If a Worker adds a “cf-priority” header to the response, Cloudflare edge servers will use the specified priority and concurrency for that response. The format of the header is / so something like response.headers.set('cf-priority', “30/0”); would set the priority to 30 with a concurrency of 0 for the given response. Similarly, “30/1” would set concurrency to 1 and “30/n” would set concurrency to n.

With this level of flexibility a site can tweak resource prioritization to meet their needs. Boosting the priority of some critical async scripts for example or increasing the priority of hero images before the browser has identified that they are in the viewport.

To help inform any prioritization decisions, the Workers runtime also exposes the browser-requested prioritization information in the request object passed in to the Worker’s fetch event listener (request.cf.requestPriority). The incoming requested priority is a semicolon-delimited list of attributes that looks something like this: “weight=192;exclusive=0;group=3;group-weight=127”.

weight: The browser-requested weight for the HTTP/2 prioritization.
exclusive: The browser-requested HTTP/2 exclusive flag (1 for Chromium-based browsers, 0 for others).
group: HTTP/2 stream ID for the request group (only non-zero for Firefox).
group-weight: HTTP/2 weight for the request group (only non-zero for Firefox).

This is Just the Beginning

The ability to tune and control the prioritization of responses is the basic building block that a lot of future work will benefit from. We will be implementing our own advanced optimizations on top of it but by exposing it in Workers we have also opened it up to sites and researchers to experiment with different prioritization strategies. With the Apps Marketplace it is also possible for companies to build new optimization services on top of the Workers platform and make it available to other sites to use.

If you are on a Pro plan or above, head over to the speed tab in the Cloudflare dashboard and turn on “Enhanced HTTP/2 Prioritization” to accelerate your site.

Fast Google Fonts with Cloudflare Workers

Patrick Meenan (Guest Author) — Thu, 22 Nov 2018 07:58:00 GMT

Google Fonts is one of the most common third-party resources on the web, but carries with it significant user-facing performance issues. Cloudflare Workers running at the edge is a great solution for fixing these performance issues, without having to modify the publishing system for every site using Google Fonts.

This post walks through the implementation details for how to fix the performance of Google Fonts with Cloudflare Workers. More importantly, it also provides code for doing high-performance content modification at the edge using Cloudflare Workers.

Google fonts are SLOW

First, some background. Google Fonts provides a rich selection of royalty-free fonts for sites to use. You select the fonts you want to use, and end up with a simple stylesheet URL to include on your pages, as well as styles to use for applying the fonts to parts of the page:


\n";
       content = content.split(matchString).join(cssString);
     }
     match = fontCSSRegex.exec(content);
   }
 }
 
 return content;
}

The fetching (and modifying) of the CSS is a little more complicated than a straight passthrough because we want to cache the result when possible. We cache the responses locally using the worker’s Cache API. Since the response is browser-specific, and we don’t want to fragment the cache too crazily, we create a custom cache key based on the browser user agent string that is basically browser+version+mobile.

Some plans have access to named cache storage, but to work with all plans it is easiest if we just modify the font URL that gets stored in cache and append the cache key to the end of the URL as a query parameter. The cache URL never gets sent to a server but is useful for local caching of different content that shares the same URL. For example:

https://fonts.googleapis.com/css?family=Roboto&chrome71

If the CSS isn’t available in the cache then we create a fetch request for the original URL from the Google servers, passing through the HTML url as the referer, the correct browser user agent string and the client’s IP address in a standard proxy X-Forwarded-For header. Once the response is available we store it in the cache for future requests.

For browsers that can’t be identified by user agent string a generic request for css is sent with the user agent string from Internet Explorer 8 to get the lowest common denominator fallback CSS.

The actual modification of the CSS just uses a regex to look for font URLs, replaces them with the HTML origin as a prefix.

async function fetchCSS(url, request) {
 let fontCSS = "";
 if (url.startsWith('/'))
   url = 'https:' + url;
 const userAgent = request.headers.get('user-agent');
 const clientAddr = request.headers.get('cf-connecting-ip');
 const browser = getCacheKey(userAgent);
 const cacheKey = browser ? url + '&' + browser : url;
 const cacheKeyRequest = new Request(cacheKey);
 let cache = null;
 
 let foundInCache = false;
 // Try pulling it from the cache API (wrap it in case it's not implemented)
 try {
   cache = caches.default;
   let response = await cache.match(cacheKeyRequest);
   if (response) {
     fontCSS = response.text();
     foundInCache = true;
   }
 } catch(e) {
   // Ignore the exception
 }
 
 if (!foundInCache) {
   let headers = {'Referer': request.url};
   if (browser) {
     headers['User-Agent'] = userAgent;
   } else {
     headers['User-Agent'] =
       "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)";
   }
   if (clientAddr) {
     headers['X-Forwarded-For'] = clientAddr;
   }
 
   try {
     const response = await fetch(url, {headers: headers});
     fontCSS = await response.text();
 
     // Rewrite all of the font URLs to come through the worker
     fontCSS = fontCSS.replace(/(https?:)?\/\/fonts\.gstatic\.com\//mgi,
                               '/fonts.gstatic.com/');
 
     // Add the css info to the font caches
     FONT_CACHE[cacheKey] = fontCSS;
     try {
       if (cache) {
         const cacheResponse = new Response(fontCSS, {ttl: 86400});
         event.waitUntil(cache.put(cacheKeyRequest, cacheResponse));
       }
     } catch(e) {
       // Ignore the exception
     }
   } catch(e) {
     // Ignore the exception
   }
 }
 
 return fontCSS;
}

Generating the browser-specific cache key is a little sensitive since browsers tend to clone each other’s user agent strings and add their own information to them. For example, Edge includes a Chrome identifier and Chrome includes a Safari identifier, etc. We don’t necessarily have to handle every browser string since it will fallback to the least common denominator (ttf files without unicode range support) but it is helpful to catch as many of the large mainstream browser engines as possible.

function getCacheKey(userAgent) {
 let os = '';
 const osRegex = /^[^(]*\(\s*(\w+)/mgi;
 let match = osRegex.exec(userAgent);
 if (match) {
   os = match[1];
 }
 
 let mobile = '';
 if (userAgent.match(/Mobile/mgi)) {
   mobile = 'Mobile';
 }
 
 // Detect Edge first since it includes Chrome and Safari
 const edgeRegex = /\s+Edge\/(\d+)/mgi;
 match = edgeRegex.exec(userAgent);
 if (match) {
   return 'Edge' + match[1] + os + mobile;
 }
 
 // Detect Chrome next (and browsers using the Chrome UA/engine)
 const chromeRegex = /\s+Chrome\/(\d+)/mgi;
 match = chromeRegex.exec(userAgent);
 if (match) {
   return 'Chrome' + match[1] + os + mobile;
 }
 
 // Detect Safari and Webview next
 const webkitRegex = /\s+AppleWebKit\/(\d+)/mgi;
 match = webkitRegex.exec(userAgent.match);
 if (match) {
   return 'WebKit' + match[1] + os + mobile;
 }
 
 // Detect Firefox
 const firefoxRegex = /\s+Firefox\/(\d+)/mgi;
 match = firefoxRegex.exec(userAgent);
 if (match) {
   return 'Firefox' + match[1] + os + mobile;
 }
  return null;
}

Profit!

Any site served through Cloudflare can implement workers to rewrite their content but for something like Google fonts or other third-party resources it gets much more interesting when someone implements it once and everyone else can benefit. With Cloudflare Apps’ new worker support you can bundle up and deliver complex worker logic for anyone else to consume and publish it to the Apps marketplace.

If you are a third-party content provider for sites, think about what you might be able to do to leverage workers for your content for sites that are served through Cloudflare.

I get excited thinking about the performance implications of something like a tag manager running entirely on the edge without the sites having to change their published pages and without browsers having to fetch heavy JavaScript to do the page modifications. It can be done dynamically for every request directly on the edge!

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowat

Patrick Meenan (Guest Author) — Fri, 12 Oct 2018 12:00:00 GMT

Getting the best end-user performance from HTTP/2 requires good support for resource prioritization. While most web servers support HTTP/2 prioritization, getting it to work well all the way to the browser requires a fair bit of coordination across the networking stack. This article will expose some of the interactions between the web server, Operating System and network and how to tune a server to optimize performance for end users.

tl;dr

On Linux 4.9 kernels and later, enable BBR congestion control and set tcp_notsent_lowat to 16KB for HTTP/2 prioritization to work reliably. This can be done in /etc/sysctl.conf:

    net.core.default_qdisc = fq
    net.ipv4.tcp_congestion_control = bbr
    net.ipv4.tcp_notsent_lowat = 16384

Browsers and Request Prioritization

A single web page is made up of dozens to hundreds of separate pieces of content that a web browser pulls together to create and present to the user. The main content (HTML) for the page you are visiting is a list of instructions on how to construct the page and the browser goes through the instructions from beginning to end to figure out everything it needs to load and how to put it all together. Each piece of content requires a separate HTTP request from the browser to the server responsible for that content (or if it has been loaded before, it can be loaded from a local cache in the browser).

In a simple implementation, the web browser could wait until everything is loaded and constructed and then show the result but that would be pretty slow. Not all of the content is critical to the user and can include things such as images way down in the page, analytics for tracking usage, ads, like buttons, etc. All the browsers work more incrementally where they display the content as it becomes available. This results in a much faster user experience. The visible part of the page can be displayed while the rest of the content is being loaded in the background. Deciding on the best order to request the content in is where browser request prioritization comes into play. Done correctly the visible content can display significantly faster than a naive implementation.

HTML Parser blocking page render for styles and scripts in the head of the document.

Most modern browsers use similar prioritization schemes which generally look like:

Load similar resources (scripts, images, styles) in the order they were listed in the HTML.
Load styles/CSS before anything else because content cannot be displayed until styles are complete.
Load blocking scripts/JavaScript next because blocking scripts stop the browser from moving on to the next instruction in the HTML until they have been loaded and executed.
Load images and non-blocking scripts (async/defer).

Fonts are a bit of a special case in that they are needed to draw the text on the screen but the browser won’t know that it needs to load a font until it is actually ready to draw the text to the screen. So they are discovered pretty late. As a result they are generally given a very high priority once they are discovered but aren’t known about until fairly late in the loading process.

Chrome also applies some special treatment to images that are visible in the current browser viewport (part of the page visible on the screen). Once the styles have been applied and the page has been laid out it will give visible images a much higher priority and load them in order from largest to smallest.

HTTP/1.x prioritization

With HTTP/1.x, each connection to a server can support one request at a time (practically anyway as no browser supports pipelining) and most browsers will open up to 6 connections at a time to each server. The browser maintains a prioritized list of the content it needs and makes the requests to each server as a connection becomes available. When a high-priority piece of content is discovered it is moved to the front of a list and when the next connection becomes available it is requested.

HTTP/2 prioritization

With HTTP/2, the browser uses a single connection and the requests are multiplexed over the connection as separate “streams”. The requests are all sent to the server as soon as they are discovered along with some prioritization information to let the server know the preferred ordering of the responses. It is then up to the server to do its best to deliver the most important responses first, followed by lower priority responses. When a high priority request comes in to the server, it should immediately jump ahead of the lower priority responses, even mid-response. The actual priority scheme implemented by HTTP/2 allows for parallel downloads with weighting between them and more complicated schemes. For now it is easiest to just think about it as a priority ordering of the resources.

Most servers that support prioritization will send data for the highest priority responses for which it has data available. But if the most important response takes longer to generate than lower priority responses, the server may end up starting to send data for a lower priority response and then interrupt its stream when the higher priority response becomes available. That way it can avoid wasting available bandwidth and head-of-line blocking where a slow response holds everything else up.

Browser requesting a high-priority resource after several low-priority resources.

In an optimal configuration, the time to retrieve a top-priority resource on a busy connection with lots of other streams will be identical to the time to retrieve it on an empty connection. Effectively that means that the server needs to be able to interrupt the response streams of all of the other responses immediately with no additional buffering to delay the high-priority response (beyond the minimal amount of data in-flight on the network to keep the connection fully utilized).

Buffers on the Internet

Excessive buffering is pretty much the nemesis for HTTP/2 because it directly impacts the ability for a server to be nimble in responding to priority shifts. It is not unusual for there to be megabytes-worth of buffering between the server and the browser which is larger than most websites. Practically that means that the responses will get delivered in whatever order they become available on the server. It is not unusual to have a critical resource (like a font or a render-blocking script in the of a document) delayed by megabytes of lower priority images. For the end-user this translates to seconds or even minutes of delay rendering the page.

TCP send buffers

The first layer of buffering between the server and the browser is in the server itself. The operating system maintains a TCP send buffer that the server writes data into. Once the data is in the buffer then the operating system takes care of delivering the data as-needed (pulling from the buffer as data is sent and signaling to the server when the buffer needs more data). A large buffer also reduces CPU load because it reduces the amount of writing that the server has to do to the connection.

The actual size of the send buffer needs to be big enough to keep a copy of all of the data that has been sent to the browser but has yet to be acknowledged in case a packet gets dropped and some data needs to be retransmitted. Too small of a buffer will prevent the server from being able to max-out the connection bandwidth to the client (and is a common cause of slow downloads over long distances). In the case of HTTP/1.x (and a lot of other protocols), the data is delivered in bulk in a known-order and tuning the buffers to be as big as possible has no downside other than the increase in memory use (trading off memory for CPU). Increasing the TCP send buffer sizes is an effective way to increase the throughput of a web server.

For HTTP/2, the problem with large send buffers is that it limits the nimbleness of the server to adjust the data it is sending on a connection as high priority responses become available. Once the response data has been written into the TCP send buffer it is beyond the server’s control and has been committed to be delivered in the order it is written.

High-priority resource queued behind low-priority resources in the TCP send buffer.

The optimal send buffer size for HTTP/2 is the minimal amount of data required to fully utilize the available bandwidth to the browser (which is different for every connection and changes over time even for a single connection). Practically you’d want the buffer to be slightly bigger to allow for some time between when the server is signaled that more data is needed and when the server writes the additional data.

TCP_NOTSENT_LOWAT

TCP_NOTSENT_LOWAT is a socket option that allows configuration of the send buffer so that it is always the optimal size plus a fixed additional buffer. You provide a buffer size (X) which is the additional amount of size you’d like in addition to the minimal needed to fully utilize the connection and it dynamically adjusts the TCP send buffer to always be X bytes larger than the current connection congestion window. The congestion window is the TCP stack’s estimate of the amount of data that needs to be in-flight on the network to fully utilize the connection.

TCP_NOTSENT_LOWAT can be configured in code on a socket-by-socket basis if the web server software supports it or system-wide using the net.ipv4.tcp_notsent_lowat sysctl:

    net.ipv4.tcp_notsent_lowat = 16384

We have a patch we are preparing to upstream for NGINX to make it configurable but it isn’t quite ready yet so configuring it system-wide is required. Experimentally, the value 16,384 (16K) has proven to be a good balance where the connections are kept fully-utilized with negligible additional CPU overhead. That will mean that at most 16KB of lower priority data will be buffered before a higher priority response can interrupt it and be delivered. As always, your mileage may vary and it is worth experimenting with.

High-priority resource ready to send with minimal TCP buffering.

Bufferbloat

Beyond buffering on the server, the network connection between the server and the browser can act as a buffer. It is increasingly common for networking gear to have large buffers that absorb data that is sent faster than the receiving side can consume it. This is generally referred to as Bufferbloat. I hedged my explanation of the effectiveness of tcp_notsent_lowat a little bit in that it is based on the current congestion window which is an estimate of the optimal amount of in-flight data needed but not necessarily the actual optimal amount of in-flight data.

The buffers in the network can be quite large at times (megabytes) and they interact very poorly with the congestion control algorithms usually used by TCP. Most classic congestion-control algorithms determine the congestion window by watching for packet loss. Once a packet is dropped then it knows there was too much data on the network and it scales back from there. With Bufferbloat that limit is raised artificially high because the buffers are absorbing the extra packets beyond what is needed to saturate the connection. As a result, the TCP stack ends up calculating a congestion window that spikes to much larger than the actual size needed, then drops to significantly smaller once the buffers are saturated and a packet is dropped and the cycle repeats.

Loss-based congestion control congestion window graph.

TCP_NOTSENT_LOWAT uses the calculated congestion window as a baseline for the size of the send buffer it needs to use so when the underlying calculation is wrong, the server ends up with send buffers much larger (or smaller) than it actually needs.

I like to think about Bufferbloat as being like a line for a ride at an amusement park. Specifically, one of those lines where it’s a straight shot to the ride when there are very few people in line but once the lines start to build they can divert you through a maze of zig-zags. Approaching the ride it looks like a short distance from the entrance to the ride but things can go horribly wrong.

Bufferbloat is very similar. When the data is coming into the network slower than the links can support, everything is nice and fast:

Response traveling through the network with no buffering.

Once the data comes in faster than it can go out the gates are flipped and the data gets routed through the maze of buffers to hold it until it can be sent. From the entrance to the line it still looks like everything is going fine since the network is absorbing the extra data but it also means there is a long queue of the low-priority data already absorbed when you want to send the high-priority data and it has no choice but to follow at the back of the line:

Responses queued in network buffers.

BBR congestion control

BBR is a new congestion control algorithm from Google that uses changes in packet delays to model the congestion instead of waiting for packets to drop. Once it sees that packets are taking longer to be acknowledged it assumes it has saturated the connection and packets have started to buffer. As a result the congestion window is often very close to the optimal needed to keep the connection fully utilized while also avoiding Bufferbloat. BBR was merged into the Linux kernel in version 4.9 and can be configured through sysctl:

    net.core.default_qdisc = fq
    net.ipv4.tcp_congestion_control = bbr

BBR also tends to perform better overall since it doesn’t require packet loss as part of probing for the correct congestion window and also tends to react better to random packet loss.

BBR congestion window graph.

Back to the amusement park line, BBR is like having each person carry one of the RFID cards they use to measure the wait time. Once the wait time looks like it is getting slower the people at the entrance slow down the rate that they let people enter the line.

BBR detecting network congestion early.

This way BBR essentially keeps the line moving as fast as possible and prevents the maze of lines from being used. When a guest with a fast pass arrives (the high-priority request) they can jump into the fast-moving line and hop right onto the ride.

BBR delivering responses without network buffering.

Technically, any congestion control that keeps Bufferbloat in check and maintains an accurate congestion window will work for keeping the TCP send buffers in check, BBR just happens to be one of them (with lots of good properties).

Putting it all together

The combination of TCP_NOTSENT_LOWAT and BBR reduces the amount of buffering on the network to the absolute minimum and is CRITICAL for good end-user performance with HTTP/2. This is particularly true for NGINX and other HTTP/2 servers that don’t implement their own buffer throttling.

The end-user impact of correct prioritization is huge and may not show up in most of the metrics you are used to watching (particularly any server-side metrics like requests-per-second, request response time, etc).

Even on a 5Mbps cable connection proper resource ordering can result in rendering a page significantly faster (and the difference can explode to dozens of seconds or even minutes on a slower connection). Here is a relatively common case of a WordPress blog served over HTTP/2:

The page from the tuned server (After) starts to render at 1.8 seconds.

The page from the tuned server (After) is completely done rendering at 4.5 seconds, well before the default configuration (Before) even started to render.

Finally, at 10.2 seconds the default configuration started to render (8.4 seconds later or 5.6 times slower than the tuned server).

Visually complete on the default configuration arrives at 10.7 seconds (6.2 seconds or 2.3 times slower than the tuned server).

Both configurations served the exact same content using the exact same servers with “After” being tuned for TCP_NOTSENT_LOWAT of 16KB (both configurations used BBR).

Identifying Prioritization Issues In The Wild

If you look at a network waterfall diagram of a page loading prioritization issues will show up as high-priority requests completing much later than lower-priority requests from the same origin. Usually that will also push metrics like First Paint and DOM Content Loaded (the vertical purple bar below) much later.

Network waterfall showing critical CSS and JavaScript delayed by images.

When prioritization is working correctly you will see critical resources all completing much earlier and not be blocked by the lower-priority requests. You may still see SOME low-priority data download before the higher-priority data starts downloading because there is still some buffering even under ideal conditions but it should be minimal.

Network waterfall showing critical CSS and JavaScript loading quickly.

Chrome 69 and later may hide the problem a bit. Chrome holds back lower-priority requests even on HTTP/2 connections until after it has finished processing the head of the document. In a waterfall it will look like a delayed block of requests that all start at the same time after the critical requests have completed. That doesn’t mean that it isn’t a problem for Chrome, just that it isn’t as obvious. Even with the staggering of requests there are still high-priority requests outside of the head of the document that can be delayed by lower-priority requests. Most notable are any blocking scripts in the body of the page and any external fonts that were not preloaded.

Network waterfall showing Chrome delaying the requesting of low-priority resources.

Hopefully this post gives you the tools to be able to identify HTTP/2 prioritization issues when they happen, a deeper understanding of how HTTP/2 prioritization works and some tools to fix the issues when they appear.

The Road to QUIC

Alessandro Ghedini — Thu, 26 Jul 2018 15:04:36 GMT

QUIC (Quick UDP Internet Connections) is a new encrypted-by-default Internet transport protocol, that provides a number of improvements designed to accelerate HTTP traffic as well as make it more secure, with the intended goal of eventually replacing TCP and TLS on the web. In this blog post we are going to outline some of the key features of QUIC and how they benefit the web, and also some of the challenges of supporting this radical new protocol.

There are in fact two protocols that share the same name: “Google QUIC” (“gQUIC” for short), is the original protocol that was designed by Google engineers several years ago, which, after years of experimentation, has now been adopted by the IETF (Internet Engineering Task Force) for standardization.

“IETF QUIC” (just “QUIC” from now on) has already diverged from gQUIC quite significantly such that it can be considered a separate protocol. From the wire format of the packets, to the handshake and the mapping of HTTP, QUIC has improved the original gQUIC design thanks to open collaboration from many organizations and individuals, with the shared goal of making the Internet faster and more secure.

So, what are the improvements QUIC provides?

Built-in security (and performance)

One of QUIC’s more radical deviations from the now venerable TCP, is the stated design goal of providing a secure-by-default transport protocol. QUIC accomplishes this by providing security features, like authentication and encryption, that are typically handled by a higher layer protocol (like TLS), from the transport protocol itself.

The initial QUIC handshake combines the typical three-way handshake that you get with TCP, with the TLS 1.3 handshake, which provides authentication of the end-points as well as negotiation of cryptographic parameters. For those familiar with the TLS protocol, QUIC replaces the TLS record layer with its own framing format, while keeping the same TLS handshake messages.

Not only does this ensure that the connection is always authenticated and encrypted, but it also makes the initial connection establishment faster as a result: the typical QUIC handshake only takes a single round-trip between client and server to complete, compared to the two round-trips required for the TCP and TLS 1.3 handshakes combined.

But QUIC goes even further, and also encrypts additional connection metadata that could be abused by middle-boxes to interfere with connections. For example packet numbers could be used by passive on-path attackers to correlate users activity over multiple network paths when connection migration is employed (see below). By encrypting packet numbers QUIC ensures that they can't be used to correlate activity by any entity other than the end-points in the connection.

Encryption can also be an effective remedy to ossification, which makes flexibility built into a protocol (like for example being able to negotiate different versions of that protocol) impossible to use in practice due to wrong assumptions made by implementations (ossification is what delayed deployment of TLS 1.3 for so long, which was only possible after several changes, designed to prevent ossified middle-boxes from incorrectly blocking the new revision of the TLS protocol, were adopted).

Head-of-line blocking

One of the main improvements delivered by HTTP/2 was the ability to multiplex different HTTP requests onto the same TCP connection. This allows HTTP/2 applications to process requests concurrently and better utilize the network bandwidth available to them.

This was a big improvement over the then status quo, which required applications to initiate multiple TCP+TLS connections if they wanted to process multiple HTTP/1.1 requests concurrently (e.g. when a browser needs to fetch both CSS and Javascript assets to render a web page). Creating new connections requires repeating the initial handshakes multiple times, as well as going through the initial congestion window ramp-up, which means that rendering of web pages is slowed down. Multiplexing HTTP exchanges avoids all that.

This however has a downside: since multiple requests/responses are transmitted over the same TCP connection, they are all equally affected by packet loss (e.g. due to network congestion), even if the data that was lost only concerned a single request. This is called “head-of-line blocking”.

QUIC goes a bit deeper and provides first class support for multiplexing such that different HTTP streams can in turn be mapped to different QUIC transport streams, but, while they still share the same QUIC connection so no additional handshakes are required and congestion state is shared, QUIC streams are delivered independently, such that in most cases packet loss affecting one stream doesn't affect others.

This can dramatically reduce the time required to, for example, render complete web pages (with CSS, Javascript, images, and other kinds of assets) particularly when crossing highly congested networks, with high packet loss rates.

That easy, uh?

In order to deliver on its promises, the QUIC protocol needs to break some of the assumptions that were taken for granted by many network applications, potentially making implementations and deployment of QUIC more difficult.

QUIC is designed to be delivered on top of UDP datagrams, to ease deployment and avoid problems coming from network appliances that drop packets from unknown protocols, since most appliances already support UDP. This also allows QUIC implementations to live in user-space, so that, for example, browsers will be able to implement new protocol features and ship them to their users without having to wait for operating systems updates.

However despite the intended goal of avoiding breakage, it also makes preventing abuse and correctly routing packets to the correct end-points more challenging.

One NAT to bring them all and in the darkness bind them

Typical NAT routers can keep track of TCP connections passing through them by using the traditional 4-tuple (source IP address and port, and destination IP address and port), and by observing TCP SYN, ACK and FIN packets transmitted over the network, they can detect when a new connection is established and when it is terminated. This allows them to precisely manage the lifetime of NAT bindings, the association between the internal IP address and port, and the external ones.

With QUIC this is not yet possible, since NAT routers deployed in the wild today do not understand QUIC yet, so they typically fallback to the default and less precise handling of UDP flows, which usually involves using arbitrary, and at times very short, timeouts, which could affect long-running connections.

When a NAT rebinding happens (due to a timeout for example), the end-point on the outside of the NAT perimeter will see packets coming from a different source port than the one that was observed when the connection was originally established, which makes it impossible to track connections by only using the 4-tuple.

And it's not just NAT! One of the features QUIC is intended to deliver is called “connection migration” and will allow QUIC end-points to migrate connections to different IP addresses and network paths at will. For example, a mobile client will be able to migrate QUIC connections between cellular data networks and WiFi when a known WiFi network becomes available (like when its user enters their favorite coffee shop).

QUIC tries to address this problem by introducing the concept of a connection ID: an arbitrary opaque blob of variable length, carried by QUIC packets, that can be used to identify a connection. End-points can use this ID to track connections that they are responsible for without the need to check the 4-tuple (in practice there might be multiple IDs identifying the same connection, for example to avoid linking different paths when connection migration is used, but that behavior is controlled by the end-points not the middle-boxes).

However this also poses a problem for network operators that use anycast addressing and ECMP routing, where a single destination IP address can potentially identify hundreds or even thousands of servers. Since edge routers used by these networks also don't yet know how to handle QUIC traffic, it might happen that UDP packets belonging to the same QUIC connection (that is, with the same connection ID) but with different 4-tuple (due to NAT rebinding or connection migration) might end up being routed to different servers, thus breaking the connection.

In order to address this, network operators might need to employ smarter layer 4 load balancing solutions, which can be implemented in software and deployed without the need to touch edge routers (see for example Facebook's Katran project).

QPACK

Another benefit introduced by HTTP/2 was header compression (or HPACK) which allows HTTP/2 end-points to reduce the amount of data transmitted over the network by removing redundancies from HTTP requests and responses.

In particular, among other techniques, HPACK employs dynamic tables populated with headers that were sent (or received) from previous HTTP requests (or responses), allowing end-points to reference previously encountered headers in new requests (or responses), rather than having to transmit them all over again.

HPACK's dynamic tables need to be synchronized between the encoder (the party that sends an HTTP request or response) and the decoder (the one that receives them), otherwise the decoder will not be able to decode what it receives.

With HTTP/2 over TCP this synchronization is transparent, since the transport layer (TCP) takes care of delivering HTTP requests and responses in the same order they were sent in, the instructions for updating the tables can simply be sent by the encoder as part of the request (or response) itself, making the encoding very simple. But for QUIC this is more complicated.

QUIC can deliver multiple HTTP requests (or responses) over different streams independently, which means that while it takes care of delivering data in order as far as a single stream is concerned, there are no ordering guarantees across multiple streams.

For example, if a client sends HTTP request A over QUIC stream A, and request B over stream B, it might happen, due to packet reordering or loss in the network, that request B is received by the server before request A, and if request B was encoded such that it referenced a header from request A, the server will be unable to decode it since it didn't yet see request A.

In the gQUIC protocol this problem was solved by simply serializing all HTTP request and response headers (but not the bodies) over the same gQUIC stream, which meant headers would get delivered in order no matter what. This is a very simple scheme that allows implementations to reuse a lot of their existing HTTP/2 code, but on the other hand it increases the head-of-line blocking that QUIC was designed to reduce. The IETF QUIC working group thus designed a new mapping between HTTP and QUIC (“HTTP/QUIC”) as well as a new header compression scheme called “QPACK”.

In the latest draft of the HTTP/QUIC mapping and the QPACK spec, each HTTP request/response exchange uses its own bidirectional QUIC stream, so there's no head-of-line blocking. In addition, in order to support QPACK, each peer creates two additional unidirectional QUIC streams, one used to send QPACK table updates to the other peer, and one to acknowledge updates received by the other side. This way, a QPACK encoder can use a dynamic table reference only after it has been explicitly acknowledged by the decoder.

Deflecting Reflection

A common problem among UDP-based protocols is their susceptibility to reflection attacks, where an attacker tricks an otherwise innocent server into sending large amounts of data to a third-party victim, by spoofing the source IP address of packets targeted to the server to make them look like they came from the victim.

This kind of attack can be very effective when the response sent by the server happens to be larger than the request it received, in which case we talk of “amplification”.

TCP is not usually used for this kind of attack due to the fact that the initial packets transmitted during its handshake (SYN, SYN+ACK, …) have the same length so they don’t provide any amplification potential.

QUIC’s handshake on the other hand is very asymmetrical: like for TLS, in its first flight the QUIC server generally sends its own certificate chain, which can be very large, while the client only has to send a few bytes (the TLS ClientHello message embedded into a QUIC packet). For this reason, the initial QUIC packet sent by a client has to be padded to a specific minimum length (even if the actual content of the packet is much smaller). However this mitigation is still not sufficient, since the typical server response spans multiple packets and can thus still be far larger than the padded client packet.

The QUIC protocol also defines an explicit source-address verification mechanism, in which the server, rather than sending its long response, only sends a much smaller “retry” packet which contains a unique cryptographic token that the client will then have to echo back to the server inside a new initial packet. This way the server has a higher confidence that the client is not spoofing its own source IP address (since it received the retry packet) and can complete the handshake. The downside of this mitigation is that it increases the initial handshake duration from a single round-trip to two.

An alternative solution involves reducing the server's response to the point where a reflection attack becomes less effective, for example by using ECDSA certificates (which are typically much smaller than their RSA counterparts). We have also been experimenting with a mechanism for compressing TLS certificates using off-the-shelf compression algorithms like zlib and brotli, which is a feature originally introduced by gQUIC but not currently available in TLS.

UDP performance

One of the recurring issues with QUIC involves existing hardware and software deployed in the wild not being able to understand it. We've already looked at how QUIC tries to address network middle-boxes like routers, but another potentially problematic area is the performance of sending and receiving data over UDP on the QUIC end-points themselves. Over the years a lot of work has gone into optimizing TCP implementations as much as possible, including building off-loading capabilities in both software (like in operating systems) and hardware (like in network interfaces), but none of that is currently available for UDP.

However it’s only a matter of time until QUIC implementations can take advantage of these capabilities as well. Look for example at the recent efforts to implement Generic Segmentation Offloading for UDP on LInux, which would allow applications to bundle and transfer multiple UDP segments between user-space and the kernel-space networking stack at the cost of a single one (or close enough), as well as the one to add zerocopy socket support also on Linux which would allow applications to avoid the cost of copying user-space memory into kernel-space.

Conclusion

Like HTTP/2 and TLS 1.3, QUIC is set to deliver a lot of new features designed to improve performance and security of web sites, as well as other Internet-based properties. The IETF working group is currently set to deliver the first version of the QUIC specifications by the end of the year and Cloudflare engineers are already hard at work to provide the benefits of QUIC to all of our customers.

Deprecating SPDY

Max Nystrom — Thu, 18 Jan 2018 15:58:00 GMT

Democratizing the Internet and making new features available to all Cloudflare customers is a core part of what we do. We're proud to be early adopters and have a long record of adopting new standards early, such as HTTP/2, as well as features that are experimental or not yet final, like TLS 1.3 and SPDY.

Participating in the Internet democracy occasionally means that ideas and technologies that were once popular or ubiquitous on the net lose their utility as newer technologies emerge. SPDY is one such technology. Several years ago, Google drafted a proprietary and experimental new protocol called SPDY. SPDY offered many performance improvements over the aging HTTP/1.1 standard and these improvements resulted in significantly faster page load times for real-world websites. Stemming from its success, SPDY became the starting point for HTTP/2 and, when the new HTTP standard was finalized, the SPDY experiment came to an end where it gradually fell into disuse.

As a result, we're announcing our intention to deprecate the use of SPDY for connections made to Cloudflare's edge by February 21st, 2018.

Remembering 2012

Five and a half years ago, when the majority of the web was unencrypted and web developers were resorting to creative tricks to improve performance (such as domain sharding to download resources in parallel) to get around the limitations of the then thirteen-year-old HTTP/1.1 standard, Cloudflare launched support for an exciting new protocol called SPDY.

CC BY-SA 4.0 by Maximilien Brice

SPDY aimed to remove many of the bottlenecks present in HTTP/1.1 by changing the way HTTP requests were sent over the wire. Using header compression, request prioritization, and multiplexing, SPDY was able to provide significant performance gains while remaining compatible with the existing HTTP/1.1 standard. This meant that server operators could place a SPDY layer, such as Cloudflare, in front of their web application and gain the performance benefits of SPDY without modifying any of their existing code. SPDY effectively became a fast tunnel for HTTP traffic.

2015: An ever-changing landscape

As the HTTPbis Working Group submitted SPDY for standardization, the protocol underwent some changes (e.g.— Using HPACK for compression), but retained its core performance optimizations and came to be known as HTTP/2 in May 2015.

With the standardization of HTTP/2, Google announced that they would cease supporting SPDY with the public release of Chrome 51 in May 2016. This signaled to other software providers that they too should abandon support for the experimental SPDY protocol in favor of the newly standardized HTTP/2. Mozilla did so with the release of Firefox 51 in January 2017 and NGINX built their HTTP/2 module so that either it or the SPDY module could be used to terminate HTTPS connections, but not both.

Cloudflare announced support for HTTP/2 in December of 2015. It was at this point that we deviated from our peers in the industry. We knew that adoption of this new standard and the migration away from SPDY would take longer than the 1-2 year timeline put forth by Google and others, so we created our own patch for NGINX so that we could support terminating both SPDY and HTTP/2. This allowed Cloudflare customers who had yet to upgrade their clients to continue to receive the performance benefits of SPDY until they were able to support HTTP/2.

When we made this decision, SPDY was used for 53.59% of TLS connections to our edge and HTTP/2 for 26.79%. Had we only adopted the standard NGINX HTTP/2 module, we would've made the internet around 20% slower for more than half of all visitors to sites on Cloudflare— definitely not a performance compromise we were willing to make!

To 2018 and Beyond

Two years after we began supporting HTTP/2 (and nearly three years after standardization), the majority of web browsers now support HTTP/2 where the notable exceptions are the UC Browser for Android and Opera Mini. This results in SPDY being used for only 3.83% of TLS connections to Cloudflare's edge whereas HTTP/2 accounts for 66.88%. At this point, and for several reasons, now is the time to stop supporting SPDY.

CC0 by JanBaby

Looking closer at the low percentage of clients connecting with SPDY in 2018, 65% of these connections are made by older iOS and macOS apps compiled against HTTP and TLS libraries which only support SPDY and not HTTP/2. This means that these app developers need to publish an update with HTTP/2 support to enable support for this protocol. We worked closely with Apple to assess the impact of deprecating SPDY for these clients and determined that the impact of deprecation is minimal.

We mentioned earlier that we applied our own patch to NGINX in order to be able to continue to support both SPDY and HTTP/2 for TLS connections. What we didn't mention was the engineering cost associated with maintaining this patch. Every time we want to update NGINX, we also need to update and test our patch, which makes each upgrade more difficult. Further, no active development is being done to SPDY so in the event that security issues arise, we would incur the cost of developing our own security patches.

Finally, when we do disable SPDY this February, the less than 4% of traffic that currently uses SPDY will still be able to connect to Cloudflare's edge by gracefully falling back to using HTTP/1.1.

Moving Forward While Looking Back

Part of being an innovator is knowing when it is time to move forward and put older innovations in the rear view mirror. By seeing 10% of HTTP requests made on the internet, Cloudflare is in a unique position to analyze overall adoption trends of new technologies, allowing us to make informed decisions on when to launch new features or deprecate legacy ones.

SPDY has been extraordinarily beneficial to clients connecting to Cloudflare over the years, but now that the protocol is largely abandoned and superseded by newer technologies, we recognize that 2018 is time to say goodbye to an aging legacy protocol.

How to make your site HTTPS-only

Nick Sullivan — Thu, 06 Jul 2017 13:35:00 GMT

The Internet is getting more secure every day as people enable HTTPS, the secure version of HTTP, on their sites and services. Last year, Mozilla reported that the percentage of requests made by Firefox using encrypted HTTPS passed 50% for the first time. HTTPS has numerous benefits that are not available over unencrypted HTTP, including improved performance with HTTP/2, SEO benefits for search engines like Google and the reassuring lock icon in the address bar.

So how do you add HTTPS to your site or service? That’s simple, Cloudflare offers free and automatic HTTPS support for all customers with no configuration. Sign up for any plan and Cloudflare will issue an SSL certificate for you and serve your site over HTTPS.

HTTPS-only

Enabling HTTPS does not mean that all visitors are protected. If a visitor types your website’s name into the address bar of a browser or follows an HTTP link, it will bring them to the insecure HTTP version of your website. In order to make your site HTTPS-only, you need to redirect visitors from the HTTP to the HTTPS version of your site.

Going HTTPS-only should be as easy as a click of a button, so we literally added one to the Cloudflare dashboard. Enable the “Always Use HTTPS” feature and all visitors of the HTTP version of your website will be redirected to the HTTPS version. You’ll find this option just above the HTTP Strict Transport Security setting and it is of course also available through our API.

In case you would like to redirect only some subset of your requests you can still do this by creating a Page Rule. Simply use the “Always Use HTTPS” setting on any URL pattern.

Securing your site: next steps

Once you have confirmed that your site is fully functional with HTTPS-only enabled, you can take it a step further and enable HTTP Strict Transport Security (HSTS). HSTS is a header that tells browsers that your site is available over HTTPS and will be for a set period of time. Once a browser sees an HSTS header for a site, it will automatically fetch the HTTPS version of HTTP pages without needing to follow redirects. HSTS can be enabled in the crypto app right under the Always Use HTTPS toggle.

It's also important to secure the connection between Cloudflare and your site. To do that, you can use Cloudflare's Origin CA to get a free certificate for your origin server. Once your origin server is set up with HTTPS and a valid certificate, change your SSL mode to Full (strict) to get the highest level of security.

So you want to expose Go on the Internet

Filippo Valsorda — Mon, 26 Dec 2016 14:59:01 GMT

This piece was originally written for the Gopher Academy advent series. We are grateful to them for allowing us to republish it here.

Back when crypto/tls was slow and net/http young, the general wisdom was to always put Go servers behind a reverse proxy like NGINX. That's not necessary anymore!

At Cloudflare we recently experimented with exposing pure Go services to the hostile wide area network. With the Go 1.8 release, net/http and crypto/tls proved to be stable, performant and flexible.

However, the defaults are tuned for local services. In this articles we'll see how to tune and harden a Go server for Internet exposure.

`crypto/tls`

You're not running an insecure HTTP server on the Internet in 2016. So you need crypto/tls. The good news is that it's now really fast (as you've seen in a previous advent article), and its security track record so far is excellent.

The default settings resemble the Intermediate recommended configuration of the Mozilla guidelines. However, you should still set PreferServerCipherSuites to ensure safer and faster cipher suites are preferred, and CurvePreferences to avoid unoptimized curves: a client using CurveP384 would cause up to a second of CPU to be consumed on our machines.

&tls.Config{
	// Causes servers to use Go's default ciphersuite preferences,
	// which are tuned to avoid attacks. Does nothing on clients.
	PreferServerCipherSuites: true,
	// Only use curves which have assembly implementations
	CurvePreferences: []tls.CurveID{
		tls.CurveP256,
		tls.X25519, // Go 1.8 only
	},
}

If you can take the compatibility loss of the Modern configuration, you should then also set MinVersion and CipherSuites.

	MinVersion: tls.VersionTLS12,
	CipherSuites: []uint16{
		tls.TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,
		tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
		tls.TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305, // Go 1.8 only
		tls.TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,   // Go 1.8 only
		tls.TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,
		tls.TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,

		// Best disabled, as they don't provide Forward Secrecy,
		// but might be necessary for some clients
		// tls.TLS_RSA_WITH_AES_256_GCM_SHA384,
		// tls.TLS_RSA_WITH_AES_128_GCM_SHA256,
	},

Be aware that the Go implementation of the CBC cipher suites (the ones we disabled in Modern mode above) is vulnerable to the Lucky13 attack, even if partial countermeasures were merged in 1.8.

Final caveat, all these recommendations apply only to the amd64 architecture, for which fast, constant time implementations of the crypto primitives (AES-GCM, ChaCha20-Poly1305, P256) are available. Other architectures are probably not fit for production use.

Since this server will be exposed to the Internet, it will need a publicly trusted certificate. You can get one easily and for free thanks to Let's Encrypt and the golang.org/x/crypto/acme/autocert package’s GetCertificate function.

Don't forget to redirect HTTP page loads to HTTPS, and consider HSTS if your clients are browsers.

srv := &http.Server{
	ReadTimeout:  5 * time.Second,
	WriteTimeout: 5 * time.Second,
	Handler: http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
		w.Header().Set("Connection", "close")
		url := "https://" + req.Host + req.URL.String()
		http.Redirect(w, req, url, http.StatusMovedPermanently)
	}),
}
go func() { log.Fatal(srv.ListenAndServe()) }()

You can use the SSL Labs test to check that everything is configured correctly.

`net/http`

net/http is a mature HTTP/1.1 and HTTP/2 stack. You probably know how (and have opinions about how) to use the Handler side of it, so that's not what we'll talk about. We will instead talk about the Server side and what goes on behind the scenes.

Timeouts

Timeouts are possibly the most dangerous edge case to overlook. Your service might get away with it on a controlled network, but it will not survive on the open Internet, especially (but not only) if maliciously attacked.

Applying timeouts is a matter of resource control. Even if goroutines are cheap, file descriptors are always limited. A connection that is stuck, not making progress or is maliciously stalling should not be allowed to consume them.

A server that ran out of file descriptors will fail to accept new connections with errors like

http: Accept error: accept tcp [::]:80: accept: too many open files; retrying in 1s

A zero/default http.Server, like the one used by the package-level helpers http.ListenAndServe and http.ListenAndServeTLS, comes with no timeouts. You don't want that.

There are three main timeouts exposed in http.Server: ReadTimeout, WriteTimeout and IdleTimeout. You set them by explicitly using a Server:

srv := &http.Server{
    ReadTimeout:  5 * time.Second,
    WriteTimeout: 10 * time.Second,
    IdleTimeout:  120 * time.Second,
    TLSConfig:    tlsConfig,
    Handler:      serveMux,
}
log.Println(srv.ListenAndServeTLS("", ""))

ReadTimeout covers the time from when the connection is accepted to when the request body is fully read (if you do read the body, otherwise to the end of the headers). It's implemented in net/http by calling SetReadDeadline immediately after Accept.

The problem with a ReadTimeout is that it doesn't allow a server to give the client more time to stream the body of a request based on the path or the content. Go 1.8 introduces ReadHeaderTimeout, which only covers up to the request headers. However, there's still no clear way to do reads with timeouts from a Handler. Different designs are being discussed in issue #16100.

WriteTimeout normally covers the time from the end of the request header read to the end of the response write (a.k.a. the lifetime of the ServeHTTP), by calling SetWriteDeadline at the end of readRequest.

However, when the connection is over HTTPS, SetWriteDeadline is called immediately after Accept so that it also covers the packets written as part of the TLS handshake. Annoyingly, this means that (in that case only) WriteTimeout ends up including the header read and the first byte wait.

Similarly to ReadTimeout, WriteTimeout is absolute, with no way to manipulate it from a Handler (#16100).

Finally, Go 1.8 introduces IdleTimeout which limits server-side the amount of time a Keep-Alive connection will be kept idle before being reused. Before Go 1.8, the ReadTimeout would start ticking again immediately after a request completed, making it very hostile to Keep-Alive connections: the idle time would consume time the client should have been allowed to send the request, causing unexpected timeouts also for fast clients.

You should set Read, Write and Idle timeouts when dealing with untrusted clients and/or networks, so that a client can't hold up a connection by being slow to write or read.

For detailed background on HTTP/1.1 timeouts (up to Go 1.7) read my post on the Cloudflare blog.

HTTP/2

HTTP/2 is enabled automatically on any Go 1.6+ server if:

the request is served over TLS/HTTPS
Server.TLSNextProto is nil (setting it to an empty map is how you disable HTTP/2)
Server.TLSConfig is set and ListenAndServeTLS is used or
Serve is used and tls.Config.NextProtos includes "h2" (like []string{"h2", "http/1.1"}, since Serve is called too late to auto-modify the TLS Config)

HTTP/2 has a slightly different meaning since the same connection can be serving different requests at the same time, however, they are abstracted to the same set of Server timeouts in Go.

Sadly, ReadTimeout breaks HTTP/2 connections in Go 1.7. Instead of being reset for each request it's set once at the beginning of the connection and never reset, breaking all HTTP/2 connections after the ReadTimeout duration. It's fixed in 1.8.

Between this and the inclusion of idle time in ReadTimeout, my recommendation is to upgrade to 1.8 as soon as possible.

TCP Keep-Alives

If you use ListenAndServe (as opposed to passing a net.Listener to Serve, which offers zero protection by default) a TCP Keep-Alive period of three minutes will be set automatically. That will help with clients that disappear completely off the face of the earth leaving a connection open forever, but I’ve learned not to trust that, and to set timeouts anyway.

To begin with, three minutes might be too high, which you can solve by implementing your own tcpKeepAliveListener.

More importantly, a Keep-Alive only makes sure that the client is still responding, but does not place an upper limit on how long the connection can be held. A single malicious client can just open as many connections as your server has file descriptors, hold them half-way through the headers, respond to the rare keep-alives, and effectively take down your service.

Finally, in my experience connections tend to leak anyway until timeouts are in place.

ServeMux

Package level functions like http.Handle[Func] (and maybe your web framework) register handlers on the global http.DefaultServeMux which is used if Server.Handler is nil. You should avoid that.

Any package you import, directly or through other dependencies, has access to http.DefaultServeMux and might register routes you don't expect.

For example, if any package somewhere in the tree imports net/http/pprof clients will be able to get CPU profiles for your application. You can still use net/http/pprof by registering its handlers manually.

Instead, instantiate an http.ServeMux yourself, register handlers on it, and set it as Server.Handler. Or set whatever your web framework exposes as Server.Handler.

Logging

net/http does a number of things before yielding control to your handlers: Accepts the connections, runs the TLS Handshake, ...

If any of these go wrong a line is written directly to Server.ErrorLog. Some of these, like timeouts and connection resets, are expected on the open Internet. It's not clean, but you can intercept most of those and turn them into metrics by matching them with regexes from the Logger Writer, thanks to this guarantee:

Each logging operation makes a single call to the Writer's Write method.

To abort from inside a Handler without logging a stack trace you can either panic(nil) or in Go 1.8 panic(http.ErrAbortHandler).

Metrics

A metric you'll want to monitor is the number of open file descriptors. Prometheus does that by using the proc filesystem.

If you need to investigate a leak, you can use the Server.ConnState hook to get more detailed metrics of what stage the connections are in. However, note that there is no way to keep a correct count of StateActive connections without keeping state, so you'll need to maintain a map[net.Conn]ConnState.

Conclusion

The days of needing NGINX in front of all Go services are gone, but you still need to take a few precautions on the open Internet, and probably want to upgrade to the shiny, new Go 1.8.

Happy serving!

HPACK: the silent killer (feature) of HTTP/2

Vlad Krasnov — Mon, 28 Nov 2016 14:10:41 GMT

If you have experienced HTTP/2 for yourself, you are probably aware of the visible performance gains possible with HTTP/2 due to features like stream multiplexing, explicit stream dependencies, and Server Push.

There is however one important feature that is not obvious to the eye. This is the HPACK header compression. Current implementation of nginx, as well edge networks and CDNs using it, do not support the full HPACK implementation. We have, however, implemented the full HPACK in nginx, and upstreamed the part that performs Huffman encoding.

CC BY 2.0 image by Conor Lawless

This blog post gives an overview of the reasons for the development of HPACK, and the hidden bandwidth and latency benefits it brings.

Some Background

As you probably know, a regular HTTPS connection is in fact an overlay of several connections in the multi-layer model. The most basic connection you usually care about is the TCP connection (the transport layer), on top of that you have the TLS connection (mix of transport/application layers), and finally the HTTP connection (application layer).

In the the days of yore, HTTP compression was performed in the TLS layer, using gzip. Both headers and body were compressed indiscriminately, because the lower TLS layer was unaware of the transferred data type. In practice it meant both were compressed with the DEFLATE algorithm.

Then came SPDY with a new, dedicated, header compression algorithm. Although specifically designed for headers, including the use of a preset dictionary, it was still using DEFLATE, including dynamic Huffman codes and string matching.

Unfortunately both were found to be vulnerable to the CRIME attack, that can extract secret authentication cookies from compressed headers: because DEFLATE uses backward string matches and dynamic Huffman codes, an attacker that can control part of the request headers, can gradually recover the full cookie by modifying parts of the request and seeing how the total size of the request changes during compression.

Most edge networks, including Cloudflare, disabled header compression because of CRIME. That’s until HTTP/2 came along.

HPACK

HTTP/2 supports a new dedicated header compression algorithm, called HPACK. HPACK was developed with attacks like CRIME in mind, and is therefore considered safe to use.

HPACK is resilient to CRIME, because it does not use partial backward string matches and dynamic Huffman codes like DEFLATE. Instead, it uses these three methods of compression:

Static Dictionary: A predefined dictionary of 61 commonly used header fields, some with predefined values.
Dynamic Dictionary: A list of actual headers that were encountered during the connection. This dictionary has limited size, and when new entries are added, old entries might be evicted.
Huffman Encoding: A static Huffman code can be used to encode any string: name or value. This code was computed specifically for HTTP Response/Request headers - ASCII digits and lowercase letters are given shorter encodings. The shortest encoding possible is 5 bits long, therefore the highest compression ratio achievable is 8:5 (or 37.5% smaller).

HPACK flow

When HPACK needs to encode a header in the format name:value, it will first look in the static and dynamic dictionaries. If the full name:value is present, it will simply reference the entry in the dictionary. This will usually take one byte, and in most cases two bytes will suffice! A whole header encoded in a single byte! How crazy is that?

Since many headers are repetitive, this strategy has a very high success rate. For example, headers like :authority:www.cloudflare.com or the sometimes huge cookie headers are the usual suspects in this case.

When HPACK can't match a whole header in a dictionary, it will attempt to find a header with the same name. Most of the popular header names are present in the static table, for example: content-encoding, cookie, etag. The rest are likely to be repetitive and therefore present in the dynamic table. For example, Cloudflare assigns a unique cf-ray header to each response, and while the value of this field is always different, the name can be reused!

If the name was found, it can again be expressed in one or two bytes in most cases, otherwise the name will be encoded using either raw encoding or the Huffman encoding: the shorter of the two. The same goes for the value of the header.

We found that the Huffman encoding alone saves almost 30% of header size.

Although HPACK does string matching, for the attacker to find the value of a header, they must guess the entire value, instead of a gradual approach that was possible with DEFLATE matching, and was vulnerable to CRIME.

Request Headers

The gains HPACK provides for HTTP request headers are more significant than for response headers. Request headers get better compression, due to much higher duplication in the headers. For example, here are two requests for our own blog, using Chrome:

Request #1:

**:authority:**blog.cloudflare.com**:method:**GET:path: /**:scheme:**https**accept:**text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8**accept-encoding:**gzip, deflate, sdch, br**accept-language:**en-US,en;q=0.8cookie: 297 byte cookie**upgrade-insecure-requests:**1**user-agent:**Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2853.0 Safari/537.36

I marked in red the headers that can be compressed with the use of the static dictionary. Three fields: :method:GET, :path:/ and :scheme:https are always present in the static dictionary, and will each be encoded in a single byte. Then some fields will only have their names compressed in a byte: :authority, accept, accept-encoding, accept-language, cookie and user-agent are present in the static dictionary.

Everything else, marked in green will be Huffman encoded.

Headers that were not matched, will be inserted into the dynamic dictionary for the following requests to use.

Let's take a look at a later request:

Request #2:

:authority:blog.cloudflare.com****:method:GET:path:/assets/images/cloudflare-sprite-small.png**:scheme:**https**accept:**image/webp,image/,/*;q=0.8**accept-encoding:**gzip, deflate, sdch, br**accept-language:**en-US,en;q=0.8**cookie:**same 297 byte cookiereferer:http://blog.cloudflare.com/assets/css/screen.css?v=2237be22c2**user-agent:**Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2853.0 Safari/537.36

Here I added blue encoded fields. Those indicate fields that were matched from the dynamic dictionary. It is clear that most fields repeat between requests. In this case two fields are again present in the static dictionary and five more are repeated and therefore present in the dynamic dictionary, that means they can be encoded in one or two bytes each. One of those is the ~300 byte cookie header, and ~130 byte user-agent. That is 430 bytes encoded into mere 4 bytes, 99% compression!

All in all for the repeat request, only three short strings will be Huffman encoded.

This is how ingress header traffic appears on the Cloudflare edge network during a six hour period:

On average we are seeing a 76% compression for ingress headers. As the headers represent the majority of ingress traffic, it also provides substantial savings in the total ingress traffic:

We can see that the total ingress traffic is reduced by 53% as the result of HPACK compression!

In fact today, we process about the same number of requests for HTTP/1 and HTTP/2 over HTTPS, yet the ingress traffic for HTTP/2 is only half that of HTTP/1.

Response Headers

For the response headers (egress traffic) the gains are more modest, but still spectacular:

Response #1:

**cache-control:**public, max-age=30**cf-cache-status:**HITcf-h2-pushed:,**cf-ray:**2ded53145e0c1ffa-DFW**content-encoding:**gzip**content-type:**text/html; charset=utf-8**date:**Wed, 07 Sep 2016 21:41:23 GMT**expires:**Wed, 07 Sep 2016 21:41:53 GMTlink: ; rel=preload; as=script,<https://code.jquery.com/jquery-1.11.3.min.js>; rel=preload; as=script**server:**cloudflare-nginxstatus:200**vary:**Accept-Encoding**x-ghost-cache-status:**From Cache**x-powered-by:**Express

The majority of the first response will be Huffman encoded, with some of the field names being matched from the static dictionary.

Response #2:

**cache-control:**public, max-age=31536000**cf-bgj:imgq:**100**cf-cache-status:**HITcf-ray:2ded53163e241ffa-DFW**content-type:**image/png**date:**Wed, 07 Sep 2016 21:41:23 GMT**expires:**Thu, 07 Sep 2017 21:41:23 GMT**server:**cloudflare-nginx**status:**200**vary:**Accept-Encoding**x-ghost-cache-status:**From Cachex-powered-by:Express

Again, the blue color indicates matches from the dynamic table, red indicate matches from the static table, and the green ones represent Huffman encoded strings.

On the second response it is possible to fully match seven of twelve headers. For four of the remaining five, the name can be fully matched, and six strings will be efficiently encoded using the static Huffman encoding.

Although the two expires headers are almost identical, they can only be Huffman compressed, because they can't be matched in full.

The more requests are being processed, the bigger the dynamic table becomes, and more headers can be matched, leading to increased compression ratio.

This is how egress header traffic appears on the Cloudflare edge:

On average egress headers are compressed by 69%. The savings for the total egress traffic are not that significant however:

It is difficult to see, but we get 1.4% savings in the total egress HTTP/2 traffic. While it does not look like much, it is still more than increasing the compression level for data would give in many cases. This number is also significantly skewed by websites that serve very large files: we measured savings of well over 15% for some websites.

Test your HPACK

If you have nghttp2 installed, you can test the efficiency of HPACK compression on your website with a bundled tool called h2load.

For example:

h2load https://blog.cloudflare.com | tail -6 |head -1
traffic: 18.27KB (18708) total, 538B (538) headers (space savings 27.98%), 17.65KB (18076) data

We see 27.98% space savings in the headers. That is for a single request, and the gains are mostly due to the Huffman encoding. To test if the website utilizes the full power of HPACK, we need to issue two requests, for example:

h2load https://blog.cloudflare.com -n 2 | tail -6 |head -1
traffic: 36.01KB (36873) total, 582B (582) headers (space savings 61.15%), 35.30KB (36152) data

If for two similar requests the savings are 50% or more then it is very likely full HPACK compression is utilized.

Note that compression ratio improves with additional requests:

h2load https://blog.cloudflare.com -n 4 | tail -6 |head -1
traffic: 71.46KB (73170) total, 637B (637) headers (space savings 78.68%), 70.61KB (72304) data

Conclusion

By implementing HPACK compression for HTTP response headers we've seen a significant drop in egress bandwidth. HPACK has been enabled for all Cloudflare customers using HTTP/2, all of whom benefit from faster, smaller HTTP responses.

How we brought HTTPS Everywhere to the cloud (part 1)

Ingvar Stepanyan — Sat, 24 Sep 2016 15:46:26 GMT

CloudFlare's mission is to make HTTPS accessible for all our customers. It provides security for their websites, improved ranking on search engines, better performance with HTTP/2, and access to browser features such as geolocation that are being deprecated for plaintext HTTP. With Universal SSL or similar features, a simple button click can now enable encryption for a website.

Unfortunately, as described in a previous blog post, this is only half of the problem. To make sure that a page is secure and can't be controlled or eavesdropped by third-parties, browsers must ensure that not only the page itself but also all its dependencies are loaded via secure channels. Page elements that don't fulfill this requirement are called mixed content and can either result in the entire page being reported as insecure or even completely blocked, thus breaking the page for the end user.

What can we do about it?

When we conceived the Automatic HTTPS Rewrites project, we aimed to automatically reduce the amount of mixed content on customers' web pages without breaking their websites and without any delay noticeable by end users while receiving a page that is being rewritten on the fly.

A naive way to do this would be to just rewrite http:// links to https:// or let browsers do that with Upgrade-Insecure-Requests directive.

Unfortunately, such approach is very fragile and unsafe unless you're sure that

Each single HTTP sub-resource is also available via HTTPS.
It's available at the exact same domain and path after protocol upgrade (more often than you might think that's not the case).

If either of these conditions is unmet, you end up rewriting resources to non-existing URLs and breaking important page dependencies.

Thus we decided to take a look at the existing solutions.

How are these problems solved already?

Many security aware people use the HTTPS Everywhere browser extension to avoid those kinds of issues. HTTPS Everywhere contains a well-maintained database from the Electronic Frontier Foundation that contains all sorts of mappings for popular websites that safely rewrite HTTP versions of resources to HTTPS only when it can be done without breaking the page.

However, most users are either not aware of it or are not even able to use it, for example, on mobile browsers.

CC BY 2.0 image by Jared Tarbell

So we decided to flip the model around. Instead of re-writing URLs in the browser, we would re-write them inside the CloudFlare reverse proxy. By taking advantage of the existing database on the server-side, website owners could turn it on and all their users would instantly benefit from HTTPS rewriting. The fact that it’s automatic is especially useful for websites with user-generated content where it's not trivial to find and fix all the cases of inserted insecure third-party content.

At our scale, we obviously couldn't use the existing JavaScript rewriter. The performance challenges for a browser extension which can find, match and cache rules lazily as a user opens websites, are very different from those of a CDN server that handles millions of requests per second. We usually don't get a chance to rewrite them before they hit the cache either, as many pages are dynamically generated on the origin server and go straight through us to the client.

That means, to take advantage of the database, we needed to learn how the existing implementation works and create our own in the form of a native library that could work without delays under our load. Let's do the same here.

How does HTTPS Everywhere know what to rewrite?

HTTPS Everywhere rulesets can be found in src/chrome/content/rules folder of the official repository. They are organized as XML files, each for their own set of hosts (with few exclusions). This allows users with basic technical skills to write and contribute missing rules to the database on their own.

Each ruleset is an XML file of the following structure:

At the moment of writing, the HTTPS Everywhere database consists of ~22K such rulesets covering ~113K domain wildcards with ~32K rewrite rules and exclusions.

For performance reasons, we can't keep all those ruleset XMLs in memory, go through nodes, check each wildcard, perform replacements based on specific string format and so on. All that work would introduce significant delays in page processing and increase memory consumption on our servers. That's why we had to perform some compile-time tricks for each type of node to ensure that rewriting is smooth and fast for any user from the very first request.

Let's walk through those nodes and see what can be done in each specific case.

Target domains

First of all, we get target elements which describe domain wildcards that current ruleset potentially covers.

If a wildcard is used, it can be either left-side or right-side.

Left-side wildcard like *.example.org covers any hostname which has example.org as a suffix - no matter how many subdomain levels you have.

Right-side wildcard like example.* covers only one level instead so that subdomains with the same beginning but one unexpected domain level are not accidentally caught. For example, the Google ruleset, among others, uses the google.* wildcard and it should match google.com, google.ru, google.es etc. but not google.mywebsite.com.

Note that a single host can be covered by several different rulesets as wildcards can overlap, so the rewriter should be given entire database in order to find a correct replacement. Still, matching hostname allows to instantly reduce all ~22,000 rulesets to only 3-5 which we can deal with more easily.

Matching wildcards at runtime one-by-one is, of course, possible, but very inefficient with ~113K domain wildcards (and, as we noted above, one domain can match several rulesets, so we can't even bail out early). We need to find a better way.

CC BY 2.0 image by vige

We use Ragel to build fast lexers in other pieces of our code. Ragel is a state machine compiler which takes grammars and actions described with its own syntax and generates source code in a given programming language as an output. We decided to use it here too and wrote a script that generates a Ragel grammar from our set of wildcards. In turn, Ragel converts it into C code of a state machine capable of going through characters of URLs, matching hosts and invoking custom handler on each found ruleset.

This leads us to another interesting problem. At the moment of writing among 113K domain wildcards we have 4.7K that have a left wildcard and less than 200 which have a right wildcard. Left wildcards are expensive in state machines (including regular expressions) as they cause DFA space explosion during compilation so Ragel got stuck for more than 10 minutes without giving any result - trying to analyze all the *. prefixes and merge all the possible states where they can go, resulting in a complex tree.

Instead, if we choose to look from the end of the host, we can significantly simplify the state tree (as only 200 wildcards need to be checked separately now instead of 4.7K), thus reducing compile time to less than 20 seconds.

Let's take an oversimplified example to understand the difference. Say, we have following target wildcards (3 left-wildcards against 1 right-wildcard and 1 simple host):

If we build Ragel state machine directly from those:

%%{
    machine hosts;
 
    host_part = (alnum | [_\-])+;
 
    main := (
        any+ '.google.com' |
        any+ '.google.co.uk' |
        any+ '.google.es' |
        'google.' host_part |
        'google.com.ua'
    );
}%%

We will get the following state graph:

You can see that the graph is already pretty complex as each starting character, even g which is an explicit starting character of 'google.' and 'google.com' strings, still needs to simultaneously go also into any+ matches. Even when you have already parsed the google. part of the host name, it can still correctly match any of the given wildcards whether as google.google.com, google.google.co.uk, google.google.es, google.tech or google.com.ua. This already blows up the complexity of the state machine, and we only took an oversimplified example with three left wildcards here.

However, if we simply reverse each rule in order to feed the string starting from the end:

%%{
    machine hosts;
 
    host_part = (alnum | [_\-])+;
 
    main := (
        'moc.elgoog.' |
        'ku.oc.elgoog.' |
        'se.elgoog.' |
        host_part '.elgoog' |
        'au.moc.elgoog'
    );
}%%

we can get much simpler graph and, consequently, significantly reduced graph build and matching times:

So now, all that we need is to do is to go through the host part in the URL, stop on / right after and start the machine backwards from this point. There is no need to waste time with in-memory string reversal as Ragel provides the getkey instruction for custom data access expressions which we can use for accessing characters in reverse order after we match the ending slash.

Here is animation of the full process:

After we've matched the host name and found potentially applicable rulesets, we need to ensure that we're not rewriting URLs which are not available via HTTPS.

Exclusions

Exclusion elements serve exactly this goal.

The rewriter needs to test against all the exclusion patterns before applying any actual rules. Otherwise, paths that have issues or can't be served over HTTPS will be incorrectly rewritten and will potentially break the website.

We don't care about matched groups nor do we care even which particular regular expression was matched, so as an extra optimization, instead of going through them one-by-one, we merge all the exclusion patterns in the ruleset into one regular expression that can be internally optimized by a regexp engine.

For example, for the exclusions above we can create the following regular expression, common parts of which can be merged internally by a regexp engine:

(^http://(www\.)?google\.com/analytics/)|(^http://(www\.)?google\.com/imgres/)

After that, in our action we just need to call pcre_exec without a match data destination – we don't care about matched groups, but only about completion status. If a URL matches a regular expression, we bail out of this action as following rewrites shouldn't be applied. After this, Ragel will automatically call another matched action (another ruleset) on its own until one is found.

Finally, after we both matched the host name and ensured that our URL is not covered by any exclusion patterns, we can go to the actual rewrite rules.

Rewrite rules

These rules are presented as JavaScript regular expressions and replacement patterns. The rewriter matches the URL against each of those regular expressions as soon as a host matches and a URL is not an exclusion.

As soon as a match is found, the replacement is performed and the search can be stopped. Note: while exclusions cover dangerous replacements, it's totally possible and valid for the URL to not match any of actual rules - in that case it should be just left intact.

After the previous steps we are usually reduced only to couple of rules, so unlike in the case with exclusions, we don't apply any clever merging techniques for them. It turned out to be easier to go through them one-by-one rather than create a regexp engine specifically optimized for the case of multi-regexp replacements.

However, we don't want to waste time on regexp analysis and compilation on our edge server. This requires extra time during initialization and memory for carrying unnecessary text sources of regular expressions around. PCRE allows regular expressions to be precompiled into its own format using pcre_compile. Then, we gather all these compiled regular expressions into one binary file and link it using ld --format=binary - a neat option that tells linker to attach any given binary file as a named data resource available to the application.

CC BY 2.0 image by DaveBleasdale

The second part of the rule is the replacement pattern which uses the simplest feature of JavaScript regex replacement - number-based groups and has the form of https://www.google.com.$1/ which means that the resulting string should be concatenation of "https://www.google.com." with the matched group at position 1, and a "/".

Once again, we don't want to waste time performing repetitive analysis looking for dollar signs and converting string indexes to numeric representation at runtime. Instead, it's more efficient to split this pattern at compile-time into { "https://www.google.com.", "/" } static substrings plus an array of indexes which need to be inserted in between - in our case just { 1 }. Then, at runtime, we simply build a string going through both arrays one-by-one and concatenating strings with found matches.

Finally, after such string is built, it's inserted in place of the previous attribute value and sent to the client.

Wait, but what about testing?

Glad you asked.

The HTTPS Everywhere extension uses an automated checker that checks the validity of rewritten URLs on any change in ruleset. In order to do that, rulesets are required to contain special test elements that cover all the rewrite rules.

What we need to do on our side is to collect those test URLs, combined with our own auto-generated tests from wildcards, and to invoke both the HTTPS Everywhere built-in JavaScript rewriter and our own side-by-side to ensure that we're getting same results — URLs that should be left intact, are left intact with our implementation and URLs that are rewritten, are rewritten identically.

Can we fix even more mixed content?

After all this was done and tested, we decided to look around for other potential sources of guaranteed rewrites to extend our database.

And one such is HSTS preload list maintained by Google and used by all the major browsers. This list allows website owners who want to ensure that their website is never loaded via http://, to submit their hosts (optionally together with subdomains) and this way opt-in to auto-rewrite of any http:// references to https:// by a modern browser before even hitting the origin.

This means, the origin guarantees that the HTTPS version will be always available and will serve just the same content as HTTP - otherwise any resources referenced from it will simply break as the browser won't attempt to fallback to HTTP after domain is in the list. A perfect match for another ruleset!

As we already have a working solution and don't have any complexities around regular expressions in this list, we can download the JSON version of it directly from the Chromium source and convert to the same XML ruleset with wildcards and exclusions that our system already understands and handles, as part of the build process.

This way, both databases are merged and work together, rewriting even more URLs on customer websites without any major changes to the code.

That was quite a trip

It was... but it's not really the end of the story. You see, in order to provide safe and fast rewrites for everyone, and after analyzing the alternatives, we decided to write a new streaming HTML5 parser that became the core of this feature. We intend to use it for even more tasks in future to ensure that we can improve security and performance of our customers websites in even more ways.

However, it deserves a separate blog post, so stay tuned.

And remember - if you're into web performance, security or just excited about the possibility of working on features that do not break millions of pages every second - we're hiring!

P.S. We are incredibly grateful to the folks at the EFF who created the HTTPS Everywhere extension and worked with us on this project.

Opportunistic Encryption: Bringing HTTP/2 to the unencrypted web

Nick Sullivan — Wed, 21 Sep 2016 15:51:28 GMT

Encrypting the web is not an easy task. Various complexities prevent websites from migrating from HTTP to HTTPS, including mixed content, which can prevent sites from functioning with HTTPS.

Opportunistic Encryption provides an additional level of security to websites that have not yet moved to HTTPS and the performance benefits of HTTP/2. Users will not see a security indicator for HTTPS in the address bar when visiting a site using Opportunistic Encryption, but the connection from the browser to the server is encrypted.

In December 2015, CloudFlare introduced HTTP/2, the latest version of HTTP, that can result in improved performance for websites. HTTP/2 can’t be used without encryption, and before now, that meant HTTPS. Opportunistic Encryption, based on an IETF draft, enables servers to accept HTTP requests over an encrypted connection, allowing HTTP/2 connections for non-HTTPS sites. This is a first.

Combined with TLS 1.3 and HTTP/2 Server Push, Opportunistic Encryption can result in significant performance gains, while also providing security benefits.

Opportunistic Encryption is now available to all CloudFlare customers, enabled by default for Free and Pro plans. The option is available in the Crypto tab of the CloudFlare dashboard:

How it works

Opportunistic Encryption uses HTTP Alternative Services, a mechanism that allows servers to tell clients that the service they are accessing is available at another network location or over another protocol. When a supporting browser makes a request to a CloudFlare site with Opportunistic Encryption enabled, CloudFlare adds an Alternative Service header to indicate that the site is available over HTTP/2 (or SPDY) on port 443.

For customers with HTTP/2 enabled:

Alt-Svc: h2=”:443”; ma=60

For customers with HTTP/2 disabled:

Alt-Svc: spdy/3.1=”:443”; ma=60

This header simply states that the domain can be authoritatively accessed using HTTP/2 (“h2”) or SPDY 3.1 (“spdy/3.1”) at the same network address, over port 443 (the standard HTTPS port). The field “ma” (max-age) indicates how long in seconds the client should remember the existence of the alternative service.

CC BY-SA 2.0 image by Evan Jackson

When Firefox (or any other browser that supports Opportunistic Encryption) receives an “h2” Alt-Svc header, it knows that the site is available using HTTP/2 over TLS on port 443. For any subsequent HTTP requests to that domain, the browser will connect using TLS on port 443, where the server will present a certificate for the domain signed by a trusted certificate authority. The browser will then validate the certificate. If the connection succeeds, the browser will send the requests over that connection using HTTP/2.

Opportunistically requests will contain “http” in the:scheme pseudo-header instead of “https”. From a bit-on-the-wire perspective, this pseudo-header is the only difference between HTTP requests with Opportunistic Encryption over TLS and HTTPS. However, there is a big difference between how browsers treat assets fetched using HTTP vs. HTTPS URLs (as discussed below).

HTTP Alternative Services is a relatively new but widely used mechanism. For example, Alt-Svc is used by Google to advertise support for their experimental transport protocol, QUIC, to browsers in much the same way as we use it to advertise support for Opportunistic Encryption.

Why not just use HTTPS?

CloudFlare enables HTTPS by default for customers on all plans using Universal SSL. However, some sites choose to continue to allow access to their sites via unencrypted HTTP. The main reason for this is mixed content. If a site contains references to HTTP images, or makes requests to HTTP URLs via embedded scripts, browsers will present warnings or even block requests outright, often breaking the functionality of the site.

CC BY 2.0 image by Blondinrikard Fröberg

Making sure a site can be fully migrated to HTTPS can be a manual and time-consuming process. It can require someone to manually inspect every page on a site or set up a Content Security Policy (CSP) reporting infrastructure, a complex task. Even after all this work, fixing mixed content issues may require changes in middleware or content management software that can’t be easily updated. Later this week, we’ll introduce Automatic HTTPS Rewrites, which helps fix mixed content for many sites, but not all. Some mixed content can’t be fixed because the included third party resources (such as ads) that are not available over HTTPS. Websites that can’t update fully to HTTPS will benefit most from Opportunistic Encryption.

With Opportunistic Encryption, supporting browsers can choose to access an HTTP site using HTTP/2 over an encrypted connection instead of HTTP/1.1 over plaintext (the default).

Security Benefits

It’s no secret that network operators have access to the data that travels through their equipment. This access can be used to modify data: ISPs have been caught injecting unwanted data (such as advertisements and tracking cookies) into unencrypted requests. Countries routinely filter content by inspecting HTTP headers in unencrypted traffic, and China’s Great Cannon injected malicious code into unencrypted websites. Access to data in transit can also be used to perform dragnet surveillance, where vast swaths of data is collected and then shipped to a central location for analysis.

Opportunistic Encryption does not fully protect against attackers who can simply remove the header that signals support for Opportunistic Encryption to the browser. However, once an opportunistically encrypted connection is established all requests sent over the connection are encrypted and cannot be read (or modified) by prying eyes.

Terminology is hard

Tim Berners-Lee initiated the development of HTTP in the late 1980s to facilitate the transfer of documents from servers to clients. Both websites and browsers were rudimentary compared to the today’s web ecosystem. The concept of web security was practically non-existent.

From the original 1989 paper:

As the use cases of the web expanded to include sensitive data transactions, some security was needed. Multiple encryption schemes were developed to help secure HTTP, including S-HTTP and the eventual winner, HTTPS.

Originally, the difference between HTTP and HTTPS was one of layering. In HTTP, messages were written to the network directly, and in HTTPS, a secure connection was established between the client and server using SSL (an encryption and authentication protocol), and standard HTTP messages were written to the encrypted connection. Browsers signaled HTTP websites with an open lock icon, whereas HTTPS websites received a closed lock. Later, SSL evolved into TLS, although people sometimes still refer to it as SSL.

As websites became much more complex, embedded scripts and dynamic content became commonplace. Serving insecure content on a secure web page was identified as a risk and HTTPS started to take on a more nuanced meaning. Rather than just HTTP on an encrypted connection, HTTPS meant secure HTTP. For example, cookies became a popular way to store state in the client for managing web sessions. Cookies obtained over a secure connection were not allowed to be sent over insecure HTTP or modified by data obtained over HTTP. New privacy-sensitive features now require HTTPS (such as the Location API). This distinction between HTTP and HTTPS has been further codified by the W3C, the standards body in charge of the web, in their Secure Contexts document.

To break it down:

HTTP is a protocol for transferring hypertext
TLS/SSL is a protocol for encrypted communication
HTTPS is a protocol for transferring secure hypertext

	HTTP	HTTPS
Unencrypted	✔️	❌
Encrypted with TLS	✔️	✔️

Opportunistic Encryption is the bottom left ✔️. While only HTTPS sites are treated as secure by the browser (as indicated by a green lock security indicator), encrypted HTTP is preferable to unencrypted HTTP.

Browser support

All versions of Firefox since Firefox 38 (May 2015) have supported Opportunistic Encryption in its original form (without certificate validation). Firefox has recently added support for certificate validation in Firefox Nightly and will support it in an upcoming official release. We believe that Opportunistic Encryption is a meaningful advance in web security. We hope that other browsers follow Firefox’s lead and enable Opportunistic Encryption.

To be clear, Opportunistic Encryption is not a replacement for HTTPS. HTTPS should always be used when both strong encryption and authentication are required. For sites that don’t have the resources to fully move to HTTPS, Opportunistic Encryption can help, providing both added security and performance. Every bit counts.