The Cloudflare Blog

Code Mode: give agents an entire API in 1,000 tokens

Matt Carey — Fri, 20 Feb 2026 14:00:00 GMT

Model Context Protocol (MCP) has become the standard way for AI agents to use external tools. But there is a tension at its core: agents need many tools to do useful work, yet every tool added fills the model's context window, leaving less room for the actual task.

Code Mode is a technique we first introduced for reducing context window usage during agent tool use. Instead of describing every operation as a separate tool, let the model write code against a typed SDK and execute the code safely in a Dynamic Worker Loader. The code acts as a compact plan. The model can explore tool operations, compose multiple calls, and return just the data it needs. Anthropic independently explored the same pattern in their Code Execution with MCP post.

Today we are introducing a new MCP server for the entire Cloudflare API — from DNS and Zero Trust to Workers and R2 — that uses Code Mode. With just two tools, search() and execute(), the server is able to provide access to the entire Cloudflare API over MCP, while consuming only around 1,000 tokens. The footprint stays fixed, no matter how many API endpoints exist.

For a large API like the Cloudflare API, Code Mode reduces the number of input tokens used by 99.9%. An equivalent MCP server without Code Mode would consume 1.17 million tokens — more than the entire context window of the most advanced foundation models.

^{Code mode savings vs native MCP, measured with}^tiktoken

You can start using this new Cloudflare MCP server today. And we are also open-sourcing a new Code Mode SDK in the Cloudflare Agents SDK, so you can use the same approach in your own MCP servers and AI Agents.

Server‑side Code Mode

This new MCP server applies Code Mode server-side. Instead of thousands of tools, the server exports just two: search() and execute(). Both are powered by Code Mode. Here is the full tool surface area that gets loaded into the model context:

[
  {
    "name": "search",
    "description": "Search the Cloudflare OpenAPI spec. All $refs are pre-resolved inline.",
    "inputSchema": {
      "type": "object",
      "properties": {
        "code": {
          "type": "string",
          "description": "JavaScript async arrow function to search the OpenAPI spec"
        }
      },
      "required": ["code"]
    }
  },
  {
    "name": "execute",
    "description": "Execute JavaScript code against the Cloudflare API.",
    "inputSchema": {
      "type": "object",
      "properties": {
        "code": {
          "type": "string",
          "description": "JavaScript async arrow function to execute"
        }
      },
      "required": ["code"]
    }
  }
]

To discover what it can do, the agent calls search(). It writes JavaScript against a typed representation of the OpenAPI spec. The agent can filter endpoints by product, path, tags, or any other metadata and narrow thousands of endpoints to the handful it needs. The full OpenAPI spec never enters the model context. The agent only interacts with it through code.

When the agent is ready to act, it calls execute(). The agent writes code that can make Cloudflare API requests, handle pagination, check responses, and chain operations together in a single execution.

Both tools run the generated code inside a Dynamic Worker isolate — a lightweight V8 sandbox with no file system, no environment variables to leak through prompt injection and external fetches disabled by default. Outbound requests can be explicitly controlled with outbound fetch handlers when needed.

Example: Protecting an origin from DDoS attacks

Suppose a user tells their agent: "protect my origin from DDoS attacks." The agent's first step is to consult documentation. It might call the Cloudflare Docs MCP Server, use a Cloudflare Skill, or search the web directly. From the docs it learns: put Cloudflare WAF and DDoS protection rules in front of the origin.

Step 1: Search for the right endpoints The search tool gives the model a spec object: the full Cloudflare OpenAPI spec with all $refs pre-resolved. The model writes JavaScript against it. Here the agent looks for WAF and ruleset endpoints on a zone:

async () => {
  const results = [];
  for (const [path, methods] of Object.entries(spec.paths)) {
    if (path.includes('/zones/') &&
        (path.includes('firewall/waf') || path.includes('rulesets'))) {
      for (const [method, op] of Object.entries(methods)) {
        results.push({ method: method.toUpperCase(), path, summary: op.summary });
      }
    }
  }
  return results;
}

The server runs this code in a Workers isolate and returns:

[
  { "method": "GET",    "path": "/zones/{zone_id}/firewall/waf/packages",              "summary": "List WAF packages" },
  { "method": "PATCH",  "path": "/zones/{zone_id}/firewall/waf/packages/{package_id}", "summary": "Update a WAF package" },
  { "method": "GET",    "path": "/zones/{zone_id}/firewall/waf/packages/{package_id}/rules", "summary": "List WAF rules" },
  { "method": "PATCH",  "path": "/zones/{zone_id}/firewall/waf/packages/{package_id}/rules/{rule_id}", "summary": "Update a WAF rule" },
  { "method": "GET",    "path": "/zones/{zone_id}/rulesets",                           "summary": "List zone rulesets" },
  { "method": "POST",   "path": "/zones/{zone_id}/rulesets",                           "summary": "Create a zone ruleset" },
  { "method": "GET",    "path": "/zones/{zone_id}/rulesets/phases/{ruleset_phase}/entrypoint", "summary": "Get a zone entry point ruleset" },
  { "method": "PUT",    "path": "/zones/{zone_id}/rulesets/phases/{ruleset_phase}/entrypoint", "summary": "Update a zone entry point ruleset" },
  { "method": "POST",   "path": "/zones/{zone_id}/rulesets/{ruleset_id}/rules",        "summary": "Create a zone ruleset rule" },
  { "method": "PATCH",  "path": "/zones/{zone_id}/rulesets/{ruleset_id}/rules/{rule_id}", "summary": "Update a zone ruleset rule" }
]

The full Cloudflare API spec has over 2,500 endpoints. The model narrowed that to the WAF and ruleset endpoints it needs, without any of the spec entering the context window.

The model can also drill into a specific endpoint's schema before calling it. Here it inspects what phases are available on zone rulesets:

async () => {
  const op = spec.paths['/zones/{zone_id}/rulesets']?.get;
  const items = op?.responses?.['200']?.content?.['application/json']?.schema;
  // Walk the schema to find the phase enum
  const props = items?.allOf?.[1]?.properties?.result?.items?.allOf?.[1]?.properties;
  return { phases: props?.phase?.enum };
}

{
  "phases": [
    "ddos_l4", "ddos_l7",
    "http_request_firewall_custom", "http_request_firewall_managed",
    "http_response_firewall_managed", "http_ratelimit",
    "http_request_redirect", "http_request_transform",
    "magic_transit", "magic_transit_managed"
  ]
}

The agent now knows the exact phases it needs: ddos_l7 for DDoS protection and http_request_firewall_managed for WAF.

Step 2: Act on the API The agent switches to using execute. The sandbox gets a cloudflare.request() client that can make authenticated calls to the Cloudflare API. First the agent checks what rulesets already exist on the zone:

async () => {
  const response = await cloudflare.request({
    method: "GET",
    path: `/zones/${zoneId}/rulesets`
  });
  return response.result.map(rs => ({
    name: rs.name, phase: rs.phase, kind: rs.kind
  }));
}

[
  { "name": "DDoS L7",          "phase": "ddos_l7",                        "kind": "managed" },
  { "name": "Cloudflare Managed","phase": "http_request_firewall_managed", "kind": "managed" },
  { "name": "Custom rules",     "phase": "http_request_firewall_custom",   "kind": "zone" }
]

The agent sees that managed DDoS and WAF rulesets already exist. It can now chain calls to inspect their rules and update sensitivity levels in a single execution:

async () => {
  // Get the current DDoS L7 entrypoint ruleset
  const ddos = await cloudflare.request({
    method: "GET",
    path: `/zones/${zoneId}/rulesets/phases/ddos_l7/entrypoint`
  });

  // Get the WAF managed ruleset
  const waf = await cloudflare.request({
    method: "GET",
    path: `/zones/${zoneId}/rulesets/phases/http_request_firewall_managed/entrypoint`
  });
}

This entire operation, from searching the spec and inspecting a schema to listing rulesets and fetching DDoS and WAF configurations, took four tool calls.

The Cloudflare MCP server

We started with MCP servers for individual products. Want an agent that manages DNS? Add the DNS MCP server. Want Workers logs? Add the Workers Observability MCP server. Each server exported a fixed set of tools that mapped to API operations. This worked when the tool set was small, but the Cloudflare API has over 2,500 endpoints. No collection of hand-maintained servers could keep up.

The Cloudflare MCP server simplifies this. Two tools, roughly 1,000 tokens, and coverage of every endpoint in the API. When we add new products, the same search() and execute() code paths discover and call them — no new tool definitions, no new MCP servers. It even has support for the GraphQL Analytics API.

Our MCP server is built on the latest MCP specifications. It is OAuth 2.1 compliant, using Workers OAuth Provider to downscope the token to selected permissions approved by the user when connecting. The agent only gets the capabilities the user explicitly granted.

For developers, this means you can use a simple agent loop and still give your agent access to the full Cloudflare API with built-in progressive capability discovery.

Comparing approaches to context reduction

Several approaches have emerged to reduce how many tokens MCP tools consume:

Client-side Code Mode was our first experiment. The model writes TypeScript against typed SDKs and runs it in a Dynamic Worker Loader on the client. The tradeoff is that it requires the agent to ship with secure sandbox access. Code Mode is implemented in Goose and Anthropics Claude SDK as Programmatic Tool Calling.

Command-line interfaces are another path. CLIs are self-documenting and reveal capabilities as the agent explores. Tools like OpenClaw and Moltworker convert MCP servers into CLIs using MCPorter to give agents progressive disclosure. The limitation is obvious: the agent needs a shell, which not every environment provides and which introduces a much broader attack surface than a sandboxed isolate.

Dynamic tool search, as used by Anthropic in Claude Code, surfaces a smaller set of tools hopefully relevant to the current task. It shrinks context use but now requires a search function that must be maintained and evaluated, and each matched tool still uses tokens.

Each approach solves a real problem. But for MCP servers specifically, server-side Code Mode combines their strengths: fixed token cost regardless of API size, no modifications needed on the agent side, progressive discovery built in, and safe execution inside a sandboxed isolate. The agent just calls two tools with code. Everything else happens on the server.

Get started today

The Cloudflare MCP server is available now. Point your MCP client at the server URL and you'll be redirected to Cloudflare to authorize and select the permissions to grant to your agent. Add this config to your MCP client:

{
  "mcpServers": {
    "cloudflare-api": {
      "url": "https://mcp.cloudflare.com/mcp"
    }
  }
}

For CI/CD, automation, or if you prefer managing tokens yourself, create a Cloudflare API token with the permissions you need. Both user tokens and account tokens are supported and can be passed as bearer tokens in the Authorization header.

More information on different MCP setup configurations can be found at the Cloudflare MCP repository.

Looking forward

Code Mode solves context costs for a single API. But agents rarely talk to one service. A developer's agent might need the Cloudflare API alongside GitHub, a database, and an internal docs server. Each additional MCP server brings the same context window pressure we started with.

Cloudflare MCP Server Portals let you compose multiple MCP servers behind a single gateway with unified auth and access control. We are building a first-class Code Mode integration for all your MCP servers, and exposing them to agents with built-in progressive discovery and the same fixed-token footprint, regardless of how many services sit behind the gateway.

A good day to trie-hard: saving compute 1% at a time

Kevin Guthrie — Tue, 10 Sep 2024 14:00:00 GMT

Cloudflare’s global network handles a lot of HTTP requests – over 60 million per second on average. That in and of itself is not news, but it is the starting point to an adventure that started a few months ago and ends with the announcement of a new open-source Rust crate that we are using to reduce our CPU utilization, enabling our CDN to handle even more of the world’s ever-increasing Web traffic.

Motivation

Let’s start at the beginning. You may recall a few months ago we released Pingora (the heart of our Rust-based proxy services) as an open-source project on GitHub. I work on the team that maintains the Pingora framework, as well as Cloudflare’s production services built upon it. One of those services is responsible for the final step in transmitting users’ (non-cached) requests to their true destination. Internally, we call the request’s destination server its “origin”, so our service has the (unimaginative) name of “pingora-origin”.

One of the many responsibilities of pingora-origin is to ensure that when a request leaves our infrastructure, it has been cleaned to remove the internal information we use to route, measure, and optimize traffic for our customers. This has to be done for every request that leaves Cloudflare, and as I mentioned above, it’s a lot of requests. At the time of writing, the rate of requests leaving pingora-origin (globally) is 35 million requests per second. Any code that has to be run per-request is in the hottest of hot paths, and it’s in this path that we find this code and comment:

// PERF: heavy function: 1.7% CPU time
pub fn clear_internal_headers(request_header: &mut RequestHeader) {
    INTERNAL_HEADERS.iter().for_each(|h| {
        request_header.remove_header(h);
    });
}

This small and pleasantly-readable function consumes more than 1.7% of pingora-origin’s total cpu time. To put that in perspective, the total cpu time consumed by pingora-origin is 40,000 compute-seconds per second. You can think of this as 40,000 saturated CPU cores fully dedicated to running pingora-origin. Of those 40,000, 1.7% (680) are only dedicated to evaluating clear_internal_headers. The function’s heavy usage and simplicity make it seem like a great place to start optimizing.

Benchmarking

Benchmarking the function shown above is straightforward because we can use the wonderful criterion Rust crate. Criterion provides an api for timing rust code down to the nanosecond by aggregating multiple isolated executions. It also provides feedback on how the performance improves or regresses over time. The input for the benchmark is a large set of synthesized requests with a random number of headers with a uniform distribution of internal vs. non-internal headers. With our tooling and test data we find that our original clear_internal_headers function runs in an average of 3.65µs. Now for each new method of clearing headers, we can measure against the same set of requests and get a relative performance difference.

Reducing Reads

One potentially quick win is to invert how we find the headers that need to be removed from requests. If you look at the original code, you can see that we are evaluating request_header.remove_header(h) for each header in our list of internal headers, so 100+ times. Diagrammatically, it looks like this:

Since an average request has significantly fewer than 100 headers (10-30), flipping the lookup direction should reduce the number of reads while yielding the same intersection. Because we are working in Rust (and because retain does not exist for http::HeaderMap yet), we have to collect the identified internal headers in a separate step before removing them from the request. Conceptually, it looks like this:

Using our benchmarking tool, we can measure the impact of this small change, and surprisingly this is already a substantial improvement. The runtime improves from 3.65µs to 1.53µs. That’s a 2.39x speed improvement for our function. We can calculate the theoretical CPU percentage by multiplying the starting utilization by the ratio of the new and old times: 1.71% * 1.53 / 3.65 = 0.717%. Unfortunately, if we subtract that from the original 1.71% that only equates to saving 1.71% - 0.717% = 0.993% of the total CPU time. We should be able to do better.

Searching Data Structures

Now that we have reorganized our function to search a static set of internal headers instead of the actual request, we have the freedom to choose what data structure we store our header name in simply by changing the type of INTERNAL_HEADER_SET.

pub fn clear_internal_headers(request_header: &mut RequestHeader) {
   let to_remove = request_header
       .headers
       .keys()
       .filter_map(|name| INTERNAL_HEADER_SET.get(name))
       .collect::>();


   to_remove.into_iter().for_each(|k| {
       request_header.remove_header(k);
   });

Our first attempt used std::HashMap, but there may be other data structures that better suit our needs. All computer science students were taught at some point that hash tables are great because they have constant-time asymptotic behavior, or O(1), for reading. (If you are not familiar with big O notation, it is a way to express how an algorithm consumes a resource, in this case time, as the input size changes.) This means no matter how large the map gets, reads always take the same amount of time. Too bad this is only partially true. In order to read from a hash table, you have to compute the hash. Computing a hash for strings requires reading every byte, so while read time for a hashmap is constant over the table’s size, it’s linear over key length. So, our goal is to find a data structure that is better than O(L) where L is the length of the key.

There are a few common data structures that provide for reads that have read behavior that meets our criteria. Sorted sets like BTreeSet use comparisons for searching, and that makes them logarithmic over key length O(log(L)), but they are also logarithmic in size too. The net effect is that even very fast sorted sets like FST work out to be a little (50 ns) slower in our benchmarks than the standard hashmap.

State machines like parsers and regex are another common tool for searching for strings, though it’s hard to consider them data structures. These systems work by accepting input one unit at a time and determining on each step whether or not to keep evaluating. Being able to make these determinations at every step means state machines are very fast to identify negative cases (i.e. when a string is not valid or not a match). This is perfect for us because only one or two headers per request on average will be internal. In fact, benchmarking an implementation of clear_internal_headers using regular expressions clocks in as taking about twice as long as the hashmap-based solution. This is impressively fast given that regexes, while powerful, aren't known for their raw speed. This approach feels promising – we just need something in between a data structure and a state machine.

That’s where the trie comes in.

Don’t Just Trie

A trie (pronounced like “try” or “tree”) is a type of tree data structure normally used for prefix searches or auto-complete systems over a known set of strings. The structure of the trie lends itself to this because each node in the trie represents a substring of characters found in the initial set. The connections between the nodes represent the characters that can follow a prefix. Here is a small example of a trie built from the words: “and”, “ant”, “dad”, “do”, & “dot”.

The root node represents an empty string prefix, so the two lettered edges directed out of it are the only letters that can appear as the first letter in the list of strings, “a” and “d”. Subsequent nodes have increasingly longer prefixes until the final valid words are reached. This layout should make it easy to see how a trie could be useful for quickly identifying strings that are not contained. Even at the root node, we can eliminate any strings that are presented that do not start with “a” or “d”. This paring down of the search space on every step gives reading from a trie the O(log(L)) we were looking for … but only for misses. Hits within a trie are still O(L), but that’s okay, because we are getting misses over 90% of the time.

Benchmarking a few trie implementations from crates.io was disheartening. Remember, most tries are used in response to keyboard events, so optimizing them to run in the hot path of tens of millions of requests per second is not a priority. The fastest existing implementation we found was radix_trie, but it still clocked in at a full microsecond slower than hashmap. The only thing left to do was write our own implementation of a trie that was optimized for our use case.

Trie Hard

And we did! Today we are announcing trie-hard. The repository gives a full description of how it works, but the big takeaway is that it gets its speed from storing node relationships in the bits of unsigned integers and keeping the entire tree in a contiguous chunk of memory. In our benchmarks, we found that trie-hard reduced the average runtime for clear_internal_headers to under a microsecond (0.93µs). We can reuse the same formula from above to calculate the expected CPU utilization for trie-hard to be 1.71% * 3.65 / 0.93 = 0.43% That means we have finally achieved and surpassed our goal by reducing the compute utilization of pingora-origin by 1.71% - 0.43% = 1.28%!

Up until now we have been working only in theory and local benchmarking. What really matters is whether our benchmarking reflects real-life behavior. Trie-hard has been running in production since July 2024, and over the course of this project we have been collecting performance metrics from the running production of pingora-origin using a statistical sampling of its stack trace over time. Using this technique, the CPU utilization percentage of a function is estimated by the percent of samples in which the function appears. If we compare the sampled performance of the different versions of clear_internal_headers, we can see that the results from the performance sampling closely match what our benchmarks predicted.

Implementation	Stack trace samples containing `clear_internal_headers`	Actual CPU Usage (%)	Predicted CPU Usage (%)
Original	19 / 1111	1.71	n/a
Hashmap	9 / 1103	0.82	0.72
trie-hard	4 / 1171	0.34	0.43

Conclusion

Optimizing functions and writing new data structures is cool, but the real conclusion for this post is that knowing where your code is slow and by how much is more important than how you go about optimizing it. Take a moment to thank your observability team (if you're lucky enough to have one), and make use of flame graphs or any other profiling and benchmarking tool you can. Optimizing operations that are already measured in microseconds may seem a little silly, but these small improvements add up.

Making WAF ML models go brrr: saving decades of processing time

Alex Bocharov — Thu, 25 Jul 2024 13:00:46 GMT

We made our WAF Machine Learning models 5.5x faster, reducing execution time by approximately 82%, from 1519 to 275 microseconds! Read on to find out how we achieved this remarkable improvement.

WAF Attack Score is Cloudflare's machine learning (ML)-powered layer built on top of our Web Application Firewall (WAF). Its goal is to complement the WAF and detect attack bypasses that we haven't encountered before. This has proven invaluable in catching zero-day vulnerabilities, like the one detected in Ivanti Connect Secure, before they are publicly disclosed and enhancing our customers' protection against emerging and unknown threats.

Since its launch in 2022, WAF attack score adoption has grown exponentially, now protecting millions of Internet properties and running real-time inference on tens of millions of requests per second. The feature's popularity has driven us to seek performance improvements, enabling even broader customer use and enhancing Internet security.

In this post, we will discuss the performance optimizations we've implemented for our WAF ML product. We'll guide you through specific code examples and benchmark numbers, demonstrating how these enhancements have significantly improved our system's efficiency. Additionally, we'll share the impressive latency reduction numbers observed after the rollout.

Before diving into the optimizations, let's take a moment to review the inner workings of the WAF Attack Score, which powers our WAF ML product.

WAF Attack Score system design

Cloudflare's WAF attack score identifies various traffic types and attack vectors (SQLi, XSS, Command Injection, etc.) based on structural or statistical content properties. Here's how it works during inference:

HTTP Request Content: Start with raw HTTP input.
Normalization & Transformation: Standardize and clean the data, applying normalization, content substitutions, and de-duplication.
Feature Extraction: Tokenize the transformed content to generate statistical and structural data.
Machine Learning Model Inference: Analyze the extracted features with pre-trained models, mapping content representations to classes (e.g., XSS, SQLi or RCE) or scores.
Classification Output in WAF: Assign a score to the input, ranging from 1 (likely malicious) to 99 (likely clean), guiding security actions.

Next, we will explore feature extraction and inference optimizations.

Feature extraction optimizations

In the context of the WAF Attack Score ML model, feature extraction or pre-processing is essentially a process of tokenizing the given input and producing a float tensor of 1 x m size:

In our initial pre-processing implementation, this is achieved via a sliding window of 3 bytes over the input with the help of Rust’s std::collections::HashMap to look up the tensor index for a given ngram.

Initial benchmarks

To establish performance baselines, we've set up four benchmark cases representing example inputs of various lengths, ranging from 44 to 9482 bytes. Each case exemplifies typical input sizes, including those for a request body, user agent, and URI. We run benchmarks using the Criterion.rs statistics-driven micro-benchmarking tool:

RUSTFLAGS="-C opt-level=3 -C target-cpu=native" cargo criterion

Here are initial numbers for these benchmarks executed on a Linux laptop with a 13th Gen Intel® Core™ i7-13800H processor:

Benchmark case	Pre-processing time, μs	Throughput, MiB/s
preprocessing/long-body-9482	248.46	36.40
preprocessing/avg-body-1000	28.19	33.83
preprocessing/avg-url-44	1.45	28.94
preprocessing/avg-ua-91	2.87	30.24

An important observation from these results is that pre-processing time correlates with the length of the input string, with throughput ranging from 28 MiB/s to 36 MiB/s. This suggests that considerable time is spent iterating over longer input strings. Optimizing this part of the process could significantly enhance performance. The dependency of processing time on input size highlights a key area for performance optimization. To validate this, we should examine where the processing time is spent by analyzing flamegraphs created from a 100-second profiling session visualized using pprof:

RUSTFLAGS="-C opt-level=3 -C target-cpu=native" cargo criterion -- --profile-time 100
 
go tool pprof -http=: target/criterion/profile/preprocessing/avg-body-1000/profile.pb

Looking at the pre-processing flamegraph above, it's clear that most of the time was spent on the following two operations:

Function name	% Time spent
std::collections::hash::map::HashMap::get	61.8%
regex::regex::bytes::Regex::replace_all	18.5%

Let's tackle the HashMap lookups first. Lookups are happening inside the tensor_populate_ngrams function, where input is split into windows of 3 bytes representing ngram and then lookup inside two hash maps:

fn tensor_populate_ngrams(tensor: &mut [f32], input: &[u8]) {   
   // Populate the NORM ngrams
   let mut unknown_norm_ngrams = 0;
   let norm_offset = 1;
 
   for s in input.windows(3) {
       match NORM_VOCAB.get(s) {
           Some(pos) => {
               tensor[*pos as usize + norm_offset] += 1.0f32;
           }
           None => {
               unknown_norm_ngrams += 1;
           }
       };
   }
 
   // Populate the SIG ngrams
   let mut unknown_sig_ngrams = 0;
   let sig_offset = norm_offset + NORM_VOCAB.len();
 
   let res = SIG_REGEX.replace_all(&input, b"#");
 
   for s in res.windows(3) {
       match SIG_VOCAB.get(s) {
           Some(pos) => {
               // adding +1 here as the first position will be the unknown_sig_ngrams
               tensor[*pos as usize + sig_offset + 1] += 1.0f32;
           }
           None => {
               unknown_sig_ngrams += 1;
           }
       }
   }
}

So essentially the pre-processing function performs a ton of hash map lookups, the volume of which depends on the size of the input string, e.g. 1469 lookups for the given benchmark case avg-body-1000.

Optimization attempt #1: HashMap → Aho-Corasick

Rust hash maps are generally quite fast. However, when that many lookups are being performed, it's not very cache friendly.

So can we do better than hash maps, and what should we try first? The answer is the Aho-Corasick library.

This library provides multiple pattern search principally through an implementation of the Aho-Corasick algorithm, which builds a fast finite state machine for executing searches in linear time.

We can also tune Aho-Corasick settings based on this recommendation:

“You might want to use AhoCorasickBuilder::kind to set your searcher to always use AhoCorasickKind::DFA if search speed is critical and memory usage isn’t a concern.”

static ref NORM_VOCAB_AC: AhoCorasick = AhoCorasick::builder().kind(Some(AhoCorasickKind::DFA)).build(&[    
    "abc",
    "def",
    "wuq",
    "ijf",
    "iru",
    "piw",
    "mjw",
    "isn",
    "od ",
    "pro",
    ...
]).unwrap();

Then we use the constructed AhoCorasick dictionary to lookup ngrams using its find_overlapping_iter method:

for mat in NORM_VOCAB_AC.find_overlapping_iter(&input) {
    tensor_input_data[mat.pattern().as_usize() + 1] += 1.0;
}

We ran benchmarks and compared them against the baseline times shown above:

Benchmark case	Baseline time, μs	Aho-Corasick time, μs	Optimization
preprocessing/long-body-9482	248.46	129.59	-47.84% or 1.64x
preprocessing/avg-body-1000	28.19	16.47	-41.56% or 1.71x
preprocessing/avg-url-44	1.45	1.01	-30.38% or 1.44x
preprocessing/avg-ua-91	2.87	1.90	-33.60% or 1.51x

That's substantially better – Aho-Corasick DFA does wonders.

Optimization attempt #2: Aho-Corasick → match

One would think optimization with Aho-Corasick DFA is enough and that it seems unlikely that anything else can beat it. Yet, we can throw Aho-Corasick away and simply use the Rust match statement and let the compiler do the optimization for us!

#[inline]
const fn norm_vocab_lookup(ngram: &[u8; 3]) -> usize {     
    match ngram {
        b"abc" => 1,
        b"def" => 2,
        b"wuq" => 3,
        b"ijf" => 4,
        b"iru" => 5,
        b"piw" => 6,
        b"mjw" => 7,
        b"isn" => 8,
        b"od " => 9,
        b"pro" => 10,
        ...
        _ => 0,
    }
}```

Here's how it performs in practice, based on the assembly generated by the Godbolt compiler explorer. The corresponding assembly code efficiently implements this lookup by employing a jump table and byte-wise comparisons to determine the return value based on input sequences, optimizing for quick decisions and minimal branching. Although the example only includes ten ngrams, it's important to note that in applications like our WAF Attack Score ML models, we deal with thousands of ngrams. This simple match-based approach outshines both HashMap lookups and the Aho-Corasick method.

Benchmark case	Baseline time, μs	Match time, μs	Optimization
preprocessing/long-body-9482	248.46	112.96	-54.54% or 2.20x
preprocessing/avg-body-1000	28.19	13.12	-53.45% or 2.15x
preprocessing/avg-url-44	1.45	0.75	-48.37% or 1.94x
preprocessing/avg-ua-91	2.87	1.4076	-50.91% or 2.04x

Switching to match gave us another 7-18% drop in latency, depending on the case.

Optimization attempt #3: Regex → WindowedReplacer

So, what exactly is the purpose of Regex::replace_all in pre-processing? Regex is defined and used like this:

pub static SIG_REGEX: Lazy =
    Lazy::new(|| RegexBuilder::new("[a-z]+").unicode(false).build().unwrap());
    ... 
    let res = SIG_REGEX.replace_all(&input, b"#");
    for s in res.windows(3) {
        tensor[sig_vocab_lookup(s.try_into().unwrap())] += 1.0;
    }

Essentially, all we need is to:

Replace every sequence of lowercase letters in the input with a single byte "#".
Iterate over replaced bytes in a windowed fashion with a step of 3 bytes representing an ngram.
Look up the ngram index and increment it in the tensor.

This logic seems simple enough that we could implement it more efficiently with a single pass over the input and without any allocations:

type Window = [u8; 3];
type Iter<'a> = Peekable>;

pub struct WindowedReplacer<'a> {
    window: Window,
    input_iter: Iter<'a>,
}

#[inline]
fn is_replaceable(byte: u8) -> bool {
    matches!(byte, b'a'..=b'z')
}

#[inline]
fn next_byte(iter: &mut Iter) -> Option {
    let byte = iter.next().copied()?;
    if is_replaceable(byte) {
        while iter.next_if(|b| is_replaceable(**b)).is_some() {}
        Some(b'#')
    } else {
        Some(byte)
    }
}

impl<'a> WindowedReplacer<'a> {
    pub fn new(input: &'a [u8]) -> Option {
        let mut window: Window = Default::default();
        let mut iter = input.iter().peekable();
        for byte in window.iter_mut().skip(1) {
            *byte = next_byte(&mut iter)?;
        }
        Some(WindowedReplacer {
            window,
            input_iter: iter,
        })
    }
}

impl<'a> Iterator for WindowedReplacer<'a> {
    type Item = Window;

    #[inline]
    fn next(&mut self) -> Option {
        for i in 0..2 {
            self.window[i] = self.window[i + 1];
        }
        let byte = next_byte(&mut self.input_iter)?;
        self.window[2] = byte;
        Some(self.window)
    }
}

By utilizing the WindowedReplacer, we simplify the replacement logic:

if let Some(replacer) = WindowedReplacer::new(&input) {                
    for ngram in replacer.windows(3) {
        tensor[sig_vocab_lookup(ngram.try_into().unwrap())] += 1.0;
    }
}

This new approach not only eliminates the need for allocating additional buffers to store replaced content, but also leverages Rust's iterator optimizations, which the compiler can more effectively optimize. You can view an example of the assembly output for this new iterator at the provided Godbolt link.

Now let's benchmark this and compare against the original implementation:

Benchmark case	Baseline time, μs	Match time, μs	Optimization
preprocessing/long-body-9482	248.46	51.00	-79.47% or 4.87x
preprocessing/avg-body-1000	28.19	5.53	-80.36% or 5.09x
preprocessing/avg-url-44	1.45	0.40	-72.11% or 3.59x
preprocessing/avg-ua-91	2.87	0.69	-76.07% or 4.18x

The new letters replacement implementation has doubled the preprocessing speed compared to the previously optimized version using match statements, and it is four to five times faster than the original version!

Optimization attempt #4: Going nuclear with branchless ngram lookups

At this point, 4-5x improvement might seem like a lot and there is no point pursuing any further optimizations. After all, using an ngram lookup with a match statement has beaten the following methods, with benchmarks omitted for brevity:

Lookup method	Description
std::collections::HashMap	Uses Google’s SwissTable design with SIMD lookups to scan multiple hash entries in parallel.
Aho-Corasick matcher with and without DFA	Also utilizes SIMD instructions in some cases.
phf crate	A library to generate efficient lookup tables at compile time using perfect hash functions.
ph crate	Another Rust library of data structures based on perfect hashing.
quickphf crate	A Rust crate that allows you to use static compile-time generated hash maps and hash sets using PTHash perfect hash functions.

However, if we look again at the assembly of the norm_vocab_lookup function, it is clear that the execution flow has to perform a bunch of comparisons using cmp instructions. This creates many branches for the CPU to handle, which can lead to branch mispredictions. Branch mispredictions occur when the CPU incorrectly guesses the path of execution, causing delays as it discards partially completed instructions and fetches the correct ones. By reducing or eliminating these branches, we can avoid these mispredictions and improve the efficiency of the lookup process. How can we get rid of those branches when there is a need to look up thousands of unique ngrams?

Since there are only 3 bytes in each ngram, we can build two lookup tables of 256 x 256 x 256 size, storing the ngram tensor index. With this naive approach, our memory requirements will be: 256 x 256 x 256 x 2 x 2 = 64 MB, which seems like a lot.

However, given that we only care about ASCII bytes 0..127, then memory requirements can be lower: 128 x 128 x 128 x 2 x 2 = 8 MB, which is better. However, we will need to check for bytes >= 128, which will introduce a branch again.

So can we do better? Considering that the actual number of distinct byte values used in the ngrams is significantly less than the total possible 256 values, we can reduce memory requirements further by employing the following technique:

1. To avoid the branching caused by comparisons, we use precomputed offset lookup tables. This means instead of comparing each byte of the ngram during each lookup, we precompute the positions of each possible byte in a lookup table. This way, we replace the comparison operations with direct memory accesses, which are much faster and do not involve branching. We build an ngram bytes offsets lookup const array, storing each unique ngram byte offset position multiplied by the number of unique ngram bytes:

const NGRAM_OFFSETS: [[u32; 256]; 3] = [
    [
        // offsets of first byte in ngram
    ],
    [
        // offsets of second byte in ngram
    ],
    [
        // offsets of third byte in ngram
    ],
];

2. Then to obtain the ngram index, we can use this simple const function:

#[inline]
const fn ngram_index(ngram: [u8; 3]) -> usize {
    (NGRAM_OFFSETS[0][ngram[0] as usize]
        + NGRAM_OFFSETS[1][ngram[1] as usize]
        + NGRAM_OFFSETS[2][ngram[2] as usize]) as usize
}

3. To look up the tensor index based on the ngram index, we construct another const array at compile time using a list of all ngrams, where N is the number of unique ngram bytes:

const NGRAM_TENSOR_IDX: [u16; N * N * N] = {
    let mut arr = [0; N * N * N];
    arr[ngram_index(*b"abc")] = 1;
    arr[ngram_index(*b"def")] = 2;
    arr[ngram_index(*b"wuq")] = 3;
    arr[ngram_index(*b"ijf")] = 4;
    arr[ngram_index(*b"iru")] = 5;
    arr[ngram_index(*b"piw")] = 6;
    arr[ngram_index(*b"mjw")] = 7;
    arr[ngram_index(*b"isn")] = 8;
    arr[ngram_index(*b"od ")] = 9;
    ...
    arr
};

4. Finally, to update the tensor based on given ngram, we lookup the ngram index, then the tensor index, and then increment it with help of get_unchecked_mut, which avoids unnecessary (in this case) boundary checks and eliminates another source of branching:

#[inline]
fn update_tensor_with_ngram(tensor: &mut [f32], ngram: [u8; 3]) {
    let ngram_idx = ngram_index(ngram);
    debug_assert!(ngram_idx < NGRAM_TENSOR_IDX.len());
    unsafe {
        let tensor_idx = *NGRAM_TENSOR_IDX.get_unchecked(ngram_idx) as usize;
        debug_assert!(tensor_idx < tensor.len());
        *tensor.get_unchecked_mut(tensor_idx) += 1.0;
    }
}

This logic works effectively, passes correctness tests, and most importantly, it's completely branchless! Moreover, the memory footprint of used lookup arrays is tiny – just ~500 KiB of memory – which easily fits into modern CPU L2/L3 caches, ensuring that expensive cache misses are rare and performance is optimal.

The last trick we will employ is loop unrolling for ngrams processing. By taking 6 ngrams (corresponding to 8 bytes of the input array) at a time, the compiler can unroll the second loop and auto-vectorize it, leveraging parallel execution to improve performance:

const CHUNK_SIZE: usize = 6;

let chunks_max_offset =
    ((input.len().saturating_sub(2)) / CHUNK_SIZE) * CHUNK_SIZE;
for i in (0..chunks_max_offset).step_by(CHUNK_SIZE) {
    for ngram in input[i..i + CHUNK_SIZE + 2].windows(3) {
        update_tensor_with_ngram(tensor, ngram.try_into().unwrap());
    }
}

Tying up everything together, our final pre-processing benchmarks show the following:

Benchmark case	Baseline time, μs	Branchless time, μs	Optimization
preprocessing/long-body-9482	248.46	21.53	-91.33% or 11.54x
preprocessing/avg-body-1000	28.19	2.33	-91.73% or 12.09x
preprocessing/avg-url-44	1.45	0.26	-82.34% or 5.66x
preprocessing/avg-ua-91	2.87	0.43	-84.92% or 6.63x

The longer input is, the higher the latency drop will be due to branchless ngram lookups and loop unrolling, ranging from six to twelve times faster than baseline implementation.

After trying various optimizations, the final version of pre-processing retains optimization attempts 3 and 4, using branchless ngram lookup with offset tables and a single-pass non-allocating replacement iterator.

There are potentially more CPU cycles left on the table, and techniques like memory pre-fetching and manual SIMD intrinsics could speed this up a bit further. However, let's now switch gears into looking at inference latency a bit closer.

Model inference optimizations

Initial benchmarks

Let’s have a look at original performance numbers of the WAF Attack Score ML model, which uses TensorFlow Lite 2.6.0:

Benchmark case	Inference time, μs
inference/long-body-9482	247.31
inference/avg-body-1000	246.31
inference/avg-url-44	246.40
inference/avg-ua-91	246.88

Model inference is actually independent of the original input length, as inputs are transformed into tensors of predetermined size during the pre-processing phase, which we optimized above. From now on, we will refer to a singular inference time when benchmarking our optimizations.

Digging deeper with profiler, we observed that most of the time is spent on the following operations:

Function name	% Time spent
tflite::tensor_utils::PortableMatrixBatchVectorMultiplyAccumulate	42.46%
tflite::tensor_utils::PortableAsymmetricQuantizeFloats	30.59%
tflite::optimized_ops::SoftmaxImpl	12.02%
tflite::reference_ops::MaximumMinimumBroadcastSlow	5.35%
tflite::ops::builtin::elementwise::LogEval	4.13%

The most expensive operation is matrix multiplication, which boils down to iteration within three nested loops:

void PortableMatrixBatchVectorMultiplyAccumulate(const float* matrix,
                                                 int m_rows, int m_cols,
                                                 const float* vector,
                                                 int n_batch, float* result) {
  float* result_in_batch = result;
  for (int b = 0; b < n_batch; b++) {
    const float* matrix_ptr = matrix;
    for (int r = 0; r < m_rows; r++) {
      float dot_prod = 0.0f;
      const float* vector_in_batch = vector + b * m_cols;
      for (int c = 0; c < m_cols; c++) {
        dot_prod += *matrix_ptr++ * *vector_in_batch++;
      }
      *result_in_batch += dot_prod;
     ++result_in_batch;
    }
  }
}

This doesn’t look very efficient and many blogs and research papers have been written on how matrix multiplication can be optimized, which basically boils down to:

Blocking: Divide matrices into smaller blocks that fit into the cache, improving cache reuse and reducing memory access latency.
Vectorization: Use SIMD instructions to process multiple data points in parallel, enhancing efficiency with vector registers.
Loop Unrolling: Reduce loop control overhead and increase parallelism by executing multiple loop iterations simultaneously.

To gain a better understanding of how these techniques work, we recommend watching this video, which brilliantly depicts the process of matrix multiplication:

Tensorflow Lite with AVX2

TensorFlow Lite does, in fact, support SIMD matrix multiplication – we just need to enable it and re-compile the TensorFlow Lite library:

if [[ "$(uname -m)" == x86_64* ]]; then
    # On x86_64 target x86-64-v3 CPU to enable AVX2 and FMA.
    arguments+=("--copt=-march=x86-64-v3")
fi

After running profiler again using the SIMD-optimized TensorFlow Lite library:

Top operations as per profiler output:

Function name	% Time spent
tflite::tensor_utils::SseMatrixBatchVectorMultiplyAccumulateImpl	43.01%
tflite::tensor_utils::NeonAsymmetricQuantizeFloats	22.46%
tflite::reference_ops::MaximumMinimumBroadcastSlow	7.82%
tflite::optimized_ops::SoftmaxImpl	6.61%
tflite::ops::builtin::elementwise::LogEval	4.63%

Matrix multiplication now uses AVX2 instructions, which uses blocks of 8x8 to multiply and accumulate the multiplication result.

Proportionally, matrix multiplication and quantization operations take a similar time share when compared to non-SIMD version, however in absolute numbers, it’s almost twice as fast when SIMD optimizations are enabled:

Benchmark case	Baseline time, μs	SIMD time, μs	Optimization
inference/avg-body-1000	246.31	130.07	-47.19% or 1.89x

Quite a nice performance boost just from a few lines of build config change!

Tensorflow Lite with XNNPACK

Tensorflow Lite comes with a useful benchmarking tool called benchmark_model, which also has a built-in profiler.

The tool can be built locally using the command:

bazel build -j 4 --copt=-march=native -c opt tensorflow/lite/tools/benchmark:benchmark_model

After building, benchmarks were run with different settings:

Benchmark run	Inference time, μs
benchmark_model --graph=model.tflite --num_runs=100000 --use_xnnpack=false	105.61
benchmark_model --graph=model.tflite --num_runs=100000 --use_xnnpack=true --xnnpack_force_fp16=true	111.95
benchmark_model --graph=model.tflite --num_runs=100000 --use_xnnpack=true	49.05

Tensorflow Lite with XNNPACK enabled emerges as a leader, achieving ~50% latency reduction, when compared to the original Tensorflow Lite implementation.

More technical details about XNNPACK can be found in these blog posts:

Re-running benchmarks with XNNPack enabled, we get the following results:

Benchmark case	Baseline time, μs TFLite 2.6.0	SIMD time, μs TFLite 2.6.0	SIMD time, μs TFLite 2.16.1	SIMD + XNNPack time, μs TFLite 2.16.1	Optimization
inference/avg-body-1000	246.31	130.07	115.17	56.22	-77.17% or 4.38x

By upgrading TensorFlow Lite from 2.6.0 to 2.16.1 and enabling SIMD optimizations along with the XNNPack, we were able to decrease WAF ML model inference time more than four-fold, achieving a 77.17% reduction.

Caching inference result

While making code faster through pre-processing and inference optimizations is great, it's even better when code doesn't need to run at all. This is where caching comes in. Amdahl's Law suggests that optimizing only parts of a program has diminishing returns. By avoiding redundant executions with caching, we can achieve significant performance gains beyond the limitations of traditional code optimization.

A simple key-value cache would quickly occupy all available memory on the server due to the high cardinality of URLs, HTTP headers, and HTTP bodies. However, because "everything on the Internet has an L-shape" or more specifically, follows a Zipf's law distribution, we can optimize our caching strategy.

Zipf's law states that in many natural datasets, the frequency of any item is inversely proportional to its rank in the frequency table. In other words, a few items are extremely common, while the majority are rare. By analyzing our request data, we found that URLs, HTTP headers, and even HTTP bodies follow this distribution. For example, here is the user agent header frequency distribution against its rank:

By caching the top-N most frequently occurring inputs and their corresponding inference results, we can ensure that both pre-processing and inference are skipped for the majority of requests. This is where the Least Recently Used (LRU) cache comes in – frequently used items stay hot in the cache, while the least recently used ones are evicted.

We use lua-resty-mlcache as our caching solution, allowing us to share cached inference results between different Nginx workers via a shared memory dictionary. The LRU cache effectively exploits the space-time trade-off, where we trade a small amount of memory for significant CPU time savings.

This approach enables us to achieve a ~70% cache hit ratio, significantly reducing latency further, as we will analyze in the final section below.

Optimization results

The optimizations discussed in this post were rolled out in several phases to ensure system correctness and stability.

First, we enabled SIMD optimizations for TensorFlow Lite, which reduced WAF ML total execution time by approximately 41.80%, decreasing from 1519 ➔ 884 μs on average.

Next, we upgraded TensorFlow Lite from version 2.6.0 to 2.16.1, enabled XNNPack, and implemented pre-processing optimizations. This further reduced WAF ML total execution time by ~40.77%, bringing it down from 932 ➔ 552 μs on average. The initial average time of 932 μs was slightly higher than the previous 884 μs due to the increased number of customers using this feature and the months that passed between changes.

Lastly, we introduced LRU caching, which led to an additional reduction in WAF ML total execution time by ~50.18%, from 552 ➔ 275 μs on average.

Overall, we cut WAF ML execution time by ~81.90%, decreasing from 1519 ➔ 275 μs, or 5.5x faster!

To illustrate the significance of this: with Cloudflare’s average rate of 9.5 million requests per second passing through WAF ML, saving 1244 microseconds per request equates to saving ~32 years of processing time every single day! That’s in addition to the savings of 523 microseconds per request or 65 years of processing time per day demonstrated last year in our Every request, every microsecond: scalable machine learning at Cloudflare post about our Bot Management product.

Conclusion

We hope you enjoyed reading about how we made our WAF ML models go brrr, just as much as we enjoyed implementing these optimizations to bring scalable WAF ML to more customers on a truly global scale.

Looking ahead, we are developing even more sophisticated ML security models. These advancements aim to bring our WAF and Bot Management products to the next level, making them even more useful and effective for our customers.

Optimizing TCP for high WAN throughput while preserving low latency

Mike Freemon — Fri, 01 Jul 2022 13:00:01 GMT

Here at Cloudflare we're constantly working on improving our service. Our engineers are looking at hundreds of parameters of our traffic, making sure that we get better all the time.

One of the core numbers we keep a close eye on is HTTP request latency, which is important for many of our products. We regard latency spikes as bugs to be fixed. One example is the 2017 story of "Why does one NGINX worker take all the load?", where we optimized our TCP Accept queues to improve overall latency of TCP sockets waiting for accept().

Performance tuning is a holistic endeavor, and we monitor and continuously improve a range of other performance metrics as well, including throughput. Sometimes, tradeoffs have to be made. Such a case occurred in 2015, when a latency spike was discovered in our processing of HTTP requests. The solution at the time was to set tcp_rmem to 4 MiB, which minimizes the amount of time the kernel spends on TCP collapse processing. It was this collapse processing that was causing the latency spikes. Later in this post we discuss TCP collapse processing in more detail.

The tradeoff is that using a low value for tcp_rmem limits TCP throughput over high latency links. The following graph shows the maximum throughput as a function of network latency for a window size of 2 MiB. Note that the 2 MiB corresponds to a tcp_rmem value of 4 MiB due to the tcp_adv_win_scale setting in effect at the time.

For the Cloudflare products then in existence, this was not a major problem, as connections terminate and content is served from nearby servers due to our BGP anycast routing.

Since then, we have added new products, such as Magic WAN, WARP, Spectrum, Gateway, and others. These represent new types of use cases and traffic flows.

For example, imagine you're a typical Magic WAN customer. You have connected all of your worldwide offices together using the Cloudflare global network. While Time to First Byte still matters, Magic WAN office-to-office traffic also needs good throughput. For example, a lot of traffic over these corporate connections will be file sharing using protocols such as SMB. These are elephant flows over long fat networks. Throughput is the metric every eyeball watches as they are downloading files.

We need to continue to provide world-class low latency while simultaneously providing high throughput over high-latency connections.

Before we begin, let’s introduce the players in our game.

TCP receive window is the maximum number of unacknowledged user payload bytes the sender should transmit (bytes-in-flight) at any point in time. The size of the receive window can and does go up and down during the course of a TCP session. It is a mechanism whereby the receiver can tell the sender to stop sending if the sent packets cannot be successfully received because the receive buffers are full. It is this receive window that often limits throughput over high-latency networks.

net.ipv4.tcp_adv_win_scale is a (non-intuitive) number used to account for the overhead needed by Linux to process packets. The receive window is specified in terms of user payload bytes. Linux needs additional memory beyond that to track other data associated with packets it is processing.

The value of the receive window changes during the lifetime of a TCP session, depending on a number of factors. The maximum value that the receive window can be is limited by the amount of free memory available in the receive buffer, according to this table:

tcp_adv_win_scale	TCP window size
4	15/16 * available memory in receive buffer
3	⅞ * available memory in receive buffer
2	¾ * available memory in receive buffer
1	½ * available memory in receive buffer
0	available memory in receive buffer
-1	½ * available memory in receive buffer
-2	¼ * available memory in receive buffer
-3	⅛ * available memory in receive buffer

We can intuitively (and correctly) understand that the amount of available memory in the receive buffer is the difference between the used memory and the maximum limit. But what is the maximum size a receive buffer can be? The answer is sk_rcvbuf.

sk_rcvbuf is a per-socket field that specifies the maximum amount of memory that a receive buffer can allocate. This can be set programmatically with the socket option SO_RCVBUF. This can sometimes be useful to do, for localhost TCP sessions, for example, but in general the use of SO_RCVBUF is not recommended.

So how is sk_rcvbuf set? The most appropriate value for that depends on the latency of the TCP session and other factors. This makes it difficult for L7 applications to know how to set these values correctly, as they will be different for every TCP session. The solution to this problem is Linux autotuning.

Linux autotuning

Linux autotuning is logic in the Linux kernel that adjusts the buffer size limits and the receive window based on actual packet processing. It takes into consideration a number of things including TCP session RTT, L7 read rates, and the amount of available host memory.

Autotuning can sometimes seem mysterious, but it is actually fairly straightforward.

The central idea is that Linux can track the rate at which the local application is reading data off of the receive queue. It also knows the session RTT. Because Linux knows these things, it can automatically increase the buffers and receive window until it reaches the point at which the application layer or network bottleneck links are the constraint on throughput (and not host buffer settings). At the same time, autotuning prevents slow local readers from having excessively large receive queues. The way autotuning does that is by limiting the receive window and its corresponding receive buffer to an appropriate size for each socket.

The values set by autotuning can be seen via the Linux “ss” command from the iproute package (e.g. “ss -tmi”). The relevant output fields from that command are:

Recv-Q is the number of user payload bytes not yet read by the local application.

rcv_ssthresh is the window clamp, a.k.a. the maximum receive window size. This value is not known to the sender. The sender receives only the current window size, via the TCP header field. A closely-related field in the kernel, tp->window_clamp, is the maximum window size allowable based on the amount of available memory. rcv_ssthresh is the receiver-side slow-start threshold value.

skmem_r is the actual amount of memory that is allocated, which includes not only user payload (Recv-Q) but also additional memory needed by Linux to process the packet (packet metadata). This is known within the kernel as sk_rmem_alloc.

Note that there are other buffers associated with a socket, so skmem_r does not represent the total memory that a socket might have allocated. Those other buffers are not involved in the issues presented in this post.

skmem_rb is the maximum amount of memory that could be allocated by the socket for the receive buffer. This is higher than rcv_ssthresh to account for memory needed for packet processing that is not packet data. Autotuning can increase this value (up to tcp_rmem max) based on how fast the L7 application is able to read data from the socket and the RTT of the session. This is known within the kernel as sk_rcvbuf.

rcv_space is the high water mark of the rate of the local application reading from the receive buffer during any RTT. This is used internally within the kernel to adjust sk_rcvbuf.

Earlier we mentioned a setting called tcp_rmem. net.ipv4.tcp_rmem consists of three values, but in this document we are always referring to the third value (except where noted). It is a global setting that specifies the maximum amount of memory that any TCP receive buffer can allocate, i.e. the maximum permissible value that autotuning can use for sk_rcvbuf. This is essentially just a failsafe for autotuning, and under normal circumstances should play only a minor role in TCP memory management.

It’s worth mentioning that receive buffer memory is not preallocated. Memory is allocated based on actual packets arriving and sitting in the receive queue. It’s also important to realize that filling up a receive queue is not one of the criteria that autotuning uses to increase sk_rcvbuf. Indeed, preventing this type of excessive buffering (bufferbloat) is one of the benefits of autotuning.

What’s the problem?

The problem is that we must have a large TCP receive window for high BDP sessions. This is directly at odds with the latency spike problem mentioned above.

Something has to give. The laws of physics (speed of light in glass, etc.) dictate that we must use large window sizes. There is no way to get around that. So we are forced to solve the latency spikes differently.

A brief recap of the latency spike problem

Sometimes a TCP session will fill up its receive buffers. When that happens, the Linux kernel will attempt to reduce the amount of memory the receive queue is using by performing what amounts to a “defragmentation” of memory. This is called collapsing the queue. Collapsing the queue takes time, which is what drives up HTTP request latency.

We do not want to spend time collapsing TCP queues.

Why do receive queues fill up to the point where they hit the maximum memory limit? The usual situation is when the local application starts out reading data from the receive queue at one rate (triggering autotuning to raise the max receive window), followed by the local application slowing down its reading from the receive queue. This is valid behavior, and we need to handle it correctly.

Selecting sysctl values

Before exploring solutions, let’s first decide what we need as the maximum TCP window size.

As we have seen above in the discussion about BDP, the window size is determined based upon the RTT and desired throughput of the connection.

Because Linux autotuning will adjust correctly for sessions with lower RTTs and bottleneck links with lower throughput, all we need to be concerned about are the maximums.

For latency, we have chosen 300 ms as the maximum expected latency, as that is the measured latency between our Zurich and Sydney facilities. It seems reasonable enough as a worst-case latency under normal circumstances.

For throughput, although we have very fast and modern hardware on the Cloudflare global network, we don’t expect a single TCP session to saturate the hardware. We have arbitrarily chosen 3500 mbps as the highest supported throughput for our highest latency TCP sessions.

The calculation for those numbers results in a BDP of 131MB, which we round to the more aesthetic value of 128 MiB.

Recall that allocation of TCP memory includes metadata overhead in addition to packet data. The ratio of actual amount of memory allocated to user payload size varies, depending on NIC driver settings, packet size, and other factors. For full-sized packets on some of our hardware, we have measured average allocations up to 3 times the packet data size. In order to reduce the frequency of TCP collapse on our servers, we set tcp_adv_win_scale to -2. From the table above, we know that the max window size will be ¼ of the max buffer space.

We end up with the following sysctl values:

net.ipv4.tcp_rmem = 8192 262144 536870912
net.ipv4.tcp_wmem = 4096 16384 536870912
net.ipv4.tcp_adv_win_scale = -2

A tcp_rmem of 512MiB and tcp_adv_win_scale of -2 results in a maximum window size that autotuning can set of 128 MiB, our desired value.

Disabling TCP collapse

Patient: Doctor, it hurts when we collapse the TCP receive queue.

Doctor: Then don’t do that!

Generally speaking, when a packet arrives at a buffer when the buffer is full, the packet gets dropped. In the case of these receive buffers, Linux tries to “save the packet” when the buffer is full by collapsing the receive queue. Frequently this is successful, but it is not guaranteed to be, and it takes time.

There are no problems created by immediately just dropping the packet instead of trying to save it. The receive queue is full anyway, so the local receiver application still has data to read. The sender’s congestion control will notice the drop and/or ZeroWindow and will respond appropriately. Everything will continue working as designed.

At present, there is no setting provided by Linux to disable the TCP collapse. We developed an in-house patch to the kernel to disable the TCP collapse logic.

Kernel patch – Attempt #1

The kernel patch for our first attempt was straightforward. At the top of tcp_try_rmem_schedule(), if the memory allocation fails, we simply return (after pred_flag = 0 and tcp_sack_reset()), thus completely skipping the tcp_collapse and related logic.

It didn’t work.

Although we eliminated the latency spikes while using large buffer limits, we did not observe the throughput we expected.

One of the realizations we made as we investigated the situation was that standard network benchmarking tools such as iperf3 and similar do not expose the problem we are trying to solve. iperf3 does not fill the receive queue. Linux autotuning does not open the TCP window large enough. Autotuning is working perfectly for our well-behaved benchmarking program.

We need application-layer software that is slightly less well-behaved, one that exercises the autotuning logic under test. So we wrote one.

A new benchmarking tool

Anomalies were seen during our “Attempt #1” that negatively impacted throughput. The anomalies were seen only under certain specific conditions, and we realized we needed a better benchmarking tool to detect and measure the performance impact of those anomalies.

This tool has turned into an invaluable resource during the development of this patch and raised confidence in our solution.

It consists of two Python programs. The reader opens a TCP session to the daemon, at which point the daemon starts sending user payload as fast as it can, and never stops sending.

The reader, on the other hand, starts and stops reading in a way to open up the TCP receive window wide open and then repeatedly causes the buffers to fill up completely. More specifically, the reader implemented this logic:

reads as fast as it can, for five seconds
- this is called fast mode
- opens up the window
calculates 5% of the high watermark of the bytes reader during any previous one second
for each second of the next 15 seconds:
- this is called slow mode
- reads that 5% number of bytes, then stops reading
- sleeps for the remainder of that particular second
- most of the second consists of no reading at all
steps 1-3 are repeated in a loop three times, so the entire run is 60 seconds

This has the effect of highlighting any issues in the handling of packets when the buffers repeatedly hit the limit.

Revisiting default Linux behavior

Taking a step back, let’s look at the default Linux behavior. The following is kernel v5.15.16.

NIC speed (mbps)	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse	TCP window (MiB)	buffer metadata to user payload ratio	Prune Called	RcvCollapsed	RcvQDrop	OFODrop	Test Result
1000	300	512	-2	0	128	4	0	0	0	0	GOOD
1000	300	256	1	0	128	2	0	0	0	0	GOOD
1000	300	170	2	0	128	1.33	24	490K	0	0	GOOD
1000	300	146	3	0	128	1.14	57	616K	0	0	GOOD
1000	300	137	4	0	128	1.07	74	803K	0	0	GOOD

The Linux kernel is effective at freeing up space in order to make room for incoming packets when the receive buffer memory limit is hit. As documented previously, the cost for saving these packets (i.e. not dropping them) is latency.

However, the latency spikes, in milliseconds, for tcp_try_rmem_schedule(), are:

tcp_rmem 170 MiB, tcp_adv_win_scale +2 (170p2):

@ms:
[0]       27093 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[1]           0 |
[2, 4)        0 |
[4, 8)        0 |
[8, 16)       0 |
[16, 32)      0 |
[32, 64)     16 |

tcp_rmem 146 MiB, tcp_adv_win_scale +3 (146p3):

@ms:
(..., 16)  25984 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[16, 20)       0 |
[20, 24)       0 |
[24, 28)       0 |
[28, 32)       0 |
[32, 36)       0 |
[36, 40)       0 |
[40, 44)       1 |
[44, 48)       6 |
[48, 52)       6 |
[52, 56)       3 |

tcp_rmem 137 MiB, tcp_adv_win_scale +4 (137p4):

@ms:
(..., 16)  37222 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[16, 20)       0 |
[20, 24)       0 |
[24, 28)       0 |
[28, 32)       0 |
[32, 36)       0 |
[36, 40)       1 |
[40, 44)       8 |
[44, 48)       2 |

These are the latency spikes we cannot have on the Cloudflare global network.

Kernel patch – Attempt #2

So the “something” that was not working in Attempt #1 was that the receive queue memory limit was hit early on as the flow was just ramping up (when the values for sk_rmem_alloc and sk_rcvbuf were small, ~800KB). This occurred at about the two second mark for 137p4 test (about 2.25 seconds for 170p2).

In hindsight, we should have noticed that tcp_prune_queue() actually raises sk_rcvbuf when it can. So we modified the patch in response to that, added a guard to allow the collapse to execute when sk_rmem_alloc is less than the threshold value.

net.ipv4.tcp_collapse_max_bytes = 6291456

The next section discusses how we arrived at this value for tcp_collapse_max_bytes.

The patch is available here.

The results with the new patch are as follows:

oscil – 300ms tests

Test	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse (MiB)	NIC speed (mbps)	TCP window (MiB)	real buffer metadata to user payload ratio	RcvCollapsed	RcvQDrop	OFODrop	max latency (us)	Test Result
oscil reader	300	512	-2	6	1000	128	4	0	0	0	12	1-941 941 941
oscil reader	300	256	1	6	1000	128	2	0	0	0	11	1-941 941 941
oscil reader	300	170	2	6	1000	128	1.33	0	9	86	11	1-941 36-605 1-298
oscil reader	300	146	3	6	1000	128	1.14	0	7	1550	16	1-940 2-82 292-395
oscil reader	300	137	4	6	1000	128	1.07	0	10	3020	9	1-940 2-13 13-33

oscil – 20ms tests

Test	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse (MiB)	NIC speed (mbps)	TCP window (MiB)	real buffer metadata to user payload ratio	RcvCollapsed	RcvQDrop	OFODrop	max latency (us)	Test Result
oscil reader	20	512	-2	6	1000	128	4	0	0	0	13	795-941 941 941
oscil reader	20	256	1	6	1000	128	2	0	0	0	13	795-941 941 941
oscil reader	20	170	2	6	1000	128	1.33	0	0	0	8	795-941 941 941
oscil reader	20	146	3	6	1000	128	1.14	0	0	0	7	795-941 941 941
oscil reader	20	137	4	6	1000	128	1.07	0	4	196	12	795-941 13-941 941

oscil – 0ms tests

Test	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse (MiB)	NIC speed (mbps)	TCP window (MiB)	real buffer metadata to user payload ratio	RcvCollapsed	RcvQDrop	OFODrop	max latency (us)	Test Result
oscil reader	0.3	512	-2	6	1000	128	4	0	0	0	9	941 941 941
oscil reader	0.3	256	1	6	1000	128	2	0	0	0	22	941 941 941
oscil reader	0.3	170	2	6	1000	128	1.33	0	0	0	8	941 941 941
oscil reader	0.3	146	3	6	1000	128	1.14	0	0	0	10	941 941 941
oscil reader	0.3	137	4	6	1000	128	1.07	0	0	0	10	941 941 941

iperf3 – 300 ms tests

Test	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse (MiB)	NIC speed (mbps)	TCP window (MiB)	real buffer metadata to user payload ratio	RcvCollapsed	RcvQDrop	OFODrop	max latency (us)	Test Result
iperf3	300	512	-2	6	1000	128	4	0	0	0	7	941
iperf3	300	256	1	6	1000	128	2	0	0	0	6	941
iperf3	300	170	2	6	1000	128	1.33	0	0	0	9	941
iperf3	300	146	3	6	1000	128	1.14	0	0	0	11	941
iperf3	300	137	4	6	1000	128	1.07	0	0	0	7	941

iperf3 – 20 ms tests

Test	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse (MiB)	NIC speed (mbps)	TCP window (MiB)	real buffer metadata to user payload ratio	RcvCollapsed	RcvQDrop	OFODrop	max latency (us)	Test Result
iperf3	20	512	-2	6	1000	128	4	0	0	0	7	941
iperf3	20	256	1	6	1000	128	2	0	0	0	15	941
iperf3	20	170	2	6	1000	128	1.33	0	0	0	7	941
iperf3	20	146	3	6	1000	128	1.14	0	0	0	7	941
iperf3	20	137	4	6	1000	128	1.07	0	0	0	6	941

iperf3 – 0ms tests

Test	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse (MiB)	NIC speed (mbps)	TCP window (MiB)	real buffer metadata to user payload ratio	RcvCollapsed	RcvQDrop	OFODrop	max latency (us)	Test Result
iperf3	0.3	512	-2	6	1000	128	4	0	0	0	6	941
iperf3	0.3	256	1	6	1000	128	2	0	0	0	14	941
iperf3	0.3	170	2	6	1000	128	1.33	0	0	0	6	941
iperf3	0.3	146	3	6	1000	128	1.14	0	0	0	7	941
iperf3	0.3	137	4	6	1000	128	1.07	0	0	0	6	941

All tests are successful.

Setting tcp_collapse_max_bytes

In order to determine this setting, we need to understand what the biggest queue we can collapse without incurring unacceptable latency.

Using 6 MiB should result in a maximum latency of no more than 2 ms.

Cloudflare production network results

Current production settings (“Old”)

net.ipv4.tcp_rmem = 8192 2097152 16777216
net.ipv4.tcp_wmem = 4096 16384 33554432
net.ipv4.tcp_adv_win_scale = -2
net.ipv4.tcp_collapse_max_bytes = 0
net.ipv4.tcp_notsent_lowat = 4294967295

tcp_collapse_max_bytes of 0 means that the custom feature is disabled and that the vanilla kernel logic is used for TCP collapse processing.

New settings under test (“New”)

net.ipv4.tcp_rmem = 8192 262144 536870912
net.ipv4.tcp_wmem = 4096 16384 536870912
net.ipv4.tcp_adv_win_scale = -2
net.ipv4.tcp_collapse_max_bytes = 6291456
net.ipv4.tcp_notsent_lowat = 131072

The tcp_notsent_lowat setting is discussed in the last section of this post.

The middle value of tcp_rmem was changed as a result of separate work that found that Linux autotuning was setting receive buffers too high for localhost sessions. This updated setting reduces TCP memory usage for those sessions, but does not change anything about the type of TCP sessions that is the focus of this post.

For the following benchmarks, we used non-Cloudflare host machines in Iowa, US, and Melbourne, Australia performing data transfers to the Cloudflare data center in Marseille, France. In Marseille, we have some hosts configured with the existing production settings, and others with the system settings described in this post. Software used is perf3 version 3.9, kernel 5.15.32.

Throughput results

RTT

(ms)

Throughput with Current Settings

(mbps)

Throughput with

New Settings

(mbps)

Increase

Factor

Iowa to

Marseille

121

276

6600

24x

Melbourne to Marseille

282

120

3800

32x

Iowa-Marseille throughput

Iowa-Marseille receive window and bytes-in-flight

Melbourne-Marseille throughput

Melbourne-Marseille receive window and bytes-in-flight

Even with the new settings in place, the Melbourne to Marseille performance is limited by the receive window on the Cloudflare host. This means that further adjustments to these settings yield even higher throughput.

Latency results

The Y-axis on these charts are the 99th percentile time for TCP collapse in seconds.

Cloudflare hosts in Marseille running the current production settings

Cloudflare hosts in Marseille running the new settings

The takeaway in looking at these graphs is that maximum TCP collapse time for the new settings is no worse than with the current production settings. This is the desired result.

Send Buffers

What we have shown so far is that the receiver side seems to be working well, but what about the sender side?

As part of this work, we are setting tcp_wmem max to 512 MiB. For oscillating reader flows, this can cause the send buffer to become quite large. This represents bufferbloat and wasted kernel memory, both things that nobody likes or wants.

Fortunately, there is already a solution: tcp_notsent_lowat. This setting limits the size of unsent bytes in the write queue. More details can be found at https://lwn.net/Articles/560082.

The results are significant:

The RTT for these tests was 466ms. Throughput is not negatively affected. Throughput is at full wire speed in all cases (1 Gbps). Memory usage is as reported by /proc/net/sockstat, TCP mem.

Our web servers already set tcp_notsent_lowat to 131072 for its sockets. All other senders are using 4 GiB, the default value. We are changing the sysctl so that 131072 is in effect for all senders running on the server.

Conclusion

The goal of this work is to open the throughput floodgates for high BDP connections while simultaneously ensuring very low HTTP request latency.

We have accomplished that goal.

...We protect entire corporate networks, help customers build Internet-scale applications efficiently, accelerate any website or Internet application, ward off DDoS attacks, keep hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.To learn more about our mission to help build a better Internet, start here. If you’re looking for a new career direction, check out our open positions.

Computing Euclidean distance on 144 dimensions

Marek Majkowski — Fri, 18 Dec 2020 12:00:00 GMT

Late last year I read a blog post about our CSAM image scanning tool. I remember thinking: this is so cool! Image processing is always hard, and deploying a real image identification system at Cloudflare is no small achievement!

Some time later, I was chatting with Kornel: "We have all the pieces in the image processing pipeline, but we are struggling with the performance of one component." Scaling to Cloudflare needs ain't easy!

The problem was in the speed of the matching algorithm itself. Let me elaborate. As John explained in his blog post on the CSAM Scanning Tool, the image matching algorithm creates a fuzzy hash from a processed image. The hash is exactly 144 bytes long. For example, it might look like this:

00e308346a494a188e1043333147267a 653a16b94c33417c12b433095c318012
5612442030d14a4ce82c623f4e224733 1dd84436734e4a5d6e25332e507a8218
6e3b89174e30372d

The hash is designed to be used in a fuzzy matching algorithm that can find "nearby", related images. The specific algorithm is well defined, but making it fast is left to the programmer — and at Cloudflare we need the matching to be done super fast. We want to match thousands of hashes per second, of images passing through our network, against a database of millions of known images. To make this work, we need to seriously optimize the matching algorithm.

Naive quadratic algorithm

The first algorithm that comes to mind has O(K*N) complexity: for each query, go through every hash in the database. In naive implementation, this creates a lot of work. But how much work exactly?

First, we need to explain how fuzzy matching works.

Given a query hash, the fuzzy match is the "closest" hash in a database. This requires us to define a distance. We treat each hash as a vector containing 144 numbers, identifying a point in a 144-dimensional space. Given two such points, we can calculate the distance using the standard Euclidean formula.

For our particular problem, though, we are interested in the "closest" match in a database only if the distance is lower than some predefined threshold. Otherwise, when the distance is large, we can assume the images aren't similar. This is the expected result — most of our queries will not have a related image in the database.

The Euclidean distance equation used by the algorithm is standard:

To calculate the distance between two 144-byte hashes, we take each byte, calculate the delta, square it, sum it to an accumulator, do a square root, and ta-dah! We have the distance!

Here's how to count the squared distance in C:

This function returns the squared distance. We avoid computing the actual distance to save us from running the square root function - it's slow. Inside the code, for performance and simplicity, we'll mostly operate on the squared value. We don't need the actual distance value, we just need to find the vector with the smallest one. In our case it doesn't matter if we'll compare distances or squared distances!

As you can see, fuzzy matching is basically a standard problem of finding the closest point in a multi-dimensional space. Surely this has been solved in the past — but let's not jump ahead.

While this code might be simple, we expect it to be rather slow. Finding the smallest hash distance in a database of, say, 1M entries, would require going over all records, and would need at least:

144 * 1M subtractions
144 * 1M multiplications
144 * 1M additions

And more. This alone adds up to 432 million operations! How does it look in practice? To illustrate this blog post we prepared a full test suite. The large database of known hashes can be well emulated by random data. The query hashes can't be random and must be slightly more sophisticated, otherwise the exercise wouldn't be that interesting. We generated the test smartly by byte-swaps of the actual data from the database — this allows us to precisely control the distance between test hashes and database hashes. Take a look at the scripts for details. Here's our first run of the first, naive, algorithm:

$ make naive
< test-vector.txt ./mmdist-naive > test-vector.tmp
Total: 85261.833ms, 1536 items, avg 55.509ms per query, 18.015 qps

We matched 1,536 test hashes against a database of 1 million random vectors in 85 seconds. It took 55ms of CPU time on average to find the closest neighbour. This is rather slow for our needs.

SIMD for help

An obvious improvement is to use more complex SIMD instructions. SIMD is a way to instruct the CPU to process multiple data points using one instruction. This is a perfect strategy when dealing with vector problems — as is the case for our task.

We settled on using AVX2, with 256 bit vectors. We did this for a simple reason — newer AVX versions are not supported by our AMD CPUs. Additionally, in the past, we were not thrilled by the AVX-512 frequency scaling.

Using AVX2 is easier said than done. There is no single instruction to count Euclidean distance between two uint8 vectors! The fastest way of counting the full distance of two 144-byte vectors with AVX2 we could find is authored by Vlad:

It’s actually simpler than it looks: load 16 bytes, convert vector from uint8 to int16, subtract the vector, store intermediate sums as int32, repeat. At the end, we need to do complex 4 instructions to extract the partial sums into the final sum. This AVX2 code improves the performance around 3x:

$ make naive-avx2 
Total: 25911.126ms, 1536 items, avg 16.869ms per query, 59.280 qps

We measured 17ms per item, which is still below our expectations. Unfortunately, we can't push it much further without major changes. The problem is that this code is limited by memory bandwidth. The measurements come from my Intel i7-5557U CPU, which has the max theoretical memory bandwidth of just 25GB/s. The database of 1 million entries takes 137MiB, so it takes at least 5ms to feed the database to my CPU. With this naive algorithm we won't be able to go below that.

Vantage Point Tree algorithm

Since the naive brute force approach failed, we tried using more sophisticated algorithms. My colleague Kornel Lesiński implemented a super cool Vantage Point algorithm. After a few ups and downs, optimizations and rewrites, we gave up. Our problem turned out to be unusually hard for this kind of algorithm.

We observed "the curse of dimensionality". Space partitioning algorithms don't work well in problems with large dimensionality — and in our case, we have an enormous number of 144 dimensions. K-D trees are doomed. Locality-sensitive hashing is also doomed. It's a bizarre situation in which the space is unimaginably vast, but everything is close together. The volume of the space is a 347-digit-long number, but the maximum distance between points is just 3060 - sqrt(255*255*144).

Space partitioning algorithms are fast, because they gradually narrow the search space as they get closer to finding the closest point. But in our case, the common query is never close to any point in the set, so the search space can’t be narrowed to a meaningful degree.

A VP-tree was a promising candidate, because it operates only on distances, subdividing space into near and far partitions, like a binary tree. When it has a close match, it can be very fast, and doesn't need to visit more than O(log(N)) nodes. For non-matches, its speed drops dramatically. The algorithm ends up visiting nearly half of the nodes in the tree. Everything is close together in 144 dimensions! Even though the algorithm avoided visiting more than half of the nodes in the tree, the cost of visiting remaining nodes was higher, so the search ended up being slower overall.

Smarter brute force?

This experience got us thinking. Since space partitioning algorithms can't narrow down the search, and still need to go over a very large number of items, maybe we should focus on going over all the hashes, extremely quickly. We must be smarter about memory bandwidth though — it was the limiting factor in the naive brute force approach before.

Perhaps we don't need to fetch all the data from memory.

Short distance

The breakthrough came from the realization that we don't need to count the full distance between hashes. Instead, we can compute only a subset of dimensions, say 32 out of the total of 144. If this distance is already large, then there is no need to compute the full one! Computing more points is not going to reduce the Euclidean distance.

The proposed algorithm works as follows:

1. Take the query hash and extract a 32-byte short hash from it

2. Go over all the 1 million 32-byte short hashes from the database. They must be densely packed in the memory to allow the CPU to perform good prefetching and avoid reading data we won't need.

3. If the distance of the 32-byte short hash is greater or equal a best score so far, move on

4. Otherwise, investigate the hash thoroughly and compute the full distance.

Even though this algorithm needs to do less arithmetic and memory work, it's not faster than the previous naive one. See make short-avx2. The problem is: we still need to compute a full distance for hashes that are promising, and there are quite a lot of them. Computing the full distance for promising hashes adds enough work, both in ALU and memory latency, to offset the gains of this algorithm.

There is one detail of our particular application of the image matching problem that will help us a lot moving forward. As we described earlier, the problem is less about finding the closest neighbour and more about proving that the neighbour with a reasonable distance doesn't exist. Remember — in practice, we don't expect to find many matches! We expect almost every image we feed into the algorithm to be unrelated to image hashes stored in the database.

It's sufficient for our algorithm to prove that no neighbour exists within a predefined distance threshold. Let's assume we are not interested in hashes more distant than, say, 220, which squared is 48,400. This makes our short-distance algorithm variation work much better:

$ make short-avx2-threshold
Total: 4994.435ms, 1536 items, avg 3.252ms per query, 307.542 qps

Origin distance variation

At some point, John noted that the threshold allows additional optimization. We can order the hashes by their distance from some origin point. Given a query hash which has origin distance of A, we can inspect only hashes which are distant between |A-threshold| and |A+threshold| from the origin. This is pretty much how each level of Vantage Point Tree works, just simplified. This optimization — ordering items in the database by their distance from origin point — is relatively simple and can help save us a bit of work.

While great on paper, this method doesn't introduce much gain in practice, as the vectors are not grouped in clusters — they are pretty much random! For the threshold values we are interested in, the origin distance algorithm variation gives us ~20% speed boost, which is okay but not breathtaking. This change might bring more benefits if we ever decide to reduce the threshold value, so it might be worth doing for production implementation. However, it doesn't work well with query batching.

Transposing data for better AVX

But we're not done with AVX optimizations! The usual problem with AVX is that the instructions don't normally fit a specific problem. Some serious mind twisting is required to adapt the right instruction to the problem, or to reverse the problem so that a specific instruction can be used. AVX2 doesn't have useful "horizontal" uint16 subtract, multiply and add operations. For example, _mm_hadd_epi16 exists, but it's slow and cumbersome.

Instead, we can twist the problem to make use of fast available uint16 operands. For example we can use:

_mm256_sub_epi16
_mm256_mullo_epi16
and _mm256_add_epu16.

The add would overflow in our case, but fortunately there is add-saturate _mm256_adds_epu16.

The saturated add is great and saves us conversion to uint32. It just adds a small limitation: the threshold passed to the program (i.e., the max squared distance) must fit into uint16. However, this is fine for us.

To effectively use these instructions we need to transpose the data in the database. Instead of storing hashes in rows, we can store them in columns:

So instead of:

[a1, a2, a3],
[b1, b2, b3],
[c1, c2, c3],

...

We can lay it out in memory transposed:

[a1, b1, c1],
[a2, b2, c2],
[a3, b3, c3],

...

Now we can load 16 first bytes of hashes using one memory operation. In the next step, we can subtract the first byte of the querying hash using a single instruction, and so on. The algorithm stays exactly the same as defined above; we just make the data easier to load and easier to process for AVX.

The hot loop code even looks relatively pretty:

With the well-tuned batch size and short distance size parameters we can see the performance of this algorithm:

$ make short-inv-avx2
Total: 1118.669ms, 1536 items, avg 0.728ms per query, 1373.062 qps

Whoa! This is pretty awesome. We started from 55ms per query, and we finished with just 0.73ms. There are further micro-optimizations possible, like memory prefetching or using huge pages to reduce page faults, but they have diminishing returns at this point.

Roofline model from Denis Bakhvalov's book‌‌

If you are interested in architectural tuning such as this, take a look at the new performance book by Denis Bakhvalov. It discusses roofline model analysis, which is pretty much what we did here.

Do take a look at our code and tell us if we missed some optimization!

Summary

What an optimization journey! We jumped between memory and ALU bottlenecked code. We discussed more sophisticated algorithms, but in the end, a brute force algorithm — although tuned — gave us the best results.

To get even better numbers, I experimented with Nvidia GPU using CUDA. The CUDA intrinsics like vabsdiff4 and dp4a fit the problem perfectly. The V100 gave us some amazing numbers, but I wasn't fully satisfied with it. Considering how many AMD Ryzen cores with AVX2 we can get for the cost of a single server-grade GPU, we leaned towards general purpose computing for this particular problem.

This is a great example of the type of complexities we deal with every day. Making even the best technologies work “at Cloudflare scale” requires thinking outside the box. Sometimes we rewrite the solution dozens of times before we find the optimal one. And sometimes we settle on a brute-force algorithm, just very very optimized.

The computation of hashes and image matching are challenging problems that require running very CPU intensive operations.. The CPU we have available on the edge is scarce and workloads like this are incredibly expensive. Even with the optimization work talked about in this blog post, running the CSAM scanner at scale is a challenge and has required a huge engineering effort. And we’re not done! We need to solve more hard problems before we're satisfied. If you want to help, consider applying!

My internship: Brotli compression using a reduced dictionary

Felix Hanau — Wed, 11 Nov 2020 16:32:39 GMT

Brotli is a state of the art lossless compression format, supported by all major browsers. It is capable of achieving considerably better compression ratios than the ubiquitous gzip, and is rapidly gaining in popularity. Cloudflare uses the Google brotli library to dynamically compress web content whenever possible. In 2015, we took an in-depth look at how brotli works and its compression advantages.

One of the more interesting features of the brotli file format, in the context of textual web content compression, is the inclusion of a built-in static dictionary. The dictionary is quite large, and in addition to containing various strings in multiple languages, it also supports the option to apply multiple transformations to those words, increasing its versatility.

The open sourced brotli library, that implements an encoder and decoder for brotli, has 11 predefined quality levels for the encoder, with higher quality level demanding more CPU in exchange for a better compression ratio. The static dictionary feature is used to a limited extent starting with level 5, and to the full extent only at levels 10 and 11, due to the high CPU cost of this feature.

We improve on the limited dictionary use approach and add optimizations to improve the compression at levels 5 through 9 at a negligible performance impact when compressing web content.

Brotli Static Dictionary

Brotli primarily uses the LZ77 algorithm to compress its data. Our previous blog post about brotli compression provides an introduction.

To improve compression on text files and web content, brotli also includes a static, predefined dictionary. If a byte sequence cannot be matched with an earlier sequence using LZ77 the encoder will try to match the sequence with a reference to the static dictionary, possibly using one of the multiple transforms. For example, every HTML file contains the opening tag that cannot be compressed with LZ77, as it is unique, but it is contained in the brotli static dictionary and will be replaced by a reference to it. The reference generally takes less space than the sequence itself, which decreases the compressed file size.

The dictionary contains 13,504 words in six languages, with lengths from 4 to 24 characters. To improve the compression of real-world text and web data, some dictionary words are common phrases ("The current") or strings common in web content (‘type=”text/javascript”’). Unlike usual LZ77 compression, a word from the dictionary can only be matched as a whole. Starting a match in the middle of a dictionary word, ending it before the end of a word or even extending into the next word is not supported by the brotli format.

Instead, the dictionary supports 120 transforms of dictionary words to support a larger number of matches and find longer matches. The transforms include adding suffixes (“work” becomes “working”) adding prefixes (“book” => “ the book”) making the first character uppercase ("process" => "Process") or converting the whole word to uppercase (“html” => “HTML”). In addition to transforms that make words longer or capitalize them, the cut transform allows a shortened match (“consistently” => “consistent”), which makes it possible to find even more matches.

Methods

With the transforms included, the static dictionary contains 1,633,984 different words – too many for exhaustive search, except when used with the slow brotli compression levels 10 and 11. When used at a lower compression level, brotli either disables the dictionary or only searches through a subset of roughly 5,500 words to find matches in an acceptable time frame. It also only considers matches at positions where no LZ77 match can be found and only uses the cut transform.

Our approach to the brotli dictionary uses a larger, but more specialized subset of the dictionary than the default, using more aggressive heuristics to improve the compression ratio with negligible cost to performance. In order to provide a more specialized dictionary, we provide the compressor with a content type hint from our servers, relying on the Content-Type header to tell the compressor if it should use a dictionary for HTML, JavaScript or CSS. The dictionaries can be furthermore refined by colocation language in the future.

Fast dictionary lookup

To improve compression without sacrificing performance, we needed a fast way to find matches if we want to search the dictionary more thoroughly than brotli does by default. Our approach uses three data structures to find a matching word directly. The radix trie is responsible for finding the word while the hash table and bloom filter are used to speed up the radix trie and quickly eliminate many words that can’t be matched using the dictionary.

Lookup for a position starting with “type”

The radix trie easily finds the longest matching word without having to try matching several words. To find the match, we traverse the graph based on the text at the current position and remember the last node with a matching word. The radix trie supports compressed nodes (having more than one character as an edge label), which greatly reduces the number of nodes that need to be traversed for typical dictionary words.

The radix trie is slowed down by the large number of positions where we can’t find a match. An important finding is that most mismatching strings have a mismatching character in the first four bytes. Even for positions where a match exists, a lot of time is spent traversing nodes for the first four bytes since the nodes close to the tree root usually have many children.

Luckily, we can use a hash table to look up the node equivalent to four bytes, matching if it exists or reject the possibility of a match. We thus look up the first four bytes of the string, if there is a matching node we traverse the trie from there, which will be fast as each four-byte prefix usually only has a few corresponding dict words. If there is no matching node, there will not be a matching word at this position and we do not need to further consider it.

While the hash table is designed to reject mismatches quickly and avoid cache misses and high search costs in the trie, it still suffers from similar problems: We might search through several 4-byte prefixes with the hash value of the given position, only to learn that no match can be found. Additionally, hash lookups can be expensive due to cache misses.

To quickly reject words that do not match the dictionary, but might still cause cache misses, we use a k=1 bloom filter to quickly rule out most non-matching positions. In the k=1 case, the filter is simply a lookup table with one bit indicating whether any matching 4-byte prefixes exist for a given hash value. If the hash value for the given bit is 0, there won’t be a match. Since the bloom filter uses at most one bit for each four-byte prefix while the hash table requires 16 bytes, cache misses are much less likely. (The actual size of the structures is a bit different since there are many empty spaces in both structures and the bloom filter has twice as many elements to reject more non-matching positions.)

This is very useful for performance as a bloom filter lookup requires a single memory access. The bloom filter is designed to be fast and simple, but still rejects more than half of all non-matching positions and thus allows us to save a full hash lookup, which would often mean a cache miss.

Heuristics

To improve the compression ratio without sacrificing performance, we employed a number of heuristics:

Only search the dictionary at some positionsThis is also done using the stock dictionary, but we search more aggressively. While the stock dictionary only considers positions where the LZ77 match finder did not find a match, we also consider positions that have a bad match according to the brotli cost model: LZ77 matches that are short or have a long distance between the current position and the reference usually only offer a small compression improvement, so it is worth trying to find a better match in the static dictionary.

Only consider the longest match and then transform itInstead of finding and transforming all matches at a position, the radix trie only gives us the longest match which we then transform. This approach results in a vast performance improvement. In most cases, this results in finding the best match.

Only include some transformsWhile all transformations can improve the compression ratio, we only included those that work well with the data structures. The suffix transforms can easily be applied after finding a non-transformed match. For the upper case transforms, we include both the non-transformed and the upper case version of a word in the radix trie. The prefix and cut transforms do not play well with the radix trie, therefore a cut of more than 1 byte and prefix transforms are not supported.

Generating the reduced dictionary

At low compression levels, brotli searches a subset of ~5,500 out of 13,504 words of the dictionary, negatively impacting compression. To store the entire dictionary, we would need to store ~31,700 words in the trie considering the upper case transformed output of ASCII sequences and ~11,000 four-byte prefixes in the hash. This would slow down hash table and radix trie, so we needed to find a different subset of the dictionary that works well for web content.

For this purpose, we used a large data set containing representative content. We made sure to use web content from several world regions to reflect language diversity and optimize compression. Based on this data set, we identified which words are most common and result in the largest compression improvement according to the brotli cost model. We only include the most useful words based on this calculation. Additionally, we remove some words if they slow down hash table lookups of other, more common words based on their hash value.

We have generated separate dictionaries for HTML, CSS and JavaScript content and use the MIME type to identify the right dictionary to use. The dictionaries we currently use include about 15-35% of the entire dictionary including uppercase transforms. Depending on the type of data and the desired compression/speed tradeoff, different options for the size of the dictionary can be useful. We have also developed code that automatically gathers statistics about matches and generates a reduced dictionary based on this, which makes it easy to extend this to other textual formats, perhaps data that is majority non-English or XML data and achieve better results for this type of data.

Results

We tested the reduced dictionary on a large data set of HTML, CSS and JavaScript files.

The improvement is especially big for small files as the LZ77 compression is less effective on them. Since the improvement on large files is a lot smaller, we only tested files up to 256KB. We used compression level 5, the same compression level we currently use for dynamic compression on our edge, and tested on a Intel Core i7-7820HQ CPU.

Compression improvement is defined as 1 - (compressed size using the reduced dictionary / compressed size without dictionary). This ratio is then averaged for each input size range. We also provide an average value weighted by file size. Our data set mirrors typical web traffic, covering a wide range of file sizes with small files being more common, which explains the large difference between the weighted and unweighted average.

With the improved dictionary approach, we are now able to compress HTML, JavaScript and CSS files as well, or sometimes even better than using a higher compression level would allow us, all while using only 1% to 3% more CPU. For reference using compression level 6 over 5 would increase CPU usage by up to 12%.

Introducing support for the AVIF image format

Kornel Lesiński — Sat, 03 Oct 2020 13:00:00 GMT

We've added support for the new AVIF image format in Image Resizing. It compresses images significantly better than older-generation formats such as WebP and JPEG. It's supported in Chrome desktop today, and support is coming to other Chromium-based browsers, as well as Firefox.

What’s the benefit?

More than a half of an average website's bandwidth is spent on images. Improved image compression can save bandwidth and improve overall performance of the web. The compression in AVIF is so good that images can reduce to half the size of JPEG and WebP

What is AVIF?

AVIF is a combination of the HEIF ISO standard, and a royalty-free AV1 codec by Mozilla, Xiph, Google, Cisco, and many others.

Currently, JPEG is the most popular image format on the Web. It's doing remarkably well for its age, and it will likely remain popular for years to come thanks to its excellent compatibility. There have been many previous attempts at replacing JPEG, such as JPEG 2000, JPEG XR and WebP. However, these formats offered only modest compression improvements, and didn't always beat JPEG on image quality. Compression and image quality in AVIF is better than in all of them, and by a wide margin.


JPEG (17KB)	WebP (17KB)	AVIF (17KB)

Why a new image format?

One of the big things AVIF does is it fixes WebP's biggest flaws. WebP is over 10 years old, and AVIF is a major upgrade over the WebP format. These two formats are technically related: they're both based on a video codec from the VPx family. WebP uses the old VP8 version, while AVIF is based on AV1, which is the next generation after VP10.

WebP is limited to 8-bit color depth, and in its best-compressing mode of operation, it can only store color at half of the image's resolution (known as chroma subsampling). This causes edges of saturated colors to be smudged or pixelated in WebP. AVIF supports 10- and 12-bit color at full resolution, and high dynamic range (HDR).

	JPEG
	WebP
	WebP (sharp YUV option)
	AVIF

AV1 also uses a new compression technique: chroma-from-luma. Most image formats store brightness separately from color hue. AVIF uses the brightness channel to guess what the color channel may look like. They are usually correlated, so a good guess gives smaller file sizes and sharper edges.

Adoption of AV1 and AVIF

VP8 and WebP suffered from reluctant industry adoption. Safari only added WebP support very recently, 10 years after Chrome.

Chrome 85 supports AVIF already. It’s a work in progress in other Chromium-based browsers, and Firefox. Apple hasn't announced whether Safari will support AVIF. However, this time Apple is one of the companies in the Alliance for Open Media, creators of AVIF. The AV1 codec is already seeing faster adoption than the previous royalty-free codecs. Latest GPUs from Nvidia, AMD, and Intel already have hardware decoding support for AV1.

AVIF uses the same HEIF container as the HEIC format used in iOS’s camera. AVIF and HEIC offer similar compression performance. The difference is that HEIC is based on a commercial, patent-encumbered H.265 format. In countries that allow software patents, H.265 is illegal to use without obtaining patent licenses. It's covered by thousands of patents, owned by hundreds of companies, which have fragmented into two competing licensing organizations. Costs and complexity of patent licensing used to be acceptable when videos were published by big studios, and the cost could be passed on to the customer in the price of physical discs and hardware players. Nowadays, when video is mostly consumed via free browsers and apps, the old licensing model has become unsustainable. This has created a huge incentive for Web giants like Google, Netflix, and Amazon to get behind the royalty-free alternative. AV1 is free to use by anyone, and the alliance of tech giants behind it will defend it from patent troll's lawsuits.

Non-standard image formats usually can only be created using their vendor's own implementation. AVIF and AV1 are already ahead with multiple independent implementations: libaom, Intel SVT-AV1, libgav1, dav1d, and rav1e. This is a sign of a healthy adoption and a prerequisite to be a Web standard. Our Image Resizing is implemented in Rust, so we've chosen the rav1e encoder to create a pure-Rust implementation of AVIF.

Caveats

AVIF is a feature-rich format. Most of its features are for smartphone cameras, such as "live" photos, depth maps, bursts, HDR, and lossless editing. Browsers will support only a fraction of these features. In our implementation in Image Resizing we’re supporting only still standard-range images.

Two biggest problems in AVIF are encoding speed and lack of progressive rendering.

AVIF images don't show anything on screen until they're fully downloaded. In contrast, a progressive JPEG can display a lower-quality approximation of the image very quickly, while it's still loading. When progressive JPEGs are delivered well, they make websites appear to load much faster. Progressive JPEG can look loaded at half of its file size. AVIF can fully load at half of JPEG's size, so it somewhat overcomes the lack of progressive rendering with the sheer compression advantage. In this case only WebP is left behind, which has neither progressive rendering nor strong compression.

Decoding AVIF images for display takes relatively more CPU power than other codecs, but it should be fast enough in practice. Even low-end Android devices can play AV1 videos in full HD without help of hardware acceleration, and AVIF images are just a single frame of an AV1 video. However, encoding of AVIF images is much slower. It may take even a few seconds to create a single image. AVIF supports tiling, which accelerates encoding on multi-core CPUs. We have lots of CPU cores, so we take advantage of this to make encoding faster. Image Resizing doesn’t use the maximum compression level possible in AVIF to further increase compression speed. Resized images are cached, so the encoding speed is noticeable only on a cache miss.

We will be adding AVIF support to Polish as well. Polish converts images asynchronously in the background, which completely hides the encoding latency, and it will be able to compress AVIF images better than Image Resizing.

Note about benchmarking

It's surprisingly difficult to fairly and accurately judge which lossy codec is better. Lossy codecs are specifically designed to mainly compress image details that are too subtle for the naked eye to see, so "looks almost the same, but the file size is smaller!" is a common feature of all lossy codecs, and not specific enough to draw conclusions about their performance.

Lossy codecs can't be compared by comparing just file sizes. Lossy codecs can easily make files of any size. Their effectiveness is in the ratio between file size and visual quality they can achieve.

The best way to compare codecs is to make each compress an image to the exact same file size, and then to compare the actual visual quality they've achieved. If the file sizes differ, any judgement may be unfair, because the codec that generated the larger file may have done so only because it was asked to preserve more details, not because it can't compress better.

How and when to enable AVIF today?

We recommend AVIF for websites that need to deliver high-quality images with as little bandwidth as possible. This is important for users of slow networks and in countries where the bandwidth is expensive.

Browsers that support the AVIF format announce it by adding image/avif to their Accept request header. To request the AVIF format from Image Resizing, set the format option to avif.

The best method to detect and enable support for AVIF is to use image resizing in Workers:

addEventListener('fetch', event => {
  const imageURL = "https://jpeg.speedcf.com/cat/4.jpg";

  const resizingOptions = {
    // You can set any options here, see:
    // https://developers.cloudflare.com/images/worker
    width: 800,
    sharpen: 1.0,
  };

  const accept = event.request.headers.get("accept");
  const isAVIFSupported = /image\/avif/.test(accept);
  if (isAVIFSupported) {
    resizingOptions.format = "avif";
  }
  event.respondWith(fetch(imageURL), {cf:{image: resizingOptions}})
})

The above script will auto-detect the supported format, and serve AVIF automatically. Alternatively, you can use URL-based resizing together with the element:

This uses our /cdn-cgi/image/… endpoint to perform the conversion, and the alternative source will be picked only by browsers that support the AVIF format.

We have the format=auto option, but it won't choose AVIF yet. We're cautious, and we'd like to test the new format more before enabling it by default.

When Bloom filters don't bloom

Marek Majkowski — Mon, 02 Mar 2020 13:00:00 GMT

I've known about Bloom filters (named after Burton Bloom) since university, but I haven't had an opportunity to use them in anger. Last month this changed - I became fascinated with the promise of this data structure, but I quickly realized it had some drawbacks. This blog post is the tale of my brief love affair with Bloom filters.

While doing research about IP spoofing, I needed to examine whether the source IP addresses extracted from packets reaching our servers were legitimate, depending on the geographical location of our data centers. For example, source IPs belonging to a legitimate Italian ISP should not arrive in a Brazilian datacenter. This problem might sound simple, but in the ever-evolving landscape of the internet this is far from easy. Suffice it to say I ended up with many large text files with data like this:

This reads as: the IP 192.0.2.1 was recorded reaching Cloudflare data center number 107 with a legitimate request. This data came from many sources, including our active and passive probes, logs of certain domains we own (like cloudflare.com), public sources (like BGP table), etc. The same line would usually be repeated across multiple files.

I ended up with a gigantic collection of data of this kind. At some point I counted 1 billion lines across all the harvested sources. I usually write bash scripts to pre-process the inputs, but at this scale this approach wasn't working. For example, removing duplicates from this tiny file of a meager 600MiB and 40M lines, took... about an eternity:

Enough to say that deduplicating lines using the usual bash commands like 'sort' in various configurations (see '--parallel', '--buffer-size' and '--unique') was not optimal for such a large data set.

Bloom filters to the rescue

Image by David Eppstein Public Domain

Then I had a brainwave - it's not necessary to sort the lines! I just need to remove duplicated lines - using some kind of "set" data structure should be much faster. Furthermore, I roughly know the cardinality of the input file (number of unique lines), and I can live with some data points being lost - using a probabilistic data structure is fine!

Bloom-filters are a perfect fit!

While you should go and read Wikipedia on Bloom Filters, here is how I look at this data structure.

How would you implement a "set"? Given a perfect hash function, and infinite memory, we could just create an infinite bit array and set a bit number 'hash(item)' for each item we encounter. This would give us a perfect "set" data structure. Right? Trivial. Sadly, hash functions have collisions and infinite memory doesn't exist, so we have to compromise in our reality. But we can calculate and manage the probability of collisions. For example, imagine we have a good hash function, and 128GiB of memory. We can calculate the probability of the second item added to the bit array colliding would be 1 in 1099511627776. The probability of collision when adding more items worsens as we fill up the bit array.

Furthermore, we could use more than one hash function, and end up with a denser bit array. This is exactly what Bloom filters optimize for. A Bloom filter is a bunch of math on top of the four variables:

'n' - The number of input elements (cardinality)
'm' - Memory used by the bit-array
'k' - Number of hash functions counted for each input
'p' - Probability of a false positive match

Given the 'n' input cardinality and the 'p' desired probability of false positive, the Bloom filter math returns the 'm' memory required and 'k' number of hash functions needed.

Check out this excellent visualization by Thomas Hurst showing how parameters influence each other:

https://hur.st/bloomfilter/

mmuniq-bloom

Guided by this intuition, I set out on a journey to add a new tool to my toolbox - 'mmuniq-bloom', a probabilistic tool that, given input on STDIN, returns only unique lines on STDOUT, hopefully much faster than 'sort' + 'uniq' combo!

Here it is:

'mmuniq-bloom.c'

For simplicity and speed I designed 'mmuniq-bloom' with a couple of assumptions. First, unless otherwise instructed, it uses 8 hash functions k=8. This seems to be a close to optimal number for the data sizes I'm working with, and the hash function can quickly output 8 decent hashes. Then we align 'm', number of bits in the bit array, to be a power of two. This is to avoid the pricey % modulo operation, which compiles down to slow assembly 'div'. With power-of-two sizes we can just do bitwise AND. (For a fun read, see how compilers can optimize some divisions by using multiplication by a magic constant.)

We can now run it against the same data file we used before:

Oh, this is so much better! 12 seconds is much more manageable than 2 minutes before. But hold on... The program is using an optimized data structure, relatively limited memory footprint, optimized line-parsing and good output buffering... 12 seconds is still eternity compared to 'wc -l' tool:

What is going on? I understand that counting lines by 'wc' is easier than figuring out unique lines, but is it really worth the 26x difference? Where does all the CPU in 'mmuniq-bloom' go?

It must be my hash function. 'wc' doesn't need to spend all this CPU performing all this strange math for each of the 40M lines on input. I'm using a pretty non-trivial 'siphash24' hash function, so it surely burns the CPU, right? Let's check by running the code computing hash function but not doing any Bloom filter operations:

This is strange. Counting the hash function indeed costs about 2s, but the program took 12s in the previous run. The Bloom filter alone takes 10 seconds? How is that possible? It's such a simple data structure...

A secret weapon - a profiler

It was time to use a proper tool for the task - let's fire up a profiler and see where the CPU goes. First, let's fire an 'strace' to confirm we are not running any unexpected syscalls:

Everything looks good. The 10 calls to 'mmap' each taking 4ms (3971 us) is intriguing, but it's fine. We pre-populate memory up front with 'MAP_POPULATE' to save on page faults later.

What is the next step? Of course Linux's 'perf'!

Then we can see the results:

Right, so we indeed burn 87.2% of cycles in our hot code. Let's see where exactly. Doing 'perf annotate process_line --source' quickly shows something I didn't expect.

You can see 26.90% of CPU burned in the 'mov', but that's not all of it! The compiler correctly inlined the function, and unrolled the loop 8-fold. Summed up that 'mov' or 'uint64_t v = *p' line adds up to a great majority of cycles!

Clearly 'perf' must be mistaken, how can such a simple line cost so much? We can repeat the benchmark with any other profiler and it will show us the same problem. For example, I like using 'google-perftools' with kcachegrind since they emit eye-candy charts:

The rendered result looks like this:

Allow me to summarise what we found so far.

The generic 'wc' tool takes 0.45s CPU time to process 600MiB file. Our optimized 'mmuniq-bloom' tool takes 12 seconds. CPU is burned on one 'mov' instruction, dereferencing memory....

Image by Jose Nicdao CC BY/2.0

Oh! I how could I have forgotten. Random memory access is slow! It's very, very, very slow!

According to the general rule "latency numbers every programmer should know about", one RAM fetch is about 100ns. Let's do the math: 40 million lines, 8 hashes counted for each line. Since our Bloom filter is 128MiB, on our older hardware it doesn't fit into L3 cache! The hashes are uniformly distributed across the large memory range - each hash generates a memory miss. Adding it together that's...

That suggests 32 seconds burned just on memory fetches. The real program is faster, taking only 12s. This is because, although the Bloom filter data does not completely fit into L3 cache, it still gets some benefit from caching. It's easy to see with 'perf stat -d':

Right, so we should have had at least 320M LLC-load-misses, but we had only 280M. This still doesn't explain why the program was running only 12 seconds. But it doesn't really matter. What matters is that the number of cache misses is a real problem and we can only fix it by reducing the number of memory accesses. Let's try tuning Bloom filter to use only one hash function:

Ouch! That really hurt! The Bloom filter required 64 GiB of memory to get our desired false positive probability ratio of 1-error-per-10k-lines. This is terrible!

Also, it doesn't seem like we improved much. It took the OS 22 seconds to prepare memory for us, but we still burned 11 seconds in userspace. I guess this time any benefits from hitting memory less often were offset by lower cache-hit probability due to drastically increased memory size. In previous runs we required only 128MiB for the Bloom filter!

Dumping Bloom filters altogether

This is getting ridiculous. To get the same false positive guarantees we either must use many hashes in Bloom filter (like 8) and therefore many memory operations, or we can have 1 hash function, but enormous memory requirements.

We aren't really constrained by available memory, instead we want to optimize for reduced memory accesses. All we need is a data structure that requires at most 1 memory miss per item, and use less than 64 Gigs of RAM...

While we could think of more sophisticated data structures like Cuckoo filter, maybe we can be simpler. How about a good old simple hash table with linear probing?

Image by Vadims Podāns

Welcome mmuniq-hash

Here you can find a tweaked version of mmuniq-bloom, but using hash table:

'mmuniq-hash.c'

Instead of storing bits as for the Bloom-filter, we are now storing 64-bit hashes from the 'siphash24' function. This gives us much stronger probability guarantees, with probability of false positives much better than one error in 10k lines.

Let's do the math. Adding a new item to a hash table containing, say 40M, entries has '40M/2^64' chances of hitting a hash collision. This is about one in 461 billion - a reasonably low probability. But we are not adding one item to a pre-filled set! Instead we are adding 40M lines to the initially empty set. As per birthday paradox this has much higher chances of hitting a collision at some point. A decent approximation is 'n^2/2m', which in our case is '(40M^2)/(2*(264))'. This is a chance of one in 23000. In other words, assuming we are using good hash function, every one in 23 thousand random sets of 40M items, will have a hash collision. This practical chance of hitting a collision is non-negligible, but it's still better than a Bloom filter and totally acceptable for my use case.

The hash table code runs faster, has better memory access patterns and better false positive probability than the Bloom filter approach.

Don't be scared about the "hash conflicts" line, it just indicates how full the hash table was. We are using linear probing, so when a bucket is already used, we just pick up the next empty bucket. In our case we had to skip over 0.7 buckets on average to find an empty slot in the table. This is fine and, since we iterate over the buckets in linear order, we can expect the memory to be nicely prefetched.

From the previous exercise we know our hash function takes about 2 seconds of this. Therefore, it's fair to say 40M memory hits take around 4 seconds.

Lessons learned

Modern CPUs are really good at sequential memory access when it's possible to predict memory fetch patterns (see Cache prefetching). Random memory access on the other hand is very costly.

Advanced data structures are very interesting, but beware. Modern computers require cache-optimized algorithms. When working with large datasets, not fitting L3, prefer optimizing for reduced number loads, over optimizing the amount of memory used.

I guess it's fair to say that Bloom filters are great, as long as they fit into the L3 cache. The moment this assumption is broken, they are terrible. This is not news, Bloom filters optimize for memory usage, not for memory access. For example, see the Cuckoo Filters paper.

Another thing is the ever-lasting discussion about hash functions. Frankly - in most cases it doesn't matter. The cost of counting even complex hash functions like 'siphash24' is small compared to the cost of random memory access. In our case simplifying the hash function will bring only small benefits. The CPU time is simply spent somewhere else - waiting for memory!

One colleague often says: "You can assume modern CPUs are infinitely fast. They run at infinite speed until they hit the memory wall".

Finally, don't follow my mistakes - everyone should start profiling with 'perf stat -d' and look at the "Instructions per cycle" (IPC) counter. If it's below 1, it generally means the program is stuck on waiting for memory. Values above 2 would be great, it would mean the workload is mostly CPU-bound. Sadly, I'm yet to see high values in the workloads I'm dealing with...

Improved mmuniq

With the help of my colleagues I've prepared a further improved version of the 'mmuniq' hash table based tool. See the code:

'mmuniq.c'

It is able to dynamically resize the hash table, to support inputs of unknown cardinality. Then, by using batching, it can effectively use the 'prefetch' CPU hint, speeding up the program by 35-40%. Beware, sprinkling the code with 'prefetch' rarely works. Instead, I specifically changed the flow of algorithms to take advantage of this instruction. With all the improvements I got the run time down to 2.1 seconds:

The end

Writing this basic tool which tries to be faster than 'sort | uniq' combo revealed some hidden gems of modern computing. With a bit of work we were able to speed it up from more than two minutes to 2 seconds. During this journey we learned about random memory access latency, and the power of cache friendly data structures. Fancy data structures are exciting, but in practice reducing random memory loads often brings better results.

Announcing Cloudflare Image Resizing: Simplifying Optimal Image Delivery

Isaac Specter — Wed, 15 May 2019 13:00:00 GMT

In the past three years, the amount of image data on the median mobile webpage has doubled. Growing images translate directly to users hitting data transfer caps, experiencing slower websites, and even leaving if a website doesn’t load in a reasonable amount of time. The crime is many of these images are so slow because they are larger than they need to be, sending data over the wire which has absolutely no (positive) impact on the user’s experience.

To provide a concrete example, let’s consider this photo of Cloudflare’s Lava Lamp Wall:

On the left you see the photo, scaled to 300 pixels wide. On the right you see the same image delivered in its original high resolution, scaled in a desktop web browser. On a regular-DPI screen, they both look the same, yet the image on the right takes more than twenty times more data to load. Even for the best and most conscientious developers resizing every image to handle every possible device geometry consumes valuable time, and it’s exceptionally easy to forget to do this resizing altogether.

Today we are launching a new product, Image Resizing, to fix this problem once and for all.

Announcing Image Resizing

With Image Resizing, Cloudflare adds another important product to its suite of available image optimizations. This product allows customers to perform a rich set of the key actions on images.

Resize - The source image will be resized to the specified height and width. This action allows multiple different sized variants to be created for each specific use.
Crop - The source image will be resized to a new size that does not maintain the original aspect ratio and a portion of the image will be removed. This can be especially helpful for headshots and product images where different formats must be achieved by keeping only a portion of the image.
Compress - The source image will have its file size reduced by applying lossy compression. This should be used when slight quality reduction is an acceptable trade for file size reduction.
Convert to WebP - When the users browser supports it, the source image will be converted to WebP. Delivering a WebP image takes advantage of the modern, highly optimized image format.

By using a combination of these actions, customers store a single high quality image on their server, and Image Resizing can be leveraged to create specialized variants for each specific use case. Without any additional effort, each variant will also automatically benefit from Cloudflare’s global caching.

Examples

Ecommerce Thumbnails

Ecommerce sites typically store a high-quality image of each product. From that image, they need to create different variants depending on how that product will be displayed. One example is creating thumbnails for a catalog view. Using Image Resizing, if the high quality image is located here:

https://example.com/images/shoe123.jpg

This is how to display a 75x75 pixel thumbnail using Image Resizing:

Responsive Images

When tailoring a site to work on various device types and sizes, it’s important to always use correctly sized images. This can be difficult when images are intended to fill a particular percentage of the screen. To solve this problem, can be used.

Without Image Resizing, multiple versions of the same image would need to be created and stored. In this example, a single high quality copy of hero.jpg is stored, and Image Resizing is used to resize for each particular size as needed.

Enforce Maximum Size Without Changing URLs

Image Resizing is also available from within a Cloudflare Worker. Workers allow you to write code which runs close to your users all around the world. For example, you might wish to add Image Resizing to your images while keeping the same URLs. Your users and client would be able to use the same image URLs as always, but the images will be transparently modified in whatever way you need.

You can install a Worker on a route which matches your image URLs, and resize any images larger than a limit:

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

async function handleRequest(request) {
  return fetch(request, {
    cf: {
      image: {
        width: 800,
        height: 800,
        fit: 'scale-down'
      }
  });
}

As a Worker is just code, it is also easy to run this worker only on URLs with image extensions, or even to only resize images being delivered to mobile clients.

Cloudflare and Images

Cloudflare has a long history building tools to accelerate images. Our caching has always helped reduce latency by storing a copy of images closer to the user. Polish automates options for both lossless and lossy image compression to remove unnecessary bytes from images. Mirage accelerates image delivery based on device type. We are continuing to invest in all of these tools, as they all serve a unique role in improving the image experience on the web.

Image Resizing is different because it is the first image product at Cloudflare to give developers full control over how their images would be served. You should choose Image Resizing if you are comfortable defining the sizes you wish your images to be served at in advance or within a Cloudflare Worker.

Next Steps and Simple Pricing

Image Resizing is available today for Business and Enterprise Customers. To enable it, login to the Cloudflare Dashboard and navigate to the Speed Tab. There you’ll find the section for Image Resizing which you can enable with one click.

This product is included in the Business and Enterprise plans at no additional cost with generous usage limits. Business Customers have 100k requests per month limit and will be charged $10 for each additional 100k requests per month. Enterprise Customers have a 10M request per month limit with discounted tiers for higher usage. Requests are defined as a hit on a URI that contains Image Resizing or a call to Image Resizing from a Worker.

Now that you’ve enabled Image Resizing, it’s time to resize your first image.

Using your existing site, store an image here: https://yoursite.com/images/yourimage.jpg
Use this URL to resize that image:https://yoursite.com/cdn-cgi/image/width=100,height=100,quality=75/images/yourimage.jpg
Experiment with changing width=, height=, and quality=.

The instructions above use the Default URL Format for Image Resizing. For details on options, uses cases, and compatibility, refer to our Developer Documentation.

Parallel streaming of progressive images

Andrew Galloni — Tue, 14 May 2019 16:00:00 GMT

Progressive image rendering and HTTP/2 multiplexing technologies have existed for a while, but now we've combined them in a new way that makes them much more powerful. With Cloudflare progressive streaming images appear to load in half of the time, and browsers can start rendering pages sooner.

In HTTP/1.1 connections, servers didn't have any choice about the order in which resources were sent to the client; they had to send responses, as a whole, in the exact order they were requested by the web browser. HTTP/2 improved this by adding multiplexing and prioritization, which allows servers to decide exactly what data is sent and when. We’ve taken advantage of these new HTTP/2 capabilities to improve perceived speed of loading of progressive images by sending the most important fragments of image data sooner.

What is progressive image rendering?

Basic images load strictly from top to bottom. If a browser has received only half of an image file, it can show only the top half of the image. Progressive images have their content arranged not from top to bottom, but from a low level of detail to a high level of detail. Receiving a fraction of image data allows browsers to show the entire image, only with a lower fidelity. As more data arrives, the image becomes clearer and sharper.

This works great in the JPEG format, where only about 10-15% of the data is needed to display a preview of the image, and at 50% of the data the image looks almost as good as when the whole file is delivered. Progressive JPEG images contain exactly the same data as baseline images, merely reshuffled in a more useful order, so progressive rendering doesn’t add any cost to the file size. This is possible, because JPEG doesn't store the image as pixels. Instead, it represents the image as frequency coefficients, which are like a set of predefined patterns that can be blended together, in any order, to reconstruct the original image. The inner workings of JPEG are really fascinating, and you can learn more about them from my recent performance.now() conference talk.

The end result is that the images can look almost fully loaded in half of the time, for free! The page appears to be visually complete and can be used much sooner. The rest of the image data arrives shortly after, upgrading images to their full quality, before visitors have time to notice anything is missing.

HTTP/2 progressive streaming

But there's a catch. Websites have more than one image (sometimes even hundreds of images). When the server sends image files naïvely, one after another, the progressive rendering doesn’t help that much, because overall the images still load sequentially:

Having complete data for half of the images (and no data for the other half) doesn't look as good as having half of the data for all images.

And there's another problem: when the browser doesn't know image sizes yet, it lays the page out with placeholders instead, and relays out the page when each image loads. This can make pages jump during loading, which is inelegant, distracting and annoying for the user.

Our new progressive streaming feature greatly improves the situation: we can send all of the images at once, in parallel. This way the browser gets size information for all of the images as soon as possible, can paint a preview of all images without having to wait for a lot of data, and large images don’t delay loading of styles, scripts and other more important resources.

This idea of streaming of progressive images in parallel is as old as HTTP/2 itself, but it needs special handling in low-level parts of web servers, and so far this hasn't been implemented at a large scale.

When we were improving our HTTP/2 prioritization, we realized it can be also used to implement this feature. Image files as a whole are neither high nor low priority. The priority changes within each file, and dynamic re-prioritization gives us the behavior we want:

The image header that contains the image size is very high priority, because the browser needs to know the size as soon as possible to do page layout. The image header is small, so it doesn't hurt to send it ahead of other data.
The minimum amount of data in the image required to show a preview of the image has a medium priority (we'd like to plug "holes" left for unloaded images as soon as possible, but also leave some bandwidth available for scripts, fonts and other resources)
The remainder of the image data is low priority. Browsers can stream it last to refine image quality once there's no rush, since the page is already fully usable.

Knowing the exact amount of data to send in each phase requires understanding the structure of image files, but it seemed weird to us to make our web server parse image responses and have a format-specific behavior hardcoded at a protocol level. By framing the problem as a dynamic change of priorities, were able to elegantly separate low-level networking code from knowledge of image formats. We can use Workers or offline image processing tools to analyze the images, and instruct our server to change HTTP/2 priorities accordingly.

The great thing about parallel streaming of images is that it doesn’t add any overhead. We’re still sending the same data, the same amount of data, we’re just sending it in a smarter order. This technique takes advantage of existing web standards, so it’s compatible with all browsers.

The waterfall

Here are waterfall charts from WebPageTest showing comparison of regular HTTP/2 responses and progressive streaming. In both cases the files were exactly the same, the amount of data transferred was the same, and the overall page loading time was the same (within measurement noise). In the charts, blue segments show when data was transferred, and green shows when each request was idle.

The first chart shows a typical server behavior that makes images load mostly sequentially. The chart itself looks neat, but the actual experience of loading that page was not great — the last image didn't start loading until almost the end.

The second chart shows images loaded in parallel. The blue vertical streaks throughout the chart are image headers sent early followed by a couple of stages of progressive rendering. You can see that useful data arrived sooner for all of the images. You may notice that one of the images has been sent in one chunk, rather than split like all the others. That’s because at the very beginning of a TCP/IP connection we don't know the true speed of the connection yet, and we have to sacrifice some opportunity to do prioritization in order to maximize the connection speed.

The metrics compared to other solutions

There are other techniques intended to provide image previews quickly, such as low-quality image placeholder (LQIP), but they have several drawbacks. They add unnecessary data for the placeholders, and usually interfere with browsers' preload scanner, and delay loading of full-quality images due to dependence on JavaScript needed to upgrade the previews to full images.

Our solution doesn't cause any additional requests, and doesn't add any extra data. Overall page load time is not delayed.
Our solution doesn't require any JavaScript. It takes advantage of functionality supported natively in the browsers.
Our solution doesn't require any changes to page's markup, so it's very safe and easy to deploy site-wide.

The improvement in user experience is reflected in performance metrics such as SpeedIndex metric and and time to visually complete. Notice that with regular image loading the visual progress is linear, but with the progressive streaming it quickly jumps to mostly complete:

Getting the most out of progressive rendering

Avoid ruining the effect with JavaScript. Scripts that hide images and wait until the onload event to reveal them (with a fade in, etc.) will defeat progressive rendering. Progressive rendering works best with the good old element.

Is it JPEG-only?

Our implementation is format-independent, but progressive streaming is useful only for certain file types. For example, it wouldn't make sense to apply it to scripts or stylesheets: these resources are rendered as all-or-nothing.

Prioritizing of image headers (containing image size) works for all file formats.

The benefits of progressive rendering are unique to JPEG (supported in all browsers) and JPEG 2000 (supported in Safari). GIF and PNG have interlaced modes, but these modes come at a cost of worse compression. WebP doesn't even support progressive rendering at all. This creates a dilemma: WebP is usually 20%-30% smaller than a JPEG of equivalent quality, but progressive JPEG appears to load 50% faster. There are next-generation image formats that support progressive rendering better than JPEG, and compress better than WebP, but they're not supported in web browsers yet. In the meantime you can choose between the bandwidth savings of WebP or the better perceived performance of progressive JPEG by changing Polish settings in your Cloudflare dashboard.

Custom header for experimentation

We also support a custom HTTP header that allows you to experiment with, and optimize streaming of other resources on your site. For example, you could make our servers send the first frame of animated GIFs with high priority and deprioritize the rest. Or you could prioritize loading of resources mentioned in of HTML documents before is loaded.

The custom header can be set only from a Worker. The syntax is a comma-separated list of file positions with priority and concurrency. The priority and concurrency is the same as in the whole-file cf-priority header described in the previous blog post.

cf-priority-change: :/, ...

For example, for a progressive JPEG we use something like (this is a fragment of JS to use in a Worker):

let headers = new Headers(response.headers);
headers.set("cf-priority", "30/0");
headers.set("cf-priority-change", "512:20/1, 15000:10/n");
return new Response(response.body, {headers});

Which instructs the server to use priority 30 initially, while it sends the first 512 bytes. Then switch to priority 20 with some concurrency (/1), and finally after sending 15000 bytes of the file, switch to low priority and high concurrency (/n) to deliver the rest of the file.

We’ll try to split HTTP/2 frames to match the offsets specified in the header to change the sending priority as soon as possible. However, priorities don’t guarantee that data of different streams will be multiplexed exactly as instructed, since the server can prioritize only when it has data of multiple streams waiting to be sent at the same time. If some of the responses arrive much sooner from the upstream server or the cache, the server may send them right away, without waiting for other responses.

Try it!

Enable our Enhanced HTTP/2 Prioritization feature, and JPEG images optimized by Polish or Image Resizing will be streamed automatically.

Too Old To Rocket Load, Too Young To Die

Peter Belesis — Wed, 04 Jul 2018 14:58:21 GMT

Rocket Loader is in the news again. One of Cloudflare's earliest web performance products has been re-engineered for contemporary browsers and Web standards.

No longer a beta product, Rocket Loader controls the load and execution of your page JavaScript, ensuring useful and meaningful page content is unblocked and displayed sooner.

For a high-level discussion of Rocket Loader aims, please refer to our sister post, We have lift off - Rocket Loader GA is mobile!

Below, we offer a lower-level outline of how Rocket Loader actually achieves its goals.

Prehistory

Early humans looked upon Netscape 2.0, with its new ability to script HTML using LiveScript, and ed to ensure themselves they weren’t dreaming. They decided to use this technology, soon to be re-christened JavaScript (a story told often and elsewhere), for everything they didn’t know they needed: form input validation, image substitution, frameset manipulation, popup windows, and more. The sole requirement was a few interpreted commands enclosed in a ...body markup... ...more body markup...

becomes:



  
    
    
  
  
    ...body markup... 
    
    ...more body markup... 
    
  


  Hey!

The buffered, dynamically inserted, markup after script execution will be

and the string that we’ll feed to the DOMParser will be

The parser will produce the following document structure from the provided markup (note that

is not allowed in and was squeezed out to the ):

Now we move all nodes that we found in parsed document's to the original document:



  
  
  


  Hey!

We see that parsed document's contains some nodes, so we prepend them to the original document’s :

And as a final step, we move all nodes in the , that initially followed the current script, to after the nodes that we’ve just inserted in the :

Quirks IV: Handling handlers

There is one edge case which drastically changes the behaviour of our script-loading simulation. If we encounter elements with inline event handlers in the HTML markup, we need to execute all scripts that precede such elements since the handlers may rely on them.

We insert the Rocket Loader client side script in special "bailout" mode immediately before such elements. In bailout mode, we activate scripts the same way as in regular mode, except we do it in a blocking manner (remember, we need to prevent element from being parsed while we activate all preceding scripts).

As noted, it’s impossible to dynamically create blocking external scripts using DOM APIs such as document.appendChild. However, we have a solution to overcome this limitation.

Since the page is still loading, we can document.write the outerHTML of activatable script into the document, forcing the browser to mark it as parser-inserted and, thus, blocking. However, the script will be inserted in a DOM position different from its original, intended, position, which may break traversing of surrounding nodes from within the script (e.g. using document.currentScript as a starting point).

There is a trick. We leverage browser behaviour which parses generated content in the same execution tick as the document.write that produced it. We have immediate access to the written element. The execution of the external script is always scheduled for one of the next execution ticks. So, we can just move script to its original position right after we write it and have it in the correct DOM position, awaiting its execution.

"I can resist everything except temptation"[1]

The need to account for every quirk, every variation in browser parsing, is strong, but implementation would eventually only weaken our product. We've dealt with the best part of browser parser behaviours, enough to benefit the majority of our customers.

What's Next?

As Rocket Loader matures, and inevitably is affected by changes in Web technologies, it may be expanded and improved. For now, we're monitoring its use, identifying issues, and ensuring that it's worthy of its predecessor, which lasted through so many advances and changes in Web technology.

Oscar Wilde, Lady Windermere's Fan (1892), and apologies to Jethro Tull for the blog post title. ↩︎

We have lift off - Rocket Loader GA is mobile!

Simon Moore — Fri, 01 Jun 2018 16:31:00 GMT

Today we’re excited to announce the official GA of Rocket Loader, our JavaScript optimisation feature that will prioritise getting your content in front of your visitors faster than ever before with improved Mobile device support. In tests on www.cloudflare.com we saw reduction of 45% (almost 1 second) in First Contentful Paint times on our pages for visitors.

Photo by SpaceX / Unsplash

We initially launched Rocket Loader as a beta in June 2011, to asynchronously load a website’s JavaScript to dramatically improve the page load time. Since then, hundreds of thousands of our customers have benefited from a one-click option to boost the speed of your content.

With this release, we’ve vastly improved and streamlined Rocket Loader so that it works in conjunction with mobile & desktop browsers to prioritise what matters most when loading a webpage: your content.

Visitors don’t wait for page “load”

To put it very simplistically - load time is a measure of when the browser has finished loading the document (HTML) and all assets referenced by that document.

When you clicked to visit this blog post, did you wait for the spinning wheel on your browser tab to start reading this content? You probably didn’t, and neither do your visitors. We’re conditioned to start consuming content as soon as it appears. However the industry had been focused on the load event timing too much, ignoring user perception & behaviour. Data from Google analytics has shown that 53% of visits are abandoned if a mobile site takes longer than 3 seconds to load. This makes sense if you think about the last time you browsed onto a website in a hurry - if nothing is rendered quickly on screen you’re much more likely to go elsewhere.

Paint timing metrics are a closer approximation of how your users perceive speed. Put simply, paint timing measures when something is displayed on screen, and there are various stages of paint that can be measured & reported which we’ll explain as we go.

Analysing your performance

One of the ways in which you can learn more about your website’s performance is to use one of the many great synthetic analysis tools out there. The Lighthouse tool in Chrome can run a performance audit on your page simulating a typical mobile device & connection. I ran this on Cloudflare’s homepage to illustrate the way the page loads over time:

The First Meaningful Paint (FMP) takes 4.8 seconds on this mobile device simulation. FMP doesn’t measure the time of the first paint, but it waits for any web fonts to render and for the biggest above-the-fold layout change to happen. The red line drawn at 4.8 seconds shows our FMP in this test. So what can we do to improve?

Render blocking scripts are a problem for paint times

Most tools will give you suggestions, and Lighthouse calls these opportunities and orders these by their estimated time saving:

Try running a lighthouse audit on your website and see what opportunities you get.

The march of the scripts

Now is a good time to quantify the spread of JavaScript on the web. Using the excellent HTTP Archive project we can review the make up of the Alexa top 500k websites over the years. Using their data, we can see that the median number of JavaScripts on mobile has increased from 5, at Rocket Loader’s launch in June 2011, increasing nearly four fold to a whopping 19 JavaScripts in May 2018. So most of us have plenty of JavaScript on our websites and there’s a good chance it will be very high on the list of performance opportunities you can seize to improve your visitors’ experience.

Implementing this recommendation would require you to make changes to your origin application’s code to asynchronously load, defer or inline your scripts. In some cases this might not be possible because you don’t control your application’s code or have the expertise to implement these strategies. Rocket Loader to the rescue!

How Rocket Loader works

Without Rocket Loader

With Rocket Loader

New Rocket Loader prioritises paint time by locating the JavaScripts inside your HTML page and hiding them from the browser temporarily during the page load. This allows the browser to continue parsing the rest of your HTML and begin discovering other assets such as CSS & images that are required to render your page. Once that has completed, Rocket Loader dynamically inserts the scripts back into the page and so the browser can load these.

Enabling Rocket Loader

This bit is quite labour-intensive, so watch closely:

Measuring the impact

Let’s run lighthouse again on our homepage now we have Rocket Loader enabled:

So Lighthouse has detected that First Meaningful Paint is just over 1.5 seconds faster in this test - a really impressive improvement delivered from a single click!

To drive this home, the opportunity lighthouse identified is now officially a “passed audit”:

Let’s get Real (User Measurement)

To measure this with real users & devices, last week we ran a simple A/B test on www.cloudflare.com where 50% of page views were optimised using Rocket Loader so we could compare performance with and without it enabled. As shown by the lighthouse audits above, our main website is a great use-case for Rocket Loader because while we do use a lot of JavaScript for some interactive aspects of our pages, most important to our visitors is reading information about Cloudflare’s network, products & features. So in short, content should be prioritised over JavaScript.

To illustrate the changes we observed, below is a graph of the Time To First Contentful Paint (TTFCP) for www.cloudflare.com visits by real users during our test. TTFCP measures the first time something in the Document Object Model (DOM) is painted on the page. For websites that are primarily for consumption of content rather than heavy interactions, this is a closer representation of a user’s perception of your website speed than measuring load time.

With Rocket Loader you can see those orange bars are grouped more to the left (faster) and higher meaning many more of our visitors were getting content on screen in under a second. In fact, the median improvement delivered by Rocket Loader during our www.cloudflare.com test lands at a 0.93 second reduction in Time To First Contentful paint or around a 45% improvement over the baseline. Boom!

What’s new in Rocket Loader

So Rocket Loader continues to drive performance improvements on JavaScript heavy websites, but lots has changed under the hood. Here is a summary of the key changes:

Improves time to first paint speed not just load time
Now compatible with over 93% of mobile devices[1]
Tiny! Less than 10% of the size of prior version
Reduced complexity & better compatibility with your on-site & 3rd party JavaScripts
Compliant with stricter Content Security Policies (CSP)

More mobile users now get optimised

We’ve already predicted that by the end of 2018 mobile usage will reach 60%. Additionally as of July 2018 Google will begin using page speed in mobile search ranking. With that in mind providing a fast experience for mobile devices is more important than ever.

Rocket Loader beta was first launched back in 2012 at a time when mobile device usage on the web was around 15%. That version of Rocket Loader intercepted JavaScript on the page, and executed it in a virtual sandbox: a world familiar, but with behavior changed behind the scenes. Unfortunately, the technique we chose to create this virtual sandbox didn't work well on all mobile browsers. That was considered an undesirable but acceptable trade off 6 years ago, but today it’s vital customers can leverage this technique on mobile. Thanks to the reduced complexity of our approach, new Rocket Loader works on over 93%[1:1] of mobile devices in use today. For those devices that are not compatible, we simply deliver the website normally without this optimisation.

Leaner & Meaner

New Rocket Loader’s JS weighs in at a lightweight 2.3KB. We did some extensive refactoring during 2017 that reduced the size of the JS required to run Rocket Loader from 47KB to 32KB and saved a staggering 213 terabytes of transfer across the globe. Due to the simplicity of the way New Rocket Loader works rocket-loader.min.js resulting in a JS file that is less than 10% of the size, saving approximately another 417 Terabytes of transfer each year when Rocket Loader does its thing.[2]

Compatibility with Content Security Policy

New Rocket Loader does not modify the content of your JavaScript, it only changes the time at which it is loaded which means it plays nicely with any Content Security Policy (CSP) you have defined. With Rocket Loader beta, if you wanted to set a CSP that only allowed execution scripts hosted on your domain you would need to disable Rocket Loader as that would also combine & load external JavaScripts through your domain. New Rocket Loader does not use this approach and instead lets the browser load & cache the files normally. As we also enable HTTP/2 for all of our customers, any first party scripts will load over a single TCP connection, and 3rd party scripts are still asynchronously loaded meaning we can optimise their loading without proxying this content. All of this means modifying your CSP to accommodate Rocket Loader is as simple as allowing script-src for https://ajax.cloudflare.com so that Rocket Loader itself can load.

How can I enable new Rocket Loader?

If you already have Rocket Loader enabled as of today your site is using the new version. You can modify your settings at any time by visiting the Speed section of your Cloudflare settings.

If you had Rocket Loader disabled or in Manual mode just click the button in the Speed section to turn Rocket Loader on.

What else can I do with Cloudflare to optimise my website?

As always achieving good performance typically takes a variety of approaches and Rocket Loader tackles JavaScript specifically. There are some other very simple optimisations you should also ensure are enabled:

Caching - cache everything you can so content is served directly from any one of our 150+ data centres without waiting for your origin.
Minify & Compress - Enable minification of your HTML, CSS & JS in your Speed settings to losslessly reduce the total byte size of your web pages and enable Brotli compression so browsers that support this new compression method receive smaller responses.
Optimise your images - Polish will automatically reduce the size of images on your website with support for highly efficient formats such as WebP. You can also turn on Mirage to optimise images for mobile devices with poor connectivity.
Use HTTP/2 - You get HTTP/2 support automatically as long as your site is served over HTTPS. Move as much as your content as you can onto your Cloudflare enabled URLs so all of that content can be multiplexed down a single TCP connection.
Use Argo & Railgun - For dynamic content Argo and Railgun can help optimise the connection & transfer between Cloudflare and your origin server.

Rocket Loader utilises a browser API called document.currentScript which is currently supported by 93.7% of mobile devices and growing: https://caniuse.com/#feat=document-currentscript ↩︎ ↩︎
Back of the envelope calculations based on 270 million rocket-loader.min.js responses served per week with a 29.7KB saving per serving. ↩︎

Introducing the Cloudflare Warp Ingress Controller for Kubernetes

John Graham-Cumming — Tue, 05 Dec 2017 14:00:00 GMT

NOTE: Prior to launch, this product was renamed Argo Tunnel. Read more in the launch announcement.

It’s ironic that the one thing most programmers would really rather not have to spend time dealing with is... a computer. When you write code it’s written in your head, transferred to a screen with your fingers and then it has to be run. On. A. Computer. Ugh.

Of course, code has to be run and typed on a computer so programmers spend hours configuring and optimizing shells, window managers, editors, build systems, IDEs, compilation times and more so they can minimize the friction all those things introduce. Optimizing your editor’s macros, fonts or colors is a battle to find the most efficient path to go from idea to running code.

CC BY 2.0 image by Yutaka Tsutano

Once the developer is managing their own universe they can write code at the speed of their mind. But when it comes to putting their code into production (which necessarily requires running their programs on machines that they don’t control) things inevitably go wrong. Production machines are never the same as developer machines.

If you’re not a developer, here’s an analogy. Imagine carefully writing an essay on a subject dear to your heart and then publishing it only to be told “unfortunately, the word ‘the’ is not available in the version of English the publisher uses and so your essay is unreadable”. That’s the sort of problem developers face when putting their code into production.

Over time different technologies have tried to deal with this problem: dual booting, different sorts of isolation (e.g. virtualenv, chroot), totally static binaries, virtual machines running on a developer desktop, elastic computing resources in clouds, and more recently containers.

Ultimately, using containers is all about a developer being able to say “it ran on my machine” and be sure that it’ll run in production, because fighting incompatibilities between operating systems, libraries and runtimes that differ from development to production is a waste of time (in particular developer brain time).

CC BY 2.0 image by Jumilla

In parallel, the rise of microservices is also a push to optimize developer brain time. The reality is that we all have limited brain power and ability to comprehend the complex systems that we build in their entirety and so we break them down into small parts that we can understand and test: functions, modules and services.

A microservice with a well-defined API and related tests running in a container is the ultimate developer fantasy. An entire program, known to operate correctly, that runs on their machine and in production.

Of course, no silver lining is without its cloud and containers beget a coordination problem: how do all these little programs find each other, scale, handle failure, log messages, communicate and remain secure. The answer, of course, is a coordination system like Kubernetes.

Kubernetes completes the developer fantasy by allowing them to write and deploy a service and have it take part in a whole.

Sadly, these little programs have one last hurdle before they turn into useful Internet services: they have to be connected to the brutish outside world. Services must be safely and scalably exposed to the Internet.

Recently, Cloudflare introduced a new service that can be used to connect a web server to Cloudflare without needing to have a public IP address for it. That service, Cloudflare Warp, maintains a connection from the server into the Cloudflare network. The server is then only exposed to the Internet through Cloudflare with no way for attackers to reach the server directly.

That means that any connection to it is protected and accelerated by Cloudflare’s service.

Cloudflare Warp Ingress Controller and StackPointCloud

Today, we are extending Warp’s reach by announcing the Cloudflare Warp Ingress Controller for Kubernetes (it’s an open source project and can be found here). We worked closely with the team at StackPointCloud to integrate Warp, Kubernetes and their universal control plane for Kubernetes.

Within Kubernetes creating an ingress with annotation kubernetes.io/ingress.class: cloudflare-warp will automatically create secure Warp tunnels to Cloudflare for any service using that ingress. The entire lifecycle of tunnels is transparently managed by the ingress controller making exposing Kubernetes-managed services securely via Cloudflare Warp trivially easy.

The Warp Ingress Controller is responsible for finding Warp-enabled services and registering them with Cloudflare using the hostname(s) specified in the Ingress resource. It is added to a Kubernetes cluster by creating a file called warp-controller.yaml with the content below:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: null
  generation: 1
  labels:
    run: warp-controller
  name: warp-controller
spec:
  replicas: 1
  selector:
    matchLabels:
      run: warp-controller
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        run: warp-controller
    spec:
      containers:
      - command:
        - /warp-controller
        - -v=6
        image: quay.io/stackpoint/warp-controller:beta
        imagePullPolicy: Always
        name: warp-controller
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - name: cloudflare-warp-cert
          mountPath: /etc/cloudflare-warp
          readOnly: true
      volumes:
        - name: cloudflare-warp-cert
          secret:
            secretName: cloudflare-warp-cert
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

The full documentation is here and shows how to get up and running with Kubernetes and Cloudflare Warp on StackPointCloud, Google GKE, Amazon EKS or even minikube.

One Click with StackPointCloud

Within StackPointCloud adding the Cloudflare Warp Ingress Controller requires just a single click. And one more click and you've deployed a Kubernetes cluster.

The connection between the Kubernetes cluster and Cloudflare is made using a TLS tunnel ensuring that all communication between the cluster and the outside world is secure.

Once connected the cluster and its services then benefit from Cloudflare's DDoS protection, WAF, global load balancing and health checks and huge global network.

The combination of Kubernetes and Cloudflare makes managing, scaling, accelerating and protecting Internet facing services simple and fast.

Delivering Dot

Dani Grant — Sun, 10 Sep 2017 17:04:26 GMT

Since March 30, 2017, Cloudflare has been providing DNS Anycast service as additional F-Root instances under contract with ISC (the F-Root operator).

F-Root is a single IPv4 address plus a single IPv6 address which both ISC and Cloudflare announce to the global Internet as a shared Anycast. This document reviews how F-Root has performed since that date in March 2017.

The DNS root servers are an important utility provided to all clients on the Internet for free - all F root instances including those hosted on the Cloudflare network are a free service provided by both ISC and Cloudflare for public benefit. Because every online request begins with a DNS lookup, and every DNS lookup requires the retrieval of information stored on the DNS root servers, the DNS root servers plays an invaluable role to the functioning of the internet.

At Cloudflare, we were excited to work with ISC to bring greater security, speed and new software diversity to the root server system. First, the root servers, because of their crucial role, are often the subject of large scale volumetric DDoS attacks, which Cloudflare specializes in mitigating (Cloudflare is currently mitigating two concurrently ongoing DDoS attacks as we write this). Second, with a distributed network of data centers in well over 100 global cities, Cloudflare DNS is close to the end client which reduces round trip times. And lastly, the F-root nodes hosted by Cloudflare also run Cloudflare’s in-house DNS software, written in Go, which brings new code diversity to the root server system.

Throughout the deployment, ISC and Cloudflare paid close attention to telemetry measurements to ensure positive impact on the global DNS and root server system. Here is what both organizations observed when transit was enabled to Cloudflare DNS servers for F-Root.

Using RIPE atlas probe measurements, we can see an immediate performance benefit to the F-Root server, from 8.24 median RTT to 4.24 median RTT.

F-Root actually became one of the fastest performing root servers:

The biggest performance improvement was in the 90th percentile, or what are the top 10% of queries that received the slowest replies. This graph below shows the 90th percentile response time for any given RIPE atlas probe. Each probe is represented by two markers, a red X for before Cloudflare enabled transit and a blue X for after Cloudflare began announcing. You can see a drop in 90th percentile response times, the blue X’s are much lower than the red X’s.

One of the optimizations that DNS resolvers do is preferring the faster root servers. As F-Root picked up speed, DNS resolvers also started sending it more traffic. Here you can see the aggregate number of queries received by each root letter per day, with an increase to F starting on March 30th.

One large public DNS resolver shared with us their internal metrics, where you can also see a large shift of traffic to F-Root as F-Root increased in speed.

When one external DNS monitor published their Root Server measurement report in June 2017, they mentioned F-Root’s increased performance. “9 more countries than in 2015 observed F-Root as the fastest — this is the biggest change across all of the root servers, so in this sense F-Root is “most improved.” F-Root has increasingly become the fastest root server in significant portions of Asia Pacific, Latin America and Eastern Europe.” They noted that “F-Root is now the fastest for roughly one quarter of the countries we tested from.”

We are happy to be working with ISC on delivering answers for F-Root and aim in the process to improve the speed and security of the F-Root server.

How we built rate limiting capable of scaling to millions of domains

Julien Desgats — Wed, 07 Jun 2017 12:47:51 GMT

Back in April we announced Rate Limiting of requests for every Cloudflare customer. Being able to rate limit at the edge of the network has many advantages: it’s easier for customers to set up and operate, their origin servers are not bothered by excessive traffic or layer 7 attacks, the performance and memory cost of rate limiting is offloaded to the edge, and more.

In a nutshell, rate limiting works like this:

Customers can define one or more rate limit rules that match particular HTTP requests (failed login attempts, expensive API calls, etc.)
Every request that matches the rule is counted per client IP address
Once that counter exceeds a threshold, further requests are not allowed to reach the origin server and an error page is returned to the client instead

This is a simple yet effective protection against brute force attacks on login pages and other sorts of abusive traffic like L7 DoS attacks.

Doing this with possibly millions of domains and even more millions of rules immediately becomes a bit more complicated. This article is a look at how we implemented a rate limiter able to run quickly and accurately at the edge of the network which is able to cope with the colossal volume of traffic we see at Cloudflare.

Let’s just do this locally!

As the Cloudflare edge servers are running NGINX, let’s first see how the stock rate limiting module works:

http {
    limit_req_zone $binary_remote_addr zone=ratelimitzone:10m rate=15r/m;
    ...
    server {
        ...
        location /api/expensive_endpoint {
            limit_req zone=ratelimitzone;
        }
    }
}

This module works great: it is reasonably simple to use (but requires a config reload for each change), and very efficient. The only problem is that if the incoming requests are spread across a large number of servers, this doesn’t work any more. The obvious alternative is to use some kind of centralized data store. Thanks to NGINX’s Lua scripting module, that we already use extensively, we could easily implement similar logic using any kind of central data backend.

But then another problem arises: how to make this fast and efficient?

All roads lead to Rome? Not with anycast!

Since Cloudflare has a vast and diverse network, reporting all counters to a single central point is not a realistic solution as the latency is far too high and guaranteeing the availability of the central service causes more challenges.

First let’s take a look at how the traffic is routed in the Cloudflare network. All the traffic going to our edge servers is anycast traffic. This means that we announce the same IP address for a given web application, site or API worldwide, and traffic will be automatically and consistently routed to the closest live data center.

This property is extremely valuable: we are sure that, under normal conditions[1], the traffic from a single IP address will always reach the same PoP. Unfortunately each new TCP connection might hit a different server inside that PoP. But we can still narrow down our problem: we can actually create an isolated counting system inside each PoP. This mostly solves the latency problem and greatly improves the availability as well.

Storing counters

At Cloudflare, each server in our edge network is as independent as possible to make their administration simple. Unfortunately for rate limiting, we saw that we do need to share data across many different servers.

We actually had a similar problem in the past with SSL session IDs: each server needed to fetch TLS connection data about past connections. To solve that problem we created a Twemproxy cluster inside each of our PoPs: this allows us to split a memcache[2] database across many servers. A consistent hashing algorithm ensures that when the cluster is resized, only a few number of keys are hashed differently.

In our architecture, each server hosts a shard of the database. As we already had experience with this system, we wanted to leverage it for the rate limit as well.

Algorithms

Now let’s take a deeper look at how the different rate limit algorithms work. What we call the sampling period in the next paragraph is the reference unit of time for the counter (1 second for a 10 req/sec rule, 1 minute for a 600 req/min rule, ...).

The most naive implementation is to simply increment a counter that we reset at the start of each sampling period. This works but is not terribly accurate as the counter will be arbitrarily reset at regular intervals, allowing regular traffic spikes to go through the rate limiter. This can be a problem for resource intensive endpoints.

Another solution is to store the timestamp of every request and count how many were received during the last sampling period. This is more accurate, but has huge processing and memory requirements as checking the state of the counter require reading and processing a lot of data, especially if you want to rate limit over long period of time (for instance 5,000 req per hour).

The leaky bucket algorithm allows a great level of accuracy while being nicer on resources (this is what the stock NGINX module is using). Conceptually, it works by incrementing a counter when each request comes in. That same counter is also decremented over time based on the allowed rate of requests until it reaches zero. The capacity of the bucket is what you are ready to accept as “burst” traffic (important given that legitimate traffic is not always perfectly regular). If the bucket is full despite its decay, further requests are mitigated.

However, in our case, this approach has two drawbacks:

It has two parameters (average rate and burst) that are not always easy to tune properly
We were constrained to use the memcached protocol and this algorithm requires multiple distinct operations that we cannot do atomically[3]

So the situation was that the only operations available were GET, SET and INCR (atomic increment).

Sliding windows to the rescue

CC BY-SA 2.0 image by halfrain

The naive fixed window algorithm is actually not that bad: we just have to solve the problem of completely resetting the counter for each sampling period. But actually, can’t we just use the information from the previous counter in order to extrapolate an accurate approximation of the request rate?

Let’s say I set a limit of 50 requests per minute on an API endpoint. The counter can be thought of like this:

In this situation, I did 18 requests during the current minute, which started 15 seconds ago, and 42 requests during the entire previous minute. Based on this information, the rate approximation is calculated like this:

rate = 42 * ((60-15)/60) + 18
     = 42 * 0.75 + 18
     = 49.5 requests

One more request during the next second and the rate limiter will start being very angry!

This algorithm assumes a constant rate of requests during the previous sampling period (which can be any time span), this is why the result is only an approximation of the actual rate. This algorithm can be improved, but in practice it proved to be good enough:

It smoothes the traffic spike issue that the fixed window method has
It very easy to understand and configure: no average vs. burst traffic, longer sampling periods can be used to achieve the same effect
It is still very accurate, as an analysis on 400 million requests from 270,000 distinct sources shown:
- 0.003% of requests have been wrongly allowed or rate limited
- An average difference of 6% between real rate and the approximate rate
- 3 sources have been allowed despite generating traffic slightly above the threshold (false negatives), the actual rate was less than 15% above the threshold rate
- None of the mitigated sources was below the threshold (false positives)

Moreover, it offers interesting properties in our case:

Tiny memory usage: only two numbers per counter
Incrementing a counter can be done by sending a single INCR command
Calculating the rate is reasonably easy: one GET command[4] and some very simple, fast math

So here we are: we can finally implement a good counting system using only a few memcache primitives and without much contention. Still we were not happy with that: it requires a memcached query to get the rate. At Cloudflare we’ve seen a few of the largest L7 attacks ever. We knew that large scale attacks would have crushed the memcached cluster like this. More importantly, such operations would slow down legitimate requests a little, even under normal conditions. This is not acceptable.

This is why the increment jobs are run asynchronously without slowing down the requests. If the request rate is above the threshold, another piece of data is stored asking all servers in the PoP to start applying the mitigation for that client. Only this bit of information is checked during request processing.

Even more interesting: once a mitigation has started, we know exactly when it will end. This means that we can cache that information in the server memory itself. Once a server starts to mitigate a client, it will not even run another query for the subsequent requests it might see from that source!

This last tweak allowed us to efficiently mitigate large L7 attacks without noticeably penalizing legitimate requests.

Conclusion

Despite being a young product, the rate limiter is already being used by many customers to control the rate of requests that their origin servers receive. The rate limiter already handles several billion requests per day and we recently mitigated attacks with as many as 400,000 requests per second to a single domain without degrading service for legitimate users.

We just started to explore how we can efficiently protect our customers with this new tool. We are looking into more advanced optimizations and create new features on the top of the existing work.

Interested in working on high-performance code running on thousands of servers at the edge of the network? Consider applying to one of our open positions!

The inner workings of anycast route changes are outside of the scope of this article, but we can assume that they are rare enough in this case. ↩︎
Twemproxy also supports Redis, but our existing infrastructure was backed by Twemcache (a Memcached fork) ↩︎
Memcache does support CAS (Compare-And-Set) operations and so optimistic transactions are possible, but it is hard to use in our case: during attacks, we will have a lot of requests, creating a lot of contention, in turn resulting in a lot of CAS transactions failing. ↩︎
The counters for the previous and current minute can be retrieved with a single GET command ↩︎

Introducing Accelerated Mobile Links: Making the Mobile Web App-Quick

Matthew Prince — Thu, 12 Jan 2017 06:00:00 GMT

In 2017, we've predicted that more than half of the traffic to Cloudflare's network will come from mobile devices. Even if they are formatted to be displayed on a small screen, the mobile web is built on traditional web protocols and technologies that were designed for desktop CPUs, network connections, and displays. As a result, browsing the mobile web feels sluggish compared with using native mobile apps.

In October 2015, the team at Google announced Accelerated Mobile Pages (AMP), a new, open technology to make the mobile web as fast as native apps. Since then, a large number of publishers have adopted AMP. Today, 600 million pages across 700,000 different domains are available in the AMP format.

The majority of traffic to this AMP content comes from people running searches on Google.com. If a visitor finds content through some source other than a Google search, even if the content can be served from AMP, it typically won't be. As a result, the mobile web continues to be slower than it needs to be.

Making the Mobile Web App-Quick

Cloudflare's Accelerated Mobile Links helps solve this problem, making content, regardless of how it's discovered, app-quick. Once enabled, Accelerated Mobile Links automatically identifies links on a Cloudflare customer's site to content with an AMP version available. If a link is clicked from a mobile device, the AMP content will be loaded nearly instantly.

To see how it works, try viewing this post from your mobile device and clicking any of these links:

TechCrunch: Cloudflare explains how FBI gag order impacted business
ZDNet: CloudFlare figured out how to make the Web one second faster
The Register: CloudFlare apologizes for Telia screwing you over
AMP 8 Ball: Ask The Amp Magic 8-Ball A Yes Or No Question.

Increasing User Engagement

One of the benefits of Accelerated Mobile Links is that AMP content is loaded in a viewer directly on the site that linked to the content. As a result, when a reader is done consuming the AMP content closing the viewer returns them to the original source of the link. In that way, every Cloudflare customers' site can be more like a native mobile app, with the corresponding increase in user engagement.

For large publishers that want an even more branded experience, Cloudflare will offer the ability to customize the domain of the viewer to match the publisher's domain. This, for the first time, provides a seamless experience where AMP content can be consumed without having to send visitors to a Google owned domain. If you're a large publisher interested in customizing the Accelerated Mobile Links viewer, you can contact Cloudflare's team.

Innovating on AMP

While Google was the initial champion of AMP, the technologies involved are open. We worked closely with the Google team in developing Cloudflare's Accelerated Mobile Links as well as our own AMP cache. Malte Ubl, the technical lead for the AMP Project at Google said of our collaboration:

"Working with Cloudflare on its AMP caching solution was as seamless as open-source development can be. Cloudflare has become a regular contributor on the project and made the code base better for all users of AMP. It is always a big step for a software project to go from supporting specific caches to many, and it is awesome to see Cloudflare’s elegant solution for this."

Cloudflare now powers the only compliant non-Google AMP cache with all the same performance and security benefits as Google.

In the spirit of open source, we're working to help develop updates to the project to address some of publishers' and end users' concerns. Specifically, here are some features we're developing to address concerns that have been expressed about AMP:

Easier ways to share AMP content using publisher's original domains
Automatically redirecting desktop visitors from the AMP version back to the original version of the content
A way for end users who would prefer not to be redirected to the AMP version of content to opt out
The ability for publishers to brand the AMP viewer and serve it from their own domain

Cloudflare is committed to the AMP project. Accelerated Mobile Links is the first AMP feature we're releasing, but we'll be doing more over the months to come. As of today, Accelerated Mobile Links is available to all Cloudflare customers for free. You can enable it in your Cloudflare Performance dashboard. Stay tuned for more AMP features that will continue to increase the speed of the mobile web.

A Very WebP New Year from Cloudflare

David Wragg — Wed, 21 Dec 2016 14:00:00 GMT

Cloudflare has an automatic image optimization feature called Polish, available to customers on paid plans. It recompresses images and removes unnecessary data so that they are delivered to browsers more quickly.

Up until now, Polish has not changed image types when optimizing (even if, for example, a PNG might sometimes have been smaller than the equivalent JPEG). But a new feature in Polish allows us to swap out an image for an equivalent image compressed using Google’s WebP format when the browser is capable of handling WebP and delivering that type of image would be quicker.

CC-BY 2.0 image by John Stratford

What is WebP?

The main image formats used on the web haven’t changed much since the early days (apart from the SVG vector format, PNG was the last one to establish itself, almost two decades ago).

WebP is a newer image format for the web, proposed by Google. It takes advantage of progress in image compression techniques since formats such as JPEG and PNG were designed. It is often able to compress the images into a significantly smaller amount of data than the older formats.

WebP is versatile and able to replace the three main raster image formats used on the web today:

WebP can do lossy compression, so it can be used instead of JPEG for photographic and photo-like images.
WebP can do lossless compression, and supports an alpha channel meaning images can have transparent regions. So it can be used instead of PNG, such as for images with sharp transitions that should be reproduced exactly (e.g. line art and graphic design elements).
WebP images can be animated, so it can be used as a replacement for animated GIF images.

Currently, the main browser that supports WebP is Google’s Chrome (both on desktop and mobile devices). See the WebP page on caniuse.com for more details.

Polish WebP conversion

Customers on the Pro, Business, and Enterprise plans can enable the automatic creation of WebP images by checking the WebP box in the Polish settings for a zone (these are found on the “Speed” page of the dashboard):

When this is enabled, Polish will optimize images just as it always has. But it will also convert the image to WebP, if WebP can shrink the image data more than the original format. These WebP images are only returned to web browsers that indicate they support WebP (e.g. Google Chrome), so most websites using Polish should be able to benefit from WebP conversion.

(Although Polish can now produce WebP images by converting them from other formats, it can't consume WebP images to optimize them. If you put a WebP image on an origin site, Polish won't do anything with it. Until the WebP ecosystem grows and matures, it is uncertain that attempting to optimize WebP is worthwhile.)

Polish has two modes: lossless and lossy. In lossless mode, JPEG images are optimized to remove unnecessary data, but the image displayed is unchanged. In lossy mode, Polish reduces the quality of JPEG images in a way that should not have a significant visible effect, but allows it to further reduce the size of the image data.

These modes are respected when JPEG images are converted to WebP. In lossless mode, the conversion is done in a way that preserves the image as faithfully as possible (due to the nature of the conversion, the resulting WebP might not be exactly identical, but there are unlikely to be any visible differences). In lossy mode, the conversion sacrifices a little quality in order to shrink the image data further, but as before, there should not be a significant visible effect.

These modes do not affect PNGs and GIFs, as these are lossless formats and so Polish will preserve images in those formats exactly.

Note that WebP conversion does not change the URLs of images, even if the file extension in the URL implies a different format. For example, a JPEG image at https://example.com/picture.jpg that has been converted to WebP will still have that same URL. The “Content-Type” HTTP header tells the browser the true format of an image.

By the Numbers

A few studies have been published of how well WebP compresses images compared with established formats. These studies provide a useful picture of how WebP performs. But before we released our WebP support, we decided to do a survey based on the context on which we planned to use WebP:

We evaluated WebP based on a collection of images gathered from the websites of our customers. The corpus consisted of 23,500 images (JPEG, PNG and GIFs).
Some studies compare WebP with JPEG by taking uncompressed images and compressing them to JPEG and WebP directly. But we wanted to know what happens when we convert an image that was has already been compressed as a JPEG. In a sense this is an unfair test, because a JPEG may contain artifacts due to compression that would not be present in the original raw image, and conversion to WebP may try to retain those artifacts. But it is such conversions matter for our use of WebP (this consideration does not apply to PNG and GIF conversions, because they are lossless).
We’re not just interested in whether WebP conversion can shrink images found on the web. We want to know how much WebP allows Polish to reduce the size further than it already does, thus providing a real end-user benefit. So our survey also includes the results of Polish without WebP.
In some cases, converting to WebP does not produce a result smaller than the optimized image in the original format. In such cases, we discard the WebP image. So the figures presented below do not penalize WebP for such cases.

Here is a chart showing the results of Polish, with and without WebP conversion. For each format, the average original image size is normalized to 100%, and the average sizes after Polishing are shown relative to that.

Here are the average savings corresponding to the chart:

Original Format	Polish without WebP	Polish using WebP
JPEG (with Polish lossless mode)	9%	19%
JPEG (with Polish lossy mode)	34%	47%
PNG	16%	38%
GIF	3%	16%

(The saving is calculated as 100% - (polished size) / (original size).)

As you can see, WebP conversion achieves significant size improvements not only for JPEG images, but also for PNG and GIF images. We believe supporting WebP will result in lower bandwidth and faster website delivery.

… and a WebP New Year

WebP does not yet have the same level of browser support as JPEG, PNG and GIF, but we are excited about its potential to streamline web pages. Polish WebP conversion allows our customers to adopt WebP with a simple change to the settings in the Cloudflare dashboard. So, if you are on one of our paid plans, we encourage you to try it out today.

PS — Want to help optimize the web? We’re hiring.

You Can Finally Get More Page Rules For 5 For $5

Dani Grant — Thu, 25 Aug 2016 16:01:53 GMT

Since CloudFlare launched Page Rules in 2012, our Free, Pro and Business users have been asking for a way to get more Page Rules without committing to the next plan up. Starting today, anyone on CloudFlare can add 5 additional Page Rules for just $5/month.

Page Rules allows you to fine tune your site speed and to apply CloudFlare’s wide range of features to specific parts of your site. Page Rules are also accessible over our API, so you can integrate them into your build process or sync them across your domains.

To help you get the most out of Page Rules, we’re also launching a tutorial site that features videos to help you setup Page Rules for specific content management systems like WordPress, Magento and Drupal, and for specific goals like optimizing your website's speed, increasing security, and saving on your bandwidth costs.

Optimizing TLS over TCP to reduce latency

John Graham-Cumming — Fri, 10 Jun 2016 13:08:49 GMT

The layered nature of the Internet (HTTP on top of some reliable transport (e.g. TCP), TCP on top of some datagram layer (e.g. IP), IP on top of some link (e.g. Ethernet)) has been very important in its development. Different link layers have come and gone over time (any readers still using 802.5?) and this flexibility also means that a connection from your web browser might traverse your home network over WiFi, then down a DSL line, across fiber and finally be delivered over Ethernet to the web server. Each layer is blissfully unaware of the implementation of the layer below it.

But there are some disadvantages to this model. In the case of TLS (the most common standard used for sending encrypted data across in the Internet and the protocol your browser uses with visiting an https:// web site) the layering of TLS on top of TCP can cause delays to the delivery of a web page.

That’s because TLS divides the data being transmitted into records of a fixed (maximum) size and then hands those records to TCP for transmission. TCP promptly divides those records up into segments which are then transmitted. Ultimately, those segments are sent inside IP packets which traverse the Internet.

In order to prevent congestion on the Internet and to ensure reliable delivery, TCP will only send a limited number of segments before waiting for the receiver to acknowledge that the segments have been received. In addition, TCP guarantees that segments are delivered in order to the application. Thus if a packet is dropped somewhere between sender and receiver it’s possible for a whole bunch of segments to be held in a buffer waiting for the missing segment to be retransmitted before the buffer can be released to the application.

TLS and TCP

What this means for TLS is that a large record that is split across multiple TCP segments can encounter unexpected delays. TLS can only handle complete records and so a missing TCP segment delays the whole TLS record.

At the start of a TCP connection as the TCP slow start occurs the record could be split across multiple segments that are delivered relatively slowly. During a TCP connection one of the segments that a TLS record has been split into may get lost causing the record to be delayed until the missing segment is retransmitted.

Thus it’s preferable to not use a fixed TLS record size but adjust the record size as the underlying TCP connection spins up (and down in the case of congestion). Starting with a small record size helps match the record size to the segments that TCP is sending at the start of a connection. Once the connection is running the record size can be increased.

CloudFlare uses NGINX to handle web requests. By default NGINX does not support dynamic TLS record sizes. NGINX has a fixed TLS record size with a default of 16KB that can be adjusted with the ssl_buffer_size parameter.

Dynamic TLS Records in NGINX

We modified NGINX to add support for dynamic TLS record sizes and are open sourcing our patch. You can find it here. The patch adds parameters to the NGINX ssl module.

ssl_dyn_rec_size_lo: the TLS record size to start with. Defaults to 1369 bytes (designed to fit the entire record in a single TCP segment: 1369 = 1500 - 40 (IPv6) - 20 (TCP) - 10 (Time) - 61 (Max TLS overhead))

ssl_dyn_rec_size_hi: the TLS record size to grow to. Defaults to 4229 bytes (designed to fit the entire record in 3 TCP segments)

ssl_dyn_rec_threshold: the number of records to send before changing the record size.

Each connection starts with records of the size ssl_dyn_rec_size_lo. After sending ssl_dyn_rec_threshold records the record size is increased to ssl_dyn_rec_size_hi. After sending an additional ssl_dyn_rec_threshold records with size ssl_dyn_rec_size_hi the record size is increased to ssl_buffer_size.

ssl_dyn_rec_timeout: if the connection idles for longer than this time (in seconds) that the TLS record size is reduced to ssl_dyn_rec_size_lo and the logic above is repeated. If this value is set to 0 then dynamic TLS record sizes are disabled and the fixed ssl_buffer_size will be used instead.

Conclusion

We hope people find our NGINX patch useful and would be very happy to hear from people who use it and/or improve it.

HTTP/2 For Web Developers

Ryan Hodson — Thu, 10 Dec 2015 12:10:51 GMT

HTTP/2 changes the way web developers optimize their websites. In HTTP/1.1, it’s become common practice to eek out an extra 5% of page load speed by hacking away at your TCP connections and HTTP requests with techniques like spriting, inlining, domain sharding, and concatenation.

Life’s a little bit easier in HTTP/2. It gives the typical website a 30% performance gain without a complicated build and deploy process. In this article, we’ll discuss the new best practices for website optimization in HTTP/2.

Web Optimization in HTTP/1.1

Most of the website optimization techniques in HTTP/1.1 revolved around minimizing the number of HTTP requests to an origin server. A browser can only open a limited number of simultaneous TCP connections to an origin, and downloading assets over each of those connections is a serial process: the response for one asset has to be returned before the next one can be sent. This is called head-of-line blocking.

As a result, web developers began squeezing as many assets as they could into a single connection and finding other ways to trick browsers into avoiding head-of-line blocking. In HTTP/2, some of these practices can actually hurt page load times.

The New Web Optimization Mindset for HTTP/2

Optimizing for HTTP/2 requires a different mindset. Instead of worrying about reducing HTTP requests, web developers should focus on tuning their website’s caching behavior. The general rule is to ship small, granular resources so they can be cached independently and transferred in parallel.

This shift occurs because of the multiplexing and header compression features of HTTP/2. Multiplexing lets multiple requests share a single TCP connection, allowing several assets to download in parallel without the unnecessary overhead of establishing multiple connections. This eliminates the head-of-line blocking issue of HTTP/1.1. Header compression further reduces the penalty of multiple HTTP requests, since the overhead for each request is smaller than the uncompressed HTTP/1.1 equivalent.

There are two other features of HTTP/2 that also change how you approach web optimization: stream prioritization and server push. The former lets browsers specify what order they want to receive resources, and the latter lets the server send extra resources that the browser doesn’t yet know it needs. To make the most of these new features, web developers need to unlearn some of the HTTP/1.1 best practices that have become second nature to them.

Best Practices for HTTP/2 Web Optimization

Stop Concatenating Files

In HTTP/1.1, web developers often combined all of the CSS for their entire website into a single file. Similarly, JavaScript was also condensed into a single file, and images were combined into a spritesheet. Having a single file for your CSS, JavaScript, and images vastly reduced the amount of HTTP requests, making it a significant performance gain for HTTP/1.1.

However, concatenating files is no longer a best practice in HTTP/2. While concatenation can still improve compression ratios, it forces expensive cache invalidations. Even if only a single line of CSS is changed, browsers are forced to reload all of your CSS declarations.

In addition, not every page in your website uses all of the declarations or functions in your concatenated CSS or JavaScript files. After it’s been cached, this isn’t a big deal, but it means that unnecessary bytes are being transferred, processed, and executed in order to render the first page a user visits. While the overhead of a request in HTTP/1.1 made this a worthwhile tradeoff, it can actually slow down time-to-first-paint in HTTP/2.

Instead of concatenating files, web developers should focus more on optimizing caching policy. By isolating files that change frequently from ones that rarely change, it’s possible to serve as much content as possible from your CDN or the user’s browser cache.

Stop Inlining Assets

Inlining assets is a special case of file concatenation. It’s the practice of embedding CSS stylesheets, external JavaScript files, and images directly into an HTML page. For example, if you have web page that looks like this:

You could run it through an inlining tool to get something like this:

In extreme cases, this can actually reduce the number of HTTP requests for a given web page to one. But, like file concatenation, you shouldn’t inline files you’re trying to optimize for HTTP/2.

Inlining means that browsers can’t cache individual assets. If you have CSS declarations that apply to your entire website embedded into every single HTML file, those declarations get sent back from the server every time. This results in redundant bytes being sent over the wire for every page that a user visits.

Inlining also has the unfortunate side effect of breaking stream prioritization. Because your CSS, JavaScript, and images are now embedded in your HTML, you’re effectively raising their priority to the same level as HTML content. This means that browsers can’t request assets in their preferred order, potentially increasing time-to-first-paint.

Instead of inlining assets, web developers should leverage HTTP/2’s server push functionality. Server push lets your web server say stuff like, "Hold on, you’re going to need this image and this CSS file to render that HTML page you just requested." Conceptually, this is the same as inlining an asset, but it doesn’t break stream prioritization, and it allows you to leverage a CDN and the user’s local browser cache, too.

Stop Sharding Domains

Domain sharding is a common technique for tricking browsers into opening more TCP connections than they’re supposed to. Browsers limit the number of connections to a single origin server, but by splitting your website’s assets across multiple domains, you can get it to open an extra set of TCP connections. This helps avoid head-of-line blocking, but it comes with significant tradeoffs.

Domain sharding should be avoided in HTTP/2. Each shard results in an unnecessary DNS query, TCP connection, and TLS handshake (assuming the servers use different TLS certificates). In HTTP/1.1, that overhead was compensated for by allowing several assets to download in parallel. But, this is no longer the case in HTTP/2: multiplexing allows multiple assets to download in parallel over a single connection. And, similar to asset inlining, domain sharding breaks HTTP/2 stream prioritization because browsers can’t prioritize streams across multiple domains.

If you’re currently sharding but want to leverage HTTP/2, you don’t necessarily need to go about refactoring your entire codebase. As discussed in our blog post on mixing domain sharding and SPDY, browsers recognize when multiple origin servers use the same TLS certificate. When they do, the browser will reuse the same SPDY or HTTP/2 connection for both servers. This still results in multiple DNS queries, but it’s a decent compromise solution if you want the best performance under both HTTP/1.1 and HTTP/2.

Some Best Practices Are Still Best Practices

Fortunately, not everything about web optimization changes in HTTP/2. There’s still several HTTP/1.1 best practices that carry over to HTTP/2. The rest of this article discusses techniques that you should be leveraging regardless of whether you’re optimizing for HTTP/1.1 or HTTP/2.

Keep Reducing DNS Lookups

Before a browser can request your website’s resources, it needs to find the IP address of your server using the domain name system (DNS). Until the DNS responds, the user is left staring at a blank screen. HTTP/2 is designed to optimize how web browsers communicate with web servers—it doesn’t affect the performance of the domain name system.

Since DNS queries can be expensive, especially if you have to start your query at the root nameservers, it’s still prudent to minimize the number of DNS lookups that your website uses. DNS prefetching via a line in your document’s can help, but it’s not a one-size-fits all solution.

Keep Using a Content Delivery Network

It takes around 130ms for light to travel around the circumference of the earth. That’s latency that you can’t get rid of—it’s just physics. The imperfect nature of fiber optic cables and wireless connections, as well as the topology of the global Internet means that it actually takes closer to 300-400ms to transmit a network packet from your computer to a server that’s halfway around the world. User’s can perceive latency as small as 100ms, and the only way to get around the laws of physics is to put your web assets geographically closer to your visitors via a CDN.

Keep Leveraging Browser Caching

You can take content delivery networks one step further by caching assets in the user’s local browser cache. This avoids any kind of data transfer over the network, aside from a 304 Not Modified response.

Keep Minimizing the Size of HTTP Requests

Even though HTTP/2 requests are multiplexed, it takes time to transmit data over the wire. Accordingly, it’s still beneficial to reduce the amount of data you need to transfer. On the request end, this means minimizing the size of cookies, URL and query strings, and referrer URLs as much as possible.

Keep Minimizing the Size of HTTP Responses

Of course, this holds true in the opposite direction. As a web developer, you want to make your server’s response as small as possible by minifying HTML, CSS, and JavaScript Files, optimizing images for the web, and compressing resources with gzip.

Keep Eliminating Unnecessary Redirects

HTTP 301 and 302 redirects are sometimes a fact of life when migrating to a new platform or re-designing your website, but they should be eliminated wherever possible. Redirects result in an extra round trip from the browser to the server, and this adds unnecessary latency. You should be particularly wary of redirect chains where it takes more than a single redirect to get to the destination URL.

Server-side redirects like 301 and 302 responses aren’t ideal, but they aren’t the worst thing in the world, either. They can still be cached locally, so a browser can recognize the URL as a redirect and avoid an unnecessary round trip. Meta refreshes (e.g., ), on the other hand, are much more costly because they can’t be cached, and they can have performance issues in certain browsers.


    
      Conclusion (and Caveats)
      
        
      
    
    And that’s HTTP/2 web optimization in a nutshell. Avoiding file concatenation, asset inlining, and domain sharding not only speeds up your website, but also makes for a much simpler build and deploy process.
There are, however, a few caveats to this discussion. Most servers, content delivery networks (including CloudFlare), and existing applications don’t support server push. Servers and CDNs will catch up soon enough, but for your application to benefit from server push, you’re going to have to make some changes to your codebase.
Also keep in mind is that HTTP/2 performance gains depend on the type of content you serve. For example, websites that rely on more external assets will see larger performance gains from HTTP/2 multiplexing than ones that have fewer HTTP requests.