
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Sat, 04 Apr 2026 06:53:20 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Cloudy Summarizations of Email Detections: Beta Announcement]]></title>
            <link>https://blog.cloudflare.com/cloudy-driven-email-security-summaries/</link>
            <pubDate>Fri, 29 Aug 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ We're now leveraging our internal LLM, Cloudy, to generate automated summaries within our Email Security product, helping SOC teams better understand what's happening within flagged messages. ]]></description>
            <content:encoded><![CDATA[ 
    <div>
      <h2>Background</h2>
      <a href="#background">
        
      </a>
    </div>
    <p>Organizations face continuous threats from <a href="https://www.cloudflare.com/learning/access-management/phishing-attack/"><u>phishing</u></a>,<a href="https://www.cloudflare.com/learning/email-security/business-email-compromise-bec/"><u> business email compromise (BEC)</u></a>, and other advanced email attacks. Attackers <a href="https://www.cloudflare.com/the-net/multichannel-phishing/"><u>adapt their tactics</u></a> daily, forcing defenders to move just as quickly to keep inboxes safe.</p><p>Cloudflare’s visibility across a large portion of the Internet gives us an unparalleled view of malicious campaigns. We process billions of email threat signals every day, feeding them into multiple AI and machine learning models. This lets our detection team create and deploy new rules at high speed, blocking malicious and unwanted emails before they reach the inbox.</p><p>But rapid protection introduces a new challenge: making sure security teams understand exactly what we blocked — and why.</p>
    <div>
      <h2>The Challenge</h2>
      <a href="#the-challenge">
        
      </a>
    </div>
    <p>Cloudflare’s fast-moving detection pipeline is one of our greatest strengths — but it also creates a communication gap for customers. Every day, our detection analysts publish new rules to block phishing, BEC, and other unwanted messages. These rules often blend signals from multiple AI and machine learning models, each looking at different aspects of a message like its content, headers, links, attachments, and sender reputation.</p><p>While this layered approach catches threats early, SOC teams don’t always have insight into the specific combination of factors that triggered a detection. Instead, they see a rule name in the investigation tab with little explanation of what it means.</p><p>Take the rule <i>BEC.SentimentCM_BEC.SpoofedSender</i> as an example. Internally, we know this indicates:</p><ul><li><p>The email contained no unique links or attachments a common BEC pattern</p></li><li><p>It was flagged as highly likely to be BEC by our Churchmouse sentiment analysis models</p></li><li><p>Spoofing indicators were found, such as anomalies in the envelope_from header</p></li></ul><p>Those details are second nature to our detection team, but without that context, SOC analysts are left to reverse-engineer the logic from opaque labels. They don’t see the nuanced ML outputs (like Churchmouse’s sentiment scoring) or the subtle header anomalies, or the sender IP/domain reputation data that factored into the decision.</p><p>The result is time lost to unclear investigations or the risk of mistakenly releasing malicious emails. For teams operating under pressure, that’s more than just an inconvenience, it's a security liability.</p><p>That’s why we extended Cloudy (our AI-powered agent) to translate complex detection logic into clear explanations, giving SOC teams the context they need without slowing them down.</p>
    <div>
      <h2>Enter Cloudy Summaries</h2>
      <a href="#enter-cloudy-summaries">
        
      </a>
    </div>
    <p>Several weeks ago, we launched Cloudy within our Cloudflare One product suite to help customers understand gateway policies and their impacts (you can read more about the launch here: https://blog.cloudflare.com/introducing-ai-agent/).</p><p>We began testing Cloudy's ability to explain the detections and updates we continuously deploy. Our first attempt revealed significant challenges.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/63bsCRl2hKUyECh1vJND5k/a033fce3c95a635ede07e1fd03a9edf5/image3.png" />
          </figure>
    <div>
      <h3>The Hallucination Problem</h3>
      <a href="#the-hallucination-problem">
        
      </a>
    </div>
    <p>We observed frequent LLM <a href="https://www.cloudflare.com/learning/ai/what-are-ai-hallucinations/"><u>hallucinations</u></a>, the model generating inaccurate information about messages. While this might be acceptable when analyzing logs, it's dangerous for email security detections. A hallucination claiming a malicious message is clean could lead SOC analysts to release it from quarantine, potentially causing a security breach.</p><p>These hallucinations occurred because email detections involve numerous and complex inputs. Our scanning process runs messages through multiple ML algorithms examining different components: body content, attachments, links, IP reputation, and more. The same complexity that makes manual detection explanation difficult also caused our initial LLM implementation to produce inconsistent and sometimes inaccurate outputs.</p>
    <div>
      <h3>Building Guardrails</h3>
      <a href="#building-guardrails">
        
      </a>
    </div>
    <p>To minimize hallucination risk while maintaining inbox security, we implemented several manual safeguards:</p><p><b>Step 1: RAG Implementation</b></p><p>We ensured Cloudy only accessed information from our detection dataset corpus, creating a <a href="https://www.cloudflare.com/learning/ai/retrieval-augmented-generation-rag/"><u>Retrieval-Augmented Generation (RAG)</u></a> system. This significantly reduced hallucinations by grounding the LLM's assessments in actual detection data.</p><p><b>Step 2: Model Context Enhancement</b></p><p>We added crucial context about our internal models. For example, the "Churchmouse" designation refers to a group of sentiment detection models, not a single algorithm. Without this context, Cloudy attempted to define "churchmouse" using the common idiom "poor as a church mouse" referencing starving church mice because holy bread never falls to the floor. While historically interesting, this was completely irrelevant to our security context.</p>
    <div>
      <h3>Current Results</h3>
      <a href="#current-results">
        
      </a>
    </div>
    <p>Our testing shows Cloudy now produces more stable explanations with minimal hallucinations. For example, the detection <i>SPAM.ASNReputation.IPReputation_Scuttle.Anomalous_HC</i> now generates this summary:</p><p>"This rule flags email messages as spam if they come from a sender with poor Internet reputation, have been identified as suspicious by a blocklist, and have unusual email server setup, indicating potential malicious activity."</p><p>This strikes the right balance. Customers can quickly understand what the detection found and why we classified the message accordingly.</p>
    <div>
      <h2>Beta Program</h2>
      <a href="#beta-program">
        
      </a>
    </div>
    <p>We're opening Cloudy email detection summaries to a select group of beta users. Our primary goal is ensuring our guardrails prevent hallucinations that could lead to security compromises. During this beta phase, we'll rigorously test outputs and verify their quality before expanding access to all customers.</p>
    <div>
      <h2>Ready to enhance your email security?</h2>
      <a href="#ready-to-enhance-your-email-security">
        
      </a>
    </div>
    <p>We provide all organizations (whether a Cloudflare customer or not) with free access to our Retro Scan tool, allowing them to use our predictive AI models to scan existing inbox messages. Retro Scan will detect and highlight any threats found, enabling organizations to remediate them directly in their email accounts. With these insights, organizations can implement further controls, either using <a href="https://www.cloudflare.com/zero-trust/products/email-security/"><u>Cloudflare Email Security</u></a> or their preferred solution, to prevent similar threats from reaching their inboxes in the future.</p><p>If you are interested in how Cloudflare can help secure your inboxes, sign up for a phishing risk assessment <a href="https://www.cloudflare.com/lp/email-security-self-guided-demo-request/?utm_medium=referral&amp;utm_source=blog&amp;utm_campaign=2025-q3-acq-gbl-modernsec-es-ge-general-ai_week_blog"><u>here</u></a>. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/lV6mxQTYwaS6j0n0e8arE/fd62cf8032b15780690f4ed48578d3fc/image2.png" />
          </figure><div>
  
</div><p></p> ]]></content:encoded>
            <category><![CDATA[AI Week]]></category>
            <category><![CDATA[Cloud Email Security]]></category>
            <category><![CDATA[LLM]]></category>
            <guid isPermaLink="false">hzXLKdI5wqNlvwd0JKzXS</guid>
            <dc:creator>Ayush Kumar</dc:creator>
            <dc:creator>Nick Blazier</dc:creator>
            <dc:creator>Phil Syme</dc:creator>
        </item>
        <item>
            <title><![CDATA[How we built the most efficient inference engine for Cloudflare’s network ]]></title>
            <link>https://blog.cloudflare.com/cloudflares-most-efficient-ai-inference-engine/</link>
            <pubDate>Wed, 27 Aug 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ Infire is an LLM inference engine that employs a range of techniques to maximize resource utilization, allowing us to serve AI models more efficiently with better performance for Cloudflare workloads. ]]></description>
            <content:encoded><![CDATA[ <p>Inference powers some of today’s most powerful AI products: chat bot replies, <a href="https://www.cloudflare.com/learning/ai/what-is-agentic-ai/"><u>AI agents</u></a>, autonomous vehicle decisions, and fraud detection. The problem is, if you’re building one of these products on top of a hyperscaler, you’ll likely need to rent expensive GPUs from large centralized data centers to run your inference tasks. That model doesn’t work for Cloudflare — there’s a mismatch between Cloudflare’s globally-distributed network and a typical centralized AI deployment using large multi-GPU nodes. As a company that operates our own compute on a lean, fast, and widely distributed network within 50ms of 95% of the world’s Internet-connected population, we need to be running inference tasks more efficiently than anywhere else.</p><p>This is further compounded by the fact that AI models are getting larger and more complex. As we started to support these models, like the Llama 4 herd and gpt-oss, we realized that we couldn’t just throw money at the scaling problems by buying more GPUs. We needed to utilize every bit of idle capacity and be agile with where each model is deployed. </p><p>After running most of our models on the widely used open source inference and serving engine <a href="https://github.com/vllm-project/vllm"><u>vLLM</u></a>, we figured out it didn’t allow us to fully utilize the GPUs at the edge. Although it can run on a very wide range of hardware, from personal devices to data centers, it is best optimized for large data centers. When run as a dedicated inference server on powerful hardware serving a specific model, vLLM truly shines. However, it is much less optimized for dynamic workloads, distributed networks, and for the unique security constraints of running inference at the edge alongside other services.</p><p>That’s why we decided to build something that will be able to meet the needs of Cloudflare inference workloads for years to come. Infire is an LLM inference engine, written in Rust, that employs a range of techniques to maximize memory, network I/O, and GPU utilization. It can serve more requests with fewer GPUs and significantly lower CPU overhead, saving time, resources, and energy across our network. </p><p>Our initial benchmarking has shown that Infire completes inference tasks up to 7% faster than vLLM 0.10.0 on unloaded machines equipped with an H100 NVL GPU. On infrastructure under real load, it performs significantly better. </p><p>Currently, Infire is powering the Llama 3.1 8B model for <a href="https://developers.cloudflare.com/workers-ai/"><u>Workers AI</u></a>, and you can test it out today at <a href="https://developers.cloudflare.com/workers-ai/models/llama-3.1-8b-instruct-fast/"><u>@cf/meta/llama-3.1-8b-instruct</u></a>!</p>
    <div>
      <h2>The Architectural Challenge of LLM Inference at Cloudflare </h2>
      <a href="#the-architectural-challenge-of-llm-inference-at-cloudflare">
        
      </a>
    </div>
    <p>Thanks to industry efforts, inference has improved a lot over the past few years. vLLM has led the way here with the recent release of the vLLM V1 engine with features like an optimized KV cache, improved batching, and the implementation of Flash Attention 3. vLLM is great for most inference workloads — we’re currently using it for several of the models in our <a href="https://developers.cloudflare.com/workers-ai/models/"><u>Workers AI catalog</u></a> — but as our AI workloads and catalog has grown, so has our need to optimize inference for the exact hardware and performance requirements we have. </p><p>Cloudflare is writing much of our <a href="https://blog.cloudflare.com/rust-nginx-module/"><u>new infrastructure in Rust</u></a>, and vLLM is written in Python. Although Python has proven to be a great language for prototyping ML workloads, to maximize efficiency we need to control the low-level implementation details. Implementing low-level optimizations through multiple abstraction layers and Python libraries adds unnecessary complexity and leaves a lot of CPU performance on the table, simply due to the inefficiencies of Python as an interpreted language.</p><p>We love to contribute to open-source projects that we use, but in this case our priorities may not fit the goals of the vLLM project, so we chose to write a server for our needs. For example, vLLM does not support co-hosting multiple models on the same GPU without using Multi-Instance GPU (MIG), and we need to be able to dynamically schedule multiple models on the same GPU to minimize downtime. We also have an in-house AI Research team exploring unique features that are difficult, if not impossible, to upstream to vLLM. </p><p>Finally, running code securely is our top priority across our platform and <a href="https://www.cloudflare.com/developer-platform/products/workers-ai/"><u>Workers AI</u></a> is no exception. We simply can’t trust a 3rd party Python process to run on our edge nodes alongside the rest of our services without strong sandboxing. We are therefore forced to run vLLM via <a href="https://gvisor.dev"><u>gvisor</u></a>. Having an extra virtualization layer adds another performance overhead to vLLM. More importantly, it also increases the startup and tear downtime for vLLM instances — which are already pretty long. Under full load on our edge nodes, vLLM running via gvisor consumes as much as 2.5 CPU cores, and is forced to compete for CPU time with other crucial services, that in turn slows vLLM down and lowers GPU utilization as a result.</p><p>While developing Infire, we’ve been incorporating the latest research in inference efficiency — let’s take a deeper look at what we actually built.</p>
    <div>
      <h2>How Infire works under the hood </h2>
      <a href="#how-infire-works-under-the-hood">
        
      </a>
    </div>
    <p>Infire is composed of three major components: an OpenAI compatible HTTP server, a batcher, and the Infire engine itself.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3BypYSG9QFsPjPFhjlOEsa/6ef5d4ccaabcd96da03116b7a14e8439/image2.png" />
          </figure><p><i><sup>An overview of Infire’s architecture </sup></i></p>
    <div>
      <h2>Platform startup</h2>
      <a href="#platform-startup">
        
      </a>
    </div>
    <p>When a model is first scheduled to run on a specific node in one of our data centers by our auto-scaling service, the first thing that has to happen is for the model weights to be fetched from our <a href="https://www.cloudflare.com/developer-platform/products/r2/"><u>R2 object storage</u></a>. Once the weights are downloaded, they are cached on the edge node for future reuse.</p><p>As the weights become available either from cache or from R2, Infire can begin loading the model onto the GPU. </p><p>Model sizes vary greatly, but most of them are <b>large, </b>so transferring them into GPU memory can be a time-consuming part of Infire’s startup process. For example, most non-quantized models store their weights in the BF16 floating point format. This format has the same dynamic range as the 32-bit floating format, but with reduced accuracy. It is perfectly suited for inference providing the sweet spot of size, performance and accuracy. As the name suggests, the BF16 format requires 16 bits, or 2 bytes per weight. The approximate in-memory size of a given model is therefore double the size of its parameters. For example, LLama3.1 8B has approximately 8B parameters, and its memory footprint is about 16 GB. A larger model, like LLama4 Scout, has 109B parameters, and requires around 218 GB of memory. Infire utilizes a combination of <a href="https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/#pinned_host_memory"><u>Page Locked</u></a> memory with CUDA asynchronous copy mechanism over multiple streams to speed up model transfer into GPU memory.</p><p>While loading the model weights, Infire begins just-in-time compiling the required kernels based on the model's parameters, and loads them onto the device. Parallelizing the compilation with model loading amortizes the latency of both processes. The startup time of Infire when loading the Llama-3-8B-Instruct model from disk is just under 4 seconds. </p>
    <div>
      <h3>The HTTP server</h3>
      <a href="#the-http-server">
        
      </a>
    </div>
    <p>The Infire server is built on top of <a href="https://docs.rs/hyper/latest/hyper/"><u>hyper</u></a>, a high performance HTTP crate, which makes it possible to handle hundreds of connections in parallel – while consuming a modest amount of CPU time. Because of ChatGPT’s ubiquity, vLLM and many other services offer OpenAI compatible endpoints out of the box. Infire is no different in that regard. The server is responsible for handling communication with the client: accepting connections, handling prompts and returning responses. A prompt will usually consist of some text, or a "transcript" of a chat session along with extra parameters that affect how the response is generated. Some parameters that come with a prompt include the temperature, which affects the randomness of the response, as well as other parameters that affect the randomness and length of a possible response.</p><p>After a request is deemed valid, Infire will pass it to the tokenizer, which transforms the raw text into a series of tokens, or numbers that the model can consume. Different models use different kinds of tokenizers, but the most popular ones use byte-pair encoding. For tokenization, we use HuggingFace's tokenizers crate. The tokenized prompts and params are then sent to the batcher, and scheduled for processing on the GPU, where they will be processed as vectors of numbers, called <a href="https://www.cloudflare.com/learning/ai/what-are-embeddings/"><u>embeddings</u></a>.</p>
    <div>
      <h2>The batcher</h2>
      <a href="#the-batcher">
        
      </a>
    </div>
    <p>The most important part of Infire is in how it does batching: by executing multiple requests in parallel. This makes it possible to better utilize memory bandwidth and caches. </p><p>In order to understand why batching is so important, we need to understand how the inference algorithm works. The weights of a model are essentially a bunch of two-dimensional matrices (also called tensors). The prompt represented as vectors is passed through a series of transformations that are largely dominated by one operation: vector-by-matrix multiplication. The model weights are so large, that the cost of the multiplication is dominated by the time it takes to fetch it from memory. In addition, modern GPUs have hardware units dedicated to matrix-by-matrix multiplications (called Tensor Cores on Nvidia GPUs). In order to amortize the cost of memory access and take advantage of the Tensor Cores, it is necessary to aggregate multiple operations into a larger matrix multiplication.</p><p>Infire utilizes two techniques to increase the size of those matrix operations. The first one is called prefill: this technique is applied to the prompt tokens. Because all the prompt tokens are available in advance and do not require decoding, they can all be processed in parallel. This is one reason why input tokens are often cheaper (and faster) than output tokens.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1pqyNSzgWLcgrV3urpCvA0/e204ac477992d591a7368632c36e97eb/image1.png" />
          </figure><p><sup><i>How Infire enables larger matrix multiplications via batching</i></sup></p><p>The other technique is called batching: this technique aggregates multiple prompts into a single decode operation.</p><p>Infire mixes both techniques. It attempts to process as many prompts as possible in parallel, and fills the remaining slots in a batch with prefill tokens from incoming prompts. This is also known as continuous batching with chunked prefill.</p><p>As tokens get decoded by the Infire engine, the batcher is also responsible for retiring prompts that reach an End of Stream token, and sending tokens back to the decoder to be converted into text. </p><p>Another job the batcher has is handling the KV cache. One demanding operation in the inference process is called <i>attention</i>. Attention requires going over the KV values computed for all the tokens up to the current one. If we had to recompute those previously encountered KV values for every new token we decode, the runtime of the process would explode for longer context sizes. However, using a cache, we can store all the previous values and re-read them for each consecutive token. Potentially the KV cache for a prompt can store KV values for as many tokens as the context window allows. In LLama 3, the maximal context window is 128K tokens. If we pre-allocated the KV cache for each prompt in advance, we would only have enough memory available to execute 4 prompts in parallel on H100 GPUs! The solution for this is paged KV cache. With paged KV caching, the cache is split into smaller chunks called pages. When the batcher detects that a prompt would exceed its KV cache, it simply assigns another page to that prompt. Since most prompts rarely hit the maximum context window, this technique allows for essentially unlimited parallelism under typical load.</p><p>Finally, the batcher drives the Infire forward pass by scheduling the needed kernels to run on the GPU.</p>
    <div>
      <h2>CUDA kernels</h2>
      <a href="#cuda-kernels">
        
      </a>
    </div>
    <p>Developing Infire gives us the luxury of focusing on the exact hardware we use, which is currently Nvidia Hopper GPUs. This allowed us to improve performance of specific compute kernels using low-level PTX instructions for this specific architecture.</p><p>Infire just-in-time compiles its kernel for the specific model it is running, optimizing for the model’s parameters, such as the hidden state size, dictionary size and the GPU it is running on. For some operations, such as large matrix multiplications, Infire will utilize the high performance cuBLASlt library, if it would deem it faster.</p><p>Infire also makes use of very fine-grained CUDA graphs, essentially creating a dedicated CUDA graph for every possible batch size on demand. It then stores it for future launch. Conceptually, a CUDA graph is another form of just-in-time compilation: the CUDA driver replaces a series of kernel launches with a single construct (the graph) that has a significantly lower amortized kernel launch cost, thus kernels executed back to back will execute faster when launched as a single graph as opposed to individual launches.</p>
    <div>
      <h2>How Infire performs in the wild </h2>
      <a href="#how-infire-performs-in-the-wild">
        
      </a>
    </div>
    <p>We ran synthetic benchmarks on one of our edge nodes with an H100 NVL GPU.</p><p>The benchmark we ran was on the widely used ShareGPT v3 dataset. We ran the benchmark on a set of 4,000 prompts with a concurrency of 200. We then compared Infire and vLLM running on bare metal as well as vLLM running under gvisor, which is the way we currently run in production. In a production traffic scenario, an edge node would be competing for resources with other traffic. To simulate this, we benchmarked vLLM running in gvisor with only one CPU available.</p><table><tr><td><p>
</p></td><td><p>requests/s</p></td><td><p>tokens/s</p></td><td><p>CPU load</p></td></tr><tr><td><p>Infire</p></td><td><p>40.91</p></td><td><p>17224.21</p></td><td><p>25%</p></td></tr><tr><td><p>vLLM 0.10.0</p></td><td><p>38.38</p></td><td><p>16164.41</p></td><td><p>140%</p></td></tr><tr><td><p>vLLM under gvisor</p></td><td><p>37.13</p></td><td><p>15637.32</p></td><td><p>250%</p></td></tr><tr><td><p>vLLM under gvisor with CPU constraints</p></td><td><p>22.04</p></td><td><p>9279.25</p></td><td><p>100%</p></td></tr></table><p>As evident from the benchmarks we achieved our initial goal of matching and even slightly surpassing vLLM performance, but more importantly, we’ve done so at a significantly lower CPU usage, in large part because we can run Infire as a trusted bare-metal process. Inference no longer takes away precious resources from our other services and we see GPU utilization upward of 80%, reducing our operational costs.</p><p>This is just the beginning. There are still multiple proven performance optimizations yet to be implemented in Infire – for example, we’re integrating Flash Attention 3, and most of our kernels don’t utilize kernel fusion. Those and other optimizations will allow us to unlock even faster inference in the near future.</p>
    <div>
      <h2>What’s next </h2>
      <a href="#whats-next">
        
      </a>
    </div>
    <p>Running AI inference presents novel challenges and demands to our infrastructure. Infire is how we’re running AI efficiently — close to users around the world. By building upon techniques like continuous batching, a paged KV-cache, and low-level optimizations tailored to our hardware, Infire maximizes GPU utilization while minimizing overhead. Infire completes inference tasks faster and with a fraction of the CPU load of our previous vLLM-based setup, especially under the strict security constraints we require. This allows us to serve more requests with fewer resources, making requests served via Workers AI faster and more efficient.</p><p>However, this is just our first iteration — we’re excited to build in multi-GPU support for larger models, quantization, and true multi-tenancy into the next version of Infire. This is part of our goal to make Cloudflare the best possible platform for developers to build AI applications.</p><p>Want to see if your AI workloads are faster on Cloudflare? <a href="https://developers.cloudflare.com/workers-ai/"><u>Get started</u></a> with Workers AI today. </p> ]]></content:encoded>
            <category><![CDATA[AI Week]]></category>
            <category><![CDATA[LLM]]></category>
            <category><![CDATA[Workers AI]]></category>
            <guid isPermaLink="false">7Li4fkq9b4B8QlgwSmZrqE</guid>
            <dc:creator>Vlad Krasnov</dc:creator>
            <dc:creator>Mari Galicer</dc:creator>
        </item>
        <item>
            <title><![CDATA[Block unsafe prompts targeting your LLM endpoints with Firewall for AI]]></title>
            <link>https://blog.cloudflare.com/block-unsafe-llm-prompts-with-firewall-for-ai/</link>
            <pubDate>Tue, 26 Aug 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare's AI security suite now includes unsafe content moderation, integrated into the Application Security Suite via Firewall for AI.  ]]></description>
            <content:encoded><![CDATA[ <p>Security teams are racing to <a href="https://www.cloudflare.com/the-net/vulnerable-llm-ai/"><u>secure a new attack surface</u></a>: AI-powered applications. From chatbots to search assistants, LLMs are already shaping customer experience, but they also open the door to new risks. A single malicious prompt can exfiltrate sensitive data, <a href="https://www.cloudflare.com/learning/ai/data-poisoning/"><u>poison a model</u></a>, or inject toxic content into customer-facing interactions, undermining user trust. Without guardrails, even the best-trained model can be turned against the business.</p><p>Today, as part of AI Week, we’re expanding our <a href="https://www.cloudflare.com/ai-security/">AI security offerings</a> by introducing unsafe content moderation, now integrated directly into Cloudflare <a href="https://developers.cloudflare.com/waf/detections/firewall-for-ai/"><u>Firewall for AI</u></a>. Built with Llama, this new feature allows customers to leverage their existing Firewall for AI engine for unified detection, analytics, and topic enforcement, providing real-time protection for <a href="https://www.cloudflare.com/learning/ai/what-is-large-language-model/"><u>Large Language Models (LLMs)</u></a> at the network level. Now with just a few clicks, security and application teams can detect and block harmful prompts or topics at the edge — eliminating the need to modify application code or infrastructure.

This feature is immediately available to current Firewall for AI users. Those not yet onboarded can contact their account team to participate in the beta program.</p>
    <div>
      <h2>AI protection in application security</h2>
      <a href="#ai-protection-in-application-security">
        
      </a>
    </div>
    <p>Cloudflare's Firewall for AI <a href="https://blog.cloudflare.com/best-practices-sase-for-ai/">protects user-facing LLM applications</a> from abuse and data leaks, addressing several of the <a href="https://www.cloudflare.com/learning/ai/owasp-top-10-risks-for-llms/"><u>OWASP Top 10 LLM risks</u></a> such as prompt injection, PII disclosure, and unbound consumption. It also extends protection to other risks such as unsafe or harmful content.</p><p>Unlike built-in controls that vary between model providers, Firewall for AI is model-agnostic. It sits in front of any model you choose, whether it’s from a third party like OpenAI or Gemini, one you run in-house, or a custom model you have built, and applies the same consistent protections.</p><p>Just like our origin-agnostic <a href="https://www.cloudflare.com/application-services/#application-services-case-products"><u>Application Security suite</u></a>, Firewall for AI enforces policies at scale across all your models, creating a unified security layer. That means you can define guardrails once and apply them everywhere. For example, a financial services company might require its LLM to only respond to finance-related questions, while blocking prompts about unrelated or sensitive topics, enforced consistently across every model in use.</p>
    <div>
      <h2>Unsafe content moderation protects businesses and users</h2>
      <a href="#unsafe-content-moderation-protects-businesses-and-users">
        
      </a>
    </div>
    <p>Effective AI moderation is more than blocking “bad words”, it’s about setting boundaries that protect users, meeting legal obligations, and preserving brand integrity, without over-moderating in ways that silence important voices.</p><p>Because LLMs cannot be fully scripted, their interactions are inherently unpredictable. This flexibility enables rich user experiences but also opens the door to abuse.</p><p>Key risks from unsafe prompts include misinformation, biased or offensive content, and model poisoning, where repeated harmful prompts degrade the quality and safety of future outputs. Blocking these prompts aligns with the OWASP Top 10 for LLMs, preventing both immediate misuse and long-term degradation.</p><p>One example of this is<a href="https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist"> <b><u>Microsoft’s Tay chatbot</u></b></a>. Trolls deliberately submitted toxic, racist, and offensive prompts, which Tay quickly began repeating. The failure was not only in Tay’s responses; it was in the lack of moderation on the inputs it accepted.</p>
    <div>
      <h2>Detecting unsafe prompts before reaching the model</h2>
      <a href="#detecting-unsafe-prompts-before-reaching-the-model">
        
      </a>
    </div>
    <p>Cloudflare has integrated <a href="https://huggingface.co/meta-llama/Llama-Guard-3-8B"><u>Llama Guard</u></a> directly into Firewall for AI. This brings AI input moderation into the same rules engine our customers already use to protect their applications. It uses the same approach that we created for developers building with AI in our <a href="https://blog.cloudflare.com/guardrails-in-ai-gateway/"><u>AI Gateway</u></a> product.</p><p>Llama Guard analyzes prompts in real time and flags them across multiple safety categories, including hate, violence, sexual content, criminal planning, self-harm, and more.</p><p>With this integration, Firewall for AI not only <a href="https://blog.cloudflare.com/take-control-of-public-ai-application-security-with-cloudflare-firewall-for-ai/#discovering-llm-powered-applications"><u>discovers LLM traffic</u></a> endpoints automatically, but also enables security and AI teams to take immediate action. Unsafe prompts can be blocked before they reach the model, while flagged content can be logged or reviewed for oversight and tuning. Content safety checks can also be combined with other Application Security protections, such as <a href="https://www.cloudflare.com/application-services/products/bot-management/"><u>Bot Management</u> </a>and <a href="https://www.cloudflare.com/application-services/products/rate-limiting/"><u>Rate Limiting</u></a>, to create layered defenses when protecting your model.</p><p>The result is a single, edge-native policy layer that enforces guardrails before unsafe prompts ever reach your infrastructure — without needing complex integrations.</p>
    <div>
      <h2>How it works under the hood</h2>
      <a href="#how-it-works-under-the-hood">
        
      </a>
    </div>
    <p>Before diving into the architecture of Firewall for AI engine and how it fits within our previously mentioned module to detect <a href="https://blog.cloudflare.com/take-control-of-public-ai-application-security-with-cloudflare-firewall-for-ai/#using-workers-ai-to-deploy-presidio"><u>PII in the prompts</u></a>, let’s start with how we detect unsafe topics.</p>
    <div>
      <h3>Detection of unsafe topics</h3>
      <a href="#detection-of-unsafe-topics">
        
      </a>
    </div>
    <p>A key challenge in building safety guardrails is balancing a good detection with model helpfulness. If detection is too broad, it can prevent a model from answering legitimate user questions, hurting its utility. This is especially difficult for topic detection because of the ambiguity and dynamic nature of human language, where context is fundamental to meaning. </p><p>Simple approaches like keyword blocklists are interesting for precise subjects — but insufficient. They are easily bypassed and fail to understand the context in which words are used, leading to poor recall. Older probabilistic models such as <a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation"><u>Latent Dirichlet Allocation (LDA)</u></a> were an improvement, but did not properly account for word ordering and other contextual nuances. 

Recent advancements in LLMs introduced a new paradigm. Their ability to perform zero-shot or few-shot classification is uniquely suited for the task of topic detection. For this reason, we chose <a href="https://huggingface.co/meta-llama/Llama-Guard-3-8B"><u>Llama Guard 3</u></a>, an open-source model based on the Llama architecture that is specifically fine-tuned for content safety classification. When it analyzes a prompt, it answers whether the text is safe or unsafe, and provides a specific category. We are showing the default categories, as listed <a href="http://developers.cloudflare.com/ruleset-engine/rules-language/fields/reference/cf.llm.prompt.unsafe_topic_categories/"><u>here</u></a>. Because Llama 3 has a fixed knowledge cutoff, certain categories — like defamation or elections — are time-sensitive. As a result, the model may not fully capture events or context that emerged after it was trained, and that’s important to keep in mind when relying on it.</p><p>For now, we cover the 13 default categories. We plan to expand coverage in the future, leveraging the model’s zero-shot capabilities.</p>
    <div>
      <h3>A scalable architecture for future detections</h3>
      <a href="#a-scalable-architecture-for-future-detections">
        
      </a>
    </div>
    <p>We designed Firewall for AI to scale without adding noticeable latency, including Llama Guard, and this remains true even as we add new detection models.</p><p>To achieve this, we built a new asynchronous architecture. When a request is sent to an application protected by Firewall for AI, a Cloudflare Worker makes parallel, non-blocking requests to our different detection modules — one for PII, one for unsafe topics, and others as we add them. </p><p>Thanks to the Cloudflare network, this design scales to handle high request volumes out of the box, and latency does not increase as we add new detections. It will only be bounded by the slowest model used. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4Y2gTP6teVR2263UIEWHc9/9a31fb394cee6c437c1d4af6f71d867c/image3.png" />
          </figure><p>We optimize to keep the model utility at its maximum while keeping the guardrail detection broad enough.</p><p>Llama Guard is a rather large model, so running it at scale with minimal latency is a challenge. We deploy it on <a href="https://www.cloudflare.com/developer-platform/products/workers-ai/"><u>Workers AI</u></a>, leveraging our large fleet of high performance GPUs. This infrastructure ensures we can offer fast, reliable inference throughout our network.</p><p>To ensure the system remains fast and reliable as adoption grows, we ran extensive load tests simulating the requests per second (RPS) we anticipate, using a wide range of prompt sizes to prepare for real-world traffic. To handle this, the number of model instances deployed on our network scales automatically with the load. We employ concurrency to minimize latency and optimize for hardware utilization. We also enforce a hard 2-second threshold for each analysis; if this time limit is reached, we fall back to any detections already completed, ensuring your application's requests latency is never further impacted.</p>
    <div>
      <h3>From detection to security rules enforcement</h3>
      <a href="#from-detection-to-security-rules-enforcement">
        
      </a>
    </div>
    <p>Firewall for AI follows the same familiar pattern as other Application Security features like Bot Management and WAF Attack Score, making it easy to adopt.</p><p>Once enabled, the <a href="https://developers.cloudflare.com/waf/detections/firewall-for-ai/#fields"><u>new fields</u></a> appear in <a href="https://developers.cloudflare.com/waf/analytics/security-analytics/"><u>Security Analytics</u></a> and expanded logs. From there, you can filter by unsafe topics, track trends over time, and drill into the results of individual requests to see all detection outcomes, for example: did we detect unsafe topics, and what are the categories. The request body itself (the prompt text) is not stored or exposed; only the results of the analysis are logged.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/722JxyLvT6DFQxFpQhHMYP/3f1a6aa8ef1dafe4ad1a8277578fd7ae/image2.png" />
          </figure><p>After reviewing the analytics, you can enforce unsafe topic moderation by creating rules to log or block based on prompt categories in <a href="https://developers.cloudflare.com/waf/custom-rules/"><u>Custom rules</u></a>.</p><p>For example, you might log prompts flagged as sexual content or hate speech for review. </p><p>You can use this expression: 
<code>If (any(cf.llm.prompt.unsafe_topic_categories[*] in {"S10" "S12"})) then Log</code>

Or deploy the rule with the categories field in the dashboard as in the below screenshot.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2CUsVjjpCEqv2UQMU6cMmt/5307235338c1b58856c0685585347537/image4.png" />
          </figure><p>You can also take a broader approach by blocking all unsafe prompts outright:
<code>If (cf.llm.prompt.unsafe_topic_detected)then Block</code></p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3uRT9YlRlRPsL5bNyBFA3i/54eb171ecb48aaecc7876b972789bf15/image5.png" />
          </figure><p>These rules are applied automatically to all discovered HTTP requests containing prompts, ensuring guardrails are enforced consistently across your AI traffic.</p>
    <div>
      <h2>What’s Next</h2>
      <a href="#whats-next">
        
      </a>
    </div>
    <p>In the coming weeks, Firewall for AI will expand to detect prompt injection and jailbreak attempts. We are also exploring how to add more visibility in the analytics and logs, so teams can better validate detection results. A major part of our roadmap is adding model response handling, giving you control over not only what goes into the LLM but also what comes out. Additional abuse controls, such as rate limiting on tokens and support for more safety categories, are also on the way.</p><p>Firewall for AI is available in beta today. If you’re new to Cloudflare and want to explore how to implement these AI protections, <a href="https://www.cloudflare.com/plans/enterprise/contact/?utm_medium=referral&amp;utm_source=blog&amp;utm_campaign=2025-q3-acq-gbl-connectivity-ge-ge-general-ai_week_blog"><u>reach out for a consultation</u></a>. If you’re already with Cloudflare, contact your account team to get access and start testing with real traffic.</p><p>Cloudflare is also opening up a user research program focused on <a href="https://www.cloudflare.com/learning/ai/what-is-ai-security/">AI security</a>. If you are curious about previews of new functionality or want to help shape our roadmap, <a href="https://www.cloudflare.com/lp/ai-security-user-research-program-2025"><u>express your interest here</u></a>.</p><div>
  
</div><p></p> ]]></content:encoded>
            <category><![CDATA[AI Week]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[LLM]]></category>
            <category><![CDATA[WAF]]></category>
            <category><![CDATA[AI]]></category>
            <guid isPermaLink="false">59hk6A3nH3YcLMjXhYnNof</guid>
            <dc:creator>Radwa Radwan</dc:creator>
            <dc:creator>Mathias Deschamps</dc:creator>
        </item>
        <item>
            <title><![CDATA[Introducing Cloudy, Cloudflare’s AI agent for simplifying complex configurations]]></title>
            <link>https://blog.cloudflare.com/introducing-ai-agent/</link>
            <pubDate>Thu, 20 Mar 2025 13:10:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare’s first AI agent, Cloudy, helps make complicated configurations easy to understand for Cloudflare administrators. ]]></description>
            <content:encoded><![CDATA[ <p>It’s a big day here at Cloudflare! Not only is it Security Week, but today marks Cloudflare’s first step into a completely new area of functionality, intended to improve how our users both interact with, and get value from, all of our products.</p><p>We’re excited to share a first glance of how we’re embedding <a href="https://www.cloudflare.com/learning/ai/what-is-artificial-intelligence/">AI</a> features into the management of Cloudflare products you know and love. Our first mission? Focus on security and streamline the rule and policy management experience. The goal is to automate away the time-consuming task of manually reviewing and contextualizing Custom Rules in <a href="https://www.cloudflare.com/application-services/products/waf/">Cloudflare WAF</a>, and Gateway policies in Cloudflare One, so you can instantly understand what each policy does, what gaps they have, and what you need to do to fix them.</p>
    <div>
      <h3>Meet Cloudy, Cloudflare’s first AI agent</h3>
      <a href="#meet-cloudy-cloudflares-first-ai-agent">
        
      </a>
    </div>
    <p>Our initial step toward a fully AI-enabled product experience is the introduction of <i>Cloudy</i>, the first version of Cloudflare AI agents, assistant-like functionality designed to help users quickly understand and improve their Cloudflare configurations in multiple areas of the product suite. You’ll start to see Cloudy functionality seamlessly embedded into two Cloudflare products across the dashboard, which we’ll talk about below.</p><p>And while the name <i>Cloudy</i> may be fun and light-hearted, our goals are more serious: Bring Cloudy and AI-powered functionality to every corner of Cloudflare, and optimize how our users operate and manage their favorite Cloudflare products. Let’s start with two places where Cloudy is now live and available to all customers using the WAF and Gateway products.</p>
    <div>
      <h3>WAF Custom Rules</h3>
      <a href="#waf-custom-rules">
        
      </a>
    </div>
    <p>Let’s begin with AI-powered overviews of <a href="https://developers.cloudflare.com/waf/custom-rules/"><u>WAF Custom Rules</u></a>. For those unfamiliar, Cloudflare’s Web Application Firewall (WAF) helps protect web applications from attacks like <a href="https://www.cloudflare.com/learning/security/threats/sql-injection/">SQL injection</a>, <a href="https://www.cloudflare.com/learning/security/threats/cross-site-scripting/">cross-site scripting (XSS)</a>, and other vulnerabilities. </p><p>One specific feature of the WAF is the ability to create WAF Custom Rules. These allow users to tailor security policies to block, challenge, or allow traffic based on specific attributes or security criteria.</p><p>However, for customers with dozens or even hundreds of rules deployed across their organization, it can be challenging to maintain a clear understanding of their security posture. Rule configurations evolve over time, often managed by different team members, leading to potential inefficiencies and security gaps. What better problem for Cloudy to solve?</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4zcFRfhRWGQWhoza9TolDu/25e1357540db32e59150609e6eddd1e0/BLOG-2692_2.png" />
          </figure><p>Powered by <a href="https://developers.cloudflare.com/workers-ai/"><u>Workers AI</u></a>, today we’ll share how Cloudy will help review your WAF Custom Rules and provide a summary of what's configured across them. Cloudy will also help you identify and solve issues such as:</p><ul><li><p><b>Identifying redundant rules</b>: Identify when multiple rules are performing the same function, or using similar fields, helping you streamline your configuration.</p></li><li><p><b>Optimising execution order</b>: Spot cases where rules ordering affects functionality, such as when a terminating rule (block/challenge action) prevents subsequent rules from executing.</p></li><li><p><b>Analysing conflicting rules</b>: Detect when rules counteract each other, such as one rule blocking traffic that another rule is designed to allow or log.</p></li><li><p><b>Identifying disabled rules</b>: Highlight potentially important security rules that are in a disabled state, helping ensure that critical protections are not accidentally left inactive.</p></li></ul><p>Cloudy won't just summarize your rules, either. It will analyze the relationships and interactions between rules to provide actionable recommendations. For security teams managing complex sets of Custom Rules, this means less time spent auditing configurations and more confidence in your security coverage.</p><p>Available to all users, we’re excited to show how Cloudflare AI Agents can enhance the usability of our products, starting with WAF Custom Rules. But this is just the beginning.</p>
    <div>
      <h3>Cloudflare One Firewall policies</h3>
      <a href="#cloudflare-one-firewall-policies">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4CXHQVlO3GGqwp6DGyOklJ/3068c434c4a303cf22c328c302947fcb/BLOG-2692_3.png" />
          </figure><p>We've also added Cloudy to <a href="https://www.cloudflare.com/static/e9ea5dfaa69c554cc1cbaa7f3e441acf/Cloudflare_One_at_a_glance.pdf"><u>Cloudflare One</u></a>, our SASE platform, where enterprises manage the security of their employees and tools from a single dashboard.</p><p>In <a href="https://www.cloudflare.com/zero-trust/products/gateway/"><u>Cloudflare Gateway</u></a>, our Secure Web Gateway offering, customers can configure policies to manage how employees do their jobs on the Internet. These Gateway policies can block access to malicious sites, prevent data loss violations, and control user access, among other things.</p><p>But similar to WAF Custom Rules, Gateway policy configurations can become overcomplicated and bogged down over time, with old, forgotten policies that do who-knows-what. Multiple selectors and operators working in counterintuitive ways. Some blocking traffic, others allowing it. Policies that include several user groups, but carve out specific employees. We’ve even seen policies that block hundreds of URLs in a single step. All to say, managing years of Gateway policies can become overwhelming.</p><p>So, why not have Cloudy summarize Gateway policies in a way that makes their purpose clear and concise?</p><p>Available to all Cloudflare Gateway users (create a free Cloudflare One account <a href="https://www.cloudflare.com/zero-trust/products/"><u>here</u></a>), Cloudy will now provide a quick summary of any Gateway policy you view. It’s now easier than ever to get a clear understanding of each policy at a glance, allowing admins to spot misconfigurations, redundant controls, or other areas for improvement, and move on with confidence.</p>
    <div>
      <h3>Built on Workers AI</h3>
      <a href="#built-on-workers-ai">
        
      </a>
    </div>
    <p>At the heart of our new functionality is <a href="https://www.cloudflare.com/developer-platform/products/workers-ai/"><u>Cloudflare Workers AI</u></a> (yes, the same version that everyone uses!) that leverages advanced <a href="https://www.cloudflare.com/learning/ai/what-is-large-language-model/">large language models (LLMs) </a>to process vast amounts of information; in this case, policy and rules data. Traditionally, manually reviewing and contextualizing complex configurations is a daunting task for any security team. With Workers AI, we automate that process, turning raw configuration data into consistent, clear summaries and actionable recommendations.</p>
    <div>
      <h4><b>How it works</b></h4>
      <a href="#how-it-works">
        
      </a>
    </div>
    <p>Cloudflare Workers AI ingests policy and rule configurations from your Cloudflare setup and combines them with a purpose-built LLM prompt. We leverage the same <a href="https://developers.cloudflare.com/workers-ai/models/"><u>publicly-available LLM models</u></a> that we offer our customers, and then further enrich the prompt with some additional data to provide it with context. For this specific task of analyzing and summarizing policy and rule data, we provided the LLM:</p><ul><li><p><b>Policy &amp; rule data</b>: This is the primary data itself, including the current configuration of policies/rules for Cloudy to summarize and provide suggestions against.</p></li><li><p><b>Documentation on product abilities:</b> We provide the model with additional technical details on the policy/rule configurations that are possible with each product, so that the model knows what kind of recommendations are within its bounds.</p></li><li><p><b>Enriched datasets</b>: Where WAF Custom Rules or CF1 Gateway policies leverage other ‘lists’ (e.g., a WAF rule referencing multiple countries, a Gateway policy leveraging a specific content category), the list item(s) selected must be first translated from an ID to plain-text wording so that the LLM can interpret which policy/rule values are actually being used.</p></li><li><p><b>Output instructions</b>: We specify to the model which format we’d like to receive the output in. In this case, we use JSON for easiest handling.</p></li><li><p><b>Additional clarifications</b>: Lastly, we explicitly instruct the LLM to be sure about its output, valuing that aspect above all else. Doing this helps us ensure that no hallucinations make it to the final output.</p></li></ul><p>By automating the analysis of your WAF Custom Rules and Gateway policies, Cloudflare Workers AI not only saves you time but also enhances security by reducing the risk of human error. You get clear, actionable insights that allow you to streamline your configurations, quickly spot anomalies, and maintain a strong security posture—all without the need for labor-intensive manual reviews.</p>
    <div>
      <h4>What’s next for Cloudy</h4>
      <a href="#whats-next-for-cloudy">
        
      </a>
    </div>
    <p>Beta previews of Cloudy are live for all Cloudflare customers today. But this is just the beginning of what we envision for AI-powered functionality across our entire product suite.</p><p>Throughout the rest of 2025, we plan to roll out additional <a href="https://www.cloudflare.com/learning/ai/what-is-agentic-ai/">AI agent capabilities</a> across other areas of Cloudflare. These new features won’t just help customers manage security more efficiently, but they’ll also provide intelligent recommendations for optimizing performance, streamlining operations, and enhancing overall user experience.</p><p>We’re excited to hear your thoughts as you get to meet Cloudy and try out these new AI features – send feedback to us at <a><u>cloudyfeedback@cloudflare.com</u></a>, or post your thoughts on X, LinkedIn, or Mastodon tagged with #SecurityWeek! Your feedback will help shape our roadmap for AI enhancement, and bring our users smarter, more efficient tooling that helps everyone get more secure.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5gGseiyO6pbddpdSVQ5wfJ/ae1d0d5a2f8ec01f571de7a85b655370/BLOG-2692_4.png" />
          </figure>
    <div>
      <h3>Watch on Cloudflare TV</h3>
      <a href="#watch-on-cloudflare-tv">
        
      </a>
    </div>
    <div>
  
</div><p></p> ]]></content:encoded>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[LLM]]></category>
            <category><![CDATA[WAF]]></category>
            <category><![CDATA[Cloudflare One]]></category>
            <category><![CDATA[Zero Trust]]></category>
            <category><![CDATA[Cloudflare Zero Trust]]></category>
            <category><![CDATA[SASE]]></category>
            <category><![CDATA[Secure Web Gateway]]></category>
            <category><![CDATA[Beta]]></category>
            <category><![CDATA[Network Services]]></category>
            <guid isPermaLink="false">7ywSxti5U7fxjKbqmVXpGW</guid>
            <dc:creator>Alex Dunbrack</dc:creator>
            <dc:creator>Harsh Saxena</dc:creator>
        </item>
        <item>
            <title><![CDATA[Take control of public AI application security with Cloudflare's Firewall for AI]]></title>
            <link>https://blog.cloudflare.com/take-control-of-public-ai-application-security-with-cloudflare-firewall-for-ai/</link>
            <pubDate>Wed, 19 Mar 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ Firewall for AI discovers and protects your public LLM-powered applications, and is seamlessly integrated with Cloudflare WAF. Join the beta now and take control of your generative AI security. ]]></description>
            <content:encoded><![CDATA[ <p>Imagine building an LLM-powered assistant trained on your developer documentation and some internal guides to quickly help customers, reduce support workload, and improve user experience. Sounds great, right? But what if sensitive data, such as employee details or internal discussions, is included in the data used to train the LLM? Attackers could manipulate the assistant into exposing sensitive data or exploit it for social engineering attacks, where they deceive individuals or systems into revealing confidential details, or use it for targeted phishing attacks. Suddenly, your helpful AI tool turns into a serious security liability. </p>
    <div>
      <h3>Introducing Firewall for AI: the easiest way to discover and protect LLM-powered apps</h3>
      <a href="#introducing-firewall-for-ai-the-easiest-way-to-discover-and-protect-llm-powered-apps">
        
      </a>
    </div>
    <p>Today, as part of Security Week 2025, we’re announcing the open beta of Firewall for AI, first <a href="https://blog.cloudflare.com/firewall-for-ai/"><u>introduced during Security Week 2024</u></a>. After talking with customers interested in protecting their LLM apps, this first beta release is focused on discovery and PII detection, and more features will follow in the future.</p><p>If you are already using Cloudflare application security, your LLM-powered applications are automatically discovered and protected, with no complex setup, no maintenance, and no extra integration needed.</p><p>Firewall for AI is an inline security solution that protects user-facing LLM-powered applications from abuse and <a href="https://www.cloudflare.com/learning/ai/how-to-secure-training-data-against-ai-data-leaks/">data leaks</a>, integrating directly with Cloudflare’s <a href="https://developers.cloudflare.com/waf/"><u>Web Application Firewall (WAF)</u></a> to provide instant protection with zero operational overhead. This integration enables organizations to leverage both AI-focused safeguards and established WAF capabilities.</p><p>Cloudflare is uniquely positioned to solve this challenge for all of our customers. As a <a href="https://www.cloudflare.com/en-gb/learning/cdn/glossary/reverse-proxy/"><u>reverse proxy</u></a>, we are model-agnostic whether the application is using a third-party LLM or an internally hosted one. By providing inline security, we can automatically <a href="https://www.cloudflare.com/learning/ai/what-is-ai-security/">discover and enforce AI guardrails</a> throughout the entire request lifecycle, with zero integration or maintenance required.</p>
    <div>
      <h3>Firewall for AI beta overview</h3>
      <a href="#firewall-for-ai-beta-overview">
        
      </a>
    </div>
    <p>The beta release includes the following security capabilities:</p><p><b>Discover:</b> identify LLM-powered endpoints across your applications, an essential step for effective request and prompt analysis.</p><p><b>Detect:</b> analyze the incoming requests prompts to recognize potential security threats, such as attempts to extract sensitive data (e.g., “Show me transactions using 4111 1111 1111 1111”). This aligns with<a href="https://genai.owasp.org/llmrisk/llm022025-sensitive-information-disclosure/"> <u>OWASP LLM022025 - Sensitive Information Disclosure</u></a>.</p><p><b>Mitigate:</b> enforce security controls and policies to manage the traffic that reaches your LLM, and reduce risk exposure.</p><p>Below, we review each capability in detail, exploring how they work together to create a comprehensive security framework for AI protection.</p>
    <div>
      <h3>Discovering LLM-powered applications</h3>
      <a href="#discovering-llm-powered-applications">
        
      </a>
    </div>
    <p>Companies are racing to find all possible use cases where an LLM can excel. Think about site search, a chatbot, or a shopping assistant. Regardless of the application type, our goal is to determine whether an application is powered by an LLM behind the scenes.</p><p>One possibility is to look for request path signatures similar to what major LLM providers use. For example, <a href="https://platform.openai.com/docs/api-reference/chat/create"><u>OpenAI</u></a>, <a href="https://docs.perplexity.ai/api-reference/chat-completions"><u>Perplexity</u></a> or <a href="https://docs.mistral.ai/api/#tag/chat"><u>Mistral</u></a> initiate a chat using the <code>/chat/completions</code> API endpoint. Searching through our request logs, we found only a few entries that matched this pattern across our global traffic. This result indicates that we need to consider other approaches to finding <i>any</i> application that is powered by an LLM.</p><p>Another signature to research, popular with LLM platforms, is the use of <a href="https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events"><u>server-sent events</u></a>. LLMs need to <a href="https://platform.openai.com/docs/guides/latency-optimization#don-t-default-to-an-llm"><u>“think”</u></a>. <a href="https://platform.openai.com/docs/api-reference/streaming"><u>Using server-sent events</u></a> improves the end user’s experience by sending over each <a href="https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them"><u>token</u></a> as soon as it is ready, creating the perception that an LLM is “thinking” like a human being. Matching on requests of server-sent events is straightforward using the response header content type of <code>text/event-stream</code>. This approach expands the coverage further, but does not yet cover the <a href="https://stackoverflow.blog/2022/06/02/a-beginners-guide-to-json-the-data-format-for-the-internet/"><u>majority of applications</u></a> that are using JSON format for data exchanges. Continuing the journey, our next focus is on the responses having header content type of <code>application/json</code>.</p><p>No matter how fast LLMs can be optimized to respond, when chatting with major LLMs, we often perceive them to be slow, as we have to wait for them to “think”. By plotting on how much time it takes for the origin server to respond over identified LLM endpoints (blue line) versus the rest (orange line), we can see in the left graph that origins serving LLM endpoints mostly need more than 1 second to respond, while the majority of the rest takes less than 1 second. Would we also see a clear distinction between origin server response body sizes, where the majority of LLM endpoints would respond with smaller sizes because major LLM providers <a href="https://platform.openai.com/docs/guides/safety-best-practices#constrain-user-input-and-limit-output-tokens"><u>limit output tokens</u></a>? Unfortunately not. The right graph shows that LLM response size largely overlaps with non-LLM traffic.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4CkowlKelGlYueNzSrbsGn/f8091d66a6c0eb8b884c7cc6f2a128ab/1.png" />
          </figure><p>By dividing origin response size over origin response duration to calculate an effective bitrate, the distinction is even clearer that 80% of LLM endpoints operate slower than 4 KB/s.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5sJKUnLmWwnTWzUOyWVRKO/c98c8fc32dbafa5d79f21effdfc58f34/2.png" />
          </figure><p>Validating this assumption by using bitrate as a heuristic across Cloudflare’s traffic, we found that roughly 3% of all origin server responses have a bitrate lower than 4 KB/s. Are these responses all powered by LLMs? Our gut feeling tells us that it is unlikely that 3% of origin responses are LLM-powered! </p><p>Among the paths found in the 3% of matching responses, there are few patterns that stand out: 1) GraphQL endpoints, 2) device heartbeat or health check, 3) generators (for QR codes, one time passwords, invoices, etc.). Noticing this gave us the idea to filter out endpoints that have a low variance of response size over time — for instance, invoice generation is mostly based on the same template, while conversations in the LLM context have a higher variance.</p><p>A combination of filtering out known false positive patterns and low variance in response size gives us a satisfying result. These matching endpoints, approximately 30,000 of them, labelled <code>cf-llm</code>, can now be found in API Shield or Web assets, depending on your dashboard’s version, for all customers. Now you can review your endpoints and decide how to best protect them.</p>
    <div>
      <h3>Detecting prompts designed to leak PII</h3>
      <a href="#detecting-prompts-designed-to-leak-pii">
        
      </a>
    </div>
    <p>There are multiple methods to detect PII in LLM prompts. A common method relies on <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions"><u>regular expressions (“regexes”)</u></a>, which is a method we have been using in the WAF for <a href="https://developers.cloudflare.com/waf/managed-rules/reference/sensitive-data-detection/"><u>Sensitive Data Detection</u></a> on the body of the HTTP response from the web server Regexes offer low latency, easy customization, and straightforward implementation. However, regexes alone have limitations when applied to LLM prompts. They require frequent updates to maintain accuracy, and may struggle with more complex or implicit PII, where the information is spread across text rather than a fixed format. </p><p>For example, regexes work well for structured data like credit card numbers and addresses, but struggle with PII is embedded in natural language. For instance, “I just booked a flight using my Chase card, ending in 1111” wouldn’t trigger a regex match as it lacks the expected pattern, even though it reveals a partial credit card number and financial institution.</p><p>To enhance detection, we rely on a <a href="https://www.ibm.com/think/topics/named-entity-recognition"><u>Named Entity Recognition (NER)</u></a> model, which adds a layer of intelligence to complement regex-based detection. NER models analyze text to identify contextual PII data types, such as names, phone numbers, email addresses, and credit card numbers, making detection more flexible and accurate. Cloudflare’s detection utilizes <a href="https://microsoft.github.io/presidio/"><u>Presidio</u></a>, an open-source PII detection framework, to further strengthen this approach.</p>
    <div>
      <h4>Using Workers AI to deploy Presidio</h4>
      <a href="#using-workers-ai-to-deploy-presidio">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5ruqPkxJBgFCRdsoft1TO1/aa327b069569a0f952c8baea102955b8/3.png" />
          </figure><p>In our design, we leverage Cloudflare <a href="https://developers.cloudflare.com/workers-ai/"><u>Workers AI</u></a> as the fastest way to deploy <a href="https://microsoft.github.io/presidio/"><u>Presidio</u></a>. This integration allows us to process LLM app requests inline, ensuring that sensitive data is flagged before it reaches the model.</p><p>Here’s how it works:</p><p>When Firewall for AI is enabled on an application and an end user sends a request to an LLM-powered application, we pass the request to Cloudflare Workers AI which runs the request through Presidio’s NER-based detection model to identify any potential PII from the available <a href="https://microsoft.github.io/presidio/supported_entities/"><u>entities</u></a>. The output includes metadata like “Was PII found?” and “What type of PII entity?”. This output is then processed in our Firewall for AI module, and handed over to other systems, like <a href="https://developers.cloudflare.com/waf/analytics/security-analytics/"><u>Security Analytics</u></a> for visibility, and the rules like <a href="https://developers.cloudflare.com/waf/custom-rules/"><u>Custom rules</u></a> for enforcement. Custom rules allow customers to take appropriate actions on the requests based on the provided metadata. </p><p>If no terminating action, like blocking, is triggered, the request proceeds to the LLM. Otherwise, it gets blocked or the appropriate action is applied before reaching the origin.</p>
    <div>
      <h3>Integrating AI security into the WAF and Analytics</h3>
      <a href="#integrating-ai-security-into-the-waf-and-analytics">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/ai-security/">Securing AI interactions</a> shouldn't require complex integrations. Firewall for AI is seamlessly built into Cloudflare’s WAF, allowing customers to enforce security policies before prompts reach LLM endpoints. With this integration, there are <a href="https://developers.cloudflare.com/waf/detections/firewall-for-ai/#fields"><u>new fields available</u></a> in Custom and Rate limiting rules. The rules can be used to take immediate action, such as blocking or logging risky prompts in real time.</p><p>For example, security teams can filter LLM traffic to analyze requests containing PII-related prompts. Using Cloudflare’s WAF rules engine, they can create custom security policies tailored to their AI applications.</p><p>Here’s what a rule to block detected PII prompts looks like:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4cvlFU0sia6dZly2LZGG8l/670dbb1ad5068f0fd5d8f4afde9e9e02/4.png" />
          </figure><p>Alternatively, if an organization wants to allow certain PII categories, such as location data, they can create an exception rule:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/wYFkoQyHFoFNwHmKtaaG3/94c7ae78dbabacf5dd8583af9e8eb071/5.png" />
          </figure><p>In addition to the rules, users can gain visibility into LLM interactions, detect potential risks, and enforce security controls using <a href="https://developers.cloudflare.com/waf/analytics/security-analytics/"><u>Security Analytics</u></a> and <a href="https://developers.cloudflare.com/waf/analytics/security-events/"><u>Security Events</u></a>. You can find more details in our <a href="https://developers.cloudflare.com/waf/detections/firewall-for-ai/"><u>documentation</u></a>.</p>
    <div>
      <h3>What's next: token counting, guardrails, and beyond</h3>
      <a href="#whats-next-token-counting-guardrails-and-beyond">
        
      </a>
    </div>
    <p>Beyond PII detection and creating security rules, we’re developing additional capabilities to strengthen AI security for our customers. The next feature we’ll release is token counting, which analyzes prompt structure and length. Customers can use the token count field in Rate Limiting and WAF Custom rules to prevent their users from sending very long prompts, which can impact third party model bills, or allow users to abuse the models. This will be followed by using AI to detect and allow content moderation, which will provide more flexibility in building guardrails in the rules.</p><p>If you're an enterprise customer, join the Firewall for AI beta today! Contact your customer team to start monitoring traffic, building protection rules, and taking control of your LLM traffic.</p> ]]></content:encoded>
            <category><![CDATA[Security Week]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[LLM]]></category>
            <category><![CDATA[Web Asset Discovery]]></category>
            <guid isPermaLink="false">5XoyHPSrtBH8pPvUJkOXMD</guid>
            <dc:creator>Radwa Radwan</dc:creator>
            <dc:creator>Zhiyuan Zheng</dc:creator>
        </item>
        <item>
            <title><![CDATA[Making Workers AI faster and more efficient: Performance optimization with KV cache compression and speculative decoding]]></title>
            <link>https://blog.cloudflare.com/making-workers-ai-faster/</link>
            <pubDate>Thu, 26 Sep 2024 13:00:00 GMT</pubDate>
            <description><![CDATA[ With a new generation of data center accelerator hardware and using optimization techniques such as KV cache compression and speculative decoding, we’ve made large language model (LLM) ]]></description>
            <content:encoded><![CDATA[ <p>During Birthday Week 2023, <a href="https://blog.cloudflare.com/workers-ai/"><u>we launched Workers AI</u></a>. Since then, we have been listening to your feedback, and one thing we’ve heard consistently is that our customers want Workers AI to be faster. In particular, we hear that large language model (LLM) generation needs to be faster. Users want their interactive chat and agents to go faster, developers want faster help, and users do not want to wait for applications and generated website content to load. Today, we’re announcing three upgrades we’ve made to Workers AI to bring faster and more efficient inference to our customers: upgraded hardware, KV cache compression, and speculative decoding.</p>
    <div>
      <h3>Watch on Cloudflare TV</h3>
      <a href="#watch-on-cloudflare-tv">
        
      </a>
    </div>
    <div>
  
</div><p>Thanks to Cloudflare’s <a href="https://blog.cloudflare.com/gen-12-servers/"><u>12th generation compute servers</u></a>, our network now supports a newer generation of GPUs capable of supporting larger models and faster inference. Customers can now use <a href="https://developers.cloudflare.com/workers-ai/models/llama-3.2-11b-vision-instruct"><u>Meta Llama 3.2 11B</u></a>, Meta’s <a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/"><u>newly released</u></a> multi-modal model with vision support, as well as Meta Llama 3.1 70B on Workers AI. Depending on load and time of day, customers can expect to see two to three times the throughput for Llama 3.1 and 3.2 compared to our previous generation Workers AI hardware. More performance information for these models can be found in today’s post: <a href="https://blog.cloudflare.com/workers-ai-bigger-better-faster"><i><u>Cloudflare’s Bigger, Better, Faster AI platform</u></i></a>.</p>
    <div>
      <h2>New KV cache compression methods, now open source</h2>
      <a href="#new-kv-cache-compression-methods-now-open-source">
        
      </a>
    </div>
    <p>In our effort to deliver low-cost low-latency inference to the world, Workers AI has been developing novel methods to boost efficiency of LLM inference. Today, we’re excited to announce a technique for KV cache compression that can help increase throughput of an inference platform. And we’ve made it open source too, so that everyone can benefit from our research.</p>
    <div>
      <h3>It’s all about memory</h3>
      <a href="#its-all-about-memory">
        
      </a>
    </div>
    <p>One of the main bottlenecks when running LLM inference is the amount of vRAM (memory) available. Every word that an LLM processes generates a set of vectors that encode the meaning of that word in the context of any earlier words in the input that are used to generate new tokens in the future. These vectors are stored in the <i>KV cache</i>, causing the memory required for inference to scale linearly with the total number of tokens of all sequences being processed. This makes memory a bottleneck for a lot of transformer-based models. Because of this, the amount of memory an instance has available limits the number of sequences it can generate concurrently, as well as the maximum token length of sequences it can generate.</p>
    <div>
      <h3>So what is the KV cache anyway?</h3>
      <a href="#so-what-is-the-kv-cache-anyway">
        
      </a>
    </div>
    <p>LLMs are made up of layers, with an <a href="https://en.wikipedia.org/wiki/Attention_(machine_learning)"><u>attention</u></a> operation occurring in each layer. Within each layer’s attention operation, information is collected from the representations of all previous tokens that are stored in cache. This means that vectors in the KV cache are organized into layers, so that the active layer’s attention operation can only query vectors from the corresponding layer of KV cache. Furthermore, since attention within each layer is parallelized across multiple attention “heads”, the KV cache vectors of a specific layer are further subdivided into groups corresponding to each attention head of that layer.</p><p>The diagram below shows the structure of an LLM’s KV cache for a single sequence being generated. Each cell represents a KV and the model’s representation for a token consists of all KV vectors for that token across all attention heads and layers. As you can see, the KV cache for a single layer is allocated as an M x N matrix of KV vectors where M is the number of attention heads and N is the sequence length. This will be important later!</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3ZagFp9yy3E55SR8GKRBvh/9e37f5890165e758ccaebf77464be483/BLOG-2571_2.png" />
          </figure><p>For a deeper look at attention, see the original “<a href="https://arxiv.org/abs/1706.03762"><u>Attention is All You Need</u></a>” paper. </p>
    <div>
      <h3>KV-cache compression — “use it or lose it”</h3>
      <a href="#kv-cache-compression-use-it-or-lose-it">
        
      </a>
    </div>
    <p>Now that we know what the KV cache looks like, let’s dive into how we can shrink it!</p><p>The most common approach to compressing the KV cache involves identifying vectors within it that are unlikely to be queried by future attention operations and can therefore be removed without impacting the model’s outputs. This is commonly done by looking at the past attention weights for each pair of key and value vectors (a measure of the degree with which that KV’s representation has been queried during past attention operations) and selecting the KVs that have received the lowest total attention for eviction. This approach is conceptually similar to a LFU (least frequently used) cache management policy: the less a particular vector is queried, the more likely it is to be evicted in the future.</p>
    <div>
      <h3>Different attention heads need different compression rates</h3>
      <a href="#different-attention-heads-need-different-compression-rates">
        
      </a>
    </div>
    <p>As we saw earlier, the KV cache for each sequence in a particular layer is allocated on the GPU as a <i># attention heads X sequence length</i> tensor. This means that the total memory allocation scales with the <i>maximum</i> sequence length for all attention heads of the KV cache. Usually this is not a problem, since each sequence generates the same number of KVs per attention head.</p><p>When we consider the problem of eviction-based KV cache compression, however, this forces us to remove an equal number of KVs from each attention head when doing the compression. If we remove more KVs from one attention head alone, those removed KVs won’t actually contribute to lowering the memory footprint of the KV cache on GPU, but will just add more empty “padding” to the corresponding rows of the tensor. You can see this in the diagram below (note the empty cells in the second row below):</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/68Q5hVbfRF1vhqyeGNzB1Q/91056db8208c5e74be00e0add147b3e9/BLOG-2571_3.png" />
          </figure><p>The extra compression along the second head frees slots for two KVs, but the cache’s shape (and memory footprint) remains the same.</p><p>This forces us to use a fixed compression rate for all attention heads of KV cache, which is very limiting on the compression rates we can achieve before compromising performance.</p>
    <div>
      <h3>Enter PagedAttention</h3>
      <a href="#enter-pagedattention">
        
      </a>
    </div>
    <p>The solution to this problem is to change how our KV cache is represented in physical memory. <a href="https://arxiv.org/abs/2309.06180"><u>PagedAttention</u></a> can represent N x M tensors with padding efficiently by using an N x M block table to index into a series of “blocks”.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1Sia3ZKKzBaHEfI8qLYr8o/57edb68d61ff916d322502aeb406c88c/BLOG-2571_4.png" />
          </figure><p>This lets us retrieve the i<sup>th</sup> element of a row by taking the i<sup>th</sup> block number from that row in the block table and using the block number to lookup the corresponding block, so we avoid allocating space to padding elements in our physical memory representation. In our case, the elements in physical memory are the KV cache vectors, and the <i>M </i>and <i>N</i> that define the shape of our block table are the number of attention heads and sequence length, respectively. Since the block table is only storing integer indices (rather than high-dimensional KV vectors), its memory footprint is negligible in most cases.</p>
    <div>
      <h3>Results</h3>
      <a href="#results">
        
      </a>
    </div>
    <p>Using paged attention lets us apply different rates of compression to different heads in our KV cache, giving our compression strategy more flexibility than other methods. We tested our compression algorithm on <a href="https://arxiv.org/abs/2308.14508"><u>LongBench</u></a> (a collection of long-context LLM benchmarks) with Llama-3.1-8B and found that for most tasks we can retain over 95% task performance while reducing cache size by up to 8x (left figure below). Over 90% task performance can be retained while further compressing up to 64x. That means you have room in memory for 64 times as many tokens!</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/pdz5rPhYdfnMmn6cxhczo/29b69bb65aea8989fc1f50283e8ecbc5/BLOG-2571_5.png" />
          </figure><p>This lets us increase the number of requests we can process in parallel, increasing the total throughput (total tokens generated per second) by 3.44x and 5.18x for compression rates of 8x and 64x, respectively (right figure above).</p>
    <div>
      <h3>Try it yourself!</h3>
      <a href="#try-it-yourself">
        
      </a>
    </div>
    <p>If you’re interested in taking a deeper dive check out our <a href="https://github.com/IsaacRe/vllm-kvcompress"><u>vLLM fork</u></a> and get compressing!!</p>
    <div>
      <h2>Speculative decoding for faster throughput</h2>
      <a href="#speculative-decoding-for-faster-throughput">
        
      </a>
    </div>
    <p>A new inference strategy that we implemented is speculative decoding, which is a very popular way to get faster throughput (measured in tokens per second). LLMs work by predicting the next expected token (a token can be a word, word fragment or single character) in the sequence with each call to the model, based on everything that the model has seen before. For the first token generated, this means just the initial prompt, but after that each subsequent token is generated based on the prompt plus all other tokens that have been generated. Typically, this happens one token at a time, generating a single word, or even a single letter, depending on what comes next.</p><p>But what about this prompt:</p><blockquote><p><i>Knock, knock!</i></p></blockquote><p>If you are familiar with knock-knock jokes, you could very accurately predict more than one token ahead. For an English language speaker, what comes next is a very specific sequence that is four to five tokens long: “Who’s there?” or “Who is there?” Human language is full of these types of phrases where the next word has only one, or a few, high probability choices. Idioms, common expressions, and even basic grammar are all examples of this. So for each prediction the model makes, we can take it a step further with speculative decoding to predict the next <i>n</i> tokens. This allows us to speed up inference, as we’re not limited to predicting one token at a time.</p><p>There are several different implementations of speculative decoding, but each in some way uses a smaller, faster-to-run model to generate more than one token at a time. For Workers AI, we have applied <a href="https://github.com/apoorvumang/prompt-lookup-decoding"><u>prompt-lookup decoding</u></a> to some of the LLMs we offer. This simple method matches the last <i>n </i>tokens of generated text against text in the prompt/output and predicts candidate tokens that continue these identified patterns as candidates for continuing the output. In the case of knock-knock jokes, it can predict all the tokens for <i>“Who’s there</i>” at once after seeing “<i>Knock, knock!</i>”, as long as this setup occurs somewhere in the prompt or previous dialogue already. Once these candidate tokens have been predicted, the model can verify them all with a single forward-pass and choose to either accept or reject them. This increases the generation speed of llama-3.1-8b-instruct by up to 40% and the 70B model by up to 70%.</p><p>Speculative decoding has tradeoffs, however. Typically, the results of a model using speculative decoding have a lower quality, both when measured using benchmarks like <a href="https://paperswithcode.com/dataset/mmlu"><u>MMLU</u></a> as well as when compared by humans. More aggressive speculation can speed up sequence generation, but generally comes with a greater impact to the quality of the result. Prompt lookup decoding offers one of the smallest overall quality impacts while still providing performance improvements, and we will be adding it to some language models on Workers AI including <a href="https://developers.cloudflare.com/workers-ai/models/llama-3-8b-instruct"><u>@cf/meta/llama-3.1-8b-instruct</u></a>.</p><p>And, by the way, here is one of our favorite knock-knock jokes, can you guess the punchline?</p><blockquote><p><i>Knock, knock!</i></p><p><i>Who’s there?</i></p><p><i>Figs!</i></p><p><i>Figs who?</i></p><p><i>Figs the doorbell, it’s broken!</i></p></blockquote>
    <div>
      <h2>Keep accelerating</h2>
      <a href="#keep-accelerating">
        
      </a>
    </div>
    <p>As the AI industry continues to evolve, there will be new hardware and software that allows customers to get faster inference responses. Workers AI is committed to researching, implementing, and making upgrades to our services to help you get fast inference. As an Inference-as-a-Service platform, you’ll be able to benefit from all the optimizations we apply, without having to hire your own team of ML researchers and SREs to manage inference software and hardware deployments.

We’re excited for you to try out some of these new releases we have and let us know what you think! Check out our full-suite of AI announcements <a href="https://blog.cloudflare.com/tag/ai/"><u>here</u></a> and check out the <a href="https://developers.cloudflare.com/workers-ai/"><u>developer docs</u></a> to get started.</p> ]]></content:encoded>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[Product News]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[LLM]]></category>
            <guid isPermaLink="false">29PAMer5L0do12OtNa557I</guid>
            <dc:creator>Isaac Rehg</dc:creator>
            <dc:creator>Jesse Kipp</dc:creator>
        </item>
        <item>
            <title><![CDATA[Start auditing and controlling the AI models accessing your content]]></title>
            <link>https://blog.cloudflare.com/cloudflare-ai-audit-control-ai-content-crawlers/</link>
            <pubDate>Mon, 23 Sep 2024 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare customers on any plan can now audit and control how AI models access the content on their site.
 ]]></description>
            <content:encoded><![CDATA[ <p>Site owners have lacked the ability to determine how AI services use their content for training or other purposes. Today, Cloudflare is releasing a set of tools to make it easy for site owners, creators, and publishers to take back control over how their content is made available to AI-related bots and crawlers. All Cloudflare customers can now audit and control how AI models access the content on their site.</p><p>This launch starts with a detailed analytics view of the AI services that crawl your site and the specific content they access. Customers can review activity by AI provider, by type of bot, and which sections of their site are most popular. This data is available to every site on Cloudflare and does not require any configuration.</p><p>We expect that this new level of visibility will prompt teams to make a decision about their exposure to AI crawlers. To help give them time to make that decision, Cloudflare now provides <a href="https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/"><u>a one-click option</u></a> in our dashboard to immediately <a href="https://www.cloudflare.com/learning/ai/how-to-block-ai-crawlers/">block</a> any AI crawlers from accessing any site. Teams can then use this “pause” to decide if they want to allow specific AI providers or types of bots to proceed. Once that decision is made, those administrators can use new filters in the Cloudflare dashboard to enforce those policies in just a couple of clicks.</p><p>Some customers have already made decisions to negotiate deals directly with AI companies. Many of those contracts include terms about the frequency of scanning and the type of content that can be accessed. We want those publishers to have the tools to measure the implementation of these deals.  As part of today’s announcement, Cloudflare customers can now generate a report with a single click that can be used to audit the activity allowed in these arrangements.</p><p>We also think that sites of any size should be able to determine how they want to be compensated for the usage of their content by AI models. Today’s announcement previews a new Cloudflare monetization feature which will give site owners the tools to set prices, control access, and capture value for the scanning of their content.</p>
    <div>
      <h3>What is the problem?</h3>
      <a href="#what-is-the-problem">
        
      </a>
    </div>
    <p>Until recently, bots and scrapers on the Internet mostly fell into two clean categories: good and bad. Good bots, like search engine crawlers, helped audiences discover your site and drove traffic to you. Bad bots tried to take down your site, jump the queue ahead of your customers, or scrape competitive data. We built the <a href="https://www.cloudflare.com/application-services/products/bot-management/"><u>Cloudflare Bot Management</u></a> platform to give you the ability to distinguish between those two broad categories and to allow or block them.</p><p>The rise of AI Large Language Models (LLMs) and other generative tools created a murkier third category. Unlike malicious bots, the crawlers associated with these platforms are not actively trying to knock your site offline or to get in the way of your customers. They are not trying to steal sensitive data; they just want to scan what is already public on your site.</p><p>However, unlike helpful bots, these AI-related crawlers do not necessarily drive traffic to your site. AI Data Scraper bots scan the content on your site to train new LLMs. Your material is then put into a kind of blender, mixed up with other content, and used to answer questions from users without attribution or the need for users to visit your site. Another type of crawler, AI Search Crawler bots, scan your content and attempt to cite it when responding to a user’s search. The downside is that those users might just stay inside of that interface, rather than visit your site, because an answer is assembled on the page in front of them.</p><p>This murkiness leaves site owners with a hard decision to make. The value exchange is unclear. And site owners are at a disadvantage while they play catch up. Many sites allowed these AI crawlers to scan their content because these crawlers, for the most part, looked like “good” bots — only for the result to mean less traffic to their site as their content is repackaged in AI-written answers.</p><p>We believe this poses a risk to an open Internet. Without the ability to control scanning and realize value, site owners will be discouraged to launch or maintain Internet properties. Creators will stash more of their content behind paywalls and the largest publishers will strike direct deals. AI model providers will in turn struggle to find and access the long tail of high-quality content on smaller sites.</p><p>Both sides lack the tools to create a healthy, transparent exchange of permissions and value. Starting today, Cloudflare equips site owners with the services they need to begin fixing this. We have broken out a series of steps we recommend all of our customers follow to get started.</p>
    <div>
      <h3>Step 1: Understand how AI models use your site</h3>
      <a href="#step-1-understand-how-ai-models-use-your-site">
        
      </a>
    </div>
    <p>Every site on Cloudflare now has access to a new analytics view that summarizes the crawling behavior of popular and known AI services. You can begin reviewing this information to understand the AI scanning of your content by selecting a site in your dashboard and navigating to the <b>AI Crawl Control </b><i><b>(formerly AI Audit)</b></i> tab in the left-side navigation bar.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7FknDZw445xutqps2fSSJt/597fca585cf0e7086ea5db567f258714/BLOG-2509_2.png" />
          </figure><p>When AI model providers access content on your site, they rely on automated tools called “bots” or “crawlers” to scan pages. The bot will request the content of your page, capture the response, and store it as part of a future data training set or remember it for AI search engine results in the future.</p><p>These bots often identify themselves to your site (and Cloudflare’s network) by including an HTTP header in their request called a <code>User Agent</code>. Although, in some cases, a bot from one of these AI services might not send the header and Cloudflare instead relies on other heuristics like IP address or behavior to identify them.</p><p>When the bot does identify itself, the header will contain a string of text with the bot name. For example, <a href="https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler"><u>Anthropic sometimes crawls sites</u></a> on the Internet with a bot called <code>ClaudeBot</code>. When that service requests the content of a page from your site on Cloudflare, Cloudflare logs the <code>User Agent</code> as <code>ClaudeBot</code>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/55IhilRHrLYZo4kvLSPBuI/7a88476fb443e09a28fdbf7e9abd5b8d/BLOG-2509_3.png" />
          </figure><p>Cloudflare takes the logs gathered from visits to your site and looks for user agents that match known AI bots and crawlers. We summarize the activity of individual crawlers and also provide you with filters to review just the activities of specific AI platforms. Many AI firms rely on multiple crawlers that serve distinct purposes. When <a href="https://platform.openai.com/docs/bots"><u>OpenAI scans sites</u></a> for data scraping, they rely on <code>GPTBot</code>, but when they crawl sites for their new AI search engine, they use <code>OAI-SearchBot</code>.</p><p>And those differences matter. Scanning from different bot types can impact traffic to your site or the attribution of your content. AI search engines will often link to sites as part of their response, potentially sending visitors to your destination. In that case, you might be open to those types of bots crawling your Internet property. AI Data Scrapers, on the other hand, just exist to read as much of the Internet as possible to train future models or improve existing ones.</p><p>We think that you deserve to know why a bot is crawling your site in addition to when and how often. Today’s release gives you a filter to review bot activity by categories like AI Data Scraper, AI Search Crawler, and Archiver.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/vd66ddf4Lii8LEr8Tt3Nt/ff85253f1d6894d4086a6696e14d250e/BLOG-2509_4.png" />
          </figure><p>With this data, you can begin analyzing how AI models access your site. That information might be overwhelming, especially if your team has not had time yet to decide how you want to handle AI scanning of your content. If you find yourself unsure on how to respond, proceed to Step 2.</p>
    <div>
      <h3>Step 2: Give yourself a pause to decide what to do next</h3>
      <a href="#step-2-give-yourself-a-pause-to-decide-what-to-do-next">
        
      </a>
    </div>
    <p>We talked to several organizations who know their sites are valuable destinations for AI crawlers, but they do not yet know what to do about it. These teams need a “time out” so they can make an informed decision about how they make their data available to these services.</p><p>Cloudflare gives you that easy button right now. Any customer on any plan can choose to block all AI bots and crawlers to give yourself a pause while you decide what you do want to allow.</p><p>To implement that option, navigate to the Bots section under the Security tab of the Cloudflare Dashboard. Follow the blue link in the top right corner to configure how Cloudflare’s proxy handles bot traffic. Next, toggle the button in the “Block AI Scrapers and Crawlers” card to the “On” position.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5RQm7Zofbmkzekclq5OWsa/921660fa9b45018ab2ecf402c104960b/BLOG-2509_5.png" />
          </figure><p>The one-click option blocks known AI-related bots and crawlers from accessing your site based on a list that Cloudflare maintains. With a block in place, you and your team can make a less rushed decision about what to do next with your content.</p>
    <div>
      <h3>Step 3: Control the bots you do want to allow</h3>
      <a href="#step-3-control-the-bots-you-do-want-to-allow">
        
      </a>
    </div>
    <p>The pause button buys time for your team to decide what you want the relationship to be between these crawlers and your content. Once your team has reached a decision, you can begin relying on Cloudflare’s network to implement that policy.</p><p>If that decision is “we are not going to allow any crawling,” then you can leave the block button discussed above toggled to “On”. If you want to allow some selective scanning, today’s release provides you with options to permit certain types of bots, or just bots from certain providers, to access your content.</p><p>For some teams, the decision will be to allow the bots associated with AI search engines to scan their Internet properties because those tools can still drive traffic to the site. Other organizations might sign deals with a specific model provider, and they want to allow any type of bot from that provider to access their content. Customers can now navigate to the WAF section of the Cloudflare dashboard to implement these types of policies.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7rtP61D3cneqk5pqQn1lFO/d297e6a10354b4dab193c67549252cb6/BLOG-2509_6.png" />
          </figure><p>Administrators can also create rules that would, for example, block all AI bots except for those from a specific platform. Teams can deploy these types of filters if they are skeptical of most AI platforms but comfortable with one AI model provider and its policies. These types of rules can also be used to implement contracts where a site owner has negotiated to allow scanning from a single provider. The site administrator would need to create a rule to block all types of AI-related bots and then add an exception that allows the specific bot or bots from their AI partner.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3dc8LlDlamjuWXsRQgx45p/9212bba47cf6df8da8b57b3bac7d38fe/BLOG-2509_7.png" />
          </figure><p>We also recommend that customers consider updating their Terms of Service to cover this new use case in addition to applying these new filters. We have <a href="https://developers.cloudflare.com/bots/reference/sample-terms/"><u>documented the steps</u></a> we suggest that “good citizen” bots and crawlers take with respect to robots.txt files. As an extension of those best practices, we are adding a new section to that documentation where we provide a sample Terms of Service section that site owners can consider using to establish that AI scanning needs to follow the policies you have defined in your robots.txt file.</p>
    <div>
      <h3>Step 4: Audit your existing scanning arrangements</h3>
      <a href="#step-4-audit-your-existing-scanning-arrangements">
        
      </a>
    </div>
    <p>An increasing number of sites are signing agreements directly with model providers to license consumption of their content in exchange for payment. Many of those deals contain provisions that determine the rate of crawling for certain sections or entire sites. Cloudflare’s AI Crawl Control tab provides you with the tools to monitor those kinds of contracts.</p><p>The table at the bottom of the AI Crawl Control tool now lists the most popular content on your site ranked by the count of scans in the time period from the filter set at the top of the page. You can click the <code>Export to CSV</code> button to quickly download a file with the details presented here that you can use to discuss any discrepancies with the AI platform that you are allowing to access your content.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3e0OE5WAARgyjcJKOOmnFT/f427935bd50b215bb603a96e3d529743/BLOG-2509_8.png" />
          </figure><p>Today, the data available to you represents key metrics we have heard from customers in these kinds of arrangements: requests against certain pages and requests against the entire site.</p>
    <div>
      <h3>Step 5: Prepare your site to capture value from AI scanning</h3>
      <a href="#step-5-prepare-your-site-to-capture-value-from-ai-scanning">
        
      </a>
    </div>
    <p>Not everyone has the time or contacts to negotiate deals with AI companies. Up to this point, only the largest publishers on the Internet have the resources to set those kinds of terms and get paid for their content.</p><p>Everyone else has been left with two basic choices on how to handle their data: block all scanning or allow unrestricted access. Today’s releases give content creators more visibility and control than just those two options, but the long tail of sites on the Internet still lack a pathway to monetization.</p><p>We think that sites of any size should be fairly compensated for the use of their content. Cloudflare plans to launch a new component of our dashboard that goes beyond just blocking and analyzing crawls. Site owners will have the ability to set a price for their site, or sections of their site, and to then charge model providers based on their scans and the price you have set. We’ll handle the rest so that you can focus on creating great content for your audience.</p><p>The fastest way to get ready to capture value through this new component is to make sure your sites use Cloudflare’s network. We plan to invite sites to participate in the beta based on the date they first joined Cloudflare. Interested in being notified when this is available? <a href="http://www.cloudflare.com/lp/ai-value-tool-waitlist"><u>Let us know here</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/581ZPDNlMumBmFzrVgnG9D/d2b9d5a2b96d572239d00a39da79c77a/BLOG-2509_9.png" />
          </figure><p></p> ]]></content:encoded>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[AI Bots]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[LLM]]></category>
            <guid isPermaLink="false">47pmgthPjmg2ZeYqTNmU8f</guid>
            <dc:creator>Sam Rhea</dc:creator>
        </item>
        <item>
            <title><![CDATA[Mitigating a token-length side-channel attack in our AI products]]></title>
            <link>https://blog.cloudflare.com/ai-side-channel-attack-mitigated/</link>
            <pubDate>Thu, 14 Mar 2024 12:30:30 GMT</pubDate>
            <description><![CDATA[ The Workers AI and AI Gateway team recently collaborated closely with security researchers at Ben Gurion University regarding a report submitted through our Public Bug Bounty program. Through this process, we discovered and fully patched a vulnerability affecting all LLM providers. Here’s the story ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5do9zHtgVCZfCILMjoXAmV/0f7e2e3b4bdb298d7fd8c0a97d3b2a19/Mitigating-a-Token-Length-Side-Channel-attack-in-our-AI-products.png" />
            
            </figure><p>Since the discovery of <a href="https://en.wikipedia.org/wiki/CRIME">CRIME</a>, <a href="https://breachattack.com/">BREACH</a>, <a href="https://media.blackhat.com/eu-13/briefings/Beery/bh-eu-13-a-perfect-crime-beery-wp.pdf">TIME</a>, <a href="https://en.wikipedia.org/wiki/Lucky_Thirteen_attack">LUCKY-13</a> etc., length-based side-channel attacks have been considered practical. Even though packets were encrypted, attackers were able to infer information about the underlying plaintext by analyzing metadata like the packet length or timing information.</p><p>Cloudflare was recently contacted by a group of researchers at <a href="https://cris.bgu.ac.il/en/">Ben Gurion University</a> who wrote a paper titled “<a href="https://cdn.arstechnica.net/wp-content/uploads/2024/03/LLM-Side-Channel.pdf">What Was Your Prompt? A Remote Keylogging Attack on AI Assistants</a>” that describes “a novel side-channel that can be used to read encrypted responses from AI Assistants over the web”.</p><p>The Workers AI and AI Gateway team collaborated closely with these security researchers through our <a href="/cloudflare-bug-bounty-program/">Public Bug Bounty program</a>, discovering and fully patching a vulnerability that affects LLM providers. You can read the detailed research paper <a href="https://cdn.arstechnica.net/wp-content/uploads/2024/03/LLM-Side-Channel.pdf">here</a>.</p><p>Since being notified about this vulnerability, we've implemented a mitigation to help secure all Workers AI and AI Gateway customers. As far as we could assess, there was no outstanding risk to Workers AI and AI Gateway customers.</p>
    <div>
      <h3>How does the side-channel attack work?</h3>
      <a href="#how-does-the-side-channel-attack-work">
        
      </a>
    </div>
    <p>In the paper, the authors describe a method in which they intercept the stream of a chat session with an LLM provider, use the network packet headers to infer the length of each token, extract and segment their sequence, and then use their own dedicated LLMs to infer the response.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6EeuXpPSqqvqIZKZUFPKEY/951a777d273caf172933639d9f5d6f12/pasted-image-0--2--3.png" />
            
            </figure><p>The two main requirements for a successful attack are an AI chat client running in <b>streaming</b> mode and a malicious actor capable of capturing network traffic between the client and the AI chat service. In streaming mode, the LLM tokens are emitted sequentially, introducing a token-length side-channel. Malicious actors could eavesdrop on packets via public networks or within an ISP.</p><p>An example request vulnerable to the side-channel attack looks like this:</p>
            <pre><code>curl -X POST \
https://api.cloudflare.com/client/v4/accounts/&lt;account-id&gt;/ai/run/@cf/meta/llama-2-7b-chat-int8 \
  -H "Authorization: Bearer &lt;Token&gt;" \
  -d '{"stream":true,"prompt":"tell me something about portugal"}'</code></pre>
            <p>Let’s use <a href="https://www.wireshark.org/">Wireshark</a> to inspect the network packets on the LLM chat session while streaming:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6sII07hkJGaVXBKlWoBoEW/a1c3be395e0bee3ec5ed690947737d51/media.png" />
            
            </figure><p>The first packet has a length of 95 and corresponds to the token "Port" which has a length of four. The second packet has a length of 93 and corresponds to the token "ug" which has a length of two, and so on. By removing the likely token envelope from the network packet length, it is easy to infer how many tokens were transmitted and their sequence and individual length just by sniffing encrypted network data.</p><p>Since the attacker needs the sequence of individual token length, this vulnerability only affects text generation models using streaming. This means that AI inference providers that use streaming — the most common way of interacting with LLMs — like Workers AI, are potentially vulnerable.</p><p>This method requires that the attacker is on the same network or in a position to observe the communication traffic and its accuracy depends on knowing the target LLM’s writing style. In ideal conditions, the researchers claim that their system “can reconstruct 29% of an AI assistant’s responses and successfully infer the topic from 55% of them”. It’s also important to note that unlike other side-channel attacks, in this case the attacker has no way of evaluating its prediction against the ground truth. That means that we are as likely to get a sentence with near perfect accuracy as we are to get one where only things that match are conjunctions.</p>
    <div>
      <h3>Mitigating LLM side-channel attacks</h3>
      <a href="#mitigating-llm-side-channel-attacks">
        
      </a>
    </div>
    <p>Since this type of attack relies on the length of tokens being inferred from the packet, it can be just as easily mitigated by obscuring token size. The researchers suggested a few strategies to mitigate these side-channel attacks, one of which is the simplest: padding the token responses with random length noise to obscure the length of the token so that responses can not be inferred from the packets. While we immediately added the mitigation to our own inference product — Workers AI, we wanted to help customers secure their LLMs regardless of where they are running them by adding it to our AI Gateway.</p><p>As of today, all users of Workers AI and AI Gateway are now automatically protected from this side-channel attack.</p>
    <div>
      <h3>What we did</h3>
      <a href="#what-we-did">
        
      </a>
    </div>
    <p>Once we got word of this research work and how exploiting the technique could potentially impact our AI products, we did what we always do in situations like this: we assembled a team of systems engineers, security engineers, and product managers and started discussing risk mitigation strategies and next steps. We also had a call with the researchers, who kindly attended, presented their conclusions, and answered questions from our teams.</p><p>The research team provided a testing notebook that we could use to validate the attack's results. While we were able to reproduce the results for the notebook's examples, we found that the accuracy varied immensely with our tests using different prompt responses and different LLMs. Nonetheless, the paper has merit, and the risks are not negligible.</p><p>We decided to incorporate the first mitigation suggestion in the paper: including random padding to each message to hide the actual length of tokens in the stream, thereby complicating attempts to infer information based solely on network packet size.</p>
    <div>
      <h3>Workers AI, our inference product, is now protected</h3>
      <a href="#workers-ai-our-inference-product-is-now-protected">
        
      </a>
    </div>
    <p>With our inference-as-a-service product, anyone can use the <a href="https://developers.cloudflare.com/workers-ai/">Workers AI</a> platform and make API calls to our supported AI models. This means that we oversee the inference requests being made to and from the models. As such, we have a responsibility to ensure that the service is secure and protected from potential vulnerabilities. We immediately rolled out a fix once we were notified of the research, and all Workers AI customers are now automatically protected from this side-channel attack. We have not seen any malicious attacks exploiting this vulnerability, other than the ethical testing from the researchers.</p><p>Our solution for Workers AI is a variation of the mitigation strategy suggested in the research document. Since we stream JSON objects rather than the raw tokens, instead of padding the tokens with whitespace characters, we added a new property, "p" (for padding) that has a string value of variable random length.</p><p>Example streaming response using the <a href="https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events">SSE</a> syntax:</p>
            <pre><code>data: {"response":"portugal","p":"abcdefghijklmnopqrstuvwxyz0123456789a"}
data: {"response":" is","p":"abcdefghij"}
data: {"response":" a","p":"abcdefghijklmnopqrstuvwxyz012"}
data: {"response":" southern","p":"ab"}
data: {"response":" European","p":"abcdefgh"}
data: {"response":" country","p":"abcdefghijklmno"}
data: {"response":" located","p":"abcdefghijklmnopqrstuvwxyz012345678"}</code></pre>
            <p>This has the advantage that no modifications are required in the SDK or the client code, the changes are invisible to the end-users, and no action is required from our customers. By adding random variable length to the JSON objects, we introduce the same network-level variability, and the attacker essentially loses the required input signal. Customers can continue using Workers AI as usual while benefiting from this protection.</p>
    <div>
      <h3>One step further: AI Gateway protects users of any inference provider</h3>
      <a href="#one-step-further-ai-gateway-protects-users-of-any-inference-provider">
        
      </a>
    </div>
    <p>We added protection to our AI inference product, but we also have a product that proxies requests to any provider — <a href="https://developers.cloudflare.com/ai-gateway/">AI Gateway</a>. AI Gateway acts as a proxy between a user and supported inference providers, helping developers gain control, performance, and <a href="https://www.cloudflare.com/learning/performance/what-is-observability/">observability</a> over their AI applications. In line with our mission to help build a better Internet, we wanted to quickly roll out a fix that can help all our customers using text generation AIs, regardless of which provider they use or if they have mitigations to prevent this attack. To do this, we implemented a similar solution that pads all streaming responses proxied through AI Gateway with random noise of variable length.</p><p>Our AI Gateway customers are now automatically protected against this side-channel attack, even if the upstream inference providers have not yet mitigated the vulnerability. If you are unsure if your inference provider has patched this vulnerability yet, use AI Gateway to proxy your requests and ensure that you are protected.</p>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>At Cloudflare, our mission is to help build a better Internet – that means that we care about all citizens of the Internet, regardless of what their tech stack looks like. We are proud to be able to improve the security of our AI products in a way that is transparent and requires no action from our customers.</p><p>We are grateful to the researchers who discovered this vulnerability and have been very collaborative in helping us understand the problem space. If you are a security researcher who is interested in helping us make our products more secure, check out our Bug Bounty program at <a href="http://hackerone.com/cloudflare">hackerone.com/cloudflare</a>.</p> ]]></content:encoded>
            <category><![CDATA[Bug Bounty]]></category>
            <category><![CDATA[LLM]]></category>
            <category><![CDATA[Vulnerabilities]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[AI Gateway]]></category>
            <category><![CDATA[SASE]]></category>
            <guid isPermaLink="false">1R32EruY6C8Pu6LrFCGXwy</guid>
            <dc:creator>Celso Martinho</dc:creator>
            <dc:creator>Michelle Chen</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cloudflare announces Firewall for AI]]></title>
            <link>https://blog.cloudflare.com/firewall-for-ai/</link>
            <pubDate>Mon, 04 Mar 2024 14:02:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare is one of the first providers to safeguard LLM models and users in the era of AI ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5EdkABBZiYVEgdEQ42QaUf/c73d263a1fbae676983868a314e2acf5/WAF-for-AI.png" />
            
            </figure><p>Today, Cloudflare is announcing the development of Firewall for AI, a protection layer that can be deployed in front of <a href="https://www.cloudflare.com/en-gb/learning/ai/what-is-large-language-model/">Large Language Models (LLMs)</a> to identify abuses before they reach the models.</p><p>While AI models, and specifically LLMs, are surging, customers tell us that they are concerned about the <a href="https://blog.cloudflare.com/best-practices-sase-for-ai/">best strategies to secure their own LLMs</a>. Using LLMs as part of Internet-connected applications introduces new vulnerabilities that can be exploited by bad actors.</p><p>Some of the vulnerabilities affecting traditional web and API applications apply to the LLM world as well, including injections or <a href="https://www.cloudflare.com/learning/security/what-is-data-exfiltration/">data exfiltration</a>. However, there is a new set of threats that are now relevant because of the way LLMs work. For example, researchers have <a href="https://thehackernews.com/2024/02/new-hugging-face-vulnerability-exposes.html">recently discovered</a> a vulnerability in an AI collaboration platform that allows them to hijack models and perform unauthorized actions.</p><p>Firewall for AI is an advanced <a href="https://www.cloudflare.com/learning/ddos/glossary/web-application-firewall-waf/">Web Application Firewall (WAF)</a> specifically tailored for applications using LLMs. It will comprise a set of tools that can be deployed in front of applications to detect vulnerabilities and provide visibility to model owners. The tool kit will include products that are already part of WAF, such as Rate Limiting and Sensitive Data Detection, and a new protection layer which is currently under development. This new validation analyzes the prompt submitted by the end user to identify attempts to exploit the model to extract data and other abuse attempts. Leveraging the size of Cloudflare network, Firewall for AI runs as close to the user as possible, allowing us to identify attacks early and protect both end user and models from abuses and attacks.</p><p>Before we dig into how Firewall for AI works and its full feature set, let’s first examine what makes LLMs unique, and the <a href="https://www.cloudflare.com/learning/security/what-is-an-attack-surface/">attack surfaces</a> they introduce. We’ll use the <a href="https://www.cloudflare.com/learning/ai/owasp-top-10-risks-for-llms/">OWASP Top 10 for LLMs</a> as a reference.</p>
    <div>
      <h2>Why are LLMs different from traditional applications?</h2>
      <a href="#why-are-llms-different-from-traditional-applications">
        
      </a>
    </div>
    <p>When considering LLMs as Internet-connected applications, there are two main differences compared with more traditional web apps.</p><p>First, the way users interact with the product. Traditional apps are deterministic in nature. Think about a bank application — it’s defined by a set of operations (check my balance, make a transfer, etc.). The security of the business operation (and data) can be obtained by controlling the fine set of operations accepted by these endpoints: “GET /balance” or “POST /transfer”.</p><p>LLM operations are non-deterministic by design. To start with, LLM interactions are based on natural language, which makes identifying problematic requests harder than matching attack signatures. Additionally, unless a response is cached, LLMs typically provide a different response every time — even if the same input prompt is repeated. This makes limiting the way a user interacts with the application much more difficult. This poses a threat to the user as well, in terms of being exposed to misinformation that weakens the trust in the model.</p><p>Second, a big difference is how the application control plane interacts with the data. In traditional applications, the control plane (code) is well separated from the data plane (database). The defined operations are the only way to interact with the underlying data (e.g. show me the history of my payment transactions). This allows security practitioners to focus on adding checks and guardrails to the control plane and thus protecting the database indirectly.</p><p>LLMs are different in that the training data becomes part of the model itself through the training process, making it extremely difficult to control how that data is shared as a result of a user prompt. Some architectural solutions are being explored, such as separating LLMs into different levels and segregating data. However, no silver bullet has yet been found.</p><p>From a security perspective, these differences allow attackers to craft new attack vectors that can target LLMs and fly under the radar of existing security tools designed for traditional web applications.</p>
    <div>
      <h3>OWASP LLM Vulnerabilities</h3>
      <a href="#owasp-llm-vulnerabilities">
        
      </a>
    </div>
    <p>The <a href="https://owasp.org/">OWASP</a> foundation <a href="https://owasp.org/www-project-top-10-for-large-language-model-applications/">released a list of</a> the top 10 classes of vulnerabilities for LLMs, providing a useful framework for thinking about <a href="https://www.cloudflare.com/learning/ai/what-is-ai-security/">how to secure language models</a>. Some of the threats are reminiscent of the <a href="https://owasp.org/www-project-top-ten/">OWASP top 10 for web applications</a>, while others are specific to language models.</p><p>Similar to web applications, some of these vulnerabilities can be best addressed when the LLM application is designed, developed, and trained. For example, <a href="https://www.cloudflare.com/learning/ai/data-poisoning/"><i>Training Data Poisoning</i></a> can be carried out by introducing vulnerabilities in the training data set used to train new models. Poisoned information is then presented to the user when the model is live. <i>Supply Chain Vulnerabilities</i> and <i>Insecure Plugin Design</i> are vulnerabilities introduced in components added to the model, like third-party software packages. Finally, managing authorization and permissions is crucial when dealing with <i>Excessive Agency</i>, where unconstrained models can perform unauthorized actions within the broader application or infrastructure.</p><p>Conversely, <i>Prompt Injection</i>, <i>Model Denial of Service</i>, and <i>Sensitive Information Disclosure</i> can be mitigated by adopting a proxy security solution like Cloudflare Firewall for AI. In the following sections, we will give more details about these vulnerabilities and discuss how Cloudflare is optimally positioned to mitigate them.</p>
    <div>
      <h3>LLM deployments</h3>
      <a href="#llm-deployments">
        
      </a>
    </div>
    <p>Language model risks also depend on the deployment model. Currently, we see three main deployment approaches: internal, public, and product LLMs. In all three scenarios, you need to protect models from abuses, protect any proprietary data stored in the model, and protect the end user from misinformation or from exposure to inappropriate content.</p><ul><li><p><b>Internal LLMs:</b> Companies develop LLMs to support the workforce in their daily tasks. These are considered corporate assets and shouldn’t be accessed by non-employees. Examples include an AI co-pilot trained on sales data and customer interactions used to generate tailored proposals, or an LLM trained on an internal knowledge base that can be queried by engineers.</p></li><li><p><b>Public LLMs:</b> These are LLMs that can be accessed outside the boundaries of a corporation. Often these solutions have free versions that anyone can use and they are often trained on general or public knowledge. Examples include <a href="https://openai.com/gpt-4">GPT</a> from OpenAI or <a href="https://www.anthropic.com/product">Claude</a> from Anthropic.</p></li><li><p><b>Product LLM:</b> From a corporate perspective, LLMs can be part of a product or service offered to their customers. These are usually self-hosted, tailored solutions that can be made available as a tool to interact with the company resources. Examples include customer support chatbots or <a href="/security-analytics-ai-assistant/">Cloudflare AI Assistant</a>.</p></li></ul><p>From a risk perspective, the difference between Product and Public LLMs is about who carries the impact of successful attacks. Public LLMs are considered a threat to data because data that ends up in the model can be accessed by virtually anyone. This is one of the reasons many corporations advise their employees not to use confidential information in prompts for publicly available services. Product LLMs can be considered a threat to companies and their intellectual property if models had access to proprietary information during training (by design or by accident).</p>
    <div>
      <h2>Firewall for AI</h2>
      <a href="#firewall-for-ai">
        
      </a>
    </div>
    <p>Cloudflare Firewall for AI will be deployed like a traditional WAF, where every API request with an LLM prompt is scanned for patterns and signatures of possible attacks.</p><p>Firewall for AI can be deployed in front of models hosted on the Cloudflare <a href="/workers-ai">Workers AI</a> platform or models hosted on any other third party infrastructure. It can also be used alongside Cloudflare <a href="https://developers.cloudflare.com/ai-gateway/">AI Gateway</a>, and customers will be able to control and set up Firewall for AI using the WAF control plane.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3kwApAqMHjSChjXkdc3H89/09efa7f0ed81746bf77c62376457d1c8/image1-1.png" />
            
            </figure><p><i>Firewall for AI works like a traditional web application firewall. It is deployed in front of an LLM application and scans every request to identify attack signatures</i></p>
    <div>
      <h3>Prevent volumetric attacks</h3>
      <a href="#prevent-volumetric-attacks">
        
      </a>
    </div>
    <p>One of the threats listed by OWASP is Model Denial of Service. Similar to traditional applications, a <a href="https://www.cloudflare.com/learning/ddos/glossary/denial-of-service/">DoS attack</a> is carried out by consuming an exceptionally high amount of resources, resulting in reduced service quality or potentially increasing the costs of running the model. Given the amount of resources LLMs require to run, and the unpredictability of user input, this type of attack can be detrimental.</p><p>This risk can be mitigated by adopting rate limiting policies that control the rate of requests from individual sessions, therefore limiting the context window. By proxying your model through Cloudflare today, you get <a href="https://www.cloudflare.com/ddos/">DDoS protection</a> out of the box. You can also use Rate Limiting and <a href="/advanced-rate-limiting/">Advanced Rate Limiting</a> to manage the rate of requests allowed to reach your model by setting a maximum rate of request performed by an individual IP address or API key during a session.</p>
    <div>
      <h3>Identify sensitive information with Sensitive Data Detection</h3>
      <a href="#identify-sensitive-information-with-sensitive-data-detection">
        
      </a>
    </div>
    <p>There are two use cases for sensitive data, depending on whether you own the model and data, or you want to prevent users from sending data into public LLMs.</p><p>As defined by <a href="https://owasp.org/www-project-top-10-for-large-language-model-applications/">OWASP</a>, <i>Sensitive Information Disclosure</i> happens when LLMs inadvertently reveal confidential data in the responses, leading to unauthorized data access, privacy violations, and security breaches. One way to prevent this is to add strict prompt validations. Another approach is to identify when personally identifiable information (PII) leaves the model. This is relevant, for example, when a model was trained with a company knowledge base that may include sensitive information, such asPII (like social security number), proprietary code, or algorithms.</p><p>Customers using LLM models behind Cloudflare WAF can employ the Sensitive Data Detection (SDD) WAF managed ruleset to identify certain PII being returned by the model in the response. Customers can review the SDD matches on WAF Security Events. Today, SDD is offered as a set of managed rules designed to scan for financial information (such as credit card numbers) as well as secrets (API keys). As part of the roadmap, we plan to allow customers to create their own custom fingerprints.</p><p>The other use case is intended to prevent users from sharing PII or other sensitive information with external LLM providers, such as OpenAI or Anthropic. To protect from this scenario, we plan to expand SDD to scan the request prompt and integrate its output with AI Gateway where, alongside the prompt's history, we detect if certain sensitive data has been included in the request. We will start by using the existing SDD rules, and we plan to allow customers to write their own custom signatures. Relatedly, obfuscation is another feature we hear a lot of customers talk about. Once available, the expanded SDD will allow customers to obfuscate certain sensitive data in a prompt before it reaches the model. SDD on the request phase is being developed.</p>
    <div>
      <h2>Preventing model abuses</h2>
      <a href="#preventing-model-abuses">
        
      </a>
    </div>
    <p>Model abuse is a broader category of abuse. It includes approaches like “prompt injection” or submitting requests that generate hallucinations or lead to responses that are inaccurate, offensive, inappropriate, or simply off-topic.</p><p>Prompt Injection is an attempt to manipulate a language model through specially crafted inputs, causing unintended responses by the LLM. The results of an injection can vary, from extracting sensitive information to influencing decision-making by mimicking normal interactions with the model. A classic example of prompt injection is manipulating a CV to affect the output of <a href="https://kai-greshake.de/posts/inject-my-pdf/">resume screening tools</a>.</p><p>A common use case we hear from customers of our AI Gateway is that they want to avoid their application generating toxic, offensive, or problematic language. The risks of not controlling the outcome of the model include reputational damage and harming the end user by providing an unreliable response.</p><p>These types of abuse can be managed by adding an additional layer of protection that sits in front of the model. This layer can be trained to block injection attempts or block prompts that fall into categories that are inappropriate.</p>
    <div>
      <h3>Prompt and response validation</h3>
      <a href="#prompt-and-response-validation">
        
      </a>
    </div>
    <p>Firewall for AI will run a series of detections designed to identify prompt injection attempts and other abuses, such as making sure the topic stays within the boundaries defined by the model owner. Like other existing WAF features, Firewall for AI will automatically look for prompts embedded in HTTP requests or allow customers to create rules based on where in the JSON body of the request the prompt can be found.</p><p>Once enabled, the Firewall will analyze every prompt and provide a score based on the likelihood that it’s malicious. It will also tag the prompt based on predefined categories. The score ranges from 1 to 99 which indicates the likelihood of a prompt injection, with 1 being the most likely.</p><p>Customers will be able to create WAF rules to block or handle requests with a particular score in one or both of these dimensions. You’ll be able to combine this score with other existing signals (like bot score or attack score) to determine whether the request should reach the model or should be blocked. For example, it could be combined with a bot score to identify if the request was malicious and generated by an automated source.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4dxfO29U9BurRgBykPOao0/5aa1619fa5ea3414b7954c78771d1360/Slice-1.png" />
            
            </figure><p><i>Detecting prompt injections and prompt abuse is part of the scope of Firewall for AI. Early iteration of the product design</i></p><p>Besides the score, we will assign tags to each prompt that can be used when creating rules to prevent prompts belonging to any of these categories from reaching their model. For example, customers will be able to create rules to block specific topics. This includes prompts using words categorized as offensive, or linked to religion, sexual content, or politics, for example.</p>
    <div>
      <h2>How can I use Firewall for AI? Who gets this?</h2>
      <a href="#how-can-i-use-firewall-for-ai-who-gets-this">
        
      </a>
    </div>
    <p>Enterprise customers on the Application Security Advanced offering can immediately start using Advanced Rate Limiting and Sensitive Data Detection (on the response phase). Both products can be found in the WAF section of the Cloudflare dashboard. Firewall for AI’s prompt validation feature is currently under development and a beta version will be released in the coming months to all Workers AI users. Sign up to <a href="https://cloudflare.com/lp/firewall-for-ai/">join the waiting list</a> and get notified when the feature becomes available.</p>
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>Cloudflare is one of the first security providers launching a set of <a href="https://www.cloudflare.com/ai-security/">tools to secure AI applications</a>. Using Firewall for AI, customers can control what prompts and requests reach their language models, reducing the risk of abuses and data exfiltration. Stay tuned to learn more about how AI application security is evolving.</p> ]]></content:encoded>
            <category><![CDATA[Security Week]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[WAF]]></category>
            <category><![CDATA[LLM]]></category>
            <category><![CDATA[Application Services]]></category>
            <guid isPermaLink="false">6mqhKmVt1dGOhO5xNsli3k</guid>
            <dc:creator>Daniele Molteni</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cloudflare R2 and MosaicML enable training LLMs on any compute, anywhere in the world, with zero switching costs]]></title>
            <link>https://blog.cloudflare.com/cloudflare-r2-mosaicml-train-llms-anywhere-faster-cheaper/</link>
            <pubDate>Tue, 16 May 2023 13:00:54 GMT</pubDate>
            <description><![CDATA[ Together, Cloudflare and MosaicML give customers the freedom to train LLMs on any compute, anywhere in the world, with zero switching costs. That means faster, cheaper training runs, and no vendor lock in. ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5BYHRApu97ZOlXvHf0cIMs/4f30b2009add920757a79e9f6c6ca1b5/111.png" />
            
            </figure><p>Building the large language models (LLMs) and diffusion models that power <a href="https://www.cloudflare.com/learning/ai/what-is-generative-ai/">generative AI</a> requires massive infrastructure. The most obvious component is compute – hundreds to thousands of GPUs – but an equally critical (and often overlooked) component is the <b>data storage infrastructure.</b> Training datasets can be terabytes to petabytes in size, and this data needs to be read in parallel by thousands of processes. In addition, model checkpoints need to be saved frequently throughout a training run, and for LLMs these checkpoints can each be hundreds of gigabytes!</p><p>To <a href="https://r2-calculator.cloudflare.com/">manage storage costs</a> and scalability, many machine learning teams have been moving to <a href="https://www.cloudflare.com/learning/cloud/what-is-object-storage/">object storage</a> to host their datasets and checkpoints. Unfortunately, most <a href="https://www.cloudflare.com/developer-platform/products/r2/">object store providers</a> use egress fees to “lock in” users to their platform. This makes it very difficult to leverage GPU capacity across multiple cloud providers, or take advantage of lower / dynamic pricing elsewhere, since the data and model checkpoints are too expensive to move. At a time when cloud GPUs are scarce, and new hardware options are entering the market, it’s more important than ever to stay flexible.</p><p>In addition to high egress fees, there is a technical barrier to object-store-centric machine learning training. Reading and writing data between object storage and compute clusters requires high throughput, efficient use of network bandwidth, determinism, and elasticity (the ability to train on different #s of GPUs). Building training software to handle all of this correctly and reliably is hard!</p><p>Today, we’re excited to show how MosaicML’s tools and Cloudflare R2 can be used together to address these challenges. First, with MosaicML’s open source <a href="https://github.com/mosaicml/streaming">StreamingDataset</a> and <a href="https://github.com/mosaicml/composer">Composer</a> libraries, you can easily stream in training data and read/write model checkpoints back to R2. All you need is an Internet connection. Second, thanks to R2’s zero-egress pricing, you can start/stop/move/resize jobs in response to GPU availability and prices across compute providers, without paying any data transfer fees. The MosaicML training platform makes it dead simple to orchestrate such training jobs across multiple clouds.</p><p>Together, Cloudflare and MosaicML give you the freedom to train LLMs on <i>any</i> compute, <i>anywhere</i> in the world, with zero switching costs. That means faster, cheaper training runs, and no vendor lock in :)</p><blockquote><p><i>“With the MosaicML training platform, customers can efficiently use R2 as the durable storage backend for training LLMs on any compute provider with zero egress fees. AI companies are facing outrageous cloud costs, and they are on the hunt for the tools that can provide them with the speed and flexibility to train their best model at the best price.”</i> – <b>Naveen Rao, CEO and co-founder, MosaicML</b></p></blockquote>
    <div>
      <h3>Reading data from R2 using StreamingDataset</h3>
      <a href="#reading-data-from-r2-using-streamingdataset">
        
      </a>
    </div>
    <p>To read data from R2 efficiently and deterministically, you can use the MosaicML <a href="https://github.com/mosaicml/streaming">StreamingDataset</a> library. First, write your training data (images, text, video, anything!) into <code>.mds</code> shard files using the provided Python API:</p>
            <pre><code>import numpy as np
from PIL import Image
from streaming import MDSWriter

# Local or remote directory in which to store the compressed output files
data_dir = 'path-to-dataset'

# A dictionary mapping input fields to their data types
columns = {
    'image': 'jpeg',
    'class': 'int'
}

# Shard compression, if any
compression = 'zstd'

# Save the samples as shards using MDSWriter
with MDSWriter(out=data_dir, columns=columns, compression=compression) as out:
    for i in range(10000):
        sample = {
            'image': Image.fromarray(np.random.randint(0, 256, (32, 32, 3), np.uint8)),
            'class': np.random.randint(10),
        }
        out.write(sample)</code></pre>
            <p>After your dataset has been converted, you can upload it to R2. Below we demonstrate this with the <code>awscli</code> command line tool, but you can also use `wrangler `or any <a href="https://www.cloudflare.com/developer-platform/solutions/s3-compatible-object-storage/">S3-compatible tool</a> of your choice. StreamingDataset will also support direct cloud writing to R2 soon!</p>
            <pre><code>$ aws s3 cp --recursive path-to-dataset s3://my-bucket/folder --endpoint-url $S3_ENDPOINT_URL</code></pre>
            <p>Finally, you can read the data into any device that has read access to your R2 bucket. You can fetch individual samples, loop over the dataset, and feed it into a standard PyTorch dataloader.</p>
            <pre><code>from torch.utils.data import DataLoader
from streaming import StreamingDataset

# Make sure that R2 credentials and $S3_ENDPOINT_URL are set in your environment    
# e.g. export S3_ENDPOINT_URL="https://[uid].r2.cloudflarestorage.com"

# Remote path where full dataset is persistently stored
remote = 's3://my-bucket/folder'

# Local working dir where dataset is cached during operation
local = '/tmp/path-to-dataset'

# Create streaming dataset
dataset = StreamingDataset(local=local, remote=remote, shuffle=True)

# Let's see what is in sample #1337...
sample = dataset[1337]
img = sample['image']
cls = sample['class']

# Create PyTorch DataLoader
dataloader = DataLoader(dataset)</code></pre>
            <p>StreamingDataset comes out of the box with high performance, elastic determinism, fast resumption, and multi-worker support. It also uses smart shuffling and distribution to ensure that download bandwidth is minimized. Across a variety of workloads such as LLMs and diffusion models, we find that there is no impact on training throughput (no dataloader bottleneck) when training from object stores like R2. For more information, check out the StreamingDataset <a href="https://www.mosaicml.com/blog/mosaicml-streamingdataset">announcement blog</a>!</p>
    <div>
      <h3>Reading/writing model checkpoints to R2 using Composer</h3>
      <a href="#reading-writing-model-checkpoints-to-r2-using-composer">
        
      </a>
    </div>
    <p>Streaming data into your training loop solves half of the problem, but how do you load/save your model checkpoints? Luckily, if you use a training library like <a href="https://github.com/mosaicml/composer">Composer</a>, it’s as easy as pointing at an R2 path!</p>
            <pre><code>from composer import Trainer
...

# Make sure that R2 credentials and $S3_ENDPOINT_URL are set in your environment
# e.g. export S3_ENDPOINT_URL="https://[uid].r2.cloudflarestorage.com"

trainer = Trainer(
        run_name='mpt-7b',
        model=model,
        train_dataloader=train_loader,
        ...
        save_folder=s3://my-bucket/mpt-7b/checkpoints,
        save_interval='1000ba',
        # load_path=s3://my-bucket/mpt-7b-prev/checkpoints/ep0-ba100-rank0.pt,
    )</code></pre>
            <p>Composer uses asynchronous uploads to minimize wait time as checkpoints are being saved during training. It also works out of the box with multi-GPU and multi-node training, and <b>does not require a shared file system.</b> This means you can skip setting up an expensive EFS/NFS system for your compute cluster, saving thousands of dollars or more per month on public clouds. All you need is an Internet connection and appropriate credentials – your checkpoints arrive safely in your R2 bucket giving you scalable and secure storage for your private models.</p>
    <div>
      <h3>Using MosaicML and R2 to train anywhere efficiently</h3>
      <a href="#using-mosaicml-and-r2-to-train-anywhere-efficiently">
        
      </a>
    </div>
    <p>Using the above tools together with Cloudflare R2 enables users to run training workloads on any compute provider, with total freedom and zero switching costs.</p><p>As a demonstration, in Figure X we use the MosaicML training platform to launch an LLM training job starting on Oracle Cloud Infrastructure, with data streaming in and checkpoints uploaded back to R2. Part way through, we pause the job and seamlessly resume on a different set of GPUs on Amazon Web Services. Composer loads the model weights from the last saved checkpoint in R2, and the streaming dataloader instantly resumes to the correct batch. Training continues deterministically. Finally, we move again to Google Cloud to finish the run.</p><p>As we train our LLM across three cloud providers, the only costs we pay are for GPU compute and data storage. No egress fees or lock in thanks to Cloudflare R2!</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6d7rJLiQ1wI12thUIr792s/2eacae82cc201106faed1af1b079b4d2/image2-19.png" />
            
            </figure><p><i>Using the MosaicML training platform with Cloudflare R2 to run an LLM training job across three different cloud providers, with zero egress fees.</i></p>
            <pre><code>$ mcli get clusters
NAME            PROVIDER      GPU_TYPE   GPUS             INSTANCE                   NODES
mml-1            MosaicML   │  a100_80gb  8             │  mosaic.a100-80sxm.1        1    
                            │  none       0             │  cpu                        1    
gcp-1            GCP        │  t4         -             │  n1-standard-48-t4-4        -    
                            │  a100_40gb  -             │  a2-highgpu-8g              -    
                            │  none       0             │  cpu                        1    
aws-1            AWS        │  a100_40gb  ≤8,16,...,32  │  p4d.24xlarge               ≤4   
                            │  none       0             │  cpu                        1    
oci-1            OCI        │  a100_40gb  8,16,...,64   │  oci.bm.gpu.b4.8            ≤8  
                            │  none       0             │  cpu                        1    

$ mcli create secret s3 --name r2-creds --config-file path/to/config --credentials-file path/to/credentials
✔  Created s3 secret: r2-creds      

$ mcli create secret env S3_ENDPOINT_URL="https://[uid].r2.cloudflarestorage.com"
✔  Created environment secret: s3-endpoint-url      
               
$ mcli run -f mpt-125m-r2.yaml --follow
✔  Run mpt-125m-r2-X2k3Uq started                                                                                    
i  Following run logs. Press Ctrl+C to quit.                                                                            
                                                                                                                        
Cloning into 'llm-foundry'...</code></pre>
            <p><i>Using the MCLI command line tool to manage compute clusters, secrets, and submit runs.</i></p>
            <pre><code>### mpt-125m-r2.yaml ###
# set up secrets once with `mcli create secret ...`
# and they will be present in the environment in any subsequent run

integrations:
- integration_type: git_repo
  git_repo: mosaicml/llm-foundry
  git_branch: main
  pip_install: -e .[gpu]

image: mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04

command: |
  cd llm-foundry/scripts
  composer train/train.py train/yamls/mpt/125m.yaml \
    data_remote=s3://bucket/path-to-data \
    max_duration=100ba \
    save_folder=s3://checkpoints/mpt-125m \
    save_interval=20ba

run_name: mpt-125m-r2

gpu_num: 8
gpu_type: a100_40gb
cluster: oci-1  # can be any compute cluster!</code></pre>
            <p><i>An MCLI job template. Specify a run name, a Docker image, a set of commands, and a compute cluster to run on.</i></p>
    <div>
      <h3>Get started today!</h3>
      <a href="#get-started-today">
        
      </a>
    </div>
    <p>The MosaicML platform is an invaluable tool to take your training to the next level, and in this post, we explored how Cloudflare R2 empowers you to train models on your own data, with any compute provider – or all of them. By eliminating egress fees, R2’s storage is an exceptionally cost-effective complement to MosaicML training, providing maximum autonomy and control. With this combination, you can switch between cloud service providers to fit your organization’s needs over time.</p><p>To learn more about using MosaicML to train custom state-of-the-art AI on your own data visit <a href="https://www.mosaicml.com/">here</a> or <a href="https://docs.google.com/forms/d/e/1FAIpQLSepW7QB3Xkv6T7GJRwrR9DmGAEjm5G2lBxJC7PUe3JXcBZYbw/viewform">get in touch</a>.</p>
    <div>
      <h3>Watch on Cloudflare TV</h3>
      <a href="#watch-on-cloudflare-tv">
        
      </a>
    </div>
    <div></div><p></p> ]]></content:encoded>
            <category><![CDATA[Developer Week]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Partners]]></category>
            <category><![CDATA[Egress]]></category>
            <category><![CDATA[Connectivity Cloud]]></category>
            <category><![CDATA[R2]]></category>
            <category><![CDATA[LLM]]></category>
            <guid isPermaLink="false">4ETryNsT8L8QFX8tPNzeye</guid>
            <dc:creator>Abhinav Venigalla (Guest Author)</dc:creator>
            <dc:creator>Phillip Jones</dc:creator>
            <dc:creator>Abhi Das</dc:creator>
        </item>
    </channel>
</rss>