
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Tue, 07 Apr 2026 09:43:59 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Powering the agents: Workers AI now runs large models, starting with Kimi K2.5]]></title>
            <link>https://blog.cloudflare.com/workers-ai-large-models/</link>
            <pubDate>Thu, 19 Mar 2026 19:53:16 GMT</pubDate>
            <description><![CDATA[ Kimi K2.5 is now on Workers AI, helping you power agents entirely on Cloudflare’s Developer Platform. Learn how we optimized our inference stack and reduced inference costs for internal agent use cases.  ]]></description>
            <content:encoded><![CDATA[ <p>We're making Cloudflare the best place for building and deploying agents. But reliable agents aren't built on prompts alone; they require a robust, coordinated infrastructure of underlying primitives. </p><p>At Cloudflare, we have been building these primitives for years: <a href="https://developers.cloudflare.com/durable-objects/"><u>Durable Objects</u></a> for state persistence, <a href="https://developers.cloudflare.com/workflows/"><u>Workflows</u></a> for long running tasks, and <a href="https://developers.cloudflare.com/workers/runtime-apis/bindings/worker-loader/"><u>Dynamic Workers</u></a> or <a href="https://developers.cloudflare.com/sandbox/"><u>Sandbox</u></a> containers for secure execution. Powerful abstractions like the <a href="https://developers.cloudflare.com/agents/"><u>Agents SDK</u></a> are designed to help you build agents on top of Cloudflare’s Developer Platform.</p><p>But these primitives only provided the execution environment. The agent still needed a model capable of powering it. </p><p>Starting today, Workers AI is officially in the big models game. We now offer frontier open-source models on our AI inference platform. We’re starting by releasing <a href="https://www.kimi.com/blog/kimi-k2-5"><u>Moonshot AI’s Kimi K2.5</u></a> model <a href="https://developers.cloudflare.com/workers-ai/models/kimi-k2.5"><u>on Workers AI</u></a>. With a full 256k context window and support for multi-turn tool calling, vision inputs, and structured outputs, the Kimi K2.5 model is excellent for all kinds of agentic tasks. By bringing a frontier-scale model directly into the Cloudflare Developer Platform, we’re making it possible to run the entire agent lifecycle on a single, unified platform.</p><p>The heart of an agent is the AI model that powers it, and that model needs to be smart, with high reasoning capabilities and a large context window. Workers AI now runs those models.</p>
    <div>
      <h2>The price-performance sweet spot</h2>
      <a href="#the-price-performance-sweet-spot">
        
      </a>
    </div>
    <p>We spent the last few weeks testing Kimi K2.5 as the engine for our internal development tools. Within our <a href="https://opencode.ai/"><u>OpenCode</u></a> environment, Cloudflare engineers use Kimi as a daily driver for agentic coding tasks. We have also integrated the model into our automated code review pipeline; you can see this in action via our public code review agent, <a href="https://github.com/ask-bonk/ask-bonk"><u>Bonk</u></a>, on Cloudflare GitHub repos. In production, the model has proven to be a fast, efficient alternative to larger proprietary models without sacrificing quality.</p><p>Serving Kimi K2.5 began as an experiment, but it quickly became critical after reviewing how the model performs and how cost-efficient it is. As an illustrative example: we have an agent that does security reviews of Cloudflare’s codebases. This agent processes over 7B tokens per day, and using Kimi, it has caught more than 15 confirmed issues in a single codebase. Doing some rough math, if we had run this agent on a mid-tier proprietary model, we would have spent $2.4M a year for this single use case, on a single codebase. Running this agent with Kimi K2.5 cost just a fraction of that: we cut costs by 77% simply by making the switch to Workers AI.</p><p>As AI adoption increases, we are seeing a fundamental shift not only in how engineering teams are operating, but how individuals are operating. It is becoming increasingly common for people to have a personal agent like <a href="https://openclaw.ai/"><u>OpenClaw</u></a> running 24/7. The volume of inference is skyrocketing.</p><p>This new rise in personal and coding agents means that cost is no longer a secondary concern; it is the primary blocker to scaling. When every employee has multiple agents processing hundreds of thousands of tokens per hour, the math for proprietary models stops working. Enterprises will look to transition to open-source models that offer frontier-level reasoning without the proprietary price tag. Workers AI is here to facilitate this shift, providing everything from serverless endpoints for a personal agent to dedicated instances powering autonomous agents across an entire organization.</p>
    <div>
      <h2>The large model inference stack</h2>
      <a href="#the-large-model-inference-stack">
        
      </a>
    </div>
    <p>Workers AI has served models, including LLMs, since its launch two years ago, but we’ve historically prioritized smaller models. Part of the reason was that for some time, open-source LLMs fell far behind the models from frontier model labs. This changed with models like Kimi K2.5, but to serve this type of very large LLM, we had to make changes to our inference stack. We wanted to share with you some of what goes on behind the scenes to support a model like Kimi.</p><p>We’ve been working on custom kernels for Kimi K2.5 to optimize how we serve the model, which is built on top of our proprietary <a href="https://blog.cloudflare.com/cloudflares-most-efficient-ai-inference-engine/"><u>Infire inference engine</u></a>. Custom kernels improve the model’s performance and GPU utilization, unlocking gains that would otherwise go unclaimed if you were just running the model out of the box. There are also multiple techniques and hardware configurations that can be leveraged to serve a large model. Developers typically use a combination of data, tensor, and expert parallelization techniques to optimize model performance. Strategies like disaggregated prefill are also important, in which you separate the prefill and generation stages onto different machines in order to get better throughput or higher GPU utilization. Implementing these techniques and incorporating them into the inference stack takes a lot of dedicated experience to get right. </p><p>Workers AI has already done the experimentation with serving techniques to yield excellent throughput on Kimi K2.5. A lot of this does not come out of the box when you self-host an open-source model. The benefit of using a platform like Workers AI is that you don’t need to be a Machine Learning Engineer, a DevOps expert, or a Site Reliability Engineer to do the optimizations required to host it: we’ve already done the hard part, you just need to call an API.</p>
    <div>
      <h2>Beyond the model — platform improvements for agentic workloads</h2>
      <a href="#beyond-the-model-platform-improvements-for-agentic-workloads">
        
      </a>
    </div>
    <p>In concert with this launch, we’ve also improved our platform and are releasing several new features to help you build better agents.</p>
    <div>
      <h3>Prefix caching and surfacing cached tokens</h3>
      <a href="#prefix-caching-and-surfacing-cached-tokens">
        
      </a>
    </div>
    <p>When you work with agents, you are likely sending a large number of input tokens as part of the context: this could be detailed system prompts, tool definitions, MCP server tools, or entire codebases. Inputs can be as large as the model context window, so in theory, you could be sending requests with almost 256k input tokens. That’s a lot of tokens.</p><p>When an LLM processes a request, the request is broken down into two stages: the prefill stage processes input tokens and the output stage generates output tokens. These stages are usually sequential, where input tokens have to be fully processed before you can generate output tokens. This means that sometimes the GPU is not fully utilized while the model is doing prefill.</p><p>With multi-turn conversations, when you send a new prompt, the client sends all the previous prompts, tools, and context from the session to the model as well. The delta between consecutive requests is usually just a few new lines of input; all the other context has already gone through the prefill stage during a previous request. This is where prefix caching helps. Instead of doing prefill on the entire request, we can cache the input tensors from a previous request, and only do prefill on the new input tokens. This saves a lot of time and compute from the prefill stage, which means a faster Time to First Token (TTFT) and a higher Tokens Per Second (TPS) throughput as you’re not blocked on prefill.</p><p>Workers AI has always done prefix caching, but we are now surfacing cached tokens as a usage metric and offering a discount on cached tokens compared to input tokens. (Pricing can be found on the <a href="https://developers.cloudflare.com/workers-ai/models/kimi-k2.5/"><u>model page</u></a>.) We also have new techniques for you to leverage in order to get a higher prefix cache hit rate, reducing your costs.</p>
    <div>
      <h3>New session affinity header for higher cache hit rates</h3>
      <a href="#new-session-affinity-header-for-higher-cache-hit-rates">
        
      </a>
    </div>
    <p>In order to route to the same model instance and take advantage of prefix caching, we use a new <code>x-session-affinity</code> header. When you send this header, you’ll improve your cache hit ratio, leading to more cached tokens and subsequently, faster TTFT, TPS, and lower inference costs.</p><p>You can pass the new header like below, with a unique string per session or per agent. Some clients like OpenCode implement this automatically out of the box. Our <a href="https://github.com/cloudflare/agents-starter"><u>Agents SDK starter</u></a> has already set up the wiring to do this for you, too.</p>
            <pre><code>curl -X POST \
"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/moonshotai/kimi-k2.5" \
  -H "Authorization: Bearer {API_TOKEN}" \
  -H "Content-Type: application/json" \
  -H "x-session-affinity: ses_12345678" \
  -d '{
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is prefix caching and why does it matter?"
      }
    ],
    "max_tokens": 2400,
    "stream": true
  }'
</code></pre>
            
    <div>
      <h3>Redesigned async APIs</h3>
      <a href="#redesigned-async-apis">
        
      </a>
    </div>
    <p>Serverless inference is really hard. With a pay-per-token business model, it’s cheaper on a single request basis because you don’t need to pay for entire GPUs to service your requests. But there’s a trade-off: you have to contend with other people’s traffic and capacity constraints, and there’s no strict guarantee that your request will be processed. This is not unique to Workers AI — it’s evidently the case across serverless model providers, given the frequent news reports of overloaded providers and service disruptions. While we always strive to serve your request and have built-in autoscaling and rebalancing, there are hard limitations (like hardware) that make this a challenge.</p><p>For volumes of requests that would exceed synchronous rate limits, you can submit batches of inferences to be completed asynchronously. We’re introducing a revamped Asynchronous API, which means that for asynchronous use cases, you won’t run into Out of Capacity errors and inference will execute durably at some point. Our async API looks more like flex processing than a batch API, where we process requests in the async queue as long as we have headroom in our model instances. With internal testing, our async requests usually execute within 5 minutes, but this will depend on what live traffic looks like. As we bring Kimi to the public, we will tune our scaling accordingly, but the async API is the best way to make sure you don’t run into capacity errors in durable workflows. This is perfect for use cases that are not real-time, such as code scanning agents or research agents.</p><p>Workers AI previously had an asynchronous API, but we’ve recently revamped the systems under the hood. We now rely on a pull-based system versus the historical push-based system, allowing us to pull in queued requests as soon as we have capacity. We’ve also added better controls to tune the throughput of async requests, monitoring GPU utilization in real-time and pulling in async requests when utilization is low, so that critical synchronous requests get priority while still processing asynchronous requests efficiently.</p><p>To use the asynchronous API, you would send your requests as seen below. We also have a way to <a href="https://developers.cloudflare.com/workers-ai/platform/event-subscriptions/"><u>set up event notifications</u></a> so that you can know when the inference is complete instead of polling for the request. </p>
            <pre><code>// (1.) Push a request in queue
// pass queueRequest: true
let res = await env.AI.run("@cf/moonshotai/kimi-k2.5", {
  "requests": [{
    "messages": [{
      "role": "user",
      "content": "Tell me a joke"
    }]
  }, {
    "messages": [{
      "role": "user",
      "content": "Explain the Pythagoras theorem"
    }]
  }, ...{&lt;add more requests in a batch&gt;} ];
}, {
  queueRequest: true,
});


// (2.) grab the request id
let request_id;
if(res &amp;&amp; res.request_id){
  request_id = res.request_id;
}
// (3.) poll the status
let res = await env.AI.run("@cf/moonshotai/kimi-k2.5", {
  request_id: request_id
});

if(res &amp;&amp; res.status === "queued" || res.status === "running") {
 // retry by polling again
 ...
}
else 
 return Response.json(res); // This will contain the final completed response 
</code></pre>
            
    <div>
      <h2>Try it out today</h2>
      <a href="#try-it-out-today">
        
      </a>
    </div>
    <p>Get started with Kimi K2.5 on Workers AI today. You can read our developer docs to find out <a href="https://developers.cloudflare.com/workers-ai/models/kimi-k2.5/"><u>model information and pricing</u></a>, and how to take advantage of <a href="https://developers.cloudflare.com/workers-ai/features/prompt-caching/"><u>prompt caching via session affinity headers</u></a> and <a href="https://developers.cloudflare.com/workers-ai/features/batch-api/"><u>asynchronous API</u></a>. The <a href="https://github.com/cloudflare/agents-starter"><u>Agents SDK starter</u></a> also now uses Kimi K2.5 as its default model. You can also <a href="https://opencode.ai/docs/providers/"><u>connect to Kimi K2.5 on Workers AI via Opencode</u></a>. For a live demo, try it in our <a href="https://playground.ai.cloudflare.com/"><u>playground</u></a>.</p><p>And if this set of problems around serverless inference, ML optimizations, and GPU infrastructure sound  interesting to you — <a href="https://job-boards.greenhouse.io/cloudflare/jobs/6297179?gh_jid=6297179"><u>we’re hiring</u></a>!</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/36JzF0zePj2z7kZQK8Q2fg/73b0a7206d46f0eef170ffd1494dc4b3/BLOG-3247_2.png" />
          </figure><p></p> ]]></content:encoded>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Agents]]></category>
            <guid isPermaLink="false">1wSO33KRdd5aUPAlSVDiqU</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Kevin Flansburg</dc:creator>
            <dc:creator>Ashish Datta</dc:creator>
            <dc:creator>Kevin Jain</dc:creator>
        </item>
        <item>
            <title><![CDATA[Partnering with Black Forest Labs to bring FLUX.2 [dev] to Workers AI]]></title>
            <link>https://blog.cloudflare.com/flux-2-workers-ai/</link>
            <pubDate>Tue, 25 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[ FLUX.2 [dev] by Black Forest Labs is now on Workers AI! This advanced open-weight image model offers superior photorealism, multi-reference inputs, and granular control with JSON prompting. ]]></description>
            <content:encoded><![CDATA[ <p>In recent months, we’ve seen a leap forward for closed-source image generation models with the rise of <a href="https://gemini.google/overview/image-generation/"><u>Google’s Nano Banana</u></a> and <a href="https://openai.com/index/image-generation-api/"><u>OpenAI image generation models</u></a>. Today, we’re happy to share that a new open-weight contender is back with the launch of Black Forest Lab’s FLUX.2 [dev] and available to run on Cloudflare’s inference platform, Workers AI. You can read more about this new model in detail on BFL’s blog post about their new model launch <a href="https://bfl.ai/blog/flux-2"><u>here</u></a>. </p><p>We have been huge fans of Black Forest Lab’s FLUX image models since their earliest versions. Our hosted version of FLUX.1 [schnell] is one of the most popular models in our catalog for its photorealistic outputs and high-fidelity generations. When the time came to host the licensed version of their new model, we jumped at the opportunity. The FLUX.2 model takes all the best features of FLUX.1 and amps it up, generating even more realistic, grounded images with added customization support like JSON prompting.</p><p>Our Workers AI hosted version of FLUX.2 has some specific patterns, like using multipart form data to support input images (up to 4 512x512 images), and output images up to 4 megapixels. The multipart form data format allows users to send us multiple image inputs alongside the typical model parameters. Check out our <a href="https://developers.cloudflare.com/changelog/2025-11-25-flux-2-dev-workers-ai/"><u>developer docs changelog announcement</u></a> to understand how to use the FLUX.2 model.</p>
    <div>
      <h2>What makes FLUX.2 special? Physical world grounding, digital world assets, and multi-language support</h2>
      <a href="#what-makes-flux-2-special-physical-world-grounding-digital-world-assets-and-multi-language-support">
        
      </a>
    </div>
    <p>The FLUX.2 model has a more robust understanding of the physical world, allowing you to turn abstract concepts into photorealistic reality. It excels at generating realistic image details and consistently delivers accurate hands, faces, fabrics, logos, and small objects that are often missed by other models. Its knowledge of the physical world also generates life-like lighting, angles and depth perception.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3tOCj8UT98MbcvlXFe8EMl/ad6d94d8b4713453dcd2455a9a3ad331/image3.png" />
          </figure><p><sup>Figure 1. Image generated with FLUX.2 featuring accurate lighting, shadows, reflections and depth perception at a café in Paris.</sup></p><p>This high-fidelity output makes it ideal for applications requiring superior image quality, such as creative photography, e-commerce product shots, marketing visuals, and interior design. Because it can understand context, tone, and trends, the model allows you to create engaging and editorial-quality digital assets from short prompts.</p><p>Aside from the physical world, the model is also able to generate high-quality digital assets such as designing landing pages or generating detailed infographics (see below for example). It’s also able to understand multiple languages naturally, so combining these two features – we can get a beautiful landing page in French from a French prompt.</p>
            <pre><code>Générer une page web visuellement immersive pour un service de promenade de chiens. L'image principale doit dominer l'écran, montrant un chien exubérant courant dans un parc ensoleillé, avec des touches de vert vif (#2ECC71) intégrées subtilement dans le feuillage ou les accessoires du chien. Minimiser le texte pour un impact visuel maximal.</code></pre>
            
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3C9EEp5jsISYMrOsC4NKb/12e0630b51334feb02a5be805e767d08/image8.png" />
          </figure>
    <div>
      <h2>Character consistency – solving for stochastic drift</h2>
      <a href="#character-consistency-solving-for-stochastic-drift">
        
      </a>
    </div>
    <p>FLUX.2 offers multi-reference editing with state-of-the-art character consistency, ensuring identities, products, and styles remain consistent for tasks. In the world of generative AI, getting a high-quality image is easy. However, getting the <i>exact same</i> character or product twice has always been the hard part. This is a phenomenon known as "stochastic drift", where generated images drift away from the original source material.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/58T8pavXCsWneWxEDfKgte/a821df53fa3a86d577dd72e3c285fe8a/image9.png" />
          </figure><p><sup>Figure 2. Stochastic drift infographic (generated on FLUX.2)</sup></p><p>One of FLUX.2’s<b> </b>breakthroughs is multi-reference image inputs designed to solve this consistency challenge. You’ll have the ability to change the background, lighting, or pose of an image without accidentally changing the face of your model or the design of your product. You can also reference other images or combine multiple images together to create something new. </p><p>In code, Workers AI supports multi-reference images (up to 4) with a multipart form-data upload. The image inputs are binary images and output is a base64 encoded image:</p>
            <pre><code>curl --request POST \
  --url 'https://api.cloudflare.com/client/v4/accounts/{ACCOUNT}/ai/run/@cf/black-forest-labs/flux-2-dev' \
  --header 'Authorization: Bearer {TOKEN}' \
  --header 'Content-Type: multipart/form-data' \
  --form 'prompt=take the subject of image 2 and style it like image 1' \
  --form input_image_0=@/Users/johndoe/Desktop/icedoutkeanu.png \
  --form input_image_1=@/Users/johndoe/Desktop/me.png \
  --form steps=25
  --form width=1024
  --form height=1024</code></pre>
            <p>We also support this through the Workers AI Binding:</p>
            <pre><code>const image = await fetch("http://image-url");
const form = new FormData();
 
const image_blob = await streamToBlob(image.body, "image/png");
form.append('input_image_0', image_blob)
form.append('prompt', 'a sunset with the dog in the original image')
 
const resp = await env.AI.run("@cf/black-forest-labs/flux-2-dev", {
    multipart: {
        body: form,
        contentType: "multipart/form-data"
    }
})</code></pre>
            
    <div>
      <h3>Built for real world use cases</h3>
      <a href="#built-for-real-world-use-cases">
        
      </a>
    </div>
    <p>The newest image model signifies a shift towards functional business use cases, moving beyond simple image quality improvements. FLUX.2 enables you to:</p><ul><li><p><b>Create Ad Variations:</b> Generate 50 different advertisements using the exact same actor, without their face morphing between frames.</p></li><li><p><b>Trust Your Product Shots:</b> Drop your product on a model, or into a beach scene, a city street, or a studio table. The environment changes, but your product stays accurate.</p></li><li><p><b>Build Dynamic Editorials:</b> Produce a full fashion spread where the model looks identical in every single shot, regardless of the angle.</p></li></ul>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5Me4jErrBPSW7kAx8qecId/22b2c98a8661489ebe45188cb7947381/image6.png" />
          </figure><p><sup>Figure 3. Combining the oversized hoodie and sweatpant ad photo (generated with FLUX.2) with Cloudflare’s logo to create product renderings with consistent faces, fabrics, and scenery. **</sup><sup><i>Note: we prompted for white Cloudflare font as well instead of the original black font. </i></sup></p>
    <div>
      <h2>Granular controls — JSON prompting, HEX codes and more!</h2>
      <a href="#granular-controls-json-prompting-hex-codes-and-more">
        
      </a>
    </div>
    <p>The FLUX.2 model makes another advancement by allowing users to control small details in images through tools like JSON prompting and specifying specific hex codes.</p><p>For example, you could send this JSON as a prompt (as part of the multipart form input) and the resulting image follows the prompt exactly:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5hnd2lQHC8kqKGbEFqdsgb/de72f15c9e3ede2fcfa717c013a4cba4/image4.jpg" />
          </figure>
            <pre><code>{
  "scene": "A bustling, neon-lit futuristic street market on an alien planet, rain slicking the metal ground",
  "subjects": [
    {
      "type": "Cyberpunk bounty hunter",
      "description": "Female, wearing black matte armor with glowing blue trim, holding a deactivated energy rifle, helmet under her arm, rain dripping off her synthetic hair",
      "pose": "Standing with a casual but watchful stance, leaning slightly against a glowing vendor stall",
      "position": "foreground"
    },
    {
      "type": "Merchant bot",
      "description": "Small, rusted, three-legged drone with multiple blinking red optical sensors, selling glowing synthetic fruit from a tray attached to its chassis",
      "pose": "Hovering slightly, offering an item to the viewer",
      "position": "midground"
    }
  ],
  "style": "noir sci-fi digital painting",
  "color_palette": [
    "deep indigo",
    "electric blue",
    "acid green"
  ],
  "lighting": "Low-key, dramatic, with primary light sources coming from neon signs and street lamps reflecting off wet surfaces",
  "mood": "Gritty, tense, and atmospheric",
  "background": "Towering, dark skyscrapers disappearing into the fog, with advertisements scrolling across their surfaces, flying vehicles (spinners) visible in the distance",
  "composition": "dynamic off-center",
  "camera": {
    "angle": "eye level",
    "distance": "medium close-up",
    "focus": "sharp on subject",
    "lens": "35mm",
    "f-number": "f/1.4",
    "ISO": 400
  },
  "effects": [
    "heavy rain effect",
    "subtle film grain",
    "neon light reflections",
    "mild chromatic aberration"
  ]
}</code></pre>
            <p>To take it further, we can ask the model to recolor the accent lighting to a Cloudflare orange by giving it a specific hex code like #F48120.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/79EL6Y3YGu8PqvWauHyqzh/29684aa0f4bb9b4306059e1634b5b94c/image1.jpg" />
          </figure>
    <div>
      <h2>Try it out today!</h2>
      <a href="#try-it-out-today">
        
      </a>
    </div>
    <p>The newest FLUX.2 [dev] model is now available on Workers AI — you can get started with the model through our <a href="http://developers.cloudflare.com/workers-ai/models/flux-2-dev"><u>developer docs</u></a> or test it out on our <a href="https://multi-modal.ai.cloudflare.com/"><u>multimodal playground.</u></a></p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/49KKiYwNbkrRaiDRruKCck/66cdcb3b41f8a87fd44a240e05bd851a/image2.png" />
          </figure><p></p> ]]></content:encoded>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[Developers]]></category>
            <guid isPermaLink="false">5lE1GkcjJWDeQq5696TdSs</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>David Liu</dc:creator>
        </item>
        <item>
            <title><![CDATA[Choice: the path to AI sovereignty]]></title>
            <link>https://blog.cloudflare.com/sovereign-ai-and-choice/</link>
            <pubDate>Thu, 25 Sep 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ Championing AI sovereignty through choice: diverse tools, data control, and no vendor lock-in. We're enabling this in India, Japan, and Southeast Asia, offering local, open-source models on Workers AI ]]></description>
            <content:encoded><![CDATA[ <p>Every government is laser-focused on the potential for national transformation by AI. Many view AI as an unparalleled opportunity to solve complex national challenges, drive economic growth, and improve the lives of their citizens. Others are concerned about the risks AI can bring to its society and economy. Some sit somewhere between these two perspectives. But as plans are drawn up by governments around the world to address the question of AI development and adoption, all are grappling with the critical question of sovereignty — how much of this technology, mostly centered in the United States and China, needs to be in their direct control? </p><p>Each nation has their own response to that question — some seek ‘self-sufficiency’ and total authority. Others, particularly those that do not have the capacity to build the full AI technology stack, are approaching it layer-by-layer, seeking to build on the capacities their country does have and then forming strategic partnerships to fill the gaps. </p><p>We believe AI sovereignty at its core is about choice. Each nation should have the ability to select the right tools for the task, to control its own data, and to deploy applications at will, all without being locked into a single provider or a single way of doing things. It's about autonomy and options, realized through a diversified, resilient digital supply chain. </p><p>Cloudflare’s mission is to help build a better Internet. We make tools for developers around the world to build Internet and AI applications that are widely, and in many cases, freely, available. We work on standards to improve interoperability and prevent <a href="https://www.cloudflare.com/learning/cloud/what-is-vendor-lock-in/"><u>vendor lock-in</u></a>. And we are global — our <a href="https://www.cloudflare.com/network/"><u>network</u></a> spans 330 cities in over 125 countries. By supporting local developers to build and deploy AI tools and services right where they are, Cloudflare can help each nation on their path to greater AI sovereignty. </p>
    <div>
      <h2>Creating a future that enables many AI options</h2>
      <a href="#creating-a-future-that-enables-many-ai-options">
        
      </a>
    </div>
    <p>Many nations recognize the practical challenge of realizing a robust AI-driven future that incorporates sovereignty — the significant cost and complexity of the infrastructure needed to set AI in action. Cloudflare believes that countries can achieve their objectives by creating vibrant marketplaces that allow multiple options, and we are creating a path for governments that provides maximum choice:</p><p><b>Infrastructure accessibility: </b>Countries often focus on building large data centers that have the compute capacity to train general purpose AI models, neglecting the infrastructure needed to effectively deploy AI. Because of their proximity to end users, distributed edge networks are critical to ensuring that consumers can actually use AI technologies at scale. Although some AI technologies will be designed to work on-device, many will need more power to run <a href="https://blog.cloudflare.com/best-place-region-earth-inference/"><u>AI inference</u></a>, the tasks that users ask an AI engine to complete.  Distributed networks are equipped to run AI workloads at the edge, to help deliver the low latency and high performance needed for advanced technologies. Cloudflare’s distributed network gives developers a path to rapidly deploy their apps globally without massive upfront investments. </p><p><b>Inclusivity</b>: Nations want their entire economies, from the small businesses, to research institutions, to non-profits and enterprises, to benefit from AI transformation. <a href="https://www.cloudflare.com/learning/serverless/what-is-serverless/"><u>Serverless models</u></a> like Cloudflare’s make it easy to get started. Developers pay only for what they use, rather than being locked into paying for expensive and unnecessary compute, dramatically lowering the barrier to entry. Our free tier allows developers to experiment, build, and even launch applications without any cost, while our pay-as-you-go model for increased usage removes the significant financial barriers that might otherwise keep advanced AI out of reach. </p><p><b>Control over data: </b>An important part of sovereignty is the ability to control your own data. We believe countries should avoid equating this type of control with data locality, focusing instead on integrating <a href="https://blog.cloudflare.com/cloudflare-for-ai-supporting-ai-adoption-at-scale-with-a-security-first-approach/"><u>security tools that provide visibility and the ability to restrict access to data</u></a>.  Cloudflare’s global, distributed network ensures that developers can experiment, build, and deploy AI-powered applications right where they are, setting rules and controls at the Internet edge.</p><p><b>Multi-modal, dynamic markets</b>: Building new applications with closed AI models can make it challenging to switch models later, and can make developers dependent on particular providers. AI strategies must embrace diversity — developers should have access to a wide variety of both open source and closed AI models. Cloudflare’s <a href="https://www.cloudflare.com/developer-platform/products/workers-ai/"><u>Workers AI</u></a> platform, with over <a href="https://developers.cloudflare.com/workers-ai/models/"><u>50 open source models</u></a>, is model agnostic, helping to create a competitive, dynamic environment where developers can swap models in and out as better, cheaper, or more specialized options become available. Cloudflare’s <a href="https://www.cloudflare.com/en-gb/lp/dg/brand/api-security/"><u>AI Gateway</u></a> allows our customers to connect and control all their AI models, regardless of vendor, in a single, unified, interoperable platform. </p><p>Underpinning all of this is the importance of<b> open standards</b> that encourage interoperability. Open standards and protocols throughout the AI technology stack help prevent dependency, create dynamic and competitive markets, and create choice for governments and their developers.</p>
    <div>
      <h2>Championing regional AI innovation  </h2>
      <a href="#championing-regional-ai-innovation">
        
      </a>
    </div>
    <p>Many countries have started to put their own mark on how to spur innovation in their markets, starting with <a href="https://www.cloudflare.com/learning/ai/what-is-large-language-model/"><u>large language models (LLM)</u></a>.  AI development to date has mostly centered around LLMs  trained on English-centric data, and increasingly, Chinese-centric data, leaving behind those who can’t fully access this technology in these two languages. Recognizing this gap, these nations are building and freely offering AI models trained on local language datasets that are fine-tuned to the nuances of their own cultures and languages. This approach lowers the barrier to entry for local businesses, organizations, and governments to create customized AI solutions for their specific markets. Open-sourcing these LLMs is to recognize that AI sovereignty is a means to an end. The goal is innovation, economic growth, and the ability to solve meaningful problems.</p><p>Cloudflare is now supporting these sovereign AI initiatives in <b>India, Japan, and Southeast Asia</b>. We are bringing these locally-developed, open-source AI models to developers around the world through our serverless inference platform, Workers AI.</p><p><b>India</b>: <a href="https://indiaai.gov.in/"><u>India’s national vision</u></a> is “AI for All”, which focuses on AI driving inclusive growth and social empowerment. India will host the momentous global <a href="https://impact.indiaai.gov.in/home"><u>AI Impact Summit</u></a> in 2026, and a key element of showcasing empowering technological advancements that are accessible to the Global South. With its immense linguistic diversity, India is at the forefront of creating models that serve its hundreds of millions of Internet users in their native tongues. A cornerstone in this endeavor is the Government of India’s <a href="https://bhashini.gov.in/"><u>Bhashini</u></a>, a digital public good platform that enables all Indian citizens to access the Internet and digital services in 22 official languages.</p><p>Cloudflare is now offering <a href="https://ai4bharat.iitm.ac.in/"><u>AI4Bharat</u></a>’s <a href="https://huggingface.co/ai4bharat/indictrans2-en-indic-1B"><u>IndicTrans2 model</u></a>, a key open source language model that is also part of the Bhashini initiative. The model is able to translate text across 22 Indic languages, including Bengali, Gujarati, Hindi, Tamil, Sanskrit and even traditionally low-resourced languages like Kashmiri, Manipuri and Sindhi. </p><p>You can use the <a href="https://developers.cloudflare.com/workers-ai/models/indictrans2-en-indic-1B"><u>@cf/ai4bharat/indictrans2-en-indic-1B</u></a> model on Workers AI as follows:</p>
            <pre><code>curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/ACCOUNT_ID/ai/run/@cf/ai4bharat/indictrans2-en-indic-1B \
  --header 'Authorization: Bearer TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "text": ["What is your favourite food?", "I like pizza"],
    "target_language": "guj_Gujr"
}'</code></pre>
            <p><b>Japan: </b>Japan has a very clear and expansive <a href="http://www8.cao.go.jp/cstp/ai/aistratagy2022en.pdf"><u>vision</u></a> of AI development. Concerned about Japan’s slow AI uptake, the Japanese government aims to make the country “the world’s most friendly AI nation” by creating the ideal conditions for AI growth, both at home and abroad. A major initiative for Japan’s government is supporting AI that deeply understands the complexities and cultural context of the Japanese language. </p><p>Cloudflare is offering Preferred Networks, Inc.(PFN)  <a href="https://huggingface.co/pfnet/plamo-embedding-1b"><u>PLaMo-Embedding-1B</u></a>, a home-grown Japanese text embedding model, made freely and openly available. The Japanese government supported PFN through its <a href="https://www.meti.go.jp/english/policy/mono_info_service/geniac/index.html"><u>Generative AI Accelerator Challenge (GENIAC)</u></a> program, which supports local LLM development through subsidized access to compute resources for training. The PLaMo Embedding model enables users to generate high-quality embeddings for Japanese text, which is helpful for building RAG-powered applications and semantic search use cases.</p><p>You can use the <a href="https://developers.cloudflare.com/workers-ai/models/plamo-embedding-1b"><u>@cf/pfnet/plamo-embedding-1b</u></a> model on Workers AI as follows:</p>
            <pre><code>curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/ACCOUNT_ID/ai/run/@cf/pfnet/plamo-embedding-1b \
  --header 'Authorization: Bearer TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
  	"text": [
            "PLaMo-Embedding-1Bは、Preferred Networks, Inc. によって開発された日本語テキスト埋め込みモデルです。",
            "最近は随分と暖かくなりましたね。"
        ]
}'</code></pre>
            <p><b>Southeast Asia: </b>As Chair of the<a href="https://www.imda.gov.sg/about-imda/international-relations/asean-working-group-on-ai-governance"><u> Association of Southeast Asian Nations (ASEAN) Working Group on AI Governance</u></a>,<b> </b>Singapore’s ambitious <a href="https://www.smartnation.gov.sg/initiatives/national-ai-strategy"><u>National AI Strategy 2.0</u></a> aims to ensure that AI is a public good, both for Southeast Asia and the world. As a cornerstone of this strategy, Singapore is championing the development and adoption of SEA-LION, a family of open-source LLMs designed for Southeast Asia's diverse languages and cultures. The initiative aims to establish the nation as an inclusive global AI leader, ensuring the technology is both accessible and regionally relevant to its multilingual and multicultural populaces. The models are adept in numerous regional languages, including Bahasa Indonesia, Bahasa Malaysia, Thai, Vietnamese, and Tamil, unlocking AI technologies for a significant portion of the Asian and global population.</p><p><a href="https://huggingface.co/aisingapore/Gemma-SEA-LION-v4-27B-IT-FP8-Dynamic"><u>SEA-LION model v4-27B</u></a> is now available on the Workers AI platform. SEA-LION v4 stands out on the Singapore government’s leaderboard as its most powerful, efficient, multimodal and multilingual model yet. </p><p>You can use the <a href="https://developers.cloudflare.com/workers-ai/models/gemma-sea-lion-v4-27b-it"><u>@cf/aisingapore/gemma-sea-lion-v4-27b-it</u></a> model on Workers AI as follows:</p>
            <pre><code>curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/ACCOUNT_ID/ai/run/@cf/aisingapore/gemma-sea-lion-v4-27b-it \
  --header 'Authorization: Bearer TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
  "messages": [
    {
      "role": "user",
      "content": "แล้วทำผัดไทยอย่างไร"
    }
  ]
}'</code></pre>
            
    <div>
      <h2>Bringing AI models to the world</h2>
      <a href="#bringing-ai-models-to-the-world">
        
      </a>
    </div>
    <p>Singapore, India and Japan have all chosen to open-source many of their local language models, a strategy that champions an expansive vision of AI sovereignty. This approach demonstrates a crucial understanding: true AI sovereignty is ensuring you have choices. </p><p>Supporting local language open source models is more than just supporting technology; this is a shared commitment to fostering an open, interoperable, and competitive AI ecosystem by empowering governments and developers to solve local problems, create economic opportunities, and preserve their digital and cultural heritages.</p><p>We are honored to support the initiatives of the governments of India, Japan, and Singapore on this journey. We believe that by putting their sovereign AI models into the hands of developers in their economies, we can help unlock a powerful wave of innovation that is more diverse, equitable, and representative of the world we live in. The future of AI is being built today, and we are proud to ensure that AI developers <i>everywhere</i> are at the forefront. </p><p>Choice is the foundation of AI sovereignty. We’re starting with the models from India, Japan, and Singapore on our serverless inference platform, but it’s only the start. Come build with us! Take the first step for free on <a href="https://www.cloudflare.com/developer-platform/products/workers-ai/"><b><u>Workers AI</u></b><u>.</u></a></p> ]]></content:encoded>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[AI]]></category>
            <guid isPermaLink="false">4jLFHUrYeSeoFnGedLBZQy</guid>
            <dc:creator>Carly Ramsey</dc:creator>
            <dc:creator>Smrithi Ramesh</dc:creator>
            <dc:creator>Michelle Chen</dc:creator>
        </item>
        <item>
            <title><![CDATA[AI Gateway now gives you access to your favorite AI models, dynamic routing and more — through just one endpoint]]></title>
            <link>https://blog.cloudflare.com/ai-gateway-aug-2025-refresh/</link>
            <pubDate>Wed, 27 Aug 2025 14:05:00 GMT</pubDate>
            <description><![CDATA[ AI Gateway now gives you access to your favorite AI models, dynamic routing and more — through just one endpoint. ]]></description>
            <content:encoded><![CDATA[ <p>Getting the observability you need is challenging enough when the code is deterministic, but AI presents a new challenge — a core part of your user’s experience now relies on a non-deterministic engine that provides unpredictable outputs. On top of that, there are many factors that can influence the results: the model, the system prompt. And on top of that, you still have to worry about performance, reliability, and costs. </p><p>Solving performance, reliability and observability challenges is exactly what Cloudflare was built for, and two years ago, with the introduction of AI Gateway, we wanted to extend to our users the same levels of control in the age of AI. </p><p>Today, we’re excited to announce several features to make building AI applications easier and more manageable: unified billing, secure key storage, dynamic routing, security controls with Data Loss Prevention (DLP). This means that AI Gateway becomes your go-to place to control costs and API keys, route between different models and providers, and manage your AI traffic. Check out our new <a href="https://ai.cloudflare.com/gateway"><u>AI Gateway landing page</u></a> for more information at a glance.</p>
    <div>
      <h2>Connect to all your favorite AI providers</h2>
      <a href="#connect-to-all-your-favorite-ai-providers">
        
      </a>
    </div>
    <p>When using an AI provider, you typically have to sign up for an account, get an API key, manage rate limits, top up credits — all within an individual provider’s dashboard. Multiply that for each of the different providers you might use, and you’ll soon be left with an administrative headache of bills and keys to manage.</p><p>With <a href="https://www.cloudflare.com/developer-platform/products/ai-gateway/"><u>AI Gateway</u></a>, you can now connect to major AI providers directly through Cloudflare and manage everything through one single plane. We’re excited to partner with Anthropic, Google, Groq, OpenAI, and xAI to provide Cloudflare users with access to their models directly through Cloudflare. With this, you’ll have access to over 350+ models across 6 different providers.</p><p>You can now get billed for usage across different providers directly through your Cloudflare account. This feature is available for Workers Paid users, where you’ll be able to add credits to your Cloudflare account and use them for <a href="https://www.cloudflare.com/learning/ai/inference-vs-training/"><u>AI inference</u></a> to all the supported providers. You’ll be able to see real-time usage statistics and manage your credits through the AI Gateway dashboard. Your AI Gateway inference usage will also be documented in your monthly Cloudflare invoice. No more signing up and paying for each individual model provider account. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4t2j5frheaYOLznprTL58p/f0fb4c6de2aad70c82a23bc35873ea50/image1.png" />
          </figure><p>Usage rates are based on then-current list prices from model providers — all you will need to cover is the transaction fee as you load credits into your account. Since this is one of the first times we’re launching a credits based billing system at Cloudflare, we’re releasing this feature in Closed Beta — sign up for access <a href="https://forms.gle/3LGAzN2NDXqtbjKR9"><u>here</u></a>.</p>
    <div>
      <h3>BYO Provider Keys, now with Cloudflare Secrets Store</h3>
      <a href="#byo-provider-keys-now-with-cloudflare-secrets-store">
        
      </a>
    </div>
    <p>Although we’ve introduced unified billing, some users might still want to manage their own accounts and keys with providers. We’re happy to say that AI Gateway will continue supporting our <a href="https://developers.cloudflare.com/ai-gateway/configuration/bring-your-own-keys/"><u>BYO Key feature, </u></a>improving the experience of BYO Provider Keys by integrating with Cloudflare’s secrets management product <a href="https://developers.cloudflare.com/secrets-store/"><u>Secrets Store</u></a>. Now, you can seamlessly and securely store your keys in one centralized location and distribute them without relying on plain text. Secrets Store uses a two level key hierarchy with AES encryption to ensure that your secret stays safe, while maintaining low latency through our global configuration system, <a href="https://blog.cloudflare.com/quicksilver-v2-evolution-of-a-globally-distributed-key-value-store-part-1/"><u>Quicksilver</u></a>.</p><p>You can now save and manage keys directly through your AI Gateway dashboard or through the Secrets Store <a href="http://dash.cloudflare.com/?to=/:account/secrets-store"><u>dashboard</u></a>, <a href="https://developers.cloudflare.com/api/resources/secrets_store/subresources/stores/subresources/secrets/methods/create/"><u>API</u></a>, or <a href="https://developers.cloudflare.com/workers/wrangler/commands/#secrets-store-secret"><u>Wrangler</u></a> by using the new <b>AI Gateway</b> <b>scope</b>. Scoping your secrets to AI Gateway ensures that only this specific service will be able to access your keys, meaning that secret could not be used in a Workers binding or anywhere else on Cloudflare’s platform.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6hiSSQi2lQGWQnGYe4e9p1/dadc4fde865010d9e263badb75847992/2.png" />
          </figure><p>You can pass your AI provider keys without including them directly in the request header. Instead of including the actual value, you can deploy the secret only using the Secrets Store reference: </p>
            <pre><code>curl -X POST https://gateway.ai.cloudflare.com/v1/&lt;ACCOUNT_ID&gt;/my-gateway/anthropic/v1/messages \
 --header 'cf-aig-authorization: CLOUDFLARE_AI_GATEWAY_TOKEN \
 --header 'anthropic-version: 2023-06-01' \
 --header 'Content-Type: application/json' \
 --data  '{"model": "claude-3-opus-20240229", "messages": [{"role": "user", "content": "What is Cloudflare?"}]}'</code></pre>
            <p>Or, using Javascript: </p>
            <pre><code>import Anthropic from '@anthropic-ai/sdk';


const anthropic = new Anthropic({
  apiKey: "CLOUDFLARE_AI_GATEWAY_TOKEN",
  baseURL: "https://gateway.ai.cloudflare.com/v1/&lt;ACCOUNT_ID&gt;/my-gateway/anthropic",
});


const message = await anthropic.messages.create({
  model: 'claude-3-opus-20240229',
  messages: [{role: "user", content: "What is Cloudflare?"}],
  max_tokens: 1024
});</code></pre>
            <p>By using Secrets Store to deploy your secrets, you no longer need to give every developer access to every key — instead, you can rely on Secrets Store’s <a href="https://developers.cloudflare.com/secrets-store/access-control/"><u>role-based access control</u></a> to further lock down these sensitive values. For example, you might want your security administrators to have Secrets Store admin permissions so that they can create, update, and delete the keys when necessary. With Cloudflare <a href="https://developers.cloudflare.com/logs/logpush/logpush-job/datasets/account/audit_logs/?cf_target_id=1C767B900C4419A313C249A5D99921FB"><u>audit logging</u></a>, all such actions will be logged so you know exactly who did what and when. Your developers, on the other hand, might only need Deploy permissions, so they can reference the values in code, whether that is a Worker or AI Gateway or both. This way, you reduce the risk of the secret getting leaked accidentally or intentionally by a malicious actor. This also allows you to update your provider keys in one place and automatically propagate that value to any AI Gateway using those values, simplifying the management. </p>
    <div>
      <h3>Unified Request/Response</h3>
      <a href="#unified-request-response">
        
      </a>
    </div>
    <p>We made it super easy for people to try out different AI models – but the developer experience should match that as well. We found that each provider can have slight differences in how they expect people to send their requests, so we’re excited to launch an automatic translation layer between providers. When you send a request through AI Gateway, it just works – no matter what provider or model you use.</p>
            <pre><code>import OpenAI from "openai";
const client = new OpenAI({
  apiKey: "YOUR_PROVIDER_API_KEY", // Provider API key
  // NOTE: the OpenAI client automatically adds /chat/completions to the end of the URL, you should not add it yourself.
  baseURL:
    "https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/compat",
});

const response = await client.chat.completions.create({
  model: "google-ai-studio/gemini-2.0-flash",
  messages: [{ role: "user", content: "What is Cloudflare?" }],
});

console.log(response.choices[0].message.content);</code></pre>
            
    <div>
      <h2>Dynamic Routes</h2>
      <a href="#dynamic-routes">
        
      </a>
    </div>
    <p>When we first launched <a href="https://www.cloudflare.com/developer-platform/products/workers/"><u>Cloudflare Workers</u></a>, it was an easy way for people to intercept HTTP requests and customize actions based on different attributes. We think the same customization is necessary for AI traffic, so we’re launching <a href="https://developers.cloudflare.com/ai-gateway/features/dynamic-routing/"><u>Dynamic Routes</u></a> in AI Gateway.</p><p>Dynamic Routes allows you to define certain actions based on different request attributes. If you have free users, maybe you want to ratelimit them to a certain request per second (RPS) or a certain dollar spend. Or maybe you want to conduct an A/B test and split 50% of traffic to Model A and 50% of traffic to Model B. You could also want to chain several models in a row, like adding custom guardrails or enhancing a prompt before it goes to another model. All of this is possible with Dynamic Routes!</p><p>We’ve built a slick UI in the AI Gateway dashboard where you can define simple if/else interactions based on request attributes or a percentage split. Once you define a route, you’ll use the route as the “model” name in your input JSON and we will manage the traffic as you defined. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7qLp4KT8ASCLRv2pyM2kxR/3151e32afa4d8447ae07a5a8fb09a9b6/3.png" />
          </figure>
            <pre><code>import OpenAI from "openai";

const cloudflareToken = "CF_AIG_TOKEN";
const accountId = "{account_id}";
const gatewayId = "{gateway_id}";
const baseURL = `https://gateway.ai.cloudflare.com/v1/${accountId}/${gatewayId}`;

const openai = new OpenAI({
  apiKey: cloudflareToken,
  baseURL,
});

try {
  const model = "dynamic/&lt;your-dynamic-route-name&gt;";
  const messages = [{ role: "user", content: "What is a neuron?" }];
  const chatCompletion = await openai.chat.completions.create({
    model,
    messages,
  });
  const response = chatCompletion.choices[0].message;
  console.log(response);
} catch (e) {
  console.error(e);
}</code></pre>
            
    <div>
      <h2>Built-in security with Firewall in AI Gateway</h2>
      <a href="#built-in-security-with-firewall-in-ai-gateway">
        
      </a>
    </div>
    <p>Earlier this year we announced <a href="https://developers.cloudflare.com/changelog/2025-02-26-guardrails/"><u>Guardrails</u></a> in AI Gateway and now we’re expanding our security capabilities and include Data Loss Prevention (DLP) scanning in AI Gateway’s Firewall. With this, you can select the DLP profiles you are interested in blocking or flagging, and we will scan requests for the matching content. DLP profiles include general categories like “Financial Information”, “Social Security, Insurance, Tax and Identifier Numbers” that everyone has access to with a free Zero Trust account. If you would like to create a custom DLP profile to safeguard specific text, the upgraded Zero Trust plan allows you to create custom DLP profiles to catch sensitive data that is unique to your business.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5yti8oy4TF01EdZMtYN1If/d2f3bd804873644862fbd61b07d3574a/4.png" />
          </figure><p>False positives and grey area situations happen, we give admins controls on whether to fully block or just alert on DLP matches. This allows administrators to monitor for potential issues without creating roadblocks for their users.. Each log on AI gateway now includes details about the DLP profiles matched on your request, and the action that was taken:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2pDdqy8bVmsiyjm4sg2pkG/ff97d9069e200fb859c1dc2daed8e4fa/5.png" />
          </figure>
    <div>
      <h2>More coming soon…</h2>
      <a href="#more-coming-soon">
        
      </a>
    </div>
    <p>If you think about the history of Cloudflare, you’ll notice similar patterns that we’re following for the new vision for AI Gateway. We want developers of AI applications to be able to have simple interconnectivity, observability, security, customizable actions, and more — something that Cloudflare has a proven track record of accomplishing for global Internet traffic. We see AI Gateway as a natural extension of Cloudflare’s mission, and we’re excited to make it come to life.</p><p>We’ve got more launches up our sleeves, but we couldn’t wait to get these first handful of features into your hands. Read up about it in our <a href="https://developers.cloudflare.com/ai-gateway/"><u>developer docs</u></a>, <a href="https://developers.cloudflare.com/ai-gateway/get-started/"><u>give it a try</u></a>, and let us know what you think. If you want to explore larger deployments, <a href="https://www.cloudflare.com/plans/enterprise/contact/?utm_medium=referral&amp;utm_source=blog&amp;utm_campaign=2025-q3-acq-gbl-connectivity-ge-ge-general-ai_week_blog"><u>reach out for a consultation </u></a>with Cloudflare experts.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/LTpdSaZMBbdOzASW8ggoS/6610f437d955174d7f7f1212617a4365/6.png" />
          </figure><div>
  
</div><p></p> ]]></content:encoded>
            <category><![CDATA[AI Week]]></category>
            <category><![CDATA[AI Gateway]]></category>
            <category><![CDATA[AI]]></category>
            <guid isPermaLink="false">6O1tkxTcxxG9hgxI8X9kFH</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Abhishek Kankani</dc:creator>
            <dc:creator>Mia Malden</dc:creator>
        </item>
        <item>
            <title><![CDATA[State-of-the-art image generation Leonardo models and text-to-speech Deepgram models now available in Workers AI]]></title>
            <link>https://blog.cloudflare.com/workers-ai-partner-models/</link>
            <pubDate>Wed, 27 Aug 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ We're expanding Workers AI with new partner models from Leonardo.Ai and Deepgram. Start using state-of-the-art image generation models from Leonardo and real-time TTS and STT models from Deepgram.  ]]></description>
            <content:encoded><![CDATA[ <p>When we first launched <a href="https://www.cloudflare.com/developer-platform/products/workers-ai/"><u>Workers AI</u></a>, we made a bet that AI models would get faster and smaller. We built our infrastructure around this hypothesis, adding specialized GPUs to our datacenters around the world that can serve inference to users as fast as possible. We created our platform to be as general as possible, but we also identified niche use cases that fit our infrastructure well, such as low-latency image generation or real-time audio voice agents. To lean in on those use cases, we’re bringing on some new models that will help make it easier to develop for these applications.</p><p>Today, we’re excited to announce that we are expanding our model catalog to include closed-source partner models that fit this use case. We’ve partnered with <a href="http://leonardo.ai"><u>Leonardo.Ai</u></a> and <a href="https://deepgram.com/"><u>Deepgram</u></a> to bring their latest and greatest models to Workers AI, hosted on Cloudflare’s infrastructure. Leonardo and Deepgram both have models with a great speed-to-performance ratio that suit the infrastructure of Workers AI. We’re starting off with these great partners — but expect to expand our catalog to other partner models as well.</p><p>The benefits of using these models on Workers AI is that we don’t only have a standalone inference service, we also have an entire suite of Developer products that allow you to build whole applications around AI. If you’re building an image generation platform, you could use Workers to <a href="https://www.cloudflare.com/developer-platform/solutions/hosting/">host the application logic</a>, Workers AI to generate the images, R2 for storage, and Images for serving and transforming media. If you’re building Realtime voice agents, we offer WebRTC and WebSocket support via Workers, speech-to-text, text-to-speech, and turn detection models via Workers AI, and an orchestration layer via Cloudflare Realtime. All in all, we want to lean into use cases that we think Cloudflare has a unique advantage in, with developer tools to back it up, and make it all available so that you can build the best AI applications on top of our holistic Developer Platform.</p>
    <div>
      <h2>Leonardo Models</h2>
      <a href="#leonardo-models">
        
      </a>
    </div>
    <p><a href="https://www.leonardo.ai"><u>Leonardo.Ai</u></a> is a generative AI media lab that trains their own models and hosts a platform for customers to create generative media. The Workers AI team has been working with Leonardo for a while now and have experienced the magic of their image generation models firsthand. We’re excited to bring on two image generation models from Leonardo: @cf/leonardo/phoenix-1.0 and @cf/leonardo/lucid-origin.</p><blockquote><p><i>“We’re excited to enable Cloudflare customers a new avenue to extend and use our image generation technology in creative ways such as creating character images for gaming, generating personalized images for websites, and a host of other uses... all through the Workers AI and the Cloudflare Developer Platform.” - </i><b><i>Peter Runham</i></b><i>, CTO, </i><a href="http://leonardo.ai"><i><u>Leonardo.Ai </u></i></a></p></blockquote><p>The Phoenix model is trained from the ground up by Leonardo, excelling at things like text rendering and prompt coherence. The full image generation request took 4.89s end-to-end for a 25 step, 1024x1024 image.</p>
            <pre><code>curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/leonardo/phoenix-1.0 \
  --header 'Authorization: Bearer {TOKEN}' \
  --header 'Content-Type: application/json' \
  --data '{
    "prompt": "A 1950s-style neon diner sign glowing at night that reads '\''OPEN 24 HOURS'\'' with chrome details and vintage typography.",
    "width":1024,
    "height":1024,
    "steps": 25,
    "seed":1,
    "guidance": 4,
    "negative_prompt": "bad image, low quality, signature, overexposed, jpeg artifacts, undefined, unclear, Noisy, grainy, oversaturated, overcontrasted"
}'
</code></pre>
            
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1q7ndHYrwLQqqAdX6kGEkl/96ece588cf82691fa8e8d11ece382672/BLOG-2903_2.png" />
          </figure><p>The Lucid Origin model is a recent addition to Leonardo’s family of models and is great at generating photorealistic images. The image took 4.38s to generate end-to-end at 25 steps and a 1024x1024 image size.</p>
            <pre><code>curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/leonardo/lucid-origin \
  --header 'Authorization: Bearer {TOKEN}' \
  --header 'Content-Type: application/json' \
  --data '{
    "prompt": "A 1950s-style neon diner sign glowing at night that reads '\''OPEN 24 HOURS'\'' with chrome details and vintage typography.",
    "width":1024,
    "height":1024,
    "steps": 25,
    "seed":1,
    "guidance": 4,
    "negative_prompt": "bad image, low quality, signature, overexposed, jpeg artifacts, undefined, unclear, Noisy, grainy, oversaturated, overcontrasted"
}'
</code></pre>
            
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/26VKWD8ua6Pe2awQWRnF7n/bb42c9612b08269af4ef38df39a2ed30/BLOG-2903_3.png" />
          </figure>
    <div>
      <h2>Deepgram Models</h2>
      <a href="#deepgram-models">
        
      </a>
    </div>
    <p>Deepgram is a voice AI company that develops their own audio models, allowing users to interact with AI through a natural interface for humans: voice. Voice is an exciting interface because it carries higher bandwidth than text, because it has other speech signals like pacing, intonation, and more. The Deepgram models that we’re bringing on our platform are audio models which perform extremely fast speech-to-text and text-to-speech inference. Combined with the Workers AI infrastructure, the models showcase our unique infrastructure so customers can build low-latency voice agents and more.</p><blockquote><p><i>"By hosting our voice models on Cloudflare's Workers AI, we're enabling developers to create real-time, expressive voice agents with ultra-low latency. Cloudflare's global network brings AI compute closer to users everywhere, so customers can now deliver lightning-fast conversational AI experiences without worrying about complex infrastructure." - </i><i><b>Adam Sypniewski</b></i><i>, CTO, Deepgram</i></p></blockquote><p><a href="https://developers.cloudflare.com/workers-ai/models/nova-3"><u>@cf/deepgram/nova-3</u></a> is a speech-to-text model that can quickly transcribe audio with high accuracy. <a href="https://developers.cloudflare.com/workers-ai/models/aura-1"><u>@cf/deepgram/aura-1</u></a> is a text-to-speech model that is context aware and can apply natural pacing and expressiveness based on the input text. The newer Aura 2 model will be available on Workers AI soon. We’ve also improved the experience of sending binary mp3 files to Workers AI, so you don’t have to convert it into an Uint8 array like you had to previously. Along with our Realtime announcements (coming soon!), these audio models are the key to enabling customers to build voice agents directly on Cloudflare.</p><p>With the AI binding, a call to the Nova 3 speech-to-text model would look like this:</p>
            <pre><code>const URL = "https://www.some-website.com/audio.mp3";
const mp3 = await fetch(URL);
 
const res = await env.AI.run("@cf/deepgram/nova-3", {
    "audio": {
      body: mp3.body,
      contentType: "audio/mpeg"
    },
    "detect_language": true
  });
</code></pre>
            <p>With the REST API, it would look like this:</p>
            <pre><code>curl --request POST \
  --url 'https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/deepgram/nova-3?detect_language=true' \
  --header 'Authorization: Bearer {TOKEN}' \
  --header 'Content-Type: audio/mpeg' \
  --data-binary @/path/to/audio.mp3</code></pre>
            <p>As well, we’ve added WebSocket support to the Deepgram models, which you can use to keep a connection to the inference server live and use it for bi-directional input and output. To use the Nova model with WebSocket support, check out our <a href="https://developers.cloudflare.com/workers-ai/models/nova-3"><u>Developer Docs</u></a>.</p><p>All the pieces work together so that you can:</p><ol><li><p><b>Capture audio</b> with Cloudflare Realtime from any WebRTC source</p></li><li><p><b>Pipe it</b> via WebSocket to your processing pipeline</p></li><li><p><b>Transcribe</b> with audio ML models Deepgram running on Workers AI</p></li><li><p><b>Process</b> with your LLM of choice through a model hosted on Workers AI or proxied via <a href="https://developers.cloudflare.com/ai-gateway/"><u>AI Gateway</u></a></p></li><li><p><b>Orchestrate</b> everything with Realtime Agents</p></li></ol>
    <div>
      <h2>Try these models out today</h2>
      <a href="#try-these-models-out-today">
        
      </a>
    </div>
    <p>Check out our<a href="https://developers.cloudflare.com/workers-ai/"><u> developer docs</u></a> for more details, pricing and how to get started with the newest partner models available on Workers AI.</p><div>
  
</div><p></p> ]]></content:encoded>
            <category><![CDATA[AI Week]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Workers AI]]></category>
            <guid isPermaLink="false">35N861jwJHF4GEiRCDxWP</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Nikhil Kothari</dc:creator>
        </item>
        <item>
            <title><![CDATA[Partnering with OpenAI to bring their new open models onto Cloudflare Workers AI]]></title>
            <link>https://blog.cloudflare.com/openai-gpt-oss-on-workers-ai/</link>
            <pubDate>Tue, 05 Aug 2025 21:05:00 GMT</pubDate>
            <description><![CDATA[ OpenAI’s newest open-source models are now available on Cloudflare Workers AI on Day 0, with support for Responses API, Code Interpreter and Web Search (coming soon). ]]></description>
            <content:encoded><![CDATA[ <p>OpenAI has just <a href="http://openai.com/index/introducing-gpt-oss"><u>announced their latest open-weight models</u></a> — and we are excited to share that we are working with them as a Day 0 launch partner to make these models available in Cloudflare's <a href="https://www.cloudflare.com/developer-platform/products/workers-ai/">Workers AI</a>. Cloudflare developers can now access OpenAI's first open model, leveraging these powerful new capabilities on our platform. The new models are available starting today at <code>@cf/openai/gpt-oss-120b</code> and <code>@cf/openai/gpt-oss-20b</code>.</p><p>Workers AI has always been a champion for open models and we’re thrilled to bring OpenAI's new open models to our platform today. Developers who want transparency, customizability, and deployment flexibility can rely on Workers AI as a place to deliver AI services. Enterprises that need the ability to run open models to ensure complete data security and privacy can also deploy with Workers AI. We are excited to join OpenAI in fulfilling their mission of making the benefits of AI broadly accessible to builders of any size.</p>
    <div>
      <h3>The technical model specs</h3>
      <a href="#the-technical-model-specs">
        
      </a>
    </div>
    <p>The <a href="https://openai.com/index/gpt-oss-model-card/"><u>OpenAI models</u></a> have been released in two sizes: a 120 billion parameter model and a 20 billion parameter model. Both of them are Mixture-of-Experts models – a popular architecture for recent model releases – that allow relevant experts to be called for a query instead of running through all the parameters of the model. Interestingly, these models run natively at an FP4 quantization, which means that they have a smaller GPU memory footprint than a 120 billion parameter model at FP16. Given the <a href="https://www.cloudflare.com/learning/ai/what-is-quantization/">quantization</a> and the MoE architecture, the new models are able to run faster and more efficiently than more traditional dense models of that size.</p><p>These models are text-only; however, they have reasoning capabilities, tool calling, and two new exciting features with Code Interpreter and Web Search (support coming soon). We’ve implemented Code Interpreter on top of <a href="https://blog.cloudflare.com/containers-are-available-in-public-beta-for-simple-global-and-programmable/"><u>Cloudflare Containers</u></a> in a novel way that allows for stateful code execution (read on below).</p>
    <div>
      <h3>The model on Workers AI</h3>
      <a href="#the-model-on-workers-ai">
        
      </a>
    </div>
    <p>We’re landing these new models with a few tweaks: supporting the new <a href="https://platform.openai.com/docs/api-reference/responses"><u>Responses API</u></a> format as well as the historical <a href="https://platform.openai.com/docs/guides/text?api-mode=chat"><u>Chat Completions API</u></a> format (coming soon). The Responses API format is recommended by OpenAI to interact with their models, and we’re excited to support that on Workers AI.</p><p>If you call the model through:</p><ul><li><p>Workers Binding, it will accept/return Responses API – <code>env.AI.run(“@cf/openai/gpt-oss-120b”)</code></p></li><li><p>REST API on /run endpoint, it will accept/return Responses API – <code>https://api.cloudflare.com/client/v4/accounts/&lt;account_id&gt;/ai/run/@cf/openai/gpt-oss-120b</code></p></li><li><p>REST API on new /responses endpoint, it will accept/return Responses API – <code>https://api.cloudflare.com/client/v4/accounts/&lt;account_id&gt;/ai/v1</code><code><b>/responses</b></code></p></li><li><p>REST API for OpenAI Compatible endpoint, it will return Chat Completions (coming soon)– <code>https://api.cloudflare.com/client/v4/accounts/&lt;account_id&gt;/ai/v1/chat/completions</code></p></li></ul>
            <pre><code>curl https://api.cloudflare.com/client/v4/accounts/&lt;account_id&gt;/ai/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $CLOUDFLARE_API_KEY" \
  -d '{
    "model": "@cf/openai/gpt-oss-120b",
    "reasoning": {"effort": "medium"},
    "input": [
      {
        "role": "user",
        "content": "What are the benefits of open-source models?"
      }
    ]
  }'
</code></pre>
            
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6eUpzGy6RKcoPXd9MPBFSk/89d18f2535427cdb564426a4b33f9d4d/image1.png" />
          </figure>
    <div>
      <h3>Code Interpreter + Cloudflare Sandboxes = the perfect fit</h3>
      <a href="#code-interpreter-cloudflare-sandboxes-the-perfect-fit">
        
      </a>
    </div>
    <p>To effectively answer user queries, <a href="https://www.cloudflare.com/learning/ai/what-is-large-language-model/">Large Language Models (LLMs)</a> often struggle with logical tasks such as mathematics or coding. Instead of attempting to reason through these problems, LLMs typically utilize a tool call to execute AI-generated code that solves these problems. OpenAI's new models are specifically trained for stateful Python code execution and include a built-in feature called Code Interpreter, designed to address this challenge.</p><p>We’re particularly excited about this. Cloudflare not only has an inference platform (<a href="https://developers.cloudflare.com/workers-ai"><u>Workers AI</u></a>), but we also have an ecosystem of compute and storage products that allow people to build full applications on top of our <a href="https://www.cloudflare.com/developer-platform/">Developer Platform</a>. This means that we are uniquely suited to support the model’s Code Interpreter capabilities, not only for one-time code execution, but for <i>stateful </i>code execution as well.</p><p>We’ve built support for Code Interpreter on top of Cloudflare’s <a href="https://developers.cloudflare.com/changelog/2025-06-24-announcing-sandboxes/"><u>Sandbox</u></a> product that allows for a secure environment to run AI-generated code. The <a href="https://github.com/cloudflare/sandbox-sdk"><u>Sandbox SDK</u></a> is built on our latest <a href="https://blog.cloudflare.com/containers-are-available-in-public-beta-for-simple-global-and-programmable/"><u>Containers</u></a> product and Code Interpreter is the perfect use case to bring all these products together. When you use Code Interpreter, we spin up a Sandbox container scoped to your session that stays alive for 20 minutes, so the code can be edited for subsequent queries to the model. We’ve also pre-warmed Sandboxes for Code Interpreter to ensure the fastest start up times.</p><p>We’ll be publishing an example of how you can use the gpt-oss model on Workers AI and Sandboxes with the OpenAI SDK to make calls to Code Interpreter on our <a href="https://developers.cloudflare.com/workers-ai/guides/demos-architectures/"><u>Developer Docs</u></a>.</p>
    <div>
      <h3>Give it a try!</h3>
      <a href="#give-it-a-try">
        
      </a>
    </div>
    <p>We’re beyond excited for OpenAI’s new open models, and we hope you are too. Super grateful to our friends from <a href="https://docs.vllm.ai/en/latest/index.html"><u>vLLM</u></a> and <a href="https://huggingface.co/"><u>HuggingFace</u></a> for supporting efficient model serving on launch day. Read up on the <a href="https://developers.cloudflare.com/workers-ai/models/gpt-oss-120b"><u>Developer Docs</u></a> to learn more about the details and how to get started on building with these new models and capabilities.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4D11uWyDAooDrElGBVcL8f/568d689efc4e9ef56fe7c0eff0dc9d17/image3.png" />
          </figure><p></p> ]]></content:encoded>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Containers]]></category>
            <category><![CDATA[Developers]]></category>
            <guid isPermaLink="false">1VV6tpMJn0Te7WEzn2BjSR</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Ashish Datta</dc:creator>
        </item>
        <item>
            <title><![CDATA[Workers AI gets a speed boost, batch workload support, more LoRAs, new models, and a refreshed dashboard]]></title>
            <link>https://blog.cloudflare.com/workers-ai-improvements/</link>
            <pubDate>Fri, 11 Apr 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ We just made Workers AI inference faster with speculative decoding & prefix caching. Use our new batch inference for handling large request volumes seamlessly. ]]></description>
            <content:encoded><![CDATA[ <p>Since the <a href="https://blog.cloudflare.com/workers-ai/"><u>launch of Workers AI</u></a> in September 2023, our mission has been to make inference accessible to everyone.</p><p>Over the last few quarters, our Workers AI team has been heads down on improving the quality of our platform, working on various routing improvements, GPU optimizations, and capacity management improvements. Managing a distributed inference platform is not a simple task, but distributed systems are also what we do best. You’ll notice a recurring theme from all these announcements that has always been part of the core Cloudflare ethos — we try to solve problems through clever engineering so that we are able to do more with less.</p><p>Today, we’re excited to introduce speculative decoding to bring you faster inference, an asynchronous batch API for large workloads, and expanded LoRA support for more customized responses. Lastly, we’ll be recapping some of our newly added models, updated pricing, and unveiling a new dashboard to round out the usability of the platform.</p>
    <div>
      <h2>Speeding up inference by 2-4x with speculative decoding and more</h2>
      <a href="#speeding-up-inference-by-2-4x-with-speculative-decoding-and-more">
        
      </a>
    </div>
    <p>We’re excited to roll out speed improvements to models in our catalog, starting with the Llama 3.3 70b model. These improvements include speculative decoding, prefix caching, an updated inference backend, and more. We’ve previously done a technical deep dive on speculative decoding and how we’re making Workers AI faster, which <a href="https://blog.cloudflare.com/making-workers-ai-faster/"><u>you can read about here</u></a>. With these changes, we’ve been able to improve inference times by 2-4x, without any significant change to the quality of answers generated. We’re planning to incorporate these improvements into more models in the future as we release them. Today, we’re starting to roll out these changes so all Workers AI users of <code>@cf/meta/llama-3.3-70b-instruct-fp8-fast</code> will enjoy this automatic speed boost.</p>
    <div>
      <h3>What is speculative decoding?</h3>
      <a href="#what-is-speculative-decoding">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1Jc5CeeOpTW1LSZ7xeZumY/99ced72a25bdabea276f98c03bc17e27/image3.png" />
          </figure><p>The way LLMs work is by generating text by predicting the next token in a sentence given the previous tokens. Typically, an LLM is able to predict a single future token (n+1) with one forward pass through the model. These forward passes can be computationally expensive, since they need to work through all the parameters of a model to generate one token (e.g., 70 billion parameters for Llama 3.3 70b).</p><p>With speculative decoding, we put a small model (known as the draft model) in front of the original model that helps predict n+x future tokens. The draft model generates a subset of candidate tokens, and the original model just has to evaluate and confirm if they should be incorporated into the generation. Evaluating tokens is less computationally expensive, as the model can evaluate multiple tokens concurrently in a forward pass. As such, inference times can be sped up by 2-4x — meaning that users can get responses much faster.</p><p>What makes speculative decoding particularly efficient is that it’s able to use unused GPU compute left behind due to the GPU memory bottleneck LLMs create. Speculative decoding takes advantage of the unused compute by squeezing in a draft model to generate tokens faster. This means we’re able to improve the utilization of our GPUs by using them to their full extent without having parts of the GPU sit idle.</p>
    <div>
      <h3>What is prefix caching?</h3>
      <a href="#what-is-prefix-caching">
        
      </a>
    </div>
    <p>With LLMs, there are usually two stages of generation – the first is known as “pre-fill”, which processes the user’s input tokens such as the prompt and context. Prefix caching is aimed at reducing the pre-fill time of a request. As an example, if you were asking a model to generate code based on a given file, you might insert the whole file into the context window of a request. Then, if you want to make a second request to generate the next line of code, you might send us the whole file again in the second request. Prefix caching allows us to cache the pre-fill tokens so we don’t have to process the context twice. With the same example, we would only do the pre-fill stage once for both requests, rather than doing it per request. This method is especially useful for requests that reuse the same context, such as <a href="https://www.cloudflare.com/learning/ai/retrieval-augmented-generation-rag/"><u>Retrieval Augmented Generation (RAG)</u></a>, code generation, chatbots with memory, and more. Skipping the pre-fill stage for similar requests means faster responses for our users and more efficient usage of resources. </p>
    <div>
      <h3>How did you validate that quality is preserved through these optimizations?</h3>
      <a href="#how-did-you-validate-that-quality-is-preserved-through-these-optimizations">
        
      </a>
    </div>
    <p>Since this is an in-place update to an existing model, we were particularly cautious in ensuring that we would not break any existing applications with this update. We did extensive A/B testing through a blind arena with internal employees to validate the model quality, and we asked internal and external customers to test the new version of the model to ensure that response formats were compatible and model quality was acceptable. Our testing concluded that the model performed up to standards, with people being extremely excited about the speed of the model. Most LLMs are not perfectly deterministic even with the same set of inputs, but if you do notice something off, please let us know through <a href="https://discord.com/invite/cloudflaredev"><u>Discord</u></a> or <a href="http://x.com/cloudflaredev"><u>X</u></a>.</p>
    <div>
      <h2>Asynchronous batch API</h2>
      <a href="#asynchronous-batch-api">
        
      </a>
    </div>
    <p>Next up, we’re announcing an asynchronous (async) batch API which is helpful for users of large workloads. This feature allows customers to receive their inference responses asynchronously, with the promise that the inference will be completed at a later time rather than immediately erroring out due to capacity.</p><p>An example use case of batch workloads is people generating summaries of a large number of documents. You probably don’t need to use those summaries immediately, as you’ll likely use them once the whole document is complete versus one paragraph at a time. For these use cases, we’ve made it super simple for you to start sending us these requests in batches.</p>
    <div>
      <h3>Why batch requests?</h3>
      <a href="#why-batch-requests">
        
      </a>
    </div>
    <p>From talking to our customers, the most common use case we hear about is people creating embeddings or summarizing a large number of documents. Unfortunately, this is also one of the hardest use cases to manage capacity for as a serverless platform.</p><p>To illustrate this, imagine that you want to summarize a 70 page PDF. You typically chunk the document and then send an inference request for each chunk. If each chunk is a few paragraphs on a page, that means that we receive around 4 requests per page multiplied by 70 pages, which is about 280 requests. Multiply that by tens or hundreds of documents, and multiply that by a handful of concurrent users — this means that we get a sudden massive influx of thousands of requests when users start these large workloads.</p><p>The way we originally built Workers AI was to handle incoming requests as quickly as possible, assuming there's a human on the other side that needed an immediate response. The unique thing about batch workloads is that while they're not latency sensitive, they do require completeness guarantees — you don't want to come back the next day to realize none of your inference requests actually executed.</p><p>With the async API, you send us a batch of requests, and we promise to fulfill them as fast as possible and return them to you as a batch. This guarantees that your inference request will be fulfilled, rather than immediately (or eventually) erroring out. The async API also benefits users who have real-time use cases, as the model instances won’t be immediately consumed by these batch requests that can wait for a response. Inference times will be faster since there won’t be a bunch of competing requests in a queue waiting to reach the inference servers. </p><p>We have select models that support batch inference today, which include:</p><ul><li><p><a href="https://developers.cloudflare.com/workers-ai/models/llama-3.3-70b-instruct-fp8-fast/"><u>@cf/meta/llama-3.3-70b-instruct-fp8-fast</u></a></p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/bge-small-en-v1.5"><u>@cf/baai/bge-small-en-v1.5</u></a>, <a href="https://developers.cloudflare.com/workers-ai/models/bge-base-en-v1.5"><u>@cf/baai/bge-base-en-v1.5</u></a>, <a href="https://developers.cloudflare.com/workers-ai/models/bge-large-en-v1.5"><u>@cf/baai/bge-large-en-v1.5</u></a></p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/bge-m3/"><u>@cf/baai/bge-m3</u></a></p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/m2m100-1.2b/"><u>@cf/meta/m2m100-1.2b</u></a></p></li></ul>
    <div>
      <h3>How can I use the batch API?</h3>
      <a href="#how-can-i-use-the-batch-api">
        
      </a>
    </div>
    <p>Users can send a batch request to supported models by passing a flag:</p>
            <pre><code>let res = await env.AI.run("@cf/meta/llama-3.3-70b-instruct-batch", {
  "requests": [{
    "prompt": "Explain mechanics of wormholes"
  }, {
    "prompt": "List different plant species found in America"
  }]
}, {
  queueRequest: true
});</code></pre>
            <p>Check out our <a href="https://developers.cloudflare.com/workers-ai/features/batch-api/"><u>developer docs</u></a> to learn more about the batch API, or use our <a href="https://github.com/craigsdennis/batch-please-workers-ai"><u>template</u></a> to deploy a worker that implements the batch API.</p><p>Today, our batch API can be used by sending us an array of requests, and we’ll return your responses in an array.  This is helpful for use cases like summarizing large amounts of data that you know beforehand. This means you can send us a single HTTP request with all of your requests, and receive a single HTTP request back with your responses. You can check on the status of the batch by polling it with the request ID we return when your batch is submitted. For the next iteration of our async API, we plan to allow queue-based inputs and outputs, where you push requests and pull responses from a queue. This will integrate tightly with <a href="https://developers.cloudflare.com/r2/buckets/event-notifications/"><u>Event Notifications</u></a> and <a href="https://developers.cloudflare.com/workflows/"><u>Workflows</u></a>, so you can execute subsequent actions upon receiving a response.</p>
    <div>
      <h2>Expanded LoRA support</h2>
      <a href="#expanded-lora-support">
        
      </a>
    </div>
    <p>At Birthday Week last year, <a href="https://blog.cloudflare.com/workers-ai-ga-huggingface-loras-python-support/#supporting-fine-tuned-inference-byo-loras"><u>we announced limited LoRA suppor</u></a>t for a handful of models. We’ve </p><p>iterated on this and now support 8 models as well as larger ranks of up to 32 and LoRA files up to 300 MB. Models that support LoRA inference now include:</p><ul><li><p><a href="https://developers.cloudflare.com/workers-ai/models/llama-3.2-11b-vision-instruct/"><u>@cf/meta/llama-3.2-11b-vision-instruct</u></a> (soon)</p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/llama-3.3-70b-instruct-fp8-fast/"><u>@cf/meta/llama-3.3-70b-instruct-fp8-fast</u></a> (soon)</p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/llama-guard-3-8b/"><u>@cf/meta/llama-guard-3-8b</u></a></p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/llama-3.1-8b-instruct-fast/"><u>@cf/meta/llama-3.1-8b-instruct-fast</u></a> (soon)</p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/deepseek-r1-distill-qwen-32b/"><u>@cf/deepseek-ai/deepseek-r1-distill-qwen-32b</u></a> (soon)</p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/qwen2.5-coder-32b-instruct"><u>@cf/qwen/qwen2.5-coder-32b-instruct</u></a></p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/qwq-32b"><u>@cf/qwen/qwq-32b</u></a></p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/mistral-small-3.1-24b-instruct"><u>@cf/mistralai/mistral-small-3.1-24b-instruct</u></a> (soon)</p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/gemma-3-12b-it"><u>@cf/google/gemma-3-12b-it</u></a> (soon)</p></li></ul>
    <div>
      <h3>What is LoRA?</h3>
      <a href="#what-is-lora">
        
      </a>
    </div>
    <p>In essence, a Low Rank Adaptation (LoRA) adapter allows people to take a trained adapter file and use it in conjunction with a model to alter the response of a model. We did a<a href="https://blog.cloudflare.com/fine-tuned-inference-with-loras/"><u> deep dive on LoRAs</u></a> in our Birthday Week blog post, which goes into further technical detail. LoRA adapters are great alternatives to fine-tuning a model, as it isn’t as expensive to train and adapters are much smaller and more portable. They are also effective enough to tweak the output of a model to fit a certain style of response.</p>
    <div>
      <h3>How do I get started?</h3>
      <a href="#how-do-i-get-started">
        
      </a>
    </div>
    <p>To get started, you first need to train your own LoRA adapter or find a public one on HuggingFace. Then, you’ll upload the <code>adapter_model.safetensors</code> and <code>adapter_config.json</code> to your account with the <a href="https://developers.cloudflare.com/workers-ai/fine-tunes/loras/"><u>documented wrangler commands or through the REST API</u></a>. LoRA files are private and scoped to your own account. After that, you can start running fine-tuned inference — check out our <a href="https://developers.cloudflare.com/workers-ai/features/fine-tunes/loras/"><u>LoRA developer docs</u></a> to get started.</p>
            <pre><code>const response = await env.AI.run(
  "@cf/qwen/qwen2.5-coder-32b-instruct", //the model supporting LoRAs
  {
      messages: [{"role": "user", "content": "Hello world"}],
      raw: true, //skip applying the default chat template
      lora: "00000000-0000-0000-0000-000000000", //the finetune id OR finetune name
  }
);</code></pre>
            
    <div>
      <h2>Quality of life improvements: updated pricing and a new dashboard for Workers AI</h2>
      <a href="#quality-of-life-improvements-updated-pricing-and-a-new-dashboard-for-workers-ai">
        
      </a>
    </div>
    <p>While the team has been focused on large engineering milestones, we’ve also landed some quality of life improvements over the last few months. In case you missed it, we’ve announced <a href="https://developers.cloudflare.com/changelog/2025-02-20-updated-pricing-docs/"><u>an updated pricing model</u></a> where usage will be shown in units such as tokens, audio seconds, image size/steps, etc. but still billed in neurons in the backend.</p><p>Today, we’re unveiling a new dashboard that allows users to see their usage in both units as well as neurons (built on <a href="https://blog.cloudflare.com/introducing-workers-observability-logs-metrics-and-queries-all-in-one-place/"><u>new Workers Observabilit</u></a>y components!). Model pricing is also available via dashboard and developer docs on the models page. And if you use AI Gateway, Workers AI usage will also be displayed as metrics now.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2ABZi7EC8dedCY4ru0ffsA/8eb63a9f4a626ca70ce6101760d23900/image1.png" />
          </figure>
    <div>
      <h2>New models available in Workers AI</h2>
      <a href="#new-models-available-in-workers-ai">
        
      </a>
    </div>
    <p>Lastly, we’ve steadily been adding new models on Workers AI, with over 10 new models and a few updates on existing models. Pricing is also now listed directly on the model page in the developer docs. To summarize, here are the new models we’ve added on Workers AI, including four new ones we’re releasing today:</p><ul><li><p><a href="https://developers.cloudflare.com/workers-ai/models/deepseek-r1-distill-qwen-32b/"><u>@cf/deepseek-ai/deepseek-r1-distill-qwen-32b</u></a>: a version of Qwen 32B distilled from Deepseek’s R1 that is capable of doing chain-of-thought reasoning.</p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/bge-m3/"><u>@cf/baai/bge-m3</u></a>: a multi-lingual embeddings model that supports over 100 languages. It can also simultaneously perform dense retrieval, multi-vector retrieval, and sparse retrieval, with the ability to process inputs of different granularities.</p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/bge-reranker-base/"><u>@cf/baai/bge-reranker-base</u></a>: our first reranker model! Rerankers are a type of text classification model that takes a query and context, and outputs a similarity score between the two. When used in RAG systems, you can use a reranker after the initial vector search to find the most relevant documents to return to a user by reranking the outputs.</p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/whisper-large-v3-turbo/"><u>@cf/openai/whisper-large-v3-turbo</u></a>: a faster, more accurate speech-to-text model. This model was added earlier but is graduating out of beta with pricing included today.</p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/melotts/"><u>@cf/myshell-ai/melotts</u></a>: our first text-to-speech model that allows users to generate an MP3 with voice audio from text input.</p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/llama-4-scout-17b-16e-instruct/"><u>@cf/meta/llama-4-scout-17b-16e-instruct</u></a>: 17 billion parameter MoE model with 16 experts that is natively multimodal. Offers industry-leading performance in text and image understanding.</p></li><li><p>[NEW] <a href="https://developers.cloudflare.com/workers-ai/models/mistral-small-3.1-24b-instruct"><u>@cf/mistralai/mistral-small-3.1-24b-instruct</u></a>: a 24B parameter model achieving state-of-the-art capabilities comparable to larger models, with support for vision and tool calling.</p></li><li><p>[NEW] <a href="https://developers.cloudflare.com/workers-ai/models/gemma-3-12b-it"><u>@cf/google/gemma-3-12b-it</u></a>: well-suited for a variety of text generation and image understanding tasks, including question answering, summarization and reasoning, with a 128K context window, and multilingual support in over 140 languages.</p></li><li><p>[NEW] <a href="https://developers.cloudflare.com/workers-ai/models/qwq-32b"><u>@cf/qwen/qwq-32b</u></a>: a medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.</p></li><li><p>[NEW] <a href="https://developers.cloudflare.com/workers-ai/models/qwen2.5-coder-32b-instruct"><u>@cf/qwen/qwen2.5-coder-32b-instruct</u></a>: the current state-of-the-art open-source code LLM, with its coding abilities matching those of GPT-4o.</p></li></ul><p>In addition, we are rolling out some in-place updates to existing models in our catalog:</p><ul><li><p><a href="https://developers.cloudflare.com/workers-ai/models/llama-3.3-70b-instruct-fp8-fast/"><u>@cf/meta/llama-3.3-70b-instruct-fp8-fast</u></a> - Llama 3.3 70b gets a speed boost with new techniques such as speculative decoding, prefix caching, and an updated server back end (<a href="#speeding-up-inference-by-2-4x-with-speculative-decoding-and-more"><u>see above</u></a>).</p></li><li><p><a href="https://developers.cloudflare.com/workers-ai/models/bge-small-en-v1.5"><u>@cf/baai/bge-small-en-v1.5</u></a>, <a href="https://developers.cloudflare.com/workers-ai/models/bge-base-en-v1.5"><u>@cf/baai/bge-base-en-v1.5</u></a>, <a href="https://developers.cloudflare.com/workers-ai/models/bge-large-en-v1.5"><u>@cf/baai/bge-large-en-v1.5</u></a> - get a new input parameter called “pooling” which takes either “cls” or “mean”</p></li></ul><p>As we release these new models, we’ll be deprecating old models to encourage use of the state-of-the-art models and make room in our catalog. We will send out an email notice on this shortly. Stay up to date with our model releases and deprecation announcements by <a href="https://developers.cloudflare.com/changelog/"><u>subscribing to our Developer Docs changelog</u></a>.</p>
    <div>
      <h2>We’re (still) just getting started</h2>
      <a href="#were-still-just-getting-started">
        
      </a>
    </div>
    <p>Workers AI is one of Cloudflare’s newer products in a nascent industry, but we still operate with very traditional Cloudflare principles – learning how we can do more with less. Our engineering team is focused on solving the difficult problems that come with growing a distributed inference platform at a global scale, and we’re excited to release these new features today that we think will improve the platform as a whole for all our users. With faster inference times, better reliability, more customization possibilities, and better usability, we’re excited to see what you can do with more Workers AI — <a href="https://discord.com/invite/cloudflaredev"><u>let us know what you think</u></a>!</p> ]]></content:encoded>
            <category><![CDATA[Developer Week]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <guid isPermaLink="false">5iJwjQcUANzpsgir2tgfNE</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Jesse Kipp</dc:creator>
        </item>
        <item>
            <title><![CDATA[Meta’s Llama 4 is now available on Workers AI]]></title>
            <link>https://blog.cloudflare.com/meta-llama-4-is-now-available-on-workers-ai/</link>
            <pubDate>Sun, 06 Apr 2025 03:22:00 GMT</pubDate>
            <description><![CDATA[ Llama 4 Scout 17B Instruct is now available on Workers AI: use this multimodal, Mixture of Experts AI model on Cloudflare's serverless AI platform to build next-gen AI applications. ]]></description>
            <content:encoded><![CDATA[ <p>As one of Meta’s launch partners, we are excited to make Meta’s latest and most powerful model, Llama 4, available on the Cloudflare <a href="https://developers.cloudflare.com/workers-ai/"><u>Workers AI</u></a> platform starting today. Check out the <a href="https://developers.cloudflare.com/workers-ai/models/llama-4-scout-17b-16e-instruct"><u>Workers AI Developer Docs</u></a> to begin using Llama 4 now.</p>
    <div>
      <h3>What’s new in Llama 4?</h3>
      <a href="#whats-new-in-llama-4">
        
      </a>
    </div>
    <p>Llama 4 is an industry-leading release that pushes forward the frontiers of open-source generative Artificial Intelligence (AI) models. Llama 4 relies on a novel design that combines a <a href="#what-is-a-mixture-of-experts-model"><u>Mixture of Experts</u></a> architecture with an early-fusion backbone that allows it to be natively multimodal.</p><p>The Llama 4 “herd” is made up of two models: Llama 4 Scout (109B total parameters, 17B active parameters) with 16 experts, and Llama 4 Maverick (400B total parameters, 17B active parameters) with 128 experts. The Llama Scout model is available on Workers AI today.</p><p>Llama 4 Scout has a context window of up to 10 million (10,000,000) tokens, which makes it one of the first open-source models to support a window of that size. A larger context window makes it possible to hold longer conversations, deliver more personalized responses, and support better <a href="https://developers.cloudflare.com/workers-ai/guides/tutorials/build-a-retrieval-augmented-generation-ai/"><u>Retrieval Augmented Generation</u></a> (RAG). For example, users can take advantage of that increase to summarize multiple documents or reason over large codebases. At launch, Workers AI is supporting a context window of 131,000 tokens to start and we’ll be working to increase this in the future.</p><p>Llama 4 does not compromise parameter depth for speed. Despite having 109 billion total parameters, the Mixture of Experts (MoE) architecture can intelligently use only a fraction of those parameters during active inference. This delivers a faster response that is made smarter by the 109B parameter size.</p>
    <div>
      <h3>What is a Mixture of Experts model?</h3>
      <a href="#what-is-a-mixture-of-experts-model">
        
      </a>
    </div>
    <p>A Mixture of Experts (MoE) model is a type of <a href="https://arxiv.org/abs/2209.01667"><u>Sparse Transformer</u></a> model that is composed of individual specialized neural networks called “experts”. MoE models also have a “router” component that manages input tokens and which experts they get sent to. These specialized experts work together to provide deeper results and faster inference times, increasing both model quality and performance.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7nQnnpYyTW5pLVPofbW6YD/3f9e79c13a419220cda20e7cae43c578/image2.png" />
          </figure><p>For an illustrative example, let’s say there’s an expert that’s really good at generating code while another expert is really good at creative writing. When a request comes in to write a <a href="https://en.wikipedia.org/wiki/Fibonacci_sequence"><u>Fibonacci</u></a> algorithm in Haskell, the router sends the input tokens to the coding expert. This means that the remaining experts might remain unactivated, so the model only needs to use the smaller, specialized neural network to solve the problem.</p><p>In the case of Llama 4 Scout, this means the model is only using one expert (17B parameters) instead of the full 109B total parameters of the model. In reality, the model probably needs to use multiple experts to handle a request, but the point still stands: an MoE model architecture is incredibly efficient for the breadth of problems it can handle and the speed at which it can handle it.</p><p>MoE also makes it more efficient to train models. We recommend reading <a href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/"><u>Meta’s blog post</u></a> on how they trained the Llama 4 models. While more efficient to train, hosting an MoE model for inference can sometimes be more challenging. You need to load the full model weights (over 200 GB) into GPU memory. Supporting a larger context window also requires keeping more memory available in a Key Value cache.</p><p>Thankfully, Workers AI solves this by offering Llama 4 Scout as a serverless model, meaning that you don’t have to worry about things like infrastructure, hardware, memory, etc. — we do all of that for you, so you are only one API request away from interacting with Llama 4. </p>
    <div>
      <h3>What is early-fusion?</h3>
      <a href="#what-is-early-fusion">
        
      </a>
    </div>
    <p>One challenge in building AI-powered applications is the need to grab multiple different models, like a Large Language Model (LLM) and a visual model, to deliver a complete experience for the user. Llama 4 solves that problem by being natively multimodal, meaning the model can understand both text and images.</p><p>You might recall that <a href="https://developers.cloudflare.com/workers-ai/models/llama-3.2-11b-vision-instruct/"><u>Llama 3.2 11b</u></a> was also a vision model, but Llama 3.2 actually used separate parameters for vision and text. This means that when you sent an image request to the model, it only used the vision parameters to understand the image.</p><p>With Llama 4, all the parameters natively understand both text and images. This allowed Meta to train the model parameters with large amounts of unlabeled text, image, and video data together. For the user, this means that you don’t have to chain together multiple models like a vision model and an LLM for a multimodal experience — you can do it all with Llama 4.</p>
    <div>
      <h3>Try it out now!</h3>
      <a href="#try-it-out-now">
        
      </a>
    </div>
    <p>We are excited to partner with Meta as a launch partner to make it effortless for developers to use Llama 4 in Cloudflare Workers AI. The release brings an efficient, multimodal, highly-capable and open-source model to anyone who wants to build AI-powered applications.</p><p>Cloudflare’s Developer Platform makes it possible to build complete applications that run alongside our Llama 4 inference. You can rely on our compute, storage, and agent layer running seamlessly with the inference from models like Llama 4. To learn more, head over to our <a href="https://developers.cloudflare.com/workers-ai/models/llama-4-scout-17b-16e-instruct"><u>developer docs model page</u></a> for more information on using Llama 4 on Workers AI, including pricing, additional terms, and acceptable use policies.</p><p>Want to try it out without an account? Visit our <a href="https://playground.ai.cloudflare.com/"><u>AI playground </u></a>or get started with building your AI experiences with Llama 4 and Workers AI.</p> ]]></content:encoded>
            <category><![CDATA[Developer Week]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Workers AI]]></category>
            <guid isPermaLink="false">3G2O7IP6rSTIhSEUVmIDkt</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Jesse Kipp</dc:creator>
            <dc:creator>Nikhil Kothari</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cloudflare’s bigger, better, faster AI platform]]></title>
            <link>https://blog.cloudflare.com/workers-ai-bigger-better-faster/</link>
            <pubDate>Thu, 26 Sep 2024 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare helps you build AI applications with fast inference at the edge, optimized AI workflows, and vector database-powered RAG solutions. ]]></description>
            <content:encoded><![CDATA[ <p>Birthday Week 2024 marks our first anniversary of Cloudflare’s AI developer products — <a href="https://blog.cloudflare.com/workers-ai/"><u>Workers AI</u></a>, <a href="https://blog.cloudflare.com/announcing-ai-gateway/"><u>AI Gateway</u></a>, and <a href="https://blog.cloudflare.com/vectorize-vector-database-open-beta/"><u>Vectorize</u></a>. For our first birthday this year, we’re excited to announce powerful new features to elevate the way you build with AI on Cloudflare.</p><p>Workers AI is getting a big upgrade, with more powerful GPUs that enable faster inference and bigger models. We’re also expanding our model catalog to be able to dynamically support models that you want to run on us. Finally, we’re saying goodbye to neurons and revamping our pricing model to be simpler and cheaper. On AI Gateway, we’re moving forward on our vision of becoming an ML Ops platform by introducing more powerful logs and human evaluations. Lastly, Vectorize is going GA, with expanded index sizes and faster queries.</p>
    <div>
      <h3>Watch on Cloudflare TV</h3>
      <a href="#watch-on-cloudflare-tv">
        
      </a>
    </div>
    <div>
  
</div><p>Whether you want the fastest inference at the edge, optimized AI workflows, or vector database-powered <a href="https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/"><u>RAG</u></a>, we’re excited to help you harness the full potential of AI and get started on building with Cloudflare.</p>
    <div>
      <h3>The fast, global AI platform</h3>
      <a href="#the-fast-global-ai-platform">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/56ofEZRtFHhkrfMaGC4RUb/3f69a2fc3722f67218297c65bd510941/image9.png" />
          </figure><p>The first thing that you notice about an application is how fast, or in many cases, how slow it is. This is especially true of AI applications, where the standard today is to wait for a response to be generated.</p><p>At Cloudflare, we’re obsessed with improving the performance of applications, and have been doubling down on our commitment to make AI fast. To live up to that commitment, we’re excited to announce that we’ve added even more powerful GPUs across our network to accelerate LLM performance.</p><p>In addition to more powerful GPUs, we’ve continued to expand our GPU footprint to get as close to the user as possible, reducing latency even further. Today, we have GPUs in over 180 cities, having doubled our capacity in a year. </p>
    <div>
      <h3>Bigger, better, faster</h3>
      <a href="#bigger-better-faster">
        
      </a>
    </div>
    <p>With the introduction of our new, more powerful GPUs, you can now run inference on significantly larger models, including Meta Llama 3.1 70B. Previously, our model catalog was limited to 8B parameter LLMs, but we can now support larger models, faster response times, and larger context windows. This means your applications can handle more complex tasks with greater efficiency.</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td>
                        <p><span><span><strong>Model</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>@cf/meta/llama-3.2-11b-vision-instruct</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>@cf/meta/llama-3.2-1b-instruct</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>@cf/meta/llama-3.2-3b-instruct</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>@cf/meta/llama-3.1-8b-instruct-fast</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>@cf/meta/Llama-3.1-70b-instruct</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>@cf/black-forest-labs/flux-1-schnell</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div><p>The set of models above are available on our new GPUs at faster speeds. In general, you can expect throughput of 80+ Tokens per Second (TPS) for 8b models and a Time To First Token of 300 ms (depending on where you are in the world).</p><p>Our model instances now support larger context windows, like the full 128K context window for Llama 3.1 and 3.2. To give you full visibility into performance, we’ll also be publishing metrics like TTFT, TPS, Context Window, and pricing on models in our <a href="https://developers.cloudflare.com/workers-ai/models/"><u>catalog</u></a>, so you know exactly what to expect.</p><p>We’re committed to bringing the best of open-source models to our platform, and that includes Meta’s release of the new Llama 3.2 collection of models. As a Meta launch partner, we were excited to have Day 0 support for the 11B vision model, as well as the 1B and 3B text-only model on Workers AI.</p><p>For more details on how we made Workers AI fast, take a look at our <a href="https://blog.cloudflare.com/making-workers-ai-faster"><u>technical blog post</u></a>, where we share a novel method for KV cache compression (it’s open-source!), as well as details on speculative decoding, our new hardware design, and more.</p>
    <div>
      <h3>Greater model flexibility</h3>
      <a href="#greater-model-flexibility">
        
      </a>
    </div>
    <p>With our commitment to helping you run more powerful models faster, we are also expanding the breadth of models you can run on Workers AI with our Run Any* Model feature. Until now, we have manually curated and added only the most popular open source models to Workers AI. Now, we are opening up our catalog to the public, giving you the flexibility to choose from a broader selection of models. We will support models that are compatible with our GPUs and inference stack at the start (hence the asterisk on Run Any* Model). We’re launching this feature in closed beta and if you’d like to try it out, please fill out the <a href="https://forms.gle/h7FcaTF4Zo5dzNb68"><u>form</u></a>, so we can grant you access to this new feature.</p><p>The Workers AI model catalog will now be split into two parts: a static catalog and a dynamic catalog. Models in the static catalog will remain curated by Cloudflare and will include the most popular open source models with guarantees on availability and speed (the models listed above). These models will always be kept warm in our network, ensuring you don’t experience cold starts. The usage and pricing model remains serverless, where you will only be charged for the requests to the model and not the cold start times.</p><p>Models that are launched via Run Any* Model will make up the dynamic catalog. If the model is public, users can share an instance of that model. In the future, we will allow users to launch private instances of models as well.</p><p>This is just the first step towards running your own custom or private models on Workers AI. While we have already been supporting private models for select customers, we are working on making this capacity available to everyone in the near future.</p>
    <div>
      <h3>New Workers AI pricing</h3>
      <a href="#new-workers-ai-pricing">
        
      </a>
    </div>
    <p>We launched Workers AI during Birthday Week 2023 with the concept of “neurons” for pricing. Neurons were intended to simplify the unit of measure across various models on our platform, including text, image, audio, and more. However, over the past year, we have listened to your feedback and heard that neurons were difficult to grasp and challenging to compare with other providers. Additionally, the industry has matured, and new pricing standards have materialized. As such, we’re excited to announce that we will be moving towards unit-based pricing and saying goodbye to neurons.</p><p>Moving forward, Workers AI will be priced based on model task, size, and units. LLMs will be priced based on the model size (parameters) and input/output tokens. Image generation models will be priced based on the output image resolution and the number of steps. Embeddings models will be priced based on input tokens. Speech-to-text models will be priced on seconds of audio input. </p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td>
                        <p><span><span><strong>Model Task</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Units</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Model Size</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Pricing</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>LLMs (incl. Vision models)</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Tokens in/out (blended)</span></span></p>
                    </td>
                    <td>
                        <p><span><span>&lt;= 3B parameters</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.10 per Million Tokens</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>3.1B - 8B</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.15 per Million Tokens</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>8.1B - 20B</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.20 per Million Tokens</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>20.1B - 40B</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.50 per Million Tokens</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>40.1B+</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.75 per Million Tokens</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Embeddings</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Tokens in</span></span></p>
                    </td>
                    <td>
                        <p><span><span>&lt;= 150M parameters</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.008 per Million Tokens</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>151M+ parameters</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.015 per Million Tokens</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Speech-to-text</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Audio seconds in</span></span></p>
                    </td>
                    <td>
                        <p><span><span>N/A</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.0039 per minute of audio input</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td>
                        <p><span><span><strong>Image Size</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Model Type</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Steps</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Price</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>&lt;=256x256</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Standard</span></span></p>
                    </td>
                    <td>
                        <p><span><span>25</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.00125 per 25 steps</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Fast</span></span></p>
                    </td>
                    <td>
                        <p><span><span>5</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.00025 per 5 steps</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>&lt;=512x512</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Standard</span></span></p>
                    </td>
                    <td>
                        <p><span><span>25</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.0025 per 25 steps</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Fast</span></span></p>
                    </td>
                    <td>
                        <p><span><span>5</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.0005 per 5 steps</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>&lt;=1024x1024</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Standard</span></span></p>
                    </td>
                    <td>
                        <p><span><span>25</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.005 per 25 steps</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Fast</span></span></p>
                    </td>
                    <td>
                        <p><span><span>5</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.001 per 5 steps</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>&lt;=2048x2048</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Standard</span></span></p>
                    </td>
                    <td>
                        <p><span><span>25</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.01 per 25 steps</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Fast</span></span></p>
                    </td>
                    <td>
                        <p><span><span>5</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.002 per 5 steps</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div><p>We paused graduating models and announcing pricing for beta models over the past few months as we prepared for this new pricing change. We’ll be graduating all models to this new pricing, and billing will take effect on October 1, 2024.</p><p>Our free tier has been redone to fit these new metrics, and will include a monthly allotment of usage across all the task types.</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td>
                        <p><span><span><strong>Model</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Free tier size</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Text Generation - LLM</span></span></p>
                    </td>
                    <td>
                        <p><span><span>10,000 tokens a day across any model size</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Embeddings</span></span></p>
                    </td>
                    <td>
                        <p><span><span>10,000 tokens a day across any model size</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Images</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Sum of 250 steps, up to 1024x1024 resolution</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Whisper</span></span></p>
                    </td>
                    <td>
                        <p><span><span>10 minutes of audio a day</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div>
    <div>
      <h3>Optimizing AI workflows with AI Gateway</h3>
      <a href="#optimizing-ai-workflows-with-ai-gateway">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6sLY6zUP6vDdnk1FNJfBBe/9a9e8df1f608b1540175302300ae9bc0/image7.png" />
          </figure><p><a href="https://developers.cloudflare.com/ai-gateway/"><u>AI Gateway</u></a> is designed to help developers and organizations building AI applications better monitor, control, and optimize their AI usage, and thanks to our users, AI Gateway has reached an incredible milestone — over 2 billion requests proxied by September 2024, less than a year after its inception. But we are not stopping there.</p><p><b>Persistent logs (open beta)</b></p><p><a href="https://developers.cloudflare.com/ai-gateway/observability/logging/"><u>Persistent logs</u></a> allow developers to store and analyze user prompts and model responses for extended periods, up to 10 million logs per gateway. Each request made through AI Gateway will create a log. With a log, you can see details of a request, including timestamp, request status, model, and provider.</p><p>We have revamped our logging interface to offer more detailed insights, including cost and duration. Users can now annotate logs with human feedback using thumbs up and thumbs down. Lastly, you can now filter, search, and tag logs with <a href="https://developers.cloudflare.com/ai-gateway/configuration/custom-metadata/"><u>custom metadata</u></a> to further streamline analysis directly within AI Gateway.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/18OovOZzlAkoKvMIgFJ1kR/dbb6b809fb063b2d918b2355cbf11ea3/image1.png" />
          </figure><p>Persistent logs are available to use on <a href="https://developers.cloudflare.com/ai-gateway/pricing/"><u>all plans</u></a>, with a free allocation for both free and paid plans. On the Workers Free plan, users can store up to 100,000 logs total across all gateways at no charge. For those needing more storage, upgrading to the Workers Paid plan will give you a higher free allocation — 200,000 logs stored total. Any additional logs beyond those limits will be available at $8 per 100,000 logs stored per month, giving you the flexibility to store logs for your preferred duration and do more with valuable data. Billing for this feature will be implemented when the feature reaches General Availability, and we’ll provide plenty of advance notice.</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td> </td>
                    <td>
                        <p><span><span><strong>Workers Free</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Workers Paid</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Enterprise</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Included Volume</span></span></p>
                    </td>
                    <td>
                        <p><span><span>100,000 logs stored (total)</span></span></p>
                    </td>
                    <td>
                        <p><span><span>200,000 logs stored (total)</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Additional Logs</span></span></p>
                    </td>
                    <td>
                        <p><span><span>N/A</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$8 per 100,000 logs stored per month</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div><p><b>Export logs with Logpush</b></p><p>For users looking to export their logs, AI Gateway now supports log export via <a href="https://developers.cloudflare.com/ai-gateway/observability/logging/logpush"><u>Logpush</u></a>. With Logpush, you can automatically push logs out of AI Gateway into your preferred storage provider, including Cloudflare R2, Amazon S3, Google Cloud Storage, and more. This can be especially useful for compliance or advanced analysis outside the platform. Logpush follows its <a href="https://developers.cloudflare.com/workers/observability/logging/logpush/"><u>existing pricing model</u></a> and will be available to all users on a paid plan.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6uazGQNezknc5P9kVyr9gr/1da3b3897c9f6376ea4983b2d267b405/image2.png" />
          </figure><p><b>AI evaluations</b></p><p>We are also taking our first step towards comprehensive <a href="https://developers.cloudflare.com/ai-gateway/evaluations/"><u>AI evaluations</u></a>, starting with evaluation using human in the loop feedback (this is now in open beta). Users can create datasets from logs to score and evaluate model performance, speed, and cost, initially focused on LLMs. Evaluations will allow developers to gain a better understanding of how their application is performing, ensuring better accuracy, reliability, and customer satisfaction. We’ve added support for <a href="https://developers.cloudflare.com/ai-gateway/observability/costs/"><u>cost analysis</u></a> across many new models and providers to enable developers to make informed decisions, including the ability to add <a href="https://developers.cloudflare.com/ai-gateway/configuration/custom-costs/"><u>custom costs</u></a>. Future enhancements will include automated scoring using LLMs, comparing performance of multiple models, and prompt evaluations, helping developers make decisions on what is best for their use case and ensuring their applications are both efficient and cost-effective.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5dyhxoR6KEsM8uh371XnDN/5eab93923157fd59112ffdea14b3bb2f/image3.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/21DCTbhFEh7u4m1d0Tfgmn/2839e2ae7d226fdcc4086f108f5c9612/image6.png" />
          </figure>
    <div>
      <h3>Vectorize GA</h3>
      <a href="#vectorize-ga">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/DjhP2xqOhPMP7oQK5Mdpa/c216167d0a204f344afd2ff7393d97f9/image4.png" />
          </figure><p>We've completely redesigned Vectorize since our <a href="https://blog.cloudflare.com/vectorize-vector-database-open-beta/"><u>initial announcement </u></a>in 2023 to better serve customer needs. Vectorize (v2) now supports<b> indexes of up to 5 million vectors</b> (up from 200,000), <b>delivers faster queries</b> (median latency is down 95% from 500 ms to 30 ms), and <b>returns up to 100 results per query</b> (increased from 20). These improvements significantly enhance Vectorize's capacity, speed, and depth of results.</p><p>Note: if you got started on Vectorize before GA, to ease the move from v1 to v2, a migration solution will be available in early Q4 — stay tuned!</p>
    <div>
      <h3>New Vectorize pricing</h3>
      <a href="#new-vectorize-pricing">
        
      </a>
    </div>
    <p>Not only have we improved performance and scalability, but we've also made Vectorize one of the most cost-effective options on the market. We've reduced query prices by 75% and storage costs by 98%.</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td> </td>
                    <td>
                        <p><span><span><strong>New Vectorize pricing</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Old Vectorize pricing</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Price reduction</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Writes</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>Free</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Free</span></span></p>
                    </td>
                    <td>
                        <p><span><span>n/a</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Query</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>$.01 per 1 million vector dimensions</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.04 per 1 million vector dimensions</span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>75%</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Storage</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span>$0.05 per 100 million vector dimensions</span></span></p>
                    </td>
                    <td>
                        <p><span><span>$4.00 per 100 million vector dimensions</span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>98%</strong></span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div><p>You can learn more about our pricing in the <a href="https://developers.cloudflare.com/vectorize/platform/pricing/"><u>Vectorize docs</u></a>.</p><p><b>Vectorize free tier</b></p><p>There’s more good news: we’re introducing a free tier to Vectorize to make it easy to experiment with our full AI stack.</p><p>The free tier includes:</p><ul><li><p>30 million <b>queried</b> vector dimensions / month</p></li><li><p>5 million <b>stored</b> vector dimensions / month</p></li></ul>
    <div>
      <h3>How fast is Vectorize?</h3>
      <a href="#how-fast-is-vectorize">
        
      </a>
    </div>
    <p>To measure performance, we conducted benchmarking tests by executing a large number of vector similarity queries as quickly as possible. We measured both request latency and result precision. In this context, precision refers to the proportion of query results that match the known true-closest results for all benchmarked queries. This approach allows us to assess both the speed and accuracy of our vector similarity search capabilities. Here are the following datasets we benchmarked on:</p><ul><li><p><a href="https://github.com/qdrant/vector-db-benchmark"><b><u>dbpedia-openai-1M-1536-angular</u></b></a>: 1 million vectors, 1536 dimensions, queried with cosine similarity at a top K of 10</p></li><li><p><a href="https://myscale.github.io/benchmark"><b><u>Laion-768-5m-ip</u></b></a>: 5 million vectors, 768 dimensions, queried with cosine similarity at a top K of 10</p><ul><li><p>We ran this again skipping the result-refinement pass to return approximate results faster</p></li></ul></li></ul><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td>
                        <p><span><span><strong>Benchmark dataset</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>P50 (ms)</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>P75 (ms)</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>P90 (ms)</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>P95 (ms)</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Throughput (RPS)</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Precision</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>dbpedia-openai-1M-1536-angular</span></span></p>
                    </td>
                    <td>
                        <p><span><span>31</span></span></p>
                    </td>
                    <td>
                        <p><span><span>56</span></span></p>
                    </td>
                    <td>
                        <p><span><span>159</span></span></p>
                    </td>
                    <td>
                        <p><span><span>380</span></span></p>
                    </td>
                    <td>
                        <p><span><span>343</span></span></p>
                    </td>
                    <td>
                        <p><span><span>95.4%</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Laion-768-5m-ip </span></span></p>
                    </td>
                    <td>
                        <p><span><span>81.5</span></span></p>
                    </td>
                    <td>
                        <p><span><span>91.7</span></span></p>
                    </td>
                    <td>
                        <p><span><span>105</span></span></p>
                    </td>
                    <td>
                        <p><span><span>123</span></span></p>
                    </td>
                    <td>
                        <p><span><span>623</span></span></p>
                    </td>
                    <td>
                        <p><span><span>95.5%</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>Laion-768-5m-ip w/o refinement</span></span></p>
                    </td>
                    <td>
                        <p><span><span>14.7</span></span></p>
                    </td>
                    <td>
                        <p><span><span>19.3</span></span></p>
                    </td>
                    <td>
                        <p><span><span>24.3</span></span></p>
                    </td>
                    <td>
                        <p><span><span>27.3</span></span></p>
                    </td>
                    <td>
                        <p><span><span>698</span></span></p>
                    </td>
                    <td>
                        <p><span><span>78.9%</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div><p>These benchmarks were conducted using a standard Vectorize v2 index, queried with a concurrency of 300 via a Cloudflare Worker binding. The reported latencies reflect those observed by the Worker binding querying the Vectorize index on warm caches, simulating the performance of an existing application with sustained usage.</p><p>Beyond Vectorize's fast query speeds, we believe the combination of Vectorize and Workers AI offers an unbeatable solution for delivering optimal AI application experiences. By running Vectorize close to the source of inference and user interaction, rather than combining AI and vector database solutions across providers, we can significantly minimize end-to-end latency.</p><p>With these improvements, we're excited to announce the general availability of the new Vectorize, which is more powerful, faster, and more cost-effective than ever before.</p>
    <div>
      <h3>Tying it all together: the AI platform for all your inference needs</h3>
      <a href="#tying-it-all-together-the-ai-platform-for-all-your-inference-needs">
        
      </a>
    </div>
    <p>Over the past year, we’ve been committed to building powerful AI products that enable users to build on us. While we are making advancements on each of these individual products, our larger vision is to provide a seamless, integrated experience across our portfolio.</p><p>With Workers AI and AI Gateway, users can easily enable analytics, logging, caching, and rate limiting to their AI application by connecting to AI Gateway directly through a binding in the Workers AI request. We imagine a future where AI Gateway can not only help you create and save datasets to use for fine-tuning your own models with Workers AI, but also seamlessly redeploy them on the same platform. A great AI experience is not just about speed, but also accuracy. While Workers AI ensures fast performance, using it in combination with AI Gateway allows you to evaluate and optimize that performance by monitoring model accuracy and catching issues, like hallucinations or incorrect formats. With AI Gateway, users can test out whether switching to new models in the Workers AI model catalog will deliver more accurate performance and a better user experience.</p><p>In the future, we’ll also be working on tighter integrations between Vectorize and Workers AI, where you can automatically supply context or remember past conversations in an inference call. This cuts down on the orchestration needed to run a <a href="https://www.cloudflare.com/learning/ai/retrieval-augmented-generation-rag/">RAG application</a>, where we can automatically help you make queries to vector databases.</p><p>If we put the three products together, we imagine a world where you can build AI apps with <a href="https://www.cloudflare.com/learning/performance/what-is-observability/">full observability </a>(traces with AI Gateway) and see how the retrieval (Vectorize) and generation (Workers AI) components are working together, enabling you to diagnose issues and improve performance.</p><p>This Birthday Week, we’ve been focused on making sure our individual products are best-in-class, but we’re continuing to invest in building a holistic AI platform within our AI portfolio, but also with the larger Developer Platform Products. Our goal is to make sure that Cloudflare is the simplest, fastest, more powerful place for you to build full-stack AI experiences with all the batteries included.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6nXZn8qwK1tCVVMFbYFf7n/fe538bed97b00ef1b74a05dfd86eb496/image5.png" />
          </figure><p>We’re excited for you to try out all these new features! Take a look at our <a href="https://developers.cloudflare.com/products/?product-group=AI"><u>updated developer docs </u></a>on how to get started and the Cloudflare dashboard to interact with your account.</p> ]]></content:encoded>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[Vectorize]]></category>
            <category><![CDATA[AI Gateway]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Workers AI]]></category>
            <guid isPermaLink="false">2lS9TcgZHa1fubO371mYiv</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Kathy Liao</dc:creator>
            <dc:creator>Phil Wittig</dc:creator>
            <dc:creator>Meaghan Choi</dc:creator>
        </item>
        <item>
            <title><![CDATA[Meta Llama 3.1 now available on Workers AI]]></title>
            <link>https://blog.cloudflare.com/meta-llama-3-1-available-on-workers-ai/</link>
            <pubDate>Tue, 23 Jul 2024 15:15:55 GMT</pubDate>
            <description><![CDATA[ Cloudflare is excited to be a launch partner with Meta to introduce Workers AI support for Llama 3.1 ]]></description>
            <content:encoded><![CDATA[ <p>At Cloudflare, we’re big supporters of the open-source community – and that extends to our approach for <a href="https://developers.cloudflare.com/workers-ai/">Workers AI</a> models as well. Our strategy for our Cloudflare AI products is to provide a top-notch developer experience and toolkit that can help people build applications with open-source models.  </p><p>We’re excited to be one of Meta’s launch partners to make their newest <a href="https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md">Llama 3.1 8B model</a> available to all Workers AI users on Day 1. You can run their latest model by simply swapping out your model ID to <code>@cf/meta/llama-3.1-8b-instruct</code> or test out the model on our <a href="https://playground.ai.cloudflare.com">Workers AI Playground</a>. Llama 3.1 8B is free to use on Workers AI until the model graduates out of beta.</p><p>Meta’s Llama collection of models have consistently shown high-quality performance in areas like general knowledge, steerability, math, tool use, and multilingual translation. Workers AI is excited to continue to distribute and serve the Llama collection of models on our serverless inference platform, powered by our globally distributed GPUs.</p><p>The Llama 3.1 model is particularly exciting, as it is released in a higher precision (bfloat16), incorporates function calling, and adds support across 8 languages. Having multilingual support built-in means that you can use Llama 3.1 to write prompts and receive responses directly in languages like English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai. Expanding model understanding to more languages means that your applications have a bigger reach across the world, and it’s all possible with just one model.</p>
            <pre><code>const answer = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    stream: true,
    messages: [{
        "role": "user",
        "content": "Qu'est-ce que ç'est verlan en français?"
    }],
});</code></pre>
            <p>Llama 3.1 also introduces native function calling (also known as tool calls) which allows LLMs to generate structured JSON outputs which can then be fed into different APIs. This means that function calling is supported out-of-the-box, without the need for a fine-tuned variant of Llama that specializes in tool use. Having this capability built-in means that you can use one model across various tasks.</p><p>Workers AI recently announced <a href="/embedded-function-calling">embedded function calling</a>, which is now usable with Meta Llama 3.1 as well. Our embedded function calling gives developers a way to run their inference tasks far more efficiently than traditional architectures, leveraging Cloudflare Workers to reduce the number of requests that need to be made manually. It also makes use of our open-source <a href="https://www.npmjs.com/package/@cloudflare/ai-utils">ai-utils</a> package, which helps you orchestrate the back-and-forth requests for function calling along with other helper methods that can automatically generate tool schemas. Below is an example function call to Llama 3.1 with embedded function calling that then stores key-values in Workers KV.</p>
            <pre><code>const response = await runWithTools(env.AI, "@cf/meta/llama-3.1-8b-instruct", {
    messages: [{ role: "user", content: "Greet the user and ask them a question" }],
    tools: [{
        name: "Store in memory",
        description: "Store everything that the user talks about in memory as a key-value pair.",
        parameters: {
            type: "object",
            properties: {
                key: {
                    type: "string",
                    description: "The key to store the value under.",
                },
                value: {
                    type: "string",
                    description: "The value to store.",
                },
            },
            required: ["key", "value"],
        },
        function: async ({ key, value }) =&gt; {
                await env.KV.put(key, value);

                return JSON.stringify({
                    success: true,
                });
         }
    }]
})</code></pre>
            <p>We’re excited to see what you build with these new capabilities. As always, use of the new model should be conducted with Meta’s <a href="https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/USE_POLICY.md">Acceptable Use Policy</a> and <a href="https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE">License</a> in mind. Take a look at our <a href="https://developers.cloudflare.com/workers-ai/models/llama-3.1-8b-instruct/">developer documentation</a> to get started!</p> ]]></content:encoded>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Product News]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Open Source]]></category>
            <guid isPermaLink="false">Mmf9yB6m0SRgCJfyxvYK8</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Nikhil Kothari</dc:creator>
        </item>
        <item>
            <title><![CDATA[Embedded function calling in Workers AI: easier, smarter, faster]]></title>
            <link>https://blog.cloudflare.com/embedded-function-calling/</link>
            <pubDate>Thu, 27 Jun 2024 17:00:09 GMT</pubDate>
            <description><![CDATA[ Introducing a new way to do function calling in Workers AI by running function code alongside your inference. Plus, a new @cloudflare/ai-utils package to make getting started as simple as possible ]]></description>
            <content:encoded><![CDATA[ <p></p>
    <div>
      <h2>Introducing embedded function calling and a new ai-utils package</h2>
      <a href="#introducing-embedded-function-calling-and-a-new-ai-utils-package">
        
      </a>
    </div>
    <p>Today, we’re excited to announce a novel way to do function calling that co-locates LLM inference with function execution, and a new ai-utils package that upgrades the developer experience for function calling.</p><p>This is a follow-up to our <a href="https://x.com/CloudflareDev/status/1803849609078284315">mid-June announcement for traditional function calling</a>, which allows you to leverage a Large Language Model (LLM) to intelligently generate structured outputs and pass them to an API call. Function calling has been largely adopted and standardized in the industry as a way for AI models to help perform actions on behalf of a user.</p><p>Our goal is to make building with AI as easy as possible, which is why we’re introducing a new <a href="https://www.npmjs.com/package/@cloudflare/ai-utils">@cloudflare/ai-utils</a> npm package that allows developers to get started quickly with embedded function calling. These helper tools drastically simplify your workflow by actually executing your function code and dynamically generating tools from OpenAPI specs. We’ve also open-sourced our ai-utils package, which you can find on <a href="https://github.com/cloudflare/ai-utils">GitHub</a>. With both embedded function calling and our ai-utils, you’re one step closer to creating intelligent AI agents, and from there, the possibilities are endless.</p>
    <div>
      <h2>Why Cloudflare’s AI platform?</h2>
      <a href="#why-cloudflares-ai-platform">
        
      </a>
    </div>
    <p>OpenAI has been the gold standard when it comes to having performant model inference and a great developer experience. However, they mostly support their closed-source models, while we want to also promote the open-source ecosystem of models. One of our goals with Workers AI is to match the developer experience you might get from OpenAI, but with open-source models.</p><p>There are other open-source inference providers out there like <a href="https://azure.microsoft.com/en-us/solutions/ai">Azure</a> or <a href="https://aws.amazon.com/bedrock/">Bedrock</a>, but most of them are focused on serving inference and the underlying infrastructure, rather than being a developer toolkit. While there are external libraries and frameworks like AI SDK that help developers build quickly with simple abstractions, they rely on upstream providers to do the actual inference. With <a href="https://developers.cloudflare.com/workers-ai/">Workers AI</a>, it’s the best of both worlds – we offer open-source model inference and a killer developer experience out of the box.</p><p>With the release of embedded function calling and ai-utils today, we’ve advanced how we do inference for function calling and improved the developer experience by making it dead simple for developers to start building AI experiences.</p>
    <div>
      <h2>How does traditional function calling work?</h2>
      <a href="#how-does-traditional-function-calling-work">
        
      </a>
    </div>
    <p>Traditional LLM function calling allows customers to specify a set of function names and required arguments along with a prompt when running inference on an LLM. The LLM returns the names and arguments for the functions that the customer can then make to perform actions. These actions give LLMs the ability to do things like fetch fresh data not present in the training dataset and "perform actions" based on user intent.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2vZ1PHwC5e6RUkxQRKfBW0/1812907a22283b5eef02e92c747e8a73/image3-15.png" />
            
            </figure><p>Traditional function calling requires multiple back-and-forth requests passing through the network in order to get to the final output. This includes requests to your origin server, an inference provider, and external APIs. As a developer, you have to orchestrate all the back-and-forths and handle all the requests and responses. If you were building complex agents with multi-tool calls or recursive tool calls, it gets infinitely harder. Fortunately, this doesn’t have to be the case, and we’ve solved it for you.</p>
    <div>
      <h2>Embedded function calling</h2>
      <a href="#embedded-function-calling">
        
      </a>
    </div>
    <p>With Workers AI, our inference runtime is the Workers platform, and the Workers platform can be seen as a global compute network of distributed functions (RPCs). With this model, we can run inference using Workers AI, and supply not only the function names and arguments, but also the runtime function code to be executed. Rather than performing multiple round-trips across networks, the LLM inference and function can run in the same execution environment, cutting out all the unnecessary requests.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/15yHmiG9OmRj9KkuLHteRm/2c50ad90b4736fbc7496a8495d58e1fe/image1-23.png" />
            
            </figure><p>Cloudflare is one of the few inference providers that is able to do this because we offer more than just inference – our developer platform has compute, storage, inference, and more, all within the same Workers runtime.</p>
    <div>
      <h3>We made it easy for you with a new ai-utils package</h3>
      <a href="#we-made-it-easy-for-you-with-a-new-ai-utils-package">
        
      </a>
    </div>
    <p>And to make it as simple as possible, we created a <a href="https://www.npmjs.com/package/@cloudflare/ai-utils"><code>@cloudflare/ai-utils</code></a> package that you can use to get started. These powerful abstractions cut down on the logic you have to implement to do function calling – it just works out of the box.</p>
    <div>
      <h3>runWithTools</h3>
      <a href="#runwithtools">
        
      </a>
    </div>
    <p><code>runWithTool</code>s is our method that you use to do embedded function calling. You pass in your AI binding (env.AI), model, prompt messages, and tools. The tools array includes the description of the function, similar to traditional function calling, but you also pass in the function code that needs to be executed. This method makes the inference calls and executes the function code in one single step. <code>runWithTools</code> is also able to handle multiple function calls, recursive tool calls, validation for model responses, streaming for the final response, and other features.</p><p>Another feature to call out is a helper method called <code>autoTrimTools</code> that automatically selects the relevant tools and trims the tools array based on the names and descriptions. We do this by adding an initial LLM inference call to intelligently trim the tools array before the actual function-calling inference call is made. We found that <code>autoTrimTools</code> helped decrease the number of total tokens used in the entire process (especially when there’s a large number of tools provided) because there’s significantly fewer input tokens used when generating the arguments list. You can choose to use <code>autoTrimTools</code> by setting it as a parameter in the <code>runWithTools</code> method.</p>
            <pre><code>const response = await runWithTools(env.AI,"@hf/nousresearch/hermes-2-pro-mistral-7b",
  {
    messages: [{ role: "user", content: "What's the weather in Austin, Texas?"}],
    tools: [
      {
        name: "getWeather",
        description: "Return the weather for a latitude and longitude",
        parameters: {
          type: "object",
          properties: {
            latitude: {
              type: "string",
              description: "The latitude for the given location"
            },
            longitude: {
              type: "string",
              description: "The longitude for the given location"
            }
          },
          required: ["latitude", "longitude"]
        },
	 // function code to be executed after tool call
        function: async ({ latitude, longitude }) =&gt; {
          const url = `https://api.weatherapi.com/v1/current.json?key=${env.WEATHERAPI_TOKEN}&amp;q=${latitude},${longitude}`
          const res = await fetch(url).then((res) =&gt; res.json())

          return JSON.stringify(res)
        }
      }
    ]
  },
  {
    streamFinalResponse: true,
    maxRecursiveToolRuns: 5,
    trimFunction: autoTrimTools,
    verbose: true,
    strictValidation: true
  }
)</code></pre>
            
    <div>
      <h3>createToolsFromOpenAPISpec</h3>
      <a href="#createtoolsfromopenapispec">
        
      </a>
    </div>
    <p>For many use cases, users will need to make a request to an external API call during function calling to get the output needed. Instead of having to hardcode the exact API endpoints in your tools array, we made a helper function that takes in an OpenAPI spec and dynamically generates the corresponding tool schemas and API endpoints you’ll need for the function call. You call <code>createToolsFromOpenAPISpec</code> from within runWithTools and it’ll dynamically populate everything for you.</p>
            <pre><code>const response = await runWithTools(env.AI, "@hf/nousresearch/hermes-2-pro-mistral-7b", {
  messages: [{ role: "user",content: "Can you name me 5 repos created by Cloudflare"}],
  tools: [
    ...(await createToolsFromOpenAPISpec(  "https://raw.githubusercontent.com/github/rest-api-description/main/descriptions-next/api.github.com/api.github.com.json"
    ))
  ]
})</code></pre>
            
    <div>
      <h2>Putting it all together</h2>
      <a href="#putting-it-all-together">
        
      </a>
    </div>
    <p>When you make a function calling inference request with <code>runWithTools</code> and <code>createToolsFromOpenAPISpec</code>, the only thing you need is the prompts – the rest is automatically handled. The LLM will choose the correct tool based on the prompt, the runtime will execute the function needed, and you’ll get a fast, intelligent response from the model. By leveraging our Workers runtime’s bindings and RPC calls along with our global network, we can execute everything from a single location close to the user, enabling developers to easily write complex agentic chains with fewer lines of code.</p><p>We’re super excited to help people build intelligent AI systems with our new embedded function calling and powerful tools. Check out our <a href="https://developers.cloudflare.com/workers-ai/function-calling/">developer docs</a> on how to get started, and let us know what you think on <a href="https://discord.cloudflare.com">Discord</a>.</p> ]]></content:encoded>
            <category><![CDATA[Product News]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Internship Experience]]></category>
            <guid isPermaLink="false">28DvCOWrN5gKA9IBZqrdc3</guid>
            <dc:creator>Harley Turan</dc:creator>
            <dc:creator>Dhravya Shah</dc:creator>
            <dc:creator>Michelle Chen</dc:creator>
        </item>
        <item>
            <title><![CDATA[AI Gateway is generally available: a unified interface for managing and scaling your generative AI workloads]]></title>
            <link>https://blog.cloudflare.com/ai-gateway-is-generally-available/</link>
            <pubDate>Wed, 22 May 2024 13:00:17 GMT</pubDate>
            <description><![CDATA[ AI Gateway is an AI ops platform that provides speed, reliability, and observability for your AI applications. With a single line of code, you can unlock powerful features including rate limiting, custom caching, real-time logs, and aggregated analytics across multiple providers ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5GsB2wwIevC3G2m0PGOAhz/d9eaeea0933d269b39fcda70c22881b7/image4-3.png" />
            
            </figure><p>During Developer Week in April 2024, we announced General Availability of <a href="/workers-ai-ga-huggingface-loras-python-support">Workers AI</a>, and today, we are excited to announce that AI Gateway is Generally Available as well. Since its launch to beta <a href="/announcing-ai-gateway">in September 2023 during Birthday Week</a>, we’ve proxied over 500 million requests and are now prepared for you to use it in production.</p><p>AI Gateway is an AI ops platform that offers a unified interface for managing and scaling your generative AI workloads. At its core, it acts as a proxy between your service and your inference provider(s), regardless of where your model runs. With a single line of code, you can unlock a set of powerful features focused on performance, security, reliability, and observability – think of it as your <a href="https://www.cloudflare.com/learning/network-layer/what-is-the-control-plane/">control plane</a> for your AI ops. And this is just the beginning – we have a roadmap full of exciting features planned for the near future, making AI Gateway the tool for any organization looking to get more out of their AI workloads.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6M6hDWXdRH2rZETQK4UlPe/444269e8d23056252e9e17aa08cef333/image6-1.png" />
            
            </figure>
    <div>
      <h2>Why add a proxy and why Cloudflare?</h2>
      <a href="#why-add-a-proxy-and-why-cloudflare">
        
      </a>
    </div>
    <p>The AI space moves fast, and it seems like every day there is a new model, provider, or framework. Given this high rate of change, it’s hard to keep track, especially if you’re using more than one model or provider. And that’s one of the driving factors behind launching AI Gateway – we want to provide you with a single consistent control plane for all your models and tools, even if they change tomorrow, and then again the day after that.</p><p>We've talked to a lot of developers and organizations building AI applications, and one thing is clear: they want more <a href="https://www.cloudflare.com/learning/performance/what-is-observability/">observability</a>, control, and tooling around their AI ops. This is something many of the AI providers are lacking as they are deeply focused on model development and less so on platform features.</p><p>Why choose Cloudflare for your AI Gateway? Well, in some ways, it feels like a natural fit. We've spent the last 10+ years helping build a better Internet by running one of the largest global networks, helping customers around the world with performance, reliability, and security – Cloudflare is used as a <a href="https://www.cloudflare.com/learning/cdn/glossary/reverse-proxy/">reverse proxy</a> by nearly 20% of all websites. With our expertise, it felt like a natural progression – change one line of code, and we can help with observability, reliability, and control for your AI applications – all in one control plane – so that you can get back to building.</p><p>Here is that one line code change using the OpenAI JS SDK. And check out <a href="https://developers.cloudflare.com/ai-gateway/providers/">our docs</a> to reference other providers, SDKs, and languages.</p>
            <pre><code>import OpenAI from 'openai';

const openai = new OpenAI({
apiKey: 'my api key', // defaults to process.env["OPENAI_API_KEY"]
	baseURL: "https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_slug}/openai"
});</code></pre>
            <p></p>
    <div>
      <h2>What’s included today?</h2>
      <a href="#whats-included-today">
        
      </a>
    </div>
    <p>After talking to customers, it was clear that we needed to focus on some foundational features before moving onto some of the more advanced ones. While we're really excited about what’s to come, here are the key features available in GA today:</p><p><b>Analytics</b>: Aggregate metrics from across multiple providers. See traffic patterns and usage including the number of requests, tokens, and costs over time.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3gFXixQSV6rVUM9V6ew1W4/db974469f45415b7ae0f0af45c30e7f3/pasted-image-0--10-.png" />
            
            </figure><p><b>Real-time logs:</b> Gain insight into requests and errors as you build.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/31KebDSmQfi9lW87mh3oZy/541a90575637dc860e1ef28972958ed4/image8-1.png" />
            
            </figure><p><b>Caching:</b> Enable custom caching rules and use Cloudflare’s cache for repeat requests instead of hitting the original model provider API, helping you save on cost and latency.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2bZw1HaJUP48B3MbXiATpx/0e7ee230a8b1c62e782efd466177fb5f/image1-10.png" />
            
            </figure><p><b>Rate limiting:</b> Control how your application scales by limiting the number of requests your application receives to control costs or prevent abuse.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4icXzN7Z8VuZw17KdzKl2X/60466c7cbe3869c14aa7a7ad90c40159/image5-9.png" />
            
            </figure><p><b>Support for your favorite providers:</b> AI Gateway now natively supports Workers AI plus 10 of the most popular providers, including <a href="https://x.com/CloudflareDev/status/1791204770394648901">Groq and Cohere</a> as of mid-May 2024.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ORhtmLzCTOKLVrCyhyEZK/53be2a20c4d6bd7dd3cdcd2657ef6455/image2-10.png" />
            
            </figure><p><b>Universal endpoint:</b> In case of errors, improve resilience by defining <a href="https://developers.cloudflare.com/ai-gateway/configuration/fallbacks/">request fallbacks</a> to another model or inference provider.</p>
            <pre><code>curl https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_slug} -X POST \
  --header 'Content-Type: application/json' \
  --data '[
  {
    "provider": "workers-ai",
    "endpoint": "@cf/meta/llama-2-7b-chat-int8",
    "headers": {
      "Authorization": "Bearer {cloudflare_token}",
      "Content-Type": "application/json"
    },
    "query": {
      "messages": [
        {
          "role": "system",
          "content": "You are a friendly assistant"
        },
        {
          "role": "user",
          "content": "What is Cloudflare?"
        }
      ]
    }
  },
  {
    "provider": "openai",
    "endpoint": "chat/completions",
    "headers": {
      "Authorization": "Bearer {open_ai_token}",
      "Content-Type": "application/json"
    },
    "query": {
      "model": "gpt-3.5-turbo",
      "stream": true,
      "messages": [
        {
          "role": "user",
          "content": "What is Cloudflare?"
        }
      ]
    }
  }
]'</code></pre>
            <p></p>
    <div>
      <h2>What’s coming up?</h2>
      <a href="#whats-coming-up">
        
      </a>
    </div>
    <p>We've gotten a lot of feedback from developers, and there are some obvious things on the horizon such as persistent logs and custom metadata – foundational features that will help unlock the real magic down the road.</p><p>But let's take a step back for a moment and share our vision. At Cloudflare, we believe our platform is much more powerful as a unified whole than as a collection of individual parts. This mindset applied to our AI products means that they should be easy to use, combine, and run in harmony.</p><p>Let's imagine the following journey. You initially onboard onto Workers AI to run inference with the latest open source models. Next, you enable AI Gateway to gain better visibility and control, and start storing persistent logs. Then you want to start tuning your inference results, so you leverage your persistent logs, our prompt management tools, and our built in eval functionality. Now you're making analytical decisions to improve your inference results. With each data driven improvement, you want more. So you implement our feedback API which helps annotate inputs/outputs, in essence building a structured data set. At this point, you are one step away from a one-click fine tune that can be deployed instantly to our global network, and it doesn't stop there. As you continue to collect logs and feedback, you can continuously rebuild your fine tune adapters in order to deliver the best results to your end users.</p><p>This is all just an aspirational story at this point, but this is how we envision the future of AI Gateway and our AI suite as a whole. You should be able to start with the most basic setup and gradually progress into more advanced workflows, all without leaving <a href="https://www.cloudflare.com/ai-solution/">Cloudflare’s AI platform</a>. In the end, it might not look exactly as described above, but you can be sure that we are committed to providing the best AI ops tools to help make Cloudflare the best place for AI.</p>
    <div>
      <h2>How do I get started?</h2>
      <a href="#how-do-i-get-started">
        
      </a>
    </div>
    <p>AI Gateway is available to use today on all plans. If you haven’t yet used AI Gateway, check out our <a href="https://developers.cloudflare.com/ai-gateway/">developer documentation</a> and get started now. AI Gateway’s core features available today are offered for free, and all it takes is a Cloudflare account and one line of code to get started. In the future, more premium features, such as persistent logging and secrets management will be available subject to fees. If you have any questions, reach out on our <a href="http://discord.cloudflare.com">Discord channel</a>.</p> ]]></content:encoded>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[Connectivity Cloud]]></category>
            <category><![CDATA[AI Gateway]]></category>
            <category><![CDATA[AI]]></category>
            <guid isPermaLink="false">3EErej51Xbc8xOYpGL8ggy</guid>
            <dc:creator>Kathy Liao</dc:creator>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Phil Wittig</dc:creator>
        </item>
        <item>
            <title><![CDATA[Meta Llama 3 available on Cloudflare Workers AI]]></title>
            <link>https://blog.cloudflare.com/meta-llama-3-available-on-cloudflare-workers-ai/</link>
            <pubDate>Thu, 18 Apr 2024 20:58:33 GMT</pubDate>
            <description><![CDATA[ We are thrilled to give developers around the world the ability to build AI applications with Meta Llama 3 using Workers AI. We are proud to be a launch partner with Meta for their newest 8B Llama 3 model ]]></description>
            <content:encoded><![CDATA[ <p></p><p>We are thrilled to give developers around the world the ability to build AI applications with Meta Llama 3 using Workers AI. We are proud to be a launch partner with Meta for their newest 8B Llama 3 model, and excited to continue our partnership to bring the best of open-source models to our inference platform.</p>
    <div>
      <h2>Workers AI</h2>
      <a href="#workers-ai">
        
      </a>
    </div>
    <p><a href="/workers-ai">Workers AI’s initial launch</a> in beta included support for Llama 2, as it was one of the most requested open source models from the developer community. Since that initial launch, we’ve seen developers build all kinds of innovative applications including knowledge sharing <a href="https://workers.cloudflare.com/built-with/projects/ai.moda/">chatbots</a>, creative <a href="https://workers.cloudflare.com/built-with/projects/Audioflare/">content generation</a>, and automation for <a href="https://workers.cloudflare.com/built-with/projects/Azule/">various workflows</a>.  </p><p>At Cloudflare, we know developers want simplicity and flexibility, with the ability to build with multiple AI models while optimizing for accuracy, performance, and cost, among other factors. Our goal is to make it as easy as possible for developers to use their models of choice without having to worry about the complexities of hosting or deploying models.</p><p>As soon as we learned about the development of Llama 3 from our partners at Meta, we knew developers would want to start building with it as quickly as possible. Workers AI’s serverless inference platform makes it extremely easy and cost-effective to start using the latest large language models (LLMs). Meta’s commitment to developing and growing an open AI-ecosystem makes it possible for customers of all sizes to use AI at scale in production. All it takes is a few lines of code to run inference to Llama 3:</p>
            <pre><code>export interface Env {
  // If you set another name in wrangler.toml as the value for 'binding',
  // replace "AI" with the variable name you defined.
  AI: any;
}

export default {
  async fetch(request: Request, env: Env) {
    const response = await env.AI.run('@cf/meta/llama-3-8b-instruct', {
        messages: [
{role: "user", content: "What is the origin of the phrase Hello, World?"}
	 ]
      }
    );

    return new Response(JSON.stringify(response));
  },
};</code></pre>
            
    <div>
      <h2>Built with Meta Llama 3</h2>
      <a href="#built-with-meta-llama-3">
        
      </a>
    </div>
    <p>Llama 3 offers leading performance on a wide range of industry benchmarks. You can learn more about the architecture and improvements on Meta’s <a href="https://ai.meta.com/blog/meta-llama-3/">blog post</a>. Cloudflare Workers AI supports <a href="https://developers.cloudflare.com/workers-ai/models/llama-3-8b-instruct/">Llama 3 8B</a>, including the instruction fine-tuned model.</p><p>Meta’s testing shows that Llama 3 is the most advanced open LLM today on <a href="https://github.com/meta-llama/llama3/blob/main/eval_details.md?cf_target_id=1F7E4663A460CE17F25CF8ADDF6AB9F1">evaluation benchmarks</a> such as MMLU, GPQA, HumanEval, GSM-8K, and MATH. Llama 3 was trained on an increased number of training tokens (15T), allowing the model to have a better grasp on language intricacies. Larger context windows doubles the capacity of Llama 2, and allows the model to better understand lengthy passages with rich contextual data. Although the model supports a context window of 8k, we currently only support 2.8k but are looking to support 8k context windows through quantized models soon. As well, the new model introduces an efficient new <a href="https://github.com/openai/tiktoken">tiktoken</a>-based tokenizer with a vocabulary of 128k tokens, encoding more characters per token, and achieving better performance on English and multilingual benchmarks. This means that there are 4 times as many parameters in the embedding and output layers, making the model larger than the previous Llama 2 generation of models.</p><p>Under the hood, Llama 3 uses <a href="https://arxiv.org/abs/2305.13245">grouped-query attention</a> (GQA), which improves inference efficiency for longer sequences and also renders their 8B model architecturally equivalent to <a href="https://developers.cloudflare.com/workers-ai/models/mistral-7b-instruct-v0.1/">Mistral-7B</a>. For tokenization, it uses byte-level <a href="https://huggingface.co/learn/nlp-course/en/chapter6/5">byte-pair encoding (BPE)</a>, similar to OpenAI’s GPT tokenizers. This allows tokens to represent any arbitrary byte sequence — even those without a valid utf-8 encoding. This makes the end-to-end model much more flexible in its representation of language, and leads to improved performance.</p><p>Along with the base Llama 3 models, Meta has released a suite of offerings with tools such as <a href="https://ai.meta.com/blog/meta-llama-3/">Llama Guard 2, Code Shield, and CyberSec Eval 2,</a> which we are hoping to release on our Workers AI platform shortly.</p>
    <div>
      <h2>Try it out now</h2>
      <a href="#try-it-out-now">
        
      </a>
    </div>
    <p>Meta Llama 3 8B is available in the <a href="https://developers.cloudflare.com/workers-ai/models/">Workers AI Model Catalog</a> today! Check out the <a href="https://developers.cloudflare.com/workers-ai/models/llama-3-8b-instruct/">documentation here</a> and as always if you want to share your experiences or learn more, join us in the <a href="https://discord.cloudflare.com">Developer Discord</a>.</p> ]]></content:encoded>
            <category><![CDATA[Llama]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Product News]]></category>
            <guid isPermaLink="false">3FHdMMB8JzNt8hkAYDcqVL</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Davina Zamanzadeh</dc:creator>
            <dc:creator>Isaac Rehg</dc:creator>
            <dc:creator>Nikhil Kothari</dc:creator>
        </item>
        <item>
            <title><![CDATA[Leveling up Workers AI: general availability and more new capabilities]]></title>
            <link>https://blog.cloudflare.com/workers-ai-ga-huggingface-loras-python-support/</link>
            <pubDate>Tue, 02 Apr 2024 13:01:00 GMT</pubDate>
            <description><![CDATA[ Today, we’re excited to make a series of announcements, including Workers AI, Cloudflare’s inference platform becoming GA and support for fine-tuned models with LoRAs and one-click deploys from HuggingFace. Cloudflare Workers now supports the Python programming language, and more ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1YNXJ4s4e47U7MvddTlpz8/3a53be280a5e373b589eba37bc4740d0/Cities-with-GPUs-momentum-update.png" />
            
            </figure><p>Welcome to Tuesday – our AI day of Developer Week 2024! In this blog post, we’re excited to share an overview of our new AI announcements and vision, including news about Workers AI officially going GA with improved pricing, a GPU hardware momentum update, an expansion of our Hugging Face partnership, Bring Your Own LoRA fine-tuned inference, Python support in Workers, more providers in AI Gateway, and Vectorize metadata filtering.</p>
    <div>
      <h3>Workers AI GA</h3>
      <a href="#workers-ai-ga">
        
      </a>
    </div>
    <p>Today, we’re excited to announce that our Workers AI inference platform is now Generally Available. After months of being in open beta, we’ve improved our service with greater reliability and performance, unveiled pricing, and added many more models to our catalog.</p>
    <div>
      <h4>Improved performance &amp; reliability</h4>
      <a href="#improved-performance-reliability">
        
      </a>
    </div>
    <p>With Workers AI, our goal is to make AI inference as reliable and easy to use as the rest of Cloudflare’s network. Under the hood, we’ve upgraded the load balancing that is built into Workers AI. Requests can now be routed to more GPUs in more cities, and each city is aware of the total available capacity for AI inference. If the request would have to wait in a queue in the current city, it can instead be routed to another location, getting results back to you faster when traffic is high. With this, we’ve increased rate limits across all our models – most LLMs now have a of 300 requests per minute, up from 50 requests per minute during our beta phase. Smaller models have a limit of 1500-3000 requests per minute. Check out our <a href="https://developers.cloudflare.com/workers-ai/platform/limits/">Developer Docs for the rate limits</a> of individual models.</p>
    <div>
      <h4>Lowering costs on popular models</h4>
      <a href="#lowering-costs-on-popular-models">
        
      </a>
    </div>
    <p>Alongside our GA of Workers AI, we published a <a href="https://ai.cloudflare.com/#pricing-calculator">pricing calculator</a> for our 10 non-beta models earlier this month. We want Workers AI to be one of the most affordable and accessible solutions to run <a href="https://www.cloudflare.com/learning/ai/inference-vs-training/">inference</a>, so we added a few optimizations to our models to make them more affordable. Now, Llama 2 is over 7x cheaper and Mistral 7B is over 14x cheaper to run than we had initially <a href="https://developers.cloudflare.com/workers-ai/platform/pricing/">published</a> on March 1. We want to continue to be the best platform for AI inference and will continue to roll out optimizations to our customers when we can.</p><p>As a reminder, our billing for Workers AI started on April 1st for our non-beta models, while beta models remain free and unlimited. We offer 10,000 <a href="/workers-ai#:~:text=may%20be%20wondering%20%E2%80%94-,what%E2%80%99s%20a%20neuron">neurons</a> per day for free to all customers. Workers Free customers will encounter a hard rate limit after 10,000 neurons in 24 hours while Workers Paid customers will incur usage at $0.011 per 1000 additional neurons.  Read our <a href="https://developers.cloudflare.com/workers-ai/platform/pricing/">Workers AI Pricing Developer Docs</a> for the most up-to-date information on pricing.</p>
    <div>
      <h4>New dashboard and playground</h4>
      <a href="#new-dashboard-and-playground">
        
      </a>
    </div>
    <p>Lastly, we’ve revamped our <a href="https://dash.cloudflare.com/?to=/:account/ai/workers-ai">Workers AI dashboard</a> and <a href="https://playground.ai.cloudflare.com/">AI playground</a>. The Workers AI page in the Cloudflare dashboard now shows analytics for usage across models, including neuron calculations to help you better predict pricing. The AI playground lets you quickly test and compare different models and configure prompts and parameters. We hope these new tools help developers start building on Workers AI seamlessly – go try them out!</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3uSiHo9pV21DreFiLpUfPX/5aa5c8a2448da881a0872e3f550c39a2/image3-3.png" />
            
            </figure>
    <div>
      <h3>Run inference on GPUs in over 150 cities around the world</h3>
      <a href="#run-inference-on-gpus-in-over-150-cities-around-the-world">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/qyZ48xVu70u80swQKPUau/1e85ea2b833139d277d5769d5a8c8c66/image5-2.png" />
            
            </figure><p>When we announced Workers AI back in September 2023, we set out to deploy GPUs to our data centers around the world. We plan to deliver on that promise and deploy inference-tuned GPUs almost everywhere by the end of 2024, making us the most widely distributed cloud-AI inference platform. We have over 150 cities with GPUs today and will continue to roll out more throughout the year.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/451IgQCfz2j10xM8kXhZy0/67632b1b5f52e3387aa387e43c804324/image7-1.png" />
            
            </figure><p>We also have our next generation of compute servers with GPUs launching in Q2 2024, which means better performance, power efficiency, and improved reliability over previous generations. We provided a preview of our Gen 12 Compute servers design in a <a href="/cloudflare-gen-12-server-bigger-better-cooler-in-a-2u1n-form-factor">December 2023 blog post</a>, with more details to come. With Gen 12 and future planned hardware launches, the next step is to support larger machine learning models and offer fine-tuning on our platform. This will allow us to achieve higher inference throughput, lower latency and greater availability for production workloads, as well as expanding support to new categories of workloads such as fine-tuning.</p>
    <div>
      <h3>Hugging Face Partnership</h3>
      <a href="#hugging-face-partnership">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7ATIA1mxaeqHYE31cztdjg/33717b0fd20244f3089cf5d4c8a9c13f/image2-2.png" />
            
            </figure><p>We’re also excited to continue our partnership with Hugging Face in the spirit of bringing the best of open-source to our customers. Now, you can visit some of the most popular models on Hugging Face and easily click to run the model on Workers AI if it is available on our platform.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3geQJnlHZhMktG1eoEWYKE/36c76ee43eca9443b4e2e9c6d3e1df7e/image6-1.png" />
            
            </figure><p>We’re happy to announce that we’ve added 4 more models to our platform in conjunction with Hugging Face. You can now access the new <a href="https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2">Mistral 7B v0.2</a> model with improved context windows, <a href="https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B">Nous Research’s Hermes 2 Pro</a> fine-tuned version of Mistral 7B, <a href="https://huggingface.co/google/gemma-7b-it">Google’s Gemma 7B</a>, and <a href="https://huggingface.co/Nexusflow/Starling-LM-7B-beta">Starling-LM-7B-beta</a> fine-tuned from OpenChat. There are currently 14 models that we’ve curated with Hugging Face to be available for serverless GPU inference powered by Cloudflare’s Workers AI platform, with more coming soon. These models are all served using Hugging Face’s technology with a <a href="https://github.com/huggingface/text-generation-inference/">TGI</a> backend, and we work closely with the Hugging Face team to curate, optimize, and deploy these models.</p><blockquote><p><i>“We are excited to work with Cloudflare to make AI more accessible to developers. Offering the most popular open models with a serverless API, powered by a global fleet of GPUs is an amazing proposition for the Hugging Face community, and I can’t wait to see what they build with it.”</i>- <b>Julien Chaumond</b>, Co-founder and CTO, Hugging Face</p></blockquote><p>You can find all of the open models supported in Workers AI in this <a href="https://huggingface.co/collections/Cloudflare/hf-curated-models-available-on-workers-ai-66036e7ad5064318b3e45db6">Hugging Face Collection</a>, and the “Deploy to Cloudflare Workers AI” button is at the top of each model card. To learn more, read Hugging Face’s <a href="http://huggingface.co/blog/cloudflare-workers-ai">blog post</a> and take a look at our <a href="https://developers.cloudflare.com/workers-ai/models/">Developer Docs</a> to get started. Have a model you want to see on Workers AI? Send us a message on <a href="https://discord.cloudflare.com">Discord</a> with your request.</p>
    <div>
      <h3>Supporting fine-tuned inference - BYO LoRAs</h3>
      <a href="#supporting-fine-tuned-inference-byo-loras">
        
      </a>
    </div>
    <p>Fine-tuned inference is one of our most requested features for Workers AI, and we’re one step closer now with Bring Your Own (BYO) LoRAs. Using the popular <a href="https://www.cloudflare.com/learning/ai/what-is-lora/">Low-Rank Adaptation</a> method, researchers have figured out how to take a model and adapt <i>some</i> model parameters to the task at hand, rather than rewriting <i>all</i> model parameters like you would for a fully fine-tuned model. This means that you can get fine-tuned model outputs without the computational expense of fully fine-tuning a model.</p><p>We now support bringing trained LoRAs to Workers AI, where we apply the LoRA adapter to a base model at runtime to give you fine-tuned inference, at a fraction of the cost, size, and speed of a fully fine-tuned model. In the future, we want to be able to support fine-tuning jobs and fully fine-tuned models directly on our platform, but we’re excited to be one step closer today with LoRAs.</p>
            <pre><code>const response = await ai.run(
  "@cf/mistralai/mistral-7b-instruct-v0.2-lora", //the model supporting LoRAs
  {
      messages: [{"role": "user", "content": "Hello world"],
      raw: true, //skip applying the default chat template
      lora: "00000000-0000-0000-0000-000000000", //the finetune id OR name 
  }
);</code></pre>
            <p>BYO LoRAs is in open beta as of today for Gemma 2B and 7B, Llama 2 7B and Mistral 7B models with LoRA adapters up to 100MB in size and max rank of 8, and up to 30 total LoRAs per account. As always, we expect you to use Workers AI and our new BYO LoRA feature with our <a href="https://www.cloudflare.com/service-specific-terms-developer-platform/#developer-platform-terms">Terms of Service</a> in mind, including any model-specific restrictions on use contained in the models’ license terms.</p><p>Read the technical deep dive blog post on <a href="/fine-tuned-inference-with-loras">fine-tuning with LoRA</a> and <a href="https://developers.cloudflare.com/workers-ai/fine-tunes">developer docs</a> to get started.</p>
    <div>
      <h3>Write Workers in Python</h3>
      <a href="#write-workers-in-python">
        
      </a>
    </div>
    <p>Python is the second most popular programming language in the world (after JavaScript) and the language of choice for building AI applications. And starting today, in open beta, you can now <a href="https://ggu-python.cloudflare-docs-7ou.pages.dev/workers/languages/python/">write Cloudflare Workers in Python</a>. Python Workers support all <a href="https://developers.cloudflare.com/workers/configuration/bindings/">bindings</a> to resources on Cloudflare, including <a href="https://developers.cloudflare.com/vectorize/">Vectorize</a>, <a href="https://developers.cloudflare.com/d1/">D1</a>, <a href="https://developers.cloudflare.com/kv/">KV</a>, <a href="https://www.cloudflare.com/developer-platform/products/r2/">R2</a> and more.</p><p><a href="https://ggu-python.cloudflare-docs-7ou.pages.dev/workers/languages/python/packages/langchain/">LangChain</a> is the most popular framework for building LLM‑powered applications, and like how <a href="/langchain-and-cloudflare">Workers AI works with langchain-js</a>, the <a href="https://python.langchain.com/docs/get_started/introduction">Python LangChain library</a> works on Python Workers, as do <a href="https://ggu-python.cloudflare-docs-7ou.pages.dev/workers/languages/python/packages/">other Python packages</a> like FastAPI.</p><p>Workers written in Python are just as simple as Workers written in JavaScript:</p>
            <pre><code>from js import Response

async def on_fetch(request, env):
    return Response.new("Hello world!")</code></pre>
            <p>…and are configured by simply pointing at a .py file in your <code>wrangler.toml</code>:</p>
            <pre><code>name = "hello-world-python-worker"
main = "src/entry.py"
compatibility_date = "2024-03-18"
compatibility_flags = ["python_workers"]</code></pre>
            <p>There are no extra toolchain or precompilation steps needed. The <a href="https://pyodide.org/en/stable/">Pyodide</a> Python execution environment is provided for you, directly by the Workers runtime, mirroring how Workers written in JavaScript already work.</p><p>There’s lots more to dive into — take a look at the <a href="https://ggu-python.cloudflare-docs-7ou.pages.dev/workers/languages/python/">docs</a>, and check out our <a href="/python-workers">companion blog post</a> for details about how Python Workers work behind the scenes.</p>
    <div>
      <h2>AI Gateway now supports Anthropic, Azure, AWS Bedrock, Google Vertex, and Perplexity</h2>
      <a href="#ai-gateway-now-supports-anthropic-azure-aws-bedrock-google-vertex-and-perplexity">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6YaNMed9Aw4YjhbZk75SGI/a53b499b743e36ad2635357118e6623f/image4-2.png" />
            
            </figure><p>Our <a href="/announcing-ai-gateway">AI Gateway</a> product helps developers better control and observe their AI applications, with analytics, caching, rate limiting, and more. We are continuing to add more providers to the product, including Anthropic, Google Vertex, and Perplexity, which we’re excited to announce today. We quietly rolled out Azure and Amazon Bedrock support in December 2023, which means that the most popular providers are now supported via AI Gateway, including Workers AI itself.</p><p>Take a look at our <a href="https://developers.cloudflare.com/ai-gateway/">Developer Docs</a> to get started with AI Gateway.</p>
    <div>
      <h4>Coming soon: Persistent Logs</h4>
      <a href="#coming-soon-persistent-logs">
        
      </a>
    </div>
    <p>In Q2 of 2024, we will be adding persistent logs so that you can push your logs (including prompts and responses) to <a href="https://www.cloudflare.com/learning/cloud/what-is-object-storage/">object storage</a>, custom metadata so that you can tag requests with user IDs or other identifiers, and secrets management so that you can securely manage your application’s API keys.</p><p>We want AI Gateway to be the control plane for your AI applications, allowing developers to dynamically evaluate and route requests to different models and providers. With our persistent logs feature, we want to enable developers to use their logged data to fine-tune models in one click, eventually running the fine-tune job and the fine-tuned model directly on our Workers AI platform. AI Gateway is just one product in our AI toolkit, but we’re excited about the workflows and use cases it can unlock for developers building on our platform, and we hope you’re excited about it too.</p>
    <div>
      <h3>Vectorize metadata filtering and future GA of million vector indexes</h3>
      <a href="#vectorize-metadata-filtering-and-future-ga-of-million-vector-indexes">
        
      </a>
    </div>
    <p>Vectorize is another component of our toolkit for AI applications. In open beta since September 2023, Vectorize allows developers to persist embeddings (vectors), like those generated from Workers AI <a href="https://developers.cloudflare.com/workers-ai/models/#text-embeddings">text embedding</a> models, and query for the closest match to support use cases like similarity search or recommendations. Without a vector database, model output is forgotten and can’t be recalled without extra costs to re-run a model.</p><p>Since Vectorize’s open beta, we’ve added <a href="https://developers.cloudflare.com/vectorize/reference/metadata-filtering/">metadata filtering</a>. Metadata filtering lets developers combine vector search with filtering for arbitrary metadata, supporting the query complexity in AI applications. We’re laser-focused on getting Vectorize ready for general availability, with an target launch date of June 2024, which will include support for multi-million vector indexes.</p>
            <pre><code>// Insert vectors with metadata
const vectors: Array&lt;VectorizeVector&gt; = [
  {
    id: "1",
    values: [32.4, 74.1, 3.2],
    metadata: { url: "/products/sku/13913913", streaming_platform: "netflix" }
  },
  {
    id: "2",
    values: [15.1, 19.2, 15.8],
    metadata: { url: "/products/sku/10148191", streaming_platform: "hbo" }
  },
...
];
let upserted = await env.YOUR_INDEX.upsert(vectors);

// Query with metadata filtering
let metadataMatches = await env.YOUR_INDEX.query(&lt;queryVector&gt;, { filter: { streaming_platform: "netflix" }} )</code></pre>
            
    <div>
      <h3>The most comprehensive Developer Platform to build AI applications</h3>
      <a href="#the-most-comprehensive-developer-platform-to-build-ai-applications">
        
      </a>
    </div>
    <p>On Cloudflare’s Developer Platform, we believe that all developers should be able to quickly build and ship full-stack applications  – and that includes AI experiences as well. With our GA of Workers AI, announcements for Python support in Workers, AI Gateway, and Vectorize, and our partnership with Hugging Face, we’ve expanded the world of possibilities for what you can build with AI on our platform. We hope you are as excited as we are – take a look at all our <a href="https://developers.cloudflare.com">Developer Docs</a> to get started, and <a href="https://discord.cloudflare.com/">let us know</a> what you build.</p> ]]></content:encoded>
            <category><![CDATA[Developer Week]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[General Availability]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <guid isPermaLink="false">6ItPe1u2j71C4DTSxJdccB</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Jesse Kipp</dc:creator>
            <dc:creator>Syona Sarma</dc:creator>
            <dc:creator>Brendan Irvine-Broque</dc:creator>
            <dc:creator>Vy Ton</dc:creator>
        </item>
        <item>
            <title><![CDATA[Running fine-tuned models on Workers AI with LoRAs]]></title>
            <link>https://blog.cloudflare.com/fine-tuned-inference-with-loras/</link>
            <pubDate>Tue, 02 Apr 2024 13:00:48 GMT</pubDate>
            <description><![CDATA[ Workers AI now supports fine-tuned models using LoRAs. But what is a LoRA and how does it work? In this post, we dive into fine-tuning, LoRAs and even some math to share the details of how it all works under the hood ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/70HGv5JY8CcvWM8IesAxt0/5b60789faf49d45cd4f54e6dd8e4efd4/loraai.png" />
            
            </figure>
    <div>
      <h3>Inference from fine-tuned LLMs with LoRAs is now in open beta</h3>
      <a href="#inference-from-fine-tuned-llms-with-loras-is-now-in-open-beta">
        
      </a>
    </div>
    <p>Today, we’re excited to announce that you can now run fine-tuned inference with LoRAs on Workers AI. This feature is in open beta and available for pre-trained LoRA adapters to be used with Mistral, Gemma, or Llama 2, with some limitations. Take a look at our <a href="/workers-ai-ga-huggingface-loras-python-support/">product announcements blog post</a> to get a high-level overview of our Bring Your Own (BYO) LoRAs feature.</p><p>In this post, we’ll do a deep dive into what fine-tuning and LoRAs are, show you how to use it on our Workers AI platform, and then delve into the technical details of how we implemented it on our platform.</p>
    <div>
      <h2>What is fine-tuning?</h2>
      <a href="#what-is-fine-tuning">
        
      </a>
    </div>
    <p>Fine-tuning is a general term for modifying an AI model by continuing to train it with additional data. The goal of fine-tuning is to increase the probability that a generation is similar to your dataset. Training a model from scratch is not practical for many use cases given how expensive and time consuming they can be to train. By fine-tuning an existing pre-trained model, you benefit from its capabilities while also accomplishing your desired task. <a href="https://www.cloudflare.com/learning/ai/what-is-lora/">Low-Rank Adaptation (LoRA)</a> is a specific fine-tuning method that can be applied to various model architectures, not just LLMs. It is common that the pre-trained model weights are directly modified or fused with additional fine-tune weights in traditional fine-tuning methods. LoRA, on the other hand, allows for the fine-tune weights and pre-trained model to remain separate, and for the pre-trained model to remain unchanged. The end result is that you can train models to be more accurate  at specific tasks, such as generating code, having a specific personality, or generating images in a specific style. You can even fine-tune an existing <a href="https://www.cloudflare.com/learning/ai/what-is-large-language-model/">LLM</a> to understand additional information about a specific topic.</p><p>The approach of maintaining the original base model weights means that you can create new fine-tune weights with relatively little compute. You can take advantage of existing foundational models (such as Llama, Mistral, and Gemma), and adapt them for your needs.</p>
    <div>
      <h2>How does fine-tuning work?</h2>
      <a href="#how-does-fine-tuning-work">
        
      </a>
    </div>
    <p>To better understand fine-tuning and why LoRA is so effective, we have to take a step back to understand how AI models work. AI models (like LLMs) are neural networks that are trained through deep learning techniques. In neural networks, there are a set of parameters that act as a mathematical representation of the model’s domain knowledge, made up of weights and biases – in simple terms, numbers. These parameters are usually represented as large matrices of numbers. The more parameters a model has, the larger the model is, so when you see models like llama-2-7b, you can read “7b” and know that the model has 7 billion parameters.</p><p>A model’s parameters define its behavior. When you train a model from scratch, these parameters usually start off as random numbers. As you train the model on a dataset, these parameters get adjusted bit-by-bit until the model reflects the dataset and exhibits the right behavior. Some parameters will be more important than others, so we apply a weight and use it to show more or less importance. Weights play a crucial role in the model's ability to capture patterns and relationships in the data it is trained on.</p><p>Traditional fine-tuning will adjust <i>all</i> the parameters in the trained model with a new set of weights. As such, a fine-tuned model requires us to serve the same amount of parameters as the original model, which means it can take a lot of time and compute to train and run inference for a fully fine-tuned model. On top of that, new state-of-the-art models, or versions of existing models, are regularly released, meaning that fully fine-tuned models can become costly to train, maintain, and store.</p>
    <div>
      <h2>LoRA is an efficient method of fine-tuning</h2>
      <a href="#lora-is-an-efficient-method-of-fine-tuning">
        
      </a>
    </div>
    <p>In the simplest terms, LoRA avoids adjusting parameters in a pre-trained model and instead allows us to apply a small number of additional parameters. These additional parameters are applied temporarily to the base model to effectively control model behavior. Relative to traditional fine-tuning methods it takes a lot less time and compute to train these additional parameters, which are referred to as a LoRA adapter. After training, we package up the LoRA adapter as a separate model file that can then plug in to the base model it was trained from. A fully fine-tuned model can be tens of gigabytes in size, while these adapters are usually just a few megabytes. This makes it a lot easier to distribute, and serving fine-tuned inference with LoRA only adds ms of latency to total inference time.</p><p>If you’re curious to understand why LoRA is so effective, buckle up — we first have to go through a brief lesson on linear algebra. If that’s not a term you’ve thought about since university, don’t worry, we’ll walk you through it.</p>
    <div>
      <h2>Show me the math</h2>
      <a href="#show-me-the-math">
        
      </a>
    </div>
    <p>With traditional fine-tuning, we can take the weights of a model (<i>W0</i>) and tweak them to output a new set of weights — so the difference between the original model weights and the new weights is <i>ΔW</i>, representing the change in weights_._ Therefore, a tuned model will have a new set of weights which can be represented as the original model weights plus the change in weights, <i>W0</i> + <i>ΔW.</i></p><p>Remember, all of these model weights are actually represented as large matrices of numbers. In math, every matrix has a property called rank (<i>r</i>), which describes the number of linearly independent columns or rows in a matrix. When matrices are low-rank, they have only a few columns or rows that are “important”, so we can actually decompose or split them into two smaller matrices with the most important parameters  (think of it like factoring in algebra). This technique is called rank decomposition, which allows us to greatly reduce and simplify matrices while keeping the most important bits. In the context of fine-tuning, rank determines how many parameters get changed from the original model – the higher the rank, the stronger the fine-tune, giving you more granularity over the output.</p><p>According to the <a href="https://arxiv.org/abs/2106.09685">original LoRA paper</a>, researchers have found that when a model is low-rank, the matrix representing the change in weights is also low-rank. Therefore, we can apply rank decomposition to our matrix representing the change in weights <i>ΔW</i> to create two smaller matrices <i>A, B</i>, where <i>ΔW = BA</i>. Now, the change in the model can be represented by two smaller low-rank matrices_._ This is why this method of fine-tuning is called Low-Rank Adaptation.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/13k5puOQpL75CRNCTv5ZYE/309e49ef14cc3ef7a3493c5ad75c1f97/Lora-lineapro.png" />
            
            </figure><p>When we run inference, we only need the smaller matrices <i>A, B</i> to change the behavior of the model. The model weights in <i>A, B</i> constitute our LoRA adapter (along with a config file). At runtime, we add the model weights together, combining the original model (<i>W0</i>) and the LoRA adapter (<i>A, B)</i>. Adding and subtracting are simple mathematical operations, meaning that we can quickly swap out different LoRA adapters by adding and subtracting <i>A, B</i> from <i>W0.</i>. By temporarily adjusting the weights of the original model, we modify the model’s behavior and output and as a result, we get fine-tuned inference with minimal added latency.</p><p>According to the original <a href="https://arxiv.org/abs/2106.09685">LoRA paper</a>, “LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times”. Because of this, LoRA is one of the most popular methods of fine-tuning since it's a lot less computationally expensive than a fully fine-tuned model, doesn't add any material inference time, and is much smaller and portable.</p>
    <div>
      <h2>How can you use LoRAs with Workers AI?</h2>
      <a href="#how-can-you-use-loras-with-workers-ai">
        
      </a>
    </div>
    <p>Workers AI is very well-suited to run LoRAs because of the way we run serverless inference. The models in our catalog are always pre-loaded on our GPUs, meaning that we keep them warm so that your requests never encounter a cold start. This means that the base model is always available, and we can dynamically load and swap out LoRA adapters as needed. We can actually plug in multiple LoRA adapters to one base model, so we can serve multiple different fine-tuned inference requests at once.</p><p>When you fine-tune with LoRA, your output will be two files: your custom model weights (in <a href="https://huggingface.co/docs/safetensors/en/index">safetensors</a> format) and an adapter config file (in json format). To create these weights yourself, you can train a LoRA on your own data using the <a href="https://huggingface.co/docs/peft/en/tutorial/peft_model_config">Hugging Face PEFT</a> (Parameter-Efficient Fine-Tuning) library combined with the <a href="https://huggingface.co/docs/autotrain/en/llm_finetuning">Hugging Face AutoTrain LLM library</a>. You can also run your training tasks on services such as <a href="https://huggingface.co/autotrain">Auto Train</a> and <a href="https://colab.research.google.com/">Google Colab</a>. Alternatively, there are many open-source LoRA adapters <a href="https://huggingface.co/models?pipeline_tag=text-generation&amp;sort=trending&amp;search=mistral+lora">available on Hugging Face</a> today that cover a variety of use cases.</p><p>Eventually, we want to support the LoRA training workloads on our platform, but we’ll need you to bring your trained LoRA adapters to Workers AI today, which is why we’re calling this feature Bring Your Own (BYO) LoRAs.</p><p>For the initial open beta release, we are allowing people to use LoRAs with our Mistral, Llama, and Gemma models. We have set aside versions of these models which accept LoRAs, which you can access by appending <code>-lora</code> to the end of the model name. Your adapter must have been fine-tuned from one of our supported base models listed below:</p><ul><li><p><code>@cf/meta-llama/llama-2-7b-chat-hf-lora</code></p></li><li><p><code>@cf/mistral/mistral-7b-instruct-v0.2-lora</code></p></li><li><p><code>@cf/google/gemma-2b-it-lora</code></p></li><li><p><code>@cf/google/gemma-7b-it-lora</code></p></li></ul><p>As we are launching this feature in open beta, we have some limitations today to take note of: quantized LoRA models are not yet supported, LoRA adapters must be smaller than 100MB and have up to a max rank of 8, and you can try up to 30 LoRAs per account during our initial open beta. To get started with LoRAs on Workers AI, read the <a href="https://developers.cloudflare.com/workers-ai/fine-tunes/loras">Developer Docs</a>.</p><p>As always, we expect people to use Workers AI and our new BYO LoRA feature with our <a href="https://www.cloudflare.com/service-specific-terms-developer-platform/#developer-platform-terms">Terms of Service</a> in mind, including any model-specific restrictions on use contained in the models’ license terms.</p>
    <div>
      <h2>How did we build multi-tenant LoRA serving?</h2>
      <a href="#how-did-we-build-multi-tenant-lora-serving">
        
      </a>
    </div>
    <p>Serving multiple LoRA models simultaneously poses a challenge in terms of GPU resource utilization. While it is possible to batch inference requests to a base model, it is much more challenging to batch requests with the added complexity of serving unique LoRA adapters. To tackle this problem, we leverage the Punica CUDA kernel design in combination with global cache optimizations in order to handle the memory intensive workload of multi-tenant LoRA serving while offering low inference latency.</p><p>The Punica CUDA kernel was introduced in the paper <a href="https://arxiv.org/abs/2310.18547">Punica: Multi-Tenant LoRA Serving</a> as a method to serve multiple, significantly different LoRA models applied to the same base model. In comparison to previous inference techniques, the method offers substantial throughput and latency improvements. This optimization is achieved in part through enabling request batching even across requests serving different LoRA adapters.</p><p>The core of the Punica kernel system is a new CUDA kernel called Segmented Gather Matrix-Vector Multiplication (SGMV). SGMV allows a GPU to store only a single copy of the pre-trained model while serving different LoRA models. The Punica kernel design system consolidates the batching of requests for unique LoRA models to improve performance by parallelizing the feature-weight multiplication of different requests in a batch. Requests for the same LoRA model are then grouped to increase operational intensity. Initially, the GPU loads the base model while reserving most of its GPU memory for KV Cache. The LoRA components (A and B matrices) are then loaded on demand from remote storage (Cloudflare’s cache or <a href="https://www.cloudflare.com/developer-platform/r2/">R2</a>) when required by an incoming request. This on demand loading introduces only milliseconds of latency, which means that multiple LoRA adapters can be seamlessly fetched and served with minimal impact on inference performance. Frequently requested LoRA adapters are cached for the fastest possible inference.</p><p>Once a requested LoRA has been cached locally, the speed it can be made available for inference is constrained only by PCIe bandwidth. Regardless, given that each request may require its own LoRA, it becomes critical that LoRA downloads and memory copy operations are performed asynchronously. The Punica scheduler tackles this exact challenge, batching only requests which currently have required LoRA weights available in GPU memory, and queueing requests that do not until the required weights are available and the request can efficiently join a batch.</p><p>By effectively managing KV cache and batching these requests, it is possible to handle significant multi-tenant LoRA-serving workloads. A further and important optimization is the use of continuous batching. Common batching methods require all requests to the same adapter to reach their stopping condition before being released. Continuous batching allows a request in a batch to be released early so that it does not need to wait for the longest running request.</p><p>Given that LLMs deployed to Cloudflare’s network are available globally, it is important that LoRA adapter models are as well. Very soon, we will implement remote model files that are cached at Cloudflare’s edge to further reduce inference latency.</p>
    <div>
      <h2>A roadmap for fine-tuning on Workers AI</h2>
      <a href="#a-roadmap-for-fine-tuning-on-workers-ai">
        
      </a>
    </div>
    <p>Launching support for LoRA adapters is an important step towards unlocking fine-tunes on our platform. In addition to the LLM fine-tunes available today, we look forward to supporting more models and a variety of task types, including image generation.</p><p>Our vision for Workers AI is to be the best place for developers to run their AI workloads — and this includes the process of fine-tuning itself. Eventually, we want to be able to run the fine-tuning training job as well as fully fine-tuned models directly on Workers AI. This unlocks many use cases for AI to be more relevant in organizations by empowering models to have more granularity and detail for specific tasks.</p><p>With AI Gateway, we will be able to help developers log their prompts and responses, which they can then use to fine-tune models with production data. Our vision is to have a one-click fine-tuning service, where log data from AI Gateway can be used to retrain a model (on Cloudflare) and then the fine-tuned model can be redeployed on Workers AI for inference. This will allow developers to personalize their AI models to fit their applications, allowing for granularity as low as a per-user level. The fine-tuned model can then be smaller and more optimized, helping users save time and money on AI inference – and the magic is that all of this can all happen within our very own <a href="https://www.cloudflare.com/developer-platform/">Developer Platform</a>.</p><p>We’re excited for you to try the open beta for BYO LoRAs! Read our <a href="https://developers.cloudflare.com/workers-ai/fine-tunes">Developer Docs</a> for more details, and tell us what you think on <a href="https://discord.cloudflare.com">Discord</a>.</p> ]]></content:encoded>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Week]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <guid isPermaLink="false">4YpkeROwzr0CCHeFmwolIF</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Logan Grasby</dc:creator>
        </item>
        <item>
            <title><![CDATA[Mitigating a token-length side-channel attack in our AI products]]></title>
            <link>https://blog.cloudflare.com/ai-side-channel-attack-mitigated/</link>
            <pubDate>Thu, 14 Mar 2024 12:30:30 GMT</pubDate>
            <description><![CDATA[ The Workers AI and AI Gateway team recently collaborated closely with security researchers at Ben Gurion University regarding a report submitted through our Public Bug Bounty program. Through this process, we discovered and fully patched a vulnerability affecting all LLM providers. Here’s the story ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5do9zHtgVCZfCILMjoXAmV/0f7e2e3b4bdb298d7fd8c0a97d3b2a19/Mitigating-a-Token-Length-Side-Channel-attack-in-our-AI-products.png" />
            
            </figure><p>Since the discovery of <a href="https://en.wikipedia.org/wiki/CRIME">CRIME</a>, <a href="https://breachattack.com/">BREACH</a>, <a href="https://media.blackhat.com/eu-13/briefings/Beery/bh-eu-13-a-perfect-crime-beery-wp.pdf">TIME</a>, <a href="https://en.wikipedia.org/wiki/Lucky_Thirteen_attack">LUCKY-13</a> etc., length-based side-channel attacks have been considered practical. Even though packets were encrypted, attackers were able to infer information about the underlying plaintext by analyzing metadata like the packet length or timing information.</p><p>Cloudflare was recently contacted by a group of researchers at <a href="https://cris.bgu.ac.il/en/">Ben Gurion University</a> who wrote a paper titled “<a href="https://cdn.arstechnica.net/wp-content/uploads/2024/03/LLM-Side-Channel.pdf">What Was Your Prompt? A Remote Keylogging Attack on AI Assistants</a>” that describes “a novel side-channel that can be used to read encrypted responses from AI Assistants over the web”.</p><p>The Workers AI and AI Gateway team collaborated closely with these security researchers through our <a href="/cloudflare-bug-bounty-program/">Public Bug Bounty program</a>, discovering and fully patching a vulnerability that affects LLM providers. You can read the detailed research paper <a href="https://cdn.arstechnica.net/wp-content/uploads/2024/03/LLM-Side-Channel.pdf">here</a>.</p><p>Since being notified about this vulnerability, we've implemented a mitigation to help secure all Workers AI and AI Gateway customers. As far as we could assess, there was no outstanding risk to Workers AI and AI Gateway customers.</p>
    <div>
      <h3>How does the side-channel attack work?</h3>
      <a href="#how-does-the-side-channel-attack-work">
        
      </a>
    </div>
    <p>In the paper, the authors describe a method in which they intercept the stream of a chat session with an LLM provider, use the network packet headers to infer the length of each token, extract and segment their sequence, and then use their own dedicated LLMs to infer the response.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6EeuXpPSqqvqIZKZUFPKEY/951a777d273caf172933639d9f5d6f12/pasted-image-0--2--3.png" />
            
            </figure><p>The two main requirements for a successful attack are an AI chat client running in <b>streaming</b> mode and a malicious actor capable of capturing network traffic between the client and the AI chat service. In streaming mode, the LLM tokens are emitted sequentially, introducing a token-length side-channel. Malicious actors could eavesdrop on packets via public networks or within an ISP.</p><p>An example request vulnerable to the side-channel attack looks like this:</p>
            <pre><code>curl -X POST \
https://api.cloudflare.com/client/v4/accounts/&lt;account-id&gt;/ai/run/@cf/meta/llama-2-7b-chat-int8 \
  -H "Authorization: Bearer &lt;Token&gt;" \
  -d '{"stream":true,"prompt":"tell me something about portugal"}'</code></pre>
            <p>Let’s use <a href="https://www.wireshark.org/">Wireshark</a> to inspect the network packets on the LLM chat session while streaming:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6sII07hkJGaVXBKlWoBoEW/a1c3be395e0bee3ec5ed690947737d51/media.png" />
            
            </figure><p>The first packet has a length of 95 and corresponds to the token "Port" which has a length of four. The second packet has a length of 93 and corresponds to the token "ug" which has a length of two, and so on. By removing the likely token envelope from the network packet length, it is easy to infer how many tokens were transmitted and their sequence and individual length just by sniffing encrypted network data.</p><p>Since the attacker needs the sequence of individual token length, this vulnerability only affects text generation models using streaming. This means that AI inference providers that use streaming — the most common way of interacting with LLMs — like Workers AI, are potentially vulnerable.</p><p>This method requires that the attacker is on the same network or in a position to observe the communication traffic and its accuracy depends on knowing the target LLM’s writing style. In ideal conditions, the researchers claim that their system “can reconstruct 29% of an AI assistant’s responses and successfully infer the topic from 55% of them”. It’s also important to note that unlike other side-channel attacks, in this case the attacker has no way of evaluating its prediction against the ground truth. That means that we are as likely to get a sentence with near perfect accuracy as we are to get one where only things that match are conjunctions.</p>
    <div>
      <h3>Mitigating LLM side-channel attacks</h3>
      <a href="#mitigating-llm-side-channel-attacks">
        
      </a>
    </div>
    <p>Since this type of attack relies on the length of tokens being inferred from the packet, it can be just as easily mitigated by obscuring token size. The researchers suggested a few strategies to mitigate these side-channel attacks, one of which is the simplest: padding the token responses with random length noise to obscure the length of the token so that responses can not be inferred from the packets. While we immediately added the mitigation to our own inference product — Workers AI, we wanted to help customers secure their LLMs regardless of where they are running them by adding it to our AI Gateway.</p><p>As of today, all users of Workers AI and AI Gateway are now automatically protected from this side-channel attack.</p>
    <div>
      <h3>What we did</h3>
      <a href="#what-we-did">
        
      </a>
    </div>
    <p>Once we got word of this research work and how exploiting the technique could potentially impact our AI products, we did what we always do in situations like this: we assembled a team of systems engineers, security engineers, and product managers and started discussing risk mitigation strategies and next steps. We also had a call with the researchers, who kindly attended, presented their conclusions, and answered questions from our teams.</p><p>The research team provided a testing notebook that we could use to validate the attack's results. While we were able to reproduce the results for the notebook's examples, we found that the accuracy varied immensely with our tests using different prompt responses and different LLMs. Nonetheless, the paper has merit, and the risks are not negligible.</p><p>We decided to incorporate the first mitigation suggestion in the paper: including random padding to each message to hide the actual length of tokens in the stream, thereby complicating attempts to infer information based solely on network packet size.</p>
    <div>
      <h3>Workers AI, our inference product, is now protected</h3>
      <a href="#workers-ai-our-inference-product-is-now-protected">
        
      </a>
    </div>
    <p>With our inference-as-a-service product, anyone can use the <a href="https://developers.cloudflare.com/workers-ai/">Workers AI</a> platform and make API calls to our supported AI models. This means that we oversee the inference requests being made to and from the models. As such, we have a responsibility to ensure that the service is secure and protected from potential vulnerabilities. We immediately rolled out a fix once we were notified of the research, and all Workers AI customers are now automatically protected from this side-channel attack. We have not seen any malicious attacks exploiting this vulnerability, other than the ethical testing from the researchers.</p><p>Our solution for Workers AI is a variation of the mitigation strategy suggested in the research document. Since we stream JSON objects rather than the raw tokens, instead of padding the tokens with whitespace characters, we added a new property, "p" (for padding) that has a string value of variable random length.</p><p>Example streaming response using the <a href="https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events">SSE</a> syntax:</p>
            <pre><code>data: {"response":"portugal","p":"abcdefghijklmnopqrstuvwxyz0123456789a"}
data: {"response":" is","p":"abcdefghij"}
data: {"response":" a","p":"abcdefghijklmnopqrstuvwxyz012"}
data: {"response":" southern","p":"ab"}
data: {"response":" European","p":"abcdefgh"}
data: {"response":" country","p":"abcdefghijklmno"}
data: {"response":" located","p":"abcdefghijklmnopqrstuvwxyz012345678"}</code></pre>
            <p>This has the advantage that no modifications are required in the SDK or the client code, the changes are invisible to the end-users, and no action is required from our customers. By adding random variable length to the JSON objects, we introduce the same network-level variability, and the attacker essentially loses the required input signal. Customers can continue using Workers AI as usual while benefiting from this protection.</p>
    <div>
      <h3>One step further: AI Gateway protects users of any inference provider</h3>
      <a href="#one-step-further-ai-gateway-protects-users-of-any-inference-provider">
        
      </a>
    </div>
    <p>We added protection to our AI inference product, but we also have a product that proxies requests to any provider — <a href="https://developers.cloudflare.com/ai-gateway/">AI Gateway</a>. AI Gateway acts as a proxy between a user and supported inference providers, helping developers gain control, performance, and <a href="https://www.cloudflare.com/learning/performance/what-is-observability/">observability</a> over their AI applications. In line with our mission to help build a better Internet, we wanted to quickly roll out a fix that can help all our customers using text generation AIs, regardless of which provider they use or if they have mitigations to prevent this attack. To do this, we implemented a similar solution that pads all streaming responses proxied through AI Gateway with random noise of variable length.</p><p>Our AI Gateway customers are now automatically protected against this side-channel attack, even if the upstream inference providers have not yet mitigated the vulnerability. If you are unsure if your inference provider has patched this vulnerability yet, use AI Gateway to proxy your requests and ensure that you are protected.</p>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>At Cloudflare, our mission is to help build a better Internet – that means that we care about all citizens of the Internet, regardless of what their tech stack looks like. We are proud to be able to improve the security of our AI products in a way that is transparent and requires no action from our customers.</p><p>We are grateful to the researchers who discovered this vulnerability and have been very collaborative in helping us understand the problem space. If you are a security researcher who is interested in helping us make our products more secure, check out our Bug Bounty program at <a href="http://hackerone.com/cloudflare">hackerone.com/cloudflare</a>.</p> ]]></content:encoded>
            <category><![CDATA[Bug Bounty]]></category>
            <category><![CDATA[LLM]]></category>
            <category><![CDATA[Vulnerabilities]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[AI Gateway]]></category>
            <category><![CDATA[SASE]]></category>
            <guid isPermaLink="false">1R32EruY6C8Pu6LrFCGXwy</guid>
            <dc:creator>Celso Martinho</dc:creator>
            <dc:creator>Michelle Chen</dc:creator>
        </item>
        <item>
            <title><![CDATA[Unlocking new use cases with 17 new models in Workers AI, including new LLMs, image generation models, and more]]></title>
            <link>https://blog.cloudflare.com/february-28-2024-workersai-catalog-update/</link>
            <pubDate>Wed, 28 Feb 2024 20:00:00 GMT</pubDate>
            <description><![CDATA[ In February 2024 we added 8 models for text generation, classification, and code generation use cases. Today, we’re back with 17 more models, focused on enabling new types of tasks and use cases
 ]]></description>
            <content:encoded><![CDATA[ <p></p><p>On February 6th, 2024 we <a href="/february-2024-workersai-catalog-update">announced eight new models</a> that we added to our catalog for text generation, classification, and code generation use cases. Today, we’re back with seventeen (17!) more models, focused on enabling new types of tasks and use cases with Workers AI. Our catalog is now nearing almost 40 models, so we also decided to introduce a revamp of our developer documentation that enables users to easily search and discover new models.</p><p>The new models are listed below, and the full Workers AI catalog can be found on our <a href="https://developers.cloudflare.com/workers-ai/models/">new developer documentation</a>.</p><p><b>Text generation</b></p><ul><li><p><b>@cf/deepseek-ai/deepseek-math-7b-instruct</b></p></li><li><p>******@**<b>cf/openchat/openchat-3.5-0106</b></p></li><li><p><b>@cf/microsoft/phi-2</b></p></li><li><p><b>@cf/tinyllama/tinyllama-1.1b-chat-v1.0</b></p></li><li><p><b>@cf/thebloke/discolm-german-7b-v1-awq</b></p></li><li><p><b>@cf/qwen/qwen1.5-0.5b-chat</b></p></li><li><p><b>@cf/qwen/qwen1.5-1.8b-chat</b></p></li><li><p><b>@cf/qwen/qwen1.5-7b-chat-awq</b></p></li><li><p><b>@cf/qwen/qwen1.5-14b-chat-awq</b></p></li><li><p><b>@cf/tiiuae/falcon-7b-instruct</b></p></li><li><p><b>@cf/defog/sqlcoder-7b-2</b></p></li></ul><p><b>Summarization</b></p><ul><li><p><b>@cf/facebook/bart-large-cnn</b></p></li></ul><p><b>Text-to-image</b></p><ul><li><p><b>@cf/lykon/dreamshaper-8-lcm</b></p></li><li><p><b>@cf/runwayml/stable-diffusion-v1-5-inpainting</b></p></li><li><p><b>@cf/runwayml/stable-diffusion-v1-5-img2img</b></p></li><li><p><b>@cf/bytedance/stable-diffusion-xl-lightning</b></p></li></ul><p><b>Image-to-text</b></p><ul><li><p><b>@cf/unum/uform-gen2-qwen-500m</b></p></li></ul>
    <div>
      <h3>New language models, fine-tunes, and quantizations</h3>
      <a href="#new-language-models-fine-tunes-and-quantizations">
        
      </a>
    </div>
    <p>Today’s catalog update includes a number of new language models so that developers can pick and choose the best LLMs for their use cases. Although most LLMs can be generalized to work in any instance, there are many benefits to choosing models that are tailored for a specific use case. We are excited to bring you some new large language models (LLMs), small language models (SLMs), and multi-language support, as well as some fine-tuned and <a href="https://www.cloudflare.com/learning/ai/what-is-quantization/">quantized</a> models.</p><p>Our latest LLM additions include <code>falcon-7b-instruct</code>, which is particularly exciting because of its innovative use of multi-query attention to generate high-precision responses. There’s also better language support with <code>discolm_german_7b</code> and the <code>qwen1.5</code> models, which are trained on multilingual data and boast impressive LLM outputs not only in English, but also in German (<code>discolm</code>) and Chinese (<code>qwen1.5</code>). The Qwen models range from 0.5B to 14B parameters and have shown particularly impressive accuracy in our testing. We’re also releasing a few new SLMs, which are growing in popularity because of their ability to do inference faster and cheaper without sacrificing accuracy. For SLMs, we’re introducing small but performant models like a 1.1B parameter version of Llama (<code>tinyllama-1.1b-chat-v1.0</code>) and a 1.3B parameter model from Microsoft (<code>phi-2</code>).</p><p>As the AI industry continues to accelerate, talented people have found ways to improve and optimize the performance and accuracy of models. We’ve added a fine-tuned model (openchat-3.5) which implements <a href="https://arxiv.org/abs/2309.11235">Conditioned Reinforcement Learning Fine-Tuning (C-RLFT)</a>, a technique that enables open-source language model development through the use of easily collectable mixed quality data.</p><p>We’re really excited to be bringing all these new text generation models onto our platform today. The open-source community has been incredible at developing new AI breakthroughs, and we’re grateful for everyone’s contributions to training, fine-tuning, and quantizing these models. We’re thrilled to be able to host these models and make them accessible to all so that developers can quickly and easily build new applications with AI. You can check out the new models and their API schemas on <a href="https://developers.cloudflare.com/workers-ai/models/">our developer docs</a>.</p>
    <div>
      <h3>New image generation models</h3>
      <a href="#new-image-generation-models">
        
      </a>
    </div>
    <p>We are adding new Stable Diffusion pipelines and optimizations to enable powerful new image editing and generation use cases. We’ve added support for Stable Diffusion XL Lightning which generates high quality images in just two inference steps. Text-to-image is a really popular task for folks who want to take a text prompt and have the model generate an image based on the input, but Stable Diffusion is actually capable of much more. With this new Workers AI release, we’ve unlocked new pipelines so that you can experiment with different modalities of input and tasks with Stable Diffusion.</p><p>You can now use Stable Diffusion on Workers AI for image-to-image and inpainting use cases. Image-to-image allows you to transform an input image into a different image – for example, you can ask Stable Diffusion to generate a cartoon version of a portrait. Inpainting allows users to upload an image and transform the same image into something new – examples of inpainting include “expanding” the background of photos or colorizing black-and-white photos.</p><p>To use inpainting, you’ll need to input an image, a mask, and a prompt. The image is the original picture that you want modified, the mask is a monochrome screen that highlights the area that you want to be painted over, and the prompt tells the model what to generate in that space. Below is an example of the inputs and the request template to perform inpainting.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/RdRXcGbrGjlfhlZ3iF5Wj/f122e6a9405cfcfc1f8e8faa5fd6ec50/RIgkPzrxrrjSz2YPCYJItmJrOftfe5ZgcA-oS2mvIYI3L7enm62_lmSV9ua2d663tj2kpXSDUf__lVJAsaU2XOhlnUw-XS9Kt8CiwygsD30Mptndu1vQDrhafph6.png" />
            
            </figure>
            <pre><code>import { Ai } from '@cloudflare/ai';

export default {
    async fetch(request, env) {
        const formData = await request.formData();
        const prompt = formData.get("prompt")
        const imageFile = formData.get("image")
        const maskFile = formData.get("mask")

        const imageArrayBuffer = await imageFile.arrayBuffer();
        const maskArrayBuffer = await maskFile.arrayBuffer();

        const ai = new Ai(env.AI);
        const inputs = {
            prompt,
            image: [...new Uint8Array(imageArrayBuffer)],
            mask: [...new Uint8Array(maskArrayBuffer)],  
            strength: 0.8, // Adjust the strength of the transformation
            num_steps: 10, // Number of inference steps for the diffusion process
        };

        const response = await ai.run("@cf/runwayml/stable-diffusion-v1-5-inpainting", inputs);

        return new Response(response, {
            headers: {
                "content-type": "image/png",
            },
        });
    }
}</code></pre>
            
    <div>
      <h3>New use cases</h3>
      <a href="#new-use-cases">
        
      </a>
    </div>
    <p>We’ve also added new models to Workers AI that allow for various specialized tasks and use cases, such as LLMs specialized in solving math problems (<code>deepseek-math-7b-instruct</code>), generating SQL code (<code>sqlcoder-7b-2</code>), summarizing text (<code>bart-large-cnn</code>), and image captioning (<code>uform-gen2-qwen-500m</code>).</p><p>We wanted to release these to the public, so you can start building with them, but we’ll be releasing more demos and tutorial content over the next few weeks. Stay tuned to our <a href="https://twitter.com/CloudflareDev">X account</a> and <a href="https://developers.cloudflare.com/workers-ai/models/">Developer Documentation</a> for more information on how to use these new models.</p>
    <div>
      <h3>Optimizing our model catalog</h3>
      <a href="#optimizing-our-model-catalog">
        
      </a>
    </div>
    <p>AI model innovation is advancing rapidly, and so are the tools and techniques for fast and efficient inference. We’re excited to be incorporating new tools that help us optimize our models so that we can offer the best inference platform for everyone. Typically, when optimizing AI inference it is useful to serialize the model into a format such as <a href="https://onnxruntime.ai/">ONNX</a>, one of the most generally applicable options for this use case with broad hardware and model architecture support. An ONNX model can be further optimized by being converted to a <a href="https://github.com/NVIDIA/TensorRT">TensorRT</a> engine. This format, designed specifically for Nvidia GPUs, can result in faster inference latency and higher total throughput from LLMs.  Choosing the right format usually comes down to what is best supported by specific model architectures and the hardware available for inference. We decided to leverage both TensorRT and ONNX formats for our new Stable Diffusion pipelines, which represent a series of models applied for a specific task.</p>
    <div>
      <h3>Explore more on our new developer docs</h3>
      <a href="#explore-more-on-our-new-developer-docs">
        
      </a>
    </div>
    <p>You can explore all these new models in our <a href="https://developers.cloudflare.com/workers-ai/models/">new developer docs</a>, where you can learn more about individual models, their prompt templates, as well as properties like context token limits. We’ve redesigned the model page to be simpler for developers to explore new models and learn how to use them. You’ll now see all the models on one page for searchability, with the task type on the right-hand side. Then, you can click into individual model pages to see code examples on how to use those models.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6vsVFSfZ9pvpWkTXdl6iWw/9b03d7b28a3420f378564da0eae5fe44/image3.png" />
            
            </figure><p>We hope you try out these new models and build something new on Workers AI! We have more updates coming soon, including more demos, tutorials, and Workers AI pricing. Let us know what you’re working on and other models you’d like to see on our <a href="https://discord.cloudflare.com">Discord</a>.</p> ]]></content:encoded>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[Product News]]></category>
            <guid isPermaLink="false">Q57bXuDbwJJ9BWzWqsgEB</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Logan Grasby</dc:creator>
        </item>
        <item>
            <title><![CDATA[Adding new LLMs, text classification and code generation models to the Workers AI catalog]]></title>
            <link>https://blog.cloudflare.com/february-2024-workersai-catalog-update/</link>
            <pubDate>Tue, 06 Feb 2024 20:00:10 GMT</pubDate>
            <description><![CDATA[ Workers AI is now bigger and better with 8 new models and improved model performance ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5bUPtXgJU5a797f2LMkI7k/b830752f7f52d7ab46bf396e00c93ea7/image2-1.png" />
            
            </figure><p>Over the last few months, the Workers AI team has been hard at work making improvements to our AI platform. We launched back in September, and in November, we added more models like Code Llama, Stable Diffusion, Mistral, as well as improvements like streaming and longer context windows.</p><p>Today, we’re excited to announce the release of eight new models.</p><p>The new models are highlighted below, but check out our full model catalog with over 20 models <a href="https://developers.cloudflare.com/workers-ai/">in our developer docs.</a></p><p><b>Text generation</b>@hf/thebloke/llama-2-13b-chat-awq@hf/thebloke/zephyr-7b-beta-awq@hf/thebloke/mistral-7b-instruct-v0.1-awq@hf/thebloke/openhermes-2.5-mistral-7b-awq@hf/thebloke/neural-chat-7b-v3-1-awq@hf/thebloke/llamaguard-7b-awq</p><p><b>Code generation</b>@hf/thebloke/deepseek-coder-6.7b-base-awq@hf/thebloke/deepseek-coder-6.7b-instruct-awq</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/693UrRZZSZJo8omR5ex2Vk/ac76b952bac6fd613cb8e2c79b7e2f10/image1.png" />
            
            </figure>
    <div>
      <h3>Bringing you the best of open source</h3>
      <a href="#bringing-you-the-best-of-open-source">
        
      </a>
    </div>
    <p>Our mission is to support a wide array of open source models and tasks. In line with this, we're excited to announce a preview of the latest models and features available for deployment on Cloudflare's network.</p><p>One of the standout models is <code>deep-seek-coder-6.7b</code>, which notably scores <a href="https://github.com/deepseek-ai/deepseek-coder">approximately 15% higher</a> on popular benchmarks against comparable Code Llama models. This performance advantage is attributed to its diverse training data, which includes both English and Chinese code generation datasets. In addition, the <code>openhermes-2.5-mistral-7b</code> model showcases how high quality fine-tuning datasets can improve the accuracy of base models. This Mistral 7b fine-tune outperforms the base model by <a href="https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B#benchmark-results">approximately 10% on many LLM benchmarks</a>.</p><p>We're also introducing innovative models that incorporate Activation-aware Weight Quantization (AWQ), such as the <code>llama-2-13b-awq</code>. This quantization technique is just one of the strategies to improve memory efficiency in Large Language Models. While <a href="https://www.cloudflare.com/learning/ai/what-is-quantization/">quantization</a> generally boosts inference efficiency in AI models, it often does so at the expense of precision. AWQ strikes a balance to mitigate this tradeoff.</p><p>The pace of progress in AI can be overwhelming, but Cloudflare's Workers AI simplifies getting started with the latest models. We handle the latest advancements and make them easily accessible from a Worker or our HTTP APIs. You are only ever an API call or Workers binding away from cutting-edge models. Simply put, Workers AI allows developers to concentrate on delivering exceptional user experiences without the burdens of deployment, infrastructure, or scalability concerns.</p>
    <div>
      <h3>Using AI to safeguard your applications</h3>
      <a href="#using-ai-to-safeguard-your-applications">
        
      </a>
    </div>
    <p>A newer use case we support with this model catalog update is the ability to classify and check prompts and responses with <code>llamaguard-7b-awq</code>. We’ve all heard of stories of AI bots that unintentionally give answers that might be inappropriate or harmful. This <a href="https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/">new LlamaGuard mode</a>l allows developers to detect and categorize when a prompt or a response is potentially unsafe, allowing them to control the inputs and outputs of their applications with developer-defined categories.</p><p>Here’s an example of an application that takes a user input and compares that to an example set of defined unsafe categories to determine if the input is safe or not. You can customize the content categories depending on your use case. <i>The below example categories are provided for informational purposes only, and Cloudflare does not make any guarantees regarding the performance of Llama Guard or any outcomes or results.</i></p>
            <pre><code>import { Ai } from '@cloudflare/ai';

export default {
	async fetch(request, env) {
		const payload = await request.json();
		// Modify these to your needs, this is the example from Meta
		const unsafeCategories = `
01: Violence and Hate.
Should not
- Help people plan or engage in violence.
- Encourage people to engage in violence.
- Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics like their race, color, religion, national origin, sexual orientation, gender, gender identity, or disability.
- Encourage discrimination against people based on those characteristics and should not use slurs or create insults based on those characteristics.
Can
-  Provide information on violence and discrimination.
- Discuss the topics of hate and violence, and can discuss historical events involving violence.
02: Sexual Content.
Should not
- Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to engage in sexual activities.
Can
- Discuss topics related to sexuality and sexual education.
- Provide general information regarding sexual activities.
- Provide guidance on topics related to sexual health.
03: Criminal Planning.
Should not
- Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them.
- Encourage people to engage in criminal activities or condone criminal behavior.
Can
- Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes.
- Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems.
04: Guns and Illegal Weapons.
Should not
- Assist in the planning or execution of firearm or illegal weapons crimes, especially when the person indicates that they may act on the information provided to them.
- Encourage people to engage in firearm or illegal weapons crimes or condone such behavior.
Can
- Discuss firearms and the arguments for and against firearm ownership.
05: Regulated or Controlled Substances.
Should not
- Encourage or assist people in the trafficking, consumption, or use of regulated or controlled substances, including but not limited to illegal drugs, tobacco, alcohol, and cannabis.
- Assist or encourage people to create such substances when it is illegal to do so.
Can
- Discuss the use of regulated or controlled substances, and can provide general information on those substances (ex: what their chemical composition is and what their effects are).
- Provide information on how those substances are manufactured, provided that the information is not so detailed as to provide step-by-step instructions for their creation.
06: Self-Harm.
Should not
- Encourage people to harm themselves.
- Romanticize or condone self-harm or suicide.
- Provide information regarding the methods of suicide or self-harm.
- Provide any information or advice, other than information regarding appropriate health resources, to people who express or imply that they wish to harm themselves.
Can
- Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions).
Should
- Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves.
		`;

		// Assemble the prompt
		const prompt = `Task: Check if there is unsafe content in 'User' messages in conversations according to our safety policy with the below categories.

		&lt;BEGIN UNSAFE CONTENT CATEGORIES&gt;
		${unsafeCategories}
		&lt;END UNSAFE CONTENT CATEGORIES&gt;

		&lt;BEGIN CONVERSATION&gt;
		User: ${payload.userContent}
		&lt;END CONVERSATION&gt;
	`;
		const ai = new Ai(env.AI);
		const response = await ai.run('@hf/thebloke/llamaguard-7b-awq', {
			prompt,
		});
		return Response.json(response);
	},
};</code></pre>
            
    <div>
      <h3>How do I get started?</h3>
      <a href="#how-do-i-get-started">
        
      </a>
    </div>
    <p>Try out our new models within the AI section of the <a href="https://dash.cloudflare.com/?to=/:account/ai/workers-ai">Cloudflare dashboard</a> or take a look at our <a href="https://developers.cloudflare.com/workers-ai/models/">Developer Docs</a> to get started. With the Workers AI platform you can build an app with Workers and Pages, store data with R2, D1, Workers KV, or Vectorize, and run model inference with Workers AI – all in one place. Having more models allows developers to build all different kinds of applications, and we plan to continually update our model catalog to bring you the best of open-source.</p><p>We’re excited to see what you build! If you’re looking for inspiration, take a look at our <a href="https://workers.cloudflare.com/built-with/collections/ai-workers/">collection of “Built-with” stories</a> that highlight what others are building on Cloudflare’s Developer Platform. Stay tuned for a pricing announcement and higher usage limits coming in the next few weeks, as well as more models coming soon. <a href="https://discord.cloudflare.com/">Join us on Discord</a> to share what you’re working on and any feedback you might have.</p> ]]></content:encoded>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <guid isPermaLink="false">58duMHip3s7DGNRo47fweV</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Logan Grasby</dc:creator>
        </item>
        <item>
            <title><![CDATA[Announcing AI Gateway: making AI applications more observable, reliable, and scalable]]></title>
            <link>https://blog.cloudflare.com/announcing-ai-gateway/</link>
            <pubDate>Wed, 27 Sep 2023 13:00:35 GMT</pubDate>
            <description><![CDATA[ AI Gateway helps developers have greater control and visibility in their AI apps, so that you can focus on building without worrying about observability, reliability, and scaling. AI Gateway handles the things that nearly all AI applications need ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3KExrqxBeL4yGLYsMeeZ2j/26e12b714f8653f8132aeaf14c0a1e78/image4-10.png" />
            
            </figure><p>Today, we’re excited to announce our beta of <b>AI Gateway</b> – the portal to making your AI applications more observable, reliable, and scalable.</p><p>AI Gateway sits between your application and the AI APIs that your application makes requests to (like OpenAI) – so that we can cache responses, limit and retry requests, and provide analytics to help you monitor and track usage. AI Gateway handles the things that nearly all AI applications need, saving you engineering time, so you can focus on what you're building.</p>
    <div>
      <h3>Connecting your app to AI Gateway</h3>
      <a href="#connecting-your-app-to-ai-gateway">
        
      </a>
    </div>
    <p>It only takes one line of code for developers to get started with Cloudflare’s AI Gateway. All you need to do is replace the URL in your API calls with your unique AI Gateway endpoint. For example, with OpenAI you would define your baseURL as <code>"https://gateway.ai.cloudflare.com/v1/ACCOUNT_TAG/GATEWAY/openai"</code> instead of <code>"https://api.openai.com/v1"</code> – and that’s it. You can keep your tokens in your code environment, and we’ll log the request through AI Gateway before letting it pass through to the final API with your token.</p>
            <pre><code>// configuring AI gateway with the dedicated OpenAI endpoint

const openai = new OpenAI({
  apiKey: env.OPENAI_API_KEY,
  baseURL: "https://gateway.ai.cloudflare.com/v1/ACCOUNT_TAG/GATEWAY/openai",
});</code></pre>
            <p>We currently support model providers such as OpenAI, Hugging Face, and Replicate with plans to add more in the future. We support all the various endpoints within providers and also response streaming, so everything should work out-of-the-box once you have the gateway configured. The dedicated endpoint for these providers allows you to connect your apps to AI Gateway by changing one line of code, without touching your original payload structure.</p><p>We also have a universal endpoint that you can use if you’d like more flexibility with your requests. With the universal endpoint, you have the ability to define fallback models and handle request retries. For example, let’s say a request was made to OpenAI GPT-3, but the API was down – with the universal endpoint, you could define Hugging Face GPT-2 as your fallback model and the gateway can automatically resend that request to Hugging Face. This is really helpful in improving resiliency for your app in cases where you are noticing unusual errors, getting rate limited, or if one bill is getting costly, and you want to diversify to other models. With the universal endpoint, you’ll just need to tweak your payload to specify the provider and endpoint, so we can properly route requests for you. Check out the example request below and <a href="https://developers.cloudflare.com/ai-gateway">the docs</a> for more details on the universal endpoint schema.</p>
            <pre><code># Using the Universal Endpoint to first try OpenAI, then Hugging Face

curl https://gateway.ai.cloudflare.com/v1/ACCOUNT_TAG/GATEWAY  -X POST \
  --header 'Content-Type: application/json' \
  --data '[
  {
    "provider": "openai",
    "endpoint": "chat/completions",
    "headers": { 
      "Authorization": "Bearer $OPENAI_TOKEN",
      "Content-Type": "application/json"
    },
    "query": {
      "model": "gpt-3.5-turbo",
      "stream": true,
      "messages": [
        {
          "role": "user",
          "content": "What is Cloudflare?"
        }
      ]
    }
  },
  {
    "provider": "huggingface",
    "endpoint": "gpt2",
    "headers": { 
      "Authorization": "Bearer $HF_TOKEN",
      "Content-Type": "application/json"
    },
    "query": {
      "inputs": "What is Cloudflare?"
    }
  },
]'</code></pre>
            
    <div>
      <h3>Gaining visibility into your app’s usage</h3>
      <a href="#gaining-visibility-into-your-apps-usage">
        
      </a>
    </div>
    <p>Now that your app is connected to Cloudflare, we can help you gather analytics and give insight and control on the traffic that is passing through your apps. Regardless of what model or infrastructure you use in the backend, we can help you log requests and analyze data like the number of requests, number of users, cost of running the app, duration of requests, etc. Although these seem like basic analytics that model providers should expose, it’s surprisingly difficult to get visibility into these metrics with the typical model providers. AI Gateway takes it one step further and lets you aggregate analytics across multiple providers too.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7yPlnJxKwRTlaUEeU9qvpg/706604d295c464f7d1608df10311e1d4/image3-24.png" />
            
            </figure>
    <div>
      <h3>Controlling how your app scales</h3>
      <a href="#controlling-how-your-app-scales">
        
      </a>
    </div>
    <p>One of the pain points we often hear is how expensive it costs to build and run AI apps. Each API call can be unpredictably expensive and costs can rack up quickly, preventing developers from scaling their apps to their full potential. At the speed that the industry is moving, you don’t want to be limited by your scale and left behind – and that’s where <a href="https://www.cloudflare.com/learning/cdn/what-is-caching/">caching</a> and rate limiting can help. We allow developers to cache their API calls so that new requests can be served from our cache rather than the original API – making it cheaper and faster. <a href="https://www.cloudflare.com/learning/bots/what-is-rate-limiting/">Rate limiting</a> can also help control costs by throttling the number of requests and preventing excessive or suspicious activity. Developers have full flexibility to define caching and rate limiting rules, so that apps can scale at a sustainable pace of your choosing.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6EgZOiCNWYJkkqeEjkV9Ew/cad537af80e68a5eac2b1f6c8cc977d0/image1-20.png" />
            
            </figure>
    <div>
      <h3>The Workers AI Platform</h3>
      <a href="#the-workers-ai-platform">
        
      </a>
    </div>
    <p>AI Gateway pairs perfectly with our new <a href="/workers-ai">Workers AI</a> and <a href="/vectorize-vector-database-open-beta">Vectorize</a> products, so you can build full-stack AI applications all within the Workers ecosystem. From deploying applications with Workers, running model inference on the edge with Workers AI, storing <a href="https://www.cloudflare.com/learning/ai/what-are-embeddings/">vector embeddings</a> on Vectorize, to gaining visibility into your applications with AI Gateway – the Workers platform is your one-stop shop to bring your AI applications to life. To learn how to use AI Gateway with Workers AI or the different providers, check out <a href="https://developers.cloudflare.com/ai-gateway/">the docs</a>.</p>
    <div>
      <h3>Next up: the enterprise use case</h3>
      <a href="#next-up-the-enterprise-use-case">
        
      </a>
    </div>
    <p>We are shipping v1 of AI Gateway with a few core features, but we have plans to expand the product to cover more advanced use cases as well – usage alerts, jailbreak protection, dynamic model routing with A/B testing, and advanced cache rules. But what we’re really excited about are the other ways you can apply AI Gateway…</p><p>In the future, we want to develop AI Gateway into a product that helps organizations monitor and observe how their users or employees are using AI. This way, you can flip a switch and have all requests within your network to providers (like OpenAI) pass through Cloudflare first – so that you can log user requests, apply access policies, enable rate limiting and <a href="https://www.cloudflare.com/learning/access-management/what-is-dlp/">data loss prevention (DLP)</a> strategies. A powerful example: if an employee accidentally pastes an API key to ChatGPT, AI Gateway can be configured to see the outgoing request and redact the API key or block the request entirely, preventing it from ever reaching OpenAI or any end providers. We can also log and alert on suspicious requests, so that organizations can proactively investigate and control certain types of activity. AI Gateway then becomes a really powerful tool for organizations that might be excited about the efficiency that <a href="https://www.cloudflare.com/learning/ai/what-is-artificial-intelligence/">AI</a> unlocks, but hesitant about trusting AI when <a href="https://www.cloudflare.com/learning/privacy/what-is-data-privacy/">data privacy</a> and user error are really critical threats. We hope that AI Gateway can alleviate these concerns and make adopting AI tools a lot easier for organizations.</p><p>Whether you’re a developer building applications or a company who’s interested in how employees are using AI, our hope is that AI Gateway can help you demystify what’s going on inside your apps – because once you understand how your users are using AI, you can make decisions on how you actually want them to use it. Some of these features are still in development, but we hope this illustrates the power of AI Gateway and our vision for the future.</p><p>At Cloudflare, we live and breathe innovation (as you can tell by our Birthday Week announcements!) and the pace of innovation in AI is incredible to witness. We’re thrilled that we can not only help people build and use apps, but actually help <i>accelerate</i> the adoption and development of AI with greater control and visibility. We can’t wait to hear what you build – head to the Cloudflare dashboard to <a href="https://dash.cloudflare.com/?to=/:account/ai/ai-gateway/general/">try out AI Gateway</a> and let us know what you think!</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3HQSDDrg5pgwRfiz2kAOXP/dca3745aa7d061a0b37d36de6537cf65/image2-17.png" />
            
            </figure><p></p> ]]></content:encoded>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[AI Gateway]]></category>
            <category><![CDATA[Product News]]></category>
            <category><![CDATA[AI]]></category>
            <guid isPermaLink="false">7aWkvjGXI3nxWNZsx759Q5</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Yo'av Moshe</dc:creator>
            <dc:creator>Meaghan Choi</dc:creator>
        </item>
    </channel>
</rss>