The Cloudflare Blog

Growing the Cloudflare AI team with talent from Ensemble AI

Alex Reneau — Mon, 15 Jun 2026 13:00:00 GMT

Today, we’re excited to share that key members of the team at Ensemble AI are joining Cloudflare to help accelerate our work in AI infrastructure and make it easier for developers to run powerful AI models efficiently at scale.

Ensemble AI, founded in 2023 in San Francisco, has spent the last few years focused on one of the most important challenges in AI: making large models faster, smaller, and more cost-effective to serve, without sacrificing quality. The team has developed new approaches to model compression and efficient inference that are designed to reduce the memory, compute, and deployment overhead of large language models and multimodal architectures.

As AI becomes a core part of how developers build applications, the economics of inference matter more than ever. Models are getting larger; workloads are becoming more dynamic. And customers increasingly expect AI to be available everywhere: globally distributed, fast, reliable, and affordable. Bringing the Ensemble AI team into Cloudflare strengthens our ability to make that possible.

Incorporating Ensemble’s expertise

The team at Ensemble AI has focused on preserving the structure inside modern AI models while reducing the cost of running them. Instead of treating model efficiency as only a quantization or hardware problem, Ensemble has explored new model building blocks that can make neural networks more compact and efficient at the architectural level.

A core part of this work is NdLinear, a drop-in replacement for standard linear layers in transformer models that operates directly on multidimensional activations rather than flattening structure away. This enables models to preserve meaningful axes, such as heads, channels, spatial dimensions, or other structured representations, while reducing parameter count and compute. Ensemble has also developed NdLinear-LoRA, an efficient adaptation method designed to reduce the trainable parameters required for fine-tuning large models.

These approaches complement other efficiency techniques, including quantization and vector quantization. Together, they point toward a future where developers can run capable AI models with substantially lower memory, compute, and cost requirements.

Making AI inference more efficient

Cloudflare Workers AI gives developers access to serverless GPU-powered inference on Cloudflare’s global network. As developers build more AI-native applications, the ability to serve models efficiently becomes a critical part of the platform.

Inference cost is one of the biggest barriers to scaling AI applications. Every improvement in model size, memory footprint, throughput, and GPU utilization can make AI more accessible to developers and more economical for customers. This is especially important as AI workloads expand beyond simple text generation into agents, multimodal models, personalization, fine-tuning, retrieval, and reinforcement learning.

We are deepening our investment in the core machine learning capabilities needed to make Workers AI faster, more flexible, and more cost-efficient. This builds on top of our existing work on improving model efficiency, including our inference engine Infire, tensor compression techniques like Unweight, and our platform for running extra large language models. The team will focus on improving the economics of serving large language models and other advanced AI architectures, with an emphasis on model efficiency, GPU utilization, and scalable deployment.

Building for the next generation of AI workloads

AI infrastructure is entering a new phase. Developers no longer need only access to models; they need infrastructure that can run models reliably, affordably, and close to users. They need the ability to experiment with different model sizes, fine-tuning approaches, and deployment patterns without being blocked by cost or operational complexity.

Cloudflare is uniquely positioned to help solve this. Our global network, developer platform, and serverless architecture give us the foundation to bring AI closer to where applications already run. The Workers AI Machine Learning Engineering team will help us improve the efficiency layer underneath that experience.

By combining Cloudflare’s global infrastructure with Ensemble’s work in model compression and efficient architectures, we can continue building a platform where developers can deploy AI applications with lower cost, better performance, and less operational overhead.

What’s next

Together, we will continue building the infrastructure needed to make AI more efficient, accessible, and useful for developers everywhere. Our goal is simple: help developers run powerful AI workloads at global scale while improving the economics of inference across the Cloudflare platform. If you want to join us in our mission, check out our careers page.

Cloudflare’s AI Platform: an inference layer designed for agents

Ming Lu — Thu, 16 Apr 2026 14:05:00 GMT

AI models are changing quickly: the best model to use for agentic coding today might in three months be a completely different model from a different provider. On top of this, real-world use cases often require calling more than one model. Your customer support agent might use a fast, cheap model to classify a user's message; a large, reasoning model to plan its actions; and a lightweight model to execute individual tasks.

This means you need access to all the models, without tying yourself financially and operationally to a single provider. You also need the right systems in place to monitor costs across providers, ensure reliability when one of them has an outage, and manage latency no matter where your users are.

These challenges are present whenever you’re building with AI, but they get even more pressing when you’re building agents. A simple chatbot might make one inference call per user prompt. An agent might chain ten calls together to complete a single task and suddenly, a single slow provider doesn't add 50ms, it adds 500ms. One failed request isn't a retry, but suddenly a cascade of downstream failures.

Since launching AI Gateway and Workers AI, we’ve seen incredible adoption from developers building AI-powered applications on Cloudflare and we’ve been shipping fast to keep up! In just the past few months, we've refreshed the dashboard, added zero-setup default gateways, automatic retries on upstream failures, and more granular logging controls. Today, we’re making Cloudflare into a unified inference layer: one API to access any AI model from any provider, built to be fast and reliable.

One catalog, one unified endpoint

Starting today, you can call third-party models using the same AI.run() binding you already use for Workers AI. If you’re using Workers, switching from a Cloudflare-hosted model to one from OpenAI, Anthropic, or any other provider is a one-line change.

const response = await env.AI.run('anthropic/claude-opus-4-6',{
input: 'What is Cloudflare?',
}, {
gateway: { id: "default" },
});

For those who don’t use Workers, we’ll be releasing REST API support in the coming weeks, so you can access the full model catalog from any environment.

We’re also excited to share that you'll now have access to 70+ models across 12+ providers — all through one API, one line of code to switch between them, and one set of credits to pay for them. And we’re quickly expanding this as we go.

You can browse through our model catalog to find the best model for your use case, from open-source models hosted on Cloudflare Workers AI to proprietary models from the major model providers. We’re excited to be expanding access to models from Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, and Vidu — who will provide their models through AI Gateway. Notably, we’re expanding our model offerings to include image, video, and speech models so that you can build multimodal applications

Accessing all your models through one API also means you can manage all your AI spend in one place. Most companies today are calling an average of 3.5 models across multiple providers, which means no one provider is able to give you a holistic view of your AI usage. With AI Gateway, you’ll get one centralized place to monitor and manage AI spend.

By including custom metadata with your requests, you can get a breakdown of your costs on the attributes that you care about most, like spend by free vs. paid users, by individual customers, or by specific workflows in your app.

const response = await env.AI.run('@cf/moonshotai/kimi-k2.5',
      {
prompt: 'What is AI Gateway?'
      },
      {
metadata: { "teamId": "AI", "userId": 12345 }
      }
    );

Bring your own model

AI Gateway gives you access to models from all the providers through one API. But sometimes you need to run a model you've fine-tuned on your own data or one optimized for your specific use case. For that, we are working on letting users bring their own model to Workers AI.

The overwhelming majority of our traffic comes from dedicated instances for Enterprise customers who are running custom models on our platform, and we want to bring this to more customers. To do this, we leverage Replicate’s Cog technology to help you containerize machine learning models.

Cog is designed to be quite simple: all you need to do is write down dependencies in a cog.yaml file, and your inference code in a Python file. Cog abstracts away all the hard things about packaging ML models, such as CUDA dependencies, Python versions, weight loading, etc.

Example of a cog.yaml file:

build:
  python_version: "3.13"
  python_requirements: requirements.txt
predict: "predict.py:Predictor"

Example of a predict.py file, which has a function to set up the model and a function that runs when you receive an inference request (a prediction):

from cog import BasePredictor, Path, Input
import torch

class Predictor(BasePredictor):
    def setup(self):
        """Load the model into memory to make running multiple predictions efficient"""
        self.net = torch.load("weights.pth")

    def predict(self,
            image: Path = Input(description="Image to enlarge"),
            scale: float = Input(description="Factor to scale image by", default=1.5)
    ) -> Path:
        """Run a single prediction on the model"""
        # ... pre-processing ...
        output = self.net(input)
        # ... post-processing ...
        return output

Then, you can run cog build to build your container image, and push your Cog container to Workers AI. We will deploy and serve the model for you, which you then access through your usual Workers AI APIs.

We’re working on some big projects to be able to bring this to more customers, like customer-facing APIs and wrangler commands so that you can push your own containers, as well as faster cold starts through GPU snapshotting. We’ve been testing this internally with Cloudflare teams and some external customers who are guiding our vision. If you’re interested in being a design partner with us, please reach out! Soon, anyone will be able to package their model and use it through Workers AI.

The fast path to first token

Using Workers AI models with AI Gateway is particularly powerful if you’re building live agents – where a user's perception of speed hinges on time to first token or how quickly the agent starts responding, rather than how long the full response takes. Even if total inference is 3 seconds, getting that first token 50ms faster makes the difference between an agent that feels zippy and one that feels sluggish.

Cloudflare's network of data centers in 330 cities around the world means AI Gateway is positioned close to both users and inference endpoints, minimizing the network time before streaming begins.

Workers AI also hosts open-source models on its public catalog, which now includes large models purpose-built for agents, including Kimi K2.5 and real-time voice models. When you call these Cloudflare-hosted models through AI Gateway, there's no extra hop over the public Internet since your code and inference run on the same global network, giving your agents the lowest latency possible.

Built for reliability with automatic failover

When building agents, speed is not the only factor that users care about – reliability matters too. Every step in an agent workflow depends on the steps before it. Reliable inference is crucial for agents because one call failing can affect the entire downstream chain.

Through AI Gateway, if you're calling a model that's available on multiple providers and one provider goes down, we'll automatically route to another available provider without you having to write any failover logic of your own.

If you’re building long-running agents with Agents SDK, your streaming inference calls are also resilient to disconnects. AI Gateway buffers streaming responses as they’re generated, independently of your agent's lifetime. If your agent is interrupted mid-inference, it can reconnect to AI Gateway and retrieve the response without having to make a new inference call or paying twice for the same output tokens. Combined with the Agents SDK's built-in checkpointing, the end user never notices.

Replicate

The Replicate team has officially joined our AI Platform team, so much so that we don’t even consider ourselves separate teams anymore. We’ve been hard at work on integrations between Replicate and Cloudflare, which include bringing all the Replicate models onto AI Gateway and replatforming the hosted models onto Cloudflare infrastructure. Soon, you’ll be able to access the models you loved on Replicate through AI Gateway, and host the models you deployed on Replicate on Workers AI as well.

Get started

To get started, check out our documentation for AI Gateway or Workers AI. Learn more about building agents on Cloudflare through Agents SDK.

Watch on Cloudflare TV

Building the foundation for running extra-large language models

Michelle Chen — Thu, 16 Apr 2026 14:00:00 GMT

An agent needs to be powered by a large language model. A few weeks ago, we announced that Workers AI is officially entering the arena for hosting large open-source models like Moonshot’s Kimi K2.5. Since then, we’ve made Kimi K2.5 3x faster and have more model additions in-flight. These models have been the backbone of a lot of the agentic products, harnesses, and tools that we have been launching this week.

Hosting AI models is an interesting challenge: it requires a delicate balance between software and very, very expensive hardware. At Cloudflare, we’re good at squeezing every bit of efficiency out of our hardware through clever software engineering. This is a deep dive on how we’re laying the foundation to run extra-large language models.

Hardware configurations

As we mentioned in our previous Kimi K2.5 blog post, we’re using a variety of hardware configurations in order to best serve models. A lot of hardware configurations depend on the size of inputs and outputs that users are sending to the model. For example, if you are using a model to write fanfiction, you might give it a few small prompts (input tokens) while asking it to generate pages of content (output tokens).

Conversely, if you are running a summarization task, you might be sending in hundreds of thousands of input tokens, but only generating a small summary with a few thousand output tokens. Presented with these opposing use cases, you have to make a choice — should you tune your model configuration so it’s faster at processing input tokens, or faster at generating output tokens?

When we launched large language models on Workers AI, we knew that most of the use cases would be used for agents. With agents, you send in a large number of input tokens. It starts off with a large system prompt, all the tools, MCPs. With the first user prompt, that context keeps growing. Each new prompt from the user sends a request to the model, which consists of everything that was said before — all the previous user prompts, assistant messages, code generated, etc. For Workers AI, that means we had to focus on two things: fast input token processing and fast tool calling.

Prefill decode (PD) disaggregation

One hardware configuration that we use to improve performance and efficiency is disaggregated prefill. There are two stages to processing an LLM request: prefill, which processes the input tokens and populates the KV cache, and decode, which generates output tokens. Prefill is usually compute bound, while decode is memory bound. This means that the parts of the GPU that are used in each stage are different, and since prefill is always done before decode, the stages block one another. Ultimately, it means that we are not efficiently utilizing all of our GPU power if we do both prefill and decode on a single machine.

With prefill decode disaggregation, separate inference servers are run for each stage. First, a request is sent to the prefill stage which performs prefill and stores it in its KV cache. Then the same request is sent to the decode server, with information about how to transfer the KV cache from the prefill server and begin decoding. This has a number of advantages, because it allows the servers to be tuned independently for the role they are performing, scaled to account for more input-heavy or output-heavy traffic, or even to run on heterogeneous hardware.

This architecture requires a relatively complex load balancer to achieve. Beyond just routing the requests as described above, it must rewrite the responses (including streaming SSE) of the decode server to include information from the prefill server such as cached tokens. To complicate matters, different inference servers require different information to initiate the KV cache transfer. We extended this to implement token-aware load balancing, in which there is a pool of prefill and decode endpoints, and the load balancer estimates how many prefill or decode tokens are in-flight to each endpoint in the pool and attempts to spread this load evenly.

After our public model launch, our input/output patterns changed drastically again. We took the time to analyze our new usage patterns and then tuned our configuration to fit our customer’s use cases.

Here’s a graph of our p90 Time to First Token drop after shifting traffic to our new PD disaggregated architecture, whilst request volume increased, using the same quantity of GPUs. We see a significant improvement in the tail latency variance.

Similarly, p90 time per token went from ~100 ms with high variance to 20-30 ms, a 3x improvement in intertoken latency.

Prompt Caching

Since agentic use cases usually have long contexts, we optimize for efficient prompt caching in order to not recompute input tensors on every turn. We leverage a header called x-session-affinity in order to help requests route to the right region that previously had the computed input tensors. We wrote about this in our original blog post about launching large LLMs on Workers AI. We added session affinity headers to popular agent harnesses like OpenCode, where we noticed a significant increase in total throughput. A small difference in prompt caching from our users can sum to a factor of additional GPUs needed to run a model. While we have KV-aware routing internally, we also rely on clients sending the x-session-affinity in order to be explicit about prompt caching. We incentivize the use of the header by offering discounted cached tokens. We highly encourage users to leverage prompt caching in order to have faster inference and cheaper pricing.

We worked with our heaviest internal users to adopt this header. The result was an increase in input token cache hit ratios from 60% to 80% during peak times. This significantly increases the request throughput that we can handle, while offering better performance for interactive or time-sensitive sessions like OpenCode or AI code reviews.

KV-cache optimization

As we’re serving larger models now, one instance can span multiple GPUs. This means that we had to find an efficient way to share KV cache across GPUs. KV cache is where all the input tensors from prefill (result of prompts in a session) are stored, and initially lives in the VRAM of a GPU. Every GPU has a fixed VRAM size, but if your model instance requires multiple GPUs, there needs to be a way for the KV cache to live across GPUs and talk to each other. To achieve this for Kimi, we leveraged Moonshot AI’s Mooncake Transfer Engine and Mooncake Store.

Mooncake’s Transfer Engine is a high-performance data transfer framework. It works with different Remote Direct Memory Access (RDMA) protocols such as NVLink and NVMe over Fabric, which enables direct memory-to-memory data transfer without involving the CPU. It improves the speed of transferring data across multiple GPU machines, which is particularly important in multi-GPU and multi-node configurations for models.

When paired with LMCache or SGLang HiCache, the cache is shared across all nodes in the cluster, allowing a prefill node to identify and re-use a cache from a previous request that was originally pre-filled on a different node. This eliminates the need for session aware routing within a cluster and allows us to load balance the traffic much more evenly. Mooncake Store also allows us to extend the cache beyond GPU VRAM, and leverage NVMe storage. This extends the time that sessions remain in cache, improving our cache hit ratio and allowing us to handle more traffic and offer better performance to users.

Speculative decoding

LLMs work by predicting the next token in a sequence, based on the tokens that came before it. With a naive implementation, models only predict the next n token, but we can actually make it predict the next n+1, n+2... tokens in a single forward pass of the model. This popular technique is known as speculative decoding, which we’ve written about in a previous post on Workers AI.

With speculative decoding, we leverage a smaller LLM (the draft model) to generate a few candidate tokens for the target model to choose from. The target model then just has to select from a small pool of candidate tokens in a single forward pass. Validating the tokens is faster and less computationally expensive than using the larger target model to generate the tokens. However, quality is still upheld as the target model ultimately has to accept or reject the draft tokens.

In agentic use cases, speculative decoding really shines because of the volume of tool calls and structured outputs that models need to generate. A tool call is largely predictable — you know there will be a name, description, and it’s wrapped in a JSON envelope.

To do this with Kimi K2.5, we leverage NVIDIA’s EAGLE-3 (Extrapolation Algorithm for Greater Language-model Efficiency) draft model. The levers for tuning speculative decoding include the number of future tokens to generate. As a result, we’re able to achieve high-quality inference while speeding up tokens per second throughput.

Infire: our proprietary inference engine

As we announced during Birthday Week in 2025, Cloudflare has a proprietary inference engine, Infire, that makes machine learning models faster. Infire is an inference engine written in Rust, designed to support Cloudflare’s unique challenges with inference given our distributed global network. We’ve extended Infire support for this new class of large language models we are planning to run, which meant we had to build a few new features to make it all work.

Multi-GPU support

Large language models like Kimi K2.5 are over 1 trillion parameters, which is about 560GB of model weights. A typical H100 has about 80GB of VRAM and the model weights need to be loaded in GPU memory in order to run. This means that a model like Kimi K2.5 needs at least 8 H100s in order to load the model into memory and run — and that’s not even including the extra VRAM you would need for KV Cache, which includes your context window.

Since we initially launched Infire, we had to add support for multi-GPU, letting the inference engine run across multiple GPUs in either pipeline-parallel or tensor-parallel modes with expert-parallelism supported as well.

For pipeline parallelism, Infire attempts to properly load balance all stages of the pipeline, in order to prevent the GPUs of one stage from starving while other stages are executing. On the other hand, for tensor parallelism, Infire optimizes for reducing cross-GPU communication, making it as fast as possible. For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency.

Even lower memory overhead

While already having much lower GPU memory overhead than vLLM, we optimized Infire even further, tightening the memory required for internal state like activations. Currently Infire is capable of running Llama 4 Scout on just two H200 GPUs with more than 56 GiB remaining for KV-cache, sufficient for more than 1.2m tokens. Infire is also capable of running Kimi K2.5 on 8 H100 GPUs (yes that is H100), with more than 30 GiB still available for KV-cache. In both cases you would have trouble even booting vLLM in the first place.

Faster cold-starts

While adding multi-GPU support, we identified additional opportunities to improve boot times. Even for the largest models, such as Kimi K2.5, Infire can begin serving requests in under 20 seconds. The load times are only bounded by the drive speed.

Maximizing our hardware for faster throughput

Investing in our proprietary inference engine enables us to maximize our hardware by getting up to 20% higher tokens per second throughput on unconstrained systems, and also enabling us to use lower-end hardware to run the latest models, where it was previously completely infeasible.

The journey doesn’t end

New technologies, research, and models come out on a weekly basis for the machine learning community. We’re continuously optimizing our technology stack in order to provide high-quality, performant inference for our customers while operating our GPUs efficiently. If these sound like interesting challenges for you – we’re hiring!

Powering the agents: Workers AI now runs large models, starting with Kimi K2.5

Michelle Chen — Thu, 19 Mar 2026 19:53:16 GMT

We're making Cloudflare the best place for building and deploying agents. But reliable agents aren't built on prompts alone; they require a robust, coordinated infrastructure of underlying primitives.

At Cloudflare, we have been building these primitives for years: Durable Objects for state persistence, Workflows for long running tasks, and Dynamic Workers or Sandbox containers for secure execution. Powerful abstractions like the Agents SDK are designed to help you build agents on top of Cloudflare’s Developer Platform.

But these primitives only provided the execution environment. The agent still needed a model capable of powering it.

Starting today, Workers AI is officially in the big models game. We now offer frontier open-source models on our AI inference platform. We’re starting by releasing Moonshot AI’s Kimi K2.5 model on Workers AI. With a full 256k context window and support for multi-turn tool calling, vision inputs, and structured outputs, the Kimi K2.5 model is excellent for all kinds of agentic tasks. By bringing a frontier-scale model directly into the Cloudflare Developer Platform, we’re making it possible to run the entire agent lifecycle on a single, unified platform.

The heart of an agent is the AI model that powers it, and that model needs to be smart, with high reasoning capabilities and a large context window. Workers AI now runs those models.

The price-performance sweet spot

We spent the last few weeks testing Kimi K2.5 as the engine for our internal development tools. Within our OpenCode environment, Cloudflare engineers use Kimi as a daily driver for agentic coding tasks. We have also integrated the model into our automated code review pipeline; you can see this in action via our public code review agent, Bonk, on Cloudflare GitHub repos. In production, the model has proven to be a fast, efficient alternative to larger proprietary models without sacrificing quality.

Serving Kimi K2.5 began as an experiment, but it quickly became critical after reviewing how the model performs and how cost-efficient it is. As an illustrative example: we have an agent that does security reviews of Cloudflare’s codebases. This agent processes over 7B tokens per day, and using Kimi, it has caught more than 15 confirmed issues in a single codebase. Doing some rough math, if we had run this agent on a mid-tier proprietary model, we would have spent $2.4M a year for this single use case, on a single codebase. Running this agent with Kimi K2.5 cost just a fraction of that: we cut costs by 77% simply by making the switch to Workers AI.

As AI adoption increases, we are seeing a fundamental shift not only in how engineering teams are operating, but how individuals are operating. It is becoming increasingly common for people to have a personal agent like OpenClaw running 24/7. The volume of inference is skyrocketing.

This new rise in personal and coding agents means that cost is no longer a secondary concern; it is the primary blocker to scaling. When every employee has multiple agents processing hundreds of thousands of tokens per hour, the math for proprietary models stops working. Enterprises will look to transition to open-source models that offer frontier-level reasoning without the proprietary price tag. Workers AI is here to facilitate this shift, providing everything from serverless endpoints for a personal agent to dedicated instances powering autonomous agents across an entire organization.

The large model inference stack

Workers AI has served models, including LLMs, since its launch two years ago, but we’ve historically prioritized smaller models. Part of the reason was that for some time, open-source LLMs fell far behind the models from frontier model labs. This changed with models like Kimi K2.5, but to serve this type of very large LLM, we had to make changes to our inference stack. We wanted to share with you some of what goes on behind the scenes to support a model like Kimi.

We’ve been working on custom kernels for Kimi K2.5 to optimize how we serve the model, which is built on top of our proprietary Infire inference engine. Custom kernels improve the model’s performance and GPU utilization, unlocking gains that would otherwise go unclaimed if you were just running the model out of the box. There are also multiple techniques and hardware configurations that can be leveraged to serve a large model. Developers typically use a combination of data, tensor, and expert parallelization techniques to optimize model performance. Strategies like disaggregated prefill are also important, in which you separate the prefill and generation stages onto different machines in order to get better throughput or higher GPU utilization. Implementing these techniques and incorporating them into the inference stack takes a lot of dedicated experience to get right.

Workers AI has already done the experimentation with serving techniques to yield excellent throughput on Kimi K2.5. A lot of this does not come out of the box when you self-host an open-source model. The benefit of using a platform like Workers AI is that you don’t need to be a Machine Learning Engineer, a DevOps expert, or a Site Reliability Engineer to do the optimizations required to host it: we’ve already done the hard part, you just need to call an API.

Beyond the model — platform improvements for agentic workloads

In concert with this launch, we’ve also improved our platform and are releasing several new features to help you build better agents.

Prefix caching and surfacing cached tokens

When you work with agents, you are likely sending a large number of input tokens as part of the context: this could be detailed system prompts, tool definitions, MCP server tools, or entire codebases. Inputs can be as large as the model context window, so in theory, you could be sending requests with almost 256k input tokens. That’s a lot of tokens.

When an LLM processes a request, the request is broken down into two stages: the prefill stage processes input tokens and the output stage generates output tokens. These stages are usually sequential, where input tokens have to be fully processed before you can generate output tokens. This means that sometimes the GPU is not fully utilized while the model is doing prefill.

With multi-turn conversations, when you send a new prompt, the client sends all the previous prompts, tools, and context from the session to the model as well. The delta between consecutive requests is usually just a few new lines of input; all the other context has already gone through the prefill stage during a previous request. This is where prefix caching helps. Instead of doing prefill on the entire request, we can cache the input tensors from a previous request, and only do prefill on the new input tokens. This saves a lot of time and compute from the prefill stage, which means a faster Time to First Token (TTFT) and a higher Tokens Per Second (TPS) throughput as you’re not blocked on prefill.

Workers AI has always done prefix caching, but we are now surfacing cached tokens as a usage metric and offering a discount on cached tokens compared to input tokens. (Pricing can be found on the model page.) We also have new techniques for you to leverage in order to get a higher prefix cache hit rate, reducing your costs.

New session affinity header for higher cache hit rates

In order to route to the same model instance and take advantage of prefix caching, we use a new x-session-affinity header. When you send this header, you’ll improve your cache hit ratio, leading to more cached tokens and subsequently, faster TTFT, TPS, and lower inference costs.

You can pass the new header like below, with a unique string per session or per agent. Some clients like OpenCode implement this automatically out of the box. Our Agents SDK starter has already set up the wiring to do this for you, too.

curl -X POST \
"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/moonshotai/kimi-k2.5" \
  -H "Authorization: Bearer {API_TOKEN}" \
  -H "Content-Type: application/json" \
  -H "x-session-affinity: ses_12345678" \
  -d '{
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is prefix caching and why does it matter?"
      }
    ],
    "max_tokens": 2400,
    "stream": true
  }'

Redesigned async APIs

Serverless inference is really hard. With a pay-per-token business model, it’s cheaper on a single request basis because you don’t need to pay for entire GPUs to service your requests. But there’s a trade-off: you have to contend with other people’s traffic and capacity constraints, and there’s no strict guarantee that your request will be processed. This is not unique to Workers AI — it’s evidently the case across serverless model providers, given the frequent news reports of overloaded providers and service disruptions. While we always strive to serve your request and have built-in autoscaling and rebalancing, there are hard limitations (like hardware) that make this a challenge.

For volumes of requests that would exceed synchronous rate limits, you can submit batches of inferences to be completed asynchronously. We’re introducing a revamped Asynchronous API, which means that for asynchronous use cases, you won’t run into Out of Capacity errors and inference will execute durably at some point. Our async API looks more like flex processing than a batch API, where we process requests in the async queue as long as we have headroom in our model instances. With internal testing, our async requests usually execute within 5 minutes, but this will depend on what live traffic looks like. As we bring Kimi to the public, we will tune our scaling accordingly, but the async API is the best way to make sure you don’t run into capacity errors in durable workflows. This is perfect for use cases that are not real-time, such as code scanning agents or research agents.

Workers AI previously had an asynchronous API, but we’ve recently revamped the systems under the hood. We now rely on a pull-based system versus the historical push-based system, allowing us to pull in queued requests as soon as we have capacity. We’ve also added better controls to tune the throughput of async requests, monitoring GPU utilization in real-time and pulling in async requests when utilization is low, so that critical synchronous requests get priority while still processing asynchronous requests efficiently.

To use the asynchronous API, you would send your requests as seen below. We also have a way to set up event notifications so that you can know when the inference is complete instead of polling for the request.

// (1.) Push a request in queue
// pass queueRequest: true
let res = await env.AI.run("@cf/moonshotai/kimi-k2.5", {
  "requests": [{
    "messages": [{
      "role": "user",
      "content": "Tell me a joke"
    }]
  }, {
    "messages": [{
      "role": "user",
      "content": "Explain the Pythagoras theorem"
    }]
  }, ...{} ];
}, {
  queueRequest: true,
});


// (2.) grab the request id
let request_id;
if(res && res.request_id){
  request_id = res.request_id;
}
// (3.) poll the status
let res = await env.AI.run("@cf/moonshotai/kimi-k2.5", {
  request_id: request_id
});

if(res && res.status === "queued" || res.status === "running") {
 // retry by polling again
 ...
}
else 
 return Response.json(res); // This will contain the final completed response

Try it out today

Get started with Kimi K2.5 on Workers AI today. You can read our developer docs to find out model information and pricing, and how to take advantage of prompt caching via session affinity headers and asynchronous API. The Agents SDK starter also now uses Kimi K2.5 as its default model. You can also connect to Kimi K2.5 on Workers AI via Opencode. For a live demo, try it in our playground.

And if this set of problems around serverless inference, ML optimizations, and GPU infrastructure sound interesting to you — we’re hiring!

Partnering with Black Forest Labs to bring FLUX.2 [dev] to Workers AI

Michelle Chen — Tue, 25 Nov 2025 00:00:00 GMT

In recent months, we’ve seen a leap forward for closed-source image generation models with the rise of Google’s Nano Banana and OpenAI image generation models. Today, we’re happy to share that a new open-weight contender is back with the launch of Black Forest Lab’s FLUX.2 [dev] and available to run on Cloudflare’s inference platform, Workers AI. You can read more about this new model in detail on BFL’s blog post about their new model launch here.

We have been huge fans of Black Forest Lab’s FLUX image models since their earliest versions. Our hosted version of FLUX.1 [schnell] is one of the most popular models in our catalog for its photorealistic outputs and high-fidelity generations. When the time came to host the licensed version of their new model, we jumped at the opportunity. The FLUX.2 model takes all the best features of FLUX.1 and amps it up, generating even more realistic, grounded images with added customization support like JSON prompting.

Our Workers AI hosted version of FLUX.2 has some specific patterns, like using multipart form data to support input images (up to 4 512x512 images), and output images up to 4 megapixels. The multipart form data format allows users to send us multiple image inputs alongside the typical model parameters. Check out our developer docs changelog announcement to understand how to use the FLUX.2 model.

What makes FLUX.2 special? Physical world grounding, digital world assets, and multi-language support

The FLUX.2 model has a more robust understanding of the physical world, allowing you to turn abstract concepts into photorealistic reality. It excels at generating realistic image details and consistently delivers accurate hands, faces, fabrics, logos, and small objects that are often missed by other models. Its knowledge of the physical world also generates life-like lighting, angles and depth perception.

^{Figure 1. Image generated with FLUX.2 featuring accurate lighting, shadows, reflections and depth perception at a café in Paris.}

This high-fidelity output makes it ideal for applications requiring superior image quality, such as creative photography, e-commerce product shots, marketing visuals, and interior design. Because it can understand context, tone, and trends, the model allows you to create engaging and editorial-quality digital assets from short prompts.

Aside from the physical world, the model is also able to generate high-quality digital assets such as designing landing pages or generating detailed infographics (see below for example). It’s also able to understand multiple languages naturally, so combining these two features – we can get a beautiful landing page in French from a French prompt.

Générer une page web visuellement immersive pour un service de promenade de chiens. L'image principale doit dominer l'écran, montrant un chien exubérant courant dans un parc ensoleillé, avec des touches de vert vif (#2ECC71) intégrées subtilement dans le feuillage ou les accessoires du chien. Minimiser le texte pour un impact visuel maximal.

Character consistency – solving for stochastic drift

FLUX.2 offers multi-reference editing with state-of-the-art character consistency, ensuring identities, products, and styles remain consistent for tasks. In the world of generative AI, getting a high-quality image is easy. However, getting the exact same character or product twice has always been the hard part. This is a phenomenon known as "stochastic drift", where generated images drift away from the original source material.

^{Figure 2. Stochastic drift infographic (generated on FLUX.2)}

One of FLUX.2’s breakthroughs is multi-reference image inputs designed to solve this consistency challenge. You’ll have the ability to change the background, lighting, or pose of an image without accidentally changing the face of your model or the design of your product. You can also reference other images or combine multiple images together to create something new.

In code, Workers AI supports multi-reference images (up to 4) with a multipart form-data upload. The image inputs are binary images and output is a base64 encoded image:

curl --request POST \
  --url 'https://api.cloudflare.com/client/v4/accounts/{ACCOUNT}/ai/run/@cf/black-forest-labs/flux-2-dev' \
  --header 'Authorization: Bearer {TOKEN}' \
  --header 'Content-Type: multipart/form-data' \
  --form 'prompt=take the subject of image 2 and style it like image 1' \
  --form input_image_0=@/Users/johndoe/Desktop/icedoutkeanu.png \
  --form input_image_1=@/Users/johndoe/Desktop/me.png \
  --form steps=25
  --form width=1024
  --form height=1024

We also support this through the Workers AI Binding:

const image = await fetch("http://image-url");
const form = new FormData();
 
const image_blob = await streamToBlob(image.body, "image/png");
form.append('input_image_0', image_blob)
form.append('prompt', 'a sunset with the dog in the original image')
 
const resp = await env.AI.run("@cf/black-forest-labs/flux-2-dev", {
    multipart: {
        body: form,
        contentType: "multipart/form-data"
    }
})

Built for real world use cases

The newest image model signifies a shift towards functional business use cases, moving beyond simple image quality improvements. FLUX.2 enables you to:

Create Ad Variations: Generate 50 different advertisements using the exact same actor, without their face morphing between frames.
Trust Your Product Shots: Drop your product on a model, or into a beach scene, a city street, or a studio table. The environment changes, but your product stays accurate.
Build Dynamic Editorials: Produce a full fashion spread where the model looks identical in every single shot, regardless of the angle.

^{Figure 3. Combining the oversized hoodie and sweatpant ad photo (generated with FLUX.2) with Cloudflare’s logo to create product renderings with consistent faces, fabrics, and scenery. **}^{Note: we prompted for white Cloudflare font as well instead of the original black font.}

Granular controls — JSON prompting, HEX codes and more!

The FLUX.2 model makes another advancement by allowing users to control small details in images through tools like JSON prompting and specifying specific hex codes.

For example, you could send this JSON as a prompt (as part of the multipart form input) and the resulting image follows the prompt exactly:

{
  "scene": "A bustling, neon-lit futuristic street market on an alien planet, rain slicking the metal ground",
  "subjects": [
    {
      "type": "Cyberpunk bounty hunter",
      "description": "Female, wearing black matte armor with glowing blue trim, holding a deactivated energy rifle, helmet under her arm, rain dripping off her synthetic hair",
      "pose": "Standing with a casual but watchful stance, leaning slightly against a glowing vendor stall",
      "position": "foreground"
    },
    {
      "type": "Merchant bot",
      "description": "Small, rusted, three-legged drone with multiple blinking red optical sensors, selling glowing synthetic fruit from a tray attached to its chassis",
      "pose": "Hovering slightly, offering an item to the viewer",
      "position": "midground"
    }
  ],
  "style": "noir sci-fi digital painting",
  "color_palette": [
    "deep indigo",
    "electric blue",
    "acid green"
  ],
  "lighting": "Low-key, dramatic, with primary light sources coming from neon signs and street lamps reflecting off wet surfaces",
  "mood": "Gritty, tense, and atmospheric",
  "background": "Towering, dark skyscrapers disappearing into the fog, with advertisements scrolling across their surfaces, flying vehicles (spinners) visible in the distance",
  "composition": "dynamic off-center",
  "camera": {
    "angle": "eye level",
    "distance": "medium close-up",
    "focus": "sharp on subject",
    "lens": "35mm",
    "f-number": "f/1.4",
    "ISO": 400
  },
  "effects": [
    "heavy rain effect",
    "subtle film grain",
    "neon light reflections",
    "mild chromatic aberration"
  ]
}

To take it further, we can ask the model to recolor the accent lighting to a Cloudflare orange by giving it a specific hex code like #F48120.

Try it out today!

The newest FLUX.2 [dev] model is now available on Workers AI — you can get started with the model through our developer docs or test it out on our multimodal playground.

Choice: the path to AI sovereignty

Carly Ramsey — Thu, 25 Sep 2025 14:00:00 GMT

Every government is laser-focused on the potential for national transformation by AI. Many view AI as an unparalleled opportunity to solve complex national challenges, drive economic growth, and improve the lives of their citizens. Others are concerned about the risks AI can bring to its society and economy. Some sit somewhere between these two perspectives. But as plans are drawn up by governments around the world to address the question of AI development and adoption, all are grappling with the critical question of sovereignty — how much of this technology, mostly centered in the United States and China, needs to be in their direct control?

Each nation has their own response to that question — some seek ‘self-sufficiency’ and total authority. Others, particularly those that do not have the capacity to build the full AI technology stack, are approaching it layer-by-layer, seeking to build on the capacities their country does have and then forming strategic partnerships to fill the gaps.

We believe AI sovereignty at its core is about choice. Each nation should have the ability to select the right tools for the task, to control its own data, and to deploy applications at will, all without being locked into a single provider or a single way of doing things. It's about autonomy and options, realized through a diversified, resilient digital supply chain.

Cloudflare’s mission is to help build a better Internet. We make tools for developers around the world to build Internet and AI applications that are widely, and in many cases, freely, available. We work on standards to improve interoperability and prevent vendor lock-in. And we are global — our network spans 330 cities in over 125 countries. By supporting local developers to build and deploy AI tools and services right where they are, Cloudflare can help each nation on their path to greater AI sovereignty.

Creating a future that enables many AI options

Many nations recognize the practical challenge of realizing a robust AI-driven future that incorporates sovereignty — the significant cost and complexity of the infrastructure needed to set AI in action. Cloudflare believes that countries can achieve their objectives by creating vibrant marketplaces that allow multiple options, and we are creating a path for governments that provides maximum choice:

Infrastructure accessibility: Countries often focus on building large data centers that have the compute capacity to train general purpose AI models, neglecting the infrastructure needed to effectively deploy AI. Because of their proximity to end users, distributed edge networks are critical to ensuring that consumers can actually use AI technologies at scale. Although some AI technologies will be designed to work on-device, many will need more power to run AI inference, the tasks that users ask an AI engine to complete. Distributed networks are equipped to run AI workloads at the edge, to help deliver the low latency and high performance needed for advanced technologies. Cloudflare’s distributed network gives developers a path to rapidly deploy their apps globally without massive upfront investments.

Inclusivity: Nations want their entire economies, from the small businesses, to research institutions, to non-profits and enterprises, to benefit from AI transformation. Serverless models like Cloudflare’s make it easy to get started. Developers pay only for what they use, rather than being locked into paying for expensive and unnecessary compute, dramatically lowering the barrier to entry. Our free tier allows developers to experiment, build, and even launch applications without any cost, while our pay-as-you-go model for increased usage removes the significant financial barriers that might otherwise keep advanced AI out of reach.

Control over data: An important part of sovereignty is the ability to control your own data. We believe countries should avoid equating this type of control with data locality, focusing instead on integrating security tools that provide visibility and the ability to restrict access to data. Cloudflare’s global, distributed network ensures that developers can experiment, build, and deploy AI-powered applications right where they are, setting rules and controls at the Internet edge.

Multi-modal, dynamic markets: Building new applications with closed AI models can make it challenging to switch models later, and can make developers dependent on particular providers. AI strategies must embrace diversity — developers should have access to a wide variety of both open source and closed AI models. Cloudflare’s Workers AI platform, with over 50 open source models, is model agnostic, helping to create a competitive, dynamic environment where developers can swap models in and out as better, cheaper, or more specialized options become available. Cloudflare’s AI Gateway allows our customers to connect and control all their AI models, regardless of vendor, in a single, unified, interoperable platform.

Underpinning all of this is the importance of open standards that encourage interoperability. Open standards and protocols throughout the AI technology stack help prevent dependency, create dynamic and competitive markets, and create choice for governments and their developers.

Championing regional AI innovation

Many countries have started to put their own mark on how to spur innovation in their markets, starting with large language models (LLM). AI development to date has mostly centered around LLMs trained on English-centric data, and increasingly, Chinese-centric data, leaving behind those who can’t fully access this technology in these two languages. Recognizing this gap, these nations are building and freely offering AI models trained on local language datasets that are fine-tuned to the nuances of their own cultures and languages. This approach lowers the barrier to entry for local businesses, organizations, and governments to create customized AI solutions for their specific markets. Open-sourcing these LLMs is to recognize that AI sovereignty is a means to an end. The goal is innovation, economic growth, and the ability to solve meaningful problems.

Cloudflare is now supporting these sovereign AI initiatives in India, Japan, and Southeast Asia. We are bringing these locally-developed, open-source AI models to developers around the world through our serverless inference platform, Workers AI.

India: India’s national vision is “AI for All”, which focuses on AI driving inclusive growth and social empowerment. India will host the momentous global AI Impact Summit in 2026, and a key element of showcasing empowering technological advancements that are accessible to the Global South. With its immense linguistic diversity, India is at the forefront of creating models that serve its hundreds of millions of Internet users in their native tongues. A cornerstone in this endeavor is the Government of India’s Bhashini, a digital public good platform that enables all Indian citizens to access the Internet and digital services in 22 official languages.

Cloudflare is now offering AI4Bharat’s IndicTrans2 model, a key open source language model that is also part of the Bhashini initiative. The model is able to translate text across 22 Indic languages, including Bengali, Gujarati, Hindi, Tamil, Sanskrit and even traditionally low-resourced languages like Kashmiri, Manipuri and Sindhi.

You can use the @cf/ai4bharat/indictrans2-en-indic-1B model on Workers AI as follows:

curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/ACCOUNT_ID/ai/run/@cf/ai4bharat/indictrans2-en-indic-1B \
  --header 'Authorization: Bearer TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "text": ["What is your favourite food?", "I like pizza"],
    "target_language": "guj_Gujr"
}'

Japan: Japan has a very clear and expansive vision of AI development. Concerned about Japan’s slow AI uptake, the Japanese government aims to make the country “the world’s most friendly AI nation” by creating the ideal conditions for AI growth, both at home and abroad. A major initiative for Japan’s government is supporting AI that deeply understands the complexities and cultural context of the Japanese language.

Cloudflare is offering Preferred Networks, Inc.(PFN) PLaMo-Embedding-1B, a home-grown Japanese text embedding model, made freely and openly available. The Japanese government supported PFN through its Generative AI Accelerator Challenge (GENIAC) program, which supports local LLM development through subsidized access to compute resources for training. The PLaMo Embedding model enables users to generate high-quality embeddings for Japanese text, which is helpful for building RAG-powered applications and semantic search use cases.

You can use the @cf/pfnet/plamo-embedding-1b model on Workers AI as follows:

curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/ACCOUNT_ID/ai/run/@cf/pfnet/plamo-embedding-1b \
  --header 'Authorization: Bearer TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
  	"text": [
            "PLaMo-Embedding-1Bは、Preferred Networks, Inc. によって開発された日本語テキスト埋め込みモデルです。",
            "最近は随分と暖かくなりましたね。"
        ]
}'

Southeast Asia: As Chair of the Association of Southeast Asian Nations (ASEAN) Working Group on AI Governance, Singapore’s ambitious National AI Strategy 2.0 aims to ensure that AI is a public good, both for Southeast Asia and the world. As a cornerstone of this strategy, Singapore is championing the development and adoption of SEA-LION, a family of open-source LLMs designed for Southeast Asia's diverse languages and cultures. The initiative aims to establish the nation as an inclusive global AI leader, ensuring the technology is both accessible and regionally relevant to its multilingual and multicultural populaces. The models are adept in numerous regional languages, including Bahasa Indonesia, Bahasa Malaysia, Thai, Vietnamese, and Tamil, unlocking AI technologies for a significant portion of the Asian and global population.

SEA-LION model v4-27B is now available on the Workers AI platform. SEA-LION v4 stands out on the Singapore government’s leaderboard as its most powerful, efficient, multimodal and multilingual model yet.

You can use the @cf/aisingapore/gemma-sea-lion-v4-27b-it model on Workers AI as follows:

curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/ACCOUNT_ID/ai/run/@cf/aisingapore/gemma-sea-lion-v4-27b-it \
  --header 'Authorization: Bearer TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
  "messages": [
    {
      "role": "user",
      "content": "แล้วทำผัดไทยอย่างไร"
    }
  ]
}'

Bringing AI models to the world

Singapore, India and Japan have all chosen to open-source many of their local language models, a strategy that champions an expansive vision of AI sovereignty. This approach demonstrates a crucial understanding: true AI sovereignty is ensuring you have choices.

Supporting local language open source models is more than just supporting technology; this is a shared commitment to fostering an open, interoperable, and competitive AI ecosystem by empowering governments and developers to solve local problems, create economic opportunities, and preserve their digital and cultural heritages.

We are honored to support the initiatives of the governments of India, Japan, and Singapore on this journey. We believe that by putting their sovereign AI models into the hands of developers in their economies, we can help unlock a powerful wave of innovation that is more diverse, equitable, and representative of the world we live in. The future of AI is being built today, and we are proud to ensure that AI developers everywhere are at the forefront.

Choice is the foundation of AI sovereignty. We’re starting with the models from India, Japan, and Singapore on our serverless inference platform, but it’s only the start. Come build with us! Take the first step for free on Workers AI.

AI Gateway now gives you access to your favorite AI models, dynamic routing and more — through just one endpoint

Michelle Chen — Wed, 27 Aug 2025 14:05:00 GMT

Getting the observability you need is challenging enough when the code is deterministic, but AI presents a new challenge — a core part of your user’s experience now relies on a non-deterministic engine that provides unpredictable outputs. On top of that, there are many factors that can influence the results: the model, the system prompt. And on top of that, you still have to worry about performance, reliability, and costs.

Solving performance, reliability and observability challenges is exactly what Cloudflare was built for, and two years ago, with the introduction of AI Gateway, we wanted to extend to our users the same levels of control in the age of AI.

Today, we’re excited to announce several features to make building AI applications easier and more manageable: unified billing, secure key storage, dynamic routing, security controls with Data Loss Prevention (DLP). This means that AI Gateway becomes your go-to place to control costs and API keys, route between different models and providers, and manage your AI traffic. Check out our new AI Gateway landing page for more information at a glance.

Connect to all your favorite AI providers

When using an AI provider, you typically have to sign up for an account, get an API key, manage rate limits, top up credits — all within an individual provider’s dashboard. Multiply that for each of the different providers you might use, and you’ll soon be left with an administrative headache of bills and keys to manage.

With AI Gateway, you can now connect to major AI providers directly through Cloudflare and manage everything through one single plane. We’re excited to partner with Anthropic, Google, Groq, OpenAI, and xAI to provide Cloudflare users with access to their models directly through Cloudflare. With this, you’ll have access to over 350+ models across 6 different providers.

You can now get billed for usage across different providers directly through your Cloudflare account. This feature is available for Workers Paid users, where you’ll be able to add credits to your Cloudflare account and use them for AI inference to all the supported providers. You’ll be able to see real-time usage statistics and manage your credits through the AI Gateway dashboard. Your AI Gateway inference usage will also be documented in your monthly Cloudflare invoice. No more signing up and paying for each individual model provider account.

Usage rates are based on then-current list prices from model providers — all you will need to cover is the transaction fee as you load credits into your account. Since this is one of the first times we’re launching a credits based billing system at Cloudflare, we’re releasing this feature in Closed Beta — sign up for access here.

BYO Provider Keys, now with Cloudflare Secrets Store

Although we’ve introduced unified billing, some users might still want to manage their own accounts and keys with providers. We’re happy to say that AI Gateway will continue supporting our BYO Key feature, improving the experience of BYO Provider Keys by integrating with Cloudflare’s secrets management product Secrets Store. Now, you can seamlessly and securely store your keys in one centralized location and distribute them without relying on plain text. Secrets Store uses a two level key hierarchy with AES encryption to ensure that your secret stays safe, while maintaining low latency through our global configuration system, Quicksilver.

You can now save and manage keys directly through your AI Gateway dashboard or through the Secrets Store dashboard, API, or Wrangler by using the new AI Gateway scope. Scoping your secrets to AI Gateway ensures that only this specific service will be able to access your keys, meaning that secret could not be used in a Workers binding or anywhere else on Cloudflare’s platform.

You can pass your AI provider keys without including them directly in the request header. Instead of including the actual value, you can deploy the secret only using the Secrets Store reference:

curl -X POST https://gateway.ai.cloudflare.com/v1//my-gateway/anthropic/v1/messages \
 --header 'cf-aig-authorization: CLOUDFLARE_AI_GATEWAY_TOKEN \
 --header 'anthropic-version: 2023-06-01' \
 --header 'Content-Type: application/json' \
 --data  '{"model": "claude-3-opus-20240229", "messages": [{"role": "user", "content": "What is Cloudflare?"}]}'

Or, using Javascript:

import Anthropic from '@anthropic-ai/sdk';


const anthropic = new Anthropic({
  apiKey: "CLOUDFLARE_AI_GATEWAY_TOKEN",
  baseURL: "https://gateway.ai.cloudflare.com/v1//my-gateway/anthropic",
});


const message = await anthropic.messages.create({
  model: 'claude-3-opus-20240229',
  messages: [{role: "user", content: "What is Cloudflare?"}],
  max_tokens: 1024
});

By using Secrets Store to deploy your secrets, you no longer need to give every developer access to every key — instead, you can rely on Secrets Store’s role-based access control to further lock down these sensitive values. For example, you might want your security administrators to have Secrets Store admin permissions so that they can create, update, and delete the keys when necessary. With Cloudflare audit logging, all such actions will be logged so you know exactly who did what and when. Your developers, on the other hand, might only need Deploy permissions, so they can reference the values in code, whether that is a Worker or AI Gateway or both. This way, you reduce the risk of the secret getting leaked accidentally or intentionally by a malicious actor. This also allows you to update your provider keys in one place and automatically propagate that value to any AI Gateway using those values, simplifying the management.

Unified Request/Response

We made it super easy for people to try out different AI models – but the developer experience should match that as well. We found that each provider can have slight differences in how they expect people to send their requests, so we’re excited to launch an automatic translation layer between providers. When you send a request through AI Gateway, it just works – no matter what provider or model you use.

import OpenAI from "openai";
const client = new OpenAI({
  apiKey: "YOUR_PROVIDER_API_KEY", // Provider API key
  // NOTE: the OpenAI client automatically adds /chat/completions to the end of the URL, you should not add it yourself.
  baseURL:
    "https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/compat",
});

const response = await client.chat.completions.create({
  model: "google-ai-studio/gemini-2.0-flash",
  messages: [{ role: "user", content: "What is Cloudflare?" }],
});

console.log(response.choices[0].message.content);

Dynamic Routes

When we first launched Cloudflare Workers, it was an easy way for people to intercept HTTP requests and customize actions based on different attributes. We think the same customization is necessary for AI traffic, so we’re launching Dynamic Routes in AI Gateway.

Dynamic Routes allows you to define certain actions based on different request attributes. If you have free users, maybe you want to ratelimit them to a certain request per second (RPS) or a certain dollar spend. Or maybe you want to conduct an A/B test and split 50% of traffic to Model A and 50% of traffic to Model B. You could also want to chain several models in a row, like adding custom guardrails or enhancing a prompt before it goes to another model. All of this is possible with Dynamic Routes!

We’ve built a slick UI in the AI Gateway dashboard where you can define simple if/else interactions based on request attributes or a percentage split. Once you define a route, you’ll use the route as the “model” name in your input JSON and we will manage the traffic as you defined.

import OpenAI from "openai";

const cloudflareToken = "CF_AIG_TOKEN";
const accountId = "{account_id}";
const gatewayId = "{gateway_id}";
const baseURL = `https://gateway.ai.cloudflare.com/v1/${accountId}/${gatewayId}`;

const openai = new OpenAI({
  apiKey: cloudflareToken,
  baseURL,
});

try {
  const model = "dynamic/";
  const messages = [{ role: "user", content: "What is a neuron?" }];
  const chatCompletion = await openai.chat.completions.create({
    model,
    messages,
  });
  const response = chatCompletion.choices[0].message;
  console.log(response);
} catch (e) {
  console.error(e);
}

Built-in security with Firewall in AI Gateway

Earlier this year we announced Guardrails in AI Gateway and now we’re expanding our security capabilities and include Data Loss Prevention (DLP) scanning in AI Gateway’s Firewall. With this, you can select the DLP profiles you are interested in blocking or flagging, and we will scan requests for the matching content. DLP profiles include general categories like “Financial Information”, “Social Security, Insurance, Tax and Identifier Numbers” that everyone has access to with a free Zero Trust account. If you would like to create a custom DLP profile to safeguard specific text, the upgraded Zero Trust plan allows you to create custom DLP profiles to catch sensitive data that is unique to your business.

False positives and grey area situations happen, we give admins controls on whether to fully block or just alert on DLP matches. This allows administrators to monitor for potential issues without creating roadblocks for their users.. Each log on AI gateway now includes details about the DLP profiles matched on your request, and the action that was taken:

More coming soon…

If you think about the history of Cloudflare, you’ll notice similar patterns that we’re following for the new vision for AI Gateway. We want developers of AI applications to be able to have simple interconnectivity, observability, security, customizable actions, and more — something that Cloudflare has a proven track record of accomplishing for global Internet traffic. We see AI Gateway as a natural extension of Cloudflare’s mission, and we’re excited to make it come to life.

We’ve got more launches up our sleeves, but we couldn’t wait to get these first handful of features into your hands. Read up about it in our developer docs, give it a try, and let us know what you think. If you want to explore larger deployments, reach out for a consultation with Cloudflare experts.

State-of-the-art image generation Leonardo models and text-to-speech Deepgram models now available in Workers AI

Michelle Chen — Wed, 27 Aug 2025 14:00:00 GMT

When we first launched Workers AI, we made a bet that AI models would get faster and smaller. We built our infrastructure around this hypothesis, adding specialized GPUs to our datacenters around the world that can serve inference to users as fast as possible. We created our platform to be as general as possible, but we also identified niche use cases that fit our infrastructure well, such as low-latency image generation or real-time audio voice agents. To lean in on those use cases, we’re bringing on some new models that will help make it easier to develop for these applications.

Today, we’re excited to announce that we are expanding our model catalog to include closed-source partner models that fit this use case. We’ve partnered with Leonardo.Ai and Deepgram to bring their latest and greatest models to Workers AI, hosted on Cloudflare’s infrastructure. Leonardo and Deepgram both have models with a great speed-to-performance ratio that suit the infrastructure of Workers AI. We’re starting off with these great partners — but expect to expand our catalog to other partner models as well.

The benefits of using these models on Workers AI is that we don’t only have a standalone inference service, we also have an entire suite of Developer products that allow you to build whole applications around AI. If you’re building an image generation platform, you could use Workers to host the application logic, Workers AI to generate the images, R2 for storage, and Images for serving and transforming media. If you’re building Realtime voice agents, we offer WebRTC and WebSocket support via Workers, speech-to-text, text-to-speech, and turn detection models via Workers AI, and an orchestration layer via Cloudflare Realtime. All in all, we want to lean into use cases that we think Cloudflare has a unique advantage in, with developer tools to back it up, and make it all available so that you can build the best AI applications on top of our holistic Developer Platform.

Leonardo Models

Leonardo.Ai is a generative AI media lab that trains their own models and hosts a platform for customers to create generative media. The Workers AI team has been working with Leonardo for a while now and have experienced the magic of their image generation models firsthand. We’re excited to bring on two image generation models from Leonardo: @cf/leonardo/phoenix-1.0 and @cf/leonardo/lucid-origin.

“We’re excited to enable Cloudflare customers a new avenue to extend and use our image generation technology in creative ways such as creating character images for gaming, generating personalized images for websites, and a host of other uses... all through the Workers AI and the Cloudflare Developer Platform.” - Peter Runham, CTO, Leonardo.Ai

The Phoenix model is trained from the ground up by Leonardo, excelling at things like text rendering and prompt coherence. The full image generation request took 4.89s end-to-end for a 25 step, 1024x1024 image.

curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/leonardo/phoenix-1.0 \
  --header 'Authorization: Bearer {TOKEN}' \
  --header 'Content-Type: application/json' \
  --data '{
    "prompt": "A 1950s-style neon diner sign glowing at night that reads '\''OPEN 24 HOURS'\'' with chrome details and vintage typography.",
    "width":1024,
    "height":1024,
    "steps": 25,
    "seed":1,
    "guidance": 4,
    "negative_prompt": "bad image, low quality, signature, overexposed, jpeg artifacts, undefined, unclear, Noisy, grainy, oversaturated, overcontrasted"
}'

The Lucid Origin model is a recent addition to Leonardo’s family of models and is great at generating photorealistic images. The image took 4.38s to generate end-to-end at 25 steps and a 1024x1024 image size.

curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/leonardo/lucid-origin \
  --header 'Authorization: Bearer {TOKEN}' \
  --header 'Content-Type: application/json' \
  --data '{
    "prompt": "A 1950s-style neon diner sign glowing at night that reads '\''OPEN 24 HOURS'\'' with chrome details and vintage typography.",
    "width":1024,
    "height":1024,
    "steps": 25,
    "seed":1,
    "guidance": 4,
    "negative_prompt": "bad image, low quality, signature, overexposed, jpeg artifacts, undefined, unclear, Noisy, grainy, oversaturated, overcontrasted"
}'

Deepgram Models

Deepgram is a voice AI company that develops their own audio models, allowing users to interact with AI through a natural interface for humans: voice. Voice is an exciting interface because it carries higher bandwidth than text, because it has other speech signals like pacing, intonation, and more. The Deepgram models that we’re bringing on our platform are audio models which perform extremely fast speech-to-text and text-to-speech inference. Combined with the Workers AI infrastructure, the models showcase our unique infrastructure so customers can build low-latency voice agents and more.

"By hosting our voice models on Cloudflare's Workers AI, we're enabling developers to create real-time, expressive voice agents with ultra-low latency. Cloudflare's global network brings AI compute closer to users everywhere, so customers can now deliver lightning-fast conversational AI experiences without worrying about complex infrastructure." - Adam Sypniewski, CTO, Deepgram

@cf/deepgram/nova-3 is a speech-to-text model that can quickly transcribe audio with high accuracy. @cf/deepgram/aura-1 is a text-to-speech model that is context aware and can apply natural pacing and expressiveness based on the input text. The newer Aura 2 model will be available on Workers AI soon. We’ve also improved the experience of sending binary mp3 files to Workers AI, so you don’t have to convert it into an Uint8 array like you had to previously. Along with our Realtime announcements (coming soon!), these audio models are the key to enabling customers to build voice agents directly on Cloudflare.

With the AI binding, a call to the Nova 3 speech-to-text model would look like this:

const URL = "https://www.some-website.com/audio.mp3";
const mp3 = await fetch(URL);
 
const res = await env.AI.run("@cf/deepgram/nova-3", {
    "audio": {
      body: mp3.body,
      contentType: "audio/mpeg"
    },
    "detect_language": true
  });

With the REST API, it would look like this:

curl --request POST \
  --url 'https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/deepgram/nova-3?detect_language=true' \
  --header 'Authorization: Bearer {TOKEN}' \
  --header 'Content-Type: audio/mpeg' \
  --data-binary @/path/to/audio.mp3

As well, we’ve added WebSocket support to the Deepgram models, which you can use to keep a connection to the inference server live and use it for bi-directional input and output. To use the Nova model with WebSocket support, check out our Developer Docs.

All the pieces work together so that you can:

Capture audio with Cloudflare Realtime from any WebRTC source
Pipe it via WebSocket to your processing pipeline
Transcribe with audio ML models Deepgram running on Workers AI
Process with your LLM of choice through a model hosted on Workers AI or proxied via AI Gateway
Orchestrate everything with Realtime Agents

Try these models out today

Check out our developer docs for more details, pricing and how to get started with the newest partner models available on Workers AI.

Partnering with OpenAI to bring their new open models onto Cloudflare Workers AI

Michelle Chen — Tue, 05 Aug 2025 21:05:00 GMT

OpenAI has just announced their latest open-weight models — and we are excited to share that we are working with them as a Day 0 launch partner to make these models available in Cloudflare's Workers AI. Cloudflare developers can now access OpenAI's first open model, leveraging these powerful new capabilities on our platform. The new models are available starting today at @cf/openai/gpt-oss-120b and @cf/openai/gpt-oss-20b.

Workers AI has always been a champion for open models and we’re thrilled to bring OpenAI's new open models to our platform today. Developers who want transparency, customizability, and deployment flexibility can rely on Workers AI as a place to deliver AI services. Enterprises that need the ability to run open models to ensure complete data security and privacy can also deploy with Workers AI. We are excited to join OpenAI in fulfilling their mission of making the benefits of AI broadly accessible to builders of any size.

The technical model specs

The OpenAI models have been released in two sizes: a 120 billion parameter model and a 20 billion parameter model. Both of them are Mixture-of-Experts models – a popular architecture for recent model releases – that allow relevant experts to be called for a query instead of running through all the parameters of the model. Interestingly, these models run natively at an FP4 quantization, which means that they have a smaller GPU memory footprint than a 120 billion parameter model at FP16. Given the quantization and the MoE architecture, the new models are able to run faster and more efficiently than more traditional dense models of that size.

These models are text-only; however, they have reasoning capabilities, tool calling, and two new exciting features with Code Interpreter and Web Search (support coming soon). We’ve implemented Code Interpreter on top of Cloudflare Containers in a novel way that allows for stateful code execution (read on below).

The model on Workers AI

We’re landing these new models with a few tweaks: supporting the new Responses API format as well as the historical Chat Completions API format (coming soon). The Responses API format is recommended by OpenAI to interact with their models, and we’re excited to support that on Workers AI.

If you call the model through:

Workers Binding, it will accept/return Responses API – env.AI.run(“@cf/openai/gpt-oss-120b”)
REST API on /run endpoint, it will accept/return Responses API – https://api.cloudflare.com/client/v4/accounts//ai/run/@cf/openai/gpt-oss-120b
REST API on new /responses endpoint, it will accept/return Responses API – https://api.cloudflare.com/client/v4/accounts//ai/v1/responses
REST API for OpenAI Compatible endpoint, it will return Chat Completions (coming soon)– https://api.cloudflare.com/client/v4/accounts//ai/v1/chat/completions

curl https://api.cloudflare.com/client/v4/accounts//ai/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $CLOUDFLARE_API_KEY" \
  -d '{
    "model": "@cf/openai/gpt-oss-120b",
    "reasoning": {"effort": "medium"},
    "input": [
      {
        "role": "user",
        "content": "What are the benefits of open-source models?"
      }
    ]
  }'

Code Interpreter + Cloudflare Sandboxes = the perfect fit

To effectively answer user queries, Large Language Models (LLMs) often struggle with logical tasks such as mathematics or coding. Instead of attempting to reason through these problems, LLMs typically utilize a tool call to execute AI-generated code that solves these problems. OpenAI's new models are specifically trained for stateful Python code execution and include a built-in feature called Code Interpreter, designed to address this challenge.

We’re particularly excited about this. Cloudflare not only has an inference platform (Workers AI), but we also have an ecosystem of compute and storage products that allow people to build full applications on top of our Developer Platform. This means that we are uniquely suited to support the model’s Code Interpreter capabilities, not only for one-time code execution, but for stateful code execution as well.

We’ve built support for Code Interpreter on top of Cloudflare’s Sandbox product that allows for a secure environment to run AI-generated code. The Sandbox SDK is built on our latest Containers product and Code Interpreter is the perfect use case to bring all these products together. When you use Code Interpreter, we spin up a Sandbox container scoped to your session that stays alive for 20 minutes, so the code can be edited for subsequent queries to the model. We’ve also pre-warmed Sandboxes for Code Interpreter to ensure the fastest start up times.

We’ll be publishing an example of how you can use the gpt-oss model on Workers AI and Sandboxes with the OpenAI SDK to make calls to Code Interpreter on our Developer Docs.

Give it a try!

We’re beyond excited for OpenAI’s new open models, and we hope you are too. Super grateful to our friends from vLLM and HuggingFace for supporting efficient model serving on launch day. Read up on the Developer Docs to learn more about the details and how to get started on building with these new models and capabilities.

Workers AI gets a speed boost, batch workload support, more LoRAs, new models, and a refreshed dashboard

Michelle Chen — Fri, 11 Apr 2025 13:00:00 GMT

Since the launch of Workers AI in September 2023, our mission has been to make inference accessible to everyone.

Over the last few quarters, our Workers AI team has been heads down on improving the quality of our platform, working on various routing improvements, GPU optimizations, and capacity management improvements. Managing a distributed inference platform is not a simple task, but distributed systems are also what we do best. You’ll notice a recurring theme from all these announcements that has always been part of the core Cloudflare ethos — we try to solve problems through clever engineering so that we are able to do more with less.

Today, we’re excited to introduce speculative decoding to bring you faster inference, an asynchronous batch API for large workloads, and expanded LoRA support for more customized responses. Lastly, we’ll be recapping some of our newly added models, updated pricing, and unveiling a new dashboard to round out the usability of the platform.

Speeding up inference by 2-4x with speculative decoding and more

We’re excited to roll out speed improvements to models in our catalog, starting with the Llama 3.3 70b model. These improvements include speculative decoding, prefix caching, an updated inference backend, and more. We’ve previously done a technical deep dive on speculative decoding and how we’re making Workers AI faster, which you can read about here. With these changes, we’ve been able to improve inference times by 2-4x, without any significant change to the quality of answers generated. We’re planning to incorporate these improvements into more models in the future as we release them. Today, we’re starting to roll out these changes so all Workers AI users of @cf/meta/llama-3.3-70b-instruct-fp8-fast will enjoy this automatic speed boost.

What is speculative decoding?

The way LLMs work is by generating text by predicting the next token in a sentence given the previous tokens. Typically, an LLM is able to predict a single future token (n+1) with one forward pass through the model. These forward passes can be computationally expensive, since they need to work through all the parameters of a model to generate one token (e.g., 70 billion parameters for Llama 3.3 70b).

With speculative decoding, we put a small model (known as the draft model) in front of the original model that helps predict n+x future tokens. The draft model generates a subset of candidate tokens, and the original model just has to evaluate and confirm if they should be incorporated into the generation. Evaluating tokens is less computationally expensive, as the model can evaluate multiple tokens concurrently in a forward pass. As such, inference times can be sped up by 2-4x — meaning that users can get responses much faster.

What makes speculative decoding particularly efficient is that it’s able to use unused GPU compute left behind due to the GPU memory bottleneck LLMs create. Speculative decoding takes advantage of the unused compute by squeezing in a draft model to generate tokens faster. This means we’re able to improve the utilization of our GPUs by using them to their full extent without having parts of the GPU sit idle.

What is prefix caching?

With LLMs, there are usually two stages of generation – the first is known as “pre-fill”, which processes the user’s input tokens such as the prompt and context. Prefix caching is aimed at reducing the pre-fill time of a request. As an example, if you were asking a model to generate code based on a given file, you might insert the whole file into the context window of a request. Then, if you want to make a second request to generate the next line of code, you might send us the whole file again in the second request. Prefix caching allows us to cache the pre-fill tokens so we don’t have to process the context twice. With the same example, we would only do the pre-fill stage once for both requests, rather than doing it per request. This method is especially useful for requests that reuse the same context, such as Retrieval Augmented Generation (RAG), code generation, chatbots with memory, and more. Skipping the pre-fill stage for similar requests means faster responses for our users and more efficient usage of resources.

How did you validate that quality is preserved through these optimizations?

Since this is an in-place update to an existing model, we were particularly cautious in ensuring that we would not break any existing applications with this update. We did extensive A/B testing through a blind arena with internal employees to validate the model quality, and we asked internal and external customers to test the new version of the model to ensure that response formats were compatible and model quality was acceptable. Our testing concluded that the model performed up to standards, with people being extremely excited about the speed of the model. Most LLMs are not perfectly deterministic even with the same set of inputs, but if you do notice something off, please let us know through Discord or X.

Asynchronous batch API

Next up, we’re announcing an asynchronous (async) batch API which is helpful for users of large workloads. This feature allows customers to receive their inference responses asynchronously, with the promise that the inference will be completed at a later time rather than immediately erroring out due to capacity.

An example use case of batch workloads is people generating summaries of a large number of documents. You probably don’t need to use those summaries immediately, as you’ll likely use them once the whole document is complete versus one paragraph at a time. For these use cases, we’ve made it super simple for you to start sending us these requests in batches.

Why batch requests?

From talking to our customers, the most common use case we hear about is people creating embeddings or summarizing a large number of documents. Unfortunately, this is also one of the hardest use cases to manage capacity for as a serverless platform.

To illustrate this, imagine that you want to summarize a 70 page PDF. You typically chunk the document and then send an inference request for each chunk. If each chunk is a few paragraphs on a page, that means that we receive around 4 requests per page multiplied by 70 pages, which is about 280 requests. Multiply that by tens or hundreds of documents, and multiply that by a handful of concurrent users — this means that we get a sudden massive influx of thousands of requests when users start these large workloads.

The way we originally built Workers AI was to handle incoming requests as quickly as possible, assuming there's a human on the other side that needed an immediate response. The unique thing about batch workloads is that while they're not latency sensitive, they do require completeness guarantees — you don't want to come back the next day to realize none of your inference requests actually executed.

With the async API, you send us a batch of requests, and we promise to fulfill them as fast as possible and return them to you as a batch. This guarantees that your inference request will be fulfilled, rather than immediately (or eventually) erroring out. The async API also benefits users who have real-time use cases, as the model instances won’t be immediately consumed by these batch requests that can wait for a response. Inference times will be faster since there won’t be a bunch of competing requests in a queue waiting to reach the inference servers.

We have select models that support batch inference today, which include:

How can I use the batch API?

Users can send a batch request to supported models by passing a flag:

let res = await env.AI.run("@cf/meta/llama-3.3-70b-instruct-batch", {
  "requests": [{
    "prompt": "Explain mechanics of wormholes"
  }, {
    "prompt": "List different plant species found in America"
  }]
}, {
  queueRequest: true
});

Check out our developer docs to learn more about the batch API, or use our template to deploy a worker that implements the batch API.

Today, our batch API can be used by sending us an array of requests, and we’ll return your responses in an array. This is helpful for use cases like summarizing large amounts of data that you know beforehand. This means you can send us a single HTTP request with all of your requests, and receive a single HTTP request back with your responses. You can check on the status of the batch by polling it with the request ID we return when your batch is submitted. For the next iteration of our async API, we plan to allow queue-based inputs and outputs, where you push requests and pull responses from a queue. This will integrate tightly with Event Notifications and Workflows, so you can execute subsequent actions upon receiving a response.

Expanded LoRA support

At Birthday Week last year, we announced limited LoRA support for a handful of models. We’ve

iterated on this and now support 8 models as well as larger ranks of up to 32 and LoRA files up to 300 MB. Models that support LoRA inference now include:

What is LoRA?

In essence, a Low Rank Adaptation (LoRA) adapter allows people to take a trained adapter file and use it in conjunction with a model to alter the response of a model. We did a deep dive on LoRAs in our Birthday Week blog post, which goes into further technical detail. LoRA adapters are great alternatives to fine-tuning a model, as it isn’t as expensive to train and adapters are much smaller and more portable. They are also effective enough to tweak the output of a model to fit a certain style of response.

How do I get started?

To get started, you first need to train your own LoRA adapter or find a public one on HuggingFace. Then, you’ll upload the adapter_model.safetensors and adapter_config.json to your account with the documented wrangler commands or through the REST API. LoRA files are private and scoped to your own account. After that, you can start running fine-tuned inference — check out our LoRA developer docs to get started.

const response = await env.AI.run(
  "@cf/qwen/qwen2.5-coder-32b-instruct", //the model supporting LoRAs
  {
      messages: [{"role": "user", "content": "Hello world"}],
      raw: true, //skip applying the default chat template
      lora: "00000000-0000-0000-0000-000000000", //the finetune id OR finetune name
  }
);

Quality of life improvements: updated pricing and a new dashboard for Workers AI

While the team has been focused on large engineering milestones, we’ve also landed some quality of life improvements over the last few months. In case you missed it, we’ve announced an updated pricing model where usage will be shown in units such as tokens, audio seconds, image size/steps, etc. but still billed in neurons in the backend.

Today, we’re unveiling a new dashboard that allows users to see their usage in both units as well as neurons (built on new Workers Observability components!). Model pricing is also available via dashboard and developer docs on the models page. And if you use AI Gateway, Workers AI usage will also be displayed as metrics now.

New models available in Workers AI

Lastly, we’ve steadily been adding new models on Workers AI, with over 10 new models and a few updates on existing models. Pricing is also now listed directly on the model page in the developer docs. To summarize, here are the new models we’ve added on Workers AI, including four new ones we’re releasing today:

@cf/deepseek-ai/deepseek-r1-distill-qwen-32b: a version of Qwen 32B distilled from Deepseek’s R1 that is capable of doing chain-of-thought reasoning.
@cf/baai/bge-m3: a multi-lingual embeddings model that supports over 100 languages. It can also simultaneously perform dense retrieval, multi-vector retrieval, and sparse retrieval, with the ability to process inputs of different granularities.
@cf/baai/bge-reranker-base: our first reranker model! Rerankers are a type of text classification model that takes a query and context, and outputs a similarity score between the two. When used in RAG systems, you can use a reranker after the initial vector search to find the most relevant documents to return to a user by reranking the outputs.
@cf/openai/whisper-large-v3-turbo: a faster, more accurate speech-to-text model. This model was added earlier but is graduating out of beta with pricing included today.
@cf/myshell-ai/melotts: our first text-to-speech model that allows users to generate an MP3 with voice audio from text input.
@cf/meta/llama-4-scout-17b-16e-instruct: 17 billion parameter MoE model with 16 experts that is natively multimodal. Offers industry-leading performance in text and image understanding.
[NEW] @cf/mistralai/mistral-small-3.1-24b-instruct: a 24B parameter model achieving state-of-the-art capabilities comparable to larger models, with support for vision and tool calling.
[NEW] @cf/google/gemma-3-12b-it: well-suited for a variety of text generation and image understanding tasks, including question answering, summarization and reasoning, with a 128K context window, and multilingual support in over 140 languages.
[NEW] @cf/qwen/qwq-32b: a medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.
[NEW] @cf/qwen/qwen2.5-coder-32b-instruct: the current state-of-the-art open-source code LLM, with its coding abilities matching those of GPT-4o.

In addition, we are rolling out some in-place updates to existing models in our catalog:

@cf/meta/llama-3.3-70b-instruct-fp8-fast - Llama 3.3 70b gets a speed boost with new techniques such as speculative decoding, prefix caching, and an updated server back end (see above).
@cf/baai/bge-small-en-v1.5, @cf/baai/bge-base-en-v1.5, @cf/baai/bge-large-en-v1.5 - get a new input parameter called “pooling” which takes either “cls” or “mean”

As we release these new models, we’ll be deprecating old models to encourage use of the state-of-the-art models and make room in our catalog. We will send out an email notice on this shortly. Stay up to date with our model releases and deprecation announcements by subscribing to our Developer Docs changelog.

We’re (still) just getting started

Workers AI is one of Cloudflare’s newer products in a nascent industry, but we still operate with very traditional Cloudflare principles – learning how we can do more with less. Our engineering team is focused on solving the difficult problems that come with growing a distributed inference platform at a global scale, and we’re excited to release these new features today that we think will improve the platform as a whole for all our users. With faster inference times, better reliability, more customization possibilities, and better usability, we’re excited to see what you can do with more Workers AI — let us know what you think!

Meta’s Llama 4 is now available on Workers AI

Michelle Chen — Sun, 06 Apr 2025 03:22:00 GMT

As one of Meta’s launch partners, we are excited to make Meta’s latest and most powerful model, Llama 4, available on the Cloudflare Workers AI platform starting today. Check out the Workers AI Developer Docs to begin using Llama 4 now.

What’s new in Llama 4?

Llama 4 is an industry-leading release that pushes forward the frontiers of open-source generative Artificial Intelligence (AI) models. Llama 4 relies on a novel design that combines a Mixture of Experts architecture with an early-fusion backbone that allows it to be natively multimodal.

The Llama 4 “herd” is made up of two models: Llama 4 Scout (109B total parameters, 17B active parameters) with 16 experts, and Llama 4 Maverick (400B total parameters, 17B active parameters) with 128 experts. The Llama Scout model is available on Workers AI today.

Llama 4 Scout has a context window of up to 10 million (10,000,000) tokens, which makes it one of the first open-source models to support a window of that size. A larger context window makes it possible to hold longer conversations, deliver more personalized responses, and support better Retrieval Augmented Generation (RAG). For example, users can take advantage of that increase to summarize multiple documents or reason over large codebases. At launch, Workers AI is supporting a context window of 131,000 tokens to start and we’ll be working to increase this in the future.

Llama 4 does not compromise parameter depth for speed. Despite having 109 billion total parameters, the Mixture of Experts (MoE) architecture can intelligently use only a fraction of those parameters during active inference. This delivers a faster response that is made smarter by the 109B parameter size.

What is a Mixture of Experts model?

A Mixture of Experts (MoE) model is a type of Sparse Transformer model that is composed of individual specialized neural networks called “experts”. MoE models also have a “router” component that manages input tokens and which experts they get sent to. These specialized experts work together to provide deeper results and faster inference times, increasing both model quality and performance.

For an illustrative example, let’s say there’s an expert that’s really good at generating code while another expert is really good at creative writing. When a request comes in to write a Fibonacci algorithm in Haskell, the router sends the input tokens to the coding expert. This means that the remaining experts might remain unactivated, so the model only needs to use the smaller, specialized neural network to solve the problem.

In the case of Llama 4 Scout, this means the model is only using one expert (17B parameters) instead of the full 109B total parameters of the model. In reality, the model probably needs to use multiple experts to handle a request, but the point still stands: an MoE model architecture is incredibly efficient for the breadth of problems it can handle and the speed at which it can handle it.

MoE also makes it more efficient to train models. We recommend reading Meta’s blog post on how they trained the Llama 4 models. While more efficient to train, hosting an MoE model for inference can sometimes be more challenging. You need to load the full model weights (over 200 GB) into GPU memory. Supporting a larger context window also requires keeping more memory available in a Key Value cache.

Thankfully, Workers AI solves this by offering Llama 4 Scout as a serverless model, meaning that you don’t have to worry about things like infrastructure, hardware, memory, etc. — we do all of that for you, so you are only one API request away from interacting with Llama 4.

What is early-fusion?

One challenge in building AI-powered applications is the need to grab multiple different models, like a Large Language Model (LLM) and a visual model, to deliver a complete experience for the user. Llama 4 solves that problem by being natively multimodal, meaning the model can understand both text and images.

You might recall that Llama 3.2 11b was also a vision model, but Llama 3.2 actually used separate parameters for vision and text. This means that when you sent an image request to the model, it only used the vision parameters to understand the image.

With Llama 4, all the parameters natively understand both text and images. This allowed Meta to train the model parameters with large amounts of unlabeled text, image, and video data together. For the user, this means that you don’t have to chain together multiple models like a vision model and an LLM for a multimodal experience — you can do it all with Llama 4.

Try it out now!

We are excited to partner with Meta as a launch partner to make it effortless for developers to use Llama 4 in Cloudflare Workers AI. The release brings an efficient, multimodal, highly-capable and open-source model to anyone who wants to build AI-powered applications.

Cloudflare’s Developer Platform makes it possible to build complete applications that run alongside our Llama 4 inference. You can rely on our compute, storage, and agent layer running seamlessly with the inference from models like Llama 4. To learn more, head over to our developer docs model page for more information on using Llama 4 on Workers AI, including pricing, additional terms, and acceptable use policies.

Want to try it out without an account? Visit our AI playground or get started with building your AI experiences with Llama 4 and Workers AI.

Cloudflare’s bigger, better, faster AI platform

Michelle Chen — Thu, 26 Sep 2024 13:00:00 GMT

Birthday Week 2024 marks our first anniversary of Cloudflare’s AI developer products — Workers AI, AI Gateway, and Vectorize. For our first birthday this year, we’re excited to announce powerful new features to elevate the way you build with AI on Cloudflare.

Workers AI is getting a big upgrade, with more powerful GPUs that enable faster inference and bigger models. We’re also expanding our model catalog to be able to dynamically support models that you want to run on us. Finally, we’re saying goodbye to neurons and revamping our pricing model to be simpler and cheaper. On AI Gateway, we’re moving forward on our vision of becoming an ML Ops platform by introducing more powerful logs and human evaluations. Lastly, Vectorize is going GA, with expanded index sizes and faster queries.

Watch on Cloudflare TV

Whether you want the fastest inference at the edge, optimized AI workflows, or vector database-powered RAG, we’re excited to help you harness the full potential of AI and get started on building with Cloudflare.

The fast, global AI platform

The first thing that you notice about an application is how fast, or in many cases, how slow it is. This is especially true of AI applications, where the standard today is to wait for a response to be generated.

At Cloudflare, we’re obsessed with improving the performance of applications, and have been doubling down on our commitment to make AI fast. To live up to that commitment, we’re excited to announce that we’ve added even more powerful GPUs across our network to accelerate LLM performance.

In addition to more powerful GPUs, we’ve continued to expand our GPU footprint to get as close to the user as possible, reducing latency even further. Today, we have GPUs in over 180 cities, having doubled our capacity in a year.

Bigger, better, faster

With the introduction of our new, more powerful GPUs, you can now run inference on significantly larger models, including Meta Llama 3.1 70B. Previously, our model catalog was limited to 8B parameter LLMs, but we can now support larger models, faster response times, and larger context windows. This means your applications can handle more complex tasks with greater efficiency.

Model

@cf/meta/llama-3.2-11b-vision-instruct

@cf/meta/llama-3.2-1b-instruct

@cf/meta/llama-3.2-3b-instruct

@cf/meta/llama-3.1-8b-instruct-fast

@cf/meta/Llama-3.1-70b-instruct

@cf/black-forest-labs/flux-1-schnell

The set of models above are available on our new GPUs at faster speeds. In general, you can expect throughput of 80+ Tokens per Second (TPS) for 8b models and a Time To First Token of 300 ms (depending on where you are in the world).

Our model instances now support larger context windows, like the full 128K context window for Llama 3.1 and 3.2. To give you full visibility into performance, we’ll also be publishing metrics like TTFT, TPS, Context Window, and pricing on models in our catalog, so you know exactly what to expect.

We’re committed to bringing the best of open-source models to our platform, and that includes Meta’s release of the new Llama 3.2 collection of models. As a Meta launch partner, we were excited to have Day 0 support for the 11B vision model, as well as the 1B and 3B text-only model on Workers AI.

For more details on how we made Workers AI fast, take a look at our technical blog post, where we share a novel method for KV cache compression (it’s open-source!), as well as details on speculative decoding, our new hardware design, and more.

Greater model flexibility

With our commitment to helping you run more powerful models faster, we are also expanding the breadth of models you can run on Workers AI with our Run Any* Model feature. Until now, we have manually curated and added only the most popular open source models to Workers AI. Now, we are opening up our catalog to the public, giving you the flexibility to choose from a broader selection of models. We will support models that are compatible with our GPUs and inference stack at the start (hence the asterisk on Run Any* Model). We’re launching this feature in closed beta and if you’d like to try it out, please fill out the form, so we can grant you access to this new feature.

The Workers AI model catalog will now be split into two parts: a static catalog and a dynamic catalog. Models in the static catalog will remain curated by Cloudflare and will include the most popular open source models with guarantees on availability and speed (the models listed above). These models will always be kept warm in our network, ensuring you don’t experience cold starts. The usage and pricing model remains serverless, where you will only be charged for the requests to the model and not the cold start times.

Models that are launched via Run Any* Model will make up the dynamic catalog. If the model is public, users can share an instance of that model. In the future, we will allow users to launch private instances of models as well.

This is just the first step towards running your own custom or private models on Workers AI. While we have already been supporting private models for select customers, we are working on making this capacity available to everyone in the near future.

New Workers AI pricing

We launched Workers AI during Birthday Week 2023 with the concept of “neurons” for pricing. Neurons were intended to simplify the unit of measure across various models on our platform, including text, image, audio, and more. However, over the past year, we have listened to your feedback and heard that neurons were difficult to grasp and challenging to compare with other providers. Additionally, the industry has matured, and new pricing standards have materialized. As such, we’re excited to announce that we will be moving towards unit-based pricing and saying goodbye to neurons.

Moving forward, Workers AI will be priced based on model task, size, and units. LLMs will be priced based on the model size (parameters) and input/output tokens. Image generation models will be priced based on the output image resolution and the number of steps. Embeddings models will be priced based on input tokens. Speech-to-text models will be priced on seconds of audio input.

Model Task	Units	Model Size	Pricing
LLMs (incl. Vision models)	Tokens in/out (blended)	<= 3B parameters	$0.10 per Million Tokens
3.1B - 8B	$0.15 per Million Tokens
8.1B - 20B	$0.20 per Million Tokens
20.1B - 40B	$0.50 per Million Tokens
40.1B+	$0.75 per Million Tokens
Embeddings	Tokens in	<= 150M parameters	$0.008 per Million Tokens
151M+ parameters	$0.015 per Million Tokens
Speech-to-text	Audio seconds in	N/A	$0.0039 per minute of audio input

Image Size	Model Type	Steps	Price
<=256x256	Standard	25	$0.00125 per 25 steps
Fast	5	$0.00025 per 5 steps
<=512x512	Standard	25	$0.0025 per 25 steps
Fast	5	$0.0005 per 5 steps
<=1024x1024	Standard	25	$0.005 per 25 steps
Fast	5	$0.001 per 5 steps
<=2048x2048	Standard	25	$0.01 per 25 steps
Fast	5	$0.002 per 5 steps

We paused graduating models and announcing pricing for beta models over the past few months as we prepared for this new pricing change. We’ll be graduating all models to this new pricing, and billing will take effect on October 1, 2024.

Our free tier has been redone to fit these new metrics, and will include a monthly allotment of usage across all the task types.

Model	Free tier size
Text Generation - LLM	10,000 tokens a day across any model size
Embeddings	10,000 tokens a day across any model size
Images	Sum of 250 steps, up to 1024x1024 resolution
Whisper	10 minutes of audio a day

Optimizing AI workflows with AI Gateway

AI Gateway is designed to help developers and organizations building AI applications better monitor, control, and optimize their AI usage, and thanks to our users, AI Gateway has reached an incredible milestone — over 2 billion requests proxied by September 2024, less than a year after its inception. But we are not stopping there.

Persistent logs (open beta)

Persistent logs allow developers to store and analyze user prompts and model responses for extended periods, up to 10 million logs per gateway. Each request made through AI Gateway will create a log. With a log, you can see details of a request, including timestamp, request status, model, and provider.

We have revamped our logging interface to offer more detailed insights, including cost and duration. Users can now annotate logs with human feedback using thumbs up and thumbs down. Lastly, you can now filter, search, and tag logs with custom metadata to further streamline analysis directly within AI Gateway.

Persistent logs are available to use on all plans, with a free allocation for both free and paid plans. On the Workers Free plan, users can store up to 100,000 logs total across all gateways at no charge. For those needing more storage, upgrading to the Workers Paid plan will give you a higher free allocation — 200,000 logs stored total. Any additional logs beyond those limits will be available at $8 per 100,000 logs stored per month, giving you the flexibility to store logs for your preferred duration and do more with valuable data. Billing for this feature will be implemented when the feature reaches General Availability, and we’ll provide plenty of advance notice.

	Workers Free	Workers Paid	Enterprise
Included Volume	100,000 logs stored (total)	200,000 logs stored (total)
Additional Logs	N/A	$8 per 100,000 logs stored per month

Export logs with Logpush

For users looking to export their logs, AI Gateway now supports log export via Logpush. With Logpush, you can automatically push logs out of AI Gateway into your preferred storage provider, including Cloudflare R2, Amazon S3, Google Cloud Storage, and more. This can be especially useful for compliance or advanced analysis outside the platform. Logpush follows its existing pricing model and will be available to all users on a paid plan.

AI evaluations

We are also taking our first step towards comprehensive AI evaluations, starting with evaluation using human in the loop feedback (this is now in open beta). Users can create datasets from logs to score and evaluate model performance, speed, and cost, initially focused on LLMs. Evaluations will allow developers to gain a better understanding of how their application is performing, ensuring better accuracy, reliability, and customer satisfaction. We’ve added support for cost analysis across many new models and providers to enable developers to make informed decisions, including the ability to add custom costs. Future enhancements will include automated scoring using LLMs, comparing performance of multiple models, and prompt evaluations, helping developers make decisions on what is best for their use case and ensuring their applications are both efficient and cost-effective.

Vectorize GA

We've completely redesigned Vectorize since our initial announcement in 2023 to better serve customer needs. Vectorize (v2) now supports indexes of up to 5 million vectors (up from 200,000), delivers faster queries (median latency is down 95% from 500 ms to 30 ms), and returns up to 100 results per query (increased from 20). These improvements significantly enhance Vectorize's capacity, speed, and depth of results.

Note: if you got started on Vectorize before GA, to ease the move from v1 to v2, a migration solution will be available in early Q4 — stay tuned!

New Vectorize pricing

Not only have we improved performance and scalability, but we've also made Vectorize one of the most cost-effective options on the market. We've reduced query prices by 75% and storage costs by 98%.

	New Vectorize pricing	Old Vectorize pricing	Price reduction
Writes	Free	Free	n/a
Query	$.01 per 1 million vector dimensions	$0.04 per 1 million vector dimensions	75%
Storage	$0.05 per 100 million vector dimensions	$4.00 per 100 million vector dimensions	98%

You can learn more about our pricing in the Vectorize docs.

Vectorize free tier

There’s more good news: we’re introducing a free tier to Vectorize to make it easy to experiment with our full AI stack.

The free tier includes:

30 million queried vector dimensions / month
5 million stored vector dimensions / month

How fast is Vectorize?

To measure performance, we conducted benchmarking tests by executing a large number of vector similarity queries as quickly as possible. We measured both request latency and result precision. In this context, precision refers to the proportion of query results that match the known true-closest results for all benchmarked queries. This approach allows us to assess both the speed and accuracy of our vector similarity search capabilities. Here are the following datasets we benchmarked on:

dbpedia-openai-1M-1536-angular: 1 million vectors, 1536 dimensions, queried with cosine similarity at a top K of 10
Laion-768-5m-ip: 5 million vectors, 768 dimensions, queried with cosine similarity at a top K of 10
- We ran this again skipping the result-refinement pass to return approximate results faster

Benchmark dataset	P50 (ms)	P75 (ms)	P90 (ms)	P95 (ms)	Throughput (RPS)	Precision
dbpedia-openai-1M-1536-angular	31	56	159	380	343	95.4%
Laion-768-5m-ip	81.5	91.7	105	123	623	95.5%
Laion-768-5m-ip w/o refinement	14.7	19.3	24.3	27.3	698	78.9%

These benchmarks were conducted using a standard Vectorize v2 index, queried with a concurrency of 300 via a Cloudflare Worker binding. The reported latencies reflect those observed by the Worker binding querying the Vectorize index on warm caches, simulating the performance of an existing application with sustained usage.

Beyond Vectorize's fast query speeds, we believe the combination of Vectorize and Workers AI offers an unbeatable solution for delivering optimal AI application experiences. By running Vectorize close to the source of inference and user interaction, rather than combining AI and vector database solutions across providers, we can significantly minimize end-to-end latency.

With these improvements, we're excited to announce the general availability of the new Vectorize, which is more powerful, faster, and more cost-effective than ever before.

Tying it all together: the AI platform for all your inference needs

Over the past year, we’ve been committed to building powerful AI products that enable users to build on us. While we are making advancements on each of these individual products, our larger vision is to provide a seamless, integrated experience across our portfolio.

With Workers AI and AI Gateway, users can easily enable analytics, logging, caching, and rate limiting to their AI application by connecting to AI Gateway directly through a binding in the Workers AI request. We imagine a future where AI Gateway can not only help you create and save datasets to use for fine-tuning your own models with Workers AI, but also seamlessly redeploy them on the same platform. A great AI experience is not just about speed, but also accuracy. While Workers AI ensures fast performance, using it in combination with AI Gateway allows you to evaluate and optimize that performance by monitoring model accuracy and catching issues, like hallucinations or incorrect formats. With AI Gateway, users can test out whether switching to new models in the Workers AI model catalog will deliver more accurate performance and a better user experience.

In the future, we’ll also be working on tighter integrations between Vectorize and Workers AI, where you can automatically supply context or remember past conversations in an inference call. This cuts down on the orchestration needed to run a RAG application, where we can automatically help you make queries to vector databases.

If we put the three products together, we imagine a world where you can build AI apps with full observability (traces with AI Gateway) and see how the retrieval (Vectorize) and generation (Workers AI) components are working together, enabling you to diagnose issues and improve performance.

This Birthday Week, we’ve been focused on making sure our individual products are best-in-class, but we’re continuing to invest in building a holistic AI platform within our AI portfolio, but also with the larger Developer Platform Products. Our goal is to make sure that Cloudflare is the simplest, fastest, more powerful place for you to build full-stack AI experiences with all the batteries included.

We’re excited for you to try out all these new features! Take a look at our updated developer docs on how to get started and the Cloudflare dashboard to interact with your account.

Meta Llama 3.1 now available on Workers AI

Michelle Chen — Tue, 23 Jul 2024 15:15:55 GMT

At Cloudflare, we’re big supporters of the open-source community – and that extends to our approach for Workers AI models as well. Our strategy for our Cloudflare AI products is to provide a top-notch developer experience and toolkit that can help people build applications with open-source models.

We’re excited to be one of Meta’s launch partners to make their newest Llama 3.1 8B model available to all Workers AI users on Day 1. You can run their latest model by simply swapping out your model ID to @cf/meta/llama-3.1-8b-instruct or test out the model on our Workers AI Playground. Llama 3.1 8B is free to use on Workers AI until the model graduates out of beta.

Meta’s Llama collection of models have consistently shown high-quality performance in areas like general knowledge, steerability, math, tool use, and multilingual translation. Workers AI is excited to continue to distribute and serve the Llama collection of models on our serverless inference platform, powered by our globally distributed GPUs.

The Llama 3.1 model is particularly exciting, as it is released in a higher precision (bfloat16), incorporates function calling, and adds support across 8 languages. Having multilingual support built-in means that you can use Llama 3.1 to write prompts and receive responses directly in languages like English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai. Expanding model understanding to more languages means that your applications have a bigger reach across the world, and it’s all possible with just one model.

const answer = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    stream: true,
    messages: [{
        "role": "user",
        "content": "Qu'est-ce que ç'est verlan en français?"
    }],
});

Llama 3.1 also introduces native function calling (also known as tool calls) which allows LLMs to generate structured JSON outputs which can then be fed into different APIs. This means that function calling is supported out-of-the-box, without the need for a fine-tuned variant of Llama that specializes in tool use. Having this capability built-in means that you can use one model across various tasks.

Workers AI recently announced embedded function calling, which is now usable with Meta Llama 3.1 as well. Our embedded function calling gives developers a way to run their inference tasks far more efficiently than traditional architectures, leveraging Cloudflare Workers to reduce the number of requests that need to be made manually. It also makes use of our open-source ai-utils package, which helps you orchestrate the back-and-forth requests for function calling along with other helper methods that can automatically generate tool schemas. Below is an example function call to Llama 3.1 with embedded function calling that then stores key-values in Workers KV.

const response = await runWithTools(env.AI, "@cf/meta/llama-3.1-8b-instruct", {
    messages: [{ role: "user", content: "Greet the user and ask them a question" }],
    tools: [{
        name: "Store in memory",
        description: "Store everything that the user talks about in memory as a key-value pair.",
        parameters: {
            type: "object",
            properties: {
                key: {
                    type: "string",
                    description: "The key to store the value under.",
                },
                value: {
                    type: "string",
                    description: "The value to store.",
                },
            },
            required: ["key", "value"],
        },
        function: async ({ key, value }) => {
                await env.KV.put(key, value);

                return JSON.stringify({
                    success: true,
                });
         }
    }]
})

We’re excited to see what you build with these new capabilities. As always, use of the new model should be conducted with Meta’s Acceptable Use Policy and License in mind. Take a look at our developer documentation to get started!

Embedded function calling in Workers AI: easier, smarter, faster

Harley Turan — Thu, 27 Jun 2024 17:00:09 GMT

Introducing embedded function calling and a new ai-utils package

Today, we’re excited to announce a novel way to do function calling that co-locates LLM inference with function execution, and a new ai-utils package that upgrades the developer experience for function calling.

This is a follow-up to our mid-June announcement for traditional function calling, which allows you to leverage a Large Language Model (LLM) to intelligently generate structured outputs and pass them to an API call. Function calling has been largely adopted and standardized in the industry as a way for AI models to help perform actions on behalf of a user.

Our goal is to make building with AI as easy as possible, which is why we’re introducing a new @cloudflare/ai-utils npm package that allows developers to get started quickly with embedded function calling. These helper tools drastically simplify your workflow by actually executing your function code and dynamically generating tools from OpenAPI specs. We’ve also open-sourced our ai-utils package, which you can find on GitHub. With both embedded function calling and our ai-utils, you’re one step closer to creating intelligent AI agents, and from there, the possibilities are endless.

Why Cloudflare’s AI platform?

OpenAI has been the gold standard when it comes to having performant model inference and a great developer experience. However, they mostly support their closed-source models, while we want to also promote the open-source ecosystem of models. One of our goals with Workers AI is to match the developer experience you might get from OpenAI, but with open-source models.

There are other open-source inference providers out there like Azure or Bedrock, but most of them are focused on serving inference and the underlying infrastructure, rather than being a developer toolkit. While there are external libraries and frameworks like AI SDK that help developers build quickly with simple abstractions, they rely on upstream providers to do the actual inference. With Workers AI, it’s the best of both worlds – we offer open-source model inference and a killer developer experience out of the box.

With the release of embedded function calling and ai-utils today, we’ve advanced how we do inference for function calling and improved the developer experience by making it dead simple for developers to start building AI experiences.

How does traditional function calling work?

Traditional LLM function calling allows customers to specify a set of function names and required arguments along with a prompt when running inference on an LLM. The LLM returns the names and arguments for the functions that the customer can then make to perform actions. These actions give LLMs the ability to do things like fetch fresh data not present in the training dataset and "perform actions" based on user intent.

Traditional function calling requires multiple back-and-forth requests passing through the network in order to get to the final output. This includes requests to your origin server, an inference provider, and external APIs. As a developer, you have to orchestrate all the back-and-forths and handle all the requests and responses. If you were building complex agents with multi-tool calls or recursive tool calls, it gets infinitely harder. Fortunately, this doesn’t have to be the case, and we’ve solved it for you.

Embedded function calling

With Workers AI, our inference runtime is the Workers platform, and the Workers platform can be seen as a global compute network of distributed functions (RPCs). With this model, we can run inference using Workers AI, and supply not only the function names and arguments, but also the runtime function code to be executed. Rather than performing multiple round-trips across networks, the LLM inference and function can run in the same execution environment, cutting out all the unnecessary requests.

Cloudflare is one of the few inference providers that is able to do this because we offer more than just inference – our developer platform has compute, storage, inference, and more, all within the same Workers runtime.

We made it easy for you with a new ai-utils package

And to make it as simple as possible, we created a @cloudflare/ai-utils package that you can use to get started. These powerful abstractions cut down on the logic you have to implement to do function calling – it just works out of the box.

runWithTools

runWithTools is our method that you use to do embedded function calling. You pass in your AI binding (env.AI), model, prompt messages, and tools. The tools array includes the description of the function, similar to traditional function calling, but you also pass in the function code that needs to be executed. This method makes the inference calls and executes the function code in one single step. runWithTools is also able to handle multiple function calls, recursive tool calls, validation for model responses, streaming for the final response, and other features.

Another feature to call out is a helper method called autoTrimTools that automatically selects the relevant tools and trims the tools array based on the names and descriptions. We do this by adding an initial LLM inference call to intelligently trim the tools array before the actual function-calling inference call is made. We found that autoTrimTools helped decrease the number of total tokens used in the entire process (especially when there’s a large number of tools provided) because there’s significantly fewer input tokens used when generating the arguments list. You can choose to use autoTrimTools by setting it as a parameter in the runWithTools method.

const response = await runWithTools(env.AI,"@hf/nousresearch/hermes-2-pro-mistral-7b",
  {
    messages: [{ role: "user", content: "What's the weather in Austin, Texas?"}],
    tools: [
      {
        name: "getWeather",
        description: "Return the weather for a latitude and longitude",
        parameters: {
          type: "object",
          properties: {
            latitude: {
              type: "string",
              description: "The latitude for the given location"
            },
            longitude: {
              type: "string",
              description: "The longitude for the given location"
            }
          },
          required: ["latitude", "longitude"]
        },
	 // function code to be executed after tool call
        function: async ({ latitude, longitude }) => {
          const url = `https://api.weatherapi.com/v1/current.json?key=${env.WEATHERAPI_TOKEN}&q=${latitude},${longitude}`
          const res = await fetch(url).then((res) => res.json())

          return JSON.stringify(res)
        }
      }
    ]
  },
  {
    streamFinalResponse: true,
    maxRecursiveToolRuns: 5,
    trimFunction: autoTrimTools,
    verbose: true,
    strictValidation: true
  }
)

createToolsFromOpenAPISpec

For many use cases, users will need to make a request to an external API call during function calling to get the output needed. Instead of having to hardcode the exact API endpoints in your tools array, we made a helper function that takes in an OpenAPI spec and dynamically generates the corresponding tool schemas and API endpoints you’ll need for the function call. You call createToolsFromOpenAPISpec from within runWithTools and it’ll dynamically populate everything for you.

const response = await runWithTools(env.AI, "@hf/nousresearch/hermes-2-pro-mistral-7b", {
  messages: [{ role: "user",content: "Can you name me 5 repos created by Cloudflare"}],
  tools: [
    ...(await createToolsFromOpenAPISpec(  "https://raw.githubusercontent.com/github/rest-api-description/main/descriptions-next/api.github.com/api.github.com.json"
    ))
  ]
})

Putting it all together

When you make a function calling inference request with runWithTools and createToolsFromOpenAPISpec, the only thing you need is the prompts – the rest is automatically handled. The LLM will choose the correct tool based on the prompt, the runtime will execute the function needed, and you’ll get a fast, intelligent response from the model. By leveraging our Workers runtime’s bindings and RPC calls along with our global network, we can execute everything from a single location close to the user, enabling developers to easily write complex agentic chains with fewer lines of code.

We’re super excited to help people build intelligent AI systems with our new embedded function calling and powerful tools. Check out our developer docs on how to get started, and let us know what you think on Discord.

AI Gateway is generally available: a unified interface for managing and scaling your generative AI workloads

Kathy Liao — Wed, 22 May 2024 13:00:17 GMT

During Developer Week in April 2024, we announced General Availability of Workers AI, and today, we are excited to announce that AI Gateway is Generally Available as well. Since its launch to beta in September 2023 during Birthday Week, we’ve proxied over 500 million requests and are now prepared for you to use it in production.

AI Gateway is an AI ops platform that offers a unified interface for managing and scaling your generative AI workloads. At its core, it acts as a proxy between your service and your inference provider(s), regardless of where your model runs. With a single line of code, you can unlock a set of powerful features focused on performance, security, reliability, and observability – think of it as your control plane for your AI ops. And this is just the beginning – we have a roadmap full of exciting features planned for the near future, making AI Gateway the tool for any organization looking to get more out of their AI workloads.

Why add a proxy and why Cloudflare?

The AI space moves fast, and it seems like every day there is a new model, provider, or framework. Given this high rate of change, it’s hard to keep track, especially if you’re using more than one model or provider. And that’s one of the driving factors behind launching AI Gateway – we want to provide you with a single consistent control plane for all your models and tools, even if they change tomorrow, and then again the day after that.

We've talked to a lot of developers and organizations building AI applications, and one thing is clear: they want more observability, control, and tooling around their AI ops. This is something many of the AI providers are lacking as they are deeply focused on model development and less so on platform features.

Why choose Cloudflare for your AI Gateway? Well, in some ways, it feels like a natural fit. We've spent the last 10+ years helping build a better Internet by running one of the largest global networks, helping customers around the world with performance, reliability, and security – Cloudflare is used as a reverse proxy by nearly 20% of all websites. With our expertise, it felt like a natural progression – change one line of code, and we can help with observability, reliability, and control for your AI applications – all in one control plane – so that you can get back to building.

Here is that one line code change using the OpenAI JS SDK. And check out our docs to reference other providers, SDKs, and languages.

import OpenAI from 'openai';

const openai = new OpenAI({
apiKey: 'my api key', // defaults to process.env["OPENAI_API_KEY"]
	baseURL: "https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_slug}/openai"
});

What’s included today?

After talking to customers, it was clear that we needed to focus on some foundational features before moving onto some of the more advanced ones. While we're really excited about what’s to come, here are the key features available in GA today:

Analytics: Aggregate metrics from across multiple providers. See traffic patterns and usage including the number of requests, tokens, and costs over time.

Real-time logs: Gain insight into requests and errors as you build.

Caching: Enable custom caching rules and use Cloudflare’s cache for repeat requests instead of hitting the original model provider API, helping you save on cost and latency.

Rate limiting: Control how your application scales by limiting the number of requests your application receives to control costs or prevent abuse.

Support for your favorite providers: AI Gateway now natively supports Workers AI plus 10 of the most popular providers, including Groq and Cohere as of mid-May 2024.

Universal endpoint: In case of errors, improve resilience by defining request fallbacks to another model or inference provider.

curl https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_slug} -X POST \
  --header 'Content-Type: application/json' \
  --data '[
  {
    "provider": "workers-ai",
    "endpoint": "@cf/meta/llama-2-7b-chat-int8",
    "headers": {
      "Authorization": "Bearer {cloudflare_token}",
      "Content-Type": "application/json"
    },
    "query": {
      "messages": [
        {
          "role": "system",
          "content": "You are a friendly assistant"
        },
        {
          "role": "user",
          "content": "What is Cloudflare?"
        }
      ]
    }
  },
  {
    "provider": "openai",
    "endpoint": "chat/completions",
    "headers": {
      "Authorization": "Bearer {open_ai_token}",
      "Content-Type": "application/json"
    },
    "query": {
      "model": "gpt-3.5-turbo",
      "stream": true,
      "messages": [
        {
          "role": "user",
          "content": "What is Cloudflare?"
        }
      ]
    }
  }
]'

What’s coming up?

We've gotten a lot of feedback from developers, and there are some obvious things on the horizon such as persistent logs and custom metadata – foundational features that will help unlock the real magic down the road.

But let's take a step back for a moment and share our vision. At Cloudflare, we believe our platform is much more powerful as a unified whole than as a collection of individual parts. This mindset applied to our AI products means that they should be easy to use, combine, and run in harmony.

Let's imagine the following journey. You initially onboard onto Workers AI to run inference with the latest open source models. Next, you enable AI Gateway to gain better visibility and control, and start storing persistent logs. Then you want to start tuning your inference results, so you leverage your persistent logs, our prompt management tools, and our built in eval functionality. Now you're making analytical decisions to improve your inference results. With each data driven improvement, you want more. So you implement our feedback API which helps annotate inputs/outputs, in essence building a structured data set. At this point, you are one step away from a one-click fine tune that can be deployed instantly to our global network, and it doesn't stop there. As you continue to collect logs and feedback, you can continuously rebuild your fine tune adapters in order to deliver the best results to your end users.

This is all just an aspirational story at this point, but this is how we envision the future of AI Gateway and our AI suite as a whole. You should be able to start with the most basic setup and gradually progress into more advanced workflows, all without leaving Cloudflare’s AI platform. In the end, it might not look exactly as described above, but you can be sure that we are committed to providing the best AI ops tools to help make Cloudflare the best place for AI.

How do I get started?

AI Gateway is available to use today on all plans. If you haven’t yet used AI Gateway, check out our developer documentation and get started now. AI Gateway’s core features available today are offered for free, and all it takes is a Cloudflare account and one line of code to get started. In the future, more premium features, such as persistent logging and secrets management will be available subject to fees. If you have any questions, reach out on our Discord channel.

Meta Llama 3 available on Cloudflare Workers AI

Michelle Chen — Thu, 18 Apr 2024 20:58:33 GMT

We are thrilled to give developers around the world the ability to build AI applications with Meta Llama 3 using Workers AI. We are proud to be a launch partner with Meta for their newest 8B Llama 3 model, and excited to continue our partnership to bring the best of open-source models to our inference platform.

Workers AI

Workers AI’s initial launch in beta included support for Llama 2, as it was one of the most requested open source models from the developer community. Since that initial launch, we’ve seen developers build all kinds of innovative applications including knowledge sharing chatbots, creative content generation, and automation for various workflows.

At Cloudflare, we know developers want simplicity and flexibility, with the ability to build with multiple AI models while optimizing for accuracy, performance, and cost, among other factors. Our goal is to make it as easy as possible for developers to use their models of choice without having to worry about the complexities of hosting or deploying models.

As soon as we learned about the development of Llama 3 from our partners at Meta, we knew developers would want to start building with it as quickly as possible. Workers AI’s serverless inference platform makes it extremely easy and cost-effective to start using the latest large language models (LLMs). Meta’s commitment to developing and growing an open AI-ecosystem makes it possible for customers of all sizes to use AI at scale in production. All it takes is a few lines of code to run inference to Llama 3:

export interface Env {
  // If you set another name in wrangler.toml as the value for 'binding',
  // replace "AI" with the variable name you defined.
  AI: any;
}

export default {
  async fetch(request: Request, env: Env) {
    const response = await env.AI.run('@cf/meta/llama-3-8b-instruct', {
        messages: [
{role: "user", content: "What is the origin of the phrase Hello, World?"}
	 ]
      }
    );

    return new Response(JSON.stringify(response));
  },
};

Built with Meta Llama 3

Llama 3 offers leading performance on a wide range of industry benchmarks. You can learn more about the architecture and improvements on Meta’s blog post. Cloudflare Workers AI supports Llama 3 8B, including the instruction fine-tuned model.

Meta’s testing shows that Llama 3 is the most advanced open LLM today on evaluation benchmarks such as MMLU, GPQA, HumanEval, GSM-8K, and MATH. Llama 3 was trained on an increased number of training tokens (15T), allowing the model to have a better grasp on language intricacies. Larger context windows doubles the capacity of Llama 2, and allows the model to better understand lengthy passages with rich contextual data. Although the model supports a context window of 8k, we currently only support 2.8k but are looking to support 8k context windows through quantized models soon. As well, the new model introduces an efficient new tiktoken-based tokenizer with a vocabulary of 128k tokens, encoding more characters per token, and achieving better performance on English and multilingual benchmarks. This means that there are 4 times as many parameters in the embedding and output layers, making the model larger than the previous Llama 2 generation of models.

Under the hood, Llama 3 uses grouped-query attention (GQA), which improves inference efficiency for longer sequences and also renders their 8B model architecturally equivalent to Mistral-7B. For tokenization, it uses byte-level byte-pair encoding (BPE), similar to OpenAI’s GPT tokenizers. This allows tokens to represent any arbitrary byte sequence — even those without a valid utf-8 encoding. This makes the end-to-end model much more flexible in its representation of language, and leads to improved performance.

Along with the base Llama 3 models, Meta has released a suite of offerings with tools such as Llama Guard 2, Code Shield, and CyberSec Eval 2, which we are hoping to release on our Workers AI platform shortly.

Try it out now

Meta Llama 3 8B is available in the Workers AI Model Catalog today! Check out the documentation here and as always if you want to share your experiences or learn more, join us in the Developer Discord.

Leveling up Workers AI: general availability and more new capabilities

Michelle Chen — Tue, 02 Apr 2024 13:01:00 GMT

Welcome to Tuesday – our AI day of Developer Week 2024! In this blog post, we’re excited to share an overview of our new AI announcements and vision, including news about Workers AI officially going GA with improved pricing, a GPU hardware momentum update, an expansion of our Hugging Face partnership, Bring Your Own LoRA fine-tuned inference, Python support in Workers, more providers in AI Gateway, and Vectorize metadata filtering.

Workers AI GA

Today, we’re excited to announce that our Workers AI inference platform is now Generally Available. After months of being in open beta, we’ve improved our service with greater reliability and performance, unveiled pricing, and added many more models to our catalog.

Improved performance & reliability

With Workers AI, our goal is to make AI inference as reliable and easy to use as the rest of Cloudflare’s network. Under the hood, we’ve upgraded the load balancing that is built into Workers AI. Requests can now be routed to more GPUs in more cities, and each city is aware of the total available capacity for AI inference. If the request would have to wait in a queue in the current city, it can instead be routed to another location, getting results back to you faster when traffic is high. With this, we’ve increased rate limits across all our models – most LLMs now have a of 300 requests per minute, up from 50 requests per minute during our beta phase. Smaller models have a limit of 1500-3000 requests per minute. Check out our Developer Docs for the rate limits of individual models.

Lowering costs on popular models

Alongside our GA of Workers AI, we published a pricing calculator for our 10 non-beta models earlier this month. We want Workers AI to be one of the most affordable and accessible solutions to run inference, so we added a few optimizations to our models to make them more affordable. Now, Llama 2 is over 7x cheaper and Mistral 7B is over 14x cheaper to run than we had initially published on March 1. We want to continue to be the best platform for AI inference and will continue to roll out optimizations to our customers when we can.

As a reminder, our billing for Workers AI started on April 1st for our non-beta models, while beta models remain free and unlimited. We offer 10,000 neurons per day for free to all customers. Workers Free customers will encounter a hard rate limit after 10,000 neurons in 24 hours while Workers Paid customers will incur usage at $0.011 per 1000 additional neurons. Read our Workers AI Pricing Developer Docs for the most up-to-date information on pricing.

New dashboard and playground

Lastly, we’ve revamped our Workers AI dashboard and AI playground. The Workers AI page in the Cloudflare dashboard now shows analytics for usage across models, including neuron calculations to help you better predict pricing. The AI playground lets you quickly test and compare different models and configure prompts and parameters. We hope these new tools help developers start building on Workers AI seamlessly – go try them out!

Run inference on GPUs in over 150 cities around the world

When we announced Workers AI back in September 2023, we set out to deploy GPUs to our data centers around the world. We plan to deliver on that promise and deploy inference-tuned GPUs almost everywhere by the end of 2024, making us the most widely distributed cloud-AI inference platform. We have over 150 cities with GPUs today and will continue to roll out more throughout the year.

We also have our next generation of compute servers with GPUs launching in Q2 2024, which means better performance, power efficiency, and improved reliability over previous generations. We provided a preview of our Gen 12 Compute servers design in a December 2023 blog post, with more details to come. With Gen 12 and future planned hardware launches, the next step is to support larger machine learning models and offer fine-tuning on our platform. This will allow us to achieve higher inference throughput, lower latency and greater availability for production workloads, as well as expanding support to new categories of workloads such as fine-tuning.

Hugging Face Partnership

We’re also excited to continue our partnership with Hugging Face in the spirit of bringing the best of open-source to our customers. Now, you can visit some of the most popular models on Hugging Face and easily click to run the model on Workers AI if it is available on our platform.

We’re happy to announce that we’ve added 4 more models to our platform in conjunction with Hugging Face. You can now access the new Mistral 7B v0.2 model with improved context windows, Nous Research’s Hermes 2 Pro fine-tuned version of Mistral 7B, Google’s Gemma 7B, and Starling-LM-7B-beta fine-tuned from OpenChat. There are currently 14 models that we’ve curated with Hugging Face to be available for serverless GPU inference powered by Cloudflare’s Workers AI platform, with more coming soon. These models are all served using Hugging Face’s technology with a TGI backend, and we work closely with the Hugging Face team to curate, optimize, and deploy these models.

“We are excited to work with Cloudflare to make AI more accessible to developers. Offering the most popular open models with a serverless API, powered by a global fleet of GPUs is an amazing proposition for the Hugging Face community, and I can’t wait to see what they build with it.”- Julien Chaumond, Co-founder and CTO, Hugging Face

You can find all of the open models supported in Workers AI in this Hugging Face Collection, and the “Deploy to Cloudflare Workers AI” button is at the top of each model card. To learn more, read Hugging Face’s blog post and take a look at our Developer Docs to get started. Have a model you want to see on Workers AI? Send us a message on Discord with your request.

Supporting fine-tuned inference - BYO LoRAs

Fine-tuned inference is one of our most requested features for Workers AI, and we’re one step closer now with Bring Your Own (BYO) LoRAs. Using the popular Low-Rank Adaptation method, researchers have figured out how to take a model and adapt some model parameters to the task at hand, rather than rewriting all model parameters like you would for a fully fine-tuned model. This means that you can get fine-tuned model outputs without the computational expense of fully fine-tuning a model.

We now support bringing trained LoRAs to Workers AI, where we apply the LoRA adapter to a base model at runtime to give you fine-tuned inference, at a fraction of the cost, size, and speed of a fully fine-tuned model. In the future, we want to be able to support fine-tuning jobs and fully fine-tuned models directly on our platform, but we’re excited to be one step closer today with LoRAs.

const response = await ai.run(
  "@cf/mistralai/mistral-7b-instruct-v0.2-lora", //the model supporting LoRAs
  {
      messages: [{"role": "user", "content": "Hello world"],
      raw: true, //skip applying the default chat template
      lora: "00000000-0000-0000-0000-000000000", //the finetune id OR name 
  }
);

BYO LoRAs is in open beta as of today for Gemma 2B and 7B, Llama 2 7B and Mistral 7B models with LoRA adapters up to 100MB in size and max rank of 8, and up to 30 total LoRAs per account. As always, we expect you to use Workers AI and our new BYO LoRA feature with our Terms of Service in mind, including any model-specific restrictions on use contained in the models’ license terms.

Read the technical deep dive blog post on fine-tuning with LoRA and developer docs to get started.

Write Workers in Python

Python is the second most popular programming language in the world (after JavaScript) and the language of choice for building AI applications. And starting today, in open beta, you can now write Cloudflare Workers in Python. Python Workers support all bindings to resources on Cloudflare, including Vectorize, D1, KV, R2 and more.

LangChain is the most popular framework for building LLM‑powered applications, and like how Workers AI works with langchain-js, the Python LangChain library works on Python Workers, as do other Python packages like FastAPI.

Workers written in Python are just as simple as Workers written in JavaScript:

from js import Response

async def on_fetch(request, env):
    return Response.new("Hello world!")

…and are configured by simply pointing at a .py file in your wrangler.toml:

name = "hello-world-python-worker"
main = "src/entry.py"
compatibility_date = "2024-03-18"
compatibility_flags = ["python_workers"]

There are no extra toolchain or precompilation steps needed. The Pyodide Python execution environment is provided for you, directly by the Workers runtime, mirroring how Workers written in JavaScript already work.

There’s lots more to dive into — take a look at the docs, and check out our companion blog post for details about how Python Workers work behind the scenes.

AI Gateway now supports Anthropic, Azure, AWS Bedrock, Google Vertex, and Perplexity

Our AI Gateway product helps developers better control and observe their AI applications, with analytics, caching, rate limiting, and more. We are continuing to add more providers to the product, including Anthropic, Google Vertex, and Perplexity, which we’re excited to announce today. We quietly rolled out Azure and Amazon Bedrock support in December 2023, which means that the most popular providers are now supported via AI Gateway, including Workers AI itself.

Take a look at our Developer Docs to get started with AI Gateway.

Coming soon: Persistent Logs

In Q2 of 2024, we will be adding persistent logs so that you can push your logs (including prompts and responses) to object storage, custom metadata so that you can tag requests with user IDs or other identifiers, and secrets management so that you can securely manage your application’s API keys.

We want AI Gateway to be the control plane for your AI applications, allowing developers to dynamically evaluate and route requests to different models and providers. With our persistent logs feature, we want to enable developers to use their logged data to fine-tune models in one click, eventually running the fine-tune job and the fine-tuned model directly on our Workers AI platform. AI Gateway is just one product in our AI toolkit, but we’re excited about the workflows and use cases it can unlock for developers building on our platform, and we hope you’re excited about it too.

Vectorize metadata filtering and future GA of million vector indexes

Vectorize is another component of our toolkit for AI applications. In open beta since September 2023, Vectorize allows developers to persist embeddings (vectors), like those generated from Workers AI text embedding models, and query for the closest match to support use cases like similarity search or recommendations. Without a vector database, model output is forgotten and can’t be recalled without extra costs to re-run a model.

Since Vectorize’s open beta, we’ve added metadata filtering. Metadata filtering lets developers combine vector search with filtering for arbitrary metadata, supporting the query complexity in AI applications. We’re laser-focused on getting Vectorize ready for general availability, with an target launch date of June 2024, which will include support for multi-million vector indexes.

// Insert vectors with metadata
const vectors: Array = [
  {
    id: "1",
    values: [32.4, 74.1, 3.2],
    metadata: { url: "/products/sku/13913913", streaming_platform: "netflix" }
  },
  {
    id: "2",
    values: [15.1, 19.2, 15.8],
    metadata: { url: "/products/sku/10148191", streaming_platform: "hbo" }
  },
...
];
let upserted = await env.YOUR_INDEX.upsert(vectors);

// Query with metadata filtering
let metadataMatches = await env.YOUR_INDEX.query(, { filter: { streaming_platform: "netflix" }} )

The most comprehensive Developer Platform to build AI applications

On Cloudflare’s Developer Platform, we believe that all developers should be able to quickly build and ship full-stack applications – and that includes AI experiences as well. With our GA of Workers AI, announcements for Python support in Workers, AI Gateway, and Vectorize, and our partnership with Hugging Face, we’ve expanded the world of possibilities for what you can build with AI on our platform. We hope you are as excited as we are – take a look at all our Developer Docs to get started, and let us know what you build.

Running fine-tuned models on Workers AI with LoRAs

Michelle Chen — Tue, 02 Apr 2024 13:00:48 GMT

Inference from fine-tuned LLMs with LoRAs is now in open beta

Today, we’re excited to announce that you can now run fine-tuned inference with LoRAs on Workers AI. This feature is in open beta and available for pre-trained LoRA adapters to be used with Mistral, Gemma, or Llama 2, with some limitations. Take a look at our product announcements blog post to get a high-level overview of our Bring Your Own (BYO) LoRAs feature.

In this post, we’ll do a deep dive into what fine-tuning and LoRAs are, show you how to use it on our Workers AI platform, and then delve into the technical details of how we implemented it on our platform.

What is fine-tuning?

Fine-tuning is a general term for modifying an AI model by continuing to train it with additional data. The goal of fine-tuning is to increase the probability that a generation is similar to your dataset. Training a model from scratch is not practical for many use cases given how expensive and time consuming they can be to train. By fine-tuning an existing pre-trained model, you benefit from its capabilities while also accomplishing your desired task. Low-Rank Adaptation (LoRA) is a specific fine-tuning method that can be applied to various model architectures, not just LLMs. It is common that the pre-trained model weights are directly modified or fused with additional fine-tune weights in traditional fine-tuning methods. LoRA, on the other hand, allows for the fine-tune weights and pre-trained model to remain separate, and for the pre-trained model to remain unchanged. The end result is that you can train models to be more accurate at specific tasks, such as generating code, having a specific personality, or generating images in a specific style. You can even fine-tune an existing LLM to understand additional information about a specific topic.

The approach of maintaining the original base model weights means that you can create new fine-tune weights with relatively little compute. You can take advantage of existing foundational models (such as Llama, Mistral, and Gemma), and adapt them for your needs.

How does fine-tuning work?

To better understand fine-tuning and why LoRA is so effective, we have to take a step back to understand how AI models work. AI models (like LLMs) are neural networks that are trained through deep learning techniques. In neural networks, there are a set of parameters that act as a mathematical representation of the model’s domain knowledge, made up of weights and biases – in simple terms, numbers. These parameters are usually represented as large matrices of numbers. The more parameters a model has, the larger the model is, so when you see models like llama-2-7b, you can read “7b” and know that the model has 7 billion parameters.

A model’s parameters define its behavior. When you train a model from scratch, these parameters usually start off as random numbers. As you train the model on a dataset, these parameters get adjusted bit-by-bit until the model reflects the dataset and exhibits the right behavior. Some parameters will be more important than others, so we apply a weight and use it to show more or less importance. Weights play a crucial role in the model's ability to capture patterns and relationships in the data it is trained on.

Traditional fine-tuning will adjust all the parameters in the trained model with a new set of weights. As such, a fine-tuned model requires us to serve the same amount of parameters as the original model, which means it can take a lot of time and compute to train and run inference for a fully fine-tuned model. On top of that, new state-of-the-art models, or versions of existing models, are regularly released, meaning that fully fine-tuned models can become costly to train, maintain, and store.

LoRA is an efficient method of fine-tuning

In the simplest terms, LoRA avoids adjusting parameters in a pre-trained model and instead allows us to apply a small number of additional parameters. These additional parameters are applied temporarily to the base model to effectively control model behavior. Relative to traditional fine-tuning methods it takes a lot less time and compute to train these additional parameters, which are referred to as a LoRA adapter. After training, we package up the LoRA adapter as a separate model file that can then plug in to the base model it was trained from. A fully fine-tuned model can be tens of gigabytes in size, while these adapters are usually just a few megabytes. This makes it a lot easier to distribute, and serving fine-tuned inference with LoRA only adds ms of latency to total inference time.

If you’re curious to understand why LoRA is so effective, buckle up — we first have to go through a brief lesson on linear algebra. If that’s not a term you’ve thought about since university, don’t worry, we’ll walk you through it.

Show me the math

With traditional fine-tuning, we can take the weights of a model (W0) and tweak them to output a new set of weights — so the difference between the original model weights and the new weights is ΔW, representing the change in weights_._ Therefore, a tuned model will have a new set of weights which can be represented as the original model weights plus the change in weights, W0 + ΔW.

Remember, all of these model weights are actually represented as large matrices of numbers. In math, every matrix has a property called rank (r), which describes the number of linearly independent columns or rows in a matrix. When matrices are low-rank, they have only a few columns or rows that are “important”, so we can actually decompose or split them into two smaller matrices with the most important parameters (think of it like factoring in algebra). This technique is called rank decomposition, which allows us to greatly reduce and simplify matrices while keeping the most important bits. In the context of fine-tuning, rank determines how many parameters get changed from the original model – the higher the rank, the stronger the fine-tune, giving you more granularity over the output.

According to the original LoRA paper, researchers have found that when a model is low-rank, the matrix representing the change in weights is also low-rank. Therefore, we can apply rank decomposition to our matrix representing the change in weights ΔW to create two smaller matrices A, B, where ΔW = BA. Now, the change in the model can be represented by two smaller low-rank matrices_._ This is why this method of fine-tuning is called Low-Rank Adaptation.

When we run inference, we only need the smaller matrices A, B to change the behavior of the model. The model weights in A, B constitute our LoRA adapter (along with a config file). At runtime, we add the model weights together, combining the original model (W0) and the LoRA adapter (A, B). Adding and subtracting are simple mathematical operations, meaning that we can quickly swap out different LoRA adapters by adding and subtracting A, B from W0.. By temporarily adjusting the weights of the original model, we modify the model’s behavior and output and as a result, we get fine-tuned inference with minimal added latency.

According to the original LoRA paper, “LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times”. Because of this, LoRA is one of the most popular methods of fine-tuning since it's a lot less computationally expensive than a fully fine-tuned model, doesn't add any material inference time, and is much smaller and portable.

How can you use LoRAs with Workers AI?

Workers AI is very well-suited to run LoRAs because of the way we run serverless inference. The models in our catalog are always pre-loaded on our GPUs, meaning that we keep them warm so that your requests never encounter a cold start. This means that the base model is always available, and we can dynamically load and swap out LoRA adapters as needed. We can actually plug in multiple LoRA adapters to one base model, so we can serve multiple different fine-tuned inference requests at once.

When you fine-tune with LoRA, your output will be two files: your custom model weights (in safetensors format) and an adapter config file (in json format). To create these weights yourself, you can train a LoRA on your own data using the Hugging Face PEFT (Parameter-Efficient Fine-Tuning) library combined with the Hugging Face AutoTrain LLM library. You can also run your training tasks on services such as Auto Train and Google Colab. Alternatively, there are many open-source LoRA adapters available on Hugging Face today that cover a variety of use cases.

Eventually, we want to support the LoRA training workloads on our platform, but we’ll need you to bring your trained LoRA adapters to Workers AI today, which is why we’re calling this feature Bring Your Own (BYO) LoRAs.

For the initial open beta release, we are allowing people to use LoRAs with our Mistral, Llama, and Gemma models. We have set aside versions of these models which accept LoRAs, which you can access by appending -lora to the end of the model name. Your adapter must have been fine-tuned from one of our supported base models listed below:

@cf/meta-llama/llama-2-7b-chat-hf-lora
@cf/mistral/mistral-7b-instruct-v0.2-lora
@cf/google/gemma-2b-it-lora
@cf/google/gemma-7b-it-lora

As we are launching this feature in open beta, we have some limitations today to take note of: quantized LoRA models are not yet supported, LoRA adapters must be smaller than 100MB and have up to a max rank of 8, and you can try up to 30 LoRAs per account during our initial open beta. To get started with LoRAs on Workers AI, read the Developer Docs.

As always, we expect people to use Workers AI and our new BYO LoRA feature with our Terms of Service in mind, including any model-specific restrictions on use contained in the models’ license terms.

How did we build multi-tenant LoRA serving?

Serving multiple LoRA models simultaneously poses a challenge in terms of GPU resource utilization. While it is possible to batch inference requests to a base model, it is much more challenging to batch requests with the added complexity of serving unique LoRA adapters. To tackle this problem, we leverage the Punica CUDA kernel design in combination with global cache optimizations in order to handle the memory intensive workload of multi-tenant LoRA serving while offering low inference latency.

The Punica CUDA kernel was introduced in the paper Punica: Multi-Tenant LoRA Serving as a method to serve multiple, significantly different LoRA models applied to the same base model. In comparison to previous inference techniques, the method offers substantial throughput and latency improvements. This optimization is achieved in part through enabling request batching even across requests serving different LoRA adapters.

The core of the Punica kernel system is a new CUDA kernel called Segmented Gather Matrix-Vector Multiplication (SGMV). SGMV allows a GPU to store only a single copy of the pre-trained model while serving different LoRA models. The Punica kernel design system consolidates the batching of requests for unique LoRA models to improve performance by parallelizing the feature-weight multiplication of different requests in a batch. Requests for the same LoRA model are then grouped to increase operational intensity. Initially, the GPU loads the base model while reserving most of its GPU memory for KV Cache. The LoRA components (A and B matrices) are then loaded on demand from remote storage (Cloudflare’s cache or R2) when required by an incoming request. This on demand loading introduces only milliseconds of latency, which means that multiple LoRA adapters can be seamlessly fetched and served with minimal impact on inference performance. Frequently requested LoRA adapters are cached for the fastest possible inference.

Once a requested LoRA has been cached locally, the speed it can be made available for inference is constrained only by PCIe bandwidth. Regardless, given that each request may require its own LoRA, it becomes critical that LoRA downloads and memory copy operations are performed asynchronously. The Punica scheduler tackles this exact challenge, batching only requests which currently have required LoRA weights available in GPU memory, and queueing requests that do not until the required weights are available and the request can efficiently join a batch.

By effectively managing KV cache and batching these requests, it is possible to handle significant multi-tenant LoRA-serving workloads. A further and important optimization is the use of continuous batching. Common batching methods require all requests to the same adapter to reach their stopping condition before being released. Continuous batching allows a request in a batch to be released early so that it does not need to wait for the longest running request.

Given that LLMs deployed to Cloudflare’s network are available globally, it is important that LoRA adapter models are as well. Very soon, we will implement remote model files that are cached at Cloudflare’s edge to further reduce inference latency.

A roadmap for fine-tuning on Workers AI

Launching support for LoRA adapters is an important step towards unlocking fine-tunes on our platform. In addition to the LLM fine-tunes available today, we look forward to supporting more models and a variety of task types, including image generation.

Our vision for Workers AI is to be the best place for developers to run their AI workloads — and this includes the process of fine-tuning itself. Eventually, we want to be able to run the fine-tuning training job as well as fully fine-tuned models directly on Workers AI. This unlocks many use cases for AI to be more relevant in organizations by empowering models to have more granularity and detail for specific tasks.

With AI Gateway, we will be able to help developers log their prompts and responses, which they can then use to fine-tune models with production data. Our vision is to have a one-click fine-tuning service, where log data from AI Gateway can be used to retrain a model (on Cloudflare) and then the fine-tuned model can be redeployed on Workers AI for inference. This will allow developers to personalize their AI models to fit their applications, allowing for granularity as low as a per-user level. The fine-tuned model can then be smaller and more optimized, helping users save time and money on AI inference – and the magic is that all of this can all happen within our very own Developer Platform.

We’re excited for you to try the open beta for BYO LoRAs! Read our Developer Docs for more details, and tell us what you think on Discord.

Mitigating a token-length side-channel attack in our AI products

Celso Martinho — Thu, 14 Mar 2024 12:30:30 GMT

Since the discovery of CRIME, BREACH, TIME, LUCKY-13 etc., length-based side-channel attacks have been considered practical. Even though packets were encrypted, attackers were able to infer information about the underlying plaintext by analyzing metadata like the packet length or timing information.

Cloudflare was recently contacted by a group of researchers at Ben Gurion University who wrote a paper titled “What Was Your Prompt? A Remote Keylogging Attack on AI Assistants” that describes “a novel side-channel that can be used to read encrypted responses from AI Assistants over the web”.

The Workers AI and AI Gateway team collaborated closely with these security researchers through our Public Bug Bounty program, discovering and fully patching a vulnerability that affects LLM providers. You can read the detailed research paper here.

Since being notified about this vulnerability, we've implemented a mitigation to help secure all Workers AI and AI Gateway customers. As far as we could assess, there was no outstanding risk to Workers AI and AI Gateway customers.

How does the side-channel attack work?

In the paper, the authors describe a method in which they intercept the stream of a chat session with an LLM provider, use the network packet headers to infer the length of each token, extract and segment their sequence, and then use their own dedicated LLMs to infer the response.

The two main requirements for a successful attack are an AI chat client running in streaming mode and a malicious actor capable of capturing network traffic between the client and the AI chat service. In streaming mode, the LLM tokens are emitted sequentially, introducing a token-length side-channel. Malicious actors could eavesdrop on packets via public networks or within an ISP.

An example request vulnerable to the side-channel attack looks like this:

curl -X POST \
https://api.cloudflare.com/client/v4/accounts//ai/run/@cf/meta/llama-2-7b-chat-int8 \
  -H "Authorization: Bearer " \
  -d '{"stream":true,"prompt":"tell me something about portugal"}'

Let’s use Wireshark to inspect the network packets on the LLM chat session while streaming:

The first packet has a length of 95 and corresponds to the token "Port" which has a length of four. The second packet has a length of 93 and corresponds to the token "ug" which has a length of two, and so on. By removing the likely token envelope from the network packet length, it is easy to infer how many tokens were transmitted and their sequence and individual length just by sniffing encrypted network data.

Since the attacker needs the sequence of individual token length, this vulnerability only affects text generation models using streaming. This means that AI inference providers that use streaming — the most common way of interacting with LLMs — like Workers AI, are potentially vulnerable.

This method requires that the attacker is on the same network or in a position to observe the communication traffic and its accuracy depends on knowing the target LLM’s writing style. In ideal conditions, the researchers claim that their system “can reconstruct 29% of an AI assistant’s responses and successfully infer the topic from 55% of them”. It’s also important to note that unlike other side-channel attacks, in this case the attacker has no way of evaluating its prediction against the ground truth. That means that we are as likely to get a sentence with near perfect accuracy as we are to get one where only things that match are conjunctions.

Mitigating LLM side-channel attacks

Since this type of attack relies on the length of tokens being inferred from the packet, it can be just as easily mitigated by obscuring token size. The researchers suggested a few strategies to mitigate these side-channel attacks, one of which is the simplest: padding the token responses with random length noise to obscure the length of the token so that responses can not be inferred from the packets. While we immediately added the mitigation to our own inference product — Workers AI, we wanted to help customers secure their LLMs regardless of where they are running them by adding it to our AI Gateway.

As of today, all users of Workers AI and AI Gateway are now automatically protected from this side-channel attack.

What we did

Once we got word of this research work and how exploiting the technique could potentially impact our AI products, we did what we always do in situations like this: we assembled a team of systems engineers, security engineers, and product managers and started discussing risk mitigation strategies and next steps. We also had a call with the researchers, who kindly attended, presented their conclusions, and answered questions from our teams.

The research team provided a testing notebook that we could use to validate the attack's results. While we were able to reproduce the results for the notebook's examples, we found that the accuracy varied immensely with our tests using different prompt responses and different LLMs. Nonetheless, the paper has merit, and the risks are not negligible.

We decided to incorporate the first mitigation suggestion in the paper: including random padding to each message to hide the actual length of tokens in the stream, thereby complicating attempts to infer information based solely on network packet size.

Workers AI, our inference product, is now protected

With our inference-as-a-service product, anyone can use the Workers AI platform and make API calls to our supported AI models. This means that we oversee the inference requests being made to and from the models. As such, we have a responsibility to ensure that the service is secure and protected from potential vulnerabilities. We immediately rolled out a fix once we were notified of the research, and all Workers AI customers are now automatically protected from this side-channel attack. We have not seen any malicious attacks exploiting this vulnerability, other than the ethical testing from the researchers.

Our solution for Workers AI is a variation of the mitigation strategy suggested in the research document. Since we stream JSON objects rather than the raw tokens, instead of padding the tokens with whitespace characters, we added a new property, "p" (for padding) that has a string value of variable random length.

Example streaming response using the SSE syntax:

data: {"response":"portugal","p":"abcdefghijklmnopqrstuvwxyz0123456789a"}
data: {"response":" is","p":"abcdefghij"}
data: {"response":" a","p":"abcdefghijklmnopqrstuvwxyz012"}
data: {"response":" southern","p":"ab"}
data: {"response":" European","p":"abcdefgh"}
data: {"response":" country","p":"abcdefghijklmno"}
data: {"response":" located","p":"abcdefghijklmnopqrstuvwxyz012345678"}

This has the advantage that no modifications are required in the SDK or the client code, the changes are invisible to the end-users, and no action is required from our customers. By adding random variable length to the JSON objects, we introduce the same network-level variability, and the attacker essentially loses the required input signal. Customers can continue using Workers AI as usual while benefiting from this protection.

One step further: AI Gateway protects users of any inference provider

We added protection to our AI inference product, but we also have a product that proxies requests to any provider — AI Gateway. AI Gateway acts as a proxy between a user and supported inference providers, helping developers gain control, performance, and observability over their AI applications. In line with our mission to help build a better Internet, we wanted to quickly roll out a fix that can help all our customers using text generation AIs, regardless of which provider they use or if they have mitigations to prevent this attack. To do this, we implemented a similar solution that pads all streaming responses proxied through AI Gateway with random noise of variable length.

Our AI Gateway customers are now automatically protected against this side-channel attack, even if the upstream inference providers have not yet mitigated the vulnerability. If you are unsure if your inference provider has patched this vulnerability yet, use AI Gateway to proxy your requests and ensure that you are protected.

Conclusion

At Cloudflare, our mission is to help build a better Internet – that means that we care about all citizens of the Internet, regardless of what their tech stack looks like. We are proud to be able to improve the security of our AI products in a way that is transparent and requires no action from our customers.

We are grateful to the researchers who discovered this vulnerability and have been very collaborative in helping us understand the problem space. If you are a security researcher who is interested in helping us make our products more secure, check out our Bug Bounty program at hackerone.com/cloudflare.

Unlocking new use cases with 17 new models in Workers AI, including new LLMs, image generation models, and more

Michelle Chen — Wed, 28 Feb 2024 20:00:00 GMT

On February 6th, 2024 we announced eight new models that we added to our catalog for text generation, classification, and code generation use cases. Today, we’re back with seventeen (17!) more models, focused on enabling new types of tasks and use cases with Workers AI. Our catalog is now nearing almost 40 models, so we also decided to introduce a revamp of our developer documentation that enables users to easily search and discover new models.

The new models are listed below, and the full Workers AI catalog can be found on our new developer documentation.

Text generation

@cf/deepseek-ai/deepseek-math-7b-instruct
******@**cf/openchat/openchat-3.5-0106
@cf/microsoft/phi-2
@cf/tinyllama/tinyllama-1.1b-chat-v1.0
@cf/thebloke/discolm-german-7b-v1-awq
@cf/qwen/qwen1.5-0.5b-chat
@cf/qwen/qwen1.5-1.8b-chat
@cf/qwen/qwen1.5-7b-chat-awq
@cf/qwen/qwen1.5-14b-chat-awq
@cf/tiiuae/falcon-7b-instruct
@cf/defog/sqlcoder-7b-2

Summarization

@cf/facebook/bart-large-cnn

Text-to-image

@cf/lykon/dreamshaper-8-lcm
@cf/runwayml/stable-diffusion-v1-5-inpainting
@cf/runwayml/stable-diffusion-v1-5-img2img
@cf/bytedance/stable-diffusion-xl-lightning

Image-to-text

@cf/unum/uform-gen2-qwen-500m

New language models, fine-tunes, and quantizations

Today’s catalog update includes a number of new language models so that developers can pick and choose the best LLMs for their use cases. Although most LLMs can be generalized to work in any instance, there are many benefits to choosing models that are tailored for a specific use case. We are excited to bring you some new large language models (LLMs), small language models (SLMs), and multi-language support, as well as some fine-tuned and quantized models.

Our latest LLM additions include falcon-7b-instruct, which is particularly exciting because of its innovative use of multi-query attention to generate high-precision responses. There’s also better language support with discolm_german_7b and the qwen1.5 models, which are trained on multilingual data and boast impressive LLM outputs not only in English, but also in German (discolm) and Chinese (qwen1.5). The Qwen models range from 0.5B to 14B parameters and have shown particularly impressive accuracy in our testing. We’re also releasing a few new SLMs, which are growing in popularity because of their ability to do inference faster and cheaper without sacrificing accuracy. For SLMs, we’re introducing small but performant models like a 1.1B parameter version of Llama (tinyllama-1.1b-chat-v1.0) and a 1.3B parameter model from Microsoft (phi-2).

As the AI industry continues to accelerate, talented people have found ways to improve and optimize the performance and accuracy of models. We’ve added a fine-tuned model (openchat-3.5) which implements Conditioned Reinforcement Learning Fine-Tuning (C-RLFT), a technique that enables open-source language model development through the use of easily collectable mixed quality data.

We’re really excited to be bringing all these new text generation models onto our platform today. The open-source community has been incredible at developing new AI breakthroughs, and we’re grateful for everyone’s contributions to training, fine-tuning, and quantizing these models. We’re thrilled to be able to host these models and make them accessible to all so that developers can quickly and easily build new applications with AI. You can check out the new models and their API schemas on our developer docs.

New image generation models

We are adding new Stable Diffusion pipelines and optimizations to enable powerful new image editing and generation use cases. We’ve added support for Stable Diffusion XL Lightning which generates high quality images in just two inference steps. Text-to-image is a really popular task for folks who want to take a text prompt and have the model generate an image based on the input, but Stable Diffusion is actually capable of much more. With this new Workers AI release, we’ve unlocked new pipelines so that you can experiment with different modalities of input and tasks with Stable Diffusion.

You can now use Stable Diffusion on Workers AI for image-to-image and inpainting use cases. Image-to-image allows you to transform an input image into a different image – for example, you can ask Stable Diffusion to generate a cartoon version of a portrait. Inpainting allows users to upload an image and transform the same image into something new – examples of inpainting include “expanding” the background of photos or colorizing black-and-white photos.

To use inpainting, you’ll need to input an image, a mask, and a prompt. The image is the original picture that you want modified, the mask is a monochrome screen that highlights the area that you want to be painted over, and the prompt tells the model what to generate in that space. Below is an example of the inputs and the request template to perform inpainting.

import { Ai } from '@cloudflare/ai';

export default {
    async fetch(request, env) {
        const formData = await request.formData();
        const prompt = formData.get("prompt")
        const imageFile = formData.get("image")
        const maskFile = formData.get("mask")

        const imageArrayBuffer = await imageFile.arrayBuffer();
        const maskArrayBuffer = await maskFile.arrayBuffer();

        const ai = new Ai(env.AI);
        const inputs = {
            prompt,
            image: [...new Uint8Array(imageArrayBuffer)],
            mask: [...new Uint8Array(maskArrayBuffer)],  
            strength: 0.8, // Adjust the strength of the transformation
            num_steps: 10, // Number of inference steps for the diffusion process
        };

        const response = await ai.run("@cf/runwayml/stable-diffusion-v1-5-inpainting", inputs);

        return new Response(response, {
            headers: {
                "content-type": "image/png",
            },
        });
    }
}

New use cases

We’ve also added new models to Workers AI that allow for various specialized tasks and use cases, such as LLMs specialized in solving math problems (deepseek-math-7b-instruct), generating SQL code (sqlcoder-7b-2), summarizing text (bart-large-cnn), and image captioning (uform-gen2-qwen-500m).

We wanted to release these to the public, so you can start building with them, but we’ll be releasing more demos and tutorial content over the next few weeks. Stay tuned to our X account and Developer Documentation for more information on how to use these new models.

Optimizing our model catalog

AI model innovation is advancing rapidly, and so are the tools and techniques for fast and efficient inference. We’re excited to be incorporating new tools that help us optimize our models so that we can offer the best inference platform for everyone. Typically, when optimizing AI inference it is useful to serialize the model into a format such as ONNX, one of the most generally applicable options for this use case with broad hardware and model architecture support. An ONNX model can be further optimized by being converted to a TensorRT engine. This format, designed specifically for Nvidia GPUs, can result in faster inference latency and higher total throughput from LLMs. Choosing the right format usually comes down to what is best supported by specific model architectures and the hardware available for inference. We decided to leverage both TensorRT and ONNX formats for our new Stable Diffusion pipelines, which represent a series of models applied for a specific task.

Explore more on our new developer docs

You can explore all these new models in our new developer docs, where you can learn more about individual models, their prompt templates, as well as properties like context token limits. We’ve redesigned the model page to be simpler for developers to explore new models and learn how to use them. You’ll now see all the models on one page for searchability, with the task type on the right-hand side. Then, you can click into individual model pages to see code examples on how to use those models.

We hope you try out these new models and build something new on Workers AI! We have more updates coming soon, including more demos, tutorials, and Workers AI pricing. Let us know what you’re working on and other models you’d like to see on our Discord.