The Cloudflare Blog

Cloudflare’s bigger, better, faster AI platform

Michelle Chen — Thu, 26 Sep 2024 13:00:00 GMT

Birthday Week 2024 marks our first anniversary of Cloudflare’s AI developer products — Workers AI, AI Gateway, and Vectorize. For our first birthday this year, we’re excited to announce powerful new features to elevate the way you build with AI on Cloudflare.

Workers AI is getting a big upgrade, with more powerful GPUs that enable faster inference and bigger models. We’re also expanding our model catalog to be able to dynamically support models that you want to run on us. Finally, we’re saying goodbye to neurons and revamping our pricing model to be simpler and cheaper. On AI Gateway, we’re moving forward on our vision of becoming an ML Ops platform by introducing more powerful logs and human evaluations. Lastly, Vectorize is going GA, with expanded index sizes and faster queries.

Watch on Cloudflare TV

Whether you want the fastest inference at the edge, optimized AI workflows, or vector database-powered RAG, we’re excited to help you harness the full potential of AI and get started on building with Cloudflare.

The fast, global AI platform

The first thing that you notice about an application is how fast, or in many cases, how slow it is. This is especially true of AI applications, where the standard today is to wait for a response to be generated.

At Cloudflare, we’re obsessed with improving the performance of applications, and have been doubling down on our commitment to make AI fast. To live up to that commitment, we’re excited to announce that we’ve added even more powerful GPUs across our network to accelerate LLM performance.

In addition to more powerful GPUs, we’ve continued to expand our GPU footprint to get as close to the user as possible, reducing latency even further. Today, we have GPUs in over 180 cities, having doubled our capacity in a year.

Bigger, better, faster

With the introduction of our new, more powerful GPUs, you can now run inference on significantly larger models, including Meta Llama 3.1 70B. Previously, our model catalog was limited to 8B parameter LLMs, but we can now support larger models, faster response times, and larger context windows. This means your applications can handle more complex tasks with greater efficiency.

Model

@cf/meta/llama-3.2-11b-vision-instruct

@cf/meta/llama-3.2-1b-instruct

@cf/meta/llama-3.2-3b-instruct

@cf/meta/llama-3.1-8b-instruct-fast

@cf/meta/Llama-3.1-70b-instruct

@cf/black-forest-labs/flux-1-schnell

The set of models above are available on our new GPUs at faster speeds. In general, you can expect throughput of 80+ Tokens per Second (TPS) for 8b models and a Time To First Token of 300 ms (depending on where you are in the world).

Our model instances now support larger context windows, like the full 128K context window for Llama 3.1 and 3.2. To give you full visibility into performance, we’ll also be publishing metrics like TTFT, TPS, Context Window, and pricing on models in our catalog, so you know exactly what to expect.

We’re committed to bringing the best of open-source models to our platform, and that includes Meta’s release of the new Llama 3.2 collection of models. As a Meta launch partner, we were excited to have Day 0 support for the 11B vision model, as well as the 1B and 3B text-only model on Workers AI.

For more details on how we made Workers AI fast, take a look at our technical blog post, where we share a novel method for KV cache compression (it’s open-source!), as well as details on speculative decoding, our new hardware design, and more.

Greater model flexibility

With our commitment to helping you run more powerful models faster, we are also expanding the breadth of models you can run on Workers AI with our Run Any* Model feature. Until now, we have manually curated and added only the most popular open source models to Workers AI. Now, we are opening up our catalog to the public, giving you the flexibility to choose from a broader selection of models. We will support models that are compatible with our GPUs and inference stack at the start (hence the asterisk on Run Any* Model). We’re launching this feature in closed beta and if you’d like to try it out, please fill out the form, so we can grant you access to this new feature.

The Workers AI model catalog will now be split into two parts: a static catalog and a dynamic catalog. Models in the static catalog will remain curated by Cloudflare and will include the most popular open source models with guarantees on availability and speed (the models listed above). These models will always be kept warm in our network, ensuring you don’t experience cold starts. The usage and pricing model remains serverless, where you will only be charged for the requests to the model and not the cold start times.

Models that are launched via Run Any* Model will make up the dynamic catalog. If the model is public, users can share an instance of that model. In the future, we will allow users to launch private instances of models as well.

This is just the first step towards running your own custom or private models on Workers AI. While we have already been supporting private models for select customers, we are working on making this capacity available to everyone in the near future.

New Workers AI pricing

We launched Workers AI during Birthday Week 2023 with the concept of “neurons” for pricing. Neurons were intended to simplify the unit of measure across various models on our platform, including text, image, audio, and more. However, over the past year, we have listened to your feedback and heard that neurons were difficult to grasp and challenging to compare with other providers. Additionally, the industry has matured, and new pricing standards have materialized. As such, we’re excited to announce that we will be moving towards unit-based pricing and saying goodbye to neurons.

Moving forward, Workers AI will be priced based on model task, size, and units. LLMs will be priced based on the model size (parameters) and input/output tokens. Image generation models will be priced based on the output image resolution and the number of steps. Embeddings models will be priced based on input tokens. Speech-to-text models will be priced on seconds of audio input.

Model Task	Units	Model Size	Pricing
LLMs (incl. Vision models)	Tokens in/out (blended)	<= 3B parameters	$0.10 per Million Tokens
3.1B - 8B	$0.15 per Million Tokens
8.1B - 20B	$0.20 per Million Tokens
20.1B - 40B	$0.50 per Million Tokens
40.1B+	$0.75 per Million Tokens
Embeddings	Tokens in	<= 150M parameters	$0.008 per Million Tokens
151M+ parameters	$0.015 per Million Tokens
Speech-to-text	Audio seconds in	N/A	$0.0039 per minute of audio input

Image Size	Model Type	Steps	Price
<=256x256	Standard	25	$0.00125 per 25 steps
Fast	5	$0.00025 per 5 steps
<=512x512	Standard	25	$0.0025 per 25 steps
Fast	5	$0.0005 per 5 steps
<=1024x1024	Standard	25	$0.005 per 25 steps
Fast	5	$0.001 per 5 steps
<=2048x2048	Standard	25	$0.01 per 25 steps
Fast	5	$0.002 per 5 steps

We paused graduating models and announcing pricing for beta models over the past few months as we prepared for this new pricing change. We’ll be graduating all models to this new pricing, and billing will take effect on October 1, 2024.

Our free tier has been redone to fit these new metrics, and will include a monthly allotment of usage across all the task types.

Model	Free tier size
Text Generation - LLM	10,000 tokens a day across any model size
Embeddings	10,000 tokens a day across any model size
Images	Sum of 250 steps, up to 1024x1024 resolution
Whisper	10 minutes of audio a day

Optimizing AI workflows with AI Gateway

AI Gateway is designed to help developers and organizations building AI applications better monitor, control, and optimize their AI usage, and thanks to our users, AI Gateway has reached an incredible milestone — over 2 billion requests proxied by September 2024, less than a year after its inception. But we are not stopping there.

Persistent logs (open beta)

Persistent logs allow developers to store and analyze user prompts and model responses for extended periods, up to 10 million logs per gateway. Each request made through AI Gateway will create a log. With a log, you can see details of a request, including timestamp, request status, model, and provider.

We have revamped our logging interface to offer more detailed insights, including cost and duration. Users can now annotate logs with human feedback using thumbs up and thumbs down. Lastly, you can now filter, search, and tag logs with custom metadata to further streamline analysis directly within AI Gateway.

Persistent logs are available to use on all plans, with a free allocation for both free and paid plans. On the Workers Free plan, users can store up to 100,000 logs total across all gateways at no charge. For those needing more storage, upgrading to the Workers Paid plan will give you a higher free allocation — 200,000 logs stored total. Any additional logs beyond those limits will be available at $8 per 100,000 logs stored per month, giving you the flexibility to store logs for your preferred duration and do more with valuable data. Billing for this feature will be implemented when the feature reaches General Availability, and we’ll provide plenty of advance notice.

	Workers Free	Workers Paid	Enterprise
Included Volume	100,000 logs stored (total)	200,000 logs stored (total)
Additional Logs	N/A	$8 per 100,000 logs stored per month

Export logs with Logpush

For users looking to export their logs, AI Gateway now supports log export via Logpush. With Logpush, you can automatically push logs out of AI Gateway into your preferred storage provider, including Cloudflare R2, Amazon S3, Google Cloud Storage, and more. This can be especially useful for compliance or advanced analysis outside the platform. Logpush follows its existing pricing model and will be available to all users on a paid plan.

AI evaluations

We are also taking our first step towards comprehensive AI evaluations, starting with evaluation using human in the loop feedback (this is now in open beta). Users can create datasets from logs to score and evaluate model performance, speed, and cost, initially focused on LLMs. Evaluations will allow developers to gain a better understanding of how their application is performing, ensuring better accuracy, reliability, and customer satisfaction. We’ve added support for cost analysis across many new models and providers to enable developers to make informed decisions, including the ability to add custom costs. Future enhancements will include automated scoring using LLMs, comparing performance of multiple models, and prompt evaluations, helping developers make decisions on what is best for their use case and ensuring their applications are both efficient and cost-effective.

Vectorize GA

We've completely redesigned Vectorize since our initial announcement in 2023 to better serve customer needs. Vectorize (v2) now supports indexes of up to 5 million vectors (up from 200,000), delivers faster queries (median latency is down 95% from 500 ms to 30 ms), and returns up to 100 results per query (increased from 20). These improvements significantly enhance Vectorize's capacity, speed, and depth of results.

Note: if you got started on Vectorize before GA, to ease the move from v1 to v2, a migration solution will be available in early Q4 — stay tuned!

New Vectorize pricing

Not only have we improved performance and scalability, but we've also made Vectorize one of the most cost-effective options on the market. We've reduced query prices by 75% and storage costs by 98%.

	New Vectorize pricing	Old Vectorize pricing	Price reduction
Writes	Free	Free	n/a
Query	$.01 per 1 million vector dimensions	$0.04 per 1 million vector dimensions	75%
Storage	$0.05 per 100 million vector dimensions	$4.00 per 100 million vector dimensions	98%

You can learn more about our pricing in the Vectorize docs.

Vectorize free tier

There’s more good news: we’re introducing a free tier to Vectorize to make it easy to experiment with our full AI stack.

The free tier includes:

30 million queried vector dimensions / month
5 million stored vector dimensions / month

How fast is Vectorize?

To measure performance, we conducted benchmarking tests by executing a large number of vector similarity queries as quickly as possible. We measured both request latency and result precision. In this context, precision refers to the proportion of query results that match the known true-closest results for all benchmarked queries. This approach allows us to assess both the speed and accuracy of our vector similarity search capabilities. Here are the following datasets we benchmarked on:

dbpedia-openai-1M-1536-angular: 1 million vectors, 1536 dimensions, queried with cosine similarity at a top K of 10
Laion-768-5m-ip: 5 million vectors, 768 dimensions, queried with cosine similarity at a top K of 10
- We ran this again skipping the result-refinement pass to return approximate results faster

Benchmark dataset	P50 (ms)	P75 (ms)	P90 (ms)	P95 (ms)	Throughput (RPS)	Precision
dbpedia-openai-1M-1536-angular	31	56	159	380	343	95.4%
Laion-768-5m-ip	81.5	91.7	105	123	623	95.5%
Laion-768-5m-ip w/o refinement	14.7	19.3	24.3	27.3	698	78.9%

These benchmarks were conducted using a standard Vectorize v2 index, queried with a concurrency of 300 via a Cloudflare Worker binding. The reported latencies reflect those observed by the Worker binding querying the Vectorize index on warm caches, simulating the performance of an existing application with sustained usage.

Beyond Vectorize's fast query speeds, we believe the combination of Vectorize and Workers AI offers an unbeatable solution for delivering optimal AI application experiences. By running Vectorize close to the source of inference and user interaction, rather than combining AI and vector database solutions across providers, we can significantly minimize end-to-end latency.

With these improvements, we're excited to announce the general availability of the new Vectorize, which is more powerful, faster, and more cost-effective than ever before.

Tying it all together: the AI platform for all your inference needs

Over the past year, we’ve been committed to building powerful AI products that enable users to build on us. While we are making advancements on each of these individual products, our larger vision is to provide a seamless, integrated experience across our portfolio.

With Workers AI and AI Gateway, users can easily enable analytics, logging, caching, and rate limiting to their AI application by connecting to AI Gateway directly through a binding in the Workers AI request. We imagine a future where AI Gateway can not only help you create and save datasets to use for fine-tuning your own models with Workers AI, but also seamlessly redeploy them on the same platform. A great AI experience is not just about speed, but also accuracy. While Workers AI ensures fast performance, using it in combination with AI Gateway allows you to evaluate and optimize that performance by monitoring model accuracy and catching issues, like hallucinations or incorrect formats. With AI Gateway, users can test out whether switching to new models in the Workers AI model catalog will deliver more accurate performance and a better user experience.

In the future, we’ll also be working on tighter integrations between Vectorize and Workers AI, where you can automatically supply context or remember past conversations in an inference call. This cuts down on the orchestration needed to run a RAG application, where we can automatically help you make queries to vector databases.

If we put the three products together, we imagine a world where you can build AI apps with full observability (traces with AI Gateway) and see how the retrieval (Vectorize) and generation (Workers AI) components are working together, enabling you to diagnose issues and improve performance.

This Birthday Week, we’ve been focused on making sure our individual products are best-in-class, but we’re continuing to invest in building a holistic AI platform within our AI portfolio, but also with the larger Developer Platform Products. Our goal is to make sure that Cloudflare is the simplest, fastest, more powerful place for you to build full-stack AI experiences with all the batteries included.

We’re excited for you to try out all these new features! Take a look at our updated developer docs on how to get started and the Cloudflare dashboard to interact with your account.

Workers AI: serverless GPU-powered inference on Cloudflare’s global network

Phil Wittig — Wed, 27 Sep 2023 13:00:47 GMT

If you're anywhere near the developer community, it's almost impossible to avoid the impact that AI’s recent advancements have had on the ecosystem. Whether you're using AI in your workflow to improve productivity, or you’re shipping AI based features to your users, it’s everywhere. The focus on AI improvements are extraordinary, and we’re super excited about the opportunities that lay ahead, but it's not enough.

Not too long ago, if you wanted to leverage the power of AI, you needed to know the ins and outs of machine learning, and be able to manage the infrastructure to power it.

As a developer platform with over one million active developers, we believe there is so much potential yet to be unlocked, so we’re changing the way AI is delivered to developers. Many of the current solutions, while powerful, are based on closed, proprietary models and don't address privacy needs that developers and users demand. Alternatively, the open source scene is exploding with powerful models, but they’re simply not accessible enough to every developer. Imagine being able to run a model, from your code, wherever it’s hosted, and never needing to find GPUs or deal with setting up the infrastructure to support it.

That's why we are excited to launch Workers AI - an AI inference as a service platform, empowering developers to run AI models with just a few lines of code, all powered by our global network of GPUs. It's open and accessible, serverless, privacy-focused, runs near your users, pay-as-you-go, and it's built from the ground up for a best in class developer experience.

Workers AI - making inference just work

We’re launching Workers AI to put AI inference in the hands of every developer, and to actually deliver on that goal, it should just work out of the box. How do we achieve that?

At the core of everything, it runs on the right infrastructure - our world-class network of GPUs
We provide off-the-shelf models that run seamlessly on our infrastructure
Finally, deliver it to the end developer, in a way that’s delightful. A developer should be able to build their first Workers AI app in minutes, and say “Wow, that’s kinda magical!”.

So what exactly is Workers AI? It’s another building block that we’re adding to our developer platform - one that helps developers run well-known AI models on serverless GPUs, all on Cloudflare’s trusted global network. As one of the latest additions to our developer platform, it works seamlessly with Workers + Pages, but to make it truly accessible, we’ve made it platform-agnostic, so it also works everywhere else, made available via a REST API.

Models you know and love

We’re launching with a curated set of popular, open source models, that cover a wide range of inference tasks:

Text generation (large language model): meta/llama-2-7b-chat-int8
Automatic speech recognition (ASR): openai/whisper
Translation: meta/m2m100-1.2
Text classification: huggingface/distilbert-sst-2-int8
Image classification: microsoft/resnet-50
Embeddings: baai/bge-base-en-v1.5

You can browse all available models in your Cloudflare dashboard, and soon you’ll be able to dive into logs and analytics on a per model basis!

This is just the start, and we’ve got big plans. After launch, we’ll continue to expand based on community feedback. Even more exciting - in an effort to take our catalog from zero to sixty, we’re announcing a partnership with Hugging Face, a leading AI community + hub. The partnership is multifaceted, and you can read more about it here, but soon you’ll be able to browse and run a subset of the Hugging Face catalog directly in Workers AI.

Accessible to everyone

Part of the mission of our developer platform is to provide all the building blocks that developers need to build the applications of their dreams. Having access to the right blocks is just one part of it — as a developer your job is to put them together into an application. Our goal is to make that as easy as possible.

To make sure you could use Workers AI easily regardless of entry point, we wanted to provide access via: Workers or Pages to make it easy to use within the Cloudflare ecosystem, and via REST API if you want to use Workers AI with your current stack.

Here’s a quick CURL example that translates some text from English to French:

curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/meta/m2m100-1.2b \
-H "Authorization: Bearer {API_TOKEN}" \
	-d '{ "text": "I'll have an order of the moule frites", "target_lang": "french" }'

And here are what the response looks like:

{
  "result": {
    "answer": "Je vais commander des moules frites"
  },
  "success": true,
  "errors":[],
  "messages":[]
}

Use it with any stack, anywhere - your favorite Jamstack framework, Python + Django/Flask, Node.js, Ruby on Rails, the possibilities are endless. And deploy.

Designed for developers

Developer experience is really important to us. In fact, most of this post has been about just that. Making sure it works out of the box. Providing popular models that just work. Being accessible to all developers whether you build and deploy with Cloudflare or elsewhere. But it’s more than that - the experience should be frictionless, zero to production should be fast, and it should feel good along the way.

Let’s walk through another example to show just how easy it is to use! We’ll run Llama 2, a popular large language model open sourced by Meta, in a worker.

We’ll assume you have some of the basics already complete (Cloudflare account, Node, NPM, etc.), but if you don’t this guide will get you properly set up!

1. Create a Workers project

Create a new project named workers-ai by running:

$ npm create cloudflare@latest

When setting up your workers-ai worker, answer the setup questions as follows:

Enter workers-ai for the app name
Choose Hello World script for the type of application
Select yes to using TypeScript
Select yes to using Git
Select no to deploying

Lastly navigate to your new app directory:

cd workers-ai

2. Connect Workers AI to your worker

Create a Workers AI binding, which allows your worker to access the Workers AI service without having to manage an API key yourself.

To bind Workers AI to your worker, add the following to the end of your wrangler.toml file:

[ai]
binding = "AI" #available in your worker via env.AI

You can also bind Workers AI to a Pages Function. For more information, refer to Functions Bindings.

3. Install the Workers AI client library

npm install @cloudflare/ai

4. Run an inference task in your worker

Update the source/index.ts with the following code:

import { Ai } from '@cloudflare/ai'
export default {
  async fetch(request, env) {
    const ai = new Ai(env.AI);
    const input = { prompt: "What's the origin of the phrase 'Hello, World'" };
    const output = await ai.run('@cf/meta/llama-2-7b-chat-int8', input );
    return new Response(JSON.stringify(output));
  },
};

5. Develop locally with Wrangler

While in your project directory, test Workers AI locally by running:

$ npx wrangler dev --remote

Note - These models currently only run on Cloudflare’s network of GPUs (and not locally), so setting --remote above is a must, and you’ll be prompted to log in at this point.

Wrangler will give you a URL (most likely localhost:8787). Visit that URL, and you’ll see a response like this

{
  "response": "Hello, World is a common phrase used to test the output of a computer program, particularly in the early stages of programming. The phrase "Hello, World!" is often the first program that a beginner learns to write, and it is included in many programming language tutorials and textbooks as a way to introduce basic programming concepts. The origin of the phrase "Hello, World!" as a programming test is unclear, but it is believed to have originated in the 1970s. One of the earliest known references to the phrase is in a 1976 book called "The C Programming Language" by Brian Kernighan and Dennis Ritchie, which is considered one of the most influential books on the development of the C programming language.
}

6. Deploy your worker

Finally, deploy your worker to make your project accessible on the Internet:

$ npx wrangler deploy
# Outputs: https://workers-ai..workers.dev

And that’s it. You can literally go from zero to deployed AI in minutes. This is obviously a simple example, but shows how easy it is to run Workers AI from any project.

Privacy by default

When Cloudflare was founded, our value proposition had three pillars: more secure, more reliable, and more performant. Over time, we’ve realized that a better Internet is also a more private Internet, and we want to play a role in building it.

That’s why Workers AI is private by default - we don’t train our models, LLM or otherwise, on your data or conversations, and our models don’t learn from your usage. You can feel confident using Workers AI in both personal and business settings, without having to worry about leaking your data. Other providers only offer this fundamental feature with their enterprise version. With us, it’s built in for everyone.

We’re also excited to support data localization in the future. To make this happen, we have an ambitious GPU rollout plan - we’re launching with seven sites today, roughly 100 by the end of 2023, and nearly everywhere by the end of 2024. Ultimately, this will empower developers to keep delivering killer AI features to their users, while staying compliant with their end users’ data localization requirements.

The power of the platform

Vector database - Vectorize

Workers AI is all about running Inference, and making it really easy to do so, but sometimes inference is only part of the equation. Large language models are trained on a fixed set of data, based on a snapshot at a specific point in the past, and have no context on your business or use case. When you submit a prompt, information specific to you can increase the quality of results, making it more useful and relevant. That’s why we’re also launching Vectorize, our vector database that’s designed to work seamlessly with Workers AI. Here’s a quick overview of how you might use Workers AI + Vectorize together.

Example: Use your data (knowledge base) to provide additional context to an LLM when a user is chatting with it.

Generate initial embeddings: run your data through Workers AI using an embedding model. The output will be embeddings, which are numerical representations of those words.
Insert those embeddings into Vectorize: this essentially seeds the vector database with your data, so we can later use it to retrieve embeddings that are similar to your users’ query
Generate embedding from user question: when a user submits a question to your AI app, first, take that question, and run it through Workers AI using an embedding model.
Get context from Vectorize: use that embedding to query Vectorize. This should output embeddings that are similar to your user’s question.
Create context aware prompt: Now take the original text associated with those embeddings, and create a new prompt combining the text from the vector search, along with the original question
Run prompt: run this prompt through Workers AI using an LLM model to get your final result

AI Gateway

That covers a more advanced use case. On the flip side, if you are running models elsewhere, but want to get more out of the experience, you can run those APIs through our AI gateway to get features like caching, rate-limiting, analytics and logging. These features can be used to protect your end point, monitor and optimize costs, and also help with data loss prevention. Learn more about AI gateway here.

Start building today

Try it out for yourself, and let us know what you think. Today we’re launching Workers AI as an open Beta for all Workers plans - free or paid. That said, it’s super early, so…

Warning - It’s an early beta

Usage is not currently recommended for production apps, and limits + access are subject to change.

Limits

We’re initially launching with limits on a per-model basis

@cf/meta/llama-2-7b-chat-int8: 50 reqs/min globally

Checkout our docs for a full overview of our limits.

Pricing

What we released today is just a small preview to give you a taste of what’s coming (we simply couldn’t hold back), but we’re looking forward to putting the full-throttle version of Workers AI in your hands.

We realize that as you approach building something, you want to understand: how much is this going to cost me? Especially with AI costs being so easy to get out of hand. So we wanted to share the upcoming pricing of Workers AI with you.

While we won’t be billing on day one, we are announcing what we expect our pricing will look like.

Users will be able to choose from two ways to run Workers AI:

Regular Twitch Neurons (RTN) - running wherever there's capacity at $0.01 / 1k neurons
Fast Twitch Neurons (FTN) - running at nearest user location at $0.125 / 1k neurons

You may be wondering — what’s a neuron?

Neurons are a way to measure AI output that always scales down to zero (if you get no usage, you will be charged for 0 neurons). To give you a sense of what you can accomplish with a thousand neurons, you can: generate 130 LLM responses, 830 image classifications, or 1,250 embeddings.

Our goal is to help our customers pay only for what they use, and choose the pricing that best matches their use case, whether it’s price or latency that is top of mind.

What’s on the roadmap?

Workers AI is just getting started, and we want your feedback to help us make it great. That said, there are some exciting things on the roadmap.

More models, please

We're launching with a solid set of models that just work, but will continue to roll out new models based on your feedback. If there’s a particular model you'd love to see on Workers AI, pop into our Discord and let us know!

In addition to that, we're also announcing a partnership with Hugging Face, and soon you'll be able to access and run a subset of the Hugging Face catalog directly from Workers AI.

Analytics + observability

Up to this point, we’ve been hyper focussed on one thing - making it really easy for any developer to run powerful AI models in just a few lines of code. But that’s only one part of the story. Up next, we’ll be working on some analytics and observability capabilities to give you insights into your usage + performance + spend on a per-model basis, plus the ability to fig into your logs if you want to do some exploring.

A road to global GPU coverage

Our goal is to be the best place to run inference on Region: Earth, so we're adding GPUs to our data centers as fast as we can.

We plan to be in 100 data centers by the end of this year

And nearly everywhere by the end of 2024

We’re really excited to see you build - head over to our docs to get started.

If you need inspiration, want to share something you’re building, or have a question - pop into our Developer Discord.

Vectorize: a vector database for shipping AI-powered applications to production, fast

Matt Silverlock — Wed, 27 Sep 2023 13:00:31 GMT

Vectorize is our brand-new vector database offering, designed to let you build full-stack, AI-powered applications entirely on Cloudflare’s global network: and you can start building with it right away. Vectorize is in open beta, and is available to any developer using Cloudflare Workers.

You can use Vectorize with Workers AI to power semantic search, classification, recommendation and anomaly detection use-cases directly with Workers, improve the accuracy and context of answers from LLMs (Large Language Models), and/or bring-your-own embeddings from popular platforms, including OpenAI and Cohere.

Visit Vectorize’s developer documentation to get started, or read on if you want to better understand what vector databases do and how Vectorize is different.

Why do I need a vector database?

Machine learning models can’t remember anything: only what they were trained on.

Vector databases are designed to solve this, by capturing how an ML model represents data — including structured and unstructured text, images and audio — and storing it in a way that allows you to compare against future inputs. This allows us to leverage the power of existing machine-learning models and LLMs (Large Language Models) for content they haven’t been trained on: which, given the tremendous cost of training models, turns out to be extremely powerful.

To better illustrate why a vector database like Vectorize is useful, let’s pretend they don’t exist, and see how painful it is to give context to an ML model or LLM for a semantic search or recommendation task. Our goal is to understand what content is similar to our query and return it: based on our own dataset.

Our user query comes in: they’re searching for “how to write to R2 from Cloudflare Workers”
We load up our entire documentation dataset — a thankfully “small” dataset at about 65,000 sentences, or 2.1 GB — and provide it alongside the query from our user. This allows the model to have the context it needs, based on our data.
We wait.
(A long time)
We get our similarity scores back, with the sentences most similar to the user’s query, and then work to map those back to URLs before we return our search results.

… and then another query comes in, and we have to start this all over again.

In practice, this isn’t really possible: we can’t pass that much context in an API call (prompt) to most machine learning models, and even if we could, it’d take tremendous amounts of memory and time to process our dataset over-and-over again.

With a vector database, we don’t have to repeat step 2: we perform it once, or as our dataset updates, and use our vector database to provide a form of long-term memory for our machine learning model. Our workflow looks a little more like this:

We load up our entire documentation dataset, run it through our model, and store the resulting vector embeddings in our vector database (just once).
For each user query (and only the query) we ask the same model and retrieve a vector representation.
We query our vector database with that query vector, which returns the vectors closest to our query vector.

If we looked at these two flows side by side, we can quickly see how inefficient and impractical it is to use our own dataset with an existing model without a vector database:

Using a vector database to help machine learning models remember.

From this simple example, it’s probably starting to make some sense: but you might also be wondering why you need a vector database instead of just a regular database.

Vectors are the model’s representation of an input: how it maps that input to its internal structure, or “features”. Broadly, the more similar vectors are, the more similar the model believes those inputs to be based on how it extracts features from an input.

This is seemingly easy when we look at example vectors of only a handful of dimensions. But with real-world outputs, searching across 10,000 to 250,000 vectors, each potentially 1,536 dimensions wide, is non-trivial. This is where vector databases come in: to make search work at scale, vector databases use a specific class of algorithm, such as k-nearest neighbors (kNN) or other approximate nearest neighbor (ANN) algorithms to determine vector similarity.

And although vector databases are extremely useful when building AI and machine learning powered applications, they’re not only useful in those use-cases: they can be used for a multitude of classification and anomaly detection tasks. Knowing whether a query input is similar — or potentially dissimilar — from other inputs can power content moderation (does this match known-bad content?) and security alerting (have I seen this before?) tasks as well.

Building a recommendation engine with vector search

We built Vectorize to be a powerful partner to Workers AI: enabling you to run vector search tasks as close to users as possible, and without having to think about how to scale it for production.

We’re going to take a real world example — building a (product) recommendation engine for an e-commerce store — and simplify a few things.

Our goal is to show a list of “relevant products” on each product listing page: a perfect use-case for vector search. Our input vectors in the example are placeholders, but in a real world application we would generate them based on product descriptions and/or cart data by passing them through a sentence similarity model (such as Worker’s AI’s text embedding model)

Each vector represents a product across our store, and we associate the URL of the product with it. We could also set the ID of each vector to the product ID: both approaches are valid. Our query — vector search — represents the product description and content for the product user is currently viewing.

Let’s step through what this looks like in code: this example is pulled straight from our developer documentation:

export interface Env {
	// This makes our vector index methods available on env.MY_VECTOR_INDEX.*
	// e.g. env.MY_VECTOR_INDEX.insert() or .query()
	TUTORIAL_INDEX: VectorizeIndex;
}

// Sample vectors: 3 dimensions wide.
//
// Vectors from a machine-learning model are typically ~100 to 1536 dimensions
// wide (or wider still).
const sampleVectors: Array = [
	{ id: '1', values: [32.4, 74.1, 3.2], metadata: { url: '/products/sku/13913913' } },
	{ id: '2', values: [15.1, 19.2, 15.8], metadata: { url: '/products/sku/10148191' } },
	{ id: '3', values: [0.16, 1.2, 3.8], metadata: { url: '/products/sku/97913813' } },
	{ id: '4', values: [75.1, 67.1, 29.9], metadata: { url: '/products/sku/418313' } },
	{ id: '5', values: [58.8, 6.7, 3.4], metadata: { url: '/products/sku/55519183' } },
];

export default {
	async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise {
		if (new URL(request.url).pathname !== '/') {
			return new Response('', { status: 404 });
		}
		// Insert some sample vectors into our index
		// In a real application, these vectors would be the output of a machine learning (ML) model,
		// such as Workers AI, OpenAI, or Cohere.
		let inserted = await env.TUTORIAL_INDEX.insert(sampleVectors);

		// Log the number of IDs we successfully inserted
		console.info(`inserted ${inserted.count} vectors into the index`);

		// In a real application, we would take a user query - e.g. "durable
		// objects" - and transform it into a vector emebedding first.
		//
		// In our example, we're going to construct a simple vector that should
		// match vector id #5
		let queryVector: Array = [54.8, 5.5, 3.1];

		// Query our index and return the three (topK = 3) most similar vector
		// IDs with their similarity score.
		//
		// By default, vector values are not returned, as in many cases the
		// vectorId and scores are sufficient to map the vector back to the
		// original content it represents.
		let matches = await env.TUTORIAL_INDEX.query(queryVector, { topK: 3, returnVectors: true });

		// We map over our results to find the most similar vector result.
		//
		// Since our index uses the 'cosine' distance metric, scores will range
		// from 1 to -1.  A value of '1' means the vector is the same; the
		// closer to 1, the more similar. Values of -1 (least similar) and 0 (no
		// match).
		// let closestScore = 0;
		// let mostSimilarId = '';
		// matches.matches.map((match) => {
		// 	if (match.score > closestScore) {
		// 		closestScore = match.score;
		// 		mostSimilarId = match.vectorId;
		// 	}
		// });

		return Response.json({
			// This will return the closest vectors: we'll see that the vector
			// with id = 5 has the highest score (closest to 1.0) as the
			// distance between it and our query vector is the smallest.
			// Return the full set of matches so we can see the possible scores.
			matches: matches,
		});
	},
};

The code above is intentionally simple, but illustrates vector search at its core: we insert vectors into our database, and query it for vectors with the smallest distance to our query vector.

Here are the results, with the values included, so we visually observe that our query vector [54.8, 5.5, 3.1] is similar to our highest scoring match: [58.799, 6.699, 3.400] returned from our search. This index uses cosine similarity to calculate the distance between vectors, which means that the closer the score to 1, the more similar a match is to our query vector.

{
  "matches": {
    "count": 3,
    "matches": [
      {
        "score": 0.999909,
        "vectorId": "5",
        "vector": {
          "id": "5",
          "values": [
            58.79999923706055,
            6.699999809265137,
            3.4000000953674316
          ],
          "metadata": {
            "url": "/products/sku/55519183"
          }
        }
      },
      {
        "score": 0.789848,
        "vectorId": "4",
        "vector": {
          "id": "4",
          "values": [
            75.0999984741211,
            67.0999984741211,
            29.899999618530273
          ],
          "metadata": {
            "url": "/products/sku/418313"
          }
        }
      },
      {
        "score": 0.611976,
        "vectorId": "2",
        "vector": {
          "id": "2",
          "values": [
            15.100000381469727,
            19.200000762939453,
            15.800000190734863
          ],
          "metadata": {
            "url": "/products/sku/10148191"
          }
        }
      }
    ]
  }
}

In a real application, we could now quickly return product recommendation URLs based on the most similar products, sorting them by their score (highest to lowest), and increasing the topK value if we want to show more. The metadata stored alongside each vector could also embed a path to an R2 object, a UUID for a row in a D1 database, or a key-value pair from Workers KV.

Workers AI + Vectorize: full stack vector search on Cloudflare

In a real application, we need a machine learning model that can both generate vector embeddings from our original dataset (to seed our database) and quickly turn user queries into vector embeddings too. These need to be from the same model, as each model represents features differently.

Here’s a compact example building an entire end-to-end vector search pipeline on Cloudflare:

import { Ai } from '@cloudflare/ai';
export interface Env {
	TEXT_EMBEDDINGS: VectorizeIndex;
	AI: any;
}
interface EmbeddingResponse {
	shape: number[];
	data: number[][];
}

export default {
	async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise {
		const ai = new Ai(env.AI);
		let path = new URL(request.url).pathname;
		if (path.startsWith('/favicon')) {
			return new Response('', { status: 404 });
		}

		// We only need to generate vector embeddings just the once (or as our
		// data changes), not on every request
		if (path === '/insert') {
			// In a real-world application, we could read in content from R2 or
			// a SQL database (like D1) and pass it to Workers AI
			const stories = ['This is a story about an orange cloud', 'This is a story about a llama', 'This is a story about a hugging emoji'];
			const modelResp: EmbeddingResponse = await ai.run('@cf/baai/bge-base-en-v1.5', {
				text: stories,
			});

			// We need to convert the vector embeddings into a format Vectorize can accept.
			// Each vector needs an id, a value (the vector) and optional metadata.
			// In a real app, our ID would typicaly be bound to the ID of the source
			// document.
			let vectors: VectorizeVector[] = [];
			let id = 1;
			modelResp.data.forEach((vector) => {
				vectors.push({ id: `${id}`, values: vector });
				id++;
			});

			await env.TEXT_EMBEDDINGS.upsert(vectors);
		}

		// Our query: we expect this to match vector id: 1 in this simple example
		let userQuery = 'orange cloud';
		const queryVector: EmbeddingResponse = await ai.run('@cf/baai/bge-base-en-v1.5', {
			text: [userQuery],
		});

		let matches = await env.TEXT_EMBEDDINGS.query(queryVector.data[0], { topK: 1 });
		return Response.json({
			// We expect vector id: 1 to be our top match with a score of
			// ~0.896888444
			// We are using a cosine distance metric, where the closer to one,
			// the more similar.
			matches: matches,
		});
	},
};

The code above does four things:

It passes the three sentences to Workers AI’s text embedding model (@cf/baai/bge-base-en-v1.5) and retrieves their vector embeddings.
It inserts those vectors into our Vectorize index.
Takes the user query and transforms it into a vector embedding via the same Workers AI model.
Queries our Vectorize index for matches.

This example might look “too” simple, but in a production application, we’d only have to change two things: just insert our vectors once (or periodically via Cron Triggers), and replace our three example sentences with real data stored in R2, a D1 database, or another storage provider.

In fact, this is incredibly similar to how we run Cursor, the AI assistant that can answer questions about Cloudflare Worker: we migrated Cursor to run on Workers AI and Vectorize. We generate text embeddings from our developer documentation using its built-in text embedding model, insert them into a Vectorize index, and transform user queries on the fly via that same model.

BYO embeddings from your favorite AI API

Vectorize isn’t just limited to Workers AI, though: it’s a fully-fledged, standalone vector database.

If you’re already using OpenAI’s Embedding API, Cohere’s multilingual model, or any other embedding API, then you can easily bring-your-own (BYO) vectors to Vectorize.

It works just the same: generate your embeddings, insert them into Vectorize, and pass your queries through the model before you query your index. Vectorize includes a few shortcuts for some of the most popular embedding models.

# Vectorize has ready-to-go presets that set the dimensions and distance metric for popular embeddings models
$ wrangler vectorize create openai-index-example --preset=openai-text-embedding-ada-002

This can be particularly useful if you already have an existing workflow around an existing embeddings API, and/or have validated a specific multimodal or multilingual embeddings model for your use-case.

Making the cost of AI predictable

There’s a tremendous amount of excitement around AI and ML, but there’s also one big concern: that it’s too expensive to experiment with, and hard to predict at scale.

With Vectorize, we wanted to bring a simpler pricing model to vector databases. Have an idea for a proof-of-concept at work? That should fit into our free-tier limits. Scaling up and optimizing your embedding dimensions for performance vs. accuracy? It shouldn’t break the bank.

Importantly, Vectorize aims to be predictable: you don’t need to estimate CPU and memory consumption, which can be hard when you’re just starting out, and made even harder when trying to plan for your peak vs. off-peak hours in production for a brand new use-case. Instead, you’re charged based on the total number of vector dimensions you store, and the number of queries against them each month. It’s our job to take care of scaling up to meet your query patterns.

Here’s the pricing for Vectorize — and if you have a Workers paid plan now, Vectorize is entirely free to use until 2024:

	Workers Free (coming soon)	Workers Paid ($5/month)
Queried vector dimensions included	30M total queried dimensions / month	50M total queried dimensions / month
Stored vector dimensions included	5M stored dimensions / month	10M stored dimensions / month
Additional cost	$0.04 / 1M vector dimensions queried or stored	$0.04 / 1M vector dimensions queried or stored

Pricing is based entirely on what you store and query: (total vector dimensions queried + stored) * dimensions_per_vector * price. Query more? Easy to predict. Optimizing for smaller dimensions per vector to improve speed and reduce overall latency? Cost goes down. Have a few indexes for prototyping or experimenting with new use-cases? We don’t charge per-index.

Create as many as you need indexes to prototype new ideas and/or separate production from dev.

As an example: if you load 10,000 Workers AI vectors (384 dimensions each) and make 5,000 queries against your index each day, it’d result in 49 million total vector dimensions queried and still fit into what we include in the Workers Paid plan ($5/month). Better still: we don’t delete your indexes due to inactivity.

Note that while this pricing isn’t final, we expect few changes going forward. We want to avoid the element of surprise: there’s nothing worse than starting to build on a platform and realizing the pricing is untenable after you’ve invested the time writing code, tests and learning the nuances of a technology.

Vectorize!

Every Workers developer on a paid plan can start using Vectorize immediately: the open beta is available right now, and you can visit our developer documentation to get started.

This is also just the beginning of the vector database story for us at Cloudflare. Over the next few weeks and months, we intend to land a new query engine that should further improve query performance, support even larger indexes, introduce sub-index filtering capabilities, increased metadata limits, and per-index analytics.

If you’re looking for inspiration on what to build, see the semantic search tutorial that combines Workers AI and Vectorize for document search, running entirely on Cloudflare. Or an example of how to combine OpenAI and Vectorize to give an LLM more context and dramatically improve the accuracy of its answers.

And if you have questions about how to use Vectorize for our product & engineering teams, or just want to bounce an idea off of other developers building on Workers AI, join the #vectorize and #workers-ai channels on our Developer Discord.