The Cloudflare Blog

R2 Data Catalog: Managed Apache Iceberg tables with zero egress fees

Phillip Jones — Thu, 10 Apr 2025 14:00:00 GMT

Apache Iceberg is quickly becoming the standard table format for querying large analytic datasets in object storage. We’re seeing this trend firsthand as more and more developers and data teams adopt Iceberg on Cloudflare R2. But until now, using Iceberg with R2 meant managing additional infrastructure or relying on external data catalogs.

So we’re fixing this. Today, we’re launching the R2 Data Catalog in open beta, a managed Apache Iceberg catalog built directly into your Cloudflare R2 bucket.

If you’re not already familiar with it, Iceberg is an open table format built for large-scale analytics on datasets stored in object storage. With R2 Data Catalog, you get the database-like capabilities Iceberg is known for – ACID transactions, schema evolution, and efficient querying – without the overhead of managing your own external catalog.

R2 Data Catalog exposes a standard Iceberg REST catalog interface, so you can connect the engines you already use, like PyIceberg, Snowflake, and Spark. And, as always with R2, there are no egress fees, meaning that no matter which cloud or region your data is consumed from, you won’t have to worry about growing data transfer costs.

Ready to query data in R2 right now? Jump into the developer docs and enable a data catalog on your R2 bucket in just a few clicks. Or keep reading to learn more about Iceberg, data catalogs, how metadata files work under the hood, and how to create your first Iceberg table.

What is Apache Iceberg?

Apache Iceberg is an open table format for analyzing large datasets in object storage. It brings database-like features – ACID transactions, time travel, and schema evolution – to files stored in formats like Parquet or ORC.

Historically, data lakes were just collections of raw files in object storage. However, without a unified metadata layer, datasets could easily become corrupted, were difficult to evolve, and queries often required expensive full-table scans.

Iceberg solves these problems by:

Providing ACID transactions for reliable, concurrent reads and writes.
Maintaining optimized metadata, so engines can skip irrelevant files and avoid unnecessary full-table scans.
Supporting schema evolution, allowing columns to be added, renamed, or dropped without rewriting existing data.

Iceberg is already widely supported by engines like Apache Spark, Trino, Snowflake, DuckDB, and ClickHouse, with a fast-growing community behind it.

How Iceberg tables are stored

Internally, an Iceberg table is a collection of data files (typically stored in columnar formats like Parquet or ORC) and metadata files (typically stored in JSON or Avro) that describe table snapshots, schemas, and partition layouts.

To understand how query engines interact efficiently with Iceberg tables, it helps to look at an Iceberg metadata file (simplified):

{
  "format-version": 2,
  "table-uuid": "0195e49b-8f7c-7933-8b43-d2902c72720a",
  "location": "s3://my-bucket/warehouse/0195e49b-79ca/table",
  "current-schema-id": 0,
  "schemas": [
    {
      "schema-id": 0,
      "type": "struct",
      "fields": [
        { "id": 1, "name": "id", "required": false, "type": "long" },
        { "id": 2, "name": "data", "required": false, "type": "string" }
      ]
    }
  ],
  "current-snapshot-id": 3567362634015106507,
  "snapshots": [
    {
      "snapshot-id": 3567362634015106507,
      "sequence-number": 1,
      "timestamp-ms": 1743297158403,
      "manifest-list": "s3://my-bucket/warehouse/0195e49b-79ca/table/metadata/snap-3567362634015106507-0.avro",
      "summary": {},
      "schema-id": 0
    }
  ],
  "partition-specs": [{ "spec-id": 0, "fields": [] }]
}

A few of the important components are:

schemas: Iceberg tracks schema changes over time. Engines use schema information to safely read and write data without needing to rewrite underlying files.
snapshots: Each snapshot references a specific set of data files that represent the state of the table at a point in time. This enables features like time travel.
partition-specs: These define how the table is logically partitioned. Query engines leverage this information during planning to skip unnecessary partitions, greatly improving query performance.

By reading Iceberg metadata, query engines can efficiently prune partitions, load only the relevant snapshots, and fetch only the data files it needs, resulting in faster queries.

Why do you need a data catalog?

Although the Iceberg data and metadata files themselves live directly in object storage (like R2), the list of tables and pointers to the current metadata need to be tracked centrally by a data catalog.

Think of a data catalog as a library's index system. While books (your data) are physically distributed across shelves (object storage), the index provides a single source of truth about what books exist, their locations, and their latest editions. Without this index, readers (query engines) would waste time searching for books, might access outdated versions, or could accidentally shelve new books in ways that make them unfindable.

Similarly, data catalogs ensure consistent, coordinated access, allowing multiple query engines to safely read from and write to the same tables without conflicts or data corruption.

Create your first Iceberg table on R2

Ready to try it out? Here’s a quick example using PyIceberg and Python to get you started. For a detailed step-by-step guide, check out our developer docs.

1. Enable R2 Data Catalog on your bucket:

npx wrangler r2 bucket catalog enable my-bucket

Or use the Cloudflare dashboard: Navigate to R2 Object Storage > Settings > R2 Data Catalog and click Enable.

2. Create a Cloudflare API token with permissions for both R2 storage and the data catalog.

3. Install PyIceberg and PyArrow, then open a Python shell or notebook:

pip install pyiceberg pyarrow

4. Connect to the catalog and create a table:

import pyarrow as pa
from pyiceberg.catalog.rest import RestCatalog

# Define catalog connection details (replace variables)
WAREHOUSE = ""
TOKEN = ""
CATALOG_URI = ""

# Connect to R2 Data Catalog
catalog = RestCatalog(
    name="my_catalog",
    warehouse=WAREHOUSE,
    uri=CATALOG_URI,
    token=TOKEN,
)

# Create default namespace
catalog.create_namespace("default")

# Create simple PyArrow table
df = pa.table({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
})

# Create an Iceberg table
table = catalog.create_table(
    ("default", "my_table"),
    schema=df.schema,
)

You can now append more data or run queries, just as you would with any Apache Iceberg table.

Pricing

While R2 Data Catalog is in open beta, there will be no additional charges beyond standard R2 storage and operations costs incurred by query engines accessing data. Storage pricing for buckets with R2 Data Catalog enabled remains the same as standard R2 buckets – \$0.015 per GB-month. As always, egress directly from R2 buckets remains \$0.

In the future, we plan to introduce pricing for catalog operations (e.g., creating tables, retrieving table metadata, etc.) and data compaction.

Below is our current thinking on future pricing. We’ll communicate more details around timing well before billing begins, so you can confidently plan your workloads.

	Pricing
R2 storage For standard storage class	$0.015 per GB-month (no change)
R2 Class A operations	$4.50 per million operations (no change)
R2 Class B operations	$0.36 per million operations (no change)
Data Catalog operations e.g., create table, get table metadata, update table properties	$9.00 per million catalog operations
Data Catalog compaction data processed	$0.05 per GB processed $4.00 per million objects processed
Data egress	$0 (no change, always free)

What’s next?

We’re excited to see how you use R2 Data Catalog! If you’ve never worked with Iceberg – or even analytics data – before, we think this is the easiest way to get started.

Next on our roadmap is tackling compaction and table optimization. Query engines typically perform better when dealing with fewer, but larger data files. We will automatically re-write collections of small data files into larger files to deliver even faster query performance.

We’re also collaborating with the broad Apache Iceberg community to expand query-engine compatibility with the Iceberg REST Catalog spec.

We’d love your feedback. Join the Cloudflare Developer Discord to ask questions and share your thoughts during the public beta. For more details, examples, and guides, visit our developer documentation.

Cloudflare incident on March 21, 2025

Phillip Jones — Tue, 25 Mar 2025 01:40:38 GMT

Multiple Cloudflare services, including R2 object storage, experienced an elevated rate of errors for 1 hour and 7 minutes on March 21, 2025 (starting at 21:38 UTC and ending 22:45 UTC). During the incident window, 100% of write operations failed and approximately 35% of read operations to R2 failed globally. Although this incident started with R2, it impacted other Cloudflare services including Cache Reserve, Images, Log Delivery, Stream, and Vectorize.

While rotating credentials used by the R2 Gateway service (R2's API frontend) to authenticate with our storage infrastructure, the R2 engineering team inadvertently deployed the new credentials (ID and key pair) to a development instance of the service instead of production. When the old credentials were deleted from our storage infrastructure (as part of the key rotation process), the production R2 Gateway service did not have access to the new credentials. This ultimately resulted in R2’s Gateway service not being able to authenticate with our storage backend. There was no data loss or corruption that occurred as part of this incident: any in-flight uploads or mutations that returned successful HTTP status codes were persisted.

Once the root cause was identified and we realized we hadn’t deployed the new credentials to the production R2 Gateway service, we deployed the updated credentials and service availability was restored.

This incident happened because of human error and lasted longer than it should have because we didn’t have proper visibility into which credentials were being used by the Gateway Worker to authenticate with our storage infrastructure.

We’re deeply sorry for this incident and the disruption it may have caused to you or your users. We hold ourselves to a high standard and this is not acceptable. This blog post exactly explains the impact, what happened and when, and the steps we are taking to make sure this failure (and others like it) doesn’t happen again.

What was impacted?

The primary incident window occurred between 21:38 UTC and 22:45 UTC.

The following table details the specific impact to R2 and Cloudflare services that depend on, or interact with, R2:

Product/Service	Impact
R2	All customers using Cloudflare R2 would have experienced an elevated error rate during the primary incident window. Specifically: * Object write operations had a 100% error rate. * Object reads had an approximate error rate of 35% globally. Individual customer error rate varied during this window depending on access patterns. Customers accessing public assets through custom domains would have seen a reduced error rate as cached object reads were not impacted. * Operations involving metadata only (e.g., head and list operations) were not impacted. There was no data loss or risk to data integrity within R2's storage subsystem. This incident was limited to a temporary authentication issue between R2's API frontend and our storage infrastructure.
Billing	Billing uses R2 to store customer invoices. During the primary incident window, customers may have experienced errors when attempting to download/access past Cloudflare invoices.
Cache Reserve	Cache Reserve customers observed an increase in requests to their origin during the incident window as an increased percentage of reads to R2 failed. This resulted in an increase in requests to origins to fetch assets unavailable in Cache Reserve during this period. User-facing requests for assets to sites with Cache Reserve did not observe failures as cache misses failed over to the origin.
Email Security	Email Security depends on R2 for customer-facing metrics. During the primary incident window, customer-facing metrics would not have updated.
Images	All (100% of) uploads failed during the primary incident window. Successful delivery of stored images dropped to approximately 25%.
Key Transparency Auditor	All (100% of) operations failed during the primary incident window due to dependence on R2 writes and/or reads. Once the incident was resolved, service returned to normal operation immediately.
Log Delivery	Log delivery (for Logpush and Logpull) was delayed during the primary incident window, resulting in significant delays (up to 70 minutes) in log processing. All logs were delivered after incident resolution.
Stream	All (100% of) uploads failed during the primary incident window. Successful Stream video segment delivery dropped to 94%. Viewers may have seen video stalls every minute or so, although actual impact would have varied. Stream Live was down during the primary incident window as it depends on object writes.
Vectorize	Queries and operations against Vectorize indexes were impacted during the incident window. During the incident window, Vectorize customers would have seen an increased error rate for read queries to indexes and all (100% of) insert and upsert operation failed as Vectorize depends on R2 for persistent storage.

Incident timeline

All timestamps referenced are in Coordinated Universal Time (UTC).

Time	Event
Mar 21, 2025 - 19:49 UTC	The R2 engineering team started the credential rotation process. A new set of credentials (ID and key pair) for storage infrastructure was created. Old credentials were maintained to avoid downtime during credential change over.
Mar 21, 2025 - 20:19 UTC	Set updated production secret (wrangler secret put) and executed wrangler deploy command to deploy R2 Gateway service with updated credentials. Note: We later discovered the --env parameter was inadvertently omitted for both Wrangler commands. This resulted in credentials being deployed to the Worker assigned to the default environment instead of the Worker assigned to the production environment.
Mar 21, 2025 - 20:20 UTC	The R2 Gateway service Worker assigned to the default environment is now using the updated storage infrastructure credentials. Note: This was the wrong Worker, the production environment should have been explicitly set. But, at this point, we incorrectly believed the credentials were updated on the correct production Worker.
Mar 21, 2025 - 20:37 UTC	Old credentials were removed from our storage infrastructure to complete the credential rotation process.
Mar 21, 2025 - 21:38 UTC	– IMPACT BEGINS – R2 availability metrics begin to show signs of service degradation. The impact to R2 availability metrics was gradual and not immediately obvious because there was a delay in the propagation of the previous credential deletion to storage infrastructure.
Mar 21, 2025 - 21:45 UTC	R2 global availability alerts are triggered (indicating 2% of error budget burn rate). The R2 engineering team began looking at operational dashboards and logs to understand impact.
Mar 21, 2025 - 21:50 UTC	Internal incident declared.
Mar 21, 2025 - 21:51 UTC	R2 engineering team observes gradual but consistent decline in R2 availability metrics for both read and write operations. Operations involving metadata only (e.g., head and list operations) were not impacted. Given gradual decline in availability metrics, R2 engineering team suspected a potential regression in propagation of new credentials in storage infrastructure.
Mar 21, 2025 - 22:05 UTC	Public incident status page published.
Mar 21, 2025 - 22:15 UTC	R2 engineering team created a new set of credentials (ID and key pair) for storage infrastructure in an attempt to force re-propagation. Continued monitoring operational dashboards and logs.
Mar 21, 2025 - 22:20 UTC	R2 engineering team saw no improvement in availability metrics. Continued investigating other potential root causes.
Mar 21, 2025 - 22:30 UTC	R2 engineering team deployed a new set of credentials (ID and key pair) to R2 Gateway service Worker. This was to validate whether there was an issue with the credentials we had pushed to gateway service. Environment parameter was still omitted in the deploy and secret put commands, so this deployment was still to the wrong non-production Worker.
Mar 21, 2025 - 22:36 UTC	– ROOT CAUSE IDENTIFIED – The R2 engineering team discovered that credentials had been deployed to a non-production Worker by reviewing production Worker release history.
Mar 21, 2025 - 22:45 UTC	– IMPACT ENDS – Deployed credentials to correct production Worker. R2 availability recovered.
Mar 21, 2025 - 22:54 UTC	The incident is considered resolved.

Analysis

R2’s architecture is primarily composed of three parts: R2 production gateway Worker (serves requests from S3 API, REST API, Workers API), metadata service, and storage infrastructure (stores encrypted object data).

The R2 Gateway Worker uses credentials (ID and key pair) to securely authenticate with our distributed storage infrastructure. We rotate these credentials regularly as a best practice security precaution.

Our key rotation process involves the following high-level steps:

Create a new set of credentials (ID and key pair) for our storage infrastructure. At this point, the old credentials are maintained to avoid downtime during credential change over.
Set the new credential secret for the R2 production gateway Worker using the wrangler secret put command.
Set the new updated credential ID as an environment variable in the R2 production gateway Worker using the wrangler deploy command. At this point, new storage credentials start being used by the gateway Worker.
Remove previous credentials from our storage infrastructure to complete the credential rotation process.
Monitor operational dashboards and logs to validate change over.

The R2 engineering team uses Workers environments to separate production and development environments for the R2 Gateway Worker. Each environment defines a separate isolated Cloudflare Worker with separate environment variables and secrets.

Critically, both wrangler secret put and wrangler deploy commands default to the default environment if the --env command line parameter is not included. In this case, due to human error, we inadvertently omitted the --env parameter and deployed the new storage credentials to the wrong Worker (default environment instead of production). To correctly deploy storage credentials to the production R2 Gateway Worker, we need to specify --env production.

The action we took on step 4 above to remove the old credentials from our storage infrastructure caused authentication errors, as the R2 Gateway production Worker still had the old credentials. This is ultimately what resulted in degraded availability.

The decline in R2 availability metrics was gradual and not immediately obvious because there was a delay in the propagation of the previous credential deletion to storage infrastructure. This accounted for a delay in our initial discovery of the problem. Instead of relying on availability metrics after updating the old set of credentials, we should have explicitly validated which token was being used by the R2 Gateway service to authenticate with R2's storage infrastructure.

Overall, the impact on read availability was significantly mitigated by our intermediate cache that sits in front of storage and continued to serve requests.

Resolution

Once we identified the root cause, we were able to resolve the incident quickly by deploying the new credentials to the production R2 Gateway Worker. This resulted in an immediate recovery of R2 availability.

Next steps

This incident happened because of human error and lasted longer than it should have because we didn’t have proper visibility into which credentials were being used by the R2 Gateway Worker to authenticate with our storage infrastructure.

We have taken immediate steps to prevent this failure (and others like it) from happening again:

Added logging tags that include the suffix of the credential ID the R2 Gateway Worker uses to authenticate with our storage infrastructure. With this change, we can explicitly confirm which credential is being used.
Related to the above step, our internal processes now require explicit confirmation that the suffix of the new token ID matches logs from our storage infrastructure before deleting the previous token.
Require that key rotation takes place through our hotfix release tooling instead of relying on manual wrangler command entry which introduces human error. Our hotfix release deploy tooling explicitly enforces the environment configuration and contains other safety checks.
While it’s been an implicit standard that this process involves at least two humans to validate the changes ahead as we progress, we’ve updated our relevant SOPs (standard operating procedures) to include this explicitly.
In Progress: Extend our existing closed loop health check system that monitors our endpoints to test new keys, automate reporting of their status through our alerting platform, and ensure global propagation prior to releasing the gateway Worker.
In Progress: To expedite triage on any future issues with our distributed storage endpoints, we are updating our observability platform to include views of upstream success rates that bypass caching to give clearer indication of issues serving requests for any reason.

The list above is not exhaustive: as we work through the above items, we will likely uncover other improvements to our systems, controls, and processes that we’ll be applying to improve R2’s resiliency, on top of our business-as-usual efforts. We are confident that this set of changes will prevent this failure, and related credential rotation failure modes, from occurring again. Again, we sincerely apologize for this incident and deeply regret any disruption it has caused you or your users.

Developer Week 2024 wrap-up

Phillip Jones — Mon, 08 Apr 2024 13:00:02 GMT

Developer Week 2024 has officially come to a close. Each day last week, we shipped new products and functionality geared towards giving developers the components they need to build full-stack applications on Cloudflare.

Even though Developer Week is now over, we are continuing to innovate with the over two million developers who build on our platform. Building a platform is only as exciting as seeing what developers build on it. Before we dive into a recap of the announcements, to send off the week, we wanted to share how a couple of companies are using Cloudflare to power their applications:

We have been using Workers for image delivery using R2 and have been able to maintain stable operations for a year after implementation. The speed of deployment and the flexibility of detailed configurations have greatly reduced the time and effort required for traditional server management. In particular, we have seen a noticeable cost savings and are deeply appreciative of the support we have received from Cloudflare Workers.- FAN Communications

Milkshake helps creators, influencers, and business owners create engaging web pages directly from their phone, to simply and creatively promote their projects and passions. Cloudflare has helped us migrate data quickly and affordably with R2. We use Workers as a routing layer between our users' websites and their images and assets, and to build a personalized analytics offering affordably. Cloudflare’s innovations have consistently allowed us to run infrastructure at a fraction of the cost of other developer platforms and we have been eagerly awaiting updates to D1 and Queues to sustainably scale Milkshake as the product continues to grow.- Milkshake

In case you missed anything, here’s a quick recap of the announcements and in-depth technical explorations that went out last week:

Summary of announcements

Monday

Announcement	Summary
Making state easy with D1 GA, Hyperdrive, Queues and Workers Analytics Engine updates	A core part of any full-stack application is storing and persisting data! We kicked off the week with announcements that help developers build stateful applications on top of Cloudflare, including making D1, Cloudflare’s SQL database and Hyperdrive, our database accelerating service, generally available.
Building D1: a Global Database	D1, Cloudflare’s SQL database, is now generally available. With new support for 10GB databases, data export, and enhanced query debugging, we empower developers to build production-ready applications with D1 to meet all their relational SQL needs. To support Workers in global applications, we’re sharing a sneak peek of our design and API for D1 global read replication to demonstrate how developers scale their workloads with D1.
Why Workers environment variables contain live objects	Bindings don't just reduce boilerplate. They are a core design feature of the Workers platform which simultaneously improve developer experience and application security in several ways. Usually these two goals are in opposition to each other, but bindings elegantly solve for both at the same time.

Tuesday

Announcement	Summary
Leveling up Workers AI: General Availability and more new capabilities	We made a series of AI-related announcements, including Workers AI, Cloudflare’s inference platform becoming GA, support for fine-tuned models with LoRAs, one-click deploys from HuggingFace, Python support for Cloudflare Workers, and more.
Running fine-tuned models on Workers AI with LoRAs	Workers AI now supports fine-tuned models using LoRAs. But what is a LoRA and how does it work? In this post, we dive into fine-tuning, LoRAs and even some math to share the details of how it all works under the hood.
Bringing Python to Workers using Pyodide and WebAssembly	We introduced Python support for Cloudflare Workers, now in open beta. We've revamped our systems to support Python, from the Workers runtime itself to the way Workers are deployed to Cloudflare’s network. Learn about a Python Worker's lifecycle, Pyodide, dynamic linking, and memory snapshots in this post.

Wednesday

Announcement	Summary
R2 adds event notifications, support for migrations from Google Cloud Storage, and an infrequent access storage tier	We announced three new features for Cloudflare R2: event notifications, support for migrations from Google Cloud Storage, and an infrequent access storage tier.
Data Anywhere with Pipelines, Event Notifications, and Workflows	We’re making it easier to build scalable, reliable, data-driven applications on top of our global network, and so we announced a new Event Notifications framework; our take on durable execution, Workflows; and an upcoming streaming ingestion service, Pipelines.
Improving Cloudflare Workers and D1 developer experience with Prisma ORM	Together, Cloudflare and Prisma make it easier than ever to deploy globally available apps with a focus on developer experience. To further that goal, Prisma ORM now natively supports Cloudflare Workers and D1 in Preview. With version 5.12.0 of Prisma ORM you can now interact with your data stored in D1 from your Cloudflare Workers with the convenience of the Prisma Client API. Learn more and try it out now.
How Picsart leverages Cloudflare's Developer Platform to build globally performant services	Picsart, one of the world’s largest digital creation platforms, encountered performance challenges in catering to its global audience. Adopting Cloudflare's global-by-default Developer Platform emerged as the optimal solution, empowering Picsart to enhance performance and scalability substantially.

Thursday

Announcement	Summary
Announcing Pages support for monorepos, wrangler.toml, database integrations and more!	We launched four improvements to Pages that bring functionality previously restricted to Workers, with the goal of unifying the development experience between the two. Support for monorepos, wrangler.toml, new additions to Next.js support and database integrations!
New tools for production safety — Gradual Deployments, Stack Traces, Rate Limiting, and API SDKs	Production readiness isn’t just about scale and reliability of the services you build with. We announced five updates that put more power in your hands – Gradual Deployments, Source mapped stack traces in Tail Workers, a new Rate Limiting API, brand-new API SDKs, and updates to Durable Objects – each built with mission-critical production services in mind.
What’s new with Cloudflare Media: updates for Calls, Stream, and Images	With Cloudflare Calls in open beta, you can build real-time, serverless video and audio applications. Cloudflare Stream lets your viewers instantly clip from ongoing streams. Finally, Cloudflare Images now supports automatic face cropping and has an upload widget that lets you easily integrate into your application.
Cloudflare Calls: millions of cascading trees all the way down	Cloudflare Calls is a serverless SFU and TURN service running at Cloudflare’s edge. It’s now in open beta and costs $0.05/ real-time GB. It’s 100% anycast WebRTC.

Friday

Announcement	Summary
Browser Rendering API GA, rolling out Cloudflare Snippets, SWR, and bringing Workers for Platforms to all users	Browser Rendering API is now available to all paid Workers customers with improved session management.
Cloudflare acquires Baselime to expand serverless application observability capabilities	We announced that Cloudflare has acquired Baselime, a serverless observability company.
Cloudflare acquires PartyKit to allow developers to build real-time multi-user applications	We announced that PartyKit, a trailblazer in enabling developers to craft ambitious real-time, collaborative, multiplayer applications, is now a part of Cloudflare. This acquisition marks a significant milestone in our journey to redefine the boundaries of serverless computing, making it more dynamic, interactive, and, importantly, stateful.
Blazing fast development with full-stack frameworks and Cloudflare	Full-stack web development with Cloudflare is now faster and easier! You can now use your framework’s development server while accessing D1 databases, R2 object stores, AI models, and more. Iterate locally in milliseconds to build sophisticated web apps that run on Cloudflare. Let’s dev together!
We've added JavaScript-native RPC to Cloudflare Workers	Cloudflare Workers now features a built-in RPC (Remote Procedure Call) system for use in Worker-to-Worker and Worker-to-Durable Object communication, with absolutely minimal boilerplate. We've designed an RPC system so expressive that calling a remote service can feel like using a library.
Community Update: empowering startups building on Cloudflare and creating an inclusive community	We closed out Developer Week by sharing updates on our Workers Launchpad program, our latest Developer Challenge, and the work we’re doing to ensure our community spaces – like our Discord and Community forums – are safe and inclusive for all developers.

Here's a video summary, by Craig Dennis, Developer Educator, AI:

🏃@CloudflareDev Developer Week 2024 🧡 ICYMI 🧡 Speed run pic.twitter.com/0uzPJshC93
— Craig Dennis (@craigsdennis) April 12, 2024

Continue the conversation

Thank you for being a part of Developer Week! Want to continue the conversation and share what you’re building? Join us on Discord. To get started building on Workers, check out our developer documentation.

Sippy helps you avoid egress fees while incrementally migrating data from S3 to R2

Phillip Jones — Tue, 26 Sep 2023 13:00:44 GMT

Earlier in 2023, we announced Super Slurper, a data migration tool that makes it easy to copy large amounts of data to R2 from other cloud object storage providers. Since the announcement, developers have used Super Slurper to run thousands of successful migrations to R2!

While Super Slurper is perfect for cases where you want to move all of your data to R2 at once, there are scenarios where you may want to migrate your data incrementally over time. Maybe you want to avoid the one time upfront AWS data transfer bill? Or perhaps you have legacy data that may never be accessed, and you only want to migrate what’s required?

Today, we’re announcing the open beta of Sippy, an incremental migration service that copies data from S3 (other cloud providers coming soon!) to R2 as it’s requested, without paying unnecessary cloud egress fees typically associated with moving large amounts of data. On top of addressing vendor lock-in, Sippy makes stressful, time-consuming migrations a thing of the past. All you need to do is replace the S3 endpoint in your application or attach your domain to your new R2 bucket and data will start getting copied over.

How does it work?

Sippy is an incremental migration service built directly into your R2 bucket. Migration-specific egress fees are reduced by leveraging requests within the flow of your application where you’d already be paying egress fees to simultaneously copy objects to R2. Here is how it works:

When an object is requested from Workers, S3 API, or public bucket, it is served from your R2 bucket if it is found.

If the object is not found in R2, it will simultaneously be returned from your S3 bucket and copied to R2.

Note: Some large objects may take multiple requests to copy.

That means after objects are copied, subsequent requests will be served from R2, and you’ll begin saving on egress fees immediately.

Start incrementally migrating data from S3 to R2

Create an R2 bucket

To get started with incremental migration, you’ll first need to create an R2 bucket if you don’t already have one. To create a new R2 bucket from the Cloudflare dashboard:

Log in to the Cloudflare dashboard and select R2.
Select Create bucket.
Give your bucket a name and select Create bucket.

To learn more about other ways to create R2 buckets refer to the documentation on creating buckets.

Enable Sippy on your R2 bucket

Next, you’ll enable Sippy for the R2 bucket you created. During the beta, you can do this by using the API. Here’s an example of how to enable Sippy for an R2 bucket with cURL:

curl -X PUT https://api.cloudflare.com/client/v4/accounts/{account_id}/r2/buckets/{bucket_name}/sippy \
--header "Authorization: Bearer " \
--data '{"provider": "AWS", "bucket": "", "zone": "","key_id": "", "access_key":"", "r2_key_id": "", "r2_access_key": ""}'

For more information on getting started, please refer to the documentation. Once enabled, requests to your bucket will now start copying data over from S3 if it’s not already present in your R2 bucket.

Finish your migration with Super Slurper

You can run your incremental migration for as long as you want, but eventually you may want to complete the migration to R2. To do this, you can pair Sippy with Super Slurper to easily migrate your remaining data that hasn’t been accessed to R2.

What’s next?

We’re excited about open beta, but it’s only the starting point. Next, we plan on making incremental migration configurable from the Cloudflare dashboard, complete with analytics that show you the progress of your migration and how much you are saving by not paying egress fees for objects that have been copied over so far.

If you are looking to start incrementally migrating your data to R2 and have any questions or feedback on what we should build next, we encourage you to join our Discord community to share!

Cloudflare R2 and MosaicML enable training LLMs on any compute, anywhere in the world, with zero switching costs

Abhinav Venigalla (Guest Author) — Tue, 16 May 2023 13:00:54 GMT

Building the large language models (LLMs) and diffusion models that power generative AI requires massive infrastructure. The most obvious component is compute – hundreds to thousands of GPUs – but an equally critical (and often overlooked) component is the data storage infrastructure. Training datasets can be terabytes to petabytes in size, and this data needs to be read in parallel by thousands of processes. In addition, model checkpoints need to be saved frequently throughout a training run, and for LLMs these checkpoints can each be hundreds of gigabytes!

To manage storage costs and scalability, many machine learning teams have been moving to object storage to host their datasets and checkpoints. Unfortunately, most object store providers use egress fees to “lock in” users to their platform. This makes it very difficult to leverage GPU capacity across multiple cloud providers, or take advantage of lower / dynamic pricing elsewhere, since the data and model checkpoints are too expensive to move. At a time when cloud GPUs are scarce, and new hardware options are entering the market, it’s more important than ever to stay flexible.

In addition to high egress fees, there is a technical barrier to object-store-centric machine learning training. Reading and writing data between object storage and compute clusters requires high throughput, efficient use of network bandwidth, determinism, and elasticity (the ability to train on different #s of GPUs). Building training software to handle all of this correctly and reliably is hard!

Today, we’re excited to show how MosaicML’s tools and Cloudflare R2 can be used together to address these challenges. First, with MosaicML’s open source StreamingDataset and Composer libraries, you can easily stream in training data and read/write model checkpoints back to R2. All you need is an Internet connection. Second, thanks to R2’s zero-egress pricing, you can start/stop/move/resize jobs in response to GPU availability and prices across compute providers, without paying any data transfer fees. The MosaicML training platform makes it dead simple to orchestrate such training jobs across multiple clouds.

Together, Cloudflare and MosaicML give you the freedom to train LLMs on any compute, anywhere in the world, with zero switching costs. That means faster, cheaper training runs, and no vendor lock in :)

“With the MosaicML training platform, customers can efficiently use R2 as the durable storage backend for training LLMs on any compute provider with zero egress fees. AI companies are facing outrageous cloud costs, and they are on the hunt for the tools that can provide them with the speed and flexibility to train their best model at the best price.” – Naveen Rao, CEO and co-founder, MosaicML

Reading data from R2 using StreamingDataset

To read data from R2 efficiently and deterministically, you can use the MosaicML StreamingDataset library. First, write your training data (images, text, video, anything!) into .mds shard files using the provided Python API:

import numpy as np
from PIL import Image
from streaming import MDSWriter

# Local or remote directory in which to store the compressed output files
data_dir = 'path-to-dataset'

# A dictionary mapping input fields to their data types
columns = {
    'image': 'jpeg',
    'class': 'int'
}

# Shard compression, if any
compression = 'zstd'

# Save the samples as shards using MDSWriter
with MDSWriter(out=data_dir, columns=columns, compression=compression) as out:
    for i in range(10000):
        sample = {
            'image': Image.fromarray(np.random.randint(0, 256, (32, 32, 3), np.uint8)),
            'class': np.random.randint(10),
        }
        out.write(sample)

After your dataset has been converted, you can upload it to R2. Below we demonstrate this with the awscli command line tool, but you can also use `wrangler `or any S3-compatible tool of your choice. StreamingDataset will also support direct cloud writing to R2 soon!

$ aws s3 cp --recursive path-to-dataset s3://my-bucket/folder --endpoint-url $S3_ENDPOINT_URL

Finally, you can read the data into any device that has read access to your R2 bucket. You can fetch individual samples, loop over the dataset, and feed it into a standard PyTorch dataloader.

from torch.utils.data import DataLoader
from streaming import StreamingDataset

# Make sure that R2 credentials and $S3_ENDPOINT_URL are set in your environment    
# e.g. export S3_ENDPOINT_URL="https://[uid].r2.cloudflarestorage.com"

# Remote path where full dataset is persistently stored
remote = 's3://my-bucket/folder'

# Local working dir where dataset is cached during operation
local = '/tmp/path-to-dataset'

# Create streaming dataset
dataset = StreamingDataset(local=local, remote=remote, shuffle=True)

# Let's see what is in sample #1337...
sample = dataset[1337]
img = sample['image']
cls = sample['class']

# Create PyTorch DataLoader
dataloader = DataLoader(dataset)

StreamingDataset comes out of the box with high performance, elastic determinism, fast resumption, and multi-worker support. It also uses smart shuffling and distribution to ensure that download bandwidth is minimized. Across a variety of workloads such as LLMs and diffusion models, we find that there is no impact on training throughput (no dataloader bottleneck) when training from object stores like R2. For more information, check out the StreamingDataset announcement blog!

Reading/writing model checkpoints to R2 using Composer

Streaming data into your training loop solves half of the problem, but how do you load/save your model checkpoints? Luckily, if you use a training library like Composer, it’s as easy as pointing at an R2 path!

from composer import Trainer
...

# Make sure that R2 credentials and $S3_ENDPOINT_URL are set in your environment
# e.g. export S3_ENDPOINT_URL="https://[uid].r2.cloudflarestorage.com"

trainer = Trainer(
        run_name='mpt-7b',
        model=model,
        train_dataloader=train_loader,
        ...
        save_folder=s3://my-bucket/mpt-7b/checkpoints,
        save_interval='1000ba',
        # load_path=s3://my-bucket/mpt-7b-prev/checkpoints/ep0-ba100-rank0.pt,
    )

Composer uses asynchronous uploads to minimize wait time as checkpoints are being saved during training. It also works out of the box with multi-GPU and multi-node training, and does not require a shared file system. This means you can skip setting up an expensive EFS/NFS system for your compute cluster, saving thousands of dollars or more per month on public clouds. All you need is an Internet connection and appropriate credentials – your checkpoints arrive safely in your R2 bucket giving you scalable and secure storage for your private models.

Using MosaicML and R2 to train anywhere efficiently

Using the above tools together with Cloudflare R2 enables users to run training workloads on any compute provider, with total freedom and zero switching costs.

As a demonstration, in Figure X we use the MosaicML training platform to launch an LLM training job starting on Oracle Cloud Infrastructure, with data streaming in and checkpoints uploaded back to R2. Part way through, we pause the job and seamlessly resume on a different set of GPUs on Amazon Web Services. Composer loads the model weights from the last saved checkpoint in R2, and the streaming dataloader instantly resumes to the correct batch. Training continues deterministically. Finally, we move again to Google Cloud to finish the run.

As we train our LLM across three cloud providers, the only costs we pay are for GPU compute and data storage. No egress fees or lock in thanks to Cloudflare R2!

Using the MosaicML training platform with Cloudflare R2 to run an LLM training job across three different cloud providers, with zero egress fees.

$ mcli get clusters
NAME            PROVIDER      GPU_TYPE   GPUS             INSTANCE                   NODES
mml-1            MosaicML   │  a100_80gb  8             │  mosaic.a100-80sxm.1        1    
                            │  none       0             │  cpu                        1    
gcp-1            GCP        │  t4         -             │  n1-standard-48-t4-4        -    
                            │  a100_40gb  -             │  a2-highgpu-8g              -    
                            │  none       0             │  cpu                        1    
aws-1            AWS        │  a100_40gb  ≤8,16,...,32  │  p4d.24xlarge               ≤4   
                            │  none       0             │  cpu                        1    
oci-1            OCI        │  a100_40gb  8,16,...,64   │  oci.bm.gpu.b4.8            ≤8  
                            │  none       0             │  cpu                        1    

$ mcli create secret s3 --name r2-creds --config-file path/to/config --credentials-file path/to/credentials
✔  Created s3 secret: r2-creds      

$ mcli create secret env S3_ENDPOINT_URL="https://[uid].r2.cloudflarestorage.com"
✔  Created environment secret: s3-endpoint-url      
               
$ mcli run -f mpt-125m-r2.yaml --follow
✔  Run mpt-125m-r2-X2k3Uq started                                                                                    
i  Following run logs. Press Ctrl+C to quit.                                                                            
                                                                                                                        
Cloning into 'llm-foundry'...

Using the MCLI command line tool to manage compute clusters, secrets, and submit runs.

### mpt-125m-r2.yaml ###
# set up secrets once with `mcli create secret ...`
# and they will be present in the environment in any subsequent run

integrations:
- integration_type: git_repo
  git_repo: mosaicml/llm-foundry
  git_branch: main
  pip_install: -e .[gpu]

image: mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04

command: |
  cd llm-foundry/scripts
  composer train/train.py train/yamls/mpt/125m.yaml \
    data_remote=s3://bucket/path-to-data \
    max_duration=100ba \
    save_folder=s3://checkpoints/mpt-125m \
    save_interval=20ba

run_name: mpt-125m-r2

gpu_num: 8
gpu_type: a100_40gb
cluster: oci-1  # can be any compute cluster!

An MCLI job template. Specify a run name, a Docker image, a set of commands, and a compute cluster to run on.

Get started today!

The MosaicML platform is an invaluable tool to take your training to the next level, and in this post, we explored how Cloudflare R2 empowers you to train models on your own data, with any compute provider – or all of them. By eliminating egress fees, R2’s storage is an exceptionally cost-effective complement to MosaicML training, providing maximum autonomy and control. With this combination, you can switch between cloud service providers to fit your organization’s needs over time.

To learn more about using MosaicML to train custom state-of-the-art AI on your own data visit here or get in touch.

Watch on Cloudflare TV

The S3 to R2 Super Slurper is now Generally Available

Phillip Jones — Tue, 16 May 2023 13:00:50 GMT

R2 is Cloudflare’s zero egress fee object storage platform. One of the things that developers love about R2 is how easy it is to get started. With R2’s S3-compatible API, integrating R2 into existing applications only requires changing a couple of lines of code.

However, migrating data from other object storage providers into R2 can still be a challenge. To address this issue, we introduced the beta of R2 Super Slurper late last year. During the beta period, we’ve been able to partner with early adopters on hundreds of successful migrations from S3 to Cloudflare R2. We’ve made many improvements during the beta including speed (up to 5x increase in the number of objects copied per second!), reliability, and the ability to copy data between R2 buckets. Today, we’re proud to announce the general availability of Super Slurper for one-time migration, making data migration a breeze!

Data migration that’s fast, reliable, and easy to use

R2 Super Slurper one-time migration allows you to quickly and easily copy objects from S3 to an R2 bucket of your choice.

Fast

Super Slurper copies objects from your S3 buckets in parallel and uses Cloudflare’s global network to tap into vast amounts of bandwidth to ensure migrations finish fast.

This migration tool is impressively fast! We expected our migration to take a day to complete, but we were able to move all of our data in less than half an hour. - Nick Inhofe, Engineering Manager at PDQ

Reliable

Sending objects through the Internet can sometimes fail. R2 Super Slurper accounts for that, and is capable of driving multi-terabyte migrations to completion with robust retries of failed transfers. Additionally, larger objects are transferred in chunks, so if something goes wrong, it only retries the portion of the object that’s needed. This means faster migrations and lower cost. And if for some reason an object just won’t transfer, it gets logged, so you can keep track and sort it out later.

Easy to use

R2 Super Slurper simplifies the process of copying objects and their associated metadata from S3 to your R2 buckets. Point Super Slurper to your S3 buckets and an asynchronous task will handle the rest. While the migration is taking place, you can follow along from the dashboard.

R2 has saved us both time and money. We migrated millions of images in a short period of time. It wouldn't have been possible for us to build a tool to migrate our data in this amount of time in a cost-effective way. - Damien Capocchi, Backend Engineering Manager at Reelgood

Migrate your S3 data into R2

From the Cloudflare dashboard, expand R2 and select Data Migration.
Select Migrate files.
Enter your Amazon S3 bucket name, optional bucket prefix, and associated credentials and select Next.
Enter your R2 bucket name and associated credentials and select Next.
After you finish reviewing the details of your migration, select Migrate files.

You can view the status of your migration job at any time on the dashboard. If you want to copy data from one R2 bucket to another R2 bucket, you can select Cloudflare R2 as the source bucket provider and follow the same process. For more information on how to use Super Slurper, please see the documentation here.

Next up: Incremental migration

For the majority of cases, a one-time migration of data from your previous object storage bucket to R2 is sufficient; complete the switch from S3 to R2 and immediately watch egress fees go to zero.

However, in some cases you may want to migrate data to R2 incrementally over time (sip by sip if you will). Enter incremental migration, allowing you to do just that.

The goal of incremental migration is to copy files from your origin bucket to R2 as they are requested. When a requested object is not already in the R2 bucket, it is downloaded - one last time - from your origin bucket then copied to R2. From now on, every request for this object will be served by R2, which means less egress fees!

Since data is migrated all within the flow of normal data access and application logic, this means zero cost overhead of unnecessary egress fees! Previously complicated migrations become as easy as replacing your S3 endpoint in your application.

Join the private beta waitlist for incremental migration

We’re excited about our progress making data migration easier, but we’re just getting started. If you’re interested in participating in the private beta for Super Slurper incremental migration, let us know by joining the waitlist here.

We encourage you to join our Discord community to share your R2 experiences, questions, and feedback!

Watch on Cloudflare TV

Use Snowflake with R2 to extend your global data lake

Phillip Jones — Tue, 16 May 2023 13:00:42 GMT

R2 is the ideal object storage platform to build data lakes. It’s infinitely scalable, highly durable (eleven 9's of annual durability), and has no egress fees. Zero egress fees mean zero vendor lock-in. You are free to use the tools you want to get the maximum value from your data.

Today we’re excited to announce our partnership with Snowflake so that you can use Snowflake to query data stored in your R2 data lake and load data from R2 into Snowflake. Organizations use Snowflake's Data Cloud to unite siloed data, discover, and securely share data, and execute diverse analytic workloads across multiple clouds.

One challenge of loading data into Snowflake database tables and querying external data lakes is the cost of data transfer. If your data is coming from a different cloud or even different region within the same cloud, this typically means you are paying an additional tax for each byte going into Snowflake. Pairing R2 and Snowflake lets you focus on getting valuable insights from your data, without having to worry about egress fees piling up.

Getting started

Sign up for R2 and create an API token

If you haven’t already, you’ll need to sign up for R2 and create a bucket. You’ll also need to create R2 security credentials for Snowflake following the steps below.

Generate an R2 token

1. In the Cloudflare dashboard, select R2.

2. Select Manage R2 API Tokens on the right side of the dashboard.

3. Select Create API token.

4. Optionally select the pencil icon or R2 Token text to edit your API token name.

5. Under Permissions, select Edit.

6. Select Create API Token.

You’ll need the Secret Access Key and Access Key ID to create an external stage in Snowflake.

Creating external stages in Snowflake

In Snowflake, stages refer to the location of data files in object storage. To create an external stage, you’ll need your bucket name and R2 credentials. Find your Cloudflare account ID in the dashboard.

CREATE STAGE my_r2_stage
  URL = 's3compat://my_bucket/files/'
  ENDPOINT = 'cloudflare_account_id.r2.cloudflarestorage.com'
  CREDENTIALS = (AWS_KEY_ID = '1a2b3c...' AWS_SECRET_KEY = '4x5y6z...')

Note: You may need to contact your Snowflake account team to enable S3-compatible endpoints in Snowflake. Get more information here.

Loading data into Snowflake

To load data from your R2 data lake into Snowflake, use the COPY INTO command.

COPY INTO t1
  FROM @my_r2_stage/load/;

You can flip the table and external stage parameters in the example above to unload data from Snowflake into R2.

Querying data in R2 with Snowflake

You’ll first need to create an external table in Snowflake. Once you’ve done that you’ll be able to query your data stored in R2.

SELECT * FROM external_table;

For more information on how to use R2 and Snowflake together, refer to documentation here.

“Data is becoming increasingly the center of every application, process, and business metrics, and is the cornerstone of digital transformation. Working with partners like Cloudflare, we are unlocking value for joint customers around the world by helping save costs and helping maximize customers data investments,” – James Malone, Director of Product Management at Snowflake

Have any feedback?

We want to hear from you! If you have any feedback on the integration between Cloudflare R2 and Snowflake, please let us know by filling this form.

Be sure to check our Discord server to discuss everything R2!

Watch on Cloudflare TV

Introducing Object Lifecycle Management for Cloudflare R2

Harshal Brahmbhatt — Wed, 10 May 2023 13:00:39 GMT

Last year, R2 made its debut, providing developers with object storage while eliminating the burden of egress fees. (For many, egress costs account for over half of their object storage bills!) Since R2’s launch, tens of thousands of developers have chosen it to store data for many different types of applications.

But for some applications, data stored in R2 doesn’t need to be retained forever. Over time, as this data grows, it can unnecessarily lead to higher storage costs. Today, we’re excited to announce that Object Lifecycle Management for R2 is generally available, allowing you to effectively manage object expiration, all from the R2 dashboard or via our API.

Object Lifecycle Management

Object lifecycles give you the ability to define rules (up to 1,000) that determine how long objects uploaded to your bucket are kept. For example, by implementing an object lifecycle rule that deletes objects after 30 days, you could automatically delete outdated logs or temporary files. You can also define rules to abort unfinished multipart uploads that are sitting around and contributing to storage costs.

Getting started with object lifecycles in R2

Cloudflare dashboard

From the Cloudflare dashboard, select R2.
Select your R2 bucket.
Navigate to the Settings tab and find the Object lifecycle rules section.
Select Add rule to define the name, relevant prefix, and lifecycle actions: delete uploaded objects or abort incomplete multipart uploads.
Select Add rule to complete the process.

S3 Compatible API

With R2’s S3-compatible API, it’s easy to apply any existing object lifecycle rules to your R2 buckets.

Here’s an example of how to configure your R2 bucket’s lifecycle policy using the AWS SDK for JavaScript. To try this out, you’ll need to generate an Access Key.

import S3 from "aws-sdk/clients/s3.js";

const client = new S3({
  endpoint: `https://${ACCOUNT_ID}.r2.cloudflarestorage.com`,
  credentials: {
    accessKeyId: ACCESS_KEY_ID, //  fill in your own
    secretAccessKey: SECRET_ACCESS_KEY, // fill in your own
  },
  region: "auto",
});

await client
  .putBucketLifecycleConfiguration({
    LifecycleConfiguration: {
      Bucket: "testBucket",
      Rules: [
        // Example: deleting objects by age
        // Delete logs older than 90 days
        {
          ID: "Delete logs after 90 days",
          Filter: {
            Prefix: "logs/",
          },
          Expiration: {
            Days: 90,
          },
        },
        // Example: abort all incomplete multipart uploads after a week
        {
          ID: "Abort incomplete multipart uploads",
          AbortIncompleteMultipartUpload: {
            DaysAfterInitiation: 7,
          },
        },
      ],
    },
  })
  .promise();

For more information on how object lifecycle policies work and how to configure them in the dashboard or API, see the documentation here.

Speaking of documentation, if you’d like to provide feedback for R2’s documentation, fill out our documentation survey!

What’s next?

Creating object lifecycle rules to delete ephemeral objects is a great way to reduce storage costs, but what if you need to keep objects around to access in the future? We’re working on new, lower cost ways to store objects in R2 that aren’t frequently accessed, like long tail user-generated content, archive data, and more. If you’re interested in providing feedback and gaining early access, let us know by joining the waitlist here.

Join the conversation: share your feedback and experiences

If you have any questions or feedback relating to R2, we encourage you to join our Discord community to share! Stay tuned for more exciting R2 updates in the future.