The Cloudflare Blog

Cloudflare service outage June 12, 2025

Jeremy Hartman — Thu, 12 Jun 2025 22:00:00 GMT

On June 12, 2025, Cloudflare suffered a significant service outage that affected a large set of our critical services, including Workers KV, WARP, Access, Gateway, Images, Stream, Workers AI, Turnstile and Challenges, AutoRAG, Zaraz, and parts of the Cloudflare Dashboard.

This outage lasted 2 hours and 28 minutes, and globally impacted all Cloudflare customers using the affected services. The cause of this outage was due to a failure in the underlying storage infrastructure used by our Workers KV service, which is a critical dependency for many Cloudflare products and relied upon for configuration, authentication and asset delivery across the affected services. Part of this infrastructure is backed by a third-party cloud provider, which experienced an outage today and directly impacted availability of our KV service.

We’re deeply sorry for this outage: this was a failure on our part, and while the proximate cause (or trigger) for this outage was a third-party vendor failure, we are ultimately responsible for our chosen dependencies and how we choose to architect around them.

This was not the result of an attack or other security event. No data was lost as a result of this incident. Cloudflare Magic Transit and Magic WAN, DNS, Cache, proxy, WAF and related services were not directly impacted by this incident.

What was impacted?

As a rule, Cloudflare designs and builds our services on our own platform building blocks, and as such many of Cloudflare’s products are built to rely on the Workers KV service.

The following table details the impacted services, including the user-facing impact, operation failures, and increases in error rates observed:

Product/Service	Impact
Workers KV	Workers KV saw 90.22% of requests failing: any key-value pair not cached and that required to retrieve the value from Workers KV's origin storage backends resulted in failed requests with response code 503 or 500. The remaining requests were successfully served from Workers KV's cache (status code 200 and 404) or returned errors within our expected limits and/or error budget. This did not impact data stored in Workers KV.
Access	Access uses Workers KV to store application and policy configuration along with user identity information. During the incident Access failed 100% of identity based logins for all application types including Self-Hosted, SaaS and Infrastructure. User Identity information was unavailable to other services like WARP and Gateway during this incident. Access is designed to fail closed when it cannot successfully fetch policy configuration or a user’s identity. Active Infrastructure Application SSH sessions with command logging enabled failed to save logs due to a Workers KV dependency. Access’ System for Cross Domain Identity (SCIM) service was also impacted due to its reliance on Workers KV and Durable Objects (which depended on KV) to store user information. During this incident, user identities were not updated due to Workers KV updates failures. These failures would result in a 500 returned to identity providers. Some providers may require a manual re-synchronization but most customers would have seen immediate service restoration once Access’ SCIM service was restored due to retry logic by the identity provider. Service authentication based logins (e.g. service token, Mutual TLS, and IP-based policies) and Bypass policies were unaffected. No Access policy edits or changes were lost during this time.
Gateway	This incident did not affect most Gateway DNS queries, including those over IPv4, IPv6, DNS over TLS (DoT), and DNS over HTTPS (DoH). However, there were two exceptions: DoH queries with identity-based rules failed. This happened because Gateway couldn't retrieve the required user’s identity information. Authenticated DoH was disrupted for some users. Users with active sessions with valid authentication tokens were unaffected, but those needing to start new sessions or refresh authentication tokens could not. Users of Gateway proxy, egress, and TLS decryption were unable to connect, register, proxy, or log traffic. This was due to our reliance on Workers KV to retrieve up-to-date identity and device posture information. Each of these actions requires a call to Workers KV, and when unavailable, Gateway is designed to fail closed to prevent traffic from bypassing customer-configured rules.
WARP	The WARP client was impacted due to core dependencies on Access and Workers KV, which is required for device registration and authentication. As a result, no new clients were able to connect or sign up during the incident. Existing WARP client users sessions that were routed through the Gateway proxy experienced disruptions, as Gateway was unable to perform its required policy evaluations. Additionally, the WARP emergency disconnect override was rendered unavailable because of a failure in its underlying dependency, Workers KV. Consumer WARP saw a similar sporadic impact as the Zero Trust version.
Dashboard	Dashboard user logins and most of the existing dashboard sessions were unavailable. This was due to an outage affecting Turnstile, DO, KV, and Access. The specific causes for login failures were: Standard Logins (User/Password): Failed due to Turnstile unavailability. Sign-in with Google (OIDC) Logins: Failed due to a KV dependency issue. SSO Logins: Failed due to a full dependency on Access. The Cloudflare v4 API was not impacted during this incident.
Challenges and Turnstile	The Challenge platform that powers Cloudflare Challenges and Turnstile saw a high rate of failure and timeout for siteverify API requests during the incident window due to its dependencies on Workers KV and Durable Objects. We have kill switches in place to disable these calls in case of incidents and outages such as this. We activated these kill switches as a mitigation so that eyeballs are not blocked from proceeding. Notably, while these kill switches were active, Turnstile’s siteverify API (the API that validates issued tokens) could redeem valid tokens multiple times, potentially allowing for attacks where a bad actor might try to use a previously valid token to bypass. There was no impact to Turnstile’s ability to detect bots. A bot attempting to solve a challenge would still have failed the challenge and thus, not receive a token.
Browser Isolation	Existing Browser Isolation sessions via Link-based isolation were impacted due to a reliance on Gateway for policy evaluation. New link-based Browser Isolation sessions could not be initiated due to a dependency on Cloudflare Access. All Gateway-initiated isolation sessions failed due its Gateway dependency.
Images	Batch uploads to Cloudflare Images were impacted during the incident window, with a 100% failure rate at the peak of the incident. Other uploads were not impacted. Overall image delivery dipped to around 97% success rate. Image Transformations were not significantly impacted, and Polish was not impacted.
Stream	Stream’s error rate exceeded 90% during the incident window as video playlists were unable to be served. Stream Live observed a 100% error rate. Video uploads were not impacted.
Realtime	The Realtime TURN (Traversal Using Relays around NAT) service uses KV and was heavily impacted. Error rates were near 100% for the duration of the incident window. The Realtime SFU service (Selective Forwarding Unit) was unable to create new sessions, although existing connections were maintained. This caused a reduction to 20% of normal traffic during the impact window.
Workers AI	All inference requests to Workers AI failed for the duration of the incident. Workers AI depends on Workers KV for distributing configuration and routing information for AI requests globally.
Pages & Workers Assets	Static assets served by Cloudflare Pages and Workers Assets (such as HTML, JavaScript, CSS, images, etc) are stored in Workers KV, cached, and retrieved at request time. Workers Assets saw an average error rate increase of around 0.06% of total requests during this time. During the incident window, Pages error rate peaked to ~100% and all Pages builds could not complete.
AutoRAG	AutoRAG relies on Workers AI models for both document conversion and generating vector embeddings during indexing, as well as LLM models for querying and search. AutoRAG was unavailable during the incident window because of the Workers AI dependency.
Durable Objects	SQLite-backed Durable Objects share the same underlying storage infrastructure as Workers KV. The average error rate during the incident window peaked at 22%, and dropped to 2% as services started to recover. Durable Object namespaces using the legacy key-value storage were not impacted.
D1	D1 databases share the same underlying storage infrastructure as Workers KV and Durable Objects. Similar to Durable Objects, the average error rate during the incident window peaked at 22%, and dropped to 2% as services started to recover.
Queues & Event Notifications	Queues message operations including–pushing and consuming–were unavailable during the incident window. Queues uses KV to map each Queue to underlying Durable Objects that contain queued messages. Event Notifications use Queues as their underlying delivery mechanism.
AI Gateway	AI Gateway is built on top of Workers and relies on Workers KV for client and internal configurations. During the incident window, AI Gateway saw error rates peak at 97% of requests until dependencies recovered.
CDN	Automated traffic management infrastructure was operational but acted with reduced efficacy during the impact period. In particular, registration requests from Zero Trust clients increased substantially as a result of the outage. The increase in requests imposed additional load in several Cloudflare locations, triggering response from automated traffic management. In response to these conditions, systems rerouted incoming CDN traffic to nearby locations, reducing impact to customers. There was a portion of traffic that was not rerouted as expected and is under investigation. CDN requests impacted by this issue would experience elevated latency, HTTP 499 errors, and / or HTTP 503 errors. Impacted Cloudflare service areas included São Paulo, Philadelphia, Atlanta, and Raleigh.
Workers / Workers for Platforms	Workers and Workers for Platforms rely on a third party service for uploads. During the incident window, Workers saw an overall error rate peak to ~2% of total requests. Workers for Platforms saw an overall error rate peak to ~10% of total requests during the same time period.
Workers Builds (CI/CD)	Starting at 18:03 UTC Workers builds could not receive new source code management push events due to Access being down. 100% of new Workers Builds failed during the incident window.
Browser Rendering	Browser Rendering depends on Browser Isolation for browser instance infrastructure. Requests to both the REST API and via the Workers Browser Binding were 100% impacted during the incident window.
Zaraz	100% of requests were impacted during the incident window. Zaraz relies on Workers KV configs for websites when handling eyeball traffic. Due to the same dependency, attempts to save updates to Zaraz configs were unsuccessful during this period, but our monitoring shows that only a single user was affected.

Product/Service

Impact

Workers KV

Workers KV saw 90.22% of requests failing: any key-value pair not cached and that required to retrieve the value from Workers KV's origin storage backends resulted in failed requests with response code 503 or 500.

The remaining requests were successfully served from Workers KV's cache (status code 200 and 404) or returned errors within our expected limits and/or error budget.

This did not impact data stored in Workers KV.

Access

Access uses Workers KV to store application and policy configuration along with user identity information.

During the incident Access failed 100% of identity based logins for all application types including Self-Hosted, SaaS and Infrastructure. User Identity information was unavailable to other services like WARP and Gateway during this incident. Access is designed to fail closed when it cannot successfully fetch policy configuration or a user’s identity.

Active Infrastructure Application SSH sessions with command logging enabled failed to save logs due to a Workers KV dependency.

Access’ System for Cross Domain Identity (SCIM) service was also impacted due to its reliance on Workers KV and Durable Objects (which depended on KV) to store user information. During this incident, user identities were not updated due to Workers KV updates failures. These failures would result in a 500 returned to identity providers. Some providers may require a manual re-synchronization but most customers would have seen immediate service restoration once Access’ SCIM service was restored due to retry logic by the identity provider.

Service authentication based logins (e.g. service token, Mutual TLS, and IP-based policies) and Bypass policies were unaffected. No Access policy edits or changes were lost during this time.

Gateway

This incident did not affect most Gateway DNS queries, including those over IPv4, IPv6, DNS over TLS (DoT), and DNS over HTTPS (DoH).

However, there were two exceptions:

DoH queries with identity-based rules failed. This happened because Gateway couldn't retrieve the required user’s identity information.

Authenticated DoH was disrupted for some users. Users with active sessions with valid authentication tokens were unaffected, but those needing to start new sessions or refresh authentication tokens could not.

Users of Gateway proxy, egress, and TLS decryption were unable to connect, register, proxy, or log traffic.

This was due to our reliance on Workers KV to retrieve up-to-date identity and device posture information. Each of these actions requires a call to Workers KV, and when unavailable, Gateway is designed to fail closed to prevent traffic from bypassing customer-configured rules.

WARP

The WARP client was impacted due to core dependencies on Access and Workers KV, which is required for device registration and authentication. As a result, no new clients were able to connect or sign up during the incident.

Existing WARP client users sessions that were routed through the Gateway proxy experienced disruptions, as Gateway was unable to perform its required policy evaluations.

Additionally, the WARP emergency disconnect override was rendered unavailable because of a failure in its underlying dependency, Workers KV.

Consumer WARP saw a similar sporadic impact as the Zero Trust version.

Dashboard

Dashboard user logins and most of the existing dashboard sessions were unavailable. This was due to an outage affecting Turnstile, DO, KV, and Access. The specific causes for login failures were:

Standard Logins (User/Password): Failed due to Turnstile unavailability.

Sign-in with Google (OIDC) Logins: Failed due to a KV dependency issue.

SSO Logins: Failed due to a full dependency on Access.

The Cloudflare v4 API was not impacted during this incident.

Challenges and Turnstile

The Challenge platform that powers Cloudflare Challenges and Turnstile saw a high rate of failure and timeout for siteverify API requests during the incident window due to its dependencies on Workers KV and Durable Objects.

We have kill switches in place to disable these calls in case of incidents and outages such as this. We activated these kill switches as a mitigation so that eyeballs are not blocked from proceeding. Notably, while these kill switches were active, Turnstile’s siteverify API (the API that validates issued tokens) could redeem valid tokens multiple times, potentially allowing for attacks where a bad actor might try to use a previously valid token to bypass.

There was no impact to Turnstile’s ability to detect bots. A bot attempting to solve a challenge would still have failed the challenge and thus, not receive a token.

Browser Isolation

Existing Browser Isolation sessions via Link-based isolation were impacted due to a reliance on Gateway for policy evaluation.

New link-based Browser Isolation sessions could not be initiated due to a dependency on Cloudflare Access. All Gateway-initiated isolation sessions failed due its Gateway dependency.

Images

Batch uploads to Cloudflare Images were impacted during the incident window, with a 100% failure rate at the peak of the incident. Other uploads were not impacted.

Overall image delivery dipped to around 97% success rate. Image Transformations were not significantly impacted, and Polish was not impacted.

Stream

Stream’s error rate exceeded 90% during the incident window as video playlists were unable to be served. Stream Live observed a 100% error rate.

Video uploads were not impacted.

Realtime

The Realtime TURN (Traversal Using Relays around NAT) service uses KV and was heavily impacted. Error rates were near 100% for the duration of the incident window.

The Realtime SFU service (Selective Forwarding Unit) was unable to create new sessions, although existing connections were maintained. This caused a reduction to 20% of normal traffic during the impact window.

Workers AI

All inference requests to Workers AI failed for the duration of the incident. Workers AI depends on Workers KV for distributing configuration and routing information for AI requests globally.

Pages & Workers Assets

Static assets served by Cloudflare Pages and Workers Assets (such as HTML, JavaScript, CSS, images, etc) are stored in Workers KV, cached, and retrieved at request time. Workers Assets saw an average error rate increase of around 0.06% of total requests during this time.

During the incident window, Pages error rate peaked to ~100% and all Pages builds could not complete.

AutoRAG

AutoRAG relies on Workers AI models for both document conversion and generating vector embeddings during indexing, as well as LLM models for querying and search. AutoRAG was unavailable during the incident window because of the Workers AI dependency.

Durable Objects

SQLite-backed Durable Objects share the same underlying storage infrastructure as Workers KV. The average error rate during the incident window peaked at 22%, and dropped to 2% as services started to recover.

Durable Object namespaces using the legacy key-value storage were not impacted.

D1 databases share the same underlying storage infrastructure as Workers KV and Durable Objects.

Similar to Durable Objects, the average error rate during the incident window peaked at 22%, and dropped to 2% as services started to recover.

Queues & Event Notifications

Queues message operations including–pushing and consuming–were unavailable during the incident window.

Queues uses KV to map each Queue to underlying Durable Objects that contain queued messages.

Event Notifications use Queues as their underlying delivery mechanism.

AI Gateway

AI Gateway is built on top of Workers and relies on Workers KV for client and internal configurations. During the incident window, AI Gateway saw error rates peak at 97% of requests until dependencies recovered.

CDN

Automated traffic management infrastructure was operational but acted with reduced efficacy during the impact period. In particular, registration requests from Zero Trust clients increased substantially as a result of the outage.

The increase in requests imposed additional load in several Cloudflare locations, triggering response from automated traffic management. In response to these conditions, systems rerouted incoming CDN traffic to nearby locations, reducing impact to customers. There was a portion of traffic that was not rerouted as expected and is under investigation. CDN requests impacted by this issue would experience elevated latency, HTTP 499 errors, and / or HTTP 503 errors. Impacted Cloudflare service areas included São Paulo, Philadelphia, Atlanta, and Raleigh.

Workers / Workers for Platforms

Workers and Workers for Platforms rely on a third party service for uploads. During the incident window, Workers saw an overall error rate peak to ~2% of total requests. Workers for Platforms saw an overall error rate peak to ~10% of total requests during the same time period.

Workers Builds (CI/CD)

Starting at 18:03 UTC Workers builds could not receive new source code management push events due to Access being down.

100% of new Workers Builds failed during the incident window.

Browser Rendering

Browser Rendering depends on Browser Isolation for browser instance infrastructure.

Requests to both the REST API and via the Workers Browser Binding were 100% impacted during the incident window.

Zaraz

100% of requests were impacted during the incident window. Zaraz relies on Workers KV configs for websites when handling eyeball traffic. Due to the same dependency, attempts to save updates to Zaraz configs were unsuccessful during this period, but our monitoring shows that only a single user was affected.

Background

Workers KV is built as what we call a “coreless” service which means there should be no single point of failure as the service runs independently in each of our locations worldwide. However, Workers KV today relies on a central data store to provide a source of truth for data. A failure of that store caused a complete outage for cold reads and writes to the KV namespaces used by services across Cloudflare.

Workers KV is in the process of being transitioned to significantly more resilient infrastructure for its central store: regrettably, we had a gap in coverage which was exposed during this incident. Workers KV removed a storage provider as we worked to re-architect KV’s backend, including migrating it to Cloudflare R2, to prevent data consistency issues (caused by the original data syncing architecture), and to improve support for data residency requirements.

One of our principles is to build Cloudflare services on our own platform as much as possible, and Workers KV is no exception. Many of our internal and external services rely heavily on Workers KV, which under normal circumstances helps us deliver the most robust services possible, instead of service teams attempting to build their own storage services. In this case, the cascading impact from the failure from Workers KV exacerbated the issue and significantly broadened the blast radius.

Incident timeline and impact

The incident timeline, including the initial impact, investigation, root cause, and remediation, are detailed below.

_{Workers KV error rates to storage infrastructure. 91% of requests to KV failed during the incident window.}

_{Cloudflare Access percentage of successful requests. Cloudflare Access relies directly on Workers KV and serves as a good proxy to measure Workers KV availability over time.}

All timestamps referenced are in Coordinated Universal Time (UTC).

Time	Event
2025-06-12 17:52	INCIDENT START Cloudflare WARP team begins to see registrations of new devices fail and begin to investigate these failures and declares an incident.
2025-06-12 18:05	Cloudflare Access team received an alert due to a rapid increase in error rates. Service Level Objectives for multiple services drop below targets and trigger alerts across those teams.
2025-06-12 18:06	Multiple service-specific incidents are combined into a single incident as we identify a shared cause (Workers KV unavailability). Incident priority upgraded to P1.
2025-06-12 18:21	Incident priority upgraded to P0 from P1 as severity of impact becomes clear.
2025-06-12 18:43	Cloudflare Access begins exploring options to remove Workers KV dependency by migrating to a different backing datastore with the Workers KV engineering team. This was proactive in the event the storage infrastructure continued to be down.
2025-06-12 19:09	Zero Trust Gateway began working to remove dependencies on Workers KV by gracefully degrading rules that referenced Identity or Device Posture state.
2025-06-12 19:32	Access and Device Posture force drop identity and device posture requests to shed load on Workers KV until third-party service comes back online.
2025-06-12 19:45	Cloudflare teams continue to work on a path to deploying a Workers KV release against an alternative backing datastore and having critical services write configuration data to that store.
2025-06-12 20:23	Services begin to recover as storage infrastructure begins to recover. We continue to see a non-negligible error rate and infrastructure rate limits due to the influx of services repopulating caches.
2025-06-12 20:25	Access and Device Posture restore calling Workers KV as third-party service is restored.
2025-06-12 20:28	IMPACT END Service Level Objectives return to pre-incident level. Cloudflare teams continue to monitor systems to ensure services do not degrade as dependent systems recover.
	INCIDENT END Cloudflare team see all affected services return to normal function. Service level objective alerts are recovered.

Remediation and follow-up steps

We’re taking immediate steps to improve the resiliency of services that depend on Workers KV and our storage infrastructure. This includes existing planned work that we are accelerating as a result of this incident.

This encompasses several workstreams, including efforts to avoid singular dependencies on storage infrastructure we do not own, improving the ability for us to recover critical services (including Access, Gateway and WARP)

Specifically:

(Actively in-flight): Bringing forward our work to improve the redundancy within Workers KV’s storage infrastructure, removing the dependency on any single provider. During the incident window we began work to cut over and backfill critical KV namespaces to our own infrastructure, in the event the incident continued.
(Actively in-flight): Short-term blast radius remediations for individual products that were impacted by this incident so that each product becomes resilient to any loss of service caused by any single point of failure, including third party dependencies.
(Actively in-flight): Implementing tooling that allows us to progressively re-enable namespaces during storage infrastructure incidents. This will allow us to ensure that key dependencies, including Access and WARP, are able to come up without risking a denial-of-service against our own infrastructure as caches are repopulated.

This list is not exhaustive: our teams continue to revisit design decisions and assess the infrastructure changes we need to make in both the near (immediate) term and long term to mitigate the incidents like this going forward.

This was a serious outage, and we understand that organizations and institutions that are large and small depend on us to protect and/or run their websites, applications, zero trust and network infrastructure. Again we are deeply sorry for the impact and are working diligently to improve our service resiliency.

Major data center power failure (again): Cloudflare Code Orange tested

Matthew Prince — Mon, 08 Apr 2024 13:00:15 GMT

Here's a post we never thought we'd need to write: less than five months after one of our major data centers lost power, it happened again to the exact same data center. That sucks and, if you're thinking "why do they keep using this facility??," I don't blame you. We're thinking the same thing. But, here's the thing, while a lot may not have changed at the data center, a lot changed over those five months at Cloudflare. So, while five months ago a major data center going offline was really painful, this time it was much less so.

This is a little bit about how a high availability data center lost power for the second time in five months. But, more so, it's the story of how our team worked to ensure that even if one of our critical data centers lost power it wouldn't impact our customers.

On November 2, 2023, one of our critical facilities in the Portland, Oregon region lost power for an extended period of time. It happened because of a cascading series of faults that appears to have been caused by maintenance by the electrical grid provider, climaxing with a ground fault at the facility, and was made worse by a series of unfortunate incidents that prevented the facility from getting back online in a timely fashion.

If you want to read all the gory details, they're available here.

It's painful whenever a data center has a complete loss of power, but it's something that we were supposed to expect. Unfortunately, in spite of that expectation, we hadn't enforced a number of requirements on our products that would ensure they continued running in spite of a major failure.

That was a mistake we were never going to allow to happen again.

Code Orange

The incident was painful enough that we declared what we called Code Orange. We borrowed the idea from Google which, when they have an existential threat to their business, reportedly declares a Code Yellow or Code Red. Our logo is orange, so we altered the formula a bit.

Our conception of Code Orange was that the person who led the incident, in this case our SVP of Technical Operations, Jeremy Hartman, would be empowered to charge any engineer on our team to work on what he deemed the highest priority project. (Unless we declared a Code Red, which we actually ended up doing due to a hacking incident, and which would then take even higher priority. If you're interested, you can read more about that here.)

After getting through the immediate incident, Jeremy quickly triaged the most important work that needed to be done in order to ensure we'd be highly available even in the case of another catastrophic failure of a major data center facility. And the team got to work.

How'd we do?

We didn’t expect such an extensive real-world test so quickly, but the universe works in mysterious ways. On Tuesday, March 26, 2024, — just shy of five months after the initial incident — the same facility had another major power outage. Below, we'll get into what caused the outage this time, but what is most important is that it provided a perfect test for the work our team had done under Code Orange. So, what were the results?

First, let’s revisit what functions the Portland data centers at Cloudflare provide. As described in the November 2, 2023, post, the control plane of Cloudflare primarily consists of the customer-facing interface for all of our services including our website and API. Additionally, the underlying services that provide the Analytics and Logging pipelines are primarily served from these facilities.

Just like in November 2023, we were alerted immediately that we had lost connectivity to our PDX01 data center. Unlike in November, we very quickly knew with certainty that we had once again lost all power, putting us in the exact same situation as five months prior. We also knew, based on a successful internal cut test in February, how our systems should react. We had spent months preparing, updating countless systems and activating huge amounts of network and server capacity, culminating with a test to prove the work was having the intended effect, which in this case was an automatic failover to the redundant facilities.

Our Control Plane consists of hundreds of internal services, and the expectation is that when we lose one of the three critical data centers in Portland, these services continue to operate normally in the remaining two facilities, and we continue to operate primarily in Portland. We have the capability to fail over to our European data centers in case our Portland centers are completely unavailable. However, that is a secondary option, and not something we pursue immediately.

On March 26, 2024, at 14:58 UTC, PDX01 lost power and our systems began to react. By 15:05 UTC, our APIs and Dashboards were operating normally, all without human intervention. Our primary focus over the past few months has been to make sure that our customers would still be able to configure and operate their Cloudflare services in case of a similar outage. There were a few specific services that required human intervention and therefore took a bit longer to recover, however the primary interface mechanism was operating as expected.

To put a finer point on this, during the November 2, 2023, incident the following services had at least six hours of control plane downtime, with several of them functionally degraded for days.

API and Dashboard
Zero Trust
Magic Transit
SSL
SSL for SaaS
Workers
KV
Waiting Room
Load Balancing
Zero Trust Gateway
Access
Pages
Stream
Images

During the March 26, 2024, incident, all of these services were up and running within minutes of the power failure, and many of them did not experience any impact at all during the failover.

The data plane, which handles the traffic that Cloudflare customers pass through our data centers in over 300 cities worldwide, was not impacted.

Our Analytics platform, which provides a view into customer traffic, was impacted and wasn’t fully restored until later that day. This was expected behavior as the Analytics platform is reliant on the PDX01 data center. Just like the Control Plane work, we began building new Analytics capacity immediately after the November 2, 2023, incident. However, the scale of the work requires that it will take a bit more time to complete. We have been working as fast as we can to remove this dependency, and we expect to complete this work in the near future.

Once we had validated the functionality of our Control Plane services, we were faced yet again with the cold start of a very large data center. This activity took roughly 72 hours in November 2023, but this time around we were able to complete this in roughly 10 hours. There is still work to be done to make that even faster in the future, and we will continue to refine our procedures in case we have a similar incident in the future.

How did we get here?

As mentioned above, the power outage event from last November led us to introduce Code Orange, a process where we shift most or all engineering resources to addressing the issue at hand when there’s a significant event or crisis. Over the past five months, we shifted all non-critical engineering functions to focusing on ensuring high reliability of our control plane.

Teams across our engineering departments rallied to ensure our systems would be more resilient in the face of a similar failure in the future. Though the March 26, 2024, incident was unexpected, it was something we’d been preparing for.

The most obvious difference is the speed at which the control plane and APIs regained service. Without human intervention, the ability to log in and make changes to Cloudflare configuration was possible seven minutes after PDX01 was lost. This is due to our efforts to move all of our configuration databases to a Highly Available (HA) topology, and pre-provision enough capacity that we could absorb the capacity loss. More than 100 databases across over 20 different database clusters simultaneously failed out of the affected facility and restored service automatically. This was actually the culmination of over a year’s worth of work, and we make sure we prove our ability to failover properly with weekly tests.

Another significant improvement is the updates to our Logpush infrastructure. In November 2023, the loss of the PDX01 datacenter meant that we were unable to push logs to our customers. During Code Orange, we invested in making the Logpush infrastructure HA in Portland, and additionally created an active failover option in Amsterdam. Logpush took advantage of our massively expanded Kubernetes cluster that spans all of our Portland facilities and provides a seamless way for service owners to deploy HA compliant services that have resiliency baked in. In fact, during our February chaos exercise, we found a flaw in our Portland HA deployment, but customers were not impacted because the Amsterdam Logpush infrastructure took over successfully. During this event, we saw that the fixes we’d made since then worked, and we were able to push logs from the Portland region.

A number of other improvements in our Stream and Zero Trust products resulted in little to no impact to their operation. Our Stream products, which use a lot of compute resources to transcode videos, were able to seamlessly hand off to our Amsterdam facility to continue operations. Teams were given specific availability targets for the services and were provided several options to achieve those targets. Stream is a good example of a service that chose a different resiliency architecture but was able to seamlessly deliver their service during this outage. Zero Trust, which was also impacted in November 2023, has since moved the vast majority of its functionally to our hundreds of data centers, which kept working seamlessly throughout this event. Ultimately this is the strategy we are pushing all Cloudflare products to adopt as our data centers in over 300 cities worldwide provide the highest level of availability possible.

What happened to the power in the data center?

On March 26, 2024, at 14:58 UTC, PDX01 experienced a total loss of power to Cloudflare’s physical infrastructure following a reportedly simultaneous failure of four Flexential-owned and operated switchboards serving all of Cloudflare’s cages. This meant both primary and redundant power paths were deactivated across the entire environment. During the Flexential investigation, engineers focused on a set of equipment known as Circuit Switch Boards, or CSBs. CSBs are likened to an electrical panel board, consisting of a main input circuit breaker and series of smaller output breakers. Flexential engineers reported that infrastructure upstream of the CSBs (power feed, generator, UPS & PDU/transformer) was not impacted and continued to act normally. Similarly, infrastructure downstream from the CSBs such as Remote Power Panels and connected switchgear was not impacted – thus implying the outage was isolated to the CSBs themselves.

Initial assessment of the root cause of Flexential’s CSB failures points to incorrectly set breaker coordination settings within the four CSBs as one contributing factor. Trip settings which are too restrictive can result in overly sensitive overcurrent protection and the potential nuisance tripping of devices. In our case, Flexential’s breaker settings within the four CSBs were reportedly too low in relation to the downstream provisioned power capacities. When one or more of these breakers tripped, a cascading failure of the remaining active CSB boards resulted, thus causing a total loss of power serving Cloudflare’s cage and others on the shared infrastructure. During the triage of the incident, we were told that the Flexential facilities team noticed the incorrect trip settings, reset the CSBs and adjusted them to the expected values, enabling our team to power up our servers in a staged and controlled fashion. We do not know when these settings were established – typically, these would be set/adjusted as part of a data center commissioning process and/or breaker coordination study before customer critical loads are installed.

What’s next?

Our top priority is completing the resilience program for our Analytics platform. Analytics aren’t simply pretty charts in a dashboard. When you want to check the status of attacks, activities a firewall is blocking, or even the status of Cloudflare Tunnels - you need analytics. We have evidence that the resiliency pattern we are adopting works as expected, so this remains our primary focus, and we will progress as quickly as possible.

There were some services that still required manual intervention to properly recover, and we have collected data and action items for each of them to ensure that further manual action is not required. We will continue to use production cut tests to prove all of these changes and enhancements provide the resiliency that our customers expect.

We will continue to work with Flexential on follow-up activities to expand our understanding of their operational and review procedures to the greatest extent possible. While this incident was limited to a single facility, we will turn this exercise into a process that ensures we have a similar view into all of our critical data center facilities.

Once again, we are very sorry for the impact to our customers, particularly those that rely on the Analytics engine who were unable to access that product feature during the incident. Our work over the past four months has yielded the results that we expected, and we will stay absolutely focused on completing the remaining body of work.

How Cloudflare erroneously throttled a customer’s web traffic

Jeremy Hartman — Tue, 07 Feb 2023 18:20:49 GMT

Over the years when Cloudflare has had an outage that affected our customers we have very quickly blogged about what happened, why, and what we are doing to address the causes of the outage. Today’s post is a little different. It’s about a single customer’s website not working correctly because of incorrect action taken by Cloudflare.

Although the customer was not in any way banned from Cloudflare, or lost access to their account, their website didn’t work. And it didn’t work because Cloudflare applied a bandwidth throttle between us and their origin server. The effect was that the website was unusable.

Because of this unusual throttle there was some internal confusion for our customer support team about what had happened. They, incorrectly, believed that the customer had been limited because of a breach of section 2.8 of our Self-Serve Subscription Agreement which prohibits use of our self-service CDN to serve excessive non-HTML content, such as images and video, without a paid plan that includes those services (this is, for example, designed to prevent someone building an image-hosting service on Cloudflare and consuming a huge amount of bandwidth; for that sort of use case we have paid image and video plans).

However, this customer wasn’t breaking section 2.8, and they were both a paying customer and a paying customer of Cloudflare Workers through which the throttled traffic was passing. This throttle should not have happened. In addition, there is and was no need for the customer to upgrade to some other plan level.

This incident has set off a number of workstreams inside Cloudflare to ensure better communication between teams, prevent such an incident happening, and to ensure that communications between Cloudflare and our customers are much clearer.

Before we explain our own mistake and how it came to be, we’d like to apologize to the customer. We realize the serious impact this had, and how we fell short of expectations. In this blog post, we want to explain what happened, and more importantly what we’re going to change to make sure it does not happen again.

Background

On February 2, an on-call network engineer received an alert for a congesting interface with Equinix IX in our Ashburn data center. While this is not an unusual alert, this one stood out for two reasons. First, it was the second day in a row that it happened, and second, the congestion was due to a sudden and extreme spike of traffic.

The engineer in charge identified the customer’s domain, tardis.dev, as being responsible for this sudden spike of traffic between Cloudflare and their origin network, a storage provider. Because this congestion happens on a physical interface connected to external peers, there was an immediate impact to many of our customers and peers. A port congestion like this one typically incurs packet loss, slow throughput and higher than usual latency. While we have automatic mitigation in place for congesting interfaces, in this case the mitigation was unable to resolve the impact completely.

The traffic from this customer went suddenly from an average of 1,500 requests per second, and a 0.5 MB payload per request, to 3,000 requests per second (2x) and more than 12 MB payload per request (25x).

The congestion happened between Cloudflare and the origin network. Caching did not happen because the requests were all unique URLs going to the origin, and therefore we had no ability to serve from cache.

A Cloudflare engineer decided to apply a throttling mechanism to prevent the zone from pulling so much traffic from their origin. Let's be very clear on this action: Cloudflare does not have an established process to throttle customers that consume large amounts of bandwidth, and does not intend to have one. This remediation was a mistake, it was not sanctioned, and we deeply regret it.

We lifted the throttle through internal escalation 12 hours and 53 minutes after having set it up.

What's next

To make sure a similar incident does not happen, we are establishing clear rules to mitigate issues like this one. Any action taken against a customer domain, paying or not, will require multiple levels of approval and clear communication to the customer. Our tooling will be improved to reflect this. We have many ways of traffic shaping in situations where a huge spike of traffic affects a link and could have applied a different mitigation in this instance.

We are in the process of rewriting our terms of service to better reflect the type of services that our customers deliver on our platform today. We are also committed to explaining to our users in plain language what is permitted under self-service plans. As a developer-first company with transparency as one of its core principles, we know we can do better here. We will follow up with a blog post dedicated to these changes later.

Once again, we apologize to the customer for this action and for the confusion it created for other Cloudflare customers.

Cloudflare outage on June 21, 2022

Tom Strickx — Tue, 21 Jun 2022 12:38:05 GMT

Introduction

Today, June 21, 2022, Cloudflare suffered an outage that affected traffic in 19 of our data centers. Unfortunately, these 19 locations handle a significant proportion of our global traffic. This outage was caused by a change that was part of a long-running project to increase resilience in our busiest locations. A change to the network configuration in those locations caused an outage which started at 06:27 UTC. At 06:58 UTC the first data center was brought back online and by 07:42 UTC all data centers were online and working correctly.

Depending on your location in the world you may have been unable to access websites and services that rely on Cloudflare. In other locations, Cloudflare continued to operate normally.

We are very sorry for this outage. This was our error and not the result of an attack or malicious activity.

Background

Over the last 18 months, Cloudflare has been working to convert all of our busiest locations to a more flexible and resilient architecture. In this time, we’ve converted 19 of our data centers to this architecture, internally called Multi-Colo PoP (MCP): Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, São Paulo, San Jose, Singapore, Sydney, Tokyo.

A critical part of this new architecture, which is designed as a Clos network, is an added layer of routing that creates a mesh of connections. This mesh allows us to easily disable and enable parts of the internal network in a data center for maintenance or to deal with a problem. This layer is represented by the spines in the following diagram.

This new architecture has provided us with significant reliability improvements, as well as allowing us to run maintenance in these locations without disrupting customer traffic. As these locations also carry a significant proportion of the Cloudflare traffic, any problem here can have a very wide impact, and unfortunately, that’s what happened today.

Incident timeline and impact

In order to be reachable on the Internet, networks like Cloudflare make use of a protocol called BGP. As part of this protocol, operators define policies which decide which prefixes (a collection of adjacent IP addresses) are advertised to peers (the other networks they connect to), or accepted from peers.

These policies have individual components, which are evaluated sequentially. The end result is that any given prefixes will either be advertised or not advertised. A change in policy can mean a previously advertised prefix is no longer advertised, known as being "withdrawn", and those IP addresses will no longer be reachable on the Internet.

While deploying a change to our prefix advertisement policies, a re-ordering of terms caused us to withdraw a critical subset of prefixes.

Due to this withdrawal, Cloudflare engineers experienced added difficulty in reaching the affected locations to revert the problematic change. We have backup procedures for handling such an event and used them to take control of the affected locations.

03:56 UTC: We deploy the change to our first location. None of our locations are impacted by the change, as these are using our older architecture.06:17: The change is deployed to our busiest locations, but not the locations with the MCP architecture.06:27: The rollout reached the MCP-enabled locations, and the change is deployed to our spines. This is when the incident started, as this swiftly took these 19 locations offline.06:32: Internal Cloudflare incident declared.06:51: First change made on a router to verify the root cause.06:58: Root cause found and understood. Work begins to revert the problematic change.07:42: The last of the reverts has been completed. This was delayed as network engineers walked over each other's changes, reverting the previous reverts, causing the problem to re-appear sporadically.08:00: Incident closed.

The criticality of these data centers can clearly be seen in the volume of successful HTTP requests we handled globally:

Even though these locations are only 4% of our total network, the outage impacted 50% of total requests. The same can be seen in our egress bandwidth:

Technical description of the error and how it happened

As part of our continued effort to standardize our infrastructure configuration, we were rolling out a change to standardize the BGP communities we attach to a subset of the prefixes we advertise. Specifically, we were adding informational communities to our site-local prefixes.

These prefixes allow our metals to communicate with each other, as well as connect to customer origins. As part of the change procedure at Cloudflare, a Change Request ticket was created, which includes a dry-run of the change, as well as a stepped rollout procedure. Before it was allowed to go out, it was also peer reviewed by multiple engineers. Unfortunately, in this case, the steps weren’t small enough to catch the error before it hit all of our spines.

The change looked like this on one of the routers:

[edit policy-options policy-statement 4-COGENT-TRANSIT-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;
[edit policy-options policy-statement 4-PUBLIC-PEER-ANYCAST-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;
[edit policy-options policy-statement 6-COGENT-TRANSIT-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;
[edit policy-options policy-statement 6-PUBLIC-PEER-ANYCAST-OUT term ADV-SITELOCAL then]
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add TLL01;
+      community add EUROPE;

This was harmless, and just added some additional information to these prefix advertisements. The change on the spines was the following:

[edit policy-options policy-statement AGGREGATES-OUT]
term 6-DISABLED_PREFIXES { ... }
!    term 6-ADV-TRAFFIC-PREDICTOR { ... }
!    term 4-ADV-TRAFFIC-PREDICTOR { ... }
!    term ADV-FREE { ... }
!    term ADV-PRO { ... }
!    term ADV-BIZ { ... }
!    term ADV-ENT { ... }
!    term ADV-DNS { ... }
!    term REJECT-THE-REST { ... }
!    term 4-ADV-SITE-LOCALS { ... }
!    term 6-ADV-SITE-LOCALS { ... }
[edit policy-options policy-statement AGGREGATES-OUT term 4-ADV-SITE-LOCALS then]
community delete NO-EXPORT { ... }
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add AMS07;
+      community add EUROPE;
[edit policy-options policy-statement AGGREGATES-OUT term 6-ADV-SITE-LOCALS then]
community delete NO-EXPORT { ... }
+      community add STATIC-ROUTE;
+      community add SITE-LOCAL-ROUTE;
+      community add AMS07;
+      community add EUROPE;

An initial glance at this diff might give the impression that this change is identical, but unfortunately, that’s not the case. If we focus on one part of the diff, it might become clear why:

!    term REJECT-THE-REST { ... }
!    term 4-ADV-SITE-LOCALS { ... }
!    term 6-ADV-SITE-LOCALS { ... }

In this diff format, the exclamation marks in front of the terms indicate a re-ordering of the terms. In this case, multiple terms moved up, and two terms were added to the bottom. Specifically, the 4-ADV-SITE-LOCALS and 6-ADV-SITE-LOCALS terms moved from the top to the bottom. These terms were now behind the REJECT-THE-REST term, and as might be clear from the name, this term is an explicit reject:

term REJECT-THE-REST {
    then reject;
}

As this term is now before the site-local terms, we immediately stopped advertising our site-local prefixes, removing our direct access to all the impacted locations, as well as removing the ability of our servers to reach origin servers.

On top of the inability to contact origins, the removal of these site-local prefixes also caused our internal load balancing system Multimog (a variation of our Unimog load-balancer) to stop working, as it could no longer forward requests between the servers in our MCPs. This meant that our smaller compute clusters in an MCP received the same amount of traffic as our largest clusters, causing the smaller ones to overload.

Remediation and follow-up steps

This incident had widespread impact, and we take availability very seriously. We have identified several areas of improvement and will continue to work on uncovering any other gaps that could cause a recurrence.

Here is what we are working on immediately:

Process: While the MCP program was designed to improve availability, a procedural gap in how we updated these data centers ultimately caused a broader impact in MCP locations specifically. While we did use a stagger procedure for this change, the stagger policy did not include an MCP data center until the final step. Change procedures and automation need to include MCP-specific test and deploy procedures to ensure there are no unintended consequences.

Architecture: The incorrect router configuration prevented the proper routes from being announced, preventing traffic from flowing properly to our infrastructure. Ultimately the policy statement that caused the incorrect routing advertisement will be redesigned to prevent an unintentional incorrect ordering.

Automation: There are several opportunities in our automation suite that would mitigate some or all of the impact seen from this event. Primarily, we will be concentrating on automation improvements that enforce an improved stagger policy for rollouts of network configuration and provide an automated “commit-confirm” rollback. The former enhancement would have significantly lessened the overall impact, and the latter would have greatly reduced the Time-to-Resolve during the incident.

Conclusion

Although Cloudflare has invested significantly in our MCP design to improve service availability, we clearly fell short of our customer expectations with this very painful incident. We are deeply sorry for the disruption to our customers and to all the users who were unable to access Internet properties during the outage. We have already started working on the changes outlined above and will continue our diligence to ensure this cannot happen again.

Incorrect proxying of 24 hostnames on January 24, 2022

Jeremy Hartman — Wed, 26 Jan 2022 18:34:08 GMT

On January 24, 2022, as a result of an internal Cloudflare product migration, 24 hostnames (including www.cloudflare.com) that were actively proxied through the Cloudflare global network were mistakenly redirected to the wrong origin. During this incident, traffic destined for these hostnames was passed through to the clickfunnels.com origin and may have resulted in a clickfunnels.com page being displayed instead of the intended website. This was our doing and clickfunnels.com was unaware of our error until traffic started to reach their origin.

API calls or other expected responses to and from these hostnames may not have responded properly, or may have failed completely. For example, if you were making an API call to api.example.com, and api.example.com was an impacted hostname, you likely would not have received the response you would have expected.

Here is what happened:

At 2022-01-24 22:24 UTC we started a migration of hundreds of thousands of custom hostnames to the Cloudflare for SaaS product. Cloudflare for SaaS allows SaaS providers to manage their customers’ websites and SSL certificates at scale - more information is available here. This migration was intended to be completely seamless, with the outcome being enhanced features and security for our customers. The migration process was designed to read the custom hostname configuration from a database and migrate it from SaaS v1 (the old system) to SaaS v2 (the current version) automatically.

To better understand what happened next, it’s important to explain a bit more about how custom hostnames are configured.

First, Cloudflare for SaaS customers can configure any hostname; but before we will proxy traffic to them, they must prove (via DNS validation) that they actually are allowed to handle that hostname’s traffic.

When the Cloudflare for SaaS customer first configures the hostname, it is marked as pending until DNS validation has occurred. Pending hostnames are very common for Cloudflare for SaaS customers as the hostname gets provisioned, and then the SaaS provider will typically work with their customer to put in place the appropriate DNS validation that proves ownership.

Once a hostname passes DNS validation, it moves from a pending state to an active state and can be proxied. Except in one case: there’s a special check for whether the hostname is marked as blocked within Cloudflare’s system. A blocked hostname is one that can’t be activated without manual approval by our Trust & Safety team. Some scenarios that could lead to a hostname being blocked include when the hostname is a Cloudflare-owned property, a well known brand, or a hostname in need of additional review for a variety of reasons.

During this incident, a very small number of blocked hostnames were erroneously moved to the active state while migrating clickfunnels.com’s customers. Once that occurred, traffic destined for those previously blocked hostnames was then processed by a configuration belonging to clickfunnels.com, sending traffic to the clickfunnels.com’s origin. One of those hostnames was www.cloudflare.com. Note that it was www.cloudflare.com and not cloudflare.com, so subdomains like dash.cloudflare.com, api.cloudflare.com, cdnjs.cloudflare.com, and so on were unaffected by this problem.

As the migration process continued down the list of hostnames, additional traffic was re-routed to the clickfunnels.com origin. At 23:06 UTC www.cloudflare.com was affected. At 23:15 UTC an incident was declared internally. Since the first alert we received was for www.cloudflare.com, we started our investigation there. In the next 19 minutes, the team restored www.cloudflare.com to its correct origin, determined the breadth of the impact and the root cause of the incident, and began remediation for the remaining affected hostnames.

By 2022-01-25 00:13 UTC, all custom hostnames had been restored to their proper configuration and the incident was closed. We have contacted all the customers who were affected by this error. We have worked with ClickFunnels to delete logs of this event to ensure that no data erroneously sent to the clickfunnels.com’s origin is retained by them and are very grateful for their speedy assistance.

Here is a graph (on a log scale) of requests to clickfunnels.com during the event. Out of a total of 268,430,157 requests redirected, 268,220,296 (99.92%) were for www.cloudflare.com:

At Cloudflare, we take these types of incidents very seriously, dedicating massive amounts of resources in preventative action and in follow-up engineering. In this case, there are both procedural and technical follow-ups to prevent reoccurrence. Here are our next steps:

No more blocked hostname overrides. All blocked hostname changes will route through our verification pipeline as part of the migration process.
All migrations will require explicit validation and approval from SaaS customers for a blocked hostname to be considered for activation.
Additional monitoring will be added to the hostnames being migrated to spot potential erroneous traffic patterns and alert the migration team.
Additional monitoring added for www.cloudflare.com.
Stage hostname activations on non-production elements prior to promoting to production will enable the ability to verify the new hostname state is expected. This will allow us to catch issues before they hit production traffic.

Conclusion

This event exposed previously unknown gaps in our process and technology that directly impacted our customers. We are truly sorry for the disruption to our customers and any potential visitor to the impacted properties. Our commitment is to provide fully reliable and secure products, and we will continue to make every effort possible to deliver just that for our customers and partners.