The Cloudflare Blog

Shifting left at enterprise scale: how we manage Cloudflare with Infrastructure as Code

Chase Catelli — Tue, 09 Dec 2025 06:00:00 GMT

The Cloudflare platform is a critical system for Cloudflare itself. We are our own Customer Zero – using our products to secure and optimize our own services.

Within our security division, a dedicated Customer Zero team uses its unique position to provide a constant, high-fidelity feedback loop to product and engineering that drives continuous improvement of our products. And we do this at a global scale — where a single misconfiguration can propagate across our edge in seconds and lead to unintended consequences. If you've ever hesitated before pushing a change to production, sweating because you know one small mistake could lock every employee out of critical application or take down a production service, you know the feeling. The risk of unintended consequences is real, and it keeps us up at night.

This presents an interesting challenge: How do we ensure hundreds of internal production Cloudflare accounts are secured consistently while minimizing human error?

While the Cloudflare dashboard is excellent for observability and analytics, manually clicking through hundreds of accounts to ensure security settings are identical is a recipe for mistakes. To keep our sanity and our security intact, we stopped treating our configurations as manual point-and-click tasks and started treating them like code. We adopted “shift left” principles to move security checks to the earliest stages of development.

This wasn't an abstract corporate goal for us. It was a survival mechanism to catch errors before they caused an incident, and it required a fundamental change in our governance architecture.

What Shift Left means to us

"Shifting left" refers to moving validation steps earlier in the software development lifecycle (SDLC). In practice, this means integrating testing, security audits, and policy compliance checks directly into the continuous integration and continuous deployment (CI/CD) pipeline. By catching issues or misconfigurations at the merge request stage, we identify issues when the cost of remediation is lowest, rather than discovering them after deployment.

When we think about applying shift left principles at Cloudflare, four key principles stand out:

Consistency: Configurations must be easily copied and reused across accounts.
Scalability: Large changes can be applied rapidly across multiple accounts.
Observability: Configurations must be auditable by anyone for current state, accuracy, and security.
Governance: Guardrails must be proactive — enforced before deployment to avoid incidents.

A production IaC operating model

To support this model, we transitioned all production accounts to being managed with Infrastructure as Code (IaC). Every modification is tracked, tied to a user, commit, and an internal ticket. Teams still use the dashboard for analytics and insights, but critical production changes are all done in code.

This model ensures every change is peer-reviewed, and policies, though set by the security team, are implemented by the owning engineering teams themselves.

This setup is grounded in two major technologies: Terraform and a custom CI/CD pipeline.

Our enterprise IaC stack

We chose Terraform for its mature open-source ecosystem, strong community support, and deep integration with Policy as Code tooling. Furthermore, using the Cloudflare Terraform Provider internally allows us to actively dogfood the experience and improve it for our customers.

To manage the scale of hundreds of accounts and around 30 merge requests per day, our CI/CD pipeline runs on Atlantis, integrated with GitLab. We also use a custom go program, tfstate-butler, that acts as a broker to securely store state files.

tfstate-butler operates as an HTTP backend for Terraform. The primary design driver was security: It ensures unique encryption keys per state file to limit the blast radius of any potential compromise.

All internal account configurations are defined in a centralized monorepo. Individual teams own and deploy their specific configurations and are the designated code owners for their sections of this centralized repository, ensuring accountability. To read more about this configuration, check out How Cloudflare uses Terraform to manage Cloudflare.

^{Infrastructure as Code Data Flow Diagram}

Baselines and Policy as Code

The entire shift left strategy hinges on establishing a strong security baseline for all internal production Cloudflare accounts. The baseline is a collection of security policies that are defined in code (Policy as Code). This baseline is not merely a set of guidelines but rather a required security configuration we enforce across the platform — e.g., maximum session length, required logs, specific WAF configurations, etc.

This setup is where policy enforcement shifts from manual audits to automated gates. We use the Open Policy Agent (OPA) framework and its policy language, Rego, via the Atlantis Conftest Policy Checking feature.

Defining policies as code

Rego policies define the specific security requirements that make up the baseline for all Cloudflare provider resources. We currently maintain approximately 50 policies.

For example, here is a Rego policy that validates only @cloudflare.com emails are allowed to be used in an access policy:

# validate no use of non-cloudflare email
warn contains reason if {
    r := tfplan.resource_changes[_]
    r.mode == "managed"
    r.type == "cloudflare_access_policy"

    include := r.change.after.include[_]
    email_address := include.email[_]
    not endswith(email_address, "@cloudflare.com")

    reason := sprintf("%-40s :: only @cloudflare.com emails are allowed", [r.address])
}
warn contains reason if {
    r := tfplan.resource_changes[_]
    r.mode == "managed"
    r.type == "cloudflare_access_policy"

    require := r.change.after.require[_]
    email_address := require.email[_]
    not endswith(email_address, "@cloudflare.com")

    reason := sprintf("%-40s :: only @cloudflare.com emails are allowed", [r.address])
}

Enforcing the baseline

The policy check runs on every merge request (MR), ensuring configurations are compliant before deployment. Policy check output is shown directly in the GitLab MR comment thread.

Policy enforcement operates in two modes:

Warning: Leaves a comment on the MR, but allows the merge.
Deny: Blocks the deployment outright.

If the policy check determines the configuration being applied in the MR deviates from the baseline, the output will return which resources are out of compliance.

The example below shows an output from a policy check identifying 3 discrepancies in a merge request:

WARN - cloudflare_zero_trust_access_application.app_saas_xxx :: "session_duration" must be less than or equal to 10h

WARN - cloudflare_zero_trust_access_application.app_saas_xxx_pay_per_crawl :: "session_duration" must be less than or equal to 10h

WARN - cloudflare_zero_trust_access_application.app_saas_ms :: you must have at least one require statement of auth_method = "swk"

41 tests, 38 passed, 3 warnings, 0 failures, 0 exception

Handling policy exceptions

We understand that exceptions are necessary, but they must be managed with the same rigor as the policy itself. When a team requires an exception, they submit a request via Jira.

Once approved by the Customer Zero team, the exception is formalized by submitting a pull request to the central exceptions.rego repository. Exceptions can be made at various levels:

Account: Exclude account_x from policy_y.
Resource Category: Exclude all resource_a’s in account_x from policy_y.
Specific Resource: Exclude resource_a_1 in account_x from policy_y.

This example shows a session length exception for five specific applications under two separate Cloudflare accounts:

{  
    "exception_type": "session_length",
    "exceptions": [
        {
            "account_id": "1xxxx",
              "tf_addresses": [
                "cloudflare_access_application.app_identity_access_denied",
                "cloudflare_access_application.enforcing_ext_auth_worker_bypass",
                "cloudflare_access_application.enforcing_ext_auth_worker_bypass_dev",
            ],
        },
        {
            "account_id": "2xxxx",
              "tf_addresses": [
                "cloudflare_access_application.extra_wildcard_application",
                "cloudflare_access_application.wildcard",
            ],
        },
    ],
}

Challenges and lessons learned

Our journey wasn't without obstacles. We had years of clickops (manual changes made directly in the dashboard) scattered across hundreds of accounts. Trying to import the existing chaos into a strict infrastructure as code system felt like trying to change the tires on a moving car. To this day, importing resources continues to be an ongoing process.

We also ran into limitations of our own tools. We found edge cases in the Cloudflare Terraform provider that only appear when you try to manage infrastructure at this scale. These weren't just minor speed bumps. They were hard lessons on the necessity of eating our own dogfood, so we could build even better solutions.

That friction clarified exactly what we were up against, leading us to three hard-earned lessons.

Lesson 1: high barriers to entry stall adoption

The first hurdle for any large-scale IaC rollout is onboarding existing, manually configured resources. We gave teams two options: manually creating Terraform resources and import blocks, or using cf-terraforming.

We quickly discovered that Terraform fluency varies across teams, and the learning curve for manually importing existing resources proved to be much steeper than we anticipated.

Luckily the cf-terraforming command-line utility uses the Cloudflare API to automatically generate the necessary Terraform code and import statements, significantly accelerating the migration process.

We also formed an internal community where experienced engineers could guide teams through the nuances of the provider and help unblock complex imports.

Lesson 2: drift happens

We also had to tackle configuration drift, which occurs when the IaC process is bypassed to expedite urgent changes. While making edits directly in the dashboard is faster during an incident, it leaves the Terraform state out of sync with reality.

We implemented a custom drift detection service that constantly compares the state defined by Terraform with the actual deployed state via the Cloudflare API. When drift is detected, an automated system creates an internal ticket and assigns it to the owning team with varying Service Level Agreements (SLAs) for remediation.

Lesson 3: automation is key

Cloudflare innovates quickly, so our set of products and APIs is ever-growing. Unfortunately, that meant that our Terraform provider was often behind in terms of feature parity with the product.

We solved that issue with the release of our v5 provider, which automatically generates the Terraform provider based on the OpenAPI specification. This transition wasn’t without bumps as we hardened our approach to code generation, but this approach ensures that the API and Terraform stay in sync, reducing the chance of capability drift.

The core lesson: proactive > reactive

By centralizing our security baselines, mandating peer reviews, and enforcing policies before any change hits production, we minimize the possibility of configuration errors, accidental deletions, or policy violations. The architecture not only helps to prevent manual mistakes, but actually increases engineering velocity because teams are confident their changes are compliant.

The key lesson from our work with Customer Zero is this: While the Cloudflare dashboard is excellent for day-to-day operations, achieving enterprise-level scale and consistent governance requires a different approach. When you treat your Cloudflare configurations as living code, you can scale securely and confidently.

Have thoughts on Infrastructure as Code? Keep the conversation going and share your experiences over at community.cloudflare.com.

Let’s DO this: detecting Workers Builds errors across 1 million Durable Objects

Jacob Hands — Thu, 29 May 2025 13:00:00 GMT

Cloudflare Workers Builds is our CI/CD product that makes it easy to build and deploy Workers applications every time code is pushed to GitHub or GitLab. What makes Workers Builds special is that projects can be built and deployed with minimal configuration. Just hook up your project and let us take care of the rest!

But what happens when things go wrong, such as failing to install tools or dependencies? What usually happens is that we don’t fix the problem until a customer contacts us about it, at which point many other customers have likely faced the same issue. This can be a frustrating experience for both us and our customers because of the lag time between issues occurring and us fixing them.

We want Workers Builds to be reliable, fast, and easy to use so that developers can focus on building, not dealing with our bugs. That’s why we recently started building an error detection system that can detect, categorize, and surface all build issues occurring on Workers Builds, enabling us to proactively fix issues and add missing features.

It’s also no secret that we’re big fans of being “Customer Zero” at Cloudflare, and Workers Builds is itself a product that’s built end-to-end on our Developer Platform using Workers, Durable Objects, Hyperdrive, Containers, Queues, Workers KV, R2, and Workers Observability.

In this post, we will dive into how we used the Cloudflare Developer Platform to check for issues across more than 1 million Durable Objects.

Background: Workers Builds architecture

Back in October 2024, we wrote about how we built Workers Builds entirely on the Workers platform. To recap, Builds is built using Workers, Durable Objects, Workers KV, R2, Queues, Hyperdrive, and a Postgres database. Some of these things were not present when launched back in October (for example, Queues and KV). But the core of the architecture is the same.

A client Worker receives GitHub/GitLab webhooks and stores build metadata in Postgres (via Hyperdrive). A build management Worker uses two Durable Object classes: a Scheduler class to find builds in Postgres that need scheduling, and a class called BuildBuddy to manage the lifecycle of a build. When a build needs to be started, Scheduler creates a new BuildBuddy instance which is responsible for creating a container for the build (using Cloudflare Containers), monitoring the container with health checks, and receiving build logs so that they can be viewed in the Cloudflare Dashboard.

In addition to this core scheduling logic, we have several Workers Queues for background work such as sending PR comments to GitHub/GitLab.

The problem: builds are failing

While this architecture has worked well for us so far, we found ourselves with a problem: compared to Cloudflare Pages, a concerning percentage of builds were failing. We needed to dig deeper and figure out what was wrong, and understand how we could improve Workers Builds so that developers can focus more on shipping instead of build failures.

Types of build failures

Not all build failures are the same. We have several categories of failures that we monitor:

Initialization failures: when the container fails to start.
Clone failures: failing to clone the repository from GitHub/GitLab.
Build timeouts: builds that ran past the limit and were terminated by BuildBuddy.
Builds failing health checks: the container stopped responding to health checks, e.g. the container crashed for an unknown reason.
Failure to install tools or dependencies.
Failed user build/deploy commands.

The first few failure types were straightforward, and we’ve been able to track down and fix issues in our build system and control plane to improve what we call “build completion rate”. We define build completion as the following:

We successfully started the build.
We attempted to install tools/dependencies (considering failures as “user error”).
We attempted to run the user-defined build/deploy commands (again, considering failures as “user error”).
We successfully marked the build as stopped in our database.

For example, we had a bug where builds for a deleted Worker would attempt to run and continuously fail, which affected our build completion rate metric.

User error

We’ve made a lot of progress improving the reliability of build and container orchestration, but we had a significant percentage of build failures in the “user error” metric. We started asking ourselves “is this actually user error? Or is there a problem with the product itself?”

This presented a challenge because questions like “did the build command fail due to a bug in the build system, or user error?” are a lot harder to answer than pass/fail issues like failing to create a container for the build. To answer these questions, we had to build something new, something smarter.

Build logs

The most obvious way to determine why a build failed is to look at its logs. When spot-checking build failures, we can typically identify what went wrong. For example, some builds fail to install dependencies because of an out of date lockfile (e.g. package-lock.json out of date with package.json). But looking through build failures one by one doesn’t scale. We didn’t want engineers looking through customer build logs without at least suspecting that there was an issue with our build system that we could fix.

Automating error detection

At this point, next steps were clear: we needed an automated way to identify why a build failed based on build logs, and provide a way for engineers to see what the top issues were while ensuring privacy (e.g. removing account-specific identifiers and file paths from the aggregate data).

Detecting errors in build logs using Workers Queues

The first thing we needed was a way to categorize build errors after a build fails. To do this, we created a queue named BuildErrorsQueue to process builds and look for errors. After a build fails, BuildBuddy will send the build ID to BuildErrorsQueue which fetches the logs, checks for issues, and saves results to Postgres.

We started out with a few static patterns to match things like Wrangler errors in log lines:

export const DetectedErrorCodes = {
  wrangler_error: {
    detect: async (lines: LogLines) => {
      const errors: DetectedError[] = []
      for (const line of lines) {
        if (line[2].trim().startsWith('✘ [ERROR]')) {
          errors.push({
            error_code: 'wrangler_error',
            error_group: getWranglerLogGroupFromLogLine(line, wranglerRegexMatchers),
            detected_on: new Date(),
            lines_matched: [line],
          })
        }
      }
      return errors
    },
  },
  installing_tools_or_dependencies_failed: { ... },
}

It wouldn’t be useful if all Wrangler errors were grouped under a single generic “wrangler_error” code, so we further grouped them by normalizing the log lines into groups:

function getWranglerLogGroupFromLogLine(
  logLine: LogLine,
  regexMatchers: RegexMatcher[]
): string {
  const original = logLine[2].trim().replaceAll(/[\t\n\r]+/g, ' ')
  let message = original
  let group = original
  for (const { mustMatch, patterns, stopOnMatch, name, useNameAsGroup } of regexMatchers) {
    if (mustMatch !== undefined) {
      const matched = matchLineToRegexes(message, mustMatch)
      if (!matched) continue
    }
    if (patterns) {
      for (const [pattern, mask] of patterns) {
        message = message.replaceAll(pattern, mask)
      }
    }
    if (useNameAsGroup === true) {
      group = name
    } else {
      group = message
    }
    if (Boolean(stopOnMatch) && message !== original) break
  }
  return group
}

const wranglerRegexMatchers: RegexMatcher[] = [
  {
    name: 'could_not_resolve',
    // ✘ [ERROR] Could not resolve "./balance"
    // ✘ [ERROR] Could not resolve "node:string_decoder" (originally "string_decoder/")
    mustMatch: [/^✘ \[ERROR\] Could not resolve "[@\w :/\\.-]*"/i],
    stopOnMatch: true,
    patterns: [
      [/(?<=^✘ \[ERROR\] Could not resolve ")[@\w :/\\.-]*(?=")/gi, ''],
      [/(?<=\(originally ")[@\w :/\\.-]*(?=")/gi, ''],
    ],
  },
  {
    name: 'no_matching_export_for_import',
    // ✘ [ERROR] No matching export in "src/db/schemas/index.ts" for import "someCoolTable"
    mustMatch: [/^✘ \[ERROR\] No matching export in "/i],
    stopOnMatch: true,
    patterns: [
      [/(?<=^✘ \[ERROR\] No matching export in ")[@~\w:/\\.-]*(?=")/gi, ''],
      [/(?<=" for import ")[\w-]*(?=")/gi, ''],
    ],
  },
  // ...many more added over time
]

Once we had our error detection matchers and normalizing logic in place, implementing the BuildErrorsQueue consumer was easy:

export async function handleQueue(
  batch: MessageBatch,
  env: Bindings,
  ctx: ExecutionContext
): Promise {
  ...
  await pMap(batch.messages, async (msg) => {
    try {
      const { build_id } = BuildErrorsQueueMessageBody.parse(msg.body)
      await store.buildErrors.deleteErrorsByBuildId({ build_id })
      const bb = getBuildBuddy(env, build_id)
      const errors: DetectedError[] = []
      let cursor: LogsCursor | undefined
      let hasMore = false

      do {
        using maybeNewLogs = await bb.getLogs(cursor, false)
        const newLogs = LogsWithCursor.parse(maybeNewLogs)
        cursor = newLogs.cursor
        const newErrors = await detectErrorsInLogLines(newLogs.lines)
        errors.push(...newErrors)
        hasMore = Boolean(cursor) && newLogs.lines.length > 0
      } while (hasMore)

      if (errors.length > 0) {
        await store.buildErrors.insertErrors(
          errors.map((e) => ({
            build_id,
            error_code: e.error_code,
            error_group: e.error_group,
          }))
        )
      }
      msg.ack()
    } catch (e) {
      msg.retry()
      sentry.captureException(e)
    }
  })
}

Here, we’re fetching logs from each build’s BuildBuddy Durable Object, detecting why it failed using the matchers we wrote, and saving errors to the Postgres DB. We also delete any existing errors for when we improve our error detection patterns to prevent subsequent runs from adding duplicate data to our database.

What about historical builds?

The BuildErrorsQueue was great for new builds, but this meant we still didn’t know why all the previous build failures happened other than “user error”. We considered only tracking errors in new builds, but this was unacceptable because it would significantly slow down our ability to improve our error detection system because each iteration would require us to wait days to identify issues we need to prioritize.

Problem: logs are stored across one million+ Durable Objects

Remember how every build has an associated BuildBuddy DO to store logs? This is a great design for ensuring our logging pipeline scales with our customers, but it presented a challenge when trying to aggregate issues based on logs because something would need to go through all historical builds (>1 million at the time) to fetch logs and detect why they failed.

If we were using Go and Kubernetes, we might solve this using a long-running container that goes through all builds and runs our error detection. But how do we solve this in Workers?

How do we backfill errors for historical builds?

At this point, we already had the Queue to process new builds. If we could somehow send all of the old build IDs to the queue, it could scan them all quickly using Queues concurrent consumers to quickly work through all builds. We thought about hacking together a local script to fetch all of the log IDs and sending them to an API to put them on a queue. But we wanted something more secure and easier to use so that running a new backfill was as simple as an API call.

That’s when an idea hit us: what if we used a Durable Object with alarms to fetch a range of builds and send them to BuildErrorsQueue? At first, it seemed far-fetched, given that Durable Object alarms have a limited amount of work they can do per invocation. But wait, if AI Agents built on Durable Objects can manage background tasks, why can’t we fetch millions of build IDs and forward them to queues?

Building a Build Errors Agent with Durable Objects

The idea was simple: create a Durable Object class named BuildErrorsAgent and run a single instance that loops through the specified range of builds in the database and sends them to BuildErrorsQueue.

The first thing we did was set up an RPC method to start a backfill and save the parameters in Durable Object KV storage so that it can be read each time the alarm executes:

async start({
  min_build_id,
  max_build_id,
}: {
  min_build_id: BuildRecord['build_id']
  max_build_id: BuildRecord['build_id']
}): Promise {
  logger.setTags({ handler: 'start', environment: this.env.ENVIRONMENT })
  try {
    if (min_build_id < 0) throw new Error('min_build_id cannot be negative')
    if (max_build_id < min_build_id) {
      throw new Error('max_build_id cannot be less than min_build_id')
    }
    const [started_on, stopped_on] = await Promise.all([
      this.kv.get('started_on'),
      this.kv.get('stopped_on'),
    ])
    await match({ started_on, stopped_on })
      .with({ started_on: P.not(null), stopped_on: P.nullish }, () => {
        throw new Error('BuildErrorsAgent is already running')
      })
      .otherwise(async () => {
        // delete all existing data and start queueing failed builds
        await this.state.storage.deleteAlarm()
        await this.state.storage.deleteAll()
        this.kv.put('started_on', new Date())
        this.kv.put('config', { min_build_id, max_build_id })
        void this.state.storage.setAlarm(this.getNextAlarmDate())
      })
  } catch (e) {
    this.sentry.captureException(e)
    throw e
  }
}

The most important part of the implementation is the alarm that runs every second until the job is complete. Each alarm invocation has the following steps:

Set a new alarm (always first to ensure an error doesn’t cause it to stop).
Retrieve state from KV.
Validate that the agent is supposed to be running:
1. Ensure the agent is supposed to be running.
2. Ensure we haven’t reached the max build ID set in the config.
Finally, queue up another batch of builds by querying Postgres and sending to the BuildErrorsQueue.

async alarm(): Promise {
  logger.setTags({ handler: 'alarm', environment: this.env.ENVIRONMENT })
  try {
    void this.state.storage.setAlarm(Date.now() + 1000)
    const kvState = await this.getKVState()
    this.sentry.setContext('BuildErrorsAgent', kvState)
    const ctxLogger = logger.withFields({ state: JSON.stringify(kvState) })

    await match(kvState)
      .with({ started_on: P.nullish }, async () => {
        ctxLogger.info('BuildErrorsAgent is not started, cancelling alarm')
        await this.state.storage.deleteAlarm()
      })
      .with({ stopped_on: P.not(null) }, async () => {
        ctxLogger.info('BuildErrorsAgent is stopped, cancelling alarm')
        await this.state.storage.deleteAlarm()
      })
      .with(
        // we should never have started_on set without config set, but just in case
        { started_on: P.not(null), config: P.nullish },
        async () => {
          const msg =
            'BuildErrorsAgent started but config is empty, stopping and cancelling alarm'
          ctxLogger.error(msg)
          this.sentry.captureException(new Error(msg))
          this.kv.put('stopped_on', new Date())
          await this.state.storage.deleteAlarm()
        }
      )
      .when(
        // make sure there are still builds to enqueue
        (s) =>
          s.latest_build_id !== null &&
          s.config !== null &&
          s.latest_build_id >= s.config.max_build_id,
        async () => {
          ctxLogger.info('BuildErrorsAgent job complete, cancelling alarm')
          this.kv.put('stopped_on', new Date())
          await this.state.storage.deleteAlarm()
        }
      )
      .with(
        {
          started_on: P.not(null),
          stopped_on: P.nullish,
          config: P.not(null),
          latest_build_id: P.any,
        },
        async ({ config, latest_build_id }) => {
          // 1. select batch of ~1000 builds
          // 2. send them to Queues 100 at a time, updating
          //    latest_build_id after each batch is sent
          const failedBuilds = await this.store.builds.selectFailedBuilds({
            min_build_id: latest_build_id !== null ? latest_build_id + 1 : config.min_build_id,
            max_build_id: config.max_build_id,
            limit: 1000,
          })
          if (failedBuilds.length === 0) {
            ctxLogger.info(`BuildErrorsAgent: ran out of builds, stopping and cancelling alarm`)
            this.kv.put('stopped_on', new Date())
            await this.state.storage.deleteAlarm()
          }

          for (
            let i = 0;
            i < BUILDS_PER_ALARM_RUN && i < failedBuilds.length;
            i += QUEUES_BATCH_SIZE
          ) {
            const batch = failedBuilds
              .slice(i, QUEUES_BATCH_SIZE)
              .map((build) => ({ body: build }))

            if (batch.length === 0) {
              ctxLogger.info(`BuildErrorsAgent: ran out of builds in current batch`)
              break
            }
            ctxLogger.info(
              `BuildErrorsAgent: sending ${batch.length} builds to build errors queue`
            )
            await this.env.BUILD_ERRORS_QUEUE.sendBatch(batch)
            this.kv.put(
              'latest_build_id',
              Math.max(...batch.map((m) => m.body.build_id).concat(latest_build_id ?? 0))
            )

            this.kv.put(
              'total_builds_processed',
              ((await this.kv.get('total_builds_processed')) ?? 0) + batch.length
            )
          }
        }
      )
      .otherwise(() => {
        const msg = 'BuildErrorsAgent has nothing to do - this should never happen'
        this.sentry.captureException(msg)
        ctxLogger.info(msg)
      })
  } catch (e) {
    this.sentry.captureException(e)
    throw e
  }
}

Using pattern matching with ts-pattern made it much easier to understand what states we were expecting and what will happen compared to procedural code. We considered using a more powerful library like XState, but decided on ts-pattern due to its simplicity.

Running the backfill

Once everything rolled out, we were able to trigger an errors backfill for over a million failed builds in a couple of hours with a single API call, categorizing 80% of failed builds on the first run. With a fast backfill process, we were able to iterate on our regex matchers to further refine our error detection and improve error grouping. Here’s what the error list looks like in our staging environment:

Fixes and improvements

Having a better understanding of what’s going wrong has already enabled us to make several improvements:

Wrangler now shows a clearer error message when no config file is found.
Fixed multiple edge-cases where the wrong package manager was used in TypeScript/JavaScript projects.
Added support for bun.lock (previously only checked for bun.lockb).
Fixed several edge cases where build caching did not work in monorepos.
Projects that use a runtime.txt file to specify a Python version no longer fail.
….and more!

We’re still working on fixing other bugs we’ve found, but we’re making steady progress. Reliability is a feature we’re striving for in Workers Builds, and this project has helped us make meaningful progress towards that goal. Instead of waiting for people to contact support for issues, we’re able to proactively identify and fix issues (and catch regressions more easily).

One of the great things about building on the Developer Platform is how easy it is to ship things. The core of this error detection pipeline (the Queue and Durable Object) only took two days to build, which meant we could spend more time working on improving Workers Builds instead of spending weeks on the error detection pipeline itself.

What’s next?

In addition to continuing to improve build reliability and speed, we’ve also started thinking about other ways to help developers build their applications on Workers. For example, we built a Builds MCP server that allows users to debug builds directly in Cursor/Claude/etc. We’re also thinking about ways we can expose these detected issues in the Cloudflare Dashboard so that users can identify issues more easily without scrolling through hundreds of logs.

Ready to get started?

Building applications on Workers has never been easier! Try deploying a Durable Object-backed chat application with Workers Builds:

How Cloudflare Security does Zero Trust

Noelle Kagan — Fri, 24 Jun 2022 14:15:31 GMT

Throughout Cloudflare One week, we provided playbooks on how to replace your legacy appliances with Zero Trust services. Using our own products is part of our team’s culture, and we want to share our experiences when we implemented Zero Trust.

Our journey was similar to many of our customers. Not only did we want better security solutions, but the tools we were using made our work more difficult than it needed to be. This started with just a search for an alternative to remotely connecting on a clunky VPN, but soon we were deploying Zero Trust solutions to protect our employees’ web browsing and email. Next, we are looking forward to upgrading our SaaS security with our new CASB product.

We know that getting started with Zero Trust can seem daunting, so we hope that you can learn from our own journey and see how it benefited us.

Replacing a VPN: launching Cloudflare Access

Back in 2015, all of Cloudflare’s internally-hosted applications were reached via a hardware-based VPN. On-call engineers would fire up a client on their laptop, connect to the VPN, and log on to Grafana. This process was frustrating and slow.

Many of the products we build are a direct result of the challenges our own team is facing, and Access is a perfect example. Launching as an internal project in 2015, Access enabled employees to access internal applications through our identity provider. We started with just one application behind Access with the goal of improving incident response times. Engineers who received a notification on their phones could tap a link and, after authenticating via their browser, would immediately have the access they needed. As soon as people started working with the new authentication flow, they wanted it everywhere. Eventually our security team mandated that we move our apps behind Access, but for a long time it was totally organic: teams were eager to use it.

With authentication occuring at our network edge, we were able to support a globally-distributed workforce without the latency of a VPN, and we were able to do so securely. Moreover, our team is committed to protecting our internal applications with the most secure and usable authentication mechanisms, and two-factor authentication is one of the most important security controls that can be implemented. With Cloudflare Access, we’re able to rely on the strong two-factor authentication mechanisms of our identity provider.

Not all second factors of authentication deliver the same level of security. Some methods are still vulnerable to man-in-the-middle (MITM) attacks. These attacks often feature bad actors stealing one-time passwords, commonly through phishing, to gain access to private resources. To eliminate that possibility, we implemented FIDO2 supported security keys. FIDO2 is an authenticator protocol designed to prevent phishing, and we saw it as an improvement to our reliance on soft tokens at the time.

While the implementation of FIDO2 can present compatibility challenges, we were enthusiastic to improve our security posture. Cloudflare Access enabled us to limit access to our systems to only FIDO2. Cloudflare employees are now required to use their hardware keys to reach our applications. The onboarding of Access was not only a huge win for ease of use, the enforcement of security keys was a massive improvement to our security posture.

Mitigate threats & prevent data exfiltration: Gateway and Remote Browser Isolation

Deploying secure DNS in our offices

A few years later, in 2020, many customers’ security teams were struggling to extend the controls they had enabled in the office to their remote workers. In response, we launched Cloudflare Gateway, offering customers protection from malware, ransomware, phishing, command & control, shadow IT, and other Internet risks over all ports and protocols. Gateway directs and filters traffic according to the policies implemented by the customer.

Our security team started with Gateway to implement DNS filtering in all of our offices. Since Gateway was built on top of the same network as 1.1.1.1, the world’s fastest DNS resolver, any current or future Cloudflare office will have DNS filtering without incurring additional latency. Each office connects to the nearest data center and is protected.

Deploying secure DNS for our remote users

Cloudflare’s WARP client was also built on top of our 1.1.1.1 DNS resolver. It extends the security and performance offered in offices to remote corporate devices. With the WARP client deployed, corporate devices connect to the nearest Cloudflare data center and are routed to Cloudflare Gateway. By sitting between the corporate device and the Internet, the entire connection from the device is secure, while also offering improved speed and privacy.

We sought to extend secure DNS filtering to our remote workforce and deployed the Cloudflare WARP client to our fleet of endpoint devices. The deployment enabled our security teams to better preserve our privacy by encrypting DNS traffic over DNS over HTTPS (DoH). Meanwhile, Cloudflare Gateway categorizes domains based on Radar, our own threat intelligence platform, enabling us to block high risk and suspicious domains for users everywhere around the world.

Adding on HTTPS filtering and Browser Isolation

DNS filtering is a valuable security tool, but it is limited to blocking entire domains. Our team wanted a more precise instrument to block only malicious URLs, not the full domain. Since Cloudflare One is an integrated platform, most of the deployment was already complete. All we needed was to add the Cloudflare Root CA to our endpoints and then enable HTTP filtering in the Zero Trust dashboard. With those few simple steps, we were able to implement more granular blocking controls.

In addition to precision blocking, HTTP filtering enables us to implement tenant control. With tenant control, Gateway HTTP policies regulate access to corporate SaaS applications. Policies are implemented using custom HTTP headers. If the custom request header is present and the request is headed to an organizational account, access is granted. If the request header is present and the request goes to a non-organizational account, such as a personal account, the request can be blocked or opened in an isolated browser.

After protecting our users’ traffic at the DNS and HTTP layers, we implemented Browser Isolation. When Browser Isolation is implemented, all browser code executes in the cloud on Cloudflare’s network. This isolates our endpoints from malicious attacks and common data exfiltration techniques. Some remote browser isolation products introduce latency and frustrate users. Cloudflare’s Browser Isolation uses the power of our network to offer a seamless experience for our employees. It quickly improved our security posture without compromising user experience.

Preventing phishing attacks: Onboarding Area 1 email security

Also in early 2020, we saw an uptick in employee-reported phishing attempts. Our cloud-based email provider had strong spam filtering, but they fell short at blocking malicious threats and other advanced attacks. As we experienced increasing phishing attack volume and frequency we felt it was time to explore more thorough email protection options.

The team looked for four main things in a vendor: the ability to scan email attachments, the ability to analyze suspected malicious links, business email compromise protection, and strong APIs into cloud-native email providers. After testing many vendors, Area 1 became the clear choice to protect our employees. We implemented Area 1’s solution in early 2020, and the results have been fantastic.

Given the overwhelmingly positive response to the product and the desire to build out our Zero Trust portfolio, Cloudflare acquired Area 1 Email Security in April 2022. We are excited to offer the same protections we use to our customers.

What’s next: Getting started with Cloudflare’s CASB

Cloudflare acquired Vectrix in February 2022. Vectrix’s CASB offers functionality we are excited to add to Cloudflare One. SaaS security is an increasing concern for many security teams. SaaS tools are storing more and more sensitive corporate data, so misconfigurations and external access can be a significant threat. However, securing these platforms can present a significant resource challenge. Manual reviews for misconfigurations or externally shared files are time-consuming, yet necessary processes for many customers. CASB reduces the burden on teams by ensuring security standards by scanning SaaS instances and identifying vulnerabilities with just a few clicks.

We want to ensure we maintain the best practices for SaaS security, and like many of our customers, we have many SaaS applications to secure. We are always seeking opportunities to make our processes more efficient, so we are excited to onboard one of our newest Zero Trust products.

Always striving for improvement

Cloudflare takes pride in deploying and testing our own products. Our security team works directly with Product to “dog food” our own products first. It’s our mission to help build a better Internet — and that means providing valuable feedback from our internal teams. As the number one consumer of Cloudflare’s products, the Security team is not only helping keep the company safer, but also contributing to build better products for our customers.

We hope you have enjoyed Cloudflare One week. We really enjoyed sharing our stories with you. To check out our recap of the week, please visit our Cloudflare TV segment.

Securing Cloudflare Using Cloudflare

Molly Cinnamon — Fri, 18 Mar 2022 21:04:00 GMT

When a new security threat arises — a publicly exploited vulnerability (like log4j) or the shift from corporate-controlled environments to remote work or a potential threat actor — it is the Security team’s job to respond to protect Cloudflare’s network, customers, and employees. And as security threats evolve, so should our defense system. Cloudflare is committed to bolstering our security posture with best-in-class solutions — which is why we often turn to our own products as any other Cloudflare customer would.

We’ve written about using Cloudflare Access to replace our VPN, Purpose Justification to create granular access controls, and Magic + Gateway to prevent lateral movement from in-house. We experience the same security needs, wants, and concerns as security teams at enterprises worldwide, so we rely on the same solutions as the Fortune 500 companies that trust Cloudflare for improved security, performance, and speed. Using our own products is embedded in our team’s culture.

Security Challenges, Cloudflare Solutions

We’ve built the muscle to think Cloudflare-first when we encounter a security threat. In fact, many security problems we encounter have a Cloudflare solution.

Problem: Remote work creates a security blind spot of remote devices and networks.
Solution: Deploying Cloudflare Gateway and WARP to shield users and devices from threats, no matter their device or network connection.

Our Detection & Response team has regained visibility into security threats by connecting corporate devices to Gateway through our Cloudflare WARP application. These Zero Trust products shield users and devices from security threats like malware, phishing, shadow IT and more, as well as enable our Security team to instantly block threats and prevent sensitive data from leaving our organization — with no performance impact for our employees, no matter their location.

Problem: Larger company footprint (in size and location) increases the complexity of internal tooling.
Solution: Deploying Cloudflare Access and Purpose Justification to give network administrators granular and contextual application access controls.

Our Identity and Access Management team uses Access to create policies that minimize data access, ensuring that our employees only have access to what they need. Flexibility and instantaneous application of policies allow for ease of scalability as our internal tooling and teams evolve. With Purpose Justification Capture, employees must also justify their use case for visiting domains with particularly sensitive data — which not only solves an internal need for Cloudflare, but helps our customers meet data policy requirements (like GDPR).

Problem: Engineering and Product teams move at a rapid pace. Conducting a manual review of every pull request is not scalable.
Solution: A tool built on top of Workers that enables scanning of PRs for security bugs.

Our Product Security Engineering team uses Cloudflare’s development platform Workers to seamlessly deploy a code review assist framework to flag secrets, vulnerability dependencies, and binary security flags. The flexibility of Workers enables the Security team to evolve the tool depending on security concerns and scale it to the hundreds of PRs the company generates per week.

These are just some of the ways the Security team has used Cloudflare’s products to block malicious domains, streamline access management, provide visibility into threats, and harden our overall security posture. To give a sense of how we think about these challenges technically, we will dive into the implementation of a use of Cloudflare to secure Cloudflare.

Phish-proof websites using Cloudflare Access

Two-factor authentication is one of the most important security controls that can be implemented. Not all second factors provide the same level of security though. Time-based One-time password (TOTP) apps like Google Authenticator are a strong second factor, but are vulnerable to phishing via man-in-the-middle attacks. A successful phishing attack on an employee with privileged access is a terrifying thought, and it is a risk we wanted to completely eliminate.

FIDO2 was developed to provide simple UX and complete protection against phishing attacks. We decided to fully embrace FIDO2 supported security keys in all contexts, but FIDO2 support is not yet ubiquitous and there are many challenging compatibility issues. Cloudflare Access allowed us to enforce that FIDO2 was the only second factor that can be used when reaching systems protected by Cloudflare Access.

In order to manage our Cloudflare Access policies we check each one into source control as terraform code. Group based access control is enforced for our applications.

resource "cloudflare_access_policy" "prod_cloudflare_users" {
  application_id = cloudflare_access_application.prod_sandbox_access_application.id
  zone_id        = cloudflare_zone.prod_sandbox.id
  name           = "Allow "
  decision       = "allow"

  include {
    email_domain = ["cloudflare.com"]
    okta {
      # Require membership in Sandbox group
      name                 = ["ACL-ACCESS-sandbox"]
      identity_provider_id = cloudflare_access_identity_provider.okta_prod.id
    }
  }

  # Require a security key
  require {
    auth_method = "swk"
  }
}

The require section enforces that Cloudflare employees are using their FIDO2 supported security keys to access all of our internal and external applications that are protected by Access. More deeply described in RFC8176, auth_method is enforcing specific values are returned during the login flow from our OIDC provider within the amr field. The enforced swk is “Proof of possession of a software-secured key” which corresponds to logins using a security key.

The ability to enforce the use of security keys when accessing internal sites caused a massive improvement in our security posture. Prior to implementing this change, for many of our internal services we allowed both soft tokens like TOTP with Google Authenticator, along with WebAuthn because of a small number of systems that still didn’t support FIDO2. You’ll see that our use of soft tokens dropped to near zero after enforcing this change.

A Continued Practice

Not only does the Security team deploy Cloudflare’s products, but we test them first too. We work directly with Product to “dog food” our own products first. It’s our mission to help build a better Internet — and that means testing our products and collecting valuable feedback from our internal teams before every launch. As the number one consumer of Cloudflare’s products, the Security team is not only helping keep the company safer, but also contributing to build better products for our customers.

To learn more about examples and technical implementation of our use of Cloudflare products at Cloudflare, please check out this recent Cloudflare TV segment on Securing Cloudflare with Cloudflare.

And for more information on the products mentioned in this document, reach out to our Sales team to find out how we can help you secure your business, teams, and users.