Shifting left at enterprise scale: how we manage Cloudflare with Infrastructure as Code

The Cloudflare platform is a critical system for Cloudflare itself. We are our own Customer Zero – using our products to secure and optimize our own services.

Within our security division, a dedicated Customer Zero team uses its unique position to provide a constant, high-fidelity feedback loop to product and engineering that drives continuous improvement of our products. And we do this at a global scale — where a single misconfiguration can propagate across our edge in seconds and lead to unintended consequences. If you've ever hesitated before pushing a change to production, sweating because you know one small mistake could lock every employee out of critical application or take down a production service, you know the feeling. The risk of unintended consequences is real, and it keeps us up at night.

This presents an interesting challenge: How do we ensure hundreds of internal production Cloudflare accounts are secured consistently while minimizing human error?

While the Cloudflare dashboard is excellent for observability and analytics, manually clicking through hundreds of accounts to ensure security settings are identical is a recipe for mistakes. To keep our sanity and our security intact, we stopped treating our configurations as manual point-and-click tasks and started treating them like code. We adopted “shift left” principles to move security checks to the earliest stages of development.

This wasn't an abstract corporate goal for us. It was a survival mechanism to catch errors before they caused an incident, and it required a fundamental change in our governance architecture.

What Shift Left means to us

"Shifting left" refers to moving validation steps earlier in the software development lifecycle (SDLC). In practice, this means integrating testing, security audits, and policy compliance checks directly into the continuous integration and continuous deployment (CI/CD) pipeline. By catching issues or misconfigurations at the merge request stage, we identify issues when the cost of remediation is lowest, rather than discovering them after deployment.

When we think about applying shift left principles at Cloudflare, four key principles stand out:

Consistency: Configurations must be easily copied and reused across accounts.
Scalability: Large changes can be applied rapidly across multiple accounts.
Observability: Configurations must be auditable by anyone for current state, accuracy, and security.
Governance: Guardrails must be proactive — enforced before deployment to avoid incidents.

A production IaC operating model

To support this model, we transitioned all production accounts to being managed with Infrastructure as Code (IaC). Every modification is tracked, tied to a user, commit, and an internal ticket. Teams still use the dashboard for analytics and insights, but critical production changes are all done in code.

This model ensures every change is peer-reviewed, and policies, though set by the security team, are implemented by the owning engineering teams themselves.

This setup is grounded in two major technologies: Terraform and a custom CI/CD pipeline.

Our enterprise IaC stack

We chose Terraform for its mature open-source ecosystem, strong community support, and deep integration with Policy as Code tooling. Furthermore, using the Cloudflare Terraform Provider internally allows us to actively dogfood the experience and improve it for our customers.

To manage the scale of hundreds of accounts and around 30 merge requests per day, our CI/CD pipeline runs on Atlantis, integrated with GitLab. We also use a custom go program, tfstate-butler, that acts as a broker to securely store state files.

tfstate-butler operates as an HTTP backend for Terraform. The primary design driver was security: It ensures unique encryption keys per state file to limit the blast radius of any potential compromise.

All internal account configurations are defined in a centralized monorepo. Individual teams own and deploy their specific configurations and are the designated code owners for their sections of this centralized repository, ensuring accountability. To read more about this configuration, check out How Cloudflare uses Terraform to manage Cloudflare.

^{Infrastructure as Code Data Flow Diagram}

Baselines and Policy as Code

The entire shift left strategy hinges on establishing a strong security baseline for all internal production Cloudflare accounts. The baseline is a collection of security policies that are defined in code (Policy as Code). This baseline is not merely a set of guidelines but rather a required security configuration we enforce across the platform — e.g., maximum session length, required logs, specific WAF configurations, etc.

This setup is where policy enforcement shifts from manual audits to automated gates. We use the Open Policy Agent (OPA) framework and its policy language, Rego, via the Atlantis Conftest Policy Checking feature.

Defining policies as code

Rego policies define the specific security requirements that make up the baseline for all Cloudflare provider resources. We currently maintain approximately 50 policies.

For example, here is a Rego policy that validates only @cloudflare.com emails are allowed to be used in an access policy:

# validate no use of non-cloudflare email
warn contains reason if {
    r := tfplan.resource_changes[_]
    r.mode == "managed"
    r.type == "cloudflare_access_policy"

    include := r.change.after.include[_]
    email_address := include.email[_]
    not endswith(email_address, "@cloudflare.com")

    reason := sprintf("%-40s :: only @cloudflare.com emails are allowed", [r.address])
}
warn contains reason if {
    r := tfplan.resource_changes[_]
    r.mode == "managed"
    r.type == "cloudflare_access_policy"

    require := r.change.after.require[_]
    email_address := require.email[_]
    not endswith(email_address, "@cloudflare.com")

    reason := sprintf("%-40s :: only @cloudflare.com emails are allowed", [r.address])
}

Enforcing the baseline

The policy check runs on every merge request (MR), ensuring configurations are compliant before deployment. Policy check output is shown directly in the GitLab MR comment thread.

Policy enforcement operates in two modes:

Warning: Leaves a comment on the MR, but allows the merge.
Deny: Blocks the deployment outright.

If the policy check determines the configuration being applied in the MR deviates from the baseline, the output will return which resources are out of compliance.

The example below shows an output from a policy check identifying 3 discrepancies in a merge request:

WARN - cloudflare_zero_trust_access_application.app_saas_xxx :: "session_duration" must be less than or equal to 10h

WARN - cloudflare_zero_trust_access_application.app_saas_xxx_pay_per_crawl :: "session_duration" must be less than or equal to 10h

WARN - cloudflare_zero_trust_access_application.app_saas_ms :: you must have at least one require statement of auth_method = "swk"

41 tests, 38 passed, 3 warnings, 0 failures, 0 exception

Handling policy exceptions

We understand that exceptions are necessary, but they must be managed with the same rigor as the policy itself. When a team requires an exception, they submit a request via Jira.

Once approved by the Customer Zero team, the exception is formalized by submitting a pull request to the central exceptions.rego repository. Exceptions can be made at various levels:

Account: Exclude account_x from policy_y.
Resource Category: Exclude all resource_a’s in account_x from policy_y.
Specific Resource: Exclude resource_a_1 in account_x from policy_y.

This example shows a session length exception for five specific applications under two separate Cloudflare accounts:

{  
    "exception_type": "session_length",
    "exceptions": [
        {
            "account_id": "1xxxx",
              "tf_addresses": [
                "cloudflare_access_application.app_identity_access_denied",
                "cloudflare_access_application.enforcing_ext_auth_worker_bypass",
                "cloudflare_access_application.enforcing_ext_auth_worker_bypass_dev",
            ],
        },
        {
            "account_id": "2xxxx",
              "tf_addresses": [
                "cloudflare_access_application.extra_wildcard_application",
                "cloudflare_access_application.wildcard",
            ],
        },
    ],
}

Challenges and lessons learned

Our journey wasn't without obstacles. We had years of clickops (manual changes made directly in the dashboard) scattered across hundreds of accounts. Trying to import the existing chaos into a strict infrastructure as code system felt like trying to change the tires on a moving car. To this day, importing resources continues to be an ongoing process.

We also ran into limitations of our own tools. We found edge cases in the Cloudflare Terraform provider that only appear when you try to manage infrastructure at this scale. These weren't just minor speed bumps. They were hard lessons on the necessity of eating our own dogfood, so we could build even better solutions.

That friction clarified exactly what we were up against, leading us to three hard-earned lessons.

Lesson 1: high barriers to entry stall adoption

The first hurdle for any large-scale IaC rollout is onboarding existing, manually configured resources. We gave teams two options: manually creating Terraform resources and import blocks, or using cf-terraforming.

We quickly discovered that Terraform fluency varies across teams, and the learning curve for manually importing existing resources proved to be much steeper than we anticipated.

Luckily the cf-terraforming command-line utility uses the Cloudflare API to automatically generate the necessary Terraform code and import statements, significantly accelerating the migration process.

We also formed an internal community where experienced engineers could guide teams through the nuances of the provider and help unblock complex imports.

Lesson 2: drift happens

We also had to tackle configuration drift, which occurs when the IaC process is bypassed to expedite urgent changes. While making edits directly in the dashboard is faster during an incident, it leaves the Terraform state out of sync with reality.

We implemented a custom drift detection service that constantly compares the state defined by Terraform with the actual deployed state via the Cloudflare API. When drift is detected, an automated system creates an internal ticket and assigns it to the owning team with varying Service Level Agreements (SLAs) for remediation.

Lesson 3: automation is key

Cloudflare innovates quickly, so our set of products and APIs is ever-growing. Unfortunately, that meant that our Terraform provider was often behind in terms of feature parity with the product.

We solved that issue with the release of our v5 provider, which automatically generates the Terraform provider based on the OpenAPI specification. This transition wasn’t without bumps as we hardened our approach to code generation, but this approach ensures that the API and Terraform stay in sync, reducing the chance of capability drift.

The core lesson: proactive > reactive

By centralizing our security baselines, mandating peer reviews, and enforcing policies before any change hits production, we minimize the possibility of configuration errors, accidental deletions, or policy violations. The architecture not only helps to prevent manual mistakes, but actually increases engineering velocity because teams are confident their changes are compliant.

The key lesson from our work with Customer Zero is this: While the Cloudflare dashboard is excellent for day-to-day operations, achieving enterprise-level scale and consistent governance requires a different approach. When you treat your Cloudflare configurations as living code, you can scale securely and confidently.

Have thoughts on Infrastructure as Code? Keep the conversation going and share your experiences over at community.cloudflare.com.

The Cloudflare Blog