The Cloudflare Blog

Route public traffic to private applications with Cloudflare

Enrique Somoza — Wed, 10 Jun 2026 13:00:00 GMT

For most of the Internet’s history, public and private infrastructure operated as separate worlds. Public applications lived behind content delivery networks (CDNs) and web application firewalls (WAFs). Private applications lived behind virtual private networks (VPNs), firewalls, and separate operational stacks. We think that distinction is becoming obsolete.

Many of the applications organizations care about are not public websites. They are internal APIs, AI agent backends, MCP servers, operational tools, and services that were never designed to be exposed to the public Internet. Yet these applications still need modern security, performance, and programmability services. Security should be a property of the traffic reaching an application, not an accident of where the application happens to sit.

Until now, applying those services to private applications often required public IPs, firewall exceptions, connector software, or complex networking. As a result, many private applications missed out on capabilities such as WAF, bot management, rate limiting, caching, traffic acceleration, rewrites, and Workers, despite needing the same protections and controls as public-facing applications.

Today, we're launching Application Services for Private Origins in closed beta for eligible Enterprise customers. Customers can now securely route traffic to private origins without exposing those origins to the public Internet. This allows Cloudflare's security, performance, and programmability services to protect applications running on private networks, just as they do for public Internet applications.

WAF rules, bot management, rate limiting, caching, rewrites, and Workers can now sit in front of private origins without requiring public IP exposure, inbound firewall rules, or cloudflared running on the origin.

Four use cases, one application layer

This routing model builds on connectivity patterns Cloudflare already supports today through Cloudflare Tunnel, Cloudflare One Client, and private network integrations. For years, Cloudflare Tunnel has allowed customers to route public traffic to private applications through cloudflared. This new capability extends the same model to existing Cloudflare WAN or Cloudflare Mesh connectivity without requiring connector software running on the origin.

Much of that connectivity is orchestrated through Cloudflare’s private networking routing layer that determines how traffic reaches private destinations across Cloudflare Tunnels, Virtual Networks, Cloudflare Mesh, and other connectivity models. Customers can define their routing behavior through APIs and the dashboard instead of managing separate networking stacks for each product.

We have extended Cloudflare’s private networking layer directly into the application services stack, allowing security and performance proxy infrastructure to treat private IPs as valid origin targets for public hostnames. As a result, the same private IPs previously reachable only through Cloudflare Tunnel, Cloudflare One, Cloudflare Mesh, or Cloudflare WAN can now sit behind Cloudflare’s security, performance, and programmability services the same way public origins already do.

This also creates a more unified model across Cloudflare products. Workers VPC bindings and Spectrum private origin routing now rely on the same underlying private connectivity layer, giving customers a single source of truth for controlling how private traffic moves through their Cloudflare environment.

Application traffic now falls into four combinations based on where users come from and where applications live:

The combination on the upper right is what Cloudflare has always done: users on the Internet reach applications on the Internet, with Cloudflare in the middle. The bottom right is Cloudflare One: users on private networks reach public services securely.

The upper left is what we are shipping today. The bottom left, private-to-private, is what we are building toward next.

What is shipping today

Until now, getting public traffic to a private origin often meant making tradeoffs. Customers could use Cloudflare Tunnel, which runs cloudflared, our connector software, on or near the origin, or Cloudflare Load Balancing with private origin pools for health checks and failover. In many cases, organizations also maintained parallel infrastructure such as public-facing load balancers, reverse proxies, mTLS between hops, and TLS termination across multiple layers. As a result, applying Cloudflare's full Application Services stack to private applications often required additional complexity, operational overhead, or separate products. Application Services for Private Origins removes those tradeoffs.

What was missing was a path for customers who already operate Cloudflare WAN (IPsec tunnels, GRE tunnels, CNI links) or Cloudflare Mesh. They had built private connectivity into Cloudflare for site-to-site networking and Zero Trust, and they wanted to use that same connectivity for public traffic to private origins. That is what Application Services for Private Origins delivers.

When you toggle Use private network routing on a proxied A or AAAA record, Cloudflare's WAF, rate limiting, caching, bot management, and transform rules all run as normal on Cloudflare’s network. The only difference is the final hop: instead of reaching the origin over the public Internet, Cloudflare routes the connection through your existing private network connectivity.

The toggle is enabled automatically for RFC 1918 private IPv4 ranges (10.x.x.x, 172.16.x.x–172.31.x.x, and 192.168.x.x), RFC 6598 CGNAT ranges (100.64.x.x–100.127.x.x), and RFC 4193 Unique Local IPv6 Addresses (FC00::/7), since these addresses are only reachable within private networks. For public IP addresses that are reachable only through your private network or tunnel, you can enable the toggle manually.

What the API looks like

For customers automating deployments through the API, private routing is simply an additional attribute on a standard DNS record.

POST /zones/{zone_id}/dns_records
{
 "type": "A",
 "name": "app.example.com",
 "content": "10.0.0.50",
 "ttl": 300,
 "proxied": true,
 "use_private_routing": true
}

Behind the scenes, Cloudflare's proxy platform determines where to send traffic for app.example.com by querying Cloudflare's Origin API. The response includes metadata indicating that the destination should be reached through a private network path:

{
 "zone_name": "example.com",
 "ipv4_addresses": ["10.0.0.50"],
 "use_private_routing": true
}

The use_private_routing flag is the key signal. When our proxy sees it, instead of attempting to connect directly to the private IP address over the public Internet, it hands the request to our private networking layer, which then routes the connection across the customer's existing private network connectivity, whether that's IPsec, GRE, Cloudflare Tunnel, CNI, or Cloudflare Mesh.

Beyond HTTP: Spectrum and Workers VPC

The same routing model now extends beyond HTTP applications. The origin does not have to be a web server. It can be a TCP database, a UDP logging endpoint, or a private API that Workers call directly. The common thread is that Cloudflare sits between your traffic and your private network, applying the same security, performance, and routing layer regardless of protocol or where the request originated.

Spectrum, Cloudflare's Layer 4 proxy, can now sit in front of TCP and UDP services running on private IPs. Instead of creating a load balancer pool as an intermediary, Spectrum applications can specify a virtual_network_id directly on the origin configuration. When you create a Spectrum application, you can include the virtual network ID alongside your private origin IP:

{
 "protocol": "tcp/22",
 "dns": {
   "type": "CNAME",
   "name": "ssh.example.com"
 },
 "origin_direct": ["tcp://10.0.0.50:22"],
 "virtual_network_id": "fab9ac85-491b-44c8-b7ae-dd44d4f4672e"
}

When you create or update a Spectrum application with a private origin and virtual network, Cloudflare verifies that the IP address matches a route in your Cloudflare Tunnel before the configuration is saved. If no matching route exists, the API rejects the request and the application is not created. Once saved, Spectrum hands the connection to your virtual network, which routes it through the associated tunnel, via the same path that HTTP traffic uses when you enable private network routing on a DNS record. In this initial release, Spectrum private origins are supported through Cloudflare Tunnel. Support for additional private network connectivity options will follow in future releases.

This means you can now put Spectrum in front of any TCP/UDP service running on a private IP. The service stays private. No public IP, connector software, or load balancer required.

Workers VPC closes the loop for code running on Cloudflare. A binding tells the Workers runtime to route through the same private path as DNS records. Browsers, mobile apps, Workers, and AI agents all reach your private origins through Cloudflare: DNS records for Internet traffic, bindings for Workers.

What comes next

Public-to-private routing is in closed beta today, and we are targeting GA (General Availability) in Q4 2026.

Beyond GA, we are building toward private-to-private traffic flows: users, services, and AI agents on private networks securely reaching applications on other private networks, with Cloudflare’s application services sitting in the middle.

We are moving toward a model where the same Cloudflare infrastructure can secure traffic regardless of whether the user or the origin is public.

The end state is a world where an employee on Cloudflare One Client accessing wiki.company.internal gets the same WAF, rate limiting, and bot management protections as a customer accessing a public API. An AI agent consuming a proprietary internal API runs through the same security stack as a browser. Service-to-service traffic across clouds and data centers gets the same controls as Internet traffic, even when neither the user nor the server sits on the public Internet.

Get started today

Routing to private origins is available today in closed beta for eligible Enterprise customers. Reach out to your Cloudflare account team to request access. Once enabled, follow our developer documentation, which walks through the full setup. You will need Cloudflare One connectivity (IPsec, GRE, CNI, or Cloudflare Mesh) and a return route for Cloudflare’s source IP range 100.64.0.0/12 in your private network.

Questions or feedback? Join the conversation in our community forums or reach out to your account team.

Shared Dictionaries: compression that keeps up with the agentic web

Alex Krivit — Fri, 17 Apr 2026 13:02:00 GMT

Web pages have grown 6-9% heavier every year for the past decade, spurred by the web becoming more framework-driven, interactive, and media-rich. Nothing about that trajectory is changing. What is changing is how often those pages get rebuilt and how many clients request them. Both are skyrocketing because of agents.

Shared dictionaries shrink asset transfers from servers to browsers so pages load faster with less bloat on the wire, especially for returning users or visitors on a slow connection. Instead of re-downloading entire JavaScript bundles after every deploy, the browser tells the server what it already has cached, and the server only sends the file diffs.

Today, we’re excited to give you a sneak peek of our support for shared compression dictionaries, show you what we’ve seen in early testing, and reveal when you’ll be able to try the beta yourself (hint: it’s April 30, 2026!).

The problem: more shipping = less caching

Agentic crawlers, browsers, and other tools hit endpoints repeatedly, fetching full pages, often to extract a fragment of information. Agentic actors represented just under 10% of total requests across Cloudflare's network during March 2026¹, up ~60% year-over-year.

Every page shipped is heavier than last year and read more often by machines than ever before. But agents aren’t just consuming the web, they’re helping to build it. AI-assisted development means teams ship faster. Increasing the frequency of deploys, experiments, and iterations is great for product velocity, but terrible for caching.

As agents push a one-line fix, the bundler re-chunks, filenames change, and every user on earth could re-download the entire application. Not because the code is meaningfully any different, but because the browser/client has no way to know specifically what changed. It sees a new URL and starts from zero. Traditional compression helps with the size of each download, but it can't help with the redundancy. It doesn't know the client already has 95% of the file cached. So every deploy, across every user, across every bot, sends redundant bytes again and again. Ship ten small changes a day and you've effectively opted out of caching. This wastes bandwidth and CPU in a web where hardware is quickly becoming the bottleneck.

In order to scale with more requests hitting heavier pages that are re-deployed more often, compression has to get smarter.

What are shared dictionaries?

A compression dictionary is a shared reference between server and client that works like a cheat sheet. Instead of compressing a response from scratch, the server says "you already know this part of the file because you’ve cached it before" and only sends what's new. The client holds the same reference and uses it to reconstruct the full response during decompression. The more the dictionary can reference content in the file, the smaller the compressed output that is transferred to the client.

This principle of compressing against what's already known is how modern compression algorithms pull ahead of their predecessors. Brotli ships with a built-in dictionary of common web patterns like HTML attributes and common phrases; Zstandard is purpose-built for custom dictionaries: you can feed it representative content samples and it generates an optimized dictionary for the kind of content you serve. Gzip has neither; it must build dictionaries by finding patterns in real-time as it’s compressing. These “traditional compression” algorithms are already available on Cloudflare today.

Shared dictionaries take this principle a step further: the previously cached version of the resource becomes the dictionary. Remember the deploy problem where a team ships a one-line fix and every user re-downloads the full bundle? With shared dictionaries, the browser already has the old version cached. The server compresses against it, sending only the diff. That 500KB bundle with a one-line change becomes only a few kilobytes on the wire. At 100K daily users and 10 deploys a day, that's the difference between 500GB of transfer and a few hundred megabytes.

Delta compression

Delta compression is what turns the version the browser already has into the dictionary. The protocol looks to when the server first serves a resource, it attaches a Use-As-Dictionary response header, telling the browser to essentially hold onto the file because it’ll be useful later. On the next request for that resource, the browser sends an Available-Dictionary header back, telling the server, "here's what I've got." The server then proceeds to compress the new version against the old one and sends only the diff. No separate dictionary file needed.

This is where the payoff lands for real applications. Versioned JS bundles, CSS files, framework updates, and anything that changes incrementally between releases. The browser has app.bundle.v1.js cached already and the developer makes an update and deploys app.bundle.v2.js. Delta compression only sends the diff between these versions. Every subsequent version after is also just a diff. Version three compresses against version two. Version 47 compresses against version 46. The savings don't reset, they persist across the entire release history.

There's also active discussion in the community about custom and dynamic dictionaries for non-static content. That's future work, but the implications are significant. We'll save that for another post.

So why the wait?

If shared dictionaries are so powerful, why doesn't everyone use them already?

Because the last time they were tried, the implementation couldn't survive contact with the open web.

Google shipped Shared Dictionary Compression for HTTP (SDCH) in Chrome in 2008. It worked well with some early adopters reporting double-digit improvements in page load times. But SDCH accumulated problems faster than anyone was able to fix them.

The most memorable was a class of compression side-channel attacks (CRIME, BREACH). Researchers showed that if an attacker could inject content alongside something sensitive that gets compressed (like a session cookie, token, etc.) the size of the compressed output could leak information about the secret. The attacker could guess a byte at a time, watch whether the asset size shrank, and repeat until they extracted the whole secret.

But security wasn't the only problem, or even the main reason why adoption didn’t happen. SDCH surfaced a few architectural problems like violating the Same-Origin Policy (which ironically is partially why it performed so well). Its cross-origin dictionary model couldn't be reconciled with CORS, and it lacked some specification regarding interactions with things like the Cache API. After a while it became clear that adoption wasn’t ready, so in 2017 Chrome (the only browser supporting at the time) unshipped it.

Getting the web community to pick up the baton took a decade, but it was worth it.

The modern standard, RFC 9842: Compression Dictionary Transport, closes key design gaps that made SDCH untenable. For example, it enforces that an advertised dictionary is only usable on responses from the same-origin, preventing many conditions that made side-channel compression attacks possible.

Chrome and Edge have shipped support with Firefox working to follow. The standard is moving toward broad adoption, but complete cross-browser support is still catching up.

The RFC mitigates the security problems but dictionary transport has always been complex to implement. An origin may have to generate dictionaries, serve them with the right headers, check every request for an Available-Dictionary match, delta-compress the response on the fly, and fall back gracefully when a client doesn't have a dictionary. Caching gets complex too. Responses vary on both encoding and dictionary hash, so every dictionary version creates a separate cache variant. Mid-deploy, you have clients with the old dictionary, clients with the new one, and clients with none. Your cache is storing separate copies for each. Hit rates drop, storage climbs, and the dictionaries themselves have to stay fresh under normal HTTP caching rules.

This complexity is a coordination problem. And exactly the kind of thing that belongs at the edge. A CDN already sits in front of every request, already manages compression, and already handles cache variants (watch this space for a soon-to-come announcement blog).

How Cloudflare is building shared dictionary support

Shared dictionary compression touches every layer of the stack between the browser and the origin. We've seen strong customer interest: some people have already built their own implementations like RFC author Patrick Meenan's dictionary-worker, which runs the full dictionary lifecycle inside a Cloudflare Worker using WASM-compiled Zstandard (as an example). We want to make this accessible to everyone and as easy as possible to implement. So we’re rolling it out across the platform in three phases, starting with the plumbing.

Phase 1: Passthrough support is currently in active development. Cloudflare forwards the headers and encodings that shared dictionaries require like Use-As-Dictionary, Available-Dictionary, and the dcb and dcz content encodings, without stripping, modifying, or recompressing them. The Cache keys are extended to vary on Available-Dictionary and Accept-Encoding so dictionary-compressed responses are cached correctly. This phase serves customers who manage their own dictionaries at the origin.

We plan to have an open beta of Phase 1 ready by April 30, 2026. To use it, you'll need to be on a Cloudflare zone with the feature enabled, have an origin that serves dictionary-compressed responses with the correct headers (Use-As-Dictionary, Content-Encoding: dcb or dcz, Vary on Accept-Encoding, Available-Dictionary), and your visitors need to be on a browser that can use dictionary transport. Today, that means Chrome 130+ and Edge 130+, with Firefox support in progress.

Keep your eyes fixed on the changelog for when this becomes available and more documentation for how to use it.

We’ve already started testing passthrough internally. In a controlled test, we deployed two js bundles in sequence. They were nearly identical except for a few localized changes between the versions representing successive deploys of the same web application. 200 tests were run hitting our San Jose, California PoP with an origin located in Council Bluffs, Iowa serving the dictionary, JS bundle, and dictionary-compressed bundle (note that the dcz compressed bundle was precomputed). For each request we captured TTFB and total download time via curl.

Uncompressed, the asset is 272KB. Gzip brought that down to 92.2KB, a solid 66% reduction. With shared dictionary compression over DCZ, using the previous version as the dictionary, that same asset dropped to 2.6KB. That's a 99% reduction from the uncompressed asset, and still 97% smaller than gzip.

In the same lab test, we measured two timing milestones from the client: time to first byte (TTFB) and full download completion. The timing difference is stark. At P99 on a cache miss, DCZ TTFB is roughly 60 ms faster than gzip. On a cache hit the TTFB gap narrows to a negligible 10ms.

Download completion is where it matters. "Transfer" in the graphs indicates total time curl spent receiving the response minus TTFB. On a cache miss, this time for the DCZ response is 1ms versus 161ms for gzip (99.4% faster - the body essentially arrived alongside the headers). On a cache hit, 1ms versus 54ms (98% faster). The localized changes between versions were already captured by the dictionary, which is exactly the point: for successive deploys of roughly the same application, shared dictionaries eliminate nearly all redundant transfer.

Initial lab results simulating minimal JS bundle diffs, results will vary based on the actual delta between the dictionary and the asset.

Phase 2: This is where Cloudflare starts doing the work for you. Instead of handling dictionary headers, compression, and fallback logic on the origin, in this phase you tell Cloudflare which assets should be used as dictionaries via a rule and we manage the rest for you. We inject the Use-As-Dictionary headers, store the dictionary bytes, delta-compress new versions against old ones, and serve the right variant to each client. Your origin serves normal responses. Any dictionary complexity moves off your infrastructure and onto ours.

To demonstrate this, we've built a live demo to show what this looks like in practice. Try it here: Can I Compress (with Dictionaries)?

The demo deploys a new ~94KB JavaScript bundle every minute, meant to mimic a typical production single page application bundle. The bulk of the code is static between deploys; only a small configuration block changes each time, which also mirrors real-world deploys where most of the bundle is unchanged framework and library code. When the first version loads, Cloudflare's edge stores it as a dictionary. When the next deploy arrives, the browser sends the hash of the version it already has, and the edge delta-compresses the new bundle against it. The result: 94KB compresses to roughly 450 bytes. That's a 99.5% reduction over gzip, because the only thing on the wire is the actual diff.

The demo site includes walkthroughs so you can verify the compression ratios on your own via curl, your browser, or your agent of choice.

Phase 3: The dictionary is automatically generated on behalf of the website. Instead of customers specifying which assets to use as dictionaries, Cloudflare identifies them automatically. Our network already sees every version of every resource that flows through it, which includes millions of sites, billions of requests, and every new deployment. The idea is that when the network observes a URL pattern where successive responses share most of their content, it has a strong signal that the resource is a good candidate for delta compression. If safe to do so, it stores the previous version as a dictionary and compresses subsequent versions against it. No customer configuration. No maintenance.

This is a simple idea, but is genuinely hard. Safely generating dictionaries that avoid revealing private data and identifying traffic for which dictionaries will offer the most benefit are real engineering problems. But Cloudflare has the right pieces: we see the traffic patterns across the entire network, we already manage the cache layer where dictionaries need to live, and our RUM beacon to clients can help give us a validation loop to confirm that a dictionary actually improves compression before we commit to serving it. The combination of traffic visibility, edge storage, and synthetic testing is what makes automatic generation feasible, though there are still many pieces to figure out.

The performance and bandwidth benefits of phase 3 are the crux of our motivation. This is what makes shared dictionaries accessible to everyone using Cloudflare, including the millions of zones that would never have had the engineering time to implement custom dictionaries manually.

The bigger picture

For most of the web's history, compression was stateless. Every response was compressed as if the client had never seen anything before. Shared dictionaries change that: they give compression a memory.

That matters more now than it would have five years ago. Agentic coding tools are compressing the interval between deploys, while also driving a growing share of the traffic that consumes them. While today AI tools can produce massive diffs, agents are gaining more context and becoming surgical in their code changes. This, coupled with more frequent releases and more automated clients means more redundant bytes on every request. Delta compression helps both sides of that equation by reducing the number of bytes per transfer, and the number of transfers that need to happen at all.

Shared Dictionaries took decades to standardize. Cloudflare is helping to build the infrastructure to make it work for every client that touches your site, human or not. Phase 1 beta opens April 30 and we’re excited for you to try it.

_____

^{1Bots =}^~31.3%^{of all HTTP requests. AI = ~}^29-30%^{of all Bot traffic (March 2026).}

Fixing request smuggling vulnerabilities in Pingora OSS deployments

Edward Wang — Mon, 09 Mar 2026 14:00:00 GMT

In December 2025, Cloudflare received reports of HTTP/1.x request smuggling vulnerabilities in the Pingora open source framework when Pingora is used to build an ingress proxy. Today we are discussing how these vulnerabilities work and how we patched them in Pingora 0.8.0.

The vulnerabilities are CVE-2026-2833, CVE-2026-2835, and CVE-2026-2836. These issues were responsibly reported to us by Rajat Raghav (xclow3n) through our Bug Bounty Program.

Cloudflare’s CDN and customer traffic were not affected, our investigation found. No action is needed for Cloudflare customers, and no impact was detected.

Due to the architecture of Cloudflare’s network, these vulnerabilities could not be exploited: Pingora is not used as an ingress proxy in Cloudflare’s CDN.

However, these issues impact standalone Pingora deployments exposed to the Internet, and may enable an attacker to:

Bypass Pingora proxy-layer security controls
Desync HTTP request/responses with backends for cross-user hijacking attacks (session or credential theft)
Poison Pingora proxy-layer caches retrieving content from shared backends

We have released Pingora 0.8.0 with fixes and hardening. While Cloudflare customers were not affected, we strongly recommend users of the Pingora framework to upgrade as soon as possible.

What was the vulnerability?

The reports described a few different HTTP/1 attack payloads that could cause desync attacks. Such requests could cause the proxy and backend to disagree about where the request body ends, allowing a second request to be “smuggled” past proxy‑layer checks. The researcher provided a proof-of-concept to validate how a basic Pingora reverse proxy misinterpreted request body lengths and forwarded those requests to server backends such as Node/Express or uvicorn.

Upon receiving the reports, our engineering team immediately investigated and validated that, as the reporter also confirmed, the Cloudflare CDN itself was not vulnerable. However, the team did also validate that vulnerabilities exist when Pingora acts as the ingress proxy to shared backends.

By design, the Pingora framework does allow edge case HTTP requests or responses that are not strictly RFC compliant, because we must accept this sort of traffic for customers with legacy HTTP stacks. But this leniency has limits to avoid exposing Cloudflare itself to vulnerabilities.

In this case, Pingora had non-RFC-compliant interpretations of request bodies within its HTTP/1 stack that allowed these desync attacks to exist. Pingora deployments within Cloudflare are not directly exposed to ingress traffic, and we found that production traffic that arrived at Pingora services were not subject to these misinterpretations. Thus, the attacks were not exploitable on Cloudflare traffic itself, unlike a previous Pingora smuggling vulnerability disclosed in May 2025.

We’ll explain, case-by-case, how these attack payloads worked.

1. Premature upgrade without 101 handshake

The first report showed that a request with an Upgrade header value would cause Pingora to pass through subsequent bytes on the HTTP connection immediately, before the backend had accepted an upgrade (by returning 101 Switching Protocols). The attacker could thus pipeline a second HTTP request after the upgrade request on the same connection:

GET / HTTP/1.1
Host: example.com
Upgrade: foo


GET /admin HTTP/1.1
Host: example.com

Pingora would parse only the initial request, then treat the remaining buffered bytes as the “upgraded” stream and forward them directly to the backend in a “passthrough” mode due to the Upgrade header (until the response was received).

This is not at all how the HTTP/1.1 Upgrade process per RFC 9110 is intended to work. The subsequent bytes should only be interpreted as part of an upgraded stream if a 101 Switching Protocols header is received, and if a 200 OK response is received instead, the subsequent bytes should continue to be interpreted as HTTP.

^{An attacker that sends an Upgrade request, then pipelines a partial HTTP request may cause a desync attack. Pingora will incorrectly interpret both as the same upgraded request, even if the backend server declines the upgrade with a 200.}

Via the improper pass-through, a Pingora deployment that received a non-101 response could still forward the second partial HTTP request to the upstream as-is, bypassing any Pingora user‑defined ACL-handling or WAF logic, and poison the connection to the upstream so that a subsequent request from a different user could improperly receive the /admin response.

^{After the attack payload, Pingora and the backend server are now “desynced.” The backend server will wait until it thinks the rest of the partial /attack request header that Pingora forwarded is complete. When Pingora forwards a different user’s request, the two headers are combined from the backend server’s perspective, and the attacker has now poisoned the other user’s response.}

We’ve since patched Pingora to switch the interpretation of subsequent bytes only once the upstream responds with 101 Switching Protocols.

We verified Cloudflare was not affected for two reasons:

The ingress CDN proxies do not have this improper behavior.
The clients to our internal Pingora services do not attempt to pipeline HTTP/1 requests. Furthermore, the Pingora service these clients talk directly with disables keep-alive on these Upgrade requests by injecting a Connection: close header; this prevents additional requests that would be sent — and subsequently smuggled — over the same connection.

2. HTTP/1.0, close-delimiting, and transfer-encoding

The reporter also demonstrated what appeared to be a more classic “CL.TE” desync-type attack, where the Pingora proxy would use Content-Length as framing while the backend would use Transfer-Encoding as framing:

GET / HTTP/1.0
Host: example.com
Connection: keep-alive
Transfer-Encoding: identity, chunked
Content-Length: 29

0

GET /admin HTTP/1.1
X:

In the reporter’s example, Pingora would treat all subsequent bytes after the first GET / request header as part of that request’s body, but the node.js backend server would interpret the body as chunked and ending at the zero-length chunk. There are actually a few things going on here:

Pingora’s chunked encoding recognition was quite barebones (only checking for whether Transfer-Encoding was “chunked”) and assumed that there could only be one encoding or Transfer-Encoding header. But the RFC only mandates that the final encoding must be chunked to apply chunked framing. So per RFC, this request should have a chunked message body (if it were not HTTP/1.0 — more on that below).
Pingora was also not actually using the Content-Length (because the Transfer-Encoding overrode the Content-Length per RFC). Because of the unrecognized Transfer-Encoding and the HTTP/1.0 version, the request body was instead treated as close-delimited (which means that the response body’s end is marked by closure of the underlying transport connection). An absence of framing headers would also trigger the same misinterpretation on HTTP/1.0. Although response bodies are allowed to be close-delimited, request bodies are never close-delimited. In fact, this clarification is now explicitly called out as a separate note in RFC 9112.
This is an HTTP/1.0 request that did not define Transfer-Encoding. The RFC mandates that HTTP/1.0 requests containing Transfer-Encoding must “treat the message as if the framing is faulty” and close the connection. Parsers such as the ones in nginx and hyper just reject these requests to avoid ambiguous framing.

^{When an attacker pipelines a partial HTTP request header after the HTTP/1.0 + Transfer-Encoding request, Pingora would incorrectly interpret that partial header as part of the same request, rather than as a distinct request. This enables the same kind of desync attack as described in the premature Upgrade example.}

This spoke to a more fundamental misreading of the RFC particularly in terms of response vs. request message framing. We’ve since fixed the improper multiple Transfer-Encoding parsing, adhere strictly to the request length guidelines such that HTTP request bodies can never be considered close-delimited, and reject invalid Content-Length and HTTP/1.0 + Transfer-Encoding request messages. Further protections we’ve added include rejecting CONNECT requests by default because the HTTP proxy logic doesn’t currently treat CONNECT as special for the purposes of CONNECT upgrade proxying, and these requests have special message framing rules. (Note that incoming CONNECT requests are rejected by the Cloudflare CDN.)

When we investigated and instrumented our services internally, we found no requests arriving at our Pingora services that would have been misinterpreted. We found that downstream proxy layers in the CDN would forward as HTTP/1.1 only, reject ambiguous framing such as invalid Content-Length, and only forward a single Transfer-Encoding: chunked header for chunked requests.

3. Cache key construction

The researcher also reported one other cache poisoning vulnerability regarding default CacheKey construction. The naive default implementation factored in only the URI path (without other factors such as host header or upstream server HTTP scheme), which meant different hosts using the same HTTP path could collide and poison each other’s cache.

This would affect users of the alpha proxy caching feature who chose to use the default CacheKey implementation. We have since removed that default, because while using something like HTTP scheme + host + URI makes sense for many applications, we want users to be careful when constructing their cache keys for themselves. If their proxy logic will conditionally adjust the URI or method on the upstream request, for example, that logic likely also must be factored into the cache key scheme to avoid poisoning.

Internally, Cloudflare’s default cache key uses a number of factors to prevent cache key poisoning, and never made use of the previously provided default.

Recommendation

If you use Pingora as a proxy, upgrade to Pingora 0.8.0 at your earliest convenience.

We apologize for the impact this vulnerability may have had on Pingora users. As Pingora earns its place as critical Internet infrastructure beyond Cloudflare, we believe it’s important for the framework to promote use of strict RFC compliance by default and will continue this effort. Very few users of the framework should have to deal with the same “wild Internet” that Cloudflare does. Our intention is that stricter adherence to the latest RFC standards by default will harden security for Pingora users and move the Internet as a whole toward best practices.

Disclosure and response timeline

- 2025‑12‑02: Upgrade‑based smuggling reported via bug bounty.

- 2026‑01‑13: Transfer‑Encoding / HTTP/1.0 parsing issues reported.

- 2026-01-18: Default cache key construction issue reported.

- 2026‑01‑29 to 2026‑02‑13: Fixes validated with the reporter. Work on more RFC-compliance checks continues.

- 2026-02-25: Cache key default removal and additional RFC checks validated with researcher.

- 2026‑03-02: Pingora 0.8.0 released.

- 2026-03-04: CVE advisories published.

Acknowledgements

We thank Rajat Raghav (xclow3n) for the report, detailed reproductions, and verification of the fixes through our bug bounty program. Please see the researcher's corresponding blog post for more information.

We would also extend a heartfelt thank you to the Pingora open source community for their active engagement, issue reports, and contributions to the framework. You truly help us build a better Internet.

Resolving a request smuggling vulnerability in Pingora

Edward Wang — Thu, 22 May 2025 13:00:00 GMT

On April 11, 2025 09:20 UTC, Cloudflare was notified via its Bug Bounty Program of a request smuggling vulnerability (CVE-2025-4366) in the Pingora OSS framework discovered by a security researcher experimenting to find exploits using Cloudflare’s Content Delivery Network (CDN) free tier which serves some cached assets via Pingora.

Customers using the free tier of Cloudflare’s CDN or users of the caching functionality provided in the open source pingora-proxy and pingora-cache crates could have been exposed. Cloudflare’s investigation revealed no evidence that the vulnerability was being exploited, and was able to mitigate the vulnerability by April 12, 2025 06:44 UTC within 22 hours after being notified.

What was the vulnerability?

The bug bounty report detailed that an attacker could potentially exploit an HTTP/1.1 request smuggling vulnerability on Cloudflare’s CDN service. The reporter noted that via this exploit, they were able to cause visitors to Cloudflare sites to make subsequent requests to their own server and observe which URLs the visitor was originally attempting to access.

We treat any potential request smuggling or caching issue with extreme urgency. After our security team escalated the vulnerability, we began investigating immediately, took steps to disable traffic to vulnerable components, and deployed a patch.

We are sharing the details of the vulnerability, how we resolved it, and what we can learn from the action. No action is needed from Cloudflare customers, but if you are using the Pingora OSS framework, we strongly urge you to upgrade to a version of Pingora 0.5.0 or later.

What is request smuggling?

Request smuggling is a type of attack where an attacker can exploit inconsistencies in the way different systems parse HTTP requests. For example, when a client sends an HTTP request to an application server, it typically passes through multiple components such as load balancers, reverse proxies, etc., each of which has to parse the HTTP request independently. If two of the components the request passes through interpret the HTTP request differently, an attacker can craft a request that one component sees as complete, but the other continues to parse into a second, malicious request made on the same connection.

Request smuggling vulnerability in Pingora

In the case of Pingora, the reported request smuggling vulnerability was made possible due to a HTTP/1.1 parsing bug when caching was enabled.

The pingora-cache crate adds an HTTP caching layer to a Pingora proxy, allowing content to be cached on a configured storage backend to help improve response times, and reduce bandwidth and load on backend servers.

HTTP/1.1 supports “persistent connections”, such that one TCP connection can be reused for multiple HTTP requests, instead of needing to establish a connection for each request. However, only one request can be processed on a connection at a time (with rare exceptions such as HTTP/1.1 pipelining). The RFC notes that each request must have a “self-defined message length” for its body, as indicated by headers such as Content-Length or Transfer-Encoding to determine where one request ends and another begins.

Pingora generally handles requests on HTTP/1.1 connections in an RFC-compliant manner, either ensuring the downstream request body is properly consumed or declining to reuse the connection if it encounters an error. After the bug was filed, we discovered that when caching was enabled, this logic was skipped on cache hits (i.e. when the service’s cache backend can serve the response without making an additional upstream request).

This meant on a cache hit request, after the response was sent downstream, any unread request body left in the HTTP/1.1 connection could act as a vector for request smuggling. When formed into a valid (but incomplete) header, the request body could “poison” the subsequent request. The following example is a spec-compliant HTTP/1.1 request which exhibits this behavior:

GET /attack/foo.jpg HTTP/1.1
Host: example.com

content-length: 79

GET / HTTP/1.1
Host: attacker.example.com
Bogus: foo

Let’s say there is a different request to victim.example.com that will be sent after this one on the reused HTTP/1.1 connection to a Pingora reverse proxy. The bug means that a Pingora service may not respect the Content-Length header and instead misinterpret the smuggled request as the beginning of the next request:

GET /attack/foo.jpg HTTP/1.1
Host: example.com

content-length: 79

GET / HTTP/1.1 // <- “smuggled” body start, interpreted as next request
Host: attacker.example.com
Bogus: fooGET /victim/main.css HTTP/1.1 // <- actual next valid req start
Host: victim.example.com

Thus, the smuggled request could inject headers and its URL into a subsequent valid request sent on the same connection to a Pingora reverse proxy service.

CDN request smuggling and hijacking

On April 11, 2025, Cloudflare was in the process of rolling out a Pingora proxy component with caching support enabled to a subset of CDN free plan traffic. This component was vulnerable to this request smuggling attack, which could enable modifying request headers and/or URL sent to customer origins.

As previously noted, the security researcher reported that they were also able to cause visitors to Cloudflare sites to make subsequent requests to their own malicious origin and observe which site URLs the visitor was originally attempting to access. During our investigation, Cloudflare found that certain origin servers would be susceptible to this secondary attack effect. The smuggled request in the example above would be sent to the correct origin IP address per customer configuration, but some origin servers would respond to the rewritten attacker Host header with a 301 redirect. Continuing from the prior example:

GET / HTTP/1.1 // <- “smuggled” body start, interpreted as next request
Host: attacker.example.com
Bogus: fooGET /victim/main.css HTTP/1.1 // <- actual next valid req start
Host: victim.example.com


HTTP/1.1 301 Moved Permanently // <- susceptible victim origin response
Location: https://attacker.example.com/

When the client browser followed the redirect, it would trigger this attack by sending a request to the attacker hostname, along with a Referrer header indicating which URL was originally visited, making it possible to load a malicious asset and observe what traffic a visitor was trying to access.

GET / HTTP/1.1 // <- redirect-following request
Host: attacker.example.com
Referrer: https://victim.example.com/victim/main.css

Upon verifying the Pingora proxy component was susceptible, the team immediately disabled CDN traffic to the vulnerable component on 2025-04-12 06:44 UTC to stop possible exploitation. By 2025-04-19 01:56 UTC and prior to re-enablement of any traffic to the vulnerable component, a patch fix to the component was released, and any assets cached on the component’s backend were invalidated in case of possible cache poisoning as a result of the injected headers.

Remediation and next steps

If you are using the caching functionality in the Pingora framework, you should update to the latest version of 0.5.0. If you are a Cloudflare customer with a free plan, you do not need to do anything, as we have already applied the patch for this vulnerability.

Timeline

All timestamps are in UTC.

2025-04-11 09:20 – Cloudflare is notified of a CDN request smuggling vulnerability via the Bug Bounty Program.
2025-04-11 17:16 to 2025-04-12 03:28 – Cloudflare confirms vulnerability is reproducible and investigates which component(s) require necessary changes to mitigate.
2025-04-12 04:25 – Cloudflare isolates issue to roll out of a Pingora proxy component with caching enabled and prepares release to disable traffic to this component.
2025-04-12 06:44 – Rollout to disable traffic complete, vulnerability mitigated.

Conclusion

We would like to sincerely thank James Kettle & Wannes Verwimp, who responsibly disclosed this issue via our Cloudflare Bug Bounty Program, allowing us to identify and mitigate the vulnerability. We welcome further submissions from our community of researchers to continually improve the security of all of our products and open source projects.

Whether you are a customer of Cloudflare or just a user of our Pingora framework, or both, we know that the trust you place in us is critical to how you connect your properties to the rest of the Internet. Security is a core part of that trust and for that reason we treat these kinds of reports and the actions that follow with serious urgency. We are confident about this patch and the additional safeguards that have been implemented, but we know that these kinds of issues can be concerning. Thank you for your continued trust in our platform. We remain committed to building with security as our top priority and responding swiftly and transparently whenever issues arise.

A good day to trie-hard: saving compute 1% at a time

Kevin Guthrie — Tue, 10 Sep 2024 14:00:00 GMT

Cloudflare’s global network handles a lot of HTTP requests – over 60 million per second on average. That in and of itself is not news, but it is the starting point to an adventure that started a few months ago and ends with the announcement of a new open-source Rust crate that we are using to reduce our CPU utilization, enabling our CDN to handle even more of the world’s ever-increasing Web traffic.

Motivation

Let’s start at the beginning. You may recall a few months ago we released Pingora (the heart of our Rust-based proxy services) as an open-source project on GitHub. I work on the team that maintains the Pingora framework, as well as Cloudflare’s production services built upon it. One of those services is responsible for the final step in transmitting users’ (non-cached) requests to their true destination. Internally, we call the request’s destination server its “origin”, so our service has the (unimaginative) name of “pingora-origin”.

One of the many responsibilities of pingora-origin is to ensure that when a request leaves our infrastructure, it has been cleaned to remove the internal information we use to route, measure, and optimize traffic for our customers. This has to be done for every request that leaves Cloudflare, and as I mentioned above, it’s a lot of requests. At the time of writing, the rate of requests leaving pingora-origin (globally) is 35 million requests per second. Any code that has to be run per-request is in the hottest of hot paths, and it’s in this path that we find this code and comment:

// PERF: heavy function: 1.7% CPU time
pub fn clear_internal_headers(request_header: &mut RequestHeader) {
    INTERNAL_HEADERS.iter().for_each(|h| {
        request_header.remove_header(h);
    });
}

This small and pleasantly-readable function consumes more than 1.7% of pingora-origin’s total cpu time. To put that in perspective, the total cpu time consumed by pingora-origin is 40,000 compute-seconds per second. You can think of this as 40,000 saturated CPU cores fully dedicated to running pingora-origin. Of those 40,000, 1.7% (680) are only dedicated to evaluating clear_internal_headers. The function’s heavy usage and simplicity make it seem like a great place to start optimizing.

Benchmarking

Benchmarking the function shown above is straightforward because we can use the wonderful criterion Rust crate. Criterion provides an api for timing rust code down to the nanosecond by aggregating multiple isolated executions. It also provides feedback on how the performance improves or regresses over time. The input for the benchmark is a large set of synthesized requests with a random number of headers with a uniform distribution of internal vs. non-internal headers. With our tooling and test data we find that our original clear_internal_headers function runs in an average of 3.65µs. Now for each new method of clearing headers, we can measure against the same set of requests and get a relative performance difference.

Reducing Reads

One potentially quick win is to invert how we find the headers that need to be removed from requests. If you look at the original code, you can see that we are evaluating request_header.remove_header(h) for each header in our list of internal headers, so 100+ times. Diagrammatically, it looks like this:

Since an average request has significantly fewer than 100 headers (10-30), flipping the lookup direction should reduce the number of reads while yielding the same intersection. Because we are working in Rust (and because retain does not exist for http::HeaderMap yet), we have to collect the identified internal headers in a separate step before removing them from the request. Conceptually, it looks like this:

Using our benchmarking tool, we can measure the impact of this small change, and surprisingly this is already a substantial improvement. The runtime improves from 3.65µs to 1.53µs. That’s a 2.39x speed improvement for our function. We can calculate the theoretical CPU percentage by multiplying the starting utilization by the ratio of the new and old times: 1.71% * 1.53 / 3.65 = 0.717%. Unfortunately, if we subtract that from the original 1.71% that only equates to saving 1.71% - 0.717% = 0.993% of the total CPU time. We should be able to do better.

Searching Data Structures

Now that we have reorganized our function to search a static set of internal headers instead of the actual request, we have the freedom to choose what data structure we store our header name in simply by changing the type of INTERNAL_HEADER_SET.

pub fn clear_internal_headers(request_header: &mut RequestHeader) {
   let to_remove = request_header
       .headers
       .keys()
       .filter_map(|name| INTERNAL_HEADER_SET.get(name))
       .collect::>();


   to_remove.into_iter().for_each(|k| {
       request_header.remove_header(k);
   });

Our first attempt used std::HashMap, but there may be other data structures that better suit our needs. All computer science students were taught at some point that hash tables are great because they have constant-time asymptotic behavior, or O(1), for reading. (If you are not familiar with big O notation, it is a way to express how an algorithm consumes a resource, in this case time, as the input size changes.) This means no matter how large the map gets, reads always take the same amount of time. Too bad this is only partially true. In order to read from a hash table, you have to compute the hash. Computing a hash for strings requires reading every byte, so while read time for a hashmap is constant over the table’s size, it’s linear over key length. So, our goal is to find a data structure that is better than O(L) where L is the length of the key.

There are a few common data structures that provide for reads that have read behavior that meets our criteria. Sorted sets like BTreeSet use comparisons for searching, and that makes them logarithmic over key length O(log(L)), but they are also logarithmic in size too. The net effect is that even very fast sorted sets like FST work out to be a little (50 ns) slower in our benchmarks than the standard hashmap.

State machines like parsers and regex are another common tool for searching for strings, though it’s hard to consider them data structures. These systems work by accepting input one unit at a time and determining on each step whether or not to keep evaluating. Being able to make these determinations at every step means state machines are very fast to identify negative cases (i.e. when a string is not valid or not a match). This is perfect for us because only one or two headers per request on average will be internal. In fact, benchmarking an implementation of clear_internal_headers using regular expressions clocks in as taking about twice as long as the hashmap-based solution. This is impressively fast given that regexes, while powerful, aren't known for their raw speed. This approach feels promising – we just need something in between a data structure and a state machine.

That’s where the trie comes in.

Don’t Just Trie

A trie (pronounced like “try” or “tree”) is a type of tree data structure normally used for prefix searches or auto-complete systems over a known set of strings. The structure of the trie lends itself to this because each node in the trie represents a substring of characters found in the initial set. The connections between the nodes represent the characters that can follow a prefix. Here is a small example of a trie built from the words: “and”, “ant”, “dad”, “do”, & “dot”.

The root node represents an empty string prefix, so the two lettered edges directed out of it are the only letters that can appear as the first letter in the list of strings, “a” and “d”. Subsequent nodes have increasingly longer prefixes until the final valid words are reached. This layout should make it easy to see how a trie could be useful for quickly identifying strings that are not contained. Even at the root node, we can eliminate any strings that are presented that do not start with “a” or “d”. This paring down of the search space on every step gives reading from a trie the O(log(L)) we were looking for … but only for misses. Hits within a trie are still O(L), but that’s okay, because we are getting misses over 90% of the time.

Benchmarking a few trie implementations from crates.io was disheartening. Remember, most tries are used in response to keyboard events, so optimizing them to run in the hot path of tens of millions of requests per second is not a priority. The fastest existing implementation we found was radix_trie, but it still clocked in at a full microsecond slower than hashmap. The only thing left to do was write our own implementation of a trie that was optimized for our use case.

Trie Hard

And we did! Today we are announcing trie-hard. The repository gives a full description of how it works, but the big takeaway is that it gets its speed from storing node relationships in the bits of unsigned integers and keeping the entire tree in a contiguous chunk of memory. In our benchmarks, we found that trie-hard reduced the average runtime for clear_internal_headers to under a microsecond (0.93µs). We can reuse the same formula from above to calculate the expected CPU utilization for trie-hard to be 1.71% * 3.65 / 0.93 = 0.43% That means we have finally achieved and surpassed our goal by reducing the compute utilization of pingora-origin by 1.71% - 0.43% = 1.28%!

Up until now we have been working only in theory and local benchmarking. What really matters is whether our benchmarking reflects real-life behavior. Trie-hard has been running in production since July 2024, and over the course of this project we have been collecting performance metrics from the running production of pingora-origin using a statistical sampling of its stack trace over time. Using this technique, the CPU utilization percentage of a function is estimated by the percent of samples in which the function appears. If we compare the sampled performance of the different versions of clear_internal_headers, we can see that the results from the performance sampling closely match what our benchmarks predicted.

Implementation	Stack trace samples containing `clear_internal_headers`	Actual CPU Usage (%)	Predicted CPU Usage (%)
Original	19 / 1111	1.71	n/a
Hashmap	9 / 1103	0.82	0.72
trie-hard	4 / 1171	0.34	0.43

Conclusion

Optimizing functions and writing new data structures is cool, but the real conclusion for this post is that knowing where your code is slow and by how much is more important than how you go about optimizing it. Take a moment to thank your observability team (if you're lucky enough to have one), and make use of flame graphs or any other profiling and benchmarking tool you can. Optimizing operations that are already measured in microseconds may seem a little silly, but these small improvements add up.

Open sourcing Pingora: our Rust framework for building programmable network services

Yuchen Wu — Wed, 28 Feb 2024 15:00:11 GMT

Today, we are proud to open source Pingora, the Rust framework we have been using to build services that power a significant portion of the traffic on Cloudflare. Pingora is released under the Apache License version 2.0.

As mentioned in our previous blog post, Pingora is a Rust async multithreaded framework that assists us in constructing HTTP proxy services. Since our last blog post, Pingora has handled nearly a quadrillion Internet requests across our global network.

We are open sourcing Pingora to help build a better and more secure Internet beyond our own infrastructure. We want to provide tools, ideas, and inspiration to our customers, users, and others to build their own Internet infrastructure using a memory safe framework. Having such a framework is especially crucial given the increasing awareness of the importance of memory safety across the industry and the US government. Under this common goal, we are collaborating with the Internet Security Research Group (ISRG) Prossimo project to help advance the adoption of Pingora in the Internet’s most critical infrastructure.

In our previous blog post, we discussed why and how we built Pingora. In this one, we will talk about why and how you might use Pingora.

Pingora provides building blocks for not only proxies but also clients and servers. Along with these components, we also provide a few utility libraries that implement common logic such as event counting, error handling, and caching.

What’s in the box

Pingora provides libraries and APIs to build services on top of HTTP/1 and HTTP/2, TLS, or just TCP/UDS. As a proxy, it supports HTTP/1 and HTTP/2 end-to-end, gRPC, and websocket proxying. (HTTP/3 support is on the roadmap.) It also comes with customizable load balancing and failover strategies. For compliance and security, it supports both the commonly used OpenSSL and BoringSSL libraries, which come with FIPS compliance and post-quantum crypto.

Besides providing these features, Pingora provides filters and callbacks to allow its users to fully customize how the service should process, transform and forward the requests. These APIs will be especially familiar to OpenResty and NGINX users, as many map intuitively onto OpenResty's "*_by_lua" callbacks.

Operationally, Pingora provides zero downtime graceful restarts to upgrade itself without dropping a single incoming request. Syslog, Prometheus, Sentry, OpenTelemetry and other must-have observability tools are also easily integrated with Pingora as well.

Who can benefit from Pingora

You should consider Pingora if:

Security is your top priority: Pingora is a more memory safe alternative for services that are written in C/C++. While some might argue about memory safety among programming languages, from our practical experience, we find ourselves way less likely to make coding mistakes that lead to memory safety issues. Besides, as we spend less time struggling with these issues, we are more productive implementing new features.

Your service is performance-sensitive: Pingora is fast and efficient. As explained in our previous blog post, we saved a lot of CPU and memory resources thanks to Pingora’s multi-threaded architecture. The saving in time and resources could be compelling for workloads that are sensitive to the cost and/or the speed of the system.

Your service requires extensive customization: The APIs that the Pingora proxy framework provides are highly programmable. For users who wish to build a customized and advanced gateway or load balancer, Pingora provides powerful yet simple ways to implement it. We provide examples in the next section.

Let’s build a load balancer

Let's explore Pingora's programmable API by building a simple load balancer. The load balancer will select between https://1.1.1.1/ and https://1.0.0.1/ to be the upstream in a round-robin fashion.

First let’s create a blank HTTP proxy.

pub struct LB();

#[async_trait]
impl ProxyHttp for LB {
    async fn upstream_peer(...) -> Result> {
        todo!()
    }
}

Any object that implements the ProxyHttp trait (similar to the concept of an interface in C++ or Java) is an HTTP proxy. The only required method there is upstream_peer(), which is called for every request. This function should return an HttpPeer which contains the origin IP to connect to and how to connect to it.

Next let’s implement the round-robin selection. The Pingora framework already provides the LoadBalancer with common selection algorithms such as round robin and hashing, so let’s just use it. If the use case requires more sophisticated or customized server selection logic, users can simply implement it themselves in this function.

pub struct LB(Arc>);

#[async_trait]
impl ProxyHttp for LB {
    async fn upstream_peer(...) -> Result> {
        let upstream = self.0
            .select(b"", 256) // hash doesn't matter for round robin
            .unwrap();

        // Set SNI to one.one.one.one
        let peer = Box::new(HttpPeer::new(upstream, true, "one.one.one.one".to_string()));
        Ok(peer)
    }
}

Since we are connecting to an HTTPS server, the SNI also needs to be set. Certificates, timeouts, and other connection options can also be set here in the HttpPeer object if needed.

Finally, let's put the service in action. In this example we hardcode the origin server IPs. In real life workloads, the origin server IPs can also be discovered dynamically when the upstream_peer() is called or in the background. After the service is created, we just tell the LB service to listen to 127.0.0.1:6188. In the end we created a Pingora server, and the server will be the process which runs the load balancing service.

fn main() {
    let mut upstreams = LoadBalancer::try_from_iter(["1.1.1.1:443", "1.0.0.1:443"]).unwrap();

    let mut lb = pingora_proxy::http_proxy_service(&my_server.configuration, LB(upstreams));
    lb.add_tcp("127.0.0.1:6188");

    let mut my_server = Server::new(None).unwrap();
    my_server.add_service(lb);
    my_server.run_forever();
}

Let’s try it out:

curl 127.0.0.1:6188 -svo /dev/null
> GET / HTTP/1.1
> Host: 127.0.0.1:6188
> User-Agent: curl/7.88.1
> Accept: */*
> 
< HTTP/1.1 403 Forbidden

We can see that the proxy is working, but the origin server rejects us with a 403. This is because our service simply proxies the Host header, 127.0.0.1:6188, set by curl, which upsets the origin server. How do we make the proxy correct that? This can simply be done by adding another filter called upstream_request_filter. This filter runs on every request after the origin server is connected and before any HTTP request is sent. We can add, remove or change http request headers in this filter.

async fn upstream_request_filter(…, upstream_request: &mut RequestHeader, …) -> Result<()> {
    upstream_request.insert_header("Host", "one.one.one.one")
}

Let’s try again:

curl 127.0.0.1:6188 -svo /dev/null
< HTTP/1.1 200 OK

This time it works! The complete example can be found here.

Below is a very simple diagram of how this request flows through the callback and filter we used in this example. The Pingora proxy framework currently provides more filters and callbacks at different stages of a request to allow users to modify, reject, route and/or log the request (and response).

Behind the scenes, the Pingora proxy framework takes care of connection pooling, TLS handshakes, reading, writing, parsing requests and any other common proxy tasks so that users can focus on logic that matters to them.

Open source, present and future

Pingora is a library and toolset, not an executable binary. In other words, Pingora is the engine that powers a car, not the car itself. Although Pingora is production-ready for industry use, we understand a lot of folks want a batteries-included, ready-to-go web service with low or no-code config options. Building that application on top of Pingora will be the focus of our collaboration with the ISRG to expand Pingora's reach. Stay tuned for future announcements on that project.

Other caveats to keep in mind:

Today, API stability is not guaranteed. Although we will try to minimize how often we make breaking changes, we still reserve the right to add, remove, or change components such as request and response filters as the library evolves, especially during this pre-1.0 period.
Support for non-Unix based operating systems is not currently on the roadmap. We have no immediate plans to support these systems, though this could change in the future.

How to contribute

Feel free to raise bug reports, documentation issues, or feature requests in our GitHub issue tracker. Before opening a pull request, we strongly suggest you take a look at our contribution guide.

Conclusion

In this blog we announced the open source of our Pingora framework. We showed that Internet entities and infrastructure can benefit from Pingora’s security, performance and customizability. We also demonstrated how easy it is to use Pingora and how customizable it is.

Whether you're building production web services or experimenting with network technologies we hope you find value in Pingora. It's been a long journey, but sharing this project with the open source community has been a goal from the start. We'd like to thank the Rust community as Pingora is built with many great open-sourced Rust crates. Moving to a memory safe Internet may feel like an impossible journey, but it's one we hope you join us on.

How Pingora keeps count

Yuchen Wu — Fri, 12 May 2023 13:00:56 GMT

A while ago we shared how we replaced NGINX with our in-house proxy, Pingora. We promised to share more technical details as well as our open sourcing plan. This blog post will be the first of a series that shares both the code libraries that power Pingora and the ideas behind them.

Today, we take a look at one of Pingora’s libraries: pingora-limits.

pingora-limits provides the functionality to count inflight events and estimate the rate of events over time. These functions are commonly used to protect infrastructure and services from being overwhelmed by certain types of malicious or misbehaving requests.

For example, when an origin server becomes slow or unresponsive, requests will accumulate on our servers, which adds pressure on both our servers and our customers’ servers. With this library, we are able to identify which origins have issues, so that action can be taken without affecting other traffic.

The problem can be abstracted in a very simple way. The input is a (never ending) stream of different types of events. At any point, the system should be able to tell the number of appearances (or the rate) of a certain type of event.

In a simple example, colors are used as the type of event. The following is one possible example of a sequence of events:

red, blue, red, orange, green, brown, red, blue,...

In this example, the system should report that “red” appears three times.

The corresponding algorithms are straightforward to design. One obvious answer is to use a hash table, where the keys are the colors and the values are their corresponding appearances. Whenever a new event appears, the algorithm looks up the hash table and increases the appearance counter. It is not hard to tell that this algorithm’s time complexity is O(1) (per event) and the space complexity O(n) where n is the number of the types of events.

How Pingora does it

The hash table solution is fine in common scenarios, but we believe there are a few things that can be improved.

We observe traffic to millions of different servers when the misbehaving ones are only a few at a given time. It seems a bit wasteful to require space (memory) that holds the counter for all the keys.
Concurrently updating the hash table (especially when adding new keys) requires a lock. This behavior potentially forces all concurrent event processing to go through our system serialized. In other words, when lock contention is severe, the lock slows down the system.

The motivation to improve the above algorithm is even stronger considering such algorithms need to be deployed at scale. This algorithm operates on tens of thousands of machines. It handles more than twenty million requests per second. The benefits of efficiency improvement can be significant.

pingora-limits adopts a different approach: count–min sketch (CM sketch) estimation. CM sketch estimates the counts of events in O(1) (per event) but only using O(log(n)) of space (polylogarithmic, to be precise, more details here). Because of the simplicity of this algorithm, which we will discuss in a bit, it can be implemented without locks. Therefore, pingora-limits runs much faster and more efficiently compared to the hash table approach discussed earlier.

CM sketch

The idea of a CM sketch is similar to a Bloom filter. The mathematical details of the CM sketch can be found in this paper. In this section, we will just illustrate how it works.

A CM sketch data structure takes two parameters, H: number of hashes (rows) and N number of counters (columns) per hash (row). The rows and columns form a matrix. The space they take is H*N. Each row has its own independent hash function (hash_i()).

For this example, we use H=3 and N=4:

0	0	0	0
0	0	0	0
0	0	0	0

When an event, "red", arrives, it is counted by every row independently. Each row will use its own hashing function ( hash_i(“red”) ) to choose a column. The counter of the column is increased without worrying about collisions (see the end of this section).

The table below illustrates a possible state of the matrix after a single “red” event:

0	1	0	0
0	0	1	0
1	0	0	0

Then, let’s assume the event "blue" arrives, and we assume it collides with "red" at row 2: both hash to the third slot:

1	1	0	0
0	0	2	0
1	0	0	1

Let’s say after another series of events, “blue, red, red, red, blue, red”, So far the algorithm observed 5 “red”s and 3 “blue”s in total. Following the algorithm, the estimator eventually becomes:

3	5	0	0
0	0	8	0
5	0	0	3

Now, let’s see how the matrix reports the occurrence of each event. In order to retrieve the count of keys, the estimator just returns the minimal value of all the columns to which that key belongs. So the count of red is min(5, 8, 5) = 5 and blue is min(3, 8, 3) = 3.

This algorithm chooses the cells with the least collisions (via the min() operations). Therefore, collisions between events in single cells are acceptable because as long as there are collision free cells for a given type of event, the counting for that event is accurate.

The estimator can overestimate when two (or more) keys collide on all slots. Assuming there are only two keys, the probability of their total collision is 1/ N^H (1/64 in this example). On the other hand, it never underestimates because it never loses count of any events.

Practical implementation

Because the algorithm only requires hashing, array index and counter increment, it can be implemented in a few lines of code and lock-free.

The following is a code snippet of how it is implemented in Rust.

pub struct Estimator {
    estimator: Box<[(Box<[AtomicIsize]>, RandomState)]>,
}
 
impl Estimator {
    /// Increment `key` by the value given. Return the new estimated value as a result.
    pub fn incr(&self, key: T, value: isize) -> isize {
        let mut min = isize::MAX;
        for (slot, hasher) in self.estimator.iter() {
            let hash = hash(&key, hasher) as usize;
            let counter = &slot[hash % slot.len()];
            let current = counter.fetch_add(value, Ordering::Relaxed);
            min = std::cmp::min(min, current + value);
        }
        min
    }
}

Performance

We compare the design above with the two hash table based approaches.

naive: Mutex>. This approach references the simple hash table approach mentioned above. This design requires a lock on every operation.
optimized: DashMap. DashMap leverages multiple hash tables in order to shard the keys to reduce contentions across different keys. We also use atomic counters here so that counting existing keys won't need a write lock.

We have two test cases, one that is single threaded and another that is multi-threaded. In both cases, we have one million keys. We generate 100 million events from the keys. The keys are uniformly distributed among the events.

The results below are performed on Debian VM running on M1 MacBook Pro.

SpeedPer event (the incr() function above) timing, lower is better:

	pingora-limits	naive	optimized
Single thread	10ns	51ns	43ns
Eight threads	212ns	1505ns	212ns

In the single thread case, where there is no lock contention, our approach is 5x faster than the naive one and 4x faster than the optimized one. With multiple threads, there is a high amount of contention. Our approach is similar to the optimized version. Both are 7x faster than the naive one. The reason the performance of pingora-limits and the optimized hash table are similar is because in both approaches the hot path is just updating the atomic counter.

Memory consumptionLower is better. The numbers are collected only from the single threaded test cases for simplicity.

	peak memory bytes	total allocations	total allocated bytes
pingora-limits	26,184	9	26,184
naive	53,477,392	20	71,303,260
optimized	36,211,208	491	71,307,722

Pingora-limits at peak requires 1/2000 of the memory compared to the naive one and 1/1300 of the memory of the optimized one.

From the data above, pingora-limits is both CPU and memory efficient.

The estimator provided by Pingora-limits is a biased estimator because it is possible for it to overestimate the appearance of events.

In the case of accurate counting, where false positives are absolutely unacceptable, pingora-limits can still be very useful. It can work as a first stage filter where only the events beyond a certain threshold are fed to a hash table to perform accurate counting. In this case, the majority of low frequency event types are filtered out by the filter so that the hash table also consumes little memory without losing any accuracy.

How it is used in production

In production, pingora uses this library in a few places. The most common one is the connection limit feature. When our servers try to establish too many connections to a single origin server, in order to protect the server and our infrastructure from becoming overloaded, this feature will start rejecting new requests with 503 errors.

In this feature every incoming request increases a counter, shared by all other requests with the same customer ID, server IP and the server hostname. When the request finishes, the counter decreases accordingly. If the value of the counter is beyond a certain threshold, the request is rejected with a 503 error response. In our production environment we choose the parameters of the library so that a theoretical collision chance between two unrelated customers is about 1 / 2 ^ 52. Additionally, the rejection threshold is significantly higher than what a healthy customer’s traffic would reach. Therefore, even if multiple customers’ counters collide, it is not likely that the overestimated value would reach the threshold. So a false positive on the connection limit is not likely to happen.

Conclusion

Pingora-limits crate is available now on GitHub. Both the core functionality and the performance benchmark performed above can be found there.

In this blog post, we introduced pingora-limits, a library that counts events efficiently. We explained the core idea, which is based on a probabilistic data structure. We also showed through a performance benchmark that the pingora-limits implementation is fast and very efficient for memory consumption.

Not only that, but we will continue introducing and open sourcing Pingora components and libraries because we believe that sharing the idea behind the code is equally important as sharing the code itself.

Interested in joining us to help build a better Internet? Our engineering teams are hiring.

How we built Pingora, the proxy that connects Cloudflare to the Internet

Yuchen Wu — Wed, 14 Sep 2022 13:00:00 GMT

Introduction

Today we are excited to talk about Pingora, a new HTTP proxy we’ve built in-house using Rust that serves over 1 trillion requests a day, boosts our performance, and enables many new features for Cloudflare customers, all while requiring only a third of the CPU and memory resources of our previous proxy infrastructure.

As Cloudflare has scaled we’ve outgrown NGINX. It was great for many years, but over time its limitations at our scale meant building something new made sense. We could no longer get the performance we needed nor did NGINX have the features we needed for our very complex environment.

Many Cloudflare customers and users use the Cloudflare global network as a proxy between HTTP clients (such as web browsers, apps, IoT devices and more) and servers. In the past, we’ve talked a lot about how browsers and other user agents connect to our network, and we’ve developed a lot of technology and implemented new protocols (see QUIC and optimizations for http2) to make this leg of the connection more efficient.

Today, we’re focusing on a different part of the equation: the service that proxies traffic between our network and servers on the Internet. This proxy service powers our CDN, Workers fetch, Tunnel, Stream, R2 and many, many other features and products.

Let’s dig in on why we chose to replace our legacy service and how we developed Pingora, our new system designed specifically for Cloudflare’s customer use cases and scale.

Why build yet another proxy

Over the years, our usage of NGINX has run up against limitations. For some limitations, we optimized or worked around them. But others were much harder to overcome.

Architecture limitations hurt performance

The NGINX worker (process) architecture has operational drawbacks for our use cases that hurt our performance and efficiency.

First, in NGINX each request can only be served by a single worker. This results in unbalanced load across all CPU cores, which leads to slowness.

Because of this request-process pinning effect, requests that do CPU heavy or blocking IO tasks can slow down other requests. As those blog posts attest we’ve spent a lot of time working around these problems.

The most critical problem for our use cases is poor connection reuse. Our machines establish TCP connections to origin servers to proxy HTTP requests. Connection reuse speeds up TTFB (time-to-first-byte) of requests by reusing previously established connections from a connection pool, skipping TCP and TLS handshakes required on a new connection.

However, the NGINX connection pool is per worker. When a request lands on a certain worker, it can only reuse the connections within that worker. When we add more NGINX workers to scale up, our connection reuse ratio gets worse because the connections are scattered across more isolated pools of all the processes. This results in slower TTFB and more connections to maintain, which consumes resources (and money) for both us and our customers.

As mentioned in past blog posts, we have workarounds for some of these issues. But if we can address the fundamental issue: the worker/process model, we will resolve all these problems naturally.

Some types of functionality are difficult to add

NGINX is a very good web server, load balancer or a simple gateway. But Cloudflare does way more than that. We used to build all the functionality we needed around NGINX, which is not easy to do while trying not to diverge too much from NGINX upstream codebase.

For example, when retrying/failing over a request, sometimes we want to send a request to a different origin server with a different set of request headers. But that is not something NGINX allows us to do. In cases like this, we spend time and effort on working around the NGINX constraints.

Meanwhile, the programming languages we had to work with didn’t provide help alleviating the difficulties. NGINX is purely in C, which is not memory safe by design. It is very error-prone to work with such a 3rd party code base. It is quite easy to get into memory safety issues, even for experienced engineers, and we wanted to avoid these as much as possible.

The other language we used to complement C is Lua. It is less risky but also less performant. In addition, we often found ourselves missing static typing when working with complicated Lua code and business logic.

And the NGINX community is not very active, and development tends to be “behind closed doors”.

Choosing to build our own

Over the past few years, as we’ve continued to grow our customer base and feature set, we continually evaluated three choices:

Continue to invest in NGINX and possibly fork it to tailor it 100% to our needs. We had the expertise needed, but given the architecture limitations mentioned above, significant effort would be required to rebuild it in a way that fully supported our needs.
Migrate to another 3rd party proxy codebase. There are definitely good projects, like envoy and others. But this path means the same cycle may repeat in a few years.
Start with a clean slate, building an in-house platform and framework. This choice requires the most upfront investment in terms of engineering effort.

We evaluated each of these options every quarter for the past few years. There is no obvious formula to tell which choice is the best. For several years, we continued with the path of the least resistance, continuing to augment NGINX. However, at some point, building our own proxy’s return on investment seemed worth it. We made a call to build a proxy from scratch, and began designing the proxy application of our dreams.

The Pingora Project

Design decisions

To make a proxy that serves millions of requests per second fast, efficient and secure, we have to make a few important design decisions first.

We chose Rust as the language of the project because it can do what C can do in a memory safe way without compromising performance.

Although there are some great off-the-shelf 3rd party HTTP libraries, such as hyper, we chose to build our own because we want to maximize the flexibility in how we handle HTTP traffic and to make sure we can innovate at our own pace.

At Cloudflare, we handle traffic across the entire Internet. We have many cases of bizarre and non-RFC compliant HTTP traffic that we have to support. This is a common dilemma across the HTTP community and web, where there is tension between strictly following HTTP specifications and accommodating the nuances of a wide ecosystem of potentially legacy clients or servers. Picking one side can be a tough job.

HTTP status codes are defined in RFC 9110 as a three digit integer, and generally expected to be in the range 100 through 599. Hyper was one such implementation. However, many servers support the use of status codes between 599 and 999. An issue had been created for this feature, which explored various sides of the debate. While the hyper team did ultimately accept that change, there would have been valid reasons for them to reject such an ask, and this was only one of many cases of noncompliant behavior we needed to support.

In order to satisfy the requirements of Cloudflare's position in the HTTP ecosystem, we needed a robust, permissive, customizable HTTP library that can survive the wilds of the Internet and support a variety of noncompliant use cases. The best way to guarantee that is to implement our own.

The next design decision was around our workload scheduling system. We chose multithreading over multiprocessing in order to share resources, especially connection pools, easily. We also decided that work stealing was required to avoid some classes of performance problems mentioned above. The Tokio async runtime turned out to be a great fit for our needs.

Finally, we wanted our project to be intuitive and developer friendly. What we build is not the final product, and should be extensible as a platform as more features are built on top of it. We decided to implement a “life of a request” event based programmable interface similar to NGINX/OpenResty. For example, the “request filter” phase allows developers to run code to modify or reject the request when a request header is received. With this design, we can separate our business logic and generic proxy logic cleanly. Developers who previously worked on NGINX can easily switch to Pingora and quickly become productive.

Pingora is faster in production

Let’s fast-forward to the present. Pingora handles almost every HTTP request that needs to interact with an origin server (for a cache miss, for example), and we’ve collected a lot of performance data in the process.

First, let’s see how Pingora speeds up our customer’s traffic. Overall traffic on Pingora shows 5ms reduction on median TTFB and 80ms reduction on the 95th percentile. This is not because we run code faster. Even our old service could handle requests in the sub-millisecond range.

The savings come from our new architecture which can share connections across all threads. This means a better connection reuse ratio, which spends less time on TCP and TLS handshakes.

Across all customers, Pingora makes only a third as many new connections per second compared to the old service. For one major customer, it increased the connection reuse ratio from 87.1% to 99.92%, which reduced new connections to their origins by 160x. To present the number more intuitively, by switching to Pingora, we save our customers and users 434 years of handshake time every day.

More features

Having a developer friendly interface engineers are familiar with while eliminating the previous constraints allows us to develop more features, more quickly. Core functionality like new protocols act as building blocks to more products we can offer to customers.

As an example, we were able to add HTTP/2 upstream support to Pingora without major hurdles. This allowed us to offer gRPC to our customers shortly afterwards. Adding this same functionality to NGINX would have required significantly more engineering effort and might not have materialized.

More recently we've announced Cache Reserve where Pingora uses R2 storage as a caching layer. As we add more functionality to Pingora, we’re able to offer new products that weren’t feasible before.

More efficient

In production, Pingora consumes about 70% less CPU and 67% less memory compared to our old service with the same traffic load. The savings come from a few factors.

Our Rust code runs more efficiently compared to our old Lua code. On top of that, there are also efficiency differences from their architectures. For example, in NGINX/OpenResty, when the Lua code wants to access an HTTP header, it has to read it from the NGINX C struct, allocate a Lua string and then copy it to the Lua string. Afterwards, Lua has to garbage-collect its new string as well. In Pingora, it would just be a direct string access.

The multithreading model also makes sharing data across requests more efficient. NGINX also has shared memory but due to implementation limitations, every shared memory access has to use a mutex lock and only strings and numbers can be put into shared memory. In Pingora, most shared items can be accessed directly via shared references behind atomic reference counters.

Another significant portion of CPU saving, as mentioned above, is from making fewer new connections. TLS handshakes are expensive compared to just sending and receiving data via established connections.

Safer

Shipping features quickly and safely is difficult, especially at our scale. It's hard to predict every edge case that can occur in a distributed environment processing millions of requests a second. Fuzzing and static analysis can only mitigate so much. Rust's memory-safe semantics guard us from undefined behavior and give us confidence our service will run correctly.

With those assurances we can focus more on how a change to our service will interact with other services or a customer's origin. We can develop features at a higher cadence and not be burdened by memory safety and hard to diagnose crashes.

When crashes do occur an engineer needs to spend time to diagnose how it happened and what caused it. Since Pingora's inception we’ve served a few hundred trillion requests and have yet to crash due to our service code.

In fact, Pingora crashes are so rare we usually find unrelated issues when we do encounter one. Recently we discovered a kernel bug soon after our service started crashing. We've also discovered hardware issues on a few machines, in the past ruling out rare memory bugs caused by our software even after significant debugging was nearly impossible.

Conclusion

To summarize, we have built an in-house proxy that is faster, more efficient and versatile as the platform for our current and future products.

We will be back with more technical details regarding the problems we faced, the optimizations we applied and the lessons we learned from building Pingora and rolling it out to power a significant portion of the Internet. We will also be back with our plan to open source it.

Pingora is our latest attempt at rewriting our system, but it won’t be our last. It is also only one of the building blocks in the re-architecting of our systems.

Interested in joining us to help build a better Internet? Our engineering teams are hiring.