The Cloudflare Blog

How Picsart leverages Cloudflare's Developer Platform to build globally performant services

Mark Dembo — Wed, 03 Apr 2024 13:00:02 GMT

Delivering great user experiences with a global user base can be challenging. While serving requests quickly when you start out in a local market is straightforward, doing so for a global audience is much more difficult. Why? Even under optimal conditions, you cannot be faster than the speed of light, which brings single data center solutions to their performance limits.

In this post, we will cover how Picsart improved the performance of one of its most critical services by moving from a centralized architecture to a globally distributed service built on Cloudflare. Our serverless compute platform, Workers, distributed throughout 310+ cities around the world, and our globally distributed Workers KV storage allowed them to improve their performance significantly and drive real business impact.

Success driven by data-driven insights

Picsart is one of the world’s largest digital creation platforms and a long-standing Cloudflare partner. At its core, an advanced tech stack powers its comprehensive features, including AI-driven photo and video editing tools and community-driven content sharing. With its infrastructure spanning across multiple cloud environments and on-prem deployments, Picsart is engineered to handle billions of daily requests from its huge mobile and web user base and API integrations. For over a decade, Cloudflare has been integral to Picsart, providing support for performant content delivery and securing its digital ecosystem.

Similar to many other tech giants, Picsart approaches product development in a data-driven way. At the core of the innovation is Picsart's remote configuration and experimentation platform, which enables product managers, UX researchers, and others to segment their user base into different test groups. These test groups might get to see slightly different implementations of features or designs of the Picsart app. Users might also get early access to experimental features or see different in-app promotions. In combination with constant monitoring of relevant KPIs, this allows for informed product decisions based on user preference and business impact.

On each app start, the client device sends a request to the remote configuration service for the latest setup tailored to the user's session. The assignment of experiments relies on factors like the operating system and previous sessions, making each request unique and uncachable. Picsart's app showcases extensive remote configuration capabilities, enabling adjustments to nearly every element. This results in a response containing a 1.5 MB configuration file for mobile clients. While the long-term solution is to reduce the file size, which has grown over time as more teams adopted the powerful service, this is not possible in the near or mid-term as it requires a major rewrite of all clients.

This setup request is blocking in the "hot path" during app start, as the results of this request will decide how the app itself looks and behaves. Hence, performance is critical. To ensure users are not waiting for too long, Picsart apps will wait for 1500ms on mobile for the request to complete – if it does not, the user will not be assigned a test group and the app will fallback to default settings.

The clock is ticking

While a 1500ms round trip time seems like a sufficiently large time budget, the data suggested otherwise. Before the improvements were implemented, a staggering 50% of devices could not complete the requests in time. How come? In these 1.5 seconds the following steps need to complete:

The request must travel from the users’ devices to the centralized backend servers
The server processes the request based on dozens of user attributes provided in the request and thousands of defined remote configuration variations, running experiments, and segments metadata. Using all the info, the server selects the right variation of each remote setting entry and builds the response payload.
The response must travel from the centralized backend servers to the user devices.

Looking at the data, it was clear to the Picsart team that their backend service was already well-optimized, taking only 30 milliseconds, a tiny fraction of the available time budget, to process each of the billions of monthly requests. The bulk of the request time came from network latency. Especially with mobile devices, last mile performance can be very volatile, eating away a significant amount of the available time budget. Not only that, but the data was clear: users closer to the origin server had a much higher chance of making the round trip in time versus users out of region. It quickly became obvious that Picsart, fueled by its global success, had outgrown a single-region setup.

To the drawing board

A solution that comes to mind would be to replicate the existing cloud infrastructure in multiple regions and use global load balancing to minimize the distance a request needs to travel. However, this introduces significant overhead and cost. On the infrastructure side, it is not only the additional compute instances and database clusters that incur cost, but also cross-region data transfer to keep data in sync. Moreover, technical teams would need to operate and monitor infrastructure in multiple regions, which can add a lot to the complexity and cognitive load, leading to decreased development velocity and productivity loss.

Picsart instead looked to Cloudflare – we already had a long-lasting relationship for Application Performance and Security, and they aimed to use our Developer Platform to tackle the problem.

Workers and Workers KV seemed like the ideal solution. Both compute and data are globally distributed in 310+ locations around the world, resulting in a shorter distance between end users and the experimentation service. Not only that, but Cloudflare's global-by-default approach allows for deployment with minimal overhead, and in contrast to other considered solutions, no additional fees to distribute the data around the globe.

No race without a clock

The objective for the refactor of the experimentation service was to increase the share of devices that successfully receive experimentation configuration within the set time budget.

But how to measure success? While synthetic testing can be useful in many situations, Picsart opted to come up with another clever solution:

During development, the Picsart engineers had already added a testing endpoint to the web and mobile versions of their app that sends a duplicate request to the new endpoint, discarding the response and swallowing all potential errors. This allows them to collect timing data based on real-user metrics without impacting the app's performance and reliability.

A simplified version of this pattern for a web client could look like this:

// API endpoint URLs
const prodUrl = 'https://prod.example.com/';
const devUrl = 'https://new.example.com/';

// Function to collect metrics
const collectMetrics = (duration) => {
    console.log('Request duration:', duration);
    // …
};

// Function to fetch data from an endpoint and call collectMetrics
const fetchData = async (url, options) => {
    const startTime = performance.now();
    
    try {
        const response = await fetch(url, options);
        const endTime = performance.now();
        const duration = endTime - startTime;
        collectMetrics(duration);
        return await response.json();
    } catch (error) {
        console.error('Error fetching data:', error);
    }
};

// Fetching data from both endpoints
async function fetchDataFromBothEndpoints() {
    try {
        const result1 = await fetchData(prodUrl, { method: 'POST', ... });
        console.log('Result from endpoint 1:', result1);

        // Fetching data from the second endpoint without awaiting its completion
        fetchData(devUrl, { method: 'POST', ... });
    } catch (error) {
        console.error('Error fetching data from both endpoints:', error);
    }
}

fetchDataFromBothEndpoints();

Using existing data analytics tools, Picsart was able to analyze the performance of the new services from day one, starting with a dummy endpoint and a 'hello world' response. And with that a v0 was created that did not have the correct logic just yet, but simulated reading multiple values from KV and returning a response of a realistic size back to the end user.

The need for a do-over

In the initial phase, outcomes fell short of expectations. Surprisingly, requests were slower despite the service's proximity to end users. What caused this setback? Subsequent investigation unveiled multiple culprits and design patterns in need for optimization.

Data segmentation

The previous, stateful solution operated on a single massive "blob" of data exceeding 100MB in value. Loading this into memory incurred a few seconds of initial startup time, but once the VM completed the task, request handling was fast, benefiting from the readily available data in memory.

However, this approach doesn't seamlessly transition to the serverless realm. Unlike long-running VMs, Worker isolates have short lifespans. Repeatedly parsing large JSON objects led to prolonged compute durations. Simply parsing four KV entries of 25MB each (KV maximum value size is 25MB) on each request was not a feasible option.

The Picsart team went back to solution design and embarked on a journey to optimize their system's execution time, resulting in a series of impactful improvements.

The fundamental insight that guided the solution was the unnecessary overhead that was involved in loading and parsing data irrelevant to the user's specific context. The 100MB configuration file contained configurations for all platforms and locations worldwide – a setup that was far from efficient in a globally distributed, serverless compute environment. For instance, when processing requests from users in the United States, there was no need to fetch configurations targeted for users in other countries, or for different platforms.

To address this inefficiency, the Picsart team stored the configuration of each platform and country in separate KV records. This targeted strategy meant that for a request originating from a US user on an Android device, our system would only fetch and parse the KV record specific to Android users in the US, thereby excluding all irrelevant data. This resulted in approximately 600 KV records, each with a maximum size of 10MB. While this leads to data duplication on the KV storage side, it decreases the amount of data that needs to be parsed upon request. As Cloudflare operates in over 120 countries around the world, only a subset of records were needed in each location. Hence, the increase in cardinality had minimal impact on KV cache performance, as demonstrated by more than 99.5% of KV reads being served from local cache.

Key	Size
com.picsart.studio_apple_us.json	6.1MB
com.picsart.studio_apple_de.json	6.1MB
com.picsart.studio_android_us.json	5.9MB
…	…

After (simplified)

This approach was a significant move for Picsart as they transitioned from a regional cloud setup to Cloudflare's globally distributed connectivity cloud. By serving data from close proximity to end user locations, they were able to combat the high network latency from their previous setup. This strategy radically transformed the data-handling process. Which unlocked two major benefits:

Performance Gains**:** By ensuring that only the relevant subset of data is fetched and parsed based on the user's platform and geographical location, wall time and compute resources required for these operations could be significantly reduced.
Scalability and Flexibility: the granular segmentation of data enables effortless scaling of the service for new features or regional content. Adding support for new applications now only requires inserting new, standalone KV records in contrast to the previous solution where this would require increasing the size of the single record.

Immutable updates

Now that changes to the configuration were segmented by app, country, and platform, this also allowed for individual updates of the configuration in KV. KV storage showcases its best performance when records are updated infrequently but read very often. This pattern leverages KV's fundamental design to cache values at edge locations upon reads, ensuring that subsequent queries for the same record are swiftly served by local caches rather than requiring a trip back to KV's centralized data centers. This architecture is fundamental for minimizing latency and maximizing the speed of data retrieval across a globally distributed platform.

A crucial requirement for Picsart’s experimentation system was the ability to propagate updates of remote configuration values immediately. Updating existing records would require very short cache TTLs and even the minimum KV cache TTL of 60 seconds was considered unacceptable for the dynamic nature of the feature flagging. Moreover, setting short TTLs also impacts the cache hit ratio and the overall KV performance, specifically in regions with low traffic.

To reconcile the need for both rapid updates and efficient caching, Picsart adopted an innovative approach: making KV records immutable. Instead of modifying existing records, they opted to create new records with each configuration change. By appending the content hash to the KV key and writing new records after each update, Picsart ensured that each record was unique and immutable. This allowed them to leverage higher cache TTLs, as these records would never be updated.

Key	TTL
com.picsart.studio_apple_us_b58b59.json	86400s
com.picsart.studio_apple_us_273678.json	86400s
-	…

After (simplified)

There was a catch, though. The service must now keep track of the correct KV keys to use. The Picsart team addressed this challenge by storing references to the latest KV keys in the environment variables of the Worker.

Each configuration change triggers a new KV pair to be written and the services' environment variables to be updated. As global Workers deployments take mere seconds, changes to the experimentation and configuration data are near-instantaneously globally available.

JSON serialization & alternatives

Following the previous improvements, the Picsart team made another significant discovery: only a small fraction of configuration data is needed to assign the experiments, while the remaining majority of the data comprises JSON values for the remote configuration payloads. While the service must deliver the latter in the response, the data is not required during the initial processing phase.

The initial implementation used KV's get() method to retrieve the configuration data with the parameter type=json, which converts the KV value to an object. This process is very compute-intensive compared to using the get() method with parameter type= text, which simply returns the value as a string. In the context of Picsart's data, the bulk of the CPU cycles were wasted on serializing JSON data that is not needed to perform the required business logic.

What if the data structure and code could be changed in such a way that only the data needed to assign experiments was parsed as JSON, while the configuration values were treated as text? Picsart went ahead with a new approach: splitting the KV records into two, creating a small 300 KB record for the metadata, which can be quickly parsed to an object, and a 9.7 MB record of configuration values. The extracted configuration values are delimited by newline characters. The respective line number is used as reference in the metadata entry, so that the respective configuration value for an experiment can be merged back into the payload response later.

{

"name": "shape_replace_items",

"default_value": "",

"segments": [

{

"id": "f1244",

"value": ""

{

"id": "a2lfd",

"value": ""

}

]

}

Before: Metadata and Values in one JSON object (simplified)

// com.picsart.studio_apple_am_metadata

1 {

2 "name": "shape_replace_items",

3 "default_value": 1,

4 "segments": [

5 {

6 "id": "f1244",

7 "value": 2

8 },

9 {

10 "id": "a2lfd",

11 "value": 3

12 }

13 ]

14 }

// com.picsart.studio_apple_am_values

1 ""

2 ""

3 ""

After: Metadata and Values are split (simplified)

After calculating the experiments and selecting the correct variants based solely on the small metadata entry, the service constructs a JSON string for the response containing placeholders for the actual values that reference the corresponding line numbers in the separate text file. To finalize the response, the server replaces the placeholders with the corresponding serialized JSON strings from the text file. This approach circumvents the need for parsing and re-serializing large JSON objects and helps to avoid a significant computational overhead.

Note that the process of parsing the metadata JSON and determining the correct experiments as well as the loading of the large file with configuration values are executed in parallel, saving additional milliseconds.

By minimizing the size of the JSON data that needed to be parsed and leveraging a more efficient method for constructing the final response, the Picsart team managed to not only reduce the response times but also optimize the compute resource usage. This approach reflects a broader principle applicable across the tech industry: that efficiency, particularly in serverless architectures, can often be dramatically improved by rethinking data structure and utilization.

Getting a head start

The changes on the server-side, moving from a single region setup to Cloudflare’s global architecture, paid off massively. Median response times globally dropped by more than 1 second, which was already a huge improvement for the team. However, in looking at the new data, two more paths for client-side optimizations were found.

As the web and mobile app would call the service at startup, most of the time no active connections to the servers were alive and establishing that connection at request time costs valuable milliseconds.

For the web version, setting a pre-connect header on initial page load showed a positive impact. For the mobile app version, the Picsart team took it a step further. Investigation showed that before the connection could be established, three modules had to complete initialization: the error tracker, the HTTP client, and the SDK. Reordering of the modules to initialize the HTTP client first allowed for connection establishment in parallel to the initialization of the SDK and error tracker, again saving time. This resulted in another 200ms improvement for end users.

Setting a new personal best

The day had come, and it was time for the phased rollout, web first and the mobile apps second. With suspense, the team looked at the dashboards, and were pleasantly surprised. The rollout was successful and billions of requests were handled with ease.

Share of successfully delivered experiments

The result? The Picsart apps are loading faster than ever for millions of users worldwide, while the share of successfully delivered experiments increased from 50% to 85%. Median response time dropped from 1500 ms to 280 ms. The response time dropped to 70 ms on the web since the response size is smaller compared to mobile. This translates to a real business impact for Picsart as they can now deliver more personalized and data-driven experiences to even more of their users.

A bright future ahead

Picsart is already thinking of the next generation of experimentation. To integrate with Cloudflare even further, the plan is to use Durable Objects to store hundreds of millions of user data records in a decentralized fashion, enabling even more powerful experiments without impacting performance. This is possible thanks to Durable Objects' underlying architecture that stores the user data in-region, close to the respective end user device.

Beyond that, Picsart’s experimentation team is also planning to onboard external B2B customers to their experimentation platform as Cloudflare's developer platform provides them with the scale and global network to handle more traffic and data with ease.

Get started yourself

If you’re also working on or with an application that would benefit from Cloudflare’s speed and scale, check out our developer documentation and tutorials, and join our developer Discord to get community support and inspiration.

No, AI did not break post-quantum cryptography

Lejla Batina (Guest Author) — Thu, 16 Mar 2023 13:00:00 GMT

News coverage of a recent paper caused a bit of a stir with this headline: “AI Helps Crack NIST-Recommended Post-Quantum Encryption Algorithm”. The news article claimed that Kyber, the encryption algorithm in question, which we have deployed world-wide, had been “broken.” Even more dramatically, the news article claimed that “the revolutionary aspect of the research was to apply deep learning analysis to side-channel differential analysis”, which seems aimed to scare the reader into wondering what will Artificial Intelligence (AI) break next?

Reporting on the paper has been wildly inaccurate: Kyber is not broken and AI has been used for more than a decade now to aid side-channel attacks. To be crystal clear: our concern is with the news reporting around the paper, not the quality of the paper itself. In this blog post, we will explain how AI is actually helpful in cryptanalysis and dive into the paper by Dubrova, Ngo, and Gärtner (DNG), that has been misrepresented by the news coverage. We’re honored to have Prof. Dr. Lejla Batina and Dr. Stjepan Picek, world-renowned experts in the field of applying AI to side-channel attacks, join us on this blog.

We start with some background, first on side-channel attacks and then on Kyber, before we dive into the paper.

Breaking cryptography

When one thinks of breaking cryptography, one imagines a room full of mathematicians puzzling over minute patterns in intercepted messages, aided by giant computers, until they figure out the key. Famously in World War II, the Nazis’ Enigma cipher machine code was completely broken in this way, allowing the Allied forces to read along with their communications.

It’s exceedingly rare for modern established cryptography to get broken head-on in this way. The last catastrophically broken cipher was RC4, designed in 1987, while AES, designed in 1998, stands proud with barely a scratch. The last big break of a cryptographic hash was on SHA-1, designed in 1995, while SHA-2, published in 2001, remains untouched in practice.

So what to do if you can’t break the cryptography head-on? Well, you get clever.

Side-channel attacks

Can you guess the pin code for this gate?

You can clearly see that some of the keys are more worn than the others, suggesting heavy use. This observation gives us some insight into the correct pin, namely the digits. But the correct order is not immediately clear. It might be 1580, 8510, or even 115085, but it’s a lot easier than trying every possible pin code. This is an example of a side-channel attack. Using the security feature (entering the PIN) had some unintended consequences (abrading the paint), which leaks information.

There are many different types of side channels, and which one you should worry about depends on the context. For instance, the sounds your keyboard makes as you type leaks what you write, but you should not worry about that if no one is listening in.

Remote timing side channel

When writing cryptography in software, one of the best known side channels is the time it takes for an algorithm to run. For example, let’s take the classic example of creating an RSA signature. Grossly simplified, to sign a message m with private key d, we compute the signature s as m^d (mod n). Computing the exponent of a big number is hard, but luckily, because we’re doing modular arithmetic, there is the square-and-multiply trick. Here is a naive implementation in pseudocode:

The algorithm loops over the bits of the secret key, and does a multiply step if the current bit is a 1. Clearly, the runtime depends on the secret key. Not great, but if the attacker can only time the full run, then they only learn the number of 1s in the secret key. The typical catastrophic timing attack against RSA instead is hidden behind the “mod n”. In a naive implementation this modular reduction is slower if the number being reduced is larger or equal n. This allows an attacker to send specially crafted messages to tease out the secret key bit-by-bit and similar attacks are surprisingly practical.

Because of this, the mantra is: cryptography should run in “constant time”. This means that the runtime does not depend on any secret information. In our example, to remove the first timing issue, one would replace the if-statement with something equivalent to:

	s = ((s * powerOfM) mod n) * bit(s, i) + s * (1 - bit(s, i))

This ensures that the multiplication is always done. Similar countermeasures prevent practically all remote timing attacks.

Power side-channel

The story is quite different for power side-channel attacks. Again, the classic example is RSA signatures. If we hook up an oscilloscope to a smartcard that uses the naive algorithm from before, and measure the power usage while it signs, we can read off the private key by eye:

Even if we use a constant-time implementation, there are still minute changes in power usage that can be detected. The underlying issue is that hardware gates that switch use more power than those that don’t. For instance, computing 127 + 64 takes more energy than 64 + 64.

127+64 and 64+64 in binary. There are more switched bits in the first.

MaskingA common countermeasure against power side-channel leakage is masking. This means that before using the secret information, it is split randomly into shares. Then, the brunt of the computation is done on the shares, which are finally recombined.

In the case of RSA, before creating a new signature, one can generate a random r and compute m^d+r (mod n) and m^r (mod n) separately. From these, the final signature m^d (mod n) can be computed with some extra care.

Masking is not a perfect defense. The parts where shares are created or recombined into the final value are especially vulnerable. It does make it harder for the attacker: they will need to collect more power traces to cut through the noise. In our example we used two shares, but we could bump that up even higher. There is a trade-off between power side-channel resistance and implementation cost.

One of the challenging parts in the field is to estimate how much secret information is actually leaked through the traces, and how to extract it. Here machine learning enters the picture.

Machine learning: extracting the key from the traces

Machine learning, of which deep learning is a part, represents the capability of a system to acquire its knowledge by extracting patterns from data — in this case, the secrets from the power traces. Machine learning algorithms can be divided into several categories based on their learning style. The most popular machine learning algorithms in side-channel attacks follow the supervised learning approach. In supervised learning, there are two phases: 1) training, where a machine learning model is trained based on known labeled examples (e.g., side-channel measurements where we know the key) and 2) testing, where, based on the trained model and additional side-channel measurements (now, with an unknown key), the attacker guesses the secret key. A common depiction of such attacks is given in the figure below.

While the threat model may sound counterintuitive, it is actually not difficult to imagine that the attacker will have access (and control) of a device similar to the one being attacked.

In side-channel analysis, the attacks following those two phases (training and testing) are called profiling attacks.

Profiling attacks are not new. The first such attack, called the template attack, appeared in 2002. Diverse machine learning techniques have been used since around 2010, all reporting good results and the ability to break various targets. The big breakthrough came in 2016, when the side-channel community started using deep learning. It greatly increased the effectiveness of power side-channel attacks both against symmetric-key and public-key cryptography, even if the targets were protected with, for instance, masking or some other countermeasures. To be clear: it doesn’t magically figure out the key, but it gets much better at extracting the leaked bits from a smaller number of power traces.

While machine learning-based side-channel attacks are powerful, they have limitations. Carefully implemented countermeasures make the attacks more difficult to conduct. Finding a good machine learning model that can break a target can be far from trivial: this phase, commonly called tuning, can last weeks on powerful clusters.

What will the future bring for machine learning/AI in side-channel analysis? Counter intuitively, we would like to see more powerful and easy to use attacks. You’d think that would make us worse off, but to the contrary it will allow us to better estimate how much actual information is leaked by a device. We also hope that we will be able to better understand why certain attacks work (or not), so that more cost-effective countermeasures can be developed. As such, the future for AI in side-channel analysis is bright especially for security evaluators, but we are still far from being able to break most of the targets in real-world applications.

Kyber

Kyber is a post-quantum (PQ) key encapsulation method (KEM). After a six-year worldwide competition, the National Institute of Standards and Technology (NIST) selected Kyber as the post-quantum key agreement they will standardize. The goal of a key agreement is for two parties that haven’t talked to each other before to agree securely on a shared key they can use for symmetric encryption (such as Chacha20Poly1305). As a KEM, it works slightly different with different terminology than a traditional Diffie–Hellman key agreement (such as X25519):

When connecting to a website the client first generates a new ephemeral keypair that consists of a private and public key. It sends the public key to the server. The server then encapsulates a shared key with that public key, which gives it a random shared key, which it keeps, and a ciphertext (in which the shared key is hidden), which the server returns to the client. The client can then use its private key to decapsulate the shared key from the ciphertext. Now the server and client can communicate with each other using the shared key.

Key agreement is particularly important to make secure against attacks of quantum computers. The reason is that an attacker can store traffic today, and crack the key agreement in the future, revealing the shared key and all communication encrypted with it afterwards. That is why we have already deployed support for Kyber across our network.

The DNG paper

With all the background under our belt, we’re ready to take a look at the DNG paper. The authors perform a power side-channel attack on their own masked implementation of Kyber with six shares.

Point of attack

They attack the decapsulation step. In the decapsulation step, after the shared key is extracted, it’s encapsulated again, and compared against the original ciphertext to detect tampering. For this re-encryption step, the precursor of the shared key—let’s call it the secret—is encoded bit-by-bit into a polynomial. To be precise, the 256-bit secret needs to be converted to a polynomial with 256 coefficients modulo q=3329, where the i^th coefficient is (q+1)/2 if the ith b^th is 1 and zero otherwise.

This function sounds simple enough, but creating a masked version is tricky. The rub is that the natural way to create shares of the secret is to have shares that xor together to be the secret, and that the natural way to share polynomials is to have shares that add together to get to the intended polynomial.

This is the two-shares implementation of the conversion that the DNG paper attacks:

The code loops over the bits of the two shares. For each bit, it creates a mask, that’s 0xffff if the bit was 1 and 0 otherwise. Then this mask is used to add (q+1)/2 to the polynomial share if appropriate. Processing a 1 will use a bit more power. It doesn’t take an AI to figure out that this will be a leaky function. In fact, this pattern was pointed out to be weak back in 2016, and explicitly mentioned to be a risk for masked Kyber in 2020. Apropos, one way to mitigate this, is to process multiple bits at once — for the state of the art, tune into April 2023’s NIST PQC seminar. For the moment, let’s allow the paper its weak target.

The authors do not claim any fundamentally new attack here. Instead, they improve the effectiveness of the attack in two ways: the way they train the neural network, and how to use multiple traces more effectively by changing the ciphertext sent. So, what did they achieve?

Effectiveness

To test the attack, they use a Chipwhisperer-lite board, which has a Cortex M4 CPU, which they downclock to 24Mhz. Power usage is sampled at 24Mhz, with high 10-bit precision.

To train the neural networks, 150,000 power traces are collected for decapsulation of different ciphertexts (with known shared key) for the same KEM keypair. This is already a somewhat unusual situation for a real-world attack: for key agreement KEM keypairs are ephemeral; generated and used only once. Still, there are certainly legitimate use cases for long-term KEM keypairs, such as for authentication, HPKE, and in particular ECH.

The training is a key step: different devices even from the same manufacturer can have wildly different power traces running the same code. Even if two devices are of the same model, their power traces might still differ significantly.

The main contribution highlighted by the authors is that they train their neural networks to attack an implementation with 6 shares, by starting with a neural network trained to attack an implementation with 5 shares. That one can be trained from a model to attack 4 shares, and so on. Thus to apply their method, of these 150,000 power traces, one-fifth must be from an implementation with 6 shares, another one-fifth from one with 5 shares, et cetera. It seems unlikely that anyone will deploy a device where an attacker can switch between the number of shares used in the masking on demand.

Given these affordances, the attack proper can commence. The authors report that, from a single power trace of a two-share decapsulation, they could recover the shared key under these ideal circumstances with probability… 0.12%. They do not report the numbers for single trace attacks on more than two shares.

When we’re allowed multiple traces of the same decapsulation, side-channel attacks become much more effective. The second trick is a clever twist on this: instead of creating a trace of decapsulation of exactly the same message, the authors rotate the ciphertext to move bits of the shared key in more favorable positions. With 4 traces that are rotations of the same message, the success probability against the two-shares implementation goes up to 78%. The six-share implementation stands firm at 0.5%. When allowing 20 traces from the six-share implementation, the shared key can be recovered with an 87% chance.

In practice

The hardware used in the demonstration might be somewhat comparable to a smart card, but it is very different from high-end devices such as smartphones, desktop computers and servers. Simple power analysis side-channel attacks on even just embedded 1GHz processors are much more challenging, requiring tens of thousands of traces using a high-end oscilloscope connected close to the processor. There are much better avenues for attack with this kind of physical access to a server: just connect the oscilloscope to the memory bus.

Except for especially vulnerable applications, such as smart cards and HSMs, power-side channel attacks are widely considered infeasible. Although sometimes, when the planets align, an especially potent power side-channel attack can be turned into a remote timing attack due to throttling, as demonstrated by Hertzbleed. To be clear: the present attack does not even come close.

And even for these vulnerable applications, such as smart cards, this attack is not particularly potent or surprising. In the field, it is not a question of whether a masked implementation leaks its secrets, because it always does. It’s a question of how hard it is to actually pull off. Papers such as the DNG paper contribute by helping manufacturers estimate how many countermeasures to put in place, to make attacks too costly. It is not the first paper studying power side-channel attacks on Kyber and it will not be the last.

Wrapping up

AI did not completely undermine a new wave of cryptography, but instead is a helpful tool to deal with noisy data and discover the vulnerabilities within it. There is a big difference between a direct break of cryptography and a power side-channel attack. Kyber is not broken, and the presented power side-channel attack is not cause for alarm.

Zero Trust security with Ping Identity and Cloudflare Access

Deeksha Lamba — Tue, 14 Mar 2023 13:00:00 GMT

In today's digital landscape, traditional perimeter based security models are no longer enough to protect sensitive data and applications. As cyber threats become increasingly sophisticated, it's essential to adopt a security approach that assumes that all access is unauthorized, rather than relying on network perimeter-based security.

Zero Trust is a security model that requires all users and devices to be authenticated and authorized before being granted access to applications and data. This approach offers a comprehensive security solution that is particularly effective in today's distributed and cloud-based environments. In this context, Cloudflare Access and Ping Identity offer a powerful solution for organizations looking to implement Zero Trust security controls to protect their applications and data.

Enforcing strong authentication and access controls

Web applications provide businesses with enhanced scalability, flexibility, and cost savings, but they can also create vulnerabilities that malicious actors can exploit. Ping Identity and Cloudflare Access can be used together to secure applications by enforcing strong authentication and access controls.

One of the key features of Ping Identity is its ability to provide single sign-on (SSO) capabilities, allowing users to log in once and be granted access to all applications they are authorized to use. This feature streamlines the authentication process, reducing the risk of password fatigue and making it easier for organizations to manage access to multiple applications.

Cloudflare Access, on the other hand, provides Zero Trust access to applications, ensuring that only authorized users can access sensitive information. With Cloudflare Access, policies can be easily created and managed in one place, making it easier to ensure clear and consistent policy enforcement across all applications. Policies can include specific types of MFA, device posture and even custom logic.

Securing custom applications with Access and Ping

Legacy applications pose a significant security risk to organizations as they may contain vulnerabilities that are no longer patched or updated. However, businesses can use Cloudflare and Ping Identity to help secure legacy applications and reduce the risk of cyberattacks.

Legacy applications may not support modern authentication methods, such as SAML or OIDC, which makes security controls like MFA easier to enforce, making them vulnerable to unauthorized access. By integrating Ping Identity with Cloudflare Access, businesses can enforce MFA and SSO for users accessing legacy applications. This can help ensure that only authorized users have access to sensitive data and reduce the risk of credential theft and account takeover.

For example, many organizations have legacy applications that lack modern security features like MFA or SSO. This is because direct code modifications were previously required to implement modern security features. Code modifications of legacy applications can be risky, difficult or even impossible in some situations. By integrating these applications with Ping Identity and Cloudflare Access, organizations can enforce stronger security controls, making it harder for unauthorized users to gain access to sensitive information. All while not requiring underlying changes to the application itself.

Full integration support for PingOne and PingFederate customers

We are excited to announce that Cloudflare is now offering full integration support for PingOne customers. This means that Ping Identity customers can now easily integrate their identity management solutions with Cloudflare Access to provide a comprehensive security solution for their applications.

User and group synchronization via SCIM

In addition to this announcement, we are also excited to share our plans to add user and group synchronization via SCIM in the near future. This will allow organizations to easily synchronize user and group data between Ping Identity and Cloudflare Access, streamlining access management and improving the overall user experience.

“A cloud-native Zero Trust security model has become an absolute necessity as enterprises continue to adopt a cloud-first strategy. Cloudflare and Ping Identity have robust product integrations in place to help security and IT leaders prevent attacks proactively and increase alignment with zero trust best practices.”– Loren Russon, SVP of Product & Technology, Ping Identity

A powerful solution for Zero Trust security controls

We believe that these integrations will provide a powerful solution for organizations looking to implement Zero Trust security controls to protect their applications and data. By combining Ping Identity's identity management capabilities with Cloudflare Access's Zero Trust access controls and MFA capabilities, organizations can ensure that only authorized users are granted access to sensitive information. This approach provides a comprehensive security solution that is particularly effective in today's distributed and cloud-based environments.

We look forward to continuing to improve our integration capabilities with Ping Identity and other identity management solutions, to provide organizations with the best possible security solution for their applications and data.

Adding Zero Trust signals to Sumo Logic for better security insights

Corey Mahan — Tue, 14 Mar 2023 13:00:00 GMT

A picture is worth a thousand words and the same is true when it comes to getting visualizations, trends, and data in the form of a ready-made security dashboard.

Today we’re excited to announce the expansion of support for automated normalization and correlation of Zero Trust logs for Logpush in Sumo Logic’s Cloud SIEM. As a Cloudflare technology partner, Sumo Logic is the pioneer in continuous intelligence, a new category of software which enables organizations of all sizes to address the data challenges and opportunities presented by digital transformation, modern applications, and cloud computing.

The updated content in Sumo Logic Cloud SIEM helps joint Cloudflare customers reduce alert fatigue tied to Zero Trust logs and accelerates the triage process for security analysts by converging security and network data into high-fidelity insights. This new functionality complements the existing Cloudflare App for Sumo Logic designed to help IT and security teams gain insights, understand anomalous activity, and better trend security and network performance data over time.

Deeper integration to deliver Zero Trust insights

Using Cloudflare Zero Trust helps protect users, devices, and data, and in the process can create a large volume of logs. These logs are helpful and important because they provide the who, what, when, and where for activity happening within and across an organization. They contain information such as what website was accessed, who signed in to an application, or what data may have been shared from a SaaS service.

Up until now, our integrations with Sumo Logic only allowed automated correlation of security signals for Cloudflare only included core services. While it’s critical to ensure collection of WAF and bot detection events across your fabric, extended visibility into Zero Trust components has now become more important than ever with the explosion of distributed work and adoption of hybrid and multi-cloud infrastructure architectures.

With the expanded Zero Trust logs now available in Sumo Logic Cloud SIEM, customers can now get deeper context into security insights thanks to the broad set of network and security logs produced by Cloudflare products:

Cloudflare Gateway (Network, DNS, HTTP)
Cloudflare Remote Browser Isolation (included with Gateway logs)
Cloudflare Data Loss Prevention (included with Gateway logs)
Cloudflare Access (Access audit logs)
Cloudflare Cloud Access Security Broker (Findings logs)

“As a long time Cloudflare partner, we’ve worked together to help joint customers analyze events and trends from their websites and applications to provide end-to-end visibility and improve digital experiences. We’re excited to expand this partnership to provide real-time insights into the Zero Trust security posture of mutual customers in Sumo Logic’s Cloud SIEM.”- John Coyle - Vice President of Business Development, Sumo Logic

How to get started

To take advantage of the suite of integrations available for Sumo Logic and Cloudflare logs available via Logpush, first enable Logpush to Sumo Logic, which will ship logs directly to Sumo Logic’s cloud-native platform. Then, install the Cloudflare App and (for Cloud SIEM customers) enable forwarding of these logs to Cloud SIEM for automated normalization and correlation of security insights.

Note that Cloudflare’s Logpush service is only available to Enterprise customers. If you are interested in upgrading, please contact us here.

Enable Logpush to Sumo LogicCloudflare Logpush supports pushing logs directly to Sumo Logic via the Cloudflare dashboard or via API.
Install the Cloudflare App for Sumo LogicLocate and install the Cloudflare app from the App Catalog, linked above. If you want to see a preview of the dashboards included with the app before installing, click Preview Dashboards. Once installed, you can now view key information in the Cloudflare Dashboards for all core services.
(Cloud SIEM Customers) Forward logs to Cloud SIEMAfter the steps above, enable the updated parser for Cloudflare logs by adding the _parser field to your S3 source created when installing the Cloudflare App.

What’s next

As more organizations move towards a Zero Trust model for security, it's increasingly important to have visibility into every aspect of the network with logs playing a crucial role in this effort.

If your organization is just getting started and not already using a tool like Sumo Logic, Cloudflare R2 for log storage is worth considering. Cloudflare R2 offers a scalable, cost-effective solution for log storage.

We’re excited to continue closely working with technology partners to expand existing and create new integrations that help customers on their Zero Trust journey.

Twilio Segment Edge SDK powered by Cloudflare Workers

Pooya Jaferian (Guest Blogger) — Fri, 18 Nov 2022 14:00:00 GMT

The Cloudflare team was so excited to hear how Twilio Segment solved problems they encountered with tracking first-party data and personalization using Cloudflare Workers. We are happy to have guest bloggers Pooya Jaferian and Tasha Alfano from Twilio Segment to share their story.

Introduction

Twilio Segment is a customer data platform that collects, transforms, and activates first-party customer data. Segment helps developers collect user interactions within an application, form a unified customer record, and sync it to hundreds of different marketing, product, analytics, and data warehouse integrations.

There are two “unsolved” problem with app instrumentation today:

Problem #1: Many important events that you want to track happen on the “wild-west” of the client, but collecting those events via the client can lead to low data quality, as events are dropped due to user configurations, browser limitations, and network connectivity issues.

Problem #2: Applications need access to real-time (<50ms) user state to personalize the application experience based on advanced computations and segmentation logic that must be executed on the cloud.

The Segment Edge SDK – built on Cloudflare Workers – solves for both. With Segment Edge SDK, developers can collect high-quality first-party data. Developers can also use Segment Edge SDK to access real-time user profiles and state, to deliver personalized app experiences without managing a ton of infrastructure.

This post goes deep on how and why we built the Segment Edge SDK. We chose the Cloudflare Workers platform as the runtime for our SDK for a few reasons. First, we needed a scalable platform to collect billions of events per day. Workers running with no cold-start made them the right choice. Second, our SDK needed a fast storage solution, and Workers KV fitted our needs perfectly. Third, we wanted our SDK to be easy to use and deploy, and Workers’ ease and speed of deployment was a great fit.

It is important to note that the Segment Edge SDK is in early development stages, and any features mentioned are subject to change.

Serving a JavaScript library 700M+ times per day

analytics.js is our core JavaScript UI SDK that allows web developers to send data to any tool without having to learn, test, or use a new API every time.

Figure 1 illustrates how Segment can be used to collect data on a web application. Developers add Segment’s web SDK, analytics.js, to their websites by including a JavaScript snippet to the HEAD of their web pages. The snippet can immediately collect and buffer events while it also loads the full library asynchronously from the Segment CDN. Developers can then use analytics.js to identify the visitors, e.g., **analytics**.identify('john'), and track user behavior, e.g., analytics.track('**Order** **Completed**'). Calling the `analytics.js methods such as identify or track will send data to Segment’s API (api.segment.io). Segment’s platform can then deliver the events to different tools, as well as create a profile for the user (e.g., build a profile for user “John”, associate “Order Completed”, as well as add all future activities of john to the profile).

Analytics.js also stores state in the browser as first-party cookies (e.g., storing an ajs_user_id cookie with the value of john, with cookie scoped at the example.com domain) so that when the user visits the website again, the user identifier stored in the cookie can be used to recognize the user.

Figure 1- How analytics.js loads on a website and tracks events

While analytics.js only tracks first-party data (i.e., the data is collected and used by the website that the user is visiting), certain browser controls incorrectly identify analytics.js as a third-party tracker, because the SDK is loaded from a third-party domain (cdn.segment.com) and the data is going to a third-party domain (api.segment.com). Furthermore, despite using first-party cookies to store user identity, some browsers such as Safari have limited the TTL for non-HTTPOnly cookies to 7-days, making it challenging to maintain state for long periods of time.

To overcome these limitations, we have built a Segment Edge SDK (currently in early development) that can automatically add Segment’s library to a web application, eliminate the use of third-party domains, and maintain user identity using HTTPOnly cookies. In the process of solving the first-party data problem, we realized that the Edge SDK is best positioned to act as a personalization library, given it has access to the user identity on every request (in the form of cookies), and it can resolve such identity to a full-user profile stored in Segment. The user profile information can be used to deliver personalized content to users directly from the Cloudflare Workers platform.

The remaining portions of this post will cover how we solved the above problems. We first explain how the Edge SDK helps with first-party collection. Then we cover how the Segment profiles database becomes available on the Cloudflare Workers platform, and how to use such data to drive personalization.

Segment Edge SDK and first-party data collection

Developers can set up the Edge SDK by creating a Cloudflare Worker sitting in front of their web application (via Routes) and importing the Edge SDK via npm. The Edge SDK will handle requests and automatically injects analytics.js snippets into every webpage. It also configures first-party endpoints to download the SDK assets and send tracking data. The Edge SDK also captures user identity by looking at the Segment events and instructs the browser to store such identity as HTTPOnly cookies.

import { Segment } from "@segment/edge-sdk-cloudflare";

export default {
   async fetch(request: Request, env: Env): Promise {
       const segment = new Segment(env.SEGMENT_WRITE_KEY); 

       const resp = await segment.handleEvent(request, env);

       return resp;
   }
};

How the Edge SDK works under the hood to enable first-party data collection

The Edge SDK's internal router checks the inbound request URL against predefined patterns. If the URL matches a route, the router runs the route's chain of handlers to process the request, fetch the origin, or modify the response.

export interface HandlerFunction {
 (
   request: Request,
   response: Response | undefined,
   context: RouterContext
 ): Promise<[Request, Response | undefined, RouterContext]>;
}

Figure 2 demonstrates the routing of incoming requests. The Worker calls segment.handleEvent method with the request object (step 1), then the router matches the request.url and request.method against a set of predefined routes:

GET requests with /seg/assets/* path are proxied to Segment CDN (step 2a)
POST requests with /seg/events/* path are proxied to Segment tracking API (step 2b)
Other requests are proxied to the origin (step 2c) and the HTML responses are enriched with the analytics.js snippet (step 3)

Regardless of the route, the router eventually returns a response to the browser (step 4) containing data from the origin, the response from Segment tracking API, or analytics.js assets. When Edge SDK detects the user identity in an incoming request (more on that later), it sets an HTTPOnly cookie in the response headers to persist the user identity in the browser.

Figure 2- Edge SDK router flow‌‌

In the subsequent three sections, we explain how we inject analytics.js, proxy Segment endpoints, and set server-side cookies.

Injecting Segment SDK on requests to origin

For all the incoming requests routed to the origin, the Edge SDK fetches the HTML page and then adds the analytics.js snippet to the tag, embeds the write key, and configures the snippet to download the subsequent javascript bundles from the first-party domain ([first-party host]/seg/assets/*) and sends data to the first-party domain as well ([first-party host]/seg/events/*). This is accomplished using the HTMLRewriter API.

import snippet from "@segment/snippet"; // Existing Segment package that generates snippet

class ElementHandler {
   constructor(host: string, writeKey: string)

   element(element: Element) {
     // generate Segment snippet and configure it with first-party host info
     const snip = snippet.min({
         host: `${this.host}/seg`,
         apiKey: this.writeKey,
       })
     element.append(``, { html: true });
   }
 }
  
export const enrichWithAJS: HandlerFunction = async (
   request,
   response,
   context
 ) => {
   const {
     settings: { writeKey },
   } = context;
   const host = request.headers.get("host") || "";
    return [
     request,
     new HTMLRewriter().on("head",
         new ElementHandler(host, writeKey))
       .transform(response),
     context,
   ];
 };

Proxy SDK bundles and Segment API

The Edge SDK proxies the Segment CDN and API under the first-party domain. For example, when the browser loads a page with the injected analytics.js snippet, the snippet loads the full analytics.js bundle from https://example.com/seg/assets/sdk.js, and the Edge SDK will proxy that request to the Segment CDN:

https://cdn.segment.com/analytics.js/v1//analytics.min.js

export const proxyAnalyticsJS: HandlerFunction = async (request, response, ctx) => {
 const url = `https://cdn.segment.com/analytics.js/v1/${ctx.params.writeKey}/analytics.min.js`;
 const resp = await fetch(url);
 return [request, resp, ctx];
};

Similarly, analytics.js collects events and sends them via a POST request to https://example.com/seg/events/[method] and the Edge SDK will proxy such requests to the Segment tracking API:

https://api.segment.io/v1/[method]

export const handleAPI: HandlerFunction = async (request, response, context) => {
 const url = new URL(request.url);
 const parts = url.pathname.split("/");
 const method = parts.pop();
 let body: { [key: string]: any } = await request.json();

 const init = {
   method: "POST",
   headers: request.headers,
   body: JSON.stringify(body),
 };

 const resp = await fetch(`https://api.segment.io/v1/${method}`, init);

 return [request, resp, context];
};

First party server-side cookies

The Edge SDK also re-writes existing client-side analytics.js cookies as HTTPOnly cookies. When Edge SDK intercepts an identify event e.g., **analytics**.identify('john'), it extracts the user identity (“john”) and then sets a server-side cookie when sending a response back to the user. Therefore, any subsequent request to the Edge SDK can be associated with “john” using request cookies.

export const enrichResponseWithIdCookies: HandlerFunction = async (
 request, response, context) => {


 const host = request.headers.get("host") || "";
 const body = await request.json();
 const userId = body.userId;

 […]

 const headers = new Headers(response.headers);
 const cookie = cookie.stringify("ajs_user_id", userId, {
   httponly: true,
   path: "/",
   maxage: 31536000,
   domain: host,
 });
 headers.append("Set-Cookie", cookie);
 
 const newResponse = new Response(response.body, {
   ...response,
   headers,
 });

 return [request, newResponse, newContext];
};

Intercepting the ajs_user_id on the Workers, and using the cookie identifier to associate each request to a user, is quite powerful, and it opens the door for delivering personalized content to users. The next section covers how Edge SDK can drive personalization.

Personalization on the Supercloud

The Edge SDK offers a registerVariation method that can customize how a request to a given route should be fetched from the origin. For example, let's assume we have three versions of a landing page in the origin: /red, /green, and / (default), and we want to deliver one of the three versions based on the visitor traits. We can use Edge SDK as follows:

   const segment = new Segment(env.SEGMENT_WRITE_KEY); 
   segment.registerVariation("/", (profile) => {
     if (profile.red_group) {
       return "/red"
     } else if (profile.green_group) 
       return "/green"
     }
   });

   const resp = await segment.handleEvent(request, env);

   return resp

The registerVariation accepts two inputs: the path that displays the personalized content, and a decision function that should return the origin address for the personalized content. The decision function receives a profile object visitor in Segment. In the example, when users visit example.com/(root path), personalized content is delivered by checking if the visitor has a red_group or green_group trait and subsequently requesting the content from either /red or /green path at the origin.

We already explained that Edge SDK knows the identity of the user via ajs_user_id cookie, but we haven’t covered how the Edge SDK has access to the full profile object. The next section explains how the full profile becomes available on the Cloudflare Workers platform.

How does personalization work under the hood?

The Personalization feature of the Edge SDK requires storage of profiles on the Cloudflare Workers platform. A Cloudflare KV should be created for the Worker running the Edge SDK and passed to the Edge SDK during initialization. Edge SDK will store profiles in KV, where keys are the ajs_user_id, and values are the serialized profile object. To move Profiles data from Segment to the KV, the SDK uses two methods:

Profiles data push from Segment to the Cloudflare Workers platform: The Segment product can sync user profiles database with different tools, including pushing the data to a webhook. The Edge SDK automatically exposes a webhook endpoint under the first-party domain (e.g., example.com/seg/profiles-webhook) that Segment can call periodically to sync user profiles. The webhook handler receives incoming sync calls from Segment, and writes profiles to the KV.
Pulling data from Segment by the Edge SDK: If the Edge SDK queries the KV for a user id, and doesn’t find the profile (i.e., data hasn’t synced yet), it requests the user profile from the Segment API, and stores it in the KV.

Figure 3 demonstrates how the personalization flow works. In step 1, the user requests content for the root path ( / ), and the Worker sends the request to the Edge SDK (step 2). The Edge SDK router determines that a variation is registered on the route, therefore, extracts the ajs_user_id from the request cookies, and goes through the full profile extraction (step 3). The SDK first checks the KV for a record with the key of ajs_user_id value and if not found, queries Segment API to fetch the profile, and stores the profile in the KV. Eventually, the profile is extracted and passed into the decision function to decide which path should be served to the user (step 4). The router eventually fetches the variation from the origin (step 5) and returns the response under the / path to the browser (step 6).

Figure 3- Personalization flow

Summary

In this post we covered how the Cloudflare Workers platform can help with tracking first-party data and personalization. We also explained how we built a Segment Edge SDK to enable Segment customers to get those benefits out of the box, without having to create their own DIY solution. The Segment Edge SDK is currently in early development, and we are planning to launch a private pilot and open-source it in the near future.

Xata Workers: client-side database access without client-side secrets

Alexis Rico (Guest Author) — Wed, 16 Nov 2022 14:00:00 GMT

We’re excited to have Xata building their serverless functions product – Xata Workers – on top of Workers for Platforms. Xata Workers act as middleware to simplify database access and allow their developers to deploy functions that sit in front of their databases. Workers for Platforms opens up a whole suite of use cases for Xata developers all while providing the security, scalability and performance of Cloudflare Workers.

Now, handing it over to Alexis, a Senior Software Engineer at Xata to tell us more.

Introduction

In the last few years, there's been a rise of Jamstack, and new ways of thinking about the cloud that some people call serverless or edge computing. Instead of maintaining dedicated servers to run a single service, these architectures split applications in smaller services or functions.

By simplifying the state and context of our applications, we can benefit from external providers deploying these functions in dozens of servers across the globe. This architecture benefits the developer and user experience alike. Developers don’t have to manage servers, and users don’t have to experience latency. Your application simply scales, even if you receive hundreds of thousands of unexpected visitors.

When it comes to databases though, we still struggle with the complexity of orchestrating replication, failover and high availability. Traditional databases are difficult to scale horizontally and usually require a lot of infrastructure maintenance or learning complex database optimization concepts.

At Xata we are building a modern data platform designed for scalable applications. It allows you to focus on your application logic, instead of having to worry about how the data is stored.

Making databases accessible to everyone

We started Xata with the mission of helping developers build their applications and forget about maintaining the underlying infrastructure.

With that mission in mind, we asked ourselves: how can we make databases accessible to everyone? How can we provide a delightful experience for a frontend engineer or designer using a database?

To begin with, we built an appealing web dashboard, that any developer — no matter their experience — can be comfortable using to work with their data.

Whether they’re defining or refining their schema, viewing or adding records, fine-tuning search results with query boosters, or getting code snippets to speed up development with our SDK. We believe that only the best user experience can provide the best development experience.

We identified a major challenge amongst our user base early on. Many front-end developers want to access their database from client-side code.

Allowing access to a database from client-side code comes with several security risks if not done properly. For example, someone could inspect the code, find the credentials, and if they weren’t scoped to a single operation, potentially query or modify other parts of the database. Unfortunately, this is a common reason for data leaks and security breaches.

It was a hard problem to solve, and after plenty of brainstorming, we agreed on two potential ways forward: implementing row-level access rules or writing API routes that talked to the database from server code.

Row-level access rules are a good way to solve this, but they would have required us to define our own opinionated language. For our users, it would have been hard to migrate away when they outgrow this solution. Instead, we preferred to focus on making serverless functions easier for our users.

Typically, serverless functions require you to either choose a full stack framework or manually write, compile, deploy and use them. This generally adds a lot of cognitive overhead even to choose the right solution. We wanted to simplify accessing the database from the frontend without sacrificing flexibility for developers. This is why we decided to build Xata Workers.

Xata Workers

A Xata Worker is a function that a developer can write in JavaScript or TypeScript as client-side code. Rather than being executed client-side, it will actually be executed on Cloudflare’s global network.

You can think of a Xata Worker as a getServerSideProps function in Next.js or a loader function in Remix. You write your server logic in a function and our tooling takes care of deploying and running it server-side for you (yes, it’s that easy).

The main difference with other alternatives is that Xata Workers are, by design, agnostic to a framework or hosting provider. You can use them to build any kind of application or website, and even upload it as static HTML in a service like GitHub Pages or S3. You don't need a full stack web framework to use Xata Workers.

With our command line tool, we handle the build and deployment process. When the function is invoked, the Xata Worker actually makes a request to a serverless function over the network.

import { useQuery } from '@tanstack/react-query';
import { xataWorker } from '~/xata';

const listProducts = xataWorker('listProducts', async ({ xata }) => {
  return await xata.db.products.sort('popularity', 'desc').getMany();
});

export const Home = () => {
  const { data = [] } = useQuery(['products'], listProducts);

  return (
    
      {data.map((product) => (
        
      ))}
    
  );
};

In the code snippet above, you can see a React component that connects to an e-commerce database with products on sale. Inside the UI component, with a popular client-side data fetching library, data is retrieved from the serverless function and for each product it renders another component in a grid.

As you can see a Xata Worker is a function that wraps any user-defined code and receives an instance of our SDK as a parameter. This instance has access to the database and, given that the code doesn’t run on the browser anymore, your secrets are not exposed for malicious usage.

When using a Xata Worker in TypeScript, our command line tool also generates custom types based on your schema. These types offer type-safety for your queries or mutations, and improve your developer experience by adding extra intellisense to your IDE.

A Xata Worker, like any other function, can receive additional parameters that pass application state, context or user tokens. The code you write in the function can either return an object that will be serialized over the network with a superset of JSON to support dates and other non-primitive data types, or a full response with a custom status code and headers.

Developers can write all their logic, including their own authentication and authorization. Unlike complex row level access control rules, you can easily express your logic without constraints and even unit test it with the rest of your code.

How we use Cloudflare

We are happy to join the Supercloud movement, Cloudflare has an excellent track record, and we are using Cloudflare Workers for Platforms to host our serverless functions. By using the Workers isolated execution contexts we reduce security risks of running untrusted code on our own while being close to our users, resulting in super low latency.

All of it, without having to deploy extra infrastructure to handle our user’s application load or ask them to configure how the serverless functions should be deployed. It really feels like magic! Now, let's dive into the internals of how we use Cloudflare to run our Xata Workers.

For every workspace in Xata we create a Worker Namespace, a Workers for Platform concept to organize Workers and restrict the routing between them. We used Namespaces to group and encapsulate the different functions coming from all the databases built by a client or team.

When a user deploys a Xata Worker, we create a new Worker Script, and we upload the compiled code to its Namespace. Each script has a unique name with a compilation identifier so that the user can have multiple versions of the same function running at the same time.

During the compilation, we inject the database connection details and the database name. This way, the function can connect to the database without leaking secrets and restricting the scope of access to the database, all of it transparently for the developer.

When the client-side application runs a Xata Worker, it calls a Dispatcher function that processes the request and calls the correct Worker Script. The dispatcher function is also responsible for configuring the CORS headers and the log drain that can be used by the developer to debug their functions.

By using Cloudflare, we are also able to benefit from other products in the Workers ecosystem. For example, we can provide an easy way to cache the results of a query in Cloudflare's global network. That way, we can serve the results for read-only queries directly from locations closer to the end users, without having to call the database again and again for every Worker invocation. For the developer, it's only a matter of adding a "cache" parameter to their query with the number of milliseconds they want to cache the results in a KV Namespace.

import { xataWorker } from '~/xata';

const listProducts = xataWorker('listProducts', async ({ xata }) => {
  return await xata.db.products.sort('popularity', 'desc').getMany({
    cache: 15 * 60 * 1000 // TTL
  });
});

In development mode, we provide a command to run the functions locally and test them before deploying them to production. This enables rapid development workflows with real-time filesystem change monitoring and hot reloading of the workers code. Internally, we use the latest version of miniflare to emulate the Cloudflare Workers runtime, mimicking the real production environment.

Conclusion

Xata is now out of beta and available for everyone. We offer a generous free tier that allows you to build and deploy your applications and only pay to scale them when you actually need it.

You can sign up for free, create a database in seconds and enjoy features such as branching with zero downtime migrations, search and analytics, transactions, and many others. Check out our website to learn more!

Xata Workers are currently in private beta. If you are interested in trying them out, you can sign up for the waitlist and talk us through your use case. We are looking for developers that are willing to provide feedback and shape this new feature for our serverless data platform.

We are very proud of our collaboration with Cloudflare for this new feature. Processing data closer to where it's being requested is the future of computing and we are excited to be part of this movement. We look forward to seeing what you build with Xata Workers.

Easy Postgres integration on Cloudflare Workers with Neon.tech

Erwin van der Koogh — Tue, 15 Nov 2022 13:44:00 GMT

It’s no wonder that Postgres is one of the world’s favorite databases. It’s easy to learn, a pleasure to use, and can scale all the way up from your first database in an early-stage startup to the system of record for giant organizations. Postgres has been an integral part of Cloudflare’s journey, so we know this fact well. But when it comes to connecting to Postgres from environments like Cloudflare Workers, there are unfortunately a bunch of challenges, as we mentioned in our Relational Database Connector post.

Neon.tech not only solves these problems; it also has other cool features such as branching databases — being able to branch your database in exactly the same way you branch your code: instant, cheap and completely isolated.

How to use it

It’s easy to get started. Neon’s client library @neondatabase/serverless is a drop-in replacement for node-postgres, the npm pg package with which you may already be familiar. After going through the getting started process to set up your Neon database, you can easily create a Worker to ask Postgres for the current time like so:

Create a new Worker — Run npx wrangler init neon-cf-demo and accept all the defaults. Enter the new folder with cd neon-cf-demo.
Install the Neon package — Run npm install @neondatabase/serverless.
Provide connection details — For deployment, run npx wrangler secret put DATABASE_URL and paste in your connection string when prompted (you’ll find this in your Neon dashboard: something like postgres://user:password@project-name-1234.cloud.neon.tech/main). For development, create a new file .dev.vars with the contents DATABASE_URL= plus the same connection string.
Write the code — Lastly, replace src/index.ts with the following code:

import { Client } from '@neondatabase/serverless';
interface Env { DATABASE_URL: string; }

export default {
  async fetch(request: Request, env: Env, ctx: ExecutionContext) {
    const client = new Client(env.DATABASE_URL);
    await client.connect();
    const { rows: [{ now }] } = await client.query('select now();');
    ctx.waitUntil(client.end());  // this doesn’t hold up the response
    return new Response(now);
  }
}

To try this locally, type npm start. To deploy it around the globe, type npx wrangler publish.

You can also check out the source for a slightly more complete demo app. This shows your nearest UNESCO World Heritage sites using IP geolocation in Cloudflare Workers and nearest-neighbor sorting in PostGIS.

How does this work? In this case, we take the coordinates supplied to our Worker in request.cf.longitude and request.cf.latitude. We then feed these coordinates to a SQL query that uses the PostGIS distance operator <-> to order our results:

const { longitude, latitude } = request.cf
const { rows } = await client.query(`
  select 
    id_no, name_en, category,
    st_makepoint($1, $2) <-> location as distance
  from whc_sites_2021
  order by distance limit 10`,
  [longitude, latitude]
);

Since we created a spatial index on the location column, the query is blazing fast. The result (rows) looks like this:

[{
  "id_no": 308,
  "name_en": "Yosemite National Park",
  "category": "Natural",
  "distance": 252970.14782223428
},
{
  "id_no": 134,
  "name_en": "Redwood National and State Parks",
  "category": "Natural",
  "distance": 416334.3926827573
},
/* … */
]

For even lower latencies, we could cache these results at a slightly coarser geographical resolution — rounding, say, to one sixtieth of a degree (one arc minute) of longitude and latitude, which is a little under a mile.

Why we did it

Cloudflare Workers has enormous potential to improve back-end development and deployment. It’s cost-effective, admin-free, and radically scalable.

The use of V8 isolates means Workers are now fast and lightweight enough for nearly any use case. But it has a key drawback: Cloudflare Workers don’t yet support raw TCP communication, which has made database connections a challenge.

Even when Workers eventually support raw TCP communication, we will not have fully solved our problem, because database connections are expensive to set up and also have quite a bit of memory overhead.

This is what the solution looks like:

It consists of three parts:

Connection pooling built into the platform — Given Neon’s serverless compute model, splitting storage and compute operations, it is not recommended to rely on a one-to-one mapping between external clients and Postgres connections. Instead, you can turn on connection pooling simply by flicking a switch (it’s in the Settings area of your Neon dashboard).
WebSocket proxy — We deploy our own WebSocket-to-TCP proxy, written in Go. The proxy simply accepts WebSocket connections from Cloudflare Worker clients, relays message payloads to a requested (Neon-only) host over plain TCP, and relays back the responses.
Client library — Our driver library is based on node-postgres but provides the necessary shims for Node.js features that aren’t present in Cloudflare Workers. Crucially, we replace Node’s net.Socket and tls.connect with code that redirects network reads and writes via the WebSocket connection. To support end-to-end TLS encryption between Workers and the database, we compile WolfSSL to WebAssembly with emscripten. Then we use esbuild to bundle it all together into an easy-to-use npm package.

The @neondatabase/serverless package is currently in public beta. We have plans to improve, extend, and explain it further in the near future on the Neon blog. In line with our commitment to open source, you can configure our serverless driver and/or run our WebSocket proxy to provide access to Postgres databases hosted anywhere — just see the respective repos for details.

So try Neon using invite code serverless, sign up and connect to it with Cloudflare Workers, and you’ll have a fully flexible back-end service running in next to no time.

Cloudflare Workers scale too well and broke our infrastructure, so we are rebuilding it on Workers

Jonathan Norris (Guest Author) — Mon, 14 Nov 2022 14:00:00 GMT

While scaling our new Feature Flagging product DevCycle, we’ve encountered an interesting challenge: our Cloudflare Workers-based infrastructure can handle way more instantaneous load than our traditional AWS infrastructure. This led us to rethink how we design our infrastructure to always use Cloudflare Workers for everything.

The origin of DevCycle

For almost 10 years, Taplytics has been a leading provider of no-code A/B testing and feature flagging solutions for product and marketing teams across a wide range of use cases for some of the largest consumer-facing companies in the world. So when we applied ourselves to build a new engineering-focused feature management product, DevCycle, we built upon our experience using Workers which have served over 140 billion requests for Taplytics customers.

The inspiration behind DevCycle is to build a focused feature management tool for engineering teams, empowering them to build their software more efficiently and deploy it faster. Helping engineering teams reach their goals, whether it be continuous deployment, lower change failure rate, or a faster recovery time. DevCycle is the culmination of our vision of how teams should use Feature Management to build high-quality software faster. We've used DevCycle to build DevCycle, enabling us to implement continuous deployment successfully.

DevCycle architecture

One of the first things we asked ourselves when ideating DevCycle was how we could get out of the business of managing 1000’s of vCPUs worth of AWS instances and move our core business logic closer to our end-user devices. Based on our experience with Cloudflare Workers at Taplytics we knew we wanted it to be a core part of our future infrastructure for DevCycle.

By using the global computing power of Workers and moving as much logic to the SDKs as possible with our local bucketing server-side SDKs, we were able to massively reduce or eliminate the latency of fetching feature flag configurations for our users. In addition, we used a shared WASM library across our Workers and local bucketing SDKs to dramatically reduce the amount of code we need to maintain per SDK, and increase the consistency of our platform. This architecture has also fundamentally changed our business's cost structure to easily serve any customer of any scale.

The core architecture of DevCycle revolves around publishing and consuming JSON configuration files per project environment. The publishing side is managed in our AWS services, while Cloudflare manages the consumption of these config files at scale. This split in responsibilities allows for all high-scale requests to be managed by Cloudflare, while keeping our AWS services simple and low-scale.

“Workers are breaking our events pipeline”

One of the primary challenges as a feature management platform is that we don’t have direct control over the load from our customers’ applications using our SDKs; our systems need the ability to scale instantly to match their load. For example, we have a couple of large customers whose mobile traffic is primarily driven by push notifications, which causes massive instantaneous spikes in traffic to our APIs in the range of 10x increases in load. As you can imagine, any traditional auto-scaled API service and the load balancer cannot manage that type of increase in load. Thus, our choices are to dramatically increase the minimum size of our cluster and load balancer to handle these unknown load spikes, accept that some requests will be rate-limited, or move to an architecture that can handle this load.

Given that all our SDK API requests are already served with Workers, they have no problem scaling instantly to 10x+ their base load. Sadly we can’t say the same about the traditional parts of our infrastructure.

For each feature flag configuration request to a Worker, a corresponding events request is sent to our AWS events infrastructure. The events are received by our events API container in Kubernetes, where they are then published to Kafka and eventually ingested by Snowflake. While Cloudflare Workers have no problem handling instantaneous spikes in feature flag requests, the events system can't keep up. Our cluster and events API containers need to be scaled up faster to prevent the existing instances from being overwhelmed. Even the load balancer has issues accepting the sudden increase. Cloudflare Workers just work too well in comparison to EC2 instances + EKS.

To solve this issue we are moving towards a new events Cloudflare Worker which will be able to handle the instantaneous events load from these requests and make use of the Kinesis Data Firehose to write events to our existing S3 bucket which is ingested by Snowflake. In the future, we look forward to testing out Cloudflare Queues writing to R2 once a Snowflake connector has been created. This architecture should allow us to ingest events at almost any scale and withstand instantaneous traffic spikes with a predictable and efficient cost structure.

Building without a database next to your code

Workers provide many benefits, including fast response times, infinite scalability, serverless architecture, and excellent up-time performance. However, if you want to see all these benefits, you need to architect your Workers to assume that you don’t have direct access to a centralized SQL / NoSQL database (or D1) like you would with a traditional API service. For example, suppose you build your workers to require reaching out to a database to fetch and update user data every time a request is made to your Workers. In that case, your request latency will be tied to the geographic distance between your Worker and the database plus the latency of the database. In addition, your Workers will be able to scale significantly beyond the number of database connections you can support, and your uptime will be tied to the uptime of your external database. Therefore, when architecting your systems to use Workers, we advise relying primarily on data sent as part of the API request and cacheable data on Cloudflare’s global network.

Cloudflare provides multiple products and services to help with data on their global network:

KV: “global, low-latency, key-value data store.”
- However, the lowest latency way of retrieving data from within a Worker is limited by a minimum 60-second TTL. So you’ll need to be ok with cached data that is 60 seconds stale.
Durable Objects: “provide low-latency coordination and consistent storage for the Workers platform through two features: global uniqueness and a transactional storage API.”
- Ability to store user-level information closer to the end user.
- Unfamiliar worker interface for accessing data for developers with SQL / NoSQL experience.
R2: “store large amounts of unstructured data.”
- Ability to store arbitrarily large amounts of unstructured data using familiar S3 APIs.
- Cloudflare’s cache can be used to provide low-latency access within workers.
D1: “serverless SQLite database”

Each of these tools that Cloudflare provides has made building APIs far more accessible than when Workers launched initially; however, each service has aspects which need to be accounted for when architecting your systems. Being an open platform, you can also access any publically available database you want from a Worker. For example, we are making use of Macrometa for our EdgeDB product built into our Workers to help customers access their user data.

The predictable cost structure of Workers

One of the greatest advantages of moving most of our workloads towards Cloudflare Workers is the predictable cost structure that can scale 1:1 with our request loads and can be easily mapped to usage-based billing for our customers. In addition, we no longer have to run excess EC2 instances to handle random spikes in load, just in case they happen.

Too many SaaS services have opaque billing based on max usage or other metrics that don’t relate directly to their costs. Moving from our legacy AWS architecture with high fixed costs like databases and caching layers to Workers has resulted in our infrastructure spending is directly tied to using our APIs and SDKs. For DevCycle, this architecture has been over ~5x more cost-efficient to operate.

The future of DevCycle and Cloudflare

With DevCycle we will continue to invest in leveraging serverless computing and moving our core business logic as close to our users as possible, either on Cloudflare’s global network or locally within our SDKs. We’re excited to integrate even more deeply with the Cloudflare developer platform as new services evolve. We already see future use cases for R2, Queues and Durable Objects and look forward to what’s coming next from Cloudflare.

Envoy Media: using Cloudflare's Bot Management & ML

Ryan Marlow (Guest Author) — Wed, 16 Mar 2022 12:59:05 GMT

This is a guest post by Ryan Marlow, CTO, and Michael Taggart, Co-founder of Envoy Media Group.

My name is Ryan Marlow, and I’m the CTO of Envoy Media Group. I’m excited to share a story with you about Envoy, Cloudflare, and how we use Bot Management to monitor automated traffic.

Background

Envoy Media Group is a digital marketing and lead generation company. The aim of our work is simple: we use marketing to connect customers with financial services. For people who are experiencing a particular financial challenge, Envoy provides informative videos, money management tools, and other resources. Along the way, we bring customers through an online experience, so we can better understand their needs and educate them on their options. With that information, we check our database of highly vetted partners to see which programs may be useful, and then match them up with the best company to serve them.

As you can imagine, it’s important for us to responsibly match engaged customers to the right financial services. Envoy develops its own brands that guide customers throughout the process. We spend our own advertising dollars, work purely on a performance basis, and choose partners we know will do right by customers. Envoy serves as a trusted guide to help customers get their financial lives back on track.

A bit of technical detail

We often say that Envoy offers a “sophisticated online experience.” This is not your average lead generation engine. We’ve built our own multichannel marketing platform called Revstr, which handles content management, marketing automation, and business intelligence. Envoy has an in-house technology team that develops and maintains Revstr's multi-million line PHP application, cloud computing services, and infrastructure. With Revstr's systems, we are able to A/B test any combination of designs with any set of business rules. As a result, Envoy shows the right experience to the right customer every time, and even adapts to the responses of each individual.

Revstr tracks each aspect of the customer's progress through our pages and forms. It also integrates with advertising platforms, client companies' CRM systems, and third-party marketing tools. Revstr creates a 360° view of our performance and the customer's experience. All this information goes into our proprietary data warehouse and is served to the business team. This warehouse also provides data — already cleaned, normalized, and labeled — to our machine learning pipeline. Where needed, we can perform quick and easy testing, training, and deployment of ML models. Both our business team and our marketing automation rely heavily on guidance from these reports and models. And that’s where Cloudflare comes into the picture…

Why we care about automated traffic

One of our key challenges is evaluating the quality of our traffic. Bad actors are always advancing and proliferating. Importantly: any fake traffic to our sites has a direct impact on our business, including wasted resources in advertising, UX optimization, and customer service.

In our world of digital marketing, especially in the industries we compete in, each click is costly and must be treated like a precious commodity. The saying goes that "he or she who is able to spend the most for a click wins." We spend our own money on those clicks and are only paid when we deliver real leads who consistently convert to enrollments for our clients.

Any money spent on illegitimate clicks will hurt our bottom line — and bots tend to be the lead culprits. Bots hurt our standing in auctions and reduce our ability to buy. The media buyers on our business team are always watching statistics on the cost, quantity, engagement, and conversion rates of our ad traffic. They look for anomalies that might represent fraudulent clicks to ensure we trim out any wasted spend.

Cloudflare offers Bot Management, which spots the exact traffic we need to look out for.

How we use Cloudflare’s Bot Score to filter out bad traffic

We solved our problem by using Cloudflare. For each request that reaches Envoy, Cloudflare calculates a “bot score” that ranges from 1 (automated) to 99 (human). It’s generated using a number of sophisticated methods, including machine learning. Because Cloudflare sees traffic from millions of requests every second, they have an enormous training set to work with. The bot scores are incredibly accurate. By leveraging the bot score to evaluate the legitimacy of an ad click and switching experiences accordingly, we can make the most of every click we pay for.

Because of the high cost of a click, we cannot afford to completely block that click even if Cloudflare indicates it might be a bot. Instead, we ingest the bot score as a custom header and use it as an input to our rules engine. For example, we can put much longer, more qualifying forms in front of traffic that looks suspicious, and render more streamlined forms to higher scoring visitors. In an extreme case, we can even require suspect visitors to contact us by phone instead of completing an online form. This allows us to convert leads which may have a more dubious origin but still prove to be legitimate, while maintaining a pleasant experience for the best leads.

We also pull the bot score into our data warehouse and provide it to our marketing team. Over the long term, if they see that any ad campaign or traffic source has a low average bot score, we can reduce or eliminate spend on that traffic source, seek refunds from providers, and refocus our efforts on more profitable segments.

Using the Bot Score to predict conversion rate

Envoy also leverages the bot score by integrating it into our ML models. For most lead generation companies, it would be sufficient to track lead volume and profit margins. For Envoy, it's part of our DNA to go beyond making a sale and really assess the lifetime value of those leads for our clients. So we turn to machine learning. We use ML to predict the lifetime value of new visitors based on known data about past leads, and then pass that signal on to our advertising vendors. By skewing the conversion value higher for leads with a better predicted lifetime value, we can influence those pay-per click (PPC) platforms' own smart bidding algorithms to search for the best qualified leads.

One of the models we use in this process does a prediction of backend conversion rate — how likely a given lead is to become an enrollment for the client company. When we added the bot score and behavioral metrics to this model, we saw a significant increase in its accuracy. When we can better predict conversion rate, we get better leads. This accuracy boost is a force multiplier for our whole platform; it makes an impact not only in media management but also in form design, lead delivery integrations, and email automation.

Why is Bot Score so valuable for Envoy Media?

At Envoy, we take pride in being analytical and data driven. Here are some of the insights we found by combining Cloudflare’s Bot Score with our own internal data:

1. When we added bot score along with behavioral metrics to our conversion rate prediction ML model, its precision increased by 15%. Getting even a 1% improvement in such a carefully tuned model is difficult; a 15% improvement is a huge win.

2. Bot score is included in 76 different reports used by our media buying and UX optimization teams, and in 9 different ML models. It is now a standard component of all new UX reports.

3. Because bot score is so accurate and because bot score is now broadly available within our organization, it is driving organizational performance in ways that we didn’t expect. For example, here is a testimonial from our UX Optimization Team:

I use bot score in PPC search reports. Before I had access to the bot score our PPC reports were muddied with automated traffic - one day conversion rates are 11%, the next day they are at 5%. That is no way to run a business! I spent a lot of time investigating to understand and justify these differences - and many times there just wasn’t a satisfactory answer, and we had to throw the analysis out. Today I have access to bot score data, and it prevents data dilution and gives me a much higher degree of confidence in my analysis.

Thank You and More to Come!

Thanks to the Cloudflare team for giving us the opportunity to share our story. We’re constantly innovating and hope that we can share more of our developments with you in the future.

Get full observability into your Cloudflare logs with New Relic

Tanushree Sharma — Mon, 14 Mar 2022 12:59:17 GMT

Building a great customer experience is at the heart of any business. Building resilient products is half the battle — teams also need observability into their applications and services that are running across their stack.

Cloudflare provides analytics and logs for our products in order to give our customers visibility to extract insights. Many of our customers use Cloudflare along with other applications and network services and want to be able to correlate data through all of their systems.

Understanding normal traffic patterns, causes of latency and errors can be used to improve performance and ultimately the customer experience. For example, for websites behind Cloudflare, analyzing application logs and origin server logs along with Cloudflare’s HTTP request logs give our customers an end-to-end visibility about the journey of a request.

We’re excited to have partnered with New Relic to create a direct integration that provides this visibility. The direct integration with our logging product, Logpush, means customers no longer need to pay for middleware to get their Cloudflare data into New Relic. The result is a faster log delivery and fewer costs for our mutual customers!

We’ve invited the New Relic team to dig into how New Relic One can be used to provide insights into Cloudflare.

New Relic Log Management

New Relic provides an open, extensible, cloud-based observability platform that gives visibility into your entire stack. Logs, metrics, events, and traces are automatically correlated to help our customers improve user experience, accelerate time to market, and reduce MTTR.

Deploying log management in context and at scale has never been faster, easier, or more attainable. With New Relic One, you can collect, search, and correlate logs and other telemetry data from applications, infrastructure, network devices, and more for faster troubleshooting and investigation.

New Relic correlates events from your applications, infrastructure, serverless environments, along with mobile errors, traces and spans to your logs — so you find exactly what you need with less toil. All your logs are only a click away, so there’s no need to dig through logs in a separate etool to manually correlate them with errors and traces.

See how engineers have used logs in New Relic to better serve their customers in the short video below.

A quickstart for Cloudflare Logpush and New Relic

To help you get the most of the new Logpush integration with New Relic, we’ve created the Cloudflare Logpush quickstart for New Relic. The Cloudflare quickstart will enable you to monitor and analyze web traffic metrics on a single pre-built dashboard, integrating with New Relic’s database to provide an at-a-glance overview of the most important logs and metrics from your websites and applications.

Getting started is simple:

First, ensure that you have enabled pushing logs directly into New Relic by following the documentation “Enable Logpush to New Relic”.
You’ll also need a New Relic account. If you don’t have one yet, get a free-forever account here.
Next, visit the Cloudflare quickstart in New Relic, click “Install quickstart”, and follow the guided click-through.

For full instructions to set up the integration and quickstart, read the New Relic blog post.

As a result, you’ll get a rich ready-made dashboard with key metrics about your Cloudflare logs!

Correlating Cloudflare logs across your stack in New Relic One is powerful for monitoring and debugging in order to keep services safe and reliable. Cloudflare customers get access to logs as part of the Enterprise account, if you aren’t using Cloudflare Enterprise, contact us. If you’re not already a New Relic user, sign up for New Relic to get a free account which includes this new experience and all of our product capabilities.

Leverage IBM QRadar SIEM to get insights from Cloudflare logs

Tanushree Sharma — Mon, 14 Mar 2022 12:59:13 GMT

It’s just gone midnight, and you’ve just been notified that there is a malicious IP hitting your servers. You need to triage the situation; find the who, what, where, when, why as fast and in as much detail as possible.

Based on what you find out, your next steps could fall anywhere between classifying the alert as a false positive, to escalating the situation and alerting on-call staff from around your organization with a middle of the night wake up.

For anyone that’s gone through a similar situation, you’re aware that the security tools you have on hand can make the situation infinitely easier. It’s invaluable to have one platform that provides complete visibility of all the endpoints, systems and operations that are running at your company.

Cloudflare protects customers’ applications through application services: DNS, CDN and WAF to name a few. We also have products that protect corporate applications, like our Zero Trust offerings Access and Gateway. Each of these products generates logs that provide customers visibility into what’s happening in their environments. Many of our customers use Cloudflare’s services along with other network or application services, such as endpoint management, containerized systems and their own servers.

We’re excited to announce that Cloudflare customers are now able to push their logs directly to IBM Security QRadar SIEM. This direct integration leads to cost savings and faster log delivery for Cloudflare and QRadar SIEM customers because there is no intermediary cloud storage required.

Cloudflare has invited our partner from the IBM QRadar SIEM team to speak to the capabilities this unlocks for our mutual customers.

IBM QRadar SIEM

QRadar SIEM provides security teams centralized visibility and insights across users, endpoints, clouds, applications, and networks – helping you detect, investigate, and respond to threats enterprise wide. QRadar SIEM helps security teams work quickly and efficiently by turning thousands to millions of events into a manageable number of prioritized alerts and accelerating investigations with automated, AI-driven enrichment and root cause analysis. With QRadar SIEM, increase the productivity of your team, address critical use cases, and mature your security operation.

Cloudflare’s reverse proxy and enterprise security products are a key part of customer’s environments. Security analysis can gain visibility about logs from these products along with data from tools that span their network to build out detections and response workflows.

IBM and Cloudflare have partnered together for years to provide a single pane of glass view for our customers. This new enhanced integration means that QRadar SIEM customers can ingest Cloudflare logs directly from Cloudflare’s Logpush product. QRadar SIEM also continues to support customers who are leveraging existing integration via S3 storage.

For more information about how to use this new integration, refer to the Cloudflare Logs DSM guide. Also, check out the blog post on the QRadar Community blog for more details!

Guest Blog: k8s tunnels with Kudelski Security

Romain Aviolat (Guest Author) — Wed, 08 Dec 2021 13:59:22 GMT

Today, we’re excited to publish a blog post written by our friends at Kudelski Security, a managed security services provider. A few weeks back, Romain Aviolat, the Principal Cloud and Security Engineer at Kudelski Security approached our Zero Trust team with a unique solution to a difficult problem that was powered by Cloudflare’s Identity-aware Proxy, which we call Cloudflare Tunnel, to ensure secure application access in remote working environments.

We enjoyed learning about their solution so much that we wanted to amplify their story. In particular, we appreciated how Kudelski Security’s engineers took full advantage of the flexibility and scalability of our technology to automate workflows for their end users. If you’re interested in learning more about Kudelski Security, check out their work below or their research blog.

Zero Trust Access to Kubernetes

Over the past few years, Kudelski Security’s engineering team has prioritized migrating our infrastructure to multi-cloud environments. Our internal cloud migration mirrors what our end clients are pursuing and has equipped us with expertise and tooling to enhance our services for them. Moreover, this transition has provided us an opportunity to reimagine our own security approach and embrace the best practices of Zero Trust.

So far, one of the most challenging facets of our Zero Trust adoption has been securing access to our different Kubernetes (K8s) control-plane (APIs) across multiple cloud environments. Initially, our infrastructure team struggled to gain visibility and apply consistent, identity-based controls to the different APIs associated with different K8s clusters. Additionally, when interacting with these APIs, our developers were often left blind as to which clusters they needed to access and how to do so.

To address these frictions, we designed an in-house solution leveraging Cloudflare to automate how developers could securely authenticate to K8s clusters sitting across public cloud and on-premise environments. Specifically, for a given developer, we can now surface all the K8s services they have access to in a given cloud environment, authenticate an access request using Cloudflare’s Zero Trust rules, and establish a connection to that cluster via Cloudflare’s Identity-aware proxy, Cloudflare Tunnel.

Most importantly, this automation tool has enabled Kudelski Security as an organization to enhance our security posture and improve our developer experience at the same time. We estimate that this tool saves a new developer at least two hours of time otherwise spent reading documentation, submitting IT service tickets, and manually deploying and configuring the different tools needed to access different K8s clusters.

In this blog, we detail the specific pain points we addressed, how we designed our automation tool, and how Cloudflare helped us progress on our Zero Trust journey in a work-from-home friendly way.

Challenges securing multi-cloud environments

As Kudelski Security has expanded our client services and internal development teams, we have inherently expanded our footprint of applications within multiple K8s clusters and multiple cloud providers. For our infrastructure engineers and developers, the K8s cluster API is a crucial entry point for troubleshooting. We work in GitOps and all our application deployments are automated, but we still frequently need to connect to a cluster to pull logs or debug an issue.

However, maintaining this diversity creates complexity and pressure for infrastructure administrators. For end users, sprawling infrastructure can translate to different credentials, different access tools for each cluster, and different configuration files to keep track of.

Such a complex access experience can make real-time troubleshooting particularly painful. For example, on-call engineers trying to make sense of an unfamiliar K8s environment may dig through dense documentation or be forced to wake up other colleagues to ask a simple question. All this is error-prone and a waste of precious time.

Common, traditional approaches of securing access to K8s APIs presented challenges we knew we wanted to avoid. For example, we felt that exposing the API to the public internet would inherently increase our attack surface, that’s a risk we couldn’t afford. Moreover, we did not want to provide broad-based access to our clusters’ APIs via our internal networks and condone the risks of lateral movement. As Kudelski continues to grow, the operational costs and complexity of deploying VPNs across our workforce and different cloud environments would lead to scaling challenges as well.

Instead, we wanted an approach that would allow us to maintain small, micro-segmented environments, small failure domains, and no more than one way to give access to a service.

Leveraging Cloudflare’s Identity-aware Proxy for Zero Trust access

To do this, Kudelski Security’s engineering team opted for a more modern approach: creating connections between users and each of our K8 clusters via an Identity-aware proxy (IAP). IAPs are flexible to deploy and add an additional layer of security in front of our applications by verifying the identity of a user when an access request is made. Further, they support our Zero Trust approach by creating connections from users to individual applications — not entire networks.

Each cluster has its own IAP and its own sets of policies, which check for identity (via our corporate SSO) and other contextual factors like the device posture of a developer’s laptop. The IAP doesn't replace the K8s cluster authentication mechanism, it adds a new one on top of it, and thanks to identity federation and SSO this process is completely transparent for our end users.

In our setup, Kudelski Security is using Cloudflare’s IAPs as a component of Cloudflare Access -- a ZTNA solution and one of several security services unified by Cloudflare’s Zero Trust platform.

For many web-based apps, IAPs help create a frictionless experience for end users requesting access via a browser. Users authenticate via their corporate SSO or identity provider before reaching the secured app, while the IAP works in the background.

That user flow looks different for CLI-based applications because we cannot redirect CLI network flows like we do in a browser. In our case, our engineers want to use their favorite K8s clients which are CLI-based like kubectl or k9s. This means our Cloudflare IAP needs to act as a SOCKS5 proxy between the CLI client and each K8s cluster.

To create this IAP connection, Cloudflare provides a lightweight server-side daemon called cloudflared that connects infrastructure with applications. This encrypted connection runs on Cloudflare’s global network where Zero Trust policies are applied with single-pass inspection.

Without any automation, however, Kudelski Security’s infrastructure team would need to distribute the daemon on end user devices, provide guidance on how to set up those encrypted connections, and take other manual, hands-on configuration steps and maintain them over time. Plus, developers would still lack a single pane of visibility across the different K8s clusters that they would need to access in their regular work.

Our automated solution: k8s-tunnels!

To solve these challenges, our infrastructure engineering team developed an internal tool — called ‘k8s-tunnels’ — that embeds complex configuration steps which make life easier for our developers. Moreover, this tool automatically discovers all the K8s clusters that a given user has access to based on the Zero Trust policies created. To enable this functionality, we embedded the SDKs of some major public cloud providers that Kudelski Security uses. The tool also embeds the cloudflared daemon_,_ meaning that we only need to distribute a single tool to our users.

All together, a developer who launches the tool goes through the following workflow: (we assume that the user already has valid credentials otherwise the tool would open a browser on our IDP to obtain them)

1. The user selects one or more cluster to

2. k8s-tunnel will automatically open the connection with Cloudflare and expose a local SOCKS5 proxy on the developer machine

3. k8s-tunnel amends the user local kubernetes client configuration by pushing the necessary information to go through the local SOCKS5 proxy

4. k8s-tunnel switches the Kubernetes client context to the current connection

5. The user can now use his/her favorite CLI client to access the K8s cluster

The whole process is really straightforward and is being used on a daily basis by our engineering team. And, of course, all this magic is made possible through the auto-discovery mechanism we’ve built into k8s-tunnels. Whenever new engineers join our team, we simply ask them to launch the auto-discovery process and get started.

Here is an example of the auto-discovery process in action.

k8s-tunnels will connect to our different cloud providers APIs and list the K8s clusters the user has access to
k8s-tunnels will maintain a local config file on the user machine of those clusters so this process does not be run more than once

Automation enhancements

For on-premises deployments, it was a bit trickier as we didn't have a simple way to store the K8s clusters metadata like we do with resource tags with public cloud providers. We decided to use Vault as a Key-Value-store to mimic public-cloud resource tags for on-prem. This way we can achieve auto-discovery of on-prem clusters following the same process as with a public-cloud provider.

Maybe you saw that in the previous CLI screenshot, the user can select multiple clusters at the same time! We quickly realized that our developers often needed to access multiple environments at the same time to compare a workload running in production and in staging. So instead of opening and closing tunnels every time they needed to switch clusters, we designed our tool such that they can now simply open multiple tunnels in parallel within a single k8s-tunnels instance and just switch the destination K8s cluster on their laptop.

Last but not least, we've also added the support for favorites and notifications on new releases, leveraging Cloudflare Workers, but that's for another blog post.

What’s Next

In designing this tool, we’ve identified a couple of issues inside Kubernetes client libraries when used in conjunction with SOCKS5 proxies, and we’re working with the Kubernetes community to fix those issues, so everybody should benefit from those patches in the near future.

With this blog post, we wanted to highlight how it is possible to apply Zero Trust security for complex workloads running on multi-cloud environments, while simultaneously improving the end user experience.

Although today our ‘k8s-tunnels’ code is too specific to Kudelski Security, our goal is to share what we’ve created back with the Kubernetes community, so that other organizations and Cloudflare customers can benefit from it.

Launching a Startup on Cloudflare Workers

Erwin van der Koogh — Fri, 19 Nov 2021 13:59:27 GMT

Closing out the Developer Spotlight series for this week is Tejas Mehta who shares how he built his startup, cClip.

cClip is a great tool that allows you to “copy/paste” and transfer files between any of your devices, regardless of what OS they run.

What is so interesting about cClip though is that it is a fully serverless application built on top of Workers and KV, but not exclusively. It uses Firebase for authentication, RevenueCat for a consolidated view over the Apple and Google Play store, and Stripe for all other billing related work.

This is a peek into the future of application development. This is a future where we will be “importing” other SaaS applications as easily as we currently import a package from a package manager. And not only unidirectional by calling APIs on that external application, but bi-directional communication through events with Webhooks.

Here is Tejas telling his story.

The origins of cClip

The abrupt transition to virtual schooling last year led to all my school communications and assignments transitioning online. With a MacBook laptop and an Android phone, submitting my precalculus homework meant I had to take a picture of each page, email each picture to myself, access the email on my computer, and finally upload each file to the submission portal. Doing this once or twice? It was no issue. Doing this every day for the rest of my sophomore and junior years? Just the thought of that got on my nerves.

This led me to make cClip, a simple transfer tool that sends images and links from my phone to my laptop. After a few friends saw it, they wanted it too, and that was when I realized it could bring enhanced productivity and platform-agnostic fluidity to a multitude of virtual workspaces.

After adding in end-to-end encryption, live-time updates, and a subscription model, I debuted cClip to the public.

Architecture

As Erwin mentioned, the architecture for cClip isn’t just Workers, but it integrates with a few other SaaS products as well. I use RevenueCat to easily integrate with the mobile app stores and Stripe via Webhooks. Firebase is used for both authentication and messaging, and, lastly, B2 for cloud storage. (At least until R2 is available!)

And of course we need clients for every single platform.

In a diagram it looks something like this:

The Clients

One of the major challenges was to create clients for every single OS and platform out there. At a minimum, we would need both an iOS and Android app, a Windows, Mac, and a web application.

Flutter to the rescue

Flutter is a cross-platform application development framework. It allowed me to have a single codebase for the UI and a majority of the logic, while allowing some specific portions, like encryption, to be able to interface with the platform-specific libraries.

Both iOS and Android are well-supported by Flutter, but Flutter Web was still relatively new. I had to use a bunch of JavaScript for tasks such as encryption, because Flutter Web doesn’t have WebWorker support yet, and file uploads and downloads. I wanted to use StreamSaver.js for streamed decryption to download to save on memory consumption for larger files.

Additionally, using Flutter Web allowed me to create an Electron app for Windows or Linux Users if they desired a native app for cClip instead of accessing the website.

macOS

macOS has its own application, written in Swift, which was built before the web one and further maintained for its performance benefits. The macOS and iOS applications share code for password hashing, file uploads and downloads, and encryption.

Backend

cClip uses five main backend technologies: Backblaze B2, Firebase Auth, Firebase Database, RevenueCat, and Cloudflare Workers. Workers acts as the main bridge between all backend services, whether in the contexts of access control or subscription management.

Authentication and Authorization

The authentication flow starts by using Firebase Authentication to validate a user’s identity through a sign in with Google, Apple, or Email/Password and uses Workers for all sensitive API calls. Workers validates the identity of the caller using the ‘idToken’ provided by Firebase, which is a JWT that contains the user’s UID when decoded successfully. Each sensitive operation, such as upload/download file, subscribe/unsubscribe, etc., is put behind Workers, which validates the JWT provided as a ‘Token’ header in the correlating API call before performing the operation itself.

Workers handles all access control after a user has been authenticated, such as ensuring the user can access a file at a given location or updating a user’s subscription status when notified by RevenueCat.

Workers acts as a gatekeeper for a user’s upload or download authorizations. When the request is sent, Workers verifies the user’s identity, checks the requested location, and the user’s subscription status. If all three check out, Workers authorize the requested action with B2 and return the corresponding information.

Storage and Messaging

Because files could be larger than 25MB, it wasn’t practical to store them in Workers KV, so I chose B2 Cloud Storage. B2 Cloud Storage is part of the Bandwidth Alliance with Cloudflare, which meant I wouldn’t be paying any egress fees for those files, which is a nice bonus.

As mentioned above, the Worker in front of B2 does all the authentication and authorization for both uploads, downloads, and any other B2 operation.

Firebase is used for real-time communication with all clients. Subscription information is stored there, for example, and we use the Cloud Messaging APIs to send notifications to different clients when necessary.

Subscription Management

Of course a large part of any SaaS application is spent dealing with billing and subscriptions. Luckily, RevenueCat simplifies that a lot and gives you one place for all information related to the mobile app stores and Stripe.

That meant I only had to implement one set of Webhooks to receive events from all three providers, and all received events followed the same format.

Most of the outgoing subscription API calls also go through RevenueCat, but there are a few edge-cases where we need a direct connection to Stripe and the Google Play Store.

Biggest Challenges

Cross-platform subscriptions

Cross-platform subscriptions, or the synchronization of a user’s subscription across every platform, were a HUGE challenge. Everything had to be integrated perfectly, from ensuring double subscriptions don’t occur to live-time changes for status. RevenueCat simplified the process substantially, but I still had to manage cross-platform cancellations and updates. I managed this by storing the status in Firebase Database and having the user listen for updates on their end. The information was write-locked to allow only the verified Worker update to modify statuses to prevent any status spoofs.

File Quotas

Another tough challenge was file storage tracking. In my initial planning, I learned B2’s API doesn’t support any getters for a directory’s storage total and needs to sum up all file values returned with the ‘b2_list_file_names’ API. This was costly, in terms of time and money, as the API is classed as one of the more expensive operations. On top of that, the API would’ve been called each time a user uploads or deletes a file. As a result I created a counter that synchronizes with B2 on different file modifications (uploads or deletion), which is stored in Firebase's Database by the Worker to allow for real-time updates on a user’s device through Firebase’s SDK change listeners.

However, I wanted to ensure users couldn’t modify their used storage count and that every upload operation was valid to verify users had enough storage available for the file. The app interacts with Workers first, and upon determining that the user has enough storage, returns an authorization for upload.

Overall experience

Workers & KV acted as a core solution to save on costly API calls and provide a completely serverless backend for cClip. On top of the quick nature of the platform, testing changes was extremely easy and needed just a quick preview or staging environment to test the newest changes to my API.

My biggest challenge with Workers was the lack of a Stripe library, as the node.js Stripe isn’t Workers compatible yet, but I used the Stripe API endpoints directly with fetch which worked without any issues. It is great, however, to see that Stripe launched their Workers-compatible SDK earlier today!

I am currently working on a revamped implementation of cClip Direct, the peer-to-peer transfer protocol that allows you to share any file across devices. Using Workers to manage notifications and signaling, the initial handshake process has been reworked, and I am looking to move the real-time communication from Firebase to Durable Objects and WebSockets. And of course, I can’t wait to try out R2 when that becomes available.

It has been so great to see an application that I made to scratch my own itch was able to scale into a fully fledged product without any problems whatsoever.

Thank you for following along with our Developer Spotlight series this week. We hope you enjoyed reading about how our customers are using the Cloudflare ecosystem to solve interesting problems. We are looking to do more spotlights in the future, so if you want to read more of these, or have a use-case you want us to write about, feel free to let me know on Twitter or any of the discussions linked below.

Developer Spotlight: Winning a Game Jam with Jamstack and Durable Objects

Erwin van der Koogh — Mon, 15 Nov 2021 13:59:19 GMT

Welcome to a new blog post series called Developer Spotlight. In this series we will be showcasing interesting applications built on top of the Cloudflare Workers Ecosystem.

And to celebrate Durable Objects going GA, what better to kick off the series than with a really cool tech demo of Durable Objects called Full Tilt?

Full Tilt by Guido Zuidhof is a game jam entry for Ludum Dare, one of the biggest and oldest game jams around, where he won first place in the innovation category. A game jam is like a hackathon for games, where you have a very short amount of time (usually 48-72 hours) to create a game from scratch.

We love Full Tilt, not just because Guido used Workers and Durable Objects to build a cool game where you control a game on your computer via your phone, but especially because it shows how powerful Durable Objects are. In less than 48 hours Guido was able to write all the logic to instantly spin up a personal gaming server anywhere in the world, as close to that player as possible. And it is so fast that you feel like you are controlling the computer directly.

But here is Guido, to tell you about his experience developing Full Tilt.

Last October I participated in the Ludum Dare 49 game jam. The thing I love about game jams is that the time constraint forces you to work quickly, iterate often, and most importantly cut scope.

In this game jam I created Full Tilt, a game in which you seamlessly use your phone as a Wiimote-like motion controller for a game that is run on your laptop in a web browser. The secret sauce to making it so smooth was in combining a game jam with Jamstack and mixing in some Durable Objects.

You can try the game right here.

Phone as a controller

Full Tilt is a browser game in which you move the player character by moving your hand around. There are a few different ways we can do that. One idea would be to use your computer's webcam to track a marker object through 3D space. While that is possible, it is tricky to get it working in any situation (a dark room can be problematic!) and also some players may feel uncomfortable turning on their webcam for a simple web game.

A smartphone has a bunch of sensors built in, including a magnetometer and gyroscope — those sensors are exactly what we need, and we can assume that the majority of our potential players have a smartphone within arm's reach. When a native mobile application uses these sensors, it adds a lot of friction (users now have to install an app), as well as a bunch of implementation work. Fortunately modern web browsers allow you to read from these sensors through the DeviceMotion API: a small web app will do the job perfectly!

The next challenge is communicating the sensor readings from the phone to the game running on the main computer. For that we will use a combination of Cloudflare Workers and Durable Objects. There needs to be some shared contact (i.e., a game server) that both the main computer and the smartphone talk to. Using a serverless solution makes a lot of sense for a web game. And as we only have 48 hours here, less stuff to worry about is a big selling point too. Workers and Durable Objects also make it easy to keep the game running after the game jam, without having to pay for — and more importantly worry about — keeping servers running.

Setting up a line of communication

There are roughly two ways for browsers to communicate: through a shared connection (a game server somewhere) or a peer to peer connection, so browsers can talk directly to each other without a middleman by leveraging WebRTC DataChannels.

WebRTC can be very challenging to make reliable and may still need a proxy server for some especially problematic network setups behind multiple NATs. Setting up a WebRTC connection may have been the perfect solution for the lowest possible latency, but it was out of scope for such a short game jam.

The game we are creating has low latency requirements, when I tilt my phone the game should respond quickly. We can probably get away with anything under 100ms, although low double digits or less is definitely preferred! If I'm geographically located in one place and the game server is geographically close enough, then this game server will be able to pass messages from my smartphone to my laptop without too much delay. Edge computing is hot nowadays, and for this gaming use case it is not just a nice-to-have but also a necessity.

Enough background, let's architect the system.

On-demand game rooms

The first thing I needed to set up was game rooms. Something that both the phone and the computer browsers could connect to, so they can pass messages between them.

Durable Objects allow us to create a large amount of small stateful "mini servers" or "rooms" that multiple clients can connect to simultaneously over WebSockets, providing perfect small on-demand game servers. Think of Durable Objects as a single thread of Javascript that runs in a data center close to where the user first asked for it to be created.

After the browser on the computer starts the game, it asks our Cloudflare Worker API to create a room for the current game session. This API request is a simple POST request, the server responds with a four character room code that uniquely identifies the room that was created. The reason we want the room code to be short is because the user may need to copy this code and type it into their smartphone if they are unable to scan a QR code.

Humans are notoriously bad at copying random character strings as different characters look alike, so we use a limited character set that excludes the most commonly confused characters:

const DICTIONARY = "2345679ADEFGHJKLMNPQRSTUVWXYZ"; // 29 chars (no 0, O, I, 1, B, 8)

A four character code will allow us to uniquely point to ~700,000 different rooms, that seems like it would be enough even if our game got quite popular! What's more: these game sessions don't last forever. After some period of time (say 24 hours) we can be certain enough that the game session has ended, and we can re-use that room code.

Room code coordination

In your Cloudflare Worker script, there are two ways to create a Durable Object: either we ask for one to be created with an ID generated from a name, or we ask for one to be created with a random unique ID. The naive solution would be to create a Durable Object with an ID generated from the room code. However, that is not a good idea here because a Durable Object gets created in a data center that is geographically close to the end user.

A problematic situation would be if a user in Mumbai requests a room and gets room code ABCD. Initially, they play the game for a bit, and it works great.

The issue comes a week later when that room code is reused for another player based in Los Angeles, The game room Durable Object will be revived in Mumbai and latency will be awful for our Los Angeles player. In the future, Durable Objects may get migrated between data centers, but that's not yet guaranteed.

Instead, what we can do is create a new Durable Object with a random ID for every new game session and keep a mapping from the four character room code to this random ID. We are introducing some state in our system: we will need a central source of truth which is where Durable Objects can come to the rescue again.

We will solve this by creating a single "Room Hub" Durable Object that keeps track of this mapping from room code to Durable Object ID. This Durable Object will have two endpoints, one to request a new room and another to look up a room's information.

Here's our request handler for the Room Request endpoint (the argument is a Sunder Context, Sunder is the web framework I used for this project):

export async function handleRoomRequest(ctx: Context) {
    const now = Date.now();    
    const reqBody = await ctx.request.json();

    // We make some attempts to find a room that is available..
    const attempts = 5

    let roomCode: string;
    let roomStorageKey: string;

    for (let i = 0; i < attempts; i++) {
        roomCode = generateRoomCode();
        roomStorageKey = ROOM_STATE_PREFIX + roomCode;
        const room = await ctx.state.storage.get(roomStorageKey);
        if (room === undefined) {
            break;
        } else if (now - room.createdAt > MAX_ROOM_AGE) {
            await ctx.state.storage.delete(roomStorageKey);
            break;
        }
        if (i === attempts-1) {
            return ctx.throw("Couldn't find available room code :(");
        }
    }

    const roomData: RoomData = {
        roomCode: roomCode,
        durableObjectId: reqBody.durableObjectId,
        createdAt: now,
    }

    await ctx.state.storage.put(roomStorageKey, roomData);

    ctx.response.body = {
        room: roomData
    };
    ctx.response.status = HttpStatus.Created;
}

In a nutshell, we generate a few room codes until we find one that has never been used or hasn't been used in a long enough time.

There is a subtle but important nuance in this code: the Durable Object gets created in the Cloudflare Worker that talks to our Room Hub, not in the Room Hub itself. Our Room Hub will run in a single data center somewhere on Cloudflare's network. If we create the game room from there it may still be far away from our end user!

Looking up a room's information is simpler, we return either the room data or status 404.

export async function handleRoomLookup(ctx: Context) {
    const now = Date.now();

    let roomStorageKey = ROOM_STATE_PREFIX + ctx.params.roomCode;
    const roomData = await ctx.state.storage.get(roomStorageKey);

    if (roomData === undefined) {
        ctx.throw(404, "Room not found");
        return;
    }

    if (now - roomData.createdAt > MAX_ROOM_AGE) {
        // About time we cleaned it up.
        await ctx.state.storage.delete(roomStorageKey);
        ctx.response.status = HttpStatus.NotFound;
        return;
    }

    ctx.response.body = {
        room: roomData
    };
}

The smartphone browser needs to connect to that same room. To make things easy for the user we generate a four character room code which points at that specific "Game Room" Durable Object. This way the user can take their smartphone, navigate to the website address https://ld49.pages.dev, and enter the code "ABCD" (or more commonly, they scan a QR code with the link https://ld49.pages.dev?room=ABCD).

Game Room Durable Object

Our game room Durable Object can be pretty simple since it is only responsible for passing along messages from the smartphone to the laptop with the latest sensor readings. I was able to modify the Durable Objects chat room example to do exactly this — a good time saver for a game jam!

When a browser connects, they either connect as a "peer" or as a "host" role. Any messages sent by peers are forwarded to the host, and all messages from the host get forwarded to all peers. In this case, our host is the laptop browser running the game and the peer is the smartphone controller. Implementation-wise this means that we keep two lists of users: the peers and the hosts. Whenever a message comes in, we loop over the list to broadcast it to all connections of the other role. In practice, the code is a bit more involved to deal with users disconnecting.

Full Tilt is a singleplayer game, but adapting it to be a multiplayer game would be easy with this setup. Imagine a Mario Kart-like game that runs in your browser in which multiple friends can join using their smartphones as their steering wheel controller! Unfortunately there was not enough time in this game jam to make a polished game.

Front end

With the backend ready, we still need to create the actual game and the controller web app.

My initial plan was to create a game in which you fly a 3D plane by tilting your phone, collecting stars and completing aerobatic tricks. It would fit the theme "Unstable" as I expected this control method to be pretty flimsy! Since there was no time left to create anything close to that, it was time to cut scope.

I ended up using the Phaser game engine and put the entire system together in a Svelte app. This was certainly a challenge, as I had only used Phaser once before many years ago and had never used Svelte. Luckily, I was able to put together something simple quickly enough: a snake-like game in which you move a thingy to collect blips that appear randomly on the screen.

In order for a game to be a game there needs to be some goal for the user, usually some game over condition. I changed the game to go faster and faster over time, added a score counter, and added a game over condition where you lose the game if you touch the edge of the screen. I made my "art" in MS Paint, generated some sound effects using the online tool sfxr, and "composed" the music soundtrack in Chrome's Music Lab Song Maker.

Simultaneously I wrote a small client for my game server and duct-taped together the smartphone controller app which is powered by the browser's DeviceMotion APIs. To distribute my game I used Cloudflare Pages, which worked like a charm the first time.

All done

And then the deadline was there — I barely made it, but I submitted something I was proud of. A not-so-great game, but with an interesting backend system and a novel input method. Do try the game yourself here, and the source code is available here (warning: it's pretty hacky code of course!).

The reception of my co-jammers was great. While everybody agreed the game itself and its graphics were awful, it was novel. The game ended up rated first in the *Innovation* category for this game jam, out of hundreds of other games!

Finally, is this the future of game servers? For indie developers and smaller studios, setting up and maintaining a globally distributed fleet of game servers is both a huge distraction and cost. Serverless and in particular Durable Objects can provide an amazing solution.

But not every game is well-suited for a WebSocket-based backend. In some real-time games you are not interested in what happened a second ago and only the latest information matters. This is where the reliable and ordered nature of WebSockets can get in the way.

All in all my first impressions of Durable Objects are very positive, it is a great tool to have in your toolbelt for all kinds of web projects. You can now tackle problems that would otherwise take days in mere minutes. I am very excited to see what other problems will be made easy by Durable Objects, even those I haven’t even thought of yet.

Hackers Spill the Wine: Lockdown Led to Rise in Wine Domains and Wine Scammers

Allan Liska (Guest Author) — Wed, 07 Apr 2021 15:12:00 GMT

This blog originally appeared in April 2021 on the Area 1 Security website, and was issued in advance of Cloudflare's acquisition of Area 1 Security on April 1, 2022. Learn more.

This report was produced jointly with researchers from Area 1 Security. Area 1 Security preemptively stops Business Email Compromise, malware, ransomware and targeted phishing attacks and as such has a multi-petabyte corpus of real time active phishing campaigns. Recorded Future thanks the team at Area 1 for their support on this research.

Staying connected with friends, family, and even co-workers during the pandemic and especially during lockdown periods is important. Often, this connection came in the form of virtual happy hours friends and family could get together over Zoom, or another video conferencing platform, to catch up over drinks. Companies were quick to jump on this trend with established companies hosting virtual livestreams and startups helping you plan the best virtual happy hour, whether that means inviting a goat to your virtual call or providing you everything you need to make the perfect drink.

As the interest in virtual happy hours and get-togethers increased so did the increase in wine-themed domain registrations. Recorded Future noted a significant increase in the number of new wine-themed domains registered starting in April 2020 and continuing through at least March 2021.

To conduct this search, we queried new domain registrations containing 1 or more of the following words:

Wine
Vino
Champagne
Bordeaux
Burgundy
Chardonnay
Merlot
Cabernet
Sauvignon
Pinot

The total number of wine-themed domains registered during this period is higher as there are search terms that were intentionally excluded to avoid too many false positives (for example Rosé or Napa) or were unlikely to be used by scammers (less popular grapes such as Riesling, Carménère, or Tannat did not significantly add to the total number of domain registrations or generate any potentially malicious matches).

New wine-themed domain registrations hovered between 3,000 and 4,000 per month until March 2020. In March 2020, Recorded Future noted a small uptick of new domain registrations at almost 5,500. In April 2020, the number jumped to almost 7,200, then in May 2020 the number skyrocketed to 12,400. From June 2020 onward, the number of new wine-themed domain registrations fluctuated between 7,000 and 9,500, in other words 2 to 3 times the number registered pre-COVID-19. The total number of wine-related domains registered between April 2020 and March 2021 containing the above keywords was 96,489.

Malicious (for the purpose of this report, the term malicious encompasses domains that Recorded Future scored as Suspicious, Malicious and Very Malicious) wine-themed domains followed a similar, though slightly delayed, timeline. Recorded Future tracked 278 malicious wine-themed domains registered in April 2020, which jumped from 668 in May 2020 to 473 in June 2020. This pattern has continued to fluctuate between 230 and 430 new malicious wine-themed domains registered each month. The total of malicious wine-themed domains identified from April 2020 through the end of March 2021 was 4,389.

It appears that it took some time for cyber criminals to catch on to the idea of using wine in malicious activities. Tracking malicious wine-themed domains as a percentage of total wine domains registered shows that the peak as a percentage of total wine-themed domains was in June 2020 at 7%. Since then, the percentage has remained between 3% and 5%, which is relatively low compared to other domain registrations commonly used for malicious activity, such as COVID-19.

Partnering with Area 1

Working with Area 1 Security we wanted to understand how these newly registered and newly observed wine related domain names were being weaponized in phishing campaigns. From April 2020 through April 1st 2021 Area 1 caught over 25,000 examples of the above domains being used in email campaigns targeting companies ranging from Fortune 500 companies to small and medium-sized businesses and across all major industries from consumer products, to financial services, to healthcare, aerospace and beyond. 74.71% of the email campaigns were classified as SPAM, meaning while they were unsolicited or unwanted at face value, they might be early stage reconnaissance campaigns.

More seriously, about 13.5% of the emails associated with the identified domains contained suspicious or malicious content (links or files), 11.74% of the emails were Type 1 Business Email Compromise phishing emails (Basic) that attempt to trick the recipient into believing they were being sent by a recognized sender. Finally, a very small percentage, .03%, of the emails were tied to Type 3 and Type 4 Business Email Compromise phishing (Sophisticated), a type of attack that is increasing in volume and is directly responsible for significant amounts of business losses these days.

Overall, these results show that spam and phishing campaigns are aware of the growing interest in wine and are using that interest to push malicious activity.

In 1 example, a voicemail phishing campaign detected by Area 1 Security that was missed by Microsoft Office 365 - APT. Malicious link leads to subdomain of lueriawinery[dot]com/.

Actionable Advice

Protect your organization against the domains listed in the appendix, as these domains represent the biggest threat from our findings.
While this report focuses on wine, spam and phishing campaigns increasingly use trends and current events, such as COVID-19 or increased interest in online retail and food delivery apps, as lures. Ensure that any phishing awareness programs run by your organization emphasize the adaptability of attackers and their ability to quickly switch out topics in their email lures.
Whenever possible, filter potentially malicious emails at the edge. Employee training is important, but preventing the email from ever reaching an employee is a much better deterrent.
Running your own custom email infrastructure requires network administrators to be perfect every single day. We recommend the use of cloud email infrastructure such as Google’s GSuite or Microsoft’s Office 365 in combination with a cloud email security solution.
Augment cloud inbox providers such as GSuite and O365 with intelligence from Recorded Future and Area 1 Security. As with any other aspect of network security, a defense in depth strategy for email protection offers the best chance for success.

Oren J. Falkowitz, Founder and CEO of Area 1 Security, is a serial entrepreneur and cybersecurity industry visionary. He led Area 1 Security as its co-founder and CEO for the first seven years, with the mission to discover and eliminate targeted phishing attacks before they cause damage. Previously, Oren held senior positions at the National Security Agency (NSA) and United States Cyber Command (USCYBERCOM), where he focused on Computer Network Operations & Big Data. That’s where he realized the immense need for preemptive cybersecurity.

Allan Liska is an intelligence analyst at Recorded Future. Allan has more than 15 years’ experience in information security and has worked as both a blue teamer and a red teamer for the intelligence community and the private sector. Allan has helped countless organizations improve their security posture using more effective and integrated intelligence.

Addressing the Web’s Client-Side Security Challenge

Swapnil Bhalode (Guest Author) — Tue, 03 Mar 2020 13:00:00 GMT

Modern web architecture relies heavily on JavaScript and enabling third-party code to make client-side network requests. These innovations are built on client-heavy frameworks such as Angular, Ember, React, and Backbone that leverage the processing power of the browser to enable the execution of code directly on the client interface/web browser. These third-party integrations provide richness (chat tools, images, fonts) or extract analytics (Google Analytics). Today, up to 70% of the code executing and rendering on your customer’s browser comes from these integrations. All of these software integrations provide avenues for potential vulnerabilities.

Unfortunately, these unmanaged, unmonitored integrations operate without security consideration, providing an expansive attack surface that attackers have routinely exploited to compromise websites. Today, only 2% of the Alexa 1000 global websites were found to deploy client-side security measures to protect websites and web applications against attacks such as Magecart, XSS, credit card skimming, session redirects and website defacement.

Improving website security and ensuring performance with Cloudflare Workers

In this post, we focus on how Cloudflare Workers can be used to improve security and ensure the high performance of web applications. Tala has joined Cloudflare’s marketplace to further our common goals of ensuring website security, preserving data privacy and assuring the integrity of web commerce. Tala’s innovative and unobtrusive solution, coupled with Cloudflare’s global reach, offers a compelling, highly effective solution for combatting the acceleration of client-side website attacks.

About Cloudflare Workers

Cloudflare Workers is a globally distributed serverless compute platform that runs across Cloudflare’s network of 200+ locations worldwide. Workers is designed for flexibility, with multiple use cases ranging from customizing configuration of Cloudflare services and features to building full, independent applications.

Cloudflare & Tala

Tala has integrated its "web module" capabilities into Cloudflare’s service Worker platform to enable a serverless, instantaneous deployment. This allows customers to activate enterprise-grade website security quickly and efficiently from Cloudflare's 200+ reliable and redundant edge locations around the world. Tala automates the activation of standards-based, browser-native security controls to deliver highly effective security, without impacting website performance or user experience.

About Tala

Tala secures millions of web sessions for large providers in verticals such as financial services, online retail, payment processing, tech, fintech and education. We secure websites and web applications by continuously interrogating application architecture to enable the automation and continuous deployment of precise, browser-native, standards-based policies & controls. Our technology allows organizations to deploy standards-based website security with near-zero impact to performance and without the operational burdens associated with the application and administration of these policies.

How Tala Works

Tala’s solution is enabled with an analytics engine that evaluates over 150 unique indicators of a web page’s behavior and integrations. This dynamic analytics engine scans continuously, working in conjunction with an AI-assisted automation engine that activates and tunes standards-based security capabilities, like Content Security Policy (CSP), Subresource Integrity (SRI), Strict Transport (HSTS), Sandboxing (iFrame rules), Referrer Policy, Trusted Types, Certificate Stapling, Clear Site Data and others.

The automation of browser-native security controls provides comprehensive security without requiring any changes to application code and has near-zero impact on website performance. Tala’s solution can be installed via the Cloudflare Workers Integration to deliver instantaneous client-side security.

With Tala, rich website analytics become available with the risk of client-side website attacks. Website performance is preserved, administration is accelerated and the need for costly and continuous administration, remediation or incident response is minimized.

How Tala Integrates with Cloudflare Workers

Customers can deploy Tala-generated security policies (discussed in the section above) on their website using Cloudflare’s Service Workers. The customer will install the Tala Service Worker on their Cloudflare account, using Tala’s installation scripts. These scripts invoke Cloudflare’s APIs to upload and enable the Tala Service Worker to Cloudflare as well upload the customized Tala security policies to Cloudflare’s KV store.

Once the installation is complete, the Tala service worker will be invoked every time an end user requests the customer’s site. During the response from Cloudflare, the Tala Service Worker implements the appropriate Tala’s security policies. Here are the steps involved:

Tala Service Worker sees the HTML content coming from the origin web server
Tala Service Worker parses the HTML page
Based on the content of the page, the Tala Service Worker inserts the appropriate security controls (e.g., CSP, SRI) which could include a combination of HTTP security headers (e.g., referrer policy, CSP, HSTS) as well as page insertions (e.g., nonces, SRI hashes)

Periodically, the Tala Service Worker polls the Tala cloud service to check for any security policy updates and if required, push the latest policies. For more details on how to install Tala into Cloudflare’s Service Workers, please read the installation manual.

Deploy Client-Side Website Security

Client-side vulnerability is a significant and accelerating problem. Workers can provide speed and capability to ensure your organization isn’t the next victim of a growing volume of successful attacks targeting widespread website and web application vulnerability. Standards-based security offers the most effective, comprehensive solution to safeguard against these attacks.

The combination of Cloudflare and Tala can help you expedite deployment. We’d love to hear from you and explore a Workers deployment!

The Tala solution is available today!

Cloudflare Enterprise Customers: Reach out to your dedicated Cloudflare account manager to learn more and start the process.
Tala Customers and Cloudflare Customers, reach out to Tala to learn more and start the process. You can sign up for and learn more about using Cloudflare Workers here!

Today, Chrome Takes Another Step Forward in Addressing the Design Flaw That is an Unencrypted Web

Troy Hunt (Guest Author) — Tue, 24 Jul 2018 15:04:00 GMT

The following is a guest post by Troy Hunt, awarded Security expert, blogger, and Pluralsight author. He’s also the creator of the popular Have I been pwned?, the free aggregation service that helps the owners of over 5 billion accounts impacted by data breaches.

I still clearly remember my first foray onto the internet as a university student back in the mid 90's. It was a simpler online time back then, of course; we weren't doing our personal banking or our tax returns or handling our medical records so the whole premise of encrypting the transport layer wasn't exactly a high priority. In time, those services came along and so did the need to have some assurances about the confidentiality of the material we were sending around over other people's networks and computers. SSL as it was at the time was costly, but hey, banks and the like could absorb that given the nature of their businesses. However, at the time, there were all sorts of problems with the premise of serving traffic securely ranging from the cost of certs to the effort involved in obtaining and configuring them through to the performance hit on the infrastructure. We've spent the last couple of decades fixing these shortcomings and subsequently, driving site owners towards a more secure web. Today represents just one more step in that journey: as of today, Chrome is flagging all non-secure connections as... not secure!

I want to delve into the premise of this a little deeper because certainly there are those who question the need for the browser to be so shouty about a lack of encryption. I particularly see this point of view expressed as it relates to sites without the need for confidentiality, for example a static site that collects no personal data. But let me set the stage for this blog post because we're actually addressing a very fundamental problem here:

The push for HTTPS is merely addressing a design flaw with the original, unencrypted web.

I mean think about it - we've been plodding along standing up billions of websites and usually having no idea whether requests are successfully reaching the correct destination, whether they've been observed, tampered with, logged or otherwise mishandled somewhere along the way. We'd never sit down and design a network like this today but as with so many aspects of the web, we're still dealing with the legacy of decisions made in a very different time.

So back to Chrome for moment and the "Not secure" visual indicator. When I run training on HTTPS, I load up a website in the browser over a secure connection and I ask the question - "How do we know this connection is secure"? It's a question usually met by confused stares as we literally see the word "Secure" sitting up next to the address bar. We know the connection is secure because the browser tells us this explicitly. Now, let's try it with a site loaded over an insecure connection - "How do we know this connection is not secure"? And the penny drops because the answer is always "We know it's not secure because it doesn't tell us that it is secure"! Isn't that an odd inversion? Was an odd inversion because as of today, both secure and non-secure connections get the same visual treatment so finally, we have parity.

But is parity what we actually want? think back to the days when Chrome didn't tell you an insecure connection wasn't secure (ah, isn't it nice that's in the past already?!); browsers could get away with this because that was the normal state! Why explicitly say anything when the connection is "normal"? But now we're changing what "normal" means and in the future that means we'll be able to apply the same logic as Chrome used to: visual indicators for the normal state won't be necessary or in other words, we won't need to say "secure" any more. Instead, we can focus on the messaging around deviations from normal, namely connections that aren't secure. Google has already flagged that we'll see this behaviour in the future too, it's just a matter of time.

Let's take a moment to reflect on what that word "normal" means as it relates to secure comms on the internet because it's something that changes over time. A perfect example of that is Scott Helme's six-monthly Alexa Top 1M report. A couple of times a year, Scott publishes stats on the adoption of a range of different security constructs by the world's largest websites. One of those security constructs is the use of HTTPS or more specifically, sites that automatically redirect non-secure requests to the secure scheme. In that report above, he found that 6.7% of sites did this in August 2015. Let's have a look at just how quickly that number has changed and for ease of legibility, I'll list them all below followed by the change from the previous scan 6 months earlier:

Aug 2015: 6.7%
Feb 2016: 9.4% (+42%)
Aug 2016: 13.8% (+46%)
Feb 2017: 20.0% (+45%)
Aug 2017: 30.8% (+48%)
Feb 2018: 38.4% (+32%)

That's an astonishingly high growth rate, pretty much doubling every 12 months. We can't sustain that rate forever, of course, but depending on how you look at it, the numbers are even higher than that. Firefox's telemetry suggests that as of today, 73% of all requests are served over a secure HTTPS connection. That number is much higher than Scott's due to the higher prevalence of the world's largest websites implementing HTTPS more frequently than the smaller ones. In fact, Scott's own figures graphically illustrate this:

Each point on the graph is a cluster of 4,000 websites with the largest ones on the left and the smallest on the right. It's clear that well over half of the largest sites are doing HTTPS by default whilst the smallest ones are much closer to one quarter. This can be explained by the fact that larger services tend to be those that we've traditionally expected higher levels of security on; they're e-commerce sites, social media platforms, banks and so on. Paradoxically, those sites are also the ones that are less trivial to roll over to HTTPS whilst the ones to the right of the graph are more likely to literally be lunchtime jobs. Last month I produced a free 4-part series called "HTTP Is Easy" and part 1 literally went from zero HTTPS to full HTTPS across the entire site in 5 minutes. It took another 5 minutes to get a higher grade than what most banks have for their transport layer encryption. HTTPS really is easy!

Yet still, there remain those who are unconvinced that secure connections are always necessary. Content integrity, they argue, is really not that important, what can a malicious party actually do with a static site such as a blog anyway? Good question! In no particular order, they can inject script to modify the settings of vulnerable routers and hijack DNS, inject cryptominers into the browser, weaponise people's browsers into a DDoS cannon or serve malware or phishing pages to unsuspecting victims. Just to really drive home the real-world risks, I demo'd all those in a single video a couple of weeks ago. Mind you, the sorts of sites for whom owners are questioning the need for HTTPS are precisely the sorts of sites that tend to be 5-minute exercises to put behind Cloudflare so regardless of debates about how necessary it is, the actual effort involved in doing it is usually negligible. Oh - and it'll give you access to HTTP/2 and Brotli compression which are both great for performance and only work over HTTPS plus enable you to access a whole range of browser features that are only available in secure contexts.

Today is just one more correction in a series that's been running for some time now. In Jan last year it was both Chrome and Firefox flagging insecure pages accepting passwords or credit cards as not secure. In October Chrome began showing the same visual indicator when entering data into any non-secure form. In March this year Safari on iOS began showing "Not Secure" when entering text into an insecure login form. We all know what's happened today and as I flagged earlier, the future holds yet more changes as we move towards a more "secure by default" web. (Incidentally, note how it's multiple browser vendors driving this change, it's by no means solely Google's doing.)

Bit by bit, we're gradually fixing the design flaws of the web.

A Note from CloudflareIn June, Troy authored a post entitled “HTTPS is Easy!,” which highlights the simplicity of converting a site to HTTPS with Cloudflare. It’s worth noting that, as indicated in his post, we were (pleasantly) surprised to see this series.

At Cloudflare, it’s our mission to build a better Internet, and a part of that is democratizing modern web technologies to everyone. This was the motivation for launching Universal SSL in 2014 - a move that made us the first company to offer SSL for free to anyone. With the release of Chrome 68, we want to continue making HTTPS easy, and have launched a free tool to help any website owner troubleshoot common problems with HTTPS configuration.

Are you Chome 68 ready? Check your website with our free SSL Test.

Proxying traffic to Report URI with Cloudflare Workers

Scott Helme (Guest Author) — Tue, 17 Jul 2018 13:00:00 GMT

The following is a guest post by Scott Helme, a Security Researcher, international speaker, and blogger. He's also the founder of the popular securityheaders.com and report-uri.com, free tools to help people deploy better security.

With the continued growth of Report URI we're seeing a larger and larger variety of sites use the service. With that diversity comes additional requirements that need to be met, some of them simple and some of them less so. Here's a quick look at those challenges and how they can be solved easily with a Cloudflare Worker.

Sending CSP Reports

When a browser sends a CSP report for us to collect at Report URI, we receive the JSON payload sent to us, but we also have access to two other pieces of information, the client IP and the User Agent string. We never store, collect or analyse the client IP, we simply don't need or want to, and all we do with the UA string is extract the browser name like Chrome or Firefox. Most site operators are perfectly happy with our approach here and things work just fine. There are however some issues when the site operator simply doesn't want to have us to have this information and some cases have come up where they can't allow us to have access to that information because of restrictions placed on them by a regulator. The other common thing that comes up, which I honestly never anticipated, was simply the perception of the reporting endpoint being a 3rd party address. There are various different ways we can and do tackle these problems.

CNAME

Up until now, if a client didn't want to report to a 3rd party address we would ask them to CNAME their subdomain to us and run a dedicated instance that would ingest reports using their subdomain. We take control of certificate issuance and renewal and the customer doesn't need to do anything further. This is a fairly common approach across many different technical requirements, and it's something that has worked well for us. The problem is that it does come with some administrative overheads for both parties. From our side the technical requirements of managing separate infrastructure are an additional burden, we're responsible for a subdomain belonging to someone else and there are more moving parts in the system, increasing complexity. I was curious if there was another way.

HTTP Proxy

One idea that we discussed with a customer a while back, but never deployed, was for them to run a proxy on premise. They could report to their own endpoint under their own control and simply forward reports to their Report URI reporting address. This means they could shield the client IP from us, mask the User Agent string if required and generally do any sanitisation they liked on the payload. The problem with this was that it just seemed like an awful lot of work, I'd much rather have discussed deploying Report URI on premise instead. The client is also still at risk of things like accidentally DDoSing their endpoint, which removes one of the good reasons to use Report URI.

Finding Another Way

For the most part our current model was working really well, but there were some customers who had a hard requirement to not send reports directly to us. Our on premise solution also isn't ready for prime time just yet, so we needed something that we could offer, without it requiring too much of the overhead mentioned above. That's when I had an idea that might just cut it.

JavaScript On A Plane

I was sat on a flight just a few days ago, and I never like to waste time. When I'm sat in the car on the way to the airport, sat in the airport or sat on my flight, I'm working. Those are precious hours that can't be wasted and during a recent flight between Manchester and Vienna I was playing around with Cloudflare Workers in their playground. I was tinkering with a worker to add Security Headers to websites, which has since been launched, and whilst inspecting the headers object and looking through the headers that were in the request I saw the User Agent string. "Oh hey, I could remove that if I wanted to" I thought to myself, and then the rapid fire series of events triggered in my brain when you're in the process of realising a great idea. I could remove the UA header... From the request... Then the worker can make any subrequests it likes... Requests to a different origin... THE WORKER CAN RECEIVE A REPORT AND FORWARD IT!!!

I realised that (of course) a Cloudflare Worker could be used to receive reports on a subdomain of your own site and then forward them to your reporting address at Report URI.

Using Cloudflare Workers As A Report Proxy

One of the main benefits of using Report URI is just how simple everything is to do and all the solutions mentioned at the start of this blog changed that. With a Cloudflare Worker we could keep the absolute simplicity of deploying Report URI but also easily allow you the option to shield your client's IP addresses, or any other information in the payload, from us.


let subdomain = 'scotthelme'

addEventListener('fetch', event => {
  event.respondWith(forwardReport(event.request))
})

async function forwardReport(req) {
  let newHdrs = new Headers()
  newHdrs.set('User-Agent', req.headers.get('User-Agent'))
  
  const init = {
    body: req.body,
    headers: newHdrs,
    method: 'POST'
  }

  let path = new URL(req.url).pathname
  let address = 'https://' + subdomain + '.report-uri.com' + path
  let response = await fetch (address, init);

  return new Response(response.body, {
    status: response.status,
    statusText: response.statusText
  })
}

This simple worker, deployed on your own site, provides a solution to all of the above problems. All you need to do is configure your subdomain in the var on the first line and everything else will be taken care of for you. Deploy this worker onto the subdomain you want to send reports to, follow the same naming convention for the path when sending reports, and everything will Just Work™.

The script above is configured for my subdomain, so if I wanted to deploy this on any site, say example.com, I'd choose the subdomain on my site where I wanted to send reports report-uri.example.com and off we go.

https://scotthelme.report-uri.com/r/d/csp/enforce
becomes
https://report-uri.example.com/r/d/csp/enforce

The reports are now being sent to a subdomain under your own control, the worker will intercept the request and forward it to the destination at Report URI for you. In the process you will shield the client IP as we will only see the source IP as being the Cloudflare Worker and in the example above we are forwarding the UA string for browser identification.

Amazingly Simple

With the worker above we don't need to worry about setting up a CNAME, certificate provisioning, separate infrastructure or anything else that goes with it. You also don't need to worry about setting up and managing a proxy to forward the reports to us and traffic or processing power required to do so. The worker will take care of all of that and what's best is that it will take care of it with minimal overhead, taking only a few minutes to set up and costing only $0.50 for every 1 million reports it processes.

Taking It One Step Further

The great thing about this is that once the worker is set up and processing reports, you can start to do some pretty awesome things beyond just proxying reports, workers are incredibly powerful.

Downsample report volume

If you want to save your quota on Report URI, maybe you're early in the process of deploying CSP, and it's quite noisy, no problem. The worker can select a random downsample of reports to forward on, so you can still receive reports but not eat your quota quite as quickly. Make the following change to the start of the forwardReport() function to randomly drop 50% of reports.

async function forwardReport(req) {
  if(Math.floor((Math.random() * 100) + 1) <= 50) {
    return new Response("discarded")
  }

Hide the UA string

If you did want to hide the UA string from Report URI and not let us see that either, you simply need to remove the following line of code.

newHdrs.set('User-Agent', req.headers.get('User-Agent'))

Advanced work

The worker can pretty much do anything you like. Maybe there are sections of your site that you don't want to send reports from. You could parse the JSON and check which page triggered the report and discard them. You could do a regex match on the JSON payload to make sure no sensitive tokens or information get sent too. The possibilities are basically endless and what we can say is that if you need to do it, it's easy and cheap enough to do in a Cloudflare Worker.

Pricing

Talking about cheap enough, I thought I'd actually quantify that and quote the Cloudflare pricing for workers.

Starting at $5 per month and covering your first 10 million requests is an amazingly cheap deal. Most websites that report through us wouldn't even come close to sending 10 million reports, so you'd probably never pay any more than $5 for your Cloudflare Worker. That's it, $5 per month... By the time you've even thought about creating a CNAME or standing up your own proxy you've probably blown through more than Cloudflare Workers would ever cost you. What's best is that if you already use Cloudflare Workers then you can roll this into your existing usage, and it might not even increase the cost if you have some of your initial 10 million requests spare. If you don't use Cloudflare on your site already then you could just as easily grab a new domain name exclusively for reporting, that'd cost just a few dollars, and stand that up behind Cloudflare too. One way or another this is insanely easy and insanely cheap.