The Cloudflare Blog

Over 700 million events/second: How we make sense of too much data

Constantin Pan — Mon, 27 Jan 2025 14:00:00 GMT

Cloudflare's network provides an enormous array of services to our customers. We collect and deliver associated data to customers in the form of event logs and aggregated analytics. As of December 2024, our data pipeline is ingesting up to 706M events per second generated by Cloudflare's services, and that represents 100x growth since our 2018 data pipeline blog post.

At peak, we are moving 107 GiB/s of compressed data, either pushing it directly to customers or subjecting it to additional queueing and batching.

All of these data streams power things like Logs, Analytics, and billing, as well as other products, such as training machine learning models for bot detection. This blog post is focused on techniques we use to efficiently and accurately deal with the high volume of data we ingest for our Analytics products. A previous blog post provides a deeper dive into the data pipeline for Logs.

The pipeline can be roughly described by the following diagram.

The data pipeline has multiple stages, and each can and will naturally break or slow down because of hardware failures or misconfiguration. And when that happens, there is just too much data to be able to buffer it all for very long. Eventually some will get dropped, causing gaps in analytics and a degraded product experience unless proper mitigations are in place.

Dropping data to retain information

How does one retain valuable information from more than half a billion events per second, when some must be dropped? Drop it in a controlled way, by downsampling.

Here is a visual analogy showing the difference between uncontrolled data loss and downsampling. In both cases the same number of pixels were delivered. One is a higher resolution view of just a small portion of a popular painting, while the other shows the full painting, albeit blurry and highly pixelated.

As we noted above, any point in the pipeline can fail, so we want the ability to downsample at any point as needed. Some services proactively downsample data at the source before it even hits Logfwdr. This makes the information extracted from that data a little bit blurry, but much more useful than what otherwise would be delivered: random chunks of the original with gaps in between, or even nothing at all. The amount of "blur" is outside our control (we make our best effort to deliver full data), but there is a robust way to estimate it, as discussed in the next section.

Logfwdr can decide to downsample data sitting in the buffer when it overflows. Logfwdr handles many data streams at once, so we need to prioritize them by assigning each data stream a weight and then applying max-min fairness to better utilize the buffer. It allows each data stream to store as much as it needs, as long as the whole buffer is not saturated. Once it is saturated, streams divide it fairly according to their weighted size.

In our implementation (Go), each data stream is driven by a goroutine, and they cooperate via channels. They consult a single tracker object every time they allocate and deallocate memory. The tracker uses a max-heap to always know who the heaviest participant is and what the total usage is. Whenever the total usage goes over the limit, the tracker repeatedly sends the "please shed some load" signal to the heaviest participant, until the usage is again under the limit.

The effect of this is that healthy streams, which buffer a tiny amount, allocate whatever they need without losses. But any lagging streams split the remaining memory allowance fairly.

We downsample more or less uniformly, by always taking some of the least downsampled batches from the buffer (using min-heap to find those) and merging them together upon downsampling.

^{Merging keeps the batches roughly the same size and their number under control.}

Downsampling is cheap, but since data in the buffer is compressed, it causes recompression, which is the single most expensive thing we do to the data. But using extra CPU time is the last thing you want to do when the system is under heavy load! We compensate for the recompression costs by starting to downsample the fresh data as well (before it gets compressed for the first time) whenever the stream is in the "shed the load" state.

We called this approach "bottomless buffers", because you can squeeze effectively infinite amounts of data in there, and it will just automatically be thinned out. Bottomless buffers resemble reservoir sampling, where the buffer is the reservoir and the population comes as the input stream. But there are some differences. First is that in our pipeline the input stream of data never ends, while reservoir sampling assumes it ends to finalize the sample. Secondly, the resulting sample also never ends.

Let's look at the next stage in the pipeline: Logreceiver. It sits in front of a distributed queue. The purpose of logreceiver is to partition each stream of data by a key that makes it easier for Logpush, Analytics inserters, or some other process to consume.

Logreceiver proactively performs adaptive sampling of analytics. This improves the accuracy of analytics for small customers (receiving on the order of 10 events per day), while more aggressively downsampling large customers (millions of events per second). Logreceiver then pushes the same data at multiple resolutions (100%, 10%, 1%, etc.) into different topics in the distributed queue. This allows it to keep pushing something rather than nothing when the queue is overloaded, by just skipping writing the high-resolution samples of data.

The same goes for Inserters: they can skip reading or writing high-resolution data. The Analytics APIs can skip reading high resolution data. The analytical database might be unable to read high resolution data because of overload or degraded cluster state or because there is just too much to read (very wide time range or very large customer). Adaptively dropping to lower resolutions allows the APIs to return some results in all of those cases.

Extracting value from downsampled data

Okay, we have some downsampled data in the analytical database. It looks like the original data, but with some rows missing. How do we make sense of it? How do we know if the results can be trusted?

Let's look at the math.

Since the amount of sampling can vary over time and between nodes in the distributed system, we need to store this information along with the data. With each event $x_i$ we store its sample interval, which is the reciprocal to its inclusion probability $\pi_i = \frac{1}{\text{sample interval}}$. For example, if we sample 1 in every 1,000 events, each of the events included in the resulting sample will have its $\pi_i = 0.001$, so the sample interval will be 1,000. When we further downsample that batch of data, the inclusion probabilities (and the sample intervals) multiply together: a 1 in 1,000 sample from a 1 in 1,000 sample is a 1 in 1,000,000 sample of the original population. The sample interval of an event can also be interpreted roughly as the number of original events that this event represents, so in the literature it is known as weight $w_i = \frac{1}{\pi_i}$.

We rely on the Horvitz-Thompson estimator (HT, paper) in order to derive analytics about $x_i$. It gives two estimates: the analytical estimate (e.g. the population total or size) and the estimate of the variance of that estimate. The latter enables us to figure out how accurate the results are by building confidence intervals. They define ranges that cover the true value with a given probability (confidence level). A typical confidence level is 0.95, at which a confidence interval (a, b) tells that you can be 95% sure the true SUM or COUNT is between a and b.

So far, we know how to use the HT estimator for doing SUM, COUNT, and AVG.

Given a sample of size $n$, consisting of values $x_i$ and their inclusion probabilities $\pi_i$, the HT estimator for the population total (i.e. SUM) would be $$\widehat{T}=\sum_{i=1}^n{\frac{x_i}{\pi_i}}=\sum_{i=1}^n{x_i w_i}.$$ The variance of $\widehat{T}$ is: $$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} + \sum_{i \neq j}^n{x_i x_j \frac{\pi_{ij} - \pi_i \pi_j}{\pi_{ij} \pi_i \pi_j}},$$ where $\pi_{ij}$ is the probability of both $i$-th and $j$-th events being sampled together.

We use Poisson sampling, where each event is subjected to an independent Bernoulli trial ("coin toss") which determines whether the event becomes part of the sample. Since each trial is independent, we can equate $\pi_{ij} = \pi_i \pi_j$, which when plugged in the variance estimator above turns the right-hand sum to zero: $$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} + \sum_{i \neq j}^n{x_i x_j \frac{0}{\pi_{ij} \pi_i \pi_j}},$$ thus $$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} = \sum_{i=1}^n{x_i^2 w_i (w_i-1)}.$$ For COUNT we use the same estimator, but plug in $x_i = 1$. This gives us: $$\begin{align} \widehat{C} &= \sum_{i=1}^n{\frac{1}{\pi_i}} = \sum_{i=1}^n{w_i},\\ \widehat{V}(\widehat{C}) &= \sum_{i=1}^n{\frac{1 - \pi_i}{\pi_i^2}} = \sum_{i=1}^n{w_i (w_i-1)}. \end{align}$$ For AVG we would use $$\begin{align} \widehat{\mu} &= \frac{\widehat{T}}{N},\\ \widehat{V}(\widehat{\mu}) &= \frac{\widehat{V}(\widehat{T})}{N^2}, \end{align}$$ if we could, but the original population size $N$ is not known, it is not stored anywhere, and it is not even possible to store because of custom filtering at query time. Plugging $\widehat{C}$ instead of $N$ only partially works. It gives a valid estimator for the mean itself, but not for its variance, so the constructed confidence intervals are unusable.

In all cases the corresponding pair of estimates are used as the $\mu$ and $\sigma^2$ of the normal distribution (because of the central limit theorem), and then the bounds for the confidence interval (of confidence level ) are: $$\Big( \mu - \Phi^{-1}\big(\frac{1 + \alpha}{2}\big) \cdot \sigma, \quad \mu + \Phi^{-1}\big(\frac{1 + \alpha}{2}\big) \cdot \sigma\Big).$$

We do not know the N, but there is a workaround: simultaneous confidence intervals. Construct confidence intervals for SUM and COUNT independently, and then combine them into a confidence interval for AVG. This is known as the Bonferroni method. It requires generating wider (half the "inconfidence") intervals for SUM and COUNT. Here is a simplified visual representation, but the actual estimator will have to take into account the possibility of the orange area going below zero.

In SQL, the estimators and confidence intervals look like this:

WITH sum(x * _sample_interval)                              AS t,
     sum(x * x * _sample_interval * (_sample_interval - 1)) AS vt,
     sum(_sample_interval)                                  AS c,
     sum(_sample_interval * (_sample_interval - 1))         AS vc,
     -- ClickHouse does not expose the erf⁻¹ function, so we precompute some magic numbers,
     -- (only for 95% confidence, will be different otherwise):
     --   1.959963984540054 = Φ⁻¹((1+0.950)/2) = √2 * erf⁻¹(0.950)
     --   2.241402727604945 = Φ⁻¹((1+0.975)/2) = √2 * erf⁻¹(0.975)
     1.959963984540054 * sqrt(vt) AS err950_t,
     1.959963984540054 * sqrt(vc) AS err950_c,
     2.241402727604945 * sqrt(vt) AS err975_t,
     2.241402727604945 * sqrt(vc) AS err975_c
SELECT t - err950_t AS lo_total,
       t            AS est_total,
       t + err950_t AS hi_total,
       c - err950_c AS lo_count,
       c            AS est_count,
       c + err950_c AS hi_count,
       (t - err975_t) / (c + err975_c) AS lo_average,
       t / c                           AS est_average,
       (t + err975_t) / (c - err975_c) AS hi_average
FROM ...

Construct a confidence interval for each timeslot on the timeseries, and you get a confidence band, clearly showing the accuracy of the analytics. The figure below shows an example of such a band in shading around the line.

Sampling is easy to screw up

We started using confidence bands on our internal dashboards, and after a while noticed something scary: a systematic error! For one particular website the "total bytes served" estimate was higher than the true control value obtained from rollups, and the confidence bands were way off. See the figure below, where the true value (blue line) is outside the yellow confidence band at all times.

We checked the stored data for corruption, it was fine. We checked the math in the queries, it was fine. It was only after reading through the source code for all of the systems responsible for sampling that we found a candidate for the root cause.

We used simple random sampling everywhere, basically "tossing a coin" for each event, but in Logreceiver sampling was done differently. Instead of sampling randomly it would perform systematic sampling by picking events at equal intervals starting from the first one in the batch.

Why would that be a problem?

There are two reasons. The first is that we can no longer claim $\pi_{ij} = \pi_i \pi_j$, so the simplified variance estimator stops working and confidence intervals cannot be trusted. But even worse, the estimator for the total becomes biased. To understand why exactly, we wrote a short repro code in Python:

import itertools

def take_every(src, period):
    for i, x in enumerate(src):
    if i % period == 0:
        yield x

pattern = [10, 1, 1, 1, 1, 1]
sample_interval = 10 # bad if it has common factors with len(pattern)
true_mean = sum(pattern) / len(pattern)

orig = itertools.cycle(pattern)
sample_size = 10000
sample = itertools.islice(take_every(orig, sample_interval), sample_size)

sample_mean = sum(sample) / sample_size

print(f"{true_mean=} {sample_mean=}")

After playing with different values for pattern and sample_interval in the code above, we realized where the bias was coming from.

Imagine a person opening a huge generated HTML page with many small/cached resources, such as icons. The first response will be big, immediately followed by a burst of small responses. If the website is not visited that much, responses will tend to end up all together at the start of a batch in Logfwdr. Logreceiver does not cut batches, only concatenates them. The first response remains first, so it always gets picked and skews the estimate up.

We checked the hypothesis against the raw unsampled data that we happened to have because that particular website was also using one of the Logs products. We took all events in a given time range, and grouped them by cutting at gaps of at least one minute. In each group, we ranked all events by time and looked at the variable of interest (response size in bytes), and put it on a scatter plot against the rank inside the group.

A clear pattern! The first response is much more likely to be larger than average.

We fixed the issue by making Logreceiver shuffle the data before sampling. As we rolled out the fix, the estimation and the true value converged.

Now, after battle testing it for a while, we are confident the HT estimator is implemented properly and we are using the correct sampling process.

Using Cloudflare's analytics APIs to query sampled data

We already power most of our analytics datasets with sampled data. For example, the Workers Analytics Engine exposes the sample interval in SQL, allowing our customers to build their own dashboards with confidence bands. In the GraphQL API, all of the data nodes that have "Adaptive" in their name are based on sampled data, and the sample interval is exposed as a field there as well, though it is not possible to build confidence intervals from that alone. We are working on exposing confidence intervals in the GraphQL API, and as an experiment have added them to the count and edgeResponseBytes (sum) fields on the httpRequestsAdaptiveGroups nodes. This is available under confidence(level: X).

Here is a sample GraphQL query:

query HTTPRequestsWithConfidence(
  $accountTag: string
  $zoneTag: string
  $datetimeStart: string
  $datetimeEnd: string
) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      httpRequestsAdaptiveGroups(
        filter: {
          datetime_geq: $datetimeStart
          datetime_leq: $datetimeEnd
      }
      limit: 100
    ) {
      confidence(level: 0.95) {
        level
        count {
          estimate
          lower
          upper
          sampleSize
        }
        sum {
          edgeResponseBytes {
            estimate
            lower
            upper
            sampleSize
          }
        }
      }
    }
  }
}

The query above asks for the estimates and the 95% confidence intervals for SUM(edgeResponseBytes) and COUNT. The results will also show the sample size, which is good to know, as we rely on the central limit theorem to build the confidence intervals, thus small samples don't work very well.

Here is the response from this query:

{
  "data": {
    "viewer": {
      "zones": [
        {
          "httpRequestsAdaptiveGroups": [
            {
              "confidence": {
                "level": 0.95,
                "count": {
                  "estimate": 96947,
                  "lower": "96874.24",
                  "upper": "97019.76",
                  "sampleSize": 96294
                },
                "sum": {
                  "edgeResponseBytes": {
                    "estimate": 495797559,
                    "lower": "495262898.54",
                    "upper": "496332219.46",
                    "sampleSize": 96294
                  }
                }
              }
            }
          ]
        }
      ]
    }
  },
  "errors": null
}

The response shows the estimated count is 96947, and we are 95% confident that the true count lies in the range 96874.24 to 97019.76. Similarly, the estimate and range for the sum of response bytes are provided.

The estimates are based on a sample size of 96294 rows, which is plenty of samples to calculate good confidence intervals.

Conclusion

We have discussed what kept our data pipeline scalable and resilient despite doubling in size every 1.5 years, how the math works, and how it is easy to mess up. We are constantly working on better ways to keep the data pipeline, and the products based on it, useful to our customers. If you are interested in doing things like that and want to help us build a better Internet, check out our careers page.

Introducing Timing Insights: new performance metrics via our GraphQL API

Jon Levine — Tue, 20 Jun 2023 13:00:47 GMT

If you care about the performance of your website or APIs, it’s critical to understand why things are slow.

Today we're introducing new analytics tools to help you understand what is contributing to "Time to First Byte" (TTFB) of Cloudflare and your origin. TTFB is just a simple timer from when a client sends a request until it receives the first byte in response. Timing Insights breaks down TTFB from the perspective of our servers to help you understand what is slow, so that you can begin addressing it.

But wait – maybe you've heard that you should stop worrying about TTFB? Isn't Cloudflare moving away from TTFB as a metric? Read on to understand why there are still situations where TTFB matters.

Why you may need to care about TTFB

It's true that TTFB on its own can be a misleading metric. When measuring web applications, metrics like Web Vitals provide a more holistic view into user experience. That's why we offer Web Analytics and Lighthouse within Cloudflare Observatory.

But there are two reasons why you still may need to pay attention to TTFB:

1. Not all applications are websitesMore than half of Cloudflare traffic is for APIs, and many customers with API traffic don't control the environments where those endpoints are called. In those cases, there may not be anything you can monitor or improve besides TTFB.

2. Sometimes TTFB is the problemEven if you are measuring Web Vitals metrics like LCP, sometimes the reason your site is slow is because TTFB is slow! And when that happens, you need to know why, and what you can do about it.

When you need to know why TTFB is slow, we’re here to help.

How Timing Insights can help

We now expose performance data through our GraphQL Analytics API that will let you query TTFB performance, and start to drill into what contributes to TTFB.

Specifically, customers on our Pro, Business, and Enterprise plans can now query for the following fields in the httpRequestsAdaptiveGroups dataset:

Time to First Byte (edgeTimeToFirstByteMs)

What is the time elapsed between when Cloudflare started processing the first byte of the request received from an end user, until when we started sending a response?

Origin DNS lookup time (edgeDnsResponseTimeMs)

If Cloudflare had to resolve a CNAME to reach your origin, how long did this take?

Origin Response Time (originResponseDurationMs)

How long did it take to reach, and receive a response from your origin?

We are exposing each metric as an average, median, 95th, and 99th percentiles (i.e. P50 / P95 / P99).

The httpRequestAdaptiveGroups dataset powers the Traffic analytics page in our dashboard, and represents all of the HTTP requests that flow through our network. The upshot is that this dataset gives you the ability to filter and “group by” any aspect of the HTTP request.

An example of how to use Timing Insights

Let’s walk through an example of how you’d actually use this data to pinpoint a problem.

To start with, I want to understand the lay of the land by querying TTFB at various quantiles:

query TTFBQuantiles($zoneTag: string) {
  viewer {
    zones(filter: {zoneTag: $zoneTag}) {
      httpRequestsAdaptiveGroups(limit: 1) {
        quantiles {
          edgeTimeToFirstByteMsP50
          edgeTimeToFirstByteMsP95
          edgeTimeToFirstByteMsP99
        }
      }
    }
  }
}

Response:
{
  "data": {
    "viewer": {
      "zones": [
        {
          "httpRequestsAdaptiveGroups": [
            {
              "quantiles": {
                "edgeTimeToFirstByteMsP50": 32,
                "edgeTimeToFirstByteMsP95": 1392,
                "edgeTimeToFirstByteMsP99": 3063,
              }
            }
          ]
        }
      ]
    }
  }
}

This shows that TTFB is over 1.3 seconds at P95 – that’s fairly slow, given that best practices are for 75% of pages to finish rendering within 2.5 seconds, and TTFB is just one component of LCP.

If I want to dig into why TTFB, it would be helpful to understand which URLs are slowest. In this query I’ll filter to that slowest 5% of page loads, and now look at the aggregate time taken – this helps me understand which pages contribute most to slow loads:

query slowestURLs($zoneTag: string, $filter:filter) {
  viewer {
    zones(filter: {zoneTag: $zoneTag}) {
      httpRequestsAdaptiveGroups(limit: 3, filter: {edgeTimeToFirstByteMs_gt: 1392}, orderBy: [sum_edgeTimeToFirstByteMs_DESC]) {
        sum {
          edgeTimeToFirstByteMs
        }
        dimensions {
          clientRequestPath
        }
      }
    }
  }
}

Response:
{
  "data": {
    "viewer": {
      "zones": [
        {
          "httpRequestsAdaptiveGroups": [
            {
              "dimensions": {
                "clientRequestPath": "/api/v2"
              },
              "sum": {
                "edgeTimeToFirstByteMs": 1655952
              }
            },
            {
              "dimensions": {
                "clientRequestPath": "/blog"
              },
              "sum": {
                "edgeTimeToFirstByteMs": 167397
              }
            },
            {
              "dimensions": {
                "clientRequestPath": "/"
              },
              "sum": {
                "edgeTimeToFirstByteMs": 118542
              }
            }
          ]
        }
      ]
    }
  }
}

Based on this query, it looks like the /api/v2 path is most often responsible for these slow requests. In order to know how to fix the problem, we need to know why these pages are slow. To do this, we can query for the average (mean) DNS and origin response time for queries on these paths, where TTFB is above our P95 threshold:

query originAndDnsTiming($zoneTag: string, $filter:filter) {
  viewer {
    zones(filter: {zoneTag: $zoneTag}) {
      httpRequestsAdaptiveGroups(filter: {edgeTimeToFirstByteMs_gt: 1392, clientRequestPath_in: [$paths]}) {
        avg {
          originResponseDurationMs
          edgeDnsResponseTimeMs
        }
      }
    }
}

Response:
{
  "data": {
    "viewer": {
      "zones": [
        {
          "httpRequestsAdaptiveGroups": [
            {
              "average": {
                "originResponseDurationMs": 4955,
                "edgeDnsResponseTimeMs": 742,
              }
            }
          ]
        }
      ]
    }
  }
}

According to this, most of the long TTFB values are actually due to resolving DNS! The good news is that’s something we can fix – for example, by setting longer TTLs with my DNS provider.

Conclusion

Coming soon, we’ll be bringing this to Cloudflare Observatory in the dashboard so that you can easily explore timing data via the UI.

And we’ll be adding even more granular metrics so you can see exactly which components are contributing to high TTFB. For example, we plan to separate out the difference between origin “connection time” (how long it took to establish a TCP and/or TLS connection) vs “application response time” (how long it took an HTTP server to respond).

We’ll also be making improvements to our GraphQL API to allow more flexible querying – for example, the ability to query arbitrary percentiles, not just 50th, 95th, or 99th.

Start using the GraphQL API today to get Timing Insights, or hop on the discussion about our Analytics products in Discord.

Watch on Cloudflare TV

Announcing Network Analytics

Omer Yoachimik — Mon, 16 Mar 2020 12:00:00 GMT

Our Analytics Platform

Back in March 2019, we released Firewall Analytics which provides insights into HTTP security events across all of Cloudflare's protection suite; Firewall rule matches, HTTP DDoS Attacks, Site Security Level which harnesses Cloudflare's threat intelligence, and more. It helps customers tailor their security configurations more effectively. The initial release was for Enterprise customers, however we believe that everyone should have access to powerful tools, not just large enterprises, and so in December 2019 we extended those same enterprise-level analytics to our Business and Pro customers.

Since then, we’ve built on top of our analytics platform; improved the usability, added more functionality and extended it to additional Cloudflare services in the form of Account Analytics, DNS Analytics, Load Balancing Analytics, Monitoring Analytics and more.

Our entire analytics platform harnesses the powerful GraphQL framework which is also available to customers that want to build, export and share their own custom reports and dashboards.

Extending Visibility From L7 To L3

Until recently, all of our dashboards were mostly HTTP-oriented and provided visibility into HTTP attributes such as the user agent, hosts, cached resources, etc. This is valuable to customers that use Cloudflare to protect and accelerate HTTP applications, mobile apps, or similar. We’re able to provide them visibility into the application layer (Layer 7 in the OSI model) because we proxy their traffic at L7.

DDoS Protection for Layer 3-7

However with Magic Transit, we don’t proxy traffic at L7 but rather route it at L3 (network layer). Using BGP Anycast, customer traffic is routed to the closest point of presence of Cloudflare’s network edge where it is filtered by customer-defined network firewall rules and automatic DDoS mitigation systems. Clean traffic is then routed via dynamic GRE Anycast tunnels to the customer's data-centers. Routing at L3 means that we have limited visibility into the higher layers. So in order to provide Magic Transit customers visibility into traffic and attacks, we needed to extend our analytics platform to the packet-layer.

Magic Transit Traffic Flow

On January 16, 2020, we released the Network Analytics dashboard for Magic Transit customers and Bring Your Own IP (BYOIP) customers. This packet and bit oriented dashboard provides near real-time visibility into network- and transport-layer traffic patterns and DDoS attacks that are blocked at the Cloudflare edge in over 200 cities around the world.

Network Analytics - Packets & Bits Over Time

Analytics For A Year

The way we've architected the analytics data-stores enables us to provide our customers one year's worth of insights. Traffic is sampled at the edge data-centers. From those samples we structure IP flow logs which are similar to SFlow and bucket one minute's worth of traffic, grouped by destination IP, port and protocol. IP flows includes multiple packet attributes such as TCP flags, source IPs and ports, Cloudflare data-center, etc. The source IP is considered PII data and is therefore only stored for 30 days, after which the source IPs are discarded and logs are rolled up into one hour groups, and then one day groups. The one hour roll-ups are stored for 6 months and the one day roll-ups for 1 year.

Similarly, attack logs are also stored efficiently. Attacks are stored as summaries with start/end timestamps, min/max/average/total bits and packets per second, attack type, action taken and more. A DDoS attack could easily consist of billions of packets which could impact performance due to the number of read/write calls to the data-store. By storing attacks as summary logs, we're able to overcome these challenges and therefore provide attack logs for up to 1 year back.

Network Analytics via GraphQL API

We built this dashboard on the same analytics platform, meaning that our packet-level analytics are also available by GraphQL. As an example, below is an attack report query that would show the top attacker IPs, the data-center cities and countries where the attack was observed, the IP version distribution, the ASNs that were used by the attackers and the ports. The query is done at the account level, meaning it would provide a report for all of your IP ranges. In order to narrow the report down to a specific destination IP or port range, you can simply add additional filters. The same filters also exist in the UI.

{
  viewer {
    accounts(filter: { accountTag: $accountTag }) {
      topNPorts: ipFlows1mGroups(
        limit: 5
        filter: $portFilter
        orderBy: [sum_packets_DESC]
      ) {
        sum {
          count: packets
          __typename
        }
        dimensions {
          metric: sourcePort
          ipProtocol
          __typename
        }
        __typename
      }
      topNASN: ipFlows1mGroups(
        limit: 5
        filter: $filter
        orderBy: [sum_packets_DESC]
      ) {
        sum {
          count: packets
          __typename
        }
        dimensions {
          metric: sourceIPAsn
          description: sourceIPASNDescription
          __typename
        }
        __typename
      }
      topNIPs: ipFlows1mGroups(
        limit: 5
        filter: $filter
        orderBy: [sum_packets_DESC]
      ) {
        sum {
          count: packets
          __typename
        }
        dimensions {
          metric: sourceIP
          __typename
        }
        __typename
      }
      topNColos: ipFlows1mGroups(
        limit: 10
        filter: $filter
        orderBy: [sum_packets_DESC]
      ) {
        sum {
          count: packets
          __typename
        }
        dimensions {
          metric: coloCity
          coloCode
          __typename
        }
        __typename
      }
      topNCountries: ipFlows1mGroups(
        limit: 10
        filter: $filter
        orderBy: [sum_packets_DESC]
      ) {
        sum {
          count: packets
          __typename
        }
        dimensions {
          metric: coloCountry
          __typename
        }
        __typename
      }
      topNIPVersions: ipFlows1mGroups(
        limit: 2
        filter: $filter
        orderBy: [sum_packets_DESC]
      ) {
        sum {
          count: packets
          __typename
        }
        dimensions {
          metric: ipVersion
          __typename
        }
        __typename
      }
      __typename
    }
    __typename
  }
}

Attack Report Query Example

After running the query using Altair GraphQL Client, the response is returned in a JSON format:

What Do Customers Want?

As part of our product definition and design research stages, we interviewed internal customer-facing teams including Customer Support, Solution Engineering and more. I consider these stakeholders as super-user-aggregators because they're customer-facing teams and are constantly engaging and helping our users. After the internal research phase, we expanded externally to customers and prospects; particularly network and security engineers and leaders. We wanted to know how they expect the dashboard to fit in their work-flow, what are their use cases and how we can tailor the dashboard to their needs. Long story short, we identified two main use cases: Incident Response and Reporting. Let's go into each of these use cases in more detail.

Incident Response

I started off by asking them a simple question - "what do you do when you're paged?" We wanted to better understand their incident response process; specifically, how they’d expect to use this dashboard when responding to an incident and what are the metrics that matter to them to help them make quick calculated decisions.

You’ve Just Been Paged

Let’s say that you’re a security operations engineer. It’s Black Friday. You’re on call. You’ve just been paged. Traffic levels to one of your data-centers has exceeded a safe threshold. Boom. What do you do? Responding quickly and resolving the issue as soon as possible is key.

If your workflows are similar to our customers’, then your objective is to resolve the page as soon as possible. However, before you can resolve it, you need to determine if there is any action that you need to take. For instance, is this a legitimate rise in traffic from excited Black Friday shoppers, perhaps a new game release or maybe an attack that hasn’t been mitigated? Do you need to shift traffic to another data-center or are levels still stable? Our customers tell us that these are the metrics that matter the most:

Top Destination IP and port - helps understand what services are being impacted
Top source IPs, port, ASN, data-center and data-center Country - helps identify the source of the traffic
Real-time packet and bit rates - helps understand the traffic levels
Protocol distribution - helps understand what type of traffic is abnormal
TCP flag distribution - an abnormal distribution could indicate an attack
Attack Log - shows what types of traffic is being dropped/rate-limited

Customizable DDoS Attack Log

As network and transport layer attacks can be highly distributed and the packet attributes can be easily spoofed, it's usually not practical to block IPs. Instead, the dashboard enables you to quickly identify patterns such as an increased frequency of a specific TCP flag or increased traffic from a specific country. Identifying these patterns brings you one step closer to resolving the issue. After you’ve identified the patterns, packet-level filtering can be applied to drop or rate-limit the malicious traffic. If the attack was automatically mitigated by Cloudflare’s systems, you’ll be able to see it immediately along with the attack attributes in the activity log. By filtering by the Attack ID, the entire dashboard becomes your attack report.

Packet/Bit Distribution by Source & Destination

TCP Flag Distribution

Reporting

During our interviews with security and network engineers, we also asked them what metrics and insights they need when creating reports for their managers, C-levels, colleagues and providing evidence to law-enforcement agencies. After all, processing data and creating reports can consume over a third (36%) of a security team’s time (~3 hours a day) and is also one of the most frequent DDoS asks by our customers.

Add filters, select-time range, print and share

On top of all of these customizable insights, we wanted to also provide a one-line summary that would reflect your recent activity. The one-liner is dynamic and changes based on your activity. It tells you whether you're currently under attack, and how many attacks were blocked. If your CISO is asking for a security update, you can simply copy-paste it and convey the efficiency of the service:

Dynamic Summary

Our customers say that they want to reflect the value of Cloudflare to their managers and peers:

How much potential downtime and bandwidth did Cloudflare spare me?
What are my top attacked IPs and ports?
Where are the attacks coming from? What types and what are the trends?

The Secret To Creating Good Reports

What does everyone love? Cool Maps! The key to a good report is adding a map; showing where the attack came from. But given that packet attributes can be easily spoofed, including the source IP, it won't do us any good to plot a map based on the locations of the source IPs. It would result in a spoofed source country and is therefore useless data. Instead, we decided to show the geographic distribution of packets and bits based on the Cloudflare data-center in which they were ingested. As opposed to legacy scrubbing center solutions with limited network infrastructures, Cloudflare has data-centers in more than 200 cities around the world. This enables us to provide precise geographic distribution with high confidence, making your reports accurate.

Packet/Bit Distribution by geography: Data-center City & Country

Tailored For You

One of the main challenges both our customers and we struggle with is how to process and generate actionable insights from all of the data points. This is especially critical when responding to an incident. Under this assumption, we built this dashboard with the purpose of speeding up your reporting and investigation processes. By tailoring it to your needs, we hope to make you more efficient and make the most out of Cloudflare’s services. Got any feedback or questions? Post them below in the comments section.

If you’re an existing Magic Transit or BYOIP customer, then the dashboard is already available to you. Not a customer yet? Click here to learn more.

How we used our new GraphQL Analytics API to build Firewall Analytics

Nick Downie — Thu, 12 Dec 2019 15:41:20 GMT

Firewall Analytics is the first product in the Cloudflare dashboard to utilize the new GraphQL Analytics API. All Cloudflare dashboard products are built using the same public APIs that we provide to our customers, allowing us to understand the challenges they face when interfacing with our APIs. This parity helps us build and shape our products, most recently the new GraphQL Analytics API that we’re thrilled to release today.

By defining the data we want, along with the response format, our GraphQL Analytics API has enabled us to prototype new functionality and iterate quickly from our beta user feedback. It is helping us deliver more insightful analytics tools within the Cloudflare dashboard to our customers.

Our user research and testing for Firewall Analytics surfaced common use cases in our customers' workflow:

Identifying spikes in firewall activity over time
Understanding the common attributes of threats
Drilling down into granular details of an individual event to identify potential false positives

We can address all of these use cases using our new GraphQL Analytics API.

GraphQL Basics

Before we look into how to address each of these use cases, let's take a look at the format of a GraphQL query and how our schema is structured.

A GraphQL query is comprised of a structured set of fields, for which the server provides corresponding values in its response. The schema defines which fields are available and their type. You can find more information about the GraphQL query syntax and format in the official GraphQL documentation.

To run some GraphQL queries, we recommend downloading a GraphQL client, such as GraphiQL, to explore our schema and run some queries. You can find documentation on getting started with this in our developer docs.

At the top level of the schema is the viewer field. This represents the top level node of the user running the query. Within this, we can query the zones field to find zones the current user has access to, providing a filter argument, with a zoneTag of the identifier of the zone we'd like narrow down to.

{
  viewer {
    zones(filter: { zoneTag: "YOUR_ZONE_ID" }) {
      # Here is where we'll query our firewall events
    }
  }
}

Now that we have a query that finds our zone, we can start querying the firewall events which have occurred in that zone, to help solve some of the use cases we’ve identified.

Visualising spikes in firewall activity

It's important for customers to be able to visualise and understand anomalies and spikes in their firewall activity, as these could indicate an attack or be the result of a misconfiguration.

Plotting events in a timeseries chart, by their respective action, provides users with a visual overview of the trend of their firewall events.

Within the zones field in the query we’ve created earlier, we can query our firewall event aggregates using the firewallEventsAdaptiveGroups field, providing arguments to limit the count of groups, a filter for the date range we're looking for (combined with any user-entered filters), and a list of fields to order by; in this case, just the datetimeHour field that we're grouping by.

Within the zones field in the query we created earlier, we can further query our firewall event aggregates using the firewallEventsAdaptiveGroups field and providing arguments for:

A limit for the count of groups
A filter for the date range we're looking for (combined with any user-entered filters)
A list of fields to orderBy (in this case, just the datetimeHour field that we're grouping by).

By adding the dimensions field, we're querying for groups of firewall events, aggregated by the fields nested within dimensions. In this case, our query includes the action and datetimeHour fields, meaning the response will be groups of firewall events which share the same action, and fall within the same hour. We also add a count field, to get a numeric count of how many events fall within each group.

query FirewallEventsByTime($zoneTag: string, $filter: FirewallEventsAdaptiveGroupsFilter_InputObject) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      firewallEventsAdaptiveGroups(
        limit: 576
        filter: $filter
        orderBy: [datetimeHour_DESC]
      ) {
        count
        dimensions {
          action
          datetimeHour
        }
      }
    }
  }
}

Note - Each of our groups queries require a limit to be set. A firewall event can have one of 8 possible actions, and we are querying over a 72 hour period. At most, we’ll end up with 567 groups, so we can set that as the limit for our query.

This query would return a response in the following format:

{
  "viewer": {
    "zones": [
      {
        "firewallEventsAdaptiveGroups": [
          {
            "count": 5,
            "dimensions": {
              "action": "jschallenge",
              "datetimeHour": "2019-09-12T18:00:00Z"
            }
          }
          ...
        ]
      }
    ]
  }
}

We can then take these groups and plot each as a point on a time series chart. Mapping over the firewallEventsAdaptiveGroups array, we can use the group’s count property on the y-axis for our chart, then use the nested fields within the dimensions object, using action as unique series and the datetimeHour as the time stamp on the x-axis.

Top Ns

After identifying a spike in activity, our next step is to highlight events with commonality in their attributes. For example, if a certain IP address or individual user agent is causing many firewall events, this could be a sign of an individual attacker, or could be surfacing a false positive.

Similarly to before, we can query aggregate groups of firewall events using the firewallEventsAdaptiveGroups field. However, in this case, instead of supplying action and datetimeHour to the group’s dimensions, we can add individual fields that we want to find common groups of.

By ordering by descending count, we’ll retrieve groups with the highest commonality first, limiting to the top 5 of each. We can add a single field nested within dimensions to group by it. For example, adding clientIP will give five groups with the IP addresses causing the most events.

We can also add a firewallEventsAdaptiveGroups field with no nested dimensions. This will create a single group which allows us to find the total count of events matching our filter.

query FirewallEventsTopNs($zoneTag: string, $filter: FirewallEventsAdaptiveGroupsFilter_InputObject) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      topIPs: firewallEventsAdaptiveGroups(
        limit: 5
        filter: $filter
        orderBy: [count_DESC]
      ) {
        count
        dimensions {
          clientIP
        }
      }
      topUserAgents: firewallEventsAdaptiveGroups(
        limit: 5
        filter: $filter
        orderBy: [count_DESC]
      ) {
        count
        dimensions {
          userAgent
        }
      }
      total: firewallEventsAdaptiveGroups(
        limit: 1
        filter: $filter
      ) {
        count
      }
    }
  }
}

Note - we can add the firewallEventsAdaptiveGroups field multiple times within a single query, each aliased differently. This allows us to fetch multiple different groupings by different fields, or with no groupings at all. In this case, getting a list of top IP addresses, top user agents, and the total events.

We can then reference each of these aliases in the UI, mapping over their respective groups to render each row with its count, and a bar which represents the proportion of total events, showing the proportion of all events each row equates to.

Are these firewall events false positives?

After users have identified spikes, anomalies and common attributes, we wanted to surface more information as to whether these have been caused by malicious traffic, or are false positives.

To do this, we wanted to provide additional context on the events themselves, rather than just counts. We can do this by querying the firewallEventsAdaptive field for these events.

Our GraphQL schema uses the same filter format for both the aggregate firewallEventsAdaptiveGroups field and the raw firewallEventsAdaptive field. This allows us to use the same filters to fetch the individual events which summate to the counts and aggregates in the visualisations above.

query FirewallEventsList($zoneTag: string, $filter: FirewallEventsAdaptiveFilter_InputObject) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      firewallEventsAdaptive(
        filter: $filter
        limit: 10
        orderBy: [datetime_DESC]
      ) {
        action
        clientAsn
        clientCountryName
        clientIP
        clientRequestPath
        clientRequestQuery
        datetime
        rayName
        source
        userAgent
      }
    }
  }
}

Once we have our individual events, we can render all of the individual fields we’ve requested, providing users the additional context on event they need to determine whether this is a false positive or not.

That’s how we used our new GraphQL Analytics API to build Firewall Analytics, helping solve some of our customers most common security workflow use cases. We’re excited to see what you build with it, and the problems you can help tackle.

You can find out how to get started querying our GraphQL Analytics API using GraphiQL in our developer documentation, or learn more about writing GraphQL queries on the official GraphQL Foundation documentation.

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

Filipp Nisenzoun — Thu, 12 Dec 2019 15:41:04 GMT

Today we’re excited to announce a powerful and flexible new way to explore your Cloudflare metrics and logs, with an API conforming to the industry-standard GraphQL specification. With our new GraphQL Analytics API, all of your performance, security, and reliability data is available from one endpoint, and you can select exactly what you need, whether it’s one metric for one domain or multiple metrics aggregated for all of your domains. You can ask questions like “How many cached bytes have been returned for these three domains?” Or, “How many requests have all the domains under my account received?” Or even, “What effect did changing my firewall rule an hour ago have on the responses my users were seeing?”

The GraphQL standard also has strong community resources, from extensive documentation to front-end clients, making it easy to start creating simple queries and progress to building your own sophisticated analytics dashboards.

From many APIs...

Providing insights has always been a core part of Cloudflare’s offering. After all, by using Cloudflare, you’re relying on us for key parts of your infrastructure, and so we need to make sure you have the data to manage, monitor, and troubleshoot your website, app, or service. Over time, we developed a few key data APIs, including ones providing information regarding your domain’s traffic, DNS queries, and firewall events. This multi-API approach was acceptable while we had only a few products, but we started to run into some challenges as we added more products and analytics. We couldn’t expect users to adopt a new analytics API every time they started using a new product. In fact, some of the customers and partners that were relying on many of our products were already becoming confused by the various APIs.

Following the multi-API approach was also affecting how quickly we could develop new analytics within the Cloudflare dashboard, which is used by more people for data exploration than our APIs. Each time we built a new product, our product engineering teams had to implement a corresponding analytics API, which our user interface engineering team then had to learn to use. This process could take up to several months for each new set of analytics dashboards.

...to one

Our new GraphQL Analytics API solves these problems by providing access to all Cloudflare analytics. It offers a standard, flexible syntax for describing exactly the data you need and provides predictable, matching responses. This approach makes it an ideal tool for:

Data exploration. You can think of it as a way to query your own virtual data warehouse, full of metrics and logs regarding the performance, security, and reliability of your Internet property.
Building amazing dashboards, which allow for flexible filtering, sorting, and drilling down or rolling up. Creating these kinds of dashboards would normally require paying thousands of dollars for a specialized analytics tool. You get them as part of our product and can customize them for yourself using the API.

In a companion post that was also published today, my colleague Nick discusses using the GraphQL Analytics API to build dashboards. So, in this post, I’ll focus on examples of how you can use the API to explore your data. To make the queries, I’ll be using GraphiQL, a popular open-source querying tool that takes advantage of GraphQL’s capabilities.

Introspection: what data is available?

The first thing you may be wondering: if the GraphQL Analytics API offers access to so much data, how do I figure out what exactly is available, and how I can ask for it? GraphQL makes this easy by offering “introspection,” meaning you can query the API itself to see the available data sets, the fields and their types, and the operations you can perform. GraphiQL uses this functionality to provide a “Documentation Explorer,” query auto-completion, and syntax validation. For example, here is how I can see all the data sets available for a zone (domain):

If I’m writing a query, and I’m interested in data on firewall events, auto-complete will help me quickly find relevant data sets and fields:

Querying: examples of questions you can ask

Let’s say you’ve made a major product announcement and expect a surge in requests to your blog, your application, and several other zones (domains) under your account. You can check if this surge materializes by asking for the requests aggregated under your account, in the 30 minutes after your announcement post, broken down by the minute:

{
 viewer { 
   accounts (filter: {accountTag: $accountTag}) {
     httpRequests1mGroups(limit: 30, filter: {datetime_geq: "2019-09-16T20:00:00Z", datetime_lt: "2019-09-16T20:30:00Z"}, orderBy: [datetimeMinute_ASC]) {
	  dimensions {
		datetimeMinute
	  }
	  sum {
		requests
	  }
	}
   }
 }
}

Here is the first part of the response, showing requests for your account, by the minute:

Now, let’s say you want to compare the traffic coming to your blog versus your marketing site over the last hour. You can do this in one query, asking for the number of requests to each zone:

{
 viewer {
   zones(filter: {zoneTag_in: [$zoneTag1, $zoneTag2]}) {
     httpRequests1hGroups(limit: 2, filter: {datetime_geq: "2019-09-16T20:00:00Z",
datetime_lt: "2019-09-16T21:00:00Z"}) {
       sum {
         requests
       }
     }
   }
 }
}

Here is the response:

Finally, let’s say you’re seeing an increase in error responses. Could this be correlated to an attack? You can look at error codes and firewall events over the last 15 minutes, for example:

{
 viewer {
   zones(filter: {zoneTag: $zoneTag}) {
     httpRequests1mGroups (limit: 100,
filter: {datetime_geq: "2019-09-16T21:00:00Z",
datetime_lt: "2019-09-16T21:15:00Z"}) {
       sum {
         responseStatusMap {
           edgeResponseStatus
           requests
         }
       }
     }
    firewallEventsAdaptiveGroups (limit: 100,
filter: {datetime_geq: "2019-09-16T21:00:00Z",
datetime_lt: "2019-09-16T21:15:00Z"}) {
       dimensions {
         action
       }
       count
     }
    }
  }
}

Notice that, in this query, we’re looking at multiple datasets at once, using a common zone identifier to “join” them. Here are the results:

By examining both data sets in parallel, we can see a correlation: 31 requests were “dropped” or blocked by the Firewall, which is exactly the same as the number of “403” responses. So, the 403 responses were a result of Firewall actions.

Try it today

To learn more about the GraphQL Analytics API and start exploring your Cloudflare data, follow the “Getting started” guide in our developer documentation, which also has details regarding the current data sets and time periods available. We’ll be adding more data sets over time, so take advantage of the introspection feature to see the latest available.

Finally, to make way for the new API, the Zone Analytics API is now deprecated and will be sunset on May 31, 2020. The data that Zone Analytics provides is available from the GraphQL Analytics API. If you’re currently using the API directly, please follow our migration guide to change your API calls. If you get your analytics using the Cloudflare dashboard or our Datadog integration, you don’t need to take any action.

One more thing....

In the API examples above, if you find it helpful to get analytics aggregated for all the domains under your account, we have something else you may like: a brand new Analytics dashboard (in beta) that provides this same information. If your account has many zones, the dashboard is helpful for knowing summary information on metrics such as requests, bandwidth, cache rate, and error rate. Give it a try and let us know what you think using the feedback link above the new dashboard.

Building a GraphQL server on the edge with Cloudflare Workers

Kristian Freeman — Wed, 14 Aug 2019 03:21:00 GMT

Today, we're open-sourcing an exciting project that showcases the strengths of our Cloudflare Workers platform: workers-graphql-server is a batteries-included Apollo GraphQL server, designed to get you up and running quickly with GraphQL.

Testing GraphQL queries in the GraphQL Playground

As a full-stack developer, I’m really excited about GraphQL. I love building user interfaces with React, but as a project gets more complex, it can become really difficult to manage how your data is managed inside an application. GraphQL makes that really easy — instead of having to recall the REST URL structure of your backend API, or remember when your backend server doesn't quite follow REST conventions — you just tell GraphQL what data you want, and it takes care of the rest.

Cloudflare Workers is uniquely suited as a platform to being an incredible place to host a GraphQL server. Because your code is running on Cloudflare's servers around the world, the average latency for your requests is extremely low, and by using Wrangler, our open-source command line tool for building and managing Workers projects, you can deploy new versions of your GraphQL server around the world within seconds.

If you'd like to try the GraphQL server, check out a demo GraphQL playground, deployed on Workers.dev. This optional add-on to the GraphQL server allows you to experiment with GraphQL queries and mutations, giving you a super powerful way to understand how to interface with your data, without having to hop into a codebase.

If you're ready to get started building your own GraphQL server with our new open-source project, we've added a new tutorial to our Workers documentation to help you get up and running — check it out here!

Finally, if you're interested in how the project works, or want to help contribute — it's open-source! We'd love to hear your feedback and see your contributions. Check out the project on GitHub.