Workers Analytics Engine (or for short, Analytics Engine) is a new way for developers to store and analyze time series analytics about anything using Cloudflare Workers, and it’s now in open beta! Analytics Engine is really good at gathering time-series data for really high cardinality and high-volume data sets from Cloudflare Workers. At Cloudflare, we use Analytics Engine to provide insight into how our customers use Cloudflare products.
Log, log, logging!
As an example, Analytics Engine is used to observe the backend that powers Instant Logs. Instant Logs allows Cloudflare customers to stream a live session of the HTTP logs for their domain to the Cloudflare dashboard. The backend for Instant Logs is built on Cloudflare Workers.
Briefly, the Instant Logs backend works by receiving requests from each Cloudflare server that processes a customer's HTTP traffic. These requests contain the HTTP logs for the customer’s HTTP traffic. The Instant Logs backend then forwards these HTTP logs to the customer’s browser via a WebSocket.
In order to ensure that the HTTP logs are being delivered smoothly to a customer's browser, we need to track the request rates across all active Instant Logs sessions. We also need to track the request rates across all Cloudflare data centers, since Instant Logs is built on Cloudflare Workers, and Cloudflare Workers is built on Cloudflare’s massive network. As a result, the data set for the Instant Logs backend has really massive cardinality!
“Traditional” metrics systems like Prometheus are poorly suited to serving high cardinality data. Fortunately, this is exactly the problem that Analytics Engine is designed to solve. So, we sent all the Instant Logs backend request logs to Analytics Engine. Log, log, logging!
Using the Analytics Engine API (which has a SQL interface), we can visualize the Instant Logs backend request rates for the top sessions and top data centers over the previous month. “Zooming in” to an interesting period is also really fast. We’ve designed Analytics Engine so that queries always respond within the window of interactivity (more on this later). This makes it well-suited for interactive debugging with a dashboard tool (in this case we’re using Grafana).
What we learned in closed beta
We received a lot of great feedback during the closed beta. Developers were excited about the SQL API, ease of integration with Workers, the ability to query data in Grafana (with more integrations in future), and our simple pricing model (free!). However, there were a number of things that we needed to fix before moving on to the open beta phase.
Developers were supportive of our choice to use SQL (the world’s language for data) as the interface for the Analytics Engine API. However, when developers used the Analytics Engine API, they found that the error messages were opaque and difficult to debug. For the open beta, we have rewritten the API from the ground-up to provide much improved error messaging.
Before:> SELECT column_that_does_not_exist FROM your_dataset FORMAT JSONSorry, we were unable to evaluate your query
After:> SELECT column_that_does_not_exist FROM your_dataset FORMAT JSONcannot select unknown column: "column_that_does_not_exist"
In addition to understanding what went wrong, developers also wanted to understand what the API is capable of doing. For the open beta, we’ve written a comprehensive SQL reference for Analytics Engine. We also have a few “How To” guides, including information on how to hook up the API to Grafana.
ABR and Analytics Engine
Analytics Engine uses Cloudflare’s ABR technology to make queries fast. This means that every query is satisfied by a resolution of the data that matches the query. For example, if we are looking at data for the last month, we might use a lower resolution version of the Analytics Engine data than if we are looking at the last hour. The lower resolution data will provide the correct answer, but will respond within the window of interactivity. By using multiple, different resolutions of the same data, ABR provides consistent response times.
To account for the different resolutions of data, each event carries with it information about the resolution of data that the event comes from. This information is encoded in the _sample_interval
column. For example, if an event comes from a resolution of the data which is 1% of the original data, its _sample_interval
will be set to 100. To reconstruct the number of events in the original data, we can use the query:
SELECT sum(_sample_interval) AS count FROM dataset
For the open beta, we are exposing _sample_interval
directly to developers. In the future, we’ll make it easier to work with this field by providing convenience functions which automatically take into account varying resolutions of the data. We also want to provide the ability to understand the confidence level of the estimates that these functions return.
Coming soon
This is just the beginning for Workers Analytics Engine. Internally, there has been high demand for the ability to define alerts based on the data captured by Analytics Engine. This is also something that we want developers to be able to do.
As in the closed beta, fields are accessed via names that have 1-based indexing (blob1, blob2, double1, double2, etc.). In the future, we will allow developers to attach names to fields, and these names will be available to use to retrieve data via the SQL API.
Something we want to provide is a rich UX in the Cloudflare dashboard (imagine something like Grafana in the Cloudflare dashboard). Ultimately, we don’t want developers to have to set up their own infrastructure for exploring data captured with Analytics Engine.
Conclusion
Try Workers Analytics Engine today! Please let us know if you have any ideas or more advanced use cases that aren’t supported. We’re discussing everything about the Analytics Engine in our discord channel too - join the conversation!