March 18, 2022

Cloudflare Observability

Tanushree Sharma, Natasha Wissmann, Ashcon Partovi, and Michael Tremante

7 minute read

This post is also available in 日本語 and 简体中文.

Whether you’re a software engineer deploying a new feature, network engineer updating routes, or a security engineer configuring a new firewall rule: You need visibility to know if your system is behaving as intended — and if it’s not, to know how to fix it.

Cloudflare is committed to helping our customers get visibility into the services they have protected behind Cloudflare. Being a single pane of glass for all network activity has always been one of Cloudflare’s goals. Today, we’re outlining the future vision for Cloudflare observability.

What is observability?

Observability means gaining visibility into the internal state of a system. It’s used to give users the tools to figure out what’s happening, where it’s happening, and why.

At Cloudflare, we believe that observability has three core components: monitoring, analytics, and forensics. Monitoring measures the health of a system - it tells you when something is going wrong. Analytics give you the tools to visualize data to identify patterns and insights. Forensics helps you answer very specific questions about an event.

Observability becomes particularly important in the context of security to validate that any mitigating actions performed by our security products, such as Firewall or Bot Management, are not false positives. Was that request correctly classified as malicious? And if it wasn’t, which detection system classified it as such?

Cloudflare, additionally, has products to improve performance of applications and corporate networks and allow developers to write lightning fast code that runs on our global network. We want to be able to provide our customers with insights into every request, packet, and fetch that goes through Cloudflare’s network.

Monitoring and Notifying

Analytics are fantastic for summarizing data, but how do you know when to look at them? No one wants to sit on the dashboard clicking refresh over and over again just in case something looks off. That’s where notifications come in.

When we talk about something “looking off” on an analytics page, what we really mean is that there’s a significant change in your traffic or network which is reflected by spikes or drops in our analytics. Availability and performance directly affect end users, and our goal is to monitor and notify our customers as soon as we see things going wrong.

Graph showing a spike in Firewall block events

Today, we have many different types of notifications from Origin Error Rates, Security Events, and Advanced Security Events to Usage Based Billing and Health Checks. We’re continuously adding more notification types to have them correspond with our awesome analytics. As our analytics get more customizable, our notifications will as well.

There’s tons of different algorithms that can be used to detect spikes, including using burn rates and z-scores. We’re continuing to iterate on the algorithms that we use for detections to offer more variations, make them smarter, and make sure that our notifications are both accurate and not too noisy.

Analytics

So, you’ve received an alert from Cloudflare. What comes next?

Analytics can be used to get a birds eye view of traffic or focus on specific types of events by adding filters and time ranges. After you receive an alert, we want to show you exactly what’s been triggered through graphs, high level metrics, and top Ns on the Cloudflare dashboard.

Whether you’re a developer, security analyst, or network engineer, the Cloudflare dashboard should be the spot for you to see everything you need. We want to make the dashboard more customizable to serve the diverse use cases of our customers. Analyze data by specifying a timeframe and filter through dropdowns on the dashboard, or build your own metrics and graphs that work alongside the raw logs to give you a clear picture of what's happening.

Focusing on security, we believe analytics are the best tool to build confidence before deploying security policies. Moving forward, we plan to layer all of our security related detection signals on top of HTTP analytics so you can use the dashboard to answer questions such as: if I were to block all requests that the WAF identifies as an XSS attack, what would I block?

Customers using our enterprise Bot Management may already be familiar with this experience, and as we improve it and build upon it further, all of our other security products will follow.

Bot Management requests by bot score graph

Analytics are a powerful tool to see high level patterns and identify anomalies that indicate that something unusual is happening. We’re working on new dashboards, customizations, and features that widen the use cases for our customers. Stay tuned!

Logs

Logs are used when you want to examine specific details about an event. They consist of a timestamp and fields that describe the event and are used to get visibility on a granular level when you need a play-by-play.

In each of our datasets, an event measures something different. For example, in HTTP request logs, an event is when an end user requests content from or sends content to a server. For Firewall logs, an event occurs when the Firewall takes an action on an HTTP request. There can be multiple Firewall events for each HTTP request.

Today, our customers access logs using Logpull, Logpush, or Instant Logs. Logpull and Logpush are great for customers that want to send their logs to third parties (like our Analytics Partners) to store, analyze, and correlate with other data sources. With Instant Logs, our customers can monitor and troubleshoot their traffic in real-time straight from the dashboard or CLI. We’re planning on building out more capabilities to dig into logs on Cloudflare. We’re hard at work on building log storage on R2 - but what’s next?

We’ve heard from customers that the activity log on the Firewall analytics dashboard is incredibly useful. We want to continue to bring the power of logs to the dashboard by adding the same functionality across our products. For customers that will store their logs on Cloudflare R2, this means that we can minimize the use of sampled data.

If you’re looking for something very specific, querying logs is also important, which is where forensics comes in. The goal is to let you investigate from high level analytics all the way down to individual logs lines that make them up. Given a unique identifier, such as the ray ID, you should be able to look up a single request, and then correlate it with all other related activity. Find out the client IP of that ray ID and from there, use cases are plentiful: what other requests from this IP are malicious? What paths did the client follow?

Tracing

Logs are really useful, but they don’t capture the context around a request. Traces show the end-to-end life cycle of a request from when a user requests a resource to each of the systems that are involved in its delivery. They’re another way of applying forensics to help you find something very specific.

These are used to differentiate each part of the application to identify where errors or bottlenecks are occurring. Let's say that you have a Worker that performs a fetch event to your origin and a third party API. Analytics can show you average execution times and error rates for your Worker, but it doesn’t give you visibility into each of these operations.

Using wrangler dev and console.log statements are really helpful ways to test and debug your code. They bring some of the visibility that’s needed, but it can be tedious to instrument your code like this.

As a developer, you should have the tools to understand what’s going on in your applications so you can deliver the best experience to your end users. We can help you answer questions like: Where is my Worker execution failing? Which operation is causing a spike in latency in my application?

Putting it all together

Notifications, analytics, logs, and tracing each have their distinct use cases, but together, these are powerful tools to provide analysts and developers visibility. Looking forward, we’re excited to bring more and more of these capabilities on the Cloudflare dashboard.

We would love to hear from you as we build these features out. If you’re interested in sharing use cases and helping shape our roadmap, contact your account team!