The Cloudflare Blog

Troubleshooting network connectivity and performance with Cloudflare AI

Chris Draper — Fri, 29 Aug 2025 14:00:00 GMT

Monitoring a corporate network and troubleshooting any performance issues across that network is a hard problem, and it has become increasingly complex over time. Imagine that you’re maintaining a corporate network, and you get the dreaded IT ticket. An executive is having a performance issue with an application, and they want you to look into it. The ticket doesn’t have a lot of details. It simply says: “Our internal documentation is taking forever to load. PLS FIX NOW”.

In the early days of IT, a corporate network was built on-premises. It provided network connectivity between employees that worked in person and a variety of corporate applications that were hosted locally.

The shift to cloud environments, the rise of SaaS applications, and a “work from anywhere” model has made IT environments significantly more complex in the past few years. Today, it’s hard to know if a performance issue is the result of:

An employee’s device
Their home or corporate wifi
The corporate network
A cloud network hosting a SaaS app
An intermediary ISP

A performance ticket submitted by an employee might even be a combination of multiple performance issues all wrapped together into one nasty problem.

Cloudflare built Cloudflare One, our Secure Access Service Edge (SASE) platform, to protect enterprise applications, users, devices, and networks. In particular, this platform relies on two capabilities to simplify troubleshooting performance issues:

Cloudflare’s Zero Trust client, also known as WARP, forwards and encrypts traffic from devices to Cloudflare edge.
Digital Experience Monitoring (DEX) works alongside WARP to monitor device, network, and application performance.

We’re excited to announce two new AI-powered tools that will make it easier to troubleshoot WARP client connectivity and performance issues. We’re releasing a new WARP diagnostic analyzer in the Zero Trust dashboard and a MCP (Model Context Protocol) server for DEX. Today, every Cloudflare One customer has free access to both of these new features by default.

WARP diagnostic analyzer

The WARP client provides diagnostic logs that can be used to troubleshoot connectivity issues on a device. For desktop clients, the most common issues can be investigated with the information captured in logs called WARP diagnostic. Each WARP diagnostic log contains an extensive amount of information spanning days of captured events occurring on the client. It takes expertise to manually go through all of this information and understand the full picture of what is occurring on a client that is having issues. In the past, we’ve advised customers having issues to send their WARP diagnostic log straight to us so that our trained support experts can do a root cause analysis for them. While this is effective, we want to give our customers the tools to take control of deciphering common troubleshooting issues for even quicker resolution.

Enter the WARP diagnostic analyzer, a new AI available for free in the Cloudflare One dashboard as of today! This AI demystifies information in the WARP diagnostic log so you can better understand events impacting the performance of your clients and network connectivity. Now, when you run a remote capture for WARP diagnostics in the Cloudflare One dashboard, you can generate an AI analysis of the WARP diagnostic file. Simply go to your organization’s Zero Trust dashboard and select DEX > Remote Captures from the side navigation bar. After you successfully run diagnostics and produce a WARP diagnostic file, you can open the status details and select View WARP Diag to generate your AI analysis.

In the WARP Diag analysis, you will find a Cloudy summary of events that we recommend a deeper dive into.

Below this summary is an events section, where the analyzer highlights occurrences of events commonly occurring when there are client and connectivity issues.

Expanding on any of the events detected will reveal a detailed page explaining the event, recommended resources to help troubleshoot, and a list of time stamped recent occurrences of the event on the device.

To further help with trouble shooting we’ve added a Device and WARP details section at the bottom of this page with a quick view of the device specifications and WARP configurations such as Operating system, WARP version, and the device profile ID.

Finally, we’ve made it easy to take all the information created in your AI summary with you by navigating to the JSON file tab and copying the contents. Your WARP Diag file is also available to download from this screen for any further analysis.

MCP server for DEX

Alongside the new WARP Diagnostic Analyzer, we’re excited to announce that all Cloudflare One customers have access to a MCP (Model Context Protocol) server for our Digital Experience Monitoring (DEX) product. Let’s dive into how this will save our customers time and money.

Cloudflare One customers use Digital Experience Monitoring (DEX) to monitor devices across their employee network and troubleshoot any connectivity or performance issues. Like many products at Cloudflare, every data point generated by DEX is available to customers via Cloudflare’s API or log ingestion. DEX API and log data is valuable because it enables Enterprises to create custom analytics for their devices’ connectivity and performance in a SIEM (Security Information and Event Management). Building new data pipelines and various dashboards can take a lot of time and can be expensive. Some of Cloudflare One customers aren’t able to dedicate the engineering time to build custom analytics: whether it’s due to budget, resource constraints, time constraints, or other factors.

Model Context Protocol (MCP) is an AI standardization that allows AI to connect to other applications. In this instance, using MCP together with DEX gives users a custom analytics experience — without all the work of integrating log data into a SIEM. Network security practitioners can type out a question (like “Show me the performance data for alice@acme.com’s device in a few graphs”), and an MCP server will answer that question with a customized graph on Alice’s device data from the DEX API.

Troubleshooting DNS query performance with the DEX MCP server

Any IT admin, security engineer, or network engineer can ask the MCP server a question like: “I received an IT ticket from bob@acme.com who is an enterprise employee. He is complaining about the performance of his device. Can you investigate this for me?”.

After you submit this question, the DEX MCP server will start by reviewing the device’s overall health.

Overall Device Health:
✅ Connection Status: Consistently connected for the past 24 hours with 100% availability
✅ WARP Mode: Running in optimal "warp+doh" (WARP + DNS over HTTPS) mode
✅ Client Version: Up-to-date with version 2025.1.7.2

So far, the device’s health looks great. Next, the DEX MCP server will begin analyzing the results of the performance tests that are configured on Bob’s device.

📊 Performance Metrics Analysis

HTTP Response Times:
Google.com test: Average 247ms (Range: 81-649ms)
  • Shows significant improvement throughout the day: Started at ~317ms, dropped to ~211ms by evening
Wiki.internal.acme.com: Average 1,374ms (Range: 877-2,431ms)
  • This is notably high and could be the source of performance complaints
  • DNS resolution particularly slow: Average 616ms with spikes up to 1,600ms

Now we’re making progress in our troubleshooting effort. The DEX MCP server identified that the HTTP GET test for wiki.internal.acme.com has a high average resource fetch time when compared to other websites (like google.com). In particular, high resource fetch time is the result of slow DNS resolution. Now, the DEX MCP server will summarize its findings.

🚨 Performance Issues Identified
Extremely Slow DNS Resolution:
  • Average DNS response time of 616ms is significantly high
  • Normal DNS should be <50ms for good performance
High Resource Fetch Times on wiki.internal.acme.com:
   • 1.3+ second response times indicate potential application-level issues
   • Could be related to DNS performance or server-side delays
HTTP Status Code Issues:
        • wiki.internal.acme.com is showing consistent HTTP 400 errors
        • This suggests potential configuration or routing issues

Now, we’ve successfully identified that Bob’s slow wiki performance is the result of a high average DNS response time. Slow DNS resolution increases the average resource fetch time, and also generates HTTP 400 errors for Enterprise employees that are attempting to access wiki.internal.acme.com. Notably, it’s likely that if the wiki is seeing slow DNS response times, other internal applications are experiencing performance issues as well.

We can ask the DEX MCP server to suggest potential solutions that would fix this wider DNS performance issue.

🔧 DNS Optimization Strategies

Check WARP DNS configuration and settings
  • Since the device is using WARP+DoH mode, you can optimize the DNS settings
Implement or improve local DNS caching
  • Implement or improve local DNS caching to reduce repeated lookups
Network stack tuning
  • Optimize TCP / UDP settings for DNS

Try out the DEX MCP server today

Fast and easy option for testing an MCP server

Any Cloudflare One customer with a Free, PayGo, or ENT plan can start using the DEX MCP server in less than one minute. The fastest and easiest way to try out the DEX MCP server is to visit playground.ai.cloudflare.com. There are five steps to get started:

Copy the URL for the DEX MCP server: https://dex.mcp.cloudflare.com/sse
Open playground.ai.cloudflare.com in a browser
Find the section in the left side bar titled MCP Servers
Paste the URL for the DEX MCP server into the URL input box and click Connect
Authenticate your Cloudflare account, and then start asking questions to the DEX MCP server

It’s worth noting that end users will need to ask specific and explicit questions to the DEX MCP server to get a response. For example, you may need to say, “Set my production account as the active account”, and then give the separate command, “Fetch the DEX test results for the user bob@acme.com over the past 24 hours”.

Better experience for MCP servers that requires additional steps

Customers will get a more flexible prompt experience by configuring the DEX MCP server with their preferred AI assistant (Claude, Gemini, ChatGPT, etc.) that has MCP server support. MCP server support may require a subscription for some AI assistants. You can read the Digital Experience Monitoring - MCP server documentation for step by step instructions on how to get set up with each of the major AI assistants that are available today.

As an example, you can configure the DEX MCP server in Claude by downloading the Claude Desktop client, then selecting Claude Code > Developer > Edit Config. You will be prompted to open “claude_desktop_config.json” in a code editor of your choice. Simply add the following JSON configuration, and you’re ready to use Claude to call the DEX MCP server.

{
  "globalShortcut": "",
  "mcpServers": {
    "cloudflare-dex-analysis": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "https://dex.mcp.cloudflare.com/sse"
      ]
    }
  }
}

Get started with Cloudflare One today

Are you ready to secure your Internet traffic, employee devices, and private resources without compromising speed? You can get started with our new Cloudflare One AI powered tools today.

The WARP diagnostic analyzer and the DEX MCP server are generally available to all customers. Head to the Zero Trust dashboard to run a WARP diagnostic and learn more about your client’s connectivity with the WARP diagnostic analyzer. You can test out the new DEX MCP server (https://dex.mcp.cloudflare.com/sse) in less than one minute at playground.ai.cloudflare.com, and you can also configure an AI assistant like Claude to use the new DEX MCP server.

If you don’t have a Cloudflare account, and you want to try these new features, you can create a free account for up to 50 users. If you’re an Enterprise customer, and you’d like a demo of these new Cloudflare One AI features, you can reach out to your account team to set up a demo anytime.

You can stay up to date on latest feature releases across the Cloudflare One platform by following the Cloudflare One changelogs and joining the conversation in the Cloudflare community hub or on our Discord Server.

Searching for the cause of hung tasks in the Linux kernel

Oxana Kharitonova — Fri, 14 Feb 2025 14:00:00 GMT

Depending on your configuration, the Linux kernel can produce a hung task warning message in its log. Searching the Internet and the kernel documentation, you can find a brief explanation that the kernel process is stuck in the uninterruptable state and hasn’t been scheduled on the CPU for an unexpectedly long period of time. That explains the warning’s meaning, but doesn’t provide the reason it occurred. In this blog post we’re going to explore how the hung task warning works, why it happens, whether it is a bug in the Linux kernel or application itself, and whether it is worth monitoring at all.

INFO: task XXX:1495882 blocked for more than YYY seconds.

The hung task message in the kernel log looks like this:

INFO: task XXX:1495882 blocked for more than YYY seconds.
     Tainted: G          O       6.6.39-cloudflare-2024.7.3 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:XXX         state:D stack:0     pid:1495882 ppid:1      flags:0x00004002
. . .

Processes in Linux can be in different states. Some of them are running or ready to run on the CPU — they are in the TASK_RUNNING state. Others are waiting for some signal or event to happen, e.g. network packets to arrive or terminal input from a user. They are in a TASK_INTERRUPTIBLE state and can spend an arbitrary length of time in this state until being woken up by a signal. The most important thing about these states is that they still can receive signals, and be terminated by a signal. In contrast, a process in the TASK_UNINTERRUPTIBLE state is waiting only for certain special classes of events to wake them up, and can’t be interrupted by a signal. The signals are not delivered until the process emerges from this state and only a system reboot can clear the process. It’s marked with the letter D in the log shown above.

What if this wake up event doesn’t happen or happens with a significant delay? (A “significant delay” may be on the order of seconds or minutes, depending on the system.) Then our dependent process is hung in this state. What if this dependent process holds some lock and prevents other processes from acquiring it? Or if we see many processes in the D state? Then it might tell us that some of the system resources are overwhelmed or are not working correctly. At the same time, this state is very valuable, especially if we want to preserve the process memory. It might be useful if part of the data is written to disk and another part is still in the process memory — we don’t want inconsistent data on a disk. Or maybe we want a snapshot of the process memory when the bug is hit. To preserve this behaviour, but make it more controlled, a new state was introduced in the kernel: TASK_KILLABLE — it still protects a process, but allows termination with a fatal signal.

How Linux identifies the hung process

The Linux kernel has a special thread called khungtaskd. It runs regularly depending on the settings, iterating over all processes in the D state. If a process is in this state for more than YYY seconds, we’ll see a message in the kernel log. There are settings for this daemon that can be changed according to your wishes:

$ sudo sysctl -a --pattern hung
kernel.hung_task_all_cpu_backtrace = 0
kernel.hung_task_check_count = 4194304
kernel.hung_task_check_interval_secs = 0
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 10
kernel.hung_task_warnings = 200

At Cloudflare, we changed the notification threshold kernel.hung_task_timeout_secs from the default 120 seconds to 10 seconds. You can adjust the value for your system depending on configuration and how critical this delay is for you. If the process spends more than hung_task_timeout_secs seconds in the D state, a log entry is written, and our internal monitoring system emits an alert based on this log. Another important setting here is kernel.hung_task_warnings — the total number of messages that will be sent to the log. We limit it to 200 messages and reset it every 15 minutes. It allows us not to be overwhelmed by the same issue, and at the same time doesn’t stop our monitoring for too long. You can make it unlimited by setting the value to "-1".

To better understand the root causes of the hung tasks and how a system can be affected, we’re going to review more detailed examples.

Example #1 or XFS

Typically, there is a meaningful process or application name in the log, but sometimes you might see something like this:

INFO: task kworker/13:0:834409 blocked for more than 11 seconds.
 	Tainted: G      	O   	6.6.39-cloudflare-2024.7.3 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/13:0	state:D stack:0 	pid:834409 ppid:2   flags:0x00004000
Workqueue: xfs-sync/dm-6 xfs_log_worker

In this log, kworker is the kernel thread. It’s used as a deferring mechanism, meaning a piece of work will be scheduled to be executed in the future. Under kworker, the work is aggregated from different tasks, which makes it difficult to tell which application is experiencing a delay. Luckily, the kworker is accompanied by the Workqueue line. Workqueue is a linked list, usually predefined in the kernel, where these pieces of work are added and performed by the kworker in the order they were added to the queue. The Workqueue name xfs-sync and the function which it points to, xfs_log_worker, might give a good clue where to look. Here we can make an assumption that the XFS is under pressure and check the relevant metrics. It helped us to discover that due to some configuration changes, we forgot no_read_workqueue / no_write_workqueue flags that were introduced some time ago to speed up Linux disk encryption.

Summary: In this case, nothing critical happened to the system, but the hung tasks warnings gave us an alert that our file system had slowed down.

Example #2 or Coredump

Let’s take a look at the next hung task log and its decoded stack trace:

INFO: task test:964 blocked for more than 5 seconds.
      Not tainted 6.6.72-cloudflare-2025.1.7 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:test            state:D stack:0     pid:964   ppid:916    flags:0x00004000
Call Trace:

__schedule (linux/kernel/sched/core.c:5378 linux/kernel/sched/core.c:6697) 
schedule (linux/arch/x86/include/asm/preempt.h:85 (discriminator 13) linux/kernel/sched/core.c:6772 (discriminator 13)) 
[do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4)) 
? finish_task_switch.isra.0 (linux/arch/x86/include/asm/irqflags.h:42 linux/arch/x86/include/asm/irqflags.h:77 linux/kernel/sched/sched.h:1385 linux/kernel/sched/core.c:5132 linux/kernel/sched/core.c:5250) 
do_group_exit (linux/kernel/exit.c:1005) 
get_signal (linux/kernel/signal.c:2869) 
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) 
? hrtimer_try_to_cancel.part.0 (linux/kernel/time/hrtimer.c:1347) 
arch_do_signal_or_restart (linux/arch/x86/kernel/signal.c:310) 
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) 
? hrtimer_nanosleep (linux/kernel/time/hrtimer.c:2105) 
exit_to_user_mode_prepare (linux/kernel/entry/common.c:176 linux/kernel/entry/common.c:210) 
syscall_exit_to_user_mode (linux/arch/x86/include/asm/entry-common.h:91 linux/kernel/entry/common.c:141 linux/kernel/entry/common.c:304) 
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) 
do_syscall_64 (linux/arch/x86/entry/common.c:88) 
entry_SYSCALL_64_after_hwframe (linux/arch/x86/entry/entry_64.S:121)

The stack trace says that the process or application test was blocked for more than 5 seconds. We might recognise this user space application by the name, but why is it blocked? It’s always helpful to check the stack trace when looking for a cause. The most interesting line here is do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4)). The source code points to the coredump_task_exit function. Additionally, checking the process metrics revealed that the application crashed during the time when the warning message appeared in the log. When a process is terminated based on some set of signals (abnormally), the Linux kernel can provide a core dump file, if enabled. The mechanism — when a process terminates, the kernel makes a snapshot of the process memory before exiting and either writes it to a file or sends it through the socket to another handler — can be systemd-coredump or your custom one. When it happens, the kernel moves the process to the D state to preserve its memory and early termination. The higher the process memory usage, the longer it takes to get a core dump file, and the higher the chance of getting a hung task warning.

Let’s check our hypothesis by triggering it with a small Go program. We’ll use the default Linux coredump handler and will decrease the hung task threshold to 1 second.

Coredump settings:

$ sudo sysctl -a --pattern kernel.core
kernel.core_pattern = core
kernel.core_pipe_limit = 16
kernel.core_uses_pid = 1

You can make changes with sysctl:

$ sudo sysctl -w kernel.core_uses_pid=1

Hung task settings:

$ sudo sysctl -a --pattern hung
kernel.hung_task_all_cpu_backtrace = 0
kernel.hung_task_check_count = 4194304
kernel.hung_task_check_interval_secs = 0
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 1
kernel.hung_task_warnings = -1

Go program:

$ cat main.go
package main

import (
	"os"
	"time"
)

func main() {
	_, err := os.ReadFile("test.file")
	if err != nil {
		panic(err)
	}
	time.Sleep(8 * time.Minute) 
}

This program reads a 10 GB file into process memory. Let’s create the file:

$ yes this is 10GB file | head -c 10GB > test.file

The last step is to build the Go program, crash it, and watch our kernel log:

$ go mod init test
$ go build .
$ GOTRACEBACK=crash ./test
$ (Ctrl+\)

Hooray! We can see our hung task warning:

$ sudo dmesg -T | tail -n 31
INFO: task test:8734 blocked for more than 22 seconds.
      Not tainted 6.6.72-cloudflare-2025.1.7 #1
      Blocked by coredump.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:test            state:D stack:0     pid:8734  ppid:8406   task_flags:0x400448 flags:0x00004000

By the way, have you noticed the Blocked by coredump. line in the log? It was recently added to the upstream code to improve visibility and remove the blame from the process itself. The patch also added the task_flags information, as Blocked by coredump is detected via the flag PF_POSTCOREDUMP, and knowing all the task flags is useful for further root-cause analysis.

Summary: This example showed that even if everything suggests that the application is the problem, the real root cause can be something else — in this case, coredump.

Example #3 or rtnl_mutex

This one was tricky to debug. Usually, the alerts are limited by one or two different processes, meaning only a certain application or subsystem experiences an issue. In this case, we saw dozens of unrelated tasks hanging for minutes with no improvements over time. Nothing else was in the log, most of the system metrics were fine, and existing traffic was being served, but it was not possible to ssh to the server. New Kubernetes container creations were also stalling. Analyzing the stack traces of different tasks initially revealed that all the traces were limited to just three functions:

rtnetlink_rcv_msg+0x9/0x3c0
dev_ethtool+0xc6/0x2db0 
bonding_show_bonds+0x20/0xb0

Further investigation showed that all of these functions were waiting for rtnl_lock to be acquired. It looked like some application acquired the rtnl_mutex and didn’t release it. All other processes were in the D state waiting for this lock.

The RTNL lock is primarily used by the kernel networking subsystem for any network-related config, for both writing and reading. The RTNL is a global mutex lock, although upstream efforts are being made for splitting up RTNL per network namespace (netns).

From the hung task reports, we can observe the “victims” that are being stalled waiting for the lock, but how do we identify the task that is holding this lock for too long? For troubleshooting this, we leveraged BPF via a bpftrace script, as this allows us to inspect the running kernel state. The kernel's mutex implementation has a struct member called owner. It contains a pointer to the task_struct from the mutex-owning process, except it is encoded as type atomic_long_t. This is because the mutex implementation stores some state information in the lower 3-bits (mask 0x7) of this pointer. Thus, to read and dereference this task_struct pointer, we must first mask off the lower bits (0x7).

Our bpftrace script to determine who holds the mutex is as follows:

#!/usr/bin/env bpftrace
interval:s:10 {
  $rtnl_mutex = (struct mutex *) kaddr("rtnl_mutex");
  $owner = (struct task_struct *) ($rtnl_mutex->owner.counter & ~0x07);
  if ($owner != 0) {
    printf("rtnl_mutex->owner = %u %s\n", $owner->pid, $owner->comm);
  }
}

In this script, the rtnl_mutex lock is a global lock whose address can be exposed via /proc/kallsyms – using bpftrace helper function kaddr(), we can access the struct mutex pointer from the kallsyms. Thus, we can periodically (via interval:s:10) check if someone is holding this lock.

In the output we had this:

rtnl_mutex->owner = 3895365 calico-node

This allowed us to quickly identify calico-node as the process holding the RTNL lock for too long. To quickly observe where this process itself is stalled, the call stack is available via /proc/3895365/stack. This showed us that the root cause was a Wireguard config change, with function wg_set_device() holding the RTNL lock, and peer_remove_after_dead() waiting too long for a napi_disable() call. We continued debugging via a tool called drgn, which is a programmable debugger that can debug a running kernel via a Python-like interactive shell. We still haven’t discovered the root cause for the Wireguard issue and have asked the upstream for help, but that is another story.

Summary: The hung task messages were the only ones which we had in the kernel log. Each stack trace of these messages was unique, but by carefully analyzing them, we could spot similarities and continue debugging with other instruments.

Epilogue

Your system might have different hung task warnings, and we have many others not mentioned here. Each case is unique, and there is no standard approach to debug them. But hopefully this blog post helps you better understand why it’s good to have these warnings enabled, how they work, and what the meaning is behind them. We tried to provide some navigation guidance for the debugging process as well:

analyzing the stack trace might be a good starting point for debugging it, even if all the messages look unrelated, like we saw in example #3
keep in mind that the alert might be misleading, pointing to the victim and not the offender, as we saw in example #2 and example #3
if the kernel doesn’t schedule your application on the CPU, puts it in the D state, and emits the warning – the real problem might exist in the application code

Good luck with your debugging, and hopefully this material will help you on this journey!

Free network flow monitoring for all enterprise customers

Chris Draper — Thu, 07 Mar 2024 14:00:43 GMT

A key component of effective corporate network security is establishing end to end visibility across all traffic that flows through the network. Every network engineer needs a complete overview of their network traffic to confirm their security policies work, to identify new vulnerabilities, and to analyze any shifts in traffic behavior. Often, it’s difficult to build out effective network monitoring as teams struggle with problems like configuring and tuning data collection, managing storage costs, and analyzing traffic across multiple visibility tools.

Today, we’re excited to announce that a free version of Cloudflare’s network flow monitoring product, Magic Network Monitoring, is available to all Enterprise Customers. Every Enterprise Customer can configure Magic Network Monitoring and immediately improve their network visibility in as little as 30 minutes via our self-serve onboarding process.

Enterprise Customers can visit the Magic Network Monitoring product page, click “Talk to an expert”, and fill out the form. You’ll receive access within 24 hours of submitting the request. Over the next month, the free version of Magic Network Monitoring will be rolled out to all Enterprise Customers. The product will automatically be available by default without the need to submit a form.

How it works

Cloudflare customers can send their network flow data (either NetFlow or sFlow) from their routers to Cloudflare’s network edge.

Magic Network Monitoring will pick up this data, parse it, and instantly provide insights and analytics on your network traffic. These analytics include traffic volume overtime in bytes and packets, top protocols, sources, destinations, ports, and TCP flags.

Dogfooding Magic Network Monitoring during the remediation of the Thanksgiving 2023 security incident

Let’s review a recent example of how Magic Network Monitoring improved Cloudflare’s own network security and traffic visibility during the Thanksgiving 2023 security incident. Our security team needed a lightweight method to identify malicious packet characteristics in our core data center traffic. We monitored for any network traffic sourced from or destined to a list of ASNs associated with the bad actor. Our security team setup Magic Network Monitoring and established visibility into our first core data center within 24 hours of the project kick-off. Today, Cloudflare continues to use Magic Network Monitoring to monitor for traffic related to bad actors and to provide real time traffic analytics on more than 1 Tbps of core data center traffic.

Magic Network Monitoring - Traffic Analytics

Monitoring local network traffic from IoT devices

Magic Network Monitoring also improves visibility on any network traffic that doesn’t go through Cloudflare. Imagine that you’re a network engineer at ACME Corporation, and it’s your job to manage and troubleshoot IoT devices in a factory that are connected to the factory’s internal network. The traffic generated by these IoT devices doesn’t go through Cloudflare because it is destined to other devices and endpoints on the internal network. Nonetheless, you still need to establish network visibility into device traffic over time to monitor and troubleshoot the system.

To solve the problem, you configure a router or other network device to securely send encrypted traffic flow summaries to Cloudflare via an IPSec tunnel. Magic Network Monitoring parses the data, and instantly provides you with insights and analytics on your network traffic. Now, when an IoT device goes down, or a connection between IoT devices is unexpectedly blocked, you can analyze historical network traffic data in Magic Network Monitoring to speed up the troubleshooting process.

Monitoring cloud network traffic

As cloud networking becomes increasingly prevalent, it is essential for enterprises to invest in visibility across their cloud environments. Let’s say you’re responsible for monitoring and troubleshooting your corporation's cloud network operations which are spread across multiple public cloud providers. You need to improve visibility into your cloud network traffic to analyze and troubleshoot any unexpected traffic patterns like configuration drift that leads to an exposed network port.

To improve traffic visibility across different cloud environments, you can export cloud traffic flow logs from any virtual device that supports NetFlow or sFlow to Cloudflare. In the future, we are building support for native cloud VPC flow logs in conjunction with Magic Cloud Networking. Cloudflare will parse this traffic flow data and provide alerts plus analytics across all your cloud environments in a single pane of glass on the Cloudflare dashboard.

Improve your security posture today in less than 30 minutes

If you’re an existing Enterprise customer, and you want to improve your corporate network security, you can get started right away. Visit the Magic Network Monitoring product page, click “Talk to an expert”, and fill out the form. You’ll receive access within 24 hours of submitting the request. You can begin the self-serve onboarding tutorial, and start monitoring your first batch of network traffic in less than 30 minutes.

Over the next month, the free version of Magic Network Monitoring will be rolled out to all Enterprise Customers. The product will be automatically available by default without the need to submit a form.

If you’re interested in becoming an Enterprise Customer, and have more questions about Magic Network Monitoring, you can talk with an expert. If you’re a free customer, and you’re interested in testing a limited beta of Magic Network Monitoring, you can fill out this form to request access.

Keeping the Cloudflare API 'all green' using Python-based testing

Elie Mitrani — Tue, 07 Mar 2023 19:00:00 GMT

At Cloudflare, we reuse existing core systems to power multiple products and testing of these core systems is essential. In particular, we require being able to have a wide and thorough visibility of our live APIs’ behaviors. We want to be able to detect regressions, prevent incidents and maintain healthy APIs. That is why we built Scout.

Scout is an automated system periodically running Python tests verifying the end to end behavior of our APIs. Scout allows us to evaluate APIs in production-like environments and thus ensures we can green light a production deployment while also monitoring the behavior of APIs in production.

Why Scout?

Before Scout, we were using an automated test system leveraging the Robot Framework. This older system was limiting our testing capabilities. In fact, we could not easily match json responses against keys we were looking for. We would abandon covering different behaviors of our APIs as it was impossible to decide on which resources a given test suite would run. Two different test suites would create false negatives as they were running on the same account.

Regarding schema validation, only API responses were validated against a json schema and tests would not fail if the response did not match the schema. Moreover, It was impossible to validate API requests.

Test suites were run in a queue, making the delay to a new feature assessment dependent on the number of test suites to run. The queue would as well potentially make newer test suites run the following day. Hence we often ended up with a mismatch between tests and APIs versions. Test steps could not be run in parallel either.

We could not split test suites between different environments. If a new API feature was being developed it was impossible to write a test without first needing the actual feature to be released to production.

We built Scout to overcome all these difficulties. We wanted the developer experience to be easy and we wanted Scout to be fast and reliable while spotting any live API issue.

A Scout test example

Scout is built in Python and leverages the functionalities of Pytest. Before diving into the exact capabilities of Scout and its architecture, let’s have a quick look at how to use it!

Following is an example of a Scout test on the Rulesets API (the docs are available here):

from scout import requires, validate, Account, Zone

@validate(schema="rulesets", ignorePaths=["accounts/[^/]+/rules/lists"])
@requires(
    account=Account(
        entitlements={"rulesets.max_rules_per_ruleset": 2),
    zone=Zone(plan="ENT",
        entitlements={"rulesets.firewall_custom_phase_allowed": True},
        account_entitlements={"rulesets.max_rules_per_ruleset": 2 }))
class TestZone:
    def test_create_custom_ruleset(self, cfapi):
        response = cfapi.zone.request(
            "POST",
            "rulesets",
            payload=f"""{{
            "name": "My zone ruleset",
            "description": "My ruleset description",
            "phase": "http_request_firewall_custom",
            "kind": "zone",
            "rules": [
                {{
                    "description": "My rule",
                    "action": "block",
                    "expression": "http.host eq \"fake.net\""
                }}
            ]
        }}""")
        response.expect_json_success(
            200,
            result=f"""{{
            "name": "My zone ruleset",
            "version": "1",
            "source": "firewall_custom",
            "phase": "http_request_firewall_custom",
            "kind": "zone",
            "rules": [
                {{
                    "description": "My rule",
                    "action": "block",
                    "expression": "http.host eq \"fake.net\"",
                    "enabled": true,
                    ...
                }}
            ],
            ...
        }}""")

A Scout test is a succession of roundtrips of requests and responses against a given API. We use the functionalities of Pytest fixtures and marks to be able to target specific resources while validating the request and responses. Pytest marks in Scout allow to provide an extra set of information to test suites. Pytest fixtures are contexts with information and methods which can be used across tests to enhance their capabilities. Hence the conjunction of marks with fixtures allow Scout to build the whole harness required to run a test suite against APIs.

Being able to exactly describe the resources against which a given test will run provides us confidence the live API behaves as expected under various conditions.

The cfapi fixture provides the capability to target different resources such as a Cloudflare account or a zone. In the test above, we use a Pytest mark @requires to describe the characteristics of the resources we want, e.g. we need here an account with a flag allowing us to have 2 rules for a ruleset. This will allow the test to only be run against accounts with such entitlements.

The @validate mark provides the capability to validate requests and responses to a given OpenAPI schema (here the rulesets OpenAPI schema). Any validation failure will be reported and flagged as a test failure.

Regarding the actual requests and responses, their payloads are described as f-strings, in particular the response f-string can be written as a “semi-json”:

 response.expect_json_success(
            200,
            result=f"""{{
            "name": "My zone ruleset",
            "version": "1",
            "source": "firewall_custom",
            "phase": "phase_http_request_firewall_custom",
            "kind": "zone",
            "rules": [
                {{
                    "description": "My rule",
                    "action": "block",
                    "expression": "http.host eq \"fake.net\"",
                    "enabled": true,
                    ...
                }}
            ],
            ...
        }}""")

Among many test assertions possible, Scout can assert the validity of a partial json response and it will log the information. We added the handling of ellipsis … as an indication for Scout not to care about any further fields at a given json nesting level. Hence, we are able to do partial matching on JSON API responses, thus focusing only on what matters the most in each test.

Once a test suite run is complete, the results are pushed by the service and stored using Cloudflare Workers KV. They are displayed via a Cloudflare Worker.

Scout is run in separate environments such as production-like and production environments. It is part of our deployment process to verify Scout is green in our production-like environment prior to deploying to production where Scout is also used for monitoring purposes.

How we built it

The core of Scout is written in Python and it is a combination of three components interacting together:

The Scout plugin: a Pytest plugin to write tests easily
The Scout service: a scheduler service to run the test suites periodically
The Scout Worker: a collector and presenter of test reports

The Scout plugin

This is the core component of the Scout system. It allows us to write self explanatory tests while ensuring a high level of compliance against OpenAPI schemas and verifying the APIs’ behaviors.

The Scout plugin architecture can be split into three components: setup, resource allocator, and runners. Setup is a conjunction of multiple sub components in charge of setting up the plugin.

The Registry contains all the information regarding a pool of accounts and zones we use for testing. As an example, entitlements are flags gating customers for using products features, the Registry provides the capability to describe entitlements per account and zone so that Scout can run a test against a specific setup.

As explained earlier, Scout can validate requests and responses against OpenAPI schemas. This is the responsibility of validators. A validator is built per OpenAPI schema and can be selected via the @validate mark we saw above.

@validate(schema="rulesets", ignorePaths=["accounts/[^/]+/rules/lists"])

As soon as a validator is selected, all the interaction of a given test with an API will be validated. If there is a validation failure, it will be marked as a test failure.

Last element of the setup, the config reader. It is the sub component in charge of providing all the URLs and authentication elements required for the Scout plugin to communicate with APIs.

Next in the chain, the resources allocator. This component is in charge of consuming the configuration and objects of the setup to build multiple runners. This is a factory which will make available the runners in the cfapi fixture.

response = cfapi.zone.request(method, path, payload)

When such a line of code is processed, it is the actual method request of the zone runner allocated for the test which is executed. Actually, the resources allocator is able to provide specialized runners (account, zone or default) which grant the possibility of targeting specific API endpoints for a given account or zone.

Runners are in charge of handling the execution of requests, managing the test expectations and using the validators for request/response schema validation.

Any failure on expectation or validation and any exceptions are recorded in the stash. The stash is shared across all runners. As such, when a test setup, run or cleanup is processed, the timeline of execution and potential retries are logged in the stash. The stash contents are later used for building the test suite reports.

Scout is able to run multiple test steps in parallel. Actually, each resource couple (Account Runner, Zone Runner) is associated with a Pytest-xdist worker which runs test steps independently. There can be as many workers as there are resource couples. An extra “default” runner is provided for reaching our different APIs and/or URLs with or without authentication.

Testing a test system was not the easiest part. We have been required to build a fake API and assert the Scout plugin would behave as it should in different situations. We reached and maintained a test coverage confidence which was considered good (close to 90%) for using the Scout plugin permanently.

The Scout service

The Scout service is meant to schedule test suites periodically. It is a configurable scheduler providing a reporting harness for the test suites as well as multiple metrics. It was a design decision to build a scheduler instead of using cron jobs.

We wanted to be aware of any scheduling issue as well as run issues. For this we used Prometheus metrics. The problem is that the Prometheus default configuration is to scrape metrics advertised by services. This scraping happens periodically and we were concerned about the eventuality of missing metrics if a cron job was to finish prior to the next Prometheus metrics scraping. As such we decided a small scheduler was better suited for overall observability of the test runs. Among the metrics the Scout service provides are network failures, general test failures, reporting failures, tests lagging and more.

The Scout service runs threads on configured periods. Each thread is a test suite run as a separate Pytest with Scout plugin process followed by a reporting execution consuming the results and publishing them to the relevant parties.

The reporting component provided to each thread publishes the report to Workers KV and notifies us on chat in case there is a failure. Reporting takes also care of publishing the information relevant for building API testing coverage. In fact it is mandatory for us to have coverage of all the API endpoints and their possible methods so that we can achieve a wide and thorough visibility of our live APIs.

As a fallback, if there are any thread failure, test failure or reporting failure we are alerted based on the Prometheus metrics being updated across the service execution. The logs of the Scout service as well as the logs of each Pytest-Scout plugin execution provide the last resort information if no metrics are available and reporting is failing.

The service can be deployed with a minimal YAML configuration and be set up for different environments. We can for example decide to run different test suites based on the environment, publish or not to Cloudflare Workers, set different periods and retry mechanisms and so on.

We keep the tests as part of our code base alongside the configuration of the Scout service, and that’s about it, the Scout service is a separate entity.

The Scout Worker

It is a Cloudflare worker in charge of fetching the most recent Worker KVs and displaying them in an eye pleasing manner. The Scout service publishes a test report as JSON, thus the Scout worker parses the report and displays its content based on the status of the test suite run.

For example, we present below an authentication failure during a test which resulted in such a display in the worker:

What does Scout let us do

Through leveraging the capabilities of Pytest and Cloudflare Workers, we have been able to build a configurable, robust and reliable system which allows us to easily write self explanatory tests for our APIs.

We can validate requests and responses against OpenAPI schemas and test behaviors over specific resources while getting alerted through multiple means if something goes wrong.

For specific use cases, we can write a test verifying the API behaves as it should, the configuration to be pushed at the edge is valid and a given zone will react as it should to security threats. Thus going beyond an end-to-end API test.

Scout quickly became our permanent live tester and monitor of APIs. We wrote tests for all endpoints to maintain a wide coverage of all our APIs. Scout has since been used for verifying an API version prior to its deployment to production. In fact, after a deployment in a production-like environment we can know in a couple of minutes if a new feature is good to go to production and assess if it is behaving correctly.

We hope you enjoyed this deep dive description into one of our systems!

Monitoring our monitoring: how we validate our Prometheus alert rules

Lukasz Mierzwa — Thu, 19 May 2022 15:39:19 GMT

Background

We use Prometheus as our core monitoring system. We’ve been heavy Prometheus users since 2017 when we migrated off our previous monitoring system which used a customized Nagios setup. Despite growing our infrastructure a lot, adding tons of new products and learning some hard lessons about operating Prometheus at scale, our original architecture of Prometheus (see Monitoring Cloudflare's Planet-Scale Edge Network with Prometheus for an in depth walk through) remains virtually unchanged, proving that Prometheus is a solid foundation for building observability into your services.

One of the key responsibilities of Prometheus is to alert us when something goes wrong and in this blog post we’ll talk about how we make those alerts more reliable - and we’ll introduce an open source tool we’ve developed to help us with that, and share how you can use it too. If you’re not familiar with Prometheus you might want to start by watching this video to better understand the topic we’ll be covering here.

Prometheus works by collecting metrics from our services and storing those metrics inside its database, called TSDB. We can then query these metrics using Prometheus query language called PromQL using ad-hoc queries (for example to power Grafana dashboards) or via alerting or recording rules. A rule is basically a query that Prometheus will run for us in a loop, and when that query returns any results it will either be recorded as new metrics (with recording rules) or trigger alerts (with alerting rules).

Prometheus alerts

Since we’re talking about improving our alerting we’ll be focusing on alerting rules.

To create alerts we first need to have some metrics collected. For the purposes of this blog post let’s assume we’re working with http_requests_total metric, which is used on the examples page. Here are some examples of how our metrics will look:

http_requests_total{job="myserver", handler="/", method=”get”, status=”200”}
http_requests_total{job="myserver", handler="/", method=”get”, status=”500”}
http_requests_total{job="myserver", handler="/posts", method=”get”, status=”200”}
http_requests_total{job="myserver", handler="/posts", method=”get”, status=”500”}
http_requests_total{job="myserver", handler="/posts/new", method=”post”, status=”201”}
http_requests_total{job="myserver", handler="/posts/new", method=”post”, status=”401”}

Let’s say we want to alert if our HTTP server is returning errors to customers.

Since, all we need to do is check our metric that tracks how many responses with HTTP status code 500 there were, a simple alerting rule could like this:

- alert: Serving HTTP 500 errors
  expr: http_requests_total{status=”500”} > 0

This will alert us if we have any 500 errors served to our customers. Prometheus will run our query looking for a time series named http_requests_total that also has a status label with value “500”. Then it will filter all those matched time series and only return ones with value greater than zero.

If our alert rule returns any results a fire will be triggered, one for each returned result.

If our rule doesn’t return anything, meaning there are no matched time series, then alert will not trigger.

The whole flow from metric to alert is pretty simple here as we can see on the diagram below.

If we want to provide more information in the alert we can by setting additional labels and annotations, but alert and expr fields are all we need to get a working rule.

But the problem with the above rule is that our alert starts when we have our first error, and then it will never go away.

After all, our http_requests_total is a counter, so it gets incremented every time there’s a new request, which means that it will keep growing as we receive more requests. What this means for us is that our alert is really telling us “was there ever a 500 error?” and even if we fix the problem causing 500 errors we’ll keep getting this alert.

A better alert would be one that tells us if we’re serving errors right now.

For that we can use the rate() function to calculate the per second rate of errors.

Our modified alert would be:

- alert: Serving HTTP 500 errors
  expr: rate(http_requests_total{status=”500”}[2m]) > 0

The query above will calculate the rate of 500 errors in the last two minutes. If we start responding with errors to customers our alert will fire, but once errors stop so will this alert.

This is great because if the underlying issue is resolved the alert will resolve too.

We can improve our alert further by, for example, alerting on the percentage of errors, rather than absolute numbers, or even calculate error budget, but let’s stop here for now.

It’s all very simple, so what do we mean when we talk about improving the reliability of alerting? What could go wrong here?

What could go wrong?

We can craft a valid YAML file with a rule definition that has a perfectly valid query that will simply not work how we expect it to work. Which, when it comes to alerting rules, might mean that the alert we rely upon to tell us when something is not working correctly will fail to alert us when it should. To better understand why that might happen let’s first explain how querying works in Prometheus.

Prometheus querying basics

There are two basic types of queries we can run against Prometheus. The first one is an instant query. It allows us to ask Prometheus for a point in time value of some time series. If we write our query as http_requests_total we’ll get all time series named http_requests_total along with the most recent value for each of them. We can further customize the query and filter results by adding label matchers, like http_requests_total{status=”500”}.

Let’s consider we have two instances of our server, green and red, each one is scraped (Prometheus collects metrics from it) every one minute (independently of each other).

This is what happens when we issue an instant query:

There’s obviously more to it as we can use functions and build complex queries that utilize multiple metrics in one expression. But for the purposes of this blog post we’ll stop here.

The important thing to know about instant queries is that they return the most recent value of a matched time series, and they will look back for up to five minutes (by default) into the past to find it. If the last value is older than five minutes then it’s considered stale and Prometheus won’t return it anymore.

The second type of query is a range query - it works similarly to instant queries, the difference is that instead of returning us the most recent value it gives us a list of values from the selected time range. That time range is always relative so instead of providing two timestamps we provide a range, like “20 minutes”. When we ask for a range query with a 20 minutes range it will return us all values collected for matching time series from 20 minutes ago until now.

An important distinction between those two types of queries is that range queries don’t have the same “look back for up to five minutes” behavior as instant queries. If Prometheus cannot find any values collected in the provided time range then it doesn’t return anything.

If we modify our example to request [3m] range query we should expect Prometheus to return three data points for each time series:

When queries don’t return anything

Knowing a bit more about how queries work in Prometheus we can go back to our alerting rules and spot a potential problem: queries that don’t return anything.

If our query doesn’t match any time series or if they’re considered stale then Prometheus will return an empty result. This might be because we’ve made a typo in the metric name or label filter, the metric we ask for is no longer being exported, or it was never there in the first place, or we’ve added some condition that wasn’t satisfied, like value of being non-zero in our http_requests_total{status=”500”} > 0 example.

Prometheus will not return any error in any of the scenarios above because none of them are really problems, it’s just how querying works. If you ask for something that doesn’t match your query then you get empty results. This means that there’s no distinction between “all systems are operational” and “you’ve made a typo in your query”. So if you’re not receiving any alerts from your service it’s either a sign that everything is working fine, or that you’ve made a typo, and you have no working monitoring at all, and it’s up to you to verify which one it is.

For example, we could be trying to query for http_requests_totals instead of http_requests_total (an extra “s” at the end) and although our query will look fine it won’t ever produce any alert.

Range queries can add another twist - they’re mostly used in Prometheus functions like rate(), which we used in our example. This function will only work correctly if it receives a range query expression that returns at least two data points for each time series, after all it’s impossible to calculate rate from a single number.

Since the number of data points depends on the time range we passed to the range query, which we then pass to our rate() function, if we provide a time range that only contains a single value then rate won’t be able to calculate anything and once again we’ll return empty results.

The number of values collected in a given time range depends on the interval at which Prometheus collects all metrics, so to use rate() correctly you need to know how your Prometheus server is configured. You can read more about this here and here if you want to better understand how rate() works in Prometheus.

For example if we collect our metrics every one minute then a range query http_requests_total[1m] will be able to find only one data point. Here’s a reminder of how this looks:

Since, as we mentioned before, we can only calculate rate() if we have at least two data points, calling rate(http_requests_total[1m]) will never return anything and so our alerts will never work.

There are more potential problems we can run into when writing Prometheus queries, for example any operations between two metrics will only work if both have the same set of labels, you can read about this here. But for now we’ll stop here, listing all the gotchas could take a while. The point to remember is simple: if your alerting query doesn’t return anything then it might be that everything is ok and there’s no need to alert, but it might also be that you’ve mistyped your metrics name, your label filter cannot match anything, your metric disappeared from Prometheus, you are using too small time range for your range queries etc.

Renaming metrics can be dangerous

We’ve been running Prometheus for a few years now and during that time we’ve grown our collection of alerting rules a lot. Plus we keep adding new products or modifying existing ones, which often includes adding and removing metrics, or modifying existing metrics, which may include renaming them or changing what labels are present on these metrics.

A lot of metrics come from metrics exporters maintained by the Prometheus community, like node_exporter, which we use to gather some operating system metrics from all of our servers. Those exporters also undergo changes which might mean that some metrics are deprecated and removed, or simply renamed.

A problem we’ve run into a few times is that sometimes our alerting rules wouldn’t be updated after such a change, for example when we upgraded node_exporter across our fleet. Or the addition of a new label on some metrics would suddenly cause Prometheus to no longer return anything for some of the alerting queries we have, making such an alerting rule no longer useful.

It’s worth noting that Prometheus does have a way of unit testing rules, but since it works on mocked data it’s mostly useful to validate the logic of a query. Unit testing won’t tell us if, for example, a metric we rely on suddenly disappeared from Prometheus.

Chaining rules

When writing alerting rules we try to limit alert fatigue by ensuring that, among many things, alerts are only generated when there’s an action needed, they clearly describe the problem that needs addressing, they have a link to a runbook and a dashboard, and finally that we aggregate them as much as possible. This means that a lot of the alerts we have won’t trigger for each individual instance of a service that’s affected, but rather once per data center or even globally.

For example, we might alert if the rate of HTTP errors in a datacenter is above 1% of all requests. To do that we first need to calculate the overall rate of errors across all instances of our server. For that we would use a recording rule:

- record: job:http_requests_total:rate2m
  expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

- record: job:http_requests_status500:rate2m
  expr: sum(rate(http_requests_total{status=”500”}[2m])) without(method, status, instance)

First rule will tell Prometheus to calculate per second rate of all requests and sum it across all instances of our server. Second rule does the same but only sums time series with status labels equal to “500”. Both rules will produce new metrics named after the value of the record field.

Now we can modify our alert rule to use those new metrics we’re generating with our recording rules:

- alert: Serving HTTP 500 errors
  expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

If we have a data center wide problem then we will raise just one alert, rather than one per instance of our server, which can be a great quality of life improvement for our on-call engineers.

But at the same time we’ve added two new rules that we need to maintain and ensure they produce results. To make things more complicated we could have recording rules producing metrics based on other recording rules, and then we have even more rules that we need to ensure are working correctly.

What if all those rules in our chain are maintained by different teams? What if the rule in the middle of the chain suddenly gets renamed because that’s needed by one of the teams? Problems like that can easily crop up now and then if your environment is sufficiently complex, and when they do, they’re not always obvious, after all the only sign that something stopped working is, well, silence - your alerts no longer trigger. If you’re lucky you’re plotting your metrics on a dashboard somewhere and hopefully someone will notice if they become empty, but it’s risky to rely on this.

We definitely felt that we needed something better than hope.

Introducing pint: a Prometheus rule linter

To avoid running into such problems in the future we’ve decided to write a tool that would help us do a better job of testing our alerting rules against live Prometheus servers, so we can spot missing metrics or typos easier. We also wanted to allow new engineers, who might not necessarily have all the in-depth knowledge of how Prometheus works, to be able to write rules with confidence without having to get feedback from more experienced team members.

Since we believe that such a tool will have value for the entire Prometheus community we’ve open-sourced it, and it’s available for anyone to use - say hello to pint!

You can find sources on github, there’s also online documentation that should help you get started.

Pint works in 3 different ways:

You can run it against a file(s) with Prometheus rules
It can run as a part of your CI pipeline
Or you can deploy it as a side-car to all your Prometheus servers

It doesn’t require any configuration to run, but in most cases it will provide the most value if you create a configuration file for it and define some Prometheus servers it should use to validate all rules against. Running without any configured Prometheus servers will limit it to static analysis of all the rules, which can identify a range of problems, but won’t tell you if your rules are trying to query non-existent metrics.

First mode is where pint reads a file (or a directory containing multiple files), parses it, does all the basic syntax checks and then runs a series of checks for all Prometheus rules in those files.

Second mode is optimized for validating git based pull requests. Instead of testing all rules from all files pint will only test rules that were modified and report only problems affecting modified lines.

Third mode is where pint runs as a daemon and tests all rules on a regular basis. If it detects any problem it will expose those problems as metrics. You can then collect those metrics using Prometheus and alert on them as you would for any other problems. This way you can basically use Prometheus to monitor itself.

What kind of checks can it run for us and what kind of problems can it detect?

All the checks are documented here, along with some tips on how to deal with any detected problems. Let’s cover the most important ones briefly.

As mentioned above the main motivation was to catch rules that try to query metrics that are missing or when the query was simply mistyped. To do that pint will run each query from every alerting and recording rule to see if it returns any result, if it doesn’t then it will break down this query to identify all individual metrics and check for the existence of each of them. If any of them is missing or if the query tries to filter using labels that aren’t present on any time series for a given metric then it will report that back to us.

So if someone tries to add a new alerting rule with http_requests_totals typo in it, pint will detect that when running CI checks on the pull request and stop it from being merged. Which takes care of validating rules as they are being added to our configuration management system.

Another useful check will try to estimate the number of times a given alerting rule would trigger an alert. Which is useful when raising a pull request that’s adding new alerting rules - nobody wants to be flooded with alerts from a rule that’s too sensitive so having this information on a pull request allows us to spot rules that could lead to alert fatigue.

Similarly, another check will provide information on how many new time series a recording rule adds to Prometheus. In our setup a single unique time series uses, on average, 4KiB of memory. So if a recording rule generates 10 thousand new time series it will increase Prometheus server memory usage by 10000*4KiB=40MiB. 40 megabytes might not sound like but our peak time series usage in the last year was around 30 million time series in a single Prometheus server, so we pay attention to anything that’s might add a substantial amount of new time series, which pint helps us to notice before such rule gets added to Prometheus.

On top of all the Prometheus query checks, pint allows us also to ensure that all the alerting rules comply with some policies we’ve set for ourselves. For example, we require everyone to write a runbook for their alerts and link to it in the alerting rule using annotations.

We also require all alerts to have priority labels, so that high priority alerts are generating pages for responsible teams, while low priority ones are only routed to karma dashboard or create tickets using jiralert. It’s easy to forget about one of these required fields and that’s not something which can be enforced using unit testing, but pint allows us to do that with a few configuration lines.

With pint running on all stages of our Prometheus rule life cycle, from initial pull request to monitoring rules deployed in our many data centers, we can rely on our Prometheus alerting rules to always work and notify us of any incident, large or small.

GitHub: https://github.com/cloudflare/pint

Putting it all together

Let’s see how we can use pint to validate our rules as we work on them.

We can begin by creating a file called “rules.yml” and adding both recording rules there.

The goal is to write new rules that we want to add to Prometheus, but before we actually add those, we want pint to validate it all for us.

groups:
- name: Demo recording rules
  rules:
  - record: job:http_requests_total:rate2m
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

  - record: job:http_requests_status500:rate2m
    expr: sum(rate(http_requests_total{status="500"}[2m]) without(method, status, instance)

Next we’ll download the latest version of pint from GitHub and run check our rules.

$ pint lint rules.yml 
level=info msg="File parsed" path=rules.yml rules=2
rules.yml:8: syntax error: unclosed left parenthesis (promql/syntax)
    expr: sum(rate(http_requests_total{status="500"}[2m]) without(method, status, instance)

level=info msg="Problems found" Fatal=1
level=fatal msg="Execution completed with error(s)" error="problems found"

Whoops, we have “sum(rate(...)” and so we’re missing one of the closing brackets. Let’s fix that and try again.

groups:
- name: Demo recording rules
  rules:
  - record: job:http_requests_total:rate2m
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

  - record: job:http_requests_status500:rate2m
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

$ pint lint rules.yml 
level=info msg="File parsed" path=rules.yml rules=2

Our rule now passes the most basic checks, so we know it’s valid. But to know if it works with a real Prometheus server we need to tell pint how to talk to Prometheus. For that we’ll need a config file that defines a Prometheus server we test our rule against, it should be the same server we’re planning to deploy our rule to. Here we’ll be using a test instance running on localhost. Let’s create a “pint.hcl” file and define our Prometheus server there:

prometheus "prom1" {
  uri     = "http://localhost:9090"
  timeout = "1m"
}

Now we can re-run our check using this configuration file:

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=2
rules.yml:5: prometheus "prom1" at http://localhost:9090 didn't have any series for "http_requests_total" metric in the last 1w (promql/series)
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

rules.yml:8: prometheus "prom1" at http://localhost:9090 didn't have any series for "http_requests_total" metric in the last 1w (promql/series)
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

level=info msg="Problems found" Bug=2
level=fatal msg="Execution completed with error(s)" error="problems found"

Yikes! It’s a test Prometheus instance, and we forgot to collect any metrics from it.

Let’s fix that by starting our server locally on port 8080 and configuring Prometheus to collect metrics from it:

scrape_configs:
  - job_name: webserver
    static_configs:
      - targets: ['localhost:8080’]

Let’ re-run our checks once more:

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=2

This time everything works!

Now let’s add our alerting rule to our file, so it now looks like this:

groups:
- name: Demo recording rules
  rules:
  - record: job:http_requests_total:rate2m
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

  - record: job:http_requests_status500:rate2m
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

- name: Demo alerting rules
  rules:
  - alert: Serving HTTP 500 errors
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

And let’s re-run pint once again:

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=3
rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_status500:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_total:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

level=info msg="Problems found" Information=2

It all works according to pint, and so we now can safely deploy our new rules file to Prometheus.

Notice that pint recognised that both metrics used in our alert come from recording rules, which aren’t yet added to Prometheus, so there’s no point querying Prometheus to verify if they exist there.

Now what happens if we deploy a new version of our server that renames the “status” label to something else, like “code”?

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=3
rules.yml:8: prometheus "prom1" at http://localhost:9090 has "http_requests_total" metric but there are no series with "status" label in the last 1w (promql/series)
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_status500:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

level=info msg="Problems found" Bug=1 Information=1
level=fatal msg="Execution completed with error(s)" error="problems found"

Luckily pint will notice this and report it, so we can adopt our rule to match the new name.

But what if that happens after we deploy our rule? For that we can use the “pint watch” command that runs pint as a daemon periodically checking all rules.

Please note that validating all metrics used in a query will eventually produce some false positives. In our example metrics with status=”500” label might not be exported by our server until there’s at least one request ending in HTTP 500 error.

The promql/series check responsible for validating presence of all metrics has some documentation on how to deal with this problem. In most cases you’ll want to add a comment that instructs pint to ignore some missing metrics entirely or stop checking label values (only check if there’s “status” label present, without checking if there are time series with status=”500”).

Summary

Prometheus metrics don’t follow any strict schema, whatever services expose will be collected. At the same time a lot of problems with queries hide behind empty results, which makes noticing these problems non-trivial.

We use pint to find such problems and report them to engineers, so that our global network is always monitored correctly, and we have confidence that lack of alerts proves how reliable our infrastructure is.

Automatic Remediation of Kubernetes Nodes

Andrew DeMaria — Thu, 15 Jul 2021 12:59:42 GMT

We use Kubernetes to run many of the diverse services that help us control Cloudflare’s edge. We have five geographically diverse clusters, with hundreds of nodes in our largest cluster. These clusters are self-managed on bare-metal machines which gives us a good amount of power and flexibility in the software and integrations with Kubernetes. However, it also means we don’t have a cloud provider to rely on for virtualizing or managing the nodes. This distinction becomes even more prominent when considering all the different reasons that nodes degrade. With self-managed bare-metal machines, the list of reasons that cause a node to become unhealthy include:

Hardware failures
Kernel-level software failures
Kubernetes cluster-level software failures
Degraded network communication
Software updates are required
Resource exhaustion¹

Unhappy Nodes

We have plenty of examples of failures in the aforementioned categories, but one example has been particularly tedious to deal with. It starts with the following log line from the kernel:

unregister_netdevice: waiting for lo to become free. Usage count = 1

The issue is further observed with the number of network interfaces on the node owned by the Container Network Interface (CNI) plugin getting out of proportion with the number of running pods:

$ ip link | grep cali | wc -l
1088

This is unexpected as it shouldn't exceed the maximum number of pods allowed on a node (we use the default limit of 110). While this issue is interesting and perhaps worthy of a whole separate blog, the short of it is that the Linux network interfaces owned by the CNI are not getting cleaned up after a pod terminates.

Some history on this can be read in a Docker GitHub issue. We found this seems to plague nodes with a longer uptime, and after rebooting the node it would be fine for about a month. However, with a significant number of nodes, this was happening multiple times per day. Each instance would need rebooting, which means going through our worker reboot procedure which looked like this:

Cordon off the affected node to prevent new workloads from scheduling on it.
Collect any diagnostic information for later investigation.
Drain the node of current workloads.
Reboot and wait for the node to come back.
Verify the node is healthy.
Re-enable scheduling of new workloads to the node.

While solving the underlying issue would be ideal, we needed a mitigation to avoid toil in the meantime — an automated node remediation process.

Existing Detection and Remediation Solutions

While not complicated, the manual remediation process outlined previously became tedious and distracting, as we had to reboot nodes multiple times a day. Some manual intervention is unavoidable, but for cases matching the following, we wanted automation:

Generic worker nodes
Software issues confined to a given node
Already researched and diagnosed issues

Limiting automatic remediation to generic worker nodes is important as there are other node types in our clusters where more care is required. For example, for control-plane nodes the process has to be augmented to check etcd cluster health and ensure proper redundancy for components servicing the Kubernetes API. We are also going to limit the problem space to known software issues confined to a node where we expect automatic remediation to be the right answer (as in our ballooning network interface problem). With that in mind, we took a look at existing solutions that we could use.

Node Problem Detector

Node problem detector is a daemon that runs on each node that detects problems and reports them to the Kubernetes API. It has a pluggable problem daemon system such that one can add their own logic for detecting issues with a node. Node problems are distinguished between temporary and permanent problems, with the latter being persisted as status conditions on the Kubernetes node resources.²

Draino and Cluster-Autoscaler

Draino as its name implies, drains nodes but does so based on Kubernetes node conditions. It is meant to be used with cluster-autoscaler which then can add or remove nodes via the cluster plugins to scale node groups.

Kured

Kured is a daemon that looks at the presence of a file on the node to initiate a drain, reboot and uncordon of the given node. It uses a locking mechanism via the Kubernetes API to ensure only a single node is acted upon at a time.

Cluster-API

The Kubernetes cluster-lifecycle SIG has been working on the cluster-api project to enable declaratively defining clusters to simplify provisioning, upgrading, and operating multiple Kubernetes clusters. It has a concept of machine resources which back Kubernetes node resources and furthermore has a concept of machine health checks. Machine health checks use node conditions to determine unhealthy nodes and then the cluster-api provider is then delegated to replace that machine via create and delete operations.

Proof of Concept

Interestingly, with all the above except for Kured, there is a theme of pluggable components centered around Kubernetes node conditions. We wanted to see if we could build a proof of concept using the existing theme and solutions. For the existing solutions, draino with cluster-autoscaler didn’t make sense in a non-cloud environment like our bare-metal set up. The cluster-api health checks are interesting, however they require a more complete investment into the cluster-api project to really make sense. That left us with node-problem-detector and kured. Deploying node-problem-detector was simple, and we ended up testing a custom-plugin-monitor like the following:

apiVersion: v1
kind: ConfigMap
metadata:
  name: node-problem-detector-config
data:
  check_calico_interfaces.sh: |
    #!/bin/bash
    set -euo pipefail
    
    count=$(nsenter -n/proc/1/ns/net ip link | grep cali | wc -l)
    
    if (( $count > 150 )); then
      echo "Too many calico interfaces ($count)"
      exit 1
    else
      exit 0
    fi
  cali-monitor.json: |
    {
      "plugin": "custom",
      "pluginConfig": {
        "invoke_interval": "30s",
        "timeout": "5s",
        "max_output_length": 80,
        "concurrency": 3,
        "enable_message_change_based_condition_update": false
      },
      "source": "calico-custom-plugin-monitor",
      "metricsReporting": false,
      "conditions": [
        {
          "type": "NPDCalicoUnhealthy",
          "reason": "CalicoInterfaceCountOkay",
          "message": "Normal amount of interfaces"
        }
      ],
      "rules": [
        {
          "type": "permanent",
          "condition": "NPDCalicoUnhealthy",
          "reason": "TooManyCalicoInterfaces",
          "path": "/bin/bash",
          "args": [
            "/config/check_calico_interfaces.sh"
          ],
          "timeout": "3s"
        }
      ]
    }

Testing showed that when the condition became true, a condition would be updated on the associated Kubernetes node like so:

kubectl get node -o json worker1a | jq '.status.conditions[] | select(.type | test("^NPD"))'
{
  "lastHeartbeatTime": "2020-03-20T17:05:17Z",
  "lastTransitionTime": "2020-03-20T17:05:16Z",
  "message": "Too many calico interfaces (154)",
  "reason": "TooManyCalicoInterfaces",
  "status": "True",
  "type": "NPDCalicoUnhealthy"
}

With that in place, the actual remediation needed to happen. Kured seemed to do most everything we needed, except that it was looking at a file instead of Kubernetes node conditions. We hacked together a patch to change that and tested it successfully end to end — we had a working proof of concept!

Revisiting Problem Detection

While the above worked, we found that node-problem-detector was unwieldy because we were replicating our existing monitoring into shell scripts and node-problem-detector configuration. A 2017 blog post describes Cloudflare’s monitoring stack, although some things have changed since then. What hasn’t changed is our extensive usage of Prometheus and Alertmanager.

For the network interface issue and other issues we wanted to address, we already had the necessary exported metrics and alerting to go with them. Here is what our already existing alert looked like³:

- alert: CalicoTooManyInterfaces
  expr: sum(node_network_info{device=~"cali.*"}) by (node) >= 200
  for: 1h
  labels:
    priority: "5"
    notify: chat-sre-core chat-k8s

Note that we use a “notify” label to drive the routing logic in Alertmanager. However, that got us asking, could we just route this to a Kubernetes node condition instead?

Introducing Sciuro

Sciuro is our open-source replacement of node-problem-detector that has one simple job: synchronize Kubernetes node conditions with currently firing alerts in Alertmanager. Node problems can be defined with a more holistic view and using already existing exporters such as node exporter, cadvisor or mtail. It also doesn’t run on affected nodes which allows us to rely on out-of-band remediation techniques. Here is a high level diagram of how Sciuro works:

Starting from the top, nodes are scraped by Prometheus, which collects those metrics and fires relevant alerts to Alertmanager. Sciuro polls Alertmanager for alerts with a matching receiver, matches them with a corresponding node resource in the Kubernetes API and updates that node’s conditions accordingly.

In more detail, we can start by defining an alert in Prometheus like the following:

- alert: CalicoTooManyInterfacesEarly
  expr: sum(node_network_info{device=~"cali.*"}) by (node) >= 150
  labels:
    priority: "6"
    notify: node-condition-k8s

Note the two differences from the previous alert. First, we use a new name with a more sensitive trigger. The idea is that we want automatic node remediation to try fixing the node first as soon as possible, but if the problem worsens or automatic remediation is failing, humans will still get notified to act. The second difference is that instead of notifying chat rooms, we route to a target called “node-condition-k8s”.

Sciuro then comes into play, polling the Altertmanager API for alerts matching the “node-condition-k8s” receiver. The following shows the equivalent using amtool:

$ amtool alert query -r node-condition-k8s
Alertname                 	Starts At            	Summary                                                               	 
CalicoTooManyInterfacesEarly  2021-05-11 03:25:21 UTC  Kubernetes node worker1a has too many Calico interfaces

We can also check the labels for this alert:

$ amtool alert query -r node-condition-k8s -o json | jq '.[] | .labels'
{
  "alertname": "CalicoTooManyInterfacesEarly",
  "cluster": "a.k8s",
  "instance": "worker1a",
  "node": "worker1a",
  "notify": "node-condition-k8s",
  "priority": "6",
  "prometheus": "k8s-a"
}

Note the node and instance labels which Sciuro will use for matching with the corresponding Kubernetes node. Sciuro uses the excellent controller-runtime to keep track of and update node sources in the Kubernetes API. We can observe the updated node condition on the status field via kubectl:

$ kubectl get node worker1a -o json | jq '.status.conditions[] | select(.type | test("^AlertManager"))'
{
  "lastHeartbeatTime": "2021-05-11T03:31:20Z",
  "lastTransitionTime": "2021-05-11T03:26:53Z",
  "message": "[P6] Kubernetes node worker1a has too many Calico interfaces",
  "reason": "AlertIsFiring",
  "status": "True",
  "type": "AlertManager_CalicoTooManyInterfacesEarly"
}

One important note is Sciuro added the AlertManager_ prefix to the node condition type to prevent conflicts with other node condition types. For example, DiskPressure, a kubelet managed condition, could also be an alert name. Sciuro will also properly update heartbeat and transition times to reflect when it first saw the alert and its last update. With node conditions synchronized by Sciuro, remediation can take place via one of the existing tools. As mentioned previously we are using a modified version of Kured for now.

We’re happy to announce that we’ve open sourced Sciuro, and it can be found on GitHub where you can read the code, find the deployment instructions, or open a Pull Request for changes.

Managing Node Uptime

While we began using automatic node remediation for obvious problems, we’ve expanded its purpose to additionally keep node uptime low. Low node uptime is desirable to further reduce drift on nodes, keep the node initialization process well-oiled, and encourage the best deployment practices on the Kubernetes clusters. To expand on the last point, services that are deployed with best practices and in a high availability fashion should see negligible impact when a single node leaves the cluster. However, services that are not deployed with best practices will most likely have problems especially if they rely on singleton pods. By draining nodes more frequently, it introduces regular chaos that encourages best practices. To enable this with automatic node remediation the following alert was defined:

- alert: WorkerUptimeTooHigh
  expr: |
    (
      (
        (
              max by(node) (kube_node_role{role="worker"})
            - on(node) group_left()
              (max by(node) (kube_node_role{role!="worker"}))
          or on(node)
            max by(node) (kube_node_role{role="worker"})
        ) == 1
      )
    * on(node) group_left()
      (
        (time() - node_boot_time_seconds) > (60 * 60 * 24 * 7)
      )
    )
  labels:
    priority: "9"
    notify: node-condition-k8s

There is a bit of juggling with the kube_node_roles metric in the above to isolate the alert to generic worker nodes, but at a high level it looks at node_boot_time_seconds, a metric from prometheus node_exporter. Again the notify label is configured to send to node conditions which kicks off the automatic node remediation. One further detail is the priority here is set to “9” which is of lower precedence than our other alerts. Note that the message field of the node condition is prefixed with the alert priority in brackets. This allows the remediation process to take priority into account when choosing which node to remediate first, which is important because Kured uses a lock to act on a single node at a time.

Wrapping Up

In the past 30 days, we’ve used the above automatic node remediation process to action 571 nodes. That has saved our humans a considerable amount of time. We’ve also been able to reduce the time to repair for some issues as automatic remediation can act at all times of the day and with a faster response time.

As mentioned before, we’re open sourcing Sciuro and its code can be found on GitHub. We’re open to issues, suggestions, and pull requests. We do have some ideas for future improvements. For Sciuro, we may look to reduce latency which is mainly due to polling and potentially add a push model from Altermanager although this isn’t a need we’ve had yet. For the larger node remediation story, we hope to do an overhaul of the remediating component. As mentioned previously, we are currently using a fork of kured, but a future replacement component should include the following:

Use out-of-band management interfaces to be able to shut down and power on nodes without a functional operating system.
Move from decentralized architecture to a centralized one that can integrate more complicated logic. This might include being able to act on entire failure domains in parallel.
Handle specialized nodes such as masters or storage nodes.

Finally, we’re looking for more people passionate about Kubernetes to join our team. Come help us push Kubernetes to the next level to serve Cloudflare’s many needs!

¹Exhaustion can be applied to hardware resources, kernel resources, or logical resources like the amount of logging being produced.

²Nearly all Kubernetes objects have spec and status fields. The status field is used to describe the current state of an object. For node resources, typically the kubelet manages a conditions field under the status field for reporting things like if the node is ready for servicing pods. ³The format of the following alert is documented on Prometheus Alerting Rules.

Improving your monitoring setup by integrating Cloudflare’s analytics data into Prometheus and Grafana

Martin Hauskrecht — Thu, 20 May 2021 13:00:15 GMT

The following is a guest post by Martin Hauskrecht, DevOps Engineer at Labyrinth Labs.

Here at Labyrinth Labs, we put great emphasis on monitoring. Having a working monitoring setup is a critical part of the work we do for our clients.

Cloudflare's Analytics dashboard provides a lot of useful information for debugging and analytics purposes for our customer Pixel Federation. However, it doesn’t automatically integrate with existing monitoring tools such as Grafana and Prometheus, which our DevOps engineers use every day to monitor our infrastructure.

Cloudflare provides a Logs API, but the amount of logs we’d need to analyze is so vast, it would be simply inefficient and too pricey to do so. Luckily, Cloudflare already does the hard work of aggregating our thousands of events per second and exposes them in an easy-to-use API.

Having Cloudflare’s data from our zones integrated with other systems’ metrics would give us a better understanding of our systems and the ability to correlate metrics and create more useful alerts, making our Day-2 operations (e.g. debugging incidents or analyzing the usage of our systems) more efficient.

Since our monitoring stack is primarily based on Prometheus and Grafana, we decided to implement our own Prometheus exporter that pulls data from Cloudflare’s GraphQL Analytics API.

Design

Based on current cloud trends and our intention to use the exporter in Kubernetes, writing the code in Go was the obvious choice. Cloudflare provides an API SDK for Golang, so the common API tasks were made easy to start with.

We take advantage of Cloudflare’s GraphQL API to obtain analytics data about each of our zones and transform them into Prometheus metrics that are then exposed on a metrics endpoint.

We are able to obtain data about the total number and rate of requests, bandwidth, cache utilization, threats, SSL usage, and HTTP response codes. In addition, we are also able to monitor what type of content is being transmitted and what countries and locations the requests originate from.

All of this information is provided through the http1mGroups node in Cloudflare’s GraphQL API. If you want to see what Datasets are available, you can find a brief description at https://developers.cloudflare.com/analytics/graphql-api/features/data-sets.

On top of all of these, we can also obtain data for Cloudflare’s data centers. Our graphs can easily show the distribution of traffic among them, further helping in our evaluations. The data is obtained from the httpRequestsAdaptiveGroups node in GraphQL.

After running the queries against the GraphQL API, we simply format the results to follow the Prometheus metrics format and expose them on the /metrics endpoint. To make things faster, we use Goroutines and make the requests in parallel.

Deployment

Our primary intention was to use the exporter in Kubernetes. Therefore, it comes with a Docker image and Helm chart to make deployments easier. You might need to adjust the Service annotations to match your Prometheus configuration.

The exporter itself exposes the gathered metrics on the /metrics endpoint. Therefore setting the Prometheus annotations either on the pod or a Kubernetes service will do the job.

apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/scrape: "true"

We plan on adding a Prometheus ServiceMonitor to the Helm chart to make scraping the exporter even easier for those who use the Prometheus operator in Kubernetes.

The configuration is quite easy, you just provide your API email and key. Optionally you can limit the scraping to selected zones only. Refer to our docs in the GitHub repo or see the example below.

 env:
   - name: CF_API_EMAIL
     value: 
   - name: CF_API_KEY
     value: 

  # Optionally, you can filter zones by adding IDs following the example below.
  # - name: ZONE_XYZ
  #   value:

To deploy the exporter with Helm you simply need to run:

helm repo add lablabs-cloudflare-exporter https://lablabs.github.io/cloudflare-exporter
helm repo update

helm install cloudflare-exporter lablabs-cloudflare-exporter/cloudflare-exporter \
--set env[0].CF_API_EMAIL= \
--set env[1].CF_API_KEY=

We also provide a Helmfile in our repo to make deployments easier, you just need to add your credentials to make it work.

Visualizing the data

I’ve already explained how the exporter works and how you can get it running. As I mentioned before, we use Grafana to visualize our metrics from Prometheus. We’ve created a dashboard that takes the data from Prometheus and puts it into use.

The dashboard is divided into several rows, which group individual panels for easier navigation. It allows you to target individual zones for metrics visualization.

To make things even more beneficial for the operations team, you can use the gathered metrics to create alerts. These can be created either in Grafana directly or using Prometheus alert rules.

Furthermore, if you integrate Thanos or Cortex into your monitoring setup, you can store these metrics indefinitely.

Future work

We’d like to integrate even more analytics data into our exporters, eventually reaching every metric that Cloudflare’s GraphQL can provide. We plan on creating new metrics for firewall analytics, DoS analytics, and Network analytics soon.

Feel free to create a GitHub issue if you have any questions, problems, or suggestions. Any pull request is greatly appreciated.

About us

Labyrinth Labs helps companies build, run, deploy and scale software and infrastructure by embracing the right technologies and principles.

Helping sites get back online: the origin monitoring intern project

Annika Garbers — Mon, 13 Apr 2020 11:00:00 GMT

The most impactful internship experiences involve building something meaningful from scratch and learning along the way. Those can be tough goals to accomplish during a short summer internship, but our experience with Cloudflare’s 2019 intern program met both of them and more! Over the course of ten weeks, our team of three interns (two engineering, one product management) went from a problem statement to a new feature, which is still working in production for all Cloudflare customers.

The project

Cloudflare sits between customers’ origin servers and end users. This means that all traffic to the origin server runs through Cloudflare, so we know when something goes wrong with a server and sometimes reflect that status back to users. For example, if an origin is refusing connections and there’s no cached version of the site available, Cloudflare will display a 521 error. If customers don’t have monitoring systems configured to detect and notify them when failures like this occur, their websites may go down silently, and they may hear about the issue for the first time from angry users.

When a customer’s origin server is unreachable, Cloudflare sends a 5xx error back to the visitor.‌‌

This problem became the starting point for our summer internship project: since Cloudflare knows when customers' origins are down, let’s send them a notification when it happens so they can take action to get their sites back online and reduce the impact to their users! This work became Cloudflare’s passive origin monitoring feature, which is currently available on all Cloudflare plans.

Over the course of our internship, we ran into lots of interesting technical and product problems, like:

Making big data small

Working with data from all requests going through Cloudflare’s 26 million+ Internet properties to look for unreachable origins is unrealistic from a data volume and performance perspective. Figuring out what datasets were available to analyze for the errors we were looking for, and how to adapt our whiteboarded algorithm ideas to use this data, was a challenge in itself.

Ensuring high alert quality

Because only a fraction of requests show up in the sampled timing and error dataset we chose to use, false positives/negatives were disproportionately likely to occur for low-traffic sites. These are the sites that are least likely to have sophisticated monitoring systems in place (and therefore are most in need of this feature!). In order to make the notifications as accurate and actionable as possible, we analyzed patterns of failed requests throughout different types of Cloudflare Internet properties. We used this data to determine thresholds that would maximize the number of true positive notifications, while making sure they weren’t so sensitive that we end up spamming customers with emails about sporadic failures.

Designing actionable notifications

Cloudflare has lots of different kinds of customers, from people running personal blogs with interest in DDoS mitigation to large enterprise companies with extremely sophisticated monitoring systems and global teams dedicated to incident response. We wanted to make sure that our notifications were understandable and actionable for people with varying technical backgrounds, so we enabled the feature for small samples of customers and tested many variations of the “origin monitoring email”. Customers responded right back to our notification emails, sent in support questions, and posted on our community forums. These were all great sources of feedback that helped us improve the message’s clarity and actionability.

We frontloaded our internship with lots of research (both digging into request data to understand patterns in origin unreachability problems and talking to customers/poring over support tickets about origin unreachability) and then spent the next few weeks iterating. We enabled passive origin monitoring for all customers with some time remaining before the end of our internships, so we could spend time improving the supportability of our product, documenting our design decisions, and working with the team that would be taking ownership of the project.

We were also able to develop some smaller internal capabilities that built on the work we’d done for the customer-facing feature, like notifications on origin outage events for larger sites to help our account teams provide proactive support to customers. It was super rewarding to see our work in production, helping Cloudflare users get their sites back online faster after receiving origin monitoring notifications.

Our internship experience

The Cloudflare internship program was a whirlwind ten weeks, with each day presenting new challenges and learnings! Some factors that led to our productive and memorable summer included:

A well-scoped project

It can be tough to find a project that’s meaningful enough to make an impact but still doable within the short time period available for summer internships. We’re grateful to our managers and mentors for identifying an interesting problem that was the perfect size for us to work on, and for keeping us on the rails if the technical or product scope started to creep beyond what would be realistic for the time we had left.

Working as a team of interns

The immediate team working on the origin monitoring project consisted of three interns: Annika in product management and Ilya and Zhengyao in engineering. Having a dedicated team with similar goals and perspectives on the project helped us stay focused and work together naturally.

Quick, agile cycles

Since our project faced strict time constraints and our team was distributed across two offices (Champaign and San Francisco), it was critical for us to communicate frequently and work in short, iterative sprints. Daily standups, weekly planning meetings, and frequent feedback from customers and internal stakeholders helped us stay on track.

Great mentorship & lots of freedom

Our managers challenged us, but also gave us room to explore our ideas and develop our own work process. Their trust encouraged us to set ambitious goals for ourselves and enabled us to accomplish way more than we may have under strict process requirements.

After the internship

In the last week of our internships, the engineering interns, who were based in the Champaign, IL office, visited the San Francisco office to meet with the team that would be taking over the project when we left and present our work to the company at our all hands meeting. The most exciting aspect of the visit: our presentation was preempted by Cloudflare’s co-founders announcing public S-1 filing at the all hands! :)

Over the next few months, Cloudflare added a notifications page for easy configurability and announced the availability of passive origin monitoring along with some other tools to help customers monitor their servers and avoid downtime.

Ilya is working for Cloudflare part-time during the school semester and heading back for another internship this summer, and Annika is joining the team full-time after graduation this May. We’re excited to keep working on tools that help make the Internet a better place!

Also, Cloudflare is doubling the size of the 2020 intern class—if you or someone you know is interested in an internship like this one, check out the open positions in software engineering, security engineering, and product management.

CloudFlare’s Newest App Partner: Verelo

Kristin Tarr — Tue, 31 Jul 2012 13:58:00 GMT

EDIT: Verelo was a previous app partner, but is not currently participating in the Cloudflare Apps program. The link to the app has been removed from the end of the post.

Verelo is a website monitoring service that provides sub-minute checks, downtime alerts by SMS, phone or email, and malware detection.

A group of CloudFlare customers have recently been testing Velero during a silent launch, testing our latest app integration. Deployment has been very successful and with over 100 CloudFlare customers using Verelo, we have decided it's time to publicly announce them as our newest app partner.

Verelo is a unique company. Started by two individuals (Mike and Andrew) who previously worked in the hosting space, Verelo provides webmasters constant monitoring of their sites. Because of their background, Mike and Andrew were able to identify a need that hosting providers weren't able to provide for their customers and so they launched Verelo as an answer to the site and server monitoring need.

Features

Velero offers prevention, detection, responses and reporting for malicious activity on any website, making running a site that much easier.

Features include:

Downtime alerts by SMS, Phone or Email
Malware detection
Response time and uptime graphs
Sub-minute monitoring
911 conference call system

Plans

Starting at free, Velero offers a variety of plans to suit your needs:

Get started!

Verelo is now available to all CloudFlare customers. For more information on the app and how to get started, visit the app detail page today: [REMOVED]

App: Panopta Provides Advanced Server Monitoring and Outage Management Services

Kristin Tarr — Wed, 25 Apr 2012 22:58:00 GMT

CloudFlare customers have websites of all sizes, so we're pleased to introduced a new advanced server monitoring and outage management service from Panopta as a CloudFlare App. Appealing to both enterprises and SMBs, Panopta will be the first to tell you if your infrastructure is down and provide you with tools to fix it.

What makes Panopta different? They offer three simple and unique areas that separate them from the rest of the market.

Deep and Wide Monitoring
No False Alerts
Intelligent Alerting

Deep and Wide Monitoring

Panopta gives you in-depth checks every 60 seconds from their global monitoring network, the Panopta Monitoring Agent and the Panopta Monitoring Appliance.

No False Alerts

Panopta guarantees no false alerts. No more chasing problems that aren't there.

Intelligent Alerting

Alerts are escalated to the right people at the right time, so you can rest assured that problems will be fixed as soon as possible.

Getting Started

Panopta offers four different plans, including:

$15/month for Solo
$50/month for Basic
$100/month for Intermediate
$250/month for Advanced

See all the details and sign up for Panopta's advanced server monitoring service now, via CloudFlare Apps.

Immediately, the first monitor will be set up for the home page of your site, with opportunity for detailed customization and additional monitors within the Panopta control panel. Try it now!

App #16 - CodeGuard Website Backup Service

John Roberts — Wed, 29 Jun 2011 03:13:00 GMT

A Time Machine for Your Website

CodeGuard takes a compelling idea -- automatic website backup -- and makes it happen without further thought. That peace of mind keeps site owners sleeping easily, and it's just the most compelling of the service's offerings. We're thrilled to add this CloudFlare App to cover another angle of smooth site operation.

CodeGuard takes backups only when the code on your site changes, and gives you the chance to scroll through old copies of site code. In conventional "Backup Mode," CodeGuard can capture your existing live site on hourly, daily, or monthly basis, storing each sweep as a different version. They call this "Undo Power" to helping you recover from lost data or untoward changes.

Special Offering for CloudFlare Customers

CodeGuard created a special free plan for CloudFlare customer: BasicPlus. Without spending a cent, you may back up a single site up to 1GB in size. For additional sites, or for larger sites, this valuable service costs only $10/month. Thank you, CodeGuard!

How to Start Using CodeGuard

In your CloudFlare dashboard, CodeGuard requires just a single click to create an account. For your first site, that's it: follow the instructions in your welcome email, and have your FTP instructions handy.

Go get your free website backup started now, and relax.

App #15 - GlobalSign HackAlert Malware Detection

John Roberts — Tue, 28 Jun 2011 03:38:00 GMT

Cloud-based detection

Your website security comes in many layers, against many threats. But how do you know if your site has a problem? Once you know, how do you solve the problem, to protect your customers and your reputation? Today's CloudFlare App answers those questions.

GlobalSign HackAlert is a website malware detection and monitoring solution that immediately emails you if your website is infected. The service scans your site, checking the content for signs of compromise. If found, you get instant notification with steps to resolution. This detection -- and details about how to fix the problem -- helps you avoid scary browser or search engine warnings to your site's visitors, beyond the possibility of infecting your customers with a drive-by download.

Yes, That's All You Have To Do

One elegant benefit of GlobalSign HackAlert is zero configuration required. Sign up for the service via CloudFlare Apps, and the monitoring kicks in. It's automatic, with no extra dashboard required. Silently scanning your site, GlobalSign HackAlert doesn't trouble you unless there's a problem.Then, the service immediately emails you with notification of the discovered malware, including clear snippets of the infected code to make identification and removal a simple matter.

With relevance and simplicity, GlobalSign HackAlert makes sure your protection isn't noisy, just effective. Try the HackAlert Basic Plan today, covering up to 100 pages of your site for just $5/month/site, and rest assured.

App a Day #9 - Monitis for Server Monitoring

John Roberts — Mon, 13 Jun 2011 04:10:00 GMT

Server Monitoring

Our first new CloudFlare App this week is Monitis, a server monitoring solution. Monitis allows you to monitor websites, of course. But Monitis's cloud-based, all-in-one monitoring service recognizes that the internet is far more than the web alone.

With Monitis, you can use monitor server CPU, applications, databases, server CPU, memory, processes, end-user experience, uptime, page load, transactions, and lots more. All of this is available via an intuitive dashboard, but these monitors may also be extended via API and command line tools.

Special Offering for CloudFlare Customers

Monitis is offering CloudFlare customers their Basic Plan for $8.99/month, which include 8 external monitors at a 20-minute frequency. That's a 10% discount for CloudFlare customers. Thank you, Monitis!

How to Start Using Monitis

Like all CloudFlare Apps, Monitis is simple to set up. Go to your CloudFlare Apps dashboard, and click the green Setup button. You'll be asked to confirm the subscription, entering billing information if your CloudFlare account doesn't already have it, and then your Monitis account will be automatically created, with a link to the Monitis dashboard.

App a Day #5 - Pingdom free account

John Roberts — Tue, 07 Jun 2011 00:36:00 GMT

EDIT: Pingdom was a previous app partner, but is not currently participating in the Cloudflare Apps program. The link to the app has been removed from the end of the post.

Be The First To Know If Your Website Goes Down

You use CloudFlare because you want high performance, reliability, and security. Website owners deserve nothing less. Because we believe in transparency, we provide insight into our system status. Still, it's important to have third-party verification, so we encourage the use of outside monitors.

Today's App can help. Pingdom is a leader in uptime monitoring. From your CloudFlare dashboard, you can create a Pingdom account with free monitoring for a single domain. For those with multiple domains or more complex requirements, the upgrade is easy.

Pingdom Features

There's more detail available but the core benefits of Pingdom include:

Alerts on downtime
Performance and uptime stats
Helps you troubleshoot downtime

How to Create Your Free Pingdom Account

Getting started with Pingdom is simple. Visit the Pingdom [link removed] page in CloudFlare Apps, continue to the CloudFlare Apps page for your preferred domain, and toggle the switch ON... your free account will be created instantly, with a link to configure your settings.