Subscribe to receive notifications of new posts:

New tools to monitor your server and avoid downtime

2019-12-11

4 min read
This post is also available in Français, Español and Deutsch.

When your server goes down, it’s a big problem. Today, Cloudflare is introducing two new tools to help you understand and respond faster to origin downtime — plus, a new service to automatically avoid downtime.

The new features are:

  • Standalone Health Checks, which notify you as soon as we detect problems at your origin server, without needing a Cloudflare Load Balancer.

  • Passive Origin Monitoring, which lets you know when your origin cannot be reached, with no configuration required.

  • Zero-Downtime Failover, which can automatically avert failures by retrying requests to origin.

Standalone Health Checks

Our first new tool is Standalone Health Checks, which will notify you as soon as we detect problems at your origin server -- without needing a Cloudflare Load Balancer.

A Health Check is a service that runs on our edge network to monitor whether your origin server is online. Health Checks are a key part of our load balancing service because they allow us to quickly and actively route traffic to origin servers that are live and ready to serve requests. Standalone Health Checks allow you to monitor the health of your origin even if you only have one origin or do not yet need to balance traffic across your infrastructure.

We’ve provided many dimensions for you to hone in on exactly what you’d like to check, including response code, protocol type, and interval. You can specify a particular path if your origin serves multiple applications, or you can check a larger subset of response codes for your staging environment. All of these options allow you to properly target your Health Check, giving you a precise picture of what is wrong with your origin.

If one of your origin servers becomes unavailable, you will receive a notification letting you know of the health change, along with detailed information about the failure so you can take action to restore your origin’s health.  

Lastly, once you’ve set up your Health Checks across the different origin servers, you may want to see trends or the top unhealthy origins. With Health Check Analytics, you’ll be able to view all the change events for a given health check, isolate origins that may be top offenders or not performing at par, and move forward with a fix. On top of this, in the near future, we are working to provide you with access to all Health Check raw events to ensure you have the detailed lens to compare Cloudflare Health Check Event logs against internal server logs.

Users on the Pro, Business, or Enterprise plan will have access to Standalone Health Checks and Health Check Analytics to promote top-tier application reliability and help maximize brand trust with their customers. You can access Standalone Health Checks and Health Check Analytics through the Traffic app in the dashboard.

Passive Origin Monitoring

Standalone Health Checks are a super flexible way to understand what’s happening at your origin server. However, they require some forethought to configure before an outage happens. That’s why we’re excited to introduce Passive Origin Monitoring, which will automatically notify you when a problem occurs -- no configuration required.

Cloudflare knows when your origin is down, because we’re the ones trying to reach it to serve traffic! When we detect downtime lasting longer than a few minutes, we’ll send you an email.

Starting today, you can configure origin monitoring alerts to go to multiple email addresses. Origin Monitoring alerts are available in the new Notification Center (more on that below!) in the Cloudflare dashboard:

Passive Origin Monitoring is available to customers on all Cloudflare plans.

Zero-Downtime Failover

What’s better than getting notified about downtime? Never having downtime in the first place! With Zero-Downtime Failover, we can automatically retry requests to origin, even before Load Balancing kicks in.

How does it work? If a request to your origin fails, and Cloudflare has another record for your origin server, we’ll just try another origin within the same HTTP request. The alternate record could be either an A/AAAA record configured via Cloudflare DNS, or another origin server in the same Load Balancing pool.

Consider an website, example.com, that has web servers at two different IP addresses: 203.0.113.1 and 203.0.113.2. Before Zero-Downtime Failover, if 203.0.113.1 becomes unavailable, Cloudflare would attempt to connect, fail, and ultimately serve an error page to the user. With Zero-Downtime Failover, if 203.0.113.1 cannot be reached, then Cloudflare’s proxy will seamlessly attempt to connect to 203.0.113.2. If the second server can respond, then Cloudflare can avert serving an error to example.com’s user.

Since we rolled Zero-Downtime Failover a few weeks ago, we’ve prevented tens of millions of requests per day from failing!

Zero-Downtime Failover works in conjunction with Load Balancing, Standalone Health Checks, and Passive Origin Monitoring to keep your website running without a hitch. Health Checks and Load Balancing can avert failure, but take time to kick in. Zero-Downtime failover works instantly, but adds latency on each connection attempt. In practice, Zero-Downtime Failover is helpful at the start of an event, when it can instantly recover from errors; once a Health Check has detected a problem, a Load Balancer can then kick in and properly re-route traffic. And if no origin is available, we’ll send an alert via Passive Origin Monitoring.

To see an example of this in practice, consider an incident from a recent customer. They saw a spike in errors at their origin that would ordinarily cause availability to plummet (red line), but thanks to Zero-Downtime failover, their actual availability stayed flat (blue line).

During a 30 minute time period, Zero-Downtime Failover improved overall availability from 99.53% to 99.98%, and prevented 140,000 HTTP requests from resulting in an error.

It’s important to note that we only attempt to retry requests that have failed during the TCP or TLS connection phase, which ensures that HTTP headers and payload have not been transmitted yet. Thanks to this safety mechanism, we're able to make Zero-Downtime Failover Cloudflare's default behavior for Pro, Business, and Enterprise plans. In other words, Zero-Downtime Failover makes connections to your origins more reliable with no configuration or action required.

Coming soon: more notifications, more flexibility

Our customers are always asking us for more insights into the health of their critical edge infrastructure. Health Checks and Passive Origin monitoring are a significant step towards Cloudflare taking a proactive instead of reactive approach to insights.

To support this work, today we’re announcing the Notification Center as the central place to manage notifications. This is available in the dashboard today, accessible from your Account Home.

From here, you can create new notifications, as well as view any existing notifications you’ve already set up. Today’s release allows you to configure  Passive Origin Monitoring notifications, and set multiple email recipients.

We’re excited about today’s launches to helping our customers avoid downtime. Based on your feedback, we have lots of improvements planned that can help you get the timely insights you need:

  • New notification delivery mechanisms

  • More events that can trigger notifications

  • Advanced configuration options for Health Checks, including added protocols, threshold based notifications, and threshold based status changes.

  • More ways to configure Passive Health Checks, like the ability to add thresholds, and filter to specific status codes

Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.
InsightsAnalyticsProduct News

Follow on X

Jon Levine|@jplevine
Steven Pack|@steven_pack
Cloudflare|@cloudflare

Related posts

October 24, 2024 1:00 PM

Durable Objects aren't just durable, they're fast: a 10x speedup for Cloudflare Queues

Learn how we built Cloudflare Queues using our own Developer Platform and how it evolved to a geographically-distributed, horizontally-scalable architecture built on Durable Objects. Our new architecture supports over 10x more throughput and over 3x lower latency compared to the previous version....

October 08, 2024 1:00 PM

Cloudflare acquires Kivera to add simple, preventive cloud security to Cloudflare One

The acquisition and integration of Kivera broadens the scope of Cloudflare’s SASE platform beyond just apps, incorporating increased cloud security through proactive configuration management of cloud services. ...

September 27, 2024 1:00 PM

AI Everywhere with the WAF Rule Builder Assistant, Cloudflare Radar AI Insights, and updated AI bot protection

This year for Cloudflare’s birthday, we’ve extended our AI Assistant capabilities to help you build new WAF rules, added new AI bot & crawler traffic insights to Radar, and given customers new AI bot blocking capabilities...