The Cloudflare Blog

How Waiting Room makes queueing decisions on Cloudflare's highly distributed network

George Thomas — Wed, 20 Sep 2023 13:00:58 GMT

Almost three years ago, we launched Cloudflare Waiting Room to protect our customers’ sites from overwhelming spikes in legitimate traffic that could bring down their sites. Waiting Room gives customers control over user experience even in times of high traffic by placing excess traffic in a customizable, on-brand waiting room, dynamically admitting users as spots become available on their sites. Since the launch of Waiting Room, we’ve continued to expand its functionality based on customer feedback with features like mobile app support, analytics, Waiting Room bypass rules, and more.

We love announcing new features and solving problems for our customers by expanding the capabilities of Waiting Room. But, today, we want to give you a behind the scenes look at how we have evolved the core mechanism of our product–namely, exactly how it kicks in to queue traffic in response to spikes.

How was the Waiting Room built, and what are the challenges?

The diagram below shows a quick overview of where the Waiting room sits when a customer enables it for their website.

Waiting Room is built on Workers that runs across a global network of Cloudflare data centers. The requests to a customer’s website can go to many different Cloudflare data centers. To optimize for minimal latency and enhanced performance, these requests are routed to the data center with the most geographical proximity. When a new user makes a request to the host/path covered by the Waiting room, the waiting room worker decides whether to send the user to the origin or the waiting room. This decision is made by making use of the waiting room state which gives an idea of how many users are on the origin.

The waiting room state changes continuously based on the traffic around the world. This information can be stored in a central location or changes can get propagated around the world eventually. Storing this information in a central location can add significant latency to each request as the central location can be really far from where the request is originating from. So every data center works with its own waiting room state which is a snapshot of the traffic pattern for the website around the world available at that point in time. Before letting a user into the website, we do not want to wait for information from everywhere else in the world as that adds significant latency to the request. This is the reason why we chose not to have a central location but have a pipeline where changes in traffic get propagated eventually around the world.

This pipeline which aggregates the waiting room state in the background is built on Cloudflare Durable Objects. In 2021, we wrote a blog talking about how the aggregation pipeline works and the different design decisions we took there if you are interested. This pipeline ensures that every data center gets updated information about changes in traffic within a few seconds.

The Waiting room has to make a decision whether to send users to the website or queue them based on the state that it currently sees. This has to be done while making sure we queue at the right time so that the customer's website does not get overloaded. We also have to make sure we do not queue too early as we might be queueing for a falsely suspected spike in traffic. Being in a queue could cause some users to abandon going to the website. Waiting Room runs on every server in Cloudflare’s network, which spans over 300 cities in more than 100 countries. We want to make sure, for every new user, the decision whether to go to the website or the queue is made with minimal latency. This is what makes the decision of when to queue a hard question for the waiting room. In this blog, we will cover how we approached that tradeoff. Our algorithm has evolved to decrease the false positives while continuing to respect the customer’s set limits.

How a waiting room decides when to queue users

The most important factor that determines when your waiting room will start queuing is how you configured the traffic settings. There are two traffic limits that you will set when configuring a waiting room–total active users and new users per minute.The total active users is a target threshold for how many simultaneous users you want to allow on the pages covered by your waiting room. New users per minute defines the target threshold for the maximum rate of user influx to your website per minute. A sharp spike in either of these values might result in queuing. Another configuration that affects how we calculate the total active users is session duration. A user is considered active for session duration minutes since the request is made to any page covered by a waiting room.

The graph below is from one of our internal monitoring tools for a customer and shows a customer's traffic pattern for 2 days. This customer has set their limits, new users per minute and total active users to 200 and 200 respectively.

If you look at their traffic you can see that users were queued on September 11th around 11:45. At that point in time, the total active users was around 200. As the total active users ramped down (around 12:30), the queued users progressed to 0. The queueing started again on September 11th around 15:00 when total active users got to 200. The users that were queued around this time ensured that the traffic going to the website is around the limits set by the customer.

Once a user gets access to the website, we give them an encrypted cookie which indicates they have already gained access. The contents of the cookie can look like this.

{  
  "bucketId": "Mon, 11 Sep 2023 11:45:00 GMT",
  "lastCheckInTime": "Mon, 11 Sep 2023 11:45:54 GMT",
  "acceptedAt": "Mon, 11 Sep 2023 11:45:54 GMT"
}

The cookie is like a ticket which indicates entry to the waiting room.The bucketId indicates which cluster of users this user is part of. The acceptedAt time and lastCheckInTime indicate when the last interaction with the workers was. This information can ensure if the ticket is valid for entry or not when we compare it with the session duration value that the customer sets while configuring the waiting room. If the cookie is valid, we let the user through which ensures users who are on the website continue to be able to browse the website. If the cookie is invalid, we create a new cookie treating the user as a new user and if there is queueing happening on the website they get to the back of the queue. In the next section let us see how we decide when to queue those users.

To understand this further, let's see what the contents of the waiting room state are. For the customer we discussed above, at the time "Mon, 11 Sep 2023 11:45:54 GMT", the state could look like this.

{  
  "activeUsers": 50,
}

As mentioned above the customer’s configuration has new users per minute and t_otal active users_ equal to 200 and 200 respectively.

So the state indicates that there is space for the new users as there are only 50 active users when it's possible to have 200. So there is space for another 150 users to go in. Let's assume those 50 users could have come from two data centers San Jose (20 users) and London (30 users). We also keep track of the number of workers that are active across the globe as well as the number of workers active at the data center in which the state is calculated. The state key below could be the one calculated at San Jose.

{  
  "activeUsers": 50,
  "globalWorkersActive": 10,
  "dataCenterWorkersActive": 3,
  "trafficHistory": {
    "Mon, 11 Sep 2023 11:44:00 GMT": {
       San Jose: 20/200, // 10%
       London: 30/200, // 15%
       Anywhere: 150/200 // 75%
    }
  }
}

Imagine at the time "Mon, 11 Sep 2023 11:45:54 GMT", we get a request to that waiting room at a datacenter in San Jose.

To see if the user that reached San Jose can go to the origin we first check the traffic history in the past minute to see the distribution of traffic at that time. This is because a lot of websites are popular in certain parts of the world. For a lot of these websites the traffic tends to come from the same data centers.

Looking at the traffic history for the minute "Mon, 11 Sep 2023 11:44:00 GMT" we see San Jose has 20 users out of 200 users going there (10%) at that time. For the current time "Mon, 11 Sep 2023 11:45:54 GMT" we divide the slots available at the website at the same ratio as the traffic history in the past minute. So we can send 10% of 150 slots available from San Jose which is 15 users. We also know that there are three active workers as "dataCenterWorkersActive" is 3.

The number of slots available for the data center is divided evenly among the workers in the data center. So every worker in San Jose can send 15/3 users to the website. If the worker that received the traffic has not sent any users to the origin for the current minute they can send up to five users (15/3).

At the same time ("Mon, 11 Sep 2023 11:45:54 GMT"), imagine a request goes to a data center in Delhi. The worker at the data center in Delhi checks the trafficHistory and sees that there are no slots allotted for it. For traffic like this we have reserved the Anywhere slots as we are really far away from the limit.

{  
  "activeUsers":50,
  "globalWorkersActive": 10,
  "dataCenterWorkersActive": 1,
  "trafficHistory": {
    "Mon, 11 Sep 2023 11:44:00 GMT": {
       San Jose: 20/200, // 10%
       London: 30/200, // 15%
       Anywhere: 150/200 // 75%
    }
  }
}

The Anywhere slots are divided among all the active workers in the globe as any worker around the world can take a part of this pie. 75% of the remaining 150 slots which is 113.

The state key also keeps track of the number of workers (globalWorkersActive) that have spawned around the world. The Anywhere slots allotted are divided among all the active workers in the world if available. globalWorkersActive is 10 when we look at the waiting room state. So every active worker can send as many as 113/10 which is approximately 11 users. So the first 11 users that come to a worker in the minute Mon, 11 Sep 2023 11:45:00 GMT gets admitted to the origin. The extra users get queued. The extra reserved slots (5) in San Jose for minute Mon, 11 Sep 2023 11:45:00 GMT discussed before ensures that we can admit up to 16(5 + 11) users from a worker from San Jose to the website.

Queuing at the worker level can cause users to get queued before the slots available for the data center

As we can see from the example above, we decide whether to queue or not at the worker level. The number of new users that go to workers around the world can be non-uniform. To understand what can happen when there is non-uniform distribution of traffic to two workers, let us look at the diagram below.

Imagine the slots available for a data center in San Jose are ten. There are two workers running in San Jose. Seven users go to worker1 and one user goes to worker2. In this situation worker1 will let in five out of the seven workers to the website and two of them get queued as worker1 only has five slots available. The one user that shows up at worker2 also gets to go to the origin. So we queue two users, when in reality ten users can get sent from the datacenter San Jose when only eight users show up.

This issue while dividing slots evenly among workers results in queueing before a waiting room’s configured traffic limits, typically within 20-30% of the limits set. This approach has advantages which we will discuss next. We have made changes to the approach to decrease the frequency with which queuing occurs outside that 20-30% range, queuing as close to limits as possible, while still ensuring Waiting Room is prepared to catch spikes. Later in this blog, we will cover how we achieved this by updating how we allocate and count slots.

What is the advantage of workers making these decisions?

The example above talked about how a worker in San Jose and Delhi makes decisions to let users through to the origin. The advantage of making decisions at the worker level is that we can make decisions without any significant latency added to the request. This is because to make the decision, there is no need to leave the data center to get information about the waiting room as we are always working with the state that is currently available in the data center. The queueing starts when the slots run out within the worker. The lack of additional latency added enables the customers to turn on the waiting room all the time without worrying about extra latency to their users.

Waiting Room’s number one priority is to ensure that customer’s sites remain up and running at all times, even in the face of unexpected and overwhelming traffic surges. To that end, it is critical that a waiting room prioritizes staying near or below traffic limits set by the customer for that room. When a spike happens at one data center around the world, say at San Jose, the local state at the data center will take a few seconds to get to Delhi.

Splitting the slots among workers ensures that working with slightly outdated data does not cause the overall limit to be exceeded by an impactful amount. For example, the activeUsers value can be 26 in the San Jose data center and 100 in the other data center where the spike is happening. At that point in time, sending extra users from Delhi may not overshoot the overall limit by much as they only have a part of the pie to start with in Delhi. Therefore, queueing before overall limits are reached is part of the design to make sure your overall limits are respected. In the next section we will cover the approaches we implemented to queue as close to limits as possible without increasing the risk of exceeding traffic limits.

Allocating more slots when traffic is low relative to waiting room limits

The first case we wanted to address was queuing that occurs when traffic is far from limits. While rare and typically lasting for one refresh interval (20s) for the end users who are queued, this was our first priority when updating our queuing algorithm. To solve this, while allocating slots we looked at the utilization (how far you are from traffic limits) and allotted more slots when traffic is really far away from the limits. The idea behind this was to prevent the queueing that happens at lower limits while still being able to readjust slots available per worker when there are more users on the origin.

To understand this let's revisit the example where there is non-uniform distribution of traffic to two workers. So two workers similar to the one we discussed before are shown below. In this case the utilization is low (10%). This means we are far from the limits. So the slots allocated(8) are closer to the slotsAvailable for the datacenter San Jose which is 10. As you can see in the diagram below, all the eight users that go to either worker get to reach the website with this modified slot allocation as we are providing more slots per worker at lower utilization levels.

The diagram below shows how the slots allocated per worker changes with utilization (how far you are away from limits). As you can see here, we are allocating more slots per worker at lower utilization. As the utilization increases, the slots allocated per worker decrease as it’s getting closer to the limits, and we are better prepared for spikes in traffic. At 10% utilization every worker gets close to the slots available for the data center. As the utilization is close to 100% it becomes close to the slots available divided by worker count in the data center.

How do we achieve more slots at lower utilization?

This section delves into the mathematics which helps us get there. If you are not interested in these details, meet us at the “Risk of over provisioning” section.

To understand this further, let's revisit the previous example where requests come to the Delhi data center. The activeUsers value is 50, so utilization is 50/200 which is around 25%.

{
  "activeUsers": 50,
  "globalWorkersActive": 10,
  "dataCenterWorkersActive": 1,
  "trafficHistory": {
    "Mon, 11 Sep 2023 11:44:00 GMT": {
       San Jose: 20/200, // 10%
       London: 30/200, // 15%
       Anywhere: 150/200 // 75%
    }
  }
}

The idea is to allocate more slots at lower utilization levels. This ensures that customers do not see unexpected queueing behaviors when traffic is far away from limits. At time Mon, 11 Sep 2023 11:45:54 GMT requests to Delhi are at 25% utilization based on the local state key.

To allocate more slots to be available at lower utilization we added a workerMultiplier which moves proportionally to the utilization. At lower utilization the multiplier is lower and at higher utilization it is close to one.

workerMultiplier = (utilization)^curveFactor
adaptedWorkerCount = actualWorkerCount * workerMultiplier

utilization - how far away from the limits you are.

curveFactor - is the exponent which can be adjusted which decides how aggressive we are with the distribution of extra budgets at lower worker counts. To understand this let's look at the graph of how y = x and y = x^2 looks between values 0 and 1.

The graph for y=x is a straight line passing through (0, 0) and (1, 1).

The graph for y=x^2 is a curved line where y increases slower than x when x < 1 and passes through (0, 0) and (1, 1)

Using the concept of how the curves work, we derived the formula for workerCountMultiplier where y=workerCountMultiplier, x=utilization and curveFactor is the power which can be adjusted which decides how aggressive we are with the distribution of extra budgets at lower worker counts. When curveFactor is 1, the workerMultiplier is equal to the utilization.

Let's come back to the example we discussed before and see what the value of the curve factor will be. At time Mon, 11 Sep 2023 11:45:54 GMT requests to Delhi are at 25% utilization based on the local state key. The Anywhere slots are divided among all the active workers in the globe as any worker around the world can take a part of this pie. i.e. 75% of the remaining 150 slots (113).

globalWorkersActive is 10 when we look at the waiting room state. In this case we do not divide the 113 slots by 10 but instead divide by the adapted worker count which is globalWorkersActive ***** workerMultiplier. If curveFactor is 1, the workerMultiplier is equal to the utilization which is at 25% or 0.25.

So effective workerCount = 10 * 0.25 = 2.5

So, every active worker can send as many as 113/2.5 which is approximately 45 users. The first 45 users that come to a worker in the minute Mon, 11 Sep 2023 11:45:00 GMT gets admitted to the origin. The extra users get queued.

Therefore, at lower utilization (when traffic is farther from the limits) each worker gets more slots. But, if the sum of slots are added up, there is a higher chance of exceeding the overall limit.

Risk of over provisioning

The method of giving more slots at lower limits decreases the chances of queuing when traffic is low relative to traffic limits. However, at lower utilization levels a uniform spike happening around the world could cause more users to go into the origin than expected. The diagram below shows the case where this can be an issue. As you can see the slots available are ten for the data center. At 10% utilization we discussed before, each worker can have eight slots each. If eight users show up at one worker and seven show up at another, we will be sending fifteen users to the website when only ten are the maximum available slots for the data center.

With the range of customers and types of traffic we have, we were able to see cases where this became a problem. A traffic spike from low utilization levels could cause overshooting of the global limits. This is because we are over provisioned at lower limits and this increases the risk of significantly exceeding traffic limits. We needed to implement a safer approach which would not cause limits to be exceeded while also decreasing the chance of queueing when traffic is low relative to traffic limits.

Taking a step back and thinking about our approach, one of the assumptions we had was that the traffic in a data center directly correlates to the worker count that is found in a data center. In practice what we found is that this was not true for all customers. Even if the traffic correlates to the worker count, the new users going to the workers in the data centers may not correlate. This is because the slots we allocate are for new users but the traffic that a data center sees consists of both users who are already on the website and new users trying to go to the website.

In the next section we are talking about an approach where worker counts do not get used and instead workers communicate with other workers in the data center. For that we introduced a new service which is a durable object counter.

Decrease the number of times we divide the slots by introducing Data Center Counters

From the example above, we can see that overprovisioning at the worker level has the risk of using up more slots than what is allotted for a data center. If we do not over provision at low levels we have the risk of queuing users way before their configured limits are reached which we discussed first. So there has to be a solution which can achieve both these things.

The overprovisioning was done so that the workers do not run out of slots quickly when an uneven number of new users reach a bunch of workers. If there is a way to communicate between two workers in a data center, we do not need to divide slots among workers in the data center based on worker count. For that communication to take place, we introduced counters. Counters are a bunch of small durable object instances that do counting for a set of workers in the data center.

To understand how it helps with avoiding usage of worker counts, let's check the diagram below. There are two workers talking to a Data Center Counter below. Just as we discussed before, the workers let users through to the website based on the waiting room state. The count of the number of users let through was stored in the memory of the worker before. By introducing counters, it is done in the Data Center Counter. Whenever a new user makes a request to the worker, the worker talks to the counter to know the current value of the counter. In the example below for the first new request to the worker the counter value received is 9. When a data center has 10 slots available, that will mean the user can go to the website. If the next worker receives a new user and makes a request just after that, it will get a value 10 and based on the slots available for the worker, the user will get queued.

The Data Center Counter acts as a point of synchronization for the workers in the waiting room. Essentially, this enables the workers to talk to each other without really talking to each other directly. This is similar to how a ticketing counter works. Whenever one worker lets someone in, they request tickets from the counter, so another worker requesting the tickets from the counter will not get the same ticket number. If the ticket value is valid, the new user gets to go to the website. So when different numbers of new users show up at workers, we will not over allocate or under allocate slots for the worker as the number of slots used is calculated by the counter which is for the data center.

The diagram below shows the behavior when an uneven number of new users reach the workers, one gets seven new users and the other worker gets one new user. All eight users that show up at the workers in the diagram below get to the website as the slots available for the data center is ten which is below ten.

This also does not cause excess users to get sent to the website as we do not send extra users when the counter value equals the slotsAvailable for the data center. Out of the fifteen users that show up at the workers in the diagram below ten will get to the website and five will get queued which is what we would expect.

Risk of over provisioning at lower utilization also does not exist as counters help workers to communicate with each other.

To understand this further, let's look at the previous example we talked about and see how it works with the actual waiting room state.

The waiting room state for the customer is as follows.

{  
  "activeUsers": 50,
  "globalWorkersActive": 10,
  "dataCenterWorkersActive": 3,
  "trafficHistory": {
    "Mon, 11 Sep 2023 11:44:00 GMT": {
       San Jose: 20/200, // 10%
       London: 30/200, // 15%
       Anywhere: 150/200 // 75%
    }
  }
}

The objective is to not divide the slots among workers so that we don’t need to use that information from the state. At time Mon, 11 Sep 2023 11:45:54 GMT requests come to San Jose. So, we can send 10% of 150 slots available from San Jose which is 15.

The durable object counter at San Jose keeps returning the counter value it is at right now for every new user that reaches the data center. It will increment the value by 1 after it returns to a worker. So the first 15 new users that come to the worker get a unique counter value. If the value received for a user is less than 15 they get to use the slots at the data center.

Once the slots available for the data center runs out, the users can make use of the slots allocated for Anywhere data-centers as these are not reserved for any particular data center. Once a worker in San Jose gets a ticket value that says 15, it realizes that it's not possible to go to the website using the slots from San Jose.

The Anywhere slots are available for all the active workers in the globe i.e. 75% of the remaining 150 slots (113). The Anywhere slots are handled by a durable object that workers from different data centers can talk to when they want to use Anywhere slots. Even if 128 (113 + 15) users end up going to the same worker for this customer we will not queue them. This increases the ability of Waiting Room to handle an uneven number of new users going to workers around the world which in turn helps the customers to queue close to the configured limits.

Why do counters work well for us?

When we built the Waiting Room, we wanted the decisions for entry into the website to be made at the worker level itself without talking to other services when the request is in flight to the website. We made that choice to avoid adding latency to user requests. By introducing a synchronization point at a durable object counter, we are deviating from that by introducing a call to a durable object counter.

However, the durable object for the data center stays within the same data center. This leads to minimal additional latency which is usually less than 10 ms. For the calls to the durable object that handles Anywhere data centers, the worker may have to cross oceans and long distances. This could cause the latency to be around 60 or 70 ms in those cases. The 95th percentile values shown below are higher because of calls that go to farther data centers.

The design decision to add counters adds a slight extra latency for new users going to the website. We deemed the trade-off acceptable because this reduces the number of users that get queued before limits are reached. In addition, the counters are only required when new users try to go into the website. Once new users get to the origin, they get entry directly from workers as the proof of entry is available in the cookies that the customers come with, and we can let them in based on that.

Counters are really simple services which do simple counting and do nothing else. This keeps the memory and CPU footprint of the counters minimal. Moreover, we have a lot of counters around the world handling the coordination between a subset of workers.This helps counters to successfully handle the load for the synchronization requirements from the workers. These factors add up to make counters a viable solution for our use case.

Summary

Waiting Room was designed with our number one priority in mind–to ensure that our customers’ sites remain up and running, no matter the volume or ramp up of legitimate traffic. Waiting Room runs on every server in Cloudflare’s network, which spans over 300 cities in more than 100 countries. We want to make sure, for every new user, the decision whether to go to the website or the queue is made with minimal latency and is done at the right time. This decision is a hard one as queuing too early at a data center can cause us to queue earlier than the customer set limits. Queuing too late can cause us to overshoot the customer set limits.

With our initial approach where we divide slots among our workers evenly we were sometimes queuing too early but were pretty good at respecting customer set limits. Our next approach of giving more slots at low utilization (low traffic levels compared to customer limits) ensured that we did better at the cases where we queued earlier than the customer set limits as every worker has more slots to work with at each worker. But as we have seen, this made us more likely to overshoot when a sudden spike in traffic occurred after a period of low utilization.

With counters we are able to get the best of both worlds as we avoid the division of slots by worker counts. Using counters we are able to ensure that we do not queue too early or too late based on the customer set limits. This comes at the cost of a little bit of latency to every request from a new user which we have found to be negligible and creates a better user experience than getting queued early.

We keep iterating on our approach to make sure we are always queuing people at the right time and above all protecting your website. As more and more customers are using the waiting room, we are learning more about different types of traffic and that is helping the product be better for everyone.

Get updates on the health of your origin where you need them

Darius Jankauskas — Tue, 22 Mar 2022 15:08:45 GMT

We are thrilled to announce the availability of Health Checks in the Cloudflare Dashboard’s Notifications tab, available to all Pro, Business, and Enterprise customers. Now, you can get critical alerts on the health of your origin without checking your inbox! Keep reading to learn more about how this update streamlines notification management and unlocks countless ways to stay informed on the health of your servers.

Keeping your site reliable

We first announced Health Checks when we realized some customers were setting up Load Balancers for their origins to monitor the origins’ availability and responsiveness. The Health Checks product provides a similarly powerful interface to Load Balancing, offering users the ability to ensure their origins meet criteria such as reachability, responsiveness, correct HTTP status codes, and correct HTTP body content. Customers can also receive email alerts when a Health Check finds their origin is unhealthy based on their custom criteria. In building a more focused product, we’ve added a slimmer, monitoring-based configuration, Health Check Analytics, and made it available for all paid customers. Health Checks run in multiple locations within Cloudflare’s edge network, meaning customers can monitor site performance across geographic locations.

What’s new with Health Checks Notifications

Health Checks email alerts have allowed customers to respond quickly to incidents and guarantee minimum disruption for their users. As before, Health Checks users can still select up to 20 email recipients to notify if a Health Check finds their site to be unhealthy. And if email is the right tool for your team, we’re excited to share that we have jazzed up our notification emails and added details on which Health Check triggered the notification, as well as the status of the Health Check:

New Health Checks email format with Time, Health Status, and Health Check details

That being said, monitoring an inbox is not ideal for many customers needing to stay proactively informed. If email is not the communication channel your team typically relies upon, checking emails can at best be inconvenient and, at worst, allow critical updates to be missed. That's where the Notifications dashboard comes in.

Users can now create Health Checks notifications within the Cloudflare Notifications dashboard. By integrating Health Checks with Cloudflare's powerful Notification platform, we have unlocked myriad new ways to ensure customers receive critical updates for their origin health. One of the key benefits of Cloudflare's Notifications is webhooks, which give customers the flexibility to sync notifications with various external services. Webhook responses contain JSON-encoded information, allowing users to ingest them into their internal or third-party systems.

For real-time updates, users can now use webhooks to send Health Check alerts directly to their preferred instant messaging platforms such as Slack, Google Chat, or Microsoft Teams, to name a few. Beyond instant messaging, customers can also use webhooks to send Health Checks notifications to their internal APIs or their Security Information and Event Management (SIEM) platforms, such as DataDog or Splunk, giving customers a single pane of glass for all Cloudflare activity, including notifications and event logs. For more on how to configure popular webhooks, check out our developer docs. Below, we'll walk you through a couple of powerful webhook applications. But first, let's highlight a couple of other ways the update to Health Checks notifications improves user experience.

By including Health Checks in the Notifications tab, users need only to access one page for a single source of truth where they can manage their account notifications. For added ease of use, users can also migrate to Notification setup directly from the Health Check creation page as well. From within a Health Check, users will also be able to see what Notifications are tied to it.

Additionally, configuring notifications for multiple Health Checks is now simplified. Instead of configuring notifications one Health Check at a time, users can now set up notifications for multiple or even all Health Checks from a single workflow.

Also, users can now access Health Checks notification history, and Enterprise customers can integrate Health Checks directly with PagerDuty. Last but certainly not least, as Cloudflare’s Notifications infrastructure grows in capabilities, Health Checks will be able to leverage all of these improvements. This guarantees Health Checks users the most timely and versatile notification capabilities that Cloudflare offers now and into the future.

Setting up a Health Checks webhook

To get notifications for health changes at an origin, we first need to set up a Health Check for it. In this example, we'll monitor HTTP responses: leave the Type and adjacent fields to their defaults. We will also monitor HTTP response codes: add `200` as an expected code, which will cause any other HTTP response code to trigger an unhealthy status.

Creating the webhook notification policy

Once we’ve got our Health Check set up, we can create a webhook to link it to. Let’s start with a popular use case and send our Health Checks to a Slack channel. Before creating the webhook in the Cloudflare Notifications dashboard, we enable webhooks in our Slack workspace and retrieve the webhook URL of the Slack channel we want to send notifications to. Next, we navigate to our account’s Notifications tab to add the Slack webhook as a Destination, entering the name and URL of our webhook — the secret will populate automatically.

Webhook creation page with the following user input: Webhook name, Slack webhook url, and auto-populated Secret

Once we hit Save and Test, we will receive a message in the designated Slack channel verifying that our webhook is working.

Message sent in Slack via the configured webhook, verifying the webhook is working

This webhook can now be used for any notification type available in the Notifications dashboard. To have a Health Check notification sent to this Slack channel, simply add a new Health Check notification from the Notifications dashboard, selecting the Health Check(s) to tie to this webhook and the Slack webhook we just created. And, voilà! Anytime our Health Check detects a response status code other than 200 or goes from unhealthy to healthy, this Slack channel will be notified.

Health Check notification sent to Slack, indicating our server is online and Healthy.

Create an Origin Health Status Page

Let’s walk through another powerful webhooks implementation with Health Checks. Using the Health Check we configured in our last example, let’s create a simple status page using Cloudflare Workers and Durable Objects that stores an origin’s health, updates it upon receiving a webhook request, and displays a status page to visitors.

Writing our workerYou can find the code for this example in this GitHub repository, if you want to clone it and try it out.

We’ve got our Health Check set up, and we’re ready to write our worker and durable object. To get started, we first need to install wrangler, our CLI tool for testing and deploying more complex worker setups.

$ wrangler -V
wrangler 1.19.8

The examples in this blog were tested in this wrangler version.

Then, to speed up writing our worker, we will use a template to generate a project:

$ wrangler generate status-page [https://github.com/cloudflare/durable-objects-template](https://github.com/cloudflare/durable-objects-template)
$ cd status-page

The template has a Durable Object with the name Counter. We’ll rename that to Status, as it will store and update the current state of the page.

For that, we update wrangler.toml to use the correct name and type, and rename the Counter class in index.mjs.

name = "status-page"
# type = "javascript" is required to use the `[build]` section
type = "javascript"
workers_dev = true
account_id = ""
route = ""
zone_id = ""
compatibility_date = "2022-02-11"
 
[build.upload]
# Upload the code directly from the src directory.
dir = "src"
# The "modules" upload format is required for all projects that export a Durable Objects class
format = "modules"
main = "./index.mjs"
 
[durable_objects]
bindings = [{name = "status", class_name = "Status"}]

Now, we’re ready to fill in our logic. We want to serve two different kinds of requests: one at /webhook that we will pass to the Notification system for updating the status, and another at / for a rendered status page.

First, let’s write the /webhook logic. We will receive a JSON object with a data and a text field. The `data` object contains the following fields:

time - The time when the Health Check status changed.
status - The status of the Health Check.
reason - The reason why the Health Check failed.
name - The Health Check name.
expected_codes - The status code the Health Check is expecting.
actual_code - The actual code received from the origin.
health_check_id - The id of the Health Check pushing the webhook notification.

For the status page we are using the Health Check name, status, and reason (the reason a Health Check became unhealthy, if any) fields. The text field contains a user-friendly version of this data, but it is more complex to parse.

 async handleWebhook(request) {
  const json = await request.json();
 
  // Ignore webhook test notification upon creation
  if ((json.text || "").includes("Hello World!")) return;
 
  let healthCheckName = json.data?.name || "Unknown"
  let details = {
    status: json.data?.status || "Unknown",
    failureReason: json.data?.reason || "Unknown"
  }
  await this.state.storage.put(healthCheckName, details)
 }

Now that we can store status changes, we can use our state to render a status page:

 async statusHTML() {
  const statuses = await this.state.storage.list()
  let statHTML = ""
 
  for(let[hcName, details] of statuses) {
   const status = details.status || ""
   const failureReason = details.failureReason || ""
   let hc = `HealthCheckName: ${hcName} 
         Status: ${status} 
         FailureReason: ${failureReason} 
         
`
   statHTML = statHTML + hc
  }
 
  return statHTML
 }
 
 async handleRoot() {
  // Default of healthy for before any notifications have been triggered
  const statuses = await this.statusHTML()
 
  return new Response(`
     
     
      Status Page
      
     
     
       Status of Production Servers
       ${statuses}
     
     `,
   {
    headers: {
     'Content-Type': "text/html"
    }
   })
 }

Then, we can direct requests to our two paths while also returning an error for invalid paths within our durable object with a fetch method:

async fetch(request) {
  const url = new URL(request.url)
  switch (url.pathname) {
   case "/webhook":
    await this.handleWebhook(request);
    return new Response()
   case "/":
    return await this.handleRoot();
   default:
    return new Response('Path not found', { status: 404 })
  }
 }

Finally, we can call that fetch method from our worker, allowing the external world to access our durable object.

export default {
 async fetch(request, env) {
  return await handleRequest(request, env);
 }
}
async function handleRequest(request, env) {
 let id = env.status.idFromName("A");
 let obj = env.status.get(id);
 
 return await obj.fetch(request);
}

Testing and deploying our worker

When we’re ready to deploy, it’s as simple as running the following command:

$ wrangler publish --new-class Status

To test the change, we create a webhook pointing to the path the worker was deployed to. On the Cloudflare account dashboard, navigate to Notifications > Destinations, and create a webhook.

Webhook creation page, with the following user input: name of webhook, destination URL, and optional secret is left blank.

Then, while still in the Notifications dashboard, create a Health Check notification tied to the status page webhook and Health Check.

Notification creation page, with the following user input: Notification name and the Webhook created in the previous step added.

Before getting any updates the status-page worker should look like this:

Status page without updates reads: “Status of Production Servers”

Webhooks get triggered when the Health Check status changes. To simulate the change of Health Check status we take the origin down, which will send an update to the worker through the webhook. This causes the status page to get updated.

Status page showing the name of the Health Check as "configuration-service", Status as "Unhealthy", and the failure reason as "TCP connection failed".

Next, we simulate a return to normal by changing the Health Check expected status back to 200. This will make the status page show the origin as healthy.

Status page showing the name of the Health Check as "configuration-service", Status as "Healthy", and the failure reason as "No failures".

If you add more Health Checks and tie them to the webhook durable object, you will see the data being added to your account.

Authenticating webhook requests

We already have a working status page! However, anyone possessing the webhook URL would be able to forge a request, pushing arbitrary data to an externally-visible dashboard. Obviously, that’s not ideal. Thankfully, webhooks provide the ability to authenticate these requests by supplying a secret. Let’s create a new webhook that will provide a secret on every request. We can generate a secret by generating a pseudo-random UUID with Python:

$ python3
>>> import uuid
>>> uuid.uuid4()
UUID('4de28cf2-4a1f-4abd-b62e-c8d69569a4d2')

Now we create a new webhook with the secret.

Webhook creation page, with the following user input: name of webhook, destination URL, and now with a secret added.

We also want to provide this secret to our worker. Wrangler has a command that lets us save the secret.

$ wrangler secret put WEBHOOK_SECRET
Enter the secret text you'd like assigned to the variable WEBHOOK_SECRET on the script named status-page:
4de28cf2-4a1f-4abd-b62e-c8d69569a4d2
? Creating the secret for script name status-page
✨ Success! Uploaded secret WEBHOOK_SECRET.

Wrangler will prompt us for the secret, then provide it to our worker. Now we can check for the token upon every webhook notification, as the secret is provided by the header cf-webhook-auth. By checking the header’s value against our secret, we can authenticate incoming webhook notifications as genuine. To do that, we modify handleWebhook:

 async handleWebhook(request) {
  // ensure the request we receive is from the Webhook destination we created
  // by examining its secret value, and rejecting it if it's incorrect
  if((request.headers.get('cf-webhook-auth') != this.env.WEBHOOK_SECRET) {
    return
   }
  ...old code here
 }

This origin health status page is just one example of the versatility of webhooks, which allowed us to leverage Cloudflare Workers and Durable Objects to support a custom Health Checks application. From highly custom use cases such as this to more straightforward, out-of-the-box solutions, pairing webhooks and Health Checks empowers users to respond to critical origin health updates effectively by delivering that information where it will be most impactful.

Migrating to the Notifications Dashboard

The Notifications dashboard is now the centralized location for most Cloudflare services. In the interest of consistency and streamlined administration, we will soon be limiting Health Checks notification setup to the Notifications dashboard. Many existing Health Checks customers have emails configured via our legacy alerting system. Over the next three months, we will support the legacy system, giving customers time to transition their Health Checks notifications to the Notifications dashboard. Customers can expect the following timeline for the phasing out of existing Health Checks notifications:

For now, customers subscribed to legacy emails will continue to receive them unchanged, so any parsing infrastructure will still work. From within a Health Check, you will see two options for configuring notifications–the legacy format and a deep link to the Notifications dashboard.
On May 24, 2022, we will disable the legacy method for the configuration of email notifications from the Health Checks dashboard.
On June 28, 2022, we will stop sending legacy emails, and adding new emails at the /healthchecks endpoint will no longer send email notifications.

We strongly encourage all our users to migrate existing Health Checks notifications to the Notifications dashboard within this timeframe to avoid lapses in alerts.

Building Waiting Room on Workers and Durable Objects

Fabienne Semeria — Wed, 16 Jun 2021 13:43:09 GMT

In January, we announced the Cloudflare Waiting Room, which has been available to select customers through Project Fair Shot to help COVID-19 vaccination web applications handle demand. Back then, we mentioned that our system was built on top of Cloudflare Workers and the then brand new Durable Objects. In the coming days, we are making Waiting Room available to customers on our Business and Enterprise plans. As we are expanding availability, we are taking this opportunity to share how we came up with this design.

What does the Waiting Room do?

You may have seen lines of people queueing in front of stores or other buildings during sales for a new sneaker or phone. That is because stores have restrictions on how many people can be inside at the same time. Every store has its own limit based on the size of the building and other factors. If more people want to get inside than the store can hold, there will be too many people in the store.

The same situation applies to web applications. When you build a web application, you have to budget for the infrastructure to run it. You make that decision according to how many users you think the site will have. But sometimes, the site can see surges of users above what was initially planned. This is where the Waiting Room can help: it stands between users and the web application and automatically creates an orderly queue during traffic spikes.

The main job of the Waiting Room is to protect a customer’s application while providing a good user experience. To do that, it must make sure that the number of users of the application around the world does not exceed limits set by the customer. Using this product should not degrade performance for end users, so it should not add significant latency and should admit them automatically. In short, this product has three main requirements: respect the customer’s limits for users on the web application, keep latency low, and provide a seamless end user experience.

When there are more users trying to access the web application than the limits the customer has configured, new users are given a cookie and greeted with a waiting room page. This page displays their estimated wait time and automatically refreshes until the user is automatically admitted to the web application.

Configuring Waiting Rooms

The important configurations that define how the waiting room operates are:

Total Active Users - the total number of active users that can be using the application at any given time
New Users Per Minute - how many new users per minute are allowed into the application, and
Session Duration - how long a user session lasts. Note: the session is renewed as long as the user is active. We terminate it after Session Duration minutes of inactivity.

How does the waiting room work?

If a web application is behind Cloudflare, every request from an end user to the web application will go to a Cloudflare data center close to them. If the web application enables the waiting room, Cloudflare issues a ticket to this user in the form of an encrypted cookie.

Waiting Room Overview

At any given moment, every waiting room has a limit on the number of users that can go to the web application. This limit is based on the customer configuration and the number of users currently on the web application. We refer to the number of users that can go into the web application at any given time as the number of user slots. The total number of users slots is equal to the limit configured by the customer minus the total number of users that have been let through.

When a traffic surge happens on the web application the number of user slots available on the web application keeps decreasing. Current user sessions need to end before new users go in. So user slots keep decreasing until there are no more slots. At this point the waiting room starts queueing.

The chart above is a customer's traffic to a web application between 09:40 and 11:30. The configuration for total active users is set to 250 users (yellow line). As time progresses there are more and more users on the application. The number of user slots available (orange line) in the application keeps decreasing as more users get into the application (green line). When there are more users on the application, the number of slots available decreases and eventually users start queueing (blue line). Queueing users ensures that the total number of active users stays around the configured limit.

To effectively calculate the user slots available, every service at the edge data centers should let its peers know how many users it lets through to the web application.

Coordination within a data center is faster and more reliable than coordination between many different data centers. So we decided to divide the user slots available on the web application to individual limits for each data center. The advantage of doing this is that only the data center limits will get exceeded if there is a delay in traffic information getting propagated. This ensures we don’t overshoot by much even if there is a delay in getting the latest information.

The next step was to figure out how to divide this information between data centers. For this we decided to use the historical traffic data on the web application. More specifically, we track how many different users tried to access the application across every data center in the preceding few minutes. The great thing about historical traffic data is that it's historical and cannot change anymore. So even with a delay in propagation, historical traffic data will be accurate even when the current traffic data is not.

Let's see an actual example: the current time is Thu, 27 May 2021 16:33:20 GMT. For the minute Thu, 27 May 2021 16:31:00 GMT there were 50 users in Nairobi and 50 in Dublin. For the minute Thu, 27 May 2021 16:32:00 GMT there were 45 users in Nairobi and 55 in Dublin. This was the only traffic on the application during that time.

Every data center looks at what the share of traffic to each data center was two minutes in the past. For Thu, 27 May 2021 16:33:20 GMT that value is Thu, 27 May 2021 16:31:00 GMT.

Thu, 27 May 2021 16:31:00 GMT: 
{
  Nairobi: 0.5, //50/100(total) users
  Dublin: 0.5,  //50/100(total) users
},
Thu, 27 May 2021 16:32:00 GMT: 
{
  Nairobi: 0.45, //45/100(total) users
  Dublin: 0.55,  //55/100(total) users
}

For the minute Thu, 27 May 2021 16:33:00 GMT, the number of user slots available will be divided equally between Nairobi and Dublin as the traffic ratio for Thu, 27 May 2021 16:31:00 GMT is 0.5 and 0.5. So, if there are 1000 slots available, Nairobi will be able to send 500 and Dublin can send 500.

For the minute Thu, 27 May 2021 16:34:00 GMT, the number of user slots available will be divided using the ratio 0.45 (Nairobi) to 0.55 (Dublin). So if there are 1000 slots available, Nairobi will be able to send 450 and Dublin can send 550.

The service at the edge data centers counts the number of users it let into the web application. It will start queueing when the data center limit is approached. The presence of limits for the data center that change based on historical traffic helps us to have a system that doesn’t need to communicate often between data centers.

Clustering

In order to let people access the application fairly we need a way to keep track of their position in the queue. A bucket has an identifier (bucketId) calculated based on the time the user tried to visit the waiting room for the first time. All the users who visited the waiting room between 19:51:00 and 19:51:59 are assigned to the bucketId 19:51:00. It's not practical to track every end user in the waiting room individually. When end users visit the application around the same time, they are given the same bucketId. So we cluster users who came around the same time as one time bucket.

What is in the cookie returned to an end user?

We mentioned an encrypted cookie that is assigned to the user when they first visit the waiting room. Every time the user comes back, they bring this cookie with them. The cookie is a ticket for the user to get into the web application. The content below is the typical information the cookie contains when visiting the web application. This user first visited around Wed, 26 May 2021 19:51:00 GMT, waited for around 10 minutes and got accepted on Wed, 26 May 2021 20:01:13 GMT.

{
  "bucketId": "Wed, 26 May 2021 19:51:00 GMT",
  "lastCheckInTime": "Wed, 26 May 2021 20:01:13 GMT",
  "acceptedAt": "Wed, 26 May 2021 20:01:13 GMT",
 }

Here

bucketId - the bucketId is the cluster the ticket is assigned to. This tracks the position in the queue.

acceptedAt - the time when the user got accepted to the web application for the first time.

lastCheckInTime - the time when the user was last seen in the waiting room or the web application.

Once a user has been let through to the web application, we have to check how long they are eligible to spend there. Our customers can customize how long a user spends on the web application using Session Duration. Whenever we see an accepted user we set the cookie to expire Session Duration minutes from when we last saw them.

Waiting Room State

Previously we talked about the concept of user slots and how we can function even when there is a delay in communication between data centers. The waiting room state helps to accomplish this. It is formed by historical data of events happening in different data centers. So when a waiting room is first created, there is no waiting room state as there is no recorded traffic. The only information available is the customer’s configured limits. Based on that we start letting users in. In the background the service (introduced later in this post as Data Center Durable Object) running in the data center periodically reports about the tickets it has issued to a co-ordinating service and periodically gets a response back about things happening around the world.

As time progresses more and more users with different bucketIds show up in different parts of the globe. Aggregating this information from the different data centers gives the waiting room state.

Let's look at an example: there are two data centers, one in Nairobi and the other in Dublin. When there are no user slots available for a data center, users start getting queued. Different users who were assigned different bucketIds get queued. The data center state from Dublin looks like this:

activeUsers: 50,
buckets: 
[  
  {
    key: "Thu, 27 May 2021 15:55:00 GMT",
    data: 
    {
      waiting: 20,
    }
  },
  {
    key: "Thu, 27 May 2021 15:56:00 GMT",
    data: 
    {
      waiting: 40,
    }
  }
]

The same thing is happening in Nairobi and the data from there looks like this:

activeUsers: 151,
buckets: 
[ 
  {
    key: "Thu, 27 May 2021 15:54:00 GMT",
    data: 
    {
      waiting: 2,
    },
  } 
  {
    key: "Thu, 27 May 2021 15:55:00 GMT",
    data: 
    {
      waiting: 30,
    }
  },
  {
    key: "Thu, 27 May 2021 15:56:00 GMT",
    data: 
    {
      waiting: 20,
    }
  }
]

This information from data centers are reported in the background and aggregated to form a data structure similar to the one below:

activeUsers: 201, // 151(Nairobi) + 50(Dublin)
buckets: 
[  
  {
    key: "Thu, 27 May 2021 15:54:00 GMT",
    data: 
    {
      waiting: 2, // 2 users from (Nairobi)
    },
  }
  {
    key: "Thu, 27 May 2021 15:55:00 GMT", 
    data: 
    {
      waiting: 50, // 20 from Nairobi and 30 from Dublin
    }
  },
  {
    key: "Thu, 27 May 2021 15:56:00 GMT",
    data: 
    {
      waiting: 60, // 20 from Nairobi and 40 from Dublin
    }
  }
]

The data structure above is a sorted list of all the bucketIds in the waiting room. The waiting field has information about how many people are waiting with a particular bucketId. The activeUsers field has information about the number of users who are active on the web application.

Imagine for this customer, the limits they have set in the dashboard are

Total Active Users - 200New Users Per Minute - 200

As per their configuration only 200 customers can be at the web application at any time. So users slots available for the waiting room state above are 200 - 201(activeUsers) = -1. So no one can go in and users get queued.

Now imagine that some users have finished their session and activeUsers is now 148.

Now userSlotsAvailable = 200 - 148 = 52 users. We should let 52 of the users who have been waiting the longest into the application. We achieve this by giving the eligible slots to the oldest buckets in the queue. In the example below 2 users are waiting from bucket Thu, 27 May 2021 15:54:00 GMT and 50 users are waiting from bucket Thu, 27 May 2021 15:55:00 GMT. These are the oldest buckets in the queue who get the eligible slots.

activeUsers: 148,
buckets: 
[  
  {
    key: "Thu, 27 May 2021 15:54:00 GMT",
    data: 
    {
      waiting: 2,
      eligibleSlots: 2,
    },
  }
  {
    key: "Thu, 27 May 2021 15:55:00 GMT",
    data: 
    {
      waiting: 50,
      eligibleSlots: 50,
    }
  },
  {
    key: "Thu, 27 May 2021 15:56:00 GMT",
    data: 
    {
      waiting: 60,
      eligibleSlots: 0,
    }
  }
]

If there are eligible slots available for all the users in their bucket, then they can be sent to the web application from any data center. This ensures the fairness of the waiting room.

There is another case that can happen where we do not have enough eligible slots for a whole bucket. When this happens things get a little more complicated as we cannot send everyone from that bucket to the web application. Instead, we allocate a share of eligible slots to each data center.

key: "Thu, 27 May 2021 15:56:00 GMT",
data: 
{
  waiting: 60,
  eligibleSlots: 20,
}

As we did before, we use the ratio of past traffic from each data center to decide how many users it can let through. So if the current time is Thu, 27 May 2021 16:34:10 GMT both data centers look at the traffic ratio in the past at Thu, 27 May 2021 16:32:00 GMT and send a subset of users from those data centers to the web application.

Thu, 27 May 2021 16:32:00 GMT: 
{
  Nairobi: 0.25, // 0.25 * 20 = 5 eligibleSlots
  Dublin: 0.75,  // 0.75 * 20 = 15 eligibleSlots
}

Estimated wait time

When a request comes from a user we look at their bucketId. Based on the bucketId it is possible to know how many people are in front of the user's bucketId from the sorted list. Similar to how we track the activeUsers we also calculate the average number of users going to the web application per minute. Dividing the number of people who are in front of the user by the average number of users going to the web application gives us the estimated time. This is what is shown to the user who visits the waiting room.

avgUsersToWebApplication:  30,
activeUsers: 148,
buckets: 
[  
  {
    key: "Thu, 27 May 2021 15:54:00 GMT",
    data: 
    {
      waiting: 2,
      eligibleSlots: 2,
    },
  }
  {
    key: "Thu, 27 May 2021 15:55:00 GMT",
    data: 
    {
      waiting: 50,
      eligibleSlots: 50,
    }
  },
  {
    key: "Thu, 27 May 2021 15:56:00 GMT",
    data: 
    {
      waiting: 60,
      eligibleSlots: 0,
    }
  }
]

In the case above for a user with bucketId Thu, 27 May 2021 15:56:00 GMT, there are 60 users ahead of them. With 30 activeUsersToWebApplication per minute, the estimated time to get into the web application is 60/30 which is 2 minutes.

Implementation with Workers and Durable Objects

Now that we have talked about the user experience and the algorithm, let’s focus on the implementation. Our product is specifically built for customers who experience high volumes of traffic, so we needed to run code at the edge in a highly scalable manner. Cloudflare has a great culture of building upon its own products, so we naturally thought of Workers. The Workers platform uses Isolates to scale up and can scale horizontally as there are more requests.

The Workers product has an ecosystem of tools like wrangler which help us to iterate and debug things quickly.

Workers also reduce long-term operational work.

For these reasons, the decision to build on Workers was easy. The more complex choice in our design was for the coordination. As we have discussed before, our workers need a way to share the waiting room state. We need every worker to be aware of changes in traffic patterns quickly in order to respond to sudden traffic spikes. We use the proportion of traffic from two minutes before to allocate user slots among data centers, so we need a solution to aggregate this data and make it globally available within this timeframe. Our design also relies on having fast coordination within a data center to react quickly to changes. We considered a few different solutions before settling on Cache and Durable Objects.

Idea #1: Workers KV

We started to work on the project around March 2020. At that point, Workers offered two options for storage: the Cache API and KV. Cache is shared only at the data center level, so for global coordination we had to use KV. Each worker writes its own key to KV that describes the requests it received and how it processed them. Each key is set to expire after a few minutes if the worker stopped writing. To create a workerState, the worker periodically does a list operation on the KV namespace to get the state around the world.

Design using KV

This design has some flaws because KV wasn’t built for a use case like this. The state of a waiting room changes all the time to match traffic patterns. Our use case is write intensive and KV is intended for read-intensive workflows. As a consequence, our proof of concept implementation turned out to be more expensive than expected. Moreover, KV is eventually consistent: it takes time for information written to KV to be available in all of our data centers. This is a problem for Waiting Room because we need fine-grained control to be able to react quickly to traffic spikes that may be happening simultaneously in several locations across the globe.

Idea #2: Centralized Database

Another alternative was to run our own databases in our core data centers. The Cache API in Workers lets us use the cache directly within a data center. If there is frequent communication with the core data centers to get the state of the world, the cached data in the data center should let us respond with minimal latency on the request hot path. There would be fine-grained control on when the data propagation happens and this time can be kept low.

Design using Core Data centers‌‌

As noted before, this application is very write-heavy and the data is rather short-lived. For these reasons, a standard relational database would not be a good fit. This meant we could not leverage the existing database clusters maintained by our in-house specialists. Rather, we would need to use an in-memory data store such as Redis, and we would have to set it up and maintain it ourselves. We would have to install a data store cluster in each of our core locations, fine tune our configuration, and make sure data is replicated between them. We would also have to create a proxy service running in our core data centers to gate access to that database and validate data before writing to it.

We could likely have made it work, at the cost of substantial operational overhead. While that is not insurmountable, this design would introduce a strong dependency on the availability of core data centers. If there were issues in the core data centers, it would affect the product globally whereas an edge-based solution would be more resilient. If an edge data center goes offline Anycast takes care of routing the traffic to the nearby data centers. This will ensure a web application will not be affected.

The Scalable Solution: Durable Objects

Around that time, we learned about Durable Objects. The product was in closed beta back then, but we decided to embrace Cloudflare’s thriving dogfooding culture and did not let that deter us. With Durable Objects, we could create one global Durable Object instance per waiting room instead of maintaining a single database. This object can exist anywhere in the world and handle redundancy and availability. So Durable Objects give us sharding for free. Durable Objects gave us fine-grained control as well as better availability as they run in our edge data centers. Additionally, each waiting room is isolated from the others: adverse events affecting one customer are less likely to spill over to other customers.

Implementation with Durable ObjectsBased on these advantages, we decided to build our product on Durable Objects.

As mentioned above, we use a worker to decide whether to send users to the Waiting Room or the web application. That worker periodically sends a request to a Durable Object saying how many users it sent to the Waiting Room and how many it sent to the web application. A Durable Object instance is created on the first request and remains active as long as it is receiving requests. The Durable Object aggregates the counters sent by every worker to create a count of users sent to the Waiting Room and a count of users on the web application.

A Durable Object instance is only active as long as it is receiving requests and can be restarted during maintenance. When a Durable Object instance is restarted, its in-memory state is cleared. To preserve the in-memory data on Durable Object restarts, we back up the data using the Cache API. This offers weaker guarantees than using the Durable Object persistent storage as data may be evicted from cache, or the Durable Object can be moved to a different data center. If that happens, the Durable Object will have to start without cached data. On the other hand, persistent storage at the edge still has limited capacity. Since we can rebuild state very quickly from worker updates, we decided that cache is enough for our use case.

Scaling upWhen traffic spikes happen around the world, new workers are created. Every worker needs to communicate how many users have been queued and how many have been let through to the web application. However, while workers automatically scale horizontally when traffic increases, Durable Objects do not. By design, there is only one instance of any Durable Object. This instance runs on a single thread so if it receives requests more quickly than it can respond, it can become overloaded. To avoid that, we cannot let every worker send its data directly to the same Durable Object. The way we achieve scalability is by sharding: we create per data center Durable Object instances that report up to one global instance.

Durable Objects implementation

The aggregation is done in two stages: at the data-center level and at the global level.

Data Center Durable ObjectWhen a request comes to a particular location, we can see the corresponding data center by looking at the cf.colo field on the request. The Data Center Durable Object keeps track of the number of workers in the data center. It aggregates the state from all those workers. It also responds to workers with important information within a data center like the number of users making requests to a waiting room or number of workers. Frequently, it updates the Global Durable Object and receives information about other data centers as the response.

Worker User Slots

Above we talked about how a data center gets user slots allocated to it based on the past traffic patterns. If every worker in the data center talks to the Data Center Durable Object on every request, the Durable Object could get overwhelmed. Worker User Slots help us to overcome this problem.

Every worker keeps track of the number of users it has let through to the web application and the number of users that it has queued. The worker user slots are the number of users a worker can send to the web application at any point in time. This is calculated from the user slots available for the data center and the worker count in the data center. We divide the total number of user slots available for the data center by the number of workers in the data center to get the user slots available for each worker. If there are two workers and 10 users that can be sent to the web application from the data center, then we allocate five as the budget for each worker. This division is needed because every worker makes its own decisions on whether to send the user to the web application or the waiting room without talking to anyone else.

Waiting room inside a data center

When the traffic changes, new workers can spin up or old workers can die. The worker count in a data center is dynamic as the traffic to the data center changes. Here we make a trade off similar to the one for inter data center coordination: there is a risk of overshooting the limit if many more workers are created between calls to the Data Center Durable Object. But too many calls to the Data Center Durable Object would make it hard to scale. In this case though, we can use Cache for faster synchronization within the data center.

Cache

On every interaction to the Data Center Durable Object, the worker saves a copy of the data it receives to the cache. Every worker frequently talks to the cache to update the state it has in memory with the state in cache. We also adaptively adjust the rate of writes from the workers to the Data Center Durable Object based on the number of workers in the data center. This helps to ensure that we do not take down the Data Center Durable Object when traffic changes.

Global Durable Object

The Global Durable Object is designed to be simple and stores the information it receives from any data center in memory. It responds with the information it has about all data centers. It periodically saves its in-memory state to cache using the Workers Cache API so that it can withstand restarts as mentioned above.

Components of waiting room

Recap

This is how the waiting room works right now. Every request with the enabled waiting room goes to a worker at a Cloudflare edge data center. When this happens, the worker looks for the state of the waiting room in the Cache first. We use cache here instead of Data Center Durable Object so that we do not overwhelm the Durable Object instance when there is a spike in traffic. Plus, reading data from cache is faster. The workers periodically make a request to the Data Center Durable Object to get the waiting room state which they then write to the cache. The idea here is that the cache should have a recent copy of the waiting room state.

Workers can examine the request to know which data center they are in. Every worker periodically makes a request to the corresponding Data Center Durable Object. This interaction updates the worker state in the Data Center Durable Object. In return, the workers get the waiting room state from the Data Center Durable Object. The Data Center Durable Object sends the data center state to the Global Durable Object periodically. In the response, the Data Center Durable Object receives all data center states globally. It then calculates the waiting room state and returns that state to a worker in its response.

The advantage of this design is that it's possible to adjust the rate of writes from workers to the Data Center Durable Object and from the Data Center Durable Object to the Global Durable Object based on the traffic received in the waiting room. This helps us respond to requests during high traffic without overloading the individual Durable Object instances.

Conclusion

By using Workers and Durable Objects, Waiting Room was able to scale up to keep web application servers online for many of our early customers during large spikes of traffic. It helped keep vaccination sign-ups online for companies and governments around the world for free through Project Fair Shot: Verto Health was able to serve over 4 million customers in Canada; Ticket Tailor reduced their peak resource utilization from 70% down to 10%; the County of San Luis Obispo was able to stay online during traffic surges of up to 23,000 users; and the country of Latvia was able to stay online during surges of thousands of requests per second. These are just a few of the customers we served and will continue to serve until Project Fair Shot ends.

In the coming days, we are rolling out the Waiting Room to customers on our business plan. Sign up today to prevent spikes of traffic to your web application. If you are interested in access to Durable Objects, it’s currently available to try out in Open Beta.

Health Check Analytics and how you can use it

Fabienne Semeria — Fri, 12 Jun 2020 11:00:00 GMT

At the end of last year, we introduced Standalone Health Checks - a service that lets you monitor the health of your origin servers and avoid the need to purchase additional third party services. The more that can be controlled from Cloudflare decreases maintenance cost, vendor management, and infrastructure complexity. This is important as it ensures you are able to scale your infrastructure seamlessly as your company grows. Today, we are introducing Standalone Health Check Analytics to help decrease your time to resolution for any potential issues. You can find Health Check Analytics in the sub-menu under the Traffic tab in your Cloudflare Dashboard.

As a refresher, Standalone Health Checks is a service that monitors an IP address or hostname for your origin servers or application and notifies you in near real-time if there happens to be a problem. These Health Checks support fine-tuned configurations based on expected codes, interval, protocols, timeout and more. These configurations enable you to properly target your checks based on the unique setup of your infrastructure. An example of a Health Check can be seen below which is monitoring an origin server in a staging environment with a notification set via email.

Once you set up a notification, you will be alerted when there is a change in the health of your origin server. In the example above, if your staging environment starts responding with anything other than a 200 OK response code, we will send you an email within seconds so you can take the necessary action before customers are impacted.

Introducing Standalone Health Check Analytics

Once you get the notification email, we provide tools that help to quickly debug the possible cause of the issue with detailed logs as well as data visualizations enabling you to better understand the context around the issue. Let’s walk through a real-world scenario and see how Health Check Analytics helps decrease our time to resolution.

A notification email has been sent to you letting you know that Staging is unhealthy. You log into your dashboard and go into Health Check Analytics for this particular Health Check. In the screenshot below, you can see that Staging is up 76% of the time vs 100% of the time for Production. Now that we see the graph validating the email notification that there is indeed a problem, we need to dig in further. Below the graph you can see a breakdown of the type of errors that have taken place in both the Staging and Production addresses over the specified time period. We see there is only one error taking place, but very frequently, in the staging environment - a TCP Connection Failed error, leading to the lower availability.

This starts to narrow the funnel for what the issue could be. We know that there is something wrong with either the Staging server's ability to receive connections, maybe an issue during the SYN-ACK handshake, or possibly an issue with the router being used and not an issue with the origin server at all but instead receiving a down-stream consequence. With this information, you can quickly make the necessary checks to validate your hypothesis and minimize your time to resolution. Instead of having to dig through endless logs, or try to make educated guesses at where the issue could stem from, Health Check Analytics allows you to quickly hone in on detailed areas that could be the root cause. This in turn maximizes your application reliability but more importantly, keeps trust and brand expectation with your customers.

Being able to quickly understand an overview of your infrastructure is important, but sometimes being able to dig deeper into each healthcheck can be more valuable to understand what is happening at a granular level. In addition to general information like address, response-code, round trip time (RTT) and failure reason, we are adding more features to help you understand the Health Check result(s). We have also added extra information into the event table so you can quickly understand a given problem. In the case of a Response Code Mismatch Error, we now provide the expected response code for a given Health Check along with the received code. This removes the need to go back and check the configuration that may have been setup long ago and keep focus on the problem at hand.

The availability of different portions of your infrastructure is very important, but this does not provide the complete view. Performance is continuing to skyrocket in importance and value to customers. If an application is not performant, they will quickly go to a competitor without a second thought. Sometimes RTT is not enough to understand why requests have higher latency and where the root of the issue may reside. To better understand where time is spent for a given request, we are introducing the waterfall view of a request within the Event Log. With this view, you can understand the time taken for the TCP connection, time taken for the TLS handshake, and the time to first byte (TTFB) for the request. The waterfall will give you a chronological idea about time spent in different stages of the request.

Time taken for establishing the initial TCP connection.(in dark blue, approx 41ms)
Once the TCP connection is established, time is spent doing the TLS handshake. This is another component that takes up time for HTTPS websites. (in light blue, approx 80ms)
Once the SYN-ACK handshake and connection is complete, then the time taken for the first byte to be received is also exposed. (in dark orange, approx 222ms)
The total round trip time (RTT) is the time taken to load the complete page once the TLS handshake is complete. The difference between the RTT and the TTFB gives you the time spent downloading content from a page. If your page has a large amount of content, the difference between TTFB and RTT will be high. (in light orange, approx 302ms). The page load time is approximately 80 ms for the address.

Using the information above lends to a number of steps that can be taken for the website owner. The delay in initial TCP connection time could be decreased by making the website available in different geo locations around the world. This could also reduce the time for TLS handshake as each round trip will be faster. Another thing that is visible is the page load time of 80ms. This might be because of the contents of the page, maybe compression can be applied on the server side to make the load time better or unnecessary content can be removed. The information in the waterfall view can also tell if an additional external library increases the time to load the website after a release.

Cloudflare has over 200 edge locations around the world making it one of the largest Anycast networks on the planet. When a health check is configured, it can be run across the different regions on the Cloudflare infrastructure, enabling you to see the variation in latency around the world for specific Health Checks.

Waterfall from India

Waterfall from Western North America‌‌

Based on the new information provided from Health Check Analytics, users can definitively validate that the address performs better from Western North America compared to India due to the page load time and overall RTT.

How do health checks run?

To understand and decipher the logs that are found in the analytics dashboard, it is important to understand how Cloudflare runs the Health Checks. Cloudflare has data centers in more than 200 cities across 90+ countries throughout the world [more]. We don’t run health checks from every single of these data centers (that would be a lot of requests to your servers!). Instead, we let you pick between one and thirteen regions from which to run health checks [Regions].

The Internet is not the same everywhere around the world. So your users may not have the same experience on your application according to where they are. Running Health Checks from different regions lets you know the health of your application from the point of view of the Cloudflare network in each of these regions.

Imagine you configure a Health Check from two regions, Western North America and South East Asia, at an interval of 10 seconds. You may have been expecting to get two requests to your origin server every 10 seconds, but if you look at your server’s logs you will see that you are actually getting six. That is because we send Health Checks not just from one location in each region but three.

For a health check configured from All Regions (thirteen regions) there will be 39 requests to your server per configured interval.

You may wonder: ‘Why do you probe from multiple locations within a region?’ We do this to make sure the health we report represents the overall performance of your service as seen from that region. Before we report a change, we check that at least two locations agree on the status. We added a third one to make sure that the system keeps running even if there is an issue at one of our locations.

Conclusion

Health Check Analytics is now live and available to all Pro, Business and Enterprise customers! We are very excited to help decrease your time to resolution and ensure your application reliability is maximised.