In January, we announced the Cloudflare Waiting Room, which has been available to select customers through Project Fair Shot to help COVID-19 vaccination web applications handle demand. Back then, we mentioned that our system was built on top of Cloudflare Workers and the then brand new Durable Objects. In the coming days, we are making Waiting Room available to customers on our Business and Enterprise plans. As we are expanding availability, we are taking this opportunity to share how we came up with this design.
What does the Waiting Room do?
You may have seen lines of people queueing in front of stores or other buildings during sales for a new sneaker or phone. That is because stores have restrictions on how many people can be inside at the same time. Every store has its own limit based on the size of the building and other factors. If more people want to get inside than the store can hold, there will be too many people in the store.
The same situation applies to web applications. When you build a web application, you have to budget for the infrastructure to run it. You make that decision according to how many users you think the site will have. But sometimes, the site can see surges of users above what was initially planned. This is where the Waiting Room can help: it stands between users and the web application and automatically creates an orderly queue during traffic spikes.
The main job of the Waiting Room is to protect a customer’s application while providing a good user experience. To do that, it must make sure that the number of users of the application around the world does not exceed limits set by the customer. Using this product should not degrade performance for end users, so it should not add significant latency and should admit them automatically. In short, this product has three main requirements: respect the customer’s limits for users on the web application, keep latency low, and provide a seamless end user experience.
When there are more users trying to access the web application than the limits the customer has configured, new users are given a cookie and greeted with a waiting room page. This page displays their estimated wait time and automatically refreshes until the user is automatically admitted to the web application.
Configuring Waiting Rooms
The important configurations that define how the waiting room operates are:
Total Active Users - the total number of active users that can be using the application at any given time
New Users Per Minute - how many new users per minute are allowed into the application, and
Session Duration - how long a user session lasts. Note: the session is renewed as long as the user is active. We terminate it after Session Duration minutes of inactivity.
How does the waiting room work?
If a web application is behind Cloudflare, every request from an end user to the web application will go to a Cloudflare data center close to them. If the web application enables the waiting room, Cloudflare issues a ticket to this user in the form of an encrypted cookie.
Waiting Room Overview
At any given moment, every waiting room has a limit on the number of users that can go to the web application. This limit is based on the customer configuration and the number of users currently on the web application. We refer to the number of users that can go into the web application at any given time as the number of user slots. The total number of users slots is equal to the limit configured by the customer minus the total number of users that have been let through.
When a traffic surge happens on the web application the number of user slots available on the web application keeps decreasing. Current user sessions need to end before new users go in. So user slots keep decreasing until there are no more slots. At this point the waiting room starts queueing.
The chart above is a customer's traffic to a web application between 09:40 and 11:30. The configuration for total active users is set to 250 users (yellow line). As time progresses there are more and more users on the application. The number of user slots available (orange line) in the application keeps decreasing as more users get into the application (green line). When there are more users on the application, the number of slots available decreases and eventually users start queueing (blue line). Queueing users ensures that the total number of active users stays around the configured limit.
To effectively calculate the user slots available, every service at the edge data centers should let its peers know how many users it lets through to the web application.
Coordination within a data center is faster and more reliable than coordination between many different data centers. So we decided to divide the user slots available on the web application to individual limits for each data center. The advantage of doing this is that only the data center limits will get exceeded if there is a delay in traffic information getting propagated. This ensures we don’t overshoot by much even if there is a delay in getting the latest information.
The next step was to figure out how to divide this information between data centers. For this we decided to use the historical traffic data on the web application. More specifically, we track how many different users tried to access the application across every data center in the preceding few minutes. The great thing about historical traffic data is that it's historical and cannot change anymore. So even with a delay in propagation, historical traffic data will be accurate even when the current traffic data is not.
Let's see an actual example: the current time is Thu, 27 May 2021 16:33:20 GMT
. For the minute Thu, 27 May 2021 16:31:00 GMT
there were 50
users in Nairobi and 50
in Dublin. For the minute Thu, 27 May 2021 16:32:00 GMT
there were 45
users in Nairobi and 55
in Dublin. This was the only traffic on the application during that time.
Every data center looks at what the share of traffic to each data center was two minutes in the past. For Thu, 27 May 2021 16:33:20 GMT
that value is Thu, 27 May 2021 16:31:00 GMT
.
Thu, 27 May 2021 16:31:00 GMT:
{
Nairobi: 0.5, //50/100(total) users
Dublin: 0.5, //50/100(total) users
},
Thu, 27 May 2021 16:32:00 GMT:
{
Nairobi: 0.45, //45/100(total) users
Dublin: 0.55, //55/100(total) users
}
For the minute Thu, 27 May 2021 16:33:00 GMT
, the number of user slots available will be divided equally between Nairobi and Dublin as the traffic ratio for Thu, 27 May 2021 16:31:00 GMT
is 0.5
and 0.5
. So, if there are 1000 slots available, Nairobi will be able to send 500 and Dublin can send 500.
For the minute Thu, 27 May 2021 16:34:00 GMT
, the number of user slots available will be divided using the ratio 0.45 (Nairobi) to 0.55 (Dublin). So if there are 1000 slots available, Nairobi will be able to send 450 and Dublin can send 550.
The service at the edge data centers counts the number of users it let into the web application. It will start queueing when the data center limit is approached. The presence of limits for the data center that change based on historical traffic helps us to have a system that doesn’t need to communicate often between data centers.
Clustering
In order to let people access the application fairly we need a way to keep track of their position in the queue. A bucket has an identifier (bucketId) calculated based on the time the user tried to visit the waiting room for the first time. All the users who visited the waiting room between 19:51:00 and 19:51:59 are assigned to the bucketId 19:51:00. It's not practical to track every end user in the waiting room individually. When end users visit the application around the same time, they are given the same bucketId. So we cluster users who came around the same time as one time bucket.
What is in the cookie returned to an end user?
We mentioned an encrypted cookie that is assigned to the user when they first visit the waiting room. Every time the user comes back, they bring this cookie with them. The cookie is a ticket for the user to get into the web application. The content below is the typical information the cookie contains when visiting the web application. This user first visited around Wed, 26 May 2021 19:51:00 GMT
, waited for around 10 minutes and got accepted on Wed, 26 May 2021 20:01:13 GMT
.
{
"bucketId": "Wed, 26 May 2021 19:51:00 GMT",
"lastCheckInTime": "Wed, 26 May 2021 20:01:13 GMT",
"acceptedAt": "Wed, 26 May 2021 20:01:13 GMT",
}
Here
bucketId - the bucketId is the cluster the ticket is assigned to. This tracks the position in the queue.
acceptedAt - the time when the user got accepted to the web application for the first time.
lastCheckInTime - the time when the user was last seen in the waiting room or the web application.
Once a user has been let through to the web application, we have to check how long they are eligible to spend there. Our customers can customize how long a user spends on the web application using Session Duration. Whenever we see an accepted user we set the cookie to expire Session Duration minutes from when we last saw them.
Waiting Room State
Previously we talked about the concept of user slots and how we can function even when there is a delay in communication between data centers. The waiting room state helps to accomplish this. It is formed by historical data of events happening in different data centers. So when a waiting room is first created, there is no waiting room state as there is no recorded traffic. The only information available is the customer’s configured limits. Based on that we start letting users in. In the background the service (introduced later in this post as Data Center Durable Object) running in the data center periodically reports about the tickets it has issued to a co-ordinating service and periodically gets a response back about things happening around the world.
As time progresses more and more users with different bucketIds show up in different parts of the globe. Aggregating this information from the different data centers gives the waiting room state.
Let's look at an example: there are two data centers, one in Nairobi and the other in Dublin. When there are no user slots available for a data center, users start getting queued. Different users who were assigned different bucketIds get queued. The data center state from Dublin looks like this:
activeUsers: 50,
buckets:
[
{
key: "Thu, 27 May 2021 15:55:00 GMT",
data:
{
waiting: 20,
}
},
{
key: "Thu, 27 May 2021 15:56:00 GMT",
data:
{
waiting: 40,
}
}
]
The same thing is happening in Nairobi and the data from there looks like this:
activeUsers: 151,
buckets:
[
{
key: "Thu, 27 May 2021 15:54:00 GMT",
data:
{
waiting: 2,
},
}
{
key: "Thu, 27 May 2021 15:55:00 GMT",
data:
{
waiting: 30,
}
},
{
key: "Thu, 27 May 2021 15:56:00 GMT",
data:
{
waiting: 20,
}
}
]
This information from data centers are reported in the background and aggregated to form a data structure similar to the one below:
activeUsers: 201, // 151(Nairobi) + 50(Dublin)
buckets:
[
{
key: "Thu, 27 May 2021 15:54:00 GMT",
data:
{
waiting: 2, // 2 users from (Nairobi)
},
}
{
key: "Thu, 27 May 2021 15:55:00 GMT",
data:
{
waiting: 50, // 20 from Nairobi and 30 from Dublin
}
},
{
key: "Thu, 27 May 2021 15:56:00 GMT",
data:
{
waiting: 60, // 20 from Nairobi and 40 from Dublin
}
}
]
The data structure above is a sorted list of all the bucketIds in the waiting room. The waiting
field has information about how many people are waiting with a particular bucketId. The activeUsers
field has information about the number of users who are active on the web application.
Imagine for this customer, the limits they have set in the dashboard are
Total Active Users - 200New Users Per Minute - 200
As per their configuration only 200 customers can be at the web application at any time. So users slots available for the waiting room state above are 200 - 201(activeUsers) = -1. So no one can go in and users get queued.
Now imagine that some users have finished their session and activeUsers is now 148.
Now userSlotsAvailable = 200 - 148 = 52 users. We should let 52 of the users who have been waiting the longest into the application. We achieve this by giving the eligible slots to the oldest buckets in the queue. In the example below 2 users are waiting from bucket Thu, 27 May 2021 15:54:00 GMT
and 50 users are waiting from bucket Thu, 27 May 2021 15:55:00 GMT
. These are the oldest buckets in the queue who get the eligible slots.
activeUsers: 148,
buckets:
[
{
key: "Thu, 27 May 2021 15:54:00 GMT",
data:
{
waiting: 2,
eligibleSlots: 2,
},
}
{
key: "Thu, 27 May 2021 15:55:00 GMT",
data:
{
waiting: 50,
eligibleSlots: 50,
}
},
{
key: "Thu, 27 May 2021 15:56:00 GMT",
data:
{
waiting: 60,
eligibleSlots: 0,
}
}
]
If there are eligible slots available for all the users in their bucket, then they can be sent to the web application from any data center. This ensures the fairness of the waiting room.
There is another case that can happen where we do not have enough eligible slots for a whole bucket. When this happens things get a little more complicated as we cannot send everyone from that bucket to the web application. Instead, we allocate a share of eligible slots to each data center.
key: "Thu, 27 May 2021 15:56:00 GMT",
data:
{
waiting: 60,
eligibleSlots: 20,
}
As we did before, we use the ratio of past traffic from each data center to decide how many users it can let through. So if the current time is Thu, 27 May 2021 16:34:10 GMT
both data centers look at the traffic ratio in the past at Thu, 27 May 2021 16:32:00 GMT
and send a subset of users from those data centers to the web application.
Thu, 27 May 2021 16:32:00 GMT:
{
Nairobi: 0.25, // 0.25 * 20 = 5 eligibleSlots
Dublin: 0.75, // 0.75 * 20 = 15 eligibleSlots
}
Estimated wait time
When a request comes from a user we look at their bucketId. Based on the bucketId it is possible to know how many people are in front of the user's bucketId from the sorted list. Similar to how we track the activeUsers we also calculate the average number of users going to the web application per minute. Dividing the number of people who are in front of the user by the average number of users going to the web application gives us the estimated time. This is what is shown to the user who visits the waiting room.
avgUsersToWebApplication: 30,
activeUsers: 148,
buckets:
[
{
key: "Thu, 27 May 2021 15:54:00 GMT",
data:
{
waiting: 2,
eligibleSlots: 2,
},
}
{
key: "Thu, 27 May 2021 15:55:00 GMT",
data:
{
waiting: 50,
eligibleSlots: 50,
}
},
{
key: "Thu, 27 May 2021 15:56:00 GMT",
data:
{
waiting: 60,
eligibleSlots: 0,
}
}
]
In the case above for a user with bucketId Thu, 27 May 2021 15:56:00 GMT
, there are 60
users ahead of them. With 30
activeUsersToWebApplication per minute, the estimated time to get into the web application is 60/30
which is 2
minutes.
Implementation with Workers and Durable Objects
Now that we have talked about the user experience and the algorithm, let’s focus on the implementation. Our product is specifically built for customers who experience high volumes of traffic, so we needed to run code at the edge in a highly scalable manner. Cloudflare has a great culture of building upon its own products, so we naturally thought of Workers. The Workers platform uses Isolates to scale up and can scale horizontally as there are more requests.
The Workers product has an ecosystem of tools like wrangler which help us to iterate and debug things quickly.
Workers also reduce long-term operational work.
For these reasons, the decision to build on Workers was easy. The more complex choice in our design was for the coordination. As we have discussed before, our workers need a way to share the waiting room state. We need every worker to be aware of changes in traffic patterns quickly in order to respond to sudden traffic spikes. We use the proportion of traffic from two minutes before to allocate user slots among data centers, so we need a solution to aggregate this data and make it globally available within this timeframe. Our design also relies on having fast coordination within a data center to react quickly to changes. We considered a few different solutions before settling on Cache and Durable Objects.
Idea #1: Workers KV
We started to work on the project around March 2020. At that point, Workers offered two options for storage: the Cache API and KV. Cache is shared only at the data center level, so for global coordination we had to use KV. Each worker writes its own key to KV that describes the requests it received and how it processed them. Each key is set to expire after a few minutes if the worker stopped writing. To create a workerState, the worker periodically does a list operation on the KV namespace to get the state around the world.
Design using KV
This design has some flaws because KV wasn’t built for a use case like this. The state of a waiting room changes all the time to match traffic patterns. Our use case is write intensive and KV is intended for read-intensive workflows. As a consequence, our proof of concept implementation turned out to be more expensive than expected. Moreover, KV is eventually consistent: it takes time for information written to KV to be available in all of our data centers. This is a problem for Waiting Room because we need fine-grained control to be able to react quickly to traffic spikes that may be happening simultaneously in several locations across the globe.
Idea #2: Centralized Database
Another alternative was to run our own databases in our core data centers. The Cache API in Workers lets us use the cache directly within a data center. If there is frequent communication with the core data centers to get the state of the world, the cached data in the data center should let us respond with minimal latency on the request hot path. There would be fine-grained control on when the data propagation happens and this time can be kept low.
Design using Core Data centers
As noted before, this application is very write-heavy and the data is rather short-lived. For these reasons, a standard relational database would not be a good fit. This meant we could not leverage the existing database clusters maintained by our in-house specialists. Rather, we would need to use an in-memory data store such as Redis, and we would have to set it up and maintain it ourselves. We would have to install a data store cluster in each of our core locations, fine tune our configuration, and make sure data is replicated between them. We would also have to create a proxy service running in our core data centers to gate access to that database and validate data before writing to it.
We could likely have made it work, at the cost of substantial operational overhead. While that is not insurmountable, this design would introduce a strong dependency on the availability of core data centers. If there were issues in the core data centers, it would affect the product globally whereas an edge-based solution would be more resilient. If an edge data center goes offline Anycast takes care of routing the traffic to the nearby data centers. This will ensure a web application will not be affected.
The Scalable Solution: Durable Objects
Around that time, we learned about Durable Objects. The product was in closed beta back then, but we decided to embrace Cloudflare’s thriving dogfooding culture and did not let that deter us. With Durable Objects, we could create one global Durable Object instance per waiting room instead of maintaining a single database. This object can exist anywhere in the world and handle redundancy and availability. So Durable Objects give us sharding for free. Durable Objects gave us fine-grained control as well as better availability as they run in our edge data centers. Additionally, each waiting room is isolated from the others: adverse events affecting one customer are less likely to spill over to other customers.
Implementation with Durable ObjectsBased on these advantages, we decided to build our product on Durable Objects.
As mentioned above, we use a worker to decide whether to send users to the Waiting Room or the web application. That worker periodically sends a request to a Durable Object saying how many users it sent to the Waiting Room and how many it sent to the web application. A Durable Object instance is created on the first request and remains active as long as it is receiving requests. The Durable Object aggregates the counters sent by every worker to create a count of users sent to the Waiting Room and a count of users on the web application.
A Durable Object instance is only active as long as it is receiving requests and can be restarted during maintenance. When a Durable Object instance is restarted, its in-memory state is cleared. To preserve the in-memory data on Durable Object restarts, we back up the data using the Cache API. This offers weaker guarantees than using the Durable Object persistent storage as data may be evicted from cache, or the Durable Object can be moved to a different data center. If that happens, the Durable Object will have to start without cached data. On the other hand, persistent storage at the edge still has limited capacity. Since we can rebuild state very quickly from worker updates, we decided that cache is enough for our use case.
Scaling upWhen traffic spikes happen around the world, new workers are created. Every worker needs to communicate how many users have been queued and how many have been let through to the web application. However, while workers automatically scale horizontally when traffic increases, Durable Objects do not. By design, there is only one instance of any Durable Object. This instance runs on a single thread so if it receives requests more quickly than it can respond, it can become overloaded. To avoid that, we cannot let every worker send its data directly to the same Durable Object. The way we achieve scalability is by sharding: we create per data center Durable Object instances that report up to one global instance.
Durable Objects implementation
The aggregation is done in two stages: at the data-center level and at the global level.
Data Center Durable ObjectWhen a request comes to a particular location, we can see the corresponding data center by looking at the cf.colo field on the request. The Data Center Durable Object keeps track of the number of workers in the data center. It aggregates the state from all those workers. It also responds to workers with important information within a data center like the number of users making requests to a waiting room or number of workers. Frequently, it updates the Global Durable Object and receives information about other data centers as the response.
Worker User Slots
Above we talked about how a data center gets user slots allocated to it based on the past traffic patterns. If every worker in the data center talks to the Data Center Durable Object on every request, the Durable Object could get overwhelmed. Worker User Slots help us to overcome this problem.
Every worker keeps track of the number of users it has let through to the web application and the number of users that it has queued. The worker user slots are the number of users a worker can send to the web application at any point in time. This is calculated from the user slots available for the data center and the worker count in the data center. We divide the total number of user slots available for the data center by the number of workers in the data center to get the user slots available for each worker. If there are two workers and 10 users that can be sent to the web application from the data center, then we allocate five as the budget for each worker. This division is needed because every worker makes its own decisions on whether to send the user to the web application or the waiting room without talking to anyone else.
Waiting room inside a data center
When the traffic changes, new workers can spin up or old workers can die. The worker count in a data center is dynamic as the traffic to the data center changes. Here we make a trade off similar to the one for inter data center coordination: there is a risk of overshooting the limit if many more workers are created between calls to the Data Center Durable Object. But too many calls to the Data Center Durable Object would make it hard to scale. In this case though, we can use Cache for faster synchronization within the data center.
Cache
On every interaction to the Data Center Durable Object, the worker saves a copy of the data it receives to the cache. Every worker frequently talks to the cache to update the state it has in memory with the state in cache. We also adaptively adjust the rate of writes from the workers to the Data Center Durable Object based on the number of workers in the data center. This helps to ensure that we do not take down the Data Center Durable Object when traffic changes.
Global Durable Object
The Global Durable Object is designed to be simple and stores the information it receives from any data center in memory. It responds with the information it has about all data centers. It periodically saves its in-memory state to cache using the Workers Cache API so that it can withstand restarts as mentioned above.
Components of waiting room
Recap
This is how the waiting room works right now. Every request with the enabled waiting room goes to a worker at a Cloudflare edge data center. When this happens, the worker looks for the state of the waiting room in the Cache first. We use cache here instead of Data Center Durable Object so that we do not overwhelm the Durable Object instance when there is a spike in traffic. Plus, reading data from cache is faster. The workers periodically make a request to the Data Center Durable Object to get the waiting room state which they then write to the cache. The idea here is that the cache should have a recent copy of the waiting room state.
Workers can examine the request to know which data center they are in. Every worker periodically makes a request to the corresponding Data Center Durable Object. This interaction updates the worker state in the Data Center Durable Object. In return, the workers get the waiting room state from the Data Center Durable Object. The Data Center Durable Object sends the data center state to the Global Durable Object periodically. In the response, the Data Center Durable Object receives all data center states globally. It then calculates the waiting room state and returns that state to a worker in its response.
The advantage of this design is that it's possible to adjust the rate of writes from workers to the Data Center Durable Object and from the Data Center Durable Object to the Global Durable Object based on the traffic received in the waiting room. This helps us respond to requests during high traffic without overloading the individual Durable Object instances.
Conclusion
By using Workers and Durable Objects, Waiting Room was able to scale up to keep web application servers online for many of our early customers during large spikes of traffic. It helped keep vaccination sign-ups online for companies and governments around the world for free through Project Fair Shot: Verto Health was able to serve over 4 million customers in Canada; Ticket Tailor reduced their peak resource utilization from 70% down to 10%; the County of San Luis Obispo was able to stay online during traffic surges of up to 23,000 users; and the country of Latvia was able to stay online during surges of thousands of requests per second. These are just a few of the customers we served and will continue to serve until Project Fair Shot ends.
In the coming days, we are rolling out the Waiting Room to customers on our business plan. Sign up today to prevent spikes of traffic to your web application. If you are interested in access to Durable Objects, it’s currently available to try out in Open Beta.