This is a guest post by Ben Chartrand, who is a Development Manager at Timely. You can check out some of Ben's other Workers projects on his GitHub and his blog.
At Timely we started a project to migrate our web applications from legacy Azure services to a modern PaaS offering. In theory it meant no code changes.
We decided to start with our webhooks. All our endpoints can be grouped into four categories:
Integration with internal tools i.e. HelpScout, monitoring endpoint for PagerDuty
Payment confirmations
Calendar integrations i.e. Google Calendar
SMS confirmations
Despite their limited number, these are vitally important. We did a lot of testing but it was clear we’d only really know if everything was working once we had production traffic. How could we migrate traffic?
Option 1
Change the CNAME to point to the new hosting infrastructure. This is high risk. DNS takes time to propagate so, if we needed to roll back, it would take time. We would also be shifting over everything at once.
Option 2
Use a traffic manager to shift a percentage of traffic using Cloudflare Load Balancing. We could start at, say, 5% traffic to the new infrastructure and, assuming everything appears to be ok, slowly increase the traffic.
In our case the vast majority of our traffic goes to our calendar integration endpoints. The other endpoints were unlikely to receive traffic, especially if started with just 5% of traffic. This wasn’t the best option.
Enter Option 3: Cloudflare Workers and Workers KV
I remember thinking: wouldn’t it be great if we could migrate traffic one endpoint at a time? We have about 20. We could start at the low risk endpoints and progressively move our way up.
We were able to write a Cloudflare Worker script that:
Detected the path i.e. /webhooks/paypal
If the path matched one our endpoints, we checked Workers KV (Key Value storage) to see if that endpoint was enabled. This was our feature flag / setting
If it was enabled and the path matched we redirected to the new infrastructure. This involved changing the domain but otherwise keeping the request as-is i.e. webhooks.currentdomain.com/webhooks/paypal to webhooks.newinfrastructure.com/webhooks/paypal
The first step was to add passThroughOnException
mentioned in this post.
addEventListener('fetch', event => {
event.passThroughOnException()
event.respondWith(handleRequest(event))
})
Next, in the handleRequest method, I created a map of each endpoint (the path) and the corresponding Workers KV key, so I know where to look for the setting.
const endpoints = new Map()
endpoints.set('/monitoring', 'monitoring')
endpoints.set('/paypal', 'payPalIpnWebHook')
// more endpoints
endpoints.set('/helpscout', 'helpScoutWebHook')
Next I inspect the path for each request. If the path matches then we check the setting. If so, we set a redirect flag.
for (var [key, value] of endpoints.entries()) {
if (currentUrl.pathname.startsWith(key)) {
const flag = await WEBHOOK_SETTINGS.get(value)
if (flag == 1) {
console.log(`redirected: ${key}`)
redirect = true
break
}
}
}
If the redirect flag is true we change the hostname in the request but leave everything else as-is. This involves creating a new Request object. If we are not redirecting we fetch the request.
// Handle the request
let response = null
if (redirect) {
// Redirect to the new infra
const newUrl = request.url.replace(currentHost, newHost)
const init = {
method: request.method,
headers: request.headers,
body: request.body
}
console.log(newUrl)
const redirectedRequest = new Request(newUrl, init)
console.log(redirectedRequest)
response = await fetch(redirectedRequest)
} else {
// Handle with the existing infra
response = await fetch(request)
}
Complete Code
addEventListener('fetch', event => {
event.passThroughOnException()
event.respondWith(handleRequest(event))
})
function postLog(data) {
return fetch("http://logs-01.loggly.com/inputs/<my id>/tag/http/", {
method: "POST",
body: data
})
}
async function handleRequest(event) {
try {
const request = event.request
const currentHost = 'webhooks.currentdomain.com'
const newHost = 'webhooks.newinfrastructure.com'
const currentUrl = new URL(request.url)
let redirect = false
// This is a map of the paths and the corresponding KV entry
const endpoints = new Map()
endpoints.set('/monitoring', 'monitoring')
endpoints.set('/paypal', 'payPalIpnWebHook')
// more endpoints
endpoints.set('/helpscout', 'helpScoutWebHook')
for (var [key, value] of endpoints.entries()) {
if (currentUrl.pathname.startsWith(key)) {
const flag = await WEBHOOK_SETTINGS.get(value)
if (flag == 1) {
console.log(`redirected: ${key}`)
redirect = true
break
}
}
}
// Handle the request
let response = null
if (redirect) {
// Redirect to the new infra
const newUrl = request.url.replace(currentHost, newHost)
const init = {
method: request.method,
headers: request.headers,
body: request.body
}
console.log(newUrl)
const redirectedRequest = new Request(newUrl, init)
console.log(redirectedRequest)
response = await fetch(redirectedRequest)
} else {
// Handle with the existing infra
response = await fetch(request)
}
return response
} catch (error) {
event.waitUntil(postLog(error))
throw error
}
}
Why use Workers KV?
We could have written everything as a hard coded script, which was updated each time to enable/disable redirection of traffic. This would require the team to make code changes and deploy the worker every time we wanted to make a change.
Using Workers KV, I enabled any member of the team to enable/disable endpoints using the Cloudflare API. To make things easier I created a Postman collection and shared it.
Go Live Problems - and Solutions!
We went live with our first endpoint. The Workers script and KV worked fine but I noticed a small number of exceptions were being reported in Workers > Worker Status.
Cloudflare provides Debugging Tips. I followed the section “Make subrequests to your debug server” and decided to incorporate Loggly. I could now catch the exceptions and send it to Loggly by running a POST using fetch
to the URL provided by Loggly. With this I quickly determined what the problem was and corrected the issue.
Another problem that came up was a plethora of 403s. This was highly visible in the Workers > Status Code graph (the green).
Turns out our IIS box had rate limiting setup. Instead of returning a 429 (Too Many Requests), it returned 403 (Forbidden). Phew - it wasn’t an issue with my Worker or the new infrastructure!
We could have set up the rate limiting on the new infrastructure but we instead opted for Cloudflare Rate Limiting. It was cheap, easy to setup and meant the blocked requests didn’t even hit our infrastructure in the first place.
Where to From Here?
As I write this we’ve transitioned all traffic. All endpoints are enabled. Once we’re ready to decommission the old infrastructure we will:
Change the CNAME to point to the new infrastructure
Disable the worker
Celebrate!
We’ll then move onto our new web application, such as our API or main web app. We’re likely to use one of two options:
Use the traffic manager to migrate a percentage of traffic
Migrate traffic on a per-customer basis. It would be similar to above except we would store a setting per-customer (KV would store a setting per customer and we know the customer by the request header, which would have the customer ID). We could, for example, start with internal test accounts, then our beta users and, at the very end, migrate our VIPs.
Upgrading Cloud Infrastructure Made Easier and Safer Using Cloudflare Workers and Workers KV