This is a guest post by Jamie Mason, who is the Head of Test Servers at SamKnows. This post originally appears on the SamKnows Megablog.
We leveraged Cloudflare Workers to expand the SamKnows measurement infrastructure.
At SamKnows, we run lots of tests to measure internet performance. Actually, that’s an understatement. Our software is embedded on tens of millions of devices, and that number grows daily.
We measure performance between the user’s home and the internet, across dozens of metrics. Some of these metrics measure the performance of major video-streaming services, popular games, or large websites. Others focus on the more traditional ‘quality of service’ metrics: speed, latency, and packet loss.
In order to measure speed, latency, and packet loss, SamKnows needs test servers to carry out the measurements against. These servers should be relatively near to the user’s home - this ensures that we’re measuring solely the user’s internet connection (i.e. what their Internet Service Provider sells them) and not some external factor.
As a result, we manage high-capacity test servers all over the world. Some are donated by research groups, some we host ourselves in major data centers, and still others are run inside ISPs’ own networks.
Customers often ask us why we don’t make use of cloud hosting providers to host our test servers. The response is always the same: bandwidth is far too expensive (it is typically 10-20x the cost of a dedicated/colocated server) and their global coverage is too limited. Whilst we revisit this topic regularly, it has never made sense for us to use a cloud provider so far.
Despite the size of our testing infrastructure, we’re always on the lookout for ways to improve and extend our platform. Having to regularly source reliable colocation with 10Gbps connectivity in emerging markets is a nice problem to have, but a problem nonetheless!
Cloudflare launches “Workers”
Cloudflare is one of the world’s largest CDNs, with an extensive network spanning 165 data centres across 6 continents. This means that most internet users will be within a few tens of milliseconds to the nearest Cloudflare location.
In March 2018, Cloudflare launched a new product called Workers, based on the W3C Service Workers standard. This allows developers to run code directly on Cloudflare’s network. The typical use case for this is for applying complex HTTP request filtering or caching logic, before the request hits the origin server.
However, the potential to run code at 165 well-connected locations globally gave us other ideas:
Could we write the SamKnows measurement server software in a Workers script?
The benefit would be significant: we would immediately gain the ability to run measurements in many new locations. Cloudflare also regularly adds new locations — they’ve added 10 since the first draft of this blog post — which would automatically be available for us as soon as they were added. Moreover, Cloudflare’s position as a CDN means that they will likely have good connectivity to most ISPs in major markets and are incentivised to keep improving that connectivity, as that improves service for their customers. In short, our interests are well-aligned.
Please note: the idea being discussed here should not be confused with our CDN test, which seeks to measure TCP connection times from a user’s home to major CDNs, including Akamai, Apple, Microsoft, Google, and — of course — Cloudflare.
Implementing a measurement server in Workers
Implementing the first prototype of the measurement server on Workers was surprisingly easy. We didn’t even have a Cloudflare account at this point, so we signed up for the free plan and paid the $5 for the Workers upgrade. Within a couple of hours, we had an early prototype working, which initially only supported a download speed test.
Of course, the 80/20 rule applies, and significantly more work was required to make it scalable, perform well under all situations, and support other metrics. But the early signs were promising.
Unlike most Workers scripts, our code does not make any requests to any origin servers. We generate content from directly inside the Worker for a download speed test and read content from the client for an upload speed test. Care had to be taken to ensure that the content we are generating is sufficiently random (some middleboxes will transparently compress data, which would interfere with our measurements). We also must be mindful of the resource limits in place on Cloudflare workers: 5–50ms of CPU time and 128MB RAM. Luckily, our experience in writing software for resource-constrained embedded devices served us well here!
We also reached out to Cloudflare at this point to let them know what we were up to, and they were very supportive and interested in our unusual use of Workers. We even met the team behind Workers at the launch party at Cloudflare’s office in London, which is just around the corner from ours.
Cloudflare Workers is based upon the W3C Service Workers standard. This means that we are subject to the limitations of this standard: we can only exchange HTTP traffic (over TCP), we cannot use UDP. This means that none of our UDP measurements can run to our Workers server.
As a result, our Workers server can support the following measurements:
- TCP download speed test (single or multiple concurrent connections)
- TCP upload speed test (single or multiple concurrent connections)
- Round-trip latency (ICMP)
- Packet loss (ICMP)
Before we could consider Cloudflare as a viable measurement server host, we needed to test their performance. We did this by configuring Whiteboxes to run measurements to both our existing servers and our Cloudflare Workers server, and then comparing the performance.
In the autumn of 2018, we configured 10,000 Whiteboxes to run measurements to both Cloudflare and their existing measurement server. These 10,000 Whiteboxes were distributed across Europe, North America, Asia, and South America.
On average, the difference in measured speed to Cloudflare and to our existing servers was 0.1%.
There were certainly outliers though, sometimes with Cloudflare vastly underperforming compared to our existing infrastructure, and sometimes with it vastly outperforming it. So, we decided to do a deeper dive into two markets.
We chose Germany — with a variety of internet access technologies and speeds — and Singapore, whose citizens are blessed with high-speed connections of 500Mbps or more, with 1Gbps being quite common. Crucially, we have existing servers located at major internet exchanges in both Germany and Singapore, that are capable of dealing with very high-speed connections.
Deep dive: Germany
In Germany, we saw quite a few instances of peak hour performance dips to both Cloudflare and our existing infrastructure. As an example, here’s the average performance of Deutsche Telekom’s 100Mbps users during early November 2018:
As you can see in this graph, there’s a lot of deep troughs, mostly correlating with peak hours. This is indicative of peak-hour congestion somewhere on the path. The interesting aspect of this from our perspective is that devices testing to Cloudflare (in blue) not only see similar speeds, but also see less impact from this congestion. This suggests that routing is playing a large part in this, and that Deutsche Telekom’s routes to Cloudflare are shorter and/or less congested than those to our three existing servers provided by Measurement Lab.
On an individual device level, we can see that most German devices report very comparable speeds between Cloudflare and M-Lab. Here’s an example, a 200Mbps Vodafone Kabel Deutschland 200Mbps user, who sees barely any difference between M-Lab and Cloudflare:
Deep dive: Singapore
Many of our Singaporean users are lucky enough to have a 1Gbps fibre-to-the-home service. However, we know from experience that the peering issues Singapore can mean that performance can be extremely variable, depending on the route your traffic takes. What we found was rather interesting.
The above graph plots MyRepublic’s 1Gbps download speeds in early November. Their performance to our existing server (in red), hosted in Equinix’s facility in Singapore, is fantastic — reaching the limits of what could be reliably squeezed out of a 1Gbps connection. This speaks volumes as to the quality of the connectivity to the existing server in Singapore. The Cloudflare results (in blue) are a bit lower and more variable. Whilst speeds are still high, it’s clear that there’s a difference.
MyRepublic users have a much better time of it than Singtel’s 1Gbps users, however, who provide us with a great example of what things look like when there’s suboptimal connectivity between two parties. In this chart, we show Singtel’s 1Gbps performance to Cloudflare (blue) and our existing measurement server (red):
Ouch! We can see that speeds to Cloudflare are almost 600Mbps lower than those of our existing server for Singtel users, and far worse than those of MyRepublic.
What’s the cause of this? That’s a big can of very controversial worms. In short, incumbent ISPs (Singtel in this instance) in some markets can charge other networks large amounts for ‘paid peering’ (paid-for connectivity between their two networks). If the provider opts not to pay for such connectivity, then traffic between them and the ISP could take a less optimal route. The concept of peering disputes is not new, over time there have been disputes that even caused parts of the Internet to become segmented and unreachable from some other networks.
The implication of the chart above is that Cloudflare does not have peering with Singtel, so the measurements for Singtel users are relatively poor compared to MyRepublic (who connect with Cloudflare via SGIX).
Shining a light on such issues is beneficial to users, as it helps demonstrate that speed tests to servers hosted by your ISP are not necessarily representative of performance to other networks.
Launching SamKnows cloud measurement servers
As a result of our testing, and with some tweaks in place as a result of this, we’re happy to announce that we are now offering Cloudflare Workers powered measurement servers as part of our product portfolio! This is now in place in the Cloud section of our Product Map, and allows us to instantly provide low-cost testing infrastructure for new customers in a large number of locations across the world. To visualise what we’ve just added to our fleet of test servers, Cloudflare have a lovely map showing just how widespread their presence is:
Ready to launch
So, what’s next? Well, we have already started using our new Workers-based infrastructure for several of our own measurement studies. This is transparent to users but will increase our capacity and allow us to grow into new markets.
In the future, we hope to be able to expand the set of measurements we can offer using Workers. The addition of WebAssembly support in late 2018 is a positive step towards this, but the killer application for us would be if Workers were able to terminate WebSockets and even arbitrary TCP/UDP connections in the future.
We’re excited to add a major CDN to our infrastructure, as CDNs represent an ever more important aspect of the modern-day internet experience. This is also just the first of our cloud offerings, with work continuing with other providers and solutions - it’s an exciting time for the test infrastructure team at SamKnows. And in addition to expanding our testing infrastructure, we’re also looking forward to releasing our new Rapid Build Framework for iOS and Android, greater mapping functionality in SamKnows One, trigger testing, and further test servers to measure and visualise internet performance worldwide - all by the end of Q1 2019.Lastly, we are currently using our new Workers infrastructure to test our next generation Whitebox, at speeds way above 1Gbps. The results are looking very promising so far — keep your eyes peeled for an exciting blog post on this soon!