Posts tagged "LUA"

More consistent LuaJIT performance

Guest Author — Wed, 12 Dec 2018 13:00:00 GMT

_{This is a guest post by Laurence Tratt, who is a programmer and Reader in Software Development in the Department of Informatics at King's College London where he leads the Software Development Team. He is also an EPSRC Fellow.}

A year ago I wrote about a project that Cloudflare were funding at King's College London to help improve LuaJIT. Our twelve months is now up. How did we do?

The first thing that happened is that I was lucky to employ a LuaJIT expert, Thomas Fransham, to work on the project. His deep knowledge about LuaJIT was crucial to getting things up and running – 12 months might sound like a long time, but it soon whizzes by!

The second thing that happened was that we realised that the current state of Lua benchmarking was not good enough for anyone to reliably tell if they'd improved LuaJIT performance or not. Different Lua implementations had different benchmark suites, mostly on the small side, and not easily compared. Although it wasn't part of our original plan, we thus put a lot of effort into creating a larger benchmark suite. This sounds like a trivial job, but it isn't. Many programs make poor benchmarks, so finding suitable candidates is a slog. Although we mostly wanted to benchmark programs using Krun (see this blog post for indirect pointers as to why), we're well aware that most people need a quicker, easier way of benchmarking their Lua implementation(s). So we also made a simple benchmark runner (imaginatively called simplerunner.lua) that does that job. Here's an example of it in use:

Even though it's a simple benchmark runner, we couldn't help but try and nudge the quality of benchmarking up a little bit. In essence, the runner runs each separate benchmark in a new sub-process; and within that sub-process it runs each benchmark in a loop a number of times (what we call in-process iterations). Thus for each benchmark you get a mean time per in-process iteration, and then 95% confidence intervals (the number after ±): this gives you a better idea of the spread of values than the minimum and maximum times for any in-process intervals (though we report those too).

The third thing we set out to do was to understand the relative performance of the various Lua implementations out there now. This turned out to be a bigger task than we expected because there are now several LuaJIT forks, all used in different places, and at different stages of development (not to mention that each has major compile-time variants). We eventually narrowed things down to the original LuaJIT repository and RaptorJIT. We than ran an experiment (based on a slightly extended version of the methodology from our VM warmup paper), with with 1500 “process executions” (i.e. separate, new VM processes) and 1500 “in-process iterations” (i.e. the benchmark in a for loop within one VM process). Here are the benchmark results for the original version of LuaJIT:

Results for luaJIT

Symbol key: bad inconsistent, flat, good inconsistent, no steady state, slowdown, warmup.

_{Results for luaJIT}

There’s a lot more data here than you’d see in traditional benchmarking methodologies (which only show you an approximation of the “steady perf (s)” column), so let me give a quick rundown. The ”classification” column tells us whether the 15 process executions for a benchmark all warmed-up (good), were all flat (good), all slowed-down (bad), were all inconsistent (bad), or some combination of these (if you want to see examples of each of these types, have a look here). “Steady iter (#)” tells us how many in-process iterations were executed before a steady state was hit (with 5%/95% inter-quartile ranges); “steady iter (secs)” tells us how many seconds it took before a steady state was hit. Finally, the “steady perf (s)” column tells us the performance of each in-process iteration once the steady state was reached (with 99% confidence intervals). For all numeric columns, lower numbers are better.

Here are the benchmark results for for RaptorJIT:

Results for RaptorJIT

Symbol key: bad inconsistent, flat, good inconsistent, no steady state, slowdown, warmup.

_{Results for RaptorJIT}

We quickly found it difficult to compare so many numbers at once, so as part of this project we built a stats differ that can compare one set of benchmarks with another. Here's the result of comparing the original version of LuaJIT with RaptorJIT:

Results for Normal vs. RaptorJIT

Symbol key: bad inconsistent, flat, good inconsistent, no steady state, slowdown, warmup.
Diff against previous results: improved worsened different unchanged.

_{Results for Normal vs. RaptorJIT}

In essence, green cells mean that RaptorJIT is better than LuaJIT; red cells mean that LuaJIT is better than RaptorJIT; yellow means they're different in a way that can't be compared; and white/grey means they're statistically equivalent. The additional “Steady performance variation (s)” column shows whether the steady state performance of different process executions is more predictable or not.

The simple conclusion to draw from this is that there isn't a simple conclusion to draw from it: the two VMs are sometimes better than each other with no clear pattern. Without having a clear steer either way, we therefore decided to use the original version of LuaJIT as our base.

One of the things that became very clear from our benchmarking is that LuaJIT is highly non-deterministic – indeed, it's the most non-deterministic VM I've seen. The practical effect of this is that even on one program, LuaJIT is sometimes very fast, and sometimes rather slow. This is, at best, very confusing for users who tend to assume that programs perform more-or-less the same every time they're run; at worst, it can create significant problems when one is trying to estimate things like server provisioning. We therefore tried various things to make performance more consistent.

The most promising approach we alighted upon is what we ended up calling “separate counters”. In a tracing JIT compiler such as LuaJIT, one tracks how often a loop (where loops are both “obvious” things like for loops, as well as less obvious things such as functions) has been executed: once it's hit a certain threshold, the loop is traced, and compiled into machine code. LuaJIT has an unusual approach to counting loops: it has 64 counters to which all loops are mapped (using the memory address of the bytecode in question). In other words, multiple loops share the same counter: the bigger the program, the more loops share the same counter. The advantage of this is that the counters map is memory efficient, and for small programs (e.g. the common LuaJIT benchmarks) it can be highly effective. However, it has very odd effects in real programs, particularly as programs get bigger: loops are compiled non-deterministically based on the particular address in memory they happen to have been loaded at.

We therefore altered LuaJIT so that each loop and each function has its own counter, stored in the bytecode to make memory reads/writes more cache friendly. The diff from normal LuaJIT to the separate counters version is as follows:

Results for Normal vs. Counters

Symbol key: bad inconsistent, flat, good inconsistent, no steady state, slowdown, warmup.
Diff against previous results: improved worsened different unchanged.

_{Results for Normal vs. Counters}

In this case we’re particularly interested in the “steady performance variation (s)” column, which shows whether benchmarks have predictable steady state performance. The results are fairly clear: steady counters are, overall, a clear improvement. As you might expect, this is not a pure win, because it changes the order in which traces are made. This has several effects, including delaying some loops to be traced later than was previously the case, because counters do not hit the required threshold as quickly.

This disadvantages some programs, particularly small deterministic benchmarks where loops are highly stable. In such cases, the earlier you trace the better. However, in my opinion, such programs are given undue weight when performance is considered. It’s no secret that some of the benchmarks regularly used to benchmark LuaJIT are highly optimised for LuaJIT as it stands; any changes to LuaJIT stand a good chance of degrading their performance. However, overall we feel that the overall gain in consistency, particularly for larger programs, is worth it. There's a pull request against the Lua Foundation's fork of LuaJIT which applies this idea to a mainstream fork of LuaJIT.

We then started looking at various programs that showed odd performance. One problem in particular showed up in more than one benchmark. Here's a standard example:

Collisiondetector, Normal, Bencher9, Proc. exec. #12 (no steady state)

The problem – and it doesn't happen on every process execution, just to make it more fun – is that there are points where the benchmark slows down by over 10% for multiple in-process iterations (e.g. in this process execution, at in-process iterations 930-ish and 1050-ish). We tried over 25 separate ways to work out what was causing this — even building an instrumentation system to track what LuaJIT is doing — but in the end it turned out to be related to LuaJIT's Garbage Collector – sort of. When we moved from the 32-bit to 64-bit GC, the odd performance went away.

As such, we don’t think that the 64-bit GC “solves” the problem: however, it changes the way that pointers are encoded (doubling in size), which causes the code generator to emit a different style of code, such that the problem seems to go away. Nevertheless, this did make us reevaluate LuaJIT's GC. Tom then started work on implementing Mike Pall's suggestion for a new GC for LuaJIT (based partly on Tom's previous work and also that of Peter Cawley). He has enough implemented to run most small, and some large, programs, but it needs more work to finish it off, at which point evaluating it against the existing Lua GCs will be fascinating!

So, did we achieve everything we wanted to in 12 months? Inevitably the answer is yes and no. We did a lot more benchmarking than we expected; we've been able to make a lot of programs (particularly large programs) have more consistent performance; and we've got a fair way down the road of implementing a new GC. To whoever takes on further LuaJIT work – best of luck, and I look forward to seeing your results!

Acknowledgements: Sarah Mount implemented the stats differ; Edd Barrett implemented Krun and answered many questions on it.

Technical reading from the Cloudflare blog for the holidays

John Graham-Cumming — Fri, 22 Dec 2017 14:17:57 GMT

During 2017 Cloudflare published 172 blog posts (including this one). If you need a distraction from the holiday festivities at this time of year here are some highlights from the year.

CC BY 2.0 image by perzon seo

The WireX Botnet: How Industry Collaboration Disrupted a DDoS Attack

We worked closely with companies across the industry to track and take down the Android WireX Botnet. This blog post goes into detail about how that botnet operated, how it was distributed and how it was taken down.

Randomness 101: LavaRand in Production

The wall of Lava Lamps in the San Francisco office is used to feed entropy into random number generators across our network. This blog post explains how.

ARM Takes Wing: Qualcomm vs. Intel CPU comparison

Our network of data centers around the world all contain Intel-based servers, but we're interested in ARM-based servers because of the potential cost/power savings. This blog post took a look at the relative performance of Intel processors and Qualcomm's latest server offering.

How to Monkey Patch the Linux Kernel

One engineer wanted to combine the Dvorak and QWERTY keyboard layouts and did so by patching the Linux kernel using SystemTap. This blog explains how and why. Where there's a will, there's a way.

Introducing Cloudflare Workers: Run JavaScript Service Workers at the Edge

Traditionally, the Cloudflare network has been configurable by our users, but not programmable. In September, we introduced Cloudflare Workers which allows users to write JavaScript code that runs on our edge worldwide. This blog post explains why we chose JavaScript and how it works.

CC BY 2.0 image by Peter Werkman

Geo Key Manager: How It Works

Our Geo Key Manager gives customers granular control of the location of their private keys on the Cloudflare network. This blog post explains the mathematics that makes the possible.

SIDH in Go for quantum-resistant TLS 1.3

Quantum-resistant cryptography isn't an academic fantasy. We implemented the SIDH scheme as part of our Go implementation of TLS 1.3 and open sourced it.

The Languages Which Almost Became CSS

This blog post recounts the history of CSS and the languages that might have been CSS.

Perfect locality and three epic SystemTap scripts

In an ongoing effort to understand the performance of NGINX under heavy load on our machines (and wring out the greatest number of requests/core), we used SystemTap to experiment with different queuing models.

How we built rate limiting capable of scaling to millions of domains

We rolled out a rate limiting feature that allows our customers to control the maximum number of HTTP requests per second/minute/hour that their servers receive. This blog post explains how we made that operate efficiently at our scale.

CC BY 2.0 image by Han Cheng Yeh

Reflections on reflection (attacks)

We deal with a new DDoS attack every few minutes and in this blog post we took a close look at reflection attacks and revealed statistics on the types of reflection-based DDoS attacks we see.

On the dangers of Intel's frequency scaling

Intel processors contain special AVX-512 that provide 512-bit wide SIMD instructions to speed up certain calculations. However, these instructions have a downside: when used the CPU base frequency is scaled down slowing down other instructions. This blog post explores that problem.

How Cloudflare analyzes 1M DNS queries per second

This blog post details how we handle logging information for 1M DNS queries per second using a custom pipeline, ClickHouse and Grafana (via a connector we open sourced) to build real time dashboards.

AES-CBC is going the way of the dodo

CBC-mode cipher suites have been declining for some time because of padding oracle-based attacks. In this blog we demonstrate that AES-CBC has now largely been replaced by ChaCha20-Poly1305 .

CC BY-SA 2.0 image by Christine

How we made our DNS stack 3x faster

We answer around 1 million authoritative DNS queries per second using a custom software stack. Responding to those queries as quickly as possible is why Cloudflare is fastest authoritative DNS provider on the Internet. This blog post details how we made our stack even faster.

Quantifying the Impact of "Cloudbleed"

On February 18 a serious security bug was reported to Cloudflare. Five days later we released details of the problem and six days after that we posted this analysis of the impact.

LuaJIT Hacking: Getting next() out of the NYI list

We make extensive use of LuaJIT when processing our customers' traffic and making it faster is a key goal. In the past, we've sponsored the project and everyone benefits from those contributions. This blog post examines getting one specific function JITted correctly for additional speed.

Privacy Pass: The Math

The Privacy Pass project provides a zero knowledge way of proving your identity to a service like Cloudflare. This detailed blog post explains the mathematics behind authenticating a user without knowing their identity.

How and why the leap second affected Cloudflare DNS

The year started with a bang for some engineers at Cloudflare when we ran into a bug in our custom DNS server, RRDNS, caused by the introduction of a leap second at midnight UTC on January 1, 2017. This blog explains the error and why it happened.

There's no leap second this year.

Cloudflare Wants to Buy Your Meetup Group Pizza

Andrew Fitch — Fri, 10 Nov 2017 15:00:00 GMT

If you’re a web dev / devops / etc. meetup group that also works toward building a faster, safer Internet, I want to support your awesome group by buying you pizza. If your group’s focus falls within one of the subject categories below and you’re willing to give us a 30 second shout out and tweet a photo of your group and @Cloudflare, your meetup’s pizza expense will be reimbursed.

Get Your Pizza $ Reimbursed »

Developer Relations at Cloudflare & why we’re doing this

I’m Andrew Fitch and I work on the Developer Relations team at Cloudflare. One of the things I like most about working in DevRel is empowering community members who are already doing great things out in the world. Whether they’re starting conferences, hosting local meetups, or writing educational content, I think it’s important to support them in their efforts and reward them for doing what they do. Community organizers are the glue that holds developers together socially. Let’s support them and make their lives easier by taking care of the pizza part of the equation.

What’s in it for Cloudflare?

We want web developers to target the apps platform
We want more people to think about working at Cloudflare
Some people only know of Cloudflare as a CDN, but it’s actually a security company and does much more than that. General awareness helps tell the story of what Cloudflare is about.

What kinds of groups we most want to support

We want to work with groups that are closely aligned with Cloudflare, groups which focus on web development, web security, devops, or tech ops. We often work with language-specific meetups (such as Go London User Group) or diversity & inclusion meetups (such as Women Who Code Portland). We’d also love to work with college student groups and any other coding groups which are focused on sharing technical knowledge with the community.

To get sponsored pizza, groups must be focused on...

Web performance & optimization
Front-end frameworks
Language-specific (JavaScript / PHP / Go / Lua / etc.)
Diversity & inclusion meetups for web devs / devops / tech ops
Workshops / classes for web devs / devops / tech ops

How it works

Interested groups need to read our Cloudflare Meetup Group Pizza Reimbursement Rules document.
If the group falls within the requirements, the group should proceed to schedule their meetup and pull the Cloudflare Introductory Slides ahead of time.
At the event, an organizer should present the Cloudflare Introductory Slides at the beginning, giving us a 30 second shout out, and take a photo of the group in front of the slides. Someone from the group should Tweet the photo and @Cloudflare, so our Community Manager may retweet.
After the event, an organizer will need to fill out our reimbursement form where they will upload a completed W-9, their mailing address, group details, and the link to the Tweet.
Within a month, the organizer should have a check from Cloudflare (and some laptop stickers for their group, if they want) in their hands.

Here are some examples of groups we’ve worked with so far

Go London User Group (GLUG)

The Go London User Group is a London-based community for anyone interested in the Go programming language.

GLUG provides offline opportunities to:

Discuss Go and related topics
Socialize with friendly people who are interested in Go
Find or fill Go-related jobs

They want GLUG to be a diverse and inclusive community. As such all attendees, organizers and sponsors are required to follow the community code of conduct.

This group’s mission is very much in alignment with Cloudflare’s guidelines, so we sponsored the pizza for their October Gophers event.

Women Who Code Portland

Women Who Code’s mission statement

Women Who Code is a global nonprofit dedicated to inspiring women to excel in technology careers by creating a global, connected community of women in technology.

What they offer

Monthly Networking Nights
Study Nights (JavaScript, React, DevOps, Design + Product, Algorithms)
Technical Workshops
Community Events (Hidden Figures screening, Women Who Strength Train, Happy Hours, etc.)
Free or discounted tickets to conferences

Women Who Code are a great example of an inclusive technical group. Cloudflare reimbursed them for their pizza at their October JavaScript Study Night.

GDG Phoenix / PHX Android

GDG Phoenix / PHX Android group is for anyone interested in Android design and development. All skill levels are welcome. Participants are welcome to the meet up to hang out and watch or be part of the conversation at events.

This group is associated with the Google Developer Group program and coordinate activities with the national organization. Group activities are not exclusively Android related, however, they can cover any technology.

Cloudflare sponsored their October event “How espresso works . . . Android talk with Michael Bailey”.

I hope a lot of awesome groups around the world will take advantage of our offer. Enjoy the pizza!

Get Your Pizza $ Reimbursed »

© 2017 Cloudflare, Inc. All rights reserved. The Cloudflare logo and marks are trademarks of Cloudflare. All other company and product names may be trademarks of the respective companies with which they are associated.

ARM Takes Wing: Qualcomm vs. Intel CPU comparison

Vlad Krasnov — Wed, 08 Nov 2017 20:03:14 GMT

One of the nicer perks I have here at Cloudflare is access to the latest hardware, long before it even reaches the market.

Until recently I mostly played with Intel hardware. For example Intel supplied us with an engineering sample of their Skylake based Purley platform back in August 2016, to give us time to evaluate it and optimize our software. As a former Intel Architect, who did a lot of work on Skylake (as well as Sandy Bridge, Ivy Bridge and Icelake), I really enjoy that.

Our previous generation of servers was based on the Intel Broadwell micro-architecture. Our configuration includes dual-socket Xeons E5-2630 v4, with 10 cores each, running at 2.2GHz, with a 3.1GHz turboboost and hyper-threading enabled, for a total of 40 threads per server.

Since Intel was, and still is, the undisputed leader of the server CPU market with greater than 98% market share, our upgrade process until now was pretty straightforward: every year Intel releases a new generation of CPUs, and every year we buy them. In the process we usually get two extra cores per socket, and all the extra architectural features such upgrade brings: hardware AES and CLMUL in Westmere, AVX in Sandy Bridge, AVX2 in Haswell, etc.

In the current upgrade cycle, our next server processor ought to be the Xeon Silver 4116, also in a dual-socket configuration. In fact, we have already purchased a significant number of them. Each CPU has 12 cores, but it runs at a lower frequency of 2.1GHz, with 3.0GHz turboboost. It also has smaller last level cache: 1.375 MiB/core, compared to 2.5 MiB the Broadwell processors had. In addition, the Skylake based platform supports 6 memory channels and the AVX-512 instruction set.

As we head into 2018, however, change is in the air. For the first time in a while, Intel has serious competition in the server market: Qualcomm and Cavium both have new server platforms based on the ARMv8 64-bit architecture (aka aarch64 or arm64). Qualcomm has the Centriq platform (code name Amberwing), based on the Falkor core, and Cavium has the ThunderX2 platform, based on the ahm ... ThunderX2 core?

The majestic Amberwing powered by the Falkor CPU CC BY-SA 2.0 image by DrPhotoMoto

Recently, both Qualcomm and Cavium provided us with engineering samples of their ARM based platforms, and in this blog post I would like to share my findings about Centriq, the Qualcomm platform.

The actual Amberwing in question

Overview

I tested the Qualcomm Centriq server, and compared it with our newest Intel Skylake based server and previous Broadwell based server.

Overall on paper Falkor looks very competitive. In theory a Falkor core can process 8 instructions/cycle, same as Skylake or Broadwell, and it has higher base frequency at a lower TDP rating.

Ecosystem readiness

Up until now, a major obstacle to the deployment of ARM servers was lack, or weak, support by the majority of the software vendors. In the past two years, ARM’s enablement efforts have paid off, as most Linux distros, as well as most popular libraries support the 64-bit ARM architecture. Driver availability, however, is unclear at that point.

At Cloudflare, we run a complex software stack that consists of many integrated services, and running each of them efficiently is top priority.

On the edge we have the NGINX server software, that does support ARMv8. NGINX is written in C, and it also uses several libraries written in C, such as zlib and BoringSSL, therefore solid C compiler support is very important.

In addition, our flavor of NGINX is highly integrated with the lua-nginx-module, and we rely a lot on LuaJIT.

Finally, a lot of our services, such as our DNS server, RRDNS, are written in Go.

The good news is that both gcc and clang not only support ARMv8 in general, but have optimization profiles for the Falkor core.

Go has official support for ARMv8 as well, and they improve the arm64 backend constantly.

As for LuaJIT, the stable version, 2.0.5 does not support ARMv8, but the beta version, 2.1.0 does. Let’s hope it gets out of beta soon.

Benchmarks

OpenSSL

The first benchmark I wanted to perform, was OpenSSL version 1.1.1 (development version), using the bundled openssl speed tool. Although we recently switched to BoringSSL, I still prefer OpenSSL for benchmarking, because it has almost equally well optimized assembly code paths for both ARMv8 and the latest Intel processors.

In my opinion handcrafted assembly is the best measure of a CPU’s potential, as it bypasses the compiler bias.

Public key cryptography

Public key cryptography is all about raw ALU performance. It is interesting, but not surprising to see that in the single core benchmark, the Broadwell core is faster than Skylake, and both in turn are faster than Falkor. This is because Broadwell runs at a higher frequency, while architecturally it is not much inferior to Skylake.

Falkor is at a disadvantage here. First, in a single core benchmark, the turbo is engaged, meaning the Intel processors run at a higher frequency. Second, in Broadwell, Intel introduced two special instructions to accelerate big number multiplication: ADCX and ADOX. These perform two independent add-with-carry operations per cycle, whereas ARM can only do one. Similarly, the ARMv8 instruction set does not have a single instruction to perform 64-bit multiplication, instead it uses a pair of MUL and UMULH instructions.

Nevertheless, at the SoC level, Falkor wins big time. It is only marginally slower than Skylake at an RSA2048 signature, and only because RSA2048 does not have an optimized implementation for ARM. The ECDSA performance is ridiculously fast. A single Centriq chip can satisfy the ECDSA needs of almost any company in the world.

It is also very interesting to see Skylake outperform Broadwell by a 30% margin, despite losing the single core benchmark, and only having 20% more cores. This can be explained by more efficient all-core turbo, and improved hyper-threading.

Symmetric key cryptography

Symmetric key performance of the Intel cores is outstanding.

AES-GCM uses a combination of special hardware instructions to accelerate AES and CLMUL (carryless multiplication). Intel first introduced those instructions back in 2010, with their Westmere CPU, and every generation since they have improved their performance. ARM introduced a set of similar instructions just recently, with their 64-bit instruction set, and as an optional extension. Fortunately every hardware vendor I know of implemented those. It is very likely that Qualcomm will improve the performance of the cryptographic instructions in future generations.

ChaCha20-Poly1305 is a more generic algorithm, designed in such a way as to better utilize wide SIMD units. The Qualcomm CPU only has the 128-bit wide NEON SIMD, while Broadwell has 256-bit wide AVX2, and Skylake has 512-bit wide AVX-512. This explains the huge lead Skylake has over both in single core performance. In the all-cores benchmark the Skylake lead lessens, because it has to lower the clock speed when executing AVX-512 workloads. When executing AVX-512 on all cores, the base frequency goes down to just 1.4GHz---keep that in mind if you are mixing AVX-512 and other code.

The bottom line for symmetric crypto is that although Skylake has the lead, Broadwell and Falkor both have good enough performance for any real life scenario, especially considering the fact that on our edge, RSA consumes more CPU time than all the other crypto algorithms combined.

Compression

The next benchmark I wanted to see was compression. This is for two reasons. First, it is a very important workload on the edge, as having better compression saves bandwidth, and helps deliver content faster to the client. Second, it is a very demanding workload, with a high rate of branch mispredictions.

Obviously the first benchmark would be the popular zlib library. At Cloudflare, we use an improved version of the library, optimized for 64-bit Intel processors, and although it is written mostly in C, it does use some Intel specific intrinsics. Comparing this optimized version to the generic zlib library wouldn’t be fair. Not to worry, with little effort I adapted the library to work very well on the ARMv8 architecture, with the use of NEON and CRC32 intrinsics. In the process it is twice as fast as the generic library for some files.

The second benchmark is the emerging brotli library, it is written in C, and allows for a level playing field for all platforms.

All the benchmarks are performed on the HTML of blog.cloudflare.com, in memory, similar to the way NGINX performs streaming compression. The size of the specific version of the HTML file is 29,329 bytes, making it a good representative of the type of files we usually compress. The parallel benchmark compresses multiple files in parallel, as opposed to compressing a single file on many threads, also similar to the way NGINX works.

gzip

When using gzip, at the single core level Skylake is the clear winner. Despite having lower frequency than Broadwell, it seems that having lower penalty for branch misprediction helps it pull ahead. The Falkor core is not far behind, especially with lower quality settings. At the system level Falkor performs significantly better, thanks to the higher core count. Note how well gzip scales on multiple cores.

brotli

With brotli on single core the situation is similar. Skylake is the fastest, but Falkor is not very much behind, and with quality setting 9, Falkor is actually faster. Brotli with quality level 4 performs very similarly to gzip at level 5, while actually compressing slightly better (8,010B vs 8,187B).

When performing many-core compression, the situation becomes a bit messy. For levels 4, 5 and 6 brotli scales very well. At level 7 and 8 we start seeing lower performance per core, bottoming with level 9, where we get less than 3x the performance of single core, running on all cores.

My understanding is that at those quality levels Brotli consumes significantly more memory, and starts thrashing the cache. The scaling improves again at levels 10 and 11.

Bottom line for brotli, Falkor wins, since we would not consider going above quality 7 for dynamic compression.

Golang

Golang is another very important language for Cloudflare. It is also one of the first languages to offer ARMv8 support, so one would expect good performance. I used some of the built-in benchmarks, but modified them to run on multiple goroutines.

Go crypto

I would like to start the benchmarks with crypto performance. Thanks to OpenSSL we have good reference numbers, and it is interesting to see just how good the Go library is.

As far as Go crypto is concerned ARM and Intel are not even on the same playground. Go has very optimized assembly code for ECDSA, AES-GCM and Chacha20-Poly1305 on Intel. It also has Intel optimized math functions, used in RSA computations. All those are missing for ARMv8, putting it at a big disadvantage.

Nevertheless, the gap can be bridged with a relatively small effort, and we know that with the right optimizations, performance can be on par with OpenSSL. Even a very minor change, such as implementing the function addMulVVW in assembly, lead to an over tenfold improvement in RSA performance, putting Falkor ahead of both Broadwell and Skylake, with 8,009 signatures/second.

Another interesting thing to note is that on Skylake, the Go Chacha20-Poly1305 code, that uses AVX2 performs almost identically to the OpenSSL AVX512 code, this is again due to AVX2 running at higher clock speeds.

Go gzip

Next in Go performance is gzip. Here again we have a reference point to pretty well optimized code, and we can compare it to Go. In the case of the gzip library, there are no Intel specific optimizations in place.

Gzip performance is pretty good. The single core Falkor performance is way below both Intel processors, but at the system level it manages to outperform Broadwell, and lags behind Skylake. Since we already know that Falkor outperforms both when C is used, it can only mean that Go’s backend for ARMv8 is still pretty immature compared to gcc.

Go regexp

Regexp is widely used in a variety of tasks, so its performance is quite important too. I ran the built-in benchmarks on 32KB strings.

Go regexp performance is not very good on Falkor. In the medium and hard tests it takes second place, thanks to the higher core count, but Skylake is significantly faster still.

Doing some profiling shows that a lot of the time is spent in the function bytes.IndexByte. This function has an assembly implementation for amd64 (runtime.indexbytebody), but generic implementation for Go. The easy regexp tests spend most time in this function, which explains the even wider gap.

Go strings

Another important library for a web server is the Go strings library. I only tested the basic Replacer class here.

In this test again, Falkor lags behind, and loses even to Broadwell. Profiling shows significant time is spent in the function runtime.memmove. Guess what? It has a highly optimized assembly code for amd64, that uses AVX2, but only very simple ARM assembly, that copies 8 bytes at a time. By changing three lines in that code, and using the LDP/STP instructions (load pair/store pair) to copy 16 bytes at a time, I improved the performance of memmove by 30%, which resulted in 20% faster EscapeString and UnescapeString performance. And that is just scratching the surface.

Go conclusion

Go support for aarch64 is quite disappointing. I am very happy to say that everything compiles and works flawlessly, but on the performance side, things should get better. Is seems like the enablement effort so far was concentrated on the compiler back end, and the library was left largely untouched. There are a lot of low hanging optimization fruits out there, like my 20-minute fix for addMulVVW clearly shows. Qualcomm and other ARMv8 vendors intends to put significant engineering resources to amend this situation, but really anyone can contribute to Go. So if you want to leave your mark, now is the time.

LuaJIT

Lua is the glue that holds Cloudflare together.

Except for the binary_trees benchmark, the performance of LuaJIT on ARM is very competitive. It wins two benchmarks, and is in almost a tie in a third one.

That being said, binary_trees is a very important benchmark, because it triggers many memory allocations and garbage collection cycles. It will require deeper investigation in the future.

NGINX

For the NGINX workload, I decided to generate a load that would resemble an actual server.

I set up a server that serves the HTML file used in the gzip benchmark, over https, with the ECDHE-ECDSA-AES128-GCM-SHA256 cipher suite.

It also uses LuaJIT to redirect the incoming request, remove all line breaks and extra spaces from the HTML file, while adding a timestamp. The HTML is then compressed using brotli with quality 5.

Each server was configured to work with as many workers as it has virtual CPUs. 40 for Broadwell, 48 for Skylake and 46 for Falkor.

As the client for this test, I used the hey program, running from 3 Broadwell servers.

Concurrently with the test, we took power readings from the respective BMC units of each server.

With the NGINX workload Falkor handled almost the same amount of requests as the Skylake server, and both significantly outperform Broadwell. The power readings, taken from the BMC show that it did so while consuming less than half the power of other processors. That means Falkor managed to get 214 requests/watt vs the Skylake’s 99 requests/watt and Broadwell’s 77.

I was a bit surprised to see Skylake and Broadwell consume about the same amount of power, given both are manufactured with the same process, and Skylake has more cores.

The low power consumption of Falkor is not surprising, Qualcomm processors are known for their great power efficiency, which has allowed them to be a dominant player in the mobile phone CPU market.

Conclusion

The engineering sample of Falkor we got certainly impressed me a lot. This is a huge step up from any previous attempt at ARM based servers. Certainly core for core, the Intel Skylake is far superior, but when you look at the system level the performance becomes very attractive.

The production version of the Centriq SoC will feature up to 48 Falkor cores, running at a frequency of up to 2.6GHz, for a potential additional 8% better performance.

Obviously the Skylake server we tested is not the flagship Platinum unit that has 28 cores, but those 28 cores come both with a big price and over 200W TDP, whereas we are interested in improving our bang for buck metric, and performance per watt.

Currently, my main concern is weak Go language performance, but that is bound to improve quickly once ARM based servers start gaining some market share.

Both C and LuaJIT performance is very competitive, and in many cases outperforms the Skylake contender. In almost every benchmark Falkor shows itself as a worthy upgrade from Broadwell.

The largest win by far for Falkor is the low power consumption. Although it has a TDP of 120W, during my tests it never went above 89W (for the go benchmark). In comparison, Skylake and Broadwell both went over 160W, while the TDP of the two CPUs is 170W.

If you enjoy testing and selecting hardware on behalf of millions of Internet properties, come [join us](https://www.cloudflare.com/careers/).

Helping to make LuaJIT faster

Guest Author — Thu, 19 Oct 2017 07:38:25 GMT

This is a guest post by Laurence Tratt, who is a programmer and Reader in Software Development in the Department of Informatics at King's College London where he leads the Software Development Team. He is also an EPSRC Fellow.

Programming language Virtual Machines (VMs) are familiar beasts: we use them to run apps on our phone, code inside our browsers, and programs on our servers. Traditional VMs are useful and widely used: nearly every working programmer is familiar with one or more of the “standard” Lua, Python, or Ruby VMs. However, such VMs are simplistic, containing only an interpreter (a simple implementation of a language). These often can’t run our programs as fast as we need; and, even when they can, they often waste huge amounts of server CPU time. We sometimes forget that servers consume a large, and growing, chunk of the world’s electricity output: slow language implementations are, quite literally, changing the world, and not in a good way.

More advanced VMs come with Just-In-Time (JIT) compilers (well known examples include LuaJIT, HotSpot (aka “the JVM”), PyPy, and V8). Such VMs observe a program’s run-time behaviour and use that to compile frequently executed parts of the program into machine code. When this works well, it leads to significant speed-ups (2x-10x is common; and 100x is not unknown). We all want to make these VMs even better than they currently are, but doing so is easier said than done.

Internally, VMs with JIT compilers are ferociously complex beasts, with many moving parts that interact in subtle ways that are hard to reason about. For example, any VM developer worth their salt will have tales about “deoptimisation bugs”: that is, bugs that occur when the optimised machine code is no longer useful and the VM has to fall back to a less optimised, but more general, version of the program (e.g. an interpreter). Have you ever thought about how to pick apart an inlined function’s stack? Perhaps it sounds quite easy. What happens when inlining has allowed the JIT compiler to remove memory allocations: how should the “missing” allocations be identified and “restored”? Even in just this single part of a VM, the complexity quickly becomes mind-boggling.

What does this have to do with Cloudflare or me? Well, as many of you know, Cloudflare is a heavy user of LuaJIT and they would like LuaJIT to perform even better than it does today. Although Cloudflare has developers with a deep understanding of LuaJIT, they’ve long wanted to give more back to the open-source community. Unfortunately, finding someone who’s able and willing to work on such a complicated program isn’t easy, to say the least. I suggested that our research group – the Software Development Team at King's College London – might provide a fruitful alternative route. Happily, Cloudflare agreed, and gave us funding for a project looking to improve LuaJIT's performance that started in late August.

Why our team? Well, partly because we've done a lot of VM work from data-structure optimisation to language composition to benchmarking, which helps us get people new to the field up and running quickly. Partly, I like to think, because we're open-minded about the projects we use as part of our research. Although we've not done a huge amount of direct LuaJIT research, we've always tried to stay abreast of its development (e.g. leading to us inviting Vyacheslav Egorov to talk about LuaJIT at the VM Summer School we ran in 2016), and we've often used it as a key part of cross-VM benchmarking.

Personally I've been aware of LuaJIT for many years: I submitted a very, very minor patch to make it run on OpenBSD in 2012 so that I could try out this VM that I'd heard others rave about; I used LuaJIT's clever assembly system in a small toy project; and I've included LuaJIT in more than one benchmarking suite (e.g. this paper). My first impression was that LuaJIT has an astonishingly small codebase and astonishingly fast warmup (roughly speaking, the time from the program starting to final machine code being generated). Both those points remain true today, and are a testament to Mike Pall’s vision and coding abilities. However, benchmarking LuaJIT against other VMs has shown weaknesses, which, I’ve come to realise, are widely known by those using LuaJIT on large systems. In a nutshell, larger programs exhibit substantial performance differences from one run to the next, and some programs don’t run as fast over long periods as other VMs are capable of. I don’t see this as a criticism – every VM I’m familiar with has strengths and weaknesses – but rather as an opportunity to see if we can make things even better.

How are we going to go about trying to improve LuaJIT? First, we had one of those strokes of luck you can normally only dream about: after conversations with some friendly LuaJIT insiders, I was very lucky to tempt Thomas Fransham, a LuaJIT expert, to work with us at King’s. Tom’s deep understanding of LuaJIT’s internals means we have someone who can immediately start turning ideas into reality. Second, we’re making use of our multi-year work on VM benchmarking which recently saw the light of day as the Virtual Machine Warmup Blows Hot and Cold paper and the Krun and warmup_stats systems. To cut a long and dense paper short, we found that, even in an idealised setting, widely studied benchmarks often don’t warm-up as they’re supposed to on well known VMs. Sometimes they get slower over time; sometimes they never stabilise; sometimes they’re inconsistent from one execution to the next. Even in the best case, we found that, across all the VMs we studied, only 43.5% of cases warmed up as they were supposed to (at 51%, LuaJIT was a little better than some VMs). While this is somewhat embarrassing news for those of us who develop VMs, it’s forced me to make two hypotheses (which is a fancy word for “gut feeling”) which we’ll test in this project: that VMs have heuristics (e.g. “when to compile code”) which interact in ways we no longer understand; and that rigorous benchmarking before and after changes (whether those changes add or remove code) is the only way to make performance better and more predictable.

As this might suggest, we’re not aiming to “just” improve LuaJIT. This project will also naturally help us advance our wider research interests centred around understanding how VM performance can be improved. I’m hopeful that what we learn from LuaJIT will, in the long run, also help other VMs improve. Indeed, we have big ideas for what the next generation of VMs might look like, and we’re bound to learn important lessons from this project. Our long-term bet is on meta-tracing: we think we can reduce its currently fearsome warmup costs through a combination of Intel’s newish Processor Tracing feature and some other tricks we have up our sleeves. That then opens up new possibilities to do things like parallel compilation that haven’t really been thought worthwhile before. While LuaJIT isn’t a meta-tracing JIT compiler, it is a tracing JIT compiler and, as the similar name suggests, most concepts carry over from one to the other.

To say that I’m extremely grateful to Cloudflare for their support is an understatement. It might not be obvious from the outside, but (apart from me) everyone in our research group is supported by external funding: without it, I can’t pay good people like Tom to help move the subject forward. I’m also pleased to say that, before I even asked, Cloudflare made clear that they wanted all the project’s results to be open. Any and all changes we make will be fully open-sourced under LuaJIT’s normal license, so as and when we make improvements to LuaJIT, the whole LuaJIT community will benefit. I also want us, more generally, to be good citizens within the wider LuaJIT community. We know we have a lot to learn from you, from ideas for benchmarking, to concrete experiences of performance problems. Don’t be shy about getting in touch!

LuaJIT Hacking: Getting next() out of the NYI list

Javier Guerra — Tue, 21 Feb 2017 13:40:23 GMT

At Cloudflare we’re heavy users of LuaJIT and in the past have sponsored many improvements to its performance.

LuaJIT is a powerful piece of software, maybe the highest performing JIT in the industry. But it’s not always easy to get the most out of it, and sometimes a small change in one part of your code can negatively impact other, already optimized, parts.

One of the first pieces of advice anyone receives when writing Lua code to run quickly using LuaJIT is “avoid the NYIs”: the language or library features that can’t be compiled because they’re NYI (not yet implemented). And that means they run in the interpreter.

CC BY-SA 2.0 image by Dwayne Bent

Another very attractive feature of LuaJIT is the FFI library, which allows Lua code to directly interface with C code and memory structures. The JIT compiler weaves these memory operations in line with the generated machine language, making it much more efficient than using the traditional Lua C API.

Unfortunately, if for any reason the Lua code using the FFI library has to run under the interpreter, it takes a very heavy performance hit. As it happens, under the interpreter the FFI is usually much slower than the Lua C API or the basic operations. For many people, this means either avoiding the FFI or committing to a permanent vigilance to maintain the code from falling back to the interpreter.

Optimizing LuaJIT Code

Before optimizing any code, it’s important to identify which parts are actually important. It’s useless to discuss what’s the fastest way to add a few numbers before sending some data, if the send operation will take a million times longer than that addition. Likewise, there’s no benefit avoiding NYI features in code like initialization routines that might run only a few times, as it’s unlikely that the JIT would even try to optimize them, so they would always run in the interpreter. Which, by the way, is also very fast; even faster than the first version of LuaJIT itself.

But optimizing the core parts of a Lua program, like any deep inner loops, can yield huge improvements in the overall performance. In similar situations, experienced developers using other languages are used to inspecting the assembly language generated by the compiler, to see if there’s some change to the source code that can make the result better.

The command line LuaJIT executable provides a bytecode list when running with the -jbc option, a statistical profiler, activated with the -jp option, a trace list with -jv, and finally a detailed dump of all the JIT operations with -jdump.

The last two provide lots of information very useful for understanding what actually happens with the Lua code while executing, but it can be a lot of work to read the huge lists generated by -jdump. Also, some messages are hard to understand without a fairly complete understanding of how the tracing compiler in LuaJIT actually works.

One very nice feature is that all these JIT options are implemented in Lua. To accomplish this the JIT provides ‘hooks’ that can execute a Lua function at important moments with the relevant information. Sometimes the best way to understand what some -jdump output actually means is to read the code that generated that specific part of the output.

CC BY 2.0 image by Kevan

Introducing Loom

After several rounds there, and being frustrated by the limitations of the sequentially-generated dump, I decided to write a different version of -jdump, one that gathered more information to process and add cross-references to help see how things are related before displaying. The result is loom, which shows roughly the same information as -jdump, but with more resolved references and formatted in HTML with tables, columns, links and colors. It has helped me a lot to understand my own code and the workings of LuaJIT itself.

For example, let's consider the following code in a file called twoloops.lua:

With the -jv option:

This tells us that there were two traces, the first one contains a loop, and the second one spawns from exit #3 of the other (the “(1/3)” part) and it’s endpoint returns to the start of trace #1.

Ok, let’s get more detail with -jdump:

This tells us... well, a lot of things. If you look closely, you’ll see the same two traces, one is a loop, the second starts at 1/3 and returns to trace #1. Each one shows some bytecode instructions, an IR listing, and the final mcode. There are several options to turn on and off each listing, and more info like the registers allocated to some IR instructions, the “snapshot” structures that allow the interpreter to continue when a compiled trace exits, etc.

Now using loom:

There’s the source code, with the corresponding bytecodes, and the same two traces, with IR and mcode listings. The bytecode lines on the traces and on the top listings are linked, hovering on some arguments on the IR listing highlights the source and use of each value, the jumps between traces are correctly labeled (and colored), finally, clicking on the bytecode or IR column headers reveals more information: excerpts from the source code and snapshot formats, respectively.

Writing it was a great learning experience, I had to read the dump script’s Lua sources and went much deeper in the LuaJIT sources than ever before. And then, I was able to use loom not only to analyze and optimize Cloudflare’s Lua code, but also to watch the steps the compiler goes through to make it run fast, and also what happens when it’s not happy.

The code is the code is the code is the code

LuaJIT handles up to four different representation of a program’s code:

First comes the source code, what the developer writes.

The parser analyzes the source code and produces the Bytecode, which is what the interpreter actually executes. It has the same flow of the source code, grouped in functions, with all the calls, iterators, operations, etc. Of course, there’s no nice formatting, comments, the local variable names are replaced by indices, and all constants (other than small numbers) are stored in a separate area.

When the interpreter finds that a given point of the bytecode has been repeated several times, it’s considered a “hot” part of the code, and interprets it once again but this time it records each bytecode it encounters, generating a “code trace” or just “a trace”. At the same time, it generates an “intermediate representation”, or IR, of the code as it’s executed. The IR doesn’t represent the whole of the function or code portion, just the actual options it actually takes.

A trace is finished when it hits a loop or a recursion, returns to a lower level than when started, hits a NYI operation, or simply becomes too long. At this point, it can be either compiled into machine language, or aborted if it has reached some code that can’t be correctly translated. If successful, the bytecode is patched with an entry to the machine code, or “mcode”. If aborted, the initial trace point is “penalized” or even “blocklisted” to avoid wasting time trying to compile it again.

What’s `next()`?

One of the most visible characteristics of the Lua language is the heavy use of dictionary objects called tables. From the Lua manual:

“Tables are the sole data structuring mechanism in Lua; they can be used to represent ordinary arrays, symbol tables, sets, records, graphs, trees, etc.”

To iterate over all the elements in a table, the idiomatic way is to use the standard library function pairs() like this:

In the standard Lua manual, pairs() is defined as “Returns three values: the next function, the table t, and nil”, so the previous code is the same as:

But unfortunately, both the next() and pairs() functions are listed as “not compiled” in the feared NYI list. That means that any such code runs on the interpreter and is not compiled, unless the code inside is complex enough, and has other inner loops (loops that doesn’t use next() or pairs(), of course). Even in that case, the code would have to fall back to the interpreter at each loop end.

This sad news creates a tradeoff: for performance sensitive parts of the code, don’t use the most Lua-like code style. That motivates people to come up with several contortions to be able to use numerical iteration (which is compiled, and very efficient), like replacing any key with a number, storing all the keys in a numbered array, or store both keys and values at even/odd numeric indices.

Getting `next()` out of the NYI list

So, I finally have a non-NYI next() function! I'd like to say "a fully JITtable next() function", but it wouldn't be totally true; as it happens, there's no way to avoid some annoying trace exits on table iteration.

The purpose of the IR is to provide a representation of the execution path so it can be quickly optimized to generate the final mcode. For that, the IR traces are linear and type-specific; creating some interesting challenges for iteration on a generic container.

Traces are linear

Being linear means that each trace captures a single execution path, it can't contain conditional code or internal jumps. The only conditional branches are the "guards" that make sure that the code to be executed is the appropriate one. If a condition changes and it must now do something different, the trace must be exited. If it happens several times, it will spawn a side trace and the exit will be patched into a conditional branch. Very nice, but this still means that there can be at most one loop on each trace.

The implementation of next() has to internally skip over empty slots in the table to only return valid key/value pairs. If we try to express this in IR code, this would be the "inner" loop and the original loop would be an "outer" one, which doesn't have as much optimization opportunities. In particular, it can't hoist invariable code out of the loop.

The solution is to do that slot skipping in C. Not using the Lua C API, of course, but the inner IR CALL instruction that is compiled into a "fast" call, using CPU registers for arguments as much as possible.

The IR is in Type-specific SSA form

The SSA form (Static Single Assignment) is key for many data flow analysis heuristics that allow quick optimizations like dead code removal, allocation sinking, type narrowing, strength reduction, etc. In LuaJIT's IR it means every instruction is usable as a value for subsequent instructions and has a declared type, fixed at the moment when the trace recorder emits this particular IR instruction. In addition, every instruction can be a type guard, if the arguments are not of the expected type the trace will be exited.

Lua is dynamically typed, every value is tagged with type information so the bytecode interpreter can apply the correct operations on it. This allows us to have variables and tables that can contain and pass around any kind of object without changing the source code. Of course, this requires the interpreter to be coded very "defensively", to consider all valid ramifications of every instruction, limiting the possibility of optimizations. The IR traces, on the other hand, are optimized for a single variation of the code, and deal with only the value types that are actually observed while executing.

For example, this simple code creates a 1,000 element array and then copies to another table:

resulting in this IR for the second loop, the one that does the copy:

Here we see the ALOAD in instruction 0031 assures that the value loaded from the table is in effect a number. If it happens to be any other value, the guard fails and the trace is exited.

But if we do an array of strings instead of numbers?

a small change:

gives us this:

It's the same code, but the type that ALOAD is guarding is now a string (and it now uses a different register, I guess a vector register isn't appropriate for a string pointer).

And if the table has a values of a mix of types?

Now there are two ALOADs, (and two ASTOREs), one for 'num' and one for 'str'. In other words, the JIT unrolled the loop and found that that made the types constant. =8-O

Of course, this would happen only on very simple and regular patterns. In general, it's wiser to avoid unpredictable type mixing; but polymorphic code will be optimized for each type that it's actually used with.

Back to `next()`

First let's see the current implementation of next() as used by the interpreter:

It takes the input key as a TValue pointer and calls keyindex(). This helper function searches for the key in the table and returns an index; if the key is an integer in the range of the array part, the index is the key itself. If not, it performs a hash query and returns the index of the Node, offset by the array size, if successful, or signals an error if not found (it's an error to give a nonexistent key to next()).

Back at lj_tab_next(), the index is first incremented, and if it's still within the array, it's iterated over any hole until a non-nil value is found. If it wasn't in the array (or there’s no next value there), it performs a similar "skip the **nil**s" on the Node table.

The new lj_record_next() function in lj_record.c, like some other record functions there, first checks not only the input parameters, but also the return values to generate the most appropriate code for this specific iteration, assuming that it will likely be optimal for subsequent iterations. Of course, any such assumption must be backed by the appropriate guard.

For next(), we choose between two different forms, if the return key is in the array part, then it uses lj_tab_nexta(), which takes the input key as an integer and returns the next key, also as an integer, in the rax register. We don't do the equivalent to the keyindex() function, just check (with a guard) that the key is within the bounds of the array:

The IR code looks like this:

Clearly, the CALL itself (at 0017) is typed as 'int', as natural for an array key; and the ALOAD (0021) is 'num', because that's what the first few values happened to be.

When we finish with the array part, the bounds check (instruction ABC on 0018) would fail and soon new IR would be generated. This time we use the lj_tab_nexth() function.

But before doing the "skip the **nil**s", we need to do a hash query to find the initial Node entry. Fortunately, the HREF IR instruction does that: This is the IR:

There's a funny thing here: HREF is supposed to return a reference to a value in the hash table, and the last argument in lj_tab_nexth() is a Node pointer. Let's see the Node definition:

Ok... the value is the first field, and it says right there "Must be first field". Looks like it's not the first place with some hand-wavy pointer casts.

The return value of lj_tab_next() is a Node pointer, which can likewise be implicitly cast by HLOAD to get the value. To get the key, I added the HKLOAD instruction. Both are guarding for the expected types of the value and key, respectively.

Let's take it for a spin

So, how does it perform? These tests do a thousand loops over a 10,000 element table, first using next() and then pairs(), with a simple addition in the inner loop. To get pairs() compiled, I just disabled the ISNEXT/ITERN optimization, so it actually uses next(). In the third test the variable in the addition is initialized to 0ULL instead of just 0, triggering the use of FFI.

First test is with all 10,000 elements on sequential integers, making the table a valid sequence, so ipairs() (which is already compiled) can be used just as well:

So, compiled next() is quite a lot faster, but the pairs() optimization in the interpreter is very fast. On the other hand, the smallest smell of FFI completely trashes interpreter performance, while making compiled code slightly tighter. Finally, ipairs() is faster, but a big part of it is because it stops on the first nil, while next() has to skip over every nil at the end of the array, which by default can be up to twice as big as the sequence itself.

Now with 5,000 (sequential) integer keys and 5,000 string keys. Of course, we can't use ipairs() here:

Roughly the same pattern: the compiled next() performance is very much the same on the three forms (used directly, under pairs() and with FFI code), while the interpreter benefits from the pairs() optimization and almost dies with FFI. In this case, the interpreted pairs() actually surpasses the compiled next() performance, hinting that separately optimizing pairs() is still desirable.

A big factor in the interpreter pairs() is that it doesn't use next(); instead it directly drives the loop with a hidden variable to iterate in the Node table without having to perform a hash lookup on every step.

Repeating that in a compiled pairs() would be equally beneficial; but has to be done carefully to maintain compatibility with the interpreter. On any trace exit the interpreter would kick in and must be able to seamlessly continue iterating. For that, the rest of the system has to be aware of that hidden variable.

The best part of this is that we have lots of very challenging, yet deeply rewarding, work ahead of us! Come work for us on making LuaJIT faster and more.

First Bay Area OpenResty Meetup

John Graham-Cumming — Mon, 09 May 2016 08:17:07 GMT

On March 9, 章亦春, known to most of us as agentzh, organized the first Bay Area OpenResty Meetup at CloudFlare's San Francisco office.

CloudFlare is a big user of Lua, LuaJIT, NGINX and OpenResty and happy to be able to sponsor Yichun's work on this fast, flexible platform.

The slides and videos from the meetup are now available for viewing by people who were unable to be there in person.

abode.io by Dragos Dascalita of Adobe

The slides are here.

KONG by Marco Palladino from Mashape

The slides can be found here

What's new in OpenResty for 2016 by Yichun Zhang of CloudFlare

Yichun's slides are here

If you are interested in being present at the next OpenResty Meetup by sure to follow the meetup itself.

Scaling out PostgreSQL for CloudFlare Analytics using CitusDB

Albert Strasheim — Thu, 09 Apr 2015 17:32:05 GMT

When I joined CloudFlare about 18 months ago, we had just started to build out our new Data Platform. At that point, the log processing and analytics pipeline built in the early days of the company had reached its limits. This was due to the rapidly increasing log volume from our Edge Platform where we’ve had to deal with traffic growth in excess of 400% annually.

Our log processing pipeline started out like most everybody else’s: compressed log files shipped to a central location for aggregation by a motley collection of Perl scripts and C++ programs with a single PostgreSQL instance to store the aggregated data. Since then, CloudFlare has grown to serve millions of requests per second for millions of sites. Apart from the hundreds of terabytes of log data that has to be aggregated every day, we also face some unique challenges in providing detailed analytics for each of the millions of sites on CloudFlare.

For the next iteration of our Customer Analytics application, we wanted to get something up and running quickly, try out Kafka, write the aggregation application in Go, and see what could be done to scale out our trusty go-to database, PostgreSQL, from a single machine to a cluster of servers without requiring us to deal with sharding in the application.

As we were analyzing our scaling requirements for PostgreSQL, we came across Citus Data, one of the companies to launch out of Y Combinator in the summer of 2011. Citus Data builds a database called CitusDB that scales out PostgreSQL for real-time workloads. Because CitusDB enables both real-time data ingest and sub-second queries across billions of rows, it has become a crucial part of our analytics infrastructure.

Log Processing Pipeline for Analytics

Before jumping into the details of our database backend, let’s review the pipeline that takes a log event from CloudFlare’s Edge to our analytics database.

An HTTP access log event proceeds through the CloudFlare data pipeline as follows:

A web browser makes a request (e.g., an HTTP GET request).
An Nginx web server running Lua code handles the request and generates a binary log event in Cap’n Proto format.
A Go program akin to Heka receives the log event from Nginx over a UNIX socket, batches it with other events, compresses the batch using a fast algorithm like Snappy or LZ4, and sends it to our data center over a TLS-encrypted TCP connection.
Another Go program (the Kafka shim) receives the log event stream, decrypts it, decompresses the batches, and produces the events into a Kafka topic with partitions replicated on many servers.
Go aggregators (one process per partition) consume the topic-partitions and insert aggregates (not individual events) with 1-minute granularity into the CitusDB database. Further rollups to 1-hour and 1-day granularity occur later to reduce the amount of data to be queried and to speed up queries over intervals spanning many hours or days.

Why Go?

Previous blog posts and talks have covered various CloudFlare projects that have been built using Go. We’ve found that Go is a great language for teams to use when building the kinds of distributed systems needed at CloudFlare, and this is true regardless of an engineer’s level of experience with Go. Our Customer Analytics team is made up of engineers that have been using Go since before its 1.0 release as well as complete Go newbies. Team members that were new to Go were able to spin up quickly, and the code base has remained maintainable even as we’ve continued to build many more data processing and aggregation applications such as a new version of our Layer 7 DDoS attack mitigation system.

Another factor that makes Go great is the ever-expanding ecosystem of third party libraries. We used go-capnproto to generate Go code to handle binary log events in Cap’n Proto format from a common schema shared between Go, C++, and Lua projects. Go support for Kafka with Shopify’s Sarama library, support for ZooKeeper with go-zookeeper, support for PostgreSQL/CitusDB through database/sql and the lib/pq driver are all very good.

Why Kafka?

As we started building our new data processing applications in Go, we had some additional requirements for the pipeline:

Use a queue with persistence to allow short periods of downtime for downstream servers and/or consumer services.
Make the data available for processing in real time by scripts written by members of our Site Reliability Engineering team.
Allow future aggregators to be built in other languages like Java, C or C++.

After extensive testing, we selected Kafka as the first stage of the log processing pipeline.

Why Postgres?

As we mentioned when PostgreSQL 9.3 was released, PostgreSQL has long been an important part of our stack, and for good reason.

Foreign data wrappers and other extension mechanisms make PostgreSQL an excellent platform for storing lots of data, or as a gateway to other NoSQL data stores, without having to give up the power of SQL. PostgreSQL also has great performance and documentation. Lastly, PostgreSQL has a large and active community, and we've had the privilege of meeting many of the PostgreSQL contributors at meetups held at the CloudFlare office and elsewhere, organized by the The San Francisco Bay Area PostgreSQL Meetup Group.

Why CitusDB?

CloudFlare has been using PostgreSQL since day one. We trust it, and we wanted to keep using it. However, CloudFlare's data has been growing rapidly, and we were running into the limitations of a single PostgreSQL instance. Our team was tasked with scaling out our analytics database in a short time so we started by defining the criteria that are important to us:

Performance: Our system powers the Customer Analytics dashboard, so typical queries need to return in less than a second even when dealing with data from many customer sites over long time periods.
PostgreSQL: We have extensive experience running PostgreSQL in production. We also find several extensions useful, e.g., Hstore enables us to store semi-structured data and HyperLogLog (HLL) makes unique count approximation queries fast.
Scaling: We need to dynamically scale out our cluster for performance and huge data storage. That is, if we realize that our cluster is becoming overutilized, we want to solve the problem by just adding new machines.
High availability: This cluster needs to be highly available. As such, the cluster needs to automatically recover from failures like disks dying or servers going down.
Business intelligence queries: in addition to sub-second responses for customer queries, we need to be able to perform business intelligence queries that may need to analyze billions of rows of analytics data.

At first, we evaluated what it would take to build an application that deals with sharding on top of stock PostgreSQL. We investigated using the postgres_fdw extension to provide a unified view on top of a number of independent PostgreSQL servers, but this solution did not deal well with servers going down.

Research into the major players in the PostgreSQL space indicated that CitusDB had the potential to be a great fit for us. On the performance point, they already had customers running real-time analytics with queries running in parallel across a large cluster in tens of milliseconds.

CitusDB has also maintained compatibility with PostgreSQL, not by forking the code base like other vendors, but by extending it to plan and execute distributed queries. Furthermore, CitusDB used the concept of many logical shards so that if we were to add new machines to our cluster, we could easily rebalance the shards in the cluster by calling a simple PostgreSQL user-defined function.

With CitusDB, we could replicate logical shards to independent machines in the cluster, and automatically fail over between replicas even during queries. In case of a hardware failure, we could also use the rebalance function to re-replicate shards in the cluster.

CitusDB Architecture

CitusDB follows an architecture similar to Hadoop to scale out Postgres: one primary node holds authoritative metadata about shards in the cluster and parallelizes incoming queries. The worker nodes then do all the actual work of running the queries.

In CloudFlare's case, the cluster holds about 1 million shards and each shard is replicated to multiple machines. When the application sends a query to the cluster, the primary node first prunes away unrelated shards and finds the specific shards relevant to the query. The primary node then transforms the query into many smaller queries for parallel execution and ships those smaller queries to the worker nodes.

Finally, the primary node receives intermediate results from the workers, merges them, and returns the final results to the application. This takes anywhere between 25 milliseconds to 2 seconds for queries in the CloudFlare analytics cluster, depending on whether some or all of the data is available in page cache.

From a high availability standpoint, when a worker node fails, the primary node automatically fails over to the replicas, even during a query. The primary node holds slowly changing metadata, making it a good fit for continuous backups or PostgreSQL's streaming replication feature. Citus Data is currently working on further improvements to make it easy to replicate the primary metadata to all the other nodes.

At CloudFlare, we love the CitusDB architecture because it enabled us to continue using PostgreSQL. Our analytics dashboard and BI tools connect to Citus using standard PostgreSQL connectors, and tools like pg_dump and pg_upgrade just work. Two features that stand out for us are CitusDB’s PostgreSQL extensions that power our analytics dashboards, and CitusDB’s ability to parallelize the logic in those extensions out of the box.

Postgres Extensions on CitusDB

PostgreSQL extensions are pieces of software that add functionality to the core database itself. Some examples are data types, user-defined functions, operators, aggregates, and custom index types. PostgreSQL has more than 150 publicly available official extensions. We’d like to highlight two of these extensions that might be of general interest. It’s worth noting that with CitusDB all of these extensions automatically scale to many servers without any changes.

HyperLogLog

HyperLogLog is a sophisticated algorithm developed for doing unique count approximations quickly. And since a HLL implementation for PostgreSQL was open sourced by the good folks at Aggregate Knowledge, we could use it with CitusDB unchanged because it’s compatible with most (if not all) Postgres extensions.

HLL was important for our application because we needed to compute unique IP counts across various time intervals in real time and we didn’t want to store the unique IPs themselves. With this extension, we could, for example, count the number of unique IP addresses accessing a customer site in a minute, but still have an accurate count when further rolling up the aggregated data into a 1-hour aggregate.

Hstore

The hstore data type stores sets of key/value pairs within a single PostgreSQL value. This can be helpful in various scenarios such as with rows with many attributes that are rarely examined, or to represent semi-structured data. We use the hstore data type to hold counters for sparse categories (e.g. country, HTTP status, data center).

With the hstore data type, we save ourselves from the burden of denormalizing our table schema into hundreds or thousands of columns. For example, we have one hstore data type that holds the number of requests coming in from different data centers per minute per CloudFlare customer. With millions of customers and hundreds of data centers, this counter data ends up being very sparse. Thanks to hstore, we can efficiently store that data, and thanks to CitusDB, we can efficiently parallelize queries of that data.

For future applications, we are also investigating other extensions such as the Postgres columnar store extension cstore_fdw that Citus Data has open sourced. This will allow us to compress and store even more historical analytics data in a smaller footprint.

Conclusion

CitusDB has been working very well for us as the new backend for our Customer Analytics system. We have also found many uses for the analytics data in a business intelligence context. The ease with which we can run distributed queries on the data allows us to quickly answer new questions about the CloudFlare network that arise from anyone in the company, from the SRE team through to Sales.

We are looking forward to features available in the recently released CitusDB 4.0, especially the performance improvements and the new shard rebalancer. We’re also excited about using the JSONB data type with CitusDB 4.0, along with all the other improvements that come standard as part of PostgreSQL 9.4.

Finally, if you’re interested in building and operating distributed services like Kafka or CitusDB and writing Go as part of a dynamic team dealing with big (nay, gargantuan) amounts of data, CloudFlare is hiring.

Introducing lua-capnproto: better serialization in Lua

Jiale Zhi — Fri, 28 Feb 2014 09:00:00 GMT

When we need to transfer data from one program to another program, either within a machine or from one data center to another some form of serialization is needed. Serialization converts data stored in memory into a form that can be sent across a network or between processes and then converted back into data a program can use directly.

At CloudFlare, we have data centers all over the world. When transferring data from one data center to another, we need a very efficient way of serializing data, saving us both time and network bandwidth.

We've looked at a few serialization projects. For example, one popular serialization format is JSON, for some of our Go programs we use gob, and we've made use of Protocol Buffers in the past. But lately we've been using a new serialization protocol called Cap'n Proto.

Cap'n Proto attracted us because of its very high performance compared to other serialization projects. It looks a little like a better version of Protocol Buffers, and the author of Cap'n Proto, Kenton, was the primary author of Protocol Buffers version 2.

At CloudFlare, we use NGINX in conjunction with Lua for front-line web serving, proxying and traffic filtering. We need to serialize our data in Lua and transport it across the Internet. But unfortunately, there was no Lua module for Cap'n Proto. So, we decided to write lua-capnproto and release it as yet another CloudFlare open source contribution.

lua-capnproto provides very fast data serialization and a really easy to use API. Here I'll show you how to use lua-capnproto to do serialization and deserialization.

Install lua-capnproto

To install lua-capnproto, you need to install Cap'n Proto, LuaJIT 2.1 and luarocks first.

Then you can install lua-capnproto using the following commands:

git clone https://github.com/cloudflare/lua-capnproto.gitcd lua-capnprotosudo luarocks make

To test whether lua-capnproto was installed successfully, you can use the capnp compiler to generate a Lua version of one of the Cap'n Proto examples as follows:

capnp compile -olua proto/example.capnp

If everything goes well, you should see no errors and a file named example_capnp.lua generated under the proto directory.

Write a Cap’n Proto definition

Here's a sample Cap’n Proto definition that would be stored in a file called AddressBook.capnp:

We have a root structure named AddressBook containing a list named people whose members are also structures. What we are going to do is to serialize an AddressBook structure and then read the structure from serialized data. For more details about the Cap'n Proto definition, you can checkout its documentation at here.

Prepare your data

Preparing data is pretty simple. All you need to do is to generate a Lua table corresponding to the root structure. The following list gives rules to help you write this table. On the left is a Cap'n Proto data type, on the right is its corresponding Lua data type.

struct -> Lua hash table
list -> Lua array table
bool -> Lua boolean
int8/16/32 or uint8/16/32 or float32/64 -> Lua number
int64/uint64 -> LuaJIT 64bit number
enum -> Lua string
void -> Lua string “Void”
group -> Lua hash table (the same as struct)
union -> Lua hash table which has only one value set

A few notices:

Because Lua number type represents real (double-precision floating-point) numbers and Lua has no integer type, so by default you can't store a 64 bit integer using number type without losing precision. LuaJIT has an extension which supports 64-bit integers. You need to add ULL or LL to the end of the number (ULL is for unsigned integers, LL for signed). So, if you need to serialize a 64-bit integer, remember to append ULL or LL to your number.

For example:

id @0 :Int64; -> id = 12345678901234LL

Enum values are automatically converted from strings to their values, you don’t need to do it yourself. By default, enums will be converted to the uppercase with underscores form. You can change this behavior using annotations. The lua-capnproto document has more details.

Here is an example:

type @0 :Type;
enum Type {
binaryAddr @0;
textAddr @1;
} -> type = “TEXT_ADDR”

void is a special type in Cap’n Proto. For simplicity, we just use a string "Void" to represent void (actually, when serializing, any value other than nil will work, but we use "Void" for consistency).

A sample data table looks like this:

Compile and run

Now let's compile the Cap'n Proto file:

capnp compile -olua AddressBook.capnp

You shouldn't see any errors and a file named AddressBook_capnp.lua is generated under the current directory.

To use this file, we only need to write a file named main.lua (or whatever name you desire), and get all the required modules ready.

Now we can start to serialize using our already prepared data.

That’s it. Just one line of code. All the serialization work is done. Now variable bin (a Lua string) holds the serialized binary data. You can write this string to a file or send it through network.

Want about deserialization? It's as easy as serialization.

Now variable decoded holds a table which is identical to data. You can find the complete code here. Note that you need LuaJIT to run the code.

Performance

If you are happy with the API, here is even better news. We chose Cap'n Proto because of its impressively high performance. So when writing lua-capnproto, I also made every effort to make it fast.

In one project we switched to lua-capnproto from lua-cjson (a quite fast JSON library for Lua) for serialization. So let's see how fast lua-capnproto is compared to lua-cjson.

You can also run benchmark.lua yourself (included in the source code) to find out how fast lua-capnproto is compared to lua-cjson on your machine.

The future

We are already using lua-capnproto in production at CloudFlare and it has been running very well for the past month. But lua-capnproto is still a very young project. Some features are missing and there's a lot of work to do in the future. We will continue to make lua-capnproto more stable and more reliable, and would be happy to receive contributions from the open source community.

Keeping our open source promise

John Graham-Cumming — Sun, 22 Dec 2013 17:00:00 GMT

Back in October I wrote a blog post about CloudFlare and open source software titled CloudFlare And Open Source Software: A Two-Way Street which detailed the many ways in which we use and support open source software.

Since then we've pushed out quite a lot of new open source projects, as well as continuing to support the likes of LuaJIT and OpenResty. All our open source projects are available via our dedicated GitHub page cloudflare.github.io. Here are a few that have been recently updated or released:

Red October: our implementation of the 'two man rule' for control of secrets (and nuclear weapons :-). It was large enough to warrant its own blog post.
Lua Resty Shcache: a shared cache implementation for OpenResty and Nginx Lua.
Lua Resty Core: a new FFI-based Lua API for the ngx_lua module (part of work CloudFlare is doing on optimization).
Go Libs: at the other end of the size spectrum: golibs is where we'll be dumping small Go libraries that might be useful to others.
BM: a Go implementation of Bentley-McIlroy long string compression (which is close to part of the compression performance by Railgun).
Lua Raven: a Lua interface for sending error reports to Sentry.
Lua Resty Logger Socket: a non-blocking logger for Nginx Lua that supports remote loggers.
conf: a very simple key=value configuration file parser written in Go.
Lua Resty Kyoto Tycoon: a non-blocking Lua interface for Kyoto Tycoon.
Go LZ4: a Go package for LZ4 compression.
Aho Corasick: a Go package that does Aho-Corasick string matching.
kt-fdw: a foreign data wrapper for Postegres that interfaces to Kyoto Tycoon. More details in its blog post.
Lua Resty Cookie: a Lua Resty package that improves Cookie handling in OpenResty and Nginx Lua.
Docker NPMJS: a Docker image for a private npmjs repository.

We've also been contributing to projects like OpenSSL, SystemTap and PhantomJS, and the names of CloudFlare employees Yichun and Piotr appear regularly in the Nginx changelog.

And, on a personal level, I've been keeping a repository of all the talks I've given (including any associated code or data) in jgc-talks. The most recent of which was a talk in London about rolling hashes.

We'll keep pushing projects, large and small, to Github as we're ready to share them. We've pushed out so much open source code that we've recently reorganized our Github page by language/software. There are now sections for JavaScript, Go, Nginx and Lua, and Postgres.

Right now there's a large and innovative security project getting ready for release. Expect to hear about it (and see its source code) in the New Year.

CloudFlare And Open Source Software: A Two-Way Street

John Graham-Cumming — Mon, 07 Oct 2013 18:00:00 GMT

CloudFlare uses a great deal of open source and free software. Our core server platform is nginx (which is released using a two-clause BSD license) and our primary database of choice is postgresql (which is released using their own BSD-like license). We've talked in the past about our use of Kyoto Tycoon (which is released under the GNU General Public License) and we've built many things on top of OpenResty.

And, of course, we make use of open source tools such as gcc, make, the Go programming language, Lua, python, Perl, and PHP, and projects like Sentry, Kibana, and nagios. And, naturally, we use Linux.

It would take a while to write down all the software that we use to build CloudFlare, but all that software has one thing in common: it's open source or free software. Our stack consists of either software we've built ourselves or an open source project (which we've sometimes forked).

Why Build On Open Source

It's probably obvious to most readers why we use open source software: it's reliable, it's easy to modify and it's easy to maintain. But there's another benefit that should not be overlooked: using and working on open source software brings a great deal of job satisfaction for programmers and it helps us hire the best.

We encourage our programmers to release changes they've made to open source software and to release projects through the CloudFlare GitHub page.

At GitHub you'll find projects such as golog (a high-performance Go logger), lua-cmsgpack (an implementation of MessagePack for Lua), a Python-based CNAME flattener, and a macro language for systemtap, amongst others.

You'll also find the ngx_lua module which embeds Lua in nginx. That's not something CloudFlare initially wrote, but we make such extensive use of it that we hired Yichun Zhang. He continues to work full-time on it while at CloudFlare.

And, if you've ever delved into the internals of nginx, you'll know another CloudFlare employee, Piotr Sikora, who recently added the ability to set keys for TLS Session Tickets to nginx.

So, at CloudFlare, open source can get you a job, be your job or, at least, be a significant part of your job.

Sponsoring

Where appropriate (i.e. where we think we make the biggest impact and get something we need) we've sponsored external open source projects and paid for improvements that all can use.

We make wide use of the excellent LuaJIT project and after much profiling by engineers we discovered areas where more JITing would improve our performance. Rather than do the work ourselves we sponsored the LuaJIT project. These speedups will be appearing in LuaJIT 2.1 when it is released.

Two Way Street

Of course, it would be easy for us to use open source software, make modifications and not release them. None of the licenses for the software we use force us to release our modifications. But we prefer to give back and not just because of karma.

There are two big advantages to releasing modifications we've made to existing projects: the many eyeballs effect and reducing fork cost.

The first, many eyeballs, is common to any open source project: the more people look at code, the better it gets. And that applies equally to code written by a core team of developers and code written by outsiders. When we contribute changes we've made, others look at the changes and improve them.

For example, back in 2012 we contributed an improvement to Go's log/syslog module. This year that work was improved.

And the cost of maintaining a fork provides useful economic pressure making releasing our modifications make sense. It's cheaper for us to release than to maintain a fork and merge as the core of a project changes.

But, what about CloudFlare's secret sauce?

Open Sourcing CloudFlare Core

Our strong bias is to open source everything we've built. When we don't, it's usually because it's highly specific to us and/or because the support cost is high. Ultimately, we'd like to open source all our major components so that they can be used to build a faster, safer, smarter web.

Many of our smaller components are really glue code that don't make any sense to open source as they are so specific to our implementation of the overall system.

We don't believe that there is any chunk of code so clever that it gives us a long term competitive advantage. Instead, our advantage comes from the network we've built, the data we collect on making the web faster and safer, and, most importantly, the people we're able to attract.

A commitment to open source builds trust in the community which helps us continue to build our service and attract the best people.

In fact, the best way to get hired at CloudFlare is to make good contributions to open source projects we find interesting and useful. Contributions like that often speak more loudly than a resume.

Pushing Nginx to its limit with Lua

Matthieu Tourne — Sat, 08 Dec 2012 20:46:00 GMT

At CloudFlare, Nginx is at the core of what we do. It is part of the underlying foundation of our reverse proxy service. In addition to the built-in Nginx functionalities, we use an array of custom C modules that are specific to our infrastructure including load balancing, monitoring, and caching. Recently, we've been adding more simple services. And they are almost exclusively written in Lua.

I wanted to share more about how we are augmenting Nginx with new capabilities using Lua and provide some examples so you can do the same.

What is Lua?

Lua is a scripting language. Specifically, it is a full-featured multi-paradigm language with a simple syntax and semantics that resemble JavaScript or Scheme. Lua also has an interesting story to it, as it is one of the only languages from an emerging country that has had worldwide impact.

Lua has always meant to be embedded with larger systems written in other languages (like C and C++), and has thrived at staying very minimal and easy to integrate. As a result, Lua is popular within videogames, security oriented software, and, morerecently, Wikipedia.

Benefits of Nginx+Lua

Nginx+Lua is a self-contained web server embedding the scripting language Lua. Powerful applications can be written directly inside Nginx without using cgi, fastcgi, or uwsgi. By adding a little Lua code to an existing Nginx configuration file, it is easy to add small features. To see it yourself, at the end of this post I've included some logging code that can be added to any existing configuration.

One of the core benefits of Nginx+Lua is that it is fully asynchronous. Nginx+Lua inherits the same event loop model that has made Nginx a popular choice of webserver. "Asynchronous" simply means that Nginx can interrupt your code when it is waiting on a blocking operation, such as an outgoing connection or reading a file, and run the code of another incoming HTTP Request.

All the Lua code is written in a sequential fashion. The asynchronous logic is hidden to the Nginx+Lua programmer. If you are familiar with other event-driven webservers, that means no callbacks. In addition, Nginx+Lua is blazingly fast, leveraging the LuaJIT interpreter.

Getting Nginx+Lua installed

You can install it from source by compiling the lua nginx-module with your existing Nginx. If you chose that path you will also need a Lua interpreter. LuaJIT-2.0.0 is recommended.

Or, you can use the tested ngx_openresty bundle. ngx_openresty comes loaded with Nginx, 3rd party modules, Lua libraries and other goodies. If youalready use Nginx without 3rd party modules, from your Linux distribution for instance, you can safely swap it out with ngx_openresty. (Quick shout-out to my CloudFlare colleague Yichun Zhang who wrote ngx_openresty. Thanks, Yichun!)

Limitations

What makes Nginx, and therefore Nginx+Lua, really fast is the asychronous model and the event loop that Nginx relies on. To stay within that model, outgoing communication that is outside of Nginx has to be treated carefully. It is not recommended that you use classic LuaSocket, and instead it is recommended that you rely on the built-in ngx_luasockets.

However, with a multitude of openresty libraries to "speak" SQL, memcached, and Redis, as well as the DNS built on top of ngx_lua sockets, this isn't really a problem in practice.

An example to try: Nginx Log Aggregation

Here is an example of how to build and run a simple log aggregator for Nginx. You can add it to any of your own existing configuration. This is the output once the aggregated logs are funneled to a time series system:

This particular example graph shows the average number of requests per second on certain nodes of the CloudFlare infrastructure.

Show me the code already!

Let's assume you already have a working webapp in Nginx, or that you use the proxy_pass directives to upstream to an Apache server.

First, add some lines in the Nginx conf to look at .lua files, and use a 1MB space of shared memory between your Nginx workers. ($prefix is relative to your Nginx install).

Next, add a little Lua snippet to calculate request_time for each request, and aggregate it into shared memory using a logging library available. Here is a simple logging library that I built.

This snippet can be used directly inline in your Nginx conf, using the log_by_lua directive.

Displaying/collecting aggregated logs

The last step to complete the example is a system to collect and/or display logs. In the full example, we set the aggregation as a separate server listening on a different port.

You should now have a functioning Nginx+Lua modification running in your environment.

Using Lua instead of custom C modules

This example showed how Lua found its way into our system at CloudFlare, but we soon realized that it wasn't limited to aggregating and printing logs. Using the same phases that Nginx has laid out for processing HTTP requests, it is becoming possible to add interesting new capabilities to Nginx, with almost as much control as a custom C module, while being pleasant and easy to write.

For instance, the access phase can be seen as a programmatic .htaccess, and even more. Whereas the content phase is where your web application would go.

Nginx+Lua has become a foundation for the work that I do at CloudFlare. As a long-time C developer, I am constantly struck by how powerful and extremely expressive Lua can be, while being simple and approachable as well.

Sometimes, simple is beautiful.

PS - We're hiring Lua programmers who are interested in working at extreme scale. Check out the Systems Engineer listing on our careers page if you're interested.

Posts tagged "LUA"

More consistent LuaJIT performance

Results for luaJIT

Results for RaptorJIT

Results for Normal vs. RaptorJIT

Results for Normal vs. Counters

Technical reading from the Cloudflare blog for the holidays

Cloudflare Wants to Buy Your Meetup Group Pizza

Developer Relations at Cloudflare & why we’re doing this

What’s in it for Cloudflare?

What kinds of groups we most want to support

How it works

Here are some examples of groups we’ve worked with so far

ARM Takes Wing: Qualcomm vs. Intel CPU comparison

Overview

Ecosystem readiness

Benchmarks

OpenSSL

Public key cryptography

Symmetric key cryptography

Compression

gzip

brotli

Golang

Go crypto

Go gzip

Go regexp

Go strings

Go conclusion

LuaJIT

NGINX

Conclusion

Helping to make LuaJIT faster

LuaJIT Hacking: Getting next() out of the NYI list

Optimizing LuaJIT Code

Introducing Loom

The code is the code is the code is the code

What’s next()?

Getting next() out of the NYI list

Traces are linear

The IR is in Type-specific SSA form

Back to next()

Let's take it for a spin

First Bay Area OpenResty Meetup

abode.io by Dragos Dascalita of Adobe

KONG by Marco Palladino from Mashape

What's new in OpenResty for 2016 by Yichun Zhang of CloudFlare

Scaling out PostgreSQL for CloudFlare Analytics using CitusDB

Log Processing Pipeline for Analytics

Why Go?

Why Kafka?

Why Postgres?

Why CitusDB?

CitusDB Architecture

Postgres Extensions on CitusDB

HyperLogLog

Hstore

Conclusion

Introducing lua-capnproto: better serialization in Lua

Install lua-capnproto

Write a Cap’n Proto definition

Prepare your data

A few notices:

Compile and run

Performance

The future

Keeping our open source promise

CloudFlare And Open Source Software: A Two-Way Street

Why Build On Open Source

Sponsoring

Two Way Street

Open Sourcing CloudFlare Core

Pushing Nginx to its limit with Lua

What is Lua?

Benefits of Nginx+Lua

Getting Nginx+Lua installed

Limitations

An example to try: Nginx Log Aggregation

Show me the code already!

Displaying/collecting aggregated logs

What’s `next()`?

Getting `next()` out of the NYI list

Back to `next()`