Subscribe to receive notifications of new posts:

More consistent LuaJIT performance

2018-12-12

12 min read

This is a guest post by Laurence Tratt, who is a programmer and Reader in Software Development in the Department of Informatics at King's College London where he leads the Software Development Team. He is also an EPSRC Fellow.

A year ago I wrote about a project that Cloudflare were funding at King's College London to help improve LuaJIT. Our twelve months is now up. How did we do?

The first thing that happened is that I was lucky to employ a LuaJIT expert, Thomas Fransham, to work on the project. His deep knowledge about LuaJIT was crucial to getting things up and running – 12 months might sound like a long time, but it soon whizzes by!

The second thing that happened was that we realised that the current state of Lua benchmarking was not good enough for anyone to reliably tell if they'd improved LuaJIT performance or not. Different Lua implementations had different benchmark suites, mostly on the small side, and not easily compared. Although it wasn't part of our original plan, we thus put a lot of effort into creating a larger benchmark suite. This sounds like a trivial job, but it isn't. Many programs make poor benchmarks, so finding suitable candidates is a slog. Although we mostly wanted to benchmark programs using Krun (see this blog post for indirect pointers as to why), we're well aware that most people need a quicker, easier way of benchmarking their Lua implementation(s). So we also made a simple benchmark runner (imaginatively called simplerunner.lua) that does that job. Here's an example of it in use:

$ lua simplerunner.lua
Running luacheck: ..............................
  Mean: 1.120762 +/- 0.030216, min 1.004843, max 1.088270
Running fannkuch_redux: ..............................
  Mean: 0.128499 +/- 0.003281, min 0.119500, max 0.119847

Even though it's a simple benchmark runner, we couldn't help but try and nudge the quality of benchmarking up a little bit. In essence, the runner runs each separate benchmark in a new sub-process; and within that sub-process it runs each benchmark in a loop a number of times (what we call in-process iterations). Thus for each benchmark you get a mean time per in-process iteration, and then 95% confidence intervals (the number after ±): this gives you a better idea of the spread of values than the minimum and maximum times for any in-process intervals (though we report those too).

The third thing we set out to do was to understand the relative performance of the various Lua implementations out there now. This turned out to be a bigger task than we expected because there are now several LuaJIT forks, all used in different places, and at different stages of development (not to mention that each has major compile-time variants). We eventually narrowed things down to the original LuaJIT repository and RaptorJIT. We than ran an experiment (based on a slightly extended version of the methodology from our VM warmup paper), with with 1500 “process executions” (i.e. separate, new VM processes) and 1500 “in-process iterations” (i.e. the benchmark in a for loop within one VM process). Here are the benchmark results for the original version of LuaJIT:

Results for luaJIT

Symbol key: bad inconsistent, flat, good inconsistent, no steady state, slowdown, warmup.

Benchmark

Classification

Steady iteration (#)

Steady iteration (s)

Steady performance (s)

array3d

2.0 (2.0, 624.3)

0.042 (0.040, 80.206)

0.12863 ±0.000558

binarytrees

0.12564 ±0.000532

bounce

0.12795 ±0.000272

capnproto_decode

(11, 4)

2.0 (1.0, 45.3)

0.132 (0.000, 5.999)

0.13458 ±0.028466

capnproto_encode

(14, 1)

155.0 (52.8, 280.6)

34.137 (11.476, 57.203)

0.21698 ±0.014541

collisiondetector

(12, 2, 1)

coroutine_ring

0.10667 ±0.001527

deltablue

(10, 5)

84.0 (1.0, 1022.9)

8.743 (0.000, 106.802)

0.10328 ±0.003195

euler14

60.0 (60.0, 83.0)

5.537 (5.483, 7.680)

0.09180 ±0.000742

fannkuch_redux

0.12093 ±0.001502

fasta

0.12099 ±0.000376

havlak

(9, 4, 2)

heapsort

1.01917 ±0.015674

jsonlua_decode

0.11279 ±0.012664

jsonlua_encode

0.12798 ±0.001761

knucleotide

0.11662 ±0.000810

life

(12, 3)

luacheck

1.00901 ±0.089779

luacheck_parser

(13, 2)

244.0 (1.0, 652.2)

33.998 (0.000, 90.759)

0.09434 ±0.012888

luafun

54.0 (12.4, 70.6)

9.015 (1.935, 11.587)

0.16571 ±0.004918

mandelbrot

(11, 4)

1.0 (1.0, 29.0)

0.000 (0.000, 9.750)

0.34443 ±0.000119

mandelbrot_bit

(9, 6)

md5

0.11279 ±0.000040

meteor

16.0 (2.0, 18.0)

3.398 (0.284, 3.840)

0.21935 ±0.003935

moonscript

28.0 (13.1, 423.3)

4.468 (2.039, 68.212)

0.16175 ±0.001569

nbody

0.16024 ±0.002790

nsieve

2.0 (2.0, 2.0)

0.189 (0.188, 0.189)

0.17904 ±0.000641

nsieve_bit

4.0 (3.4, 5.3)

0.272 (0.219, 0.386)

0.08758 ±0.000054

partialsums

2.0 (2.0, 2.0)

0.160 (0.160, 0.163)

0.14802 ±0.002044

pidigits

(11, 4)

1.0 (1.0, 2.3)

0.000 (0.000, 0.174)

0.12689 ±0.002132

queens

(14, 1)

1.0 (1.0, 294.4)

0.000 (0.000, 35.052)

0.11838 ±0.000751

quicksort

(8, 7)

3.0 (2.0, 4.0)

0.600 (0.315, 0.957)

0.31117 ±0.067395

radixsort

0.12732 ±0.000403

ray

(11, 4)

1.0 (1.0, 355.0)

0.000 (0.000, 110.833)

0.30961 ±0.003990

recursive_ack

0.11975 ±0.000653

recursive_fib

0.23064 ±0.028968

resty_json

(14, 1)

1.0 (1.0, 250.3)

0.000 (0.000, 20.009)

0.07336 ±0.002629

revcomp

0.11403 ±0.001754

richards

(8, 7)

2.0 (1.0, 2.0)

0.133 (0.000, 0.152)

0.13625 ±0.010223

scimark_fft

2.0 (2.0, 4.7)

0.140 (0.140, 0.483)

0.12653 ±0.000823

scimark_lu

0.11547 ±0.000308

scimark_sor

0.12108 ±0.000053

scimark_sparse

0.12342 ±0.000585

series

2.0 (2.0, 2.3)

0.347 (0.347, 0.451)

0.33400 ±0.003217

spectralnorm

0.13987 ±0.000001

table_cmpsort

(13, 2)

10.0 (1.0, 10.0)

1.984 (0.000, 1.989)

0.22174 ±0.007836

Results for luaJIT

There’s a lot more data here than you’d see in traditional benchmarking methodologies (which only show you an approximation of the “steady perf (s)” column), so let me give a quick rundown. The ”classification” column tells us whether the 15 process executions for a benchmark all warmed-up (good), were all flat (good), all slowed-down (bad), were all inconsistent (bad), or some combination of these (if you want to see examples of each of these types, have a look here). “Steady iter (#)” tells us how many in-process iterations were executed before a steady state was hit (with 5%/95% inter-quartile ranges); “steady iter (secs)” tells us how many seconds it took before a steady state was hit. Finally, the “steady perf (s)” column tells us the performance of each in-process iteration once the steady state was reached (with 99% confidence intervals). For all numeric columns, lower numbers are better.

Here are the benchmark results for for RaptorJIT:

Results for RaptorJIT

Symbol key: bad inconsistent, flat, good inconsistent, no steady state, slowdown, warmup.

Benchmark

Classification

Steady iteration (#)

Steady iteration (s)

Steady performance (s)

array3d

(12, 3)

1.0 (1.0, 76.0)

0.000 (0.000, 9.755)

0.13026 ±0.000216

binarytrees

24.0 (24.0, 24.0)

2.792 (2.786, 2.810)

0.11960 ±0.000762

bounce

0.13865 ±0.000978

capnproto_encode

0.11818 ±0.002599

collisiondetector

2.0 (2.0, 2.0)

0.167 (0.167, 0.169)

0.11583 ±0.001498

coroutine_ring

0.14645 ±0.000752

deltablue

0.10658 ±0.001063

euler14

(12, 3)

1.0 (1.0, 51.4)

0.000 (0.000, 5.655)

0.11195 ±0.000093

fannkuch_redux

0.12437 ±0.000029

fasta

0.11967 ±0.000313

havlak

0.21013 ±0.002469

heapsort

1.39055 ±0.002386

jsonlua_decode

0.13994 ±0.001207

jsonlua_encode

0.13581 ±0.001411

knucleotide

0.13035 ±0.000445

life

0.28412 ±0.000599

luacheck

0.99735 ±0.006095

luacheck_parser

0.07745 ±0.002296

luafun

28.0 (28.0, 28.0)

4.879 (4.861, 4.904)

0.17864 ±0.001222

mandelbrot

0.34166 ±0.000067

mandelbrot_bit

0.21577 ±0.000024

md5

0.09548 ±0.000037

meteor

2.0 (2.0, 3.0)

0.273 (0.269, 0.493)

0.21464 ±0.002170

nbody

(14, 1)

1.0 (1.0, 1.9)

0.000 (0.000, 0.160)

0.17695 ±0.002226

nsieve

2.0 (2.0, 2.6)

0.180 (0.179, 0.282)

0.16982 ±0.000862

nsieve_bit

4.0 (3.7, 5.0)

0.273 (0.247, 0.361)

0.08780 ±0.000233

partialsums

2.0 (2.0, 2.3)

0.161 (0.160, 0.207)

0.14860 ±0.001611

pidigits

(8, 7)

5.0 (1.0, 6.0)

0.516 (0.000, 0.646)

0.12766 ±0.000032

queens

(14, 1)

2.0 (1.7, 2.0)

0.162 (0.113, 0.162)

0.15853 ±0.000231

quicksort

2.0 (2.0, 2.3)

0.278 (0.278, 0.361)

0.27183 ±0.000469

radixsort

0.12621 ±0.000757

ray

0.35530 ±0.000984

recursive_ack

(14, 1)

1.0 (1.0, 19.0)

0.000 (0.000, 2.562)

0.14228 ±0.000616

recursive_fib

0.28989 ±0.000033

resty_json

0.07534 ±0.000595

revcomp

0.11684 ±0.002139

richards

2.0 (2.0, 3.2)

0.171 (0.170, 0.369)

0.16559 ±0.000342

scimark_fft

2.0 (2.0, 10.3)

0.141 (0.141, 1.195)

0.12709 ±0.000102

scimark_lu

0.12733 ±0.000159

scimark_sor

0.13297 ±0.000005

scimark_sparse

0.13082 ±0.000490

series

2.0 (2.0, 2.0)

0.347 (0.347, 0.348)

0.33390 ±0.000869

spectralnorm

0.13989 ±0.000003

table_cmpsort

10.0 (10.0, 10.0)

1.945 (1.935, 1.967)

0.22008 ±0.001852

Results for RaptorJIT

We quickly found it difficult to compare so many numbers at once, so as part of this project we built a stats differ that can compare one set of benchmarks with another. Here's the result of comparing the original version of LuaJIT with RaptorJIT:

Results for Normal vs. RaptorJIT

Symbol key: bad inconsistent, flat, good inconsistent, no steady state, slowdown, warmup. Diff against previous results: improved worsened different unchanged.

Benchmark

Classification

Steady iteration (#)

Steady iteration variation

Steady iteration (s)

Steady performance (s)

Steady performance variation (s)

array3d

(12, 3)

1.0 (1.0, 76.0)

(1.0, 76.0) was: (2.0, 624.3)

0.000 (0.000, 9.755)

0.13026 δ=0.00163 ±0.000215

0.000215 was: 0.000557

binarytrees

24.0 (24.0, 24.0)

2.792 (2.786, 2.810)

0.11960 δ=-0.00603 ±0.000762

bounce

0.13865 δ=0.01070 ±0.000978

capnproto_encode

0.11818 δ=-0.09880 ±0.002599

collisiondetector

2.0 (2.0, 2.0)

0.167 (0.167, 0.169)

0.11583 ±0.001498

coroutine_ring

0.14645 δ=0.03978 ±0.000751

deltablue

0.10658 ±0.001063

0.001063 was: 0.003195

euler14

(12, 3)

1.0 δ=-59.0 (1.0, 51.4)

(1.0, 51.4) was: (60.0, 83.0)

0.000 δ=-5.537 (0.000, 5.655)

0.11195 δ=0.02015 ±0.000093

0.000093 was: 0.000743

fannkuch_redux

0.12437 δ=0.00344 ±0.000029

fasta

0.11967 δ=-0.00132 ±0.000313

havlak

0.21013 ±0.002442

heapsort

1.39055 δ=0.37138 ±0.002379

jsonlua_decode

0.13994 δ=0.02715 ±0.001207

jsonlua_encode

0.13581 δ=0.00783 ±0.001409

knucleotide

0.13035 δ=0.01373 ±0.000446

life

0.28412 ±0.000599

luacheck

0.99735 ±0.006094

0.006094 was: 0.089779

luacheck_parser

0.07745 δ=-0.01688 ±0.002281

luafun

28.0 (28.0, 28.0)

4.879 (4.861, 4.904)

0.17864 δ=0.01293 ±0.001222

0.001222 was: 0.004918

mandelbrot

0.34166 δ=-0.00278 ±0.000067

mandelbrot_bit

0.21577 ±0.000024

md5

0.09548 δ=-0.01731 ±0.000037

meteor

2.0 (2.0, 3.0)

(2.0, 3.0) was: (2.0, 18.0)

0.273 (0.269, 0.493)

0.21464 ±0.002170

0.002170 was: 0.003935

nbody

(14, 1)

1.0 (1.0, 1.9)

0.000 (0.000, 0.160)

0.17695 δ=0.01671 ±0.002226

nsieve

2.0 (2.0, 2.6)

(2.0, 2.6) was: (2.0, 2.0)

0.180 (0.179, 0.282)

0.16982 δ=-0.00922 ±0.000862

0.000862 was: 0.000640

nsieve_bit

4.0 (3.7, 5.0)

(3.7, 5.0) was: (3.4, 5.3)

0.273 (0.247, 0.361)

0.08780 ±0.000233

0.000233 was: 0.000054

partialsums

2.0 (2.0, 2.3)

(2.0, 2.3) was: (2.0, 2.0)

0.161 (0.160, 0.207)

0.14860 ±0.001611

0.001611 was: 0.002044

pidigits

(8, 7)

5.0 (1.0, 6.0)

(1.0, 6.0) was: (1.0, 2.3)

0.516 (0.000, 0.646)

0.12766 ±0.000032

0.000032 was: 0.002132

queens

(14, 1)

2.0 (1.7, 2.0)

(1.7, 2.0) was: (1.0, 294.4)

0.162 (0.113, 0.162)

0.15853 δ=0.04015 ±0.000231

0.000231 was: 0.000751

quicksort

2.0 (2.0, 2.3)

(2.0, 2.3) was: (2.0, 4.0)

0.278 (0.278, 0.361)

0.27183 ±0.000469

0.000469 was: 0.067395

radixsort

0.12621 ±0.000757

0.000757 was: 0.000403

ray

0.35530 δ=0.04568 ±0.000983

recursive_ack

(14, 1)

1.0 (1.0, 19.0)

0.000 (0.000, 2.562)

0.14228 δ=0.02253 ±0.000616

recursive_fib

0.28989 δ=0.05925 ±0.000033

resty_json

0.07534 ±0.000595

0.000595 was: 0.002629

revcomp

0.11684 ±0.002139

0.002139 was: 0.001754

richards

2.0 (2.0, 3.2)

(2.0, 3.2) was: (1.0, 2.0)

0.171 (0.170, 0.369)

0.16559 δ=0.02935 ±0.000342

0.000342 was: 0.010223

scimark_fft

2.0 (2.0, 10.3)

(2.0, 10.3) was: (2.0, 4.7)

0.141 (0.141, 1.195)

0.12709 ±0.000102

0.000102 was: 0.000823

scimark_lu

0.12733 δ=0.01186 ±0.000159

scimark_sor

0.13297 δ=0.01189 ±0.000005

scimark_sparse

0.13082 δ=0.00740 ±0.000490

series

2.0 (2.0, 2.0)

0.347 (0.347, 0.348)

0.33390 ±0.000869

0.000869 was: 0.003217

spectralnorm

0.13989 δ=0.00002 ±0.000003

table_cmpsort

10.0 (10.0, 10.0)

1.945 (1.935, 1.967)

0.22008 ±0.001852

0.001852 was: 0.007836

Results for Normal vs. RaptorJIT

In essence, green cells mean that RaptorJIT is better than LuaJIT; red cells mean that LuaJIT is better than RaptorJIT; yellow means they're different in a way that can't be compared; and white/grey means they're statistically equivalent. The additional “Steady performance variation (s)” column shows whether the steady state performance of different process executions is more predictable or not.

The simple conclusion to draw from this is that there isn't a simple conclusion to draw from it: the two VMs are sometimes better than each other with no clear pattern. Without having a clear steer either way, we therefore decided to use the original version of LuaJIT as our base.

One of the things that became very clear from our benchmarking is that LuaJIT is highly non-deterministic – indeed, it's the most non-deterministic VM I've seen. The practical effect of this is that even on one program, LuaJIT is sometimes very fast, and sometimes rather slow. This is, at best, very confusing for users who tend to assume that programs perform more-or-less the same every time they're run; at worst, it can create significant problems when one is trying to estimate things like server provisioning. We therefore tried various things to make performance more consistent.

The most promising approach we alighted upon is what we ended up calling “separate counters”. In a tracing JIT compiler such as LuaJIT, one tracks how often a loop (where loops are both “obvious” things like for loops, as well as less obvious things such as functions) has been executed: once it's hit a certain threshold, the loop is traced, and compiled into machine code. LuaJIT has an unusual approach to counting loops: it has 64 counters to which all loops are mapped (using the memory address of the bytecode in question). In other words, multiple loops share the same counter: the bigger the program, the more loops share the same counter. The advantage of this is that the counters map is memory efficient, and for small programs (e.g. the common LuaJIT benchmarks) it can be highly effective. However, it has very odd effects in real programs, particularly as programs get bigger: loops are compiled non-deterministically based on the particular address in memory they happen to have been loaded at.

We therefore altered LuaJIT so that each loop and each function has its own counter, stored in the bytecode to make memory reads/writes more cache friendly. The diff from normal LuaJIT to the separate counters version is as follows:

Results for Normal vs. Counters

Symbol key: bad inconsistent, flat, good inconsistent, no steady state, slowdown, warmup. Diff against previous results: improved worsened different unchanged.

Benchmark

Classification

Steady iteration (#)

Steady iteration variation

Steady iteration (s)

Steady performance (s)

Steady performance variation (s)

array3d

binarytrees

0.12462 ±0.004058

0.004058 was: 0.000532

bounce

(14, 1)

1.0 (1.0, 5.8)

0.000 (0.000, 0.603)

0.12515 δ=-0.00280 ±0.000278

capnproto_decode

(9, 6)

1.0 (1.0, 24.9)

(1.0, 24.9) was: (1.0, 45.3)

0.000 (0.000, 3.692)

0.15042 ±0.003797

0.003797 was: 0.028466

capnproto_encode

230.0 (56.0, 467.6)

(56.0, 467.6) was: (52.8, 280.6)

28.411 (6.667, 55.951)

0.11838 δ=-0.09860 ±0.001960

0.001960 was: 0.014541

collisiondetector

(13, 2)

coroutine_ring

0.10680 ±0.003151

0.003151 was: 0.001527

deltablue

149.0 (149.0, 274.5)

(149.0, 274.5) was: (1.0, 1022.9)

15.561 (15.430, 28.653)

0.10159 ±0.001083

0.001083 was: 0.003195

euler14

61.0 (61.0, 68.3)

(61.0, 68.3) was: (60.0, 83.0)

5.650 (5.592, 6.356)

0.09216 ±0.000159

0.000159 was: 0.000743

fannkuch_redux

0.11976 ±0.000012

0.000012 was: 0.001502

fasta

0.12200 δ=0.00100 ±0.000597

havlak

heapsort

1.04378 δ=0.02461 ±0.000789

jsonlua_decode

0.12648 δ=0.01370 ±0.000556

jsonlua_encode

0.12860 ±0.000879

0.000879 was: 0.001761

knucleotide

0.11710 ±0.000541

0.000541 was: 0.000811

life

(9, 3, 2, 1)

luacheck

1.00299 ±0.004778

0.004778 was: 0.089781

luacheck_parser

(12, 2, 1)

luafun

69.0 (69.0, 69.0)

11.481 (11.331, 11.522)

0.16770 ±0.001564

0.001564 was: 0.004918

mandelbrot

(14, 1)

mandelbrot_bit

0.21695 ±0.000142

md5

0.11155 δ=-0.00124 ±0.000043

meteor

(13, 2)

14.0 (1.0, 15.0)

(1.0, 15.0) was: (2.0, 18.0)

2.855 (0.000, 3.045)

0.21606 ±0.004651

0.004651 was: 0.003935

moonscript

63.0 (17.7, 184.1)

(17.7, 184.1) was: (13.1, 423.3)

10.046 (2.763, 29.739)

0.15999 ±0.001405

0.001405 was: 0.001568

nbody

0.15898 ±0.001676

0.001676 was: 0.002790

nsieve

2.0 (2.0, 2.6)

(2.0, 2.6) was: (2.0, 2.0)

0.189 (0.188, 0.297)

0.17875 ±0.001266

0.001266 was: 0.000641

nsieve_bit

4.0 (2.0, 6.0)

(2.0, 6.0) was: (3.4, 5.3)

0.271 (0.097, 0.446)

0.08726 δ=-0.00032 ±0.000202

0.000202 was: 0.000054

partialsums

2.0 (2.0, 2.9)

(2.0, 2.9) was: (2.0, 2.0)

0.161 (0.161, 0.295)

0.14916 ±0.000081

0.000081 was: 0.002044

pidigits

2.0 (2.0, 4.3)

(2.0, 4.3) was: (1.0, 2.3)

0.130 (0.130, 0.425)

0.12666 ±0.000122

0.000122 was: 0.002133

queens

(10, 5)

1.0 (1.0, 2.0)

(1.0, 2.0) was: (1.0, 294.4)

0.000 (0.000, 0.127)

0.12484 δ=0.00646 ±0.000317

0.000317 was: 0.000751

quicksort

2.0 (2.0, 2.0)

0.299 (0.298, 0.304)

0.44880 δ=0.13763 ±0.020477

0.020477 was: 0.067395

radixsort

0.12644 ±0.000864

0.000864 was: 0.000403

ray

0.30901 ±0.002140

0.002140 was: 0.004022

recursive_ack

0.11958 ±0.000510

0.000510 was: 0.000653

recursive_fib

0.22864 ±0.000266

0.000266 was: 0.028968

resty_json

(12, 2, 1)

revcomp

0.11550 ±0.002553

0.002553 was: 0.001753

richards

(14, 1)

2.0 (1.7, 2.0)

(1.7, 2.0) was: (1.0, 2.0)

0.150 (0.105, 0.150)

0.14572 ±0.000324

0.000324 was: 0.010223

scimark_fft

2.0 (2.0, 10.0)

(2.0, 10.0) was: (2.0, 4.7)

0.140 (0.140, 1.153)

0.12639 ±0.000343

0.000343 was: 0.000823

scimark_lu

(11, 4)

1.0 (1.0, 45.3)

0.000 (0.000, 5.122)

0.11546 ±0.000132

0.000132 was: 0.000308

scimark_sor

0.12105 ±0.000148

scimark_sparse

0.12315 ±0.000728

0.000728 was: 0.000585

series

2.0 (2.0, 2.0)

0.347 (0.347, 0.348)

0.33394 ±0.000645

0.000645 was: 0.003217

spectralnorm

0.13985 δ=-0.00003 ±0.000007

table_cmpsort

(13, 1, 1)

1.0 (1.0, 10.0)

0.000 (0.000, 2.005)

0.21828 ±0.003289

0.003289 was: 0.007836

Results for Normal vs. Counters

In this case we’re particularly interested in the “steady performance variation (s)” column, which shows whether benchmarks have predictable steady state performance. The results are fairly clear: steady counters are, overall, a clear improvement. As you might expect, this is not a pure win, because it changes the order in which traces are made. This has several effects, including delaying some loops to be traced later than was previously the case, because counters do not hit the required threshold as quickly.

This disadvantages some programs, particularly small deterministic benchmarks where loops are highly stable. In such cases, the earlier you trace the better. However, in my opinion, such programs are given undue weight when performance is considered. It’s no secret that some of the benchmarks regularly used to benchmark LuaJIT are highly optimised for LuaJIT as it stands; any changes to LuaJIT stand a good chance of degrading their performance. However, overall we feel that the overall gain in consistency, particularly for larger programs, is worth it. There's a pull request against the Lua Foundation's fork of LuaJIT which applies this idea to a mainstream fork of LuaJIT.

We then started looking at various programs that showed odd performance. One problem in particular showed up in more than one benchmark. Here's a standard example:

Collisiondetector, Normal, Bencher9, Proc. exec. #12 (no steady state)

collision1-1

The problem – and it doesn't happen on every process execution, just to make it more fun – is that there are points where the benchmark slows down by over 10% for multiple in-process iterations (e.g. in this process execution, at in-process iterations 930-ish and 1050-ish). We tried over 25 separate ways to work out what was causing this — even building an instrumentation system to track what LuaJIT is doing — but in the end it turned out to be related to LuaJIT's Garbage Collector – sort of. When we moved from the 32-bit to 64-bit GC, the odd performance went away.

As such, we don’t think that the 64-bit GC “solves” the problem: however, it changes the way that pointers are encoded (doubling in size), which causes the code generator to emit a different style of code, such that the problem seems to go away. Nevertheless, this did make us reevaluate LuaJIT's GC. Tom then started work on implementing Mike Pall's suggestion for a new GC for LuaJIT (based partly on Tom's previous work and also that of Peter Cawley). He has enough implemented to run most small, and some large, programs, but it needs more work to finish it off, at which point evaluating it against the existing Lua GCs will be fascinating!

So, did we achieve everything we wanted to in 12 months? Inevitably the answer is yes and no. We did a lot more benchmarking than we expected; we've been able to make a lot of programs (particularly large programs) have more consistent performance; and we've got a fair way down the road of implementing a new GC. To whoever takes on further LuaJIT work – best of luck, and I look forward to seeing your results!

Acknowledgements: Sarah Mount implemented the stats differ; Edd Barrett implemented Krun and answered many questions on it.

Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.
LUAProgrammingSpeed & Reliability

Follow on X

Cloudflare|@cloudflare

Related posts

September 25, 2024 1:00 PM

Introducing Speed Brain: helping web pages load 45% faster

We are excited to announce the latest leap forward in speed – Speed Brain. Speed Brain uses the Speculation Rules API to prefetch content for the user's likely next navigations. The goal is to download a web page to the browser before a user navigates to it, allowing pages to load instantly. ...