This is a guest post by Laurence Tratt, who is a programmer and Reader in Software Development in the Department of Informatics at King's College London where he leads the Software Development Team. He is also an EPSRC Fellow.
A year ago I wrote about a project that Cloudflare were funding at King's College London to help improve LuaJIT. Our twelve months is now up. How did we do?
The first thing that happened is that I was lucky to employ a LuaJIT expert, Thomas Fransham, to work on the project. His deep knowledge about LuaJIT was crucial to getting things up and running – 12 months might sound like a long time, but it soon whizzes by!
The second thing that happened was that we realised that the current state of Lua benchmarking was not good enough for anyone to reliably tell if they'd improved LuaJIT performance or not. Different Lua implementations had different benchmark suites, mostly on the small side, and not easily compared. Although it wasn't part of our original plan, we thus put a lot of effort into creating a larger benchmark suite. This sounds like a trivial job, but it isn't. Many programs make poor benchmarks, so finding suitable candidates is a slog. Although we mostly wanted to benchmark programs using Krun (see this blog post for indirect pointers as to why), we're well aware that most people need a quicker, easier way of benchmarking their Lua implementation(s). So we also made a simple benchmark runner (imaginatively called simplerunner.lua) that does that job. Here's an example of it in use:
$ lua simplerunner.lua
Running luacheck: ..............................
Mean: 1.120762 +/- 0.030216, min 1.004843, max 1.088270
Running fannkuch_redux: ..............................
Mean: 0.128499 +/- 0.003281, min 0.119500, max 0.119847
Even though it's a simple benchmark runner, we couldn't help but try and nudge the quality of benchmarking up a little bit. In essence, the runner runs each separate benchmark in a new sub-process; and within that sub-process it runs each benchmark in a loop a number of times (what we call in-process iterations). Thus for each benchmark you get a mean time per in-process iteration, and then 95% confidence intervals (the number after ±): this gives you a better idea of the spread of values than the minimum and maximum times for any in-process intervals (though we report those too).
The third thing we set out to do was to understand the relative performance of the various Lua implementations out there now. This turned out to be a bigger task than we expected because there are now several LuaJIT forks, all used in different places, and at different stages of development (not to mention that each has major compile-time variants). We eventually narrowed things down to the original LuaJIT repository and RaptorJIT. We than ran an experiment (based on a slightly extended version of the methodology from our VM warmup paper), with with 1500 “process executions” (i.e. separate, new VM processes) and 1500 “in-process iterations” (i.e. the benchmark in a for loop within one VM process). Here are the benchmark results for the original version of LuaJIT:
Results for luaJIT
Symbol key: bad inconsistent, flat, good inconsistent, no steady state, slowdown, warmup.
Benchmark | Classification | Steady iteration (#) | Steady iteration (s) | Steady performance (s) |
---|---|---|---|---|
array3d | 2.0 (2.0, 624.3) | 0.042 (0.040, 80.206) | 0.12863 ±0.000558 | |
binarytrees | 0.12564 ±0.000532 | |||
bounce | 0.12795 ±0.000272 | |||
capnproto_decode | (11, 4) | 2.0 (1.0, 45.3) | 0.132 (0.000, 5.999) | 0.13458 ±0.028466 |
capnproto_encode | (14, 1) | 155.0 (52.8, 280.6) | 34.137 (11.476, 57.203) | 0.21698 ±0.014541 |
collisiondetector | (12, 2, 1) | |||
coroutine_ring | 0.10667 ±0.001527 | |||
deltablue | (10, 5) | 84.0 (1.0, 1022.9) | 8.743 (0.000, 106.802) | 0.10328 ±0.003195 |
euler14 | 60.0 (60.0, 83.0) | 5.537 (5.483, 7.680) | 0.09180 ±0.000742 | |
fannkuch_redux | 0.12093 ±0.001502 | |||
fasta | 0.12099 ±0.000376 | |||
havlak | (9, 4, 2) | |||
heapsort | 1.01917 ±0.015674 | |||
jsonlua_decode | 0.11279 ±0.012664 | |||
jsonlua_encode | 0.12798 ±0.001761 | |||
knucleotide | 0.11662 ±0.000810 | |||
life | (12, 3) | |||
luacheck | 1.00901 ±0.089779 | |||
luacheck_parser | (13, 2) | 244.0 (1.0, 652.2) | 33.998 (0.000, 90.759) | 0.09434 ±0.012888 |
luafun | 54.0 (12.4, 70.6) | 9.015 (1.935, 11.587) | 0.16571 ±0.004918 | |
mandelbrot | (11, 4) | 1.0 (1.0, 29.0) | 0.000 (0.000, 9.750) | 0.34443 ±0.000119 |
mandelbrot_bit | (9, 6) | |||
md5 | 0.11279 ±0.000040 | |||
meteor | 16.0 (2.0, 18.0) | 3.398 (0.284, 3.840) | 0.21935 ±0.003935 | |
moonscript | 28.0 (13.1, 423.3) | 4.468 (2.039, 68.212) | 0.16175 ±0.001569 | |
nbody | 0.16024 ±0.002790 | |||
nsieve | 2.0 (2.0, 2.0) | 0.189 (0.188, 0.189) | 0.17904 ±0.000641 | |
nsieve_bit | 4.0 (3.4, 5.3) | 0.272 (0.219, 0.386) | 0.08758 ±0.000054 | |
partialsums | 2.0 (2.0, 2.0) | 0.160 (0.160, 0.163) | 0.14802 ±0.002044 | |
pidigits | (11, 4) | 1.0 (1.0, 2.3) | 0.000 (0.000, 0.174) | 0.12689 ±0.002132 |
queens | (14, 1) | 1.0 (1.0, 294.4) | 0.000 (0.000, 35.052) | 0.11838 ±0.000751 |
quicksort | (8, 7) | 3.0 (2.0, 4.0) | 0.600 (0.315, 0.957) | 0.31117 ±0.067395 |
radixsort | 0.12732 ±0.000403 | |||
ray | (11, 4) | 1.0 (1.0, 355.0) | 0.000 (0.000, 110.833) | 0.30961 ±0.003990 |
recursive_ack | 0.11975 ±0.000653 | |||
recursive_fib | 0.23064 ±0.028968 | |||
resty_json | (14, 1) | 1.0 (1.0, 250.3) | 0.000 (0.000, 20.009) | 0.07336 ±0.002629 |
revcomp | 0.11403 ±0.001754 | |||
richards | (8, 7) | 2.0 (1.0, 2.0) | 0.133 (0.000, 0.152) | 0.13625 ±0.010223 |
scimark_fft | 2.0 (2.0, 4.7) | 0.140 (0.140, 0.483) | 0.12653 ±0.000823 | |
scimark_lu | 0.11547 ±0.000308 | |||
scimark_sor | 0.12108 ±0.000053 | |||
scimark_sparse | 0.12342 ±0.000585 | |||
series | 2.0 (2.0, 2.3) | 0.347 (0.347, 0.451) | 0.33400 ±0.003217 | |
spectralnorm | 0.13987 ±0.000001 | |||
table_cmpsort | (13, 2) | 10.0 (1.0, 10.0) | 1.984 (0.000, 1.989) | 0.22174 ±0.007836 |
Results for luaJIT
There’s a lot more data here than you’d see in traditional benchmarking methodologies (which only show you an approximation of the “steady perf (s)” column), so let me give a quick rundown. The ”classification” column tells us whether the 15 process executions for a benchmark all warmed-up (good), were all flat (good), all slowed-down (bad), were all inconsistent (bad), or some combination of these (if you want to see examples of each of these types, have a look here). “Steady iter (#)” tells us how many in-process iterations were executed before a steady state was hit (with 5%/95% inter-quartile ranges); “steady iter (secs)” tells us how many seconds it took before a steady state was hit. Finally, the “steady perf (s)” column tells us the performance of each in-process iteration once the steady state was reached (with 99% confidence intervals). For all numeric columns, lower numbers are better.
Here are the benchmark results for for RaptorJIT:
Results for RaptorJIT
Symbol key: bad inconsistent, flat, good inconsistent, no steady state, slowdown, warmup.
Benchmark | Classification | Steady iteration (#) | Steady iteration (s) | Steady performance (s) |
---|---|---|---|---|
array3d | (12, 3) | 1.0 (1.0, 76.0) | 0.000 (0.000, 9.755) | 0.13026 ±0.000216 |
binarytrees | 24.0 (24.0, 24.0) | 2.792 (2.786, 2.810) | 0.11960 ±0.000762 | |
bounce | 0.13865 ±0.000978 | |||
capnproto_encode | 0.11818 ±0.002599 | |||
collisiondetector | 2.0 (2.0, 2.0) | 0.167 (0.167, 0.169) | 0.11583 ±0.001498 | |
coroutine_ring | 0.14645 ±0.000752 | |||
deltablue | 0.10658 ±0.001063 | |||
euler14 | (12, 3) | 1.0 (1.0, 51.4) | 0.000 (0.000, 5.655) | 0.11195 ±0.000093 |
fannkuch_redux | 0.12437 ±0.000029 | |||
fasta | 0.11967 ±0.000313 | |||
havlak | 0.21013 ±0.002469 | |||
heapsort | 1.39055 ±0.002386 | |||
jsonlua_decode | 0.13994 ±0.001207 | |||
jsonlua_encode | 0.13581 ±0.001411 | |||
knucleotide | 0.13035 ±0.000445 | |||
life | 0.28412 ±0.000599 | |||
luacheck | 0.99735 ±0.006095 | |||
luacheck_parser | 0.07745 ±0.002296 | |||
luafun | 28.0 (28.0, 28.0) | 4.879 (4.861, 4.904) | 0.17864 ±0.001222 | |
mandelbrot | 0.34166 ±0.000067 | |||
mandelbrot_bit | 0.21577 ±0.000024 | |||
md5 | 0.09548 ±0.000037 | |||
meteor | 2.0 (2.0, 3.0) | 0.273 (0.269, 0.493) | 0.21464 ±0.002170 | |
nbody | (14, 1) | 1.0 (1.0, 1.9) | 0.000 (0.000, 0.160) | 0.17695 ±0.002226 |
nsieve | 2.0 (2.0, 2.6) | 0.180 (0.179, 0.282) | 0.16982 ±0.000862 | |
nsieve_bit | 4.0 (3.7, 5.0) | 0.273 (0.247, 0.361) | 0.08780 ±0.000233 | |
partialsums | 2.0 (2.0, 2.3) | 0.161 (0.160, 0.207) | 0.14860 ±0.001611 | |
pidigits | (8, 7) | 5.0 (1.0, 6.0) | 0.516 (0.000, 0.646) | 0.12766 ±0.000032 |
queens | (14, 1) | 2.0 (1.7, 2.0) | 0.162 (0.113, 0.162) | 0.15853 ±0.000231 |
quicksort | 2.0 (2.0, 2.3) | 0.278 (0.278, 0.361) | 0.27183 ±0.000469 | |
radixsort | 0.12621 ±0.000757 | |||
ray | 0.35530 ±0.000984 | |||
recursive_ack | (14, 1) | 1.0 (1.0, 19.0) | 0.000 (0.000, 2.562) | 0.14228 ±0.000616 |
recursive_fib | 0.28989 ±0.000033 | |||
resty_json | 0.07534 ±0.000595 | |||
revcomp | 0.11684 ±0.002139 | |||
richards | 2.0 (2.0, 3.2) | 0.171 (0.170, 0.369) | 0.16559 ±0.000342 | |
scimark_fft | 2.0 (2.0, 10.3) | 0.141 (0.141, 1.195) | 0.12709 ±0.000102 | |
scimark_lu | 0.12733 ±0.000159 | |||
scimark_sor | 0.13297 ±0.000005 | |||
scimark_sparse | 0.13082 ±0.000490 | |||
series | 2.0 (2.0, 2.0) | 0.347 (0.347, 0.348) | 0.33390 ±0.000869 | |
spectralnorm | 0.13989 ±0.000003 | |||
table_cmpsort | 10.0 (10.0, 10.0) | 1.945 (1.935, 1.967) | 0.22008 ±0.001852 |
Results for RaptorJIT
We quickly found it difficult to compare so many numbers at once, so as part of this project we built a stats differ that can compare one set of benchmarks with another. Here's the result of comparing the original version of LuaJIT with RaptorJIT:
Results for Normal vs. RaptorJIT
Symbol key: bad inconsistent, flat, good inconsistent, no steady state, slowdown, warmup. Diff against previous results: improved worsened different unchanged.
Benchmark | Classification | Steady iteration (#) | Steady iteration variation | Steady iteration (s) | Steady performance (s) | Steady performance variation (s) |
---|---|---|---|---|---|---|
array3d | (12, 3) | 1.0 (1.0, 76.0) | (1.0, 76.0) was: (2.0, 624.3) | 0.000 (0.000, 9.755) | 0.13026 δ=0.00163 ±0.000215 | 0.000215 was: 0.000557 |
binarytrees | 24.0 (24.0, 24.0) | 2.792 (2.786, 2.810) | 0.11960 δ=-0.00603 ±0.000762 | |||
bounce | 0.13865 δ=0.01070 ±0.000978 | |||||
capnproto_encode | 0.11818 δ=-0.09880 ±0.002599 | |||||
collisiondetector | 2.0 (2.0, 2.0) | 0.167 (0.167, 0.169) | 0.11583 ±0.001498 | |||
coroutine_ring | 0.14645 δ=0.03978 ±0.000751 | |||||
deltablue | 0.10658 ±0.001063 | 0.001063 was: 0.003195 | ||||
euler14 | (12, 3) | 1.0 δ=-59.0 (1.0, 51.4) | (1.0, 51.4) was: (60.0, 83.0) | 0.000 δ=-5.537 (0.000, 5.655) | 0.11195 δ=0.02015 ±0.000093 | 0.000093 was: 0.000743 |
fannkuch_redux | 0.12437 δ=0.00344 ±0.000029 | |||||
fasta | 0.11967 δ=-0.00132 ±0.000313 | |||||
havlak | 0.21013 ±0.002442 | |||||
heapsort | 1.39055 δ=0.37138 ±0.002379 | |||||
jsonlua_decode | 0.13994 δ=0.02715 ±0.001207 | |||||
jsonlua_encode | 0.13581 δ=0.00783 ±0.001409 | |||||
knucleotide | 0.13035 δ=0.01373 ±0.000446 | |||||
life | 0.28412 ±0.000599 | |||||
luacheck | 0.99735 ±0.006094 | 0.006094 was: 0.089779 | ||||
luacheck_parser | 0.07745 δ=-0.01688 ±0.002281 | |||||
luafun | 28.0 (28.0, 28.0) | 4.879 (4.861, 4.904) | 0.17864 δ=0.01293 ±0.001222 | 0.001222 was: 0.004918 | ||
mandelbrot | 0.34166 δ=-0.00278 ±0.000067 | |||||
mandelbrot_bit | 0.21577 ±0.000024 | |||||
md5 | 0.09548 δ=-0.01731 ±0.000037 | |||||
meteor | 2.0 (2.0, 3.0) | (2.0, 3.0) was: (2.0, 18.0) | 0.273 (0.269, 0.493) | 0.21464 ±0.002170 | 0.002170 was: 0.003935 | |
nbody | (14, 1) | 1.0 (1.0, 1.9) | 0.000 (0.000, 0.160) | 0.17695 δ=0.01671 ±0.002226 | ||
nsieve | 2.0 (2.0, 2.6) | (2.0, 2.6) was: (2.0, 2.0) | 0.180 (0.179, 0.282) | 0.16982 δ=-0.00922 ±0.000862 | 0.000862 was: 0.000640 | |
nsieve_bit | 4.0 (3.7, 5.0) | (3.7, 5.0) was: (3.4, 5.3) | 0.273 (0.247, 0.361) | 0.08780 ±0.000233 | 0.000233 was: 0.000054 | |
partialsums | 2.0 (2.0, 2.3) | (2.0, 2.3) was: (2.0, 2.0) | 0.161 (0.160, 0.207) | 0.14860 ±0.001611 | 0.001611 was: 0.002044 | |
pidigits | (8, 7) | 5.0 (1.0, 6.0) | (1.0, 6.0) was: (1.0, 2.3) | 0.516 (0.000, 0.646) | 0.12766 ±0.000032 | 0.000032 was: 0.002132 |
queens | (14, 1) | 2.0 (1.7, 2.0) | (1.7, 2.0) was: (1.0, 294.4) | 0.162 (0.113, 0.162) | 0.15853 δ=0.04015 ±0.000231 | 0.000231 was: 0.000751 |
quicksort | 2.0 (2.0, 2.3) | (2.0, 2.3) was: (2.0, 4.0) | 0.278 (0.278, 0.361) | 0.27183 ±0.000469 | 0.000469 was: 0.067395 | |
radixsort | 0.12621 ±0.000757 | 0.000757 was: 0.000403 | ||||
ray | 0.35530 δ=0.04568 ±0.000983 | |||||
recursive_ack | (14, 1) | 1.0 (1.0, 19.0) | 0.000 (0.000, 2.562) | 0.14228 δ=0.02253 ±0.000616 | ||
recursive_fib | 0.28989 δ=0.05925 ±0.000033 | |||||
resty_json | 0.07534 ±0.000595 | 0.000595 was: 0.002629 | ||||
revcomp | 0.11684 ±0.002139 | 0.002139 was: 0.001754 | ||||
richards | 2.0 (2.0, 3.2) | (2.0, 3.2) was: (1.0, 2.0) | 0.171 (0.170, 0.369) | 0.16559 δ=0.02935 ±0.000342 | 0.000342 was: 0.010223 | |
scimark_fft | 2.0 (2.0, 10.3) | (2.0, 10.3) was: (2.0, 4.7) | 0.141 (0.141, 1.195) | 0.12709 ±0.000102 | 0.000102 was: 0.000823 | |
scimark_lu | 0.12733 δ=0.01186 ±0.000159 | |||||
scimark_sor | 0.13297 δ=0.01189 ±0.000005 | |||||
scimark_sparse | 0.13082 δ=0.00740 ±0.000490 | |||||
series | 2.0 (2.0, 2.0) | 0.347 (0.347, 0.348) | 0.33390 ±0.000869 | 0.000869 was: 0.003217 | ||
spectralnorm | 0.13989 δ=0.00002 ±0.000003 | |||||
table_cmpsort | 10.0 (10.0, 10.0) | 1.945 (1.935, 1.967) | 0.22008 ±0.001852 | 0.001852 was: 0.007836 |
Results for Normal vs. RaptorJIT
In essence, green cells mean that RaptorJIT is better than LuaJIT; red cells mean that LuaJIT is better than RaptorJIT; yellow means they're different in a way that can't be compared; and white/grey means they're statistically equivalent. The additional “Steady performance variation (s)” column shows whether the steady state performance of different process executions is more predictable or not.
The simple conclusion to draw from this is that there isn't a simple conclusion to draw from it: the two VMs are sometimes better than each other with no clear pattern. Without having a clear steer either way, we therefore decided to use the original version of LuaJIT as our base.
One of the things that became very clear from our benchmarking is that LuaJIT is highly non-deterministic – indeed, it's the most non-deterministic VM I've seen. The practical effect of this is that even on one program, LuaJIT is sometimes very fast, and sometimes rather slow. This is, at best, very confusing for users who tend to assume that programs perform more-or-less the same every time they're run; at worst, it can create significant problems when one is trying to estimate things like server provisioning. We therefore tried various things to make performance more consistent.
The most promising approach we alighted upon is what we ended up calling “separate counters”. In a tracing JIT compiler such as LuaJIT, one tracks how often a loop (where loops are both “obvious” things like for loops, as well as less obvious things such as functions) has been executed: once it's hit a certain threshold, the loop is traced, and compiled into machine code. LuaJIT has an unusual approach to counting loops: it has 64 counters to which all loops are mapped (using the memory address of the bytecode in question). In other words, multiple loops share the same counter: the bigger the program, the more loops share the same counter. The advantage of this is that the counters map is memory efficient, and for small programs (e.g. the common LuaJIT benchmarks) it can be highly effective. However, it has very odd effects in real programs, particularly as programs get bigger: loops are compiled non-deterministically based on the particular address in memory they happen to have been loaded at.
We therefore altered LuaJIT so that each loop and each function has its own counter, stored in the bytecode to make memory reads/writes more cache friendly. The diff from normal LuaJIT to the separate counters version is as follows:
Results for Normal vs. Counters
Symbol key: bad inconsistent, flat, good inconsistent, no steady state, slowdown, warmup. Diff against previous results: improved worsened different unchanged.
Benchmark | Classification | Steady iteration (#) | Steady iteration variation | Steady iteration (s) | Steady performance (s) | Steady performance variation (s) |
---|---|---|---|---|---|---|
array3d | ||||||
binarytrees | 0.12462 ±0.004058 | 0.004058 was: 0.000532 | ||||
bounce | (14, 1) | 1.0 (1.0, 5.8) | 0.000 (0.000, 0.603) | 0.12515 δ=-0.00280 ±0.000278 | ||
capnproto_decode | (9, 6) | 1.0 (1.0, 24.9) | (1.0, 24.9) was: (1.0, 45.3) | 0.000 (0.000, 3.692) | 0.15042 ±0.003797 | 0.003797 was: 0.028466 |
capnproto_encode | 230.0 (56.0, 467.6) | (56.0, 467.6) was: (52.8, 280.6) | 28.411 (6.667, 55.951) | 0.11838 δ=-0.09860 ±0.001960 | 0.001960 was: 0.014541 | |
collisiondetector | (13, 2) | |||||
coroutine_ring | 0.10680 ±0.003151 | 0.003151 was: 0.001527 | ||||
deltablue | 149.0 (149.0, 274.5) | (149.0, 274.5) was: (1.0, 1022.9) | 15.561 (15.430, 28.653) | 0.10159 ±0.001083 | 0.001083 was: 0.003195 | |
euler14 | 61.0 (61.0, 68.3) | (61.0, 68.3) was: (60.0, 83.0) | 5.650 (5.592, 6.356) | 0.09216 ±0.000159 | 0.000159 was: 0.000743 | |
fannkuch_redux | 0.11976 ±0.000012 | 0.000012 was: 0.001502 | ||||
fasta | 0.12200 δ=0.00100 ±0.000597 | |||||
havlak | ||||||
heapsort | 1.04378 δ=0.02461 ±0.000789 | |||||
jsonlua_decode | 0.12648 δ=0.01370 ±0.000556 | |||||
jsonlua_encode | 0.12860 ±0.000879 | 0.000879 was: 0.001761 | ||||
knucleotide | 0.11710 ±0.000541 | 0.000541 was: 0.000811 | ||||
life | (9, 3, 2, 1) | |||||
luacheck | 1.00299 ±0.004778 | 0.004778 was: 0.089781 | ||||
luacheck_parser | (12, 2, 1) | |||||
luafun | 69.0 (69.0, 69.0) | 11.481 (11.331, 11.522) | 0.16770 ±0.001564 | 0.001564 was: 0.004918 | ||
mandelbrot | (14, 1) | |||||
mandelbrot_bit | 0.21695 ±0.000142 | |||||
md5 | 0.11155 δ=-0.00124 ±0.000043 | |||||
meteor | (13, 2) | 14.0 (1.0, 15.0) | (1.0, 15.0) was: (2.0, 18.0) | 2.855 (0.000, 3.045) | 0.21606 ±0.004651 | 0.004651 was: 0.003935 |
moonscript | 63.0 (17.7, 184.1) | (17.7, 184.1) was: (13.1, 423.3) | 10.046 (2.763, 29.739) | 0.15999 ±0.001405 | 0.001405 was: 0.001568 | |
nbody | 0.15898 ±0.001676 | 0.001676 was: 0.002790 | ||||
nsieve | 2.0 (2.0, 2.6) | (2.0, 2.6) was: (2.0, 2.0) | 0.189 (0.188, 0.297) | 0.17875 ±0.001266 | 0.001266 was: 0.000641 | |
nsieve_bit | 4.0 (2.0, 6.0) | (2.0, 6.0) was: (3.4, 5.3) | 0.271 (0.097, 0.446) | 0.08726 δ=-0.00032 ±0.000202 | 0.000202 was: 0.000054 | |
partialsums | 2.0 (2.0, 2.9) | (2.0, 2.9) was: (2.0, 2.0) | 0.161 (0.161, 0.295) | 0.14916 ±0.000081 | 0.000081 was: 0.002044 | |
pidigits | 2.0 (2.0, 4.3) | (2.0, 4.3) was: (1.0, 2.3) | 0.130 (0.130, 0.425) | 0.12666 ±0.000122 | 0.000122 was: 0.002133 | |
queens | (10, 5) | 1.0 (1.0, 2.0) | (1.0, 2.0) was: (1.0, 294.4) | 0.000 (0.000, 0.127) | 0.12484 δ=0.00646 ±0.000317 | 0.000317 was: 0.000751 |
quicksort | 2.0 (2.0, 2.0) | 0.299 (0.298, 0.304) | 0.44880 δ=0.13763 ±0.020477 | 0.020477 was: 0.067395 | ||
radixsort | 0.12644 ±0.000864 | 0.000864 was: 0.000403 | ||||
ray | 0.30901 ±0.002140 | 0.002140 was: 0.004022 | ||||
recursive_ack | 0.11958 ±0.000510 | 0.000510 was: 0.000653 | ||||
recursive_fib | 0.22864 ±0.000266 | 0.000266 was: 0.028968 | ||||
resty_json | (12, 2, 1) | |||||
revcomp | 0.11550 ±0.002553 | 0.002553 was: 0.001753 | ||||
richards | (14, 1) | 2.0 (1.7, 2.0) | (1.7, 2.0) was: (1.0, 2.0) | 0.150 (0.105, 0.150) | 0.14572 ±0.000324 | 0.000324 was: 0.010223 |
scimark_fft | 2.0 (2.0, 10.0) | (2.0, 10.0) was: (2.0, 4.7) | 0.140 (0.140, 1.153) | 0.12639 ±0.000343 | 0.000343 was: 0.000823 | |
scimark_lu | (11, 4) | 1.0 (1.0, 45.3) | 0.000 (0.000, 5.122) | 0.11546 ±0.000132 | 0.000132 was: 0.000308 | |
scimark_sor | 0.12105 ±0.000148 | |||||
scimark_sparse | 0.12315 ±0.000728 | 0.000728 was: 0.000585 | ||||
series | 2.0 (2.0, 2.0) | 0.347 (0.347, 0.348) | 0.33394 ±0.000645 | 0.000645 was: 0.003217 | ||
spectralnorm | 0.13985 δ=-0.00003 ±0.000007 | |||||
table_cmpsort | (13, 1, 1) | 1.0 (1.0, 10.0) | 0.000 (0.000, 2.005) | 0.21828 ±0.003289 | 0.003289 was: 0.007836 |
Results for Normal vs. Counters
In this case we’re particularly interested in the “steady performance variation (s)” column, which shows whether benchmarks have predictable steady state performance. The results are fairly clear: steady counters are, overall, a clear improvement. As you might expect, this is not a pure win, because it changes the order in which traces are made. This has several effects, including delaying some loops to be traced later than was previously the case, because counters do not hit the required threshold as quickly.
This disadvantages some programs, particularly small deterministic benchmarks where loops are highly stable. In such cases, the earlier you trace the better. However, in my opinion, such programs are given undue weight when performance is considered. It’s no secret that some of the benchmarks regularly used to benchmark LuaJIT are highly optimised for LuaJIT as it stands; any changes to LuaJIT stand a good chance of degrading their performance. However, overall we feel that the overall gain in consistency, particularly for larger programs, is worth it. There's a pull request against the Lua Foundation's fork of LuaJIT which applies this idea to a mainstream fork of LuaJIT.
We then started looking at various programs that showed odd performance. One problem in particular showed up in more than one benchmark. Here's a standard example:
Collisiondetector, Normal, Bencher9, Proc. exec. #12 (no steady state)
The problem – and it doesn't happen on every process execution, just to make it more fun – is that there are points where the benchmark slows down by over 10% for multiple in-process iterations (e.g. in this process execution, at in-process iterations 930-ish and 1050-ish). We tried over 25 separate ways to work out what was causing this — even building an instrumentation system to track what LuaJIT is doing — but in the end it turned out to be related to LuaJIT's Garbage Collector – sort of. When we moved from the 32-bit to 64-bit GC, the odd performance went away.
As such, we don’t think that the 64-bit GC “solves” the problem: however, it changes the way that pointers are encoded (doubling in size), which causes the code generator to emit a different style of code, such that the problem seems to go away. Nevertheless, this did make us reevaluate LuaJIT's GC. Tom then started work on implementing Mike Pall's suggestion for a new GC for LuaJIT (based partly on Tom's previous work and also that of Peter Cawley). He has enough implemented to run most small, and some large, programs, but it needs more work to finish it off, at which point evaluating it against the existing Lua GCs will be fascinating!
So, did we achieve everything we wanted to in 12 months? Inevitably the answer is yes and no. We did a lot more benchmarking than we expected; we've been able to make a lot of programs (particularly large programs) have more consistent performance; and we've got a fair way down the road of implementing a new GC. To whoever takes on further LuaJIT work – best of luck, and I look forward to seeing your results!
Acknowledgements: Sarah Mount implemented the stats differ; Edd Barrett implemented Krun and answered many questions on it.