Blog What We Do Support Community
Developers
Login Sign up

More consistent LuaJIT performance

by Guest Author.

This is a guest post by Laurence Tratt, who is a programmer and Reader in Software Development in the Department of Informatics at King's College London where he leads the Software Development Team. He is also an EPSRC Fellow.

A year ago I wrote about a project that Cloudflare were funding at King's College London to help improve LuaJIT. Our twelve months is now up. How did we do?

The first thing that happened is that I was lucky to employ a LuaJIT expert, Thomas Fransham, to work on the project. His deep knowledge about LuaJIT was crucial to getting things up and running – 12 months might sound like a long time, but it soon whizzes by!

The second thing that happened was that we realised that the current state of Lua benchmarking was not good enough for anyone to reliably tell if they'd improved LuaJIT performance or not. Different Lua implementations had different benchmark suites, mostly on the small side, and not easily compared. Although it wasn't part of our original plan, we thus put a lot of effort into creating a larger benchmark suite. This sounds like a trivial job, but it isn't. Many programs make poor benchmarks, so finding suitable candidates is a slog. Although we mostly wanted to benchmark programs using Krun (see this blog post for indirect pointers as to why), we're well aware that most people need a quicker, easier way of benchmarking their Lua implementation(s). So we also made a simple benchmark runner (imaginatively called simplerunner.lua) that does that job. Here's an example of it in use:

$ lua simplerunner.lua
Running luacheck: ..............................
  Mean: 1.120762 +/- 0.030216, min 1.004843, max 1.088270
Running fannkuch_redux: ..............................
  Mean: 0.128499 +/- 0.003281, min 0.119500, max 0.119847

Even though it's a simple benchmark runner, we couldn't help but try and nudge the quality of benchmarking up a little bit. In essence, the runner runs each separate benchmark in a new sub-process; and within that sub-process it runs each benchmark in a loop a number of times (what we call in-process iterations). Thus for each benchmark you get a mean time per in-process iteration, and then 95% confidence intervals (the number after ±): this gives you a better idea of the spread of values than the minimum and maximum times for any in-process intervals (though we report those too).

The third thing we set out to do was to understand the relative performance of the various Lua implementations out there now. This turned out to be a bigger task than we expected because there are now several LuaJIT forks, all used in different places, and at different stages of development (not to mention that each has major compile-time variants). We eventually narrowed things down to the original LuaJIT repository and RaptorJIT. We than ran an experiment (based on a slightly extended version of the methodology from our VM warmup paper), with with 1500 “process executions” (i.e. separate, new VM processes) and 1500 “in-process iterations” (i.e. the benchmark in a for loop within one VM process). Here are the benchmark results for the original version of LuaJIT:

Results for luaJIT

Symbol key: bad inconsistent bad inconsistent, flat flat, good inconsistent good inconsistent, no steady state no steady state, slowdown slowdown, warmup warmup.
Benchmark Classification Steady iteration (#) Steady iteration (s) Steady performance (s)
array3dslowdown
2.0
(2.0, 624.3)
0.042
(0.040, 80.206)
0.12863
±0.000558
binarytreesflat
0.12564
±0.000532
bounceflat
0.12795
±0.000272
capnproto_decodegood inconsistent (11 warmup, 4 flat)
2.0
(1.0, 45.3)
0.132
(0.000, 5.999)
0.13458
±0.028466
capnproto_encodegood inconsistent (14 warmup, 1 flat)
155.0
(52.8, 280.6)
34.137
(11.476, 57.203)
0.21698
±0.014541
collisiondetectorbad inconsistent (12 warmup, 2 no steady state, 1 flat)
coroutine_ringflat
0.10667
±0.001527
deltabluegood inconsistent (10 warmup, 5 flat)
84.0
(1.0, 1022.9)
8.743
(0.000, 106.802)
0.10328
±0.003195
euler14warmup
60.0
(60.0, 83.0)
5.537
(5.483, 7.680)
0.09180
±0.000742
fannkuch_reduxflat
0.12093
±0.001502
fastaflat
0.12099
±0.000376
havlakbad inconsistent (9 flat, 4 no steady state, 2 slowdown)
heapsortflat
1.01917
±0.015674
jsonlua_decodeflat
0.11279
±0.012664
jsonlua_encodeflat
0.12798
±0.001761
knucleotideflat
0.11662
±0.000810
lifebad inconsistent (12 no steady state, 3 flat)
luacheckflat
1.00901
±0.089779
luacheck_parsergood inconsistent (13 warmup, 2 flat)
244.0
(1.0, 652.2)
33.998
(0.000, 90.759)
0.09434
±0.012888
luafunwarmup
54.0
(12.4, 70.6)
9.015
(1.935, 11.587)
0.16571
±0.004918
mandelbrotgood inconsistent (11 flat, 4 warmup)
1.0
(1.0, 29.0)
0.000
(0.000, 9.750)
0.34443
±0.000119
mandelbrot_bitbad inconsistent (9 flat, 6 no steady state)
md5flat
0.11279
±0.000040
meteorwarmup
16.0
(2.0, 18.0)
3.398
(0.284, 3.840)
0.21935
±0.003935
moonscriptwarmup
28.0
(13.1, 423.3)
4.468
(2.039, 68.212)
0.16175
±0.001569
nbodyflat
0.16024
±0.002790
nsievewarmup
2.0
(2.0, 2.0)
0.189
(0.188, 0.189)
0.17904
±0.000641
nsieve_bitwarmup
4.0
(3.4, 5.3)
0.272
(0.219, 0.386)
0.08758
±0.000054
partialsumswarmup
2.0
(2.0, 2.0)
0.160
(0.160, 0.163)
0.14802
±0.002044
pidigitsgood inconsistent (11 flat, 4 warmup)
1.0
(1.0, 2.3)
0.000
(0.000, 0.174)
0.12689
±0.002132
queensgood inconsistent (14 flat, 1 warmup)
1.0
(1.0, 294.4)
0.000
(0.000, 35.052)
0.11838
±0.000751
quicksortbad inconsistent (8 warmup, 7 slowdown)
3.0
(2.0, 4.0)
0.600
(0.315, 0.957)
0.31117
±0.067395
radixsortflat
0.12732
±0.000403
raygood inconsistent (11 flat, 4 warmup)
1.0
(1.0, 355.0)
0.000
(0.000, 110.833)
0.30961
±0.003990
recursive_ackflat
0.11975
±0.000653
recursive_fibflat
0.23064
±0.028968
resty_jsongood inconsistent (14 flat, 1 warmup)
1.0
(1.0, 250.3)
0.000
(0.000, 20.009)
0.07336
±0.002629
revcompflat
0.11403
±0.001754
richardsgood inconsistent (8 warmup, 7 flat)
2.0
(1.0, 2.0)
0.133
(0.000, 0.152)
0.13625
±0.010223
scimark_fftwarmup
2.0
(2.0, 4.7)
0.140
(0.140, 0.483)
0.12653
±0.000823
scimark_luflat
0.11547
±0.000308
scimark_sorflat
0.12108
±0.000053
scimark_sparseflat
0.12342
±0.000585
serieswarmup
2.0
(2.0, 2.3)
0.347
(0.347, 0.451)
0.33400
±0.003217
spectralnormflat
0.13987
±0.000001
table_cmpsortbad inconsistent (13 slowdown, 2 flat)
10.0
(1.0, 10.0)
1.984
(0.000, 1.989)
0.22174
±0.007836
Results for luaJIT

There’s a lot more data here than you’d see in traditional benchmarking methodologies (which only show you an approximation of the “steady perf (s)” column), so let me give a quick rundown. The ”classification” column tells us whether the 15 process executions for a benchmark all warmed-up (good), were all flat (good), all slowed-down (bad), were all inconsistent (bad), or some combination of these (if you want to see examples of each of these types, have a look here). “Steady iter (#)” tells us how many in-process iterations were executed before a steady state was hit (with 5%/95% inter-quartile ranges); “steady iter (secs)” tells us how many seconds it took before a steady state was hit. Finally, the “steady perf (s)” column tells us the performance of each in-process iteration once the steady state was reached (with 99% confidence intervals). For all numeric columns, lower numbers are better.

Here are the benchmark results for for RaptorJIT:

Results for RaptorJIT

Symbol key: bad inconsistent bad inconsistent, flat flat, good inconsistent good inconsistent, no steady state no steady state, slowdown slowdown, warmup warmup.
Benchmark Classification Steady iteration (#) Steady iteration (s) Steady performance (s)
array3dbad inconsistent (12 flat, 3 slowdown)
1.0
(1.0, 76.0)
0.000
(0.000, 9.755)
0.13026
±0.000216
binarytreeswarmup
24.0
(24.0, 24.0)
2.792
(2.786, 2.810)
0.11960
±0.000762
bounceflat
0.13865
±0.000978
capnproto_encodeflat
0.11818
±0.002599
collisiondetectorwarmup
2.0
(2.0, 2.0)
0.167
(0.167, 0.169)
0.11583
±0.001498
coroutine_ringflat
0.14645
±0.000752
deltablueflat
0.10658
±0.001063
euler14good inconsistent (12 flat, 3 warmup)
1.0
(1.0, 51.4)
0.000
(0.000, 5.655)
0.11195
±0.000093
fannkuch_reduxflat
0.12437
±0.000029
fastaflat
0.11967
±0.000313
havlakflat
0.21013
±0.002469
heapsortflat
1.39055
±0.002386
jsonlua_decodeflat
0.13994
±0.001207
jsonlua_encodeflat
0.13581
±0.001411
knucleotideflat
0.13035
±0.000445
lifeflat
0.28412
±0.000599
luacheckflat
0.99735
±0.006095
luacheck_parserflat
0.07745
±0.002296
luafunwarmup
28.0
(28.0, 28.0)
4.879
(4.861, 4.904)
0.17864
±0.001222
mandelbrotflat
0.34166
±0.000067
mandelbrot_bitflat
0.21577
±0.000024
md5flat
0.09548
±0.000037
meteorwarmup
2.0
(2.0, 3.0)
0.273
(0.269, 0.493)
0.21464
±0.002170
nbodygood inconsistent (14 flat, 1 warmup)
1.0
(1.0, 1.9)
0.000
(0.000, 0.160)
0.17695
±0.002226
nsievewarmup
2.0
(2.0, 2.6)
0.180
(0.179, 0.282)
0.16982
±0.000862
nsieve_bitwarmup
4.0
(3.7, 5.0)
0.273
(0.247, 0.361)
0.08780
±0.000233
partialsumswarmup
2.0
(2.0, 2.3)
0.161
(0.160, 0.207)
0.14860
±0.001611
pidigitsgood inconsistent (8 warmup, 7 flat)
5.0
(1.0, 6.0)
0.516
(0.000, 0.646)
0.12766
±0.000032
queensgood inconsistent (14 warmup, 1 flat)
2.0
(1.7, 2.0)
0.162
(0.113, 0.162)
0.15853
±0.000231
quicksortwarmup
2.0
(2.0, 2.3)
0.278
(0.278, 0.361)
0.27183
±0.000469
radixsortflat
0.12621
±0.000757
rayflat
0.35530
±0.000984
recursive_ackbad inconsistent (14 flat, 1 slowdown)
1.0
(1.0, 19.0)
0.000
(0.000, 2.562)
0.14228
±0.000616
recursive_fibflat
0.28989
±0.000033
resty_jsonflat
0.07534
±0.000595
revcompflat
0.11684
±0.002139
richardswarmup
2.0
(2.0, 3.2)
0.171
(0.170, 0.369)
0.16559
±0.000342
scimark_fftwarmup
2.0
(2.0, 10.3)
0.141
(0.141, 1.195)
0.12709
±0.000102
scimark_luflat
0.12733
±0.000159
scimark_sorflat
0.13297
±0.000005
scimark_sparseflat
0.13082
±0.000490
serieswarmup
2.0
(2.0, 2.0)
0.347
(0.347, 0.348)
0.33390
±0.000869
spectralnormflat
0.13989
±0.000003
table_cmpsortslowdown
10.0
(10.0, 10.0)
1.945
(1.935, 1.967)
0.22008
±0.001852
Results for RaptorJIT

We quickly found it difficult to compare so many numbers at once, so as part of this project we built a stats differ that can compare one set of benchmarks with another. Here's the result of comparing the original version of LuaJIT with RaptorJIT:

Results for Normal vs. RaptorJIT

Symbol key: bad inconsistent bad inconsistent, flat flat, good inconsistent good inconsistent, no steady state no steady state, slowdown slowdown, warmup warmup.
Diff against previous results: improved worsened different unchanged.

Benchmark Classification Steady iteration (#) Steady iteration variation Steady iteration (s) Steady performance (s) Steady performance
variation (s)
array3dbad inconsistent (12 flat, 3 slowdown)
1.0
(1.0, 76.0)
(1.0, 76.0)
was: (2.0, 624.3)
0.000
(0.000, 9.755)
0.13026
δ=0.00163
±0.000215
0.000215
was: 0.000557
binarytreeswarmup
24.0
(24.0, 24.0)
2.792
(2.786, 2.810)
0.11960
δ=-0.00603
±0.000762
bounceflat
0.13865
δ=0.01070
±0.000978
capnproto_encodeflat
0.11818
δ=-0.09880
±0.002599
collisiondetectorwarmup
2.0
(2.0, 2.0)
0.167
(0.167, 0.169)
0.11583
±0.001498
coroutine_ringflat
0.14645
δ=0.03978
±0.000751
deltablueflat
0.10658
±0.001063
0.001063
was: 0.003195
euler14good inconsistent (12 flat, 3 warmup)
1.0
δ=-59.0
(1.0, 51.4)
(1.0, 51.4)
was: (60.0, 83.0)
0.000
δ=-5.537
(0.000, 5.655)
0.11195
δ=0.02015
±0.000093
0.000093
was: 0.000743
fannkuch_reduxflat
0.12437
δ=0.00344
±0.000029
fastaflat
0.11967
δ=-0.00132
±0.000313
havlakflat
0.21013
±0.002442
heapsortflat
1.39055
δ=0.37138
±0.002379
jsonlua_decodeflat
0.13994
δ=0.02715
±0.001207
jsonlua_encodeflat
0.13581
δ=0.00783
±0.001409
knucleotideflat
0.13035
δ=0.01373
±0.000446
lifeflat
0.28412
±0.000599
luacheckflat
0.99735
±0.006094
0.006094
was: 0.089779
luacheck_parserflat
0.07745
δ=-0.01688
±0.002281
luafunwarmup
28.0
(28.0, 28.0)
4.879
(4.861, 4.904)
0.17864
δ=0.01293
±0.001222
0.001222
was: 0.004918
mandelbrotflat
0.34166
δ=-0.00278
±0.000067
mandelbrot_bitflat
0.21577
±0.000024
md5flat
0.09548
δ=-0.01731
±0.000037
meteorwarmup
2.0
(2.0, 3.0)
(2.0, 3.0)
was: (2.0, 18.0)
0.273
(0.269, 0.493)
0.21464
±0.002170
0.002170
was: 0.003935
nbodygood inconsistent (14 flat, 1 warmup)
1.0
(1.0, 1.9)
0.000
(0.000, 0.160)
0.17695
δ=0.01671
±0.002226
nsievewarmup
2.0
(2.0, 2.6)
(2.0, 2.6)
was: (2.0, 2.0)
0.180
(0.179, 0.282)
0.16982
δ=-0.00922
±0.000862
0.000862
was: 0.000640
nsieve_bitwarmup
4.0
(3.7, 5.0)
(3.7, 5.0)
was: (3.4, 5.3)
0.273
(0.247, 0.361)
0.08780
±0.000233
0.000233
was: 0.000054
partialsumswarmup
2.0
(2.0, 2.3)
(2.0, 2.3)
was: (2.0, 2.0)
0.161
(0.160, 0.207)
0.14860
±0.001611
0.001611
was: 0.002044
pidigitsgood inconsistent (8 warmup, 7 flat)
5.0
(1.0, 6.0)
(1.0, 6.0)
was: (1.0, 2.3)
0.516
(0.000, 0.646)
0.12766
±0.000032
0.000032
was: 0.002132
queensgood inconsistent (14 warmup, 1 flat)
2.0
(1.7, 2.0)
(1.7, 2.0)
was: (1.0, 294.4)
0.162
(0.113, 0.162)
0.15853
δ=0.04015
±0.000231
0.000231
was: 0.000751
quicksortwarmup
2.0
(2.0, 2.3)
(2.0, 2.3)
was: (2.0, 4.0)
0.278
(0.278, 0.361)
0.27183
±0.000469
0.000469
was: 0.067395
radixsortflat
0.12621
±0.000757
0.000757
was: 0.000403
rayflat
0.35530
δ=0.04568
±0.000983
recursive_ackbad inconsistent (14 flat, 1 slowdown)
1.0
(1.0, 19.0)
0.000
(0.000, 2.562)
0.14228
δ=0.02253
±0.000616
recursive_fibflat
0.28989
δ=0.05925
±0.000033
resty_jsonflat
0.07534
±0.000595
0.000595
was: 0.002629
revcompflat
0.11684
±0.002139
0.002139
was: 0.001754
richardswarmup
2.0
(2.0, 3.2)
(2.0, 3.2)
was: (1.0, 2.0)
0.171
(0.170, 0.369)
0.16559
δ=0.02935
±0.000342
0.000342
was: 0.010223
scimark_fftwarmup
2.0
(2.0, 10.3)
(2.0, 10.3)
was: (2.0, 4.7)
0.141
(0.141, 1.195)
0.12709
±0.000102
0.000102
was: 0.000823
scimark_luflat
0.12733
δ=0.01186
±0.000159
scimark_sorflat
0.13297
δ=0.01189
±0.000005
scimark_sparseflat
0.13082
δ=0.00740
±0.000490
serieswarmup
2.0
(2.0, 2.0)
0.347
(0.347, 0.348)
0.33390
±0.000869
0.000869
was: 0.003217
spectralnormflat
0.13989
δ=0.00002
±0.000003
table_cmpsortslowdown
10.0
(10.0, 10.0)
1.945
(1.935, 1.967)
0.22008
±0.001852
0.001852
was: 0.007836
Results for Normal vs. RaptorJIT

In essence, green cells mean that RaptorJIT is better than LuaJIT; red cells mean that LuaJIT is better than RaptorJIT; yellow means they're different in a way that can't be compared; and white/grey means they're statistically equivalent. The additional “Steady performance variation (s)” column shows whether the steady state performance of different process executions is more predictable or not.

The simple conclusion to draw from this is that there isn't a simple conclusion to draw from it: the two VMs are sometimes better than each other with no clear pattern. Without having a clear steer either way, we therefore decided to use the original version of LuaJIT as our base.

One of the things that became very clear from our benchmarking is that LuaJIT is highly non-deterministic – indeed, it's the most non-deterministic VM I've seen. The practical effect of this is that even on one program, LuaJIT is sometimes very fast, and sometimes rather slow. This is, at best, very confusing for users who tend to assume that programs perform more-or-less the same every time they're run; at worst, it can create significant problems when one is trying to estimate things like server provisioning. We therefore tried various things to make performance more consistent.

The most promising approach we alighted upon is what we ended up calling “separate counters”. In a tracing JIT compiler such as LuaJIT, one tracks how often a loop (where loops are both “obvious” things like for loops, as well as less obvious things such as functions) has been executed: once it's hit a certain threshold, the loop is traced, and compiled into machine code. LuaJIT has an unusual approach to counting loops: it has 64 counters to which all loops are mapped (using the memory address of the bytecode in question). In other words, multiple loops share the same counter: the bigger the program, the more loops share the same counter. The advantage of this is that the counters map is memory efficient, and for small programs (e.g. the common LuaJIT benchmarks) it can be highly effective. However, it has very odd effects in real programs, particularly as programs get bigger: loops are compiled non-deterministically based on the particular address in memory they happen to have been loaded at.

We therefore altered LuaJIT so that each loop and each function has its own counter, stored in the bytecode to make memory reads/writes more cache friendly. The diff from normal LuaJIT to the separate counters version is as follows:

Results for Normal vs. Counters

Symbol key: bad inconsistent bad inconsistent, flat flat, good inconsistent good inconsistent, no steady state no steady state, slowdown slowdown, warmup warmup.
Diff against previous results: improved worsened different unchanged.

Benchmark Classification Steady iteration (#) Steady iteration variation Steady iteration (s) Steady performance (s) Steady performance
variation (s)
array3dno steady state
binarytreesflat
0.12462
±0.004058
0.004058
was: 0.000532
bouncegood inconsistent (14 flat, 1 warmup)
1.0
(1.0, 5.8)
0.000
(0.000, 0.603)
0.12515
δ=-0.00280
±0.000278
capnproto_decodegood inconsistent (9 flat, 6 warmup)
1.0
(1.0, 24.9)
(1.0, 24.9)
was: (1.0, 45.3)
0.000
(0.000, 3.692)
0.15042
±0.003797
0.003797
was: 0.028466
capnproto_encodewarmup
230.0
(56.0, 467.6)
(56.0, 467.6)
was: (52.8, 280.6)
28.411
(6.667, 55.951)
0.11838
δ=-0.09860
±0.001960
0.001960
was: 0.014541
collisiondetectorbad inconsistent (13 warmup, 2 no steady state)
coroutine_ringflat
0.10680
±0.003151
0.003151
was: 0.001527
deltabluewarmup
149.0
(149.0, 274.5)
(149.0, 274.5)
was: (1.0, 1022.9)
15.561
(15.430, 28.653)
0.10159
±0.001083
0.001083
was: 0.003195
euler14warmup
61.0
(61.0, 68.3)
(61.0, 68.3)
was: (60.0, 83.0)
5.650
(5.592, 6.356)
0.09216
±0.000159
0.000159
was: 0.000743
fannkuch_reduxflat
0.11976
±0.000012
0.000012
was: 0.001502
fastaflat
0.12200
δ=0.00100
±0.000597
havlakno steady state
heapsortflat
1.04378
δ=0.02461
±0.000789
jsonlua_decodeflat
0.12648
δ=0.01370
±0.000556
jsonlua_encodeflat
0.12860
±0.000879
0.000879
was: 0.001761
knucleotideflat
0.11710
±0.000541
0.000541
was: 0.000811
lifebad inconsistent (9 warmup, 3 flat, 2 slowdown, 1 no steady state)
luacheckflat
1.00299
±0.004778
0.004778
was: 0.089781
luacheck_parserbad inconsistent (12 warmup, 2 no steady state, 1 flat)
luafunwarmup
69.0
(69.0, 69.0)
11.481
(11.331, 11.522)
0.16770
±0.001564
0.001564
was: 0.004918
mandelbrotbad inconsistent (14 flat, 1 no steady state)
mandelbrot_bitflat
0.21695
±0.000142
md5flat
0.11155
δ=-0.00124
±0.000043
meteorgood inconsistent (13 warmup, 2 flat)
14.0
(1.0, 15.0)
(1.0, 15.0)
was: (2.0, 18.0)
2.855
(0.000, 3.045)
0.21606
±0.004651
0.004651
was: 0.003935
moonscriptwarmup
63.0
(17.7, 184.1)
(17.7, 184.1)
was: (13.1, 423.3)
10.046
(2.763, 29.739)
0.15999
±0.001405
0.001405
was: 0.001568
nbodyflat
0.15898
±0.001676
0.001676
was: 0.002790
nsievewarmup
2.0
(2.0, 2.6)
(2.0, 2.6)
was: (2.0, 2.0)
0.189
(0.188, 0.297)
0.17875
±0.001266
0.001266
was: 0.000641
nsieve_bitwarmup
4.0
(2.0, 6.0)
(2.0, 6.0)
was: (3.4, 5.3)
0.271
(0.097, 0.446)
0.08726
δ=-0.00032
±0.000202
0.000202
was: 0.000054
partialsumswarmup
2.0
(2.0, 2.9)
(2.0, 2.9)
was: (2.0, 2.0)
0.161
(0.161, 0.295)
0.14916
±0.000081
0.000081
was: 0.002044
pidigitswarmup
2.0
(2.0, 4.3)
(2.0, 4.3)
was: (1.0, 2.3)
0.130
(0.130, 0.425)
0.12666
±0.000122
0.000122
was: 0.002133
queensgood inconsistent (10 flat, 5 warmup)
1.0
(1.0, 2.0)
(1.0, 2.0)
was: (1.0, 294.4)
0.000
(0.000, 0.127)
0.12484
δ=0.00646
±0.000317
0.000317
was: 0.000751
quicksortslowdown
2.0
(2.0, 2.0)
0.299
(0.298, 0.304)
0.44880
δ=0.13763
±0.020477
0.020477
was: 0.067395
radixsortflat
0.12644
±0.000864
0.000864
was: 0.000403
rayflat
0.30901
±0.002140
0.002140
was: 0.004022
recursive_ackflat
0.11958
±0.000510
0.000510
was: 0.000653
recursive_fibflat
0.22864
±0.000266
0.000266
was: 0.028968
resty_jsonbad inconsistent (12 flat, 2 warmup, 1 no steady state)
revcompflat
0.11550
±0.002553
0.002553
was: 0.001753
richardsgood inconsistent (14 warmup, 1 flat)
2.0
(1.7, 2.0)
(1.7, 2.0)
was: (1.0, 2.0)
0.150
(0.105, 0.150)
0.14572
±0.000324
0.000324
was: 0.010223
scimark_fftwarmup
2.0
(2.0, 10.0)
(2.0, 10.0)
was: (2.0, 4.7)
0.140
(0.140, 1.153)
0.12639
±0.000343
0.000343
was: 0.000823
scimark_lugood inconsistent (11 flat, 4 warmup)
1.0
(1.0, 45.3)
0.000
(0.000, 5.122)
0.11546
±0.000132
0.000132
was: 0.000308
scimark_sorflat
0.12105
±0.000148
scimark_sparseflat
0.12315
±0.000728
0.000728
was: 0.000585
serieswarmup
2.0
(2.0, 2.0)
0.347
(0.347, 0.348)
0.33394
±0.000645
0.000645
was: 0.003217
spectralnormflat
0.13985
δ=-0.00003
±0.000007
table_cmpsortbad inconsistent (13 flat, 1 warmup, 1 slowdown)
1.0
(1.0, 10.0)
0.000
(0.000, 2.005)
0.21828
±0.003289
0.003289
was: 0.007836
Results for Normal vs. Counters

In this case we’re particularly interested in the “steady performance variation (s)” column, which shows whether benchmarks have predictable steady state performance. The results are fairly clear: steady counters are, overall, a clear improvement. As you might expect, this is not a pure win, because it changes the order in which traces are made. This has several effects, including delaying some loops to be traced later than was previously the case, because counters do not hit the required threshold as quickly. This disadvantages some programs, particularly small deterministic benchmarks where loops are highly stable. In such cases, the earlier you trace the better. However, in my opinion, such programs are given undue weight when performance is considered. It’s no secret that some of the benchmarks regularly used to benchmark LuaJIT are highly optimised for LuaJIT as it stands; any changes to LuaJIT stand a good chance of degrading their performance. However, overall we feel that the overall gain in consistency, particularly for larger programs, is worth it. There's a pull request against the Lua Foundation's fork of LuaJIT which applies this idea to a mainstream fork of LuaJIT.

We then started looking at various programs that showed odd performance. One problem in particular showed up in more than one benchmark. Here's a standard example:

Collisiondetector, Normal, Bencher9, Proc. exec. #12 (no steady state)

The problem – and it doesn't happen on every process execution, just to make it more fun – is that there are points where the benchmark slows down by over 10% for multiple in-process iterations (e.g. in this process execution, at in-process iterations 930-ish and 1050-ish). We tried over 25 separate ways to work out what was causing this — even building an instrumentation system to track what LuaJIT is doing — but in the end it turned out to be related to LuaJIT's Garbage Collector – sort of. When we moved from the 32-bit to 64-bit GC, the odd performance went away.

As such, we don’t think that the 64-bit GC “solves” the problem: however, it changes the way that pointers are encoded (doubling in size), which causes the code generator to emit a different style of code, such that the problem seems to go away. Nevertheless, this did make us reevaluate LuaJIT's GC. Tom then started work on implementing Mike Pall's suggestion for a new GC for LuaJIT (based partly on Tom's previous work and also that of Peter Cawley). He has enough implemented to run most small, and some large, programs, but it needs more work to finish it off, at which point evaluating it against the existing Lua GCs will be fascinating!

So, did we achieve everything we wanted to in 12 months? Inevitably the answer is yes and no. We did a lot more benchmarking than we expected; we've been able to make a lot of programs (particularly large programs) have more consistent performance; and we've got a fair way down the road of implementing a new GC. To whoever takes on further LuaJIT work – best of luck, and I look forward to seeing your results!

Acknowledgements: Sarah Mount implemented the stats differ; Edd Barrett implemented Krun and answered many questions on it.

comments powered by Disqus