At Cloudflare we’re heavy users of LuaJIT and in the past have sponsored many improvements to its performance.
LuaJIT is a powerful piece of software, maybe the highest performing JIT in the industry. But it’s not always easy to get the most out of it, and sometimes a small change in one part of your code can negatively impact other, already optimized, parts.
One of the first pieces of advice anyone receives when writing Lua code to run quickly using LuaJIT is “avoid the NYIs”: the language or library features that can’t be compiled because they’re NYI (not yet implemented). And that means they run in the interpreter.
CC BY-SA 2.0 image by Dwayne Bent
Another very attractive feature of LuaJIT is the FFI library, which allows Lua code to directly interface with C code and memory structures. The JIT compiler weaves these memory operations in line with the generated machine language, making it much more efficient than using the traditional Lua C API.
Unfortunately, if for any reason the Lua code using the FFI library has to run under the interpreter, it takes a very heavy performance hit. As it happens, under the interpreter the FFI is usually much slower than the Lua C API or the basic operations. For many people, this means either avoiding the FFI or committing to a permanent vigilance to maintain the code from falling back to the interpreter.
Optimizing LuaJIT Code
Before optimizing any code, it’s important to identify which parts are actually important. It’s useless to discuss what’s the fastest way to add a few numbers before sending some data, if the send operation will take a million times longer than that addition. Likewise, there’s no benefit avoiding NYI features in code like initialization routines that might run only a few times, as it’s unlikely that the JIT would even try to optimize them, so they would always run in the interpreter. Which, by the way, is also very fast; even faster than the first version of LuaJIT itself.
But optimizing the core parts of a Lua program, like any deep inner loops, can yield huge improvements in the overall performance. In similar situations, experienced developers using other languages are used to inspecting the assembly language generated by the compiler, to see if there’s some change to the source code that can make the result better.
The command line LuaJIT executable provides a bytecode list when running with the -jbc
option, a statistical profiler, activated with the -jp
option, a trace list with -jv
, and finally a detailed dump of all the JIT operations with -jdump
.
The last two provide lots of information very useful for understanding what actually happens with the Lua code while executing, but it can be a lot of work to read the huge lists generated by -jdump
. Also, some messages are hard to understand without a fairly complete understanding of how the tracing compiler in LuaJIT actually works.
One very nice feature is that all these JIT options are implemented in Lua. To accomplish this the JIT provides ‘hooks’ that can execute a Lua function at important moments with the relevant information. Sometimes the best way to understand what some -jdump
output actually means is to read the code that generated that specific part of the output.
Introducing Loom
After several rounds there, and being frustrated by the limitations of the sequentially-generated dump, I decided to write a different version of -jdump
, one that gathered more information to process and add cross-references to help see how things are related before displaying. The result is loom, which shows roughly the same information as -jdump
, but with more resolved references and formatted in HTML with tables, columns, links and colors. It has helped me a lot to understand my own code and the workings of LuaJIT itself.
For example, let's consider the following code in a file called twoloops.lua
:
for i=1,1000 do
for j=1,1000 do
end
end
With the -jv
option:
$ luajit -jv twoloops.lua
[TRACE 1 twoloops.lua:2 loop]
[TRACE 2 (1/3) twoloops.lua:1 -> 1]
This tells us that there were two traces, the first one contains a loop, and the second one spawns from exit #3 of the other (the “(1/3)” part) and it’s endpoint returns to the start of trace #1.
Ok, let’s get more detail with -jdump
:
$ luajit -jdump twoloops.lua
---- TRACE 1 start twoloops.lua:2
0009 FORL 4 => 0009
---- TRACE 1 IR
0001 int SLOAD #5 CI
0002 + int ADD 0001 +1
0003 > int LE 0002 +1000
0004 ------ LOOP ------------
0005 + int ADD 0002 +1
0006 > int LE 0005 +1000
0007 int PHI 0002 0005
---- TRACE 1 mcode 47
0bcbffd1 mov dword [0x40db1410], 0x1
0bcbffdc cvttsd2si ebp, [rdx+0x20]
0bcbffe1 add ebp, +0x01
0bcbffe4 cmp ebp, 0x3e8
0bcbffea jg 0x0bcb0014 ->1
->LOOP:
0bcbfff0 add ebp, +0x01
0bcbfff3 cmp ebp, 0x3e8
0bcbfff9 jle 0x0bcbfff0 ->LOOP
0bcbfffb jmp 0x0bcb001c ->3
---- TRACE 1 stop -> loop
---- TRACE 2 start 1/3 twoloops.lua:1
0010 FORL 0 => 0005
0005 KSHORT 4 1
0006 KSHORT 5 1000
0007 KSHORT 6 1
0008 JFORI 4 => 0010
---- TRACE 2 IR
0001 num SLOAD #1 I
0002 num ADD 0001 +1
0003 > num LE 0002 +1000
---- TRACE 2 mcode 81
0bcbff79 mov dword [0x40db1410], 0x2
0bcbff84 movsd xmm6, [0x41704068]
0bcbff8d movsd xmm5, [0x41704078]
0bcbff96 movsd xmm7, [rdx]
0bcbff9a addsd xmm7, xmm6
0bcbff9e ucomisd xmm5, xmm7
0bcbffa2 jb 0x0bcb0014 ->1
0bcbffa8 movsd [rdx+0x38], xmm6
0bcbffad movsd [rdx+0x30], xmm6
0bcbffb2 movsd [rdx+0x28], xmm5
0bcbffb7 movsd [rdx+0x20], xmm6
0bcbffbc movsd [rdx+0x18], xmm7
0bcbffc1 movsd [rdx], xmm7
0bcbffc5 jmp 0x0bcbffd1
---- TRACE 2 stop -> 1
This tells us... well, a lot of things. If you look closely, you’ll see the same two traces, one is a loop, the second starts at 1/3
and returns to trace #1. Each one shows some bytecode instructions, an IR listing, and the final mcode. There are several options to turn on and off each listing, and more info like the registers allocated to some IR instructions, the “snapshot” structures that allow the interpreter to continue when a compiled trace exits, etc.
Now using loom:
There’s the source code, with the corresponding bytecodes, and the same two traces, with IR and mcode listings. The bytecode lines on the traces and on the top listings are linked, hovering on some arguments on the IR listing highlights the source and use of each value, the jumps between traces are correctly labeled (and colored), finally, clicking on the bytecode or IR column headers reveals more information: excerpts from the source code and snapshot formats, respectively.
Writing it was a great learning experience, I had to read the dump script’s Lua sources and went much deeper in the LuaJIT sources than ever before. And then, I was able to use loom not only to analyze and optimize Cloudflare’s Lua code, but also to watch the steps the compiler goes through to make it run fast, and also what happens when it’s not happy.
The code is the code is the code is the code
LuaJIT handles up to four different representation of a program’s code:
First comes the source code, what the developer writes.
The parser analyzes the source code and produces the Bytecode, which is what the interpreter actually executes. It has the same flow of the source code, grouped in functions, with all the calls, iterators, operations, etc. Of course, there’s no nice formatting, comments, the local variable names are replaced by indices, and all constants (other than small numbers) are stored in a separate area.
When the interpreter finds that a given point of the bytecode has been repeated several times, it’s considered a “hot” part of the code, and interprets it once again but this time it records each bytecode it encounters, generating a “code trace” or just “a trace”. At the same time, it generates an “intermediate representation”, or IR, of the code as it’s executed. The IR doesn’t represent the whole of the function or code portion, just the actual options it actually takes.
A trace is finished when it hits a loop or a recursion, returns to a lower level than when started, hits a NYI operation, or simply becomes too long. At this point, it can be either compiled into machine language, or aborted if it has reached some code that can’t be correctly translated. If successful, the bytecode is patched with an entry to the machine code, or “mcode”. If aborted, the initial trace point is “penalized” or even “blocklisted” to avoid wasting time trying to compile it again.
What’s
One of the most visible characteristics of the Lua language is the heavy use of dictionary objects called tables. From the Lua manual:
“Tables are the sole data structuring mechanism in Lua; they can be used to represent ordinary arrays, symbol tables, sets, records, graphs, trees, etc.”
To iterate over all the elements in a table, the idiomatic way is to use the standard library function pairs()
like this:
for k, v in pairs(t) do
-- use the key in ‘k’ and the value in ‘v’
end
In the standard Lua manual, pairs()
is defined as “Returns three values: the next
function, the table t
, and nil
”, so the previous code is the same as:
for k, v in next, t, nil do
-- use the key in ‘k’ and the value in ‘v’
end
But unfortunately, both the next()
and pairs()
functions are listed as “not compiled” in the feared NYI list. That means that any such code runs on the interpreter and is not compiled, unless the code inside is complex enough, and has other inner loops (loops that doesn’t use next()
or pairs()
, of course). Even in that case, the code would have to fall back to the interpreter at each loop end.
This sad news creates a tradeoff: for performance sensitive parts of the code, don’t use the most Lua-like code style. That motivates people to come up with several contortions to be able to use numerical iteration (which is compiled, and very efficient), like replacing any key with a number, storing all the keys in a numbered array, or store both keys and values at even/odd numeric indices.
Getting
So, I finally have a non-NYI next()
function! I'd like to say "a fully JITtable next()
function", but it wouldn't be totally true; as it happens, there's no way to avoid some annoying trace exits on table iteration.
The purpose of the IR is to provide a representation of the execution path so it can be quickly optimized to generate the final mcode. For that, the IR traces are linear and type-specific; creating some interesting challenges for iteration on a generic container.
Traces are linear
Being linear means that each trace captures a single execution path, it can't contain conditional code or internal jumps. The only conditional branches are the "guards" that make sure that the code to be executed is the appropriate one. If a condition changes and it must now do something different, the trace must be exited. If it happens several times, it will spawn a side trace and the exit will be patched into a conditional branch. Very nice, but this still means that there can be at most one loop on each trace.
The implementation of next()
has to internally skip over empty slots in the table to only return valid key/value pairs. If we try to express this in IR code, this would be the "inner" loop and the original loop would be an "outer" one, which doesn't have as much optimization opportunities. In particular, it can't hoist invariable code out of the loop.
The solution is to do that slot skipping in C. Not using the Lua C API, of course, but the inner IR CALL instruction that is compiled into a "fast" call, using CPU registers for arguments as much as possible.
The IR is in Type-specific SSA form
The SSA form (Static Single Assignment) is key for many data flow analysis heuristics that allow quick optimizations like dead code removal, allocation sinking, type narrowing, strength reduction, etc. In LuaJIT's IR it means every instruction is usable as a value for subsequent instructions and has a declared type, fixed at the moment when the trace recorder emits this particular IR instruction. In addition, every instruction can be a type guard, if the arguments are not of the expected type the trace will be exited.
Lua is dynamically typed, every value is tagged with type information so the bytecode interpreter can apply the correct operations on it. This allows us to have variables and tables that can contain and pass around any kind of object without changing the source code. Of course, this requires the interpreter to be coded very "defensively", to consider all valid ramifications of every instruction, limiting the possibility of optimizations. The IR traces, on the other hand, are optimized for a single variation of the code, and deal with only the value types that are actually observed while executing.
For example, this simple code creates a 1,000 element array and then copies to another table:
local t,t2 = {},{}
for i=1,1000 do
t[i] = i
end
for i,v in ipairs(t) do
t2[i]=v
end
resulting in this IR for the second loop, the one that does the copy:
0023 ------------ LOOP ------------
0024 num CONV 0017 num.int
0025 > int ABC 0005 0017
0026 p32 AREF 0007 0017
0027 num ASTORE 0026 0022
0028 rbp + int ADD 0017 +1
0029 > int ABC 0018 0028
0030 p32 AREF 0020 0028
0031 xmm7 >+ num ALOAD 0030
0032 xmm7 num PHI 0022 0031
0033 rbp int PHI 0017 0028
0034 rbx nil RENAME 0017 #3
0035 xmm6 nil RENAME 0022 #2
Here we see the ALOAD
in instruction 0031
assures that the value loaded from the table is in effect a number. If it happens to be any other value, the guard fails and the trace is exited.
But if we do an array of strings instead of numbers?
a small change:
local t,t2 = {},{}
for i=1,1000 do
t[i] = 's'..i
end
for i,v in ipairs(t) do
t2[i]=v
end
gives us this:
0024 ------------ LOOP ------------
0025 num CONV 0018 num.int
0026 > int ABC 0005 0018
0027 p32 AREF 0007 0018
0028 str ASTORE 0027 0023
0029 rbp + int ADD 0018 +1
0030 > int ABC 0019 0029
0031 p32 AREF 0021 0029
0032 rbx >+ str ALOAD 0031
0033 rbx str PHI 0023 0032
0034 rbp int PHI 0018 0029
0035 r15 nil RENAME 0018 #3
0036 r14 nil RENAME 0023 #2
It's the same code, but the type that ALOAD
is guarding is now a string (and it now uses a different register, I guess a vector register isn't appropriate for a string pointer).
And if the table has a values of a mix of types?
local t,t2={},{}
for i=1,1000,2 do
t[i], t[i+1] = i, 's'..i
end
for i,v in ipairs(t)
do t2[i]=v
end
0031 ------------ LOOP ------------
0032 num CONV 0027 num.int
0033 > int ABC 0005 0027
0034 p32 AREF 0007 0027
0035 str ASTORE 0034 0030
0036 r15 int ADD 0027 +1
0037 > int ABC 0019 0036
0038 p32 AREF 0021 0036
0039 xmm7 > num ALOAD 0038
0040 > int ABC 0005 0036
0041 p32 AREF 0007 0036
0042 num ASTORE 0041 0039
0043 rbp + int ADD 0027 +2
0044 > int ABC 0019 0043
0045 p32 AREF 0021 0043
0046 rbx >+ str ALOAD 0045
0047 rbx str PHI 0030 0046
0048 rbp int PHI 0027 0043
Now there are two ALOAD
s, (and two ASTORE
s), one for 'num' and one for 'str'. In other words, the JIT unrolled the loop and found that that made the types constant. =8-O
Of course, this would happen only on very simple and regular patterns. In general, it's wiser to avoid unpredictable type mixing; but polymorphic code will be optimized for each type that it's actually used with.
Back to
First let's see the current implementation of next()
as used by the interpreter:
lj_tab.c
/* Advance to the next step in a table traversal. */
int lj_tab_next(lua_State *L, GCtab *t, TValue *key)
{
uint32_t i = keyindex(L, t, key); /* Find predecessor key index. */
for (i++; i < t->asize; i++) /* First traverse the array keys. */
if (!tvisnil(arrayslot(t, i))) {
setintV(key, i);
copyTV(L, key+1, arrayslot(t, i));
return 1;
}
for (i -= t->asize; i <= t->hmask; i++) { /* Then traverse the hash keys. */
Node *n = &noderef(t->node)[i];
if (!tvisnil(&n->val)) {
copyTV(L, key, &n->key);
copyTV(L, key+1, &n->val);
return 1;
}
}
return 0; /* End of traversal. */
}
It takes the input key as a TValue
pointer and calls keyindex()
. This helper function searches for the key in the table and returns an index; if the key is an integer in the range of the array part, the index is the key itself. If not, it performs a hash query and returns the index of the Node, offset by the array size, if successful, or signals an error if not found (it's an error to give a nonexistent key to next()
).
Back at lj_tab_next()
, the index is first incremented, and if it's still within the array, it's iterated over any hole until a non-nil
value is found. If it wasn't in the array (or there’s no next value there), it performs a similar "skip the **nil
**s" on the Node table.
The new lj_record_next()
function in lj_record.c
, like some other record functions there, first checks not only the input parameters, but also the return values to generate the most appropriate code for this specific iteration, assuming that it will likely be optimal for subsequent iterations. Of course, any such assumption must be backed by the appropriate guard.
For next()
, we choose between two different forms, if the return key is in the array part, then it uses lj_tab_nexta()
, which takes the input key as an integer and returns the next key, also as an integer, in the rax
register. We don't do the equivalent to the keyindex()
function, just check (with a guard) that the key is within the bounds of the array:
lj_tab.c
/* Get the next array index */
MSize LJ_FASTCALL lj_tab_nexta(GCtab *t, MSize k)
{
for (k++; k < t->asize; k++)
if (!tvisnil(arrayslot(t, k)))
break;
return k;
}
The IR code looks like this:
0014 r13 int FLOAD 0011 tab.asize
0015 rsi > int CONV 0012 int.num
0017 rax + int CALLL lj_tab_nexta (0011 0015)
0018 > int ABC 0014 0017
0019 r12 p32 FLOAD 0011 tab.array
0020 p32 AREF 0019 0017
0021 [8] >+ num ALOAD 0020
Clearly, the CALL
itself (at 0017
) is typed as 'int
', as natural for an array key; and the ALOAD
(0021
) is 'num
', because that's what the first few values happened to be.
When we finish with the array part, the bounds check (instruction ABC
on 0018
) would fail and soon new IR would be generated. This time we use the lj_tab_nexth()
function.
lj_tab.c
LJ_FUNCA const Node *LJ_FASTCALL lj_tab_nexth(lua_State *L, GCtab *t, const Node *n)
{
const Node *nodeend = noderef(t->node)+t->hmask;
for (n++; n <= nodeend; n++) {
if (!tvisnil(&n->val)) {
return n;
}
}
return &G(L)->nilnode;
}
But before doing the "skip the **nil
**s", we need to do a hash query to find the initial Node
entry. Fortunately, the HREF
IR instruction does that: This is the IR:
0014 rdx p32 HREF 0011 0012
0016 r12 p32 CALLL lj_tab_nexth (0011 0014)
0017 rax >+ str HKLOAD 0016
0018 [8] >+ num HLOAD 0016
There's a funny thing here: HREF
is supposed to return a reference to a value in the hash table, and the last argument in lj_tab_nexth()
is a Node
pointer. Let's see the Node
definition:
lj_obj.h
/* Hash node. */
typedef struct Node {
TValue val; /* Value object. Must be first field. */
TValue key; /* Key object. */
MRef next; /* Hash chain. */
#if !LJ_GC64
MRef freetop; /* Top of free elements (stored in t->node[0]). */
#endif
} Node;
Ok... the value is the first field, and it says right there "Must be first field". Looks like it's not the first place with some hand-wavy pointer casts.
The return value of lj_tab_next()
is a Node
pointer, which can likewise be implicitly cast by HLOAD
to get the value. To get the key, I added the HKLOAD
instruction. Both are guarding for the expected types of the value and key, respectively.
Let's take it for a spin
So, how does it perform? These tests do a thousand loops over a 10,000 element table, first using next()
and then pairs()
, with a simple addition in the inner loop. To get pairs()
compiled, I just disabled the ISNEXT
/ITERN
optimization, so it actually uses next()
. In the third test the variable in the addition is initialized to 0ULL
instead of just 0
, triggering the use of FFI.
First test is with all 10,000 elements on sequential integers, making the table a valid sequence, so ipairs()
(which is already compiled) can be used just as well:
So, compiled next()
is quite a lot faster, but the pairs()
optimization in the interpreter is very fast. On the other hand, the smallest smell of FFI completely trashes interpreter performance, while making compiled code slightly tighter. Finally, ipairs()
is faster, but a big part of it is because it stops on the first nil
, while next()
has to skip over every nil
at the end of the array, which by default can be up to twice as big as the sequence itself.
Now with 5,000 (sequential) integer keys and 5,000 string keys. Of course, we can't use ipairs()
here:
Roughly the same pattern: the compiled next()
performance is very much the same on the three forms (used directly, under pairs()
and with FFI code), while the interpreter benefits from the pairs()
optimization and almost dies with FFI. In this case, the interpreted pairs()
actually surpasses the compiled next()
performance, hinting that separately optimizing pairs()
is still desirable.
A big factor in the interpreter pairs()
is that it doesn't use next()
; instead it directly drives the loop with a hidden variable to iterate in the Node
table without having to perform a hash lookup on every step.
Repeating that in a compiled pairs()
would be equally beneficial; but has to be done carefully to maintain compatibility with the interpreter. On any trace exit the interpreter would kick in and must be able to seamlessly continue iterating. For that, the rest of the system has to be aware of that hidden variable.
The best part of this is that we have lots of very challenging, yet deeply rewarding, work ahead of us! Come work for us on making LuaJIT faster and more.