Performance
Currently this guide is specifically for considerations you may have to make with the SLua language in mind, rather than general programming advice.
For detailed technical information refer to https://luau.org/performance/
Here are the top level insights that may have an impact on how you write code:
- Compiler uses type information to make further optimisations
- So make sure to use type refinement / typing to avoid ambuigity (e.g.
numeric | boolean->numeric)
- So make sure to use type refinement / typing to avoid ambuigity (e.g.
- Table access for field lookup is optimised and very fast provided that:
- The field name is known at compilete time (e.g.
table.fieldortable["field"]where expression is a constant string) - Field access does not use metatables. The fastest way to work with tables is to store fields directly inside the table and methods in the metatable (static fields access through a class than object instance)
- Object structure is usually uniform. While it is possible to use a generic function to access tables of different shapes — e.g.
function getX(obj) return obj.x endused on any table that has a field"x"— it is best not to vary the keys used in tables too much, as it defeats this optimisation
- The field name is known at compilete time (e.g.
- Fast method calls
- Compiler emits a specialised instruction sequence when methods are called through
obj:Method(aka colon syntax) instead ofobj.Method(obj). For this to be effective, it is crucial that__indexin a metatable points to a table directly instead of a function. It is strongly recommended to avoid__indexfunctions as well as deep__indexchains; an ideal object is a table with a metatable that points to itself through__index. - As a result of optimisations, common Lua tricks of caching a method in a local variable are not productive
- Compiler emits a specialised instruction sequence when methods are called through
- Fastcall mechanism of builtin functions
- Some functions in Luau are optimised through a special fastcall mechanism but for this to work the function call must be “obvious” to the compiler — it needs to call a builtin function directly, e.g.
math.max(x, 1), although this also works if the function is “localised” (local max = math.max); this does not work for indirect function calls and does not work for method calls (so callingstring.byteis more efficient thans:byte) - Additionally some fastcall specialisations are partial in that they don’t support all types of arguments, for example all
mathlibrary builtins are specialised for numeric arguments, so callingmath.abswith a string will back to a slower implementation that does string to number coercion
- Some functions in Luau are optimised through a special fastcall mechanism but for this to work the function call must be “obvious” to the compiler — it needs to call a builtin function directly, e.g.
- Optimised table iteration
- Iteration through tables typically doesn’t result in function calls for every iteration; The performance of iteration using generalised iteration,
pairsandipairsis comparable, so generalised iteration (without the use ofpairs/ipairs) is recommended unless code needs to be compatible with vanilla Lua or specifics ofipairsis required (which stops at firstnil). Additionally generalised iteration avoids callingpairswhen the loop starts which can be noticeable on very short tables - Iterating through array-like tables using
for i=1,#t(instead of generalised iteration) tends to be slightly slower because of extra cost incurred when reading elements from the table
- Iteration through tables typically doesn’t result in function calls for every iteration; The performance of iteration using generalised iteration,
- Creating and modifying tables
- SLua implements several optimisations for table creation. When creating object-like tables, it is recommended to use table literals (
{...}) and to specify all table fields in the literal in one go instead of assigning them later; This triggers and optimisation inspired by LuaJIT called “table templates” and results in higher performance when creating objects. When creating array-like tables on the other hand, if the maximum size of the table is known up front, it is recommended to usetable.createwhich can create an empty table with preallocated storage, and optionally fill it with a given value. - When the exact table shape is not known, the compiler can still predict the table capacity required in case the table is initialised with an empty literal (
{}) and filled with fields subsequently. For example, the following code creates a correctly sized table implicitly:local v = {}v.x = 1v.y = 2v.z = 3return v - When appending elements to tables, it is recommend to use
table.insert, which is the fastest method to append an element to a table if the table size is not known. In cases when a table is filled sequentially, however, it can be more efficient to use known index for insertion — together with preallocating tables usingtable.createthis can result in much faster code, for example this is the fastest way to build a table of squares:local t = table.create(N)for i=1,N dot[i] = i * iend
- SLua implements several optimisations for table creation. When creating object-like tables, it is recommended to use table literals (
- Native vector math
- Vectors in Luau are optimised as 32-bit floating point vectors with 3 components with first class support for all math operations and component manipulation, which means native 3-wide SIMD support. For code that uses a lot of vector values this means smaller GC pressure and significantly faster execution.
- Closure caching
- Creating new closures (function objects) requires allocating a new object every time. This can be problematic for cases when functions are passed to algorithms like
table.sortor functions likepcallwhich can lead to more work for the garbage collector. To make closure creation cheaper, the compiler implements closure caching — when multiple executions of the same function expression are guaranteed to result in the function object that is semantically identical, the compiler may cache the closure and always return the same object. This changes the function identity which may affect code that uses function objects as table keys.
- Creating new closures (function objects) requires allocating a new object every time. This can be problematic for cases when functions are passed to algorithms like
- Fast memory allocator
- A custom allocator is implemented that is highly specialised and tuned to common allocation workloads. This does not mean memory allocation si free — it is carefully optimised but still carries a cost, and a high rate of allocations require more work from the garbage collector. The GC is incremental. Thus for high performance code it is recommend to avoid allocating memory in tight loops, by avoiding temporary table and userdata creation.
- Optimised libraries
- While the best performing code performs most of the time in the interpreter, the performance of the standard library functions is critical to some applications. In addition to specialising many small and simple functionds using the builtin fastcall mechanism, we spend extra care on optimising all library functions and providing additional functions beyond Lua’s standard library to help achieve good performance
- Functions from the
tablelibrary likeinsert,removeandmovehave been tuned for performance on array-like tables. Luau provides additional functions likecreateandfindto achieve further speedups.sortuses anintrosortalgorithm which results in guarnateed worst caseNlogNcomplexity regardless of input - For
stringlibrary a carefully tuned dynamic string buffer implementation is used. It is optimised for smaller strings to reduce garbage during string manipulation, and for larger strings it allows to produce a large string without extra copies, especially where the resulting size is known ahead of time. Additionally functions likeformathave been tuned to avoid the overhead ofsprintfwhere possible
- Loop unrolling
- Only loops with loop bounds known at compile time, such as for
i=1,4 do, can be unrolled. The loop body must be simple enough for the optimisation to be profitable; compiler uses heuristics to estimate the performance benefit and automatically decide if unrolling should be performed
- Only loops with loop bounds known at compile time, such as for
- Function inlining
- Only local functions (defined either as
local function fooorlocal foo = function) can be inlined. The function body must be simple enough for the optimization to be profitable; compiler uses heuristics to estimate the performance benefit and automatically decide if each call to the function should be inlined instead. Additionally recursive invocations of a function can’t be inlined at this time.
- Only local functions (defined either as