Hello!
On Thu, Sep 24, 2015 at 2:05 AM, bjoe2k4 wrote:
> when i execute this code in a location via content_by_lua_file, then i get
> the following results:
>
> string_find: 0.002s
> string_match: 0.684s
>
The main difference here is that the string.find() builtin function
can be JIT compiled if the pattern is treated as a plain string [1]
since LuaJIT v2.1. string.match(), on the other hand, is a NYI [2],
which cannot be JIT compiled (yet).
It's easiest to see by using the jit.v and/or jit.dump modules shipped
with LuaJIT. For example, we can add the following lines of Lua code
to the beginning of you Lua file:
local jit_v = require "jit.v"
jit_v.on("/dev/stderr")
Then running your sample Lua code with the "resty" command-line
utility (or the full-blown OpenResty nginx) produces the following
output from the LuaJIT JIT compiler:
$ resty a.lua
[TRACE 1 a.lua:21 loop]
string_find: 0.002s
[TRACE --- (1/0) a.lua:22 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- (1/0) a.lua:22 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- (1/0) a.lua:22 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- (1/0) a.lua:22 -- NYI: FastFunc string.match at a.lua:51]
[TRACE 2 (1/0) a.lua:22 -- fallback to interpreter]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
string_match: 0.269s
It's obvious that the string.find() is JIT'd, producing the trace #1
while the string.match() frequently aborts the JIT compilation and
falls back to the (slower) LuaJIT interpreter.
Alternatively, we can also see these in the on-CPU flamegraphs:
1) on-CPU flame graph for the case running string.find:
http://agentzh.org/misc/flamegraph/2015-09-25/jit-string-find.svg
Here we have no meaningful backtraces in bottom of the graph, which
means they are initiated by JITted code [3].
Another interesting thing worth noting here is that we also have no
visible frames on the lj_str_find() C function that implements
string.find(). This means that we are *not* really testing the
performance of string.find() here! What's happening? Let's check out
the actual IR code [4] generated by the LuaJIT JIT compiler:
---- TRACE 1 IR
0001 > int SLOAD #7 CRI
0002 > int LE 0001 +2147483646
0003 int SLOAD #6 CI
0004 > fun SLOAD #3 T
0005 > str SLOAD #4 T
0006 > fun EQ 0004 a.lua:45
0007 int FLOAD 0005 str.len
0008 p32 STRREF 0005 +0
0012 p32 CALLN lj_str_find (0008 [0x41f18748] 0007 +1 )
0013 > p32 NE 0012 [NULL]
0016 + int ADD 0003 +1
0017 > int LE 0016 0001
0018 ------ LOOP ------------
0019 + int ADD 0016 +1
0020 > int LE 0019 0001
0021 int PHI 0016 0019
Okay, there is a lj_str_find C function call, but it is placed out of
the loop! Yeah, the JIT compiler of LuaJIT is definitely smart enough
to note this call produces constant results here [5].
So, your naive benchmark code is not testing what it is supposed to
test! It's really testing an empty loop here! This is why there's so
many misleading benchmark case out there in the wild [6].
2) on-CPU flame graph for the case running string.match:
http://agentzh.org/misc/flamegraph/2015-09-25/interp-string-match.svg
So almost all the samples in the graph have complete backtraces
(starting from the C main function). Also, we see all the code frames
in the LuaJIT interpreter too, like lj_BC_FUNCC, which is for
interpreting LuaJIT bytecode instructions.
Interpreted code is definitely slower than compiled code. Another hot
spot is the lj_str_new function call, which is for creating new Lua
strings. This is due to the fact that string.match() needs to return a
(new) Lua string object. Creating a Lua string is definitely more
expensive than directly using a primitive unboxed Lua numbers (which
are not GC objects at all).
At this point, we can make sense of the benchmark results perfectly by
using the right tools (on-CPU flame graphs and LuaJIT's JIT compiler's
trace dump).
> However, when i reverse the two 'bench' functions (or reload the previous
> page with 'lua_code_cache on;') i get the following:
>
> string_match: 0.632s
> string_find: 2.416s
>
This cannot be explained without using the same set of tools to
analyze either. So let me elaborate the details below.