Re: [openresty-en] Odd Behaviour executing string functions in a tight loop

414 views

Skip to first unread message

Yichun Zhang (agentzh)

unread,

Sep 25, 2015, 8:54:38 AM9/25/15

to openresty-en

Hello!

On Thu, Sep 24, 2015 at 2:05 AM, bjoe2k4 wrote:
> when i execute this code in a location via content_by_lua_file, then i get
> the following results:
>
> string_find: 0.002s
> string_match: 0.684s
>

The main difference here is that the string.find() builtin function
can be JIT compiled if the pattern is treated as a plain string [1]
since LuaJIT v2.1. string.match(), on the other hand, is a NYI [2],
which cannot be JIT compiled (yet).

It's easiest to see by using the jit.v and/or jit.dump modules shipped
with LuaJIT. For example, we can add the following lines of Lua code
to the beginning of you Lua file:

local jit_v = require "jit.v"
jit_v.on("/dev/stderr")

Then running your sample Lua code with the "resty" command-line
utility (or the full-blown OpenResty nginx) produces the following
output from the LuaJIT JIT compiler:

$ resty a.lua
[TRACE 1 a.lua:21 loop]
string_find: 0.002s
[TRACE --- (1/0) a.lua:22 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- (1/0) a.lua:22 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- (1/0) a.lua:22 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- (1/0) a.lua:22 -- NYI: FastFunc string.match at a.lua:51]
[TRACE 2 (1/0) a.lua:22 -- fallback to interpreter]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
[TRACE --- a.lua:50 -- NYI: FastFunc string.match at a.lua:51]
string_match: 0.269s

It's obvious that the string.find() is JIT'd, producing the trace #1
while the string.match() frequently aborts the JIT compilation and
falls back to the (slower) LuaJIT interpreter.

Alternatively, we can also see these in the on-CPU flamegraphs:

1) on-CPU flame graph for the case running string.find:

http://agentzh.org/misc/flamegraph/2015-09-25/jit-string-find.svg

Here we have no meaningful backtraces in bottom of the graph, which
means they are initiated by JITted code [3].

Another interesting thing worth noting here is that we also have no
visible frames on the lj_str_find() C function that implements
string.find(). This means that we are *not* really testing the
performance of string.find() here! What's happening? Let's check out
the actual IR code [4] generated by the LuaJIT JIT compiler:

---- TRACE 1 IR
0001 > int SLOAD #7 CRI
0002 > int LE 0001 +2147483646
0003 int SLOAD #6 CI
0004 > fun SLOAD #3 T
0005 > str SLOAD #4 T
0006 > fun EQ 0004 a.lua:45
0007 int FLOAD 0005 str.len
0008 p32 STRREF 0005 +0
0012 p32 CALLN lj_str_find (0008 [0x41f18748] 0007 +1 )
0013 > p32 NE 0012 [NULL]
0016 + int ADD 0003 +1
0017 > int LE 0016 0001
0018 ------ LOOP ------------
0019 + int ADD 0016 +1
0020 > int LE 0019 0001
0021 int PHI 0016 0019

Okay, there is a lj_str_find C function call, but it is placed out of
the loop! Yeah, the JIT compiler of LuaJIT is definitely smart enough
to note this call produces constant results here [5].

So, your naive benchmark code is not testing what it is supposed to
test! It's really testing an empty loop here! This is why there's so
many misleading benchmark case out there in the wild [6].

2) on-CPU flame graph for the case running string.match:

http://agentzh.org/misc/flamegraph/2015-09-25/interp-string-match.svg

So almost all the samples in the graph have complete backtraces
(starting from the C main function). Also, we see all the code frames
in the LuaJIT interpreter too, like lj_BC_FUNCC, which is for
interpreting LuaJIT bytecode instructions.

Interpreted code is definitely slower than compiled code. Another hot
spot is the lj_str_new function call, which is for creating new Lua
strings. This is due to the fact that string.match() needs to return a
(new) Lua string object. Creating a Lua string is definitely more
expensive than directly using a primitive unboxed Lua numbers (which
are not GC objects at all).

At this point, we can make sense of the benchmark results perfectly by
using the right tools (on-CPU flame graphs and LuaJIT's JIT compiler's
trace dump).

> However, when i reverse the two 'bench' functions (or reload the previous
> page with 'lua_code_cache on;') i get the following:
>
> string_match: 0.632s
> string_find: 2.416s
>

This cannot be explained without using the same set of tools to
analyze either. So let me elaborate the details below.

From the flamegraph for the second part of code running string.find(),

http://agentzh.org/misc/flamegraph/2015-09-25/interp-string-find-loop.svg

we can see that lj_str_find() is finally dominating the on-CPU time!
Obviously the memchr() libc call is doing the hard work [7].

But we may wonder why LuaJIT does not raise the string.find() call out
of the loop as the previous benchmark. The LuaJIT IR code dump
confirms this and gives us more details:

---- TRACE 1 IR
0001 fun SLOAD #0 R
0002 > fun EQ 0001 a.lua:45
0003 > str SLOAD #1 T
0004 int FLOAD 0003 str.len
0005 p32 STRREF 0003 +0
0009 p32 CALLN lj_str_find (0005 [0x40ac0748] 0004 +1 )
0010 > p32 NE 0009 [NULL]
0011 int SUB 0009 0005
0012 int ADD 0011 +1
0013 num CONV 0012 num.int

Okay, so we don't have the loop compiled this time! Only the function
call string_match is compiled. The loop is still interpreted. So
there's no way for the JIT compiler to analyze and optimize the loop
this time.

Then why LuaJIT does not include the loop in the compiled trace in the
first place? This is because the benchmark itself is guilty. It uses
the same bench Lua function for two different sets of run. The loop is
used for both string.match and string.find. When string.match is
exercised in the loop, the JIT compiler is constantly hitting a wall
(the NYI of string.match) while trying to compile that loop. So the
JIT compiler eventually gave up and blacklisted that loop. Thus, when
the loop is later used for string.find(), the JIT compiler does not
even bother try compiling the blacklisted loop even though this time
it MAY succeed.

This can be fixed by introducing another bench2() Lua function
duplicating the bench function's code. Then we get expected results as
the previous benchmark:

$ resty a.lua
string_match: 0.279s
string_find: 0.002s

In general, it's a bad idea to pass Lua functions as arguments to
other Lua functions to do "higher-order" functional programming. This
can confuse the JIT compiler and result in the lack of advanced
optimizations and falling back to the (slower) interpreter.

>
> Can anyone explain me, why in the second case, the string.find function
> apparently executes ~1000 times slower?
>

Again, you're not really testing what you're supposed to test :)

Regards,
-agentzh

P.S. This is really a question about LuaJIT, not OpenResty (in
particular), despite the fact that you are using the ngx.* API in
(unimportant places of) your benchmark Lua program.

[1] CloudFlare sponsored this work in the LuaJIT v2.1 branch.
[2] http://wiki.luajit.org/NYI
[3] or static code without debug symbols, which is definitely not the
case in this example.
[4] We can get the IR code dump by using the jit.dump module instead
of jit.v. For example, just add the following code to the beginning of
the .lua file:

local jit_dump = require "jit.dump"
jit_dump.on("Xtbim", "/dev/stderr")

[5] Well, if the JIT compiler is more intelligent, it can remove the
string.find() call completely.
[6] We are actually using the "active benchmarking" approach proposed
by Brendan Gregg to validate the benchmark itself. See
http://www.brendangregg.com/activebenchmarking.html for more details.
[7] One may wonder why a JITted string.find is slower than a fully
intepreted string.match in this very benchmark, this is simple: they
are operating on very different input strings! The one scanned by
string_find, str, is much much longer than the one for string_match,
str2.

Reply all

Reply to author

Forward

0 new messages