lj_debug_funcname 中 terminated with signal 11

152 views
Skip to first unread message

秦晋

unread,
Apr 10, 2014, 12:48:36 AM4/10/14
to open...@googlegroups.com
春哥你好

首先得感谢带来openresty这样优秀的开发平台。点赞!

我现在工作在centos版本1.5.11.1上。使用openresty+redis做了一个简单的queue service。

设计和实现如下:

生产者由用户request产生,将task加入redis list。

消费者由生产者通过resty_http调用127.0.0.1上的请求启动一个ngx.timer.at定时器。定时器从redis list中取出task后处理。并且视有无后续任务等情况重启定时器。当然,从redis list中弹出处理过的task。

生产者和消费者分别运行在同一个服务器的两个通过nginx -c 启动的实例中。 实现了进程隔离。

当redis list中请求不多的时候。这个queue service运行的不错。
但是当我密集的将task送到list中时。消费者进程偶尔会 exited on signal 11。
我运行的测试任务大概每个消耗400ms时间,同时有8个队列并发进行。4个ngx_workprocess。

我反复检查代码后比较怀疑是连接redis的tcp_socket带来的问题。

下面是两次core 的 dump,希望对解决问题有帮助

Core was generated by `nginx: worker process                                                         '.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f280016c8ce in lj_debug_funcname (L=0x45c5efa8, 
    frame=<value optimized out>, name=0x7fff20b466a8) at lj_debug.c:303
303     MMS mm = bcmode_mm(bc_op(*ip));
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6_4.6.x86_64 libcom_err-1.41.12-18.el6.x86_64 libgcc-4.4.7-4.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 nss-softokn-freebl-3.14.3-9.el6.x86_64 openssl-1.0.1e-16.el6_5.4.x86_64 pcre-7.8-6.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) where
#0  0x00007f280016c8ce in lj_debug_funcname (L=0x45c5efa8, 
    frame=<value optimized out>, name=0x7fff20b466a8) at lj_debug.c:303
#1  0x00007f280016d191 in lj_debug_getinfo (L=0x45c5efa8, what=0x4d4d98 "nl", 
    ar=0x7fff20b466a0, ext=0) at lj_debug.c:483
#2  0x000000000048b9a7 in log_wrapper (log=0x182ebf0, ident=0x4d4dda "[lua] ", 
    level=2, L=0x45c5efa8) at ../ngx_lua-0.9.6/src/ngx_http_lua_log.c:110
#3  0x000000000048be3c in ngx_http_lua_ngx_log (L=0x45c5efa8)
    at ../ngx_lua-0.9.6/src/ngx_http_lua_log.c:58
#4  0x00007f2800164371 in lj_BC_FUNCC ()
   from /usr/local/openresty/luajit/lib/libluajit-5.1.so.2
#5  0x00007f2800167d51 in lj_err_run (L=0x45c5efa8) at lj_err.c:565
#6  0x00007f2800168352 in err_msgv (L=0x45c5efa8, em=<value optimized out>)
    at lj_err.c:580
#7  0x00007f28001684a5 in lj_err_optype (L=0x45c5efa8, 
    o=<value optimized out>, opm=<value optimized out>) at lj_err.c:616
#8  0x00007f280016be8f in lj_meta_tget (L=0x45c5efa8, o=0x47df8c08, 
    k=0x7fff20b46940) at lj_meta.c:141
#9  0x00007f28001647cd in lj_vmeta_tgetv ()
   from /usr/local/openresty/luajit/lib/libluajit-5.1.so.2
#10 0x00000000004a0361 in ngx_http_lua_run_thread (L=0x41fc3378, r=0x182ec40, 
    ctx=0x1832dc8, nrets=1) at ../ngx_lua-0.9.6/src/ngx_http_lua_util.c:1103
#11 0x00000000004a753c in ngx_http_lua_socket_tcp_resume (r=0x182ec40)
    at ../ngx_lua-0.9.6/src/ngx_http_lua_socket_tcp.c:4277
#12 0x00000000004a19ab in ngx_http_lua_content_wev_handler (
    r=<value optimized out>)
    at ../ngx_lua-0.9.6/src/ngx_http_lua_contentby.c:128
#13 0x00000000004a74a1 in ngx_http_lua_socket_handle_success (
    r=<value optimized out>, u=<value optimized out>)
    at ../ngx_lua-0.9.6/src/ngx_http_lua_socket_tcp.c:2329
#14 0x00000000004aa587 in ngx_http_lua_socket_tcp_read (r=0x182ec40, 
    u=0x45f1cce8) at ../ngx_lua-0.9.6/src/ngx_http_lua_socket_tcp.c:1513
#15 0x00000000004aa9fb in ngx_http_lua_socket_read_handler (r=0x182ec40, 
    u=0x45f1cce8) at ../ngx_lua-0.9.6/src/ngx_http_lua_socket_tcp.c:2167
#16 0x00000000004a7389 in ngx_http_lua_socket_tcp_handler (
    ev=<value optimized out>)
    at ../ngx_lua-0.9.6/src/ngx_http_lua_socket_tcp.c:2119
#17 0x00000000004343ab in ngx_epoll_process_events (
    cycle=<value optimized out>, timer=<value optimized out>, 
    flags=<value optimized out>) at src/event/modules/ngx_epoll_module.c:691

====================================================

#0  0x00007f34d0ac18ce in lj_debug_funcname (L=0x405705d8, 
    frame=<value optimized out>, name=0x7fff354de208) at lj_debug.c:303
303     MMS mm = bcmode_mm(bc_op(*ip));
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6_4.6.x86_64 libcom_err-1.41.12-18.el6.x86_64 libgcc-4.4.7-4.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 nss-softokn-freebl-3.14.3-9.el6.x86_64 openssl-1.0.1e-16.el6_5.4.x86_64 pcre-7.8-6.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) where
#0  0x00007f34d0ac18ce in lj_debug_funcname (L=0x405705d8, 
    frame=<value optimized out>, name=0x7fff354de208) at lj_debug.c:303
#1  0x00007f34d0ac2191 in lj_debug_getinfo (L=0x405705d8, what=0x4d4d98 "nl", 
    ar=0x7fff354de200, ext=0) at lj_debug.c:483
#2  0x000000000048b9a7 in log_wrapper (log=0xbfdb70, ident=0x4d4dda "[lua] ", 
    level=2, L=0x405705d8) at ../ngx_lua-0.9.6/src/ngx_http_lua_log.c:110
#3  0x000000000048be3c in ngx_http_lua_ngx_log (L=0x405705d8)
    at ../ngx_lua-0.9.6/src/ngx_http_lua_log.c:58
#4  0x00007f34d0ab9371 in lj_BC_FUNCC ()
   from /usr/local/openresty/luajit/lib/libluajit-5.1.so.2
#5  0x00007f34d0abcd51 in lj_err_run (L=0x405705d8) at lj_err.c:565
#6  0x00007f34d0abd352 in err_msgv (L=0x405705d8, em=<value optimized out>)
    at lj_err.c:580
#7  0x00007f34d0abd4a5 in lj_err_optype (L=0x405705d8, 
    o=<value optimized out>, opm=<value optimized out>) at lj_err.c:616
#8  0x00007f34d0abd522 in lj_err_optype_call (L=0x410b1db0, o=0x400000310)
    at lj_err.c:643
#9  0x00007f34d0ac0188 in lj_meta_call (L=0x405705d8, func=0x44923bf8, 
    top=0x44923c08) at lj_meta.c:433
#10 0x00007f34d0ab9a43 in lj_vmeta_call ()
   from /usr/local/openresty/luajit/lib/libluajit-5.1.so.2
#11 0x00000000004a0361 in ngx_http_lua_run_thread (L=0x406e8378, r=0xbfdbc0, 
    ctx=0xbdd148, nrets=1) at ../ngx_lua-0.9.6/src/ngx_http_lua_util.c:1103
#12 0x00000000004a753c in ngx_http_lua_socket_tcp_resume (r=0xbfdbc0)
    at ../ngx_lua-0.9.6/src/ngx_http_lua_socket_tcp.c:4277
#13 0x00000000004a19ab in ngx_http_lua_content_wev_handler (
    r=<value optimized out>)
    at ../ngx_lua-0.9.6/src/ngx_http_lua_contentby.c:128
#14 0x00000000004a74a1 in ngx_http_lua_socket_handle_success (
    r=<value optimized out>, u=<value optimized out>)
    at ../ngx_lua-0.9.6/src/ngx_http_lua_socket_tcp.c:2329
#15 0x00000000004aa587 in ngx_http_lua_socket_tcp_read (r=0xbfdbc0, 
    u=0x4472b860) at ../ngx_lua-0.9.6/src/ngx_http_lua_socket_tcp.c:1513
#16 0x00000000004aa9fb in ngx_http_lua_socket_read_handler (r=0xbfdbc0, 
    u=0x4472b860) at ../ngx_lua-0.9.6/src/ngx_http_lua_socket_tcp.c:2167
#17 0x00000000004a7389 in ngx_http_lua_socket_tcp_handler (
    ev=<value optimized out>)
    at ../ngx_lua-0.9.6/src/ngx_http_lua_socket_tcp.c:2119
#18 0x000000000042d5aa in ngx_event_process_posted (

哦。对了。我在释放redis连接的时候使用的是redis:set_keepalive(0, 128)
但其实设置成其它参数也会导致问题。甚至redis:close()也一样。

谢谢

Yichun Zhang (agentzh)

unread,
Apr 10, 2014, 12:57:45 AM4/10/14
to openresty
Hello!

On Wed, Apr 9, 2014 at 9:48 PM, 秦晋 wrote:
> 我现在工作在centos版本1.5.11.1上。使用openresty+redis做了一个简单的queue service。
[...]
> 当redis list中请求不多的时候。这个queue service运行的不错。
> 但是当我密集的将task送到list中时。消费者进程偶尔会 exited on signal 11。

请尝试使用最新的 ngx_openresty 1.5.12.1rc1:

http://openresty.org/download/ngx_openresty-1.5.12.1rc1.tar.gz

这个版本的 openresty 包含了 luajit 中的一个重要 bug fix.

> 下面是两次core 的 dump,希望对解决问题有帮助
>
> Core was generated by `nginx: worker process

从你提供的 C backtrace 上看崩溃在了 LuaJIT VM 内部。请首先尝试 openresty 1.5.12.1rc1
是否仍有问题。如果仍有问题,使用我的 gdb 扩展工具对 core dump 进行分析:

https://github.com/agentzh/nginx-gdb-utils#lbt

以及

https://github.com/agentzh/nginx-gdb-utils#lgc

其中 lbt 工具可以确认崩溃发生的 Lua 代码位置(通过 Lua 调用栈),而 lgc 可以确认当前 LuaJIT VM 是否分配了过多的内存而导致崩溃。

当然,我的这些 gdb 扩展命令依赖较新版本的启用了 python 支持的 gdb.

另外,除了尝试 1.5.12.1rc1 以外,还可以尝试禁用 LuaJIT 的 JIT 编译器,看是否仍能复现问题,即在
nginx.conf 中添加下面这个片段:

init_by_lua 'jit.off()';

最后一条建议是:确认你没有使用 openresty 以外的第三方 nginx 模块或者 Lua C 库。

Regards,
-agentzh

Yichun Zhang (agentzh)

unread,
Apr 10, 2014, 1:06:26 AM4/10/14
to openresty
Hello!

On Wed, Apr 9, 2014 at 9:48 PM, 秦晋 wrote:
> (gdb) where
> #0 0x00007f280016c8ce in lj_debug_funcname (L=0x45c5efa8,
> frame=<value optimized out>, name=0x7fff20b466a8) at lj_debug.c:303
> #1 0x00007f280016d191 in lj_debug_getinfo (L=0x45c5efa8, what=0x4d4d98
> "nl",

另外,请在 gdb 里面使用 bt full 命令得到更加完整的 C 调用栈(包含各函数帧里的局部变量的值)。另外建议你使用
./configure --with-debug 重新编译 openresty. 这样我们可以看到更完整的 C 调试信息,同时也可以启用
LuaJIT 的内部断言检查。

多谢!

Regards,
-agentzh

秦晋

unread,
Apr 10, 2014, 2:43:27 AM4/10/14
to open...@googlegroups.com
应该能确定是jit的问题了:

升级到1.5.12.1rc1 --with-debug 之后问题依旧  core dump有些变化
尝试jit.off()后问题消失。

gdb工具等我晚上有空了搞一下。

万分感谢你的帮助。

新的core dump如下:
或者你们需要core文件么?我也可以打个包发过去

(gdb) bt full
#0  0x00007f755d3b6925 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f755d3b8105 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007f755d3afa4e in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3  0x00007f755d3afb10 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4  0x00007f755e63fa40 in debug_framepc (L=<value optimized out>, 
    fn=<value optimized out>, nextframe=<value optimized out>)
    at lj_debug.c:103
        T = <value optimized out>
        ins = <value optimized out>
        pos = <value optimized out>
        __PRETTY_FUNCTION__ = "debug_framepc"
#5  0x00007f755e63fa49 in debug_frameline (L=<value optimized out>, 
    fn=0x41a66fc0, nextframe=<value optimized out>) at lj_debug.c:133
        pc = <value optimized out>
        __PRETTY_FUNCTION__ = "debug_frameline"
#6  0x00007f755e64130b in lj_debug_addloc (L=0x46b362a8, 
    msg=0x430a4b10 "attempt to index a nil value", 
    frame=<value optimized out>, nextframe=<value optimized out>)
    at lj_debug.c:356
        line = <value optimized out>
        fn = 0x41a66fc0
        __PRETTY_FUNCTION__ = "lj_debug_addloc"
#7  0x00007f755e63af4a in err_msgv (L=0x46b362a8, em=<value optimized out>)
    at lj_err.c:579
        msg = <value optimized out>
        argp = {{gp_offset = 32, fp_offset = 48, 
            overflow_arg_area = 0x7ffffee86900, 
            reg_save_area = 0x7ffffee86840}}
#8  0x00007f755e63b0a5 in lj_err_optype (L=0x46b362a8, 
    o=<value optimized out>, opm=<value optimized out>) at lj_err.c:616
        tname = 0x7f755e69eb17 "nil"
        opname = 0x7f755e69da23 "index"
#9  0x00007f755e63f746 in lj_meta_tget (L=0x46b362a8, o=0x451dfbb0, 
    k=0x7ffffee86970) at lj_meta.c:141
        mo = 0x0
        loop = <value optimized out>
#10 0x00007f755e6368a1 in lj_vmeta_tgetv ()
   from /usr/local/openresty/luajit/lib/libluajit-5.1.so.2
No symbol table info available.
#11 0x00000000004ab97a in ngx_http_lua_run_thread (L=0x4128e378, r=0x111d500, 
    ctx=0x1121688, nrets=1) at ../ngx_lua-0.9.7/src/ngx_http_lua_util.c:1105
        next_coctx = <value optimized out>
        parent_coctx = <value optimized out>
        orig_coctx = 0x11216b0
        rv = <value optimized out>
        success = <value optimized out>
        next_co = <value optimized out>
        old_co = <value optimized out>
        err = <value optimized out>
        msg = <value optimized out>
        trace = <value optimized out>
        rc = <value optimized out>
        old_pool = 0x0
#12 0x00000000004b45b4 in ngx_http_lua_socket_tcp_resume (r=0x111d500)
    at ../ngx_lua-0.9.7/src/ngx_http_lua_socket_tcp.c:4316
        nret = <value optimized out>
        vm = 0x4128e378
        rc = <value optimized out>
        c = 0x7f755ab0cbe0
        ctx = 0x1121688
        coctx = <value optimized out>
        u = <value optimized out>
#13 0x00000000004ae02b in ngx_http_lua_content_wev_handler (
    r=<value optimized out>)
    at ../ngx_lua-0.9.7/src/ngx_http_lua_contentby.c:128
No locals.
#14 0x00000000004b44b1 in ngx_http_lua_socket_handle_success (r=0x111d500, 
    u=<value optimized out>)
    at ../ngx_lua-0.9.7/src/ngx_http_lua_socket_tcp.c:2333
        ctx = <value optimized out>
        coctx = <value optimized out>
        __PRETTY_FUNCTION__ = "ngx_http_lua_socket_handle_success"
#15 0x00000000004b82e6 in ngx_http_lua_socket_tcp_read (r=0x111d500, 
    u=0x4579d378) at ../ngx_lua-0.9.7/src/ngx_http_lua_socket_tcp.c:1515
        rc = <value optimized out>
        c = 0x7f755ab0cb08
        b = 0x4579d450
        rev = 0x7f755a903558
        size = <value optimized out>
        n = <value optimized out>
        read = 1
        preread = <value optimized out>
        llcf = <value optimized out>
#16 0x00000000004b93c1 in ngx_http_lua_socket_read_handler (r=0x111d500, 
    u=0x4579d378) at ../ngx_lua-0.9.7/src/ngx_http_lua_socket_tcp.c:2169
        c = <value optimized out>
#17 0x00000000004b42a6 in ngx_http_lua_socket_tcp_handler (ev=0x7f755a903558)
    at ../ngx_lua-0.9.7/src/ngx_http_lua_socket_tcp.c:2121
        c = 0x7f755ab0cbe0
        r = 0x111d500
        ctx = <value optimized out>
        u = 0x4579d378
#18 0x0000000000437397 in ngx_epoll_process_events (cycle=0x10dd7b0, 
    timer=<value optimized out>, flags=<value optimized out>)
    at src/event/modules/ngx_epoll_module.c:691
        events = 4
        revents = 5
        instance = <value optimized out>
        i = <value optimized out>
        level = <value optimized out>
        err = <value optimized out>
        rev = 0x7f755a903558
        wev = <value optimized out>
        queue = <value optimized out>
        c = 0x7f755ab0cb08
#19 0x000000000042ed23 in ngx_process_events_and_timers (cycle=0x10dd7b0)
    at src/event/ngx_event.c:248
        flags = 1
        timer = 500
        delta = 1397107393325
#20 0x0000000000435bde in ngx_worker_process_cycle (cycle=0x10dd7b0, 
    data=<value optimized out>) at src/os/unix/ngx_process_cycle.c:816
        worker = <value optimized out>
        i = <value optimized out>
        c = <value optimized out>
#21 0x00000000004342fc in ngx_spawn_process (cycle=0x10dd7b0, 
    proc=0x435ae8 <ngx_worker_process_cycle>, data=0x2, 
    name=0x4d952b "worker process", respawn=-3)
    at src/os/unix/ngx_process.c:198
        on = 1
        pid = 0
        s = 2
#22 0x0000000000434fea in ngx_start_worker_processes (cycle=0x10dd7b0, n=3, 
    type=-3) at src/os/unix/ngx_process_cycle.c:364
        i = <value optimized out>
        ch = {command = 1, pid = 7720, slot = 1, fd = 8}
#23 0x000000000043625f in ngx_master_process_cycle (cycle=0x10dd7b0)
    at src/os/unix/ngx_process_cycle.c:136
        title = 0x10ff3ac "master process /usr/local/openresty/nginx/sbin/nginx -c /usr/local/openresty/nginx/conf/nginx.queue.conf"
        p = <value optimized out>
        size = 105
        i = 3
        n = <value optimized out>
        sigio = <value optimized out>
        set = {__val = {0 <repeats 16 times>}}
        itv = {it_interval = {tv_sec = 17822440, tv_usec = 0}, it_value = {
            tv_sec = 0, tv_usec = 0}}
        live = <value optimized out>
        delay = <value optimized out>
        ls = <value optimized out>
        ccf = 0x10de7e0
#24 0x0000000000418deb in main (argc=<value optimized out>, 
    argv=<value optimized out>) at src/core/nginx.c:407
        i = <value optimized out>
        log = 0x71eb20
        cycle = 0x10dd7b0
        init_cycle = {conf_ctx = 0x0, pool = 0x10dc9e0, log = 0x71eb20, 
          new_log = {log_level = 0, file = 0x0, connection = 0, handler = 0, 
            data = 0x0, action = 0x0, next = 0x0}, log_use_stderr = 0, 
          files = 0x0, free_connections = 0x0, free_connection_n = 0, 
          reusable_connections_queue = {prev = 0x0, next = 0x0}, listening = {
            elts = 0x0, nelts = 0, size = 0, nalloc = 0, pool = 0x0}, paths = {
            elts = 0x0, nelts = 0, size = 0, nalloc = 0, pool = 0x0}, 
          open_files = {last = 0x0, part = {elts = 0x0, nelts = 0, 
              next = 0x0}, size = 0, nalloc = 0, pool = 0x0}, shared_memory = {
            last = 0x0, part = {elts = 0x0, nelts = 0, next = 0x0}, size = 0, 
            nalloc = 0, pool = 0x0}, connection_n = 0, files_n = 0, 
          connections = 0x0, read_events = 0x0, write_events = 0x0, 
          old_cycle = 0x0, conf_file = {len = 48, data = 0x7ffffee88764 ""}, 
          conf_param = {len = 0, data = 0x0}, conf_prefix = {len = 32, 
            data = 0x7ffffee88764 ""}, prefix = {len = 27, 
            data = 0x4d4f8a "/usr/local/openresty/nginx/"}, lock_file = {
            len = 0, data = 0x0}, hostname = {len = 0, data = 0x0}}
        ccf = <value optimized out>


在 2014年4月10日星期四UTC+8下午1时06分26秒,agentzh写道:

秦晋

unread,
Apr 10, 2014, 8:39:17 AM4/10/14
to open...@googlegroups.com
hi 在osx上同样的代码发生另一个错误

osx 10.9.2
openresty 1.5.12.1rc1 --with-debug
luajit off

terminated with signal 5 

core dump 如下
GNU gdb 6.3.50-20050815 (Apple version gdb-1824) (Wed Feb  6 22:51:23 UTC 2013)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-apple-darwin"...Reading symbols for shared libraries ....... done

Reading symbols for shared libraries . done
Reading symbols for shared libraries .............................................. done
#0  0x00000001041b5bd1 in lj_BC_FUNCC () at lj_vm.s:588
588     call _lj_state_growstack
(gdb) where
#0  0x00000001041b5bd1 in lj_BC_FUNCC () at lj_vm.s:588
#1  0x000000010421a370 in lj_alloc_destroy () at lj_alloc.c:1172
Current language:  auto; currently asm
(gdb) bt full
#0  0x00000001041b5bd1 in lj_BC_FUNCC () at lj_vm.s:588
No symbol table info available.
#1  0x000000010421a370 in lj_alloc_destroy () at lj_alloc.c:1172
No symbol table info available.

这个错误看来更棘手。

在 2014年4月10日星期四UTC+8下午1时06分26秒,agentzh写道:

Yichun Zhang (agentzh)

unread,
Apr 10, 2014, 3:19:59 PM4/10/14
to openresty
Hello!

On Wed, Apr 9, 2014 at 11:43 PM, 秦晋 wrote:
> 应该能确定是jit的问题了:
>
> 升级到1.5.12.1rc1 --with-debug 之后问题依旧 core dump有些变化
> 尝试jit.off()后问题消失。
>

看起来像是 JIT 编译器相关的一个问题。

> gdb工具等我晚上有空了搞一下。
>

你能运行 nginx-gdb-utils 里面的工具了吗?

>
> 新的core dump如下:
> 或者你们需要core文件么?我也可以打个包发过去
>

原始的 core dump 文件对我来说没有什么用处,因为我需要一个和你的一模一样的操作系统环境。

你能尝试构造一个最小化的完整用例吗?以便我可以在我本地复现你看到的问题。这样可以节约我们之间的沟通成本。尽量去除不必要的依赖项。

如果你不能提供一个最小化的完整用例的话,我们需要你配合我们做一些调试工作。我会为你提供一个特殊版本的 OpenResty,其 LuaJIT
会内建 gdb 调试所需的相关支持。

> (gdb) bt full
> #0 0x00007f755d3b6925 in raise () from /lib64/libc.so.6
> No symbol table info available.
> #1 0x00007f755d3b8105 in abort () from /lib64/libc.so.6
> No symbol table info available.
> #2 0x00007f755d3afa4e in __assert_fail_base () from /lib64/libc.so.6
> No symbol table info available.
> #3 0x00007f755d3afb10 in __assert_fail () from /lib64/libc.so.6
> No symbol table info available.
> #4 0x00007f755e63fa40 in debug_framepc (L=<value optimized out>,
> fn=<value optimized out>, nextframe=<value optimized out>)

从这个 C 调用栈看,是下面这个 LuaJIT 内部断言失败了:

bc_isret(bc_op(ins[-1]))

一个可能的原因是 LuaJIT VM 离开的最后一个 trace 时,LuaJIT 的 PC 寄存器(在 x86_64 上为 ebx)和
BASE 寄存器(在 x86_64 上为 edx)出现了不一致。

> 在osx上同样的代码发生另一个错误
>

Mac OS X 的开发和调试工具链远不及 Linux,让我们先专注于 Linux 环境。

Regards,
-agentzh

Yichun Zhang (agentzh)

unread,
Apr 10, 2014, 3:59:29 PM4/10/14
to openresty
Hello!

2014-04-10 12:19 GMT-07:00 Yichun Zhang (agentzh):
> 如果你不能提供一个最小化的完整用例的话,我们需要你配合我们做一些调试工作。我会为你提供一个特殊版本的 OpenResty,其 LuaJIT
> 会内建 gdb 调试所需的相关支持。
>

OK,我已经为你准备了一个特殊版本的 openresty:

http://openresty.org/download/ngx_openresty-1.5.12.1rc1-dbg.tar.gz

解压后使用类似下面的命令构造之:

./configure --with-debug -j2 ...
make -j2
sudo make install

重要的是 --with-debug 选项。

然后使用该版本的 openresty 复现你的问题。在得到 core dump 之后,使用 nginx-gdb-utils
项目(https://github.com/agentzh/nginx-gdb-utils )里的工具对 core 进行分析:

(gdb) source luajit21.py
(gdb) lbt
(gdb) lringbuf

请把这些命令的输出放在 github gist 网站上面(这样方便后面我们不断追加新的信息):

https://gist.github.com/

最好是一个 core 样本对应一个 gist. 建议在 Linux 上面采集一到两个样本。

这些样本对应的 core 文件以及对应的 nginx/luajit 可执行文件请都予以保留,因为我们还会基于你提供的 lringbuf
的输出从它们采样更多的细节信息。

>> 在osx上同样的代码发生另一个错误
>>
>
> Mac OS X 的开发和调试工具链远不及 Linux,让我们先专注于 Linux 环境。
>

你也可以在 Mac OS X 上使用上面的步骤得到对应的调试信息,并把输出集中放在单独的 gist 里面。

不过我还是建议先集中力量于 Linux,以避免混乱 :)

Thanks!
-agentzh

秦晋

unread,
Apr 11, 2014, 6:02:28 AM4/11/14
to open...@googlegroups.com
hi 

抱歉经过了这么久。因为我一直在尝试制作一个最小化的用例。但尝试了几次都不太成功。
所以最后去建立了GitGist

使用这个特殊版本之后情况依然是原来的样子
打开JIT的话会崩溃,如果关闭jit就一切OK

Gist链接如下:
core.1812

core.1813

在 2014年4月11日星期五UTC+8上午3时59分29秒,agentzh写道:

Yichun Zhang (agentzh)

unread,
Apr 11, 2014, 4:11:42 PM4/11/14
to openresty
Hello!

2014-04-11 3:02 GMT-07:00 秦晋:
> 抱歉经过了这么久。因为我一直在尝试制作一个最小化的用例。但尝试了几次都不太成功。
> 所以最后去建立了GitGist
>

OK,没问题。那就让我们直接在你那边通过我们的 gdb 工具来采集修复问题所需要的所有信息 :)

> 使用这个特殊版本之后情况依然是原来的样子
> 打开JIT的话会崩溃,如果关闭jit就一切OK
>

好极,特殊版本就需要能复现崩溃,因为它只是用来提供更多的调试信息的,特别是 LuaJIT 内部的相关执行轨迹的记录。

> Gist链接如下:
> core.1812
> https://gist.github.com/tinyqin/10454187
>

对于这个样本,lringbuf 命令输出的最后几行是

->166 L=0x429d97b8 Context.lua:105 < Schema.lua:64
<-166 L=0x429d97b8 @0x49ac16 < Context.lua:106 (direct exit, exitno=0)
->209 L=0x429d9578 redis.lua:166 < TaskQueueBase.lua:182
Assertion failed: bc_isret(bc_op(ins[-1])) pt=0x40282da0 pc=0x410c3ccc
<-209 L=0x429d9578 redis.lua:165 < TaskQueueBase.lua:182 (exitno=1)

以 -> 起始的行对应 LuaJIT 进入某个 trace(即被编译的最小单位的 Lua 代码路径)的事件,而以 <- 起始的行则对应离开
trace 的事件。而 -> 和 <- 后面的数字则代表当前进入或离开的 trace 对象的编号(从 1 开始严格递增)。

从上面的输出可以看到,在崩溃发生之前 LuaJIT 执行的最后两个 trace 是 #166 和 #209. 而且是在 LuaJIT
离开最后一个 trace(#209)时,断言 bc_isret(bc_op(ins[-1])) 就已经不再成立了。所以很可能是 #209
这个 trace 有问题。

为进一步确认这一点,根据这一行输出

Assertion failed: bc_isret(bc_op(ins[-1])) pt=0x40282da0 pc=0x410c3ccc

请在该样本对应的 core dump 文件的 gdb session 里执行下面的命令:

(gdb) lpc 0x410c3ccc
(gdb) lval (GCproto*)0x40282da0

看看 LuaJIT VM 的 PC 究竟指向哪个 Lua 函数的原型。(请把这两条 gdb 命令的输出追加到对应的 gist 中去。)

进一步地,我们需要崩溃前执行过的 #209 和 #166 这两个 trace 里面的 IR 中间码的完整列表。可以使用我们的 lir 命令:

(gdb) lir 209
(gdb) lir 166

进一步地,我们需要这两个 trace 对应的机器指令。可以使用 ltrace 命令得到它们的机器码的起始地址和结束地址:

(gdb) ltrace 209
(gdb) ltrace 166

然后分别使用 gdb 自己的 disas 命令得到它们的机器码的完整列表:

(gdb) disas <begin>, <end>

具体示例,可以参见 ltrace 命令的文档:

https://github.com/agentzh/nginx-gdb-utils#ltrace

请把上面所有命令的原始输出都追加到对应的 gist 里去 :)

为清楚起见,在 gist 里最好把各个 trace 对应的 IR 码和 mcode 码放在一起。

> core.1813
> https://gist.github.com/tinyqin/10454460
>

对于此样本,也如上面那样分析。也请把分析的结果追加进它对应的 gist :)

多谢!
-agentzh

Yichun Zhang (agentzh)

unread,
Apr 11, 2014, 4:17:55 PM4/11/14
to openresty
Hello!

2014-04-11 13:11 GMT-07:00 Yichun Zhang (agentzh):
> 对于这个样本,lringbuf 命令输出的最后几行是
>
> ->166 L=0x429d97b8 Context.lua:105 < Schema.lua:64
> <-166 L=0x429d97b8 @0x49ac16 < Context.lua:106 (direct exit, exitno=0)
> ->209 L=0x429d9578 redis.lua:166 < TaskQueueBase.lua:182
> Assertion failed: bc_isret(bc_op(ins[-1])) pt=0x40282da0 pc=0x410c3ccc
> <-209 L=0x429d9578 redis.lua:165 < TaskQueueBase.lua:182 (exitno=1)
>

另外,值得一提的是,你最好能提供这里涉及到的 .lua 文件的源码,即 Context.lua 和TaskQueueBase.lua,如果你方便的话。

多谢!
-agentzh

Yichun Zhang (agentzh)

unread,
Apr 11, 2014, 5:04:37 PM4/11/14
to openresty
Hello!

2014-04-11 13:11 GMT-07:00 Yichun Zhang (agentzh):
> 然后分别使用 gdb 自己的 disas 命令得到它们的机器码的完整列表:
>
> (gdb) disas <begin>, <end>
>
> 具体示例,可以参见 ltrace 命令的文档:
>
> https://github.com/agentzh/nginx-gdb-utils#ltrace
>

另外,这里需要提醒一下在运行 disas 命令之前,需要先设置 Intel 风格的汇编记法:

set disassembly-flavor intel

这样看起来方便。

多谢!

Regards,
-agentzh

秦晋

unread,
Apr 11, 2014, 11:35:35 PM4/11/14
to open...@googlegroups.com
嗯。两个gist都更新了。文件源码我整理一下给你
问题是不好放在这里。我要怎么提给你?


在 2014年4月12日星期六UTC+8上午5时04分37秒,agentzh写道:

Yichun Zhang (agentzh)

unread,
Apr 11, 2014, 11:40:42 PM4/11/14
to openresty
Hello!

2014-04-11 20:35 GMT-07:00 秦晋:
> 嗯。两个gist都更新了。

太好了!我会转给 Mike Pall 进行分析 :)

> 文件源码我整理一下给你
> 问题是不好放在这里。我要怎么提给你?
>

你可以直接作为邮件附件发到我的个人信箱(age...@gmail.com),我也会转给 Mike Pall.

Thanks!
-agentzh
Reply all
Reply to author
Forward
0 new messages