openresty 做的websocket push存在一天几百MB的内存泄漏

oysoft

unread,

Mar 8, 2016, 8:26:19 PM3/8/16

to openresty

因为工作需要，我用 openresty 做了一个websocket推送.在线上环境中存在一天几百MB的内存泄漏，尝试定位问题多次都失败了。

我的设计是：客户端 websocket 协议连上来，lua-resty-websocket 完成 http 到 websocket 的协议转换。

在解析请求后创建一个轻量级线程 (ngx.thread.spawn)，轻量级线程连接redis负责接受redis推送并直接把数据push给客户端，

如果客户端断开连接，就用ngx.thread.kill结束对应lightthread。

一天内存使用曲线大致为：

0:00到8:00 1.39G

8:00到11:00 随着连接量上升而上升，最高连接量6000，内存到1.81G

11:00到17:00 连接量上下波动，内存在随之波动。

17:00之后连接量逐渐下降到100以下，内存逐步下降，最后却稳定在1.66G。

第二天类似，不过起始内存确是1.66G，一天增长270M。

2w连接压测时，使用nginx-gdb-utils分析：

(gdb) lgcstat

42778 str objects: max=851, avg = 41, min=18, sum=1781624

2202 upval objects: max=24, avg = 24, min=24, sum=52848

36031 thread objects: max=976, avg = 748, min=424, sum=26963272

73 proto objects: max=2579, avg = 336, min=80, sum=24543

20479 func objects: max=144, avg = 21, min=20, sum=430920

60 trace objects: max=2308, avg = 614, min=160, sum=36892

0 cdata objects: max=0, avg = 0, min=0, sum=0

173891 tab objects: max=524328, avg = 121, min=32, sum=21211768

94408 udata objects: max=1376, avg = 239, min=32, sum=22592249

sizeof strhash 262144

sizeof g->tmpbuf 1024

sizeof ctype_state 0

sizeof jit_state 3864

total sz 73366380

g->strnum 42778, g->gc.total 73366380

elapsed: 37.780000 sec

有两个问题：

1)压测结束后，内存实际增长700M多，而lgcstat显示的g->gc.total不超过100M，其他内存是谁分配的，为什么没有释放。

2)压测结束后，使用lgcstat获得thread还有3万多个，感觉lightthread并没有收回。

DeJiang Zhu

unread,

Mar 8, 2016, 9:34:25 PM3/8/16

to open...@googlegroups.com

Hello,

在 2016年3月9日上午9:26，oysoft <oycw...@163.com>写道：

因为工作需要，我用 openresty 做了一个websocket推送.在线上环境中存在一天几百MB的内存泄漏，尝试定位问题多次都失败了。

建议你描述一下，你线上环境的泄露情况（貌似你下面的描述里只是压测的？）

我的设计是：客户端 websocket 协议连上来，lua-resty-websocket 完成 http 到 websocket 的协议转换。
在解析请求后创建一个轻量级线程 (ngx.thread.spawn)，轻量级线程连接redis负责接受redis推送并直接把数据push给客户端，
如果客户端断开连接，就用ngx.thread.kill结束对应lightthread。

也可以不用自己 kill，直接 ngx.exit 就都退出了

一天内存使用曲线大致为：
0:00到8:00 1.39G
8:00到11:00 随着连接量上升而上升，最高连接量6000，内存到1.81G
11:00到17:00 连接量上下波动，内存在随之波动。
17:00之后连接量逐渐下降到100以下，内存逐步下降，最后却稳定在1.66G。
第二天类似，不过起始内存确是1.66G，一天增长270M。

这是压测，还是线上，这些时间发生了什么呢？

2w连接压测时，使用nginx-gdb-utils分析：
(gdb) lgcstat
42778 str objects: max=851, avg = 41, min=18, sum=1781624
2202 upval objects: max=24, avg = 24, min=24, sum=52848
36031 thread objects: max=976, avg = 748, min=424, sum=26963272
73 proto objects: max=2579, avg = 336, min=80, sum=24543
20479 func objects: max=144, avg = 21, min=20, sum=430920
60 trace objects: max=2308, avg = 614, min=160, sum=36892
0 cdata objects: max=0, avg = 0, min=0, sum=0
173891 tab objects: max=524328, avg = 121, min=32, sum=21211768
94408 udata objects: max=1376, avg = 239, min=32, sum=22592249

sizeof strhash 262144
sizeof g->tmpbuf 1024
sizeof ctype_state 0
sizeof jit_state 3864

total sz 73366380
g->strnum 42778, g->gc.total 73366380
elapsed: 37.780000 sec

73M，这个值看来比较正常

有两个问题：
1)压测结束后，内存实际增长700M多，而lgcstat显示的g->gc.total不超过100M，其他内存是谁分配的，为什么没有释放。

你是通过什么看的内存呢？

2)压测结束后，使用lgcstat获得thread还有3万多个，感觉lightthread并没有收回。

Lua 的 gc 默认是比较懒的，最靠谱的是，你自己手动 gc 一遍，或者 collectgarbage("setpause", 90) 让 gc 更加勤快一些

The garbage-collector pause controls how long the collector waits before starting a new cycle. Larger values make the collector less aggressive. Values smaller than 100 mean the collector will not wait to start a new cycle. A value of 200 means that the collector waits for the total memory in use to double before starting a new cycle.

http://www.lua.org/manual/5.1/manual.html#2.10

--
--
邮件来自列表“openresty”,专用于技术讨论!
订阅: 请发空白邮件到 openresty...@googlegroups.com
发言: 请发邮件到 open...@googlegroups.com
退订: 请发邮件至 openresty+...@googlegroups.com
归档: http://groups.google.com/group/openresty
官网: http://openresty.org/
仓库: https://github.com/agentzh/ngx_openresty
教程: http://openresty.org/download/agentzh-nginx-tutorials-zhcn.html

oysoft

unread,

Mar 8, 2016, 10:50:57 PM3/8/16

to openresty

一天内存使用曲线大致为：

0:00到8:00 1.39G

8:00到11:00 随着连接量上升而上升，最高连接量6000，内存到1.81G

11:00到17:00 连接量上下波动，内存在随之波动。

17:00之后连接量逐渐下降到100以下，内存逐步下降，最后却稳定在1.66G。

第二天类似，不过起始内存确是1.66G，一天增长270M。

这是线上环境，平均6000长连接，凌晨比较，一天增长200多M，密切关注两周一直这种情况。

后面的是压测定位问题，2万压测，压测结束后稳定后内存增长700M多，分析结果如下，gc大小不超过70M。

oysoft

unread,

Mar 9, 2016, 8:46:29 PM3/9/16

to openresty

我用valgrind跟踪了一下。发现nginx有明确内存泄漏：

在ngx_http_lua_util.c文件ngx_http_lua_run_thread函数中的

rv = lua_resume(orig_coctx->co, nrets);

行。

最后应该调到ngx_http_lua_coroutine.c文件ngx_http_lua_coroutine_resume函数中。

不知道是否ngx_http_lua_coroutine_resume函数存在泄漏？

==00:00:06:47.124 4664== 250,432 bytes in 3,913 blocks are definitely lost in loss record 242 of 245

==00:00:06:47.124 4664== at 0x4C2828A: malloc (vg_replace_malloc.c:299)

==00:00:06:47.124 4664== by 0x9F51D0D: ???

==00:00:06:47.124 4664== by 0x9F4FAA4: ???

==00:00:06:47.124 4664== by 0x9F50036: ???

==00:00:06:47.124 4664== by 0x528E886: lj_BC_FUNCC (in /data/usr_local/lib/libluajit-5.1.so.2.1.0)

==00:00:06:47.124 4664== by 0x4C1C48: ngx_http_lua_run_thread (ngx_http_lua_util.c:1015)

==00:00:06:47.124 4664== by 0x4CBDB1: ngx_http_lua_socket_tcp_resume_helper (ngx_http_lua_socket_tcp.c:5190)

==00:00:06:47.124 4664== by 0x4D0B71: ngx_http_lua_socket_handle_read_success (ngx_http_lua_socket_tcp.c:2973)

==00:00:06:47.124 4664== by 0x4D0B71: ngx_http_lua_socket_tcp_read (ngx_http_lua_socket_tcp.c:2092)

==00:00:06:47.124 4664== by 0x4CBC49: ngx_http_lua_socket_tcp_handler (ngx_http_lua_socket_tcp.c:2724)

==00:00:06:47.124 4664== by 0x4341E5: ngx_event_process_posted (ngx_event_posted.c:33)

==00:00:06:47.124 4664== by 0x43B0A4: ngx_worker_process_cycle (ngx_process_cycle.c:753)

==00:00:06:47.124 4664== by 0x4394E3: ngx_spawn_process (ngx_process.c:198)

DeJiang Zhu

unread,

Mar 9, 2016, 9:03:44 PM3/9/16

to open...@googlegroups.com

Hello

你是否有打上 no-pool 的补丁

首先，我认为泄露的可能性不大（我也有类似的系统，测试和线上都没有发现这个问题）

建议你先这样测试：

用新的 1.9.7.3 的版本（优化了请求常驻下 cosocket 的内存使用 [1] ）

多轮压测，看你的内存是不是会一直涨（我不大清楚你怎么看的内存，或许是系统层面有些内存释放不及时）

1. https://github.com/openresty/lua-nginx-module/pull/590

--

oysoft

unread,

Mar 10, 2016, 1:13:22 AM3/10/16

to openresty

Hi，多谢您一再的回复，还得继续麻烦您。

我用1.9.7.3版本打上no-pool补丁，修改nginx.conf如附件。里面只有几行lua代码：

location /push {

content_by_lua_block{

local wserver = require "resty.websocket.server"

local wb, err = wserver:new {

timeout = 60000,

max_payload_len = 65535

}

ngx.sleep(60)

wb:send_close()

}

在1万并发压测下，也发生内存泄漏，使用pmap获取压测前后内存映射（附件中18433.1.map是压测前，18433.2.map是之后)。比较差异发现：

压测后多了不少anon段，应该就是没有释放的内存。

在 2016年3月10日星期四 UTC+8上午10:03:44，doujiang写道：

nginx.conf

18433.1.map

18433.2.map

DeJiang Zhu

unread,

Mar 10, 2016, 10:52:08 PM3/10/16

to open...@googlegroups.com

Hello,

在 2016年3月10日下午2:13，oysoft <oycw...@163.com>写道：

Hi，多谢您一再的回复，还得继续麻烦您。
我用1.9.7.3版本打上no-pool补丁

在使用 valgrind 跟踪的时候，才需要打上这个补丁

修改nginx.conf如附件。里面只有几行lua代码：
location /push {
content_by_lua_block{
local wserver = require "resty.websocket.server"
local wb, err = wserver:new {
timeout = 60000,
max_payload_len = 65535
}
ngx.sleep(60)
wb:send_close()
}
}
在1万并发压测下，也发生内存泄漏

如我上一篇提到的，最好看下是不是内存有持续增长

使用pmap获取压测前后内存映射（附件中18433.1.map是压测前，18433.2.map是之后)。比较差异发现：
压测后多了不少anon段，应该就是没有释放的内存。

这个我也不清楚了

Yichun Zhang (agentzh)

unread,

Mar 16, 2016, 3:34:12 PM3/16/16

to openresty

Hello!

On Wed, Mar 9, 2016 at 5:46 PM, oysoft wrote:
> 我用valgrind跟踪了一下。发现nginx有明确内存泄漏：
> 在ngx_http_lua_util.c文件ngx_http_lua_run_thread函数中的
> rv = lua_resume(orig_coctx->co, nrets);
> 行。
> 最后应该调到ngx_http_lua_coroutine.c文件ngx_http_lua_coroutine_resume函数中。
> 不知道是否ngx_http_lua_coroutine_resume函数存在泄漏？
> ==00:00:06:47.124 4664== 250,432 bytes in 3,913 blocks are definitely lost
> in loss record 242 of 245
> ==00:00:06:47.124 4664== at 0x4C2828A: malloc (vg_replace_malloc.c:299)
> ==00:00:06:47.124 4664== by 0x9F51D0D: ???
> ==00:00:06:47.124 4664== by 0x9F4FAA4: ???
> ==00:00:06:47.124 4664== by 0x9F50036: ???
> ==00:00:06:47.124 4664== by 0x528E886: lj_BC_FUNCC (in
> /data/usr_local/lib/libluajit-5.1.so.2.1.0)
> ==00:00:06:47.124 4664== by 0x4C1C48: ngx_http_lua_run_thread
> (ngx_http_lua_util.c:1015)
> ==00:00:06:47.124 4664== by 0x4CBDB1:
> ngx_http_lua_socket_tcp_resume_helper (ngx_http_lua_socket_tcp.c:5190)
> ==00:00:06:47.124 4664== by 0x4D0B71:

这个看起来是你自己的 Lua 代码调用的外部 C 函数触发的 malloc 调用（假设你使用的是
LuaJIT）。在这个位置，cosocket 操作其实已经结束了，之所以会出现在调用栈中，是因为 continuation 编程方式的缘故。

你是否有使用第三方或自己的 Lua C 库？建议为你使用的 Lua C 库启用调试符号（注意你上面的调用栈有三个帧缺少调试符号），这样你可以看到。

Regards,
-agentzh

oysoft

unread,

Mar 17, 2016, 12:39:29 AM3/17/16

to openresty

多谢春哥，我的确使用直接的Lua C模块，我跟进一下。

我加上--with-debug，发现大量nginx worker异常退出记录：

worker process 24935 exited on signal 6

../ngx_lua-0.10.0/src/ngx_http_lua_util.c:1013: ngx_http_lua_run_thread: Assertion `orig_coctx->co_top + nrets == lua_g

ettop(orig_coctx->co)' failed.

不知道怎么回事。

在 2016年3月17日星期四 UTC+8上午3:34:12，agentzh写道：

Yichun Zhang (agentzh)

unread,

Mar 24, 2016, 3:49:22 PM3/24/16

to openresty

Hello!

2016-03-16 21:39 GMT-07:00 oysoft:
> 多谢春哥，我的确使用直接的Lua C模块，我跟进一下。
> 我加上--with-debug，发现大量nginx worker异常退出记录：
> worker process 24935 exited on signal 6
> ../ngx_lua-0.10.0/src/ngx_http_lua_util.c:1013: ngx_http_lua_run_thread:
> Assertion `orig_coctx->co_top + nrets == lua_g
> ettop(orig_coctx->co)' failed.

看起来你使用的自己的 Lua C 模块有内存问题，把堆给搞坏了，于是进入不确定性的世界，一切奇怪的事情都会发生……直到崩溃。

Regards,
-agentzh

wd

unread,

Mar 24, 2016, 10:06:10 PM3/24/16

to open...@googlegroups.com

咦？不加 debug 编译的话，worker 退出记录应该也会在 error_log 里面？

--

oysoft

unread,

Mar 25, 2016, 1:30:43 AM3/25/16

to openresty

在尝试解决问题时：

我使用--with-debug --with-lua51编译openresty。

而--with-debug开启了NGX_LUA_USE_ASSERT宏，这个宏和lua5.1配合好像有问题，才这样，这个过程我已经剔除了自己的C扩展。

春哥看看这个组合有问题吗？

在 2016年3月25日星期五 UTC+8上午3:49:22，agentzh写道：

Yichun Zhang (agentzh)

unread,

Mar 26, 2016, 1:20:30 PM3/26/16

to openresty

Hello!

2016-03-24 22:30 GMT-07:00 oysoft:
> 在尝试解决问题时：
> 我使用--with-debug --with-lua51编译openresty。
> 而--with-debug开启了NGX_LUA_USE_ASSERT宏，这个宏和lua5.1配合好像有问题，才这样，这个过程我已经剔除了自己的C扩展。
> 春哥看看这个组合有问题吗？
>

这个组合没有问题。NGX_LUA_USE_ASSERT 只是启用 ngx_lua 自己的 assert() 断言，和 Lua 语言的实现无关。

你调试内存问题应该使用 Valgrind 或者 Addr essSanitizer 这样的工具进行分析。

Regards,
-agentzh

oysoft

unread,

Apr 7, 2016, 8:50:45 PM4/7/16

to openresty

问题解决了，是C扩展写得有问题，在某些场景下发生一些奇怪的问题。多谢agentzh。

在 2016年3月27日星期日 UTC+8上午1:20:30，agentzh写道：

mjay...@gmail.com

unread,

Jun 17, 2016, 8:17:10 PM6/17/16

to openresty

hi oysoft

我们也有类似的内存泄露情况，还没找到原因。。

请教下，你是怎么找到原因的，方便解释下是哪个c扩展在哪种情况下发生什么样的问题吗，避免其他小伙伴踩坑:)

在 2016年4月8日星期五 UTC+8上午8:50:45，oysoft写道：

oysoft

unread,

Jun 22, 2016, 3:47:15 AM6/22/16

to openresty

C扩展中使用cJSON库，有些条件下没有进行cJSON_Delete。造成了内存泄漏。

在 2016年6月18日星期六 UTC+8上午8:17:10，mjay...@gmail.com写道：

Reply all

Reply to author

Forward