Coroutines entering dead state on creation

264 views
Skip to first unread message

James Hurst

unread,
Dec 11, 2013, 11:57:15 AM12/11/13
to openre...@googlegroups.com
Hi, I'm having a weird issue with coroutines, and am struggling to debug it further so I thought I'd explain what I know in case anyone can help.

I'm using coroutines to stream http responses in http://github.com/pintsized/lua-resty-http, which under normal circumstances work fine. However if you have several upstream http requests, sometimes a coroutine will enter the dead state without having ever run. This appears to be quite random, but the more upstream requests you have in a given request the more likely you are to see it (say 10 upstream requests to the same resource, 3 or 4 of them may be dead on arrival).

Additionally, the ones which are dead change every time... It's rarely any of the first 4 requests, but otherwise it could be any combination of the remaining requests, including occasionally none.

I'd suspected there was just something silly wrong with my code, but it's definitely weirder than that. The coroutine is created (successfully, I get a thread value back), and then on the next line I call coroutine.status(co) and it returns "dead", which as I understand should be impossible?

I can't seem to recreate this with simpler code though, I have to use lua-resty-http and in a loop create connections. I can't help wondering if it's something to do with the fact that my coroutines operate on open cosockets. That is, if I take that factor out of the equation I can't replicate it.

But just to reiterate, when a given request fails the coroutine doesn't execute a single line (it can't be resumed to do so), and the socket that the coroutine would have read from is still perfectly fine and usable afterwards. That is, the error is definitely that I've created a coroutine but it has changed to "dead" immediately, rather than the coroutine tries to run and fails in some other way.

I went through the ngx_lua source and placed debug statements around every time the status is set to NGX_HTTP_LUA_CO_DEAD, and it would appear that none of them are responsible! The coroutine is created as suspended, but then immediately after, apparently not set by C at any point, the status has become dead.

Clutching at straws, but could the threads be garbage collected by accident or something?

I've created a Gist with example code to recreate and example output. I became aware of this because of this bug report regarding excessive memory usage: https://github.com/pintsized/lua-resty-http/issues/4 - which happens because if a coroutine dies when using coroutine.wrap, we blindly try to resume it and get a string error back, causing an infinite loop. So in my Gist you can see how I've reimplemented coroutine.wrap to stop this.


I also tried removing wrap from the equation completely, and using create/resume directly, but the effect is the same.

Any ideas on what I can try to narrow this down further?

Thanks.

--
James Hurst

Yichun Zhang (agentzh)

unread,
Dec 11, 2013, 2:14:28 PM12/11/13
to openresty-en
Hello!

On Wed, Dec 11, 2013 at 8:57 AM, James Hurst wrote:
>
> I'm using coroutines to stream http responses in
> http://github.com/pintsized/lua-resty-http, which under normal circumstances
> work fine.

Just as a side note: if you really want performance, then you should
avoid using coroutines as much as possible because the coroutine API
is kinda expensive and cannot be JIT compiled. It is recommended to
use simple stream API as in lua-resty-upload:

https://github.com/agentzh/lua-resty-upload

If you want parallelism in multiple backend requests, then you should
use the ngx.thread API (aka "light threads") instead:

https://github.com/chaoslawful/lua-nginx-module#ngxthreadspawn

> However if you have several upstream http requests, sometimes a
> coroutine will enter the dead state without having ever run.

Which version of ngx_lua (or ngx_openresty) are you using? Looks like
you're running into an old bug that was fixed in ngx_lua 0.9.1 (or
ngx_openresty 1.4.3.1):

* bugfix: the bookkeeping state for already-freed user threads might
be incorrectly used by newly-created threads that were completely
different, which could lead to bad results. thanks Sam Lawrence for
the report.

>
> I'd suspected there was just something silly wrong with my code, but it's
> definitely weirder than that. The coroutine is created (successfully, I get
> a thread value back), and then on the next line I call coroutine.status(co)
> and it returns "dead", which as I understand should be impossible?

Right, this is impossible.

>
> Any ideas on what I can try to narrow this down further?
>

If you're not using the latest ngx_lua, please consider upgrading first :)

Regards,
-agentzh

James Hurst

unread,
Dec 11, 2013, 6:30:00 PM12/11/13
to openre...@googlegroups.com
Hey,

On 11 December 2013 19:14, Yichun Zhang (agentzh) <age...@gmail.com> wrote:
Just as a side note: if you really want performance, then you should
avoid using coroutines as much as possible because the coroutine API
is kinda expensive and cannot be JIT compiled. It is recommended to
use simple stream API as in lua-resty-upload:

    https://github.com/agentzh/lua-resty-upload


Sure, I'll have a look at that. Coroutines make things very elegant in my opinion, especially in code paths that may dynamically introduce different filters to a stream. Perhaps there's a way to achieve the same thing by just maintaining state manually though... Initial indications are that lua-resty-http benches faster than subrequests + proxy module, so I've not been too worried! But I take your point.


> However if you have several upstream http requests, sometimes a
> coroutine will enter the dead state without having ever run.

Which version of ngx_lua (or ngx_openresty) are you using? Looks like
you're running into an old bug that was fixed in ngx_lua 0.9.1 (or
ngx_openresty 1.4.3.1):

 * bugfix: the bookkeeping state for already-freed user threads might
 be incorrectly used by newly-created threads that were completely
different, which could lead to bad results. thanks Sam Lawrence for
the report.

Hmm, well I'm on 1.4.3.3. Sorry I did mean to include that. Is it worth me reverting to pre 1.4.3.1 just to see if the behaviour is different in any way?

 
> I'd suspected there was just something silly wrong with my code, but it's
> definitely weirder than that. The coroutine is created (successfully, I get
> a thread value back), and then on the next line I call coroutine.status(co)
> and it returns "dead", which as I understand should be impossible?

Right, this is impossible.

Good, so I'm not crazy ;)

 
> Any ideas on what I can try to narrow this down further?
>

If you're not using the latest ngx_lua, please consider upgrading first :)

Any other ideas? :)

Thanks,

-- 
James Hurst

Yichun Zhang (agentzh)

unread,
Dec 13, 2013, 1:04:40 PM12/13/13
to openresty-en
Hello James!

On Wed, Dec 11, 2013 at 3:30 PM, James Hurst wrote:
>
> Hmm, well I'm on 1.4.3.3. Sorry I did mean to include that.

Interesting. Then there might be other remaining issues :)

> Is it worth me
> reverting to pre 1.4.3.1 just to see if the behaviour is different in any
> way?
>

No, it's not worth it at all. Let's focus on the issue you're having :)

>> > The coroutine is created (successfully, I
>> > get
>> > a thread value back), and then on the next line I call
>> > coroutine.status(co)
>> > and it returns "dead", which as I understand should be impossible?
>>

It's very likely that ngx_http_lua_get_co_ctx() called by
ngx_http_lua_coroutine_status() (for coroutine.status) got a bad
ngx_http_lua_co_ctx_t pointer. Could you add debugging code to check
if the resulting "coctx" pointer remains the same as the original
"coctx" created by ngx_http_lua_coroutine_create_helper?

>
> Any other ideas? :)
>

It'll be great if you can use valgrind's memcheck and/or clang ASan to
check if there are any memory issues in nginx running your code.
You're recommended to enable the no-pool patch when checking memory
issues. Because you're using the openresty bundle, just rebuild it
with the ./configure option --with-no-pool-patch. But don't enable the
no-pool patch for production use :)

I'm currently busy with some company work these days. I'll try to run
your code sample in order to reproduce the issue on my side later :)

Thanks!
-agentzh

James Hurst

unread,
Dec 13, 2013, 1:55:58 PM12/13/13
to openre...@googlegroups.com
Ok, great I'll try those suggestions next week. Incidentally I also see the issue on 1.4.3.6, but a colleague failed to recreate it, so I'll try and figure out what the difference is there too.

Thanks,

James.


-agentzh

--
You received this message because you are subscribed to the Google Groups "openresty-en" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openresty-en...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--

James Hurst

unread,
Dec 18, 2013, 5:46:19 AM12/18/13
to openre...@googlegroups.com
Hi,
 
It's very likely that ngx_http_lua_get_co_ctx() called by
ngx_http_lua_coroutine_status() (for coroutine.status) got a bad
ngx_http_lua_co_ctx_t pointer. Could you add debugging code to check
if the resulting "coctx" pointer remains the same as the original
"coctx" created by ngx_http_lua_coroutine_create_helper?

I've just checked this, and yes in every case where the coroutine is prematurely dead, the coctx pointer in ngx_http_lua_coroutine_create_helper and ngx_http_lua_coroutine_status are different to each other, whereas in the requests which work they are the same. 

I had a quick look at ngx_http_lua_get_co_ctx, it looks like there are multiple possible cases in the loop where coctx[i].co == L, and we're picking the first. Incidentally, if you pick the last the problem goes away but from what I can gather that just seems like a happy accident. The assumption presumably is that L should be a unique state in the request context?

I'm out of my depth here, but please let me know if there's anything else I can verify for you to help with this.
 
>
> Any other ideas? :)
>

It'll be great if you can use valgrind's memcheck and/or clang ASan to
check if there are any memory issues in nginx running your code.
You're recommended to enable the no-pool patch when checking memory
issues. Because you're using the openresty bundle, just rebuild it
with the ./configure option --with-no-pool-patch. But don't enable the
no-pool patch for production use :)

I ran valgrind, with the no-pool-patch enabled. init_by_lua requires "resty.http", and content_by_lua is as per the example in the gist. A ran this and then a single curl against the server which triggered the error. The output is here: https://gist.github.com/pintsized/7913674#file-valgrind-output

Cheers,

James.

Yichun Zhang (agentzh)

unread,
Dec 19, 2013, 12:48:34 AM12/19/13
to openresty-en
Hello James!

On Wed, Dec 18, 2013 at 2:46 AM, James Hurst wrote:
> I've just checked this, and yes in every case where the coroutine is
> prematurely dead, the coctx pointer in ngx_http_lua_coroutine_create_helper
> and ngx_http_lua_coroutine_status are different to each other, whereas in
> the requests which work they are the same.
>

This information is invaluable! I finally reproduce this issue with
the following minimal Lua code example:

local function f()
return 3
end

for i = 1, 10 do
collectgarbage()
local c = coroutine.create(f)
if coroutine.status(c) == "dead" then
ngx.say("found a dead coroutine")
return
end
coroutine.resume(c)
end
ngx.say("ok")

I've just committed a patch to ngx_lua's git master:

https://github.com/chaoslawful/lua-nginx-module/commit/8f4c485

The patch includes the example above as a test case in the test suite.

Will you try out the latest master of ngx_lua on your side?

Thank you for your report and analysis!

> I had a quick look at ngx_http_lua_get_co_ctx, it looks like there are
> multiple possible cases in the loop where coctx[i].co == L, and we're
> picking the first. Incidentally, if you pick the last the problem goes away
> but from what I can gather that just seems like a happy accident. The
> assumption presumably is that L should be a unique state in the request
> context?
>

It should. But the old buggy implementation did not guarantee that.
Hopefully the new implementation no longer has this issue :)

> I ran valgrind, with the no-pool-patch enabled. init_by_lua requires
> "resty.http", and content_by_lua is as per the example in the gist. A ran
> this and then a single curl against the server which triggered the error.
> The output is here:
> https://gist.github.com/pintsized/7913674#file-valgrind-output
>

Just a quick note: all the things in your valgrind output are false
positives. You should have passed the valgrind option

--suppressions=/path/to/luajit-2.0/src/lj.supp

to suppress these :)

Best regards,
-agentzh

James Hurst

unread,
Dec 19, 2013, 5:16:58 AM12/19/13
to openre...@googlegroups.com
Hi,
 
I've just committed a patch to ngx_lua's git master:

    https://github.com/chaoslawful/lua-nginx-module/commit/8f4c485

The patch includes the example above as a test case in the test suite.

Will you try out the latest master of ngx_lua on your side?

Thank you for your report and analysis!

Brilliant, thanks! The latest master is working fine on my side now. Thanks for the speedy response as always! I'll ask the original reporter to confirm in their environment too.
 
> I ran valgrind, with the no-pool-patch enabled. init_by_lua requires
> "resty.http", and content_by_lua is as per the example in the gist. A ran
> this and then a single curl against the server which triggered the error.
> The output is here:
> https://gist.github.com/pintsized/7913674#file-valgrind-output
>

Just a quick note: all the things in your valgrind output are false
positives. You should have passed the valgrind option

    --suppressions=/path/to/luajit-2.0/src/lj.supp

to suppress these :)

Ah yes, that makes a bit more sense!

James.
Reply all
Reply to author
Forward
0 new messages