[BUG] Killing the event loop

18 views
Skip to first unread message

Felix Geisendörfer

unread,
Dec 27, 2009, 4:09:30 PM12/27/09
to nodejs
Pretty: http://github.com/felixge/node/commit/b035844a5084da9b85d55c61360d63bc99adedfb
Patch: http://github.com/felixge/node/commit/b035844a5084da9b85d55c61360d63bc99adedfb.patch

This test case demonstrates a race-condition that causes the event
loop to

freeze under a combination of circumstances:

* A heavy I/O operation such as reading or writing a big file
* Two nested posix.stat calls, the inner one using .wait()

My guess is that this is due to a bug in node's co-routine
implementation
or
the posix module. It could also be directly related to the event loop.

As I understand the limitations in node's co-routine implementation,
it
should
be safe to use .wait() when guaranteeing that no other .wait() calls
will
occur
until the first one has finished. If that assumption is correct, this
test
case
will hopefully be useful to discover the underlaying issue.

PS: I came across this issue while testing my hot code reloading patch
in
a
real world application. I reduced this test case as far as it was
possible
for
me.

Ciaran

unread,
Dec 27, 2009, 5:37:33 PM12/27/09
to nod...@googlegroups.com
2009/12/27 Felix Geisendörfer <fe...@debuggable.com>:
Is it possible that this is a test-case for my issue raised as :

http://github.com/ry/node/issues/#issue/27

It certainly sounds very similar, but my testcase is worryingly more trivial ;)
-cj.


> --
>
> You received this message because you are subscribed to the Google Groups "nodejs" group.
> To post to this group, send email to nod...@googlegroups.com.
> To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nodejs?hl=en.
>
>
>

Felix Geisendörfer

unread,
Dec 28, 2009, 4:17:10 AM12/28/09
to nodejs
> Is it possible that this is a test-case for my issue raised as :
>
> http://github.com/ry/node/issues/#issue/27
>
> It certainly sounds very similar, but my testcase is worryingly more trivial ;)
> -cj.

Yes, this seems very related! I just wasn't able to demonstrate the
problem without doing some heavy I/O in parallel. Maybe the mkdir call
is more likely to trigger the issue than a second stat call is?

I actually had the misfortune of running into this while doing hot
code reloading for transload.it and suddenly saw my uploads stopping
without knowing why. I traced it down through the module / require()
system and found the network I/O to be completely unrelated. I really
wish I had run into this in a more trivial way like you : ).

Anyway, the attached test case does reliably trigger the issue. So
hopefully it will be good enough so that ryan can do some
investigation when he gets chance (he's pretty busy right now afaik).

--fg

On Dec 27, 11:37 pm, Ciaran <ciar...@gmail.com> wrote:
> 2009/12/27 Felix Geisendörfer <fe...@debuggable.com>:
>
>
>

> > Pretty:http://github.com/felixge/node/commit/b035844a5084da9b85d55c61360d63b...
> > Patch:http://github.com/felixge/node/commit/b035844a5084da9b85d55c61360d63b...

Felix Geisendörfer

unread,
Dec 28, 2009, 5:05:00 PM12/28/09
to nodejs
I spend some more time trying to debug this. As far as I can tell, it
seems like libeio properly queues the last stat request, but for some
reason never invokes EIOWantPoll / the After function for it.

That's probably not a great insight, but I learned a few things along
the way, hopefully I'll be able to find out more. Ryan, if you have
good tips on debugging libeio, I'd love to hear them - right now I'm
mostly working with fprintf as I don't know what I'm looking for so
gdb isn't all that helpful - right?

--fg

Felix Geisendörfer

unread,
Dec 29, 2009, 6:39:54 AM12/29/09
to nodejs
Got a little bit further this morning. It seems like the response for
the stat call gets pushed onto the response queue, but is never
shifted from it. I also observed that this always seems to happen when
there already were other responses in the queue, the problem never
occurs when there is just 1 response waiting to be shifted.

I'd like to dig further into this and try if I can come up with a
purse libeio test case for this. However, I'm not sure how to build
the demo.c that ships with it - it seems like it could also be
outdated and not working anymore. Let me know if you have a suggestion
on that Ryan.

--fg

Felix Geisendörfer

unread,
Dec 29, 2009, 6:42:51 AM12/29/09
to nodejs
Forgot to mention one more thing: I was able to reproduce Ciarans
issue and discovered that it only occurs when running the stat/mkdir
call right after node has loaded - it is very likely a side effect
from the I/O performed during the module loading.

--fg

r...@tinyclouds.org

unread,
Dec 29, 2009, 3:03:41 PM12/29/09
to nod...@googlegroups.com
Hi Felix,

I fixed one bug with this commit
http://github.com/ry/node/commit/0d7e88a4298740751bf0e8dabf9015baf88f455a
I already showed you this on IRC, and you said the problem continues;
but please "rebase" your debugging on to that so that we're not
looking at two issues.

Sorry for the slow response and bug fixes - I'm without computer
access at the moment.

Ryan

Felix Geisendörfer

unread,
Dec 29, 2009, 6:26:51 PM12/29/09
to nodejs
On Dec 29, 9:03 pm, r...@tinyclouds.org wrote:
> Hi Felix,
>
> I fixed one bug with this commithttp://github.com/ry/node/commit/0d7e88a4298740751bf0e8dabf9015baf88f...

> I already showed you this on IRC, and you said the problem continues;
> but please "rebase" your debugging on to that so that we're not
> looking at two issues.

Will do first thing tomorrow. Maybe this fixes the issue on Linux bot
not OSX, I'll check that specifically.

> Sorry for the slow response and bug fixes - I'm without computer
> access at the moment.

No worries, I love learning more about libeio and all the magic that
makes node awesome, so its great that I'm forced to dig into this
myself. Any help is of course much appreciated, but you being busy
right now does encourage me to figure out as much as I can, instead
of just asking you for help right away : ). I'm also learning more
about gdb which already helped me with some iPhone dev stuff I had to
do today!

Anyway, will post an update tomorrow.

--fg

On Dec 29, 9:03 pm, r...@tinyclouds.org wrote:
> Hi Felix,
>

> I fixed one bug with this commithttp://github.com/ry/node/commit/0d7e88a4298740751bf0e8dabf9015baf88f...

Felix Geisendörfer

unread,
Jan 1, 2010, 7:29:51 PM1/1/10
to nodejs
Ok, so it seems like your patch did indeed fix my initial eio test
case. However, I'm still experiencing issues with it.

Could you please try this test here and let me know if you get 'mkdir'
to never complete as well?

http://github.com/felixge/node/commit/b7f0c2d76eddcc61bbb5b7a05392e40510912b4b

Unfortunately this test case does not always demonstrate the problem,
but it's ~9/10 times for me locally on both OSX as well as Linux.

Thanks,
--fg

On Dec 30 2009, 12:26 am, Felix Geisendörfer <fe...@debuggable.com>

Ryan Dahl

unread,
Jan 5, 2010, 2:37:59 AM1/5/10
to nod...@googlegroups.com
2010/1/1 Felix Geisendörfer <fe...@debuggable.com>:

> Ok, so it seems like your patch did indeed fix my initial eio test
> case. However, I'm still experiencing issues with it.

Has 04dd2d5 fixed this?

Felix Geisendörfer

unread,
Jan 5, 2010, 6:28:24 AM1/5/10
to nodejs
On Jan 5, 8:37 am, Ryan Dahl <coldredle...@gmail.com> wrote:
> Has 04dd2d5 fixed this?

Just tested it on OS 10.6.2 / Ubuntu 9.10 - the problem still shows up
in ~90% of all runs : /.

How does 'test-mkdir-eio-race.js' behave on your end?

--fg

Ryan Dahl

unread,
Jan 5, 2010, 12:39:16 PM1/5/10
to nod...@googlegroups.com
2010/1/5 Felix Geisendörfer <fe...@debuggable.com>:

> On Jan 5, 8:37 am, Ryan Dahl <coldredle...@gmail.com> wrote:
>> Has 04dd2d5 fixed this?
>
> Just tested it on OS 10.6.2 / Ubuntu 9.10 - the problem still shows up
> in ~90% of all runs : /.
>
> How does 'test-mkdir-eio-race.js' behave on your end?

I think there an an error in your script because there is no call to
mkdir. Could you fix it? (Also, be sure to set the Errbacks.)

Please try the attached patch - this maybe the solution...

Ryan Dahl

unread,
Jan 5, 2010, 12:41:32 PM1/5/10
to nod...@googlegroups.com
Err.. the patch didn't get attached.
use_eio_done_poll.diff

Ryan Dahl

unread,
Jan 6, 2010, 5:12:33 AM1/6/10
to nod...@googlegroups.com
We think this is finally resolved in b1e126f4152c41d1879182f72741778657377c57.

Felix Geisendörfer

unread,
Jan 6, 2010, 9:55:57 AM1/6/10
to nodejs
I hope I am doing something terribly wrong, but the hot code reloading
I'm working on still fails under heavy disc I/O. Unfortunately I
wasn't able to come up with an independent test case, which increases
the chances of me doing something wrong or the problem being in the co-
routine implementation.

Basically this patch here adds support for require.hot(), a function
that behaves like require.aysnc(). The only difference is that this
patch replaces the global module cache with a module cache that can be
shared between one or more Module instances (and is usually inherited
from the parent module). require.hot() is different in that it creates
a new Module that has an empty cache, so all child-modules are forced
to reload. The patch also makes sure that require.hot() queues the
loading of modules, this is to avoid possible mind-boggling race-
conditions with the co-routine implementation:

http://github.com/felixge/node/commit/78fd926ce1f93e55187bb60a66d3689736a9a532

As demonstrated by the attached test case, hot reloading seems to work
just fine. However, when I perform hot code reloading during heavy I/
O, some require.hot() calls always get stuck to what could be yet
another libeio race condition:

http://github.com/felixge/node/commit/7bebc0f6ed2e24ec6e8a130dbd2bcf8e25112656

Hopefully there is something wrong with my approach to hot code
reloading, but the fact that this seems to be another I/O related race-
condition points towards libeio.

Ryan or anybody: Please try to reproduce the issue and post your
results here. Maybe we see some pattern that will allow for a more
simplified test case, but currently this is the best I got.

--fg

Reply all
Reply to author
Forward
0 new messages