This test case demonstrates a race-condition that causes the event
loop to
freeze under a combination of circumstances:
* A heavy I/O operation such as reading or writing a big file
* Two nested posix.stat calls, the inner one using .wait()
My guess is that this is due to a bug in node's co-routine
implementation
or
the posix module. It could also be directly related to the event loop.
As I understand the limitations in node's co-routine implementation,
it
should
be safe to use .wait() when guaranteeing that no other .wait() calls
will
occur
until the first one has finished. If that assumption is correct, this
test
case
will hopefully be useful to discover the underlaying issue.
PS: I came across this issue while testing my hot code reloading patch
in
a
real world application. I reduced this test case as far as it was
possible
for
me.
http://github.com/ry/node/issues/#issue/27
It certainly sounds very similar, but my testcase is worryingly more trivial ;)
-cj.
> --
>
> You received this message because you are subscribed to the Google Groups "nodejs" group.
> To post to this group, send email to nod...@googlegroups.com.
> To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nodejs?hl=en.
>
>
>
Yes, this seems very related! I just wasn't able to demonstrate the
problem without doing some heavy I/O in parallel. Maybe the mkdir call
is more likely to trigger the issue than a second stat call is?
I actually had the misfortune of running into this while doing hot
code reloading for transload.it and suddenly saw my uploads stopping
without knowing why. I traced it down through the module / require()
system and found the network I/O to be completely unrelated. I really
wish I had run into this in a more trivial way like you : ).
Anyway, the attached test case does reliably trigger the issue. So
hopefully it will be good enough so that ryan can do some
investigation when he gets chance (he's pretty busy right now afaik).
--fg
On Dec 27, 11:37 pm, Ciaran <ciar...@gmail.com> wrote:
> 2009/12/27 Felix Geisendörfer <fe...@debuggable.com>:
>
>
>
> > Pretty:http://github.com/felixge/node/commit/b035844a5084da9b85d55c61360d63b...
> > Patch:http://github.com/felixge/node/commit/b035844a5084da9b85d55c61360d63b...
That's probably not a great insight, but I learned a few things along
the way, hopefully I'll be able to find out more. Ryan, if you have
good tips on debugging libeio, I'd love to hear them - right now I'm
mostly working with fprintf as I don't know what I'm looking for so
gdb isn't all that helpful - right?
--fg
I'd like to dig further into this and try if I can come up with a
purse libeio test case for this. However, I'm not sure how to build
the demo.c that ships with it - it seems like it could also be
outdated and not working anymore. Let me know if you have a suggestion
on that Ryan.
--fg
--fg
I fixed one bug with this commit
http://github.com/ry/node/commit/0d7e88a4298740751bf0e8dabf9015baf88f455a
I already showed you this on IRC, and you said the problem continues;
but please "rebase" your debugging on to that so that we're not
looking at two issues.
Sorry for the slow response and bug fixes - I'm without computer
access at the moment.
Ryan
Will do first thing tomorrow. Maybe this fixes the issue on Linux bot
not OSX, I'll check that specifically.
> Sorry for the slow response and bug fixes - I'm without computer
> access at the moment.
No worries, I love learning more about libeio and all the magic that
makes node awesome, so its great that I'm forced to dig into this
myself. Any help is of course much appreciated, but you being busy
right now does encourage me to figure out as much as I can, instead
of just asking you for help right away : ). I'm also learning more
about gdb which already helped me with some iPhone dev stuff I had to
do today!
Anyway, will post an update tomorrow.
--fg
On Dec 29, 9:03 pm, r...@tinyclouds.org wrote:
> Hi Felix,
>
> I fixed one bug with this commithttp://github.com/ry/node/commit/0d7e88a4298740751bf0e8dabf9015baf88f...
Could you please try this test here and let me know if you get 'mkdir'
to never complete as well?
http://github.com/felixge/node/commit/b7f0c2d76eddcc61bbb5b7a05392e40510912b4b
Unfortunately this test case does not always demonstrate the problem,
but it's ~9/10 times for me locally on both OSX as well as Linux.
Thanks,
--fg
On Dec 30 2009, 12:26 am, Felix Geisendörfer <fe...@debuggable.com>
Has 04dd2d5 fixed this?
Just tested it on OS 10.6.2 / Ubuntu 9.10 - the problem still shows up
in ~90% of all runs : /.
How does 'test-mkdir-eio-race.js' behave on your end?
--fg
I think there an an error in your script because there is no call to
mkdir. Could you fix it? (Also, be sure to set the Errbacks.)
Please try the attached patch - this maybe the solution...
Basically this patch here adds support for require.hot(), a function
that behaves like require.aysnc(). The only difference is that this
patch replaces the global module cache with a module cache that can be
shared between one or more Module instances (and is usually inherited
from the parent module). require.hot() is different in that it creates
a new Module that has an empty cache, so all child-modules are forced
to reload. The patch also makes sure that require.hot() queues the
loading of modules, this is to avoid possible mind-boggling race-
conditions with the co-routine implementation:
http://github.com/felixge/node/commit/78fd926ce1f93e55187bb60a66d3689736a9a532
As demonstrated by the attached test case, hot reloading seems to work
just fine. However, when I perform hot code reloading during heavy I/
O, some require.hot() calls always get stuck to what could be yet
another libeio race condition:
http://github.com/felixge/node/commit/7bebc0f6ed2e24ec6e8a130dbd2bcf8e25112656
Hopefully there is something wrong with my approach to hot code
reloading, but the fact that this seems to be another I/O related race-
condition points towards libeio.
Ryan or anybody: Please try to reproduce the issue and post your
results here. Maybe we see some pattern that will allow for a more
simplified test case, but currently this is the best I got.
--fg