experiencing infinite loop in net.js list.callback at line 141

49 views
Skip to first unread message

Woody Anderson

unread,
Jan 26, 2011, 2:28:36 PM1/26/11
to nodejs
My app runs for a while, i have 200 client connection objects that
each have a 10 sec timeout, at some point the application goes
completely unresponsive and cpu spikes.
I attached the debugger and have found that the inner 'timeout' object
defined in net.js defines a callback which loops through a linked list
of sockets and either delivers a timeout event and removes the socket
from the list or (if the socket was active) exits the loop.

What i'm seeing is that the list is not shortened by the
remove(socket) method. By inspection it seems that the _idleNext
member of each socket points directly to the timer object (when i
attach to a working process, only the first element has this
characteristic, in general it seems to be a dual-linked-list). This
basically breaks the dual-link-list aspect of the list, and as such
the length of the list seems unchanged when determined using the
following code. Note that the first element does change, but it simply
loops over the same set of elements over and over.

I wrote a quick/dirty loop scanner based on a rather uninformed
interpretation of what the timeout lists are supposed to look like
when they are well formed, i then defined that in the debugger and
used it to inspect the list as the loop progresses etc.

Important information about what my callbacks do when handling the
timeout event, which may be against some unknown to me rules of
behavior for handling timeouts.

when the client's socket times out, i destroy the client and then i
emit an error using the client event emitter to hook some application
bookeeping.

request.connection._x_to = function() {
client.destroy();
client.emit('error', 'timeout');
};

I'm wondering if it's potentially a problem that an event is emitted
inside an event handler.. but either way i'm having a hard time
figuring out how the linked list gets so corrupted that it causes an
infinite loop by virtue of remove(socket) not being effective.

I'm quite concerned about this inf loop scenario, as It's not
predictable, but does occur a lot when the remote server is slow to
respond to my application.
Thoughts? anyone else run across this issue? this is using node 2.1,
though the timeout is unchanged in 2.5.
-w


here's the code i used to count and show the client sockets in the
debugger (all of which my app has elsewhere tagged with _x_client_id
to help me debug):
list.count_nodes = function() {
var c = 0, end = this._idlePrev, n = this._idlePrev, t =
end._idleNext;
var ids = [];
ids.push(n._x_client_id || 'LIST');
while( n._idlePrev != end && n._idlePrev != this ) {
++c;
ids.push(n._idlePrev._x_client_id || 'LIST');
n = n._idlePrev;
if( c > 10000 ) {
sys.puts(c + ': aborted scan: ' + c + ' ' +
JSON.stringify(ids));
return -c;
}
}
sys.puts(c + ': hit loop: ' + c + ' ' + JSON.stringify(ids));
return c;
}

Ryan Dahl

unread,
Jan 26, 2011, 5:26:55 PM1/26/11
to nod...@googlegroups.com
On Wed, Jan 26, 2011 at 11:28 AM, Woody Anderson
<woody.a...@gmail.com> wrote:
> My app runs for a while, i have 200 client connection objects that
> each have a 10 sec timeout, at some point the application goes
> completely unresponsive and cpu spikes.
> I attached the debugger and have found that the inner 'timeout' object
> defined in net.js defines a callback which loops through a linked list
> of sockets and either delivers a timeou

Is it possible to run this on v0.3.6? There have been significant
changes to this code - the link list is now a separate module and
tested independently.

Ryan

Woody Anderson

unread,
Jan 26, 2011, 7:06:48 PM1/26/11
to nodejs
unfortunately i only see this erratic behavior when running in wild/
production.
i have just gotten a sanctioned 2.5 based build for prod (no real
change in the timeout/lists code for 2.5 tho).
I'll try and see if i can get the ok to try out 'unstable' in that
environment for a short run, otherwise.. i'm stuck on the 2.x train
until there is a 'stable' 3.x.

I have changed the timeout callback to look like this:
request.connection._x_to = function() {
setTimeout(function() {
client.destroy();
client.emit('error', 'timeout');
}, 1);
};

time will tell if this version of the callback chain also inf-loops,
though for the last few hour or so it hasn't borked.

still seeing plenty of the timeout (net.js:149) assert error issues
i've posted about earlier, so a rewritten and testable cnxn timeout
system is very welcome news.
-w


On Jan 26, 2:26 pm, Ryan Dahl <r...@tinyclouds.org> wrote:

Woody Anderson

unread,
Jan 27, 2011, 3:30:40 PM1/27/11
to nodejs
Ry,
Ok. I think I now understand the issue quite a bit better.

The timer.remove(socket) method does not properly reset the state of
the socket.[next/prev], such that if client.destroy() [which in turn
calls timer.unenroll(socket)] is called, then the socket is removed
from the list again (b/c socket.next is still set).

if this occurs immediately (ie called from within timeout callback)
the double remove is a no-op b/c the next/prev pointers are still in
the same place as before. However, if the client.destroy() is deferred
(using setTimeout etc) then the list order can change due to socket
activity which would move active sockets to the end of the list. Thus
there is an arbitrary shuffling of the order of the list before
timeout.unenroll() is called. This would cause the second pass of the
removal logic to corrupt the linked list. The nature of the corruption
is a function of the shuffling that occurred; it could be a loop, it
could orphan part of the list but still look valid etc.

The solution (i think) is to change the logic of remove() to include:
socket._idlePrev = socket._idleNext = socket; (or null if you prefer)
and to call remove from unenroll and other places where removal is the
intention, or to add such state reset logic to all places that are
inlining the remove() code.

This is a pretty major issue. I know it has been addressed in 3.6 (by
code inspection), but i think it should be fixed in the 2.x branch in
preparation for 2.7 etc.
the fix is fairly simple and does not require the full _linkedlist
module lift (tho that would be cool too).
agree?
-w

Woody Anderson

unread,
Feb 9, 2011, 4:03:13 PM2/9/11
to nodejs
i was able to resolve (mostly) this issue in my code by controlling
how/when client.destroy() is called.
rather than simply calling client.destroy(), i have a wrapper that i
use to avoid the internal node bugs.

function destroyClient(client) {
if( client._idleNext ) {
if( client._idleNext._idlePrev === this ) {
client.destroy();
}
client._idleNext = client._idlePrev = undefined;
client._idleTimeout = undefined;
}
}

this doesn't solve every timer assert issue, b/c there are code paths
inside timer/net that call detroy() or unenroll directly.
but my error rate fell dramatically with this change.
-w
Reply all
Reply to author
Forward
0 new messages