process._tickCallback and where dispatch happens

503 views
Skip to first unread message

Matt Ginzton

unread,
Oct 7, 2014, 5:54:24 PM10/7/14
to nod...@googlegroups.com
I've got a strange problem with an app running on NodeJS, which I'd appreciate any help with.

The problem manifests as either TypeError trying to access objects that haven't been initialized consistently, or AssertionError due to asserts in my code finding violations of invariants that should hold -- in both cases, it looks like the code was invoked with other data structures in an inconsistent state. This could just be a bug in my code, sure, except for a peculiarity with _tickCallback getting called seemingly out of nowhere as you'll see below.

My understanding of the NodeJS concurrency model is that what's going on here should be impossible -- I'd explain the concurrency model as, there's only one thread and no preemption, scheduling happens only at the level of the event loop, synchronous code runs until it decides to yield by returning to the event loop, so the only boundary where code needs to worry about data structures changing out from under it is when scheduling a callback asynchronously (with a trip back to the event loop dispatch mechanism in between). Or said more shortly, every synchronous invocation is essentially a critical section until it returns all the way back to Node's event loop.

But then I get an exception with a call stack like:

TypeError: Cannot call method 'end' of undefined
      at Request.end (/opt/buildbot-slave/tree/build/node_modules/request/request.js:1320:12)
      at /opt/buildbot-slave/tree/build/node_modules/request/request.js:418:14
      at process._tickCallback (node.js:415:13)
      at /opt/buildbot-slave/tree/build/node_modules/request/lib/copy.js:5:10
      at Array.forEach (native)
      at copy (/opt/buildbot-slave/tree/build/node_modules/request/lib/copy.js:4:20)
      at Request.start (/opt/buildbot-slave/tree/build/node_modules/request/request.js:668:20)
      at Request.end (/opt/buildbot-slave/tree/build/node_modules/request/request.js:1319:28)
      at Object.exports.request (/opt/buildbot-slave/tree/build/node_modules/soap/lib/http.js:55:11)

Note the call to process._tickCallback in the middle of this stack trace, 3 frames down.

Is this actually legal and possible? According to my understanding of the NodeJS concurrency model (see above), no; if _tickCallback can be called from arbitrary points (and then dispatch into other arbitrary code), it seems like this would break the world. But I haven't found any official documentation that explicitly rules this out.

I see 3 possibilities:

- this is happening and is legal: then I need to completely rethink my understanding of data structure consistency in NodeJS.
- this is happening and is not legal: through what bug is it happening, then?
- it's not actually happening and the call stack is a lie: but then I need some other explanation for the data structure inconsistency that my assertions are observing.

Clearly, none of these possibilities is attractive :)

For what it's worth, the code at node_modules/request/lib/copy.js:5:10 which is alleged to have triggered a call to _tickCallback is the second line of

  Object.keys(obj).forEach(function (i) {
    o[i] = obj[i]

that is, just a property lookup against a JS object.

I did dig into the NodeJS source trying to find all callers of _tickCallback; it seems mostly it's used by the C++ code in MakeCallback() which calls it every time after invoking the JS code "callback". MakeCallback has a ton of callers, both direct and via a mechanism where it's registered with libuv as uv_check_immediate callback; I think I read somewhere that MakeCallback is how native code always invokes JS code, and so nextTick callbacks are processed (by running process._tickCallback) on every transition from JS code back to native code.

So my question: is it possible for _tickCallback to be called (and then dispatch into arbitrary JS code, whatever was registered as a nextTick callback) from arbitrary JS code, and under what conditions? If it's true that we invoke _tickCallback via MakeCallback() when leaving JS code back to native code, does this happen only when the entire synchronous execution (driven by the last async event) is done and we're going back to sleep in the event loop, or also if the JS code calls some C++ code in a native-code extension which itself calls back into JS code (so then unwinding, we go from JS to C++ but not back to the event loop; we have more JS to run still on the stack as we unwind; does this scenario also invoke MakeCallback and possibly _tickCallback)?

Or am I overthinking all this; my initial hunch that _tickCallback should never show up in the middle of a call stack was true, and this is just a bug that shouldn't happen?

I'd appreciate knowing where to concentrate my search to understand this better.

I'm running on Node stable releases (0.10.x; I think I've seen this behavior since 0.10.16 and it continues in 0.10.26; I haven't tried newer versions yet). I am using the node-fibers module [1] but I don't think it's implicated here; fibers have to choose to yield, and if they do yield then either we go back to the event loop or another fiber starts running from where it left off; any such fiber interactions would be visible in the call stack; it should be easy to distinguish between these cases and the one I'm asking about here. Clearly, that line of code quoted above from request/lib/copy.js, "o[i] = obj[i]", should not be triggering a fiber yield, and if it did (say in fancy situations with proxies or getters are involved), that should not manifest as a call to _tickCallback.

thanks,

Matt


Forrest Norvell

unread,
Oct 7, 2014, 11:06:55 PM10/7/14
to nod...@googlegroups.com
  • this is happening and is legal: then I need to completely rethink my understanding of data structure consistency in NodeJS.

  • it’s not actually happening and the call stack is a lie: but then I need some other explanation for the data structure inconsistency that my assertions are observing.

    It’s a little bit of #1 and #3.

    Your basic understanding of how asynchronous concurrency works in Node is sound, but there’s one piece you seem to be missing: when you call an asynchronous function (like request()) and pass it a callback, that callback is going to be called on the other side of some kind of asynchronous I/O operation (in request’s case, most likely reading the response stream after making an external HTTP request). At some point, a callback function (in this case, probably created by request) got handed off to a lower-level operation (perhaps within fibers) that handed it over to MakeCallback on the C++ side of node, which interfaces to the JS side via _tickCallback when the runtime indicates that there are data available to be read back from libuv.

    My guess is that something about how node-fibers handles suspension and resumption of its continuations / coroutines is interacting with the V8 stack in a funky way, but that the code is basically executing as expected. I don’t use fibers, and maybe one of the people here who does can pitch in here, but I think that’s not a deal breaker here – the only real peculiarity they cause here is that because of how they staple call stacks together, you’re seeing a stack trace that bridges an async call boundary.

    Does that make sense?

    F

    Matt Ginzton

    unread,
    Oct 10, 2014, 5:30:23 AM10/10/14
    to nod...@googlegroups.com
    Thanks for engaging on this.

    That does make sense, and yeah, my model for when JS code runs in Node is basically "in response to some asynchronous event": which could be I/O, or also things like setTimeout or nextTick (or the timers module). The node process will be asleep in the libuv event loop, wake up in order to run some callback, and the code in that callback will get to run until it's done (including anything it calls synchronously, which doesn't require I/O or get deferred via nextTick or timers); it shouldn't get preempted; eventually this sequence of synchronous execution will be done and Node will find something else to do (another timer or event handler) or go back to sleep.

    Upon more experimentation, I don't think this has anything to do with node-fibers (more on this below); I just mentioned that because we are using it and it might affect things, but I no longer think it's relevant here. I should also have mentioned we're using node-weak, which is going to be relevant here (it's one way but not the only way of getting the behavior in question). What this boils down to is, I have a fundamental question about the design and intended use of MakeCallback, whether I'm right about the behavior I'm going to describe, and whether this is intentional.

    I think I can further boil down my understanding of Node's asynchronous concurrency model and consistency guarantees and need for locking (or not) to: "no preemption". At least, that's my read on what people want from Node, and what I want from Node. (Following on what I said before about every synchronous invocation being a critical section until it chooses to yield: State can't change out from under you in the middle of one synchronous invocation. If you yield and then resume execution, "on the other side of an asynchronous I/O operation" as you said, state can and does change out from under you, because if code A schedules callback B, it can't know what will run in between A and B. And that's fine, because we all know it works that way, and B can't make any assumptions about the state of the world being exactly as A left it. But each statement in A should be able to assume that the state of the world is exactly as the previous statement left it.)

    This "no preemption" property would require that if I'm running some synchronous code, it gets to decide what other code runs, until it decides to yield execution by returning to the event dispatcher. In the middle of what I'm calling a flow of synchronous execution, some other piece of code it doesn't know about can't jump on the CPU, change some data structures, and return control to the first code in an inconsistent state. For this to be true, we should only do event dispatch (calling callbacks) from the toplevel event loop; there shouldn't be contexts nested within arbitrary application code that can dispatch into other arbitrary code.

    MakeCallback is where Node invokes nextTick callbacks, and it is called from the toplevel event loop (well, right before returning to the libuv event loop), but if my understanding is correct, it's also called on every transition from JS code back to C++ code -- which is a superset of "returning to the libuv event loop". That is, if "outer" JS code calls into a native-code Node extension, and that native-code extension calls back into "inner" JS via MakeCallback, when that inner JS returns to C++, MakeCallback invokes process._tickCallback before returning, and the outer JS has been preempted by any nextTick callbacks that get invoked here, and it can't really reason about the state of the world (specifically, any consistency guarantees on shared data structures) after that. (Side question here: is MakeCallback *a* way for extension/C++ code to invoke JS code, or *the* way for extension/C++ code to invoke JS code? That is, from a native-code extension that wants to invoke JS code, is all JS code "callbacks" or only some of it?)

    This, to me, seems to break the entire Node concurrency model, and I'm surprised more people haven't been bitten by it, but maybe I'm misunderstanding things, maybe there's a good reason for it, maybe MakeCallback isn't supposed to get used this way, or maybe it's just a bug that's enough of a corner case that it hasn't been noticed.

    Let me restate the problematic flow, and then I'll get into examples. A normal Node app might start off as pure JS; outside Node's own codebase all the app code is JS running as event handlers, and the flow is (1) event loop -> (2) handlers in JS -> return to (1). Then you add some native-code Node extensions, and JS code can call out to C++ code in these extensions, and now the flow is (1) event loop -> (2) handlers in JS -> (3) callouts in C++ extension -> return to (2) -> return to (1). And say some of these native-code Node extensions want to execute JS and use MakeCallback to do it (again, I don't know that this is the only way native code gets to execute JS code, but I do know it's one way); and now the flow is (1) event loop -> (2) handlers in JS -> (3) callouts in C++ extension -> (4) more JS code -> return to (3) -> return to (2) -> return to (1). And obviously more complex flows are possible; there could be even deeper nested transitions between JS and native code, or bounce between (2)<>(3) multiple times before returning to (1), but the 1234321 case is already enough to illustrate the preemption problem. Every time we invoke MakeCallback, on the way out it executes process._tickCallback. This happens at the boundary from (2) back to (1), which is not a problem, but also at the boundary from (4) back to (3), which is a problem.

    Here's a demo of the problem. (Note that this doesn't use node-fibers. It does use node-weak, but just as a way of forcing invocation of MakeCallback. Any other extension that calls MakeCallback could exhibit the same behavior.)

    Here are examples of node-fibers usage, including examples of what the call stacks look like. (Again, not relevant to the preemption issue I'm harping on; just following up on what would otherwise be a loose end from my original post, if anyone's curious. The point is that the stack trace inside a fiber doesn't see outside the fiber boundary, unless you write your own stack-stitching code, so you don't see _tickCallback in the call stack even if it was used to invoke the fiber. But you still wouldn't expect to see _tickCallback higher in the call stack.) Using fibers, one certainly ends up with a different concurrency model than the normal Node one, and you can bring all sorts of problems on yourself this way if you do it wrong, but I don't consider it to break my "no preemption" rule because it's entirely under your control; you can get descheduled and then find the world has changed out from under you when you get rescheduled, but you actually have to ask for this to happen by calling Fiber.yield or Fiber.run.

    Boiling all this back down to one question, it's this: is it intentional that inner invocations of MakeCallback do dispatch to _tickCallback (thus preempting any code that was running higher on the call stack), or can this be changed? IMHO it would be preferable, and would solve this problem, if we called process._tickCallback only from the very outermost transition from JS back to C++, not on every such transition. But maybe this causes other problems I'm not seeing. Or, should extensions that are invoked as callouts from JS code (as opposed to registering event handlers, which seems to be how the Node source uses MakeCallback) not be using MakeCallback in the first place?

    As things stand, it seems like any extension that uses MakeCallback should be labeled "somewhat dangerous", because it can preempt the code that calls it. (But if this is documented and advertised properly, I as a client of that extension can code around it; I just have to realize that certain entry points into the extension are a yield/redispatch point from the point of view of the calling code, and if I know where those are, I can code around it.) Meanwhile node-weak (and I don't mean to malign it, it's really cool and useful functionality) should be labeled "quite a lot more dangerous", because not only can it cause preemption for those same reasons, but you can't document any specific entry points or surface area that can cause that preemption: that surface area is GC, which can happen at any time.

    thanks,

    Matt

    Bruno Jouhier

    unread,
    Oct 10, 2014, 12:11:14 PM10/10/14
    to nod...@googlegroups.com
    Your problem seems to originate from node-weak.

    With "normal" node modules, the code that gets executed during one turn of the event loop is 100% related to the JS calls that you are making in the response to the event. Nothing interferes. If other I/O operations are pending they won't interfere with your event; their callbacks will be called on another turn of the event loop. This is completely non-preemptive.

    But node-weak is a very odd beast. It introduces a callback which is called by the garbage collector and the garbage collector may run at any time, and will preempt what your JS code is doing. As a result, you have some kind of odd stack that pops up at random places in your code.

    You should be very cautious about what you do in a node-weak callback. Handling such a callback is very much like handling an interrupt in driver code: do as little as possible: incrementing a counter or setting a flag is fine but you should not be calling complex code that's going to interact with the event loop.

    Bruno

    Matt Ginzton

    unread,
    Oct 10, 2014, 3:00:18 PM10/10/14
    to nod...@googlegroups.com
    Yeah, I know (and hopefully anyone using node-weak knows) that the node-weak callback is a dangerous place to do anything complicated, as you say, like interrupt handlers or signal handlers.

    The problem I have isn't what my node-weak callback itself is doing, because I know how to play by those rules -- the problem is that because node-weak used MakeCallback() to invoke *that* callback, it also has the side effect of invoking *all previously requested nextTick callbacks*, which could be anything -- and certainly includes a bunch of code that doesn't know, can't know, to play by those stricter rules.

    Basically, the MakeCallback behavior I'm observing, when called from a nested context (like what node-weak does when invoking callbacks from GC context), could turn any nextTick callback into an interrupt handler, and I can't possibly write all my nextTick callbacks to be safe enough to run at such times.

    I agree node-weak is an odd beast, and it's certainly possible that all my troubles originate from my use of it. I'm still curious whether what it's doing is the right way for it and other extensions to go back into V8 and invoke JS code from native code, or is there some other way. If nothing else uses MakeCallback this way, and there's another way for node-weak to invoke just its own callbacks (the ones written to the safety standard of interrupt handlers), I'd call this a node-weak bug and ask it to change to invoke just its own callbacks. If nothing else uses MakeCallback this way but there's no other way for node-weak to invoke just its own callbacks, then the problem would be limited to node-weak users but I'd consider node-weak too dangerous to use. If other things use MakeCallback this way, then the problem isn't limited to node-weak anyway (though as I mentioned in my previous post, I think the problem is much more dangerous the way node-weak exposes it, because if your API function or extension calls MakeCallback and all nextTick callbacks, at least you can tell me to watch out for that API function and I can try to call it only at safe times, but I can't control when GC happens so I can't restrict MakeCallback to safe times).

    I tried looking around the ~45 uses of MakeCallback in the node codebase per se, outside of node.cc; the majority seem like they're using it perfectly safely (from an OnFoo callback which was registered with libuv directly). Some of the ones in node_crypto.cc look like they might be capable of running in what I'm calling a nested context (with user JS code already on the call stack), but I haven't had time to prove or disprove this hypothesis yet.

    I also wanted to audit all the other native-code Node extensions I'm using to see whether/how they invoke MakeCallback, but at this point this is largely obscured by nan (even node-weak doesn't call MakeCallback directly), and these codebases are all new to me so this isn't something I can do quickly and I haven't been able to glean anything useful there either yet.

    I'm pretty sure I could repro the same behavior without node-weak but with the extension in https://github.com/joyent/node/issues/8231, which does directly expose MakeCallback in what I'm calling a nested context. That extension was just to demo/repro some other bug and not real production code; again I don't know whether you're supposed to write code like this, but I take this as some indication that this way of using MakeCallback is not purely limited to node-weak.

    Thanks again to anyone looking into this -- I am very eager to get to the bottom of it.

    Matt

    Bruno Jouhier

    unread,
    Oct 10, 2014, 4:19:43 PM10/10/14
    to nod...@googlegroups.com
    Looks like node-weak should use a different mechanism to invoke the callback. It should call the function directly instead of going through MakeCallback.
    Reply all
    Reply to author
    Forward
    0 new messages