Thanks for engaging on this.
That does make sense, and yeah, my model for when JS code runs in Node is basically "in response to some asynchronous event": which could be I/O, or also things like setTimeout or nextTick (or the timers module). The node process will be asleep in the libuv event loop, wake up in order to run some callback, and the code in that callback will get to run until it's done (including anything it calls synchronously, which doesn't require I/O or get deferred via nextTick or timers); it shouldn't get preempted; eventually this sequence of synchronous execution will be done and Node will find something else to do (another timer or event handler) or go back to sleep.
Upon more experimentation, I don't think this has anything to do with node-fibers (more on this below); I just mentioned that because we are using it and it might affect things, but I no longer think it's relevant here. I should also have mentioned we're using
node-weak, which is going to be relevant here (it's one way but not the only way of getting the behavior in question). What this boils down to is, I have a fundamental question about the design and intended use of MakeCallback, whether I'm right about the behavior I'm going to describe, and whether this is intentional.
I think I can further boil down my understanding of Node's asynchronous concurrency model and consistency guarantees and need for locking (or not) to: "no preemption". At least, that's my read on what people want from Node, and what I want from Node. (Following on what I said before about every synchronous invocation being a critical section until it chooses to yield: State can't change out from under you in the middle of one synchronous invocation. If you yield and then resume execution, "on the other side of an asynchronous I/O operation" as you said, state can and does change out from under you, because if code A schedules callback B, it can't know what will run in between A and B. And that's fine, because we all know it works that way, and B can't make any assumptions about the state of the world being exactly as A left it. But each statement in A should be able to assume that the state of the world is exactly as the previous statement left it.)
This "no preemption" property would require that if I'm running some synchronous code, it gets to decide what other code runs, until it decides to yield execution by returning to the event dispatcher. In the middle of what I'm calling a flow of synchronous execution, some other piece of code it doesn't know about can't jump on the CPU, change some data structures, and return control to the first code in an inconsistent state. For this to be true, we should only do event dispatch (calling callbacks) from the toplevel event loop; there shouldn't be contexts nested within arbitrary application code that can dispatch into other arbitrary code.
MakeCallback is where Node invokes nextTick callbacks, and it is called from the toplevel event loop (well, right before returning to the libuv event loop), but if my understanding is correct, it's also called on every transition from JS code back to C++ code -- which is a superset of "returning to the libuv event loop". That is, if "outer" JS code calls into a native-code Node extension, and that native-code extension calls back into "inner" JS via MakeCallback, when that inner JS returns to C++, MakeCallback invokes process._tickCallback before returning, and the outer JS has been preempted by any nextTick callbacks that get invoked here, and it can't really reason about the state of the world (specifically, any consistency guarantees on shared data structures) after that. (Side question here: is MakeCallback *a* way for extension/C++ code to invoke JS code, or *the* way for extension/C++ code to invoke JS code? That is, from a native-code extension that wants to invoke JS code, is all JS code "callbacks" or only some of it?)
This, to me, seems to break the entire Node concurrency model, and I'm surprised more people haven't been bitten by it, but maybe I'm misunderstanding things, maybe there's a good reason for it, maybe MakeCallback isn't supposed to get used this way, or maybe it's just a bug that's enough of a corner case that it hasn't been noticed.
Let me restate the problematic flow, and then I'll get into examples. A normal Node app might start off as pure JS; outside Node's own codebase all the app code is JS running as event handlers, and the flow is (1) event loop -> (2) handlers in JS -> return to (1). Then you add some native-code Node extensions, and JS code can call out to C++ code in these extensions, and now the flow is (1) event loop -> (2) handlers in JS -> (3) callouts in C++ extension -> return to (2) -> return to (1). And say some of these native-code Node extensions want to execute JS and use MakeCallback to do it (again, I don't know that this is the only way native code gets to execute JS code, but I do know it's one way); and now the flow is (1) event loop -> (2) handlers in JS -> (3) callouts in C++ extension -> (4) more JS code -> return to (3) -> return to (2) -> return to (1). And obviously more complex flows are possible; there could be even deeper nested transitions between JS and native code, or bounce between (2)<>(3) multiple times before returning to (1), but the 1234321 case is already enough to illustrate the preemption problem. Every time we invoke MakeCallback, on the way out it executes process._tickCallback. This happens at the boundary from (2) back to (1), which is not a problem, but also at the boundary from (4) back to (3), which is a problem.
Here's a
demo of the problem. (Note that this doesn't use node-fibers. It does use node-weak, but just as a way of forcing invocation of MakeCallback. Any other extension that calls MakeCallback could exhibit the same behavior.)
Here are
examples of node-fibers usage, including examples of what the call stacks look like. (Again, not relevant to the preemption issue I'm harping on; just following up on what would otherwise be a loose end from my original post, if anyone's curious. The point is that the stack trace inside a fiber doesn't see outside the fiber boundary, unless you write your own stack-stitching code, so you don't see _tickCallback in the call stack even if it was used to invoke the fiber. But you still wouldn't expect to see _tickCallback higher in the call stack.) Using fibers, one certainly ends up with a different concurrency model than the normal Node one, and you can bring all sorts of problems on yourself this way if you do it wrong, but I don't consider it to break my "no preemption" rule because it's entirely under your control; you can get descheduled and then find the world has changed out from under you when you get rescheduled, but you actually have to ask for this to happen by calling Fiber.yield or Fiber.run.
Boiling all this back down to one question, it's this: is it intentional that inner invocations of MakeCallback do dispatch to _tickCallback (thus preempting any code that was running higher on the call stack), or can this be changed? IMHO it would be preferable, and would solve this problem, if we called process._tickCallback only from the very outermost transition from JS back to C++, not on every such transition. But maybe this causes other problems I'm not seeing. Or, should extensions that are invoked as callouts from JS code (as opposed to registering event handlers, which seems to be how the Node source uses MakeCallback) not be using MakeCallback in the first place?
As things stand, it seems like any extension that uses MakeCallback should be labeled "somewhat dangerous", because it can preempt the code that calls it. (But if this is documented and advertised properly, I as a client of that extension can code around it; I just have to realize that certain entry points into the extension are a yield/redispatch point from the point of view of the calling code, and if I know where those are, I can code around it.) Meanwhile node-weak (and I don't mean to malign it, it's really cool and useful functionality) should be labeled "quite a lot more dangerous", because not only can it cause preemption for those same reasons, but you can't document any specific entry points or surface area that can cause that preemption: that surface area is GC, which can happen at any time.
thanks,
Matt