Hi folks,
I'm Attila, and I'm an engineer at Datadog working on our Node.js profiler[0]. (As a side note, I'd be happy to sometimes give a larger overview of how we do V8 profiling at Datadog.)
TL;DR: if we interrupt V8 with a signal while it's executing HandleScope::DeleteExtensions(), a call to Isolate::GetCurrentContext() will crash even though Isolate::InContext() returns true.
The whole story:
We mostly rely on V8 built-in CpuProfiler APIs, except we also try to gather some more data on each sample, so we intercept the PROF signal handler and do some work before and after delegating to the V8's original PROF handler. E.g. we also gather elapsed thread CPU time and recently we started to gather the value of node::AsyncHooksGetExecutionAsyncId(Isolate*) (I get this is a V8 and not a Node list – my question ultimately has nothing to do with Node.js per se.)
Now I know you'd say that invoking random methods from a signal handler is obviously a recipe for disaster, but in this case this API is just a few dependent loads through a chain of pointers that doesn't change much if at all. If I pseudo-unrolled what this Node.js method does, then it'd be something like this (omitting a bunch of defensive checks for isEmpty etc.)
if (isolate->InContext()) {
auto ctx = isolate->GetCurrentContext();
auto env = ctx ->GetAlignedPointerFromEmbedderData(NodeEnvironmentIndex);
return env->async_hooks_->async_id_fields[AsyncHooks::kAsyncIdCounter];
}
The thing is, it does work – a very obvious case when it wouldn't work is if the PROF signal handler interrupted V8 while it's collecting garbage, but we do guard against it by installing GC prologue and epilogue callbacks and not touching the engine when we know it's GCing.
However, we see one case when it doesn't work, and I can't figure it out on my own so I'm kindly asking for some eyeballs here. Namely, we're seeing a crash with a stack trace of:
Isolate::GetCurrentContext()
node::AsyncHooksGetExecutionAsyncId(Isolate*)
dd::SignalHandler::HandleProfilerSignal(…)
v8::internal::HandleScope::DeleteExtensions(Isolate*)
node::Environment::RunTimers(…)
uv__run_timers
uv_run
So node::Environment::RunTimers has a HandleScope instance[1]; this HandleScope is being destroyed when it's going out of C++ scope, which should be just before the method returns. As part of the destruction process, DeleteExtensions is invoked. Our PROF handler interrupts here, and calls node::AsyncHooksGetExecutionAsyncId() which will invoke Isolate::InContext() and receive true, but then it'll crash in Isolate::GetCurrentContext().
We see repeated crashes with this exact stack trace. There's something in HandleScope::DeleteExtensions() that causes Isolate::GetCurrentContext() to crash even though Isolate::InContext() returned true. If anyone knows off the top of their head what causes this, it'd be appreciated. If we can fix it, even better. I'd settle for being able to observe this state and bail out without crashing.
Again, I do understand that signal handlers can't expect to observe data in an internally consistent state. FWIW, I have a hunch that Isolate::InContext should really be returning false in this situation since at the time the handle scope is destroyed, there's no JS code executing at all, it's about to return to libuv's run_timers. There are a bunch of Call()s[2] in JS functions in a loop in that method, but by the time the handle scope is destroyed, execution is outside of 'em all. The code of RunTimers seems sufficiently complex that this shouldn't be explainable by compiler inlining and reordering.
Again, any insight appreciated, and I'm happy to work more on this with anyone.
Thanks,
Attila.
--