Isolate::GetCurrentContext() crashes after Isolate::InContext() returns true

Attila Szegedi

unread,

Apr 10, 2025, 8:06:00 AM4/10/25

to v8-...@googlegroups.com

Hi folks,

I'm Attila, and I'm an engineer at Datadog working on our Node.js profiler[0]. (As a side note, I'd be happy to sometimes give a larger overview of how we do V8 profiling at Datadog.)

TL;DR: if we interrupt V8 with a signal while it's executing HandleScope::DeleteExtensions(), a call to Isolate::GetCurrentContext() will crash even though Isolate::InContext() returns true.

The whole story:

We mostly rely on V8 built-in CpuProfiler APIs, except we also try to gather some more data on each sample, so we intercept the PROF signal handler and do some work before and after delegating to the V8's original PROF handler. E.g. we also gather elapsed thread CPU time and recently we started to gather the value of node::AsyncHooksGetExecutionAsyncId(Isolate*) (I get this is a V8 and not a Node list – my question ultimately has nothing to do with Node.js per se.)

Now I know you'd say that invoking random methods from a signal handler is obviously a recipe for disaster, but in this case this API is just a few dependent loads through a chain of pointers that doesn't change much if at all. If I pseudo-unrolled what this Node.js method does, then it'd be something like this (omitting a bunch of defensive checks for isEmpty etc.)

if (isolate->InContext()) {

auto ctx = isolate->GetCurrentContext();

auto env = ctx ->GetAlignedPointerFromEmbedderData(NodeEnvironmentIndex);

return env->async_hooks_->async_id_fields[AsyncHooks::kAsyncIdCounter];

}

The thing is, it does work – a very obvious case when it wouldn't work is if the PROF signal handler interrupted V8 while it's collecting garbage, but we do guard against it by installing GC prologue and epilogue callbacks and not touching the engine when we know it's GCing.

However, we see one case when it doesn't work, and I can't figure it out on my own so I'm kindly asking for some eyeballs here. Namely, we're seeing a crash with a stack trace of:

Isolate::GetCurrentContext()

node::AsyncHooksGetExecutionAsyncId(Isolate*)

dd::SignalHandler::HandleProfilerSignal(…)

v8::internal::HandleScope::DeleteExtensions(Isolate*)

node::Environment::RunTimers(…)

uv__run_timers

uv_run

So node::Environment::RunTimers has a HandleScope instance[1]; this HandleScope is being destroyed when it's going out of C++ scope, which should be just before the method returns. As part of the destruction process, DeleteExtensions is invoked. Our PROF handler interrupts here, and calls node::AsyncHooksGetExecutionAsyncId() which will invoke Isolate::InContext() and receive true, but then it'll crash in Isolate::GetCurrentContext().

We see repeated crashes with this exact stack trace. There's something in HandleScope::DeleteExtensions() that causes Isolate::GetCurrentContext() to crash even though Isolate::InContext() returned true. If anyone knows off the top of their head what causes this, it'd be appreciated. If we can fix it, even better. I'd settle for being able to observe this state and bail out without crashing.

Again, I do understand that signal handlers can't expect to observe data in an internally consistent state. FWIW, I have a hunch that Isolate::InContext should really be returning false in this situation since at the time the handle scope is destroyed, there's no JS code executing at all, it's about to return to libuv's run_timers. There are a bunch of Call()s[2] in JS functions in a loop in that method, but by the time the handle scope is destroyed, execution is outside of 'em all. The code of RunTimers seems sufficiently complex that this shouldn't be explainable by compiler inlining and reordering.

Again, any insight appreciated, and I'm happy to work more on this with anyone.

Thanks,

Attila.

--

[0]: https://github.com/DataDog/pprof-nodejs

[1]: https://github.com/nodejs/node/blob/9bbbe60f6b0001964cb62182ce17e7f3980ccf06/src/env.cc#L1490

[2]: https://github.com/nodejs/node/blob/9bbbe60f6b0001964cb62182ce17e7f3980ccf06/src/env.cc#L1505

Ben Noordhuis

unread,

Apr 10, 2025, 12:55:41 PM4/10/25

to v8-...@googlegroups.com

HandleScope::DeleteExtensions depends on thread-local state, grep
around for HandleScopeData. Sooner or later you're going to call into
V8 right when it's updating said state. Node doesn't enter/exit
isolates often, but still.

That Environment::RunTimers shows up in the stack trace is not
unexpected; it changes the active context and that's another point
where you're racing the thread from the signal handler.

Is this a new or existing problem? I remember someone from DD reaching
out to me about this or a very similar problem 1-2 years ago, asking
if I was interested in coming in as a consultant (I work on V8, node
and libuv) but I didn't have time back then.

Attila Szegedi

unread,

Apr 14, 2025, 11:09:21 AM4/14/25

to v8-...@googlegroups.com

On Thu, Apr 10, 2025 at 6:55 PM Ben Noordhuis <in...@bnoordhuis.nl> wrote:

HandleScope::DeleteExtensions depends on thread-local state, grep
around for HandleScopeData. Sooner or later you're going to call into
V8 right when it's updating said state. Node doesn't enter/exit
isolates often, but still.

Will look into it, thanks for the pointer!

That Environment::RunTimers shows up in the stack trace is not
unexpected; it changes the active context and that's another point
where you're racing the thread from the signal handler.

Would that context not be scoped around the "cb->Call(env->context(), process, 1, &arg);" invocations inside the loop, though? I'd expect that as soon as Call() returns the context is unset, so I wouldn't expect a problem there, and what definitely surprised me was Isolate being in a context (or at least, claiming to be in one) during execution within the HandleScope destructor. To make matters even more intriguing, this only seems to trigger on worker threads, and never on the main event loop thread.

Is this a new or existing problem? I remember someone from DD reaching
out to me about this or a very similar problem 1-2 years ago, asking
if I was interested in coming in as a consultant (I work on V8, node
and libuv) but I didn't have time back then.

I've been with Datadog for about 2 years now myself, and I don't think it came up since I've been here. The functionality I implemented recently for capturing the async IDs during sampling is new.

Thanks!

Attila.

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/v8-dev/CAHQurc-aw9wjd40aXhnLUyTkhJ_cVY%2BFb24kjmVu1n%2BaRZvT2w%40mail.gmail.com.

Ben Noordhuis

unread,

Apr 14, 2025, 3:46:06 PM4/14/25

to v8-...@googlegroups.com

On Mon, Apr 14, 2025 at 5:09 PM 'Attila Szegedi' via v8-dev
<v8-...@googlegroups.com> wrote:
>> That Environment::RunTimers shows up in the stack trace is not
>> unexpected; it changes the active context and that's another point
>> where you're racing the thread from the signal handler.
>
> Would that context not be scoped around the "cb->Call(env->context(), process, 1, &arg);" invocations inside the loop, though? I'd expect that as soon as Call() returns the context is unset, so I wouldn't expect a problem there, and what definitely surprised me was Isolate being in a context (or at least, claiming to be in one) during execution within the HandleScope destructor. To make matters even more intriguing, this only seems to trigger on worker threads, and never on the main event loop thread.

My point was that if your signal handler runs at exactly the right
time (or wrong time, really) you're going to observe an
isolate+context combo in an inconsistent state, and that's
unavoidable, really.

Worker threads enter/exit isolates way more often than the main thread
so that makes perfect sense. (The main thread enters and exits an
isolate only once; at program start and program exit.)

>> Is this a new or existing problem? I remember someone from DD reaching
>> out to me about this or a very similar problem 1-2 years ago, asking
>> if I was interested in coming in as a consultant (I work on V8, node
>> and libuv) but I didn't have time back then.
>
>
> I've been with Datadog for about 2 years now myself, and I don't think it came up since I've been here. The functionality I implemented recently for capturing the async IDs during sampling is new.

Then I probably misremember and it was something else. I didn't find
anything from around that time in my email archive except for an
exchange with one of your recruiters (who, it must be admitted, was
very good and persuasive.)

Reply all

Reply to author

Forward