Fatal error in ScavengerCollector::CollectGarbage(), what did I do wrong?

516 views
Skip to first unread message

Kenton Varda

unread,
Dec 5, 2018, 6:37:20 PM12/5/18
to v8-u...@googlegroups.com
Hi v8-users,

I recently refactored the way that I integrate with V8's garbage collection in my application, and after the refactor I am (very rarely) hitting a fatal error that I don't understand:

#
# Fatal error in , line 0
# Check failed: IsEmpty().
#

stack:
____ base/platform/platform-posix.cc:402 v8::base::OS::Abort()
____ base/logging.cc:171 V8_Fatal(char const*, int, char const*, ...)
____ heap/worklist.h:72 v8::internal::Worklist<std::__1::pair<v8::internal::HeapObject*, int>, 256>::~Worklist()
____ heap/scavenger.cc:288 v8::internal::ScavengerCollector::CollectGarbage()
____ heap/heap.cc:1931 v8::internal::Heap::Scavenge()
____ heap/heap.cc:1649 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags)
____ heap/heap.cc:1292 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags)
____ heap/heap.cc:4264 v8::internal::Heap::AllocateRawWithLightRetry(int, v8::internal::AllocationSpace, v8::internal::AllocationAlignment)
____ heap/heap.cc:4278 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationSpace, v8::internal::AllocationAlignment)
____ heap/factory.cc:120 AllocateRawWithImmortalMap
____ heap/factory.cc:972  (inlined by) NewRawTwoByteString
____ heap/factory.cc:636 v8::internal::Factory::NewStringFromUtf8(v8::internal::Vector<char const>, v8::internal::PretenureFlag)
____ api.cc:6459 NewString
____ api.cc:6519  (inlined by) NewFromUtf8
(... application code ...)

It looks like the Worklist being destroyed here is Scavenger::PromotionList::regular_object_promotion_list_. This Worklist is being destroyed at the end of ScavengerCollector::CollectGarbage(), but is non-empty when destroyed.

I don't understand how this code works, so I don't really have any idea what might lead to that. Any hints on what I should be looking for in my code that might be causing this?

-Kenton

Kenton Varda

unread,
Dec 5, 2018, 8:23:03 PM12/5/18
to v8-users
Whoops, correction: It's actually copied_list that has work left, not promotion_list.

Kenton Varda

unread,
Dec 6, 2018, 4:43:04 PM12/6/18
to v8-u...@googlegroups.com
Seems to have something to do with this block in ScavengerCollector::CollectGarbage():


    {
      // Scavenge weak global handles.
      TRACE_GC(heap_->tracer(),
               GCTracer::Scope::SCAVENGER_SCAVENGE_WEAK_GLOBAL_HANDLES_PROCESS);
      isolate_->global_handles()->MarkNewSpaceWeakUnmodifiedObjectsPending(
          &IsUnscavengedHeapObject);
      isolate_->global_handles()
          ->IterateNewSpaceWeakUnmodifiedRootsForFinalizers(
              &root_scavenge_visitor);
      scavengers[kMainThreadId]->Process();

      DCHECK(copied_list.IsEmpty());
      DCHECK(promotion_list.IsEmpty());
      isolate_->global_handles()
          ->IterateNewSpaceWeakUnmodifiedRootsForPhantomHandles(
              &root_scavenge_visitor, &IsUnscavengedHeapObject);
    }

The call to IterateNewSpaceWeakUnmodifiedRootsForPhantomHandles() at the end of the block appears to cause the crash by visiting a handle for an object that is already in ToSpace. This triggers a DCHECK at the start of ScavengeObject(), or in release builds adds the handle to copied_list which triggers the aforementioned CHECK in ~Worklist().

The following patch appears to stop the crash:

diff --git a/src/heap/scavenger.cc b/src/heap/scavenger.cc
index 4c63ed099a..b2f0991350 100644
--- a/src/heap/scavenger.cc
+++ b/src/heap/scavenger.cc
@@ -443,6 +443,7 @@ void RootScavengeVisitor::ScavengePointer(Object** p) {
   Object* object = *p;
   DCHECK(!HasWeakHeapObjectTag(object));
   if (!Heap::InNewSpace(object)) return;
+  if (!Heap::InFromSpace(object)) return;
 
   scavenger_->ScavengeObject(reinterpret_cast<HeapObjectReference**>(p),
                              reinterpret_cast<HeapObject*>(object));

But I have no idea if this is really correct, nor do I understand how my refactoring caused this problem.

Under what situations could the last line of the block above end up visiting an object in ToSpace?

One idea I had was that maybe a handle was created (or marked weak?) during scavenging, but I don't know how I could have done that: AFAICT, the only callbacks into into my application code that happen during scavenging are the weak callbacks, but I've verified they aren't doing anything with handles other than resetting the handle they were called on.

Any ideas?

-Kenton

--
--
v8-users mailing list
v8-u...@googlegroups.com
http://groups.google.com/group/v8-users
---
You received this message because you are subscribed to a topic in the Google Groups "v8-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/v8-users/sdU232XmyOw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to v8-users+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Lippautz

unread,
Dec 11, 2018, 4:01:58 AM12/11/18
to v8-users
Thanks for the thorough description. There's indeed something off here.

It could either be filtering in the roots visitor (like the patch you suggested) or that we need another processing phase after iterating those roots. 

I've opened https://crbug.com/v8/8571 to track this. Will investigate and report back there.

-Michael

You received this message because you are subscribed to the Google Groups "v8-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-users+u...@googlegroups.com.

Michael Lippautz

unread,
Dec 11, 2018, 7:30:10 AM12/11/18
to Michael Lippautz, v8-users
I think this issue is a symptom of visiting a root (global handle) twice during a Scavenge. I've checked all paths in global handles iteration and it looks like they are mutually exclusive, as they should be.

Is there  any chance you can provide some sort of repro?

Also, which kind of API calls are involved? E.g.: Weak callbacks, finalizers, GC callbacks, MarkActive, MarkIndependent, etc

-Michael

Kenton Varda

unread,
Dec 11, 2018, 12:27:04 PM12/11/18
to v8-u...@googlegroups.com, Michael Lippautz
Hi Michael,

Thanks for looking at this!

I wish I could come up with a simple repro. Currently my only repro involves running load test traffic (captured from production) through a test instance for about a minute before the crash happens.

My refactoring does a bunch of things differently from before and it's hard to revert them one by one. There's one big main thing that is different: Previously, for each wrapper, I'd create one weak persistent handle for the purpose of receiving a callback to delete the object, and then additional persistent handles for each reference held by some other native object (which may or may not be weak depending on whether the holder is itself GC traceable). In the new code, I instead keep only one persistent handle per wrapper, and I separately track a "strong reference count"; when the strong refcount becomes zero I call SetWeak(), and when it becomes non-zero I call ClearWeak(). So the same handle may switch between weak and non-weak.

This transitioning between weak and non-weak seems like it could confuse the scavenger *if* it happened during scavenging. However, AFAICT, there's no way this could happen, since there are no callbacks into application code during scavenging except for weak callbacks... and in my case, those weak callbacks are actually never called during scavenging, because I have a scavenge prologue callback that iterates over all weak handles to wrapper objects and calls MarkActive() on them.

I don't expect you to be able to figure this out for me, but do you have any hints for debugging strategies I should try?

Thanks,
-Kenton

Michael Lippautz

unread,
Dec 11, 2018, 2:13:50 PM12/11/18
to ken...@cloudflare.com, v8-users, Michael Lippautz
On Tue, Dec 11, 2018 at 6:27 PM Kenton Varda <ken...@cloudflare.com> wrote:
Hi Michael,

Thanks for looking at this!

I wish I could come up with a simple repro. Currently my only repro involves running load test traffic (captured from production) through a test instance for about a minute before the crash happens.

My refactoring does a bunch of things differently from before and it's hard to revert them one by one. There's one big main thing that is different: Previously, for each wrapper, I'd create one weak persistent handle for the purpose of receiving a callback to delete the object, and then additional persistent handles for each reference held by some other native object (which may or may not be weak depending on whether the holder is itself GC traceable). In the new code, I instead keep only one persistent handle per wrapper, and I separately track a "strong reference count"; when the strong refcount becomes zero I call SetWeak(), and when it becomes non-zero I call ClearWeak(). So the same handle may switch between weak and non-weak.

This transitioning between weak and non-weak seems like it could confuse the scavenger *if* it happened during scavenging. However, AFAICT, there's no way this could happen, since there are no callbacks into application code during scavenging except for weak callbacks... and in my case, those weak callbacks are actually never called during scavenging, because I have a scavenge prologue callback that iterates over all weak handles to wrapper objects and calls MarkActive() on them.


I really hope that handles are only access from the same thread the GC is running on. Otherwise, you would need a v8::Locker to synchronize that access. 

Also, you are marking *all* handles with MarkActive()? That should essentially make all of the handles strong for the Scavenger. 

That contradicts with the crash happening after IterateNewSpaceWeakUnmodifiedRootsForPhantomHandles which deals with weak handles that are not marked as active. (Assuming you did not use MarkIndependent()).
 
I don't expect you to be able to figure this out for me, but do you have any hints for debugging strategies I should try?

Essentially, it looks like you are scavenging a root node twice, which is why you already see it in ToSpace when it's getting added through the visitor.

The relevant functions are in src/global-handles.{h,cc} and called throughout scavenging in this order: 
1. IdentifyWeakUnmodifiedObjects: Makes nodes active that V8 thinks should be active in addition to those that already had MarkActive() called on them.
2. IterateNewSpaceStrongAndDependentRoots: Scavenges strong handles and those that are active.
3. MarkNewSpaceWeakUnmodifiedObjectsPending: Identifies nodes that require finalizers.
4. IterateNewSpaceWeakUnmodifiedRootsForFinalizers: Scavengers nodes that require finalizers.
5. IterateNewSpaceWeakUnmodifiedRootsForPhantomHandles: Resets weak handles or scavenges them.  Scavenging here just means updating the forwarding pointer from FromSpace to ToSpace as the object itself has already been scavenged if we don't need to reset it.

The sets of handles that are scavenged should be disjoint in 2), 4) and 5). Whenever a node is already pointing directly to ToSpace that should be a bug and some other method probably scavenged it in the same run before. 

The interesting part is then why multiple methods match the same node for scavenging.

Cheers, -Michael



Kenton Varda

unread,
Dec 11, 2018, 4:03:20 PM12/11/18
to Michael Lippautz, v8-u...@googlegroups.com
On Tue, Dec 11, 2018 at 11:13 AM Michael Lippautz <mlip...@chromium.org> wrote:
I really hope that handles are only access from the same thread the GC is running on. Otherwise, you would need a v8::Locker to synchronize that access. 

Right, the isolate mutex is locked any time we're manipulating handles.
 
Also, you are marking *all* handles with MarkActive()? That should essentially make all of the handles strong for the Scavenger. 

Yes, because the scavenger doesn't trace objects with EmbedderHeapTracer, I think we have to mark all handles active (as visited via VisitWeakHandles()).
 
That contradicts with the crash happening after IterateNewSpaceWeakUnmodifiedRootsForPhantomHandles which deals with weak handles that are not marked as active. (Assuming you did not use MarkIndependent()).

There is a version of my change which uses MarkIndependent() when we know that there are *no* references from C++ objects (i.e. when the handle exists purely to register the weak callback to delete the object). However, I still observe this crash even when I comment out the call to MarkIndependent().

I agree that this seems inconsistent.
 
I don't expect you to be able to figure this out for me, but do you have any hints for debugging strategies I should try?

Essentially, it looks like you are scavenging a root node twice, which is why you already see it in ToSpace when it's getting added through the visitor.

The relevant functions are in src/global-handles.{h,cc} and called throughout scavenging in this order: 
1. IdentifyWeakUnmodifiedObjects: Makes nodes active that V8 thinks should be active in addition to those that already had MarkActive() called on them.
2. IterateNewSpaceStrongAndDependentRoots: Scavenges strong handles and those that are active.
3. MarkNewSpaceWeakUnmodifiedObjectsPending: Identifies nodes that require finalizers.
4. IterateNewSpaceWeakUnmodifiedRootsForFinalizers: Scavengers nodes that require finalizers.
5. IterateNewSpaceWeakUnmodifiedRootsForPhantomHandles: Resets weak handles or scavenges them.  Scavenging here just means updating the forwarding pointer from FromSpace to ToSpace as the object itself has already been scavenged if we don't need to reset it.

The sets of handles that are scavenged should be disjoint in 2), 4) and 5). Whenever a node is already pointing directly to ToSpace that should be a bug and some other method probably scavenged it in the same run before. 

The interesting part is then why multiple methods match the same node for scavenging.

Thanks, I think this helps. I'll do some more digging.

-Kenton

Kenton Varda

unread,
Dec 12, 2018, 8:28:34 PM12/12/18
to Michael Lippautz, v8-u...@googlegroups.com
On Tue, Dec 11, 2018 at 1:02 PM Kenton Varda <ken...@cloudflare.com> wrote:
On Tue, Dec 11, 2018 at 11:13 AM Michael Lippautz <mlip...@chromium.org> wrote:
I really hope that handles are only access from the same thread the GC is running on. Otherwise, you would need a v8::Locker to synchronize that access. 

Right, the isolate mutex is locked any time we're manipulating handles.

... or so I thought.

It turns out in some buggy cases a destructor for a C++ object was executing without locking the isolate mutex. Before my refactoring, this resulted in the destruction of a persistent handle, which apparently didn't cause any visible concurrency issues. After my refactoring, this ended up invoking SetWeak() on a handle without destroying it. If this happened to occur at the wrong time during a scavenge, it could cause the object to be scavenged twice.

In retrospect I obviously should have thought of this earlier.

I've now added some asserts and template hacks to catch this class of error.

Thanks for the hint and sorry for wasting your time!

-Kenton

Michael Lippautz

unread,
Dec 13, 2018, 1:11:27 AM12/13/18
to Kenton Varda, Michael Lippautz, v8-users
No worries, running with TSAN for such repros would've immediately flushed it out as the global handle system is designed for single-threaded execution and intentionally avoids use of atomics.

Cheers, -Michael
Reply all
Reply to author
Forward
0 new messages