Helping with a memory leak issue

51 views
Skip to first unread message

Xida Chen

unread,
May 31, 2021, 3:33:12 PM5/31/21
to memor...@chromium.org
Hello memory-dev,

I have been debugging a memory leak issue for some time and I could not find the root cause of it.

Here is a summary of what the code looks like:
1. I create a worker thread:

2. Then there is a task scheduled to run on the worker_thread_, the task is running Foo() which looks like this:

Foo() {
  auto* rendering_context = MakeGarbageCollected<PaintRenderingContext2D>(...);
DCHECK(rendering_context);
return nullptr;
}

Here are my investigations:
1. I put a static instance counter in PaintRenderingContext2D, and found that the counter goes > 1000 within 30 seconds.
2. I put CHECK(false) in the PaintRenderingContext2D::Trace() and found that Trace() is never getting called.

Can someone help me and let me know what I should do next to debug this?

Thank you. Any suggestions are greatly appreciated!

Michael Lippautz

unread,
May 31, 2021, 3:47:22 PM5/31/21
to Xida Chen, memor...@chromium.org
Is the repro above part of some unit test set up or production code? Some unit test environments (e.g. blink_platform_unittests) only allow for manually executing GC and will not trigger it automatically.

If it's production, can you share the CL and test?
 

Thank you. Any suggestions are greatly appreciated!

--
You received this message because you are subscribed to the Google Groups "memory-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to memory-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/memory-dev/CALRxhAsHybh-Zuexx15fyTPBVX1Qp63hnHyygb5PfDcufuk8jg%40mail.gmail.com.

Xida Chen

unread,
May 31, 2021, 3:50:45 PM5/31/21
to Michael Lippautz, memor...@chromium.org
Hi Michael,

Thank you for your quick response.

This is production code, not a testing environment.


And this is the test case: https://output.jsbin.com/xocoroh/4. Please run chrome with --enable-blink-features=CompositeBGColorAnimation to observe the leak.

Michael Lippautz

unread,
May 31, 2021, 6:49:07 PM5/31/21
to Xida Chen, Michael Lippautz, memor...@chromium.org
On Mon, May 31, 2021 at 9:50 PM Xida Chen <xida...@google.com> wrote:
Hi Michael,

Thank you for your quick response.

This is production code, not a testing environment.


And this is the test case: https://output.jsbin.com/xocoroh/4. Please run chrome with --enable-blink-features=CompositeBGColorAnimation to observe the leak.


Thanks a lot for the repro. Can you file a bug against "Blink>MemoryAllocator>GarbageCollection" and assign it to me?

The problem is that the paint worklet runs on a separate thread that does not have a V8 Isolate attached to it. Since unified heap all garbage collections are scheduled through V8's GC which considers both C++ and JS memory to avoid ping/pong effects. (The setup covers the main thread and workers but not worklets.)

Michael Lippautz

unread,
May 31, 2021, 6:55:35 PM5/31/21
to Michael Lippautz, Xida Chen, memor...@chromium.org
Maybe sent too early :)

Why is the native paint worklet not using a WorkletThreadHolder but rather creating its own thread with only minor setup?

Using the AnimationAndPaintWorkletThread would provide the right setup.

Xida Chen

unread,
May 31, 2021, 10:43:21 PM5/31/21
to Michael Lippautz, memor...@chromium.org, Robert Flack
In the native paint worklet case, all I need is a worker thread that can support GC, and that should be a much simpler setup compared with AnimationAndPaintWorkletThread. So that's why I just created a dedicated worker thread and set that to support GC, and I thought that should be enough.

So based on what you said, a worker thread that supports GC needs to have a V8 Isolate attached to it?

Michael Lippautz

unread,
Jun 1, 2021, 4:17:11 AM6/1/21
to Xida Chen, Michael Lippautz, memor...@chromium.org, Robert Flack
On Tue, Jun 1, 2021 at 4:43 AM Xida Chen <xida...@chromium.org> wrote:
In the native paint worklet case, all I need is a worker thread that can support GC, and that should be a much simpler setup compared with AnimationAndPaintWorkletThread. So that's why I just created a dedicated worker thread and set that to support GC, and I thought that should be enough.

So based on what you said, a worker thread that supports GC needs to have a V8 Isolate attached to it?

Yes, support without V8 is limited to merely allocation that leaks until the heap gets destroyed upon which memory is released.

The reasoning there is that GC scheduling is complex and we converged to a single implementation in V8 that we maintain for both JS and C++.

Xida Chen

unread,
Jun 1, 2021, 10:00:40 AM6/1/21
to Michael Lippautz, memor...@chromium.org, Robert Flack
Hi Michael,

Thank you for your help.

I forgot to mention, we also have memory leaks even with AnimationAndPaintWorkletThread.

This is the test case: https://output.jsbin.com/vafewif, and you can run it without any flag. In this test case, my logging shows that the number of PaintRenderingContext2D instances climbs > 1000 within 30 seconds, and also the PaintRenderingContext2D::Trace() function is never called.

The PaintRenderingContext2D instance is created here in CSSPaintDefinition::Paint(), which runs on a worker thread. The worker thread comes from here, which is the thread where global_scope is created. The global scope is created here, and as you can see, it is actually using the AnimationAndPaintWorkletThread.

Do you think this problem is the same as the native paint worklet? Or is this a separate issue?

Michael Lippautz

unread,
Jun 1, 2021, 2:47:03 PM6/1/21
to Xida Chen, Michael Lippautz, memor...@chromium.org, Robert Flack
On Tue, Jun 1, 2021 at 4:00 PM Xida Chen <xida...@chromium.org> wrote:
Hi Michael,

Thank you for your help.

I forgot to mention, we also have memory leaks even with AnimationAndPaintWorkletThread.

This is the test case: https://output.jsbin.com/vafewif, and you can run it without any flag. In this test case, my logging shows that the number of PaintRenderingContext2D instances climbs > 1000 within 30 seconds, and also the PaintRenderingContext2D::Trace() function is never called.

The PaintRenderingContext2D instance is created here in CSSPaintDefinition::Paint(), which runs on a worker thread. The worker thread comes from here, which is the thread where global_scope is created. The global scope is created here, and as you can see, it is actually using the AnimationAndPaintWorkletThread.

Do you think this problem is the same as the native paint worklet? Or is this a separate issue?

The setup is actually correct. With local profiling I see that all allocations are properly reported to V8 which makes scheduling decisions.

The schedule in this scenario is very loose though allowing to accumulate ~30MB of C++ memory until the first GC is performed (~13500 PaintRenderingContext2D instances). The GC schedule then stabilizes around ~10MB C++ memory (3000-4000 instances accumulated till a GC is triggered).

Feel free to file an issue against "Blink>MemoryAllocator>GarbageCollection" or "Blink>JavaScript>GC" and cc me. 

Although there's technically no leak here, there's certainly room for improvement. The background is that we need to be careful with scheduling the initial GC as it has consequences around startup (the earlier the first GC, the likelier we have many more until the system stabilizes). 

Robert Flack

unread,
Jun 1, 2021, 3:33:25 PM6/1/21
to Michael Lippautz, Xida Chen, memor...@chromium.org
Thanks for the information and for digging in to the test case!

As not being able to schedule GC without an isolate, as far as I know we can't use GarbageCollected types without GC. NativePaintWorklet, so requiring an isolate to schedule GC means that the only way we can instantiate / use Blink GC classes on another thread is to also instantiate an isolate on that thread, which IIUC has non-trivial performance implications (e.g. runtime and memory). Is this correct? If so, could we revisit this decision / consider other ways of scheduling GC?

Michael Lippautz

unread,
Jun 1, 2021, 3:59:38 PM6/1/21
to Robert Flack, Michael Lippautz, Xida Chen, memor...@chromium.org
On Tue, Jun 1, 2021 at 9:33 PM Robert Flack <fla...@chromium.org> wrote:
Thanks for the information and for digging in to the test case!

As not being able to schedule GC without an isolate, as far as I know we can't use GarbageCollected types without GC. NativePaintWorklet, so requiring an isolate to schedule GC means that the only way we can instantiate / use Blink GC classes on another thread is to also instantiate an isolate on that thread, which IIUC has non-trivial performance implications (e.g. runtime and memory). Is this correct? If so, could we revisit this decision / consider other ways of scheduling GC?


The setup is currently optimized for main thread, workers, and worklets with Isolates. The overhead should be mostly on the memory side (the last I heard was <1.5MB for about:blank) per Isolate with a one-time setup cost. There is only negligible V8 runtime overhead if no JS is executed. So for a C++ native worklet that is using a single backing thread the cost should be ~1.5MB in memory.

GC scheduling is a never-ending optimization story and prone to corner cases, which is why we migrated to a single scheduler for C++/JS. 

That said, we are finalizing the Oilpan library in V8 which allows for creating stand-alone C++ garbage collection heaps that could in theory leverage some of the (V8-internal) scheduling algorithms without actually instantiating an Isolate. I filed an issue to track this request.

Kentaro Hara

unread,
Jun 1, 2021, 9:43:34 PM6/1/21
to Michael Lippautz, Robert Flack, Xida Chen, memory-dev
Some users disable JavaScript (this is a legitimate use case). We need to make sure that Oilpan is triggered even if no v8::Isolate is created.

I thought it was working with the existing in-Blink Oilpan. Is it not the case with the Oilpan library?






--
Kentaro Hara, Tokyo

Michael Lippautz

unread,
Jun 2, 2021, 11:06:55 AM6/2/21
to Kentaro Hara, Michael Lippautz, Robert Flack, Xida Chen, memory-dev
On Wed, Jun 2, 2021 at 3:43 AM Kentaro Hara <har...@chromium.org> wrote:
Some users disable JavaScript (this is a legitimate use case). We need to make sure that Oilpan is triggered even if no v8::Isolate is created.


I think we always initialize an Isolate when we initialize a Blink platform. Where would the JS-disabled code go through to initialize a Blink platform?
 
I thought it was working with the existing in-Blink Oilpan. Is it not the case with the Oilpan library?


We removed stand-alone scheduling shortly after landing the unified heap as it wasted around ~20% GC time while not providing any clear memory benefit. The stand-alone schedule was incomplete compared to what we have with V8.

Kentaro Hara

unread,
Jun 2, 2021, 11:22:38 AM6/2/21
to Michael Lippautz, Robert Flack, Xida Chen, memory-dev
I read the code and realized my memory was wrong!

You're right -- we always create an isolate even when JavaScript is disabled. Sorry about the noise :)

--
Kentaro Hara, Tokyo

Xida Chen

unread,
Jun 3, 2021, 11:31:59 AM6/3/21
to Kentaro Hara, Michael Lippautz, Robert Flack, memory-dev
I would like to thank you all for your help on this issue, it was really helpful.

I have started working on a CL to address the problem of native paint worklet. I have some notes in here describing what I have tried so far. 

Right now I still see memory leaks with that CL. I am out of ideas, any suggestions from you would be really helpful to me.

Bruce Dawson

unread,
Jun 3, 2021, 11:54:33 AM6/3/21
to Xida Chen, Kentaro Hara, Michael Lippautz, Robert Flack, memory-dev
Is it possible that there is a generic problem of objects that are supposed to be garbage collected being missed under some circumstances, such as when running on secondary threads? The (now fixed) issue with IDBFactory related leaks (crbug.com/1183081) and some still open similar issues (crbug.com/1208061). Is there enough commonality here that we should be looking at automated ways to either fix or detect these?

Apologies if I am conflating different issues - this is out of my area of expertise.



--
Bruce Dawson, he/him

Michael Lippautz

unread,
Jun 3, 2021, 1:00:45 PM6/3/21
to Bruce Dawson, Xida Chen, Kentaro Hara, Michael Lippautz, Robert Flack, memory-dev


On Thu, Jun 3, 2021, 5:54 PM Bruce Dawson <bruce...@google.com> wrote:
Is it possible that there is a generic problem of objects that are supposed to be garbage collected being missed under some circumstances, such as when running on secondary threads?

There's per thread heaps and we never had such issues. 

We had very rarely issues with different threads keeping roots to other heap's objects causing leaks. I have not seen those in years now.

The (now fixed) issue with IDBFactory related leaks (crbug.com/1183081) and some still open similar issues (crbug.com/1208061). Is there enough commonality here that we should be looking at automated ways to either fix or detect these?

GC is the automated way :)

Fwiw, we have leak detectors that highlight the most prominent leaks of interesting node types.

Bruce Dawson

unread,
Jun 3, 2021, 1:56:33 PM6/3/21
to Michael Lippautz, Xida Chen, Kentaro Hara, Michael Lippautz, Robert Flack, memory-dev
I'm confused by the claim that we never had such issues. Isn't crbug.com/1183081 an example of such an issue? That was causing 50+ MB/day of leaks on my home laptop and was only noticed because there was an associated handle leak, and my (vague) understanding was that it was somehow GC related.
--
Bruce Dawson, he/him

Michael Lippautz

unread,
Jun 3, 2021, 2:40:56 PM6/3/21
to Bruce Dawson, Xida Chen, Kentaro Hara, Michael Lippautz, Robert Flack, memory-dev


On Thu, Jun 3, 2021, 7:56 PM Bruce Dawson <bruce...@google.com> wrote:
I'm confused by the claim that we never had such issues. Isn't crbug.com/1183081 an example of such an issue? That was causing 50+ MB/day of leaks on my home laptop and was only noticed because there was an associated handle leak, and my (vague) understanding was that it was somehow GC related.

From skimming the bug this was caused by a self-owned and thus rooting receiver. It's fair assessment that it was technically GCed. There's little actual GC involved there if the object holds a root to itself though. I am probably biased not counting these issues directly towards GC as there's nothing the GC could've helped you with in first place. (Self rooting is actually discouraged because of the leak potential and often hints to some design issue.)

Xida Chen

unread,
Jun 4, 2021, 1:22:49 PM6/4/21
to Michael Lippautz, Bruce Dawson, Kentaro Hara, Michael Lippautz, Robert Flack, memory-dev
Hi everyone,

Thank you all for your help on this issue.

Please allow me to re-state the problem:
We have a memory leak with native paint worklet. Test case here: https://output.jsbin.com/xocoroh/4, and running chrome with --enable-blink-features=CompositeBGColorAnimation should repro it.
The CSS paint worklet doesn't have a memory leak. Test case here: https://output.jsbin.com/vafewif

My attempts:
I tried to fix the memory leak with this CL: https://chromium-review.googlesource.com/c/chromium/src/+/2933677, which uses AnimationAndPaintWorkletThread, but that didn't work and memory leak issue persists.

I did more investigation: I found that for the CSS paint worklet case, the ThreadState::gc_phase_ is set to GCPhase::kSweeping at some point. But that didn't happen at all for native paint worklet. Here is the stack trace for setting the gc_phase_:
  • #0 0x560846ee7b29 base::debug::CollectStackTrace()

  • #1 0x560846e149a3 base::debug::StackTrace::StackTrace()

  • #2 0x5608466bb4f3 blink::ThreadState::SetGCPhase()

  • #3 0x5608466bf342 blink::ThreadState::AtomicPauseSweepAndCompact()

  • #4 0x5608466c0eb2 blink::UnifiedHeapController::TraceEpilogue()

  • #5 0x560845644b48 v8::internal::LocalEmbedderHeapTracer::TraceEpilogue()

  • #6 0x56084569aec0 v8::internal::Heap::PerformGarbageCollection()

  • #7 0x560845697e3f v8::internal::Heap::CollectGarbage()

  • #8 0x5608456a224d v8::internal::Heap::FinalizeIncrementalMarkingIfComplete()

  • #9 0x5608456b6bcb v8::internal::IncrementalMarkingJob::Task::RunInternal()


Any thoughts on why gc_phase_ isn't set for native paint worklet? And how can we let gc run in native paint worklet case?

Michael Lippautz

unread,
Jun 7, 2021, 8:28:40 AM6/7/21
to Xida Chen, Bruce Dawson, Kentaro Hara, Michael Lippautz, Robert Flack, memory-dev
Have been debugging this now, thanks for the elaborate description.

The setup using AnimationAndPaintWorkletThread is correct.

This is indeed an issue with the scheduler based on the "new" use case of never allocating JS but merely C++ which should only happen in this worklet case. I filed an issue and we will fix this asap.

-Michael
Reply all
Reply to author
Forward
0 new messages