Why is concurrent marking worker number limited to no more than 7?

84 views
Skip to first unread message

Jianxiao Lu

unread,
Jun 21, 2022, 2:33:03 AM6/21/22
to v8-dev
https://source.chromium.org/chromium/chromium/src/+/main:v8/src/heap/concurrent-marking.h;l=59?q=kMaxTasks&sq=&ss=chromium

The code comments above seems out of date. So I wonder if this limitation is intentional or just no be updated in time.

Here is a snapshot in webtooling (d8).

7.png

15.png

Seems that we can benefit from more worker?

Michael Lippautz

unread,
Jun 21, 2022, 3:08:10 AM6/21/22
to v8-dev
This really depends on what the average size of the heap is.The numbers were chosen as a compromise between small and large heaps and low-end vs desktop devices. Also, the algorithm doesn't scale linearly but there's smaller trade offs here and there which add up as the # tasks are increased.

Did the absolute time actually improve? I see 1280ms-1340ms in the first screenshot vs 1520ms-1580ms. You could check marked bytes/s as a proxy of whether the helper tasks are still efficient or not.

-Michael
 

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/2944463a-a964-40a9-91f4-fba7a74debe8n%40googlegroups.com.

Jianxiao Lu

unread,
Jun 21, 2022, 5:36:21 AM6/21/22
to v8-dev
Thanks for the explanation. The second screenshot is 1520ms~1560ms in fact. The snapshots were taken from the first major gc running webtooling. Because I believe the first major gcs are relatively predictable and consistent. (Maybe I am wrong).

I will check with your suggestion later. Does the gc-tracer already have something to record the mark-bytes/ms ? Or I need to try to implement one?

Thanks,
Jianxiao 

Michael Lippautz

unread,
Jun 21, 2022, 2:03:30 PM6/21/22
to v8-dev
On Tue, Jun 21, 2022 at 11:36 AM Jianxiao Lu <jianx...@intel.com> wrote:
Thanks for the explanation. The second screenshot is 1520ms~1560ms in fact. The snapshots were taken from the first major gc running webtooling. Because I believe the first major gcs are relatively predictable and consistent. (Maybe I am wrong).

I will check with your suggestion later. Does the gc-tracer already have something to record the mark-bytes/ms ? Or I need to try to implement one?

--trace-concurrent-marking will log concurrently marked bytes and timespans which can be used to compute the speed. Any custom logging will also do though.

-Michael 
 

Jianxiao Lu

unread,
Jun 23, 2022, 4:06:21 AM6/23/22
to v8-dev
Test on an AWS server:
Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz (16 cores)

Running webtooling with d8

Here is the v8.gc tracing data:
baseline.png

15worker.png


For the concurrent marking speed, I simply sum up the KB and ms in the trace log and divide them (KB/ms),
For each worker:
baseline:  640.5,  669.86, 684.72
15 workers:508.91, 515.51, 503.51

Maybe it's better to replace the fixed kMaxTasks with core number? Just like what parallel compaction do.

Regards,
Jianxiao

Michael Lippautz

unread,
Jun 23, 2022, 11:10:12 AM6/23/22
to v8-dev
On Thu, Jun 23, 2022 at 10:06 AM Jianxiao Lu <jianx...@intel.com> wrote:
Test on an AWS server:
Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz (16 cores)

Running webtooling with d8

Here is the v8.gc tracing data:
baseline.png

15worker.png


For the concurrent marking speed, I simply sum up the KB and ms in the trace log and divide them (KB/ms),
For each worker:
baseline:  640.5,  669.86, 684.72
15 workers:508.91, 515.51, 503.51

So that's 20-25% less efficient in terms of cycles. It's hard to say whether the trade off in terms of absolute time is worth it.

Can you relate the time spent more in concurrent marking vs the time saved on the main thread (incremental steps above)?
Parallel compaction is more coarse grained and scales a little better I think.

I am curious whether you can provide the trade off (see above). What we can always do is exposing a flag though for users that don't care about this tradeoff.
 

Jianxiao Lu

unread,
Jun 23, 2022, 10:47:14 PM6/23/22
to v8-dev
> Can you relate the time spent more in concurrent marking vs the time saved on the main thread (incremental steps above)?

I am not 100% sure about that but here is my understanding:
When more workers are activated, the speed of each worker may decrease but the total speed will increase.
Rough calculation:
508.91 KB/ms/worker *15 worker=7633.65 KB/ms
669.86 KB/ms/worker * 7 worker = 4689.02 KB/ms

The concurrent marking task is scheduled when the incremental marking starts . For each steps of incremental marking, it need to mark "bytes_to_process"  heapobjects. When the local worklists is empty, the incremental marking will complete and invoke major gc . That means, only when concurrent marking complete most marking works, can the main thread local worklist be empty, otherwise it can always steal a segment from global worklist. If the concurrent marking have more worker to mark heapobjects faster, the incremental marking can also complete earlier. This may explain why the occurrence and duration of V8.GC_MC_INCREMENTAL are decreased.  And the execution will also benefit from that without write barrier introduced by incremental marking.

> What we can always do is exposing a flag though for users that don't care about this tradeoff.
I suggest to expose that a flag.

I am not urgent with that, just share some findings and confusions with community. Thanks!

Regards,
Jianxiao

Michael Lippautz

unread,
Jul 11, 2022, 12:05:41 PM7/11/22
to v8-dev
Sorry, for coming not back earlier to this thread.

On Fri, Jun 24, 2022 at 4:47 AM Jianxiao Lu <jianx...@intel.com> wrote:
> Can you relate the time spent more in concurrent marking vs the time saved on the main thread (incremental steps above)?

I am not 100% sure about that but here is my understanding:
When more workers are activated, the speed of each worker may decrease but the total speed will increase.
Rough calculation:
508.91 KB/ms/worker *15 worker=7633.65 KB/ms
669.86 KB/ms/worker * 7 worker = 4689.02 KB/ms

That's a 25% drop in throughput per worker.

The overall time improves but that comes at quite some CPU (and possibly battery) cost here.
 

The concurrent marking task is scheduled when the incremental marking starts . For each steps of incremental marking, it need to mark "bytes_to_process"  heapobjects. When the local worklists is empty, the incremental marking will complete and invoke major gc . That means, only when concurrent marking complete most marking works, can the main thread local worklist be empty, otherwise it can always steal a segment from global worklist. If the concurrent marking have more worker to mark heapobjects faster, the incremental marking can also complete earlier. This may explain why the occurrence and duration of V8.GC_MC_INCREMENTAL are decreased.  And the execution will also benefit from that without write barrier introduced by incremental marking.

> What we can always do is exposing a flag though for users that don't care about this tradeoff.
I suggest to expose that a flag.

That's always an option.
 
I am not urgent with that, just share some findings and confusions with community. Thanks!

 
Thanks a lot for doing these measurements. The trade off here is definitely non trivial and we need to find the right balance between using CPU/battery vs  main thread speed.
 
Reply all
Reply to author
Forward
0 new messages