Why is concurrent marking worker number limited to no more than 7?

Jianxiao Lu

unread,

Jun 21, 2022, 2:33:03 AM6/21/22

to v8-dev

https://source.chromium.org/chromium/chromium/src/+/main:v8/src/heap/concurrent-marking.h;l=59?q=kMaxTasks&sq=&ss=chromium

The code comments above seems out of date. So I wonder if this limitation is intentional or just no be updated in time.

Here is a snapshot in webtooling (d8).

After I tuned the worker number(https://chromium-review.googlesource.com/c/v8/v8/+/3711496):

Seems that we can benefit from more worker?

Michael Lippautz

unread,

Jun 21, 2022, 3:08:10 AM6/21/22

to v8-dev

This really depends on what the average size of the heap is.The numbers were chosen as a compromise between small and large heaps and low-end vs desktop devices. Also, the algorithm doesn't scale linearly but there's smaller trade offs here and there which add up as the # tasks are increased.

Did the absolute time actually improve? I see 1280ms-1340ms in the first screenshot vs 1520ms-1580ms. You could check marked bytes/s as a proxy of whether the helper tasks are still efficient or not.

-Michael

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/2944463a-a964-40a9-91f4-fba7a74debe8n%40googlegroups.com.

Jianxiao Lu

unread,

Jun 21, 2022, 5:36:21 AM6/21/22

to v8-dev

Thanks for the explanation. The second screenshot is 1520ms~1560ms in fact. The snapshots were taken from the first major gc running webtooling. Because I believe the first major gcs are relatively predictable and consistent. (Maybe I am wrong).

I will check with your suggestion later. Does the gc-tracer already have something to record the mark-bytes/ms ? Or I need to try to implement one?

Thanks,

Jianxiao

Michael Lippautz

unread,

Jun 21, 2022, 2:03:30 PM6/21/22

to v8-dev

On Tue, Jun 21, 2022 at 11:36 AM Jianxiao Lu <jianx...@intel.com> wrote:

Thanks for the explanation. The second screenshot is 1520ms~1560ms in fact. The snapshots were taken from the first major gc running webtooling. Because I believe the first major gcs are relatively predictable and consistent. (Maybe I am wrong).

I will check with your suggestion later. Does the gc-tracer already have something to record the mark-bytes/ms ? Or I need to try to implement one?

--trace-concurrent-marking will log concurrently marked bytes and timespans which can be used to compute the speed. Any custom logging will also do though.

-Michael

To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/3e8a9fd6-0a7f-424a-843c-9d3be714ecebn%40googlegroups.com.

Jianxiao Lu

unread,

Jun 23, 2022, 4:06:21 AM6/23/22

to v8-dev

Test on an AWS server:

Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz (16 cores)

Running webtooling with d8

Here is the v8.gc tracing data:

Raw tracing log files: https://drive.google.com/drive/folders/1NRQCjNQ5x9RDHA0AVvTOr8neuARyheq-?usp=sharing

For the concurrent marking speed, I simply sum up the KB and ms in the trace log and divide them (KB/ms),

For each worker:

baseline: 640.5, 669.86, 684.72
15 workers:508.91, 515.51, 503.51

Maybe it's better to replace the fixed kMaxTasks with core number? Just like what parallel compaction do.

https://source.chromium.org/chromium/chromium/src/+/main:v8/src/heap/mark-compact.cc;l=472;bpv=1;bpt=0?q=NumberOfParallelCompactionTasks&ss=chromium%2Fchromium%2Fsrc

Regards,

Jianxiao

Michael Lippautz

unread,

Jun 23, 2022, 11:10:12 AM6/23/22

to v8-dev

On Thu, Jun 23, 2022 at 10:06 AM Jianxiao Lu <jianx...@intel.com> wrote:

Test on an AWS server:
Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz (16 cores)

Running webtooling with d8

Here is the v8.gc tracing data:

Raw tracing log files: https://drive.google.com/drive/folders/1NRQCjNQ5x9RDHA0AVvTOr8neuARyheq-?usp=sharing

For the concurrent marking speed, I simply sum up the KB and ms in the trace log and divide them (KB/ms),
For each worker:
baseline: 640.5, 669.86, 684.72
15 workers:508.91, 515.51, 503.51

So that's 20-25% less efficient in terms of cycles. It's hard to say whether the trade off in terms of absolute time is worth it.

Can you relate the time spent more in concurrent marking vs the time saved on the main thread (incremental steps above)?

Maybe it's better to replace the fixed kMaxTasks with core number? Just like what parallel compaction do.
https://source.chromium.org/chromium/chromium/src/+/main:v8/src/heap/mark-compact.cc;l=472;bpv=1;bpt=0?q=NumberOfParallelCompactionTasks&ss=chromium%2Fchromium%2Fsrc

Parallel compaction is more coarse grained and scales a little better I think.

I am curious whether you can provide the trade off (see above). What we can always do is exposing a flag though for users that don't care about this tradeoff.

To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/50b7b577-3304-4227-ae7c-1b59f8d4690bn%40googlegroups.com.

Jianxiao Lu

unread,

Jun 23, 2022, 10:47:14 PM6/23/22

to v8-dev

> Can you relate the time spent more in concurrent marking vs the time saved on the main thread (incremental steps above)?

I am not 100% sure about that but here is my understanding:

When more workers are activated, the speed of each worker may decrease but the total speed will increase.

Rough calculation:

508.91 KB/ms/worker *15 worker=7633.65 KB/ms

669.86 KB/ms/worker * 7 worker = 4689.02 KB/ms

The concurrent marking task is scheduled when the incremental marking starts . For each steps of incremental marking, it need to mark "bytes_to_process" heapobjects. When the local worklists is empty, the incremental marking will complete and invoke major gc . That means, only when concurrent marking complete most marking works, can the main thread local worklist be empty, otherwise it can always steal a segment from global worklist. If the concurrent marking have more worker to mark heapobjects faster, the incremental marking can also complete earlier. This may explain why the occurrence and duration of V8.GC_MC_INCREMENTAL are decreased. And the execution will also benefit from that without write barrier introduced by incremental marking.

> What we can always do is exposing a flag though for users that don't care about this tradeoff.

I suggest to expose that a flag.

I am not urgent with that, just share some findings and confusions with community. Thanks!

Regards,

Jianxiao

Michael Lippautz

unread,

Jul 11, 2022, 12:05:41 PM7/11/22

to v8-dev

Sorry, for coming not back earlier to this thread.

On Fri, Jun 24, 2022 at 4:47 AM Jianxiao Lu <jianx...@intel.com> wrote:

> Can you relate the time spent more in concurrent marking vs the time saved on the main thread (incremental steps above)?

I am not 100% sure about that but here is my understanding:
When more workers are activated, the speed of each worker may decrease but the total speed will increase.
Rough calculation:
508.91 KB/ms/worker *15 worker=7633.65 KB/ms
669.86 KB/ms/worker * 7 worker = 4689.02 KB/ms

That's a 25% drop in throughput per worker.

The overall time improves but that comes at quite some CPU (and possibly battery) cost here.

The concurrent marking task is scheduled when the incremental marking starts . For each steps of incremental marking, it need to mark "bytes_to_process" heapobjects. When the local worklists is empty, the incremental marking will complete and invoke major gc . That means, only when concurrent marking complete most marking works, can the main thread local worklist be empty, otherwise it can always steal a segment from global worklist. If the concurrent marking have more worker to mark heapobjects faster, the incremental marking can also complete earlier. This may explain why the occurrence and duration of V8.GC_MC_INCREMENTAL are decreased. And the execution will also benefit from that without write barrier introduced by incremental marking.

> What we can always do is exposing a flag though for users that don't care about this tradeoff.
I suggest to expose that a flag.

That's always an option.

I am not urgent with that, just share some findings and confusions with community. Thanks!

Thanks a lot for doing these measurements. The trade off here is definitely non trivial and we need to find the right balance between using CPU/battery vs main thread speed.

To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/814f0eb4-a805-4d90-9123-165889eac1ddn%40googlegroups.com.

Reply all

Reply to author

Forward