Experiment: multi-threading Document::recalcStyle

richard....@arm.com

unread,

Jul 15, 2018, 7:40:34 PM7/15/18

to blink-dev

Hi blink-dev,

A little while ago, I wrote about an experiment I did which broke up style recalculation into lots of incremental tasks and posted them back onto the main thread. A natural follow-on question is what happens when you post the tasks onto worker threads and recalculate styles in parallel, so I did some experimental hacking, and got a prototype working well enough to run through some MotionMark benchmarks:

The results on the affected subset of MotionMark benchmarks (these are Animometer – Leaves, HTML Suite – CSS bouncing tagged images and HTML Suite – Leaves 2.0, which all heavily manipulate <img> tags) are encouraging given the limited tuning and optimization I've done so far:

On a powerful desktop system (without GPU raster):

FPS improved by up to 37%
Ramp@30FPS complexity increased by up to 39%
Ramp@60 FPS complexity improved by up to 25%

I also tried it on several Arm Chromebooks:

On a Chromebook without GPU raster:

Ramp@30FPS complexity improved by roughly 8-16%

On a Chromebook with GPU raster:

Ramp@30FPS improved by 0-8% (GPU limited)

And also on a Pixel 2 Android device:

Aninometer – Leaves, 30FPS complexity improved by 21%
Aninometer – Leaves, at fixed complexity (702), frame-rate improved by 28%

There are some good reasons why this can't ship in its current state: I've only implemented support for <img> tags so far, and the browser crashes (a lot) on everything else apart from MotionMark and Speedometer, making it difficult to assess how this might impact page loading etc. Performance on Speedometer was surprisingly similar, (but Speedometer doesn't spend much time styling images). Most of the crashes appear to be due to memory ownership issues (e.g. objects created on a worker thread being cleaned up incorrectly) which I think are solvable with enough effort. There are also lots of test failures, and TSAN is often unhappy, however I think this offers an interesting preview of what the pros and cons of such an approach might be, how it might scale, and what the challenges could be with multi-threading parts of Blink. If you're interested in the idea, feel free to try out the patch, take a look at the full results so far, and feel free to ask me any questions here, or via richard....@arm.com.

Thanks

Richard

Auto Generated Inline Image 1

richard....@arm.com

unread,

Jul 17, 2018, 7:59:31 PM7/17/18

to blink-dev

An additional thing I noticed: if you ramp up the number of style recalculation worker threads to (say) 8 or more, the most significant CPU overhead originates from trying to acquire a spin-lock in the partition allocator. If ComputedStyle::Create and so-on could allocate from a thread-local Oilpan heap (and thus not require a lock), it would probably go quite a bit faster.

Best

Richard

Hayato Ito

unread,

Jul 19, 2018, 1:29:52 AM7/19/18

to richard....@arm.com, blink-dev

Thank you for uploading the CL. I think we can get a lot of insights from the CL.

We, dom-team, will take a look at the CL more closely later. We'll follow-up in the CL.

Please feel free to contact dom...@chromium.org, anyway. We are happy to discuss what is the best approach to make Blink's core/ codebase be ready for many-core era.

That has been a challenging project for us, which, at this point, needs someone to do a lot of works to get it done.

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/1bddcb82-ee51-4cde-a80d-47b589e267f4%40chromium.org.

--

Hayato

Rune Lillesveen

unread,

Jul 25, 2018, 3:43:49 AM7/25/18

to richard....@arm.com, blink-dev

Hi Richard,

Thanks for exploring this.

There are several aspects that makes this non-trivial with the current design. We should consider the following (non-exhaustive):

* Immutable DOM during style recalc

We currently set flags set during selector matching.

* Immutable ComputedStyle once set for an element

We currently set flags on parent ComputedStyle during style recalc. There has been work done to remedy this lately, but not completely finished.

* MatchedPropertiesCache and thread safety

This is a global cacheI couldn't see that you have done anything to ensure this with your patch.

If you're willing to spend more time on this, it would be great. If so we should explore this through design document/discussions.

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/8ca5a8a8-aaaf-49df-bed2-231a5490d176%40chromium.org.

richard....@arm.com

unread,

Jul 30, 2018, 8:09:05 AM7/30/18

to blink-dev, richard....@arm.com

(Back from holiday). Interesting, so there are a few more things for the task list I think are necessary too:

Migrate more of the style classes (ComputedStyle etc) to Oilpan to avoid the spinlock in the partition allocator
Figure out the best way to get Oilpan working with fan-out multithreading (it might work by adopting newly-allocated objects back onto the main thread).
There's currently things like the TaskSchedulerForegroundWorker thread used for V8 tasks, it would be good if Blink could use those threads too if it needs to multi-thread stuff, which may require some re-plumbing (I think those threads are currently owned by V8).
Get the breadth-first selector logic working and well-debugged (foundational task).

If that sounds good, let me know. I guess the next stage would be a tracking bug and some sort of high-level design document?

Best

Richard

Rune Lillesveen

unread,

Aug 1, 2018, 3:38:55 AM8/1/18

to richard....@arm.com, blink-dev

On Mon, Jul 30, 2018 at 2:09 PM <richard....@arm.com> wrote:

(Back from holiday). Interesting, so there are a few more things for the task list I think are necessary too:
Migrate more of the style classes (ComputedStyle etc) to Oilpan to avoid the spinlock in the partition allocator

Moving ComputedStyle to Olilplan would affect layout tree and layout fragments in LayoutNG as they reference ComputedStyle (currently kept alive with a ref counter). It was at some point decided not to move the layout structure to Oilpan for performance reasons. LayoutNG objects were originallly on Oilpan but was moved off of Oilpan for the same reasons.

Figure out the best way to get Oilpan working with fan-out multithreading (it might work by adopting newly-allocated objects back onto the main thread).

Servo's Stylo engine uses some work stealing strategy. It would make sense to study what they've done.

There's currently things like the TaskSchedulerForegroundWorker thread used for V8 tasks, it would be good if Blink could use those threads too if it needs to multi-thread stuff, which may require some re-plumbing (I think those threads are currently owned by V8).

I haven't worked with threads in Chromium/Blink, so I don't have any useful input here.

Get the breadth-first selector logic working and well-debugged (foundational task).
If that sounds good, let me know. I guess the next stage would be a tracking bug and some sort of high-level design document?

Yes.

richard....@arm.com

unread,

Aug 1, 2018, 4:56:29 AM8/1/18

to blink-dev, richard....@arm.com

On Wednesday, August 1, 2018 at 8:38:55 AM UTC+1, Rune Lillesveen wrote:

On Mon, Jul 30, 2018 at 2:09 PM <richard....@arm.com> wrote:
(Back from holiday). Interesting, so there are a few more things for the task list I think are necessary too:
Migrate more of the style classes (ComputedStyle etc) to Oilpan to avoid the spinlock in the partition allocator
Moving ComputedStyle to Olilplan would affect layout tree and layout fragments in LayoutNG as they reference ComputedStyle (currently kept alive with a ref counter). It was at some point decided not to move the layout structure to Oilpan for performance reasons. LayoutNG objects were originallly on Oilpan but was moved off of Oilpan for the same reasons.

I should clarify that moving stuff to Oilpan shouldn't be necessary (since there are performance benefits without doing so), but spinning in the partition allocator does waste energy (important for us as we're focussed mostly on mobile devices, Chromebooks etc), so it'd be a nice-to-have in the long term if I can figure out how to make this thing work.

Reply all

Reply to author

Forward