Making StringImpl::ref_count_ atomic

63 views
Skip to first unread message

Kevin Babbitt

unread,
Dec 6, 2021, 6:36:28 PM12/6/21
to platform-architecture-dev, Luis Fernando Pardo Sixtos
Hi all,

I wanted to start a conversation on landing this change: https://crrev.com/c/3152354

Luis and I have been working on reducing the performance impact of it, and we're at a point where I'd like to start getting broad testing on it to identify additional areas that need to be addressed. My hope is to land just after the branch for M98 this week and use the extended M99 milestone to follow up as needed; please let me know your thoughts.

Here's how the performance currently stands. Pinpoint results below are from patchset 8 of the change; the latest patchset adds back some DCHECK-only verification and should not affect the release build.

Android binary size: +5,896 bytes. We attempted to reduce this impact by out-of-lining StringImpl::AddRef and Release but found it produced unacceptable speed regressions. See comment 48 in the bug for details.

Speedometer2:
No significant regressions on Windows or Android.

Micro-benchmarks: In short, there are some regressions. A lot of these are cost-of-doing-business on heavily exercised getAttribute and setAttribute paths; see comment 50 in the bug for details. There are also regressions CSS which will need follow-up during M99. Detailed results:

blink_perf:
bindings: Windows 14 regressions, 26 insignificant, 11 improvements. Android 12 regressions, 25 insignificant, 14 improvements.
css: Windows 14 regressions, 29 insignificant, 2 improvements. Android 16 regressions, 24 insignificant, 4 improvements.
dom: Windows 4 regressions, 5 insignificant, 2 improvements. Android 5 regressions, 5 insignificant, 1 improvement.
parser: Windows 6 regressions, 18 insignificant, 7 improvements. Android 18 regressions, 4 insignificant, 9 improvements.

dromaeo:
dom-attr: Windows 5 regressions, 3 insignificant, 0 improvements. Android 4 regressions, 4 insignificant, 0 improvements.
dom-modify: Windows 1 regression, 7 insignificant, 0 improvements. Android 0 regressions, 8 insignificant, 0 improvements.
dom-query: Windows 6 regressions, 5 insignificant, 0 improvements. Android 1 regression, 4 insignificant, 6 improvements.
dom-traverse: Windows 0 regressions, 7 insignificant, 0 improvements. Android 1 regression, 6 insignificant, 0 improvements.

Thanks,
Kevin

Kevin Babbitt

unread,
Dec 6, 2021, 7:28:24 PM12/6/21
to platform-architecture-dev, Luis Fernando Pardo Sixtos
One correction to the below, thanks Luis for pointing it out: Regarding the Android binary size reduction, I linked to bug comment 48, but that comment was talking about a different investigation. Luis has added comment 54 which captures the findings on Android binary size.

Kevin



From: Kevin Babbitt <kbab...@microsoft.com>
Sent: Monday, December 6, 2021 3:36 PM
To: platform-architecture-dev <platform-arc...@chromium.org>
Cc: Luis Fernando Pardo Sixtos <lpardo...@microsoft.com>
Subject: Making StringImpl::ref_count_ atomic
 

Kentaro Hara

unread,
Dec 6, 2021, 7:53:56 PM12/6/21
to Kevin Babbitt, Yaron Friedman, Andrew Grieve, Jeremy Roman, Kent Tamura, platform-architecture-dev, Luis Fernando Pardo Sixtos
Thanks Kevin and Luis for all the hard work to reduce AddRef/Release thrashing on StringImpls!

Android binary size: +5,896 bytes. We attempted to reduce this impact by out-of-lining StringImpl::AddRef and Release but found it produced unacceptable speed regressions. See comment 48 in the bug for details.

@Yaron Friedman @Andrew Grieve Do you think this binary size increase is acceptable?

Note that the benefit of making StringImpl thread-safe is significant. It massively simplifies String handling in multi-threaded Blink, reduces bugs and enables future optimizations like parallel text shaping.

Speedometer2:
No significant regressions on Windows or Android.

Great!

blink_perf:
bindings: Windows 14 regressions, 26 insignificant, 11 improvements. Android 12 regressions, 25 insignificant, 14 improvements.
css: Windows 14 regressions, 29 insignificant, 2 improvements. Android 16 regressions, 24 insignificant, 4 improvements.
dom: Windows 4 regressions, 5 insignificant, 2 improvements. Android 5 regressions, 5 insignificant, 1 improvement.
parser: Windows 6 regressions, 18 insignificant, 7 improvements. Android 18 regressions, 4 insignificant, 9 improvements.
dromaeo:
dom-attr: Windows 5 regressions, 3 insignificant, 0 improvements. Android 4 regressions, 4 insignificant, 0 improvements.
dom-modify: Windows 1 regression, 7 insignificant, 0 improvements. Android 0 regressions, 8 insignificant, 0 improvements.
dom-query: Windows 6 regressions, 5 insignificant, 0 improvements. Android 1 regression, 4 insignificant, 6 improvements.
dom-traverse: Windows 0 regressions, 7 insignificant, 0 improvements. Android 1 regression, 6 insignificant, 0 improvements.

I still see a few percent regressions on benchmarks around query selectors and get/setAttribute. However, I'm okay with accepting these regressions because:

1) The regressions are very well understood and optimized over 1+ year of work. Albert, Kevin and Luis did everything we could do to minimize the overhead.
2) To minimize the overhead, we've implemented fairly complex optimizations to AttributeCollection. I'm worried about making it more complicated.
3) The regressions are observed only on micro benchmarks.

@Jeremy Roman @Kent Tamura for the second input :)


--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architect...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/platform-architecture-dev/SJ0PR00MB1333F1F63C4D9B9DDAAE8B23C06E9%40SJ0PR00MB1333.namprd00.prod.outlook.com.


--
Kentaro Hara, Tokyo

Andrew Grieve

unread,
Dec 6, 2021, 9:53:45 PM12/6/21
to Kentaro Hara, Kevin Babbitt, Yaron Friedman, Jeremy Roman, Kent Tamura, platform-architecture-dev, Luis Fernando Pardo Sixtos
Binary-size-wise, we won't notice 5kb :) 

Kentaro Hara

unread,
Dec 7, 2021, 1:30:29 AM12/7/21
to Andrew Grieve, Kevin Babbitt, Yaron Friedman, Jeremy Roman, Kent Tamura, platform-architecture-dev, Luis Fernando Pardo Sixtos
Sorry, I misread it as +5,896 kbytes, not +5,896 bytes. Fully agreed that 5 KB is not a problem :)


--
Kentaro Hara, Tokyo

TAMURA, Kent

unread,
Dec 7, 2021, 1:46:31 AM12/7/21
to Kentaro Hara, Andrew Grieve, Kevin Babbitt, Yaron Friedman, Jeremy Roman, platform-architecture-dev, Luis Fernando Pardo Sixtos
I found some notable regressions such as modify-element-classname in blink_perf.dom and AttributeDescendantSelector in blink_perf.css.  However I don't think the microbenchmark regressions are blockers.  We may proceed with the StringImpl change as is, and should address performance regressions if they affect real sites.


--
TAMURA Kent
Software Engineer, Google


Jeremy Roman

unread,
Dec 7, 2021, 11:07:36 AM12/7/21
to TAMURA, Kent, Kentaro Hara, Andrew Grieve, Kevin Babbitt, Yaron Friedman, platform-architecture-dev, Luis Fernando Pardo Sixtos
I left a comment on the CL -- I'm concerned that the use of atomics isn't correct at the moment, and that making it correct might affect the performance and size numbers.

Very much appreciate the work that's gone into this and look forward to it, of course.

Kevin Babbitt

unread,
Dec 8, 2021, 1:28:23 PM12/8/21
to Jeremy Roman, TAMURA, Kent, Kentaro Hara, Andrew Grieve, Yaron Friedman, platform-architecture-dev, Luis Fernando Pardo Sixtos

Thanks everyone for your support!

 

After addressing Jeremy’s issue and rerunning jobs: The bad news is that Android binary size did increase quite a bit. The good news is that the speed numbers don’t look significantly different to me.

 

Android binary size: +8,776 bytes.

 

Speedometer: still no significant regressions (Windows, Android)

 

blink_perf:

bindings: Windows 10 regressions, 25 insignificant, 16 improvements. Android 19 regressions, 24 insignificant, 8 improvements.

css: Windows 8 regressions, 25 insignificant, 12 improvements. Android 6 regressions, 19 insignificant, 20 improvements.

dom: Windows 5 regressions, 3 insignificant, 3 improvements. Android 5 regressions, 4 insignificant, 2 improvements.

parser: Windows results generation failed. Android 11 regressions, 8 insignificant, 12 improvements.

dromaeo:

dom-attr: Windows 1 regression, 7 insignificant, 0 improvements. Android 3 regressions, 4 insignificant, 1 improvement.

dom-modify: Windows 1 regression, 7 insignificant, 0 improvements. Android 0 regressions, 7 insignificant, 1 improvement.

dom-query: Windows 2 regressions, 9 insignificant, 0 improvements. Android 7 regressions, 4 insignificant, 0 improvements.

dom-traverse: Windows 0 regressions, 7 insignificant, 0 improvements. Android 5 regressions, 2 insignificant, 0 improvements.

Jeremy Roman

unread,
Dec 8, 2021, 3:33:53 PM12/8/21
to Kevin Babbitt, Jeremy Roman, TAMURA, Kent, Kentaro Hara, Andrew Grieve, Yaron Friedman, platform-architecture-dev, Luis Fernando Pardo Sixtos
Thanks for addressing that. I doubt the Android binary size change will be a dealbreaker.

Once this is thread-safe, I think we can simplify static strings, because they can just bump ref counts like everything else, right? I think it being thread-safe would enable that and allow us to either remove the bit altogether or at least use it only for a (D)CHECK in DestroyIfNotStatic. If so we might even recover a bit of the performance and binary size hit here by being able to elide the IsStatic branches in AddRef and Release. I'm curious if that's been measured (even though we might want to land it separately).

I'm personally excited to see this work continue to progress. :)

Kentaro Hara

unread,
Dec 8, 2021, 8:27:30 PM12/8/21
to Jeremy Roman, Kevin Babbitt, Jeremy Roman, TAMURA, Kent, Andrew Grieve, Yaron Friedman, platform-architecture-dev, Luis Fernando Pardo Sixtos
It looks like there's no objection on this thread for a couple of days. I think you can move forward -- thanks for all the hard work to remove unnecessary AddRef/Release from the codebase!


--
Kentaro Hara, Tokyo

Kentaro Hara

unread,
Jan 28, 2022, 2:30:12 AM1/28/22
to Jeremy Roman, Kevin Babbitt, Jeremy Roman, TAMURA, Kent, Andrew Grieve, Yaron Friedman, platform-architecture-dev, Luis Fernando Pardo Sixtos
Kevin and Luis landed a change to make StringImpl thread-safe -- congratulations! :D

Now they are working to make AtomicStringTable thread-safe (thanks!) but it is hitting performance regressions on a few benchmarks. I have one question here: If we do NOT make AtomicStringTable thread-safe, what will be the consequence? (i.e., we ban using StringImpls with an Atomic flag across threads.) I'm wondering if it's a plausible end state or we have to mitigate the performance regressions and make AtomicStringTable thread-safe.



--
Kentaro Hara, Tokyo

Daniel Cheng

unread,
Jan 28, 2022, 10:12:25 AM1/28/22
to Kentaro Hara, Jeremy Roman, Kevin Babbitt, Jeremy Roman, TAMURA, Kent, Andrew Grieve, Yaron Friedman, platform-architecture-dev, Luis Fernando Pardo Sixtos
I think a regular WTF::String can still reference a StringImpl in the atomic string table, right? Would this lead to surprising runtime errors?

Daniel

Jeremy Roman

unread,
Jan 28, 2022, 10:21:53 AM1/28/22
to Daniel Cheng, Kentaro Hara, Jeremy Roman, Kevin Babbitt, TAMURA, Kent, Andrew Grieve, Yaron Friedman, platform-architecture-dev, Luis Fernando Pardo Sixtos
And StringImpls can become atomic after the fact upon insertion into the AtomicStringTable at present. Such strings would need to be copied.

I think the large qualitative benefits emerge mainly when an arbitrary String/AtomicString, possibly embedded in another object, can safely be touched from multiple threads.

Kentaro Hara

unread,
Jan 28, 2022, 10:51:46 AM1/28/22
to Jeremy Roman, Daniel Cheng, Jeremy Roman, Kevin Babbitt, TAMURA, Kent, Andrew Grieve, Yaron Friedman, platform-architecture-dev, Luis Fernando Pardo Sixtos
Yeah... agreed. So we need to find a magic to reduce the thread-safe AtomicStringTable performance overhead. I welcome your brain on this CL :)


--
Kentaro Hara, Tokyo

dan...@chromium.org

unread,
Jan 28, 2022, 10:55:02 AM1/28/22
to Kentaro Hara, Jeremy Roman, Daniel Cheng, Jeremy Roman, Kevin Babbitt, TAMURA, Kent, Andrew Grieve, Yaron Friedman, platform-architecture-dev, Luis Fernando Pardo Sixtos
FWIW, most of the regressions Albert found and that I heard about were simply wasteful refcount churn that he had to find with profiling and fix.

Kentaro Hara

unread,
Jan 28, 2022, 10:57:57 AM1/28/22
to dan...@chromium.org, Jeremy Roman, Daniel Cheng, Jeremy Roman, Kevin Babbitt, TAMURA, Kent, Andrew Grieve, Yaron Friedman, platform-architecture-dev, Luis Fernando Pardo Sixtos
FWIW, most of the regressions Albert found and that I heard about were simply wasteful refcount churn that he had to find with profiling and fix.

This was already fixed by Kevin and Luis (with lots of work to reduce the refcount churn) and the change to make StringImpl thread-safe was successfully landed.

Now the remaining work is to change the per-thread AtomicStringTable to one thread-safe AtomicStringTable.

--
Kentaro Hara, Tokyo

Luis Fernando Pardo Sixtos

unread,
Jan 28, 2022, 11:22:33 AM1/28/22
to platform-architecture-dev, Kentaro, Jeremy Roman, Daniel Cheng, Jeremy Roman, Kevin Babbitt, TAMURA, Kent, Andrew Grieve, Yaron Friedman, platform-architecture-dev, Luis Fernando Pardo Sixtos, danakj
A little more context about the regressions. The solution we are working on is based on Albert's prototype of adding a mutex lock to the AtomiStringTable Add and Remove operations. I split the original CL that Kentaro pointed to into this CL that makes the necessary logic changes and this other CL that introduces the shared table and the lock. I expected to see the regressions on the lock CL, but we are getting regressions on the first one too. Hopefully that is just bad coding on my part, but I just don't see it, any feedback on that CL would be highly appreciated.

 Because some regressions in the microbenchmarks come and go, I ran many pinpoint jobs to get a sense of which are "the real regressions", I will parse and summarize the results soon and post them here and in their respective CLs so we can agree on how much of them are we willing to accept.

Luis Fernando Pardo Sixtos

unread,
Jan 28, 2022, 3:05:14 PM1/28/22
to platform-architecture-dev, Luis Fernando Pardo Sixtos, Kentaro, Jeremy Roman, Daniel Cheng, Jeremy Roman, Kevin Babbitt, TAMURA, Kent, Andrew Grieve, Yaron Friedman, platform-architecture-dev, danakj
Here is a summary of the results showing the top 5 significant regressions for each benchmark. To get some more consistency, I ran each pinpoint twice with the same configuration. The regressions only appear hear if they were marked as "significant" in both runs. All of these jobs ran on win-10-perf.

Note: Overall the regressions that appear in the preparation CL do not worsen by introducing the lock, but this could vary across platforms.


Speedometer2
No significant regression appeared in both pinpoint runs.

blink_perf.parser
query-selector-all-id-last  -->  8.85%
query-selector-all-class-first  -->  7.90%
query-selector-all-first  -->  7.10%
query-selector-all-id-first  --> 7.25%
innerHTML-setter  -->  3.80%


blink_perf.css
ClassInvalidation  -->  6.15%
PseudoClassSelectors  -->  2.80%

blink_perf.layout

character_fallback  -->  387.9%
css-contain-change-text-different-subtree-root  -->  3.3%
ArabicLineLayout  -->  3.8%
css-contain-change-text-without-subtree-root  -->  3.7%
css-contain-change-text  -->  3.2%


blink_perf.dom
select-sinlge-add  -->  6.40%
insert-text-with-dir-auto  --> 5.25%
select-multiple-add  -->  3.75%

blink_perf.bindings

worker-structured-clone-workerDOM-Map-from-worker  -->  74.60%
set-attribute-rare -->  8.45%
set-attribute  -->  8.25%
gc-mini-tree  -->  4.50%
create-element  -->  2.45%



Speedometer2
No significant regression appeared in both pinpoint runs.

blink_perf.parser
query-selector-all-id-last  -->  7.75%
query-selector-all-class-first -->  7.20%
query-selector-all-first  -->  8.05%
query-selector-all-id-first  -->  8.75%
innerHTML-setter  -->  3.65%


blink_perf.css
ClassInvalidation  -->  4.90%
CustomPropertiesPendingSubstitution  -->  3.80%
AttributeDescendantSelector  -->  3.50%
HighlightInheritanceRecalc  -->  3.45%
CSSPropertySetterGetterMethods  -->  1.30%


blink_perf.layout

css-contain-change-text-different-subtree-root  -->  1.7%
ArabicLineLayout  -->  17.8%
css-contain-change-text-without-subtree-root  -->  2.3%
css-contain-change-text  -->  2.0%
contain-content-style-change  -->  1.6%


blink_perf.dom
No significant regression appeared in both pinpoint runs.

blink_perf.bindings

set-attribute-rare  -->  9.75%
set-attribute  -->  12.05%
gc-mini-tree  -->  1.75%
create-element -->  3.55%
append-child  -->  7.50%
get-element-by-id  -->  7.80%
get-attribute-rare -->  7.15%


Kentaro Hara

unread,
Jan 30, 2022, 8:09:50 PM1/30/22
to Luis Fernando Pardo Sixtos, platform-architecture-dev, Jeremy Roman, Daniel Cheng, Jeremy Roman, Kevin Babbitt, TAMURA, Kent, Andrew Grieve, Yaron Friedman, danakj
Thanks Luis for the data!

Does it mean that making AtomicStringTable thread-safe will regress query-selector-* benchmarks by 15% in total...? That sounds a lot to me... :/

Crazy idea: Do we know how often in practice AtomicStringTable::Add/Remove is called from non-main threads? If it's very limited (i.e., in common cases the main thread passes AtomicStrings to non-main threads and the non-main threads just look up them in the table), I'm wondering if there's an option to ban AtomicStringTable::Add/Remove from non-main threads and eliminate the lock.


Luis Fernando Pardo Sixtos

unread,
Jan 30, 2022, 10:13:33 PM1/30/22
to platform-architecture-dev, Kentaro, platform-architecture-dev, Jeremy Roman, Daniel Cheng, Jeremy Roman, Kevin Babbitt, TAMURA, Kent, Andrew Grieve, Yaron Friedman, danakj, Luis Fernando Pardo Sixtos
Sorry I forgot to clarify. The first results are for the logic change only, and the others are for the logic + lock change. So it wouldn't be a 15% regression, just an 8% regression. From the results I saw, the worst regressions happen in the first CL, so eliminating the lock might not be a priority for now.

Albert Wong

unread,
Jan 31, 2022, 1:45:07 AM1/31/22
to platform-architecture-dev, Luis Fernando Pardo Sixtos, Kentaro, platform-architecture-dev, Jeremy Roman, Daniel Cheng, Jeremy Roman, Kevin Babbitt, TAMURA, Kent, Andrew Grieve, Yaron Friedman, danakj
Whoa... just checked in and have to say congrats to Kevin and Luis!  This is amazing!!!!

Regarding the selectors, I skimmed the code and I wonder how much you're paying for the construction of a temporary String from an AtomicString in CSSTokenizer here:

It's a little unfortunate that CSSTokenizer (and CSSTokenizerInputStream), when used a stack variable, still retains a reference the String it's parsing as opposed to, say, just a StringView.

Reading the actual test:

and the implementation of ContainerNode::QuerySelectorAll() which quickly goes here:

the test is basically doing 1 CSS query parse + 1 Hashtable insert followed by 99 lookups of a hashtable so I could see a single unnecessary AddRef/Release() having a few percentage point increase in the microbenchmark.

...of course this is all visual code analysis so take with a large grain of salt.

-Albert
Reply all
Reply to author
Forward
0 new messages