Is there memory optimization potential in CSS computed styles?

15 views
Skip to first unread message

Jann Horn

unread,
Nov 23, 2021, 11:53:19 AM11/23/21
to memor...@chromium.org, and...@chromium.org, ericwi...@chromium.org, fut...@chromium.org, xiaoc...@chromium.org
Quick summary:
I think there are potential memory savings in CSS computed styles; how about adding a UMA metric to see how much potential exists there? I have a hacky demo CL that reduces renderer private RAM usage by up to ~6.8% in extreme cases and ~0.47% on a Google search results page, at the cost of slightly increased CPU usage.


Hi!

I've been tinkering with some tooling (still unfinished, and I probably won't get around to finishing it anytime soon), and I've noticed that on some websites, objects related to CSS computed styles can consume a non-negligible chunk of RAM; and many of these objects have mostly or completely the same contents.

A very extreme example is https://elixir.bootlin.com/linux/v5.10/source/mm/filemap.c (a source cross referencing website), where there are e.g. 20755 allocations of blink::ComputedStyleBase::StyleBoxData (plus a lot more in other blink::ComputedStyleBase::*). The contents of these objects look as follows; the stuff under "TOP VALUES" are hex dumps of the three most frequent values along with the number of instances of those values:

              0x000 0x004 <inheritance>  typeunit:WTF::RefCounted<[...]>
[...]
                #### TOP VALUES:
                ####  20743 01 00 00 00
                ####     12 02 00 00 00
              0x004 0x00c aspect_ratio_  typeunit:blink::StyleAspectRatio
                  0x004 0x004 type_  typeunit:unsigned int
                    #### TOP VALUES:
                    ####  20755 00 07 5f c0
                  0x008 0x008 ratio_  typeunit:blink::FloatSize
[...]
                    #### TOP VALUES:
                    ####  20755 00 00 00 00 00 00 00 00
                #### TOP VALUES:
                ####  20755 00 07 5f c0 00 00 00 00 00 00 00 00
              0x010 0x010 contain_intrinsic_height_  typeunit:absl::optional<blink::StyleIntrinsicLength>
[...]
                #### TOP VALUES:
                ####  20755 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
              0x020 0x010 contain_intrinsic_width_  typeunit:absl::optional<blink::StyleIntrinsicLength>
[...]
                              0x020 0x001 engaged_  typeunit:bool
                                #### TOP VALUES:
                                ####  20755 00
                              0x024 0x00c <unnamed member>  typeunit:absl::optional_internal::[2:optional_data_dtor_base<blink::StyleIntrinsicLength, false>]<unnamed>
                                #### TOP VALUES:
                                ####  20755 00 00 00 00 00 00 00 00 00 00 00 00
[...]
                #### TOP VALUES:
                ####  20755 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
              0x030 0x008 height_  typeunit:blink::Length
                  0x030 0x004 <unnamed member>  typeunit:blink::[2:Length]<unnamed>
                    #### TOP VALUES:
                    ####  20741 00 00 00 00
                    ####      8 00 00 c8 42
                    ####      1 66 66 26 42
                  0x034 0x001 quirk_  typeunit:bool
                    #### TOP VALUES:
                    ####  20755 00
                  0x035 0x001 type_  typeunit:unsigned char
                    #### TOP VALUES:
                    ####  20741 00
                    ####      7 02
                    ####      7 01
                  0x036 0x001 is_float_  typeunit:bool
                    #### TOP VALUES:
                    ####  20741 00
                    ####     14 01
                #### TOP VALUES:
                ####  20741 00 00 00 00 00 00 00 00
                ####      7 00 00 c8 42 00 01 01 00
                ####      1 66 66 26 42 00 02 01 00
              0x038 0x008 max_height_  typeunit:blink::Length
                  0x038 0x004 <unnamed member>  typeunit:blink::[2:Length]<unnamed>
                    #### TOP VALUES:
                    ####  20755 00 00 00 00
                  0x03c 0x001 quirk_  typeunit:bool
                    #### TOP VALUES:
                    ####  20755 00
                  0x03d 0x001 type_  typeunit:unsigned char
                    #### TOP VALUES:
                    ####  20755 0c
                  0x03e 0x001 is_float_  typeunit:bool
                    #### TOP VALUES:
                    ####  20755 00
                #### TOP VALUES:
                ####  20755 00 00 00 00 00 0c 00 00
              0x040 0x008 max_width_  typeunit:blink::Length
                  0x040 0x004 <unnamed member>  typeunit:blink::[2:Length]<unnamed>
                    #### TOP VALUES:
                    ####  20753 00 00 00 00
                    ####      2 00 00 a5 43
                  0x044 0x001 quirk_  typeunit:bool
                    #### TOP VALUES:
                    ####  20755 00
                  0x045 0x001 type_  typeunit:unsigned char
                    #### TOP VALUES:
                    ####  20753 0c
                    ####      2 02
                  0x046 0x001 is_float_  typeunit:bool
                    #### TOP VALUES:
                    ####  20753 00
                    ####      2 01
                #### TOP VALUES:
                ####  20753 00 00 00 00 00 0c 00 00
                ####      2 00 00 a5 43 00 02 01 00
              0x048 0x008 min_height_  typeunit:blink::Length
                  0x048 0x004 <unnamed member>  typeunit:blink::[2:Length]<unnamed>
                    #### TOP VALUES:
                    ####  20751 00 00 00 00
                    ####      2 00 00 c8 42
                    ####      1 00 80 37 44
                  0x04c 0x001 quirk_  typeunit:bool
                    #### TOP VALUES:
                    ####  20755 00
                  0x04d 0x001 type_  typeunit:unsigned char
                    #### TOP VALUES:
                    ####  20751 00
                    ####      2 02
                    ####      2 01
                  0x04e 0x001 is_float_  typeunit:bool
                    #### TOP VALUES:
                    ####  20751 00
                    ####      4 01
                #### TOP VALUES:
                ####  20751 00 00 00 00 00 00 00 00
                ####      2 00 00 c8 42 00 01 01 00
                ####      1 00 80 37 44 00 02 01 00
              0x050 0x008 min_width_  typeunit:blink::Length
                  0x050 0x004 <unnamed member>  typeunit:blink::[2:Length]<unnamed>
                    #### TOP VALUES:
                    ####  20755 00 00 00 00
                  0x054 0x001 quirk_  typeunit:bool
                    #### TOP VALUES:
                    ####  20755 00
                  0x055 0x001 type_  typeunit:unsigned char
                    #### TOP VALUES:
                    ####  20753 00
                    ####      2 02
                  0x056 0x001 is_float_  typeunit:bool
                    #### TOP VALUES:
                    ####  20753 00
                    ####      2 01
                #### TOP VALUES:
                ####  20753 00 00 00 00 00 00 00 00
                ####      2 00 00 00 00 00 02 01 00
              0x058 0x008 width_  typeunit:blink::Length
                  0x058 0x004 <unnamed member>  typeunit:blink::[2:Length]<unnamed>
                    #### TOP VALUES:
                    ####  16773 00 00 00 00
                    ####   3975 00 00 c8 42
                    ####      3 00 00 52 43
                  0x05c 0x001 quirk_  typeunit:bool
                    #### TOP VALUES:
                    ####  20755 00
                  0x05d 0x001 type_  typeunit:unsigned char
                    #### TOP VALUES:
                    ####  16773 00
                    ####   3976 01
                    ####      6 02
                  0x05e 0x001 is_float_  typeunit:bool
                    #### TOP VALUES:
                    ####  16773 00
                    ####   3982 01
                #### TOP VALUES:
                ####  16773 00 00 00 00 00 00 00 00
                ####   3975 00 00 c8 42 00 01 01 00
                ####      3 00 00 52 43 00 02 01 00
              0x060 0x008 vertical_align_length_  typeunit:blink::Length
                  0x060 0x004 <unnamed member>  typeunit:blink::[2:Length]<unnamed>
                    #### TOP VALUES:
                    ####  20724 00 00 00 00
                    ####     31 cb 10 c7 3f
                  0x064 0x001 quirk_  typeunit:bool
                    #### TOP VALUES:
                    ####  20755 00
                  0x065 0x001 type_  typeunit:unsigned char
                    #### TOP VALUES:
                    ####  20724 00
                    ####     31 02
                  0x066 0x001 is_float_  typeunit:bool
                    #### TOP VALUES:
                    ####  20724 00
                    ####     31 01
                #### TOP VALUES:
                ####  20724 00 00 00 00 00 00 00 00
                ####     31 cb 10 c7 3f 00 02 01 00
              0x068 0x004 z_index_  typeunit:int
                #### TOP VALUES:
                ####  20747 00 00 00 00
                ####      3 1d 00 00 00
                ####      2 02 00 00 00
              0x06c 0x004 box_decoration_break_  typeunit:unsigned int
              0x06c 0x004 box_sizing_  typeunit:unsigned int
              0x06c 0x004 has_auto_z_index_  typeunit:unsigned int
                #### TOP VALUES:
                ####  20742 05 00 00 00
                ####      8 01 00 00 00
                ####      3 2d 1c 00 00


Many of these objects have identical contents. This is a very extreme example; most websites have much less of these objects.


I don't really understand how CSS works, and there's probably (?) a more intelligent way to do this, but I decided to write a (very hacky) CL that tries to cache transitions between calculated CSS styles to see how much memory savings you could get from this (https://chromium-review.googlesource.com/c/chromium/src/+/3275697). (It's a really gross CL, I know, but it's good enough for some experimentation to get some numbers on potential memory savings.) I've included before/after comparisons for some random websites below; although of course this isn't a very representative sample. The results vary a lot between websites (showing renderer private dirty memory over time across multiple runs; red is old, green is new; some of the tests were done on MHTML downloads to reduce variability).

If you think that this is worth investigating, maybe it would be a good idea to add a UMA metric to track how much memory is spent on blink::ComputedStyleBase::* and blink::ComputedStyleBase instances, in absolute terms and as a fraction of renderer memory usage, or something along those lines? That might help determine how much effort it would be worth to design a proper optimization for this.


Measurements
elixir.bootlin.com displaying mm/filemap.c:
mem_over_time.png
source.chromium.org displaying mm/filemap.c:
mem_over_time.png

some random article on the Chromium blog:
mem_over_time_full.png
same, but zoomed in without the outliers:
mem_over_time.png

Some random public Google Doc from chromium.org, viewed with read-only access (https://docs.google.com/document/d/1Iqe_CzFVA6hyxe7h2bUKusxsjB6frXfdAYLerM3JjPo/edit) - the ~20 seconds without data are because there were multiple renderers at that point and my script can't deal with that:
mem_over_time.png
Google web search results for "chromium":

mem_over_time.png
Zoomed in on the center part:
mem_over_time_zoom.png

And here are some basic graphs on CPU usage impact of my patch - although that form of the patch isn't really usable as-is anyway, so this is mostly just to get a very rough idea of the performance impact:


Here is the before-and-after comparison of CPU usage for loading mm/filemap.c in codesearch:

cpu-branch-misses.png
cpu-cycles.pngcpu-LLC-load-misses.pngcpu-LLC-loads.pngcpu-ref-cycles.png

On elixir.bootlin.com, the number of LLC-load-misses (in other words, accesses to main memory) seems to have gone down a lot, indicating that maybe reducing memory usage has increased memory cache efficiency - although it doesn't seem to actually result in a speedup:

cpu-branch-misses.png
cpu-cycles.pngcpu-LLC-load-misses.pngcpu-LLC-loads.pngcpu-ref-cycles.png

The "ref-cycles" graphs basically measure for how long the CPU has been used; that seems to have worsened a bit in both cases, I guess?

Anders Hartvoll Ruud

unread,
Nov 23, 2021, 4:35:45 PM11/23/21
to Jann Horn, memor...@chromium.org, ericwi...@chromium.org, fut...@chromium.org, xiaoc...@chromium.org
Can you describe what the idea behind the transitions cache is? I don't quite understand what I'm looking at.

General comment: my main concern is (as usual) complexity. A memory win is still a net loss if it creates a disproportionate maintenance burden forever. (Knee-jerk reaction when I see the word "cache"). But that's a general concern, I'd need to understand your proposal to get a better picture of what applies here.

Adding metrics sounds reasonable regardless.

> slightly increased CPU usage

What exactly did you measure?

Jann Horn

unread,
Nov 24, 2021, 5:51:09 AM11/24/21
to Anders Hartvoll Ruud, memor...@chromium.org, ericwi...@chromium.org, fut...@chromium.org, xiaoc...@chromium.org
On Tue, Nov 23, 2021 at 10:35 PM Anders Hartvoll Ruud <and...@chromium.org> wrote:
Can you describe what the idea behind the transitions cache is? I don't quite understand what I'm looking at.

It consists of two parts.

Basically every blink::ComputedStyleBase::* object gets a unique 64-bit ID, and the entries in the TransitionCache mean "if you have an object with ID $id, and then set its property $name to $value, you get an object with ID $new_id". Then there is the IncarnationCache that maps IDs to object pointers.

So e.g. in a case where you have a ComputedStyle, and then repeatedly create copies of it with max_width=100, you'll usually have a hit in the TransitionCache that gives you a preexisting ID for the modified version, and then the IncarnationCache will usually give you an instance corresponding to that, so you don't have to actually copy the entire object.

The two layers of caching become useful in cases where you e.g. repeatedly create copies on which you set both max_width=100 and max_height=100 - in that case, the first modification will usually have a hit in the TransitionCache, but miss in the IncarnationCache, so a new copy will be created for that, but it'll have the preexisting ID. And then the second modification will hit in the TransitionCache and the IncarnationCache, and so you can then again return a reference to the preexisting object and delete the new copy. Essentially it allows caching transitions through intermediate states for which there aren't actually any full objects.
 
General comment: my main concern is (as usual) complexity. A memory win is still a net loss if it creates a disproportionate maintenance burden forever. (Knee-jerk reaction when I see the word "cache"). But that's a general concern, I'd need to understand your proposal to get a better picture of what applies here.

Oh, I agree - that CL is just the hack I chose to implement because it didn't require me to actually understand how the CSS engine works. And the things I'm doing with memcpy() and memcmp() in there are an atrocity that doesn't belong into production code.

I imagine that if you think that it's worth investing time into reducing memory usage here, you can probably come up with much better ideas for dealing with the problem at a higher level, in a completely different way, which would probably end up being cleaner and maybe also faster. I just wanted something I could use to prove that there is at least one way to reduce the CSS engine's memory usage.
 
Adding metrics sounds reasonable regardless.

> slightly increased CPU usage

What exactly did you measure?

ref-cycles (meaning elapsed time) spent in userspace, as reported by the CPU's performance counters, for a renderer from execve until 26 seconds later, using Chrome with --disable-features=SpareRendererForSitePerProcess --renderer-cmd-prefix="perf stat --all-user -e $EVENTS -x, -o/h/chromium/cpu_usage_$1_a.log --append --timeout=26000 taskset -c 27,55". I didn't do a proper statistical analysis or anything though, I just plotted the before/after comparison graphs at the bottom of the mail. (I also plotted some other perf counters, since I wanted to see whether there'd be a big spike in the number of branch misses or something like that.)

Rune Lillesveen

unread,
Nov 24, 2021, 7:16:03 AM11/24/21
to Anders Hartvoll Ruud, Jann Horn, memor...@chromium.org, ericwi...@chromium.org, xiaoc...@chromium.org
We used to share ComputedStyle objects, but at some point it was removed because it was error prone without being considered to have enough gain, I believe. There is an interesting document in https://crbug.com/721517 which has some background.

One thing to consider is the subgroups which are copy-on-write objects which may have potential to be optimized for memory use by grouping properties differently.

On Tue, Nov 23, 2021 at 10:35 PM Anders Hartvoll Ruud <and...@chromium.org> wrote:


--
Rune Lillesveen

Chris Hamilton

unread,
Nov 24, 2021, 10:04:05 AM11/24/21
to Rune Lillesveen, Etienne Bergeron, Anders Hartvoll Ruud, Jann Horn, memor...@chromium.org, ericwi...@chromium.org, xiaoc...@chromium.org
Regarding adding metrics to quantify the size of the problem to begin with, we likely already have the data you want from the heap profiler, which samples allocation stack traces on end user machines ( +Etienne Bergeron who knows all about it, and can point you at the places for accessing that data)

Cheers,

Chris

--
You received this message because you are subscribed to the Google Groups "memory-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to memory-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/memory-dev/CACuPfeTLudvqYC_xdyENt91g1qTYTjFVbY9bo4pDazA_j_zvQA%40mail.gmail.com.

Xiaocheng Hu

unread,
Nov 24, 2021, 1:43:14 PM11/24/21
to Chris Hamilton, Rune Lillesveen, Etienne Bergeron, Anders Hartvoll Ruud, Jann Horn, memor...@chromium.org, ericwi...@chromium.org, xiaoc...@chromium.org
Hi Jann,

Thanks for the explanation! It looks much simpler than the style sharing that we used to have, so to me the extra complexity seems easier to justify. But we still need more real world data.

Before adding UMA, have you tried collecting some data from Cluster Telemetry first? And (in case you are not aware of it) here's a doc on how to test rough prototypes and measure ad hoc metrics on CT.
Reply all
Reply to author
Forward
0 new messages