static_library vs source

Primiano Tucci

unread,

Jul 26, 2016, 9:27:53 AM7/26/16

to gn-dev, Bruce Dawson, Perf Sheriffs

Hi GN experts (CC: perf-sheriffs@)

I'm perf-sheriffing today and we have a bunch of perf regressions on Android (e.g., crbug.com/631421) which seem to be caused by https://codereview.chromium.org/2172033002. Even without the bisect results, in a lot of cases that was the only realistic CL in the range (the other ones were tests changes, like here), so I fell pretty confident this is not just a bisect flake.

If I am reading that code correctly, the only side effect on Android of that CL is turning some targets that were previously source_sets into static_libraries.

The thing that I cannot figure out is: how can that have possibly an impact on performance?

Android perf builders build with buildtype=Official and component=static_library. I thought that for static builds, at the end of the day, the major difference was just how do we end up invoking the linker (if pre-grouping in .a files or appending to the rsp), but clearly I am missing something.

Would you expect a perf impact by that change? If so why?

Any help would be appreciated.

Bruce Dawson

unread,

Jul 26, 2016, 12:05:43 PM7/26/16

to Primiano Tucci, gn-dev, Perf Sheriffs

Brett and I discussed this. We believe that the change of some targets to static_library perturbed the binary layout, and that caused the regression. There is nothing inherent to libraries that should cause a regression, and in fact they should help performance slightly (by making the binary smaller) in some cases.

In short, we believe that this is due to linking being a sort-of-random process. If you can see anything else let us know.

This was a small regression, correct?

Primiano Tucci

unread,

Jul 27, 2016, 8:52:49 AM7/27/16

to Bruce Dawson, gn-dev, Perf Sheriffs

They are all regressions on micro-benchmarks (Full list here), but some of them are in the 10-20% ballpark and quite visible (read: not just lost in the noise).

For instance this or this

I'll try to take a look to both binaries and see if anything big bumps to my attention, but will be quite hard to reverse eng the root cause from the binaries.

Daniel Bratell

unread,

Jul 27, 2016, 9:07:02 AM7/27/16

to Bruce Dawson, Primiano Tucci, gn-dev, Perf Sheriffs

On Wed, 27 Jul 2016 14:52:39 +0200, Primiano Tucci <prim...@chromium.org> wrote:

They are all regressions on micro-benchmarks (Full list here), but some of them are in the 10-20% ballpark and quite visible (read: not just lost in the noise).
For instance this or this
I'll try to take a look to both binaries and see if anything big bumps to my attention, but will be quite hard to reverse eng the root cause from the binaries.

Check the size of those tests. I've encountered micro benchmarks that after optimiztions have ended up with inner loops of a few hundred or a few thousand clock cycles. Then some change in alignment can actually make the difference 10-20% without there being anything particularly wrong.

/Daniel

--

/* Opera Software, Linköping, Sweden: CEST (UTC+2) */

Primiano Tucci

unread,

Jul 27, 2016, 9:46:24 AM7/27/16

to Daniel Bratell, Bruce Dawson, to...@chromium.org, gn-dev, Perf Sheriffs

So, I did diff the symbols of the build artifacts produced by the waterfall before and actual Bruce's CL.

I was expecting the binaries to be identical % symbol address reshuffling. Instead I see that some symbols got somehow compiled differently. I wonder if this interferring with link time optimization.

The list of the ~200 different symbols is the following: http://pastebin.com/ih73BBNv (the last argument is the size of the symbol).

It's not "lots of symbols" if you consider that in total there are ~812272 symbols and only ~200 changed. But I suppose enough to make a difference.

The produced assembly looks quite different for those, for instance:

+somebody who stared at arm assembly for longer than me (torne@)

--
You received this message because you are subscribed to the Google Groups "gn-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gn-dev+un...@chromium.org.

Torne (Richard Coles)

unread,

Jul 28, 2016, 8:58:55 AM7/28/16

to Primiano Tucci, Daniel Bratell, Bruce Dawson, gn-dev, Perf Sheriffs

On Wed, 27 Jul 2016 at 14:46 Primiano Tucci <prim...@chromium.org> wrote:

So, I did diff the symbols of the build artifacts produced by the waterfall before and actual Bruce's CL.
I was expecting the binaries to be identical % symbol address reshuffling. Instead I see that some symbols got somehow compiled differently. I wonder if this interferring with link time optimization.
The list of the ~200 different symbols is the following: http://pastebin.com/ih73BBNv (the last argument is the size of the symbol).
It's not "lots of symbols" if you consider that in total there are ~812272 symbols and only ~200 changed. But I suppose enough to make a difference.

The produced assembly looks quite different for those, for instance:
WTF::Vector<unsigned int, 0u, WTF::PartitionAllocator>::appendSlowCase

blink::protocol::Accessibility::AXValue::~AXValue()
WTF::VectorBufferBase<blink::Member<blink::MessagePort>, false, blink::HeapAllocator>::allocateBuffer(unsigned int)

In the latter version of this one, WTF::MemoryBarrier looks to have been placed too far away from the function to call it with a direct thumb branch, and so it has to load a PC-relative address from the literal pool and blx to it. It also inlined more functions. I would guess that the function it inlines here are too far away to reach with a direct branch and the resulting difference in instruction count/register usage for the branch causes it to be worth inlining when it wasn't in the case where a direct branch worked?

The other two look like they could be similar.

Different branch instructions (conditional vs nonconditional, etc) have different ranges in Thumb2; I forget all the combinations but there are branches that are +/- 4MiB and 16MiB? Both of these are smaller than our library :)

When not doing link-time optimisation, the compiler likely ends up generating pessimistic sequences with relocations that allow very large ranges, and at link time the linker simply fills in the relocs, and you do the big/slow branch sequence "every time" (or else, it generates short branches and then the linker has to generate branch islands to extend the range - both techniques have been used but I don't know what's common in modern toolchains). This mostly means that the layout of the code in the binary has no effect, because the compiler is generating the same instructions regardless of the final layout.

LTO means it can optimise for the layout it actually got and actually use efficient short branches where possible, but it means that the codegen is going to depend on the exact layout and that's likely to change wildly if you change the input object file order.

All speculative, but seems plausible?

Bruce Dawson

unread,

Jul 28, 2016, 1:36:23 PM7/28/16

to Torne (Richard Coles), Primiano Tucci, Daniel Bratell, gn-dev, Perf Sheriffs

It's a bit terrifying that randomizing the layout can cause such significant regressions. The 'fix' is easy enough (disable these changes for Android) but it's way too ad-hoc/random too make me feel happy.

But yeah, it sounds plausible, I think? Thanks for the analysis.

James Robinson

unread,

Jul 28, 2016, 1:40:27 PM7/28/16

to Bruce Dawson, Torne (Richard Coles), Primiano Tucci, Daniel Bratell, gn-dev, Perf Sheriffs

I tried to repro this locally because it seemed like an interesting issue but I got the exact same disassembly before/after Bruce's patch even after doing a clean build. I'm building android release static official builds and looking at the disassembly in wtf_unittests. I don't believe this toolchain does any further optimization that would differ between different binaries linking in the same symbol.

Primiano - have you been able to reproduce this? I'm dubious that the change here is actually due to that patch.

- James

Torne (Richard Coles)

unread,

Jul 28, 2016, 1:45:17 PM7/28/16

to James Robinson, Bruce Dawson, Primiano Tucci, Daniel Bratell, gn-dev, Perf Sheriffs

Don't we do LTO on android?

Primiano Tucci

unread,

Jul 28, 2016, 1:51:11 PM7/28/16

to Bruce Dawson, Torne (Richard Coles), Daniel Bratell, gn-dev, Perf Sheriffs, Egor Pasko

> The 'fix' is easy enough (disable these changes for Android) but it's way too ad-hoc/random too make me feel happy.

To be honest if this is the case I don't think is even worth it % owners of those benchmark being okay with that. If this is the cause, this will change at any time by introducing some extra files.

Also, the official Chrome.apk has an extra step which reorders all the symbols so I suspect that this is really not impacting the final binary we ship.

+pasko, I assume we don't have the orderfile step on ChromePublic and hence all the tests on the public waterfall?

>Primiano - have you been able to reproduce this?

Hmm I didn't try locally. I downloaded the build artifacts from perf bots, just before and after bruce CL. Lemme double check that, from my history I have:

$ gsutil cp gs://chrome-perf/android_perf_rel/full-build-linux_dd5bcd3ee0e38d0622de580545fdfe43f4131395.zip .

$ gsutil cp gs://chrome-perf/android_perf_rel/full-build-linux_6e53a2e16dc26dec41494b02da83a1221b72e7c4.zip .

6e53a is Bruce's CL, dd5bcd the one immediately before. So they should be the right ones.

From there I extracted the unstripped libchrome.so, ran arm-linux-androideabi-readelf -Ws libchrome.so and diffed the two symbols tables, after having removed the addressed with cut.

I imagine that the set of symbols that change depend exactly on the gn flags.

Here's what the perf builder use (from here):

ffmpeg_branding = "Chrome"

is_chrome_branded = true

is_debug = false

is_official_build = true

proprietary_codecs = true

symbol_level = 1

target_os = "android"

use_goma = true

Primiano Tucci

unread,

Jul 29, 2016, 9:33:33 AM7/29/16

to Bruce Dawson, Torne (Richard Coles), Daniel Bratell, gn-dev, Perf Sheriffs, Egor Pasko, Annie Sullivan

> Don't we do LTO on android?

Depends on which kind of LTO you mean :)

Pasko might be the best person to answer this. The build files are quite tricky to read. According to build/config/compiler/BUILD.gn it seems that we do:

--icf=all if is_android but not if use_order_profiling (which is what we use for the official Chrome.apk)
We don't use -flto / -Wl,--lto on Android yet. This is the link time codegen pcc@ and krasyn@ are working on and announced on chromium-dev recently
In the official Chrome.apk we use order profiling. I have no idea what effect it has on the final binary. My impression is that all those microbenchmarks are only running in the upstream waterfall (At least I couldn't find any other instance running internally) and we don't have any coverage of what happens on the official binary (+sullivan) post-linking.

Torne (Richard Coles)

unread,

Jul 29, 2016, 9:52:59 AM7/29/16

to Primiano Tucci, Bruce Dawson, Daniel Bratell, gn-dev, Perf Sheriffs, Egor Pasko, Annie Sullivan

If we aren't actually using -flto then I think my hypothesis is wrong, or else I don't understand how the linker works. I don't think anything short of -flto causes the linker to be able to rewrite branch sequences/change inlining decisions.

Primiano Tucci

unread,

Jul 29, 2016, 10:00:43 AM7/29/16

to Torne (Richard Coles), Bruce Dawson, Daniel Bratell, gn-dev, Perf Sheriffs, Egor Pasko, Annie Sullivan

Can it be something similar due to --icf? maybe when it ICFs a function is smart enough to change the jump type if the other identical function is closer?

Egor Pasko

unread,

Jul 29, 2016, 2:14:17 PM7/29/16

to Primiano Tucci, Torne (Richard Coles), Annie Sullivan, gn-dev, Bruce Dawson, Daniel Bratell, Perf Sheriffs

quick thoughts from the bus, no links :)

we are not using lto on android

we are not using orderfile for chrome public afair

shuffling the binary does affect performance a lot, down to masking out slews of compiler optimizations, there was a paper published about this a couple you years ago or so

modern toolchains work in a very entertaining multi-pass way, search for "instruction relaxation"

yay performance, as non-intuitive as possible, but explainable after some time digging :)

Bruce Dawson

unread,

Jul 29, 2016, 2:30:56 PM7/29/16

to Egor Pasko, Primiano Tucci, Torne (Richard Coles), Annie Sullivan, gn-dev, Daniel Bratell, Perf Sheriffs

My take on this is that we should not revert or special-case the source-set to static-library changes.

However we should file a bug (or find an existing bug?) for somebody to investigate whether we can make this more predictable, preferably in a performance-positive way.

On Xbox 360 I created a tool that took trace data, converted that to a call graph (with function call frequencies, so sampled data is not appropriate) and used that to create a link order file by placing functions adjacent/close to the functions that they frequently called. This typically gave a ~7% speedup across the board. On Android it sounds like it could give a much larger increase on some micro-benchmarks.

As with any profiled guided optimizations there is the risk that we would make those scenarios that are not part of the training slower, but it feels like it couldn't be worse than the current situation, where random changes are rewarded or punished.

BTW, changing the link order should cause random regressions and random improvements. Is that what we are seeing? I haven't heard of random improvements in benchmarks, but is that just reporting bias?

Dirk Pranke

unread,

Jul 29, 2016, 6:18:42 PM7/29/16

to Bruce Dawson, Egor Pasko, Primiano Tucci, Torne (Richard Coles), Annie Sullivan, gn-dev, Daniel Bratell, Perf Sheriffs

In my experience w/ the perf builders, one never hears about random improvements, just regressions.

(This is not meant as a criticism of the perf-sheriffs, either, whose job is hard enough as it is.)

-- Dirk

Annie Sullivan

unread,

Jul 30, 2016, 3:12:31 PM7/30/16

to Dirk Pranke, Bruce Dawson, Egor Pasko, Primiano Tucci, Torne (Richard Coles), gn-dev, Daniel Bratell, Perf Sheriffs

It's true that we don't have a great way of touting performance improvements. Let me know if you have thoughts on how we could do better there!

For any specific CL you can see all the regressions and improvements that were detected with the CL in the revision range with a url like this:

https://chromeperf.appspot.com/group_report?rev=407372

You'd need to bisect to ensure that any improvement was caused by that CL and not some other commit in the range or noise though.

Reply all

Reply to author

Forward

static_library vs source_set and perf

Primiano Tucci

Bruce Dawson

Primiano Tucci

Daniel Bratell

Primiano Tucci

Torne (Richard Coles)

Bruce Dawson

James Robinson

Torne (Richard Coles)

Primiano Tucci

Primiano Tucci

Torne (Richard Coles)

Primiano Tucci

Egor Pasko

Bruce Dawson

Dirk Pranke

Annie Sullivan