Compiler optimization level per file/library on Android?

1,554 views
Skip to first unread message

Egor Pasko

unread,
Dec 18, 2013, 1:14:01 PM12/18/13
to chromium-dev
TL;DR: optimize some files with -O2/-O3 manually as a temporary hack on Android or not? 


Some Background:

On Android we care a lot about executable binary size (the binary is libchromeview.so). We build all object files with compiler options to "optimize for size" (gcc -Os). Also, we remove unreferenced functions via a compiler+linker trick: "-ffunction-sections -Wl,--gc-sections" (13GB RAM at link-time ftw). For quite a few performance-critical areas this setup is very far from optimal.


Some Immediate Questions:

1. Can we use aggressive optimization flags on small parts of the code while keeping the rest optimized for size?
2. What is the balance? How should the trade-off be coordinated across teams? (Is a 100K increase in binary size worth a 5% improvement in page load time on *some* websites?)
3. Is it appropriate to have ('release_optimize%': '3') sprinkled across gyp files for individual .cc sources? How about temporarily?


Some More Background

I looked at potential speedups on blink_perf tests [1]. Results are encouraging, for example, we could increase DOM traversal performance by up to 70%. Hopefully this would require only a small number of files to be recompiled with aggressive optimizations. Similar observations proved that higher optimization level helps compositing faster [2].

Typically this problem is solved by using Feedback Driven Optimizations (FDO) == Profile-Guided Optimization (PGO) [3]. ChromeOS is already using these techniques, as mentioned in [4], but with Android we are not there yet.

There has been some recent work in GCC to save binary size while having more aggressive optimization flags [5] [6], but it will take some time to have it tuned for Android/ARM/etc.


Discussion

I personally think the hack is worth doing now: select ~20 most performance-critical files, build them with -O3, win!
There are surely concerns with this approach: what do we do if critical functionality migrates from one file to another? One-file libraries would look ugly in gyp?

I am seeing this as a few gradual steps:
1. select files to optimize manually, estimate cost/benefit -> short-term wins
2. select optimization level in more clever ways: take feedback from cygprofile after running perf tests
3. update compilers, maybe tune optimization flags, PGO would be based on tests, this could be pretty good and final already!
4. follow the ChromeOS lead, use PGO from low-overhead sampling (not sure about this)


References






Jonathan Dixon

unread,
Dec 18, 2013, 2:24:24 PM12/18/13
to Egor Pasko, chromium-dev
>There are surely concerns with this approach: what do we do if critical functionality migrates from one file to another? One-file libraries would look ugly in gyp?

A partial mitigation to both these is to use #pragma optimzie("O3") inside the .cc file instead of hacking .gyp
You could wrap it in e.g. PRAGMA_OPTIMIZE_FOR_SPEED() define which is slightly more readable, and more likely to migrate with the code in question. Also it can be done at a function (or even finer) level with a pragma push/pop.



--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev

Tony Gentilcore

unread,
Dec 18, 2013, 2:32:04 PM12/18/13
to Egor Pasko, chromium-dev
I think your investigation here will be quite fruitful!

But I think that per-file compilation flags will lead to short terms
gains that will quickly rot (or be a terrible maintenance burden). For
instance, over time, a particular file may evolve gradually from
something that would benefit from -O2 to something that would benefit
from -Os.

That being said, the reason I think this will be fruitful is that:
* As you know, we already compile with different flags on a
per-library basis, and I'd support continuing to do a better job of
that.
* Finding the files that benefit from greater optimization points out
where we can (and should) do better. I suspect we can often get the
same gains by looking at the asm differences between the two and then
writing more compiler-friendly c++ that allows -Os to do a better job.

-Tony
> --
> --
> Chromium Developers mailing list: chromi...@chromium.org
> View archives, change email options, or unsubscribe:
> http://groups.google.com/a/chromium.org/group/chromium-dev
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to chromium-dev...@chromium.org.

Chris Hamilton

unread,
Dec 18, 2013, 2:55:04 PM12/18/13
to Tony Gentilcore, Egor Pasko, chromium-dev
How about driving the selection of optimized files based on profiling
and some standard test suite? Or even better, based on actual
real-world profile data? If the build of libchromeview for android is
not too far removed from that built on ChromeOS you could even be
driving this using real-data collected by ChromeO Wide Profiling
(go/cwp). This would hopefully prevent bit rot issues...

Unfortunately this approach conflates "heavily used translation units"
with "translation units that would most benefit from optimization".
The only way to truly measure the latter in an automated way would be
to run benchmarks while turning on and off -O2 for individual .cc
files.

As a total thought experiment:

Maybe you could build an automated system that randomly sets -O2 on
some percentage (say 15-25%) of .cc files, runs a suite of unittests,
and records speed-ups vs everything compiled for min-size. After a
reasonable number of runs every .cc file would be (likely) covered
multiple times, and you'd be able to make a good guess at percentage
improvement on a benchmark associated with enabling aggressive
optimizations on a .cc file. I guess the goal would be to get cycle
times down to a reasonable level...

Scott Graham

unread,
Dec 18, 2013, 3:48:28 PM12/18/13
to chr...@chromium.org, Tony Gentilcore, Egor Pasko, chromium-dev
I started doing this on Windows with the PGO tools in VS2013. I just used an arbitrary set of synthetic benchmarks as test data for now (dromaeo, robohornet, etc).

We're waiting on toolchain fixes from MSFT in the next update, but it appears that it could be quite beneficial. The tools suggest that only ~2% of code benefits from speed optimization but that the rest should be size-optimized. Right now on Windows, we have ~35% set to speed opt (all of blink, v8, base, and allocator, but nothing else), which can surely be improved.

Figuring out how to include profile data from real users on sundry hardware platforms would obviously be great. Presumably "this file is hot" would map fine across platforms for most code based on go/cwp. Alternatively, we might be able to determine what PGO decided to optimize, and export that elsewhere.

Reid Kleckner

unread,
Dec 18, 2013, 3:53:18 PM12/18/13
to jo...@chromium.org, Egor Pasko, chromium-dev
On Wed, Dec 18, 2013 at 11:24 AM, Jonathan Dixon <jo...@chromium.org> wrote:
>There are surely concerns with this approach: what do we do if critical functionality migrates from one file to another? One-file libraries would look ugly in gyp?

A partial mitigation to both these is to use #pragma optimzie("O3") inside the .cc file instead of hacking .gyp
You could wrap it in e.g. PRAGMA_OPTIMIZE_FOR_SPEED() define which is slightly more readable, and more likely to migrate with the code in question. Also it can be done at a function (or even finer) level with a pragma push/pop.

FYI, clang ignores these pragmas today.  If you only care about gcc from the Android NDK, then maybe that's OK.  I've also been burned by bugs in gcc's implementation of these pragmas in the past, and I wouldn't recommend them.

Egor Pasko

unread,
Dec 19, 2013, 8:41:06 AM12/19/13
to Tony Gentilcore, chromium-dev
On Wed, Dec 18, 2013 at 8:32 PM, Tony Gentilcore <to...@google.com> wrote:
I think your investigation here will be quite fruitful!

But I think that per-file compilation flags will lead to short terms
gains that will quickly rot (or be a terrible maintenance burden).

That's my main concern too. On the other hand, we would be able to spot regressions like these at extra perf sheriffing costs. I am looking at it as a short term measure only.
 
For instance, over time, a particular file may evolve gradually from
something that would benefit from -O2 to something that would benefit
from -Os.

100% agreed. So we should not keep the file-level manual optimization hack for long.
 
That being said, the reason I think this will be fruitful is that:
* As you know, we already compile with different flags on a
per-library basis, and I'd support continuing to do a better job of
that.

I think per-library flags are too coarse-grained. The reason I want PGO is because even function-level size-vs-perf decisions are too coarse-grained.

* Finding the files that benefit from greater optimization points out
where we can (and should) do better. I suspect we can often get the
same gains by looking at the asm differences between the two and then
writing more compiler-friendly c++ that allows -Os to do a better job.

This approach historically helped to optimize some outstanding and localized performance issues. We should continue doing that for localized bottlenecks. The biggest concern is that hand-optimizing does not scale to the long tail of semi-hot functions. Also, balancing speed vs. size with individual C++ CLs would be impossible.

One entertaining example: the "branch rediction" builtins. We use them heavily in Blink (LIKELY()/UNLIKELY()) while they are *ignored* by GCC at -Os optimization level. This could be no longer true on new versions of GCC (I did not verify), but if it is still the case, this seriously limits optimization opportunities.



--
Egor Pasko

Egor Pasko

unread,
Dec 19, 2013, 8:46:00 AM12/19/13
to Reid Kleckner, jo...@chromium.org, chromium-dev
Thanks! That's good point for consideration. I tend to *not* like the idea of using per-function optimization pragmas because it is one of those non-obvious things that developers would be caught by when trying to reason about performance.

--
Egor Pasko

Egor Pasko

unread,
Dec 19, 2013, 8:51:52 AM12/19/13
to Scott Graham, chr...@chromium.org, Tony Gentilcore, chromium-dev
On Wed, Dec 18, 2013 at 9:48 PM, Scott Graham <sco...@chromium.org> wrote:
I started doing this on Windows with the PGO tools in VS2013. I just used an arbitrary set of synthetic benchmarks as test data for now (dromaeo, robohornet, etc).

We're waiting on toolchain fixes from MSFT in the next update, but it appears that it could be quite beneficial. The tools suggest that only ~2% of code benefits from speed optimization but that the rest should be size-optimized. Right now on Windows, we have ~35% set to speed opt (all of blink, v8, base, and allocator, but nothing else), which can surely be improved.

PGO is the ultimate goal. I believe there would be serious issues blocking us from applying PGO straight from NDK, so, as in the Windows case, it's rather long-term.
 
Figuring out how to include profile data from real users on sundry hardware platforms would obviously be great. Presumably "this file is hot" would map fine across platforms for most code based on go/cwp. Alternatively, we might be able to determine what PGO decided to optimize, and export that elsewhere.

Profile from real users (from different hardware) is a really-really ultimate forward-looking goal. I would prefer not to think about it now for Android. I'm just not ready yet.



--
Egor Pasko

Egor Pasko

unread,
Dec 19, 2013, 9:07:03 AM12/19/13
to Chris Hamilton, Tony Gentilcore, chromium-dev
On Wed, Dec 18, 2013 at 8:55 PM, Chris Hamilton <chr...@chromium.org> wrote:
How about driving the selection of optimized files based on profiling
and some standard test suite? Or even better, based on actual
real-world profile data? If the build of libchromeview for android is
not too far removed from that built on ChromeOS you could even be
driving this using real-data collected by ChromeO Wide Profiling
(go/cwp). This would hopefully prevent bit rot issues...

I was thinking of the way to deal with rot automagically. I believe getting PGO-like data from test runs will already provide a *huge* performance boost! It is much simpler than chromeos-like wide profiling, and I think the latter would have a hard time to improve significantly over local test runs. Certainly not a 1-3 person project.
 
Unfortunately this approach conflates "heavily used translation units"
with "translation units that would most benefit from optimization".
The only way to truly measure the latter in an automated way would be
to run benchmarks while turning on and off -O2 for individual .cc
files.

So this hack should work pretty well in medium term:
1. collect some coarse-grained function-level call counts (with cygprofile) on local perf tests
2. take the top-N functions
3. use a GCC plugin optimize these functions aggressively

WDYT?

I want to experiment with this approach. I was thinking though that manual setting of aggerssive optimization flags on file-level would be feasible in the shorter term.
 
As a total thought experiment:

Maybe you could build an automated system that randomly sets -O2 on
some percentage (say 15-25%) of .cc files, runs a suite of unittests,
and records speed-ups vs everything compiled for min-size. After a
reasonable number of runs every .cc file would be (likely) covered
multiple times, and you'd be able to make a good guess at percentage
improvement on a benchmark associated with enabling aggressive
optimizations on a .cc file. I guess the goal would be to get cycle
times down to a reasonable level...

By unittests you probably mean microbenchmarks? That's an interesting way to experiment. If we only had variance from perf scores on Android low enough to be sure about sub-1% improvements in metrics, we could then perform a search over the optimal set of files to optimize for speed. Unfortunately, fighting the variance will keep us busy for some time.



--
Egor Pasko

Tony Gentilcore

unread,
Dec 19, 2013, 9:56:57 AM12/19/13
to Egor Pasko, chromium-dev
>> That being said, the reason I think this will be fruitful is that:
>> * As you know, we already compile with different flags on a
>> per-library basis, and I'd support continuing to do a better job of
>> that.
>
> I think per-library flags are too coarse-grained. The reason I want PGO is
> because even function-level size-vs-perf decisions are too coarse-grained.

That is true too. Again, you are right that it is a balance between
maintainability and benchmarks.

>> * Finding the files that benefit from greater optimization points out
>> where we can (and should) do better. I suspect we can often get the
>> same gains by looking at the asm differences between the two and then
>> writing more compiler-friendly c++ that allows -Os to do a better job.
>
>
> This approach historically helped to optimize some outstanding and localized
> performance issues. We should continue doing that for localized bottlenecks.
> The biggest concern is that hand-optimizing does not scale to the long tail
> of semi-hot functions. Also, balancing speed vs. size with individual C++
> CLs would be impossible.

Do you have any idea how much we care about the red hot spots vs the
long tail of semi hot stuff? My suspicion is that we'll get the most
bang for our buck at this point by focusing on the red hot ones. Once
we hit diminishing returns there, maybe we can get together a plan for
the long tail.

> One entertaining example: the "branch rediction" builtins. We use them
> heavily in Blink (LIKELY()/UNLIKELY()) while they are *ignored* by GCC at
> -Os optimization level. This could be no longer true on new versions of GCC
> (I did not verify), but if it is still the case, this seriously limits
> optimization opportunities.

I already really disliked LIKELY/UNLIKELY because they are hard to
maintain and don't work on Windows. I had no idea they didn't work on
Android either (due to -Os). That's probably the nail in the coffin to
continue discouraging folks from relying on this crutch.

Chris Hamilton

unread,
Dec 19, 2013, 10:02:15 AM12/19/13
to Egor Pasko, Tony Gentilcore, chromium-dev, etie...@chromium.org
>> How about driving the selection of optimized files based on profiling
>> and some standard test suite? Or even better, based on actual
>> real-world profile data? If the build of libchromeview for android is
>> not too far removed from that built on ChromeOS you could even be
>> driving this using real-data collected by ChromeO Wide Profiling
>> (go/cwp). This would hopefully prevent bit rot issues...
>
>
> I was thinking of the way to deal with rot automagically. I believe getting
> PGO-like data from test runs will already provide a *huge* performance
> boost! It is much simpler than chromeos-like wide profiling, and I think the
> latter would have a hard time to improve significantly over local test runs.
> Certainly not a 1-3 person project.
>
>>
>> Unfortunately this approach conflates "heavily used translation units"
>> with "translation units that would most benefit from optimization".
>> The only way to truly measure the latter in an automated way would be
>> to run benchmarks while turning on and off -O2 for individual .cc
>> files.
>
>
> So this hack should work pretty well in medium term:
> 1. collect some coarse-grained function-level call counts (with cygprofile)
> on local perf tests
> 2. take the top-N functions
> 3. use a GCC plugin optimize these functions aggressively

Yup, seems reasonable. In fact, my team implemented this exact
approach for Windows last quarter. We saw modest gains on some
benchmarks, but we found that it was too easy to optimize for a
specific benchmark. Picking a representative set of benchmarks is
key... hence we were looking at moving to using real world perf data.
The other missing piece is build infrastructure. I hacked together a
GYP extension that allowed per .cc file command-line flags to be
specified additionally in a whole separate config file. This file was
then automatically generated from the perf data, and fed into the
build. This helped to decouple build config from profiling and perf
optimizations, and is effectively a poor man's translation unit level
PGO.

We initially implemented this pre Chrome-DLL split. The motivation for
the split was due to the linker fast approaching its memory limits.
Rather than disabling optimizations wholesale (freeing up linker
memory), we were pursuing this as a more directed way of choosing the
files to optimize, with the side effect of reducing linker memory.
However, this became a moot point once the Chrome DLL split landed.
It's been on the back burner since then. If you are pursing this, it
would be great if we could make the build infrastructure integration
common such that it can be leveraged across platforms.

> I want to experiment with this approach. I was thinking though that manual
> setting of aggerssive optimization flags on file-level would be feasible in
> the shorter term.
>
>>
>> As a total thought experiment:
>>
>> Maybe you could build an automated system that randomly sets -O2 on
>> some percentage (say 15-25%) of .cc files, runs a suite of unittests,
>> and records speed-ups vs everything compiled for min-size. After a
>> reasonable number of runs every .cc file would be (likely) covered
>> multiple times, and you'd be able to make a good guess at percentage
>> improvement on a benchmark associated with enabling aggressive
>> optimizations on a .cc file. I guess the goal would be to get cycle
>> times down to a reasonable level...
>
>
> By unittests you probably mean microbenchmarks? That's an interesting way to
> experiment. If we only had variance from perf scores on Android low enough
> to be sure about sub-1% improvements in metrics, we could then perform a
> search over the optimal set of files to optimize for speed. Unfortunately,
> fighting the variance will keep us busy for some time.

I do indeed mean benchmarks :) And the reason for this thought
experiment is that simply measuring hot code does not find the code
that would most benefit from optimization (although its a reasonable
proxy). We have also been fighting against variance in the measurement
process, and have a reasonably good handle on it now for the Windows
platform. We do all sorts of tricks to make sure caches are flushed,
to make sure the OS sees the executables as brand new and never loaded
before, pinning processes to a processor, increasing process
priorities, making sure as little else is running on the system, etc.
Additionally, we've found that CPU perf counters are a way more stable
measure than wall time. Etienneb@ might have more comments on this, as
he's been chasing down the variance for the last couple quarters.

Egor Pasko

unread,
Dec 19, 2013, 11:07:06 AM12/19/13
to Tony Gentilcore, chromium-dev
On Thu, Dec 19, 2013 at 3:56 PM, Tony Gentilcore <to...@google.com> wrote:
>> That being said, the reason I think this will be fruitful is that:
>> * As you know, we already compile with different flags on a
>> per-library basis, and I'd support continuing to do a better job of
>> that.
>
> I think per-library flags are too coarse-grained. The reason I want PGO is
> because even function-level size-vs-perf decisions are too coarse-grained.

That is true too. Again, you are right that it is a balance between
maintainability and benchmarks.

>> * Finding the files that benefit from greater optimization points out
>> where we can (and should) do better. I suspect we can often get the
>> same gains by looking at the asm differences between the two and then
>> writing more compiler-friendly c++ that allows -Os to do a better job.
>
>
> This approach historically helped to optimize some outstanding and localized
> performance issues. We should continue doing that for localized bottlenecks.
> The biggest concern is that hand-optimizing does not scale to the long tail
> of semi-hot functions. Also, balancing speed vs. size with individual C++
> CLs would be impossible.

Do you have any idea how much we care about the red hot spots vs the
long tail of semi hot stuff? My suspicion is that we'll get the most
bang for our buck at this point by focusing on the red hot ones. Once
we hit diminishing returns there, maybe we can get together a plan for
the long tail.

Hard to say without experimenting. It is expected for high-level optimizations applied to all code to provide up to 20% speedup over only "slightly" optimized code (code size is a challenge though). Compiler engineers worked hard for decades to achieve this, and in theory we can use it almost for free. If we look at perf profile of a page load, the best functions that we can optimize consume about 1% of time. It would require many weeks of work to get a 20% speedup using manual optimization approach.

What is good about hand-optimizing hot paths vs. long-tail optimizations is that they seem to be orthogonal for chromium, we can do both and get the bang multiplied.
 
> One entertaining example: the "branch rediction" builtins. We use them
> heavily in Blink (LIKELY()/UNLIKELY()) while they are *ignored* by GCC at
> -Os optimization level. This could be no longer true on new versions of GCC
> (I did not verify), but if it is still the case, this seriously limits
> optimization opportunities.

I already really disliked LIKELY/UNLIKELY because they are hard to
maintain and don't work on Windows. I had no idea they didn't work on
Android either (due to -Os). That's probably the nail in the coffin to
continue discouraging folks from relying on this crutch.

With PGO the branch prediction builtins should be ignored by the compiler. I don't think LIKELY/UNLIKELY are hard to maintain if they serve for the limited use of marking the sort of branching that happens only on slow path, like exception paths.

--
Egor Pasko

Egor Pasko

unread,
Dec 19, 2013, 11:32:04 AM12/19/13
to Chris Hamilton, Tony Gentilcore, chromium-dev, etie...@chromium.org
My main benchmark would be or list of top mobile websites (running as page_cycler). This could possibly be combined with more precise speed index measurements. On Android we don't have a luxury to verify real-world improvements by deploying binaries built differently. So I'd prefer to stick with improving metrics that are verifiable. Does it sound convincing?
 
The other missing piece is build infrastructure. I hacked together a
GYP extension that allowed per .cc file command-line flags to be
specified additionally in a whole separate config file. This file was
then automatically generated from the perf data, and fed into the
build. This helped to decouple build config from profiling and perf
optimizations, and is effectively a poor man's translation unit level
PGO.

file-level still awesome! Can you point me at historical CLs where this worked?
 
We initially implemented this pre Chrome-DLL split. The motivation for
the split was due to the linker fast approaching its memory limits.
Rather than disabling optimizations wholesale (freeing up linker
memory), we were pursuing this as a more directed way of choosing the
files to optimize, with the side effect of reducing linker memory.
However, this became a moot point once the Chrome DLL split landed.
It's been on the back burner since then. If you are pursing this, it
would be great if we could make the build infrastructure integration
common such that it can be leveraged across platforms.

Got it. These days I care about Android most, and I realize that "GCC plugin" approach would not work for all platforms, will keep that in mind.
 
> I want to experiment with this approach. I was thinking though that manual
> setting of aggerssive optimization flags on file-level would be feasible in
> the shorter term.
>
>>
>> As a total thought experiment:
>>
>> Maybe you could build an automated system that randomly sets -O2 on
>> some percentage (say 15-25%) of .cc files, runs a suite of unittests,
>> and records speed-ups vs everything compiled for min-size. After a
>> reasonable number of runs every .cc file would be (likely) covered
>> multiple times, and you'd be able to make a good guess at percentage
>> improvement on a benchmark associated with enabling aggressive
>> optimizations on a .cc file. I guess the goal would be to get cycle
>> times down to a reasonable level...
>
>
> By unittests you probably mean microbenchmarks? That's an interesting way to
> experiment. If we only had variance from perf scores on Android low enough
> to be sure about sub-1% improvements in metrics, we could then perform a
> search over the optimal set of files to optimize for speed. Unfortunately,
> fighting the variance will keep us busy for some time.

I do indeed mean benchmarks :) And the reason for this thought
experiment is that simply measuring hot code does not find the code
that would most benefit from optimization (although its a reasonable
proxy).

I understand and agree. This is simply because the hottest code is highly optimized already (memcpy and such).
 
We have also been fighting against variance in the measurement
process, and have a reasonably good handle on it now for the Windows
platform. We do all sorts of tricks to make sure caches are flushed,
to make sure the OS sees the executables as brand new and never loaded
before, pinning processes to a processor, increasing process
priorities, making sure as little else is running on the system, etc.
Additionally, we've found that CPU perf counters are a way more stable
measure than wall time. Etienneb@ might have more comments on this, as
he's been chasing down the variance for the last couple quarters.

Maybe we should report perf counters as scores too. That's a great idea!!

--
Egor Pasko

Nico Weber

unread,
Dec 19, 2013, 12:27:02 PM12/19/13
to Egor Pasko, Chris Hamilton, Tony Gentilcore, chromium-dev, etie...@chromium.org
Is it possible to rewrite functions that are a) very hot b) very reliant on compiler optimzations to be less reliant on compiler optimizations? Then they'll be faster with all compilers, and your gains are less likely to be lost when you update or switch compilers. It'd keep the build files simpler too.


--

Egor Pasko

unread,
Dec 19, 2013, 12:48:25 PM12/19/13
to Nico Weber, Chris Hamilton, Tony Gentilcore, chromium-dev, etie...@chromium.org
On Thu, Dec 19, 2013 at 6:27 PM, Nico Weber <tha...@chromium.org> wrote:
Is it possible to rewrite functions that are a) very hot b) very reliant on compiler optimzations to be less reliant on compiler optimizations? Then they'll be faster with all compilers, and your gains are less likely to be lost when you update or switch compilers. It'd keep the build files simpler too.

We should hand-optimize for some extent for very hot functions. Often it would make the functions less reliant on compiler optimizations. I believe all teams are doing it already at some extent. It's a good question whether we should do anything systematically to optimize more than we are doing now. I don't know. There would certainly be some opposition if these rewrites make the code less readable.

Manual loop unrolling would be a perfect demonstration of how this approach can go wrong. Very often, if you unroll a tight loop by hand, it would get less reliant on compiler unrolling (because very few compilers deal well with bad manual unrolling, ha ha). But is it what we want to do? That would be platform-dependent ugliness with ifdefs, detection of CPU features etc.

--
Egor Pasko

Scott Hess

unread,
Dec 19, 2013, 1:35:37 PM12/19/13
to Nico Weber, Egor Pasko, Chris Hamilton, Tony Gentilcore, chromium-dev, etie...@chromium.org
I would prefer to have sane code with complicated systems to detect and optimize hotspots over insane code.

I don't think that's the choice, of course, but it is true that complexifying code is generally easier than simplifying it.  A given engineer might be able to fully comprehend the simple code and see how it could be rewritten to be faster, but they may not be able to fully comprehend the complex code to see through to the underlying simplicity.

-scott

Nico Weber

unread,
Dec 19, 2013, 1:37:03 PM12/19/13
to Egor Pasko, Chris Hamilton, Tony Gentilcore, chromium-dev, etie...@chromium.org
I don't mean doing things like loop unrolling manually, I mean more trying to come up with ways that functions have to do less work.

It's difficult to talk about this in abstract – for the files you looked at, do you know which of the O3 passes are the ones that make a difference? Do you have examples for in what ways the generated code is different in the slow and fast cases?

Egor Pasko

unread,
Dec 19, 2013, 1:36:50 PM12/19/13
to Scott Hess, Nico Weber, Chris Hamilton, Tony Gentilcore, chromium-dev, etie...@chromium.org
On Thu, Dec 19, 2013 at 7:35 PM, Scott Hess <sh...@chromium.org> wrote:
I would prefer to have sane code with complicated systems to detect and optimize hotspots over insane code.

I don't think that's the choice, of course, but it is true that complexifying code is generally easier than simplifying it.  A given engineer might be able to fully comprehend the simple code and see how it could be rewritten to be faster, but they may not be able to fully comprehend the complex code to see through to the underlying simplicity.

+1. I could not say it better.



--
Egor Pasko

Egor Pasko

unread,
Dec 19, 2013, 2:06:08 PM12/19/13
to Nico Weber, Chris Hamilton, Tony Gentilcore, chromium-dev, etie...@chromium.org
On Thu, Dec 19, 2013 at 7:37 PM, Nico Weber <tha...@chromium.org> wrote:
On Thu, Dec 19, 2013 at 9:48 AM, Egor Pasko <pa...@google.com> wrote:



On Thu, Dec 19, 2013 at 6:27 PM, Nico Weber <tha...@chromium.org> wrote:
Is it possible to rewrite functions that are a) very hot b) very reliant on compiler optimzations to be less reliant on compiler optimizations? Then they'll be faster with all compilers, and your gains are less likely to be lost when you update or switch compilers. It'd keep the build files simpler too.

We should hand-optimize for some extent for very hot functions. Often it would make the functions less reliant on compiler optimizations. I believe all teams are doing it already at some extent. It's a good question whether we should do anything systematically to optimize more than we are doing now. I don't know. There would certainly be some opposition if these rewrites make the code less readable.

Manual loop unrolling would be a perfect demonstration of how this approach can go wrong. Very often, if you unroll a tight loop by hand, it would get less reliant on compiler unrolling (because very few compilers deal well with bad manual unrolling, ha ha). But is it what we want to do? That would be platform-dependent ugliness with ifdefs, detection of CPU features etc.

I don't mean doing things like loop unrolling manually, I mean more trying to come up with ways that functions have to do less work.

It's difficult to talk about this in abstract – for the files you looked at, do you know which of the O3 passes are the ones that make a difference? Do you have examples for in what ways the generated code is different in the slow and fast cases? 

I did not experiment with individual passes (yet), that's more complex. Let's look at the example in this thread with ignoring/not-ignoring the branch prediction builtins. It quite fits your suggestion. If we are able to measure the benefit of inserting one branch prediction builtin into a hot function, I am all for making this change. I think we may *not* be able to measure the benefit of a this LIKELY-enabling for many individual functions from the long tail, but by enabling the optimization everywhere we may see the difference. This latter case should rather be done via PGO to be clean.

--
Egor Pasko

Eric Penner

unread,
Jan 9, 2014, 7:35:02 PM1/9/14
to chromi...@chromium.org, Nico Weber, Chris Hamilton, Tony Gentilcore, etie...@chromium.org
Shame on me for digesting chromium-dev :P, I didn't see this discussion until now.

Some extra data in this area: I saw a 10% reduction in total CPU time across all render critical threads on Android, by using O2 instead of Os, on this benchmark:

I was starting to consider some of the ideas in this thread (in terms of how to not increase binary size but still get this big perf boost).

If one of the gyp solutions is implemented, then I think the above benchmark would be a good one to measure the final outcome, since it covers _everything_ required to get a frame on the screen during accelerated gestures. Conversely it's probably a bit noisier than some micro-benchmarks.

Cheers!

Eric

Egor Pasko

unread,
Jan 13, 2014, 5:58:28 AM1/13/14
to epe...@chromium.org, chromium-dev, Nico Weber, Chris Hamilton, Tony Gentilcore, etie...@chromium.org
On Fri, Jan 10, 2014 at 1:35 AM, Eric Penner <epe...@chromium.org> wrote:
Shame on me for digesting chromium-dev :P, I didn't see this discussion until now.

:)
 
Some extra data in this area: I saw a 10% reduction in total CPU time across all render critical threads on Android, by using O2 instead of Os, on this benchmark:

That's an interesting metric, and we probably want to bring it down with some specific CPU optimizations.

The metric confuses me though in a few ways:

1. bringing down the cpu utilization %-age on per-render-frame activities is a sign of a good thing, but utilization can also go down if new stalls are introduced and the whole workload takes more time. I am not sure compiler optimizations can lead us to introducing more stalls like that, but generally I would expect this metric would change with unrelated commits in ways that are hard to explain.

2. it does not feel comfy to average %-ages, and then to average them more with results acquired from different inputs. This makes it very hard to perform any meaningful interpretation of the numbers. For example, huge relative changes in small numbers won't affect the averages much. Geomean is preferred on absolute numbers, imho.

3. each time results from 5 experiments have a standard deviation of about 25%, for example:
What can we conclude by observing that averaged results are more "stable" (i.e. less variability)? Probably not much. If we repeatedly observe averaged results improve by 10% with a change X while there was stddev of 25% before averaging? Probably we can conclude that we improved the metric, but it immediately raises questions like "do we observe noise because perf-testing conditions change for each run? what if they change identically each time when run on a bot? what if these conditions are not representative to what users observe?"
 
I was starting to consider some of the ideas in this thread (in terms of how to not increase binary size but still get this big perf boost).

If one of the gyp solutions is implemented, then I think the above benchmark would be a good one to measure the final outcome, since it covers _everything_ required to get a frame on the screen during accelerated gestures. Conversely it's probably a bit noisier than some micro-benchmarks.

I'd like to look at this too. I think we should also look at PLT and smoothness benchmarks. If we can make significant improvements on them, i.e. on what users directly observe, that'd be pretty exciting, right?

--
Egor Pasko

Eric Penner

unread,
Jan 13, 2014, 4:55:10 PM1/13/14
to Egor Pasko, chromium-dev, Nico Weber, Chris Hamilton, Tony Gentilcore, etie...@chromium.org
Sorry if this is a duplicate. The last attempt got bounced by chromium-dev.


On Mon, Jan 13, 2014 at 2:58 AM, Egor Pasko <pa...@google.com> wrote:



On Fri, Jan 10, 2014 at 1:35 AM, Eric Penner <epe...@chromium.org> wrote:
Shame on me for digesting chromium-dev :P, I didn't see this discussion until now.

:)
 
Some extra data in this area: I saw a 10% reduction in total CPU time across all render critical threads on Android, by using O2 instead of Os, on this benchmark:

That's an interesting metric, and we probably want to bring it down with some specific CPU optimizations.

The metric confuses me though in a few ways:

1. bringing down the cpu utilization %-age on per-render-frame activities is a sign of a good thing, but utilization can also go down if new stalls are introduced and the whole workload takes more time. I am not sure compiler optimizations can lead us to introducing more stalls like that, but generally I would expect this metric would change with unrelated commits in ways that are hard to explain.


Yeah we need to deal with big stalls somehow. As some extra context, this was meant for getting an understanding of our CPU usage on pages where we are already hitting 60Hz, so there shouldn't be big stalls. It's also meant for tracking progress over time and not necessarily closing the tree just yet. However it should still handle stalls as that will make the result incorrect like you say. I think choosing the correct time span during a 60Hz gesture, and normalizing by frames-produced might work. Alternatively we could hand-pick traces and normalize by very specific action counts.
 
2. it does not feel comfy to average %-ages, and then to average them more with results acquired from different inputs. This makes it very hard to perform any meaningful interpretation of the numbers. For example, huge relative changes in small numbers won't affect the averages much. Geomean is preferred on absolute numbers, imho.


It sounds like you want to focus on relative time changes, but I think the opposite is true. A 1000% improvement in something that takes 0.01ms doesn't help us much. This test is meant to keep us honest that such relative changes make a real-world impact in absolute terms.

If we are at 1.2 cores of fast-path CPU time while hitting 60Hz today, and we are at 0.6 cores of fast-path CPU time while still hitting 60Hz in the future, I think that would be easy enough to interpret, at a minimum.
 
 
3. each time results from 5 experiments have a standard deviation of about 25%, for example:
What can we conclude by observing that averaged results are more "stable" (i.e. less variability)? Probably not much. If we repeatedly observe averaged results improve by 10% with a change X while there was stddev of 25% before averaging? Probably we can conclude that we improved the metric, but it immediately raises questions like "do we observe noise because perf-testing conditions change for each run? what if they change identically each time when run on a bot? what if these conditions are not representative to what users observe?"
 

I think the 25% variance you are seeing is in clock-time not CPU-time. Clock-time includes time when the CPU is descheduled at varies widely due to blocking calls etc. The variance for the same page in CPU-time is usually pretty small. There is also some variance from some pages having strange stalls.

 
I was starting to consider some of the ideas in this thread (in terms of how to not increase binary size but still get this big perf boost).

If one of the gyp solutions is implemented, then I think the above benchmark would be a good one to measure the final outcome, since it covers _everything_ required to get a frame on the screen during accelerated gestures. Conversely it's probably a bit noisier than some micro-benchmarks.

I'd like to look at this too. I think we should also look at PLT and smoothness benchmarks. If we can make significant improvements on them, i.e. on what users directly observe, that'd be pretty exciting, right?

The smoothness benchmarks cap at 60Hz, so there is a possibility that a large improvement won't show up there. However, the difference will still be directly observed by users in their battery life, and better blink performance, if our fast-path scales below 16ms.
 

--
Egor Pasko

Egor Pasko

unread,
Jan 14, 2014, 10:00:17 AM1/14/14
to Eric Penner, chromium-dev, Nico Weber, Chris Hamilton, Tony Gentilcore, etie...@chromium.org



On Mon, Jan 13, 2014 at 10:50 PM, Eric Penner <epe...@google.com> wrote:
On Mon, Jan 13, 2014 at 2:58 AM, Egor Pasko <pa...@google.com> wrote:



On Fri, Jan 10, 2014 at 1:35 AM, Eric Penner <epe...@chromium.org> wrote:
Shame on me for digesting chromium-dev :P, I didn't see this discussion until now.

:)
 
Some extra data in this area: I saw a 10% reduction in total CPU time across all render critical threads on Android, by using O2 instead of Os, on this benchmark:

That's an interesting metric, and we probably want to bring it down with some specific CPU optimizations.

The metric confuses me though in a few ways:

1. bringing down the cpu utilization %-age on per-render-frame activities is a sign of a good thing, but utilization can also go down if new stalls are introduced and the whole workload takes more time. I am not sure compiler optimizations can lead us to introducing more stalls like that, but generally I would expect this metric would change with unrelated commits in ways that are hard to explain.


Yeah we need to deal with big stalls somehow. As some extra context, this was meant for getting an understanding of our CPU usage on pages where we are already hitting 60Hz, so there shouldn't be big stalls. It's also meant for tracking progress over time and not necessarily closing the tree just yet. However it should still handle stalls as that will make the result incorrect like you say. I think choosing the correct time span during a 60Hz gesture, and normalizing by frames-produced might work. Alternatively we could hand-pick traces and normalize by very specific action counts.
 
2. it does not feel comfy to average %-ages, and then to average them more with results acquired from different inputs. This makes it very hard to perform any meaningful interpretation of the numbers. For example, huge relative changes in small numbers won't affect the averages much. Geomean is preferred on absolute numbers, imho.

It sounds like you want to focus on relative time changes, but I think the opposite is true. A 1000% improvement in something that takes 0.01ms doesn't help us much. This test is meant to keep us honest that such relative changes make a real-world impact in absolute terms.

Hm, you are right, geomean does not work here, and it's difficult to define what we want. In this example I am worried about weighting X% improvement of cpu_time_percentage equally for a page that loads 0.5 sec (page A) and another one (page B) that loads 10 sec. Improving on page B probably improves battery life, a similar percent regression on A will probably not affect battery life noticeably. More devastating: improvements CPU utilization on page A will likely make it faster to "load", which would be noticeable, but whatever happens to PLT for page B in the range of +- 2 sec will probably look the same to the user. I am not suggesting to average numbers in a different way (yet), I am saying that averaging some numbers does not make them easier to interpret.

If we are at 1.2 cores of fast-path CPU time while hitting 60Hz today, and we are at 0.6 cores of fast-path CPU time while still hitting 60Hz in the future, I think that would be easy enough to interpret, at a minimum. 

I think it is hard to interpret because it was averaged a few times in non-absolute terms, so how do we model it to build confidence intervals? This method also forgets the distribution of results very early, but I am sorry to remind this now, we are doing similar things in many other performance measurements against Chrome. Suppose we had CPU utilization at 0.6 and "improved" to 0.3 half of the time and 0.7 for another half, which averages to 0.5. That should not be considered an improvement. Sorry again for diverting the discussion with particular problems without a clear suggestion on alternatives.
 

3. each time results from 5 experiments have a standard deviation of about 25%, for example:
What can we conclude by observing that averaged results are more "stable" (i.e. less variability)? Probably not much. If we repeatedly observe averaged results improve by 10% with a change X while there was stddev of 25% before averaging? Probably we can conclude that we improved the metric, but it immediately raises questions like "do we observe noise because perf-testing conditions change for each run? what if they change identically each time when run on a bot? what if these conditions are not representative to what users observe?"
 

I think the 25% variance you are seeing is in clock-time not CPU-time. Clock-time includes time when the CPU is descheduled at varies widely due to blocking calls etc. The variance for the same page in CPU-time is usually pretty small. There is also some variance from some pages having strange stalls.

In the link I provided:
Avg thread_total_fast_path_cpu_time_percentage: 72.542536%
Sd  thread_total_fast_path_cpu_time_percentage: 25.572983%

Where do you observe pretty small variance for that? How many runs is it based on?
(I am guessing the *_total_* should exclude the pages with strange stalls, but I did not look at what they are)
 
I was starting to consider some of the ideas in this thread (in terms of how to not increase binary size but still get this big perf boost).

If one of the gyp solutions is implemented, then I think the above benchmark would be a good one to measure the final outcome, since it covers _everything_ required to get a frame on the screen during accelerated gestures. Conversely it's probably a bit noisier than some micro-benchmarks.

I'd like to look at this too. I think we should also look at PLT and smoothness benchmarks. If we can make significant improvements on them, i.e. on what users directly observe, that'd be pretty exciting, right?


The smoothness benchmarks cap at 60Hz, so there is a possibility that a large improvement won't show up there.

Right. It would be natural to look at CPU utilization. If current cpu_time_percentage does not work in a stable way, we could look at getrlimit or hardware counters for instructions retired, we could even do something like "instructions retired per frame".
 
However, the difference will still be directly observed by users in their battery life, and better blink performance, if our fast-path scales below 16ms.

Agreed.

--
Egor Pasko

Eric Penner

unread,
Jan 14, 2014, 3:23:13 PM1/14/14
to Egor Pasko, chromium-dev, Nico Weber, Chris Hamilton, Tony Gentilcore, etie...@chromium.org
Summarizing my responses:

I think some of your concerns will be addressed when we normalize by frames-produced, which we are about to do.  Then we'll have an absolute number per page measured (cpu-ms-per-frame). Variance across frames shouldn't be a big concern given the nature of the code being targeted, but maybe that could be addressed elsewhere. Variance within a single page is already low and I think that will bring it down further. Load times are intentionally not measured here (or if they are that's a bug). We only want to measure the steady and repetitive creation of frames during a short user gesture.

Now I understand the large variance you are seeing there. A large variance between pages is expected since those different pages are doing arbitrary different amounts of cpu-work per frame. Variance within a single page is way lower. But the same should be true of load-times on different pages and frame-rates on different pages, right?  The perf-dashboard does the averaging. Perhaps geometric mean might be better way to do that in the dashboard?

Thankfully in this case we could glean a lot by literally using a single page, or and a small set (not averaged) eventually, if the mean doesn't work between pages.

Eric
Reply all
Reply to author
Forward
0 new messages