--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
>There are surely concerns with this approach: what do we do if critical functionality migrates from one file to another? One-file libraries would look ugly in gyp?A partial mitigation to both these is to use #pragma optimzie("O3") inside the .cc file instead of hacking .gypYou could wrap it in e.g. PRAGMA_OPTIMIZE_FOR_SPEED() define which is slightly more readable, and more likely to migrate with the code in question. Also it can be done at a function (or even finer) level with a pragma push/pop.
I think your investigation here will be quite fruitful!
But I think that per-file compilation flags will lead to short terms
gains that will quickly rot (or be a terrible maintenance burden).
For instance, over time, a particular file may evolve gradually from
something that would benefit from -O2 to something that would benefit
from -Os.
That being said, the reason I think this will be fruitful is that:
* As you know, we already compile with different flags on a
per-library basis, and I'd support continuing to do a better job of
that.
* Finding the files that benefit from greater optimization points out
where we can (and should) do better. I suspect we can often get the
same gains by looking at the asm differences between the two and then
writing more compiler-friendly c++ that allows -Os to do a better job.
I started doing this on Windows with the PGO tools in VS2013. I just used an arbitrary set of synthetic benchmarks as test data for now (dromaeo, robohornet, etc).We're waiting on toolchain fixes from MSFT in the next update, but it appears that it could be quite beneficial. The tools suggest that only ~2% of code benefits from speed optimization but that the rest should be size-optimized. Right now on Windows, we have ~35% set to speed opt (all of blink, v8, base, and allocator, but nothing else), which can surely be improved.
Figuring out how to include profile data from real users on sundry hardware platforms would obviously be great. Presumably "this file is hot" would map fine across platforms for most code based on go/cwp. Alternatively, we might be able to determine what PGO decided to optimize, and export that elsewhere.
How about driving the selection of optimized files based on profiling
and some standard test suite? Or even better, based on actual
real-world profile data? If the build of libchromeview for android is
not too far removed from that built on ChromeOS you could even be
driving this using real-data collected by ChromeO Wide Profiling
(go/cwp). This would hopefully prevent bit rot issues...
Unfortunately this approach conflates "heavily used translation units"
with "translation units that would most benefit from optimization".
The only way to truly measure the latter in an automated way would be
to run benchmarks while turning on and off -O2 for individual .cc
files.
As a total thought experiment:
Maybe you could build an automated system that randomly sets -O2 on
some percentage (say 15-25%) of .cc files, runs a suite of unittests,
and records speed-ups vs everything compiled for min-size. After a
reasonable number of runs every .cc file would be (likely) covered
multiple times, and you'd be able to make a good guess at percentage
improvement on a benchmark associated with enabling aggressive
optimizations on a .cc file. I guess the goal would be to get cycle
times down to a reasonable level...
>> That being said, the reason I think this will be fruitful is that:That is true too. Again, you are right that it is a balance between
>> * As you know, we already compile with different flags on a
>> per-library basis, and I'd support continuing to do a better job of
>> that.
>
> I think per-library flags are too coarse-grained. The reason I want PGO is
> because even function-level size-vs-perf decisions are too coarse-grained.
maintainability and benchmarks.
Do you have any idea how much we care about the red hot spots vs the
>> * Finding the files that benefit from greater optimization points out
>> where we can (and should) do better. I suspect we can often get the
>> same gains by looking at the asm differences between the two and then
>> writing more compiler-friendly c++ that allows -Os to do a better job.
>
>
> This approach historically helped to optimize some outstanding and localized
> performance issues. We should continue doing that for localized bottlenecks.
> The biggest concern is that hand-optimizing does not scale to the long tail
> of semi-hot functions. Also, balancing speed vs. size with individual C++
> CLs would be impossible.
long tail of semi hot stuff? My suspicion is that we'll get the most
bang for our buck at this point by focusing on the red hot ones. Once
we hit diminishing returns there, maybe we can get together a plan for
the long tail.
> One entertaining example: the "branch rediction" builtins. We use themI already really disliked LIKELY/UNLIKELY because they are hard to
> heavily in Blink (LIKELY()/UNLIKELY()) while they are *ignored* by GCC at
> -Os optimization level. This could be no longer true on new versions of GCC
> (I did not verify), but if it is still the case, this seriously limits
> optimization opportunities.
maintain and don't work on Windows. I had no idea they didn't work on
Android either (due to -Os). That's probably the nail in the coffin to
continue discouraging folks from relying on this crutch.
The other missing piece is build infrastructure. I hacked together a
GYP extension that allowed per .cc file command-line flags to be
specified additionally in a whole separate config file. This file was
then automatically generated from the perf data, and fed into the
build. This helped to decouple build config from profiling and perf
optimizations, and is effectively a poor man's translation unit level
PGO.
We initially implemented this pre Chrome-DLL split. The motivation for
the split was due to the linker fast approaching its memory limits.
Rather than disabling optimizations wholesale (freeing up linker
memory), we were pursuing this as a more directed way of choosing the
files to optimize, with the side effect of reducing linker memory.
However, this became a moot point once the Chrome DLL split landed.
It's been on the back burner since then. If you are pursing this, it
would be great if we could make the build infrastructure integration
common such that it can be leveraged across platforms.
> I want to experiment with this approach. I was thinking though that manualI do indeed mean benchmarks :) And the reason for this thought
> setting of aggerssive optimization flags on file-level would be feasible in
> the shorter term.
>
>>
>> As a total thought experiment:
>>
>> Maybe you could build an automated system that randomly sets -O2 on
>> some percentage (say 15-25%) of .cc files, runs a suite of unittests,
>> and records speed-ups vs everything compiled for min-size. After a
>> reasonable number of runs every .cc file would be (likely) covered
>> multiple times, and you'd be able to make a good guess at percentage
>> improvement on a benchmark associated with enabling aggressive
>> optimizations on a .cc file. I guess the goal would be to get cycle
>> times down to a reasonable level...
>
>
> By unittests you probably mean microbenchmarks? That's an interesting way to
> experiment. If we only had variance from perf scores on Android low enough
> to be sure about sub-1% improvements in metrics, we could then perform a
> search over the optimal set of files to optimize for speed. Unfortunately,
> fighting the variance will keep us busy for some time.
experiment is that simply measuring hot code does not find the code
that would most benefit from optimization (although its a reasonable
proxy).
We have also been fighting against variance in the measurement
process, and have a reasonably good handle on it now for the Windows
platform. We do all sorts of tricks to make sure caches are flushed,
to make sure the OS sees the executables as brand new and never loaded
before, pinning processes to a processor, increasing process
priorities, making sure as little else is running on the system, etc.
Additionally, we've found that CPU perf counters are a way more stable
measure than wall time. Etienneb@ might have more comments on this, as
he's been chasing down the variance for the last couple quarters.
--
Is it possible to rewrite functions that are a) very hot b) very reliant on compiler optimzations to be less reliant on compiler optimizations? Then they'll be faster with all compilers, and your gains are less likely to be lost when you update or switch compilers. It'd keep the build files simpler too.
I would prefer to have sane code with complicated systems to detect and optimize hotspots over insane code.I don't think that's the choice, of course, but it is true that complexifying code is generally easier than simplifying it. A given engineer might be able to fully comprehend the simple code and see how it could be rewritten to be faster, but they may not be able to fully comprehend the complex code to see through to the underlying simplicity.
On Thu, Dec 19, 2013 at 9:48 AM, Egor Pasko <pa...@google.com> wrote:
On Thu, Dec 19, 2013 at 6:27 PM, Nico Weber <tha...@chromium.org> wrote:
Is it possible to rewrite functions that are a) very hot b) very reliant on compiler optimzations to be less reliant on compiler optimizations? Then they'll be faster with all compilers, and your gains are less likely to be lost when you update or switch compilers. It'd keep the build files simpler too.We should hand-optimize for some extent for very hot functions. Often it would make the functions less reliant on compiler optimizations. I believe all teams are doing it already at some extent. It's a good question whether we should do anything systematically to optimize more than we are doing now. I don't know. There would certainly be some opposition if these rewrites make the code less readable.Manual loop unrolling would be a perfect demonstration of how this approach can go wrong. Very often, if you unroll a tight loop by hand, it would get less reliant on compiler unrolling (because very few compilers deal well with bad manual unrolling, ha ha). But is it what we want to do? That would be platform-dependent ugliness with ifdefs, detection of CPU features etc.I don't mean doing things like loop unrolling manually, I mean more trying to come up with ways that functions have to do less work.It's difficult to talk about this in abstract – for the files you looked at, do you know which of the O3 passes are the ones that make a difference? Do you have examples for in what ways the generated code is different in the slow and fast cases?
Shame on me for digesting chromium-dev :P, I didn't see this discussion until now.
Some extra data in this area: I saw a 10% reduction in total CPU time across all render critical threads on Android, by using O2 instead of Os, on this benchmark:
I was starting to consider some of the ideas in this thread (in terms of how to not increase binary size but still get this big perf boost).If one of the gyp solutions is implemented, then I think the above benchmark would be a good one to measure the final outcome, since it covers _everything_ required to get a frame on the screen during accelerated gestures. Conversely it's probably a bit noisier than some micro-benchmarks.
On Fri, Jan 10, 2014 at 1:35 AM, Eric Penner <epe...@chromium.org> wrote:
Shame on me for digesting chromium-dev :P, I didn't see this discussion until now.:)Some extra data in this area: I saw a 10% reduction in total CPU time across all render critical threads on Android, by using O2 instead of Os, on this benchmark:That's an interesting metric, and we probably want to bring it down with some specific CPU optimizations.The metric confuses me though in a few ways:1. bringing down the cpu utilization %-age on per-render-frame activities is a sign of a good thing, but utilization can also go down if new stalls are introduced and the whole workload takes more time. I am not sure compiler optimizations can lead us to introducing more stalls like that, but generally I would expect this metric would change with unrelated commits in ways that are hard to explain.
2. it does not feel comfy to average %-ages, and then to average them more with results acquired from different inputs. This makes it very hard to perform any meaningful interpretation of the numbers. For example, huge relative changes in small numbers won't affect the averages much. Geomean is preferred on absolute numbers, imho.
3. each time results from 5 experiments have a standard deviation of about 25%, for example:What can we conclude by observing that averaged results are more "stable" (i.e. less variability)? Probably not much. If we repeatedly observe averaged results improve by 10% with a change X while there was stddev of 25% before averaging? Probably we can conclude that we improved the metric, but it immediately raises questions like "do we observe noise because perf-testing conditions change for each run? what if they change identically each time when run on a bot? what if these conditions are not representative to what users observe?"
I was starting to consider some of the ideas in this thread (in terms of how to not increase binary size but still get this big perf boost).If one of the gyp solutions is implemented, then I think the above benchmark would be a good one to measure the final outcome, since it covers _everything_ required to get a frame on the screen during accelerated gestures. Conversely it's probably a bit noisier than some micro-benchmarks.I'd like to look at this too. I think we should also look at PLT and smoothness benchmarks. If we can make significant improvements on them, i.e. on what users directly observe, that'd be pretty exciting, right?
--
Egor Pasko
On Mon, Jan 13, 2014 at 2:58 AM, Egor Pasko <pa...@google.com> wrote:
On Fri, Jan 10, 2014 at 1:35 AM, Eric Penner <epe...@chromium.org> wrote:
Shame on me for digesting chromium-dev :P, I didn't see this discussion until now.:)Some extra data in this area: I saw a 10% reduction in total CPU time across all render critical threads on Android, by using O2 instead of Os, on this benchmark:That's an interesting metric, and we probably want to bring it down with some specific CPU optimizations.The metric confuses me though in a few ways:1. bringing down the cpu utilization %-age on per-render-frame activities is a sign of a good thing, but utilization can also go down if new stalls are introduced and the whole workload takes more time. I am not sure compiler optimizations can lead us to introducing more stalls like that, but generally I would expect this metric would change with unrelated commits in ways that are hard to explain.
Yeah we need to deal with big stalls somehow. As some extra context, this was meant for getting an understanding of our CPU usage on pages where we are already hitting 60Hz, so there shouldn't be big stalls. It's also meant for tracking progress over time and not necessarily closing the tree just yet. However it should still handle stalls as that will make the result incorrect like you say. I think choosing the correct time span during a 60Hz gesture, and normalizing by frames-produced might work. Alternatively we could hand-pick traces and normalize by very specific action counts.
2. it does not feel comfy to average %-ages, and then to average them more with results acquired from different inputs. This makes it very hard to perform any meaningful interpretation of the numbers. For example, huge relative changes in small numbers won't affect the averages much. Geomean is preferred on absolute numbers, imho.
It sounds like you want to focus on relative time changes, but I think the opposite is true. A 1000% improvement in something that takes 0.01ms doesn't help us much. This test is meant to keep us honest that such relative changes make a real-world impact in absolute terms.
If we are at 1.2 cores of fast-path CPU time while hitting 60Hz today, and we are at 0.6 cores of fast-path CPU time while still hitting 60Hz in the future, I think that would be easy enough to interpret, at a minimum.
3. each time results from 5 experiments have a standard deviation of about 25%, for example:What can we conclude by observing that averaged results are more "stable" (i.e. less variability)? Probably not much. If we repeatedly observe averaged results improve by 10% with a change X while there was stddev of 25% before averaging? Probably we can conclude that we improved the metric, but it immediately raises questions like "do we observe noise because perf-testing conditions change for each run? what if they change identically each time when run on a bot? what if these conditions are not representative to what users observe?"
I think the 25% variance you are seeing is in clock-time not CPU-time. Clock-time includes time when the CPU is descheduled at varies widely due to blocking calls etc. The variance for the same page in CPU-time is usually pretty small. There is also some variance from some pages having strange stalls.
I was starting to consider some of the ideas in this thread (in terms of how to not increase binary size but still get this big perf boost).If one of the gyp solutions is implemented, then I think the above benchmark would be a good one to measure the final outcome, since it covers _everything_ required to get a frame on the screen during accelerated gestures. Conversely it's probably a bit noisier than some micro-benchmarks.I'd like to look at this too. I think we should also look at PLT and smoothness benchmarks. If we can make significant improvements on them, i.e. on what users directly observe, that'd be pretty exciting, right?
The smoothness benchmarks cap at 60Hz, so there is a possibility that a large improvement won't show up there.
However, the difference will still be directly observed by users in their battery life, and better blink performance, if our fast-path scales below 16ms.