CC: cfe-dev
Thanks for sharing. We also noticed this internally, and I know that Bruno and Chris are working on some infrastructure and tooling to help tracking closely compile time regressions.
We had this conversation internally about the tradeoff between compile-time and runtime performance, and I planned to bring-up the topic on the list in the coming months, this looks like a good occasion to plant the seed. Apparently in the past (years/decade ago?) the project was very conservative on adding any optimizations that would impact compile time, however there is no explicit policy (that I know of) to address this tradeoff.
The closest I could find would be what Chandler wrote in: http://reviews.llvm.org/D12826 ; for instance for O2 he stated that "if an optimization increases compile time by 5% or increases code size by 5% for a particular benchmark, that benchmark should also be one which sees a 5% runtime improvement".
My hope is that with better tooling for tracking compile time in the future, we'll reach a state where we'll be able to consider "breaking" the compile-time regression test as important as breaking any test: i.e. the offending commit should be reverted unless it has been shown to significantly (hand wavy...) improve the runtime performance.
<troll>
With the current trend, the Polly developers don't have to worry about improving their compile time, we'll catch up with them ;)
</troll>
--
Mehdi
> <run.sh><LTO.time><Debug.time><Release.time>_______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
My two largest pet peeves in this area are:
1. We often use functions from ValueTracking (to get known bits, the number of sign bits, etc.) as through they're low cost. They're not really low cost. The problem is that they *should* be. These functions do bottom-up walks, and could cache their results. Instead, they do a limited walk and recompute everything each time. This is expensive, and a significant amount of our InstCombine time goes to ValueTracking, and that shouldn't be the case. The more we add to InstCombine (and related passes), and the more we run InstCombine, the worse this gets. On the other hand, fixing this will help both compile time and code quality.
Furthermore, BasicAA has the same problem.
2. We have "cleanup" passes in the pipeline, such as those that run after loop unrolling and/or vectorization, that run regardless of whether the preceding pass actually did anything. We've been adding more of these, and they catch important use cases, but we need a better infrastructure for this (either with the new pass manager or otherwise).
Also, I'm very hopeful that as our new MemorySSA and GVN improvements materialize, we'll see large compile-time improvements from that work. We spend a huge amount of time in GVN computing memory-dependency information (the dwarfs the time spent by GVN doing actual value numbering work by an order of magnitude or more).
-Hal
>
> --
> Mehdi
>
>
>
>
>
>
> > On Mar 8, 2016, at 8:13 AM, Rafael Espíndola via llvm-dev
> > <llvm...@lists.llvm.org> wrote:
> >
> > I have just benchmarked building trunk llvm and clang in Debug,
> > Release and LTO modes (see the attached scrip for the cmake lines).
> >
> > The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all
> > cases I used the system libgcc and libstdc++.
> >
> > For release builds there is a monotonic increase in each version.
> > From
> > 163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc
> > 5.3.2 takes 205 minutes.
> >
> > Debug and LTO show an improvement in 3.7, but have regressed again
> > in 3.8.
> >
> > Cheers,
> > Rafael
> > <run.sh><LTO.time><Debug.time><Release.time>_______________________________________________
> > LLVM Developers mailing list
> > llvm...@lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> cfe-dev mailing list
> cfe...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
1. We often use functions from ValueTracking (to get known bits, the number of sign bits, etc.) as through they're low cost. They're not really low cost. The problem is that they *should* be. These functions do bottom-up walks, and could cache their results. Instead, they do a limited walk and recompute everything each time. This is expensive, and a significant amount of our InstCombine time goes to ValueTracking, and that shouldn't be the case. The more we add to InstCombine (and related passes), and the more we run InstCombine, the worse this gets. On the other hand, fixing this will help both compile time and code quality.
Furthermore, BasicAA has the same problem.
2. We have "cleanup" passes in the pipeline, such as those that run after loop unrolling and/or vectorization, that run regardless of whether the preceding pass actually did anything. We've been adding more of these, and they catch important use cases, but we need a better infrastructure for this (either with the new pass manager or otherwise).
Also, I'm very hopeful that as our new MemorySSA and GVN improvements materialize, we'll see large compile-time improvements from that work. We spend a huge amount of time in GVN computing memory-dependency information (the dwarfs the time spent by GVN doing actual value numbering work by an order of magnitude or more).
I'm curious how these times divide across Clang and various parts of
LLVM; rerunning with -ftime-report and summing the numbers across all
compiles could be interesting.
Furthermore, BasicAA has the same problem.
2. We have "cleanup" passes in the pipeline, such as those that run after loop unrolling and/or vectorization, that run regardless of whether the preceding pass actually did anything. We've been adding more of these, and they catch important use cases, but we need a better infrastructure for this (either with the new pass manager or otherwise).
Also, I'm very hopeful that as our new MemorySSA and GVN improvements materialize, we'll see large compile-time improvements from that work. We spend a huge amount of time in GVN computing memory-dependency information (the dwarfs the time spent by GVN doing actual value numbering work by an order of magnitude or more).I'm a working on it ;)
I have noticed that LLVM doesn't seem to "like" large functions, as a general rule. Admittedly, my experience is similar with gcc, so I'm not sure it's something that can be easily fixed. And I'm probably sounding like a broken record, because I have said this before.My experience is that the time it takes to compile something is growing above linear with size of function.
.png?part=0.0.2&view=1)

On Tue, Mar 8, 2016 at 8:13 AM, Rafael Espíndola
<llvm...@lists.llvm.org> wrote:
> I have just benchmarked building trunk llvm and clang in Debug,
> Release and LTO modes (see the attached scrip for the cmake lines).
>
> The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all
> cases I used the system libgcc and libstdc++.
>
> For release builds there is a monotonic increase in each version. From
> 163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc
> 5.3.2 takes 205 minutes.
>
> Debug and LTO show an improvement in 3.7, but have regressed again in 3.8.
I'm curious how these times divide across Clang and various parts of
LLVM; rerunning with -ftime-report and summing the numbers across all
compiles could be interesting.
On Tue, Mar 8, 2016 at 10:42 AM, Richard Smith via llvm-dev <llvm...@lists.llvm.org> wrote:On Tue, Mar 8, 2016 at 8:13 AM, Rafael Espíndola
<llvm...@lists.llvm.org> wrote:
> I have just benchmarked building trunk llvm and clang in Debug,
> Release and LTO modes (see the attached scrip for the cmake lines).
>
> The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all
> cases I used the system libgcc and libstdc++.
>
> For release builds there is a monotonic increase in each version. From
> 163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc
> 5.3.2 takes 205 minutes.
>
> Debug and LTO show an improvement in 3.7, but have regressed again in 3.8.
I'm curious how these times divide across Clang and various parts of
LLVM; rerunning with -ftime-report and summing the numbers across all
compiles could be interesting.Based on the results I posted upthread about the relative time spend in the backend for debug vs release, we can estimate this.To summarize:
On Tue, Mar 8, 2016 at 10:55 AM, mats petersson via llvm-dev <llvm...@lists.llvm.org> wrote:I have noticed that LLVM doesn't seem to "like" large functions, as a general rule. Admittedly, my experience is similar with gcc, so I'm not sure it's something that can be easily fixed. And I'm probably sounding like a broken record, because I have said this before.My experience is that the time it takes to compile something is growing above linear with size of function.The number of BBs -- Kosyia can point you to the compile time bug that is exposed by asan .
Correct. The machine has no swap :-)
But some targets (clang) are much larger and I have the impression
that the last minute or so of the build is just finishing that one
link.
Cheers,
Rafael
On 9 Mar 2016 1:22 a.m., "Adam Nemet via cfe-dev" <cfe...@lists.llvm.org> wrote:
> A related issue is that if an analysis is not preserved by a pass, it gets invalidated *even if* the pass doesn’t end up modifying the code. Because of this for example we invalidate SCEV’s cache unnecessarily. The new pass manager should fix this.
+1


--Mehdi
The lto time could be explained by second order effect due to increased dcache/dtlb pressures due to increased memory footprint and poor locality.
On Tue, Mar 8, 2016 at 10:55 AM, mats petersson via llvm-dev <llvm...@lists.llvm.org> wrote:I have noticed that LLVM doesn't seem to "like" large functions, as a general rule. Admittedly, my experience is similar with gcc, so I'm not sure it's something that can be easily fixed. And I'm probably sounding like a broken record, because I have said this before.My experience is that the time it takes to compile something is growing above linear with size of function.The number of BBs -- Kosyia can point you to the compile time bug that is exposed by asan .
On Wed, Mar 9, 2016 at 12:38 PM, Xinliang David Li <xinli...@gmail.com> wrote:The lto time could be explained by second order effect due to increased dcache/dtlb pressures due to increased memory footprint and poor locality.Actually thinking more about this, I was totally wrong. Mehdi said that we LTO ~56 binaries. If we naively assume that each binary is like clang and links in "everything" and that the LTO process takes CPU time equivalent to "-O3 for every TU", then we would expect that *for each binary* we would see +33% (total increase >1800% vs Release). Clearly that is not happening since the actual overhead is only 50%-100%, so we need a more refined explanation.There are a couple factors that I can think of.a) there are 56 binaries being LTO'd (this will tend to increase our estimate)b) not all 56 binaries are the size of clang (this will tend to decrease our estimate)c) per-TU processing only is doing mid-level optimizations and no codegen (this will tend to decrease our estimate)d) IR seen during LTO has already been "cleaned up" and has less overall size/amount of optimizations that will apply during the LTO process (this will tend to decrease our estimate)e) comdat folding in the linker means that we only codegen (this will tend to decrease our estimate)Starting from a (normalized) release build withreleaseBackend = .33releaseFrontend = .67release = releaseBackend + releaseFrontend = 1Let us try to obtainLTO = (some expression involving releaseFrontend and releaseBackend) = 1.5-2For starters, let us apply a), with a naive assumption that for each of the numBinaries = 52 binaries we add the cost of releaseBackend (I just checked and 52 is the exact number for LLVM+Clang+LLD+clang-tools-extra, ignoring symlinks). This givesLTO = release + 52 * releaseBackend = 21.46, which is way high.
Back in the pre-Clang LLVM 1.x dark ages you could, if you
pressed the right buttons, run LLVM as a very fast portable
codegen. MB/s was a reasonable measure as the speed was (or
could be made to be) fairly independent of the input structure.
Since ~2006, as LLVM has shifted from "awesome research
plaything" to "compiler people depend on", there has been a
focus on ensuring that typical software compiles quickly and
well. Many good things have followed as a result, but you are
certainly correct that LLVM doesn't handle large input
particularly well. Having said that, some projects (the Gambit
Scheme->C and Verilator Verilog->C compilers come to mind)
routinely see runtimes 10~100x that of GCC in typical use. So
perhaps we are thinking of different things if you're seeing
similar issues with GCC.
I suspect that despite the passage of time the problem remains
solvable - there's probably *more* work to be done now, but I
don't think there are any massively *difficult* problems to be
solved. Properly quantifying/tracking the problem would be a
good first step.
Best,
Duraid
Bug reports (with pre-processed source files preferably) are always welcome.
Collecting the test cases in a "compile time test suite" is what should follow naturally.
Best,
--
Mehdi
> There is a possibility that r259673 could play a role here.
>
> For the buildSchedGraph() method, there is the -dag-maps-huge-region that
> has the default value of 1000. When I commited the patch, I was expecting
> people to lower this value as needed and also suggested this, but this has
> not happened. 1000 is very high, basically "unlimited".
>
> It would be interesting to see what results you get with e.g. -mllvm
> -dag-maps-huge-region=50. Of course, since this is a trade-off between
> compile time and scheduler freedom, some care should be taken before
> lowering this in trunk.
Indeed we hit this internally, filed a PR:
https://llvm.org/bugs/show_bug.cgi?id=26940
As a general comment on this thread and as mentioned by Mehdi, we care
a lot about compile time and we're looking forward to contribute more
in this area in the following months; by collecting compile time
testcases into a testsuite and publicly tracking results on those we
should be able to start a RFC on a tradeoff policy.
--
Bruno Cardoso Lopes
http://www.brunocardoso.cc
http://hubicka.blogspot.nl/2016/03/building-libreoffice-with-gcc-6-and-lto.html#more
Compared to llvm 3.5,0. the builds with llvm 3.9.0 svn were 24% slower.
On Tue, Mar 8, 2016 at 11:13 AM, Rafael Espíndola
<llvm...@lists.llvm.org> wrote:
> I have just benchmarked building trunk llvm and clang in Debug,
> Release and LTO modes (see the attached scrip for the cmake lines).
>
> The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all
> cases I used the system libgcc and libstdc++.
>
> For release builds there is a monotonic increase in each version. From
> 163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc
> 5.3.2 takes 205 minutes.
>
> Debug and LTO show an improvement in 3.7, but have regressed again in 3.8.
>
> Cheers,
> Rafael
> There is a possibility that r259673 could play a role here.
>
> For the buildSchedGraph() method, there is the -dag-maps-huge-region that
> has the default value of 1000. When I commited the patch, I was expecting
> people to lower this value as needed and also suggested this, but this has
> not happened. 1000 is very high, basically "unlimited".
>
> It would be interesting to see what results you get with e.g. -mllvm
> -dag-maps-huge-region=50. Of course, since this is a trade-off between
> compile time and scheduler freedom, some care should be taken before
> lowering this in trunk.
Indeed we hit this internally, filed a PR:
https://llvm.org/bugs/show_bug.cgi?id=26940
+1. Reverting is easy when a commit is fresh, but gets rapidly more
difficult as other changes (related or not) come after it, whereas
re-applying a commit later is usually straightforward.
Keeping the top of tree compiler in good shape improves everyone's
lives.
+1
> On Mon, Mar 14, 2016 at 12:14 PM Bruno Cardoso Lopes via llvm-dev
> <llvm...@lists.llvm.org> wrote:
>>
>> > There is a possibility that r259673 could play a role here.
>> >
>> > For the buildSchedGraph() method, there is the -dag-maps-huge-region
>> > that
>> > has the default value of 1000. When I commited the patch, I was
>> > expecting
>> > people to lower this value as needed and also suggested this, but this
>> > has
>> > not happened. 1000 is very high, basically "unlimited".
>> >
>> > It would be interesting to see what results you get with e.g. -mllvm
>> > -dag-maps-huge-region=50. Of course, since this is a trade-off between
>> > compile time and scheduler freedom, some care should be taken before
>> > lowering this in trunk.
>>
>> Indeed we hit this internally, filed a PR:
>> https://llvm.org/bugs/show_bug.cgi?id=26940
>
>
> I think we should have rolled back r259673 as soon as the test case was
> available.
I agree, but since we didn't have a policy about it, I was kind of
unsure on what to do about it. Glad you begin this discussion :-)
> Thoughts?
Ideally it would be good to have more compile time sensitive
benchmarks on the test-suite to detect those. We're are working on
collecting what we have internally and upstream to help track the
results in a public way.
Me too.
I also agree that reverting fresh and reapplying is *much* easier than
trying to revert late.
But I'd like to avoid dubious metrics.
> The closest I could find would be what Chandler wrote in:
> http://reviews.llvm.org/D12826 ; for instance for O2 he stated that "if an
> optimization increases compile time by 5% or increases code size by 5% for a
> particular benchmark, that benchmark should also be one which sees a 5%
> runtime improvement".
I think this is a bit limited and can lead to which hunts, especially
wrt performance measurements.
Chandler's title is perfect though... Large can be vague, but
"super-linear" is not. We used to have the concept that any large
super-linear (quadratic+) compile time introductions had to be in O3
or, for really bad cases, behind additional flags. I think we should
keep that mindset.
> My hope is that with better tooling for tracking compile time in the future,
> we'll reach a state where we'll be able to consider "breaking" the
> compile-time regression test as important as breaking any test: i.e. the
> offending commit should be reverted unless it has been shown to
> significantly (hand wavy...) improve the runtime performance.
In order to have any kind of threshold, we'd have to monitor with some
accuracy the performance of both compiler and compiled code for the
main platforms. We do that to certain extent with the test-suite bots,
but that's very far from ideal.
So, I'd recommend we steer away from any kind of percentage or ratio
and keep at least the quadratic changes and beyond on special flags
(n.logn is ok for most cases).
> Since you raise the discussion now, I take the opportunity to push on the
> "more aggressive" side: I think the policy should be a balance between the
> improvement the commit brings compared to the compile time slow down.
This is a fallacy.
Compile time often regress across all targets, while execution
improvements are focused on specific targets and can have negative
effects on those that were not benchmarked on. Overall, though,
compile time regressions dilute over the improvements, but not on a
commit per commit basis. That's what I meant by which hunt.
I think we should keep an eye on those changes, ask for numbers in
code review and even maybe do some benchmarking on our own before
accepting it. Also, we should not commit code that we know hurts
performance that badly, even if we believe people will replace them in
the future. It always takes too long. I myself have done that last
year, and I learnt my lesson.
Metrics are often more dangerous than helpful, as they tend to be used
as a substitute for thinking.
My tuppence.
--renato
> On Mar 31, 2016, at 2:46 PM, Renato Golin <renato...@linaro.org> wrote:
>
> On 31 March 2016 at 21:41, Mehdi Amini via llvm-dev
> <llvm...@lists.llvm.org> wrote:
>> TLDR: I totally support considering compile time regression as bug.
>
> Me too.
>
> I also agree that reverting fresh and reapplying is *much* easier than
> trying to revert late.
>
> But I'd like to avoid dubious metrics.
I'm not sure about how "this commit regress the compile time by 2%" is a dubious metric.
The metric is not dubious IMO, it is what it is: a measurement.
You just have to cast a good process around it to exploit this measurement in a useful way for the project.
>> The closest I could find would be what Chandler wrote in:
>> http://reviews.llvm.org/D12826 ; for instance for O2 he stated that "if an
>> optimization increases compile time by 5% or increases code size by 5% for a
>> particular benchmark, that benchmark should also be one which sees a 5%
>> runtime improvement".
>
> I think this is a bit limited and can lead to which hunts, especially
> wrt performance measurements.
>
> Chandler's title is perfect though... Large can be vague, but
> "super-linear" is not. We used to have the concept that any large
> super-linear (quadratic+) compile time introductions had to be in O3
> or, for really bad cases, behind additional flags. I think we should
> keep that mindset.
>
>
>> My hope is that with better tooling for tracking compile time in the future,
>> we'll reach a state where we'll be able to consider "breaking" the
>> compile-time regression test as important as breaking any test: i.e. the
>> offending commit should be reverted unless it has been shown to
>> significantly (hand wavy...) improve the runtime performance.
>
> In order to have any kind of threshold, we'd have to monitor with some
> accuracy the performance of both compiler and compiled code for the
> main platforms. We do that to certain extent with the test-suite bots,
> but that's very far from ideal.
I agree. Did you read the part where I was mentioning that we're working in the tooling part and that I was waiting for it to be done to start this thread?
>
> So, I'd recommend we steer away from any kind of percentage or ratio
> and keep at least the quadratic changes and beyond on special flags
> (n.logn is ok for most cases).
How to do you suggest we address the long trail of 1-3% slow down that lead to the current situation (cf the two links I posted in my previous email)?
Because there *is* a problem here, and I'd really like someone to come up with a solution for that.
>> Since you raise the discussion now, I take the opportunity to push on the
>> "more aggressive" side: I think the policy should be a balance between the
>> improvement the commit brings compared to the compile time slow down.
>
> This is a fallacy.
Not sure why or what you mean? The fact that an optimization improves only some target does not invalidate the point.
>
> Compile time often regress across all targets, while execution
> improvements are focused on specific targets and can have negative
> effects on those that were not benchmarked on.
Yeah, as usual in LLVM: if you care about something on your platform, setup a bot and track trunk closely, otherwise you're less of a priority.
> Overall, though,
> compile time regressions dilute over the improvements, but not on a
> commit per commit basis. That's what I meant by which hunt.
There is no "witch hunt", at least that's not my objective.
I think everyone is pretty enthusiastic with every new perf improvement (I do), but just like without bot in general (and policy) we would break them all the time unintentionally.
I talking about chasing and tracking every single commit were a developer would regress compile time *without even being aware*.
I'd personally love to have a bot or someone emailing me with compile time regression I would introduce.
>
> I think we should keep an eye on those changes, ask for numbers in
> code review and even maybe do some benchmarking on our own before
> accepting it. Also, we should not commit code that we know hurts
> performance that badly, even if we believe people will replace them in
> the future. It always takes too long. I myself have done that last
> year, and I learnt my lesson.
Agree.
> Metrics are often more dangerous than helpful, as they tend to be used
> as a substitute for thinking.
I don't relate this sentence to anything concrete at stance here.
I think this list is full of people that are very good at thinking and won't substitute it :)
Best,
--
Mehdi
Ignoring for a moment the slippery slope we recently had on compile
time performance, 2% is an acceptable regression for a change that
improves most targets around 2% execution time, more than if only one
target was affected.
Different people see performance with different eyes, and companies
have different expectations about it, too, so those percentages can
have different impact on different people for the same change.
I guess my point is that no threshold will please everybody, and
people are more likely to "abuse" of the metric if the results are far
from what they see as acceptable, even if everyone else is ok with it.
My point about replacing metrics for thinking is not to the lazy
programmers (of which there are very few here), but to how far does
the encoded threshold fall from your own. Bias is a *very* hard thing
to remove, even for extremely smart and experienced people.
So, while "which hunt" is a very strong term for the mild bias we'll
all have personally, we have seen recently how some discussions end up
in rage when a group of people strongly disagree with the rest,
self-reinforcing their bias to levels that they would never reach
alone. In those cases, the term stops being strong, and may be
fitting... Makes sense?
> I agree. Did you read the part where I was mentioning that we're working in the tooling part and that I was waiting for it to be done to start this thread?
I did, and should have mentioned on my reply. I think you guys (and
ARM) are doing an amazing job at quality measurement. I wasn't trying
to reduce your efforts, but IMHO, the relationship between effort and
bias removal is not linear, ie. you'll have to improve quality
exponentially to remove bias linearly. So, the threshold we're
prepared to stop might not remove all the problems and metrics could
still play a negative role.
I think I'm just asking for us to be aware of the fact, not to stop
any attempt to introduce metrics. If they remain relevant to the final
objective, and we're allowed to break them with enough arguments, it
should work fine.
> How to do you suggest we address the long trail of 1-3% slow down that lead to the current situation (cf the two links I posted in my previous email)?
> Because there *is* a problem here, and I'd really like someone to come up with a solution for that.
Indeed, we're now slower than GCC, and that's a place that looked
impossible two years ago. But I doubt reverting a few patches will
help. For this problem, we'll need a task force to hunt for all the
dragons, and surgically alter them, since at this time, all relevant
patches are too far in the past.
For the future, emailing on compile time regressions (as well as run
time) is a good thing to have and I vouch for it. But I don't want
that to become a tool that will increase stress in the community.
> Not sure why or what you mean? The fact that an optimization improves only some target does not invalidate the point.
Sorry, I seem to have misinterpreted your point.
The fallacy is about the measurement of "benefit" versus the
regression "effect". The former is very hard to measure, while the
latter is very precise. Comparisons with radically different standard
deviations can easily fall into "undefined behaviour" land, and be
seed for rage threads.
> I talking about chasing and tracking every single commit were a developer would regress compile time *without even being aware*.
That's a goal worth pursuing, regardless of the patch's benefit, I
agree wholeheartedly. And for that, I'm very grateful of the work you
guys are doing.
cheers,
--renato
Sure, I don't think I have suggested anything else, if I did it is because I don't express myself correctly then :)
I'm excited about runtime performance, and I'm willing to spend compile-time budget to achieve these.
I'd even say that my view is that by tracking compile-time on other things, it'll help to preserve more compile-time budget for the kind of commit you mention above.
>
> Different people see performance with different eyes, and companies
> have different expectations about it, too, so those percentages can
> have different impact on different people for the same change.
>
> I guess my point is that no threshold
I don't suggest a threshold that says "a commit can't regress x%", and that would be set in stone.
What I have in mind is more: if a commit regress the build above a threshold (1% on average for instance), then we should be able to have a discussion about this commit to evaluate if it belongs to O2 or if it should go to O3 for instance.
Also if the commit is about refactoring, or introducing a new feature, the regression might not be intended at all by the author!
> will please everybody, and
> people are more likely to "abuse" of the metric if the results are far
> from what they see as acceptable, even if everyone else is ok with it.
The metric is "the commit regressed 1%". The natural thing that follows is what happens usually in the community: we look at the data (what is the performance improvement), and decide on a case by case if it is fine as is or not.
I feel like you're talking about the "metric" like an automatic threshold that triggers an automatic revert and block things, this is not the goal and that is not what I mean when I use of the word metric (but hey, I'm not a native speaker!).
As I said before, I'm mostly chasing *untracked* and *unintentional* compile time regression.
> My point about replacing metrics for thinking is not to the lazy
> programmers (of which there are very few here), but to how far does
> the encoded threshold fall from your own. Bias is a *very* hard thing
> to remove, even for extremely smart and experienced people.
>
> So, while "which hunt" is a very strong term for the mild bias we'll
> all have personally, we have seen recently how some discussions end up
> in rage when a group of people strongly disagree with the rest,
> self-reinforcing their bias to levels that they would never reach
> alone. In those cases, the term stops being strong, and may be
> fitting... Makes sense?
>
>
>> I agree. Did you read the part where I was mentioning that we're working in the tooling part and that I was waiting for it to be done to start this thread?
>
> I did, and should have mentioned on my reply. I think you guys (and
> ARM) are doing an amazing job at quality measurement. I wasn't trying
> to reduce your efforts, but IMHO, the relationship between effort and
> bias removal is not linear, ie. you'll have to improve quality
> exponentially to remove bias linearly. So, the threshold we're
> prepared to stop might not remove all the problems and metrics could
> still play a negative role.
I'm not sure I really totally understand everything you mean.
>
> I think I'm just asking for us to be aware of the fact, not to stop
> any attempt to introduce metrics. If they remain relevant to the final
> objective, and we're allowed to break them with enough arguments, it
> should work fine.
>
>
>> How to do you suggest we address the long trail of 1-3% slow down that lead to the current situation (cf the two links I posted in my previous email)?
>> Because there *is* a problem here, and I'd really like someone to come up with a solution for that.
>
> Indeed, we're now slower than GCC, and that's a place that looked
> impossible two years ago. But I doubt reverting a few patches will
> help. For this problem, we'll need a task force to hunt for all the
> dragons, and surgically alter them, since at this time, all relevant
> patches are too far in the past.
Obviously, my immediate concern is "what tools and process to make sure it does not get worse", and starting with "community awareness" is not bad. Improving and recovering from the current state is valuable, but orthogonal to what I'm trying to achieve.
Another things is the complain from multiple people that are trying to JIT using LLVM, we know LLVM is not designed in a way that helps with latency and memory consumption, but getting worse is not nice.
> For the future, emailing on compile time regressions (as well as run
> time) is a good thing to have and I vouch for it. But I don't want
> that to become a tool that will increase stress in the community.
Sure, I'm glad you step up to make sure it does not happen. So please continue to voice up in the future as we try to roll thing.
I hope we're on the same track past the initial misunderstanding we had each other?
What I'd really like is to have a consensus on the goal to pursue (knowing to not be alone to care about compile time is a great start!), so that the tooling can be set up to serve this goal the best way possible (and decreasing stress instead of increasing it).
Best,
--
Mehdi
Thresholds as trigger for discussion is exactly what I was looking for.
But Chandler goes further (or so I gathered), that some commits are
really bad and could be candidates for reversion before discussion.
Those, more extreme measures, may be justified if, for example, the
commit is quadratic or more in a core part of the compiler, or double
the testing time, etc.
I agree with both proposals, but we have to make sure what goes where,
to avoid (unintentionally) heavy handing other people's work.
> The metric is "the commit regressed 1%". The natural thing that follows is what happens usually in the community: we look at the data (what is the performance improvement), and decide on a case by case if it is fine as is or not.
> I feel like you're talking about the "metric" like an automatic threshold that triggers an automatic revert and block things, this is not the goal and that is not what I mean when I use of the word metric (but hey, I'm not a native speaker!).
I wasn't talking about automatic reversal, but about pre-discussion
reversal, as I mention above.
> As I said before, I'm mostly chasing *untracked* and *unintentional* compile time regression.
That's is obviously good. :)
> I'm not sure I really totally understand everything you mean.
It's about the threshold between what promotes discussion and what
promotes pre-discussion reverts. This is a hard line to draw with so
many people (and companies) involved.
> Sure, I'm glad you step up to make sure it does not happen. So please continue to voice up in the future as we try to roll thing.
> I hope we're on the same track past the initial misunderstanding we had each other?
Yes. :)