[llvm-dev] Saving Compile Time in InstCombine

317 views
Skip to first unread message

Mikhail Zolotukhin via llvm-dev

unread,
Mar 17, 2017, 3:00:23 PM3/17/17
to llvm-dev
Hi,

One of the most time-consuming passes in LLVM middle-end is InstCombine (see e.g. [1]). It is a very powerful pass capable of doing all the crazy stuff, and new patterns are being constantly introduced there. The problem is that we often use it just as a clean-up pass: it's scheduled 6 times in the current pass pipeline, and each time it's invoked it checks all known patterns. It sounds ok for O3, where we try to squeeze as much performance as possible, but it is too excessive for other opt-levels. InstCombine has an ExpensiveCombines parameter to address that - but I think it's underused at the moment.

Trying to find out, which patterns are important, and which are rare, I profiled clang using CTMark and got the following coverage report:
InstCombine_covreport.html
0001-InstCombine-Move-some-infrequent-patterns-under-if-E.patch

Vedant Kumar via llvm-dev

unread,
Mar 17, 2017, 5:02:34 PM3/17/17
to Mikhail Zolotukhin, llvm-dev
> <InstCombine_covreport.html>
> (beware, the file is ~6MB).
>
> Guided by this profile I moved some patterns under the "if (ExpensiveCombines)" check, which expectedly happened to be neutral for runtime performance, but improved compile-time. The testing results are below (measured for Os).

It'd be nice to double-check that any runtime performance loss at -O2 is negligible. But this sounds like a great idea!

vedant

> Performance Improvements - Compile Time Δ Previous Current σ
> CTMark/sqlite3/sqlite3 -1.55% 6.8155 6.7102 0.0081
> CTMark/mafft/pairlocalalign -1.05% 8.0407 7.9559 0.0193
> CTMark/ClamAV/clamscan -1.02% 11.3893 11.2734 0.0081
> CTMark/lencod/lencod -1.01% 12.8763 12.7461 0.0244
> CTMark/SPASS/SPASS -1.01% 12.5048 12.3791 0.0340
>
> Performance Improvements - Compile Time Δ Previous Current σ
> External/SPEC/CINT2006/403.gcc/403.gcc -1.64% 54.0801 53.1930 -
> External/SPEC/CINT2006/400.perlbench/400.perlbench -1.25% 19.1481 18.9091 -
> External/SPEC/CINT2006/445.gobmk/445.gobmk -1.01% 15.2819 15.1274 -
>
>
> Do such changes make sense? The patch doesn't change O3, but it does change Os and potentially can change performance there (though I didn't see any changes in my tests).
>
> The patch is attached for the reference, if we decide to go for it, I'll upload it to phab:
>
> <0001-InstCombine-Move-some-infrequent-patterns-under-if-E.patch>
>
>
> Thanks,
> Michael
>
> [1]: http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html
>
> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Mikhail Zolotukhin via llvm-dev

unread,
Mar 17, 2017, 5:22:50 PM3/17/17
to Vedant Kumar, llvm-dev
On Mar 17, 2017, at 2:02 PM, Vedant Kumar <v...@apple.com> wrote:


On Mar 17, 2017, at 11:50 AM, Mikhail Zolotukhin via llvm-dev <llvm...@lists.llvm.org> wrote:

Hi,

One of the most time-consuming passes in LLVM middle-end is InstCombine (see e.g. [1]). It is a very powerful pass capable of doing all the crazy stuff, and new patterns are being constantly introduced there. The problem is that we often use it just as a clean-up pass: it's scheduled 6 times in the current pass pipeline, and each time it's invoked it checks all known patterns. It sounds ok for O3, where we try to squeeze as much performance as possible, but it is too excessive for other opt-levels. InstCombine has an ExpensiveCombines parameter to address that - but I think it's underused at the moment.

Trying to find out, which patterns are important, and which are rare, I profiled clang using CTMark and got the following coverage report:
<InstCombine_covreport.html>
(beware, the file is ~6MB).

Guided by this profile I moved some patterns under the "if (ExpensiveCombines)" check, which expectedly happened to be neutral for runtime performance, but improved compile-time. The testing results are below (measured for Os).

It'd be nice to double-check that any runtime performance loss at -O2 is negligible. But this sounds like a great idea!
I forgot to mention that I ran SPEC2006/INT with "-Os" on ARM64 and didn't see any changes in runtime performance. I can run O2 testing as well over the weekend.

Michael

Mehdi Amini via llvm-dev

unread,
Mar 17, 2017, 5:30:06 PM3/17/17
to Mikhail Zolotukhin, llvm-dev
On Mar 17, 2017, at 11:50 AM, Mikhail Zolotukhin via llvm-dev <llvm...@lists.llvm.org> wrote:

Hi,

One of the most time-consuming passes in LLVM middle-end is InstCombine (see e.g. [1]). It is a very powerful pass capable of doing all the crazy stuff, and new patterns are being constantly introduced there. The problem is that we often use it just as a clean-up pass: it's scheduled 6 times in the current pass pipeline, and each time it's invoked it checks all known patterns. It sounds ok for O3, where we try to squeeze as much performance as possible, but it is too excessive for other opt-levels. InstCombine has an ExpensiveCombines parameter to address that - but I think it's underused at the moment.

Yes, the “ExpensiveCombines” has been added recently (4.0? 3.9?) but I believe has always been intended to be extended the way you’re doing it. So I support this effort :)

CC: David for the general direction on InstCombine though.


— 
Mehdi




Trying to find out, which patterns are important, and which are rare, I profiled clang using CTMark and got the following coverage report:
<InstCombine_covreport.html>
(beware, the file is ~6MB).

Guided by this profile I moved some patterns under the "if (ExpensiveCombines)" check, which expectedly happened to be neutral for runtime performance, but improved compile-time. The testing results are below (measured for Os).

Matthias Braun via llvm-dev

unread,
Mar 17, 2017, 5:36:10 PM3/17/17
to Mehdi Amini, llvm-dev
In general it is great that we investigate these things! We have been liberally adding pass invocations and patterns for years without checking the compiletime consequences.

However intuitively it feels wrong to disable some patterns completely (there will always be that one program that gets so much better when you have a certain pattern).
- Do you have an idea what would happen if we only disable them in 5 of the 6 invocations?
- Or alternatively what happens when we just not put as many InstCombine instances into the pass pipeline in -Os?

- Matthias

Hal Finkel via llvm-dev

unread,
Mar 17, 2017, 8:49:32 PM3/17/17
to Mehdi Amini, Mikhail Zolotukhin, llvm-dev


On 03/17/2017 04:30 PM, Mehdi Amini via llvm-dev wrote:

On Mar 17, 2017, at 11:50 AM, Mikhail Zolotukhin via llvm-dev <llvm...@lists.llvm.org> wrote:

Hi,

One of the most time-consuming passes in LLVM middle-end is InstCombine (see e.g. [1]). It is a very powerful pass capable of doing all the crazy stuff, and new patterns are being constantly introduced there. The problem is that we often use it just as a clean-up pass: it's scheduled 6 times in the current pass pipeline, and each time it's invoked it checks all known patterns. It sounds ok for O3, where we try to squeeze as much performance as possible, but it is too excessive for other opt-levels. InstCombine has an ExpensiveCombines parameter to address that - but I think it's underused at the moment.

Yes, the “ExpensiveCombines” has been added recently (4.0? 3.9?) but I believe has always been intended to be extended the way you’re doing it. So I support this effort :)

+1

Also, did your profiling reveal why the other combines are expensive? Among other things, I'm curious if the expensive ones tend to spend a lot of time in ValueTracking (getting known bits and similar)?

 -Hal
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

David Majnemer via llvm-dev

unread,
Mar 17, 2017, 9:13:14 PM3/17/17
to Hal Finkel, llvm-dev
Honestly, I'm not a huge fan of this change as-is. The set of transforms that were added behind ExpensiveChecks seems awfully strange and many would not lead the reader to believe that they are expensive at all (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls being the obvious expensive routines).

The purpose of many of InstCombine's xforms is to canonicalize the IR to make life easier for downstream passes and analyses.

InstCombine internally relies on knowing what transforms it may or may not perform. This is important: canonicalizations may duel endlessly if we get this wrong; the order of the combines is also important for exactly the same reason (SelectionDAG deals with this problem in a different way with its pattern complexity field).

Another concern with moving seemingly arbitrary combines under ExpensiveCombines is that it will make it that much harder to understand what is and is not canonical at a given point during the execution of the optimizer.

I'd be much more interested in a patch which caches the result of frequently called ValueTracking functionality like ComputeKnownBits, ComputeSignBit, etc. which often doesn't change but is not intelligently reused. I imagine that the performance win might be quite comparable. Such a patch would have the benefit of keeping the set of available transforms constant throughout the pipeline while bringing execution time down; I wouldn't be at all surprised if caching the ValueTracking functions resulted in a bigger time savings.

Hal Finkel via llvm-dev

unread,
Mar 18, 2017, 10:43:31 AM3/18/17
to David Majnemer, llvm-dev


On 03/17/2017 08:12 PM, David Majnemer wrote:
Honestly, I'm not a huge fan of this change as-is. The set of transforms that were added behind ExpensiveChecks seems awfully strange and many would not lead the reader to believe that they are expensive at all (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls being the obvious expensive routines).

The purpose of many of InstCombine's xforms is to canonicalize the IR to make life easier for downstream passes and analyses.

InstCombine internally relies on knowing what transforms it may or may not perform. This is important: canonicalizations may duel endlessly if we get this wrong; the order of the combines is also important for exactly the same reason (SelectionDAG deals with this problem in a different way with its pattern complexity field).

Another concern with moving seemingly arbitrary combines under ExpensiveCombines is that it will make it that much harder to understand what is and is not canonical at a given point during the execution of the optimizer.

I agree with this up to a point. If we have these kinds of canonicalization dependencies that depend on ValueTracking's depth, this seems very fragile. Even if we introduce caching, thus making the depth often infinite, if it will still be finite in the face of updates then we still need to be careful (plus, if we're worried about being able to understand the canonical form, then depending on known-bits analysis can make defining this form subtle).


I'd be much more interested in a patch which caches the result of frequently called ValueTracking functionality like ComputeKnownBits, ComputeSignBit, etc. which often doesn't change but is not intelligently reused. I imagine that the performance win might be quite comparable. Such a patch would have the benefit of keeping the set of available transforms constant throughout the pipeline while bringing execution time down; I wouldn't be at all surprised if caching the ValueTracking functions resulted in a bigger time savings.

I'd started working on this a few months ago; I didn't finish the patch (mostly because I discovered that there's also a need to invalidate the cache whenever to perform a transformation that drops nsw/nuw flags and I've never got to that part), but I'm happy to provide my work-in-progress to anyone interested. cc'ing Davide who had also expressed an interest in this.

 -Hal

Sanjay Patel via llvm-dev

unread,
Mar 20, 2017, 9:35:23 AM3/20/17
to Hal Finkel, Michael Zolotukhin, Davide Italiano, llvm-dev
The goal is to improve compile-time, so I think we should measure what's taking the most time and work from there. I'm not sure how the code coverage report was made, but that seems like it is based on execution frequency rather than time?

Here are some Instruments (macOS) screenshots of a time profile of a release build of opt running on a Haswell machine:
$ opt -O2 row_common.bc -S -o /dev/null

The input is derived from the Chrome source file attached to:
https://bugs.llvm.org//show_bug.cgi?id=32037

It's an implicit assumption in this experiment that 'opt' time is important to overall compile-time, but as you can see in the bug report, that's not really true for this case. Also, this profile may not be representative of other programs or important vs. other programs, but to summarize:

1. 123 ms out of 326 ms (38%) of the -O2 opt time is going into InstCombine. So yes, InstCombine is a big chunk of opt time.
2. Focusing on computeKnownBits() within the InstCombine subtree: > 48% of the time consumed by InstCombine is spent in ValueTracking.

We know that most InstCombines don't use ValueTracking, so I think transforms that use computeKnownBits are the ones that should be grouped under "ExpensiveCombines". Maybe there's another category for "RareCombines", or we split InstCombine into multiple passes to further reduce compile-time?

In any case, the profile lines up with David's comment about simplifyDemanded / computeKnownBits and previous anecdotes in the "llvm is getting slower" threads. Simple matchers are relatively short strands of compare and branch spaghetti. It's the few transforms that use ValueTracking that have a high relative cost. If we attack that, the profile says we could recover much more than 1% if -O2 middle-end time is significant for most compiles.
optO2profile.png
computeKnownBIts.png

Mehdi Amini via llvm-dev

unread,
Mar 20, 2017, 11:37:47 AM3/20/17
to David Majnemer, llvm-dev
On Mar 17, 2017, at 6:12 PM, David Majnemer <david.m...@gmail.com> wrote:

Honestly, I'm not a huge fan of this change as-is. The set of transforms that were added behind ExpensiveChecks seems awfully strange and many would not lead the reader to believe that they are expensive at all (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls being the obvious expensive routines).

The purpose of many of InstCombine's xforms is to canonicalize the IR to make life easier for downstream passes and analyses.

OK, but is it true of *all* the combines today?


InstCombine internally relies on knowing what transforms it may or may not perform. This is important: canonicalizations may duel endlessly if we get this wrong; the order of the combines is also important for exactly the same reason (SelectionDAG deals with this problem in a different way with its pattern complexity field).

Another concern with moving seemingly arbitrary combines under ExpensiveCombines is that it will make it that much harder to understand what is and is not canonical at a given point during the execution of the optimizer.

If a canonicalization is too costly to achieve, maybe it is not a reasonable one?
It is also not clear to me that canonicalizations that are using complex analyses (ValueTracking / computeKnownBits) are really making it easy to "understand what is canonical” anyway. This is my impression in general as the scope of what is needed to achieve the transformation gets larger: the more context needed the less it looks like a “canonicalization” to me.
WDYT?

— 
Mehdi

Amara Emerson via llvm-dev

unread,
Mar 20, 2017, 12:37:02 PM3/20/17
to Mehdi Amini, llvm-dev
Agree, it would be nice if instcombine was more modular, such that we could choose whether or not we wanted a canonicalisation run or wanted optimizations. And of the optimisations, it would also be good to be able to select which categories of combines to run. Splitting out canonicalisation would also make it clearer to all what the standard forms are for certain constructs.

Amara

Mikhail Zolotukhin via llvm-dev

unread,
Mar 20, 2017, 9:45:48 PM3/20/17
to Amara Emerson, llvm-dev
Thanks for the interest, everyone! The intention of this patch was exactly to ignite the discussion, I expected that it should be changed/reworked before going in (if ever).

The profile I used was not a time profile, it was a frequency-based report, as Sanjay noticed. The idea was to find patterns that are (almost) unused, and that was the patterns I later moved under 'ExpensiveCombines' - I admit it was kind of arbitrary on my side, I kept the ones which looked pretty cheap, and picked the others.

Let me try to keep only the ValueTracking related patterns and remeasure the numbers for now, and then we can decide if we want to do anything about the others. I think it still would be nice to avoid checking for them - maybe we can keep one 'full' InstCombine run for canonicalization, and in subsequent runs only check frequent patterns.

Thanks,
Michael

Gerolf Hoflehner via llvm-dev

unread,
Mar 21, 2017, 12:52:05 AM3/21/17
to David Majnemer, Hal Finkel, llvm-dev
On Mar 17, 2017, at 6:12 PM, David Majnemer via llvm-dev <llvm...@lists.llvm.org> wrote:

Honestly, I'm not a huge fan of this change as-is. The set of transforms that were added behind ExpensiveChecks seems awfully strange and many would not lead the reader to believe that they are expensive at all (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls being the obvious expensive routines).

The purpose of many of InstCombine's xforms is to canonicalize the IR to make life easier for downstream passes and analyses.

As we get further along with compile-time improvements one question we need to ask ourselves more frequently is about the effectiveness of optimizations/passes. For example -  in this case - how can we make an educated assessment that running the combiner N times is a good cost/benefit investment of compute resources? The questions below are meant to figure out what technologies/instrumentations/etc could help towards a more data-driven decision process when it comes to the effectiveness of optimizations. Instcombiner might just be an inspirational use case to see what is possible in that direction.

The combiner is invoked in full multiple times. But is it really necessary to run all of it for that purpose? After instcombine is run once is there a mapping from transformation -> combines? I suspect most transformations could invoke a subset of combines to re-canonicalize. Or, if there was a (cheap) verifier for canonical IR, it could invoke a specific canonicalization routine. Instrumenting the instcombiner and checking which patterns actually kick in (for different invocations)  might give insight into how the combiner could be structured and so that only a subset of pattern need to be checked.


InstCombine internally relies on knowing what transforms it may or may not perform. This is important: canonicalizations may duel endlessly if we get this wrong; the order of the combines is also important for exactly the same reason (SelectionDAG deals with this problem in a different way with its pattern complexity field).

Can you elaborate on this “duel endlessly” with specific examples? This is out of curiosity. There must be verifiers that check that this cannot happen. Or an implementation strategy that guarantees that. Global isel will run into the same/similar question when it gets far enough to replace SD.


Another concern with moving seemingly arbitrary combines under ExpensiveCombines is that it will make it that much harder to understand what is and is not canonical at a given point during the execution of the optimizer.


I'd be much more interested in a patch which caches the result of frequently called ValueTracking functionality like ComputeKnownBits, ComputeSignBit, etc. which often doesn't change but is not intelligently reused. I imagine that the performance win might be quite comparable.

Can you back this up with measurements? Caching schemes are tricky. Is there a way to evaluate when the results of ComputeKnownBits etc is actually effective meaining the result is used and gives faster instructions? E.g. it might well be that only the first instance of inst_combine benefits from computing the bits. 

Davide Italiano via llvm-dev

unread,
Mar 21, 2017, 2:13:06 PM3/21/17
to Mikhail Zolotukhin, llvm-dev
> (beware, the file is ~6MB).
>
> Guided by this profile I moved some patterns under the "if (ExpensiveCombines)" check, which expectedly happened to be neutral for runtime performance, but improved compile-time. The testing results are below (measured for Os).
>

As somebody who brought up this problem at least once in the mailing
lists, I'm in agreement with David Majnemer here.
I think we should consider a caching strategy before going this route.
FWIW, I'm not a big fan of `ExpensiveCombines` at all, I can see the
reason why it was introduced, but in my experience the "expensive"
bits of Instcombine comes from the implementation of bitwise domain,
i.e. known bits & friends, so at least evaluating caching is something
I would try earlier.

Something else that can be tried (even if it doesn't improve compile
time is still a nice cleanup) is that of moving combines not creating
new instructions from instcombine to instsimplify. Many passes use
instruction simplify so that might result in the amount of code that's
processed by instcombine being smaller and/or could result in improved
code quality. Just speculations, but a decent experiment if somebody
has time to take a look at.

--
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Daniel Berlin via llvm-dev

unread,
Mar 21, 2017, 2:45:37 PM3/21/17
to Davide Italiano, llvm-dev
So, just a thought:
"The purpose of many of InstCombine's xforms is to canonicalize the IR to make life easier for downstream passes and analyses."
That sounds sane.

So, are the expensive things canonicalization?
If that is the case, why are we doing such expensive canonicalization?  That seems strange to me.

If they are not canonicalization, should they really not be separated out (into some pass that possibly shares infrastructure)?

No compiler is going to get everything anyway, and instcombine needs to decide what "good enough" really means.

I would rather see us understand what we want out of instcombine, precisely, before we try to decide how to make it faster at doing whatever that thing is :)


--Dan


Hal Finkel via llvm-dev

unread,
Mar 22, 2017, 9:36:44 AM3/22/17
to Gerolf Hoflehner, David Majnemer, llvm-dev


On 03/20/2017 11:51 PM, Gerolf Hoflehner wrote:

On Mar 17, 2017, at 6:12 PM, David Majnemer via llvm-dev <llvm...@lists.llvm.org> wrote:

Honestly, I'm not a huge fan of this change as-is. The set of transforms that were added behind ExpensiveChecks seems awfully strange and many would not lead the reader to believe that they are expensive at all (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls being the obvious expensive routines).

The purpose of many of InstCombine's xforms is to canonicalize the IR to make life easier for downstream passes and analyses.

As we get further along with compile-time improvements one question we need to ask ourselves more frequently is about the effectiveness of optimizations/passes. For example -  in this case - how can we make an educated assessment that running the combiner N times is a good cost/benefit investment of compute resources? The questions below are meant to figure out what technologies/instrumentations/etc could help towards a more data-driven decision process when it comes to the effectiveness of optimizations. Instcombiner might just be an inspirational use case to see what is possible in that direction.

The combiner is invoked in full multiple times. But is it really necessary to run all of it for that purpose? After instcombine is run once is there a mapping from transformation -> combines? I suspect most transformations could invoke a subset of combines to re-canonicalize. Or, if there was a (cheap) verifier for canonical IR, it could invoke a specific canonicalization routine. Instrumenting the instcombiner and checking which patterns actually kick in (for different invocations)  might give insight into how the combiner could be structured and so that only a subset of pattern need to be checked.

InstCombine internally relies on knowing what transforms it may or may not perform. This is important: canonicalizations may duel endlessly if we get this wrong; the order of the combines is also important for exactly the same reason (SelectionDAG deals with this problem in a different way with its pattern complexity field).

Can you elaborate on this “duel endlessly” with specific examples? This is out of curiosity. There must be verifiers that check that this cannot happen. Or an implementation strategy that guarantees that. Global isel will run into the same/similar question when it gets far enough to replace SD.

Another concern with moving seemingly arbitrary combines under ExpensiveCombines is that it will make it that much harder to understand what is and is not canonical at a given point during the execution of the optimizer.


I'd be much more interested in a patch which caches the result of frequently called ValueTracking functionality like ComputeKnownBits, ComputeSignBit, etc. which often doesn't change but is not intelligently reused. I imagine that the performance win might be quite comparable.

Can you back this up with measurements? Caching schemes are tricky. Is there a way to evaluate when the results of ComputeKnownBits etc is actually effective meaining the result is used and gives faster instructions? E.g. it might well be that only the first instance of inst_combine benefits from computing the bits.

To (hopefully) make it easier to answer this question, I've posted my (work-in-progress) patch which adds a known-bits cache to InstCombine. I rebased it yesterday, so it should be fairly easy to apply: https://reviews.llvm.org/D31239 - Seeing what this does to the performance of the benchmarks mentioned in this thread (among others) would certainly be interesting.

 -Hal

Sanjay Patel via llvm-dev

unread,
Mar 22, 2017, 2:34:24 PM3/22/17
to Hal Finkel, llvm-dev
> To (hopefully) make it easier to answer this question, I've posted my (work-in-progress) patch which adds a known-bits cache to InstCombine.
> I rebased it yesterday, so it  should be fairly easy to apply: https://reviews.llvm.org/D31239 - Seeing what this does to the performance of the
> benchmarks mentioned in this thread (among others) would certainly be interesting.

Thanks! I only have the one rough data point based on PR32037, but caching does good things for compile-time on that example.

Trunk r298514 compiled release on macOS running on Haswell 4GHz:
$ time ./opt -O2  row_common.bc -S -o /dev/null

user    0m0.302s
user    0m0.300s
user    0m0.296s
user    0m0.299s
user    0m0.296s

With your patch applied:

user    0m0.264s
user    0m0.269s
user    0m0.269s
user    0m0.268s
user    0m0.268s

So the time for all of -O2 has dropped to 89.6% of the baseline (improvement of 11.5%).
A profile shows time spent in InstCombine->computeKnownBits dropped from 58 ms to 15 ms (lines up with the overall time drop), so we're about 4x faster in ValueTracking with the caching.

Hal Finkel via llvm-dev

unread,
Mar 22, 2017, 5:23:09 PM3/22/17
to Sanjay Patel, llvm-dev


On 03/22/2017 01:34 PM, Sanjay Patel wrote:
> To (hopefully) make it easier to answer this question, I've posted my (work-in-progress) patch which adds a known-bits cache to InstCombine.
> I rebased it yesterday, so it  should be fairly easy to apply: https://reviews.llvm.org/D31239 - Seeing what this does to the performance of the
> benchmarks mentioned in this thread (among others) would certainly be interesting.

Thanks! I only have the one rough data point based on PR32037, but caching does good things for compile-time on that example.

Trunk r298514 compiled release on macOS running on Haswell 4GHz:
$ time ./opt -O2  row_common.bc -S -o /dev/null

user    0m0.302s
user    0m0.300s
user    0m0.296s
user    0m0.299s
user    0m0.296s

With your patch applied:

user    0m0.264s
user    0m0.269s
user    0m0.269s
user    0m0.268s
user    0m0.268s

So the time for all of -O2 has dropped to 89.6% of the baseline (improvement of 11.5%).
A profile shows time spent in InstCombine->computeKnownBits dropped from 58 ms to 15 ms (lines up with the overall time drop), so we're about 4x faster in ValueTracking with the caching.

Yay :-) -- Unfortunately, I won't have time to work on this in the near future, but if someone would like to pick this up and fix the nsw/nuw invalidation issue (which is exposed in a few of the regression-tests which fail with the patch applied), that would be great.

 -Hal

Mikhail Zolotukhin via llvm-dev

unread,
Mar 22, 2017, 9:28:56 PM3/22/17
to Hal Finkel, llvm-dev
In my testing results are not that impressive, but that's because I'm now focusing on Os. For me even complete disabling of all KnownBits-related patterns in InstCombine places the results very close to the noise level. In my original patch I also had some extra patterns moved under ExpensiveCombines - and that seems to make a difference too (without this part, or without the KnownBits part I get results below 1%, which are not reported as regressions/improvements).

Personally I think caching results of KnownBits is a good idea, and should probably help O3 compile time (and obviously the cases from bug reports, like PR32037).

But I also think that the way we're currently doing combining/canonicalization is unnecessary complicated. Do we really need to try canonicalizing all of these patterns? What happens if we don't? Has anyone tried replace some of the InstCombine invocations with InstSimplify? Do we have a justification for the invocations we currently have? I realize that InstCombine doesn't usually do any harm, if we don't care about compile time, but that's only the case for O3 (to some extent), not for other optimization levels. I think it's equally important to 1) make our implementations faster, and 2) not perform unnecessary work when possible. Adding caching for known bits makes InstCombine faster, but we can get even more if we don't invoke it when it's not needed.

Of course, we don't want to blindly disable the patterns that just happen to be not used in some (admittedly small) test runs that I made. But what I'd like to do is to make a deliberate decision on what's critical and what's optional here, what can be disabled to save some compile time. I'd be happy to carry out any kinds of experiments do we need to run to figure this out, and take part in implementing any missing parts or necessary refactoring. My current plan is to experiment with replacing some InstCombine invocations with InstSimplify and see what regresses after it, but if you know that's already been done by someone, or you have better ideas - I'd be happy to hear them!

Thanks,
Michael

Davide Italiano via llvm-dev

unread,
Mar 22, 2017, 9:46:45 PM3/22/17
to Mikhail Zolotukhin, llvm-dev
On Wed, Mar 22, 2017 at 6:29 PM, Mikhail Zolotukhin via llvm-dev
<llvm...@lists.llvm.org> wrote:
>
> In my testing results are not that impressive, but that's because I'm now focusing on Os. For me even complete disabling of all KnownBits-related patterns in InstCombine places the results very close to the noise level. In my original patch I also had some extra patterns moved under ExpensiveCombines - and that seems to make a difference too (without this part, or without the KnownBits part I get results below 1%, which are not reported as regressions/improvements).
>

Have you profiled a single InstCombine run to see where we actually
spend our cycles (as Sanjay did for his reduced testcase)?

> I realize that InstCombine doesn't usually do any harm, if we don't care about compile time, but that's only the case for O3 (to some extent), not for other optimization levels.

Independently from what's the optimization level, I think compile-time
is important. Note, for example, that we run a (kinda) similar
pipeline at O3 and LTO (full, that is), where the impact of compile
time is much more evident. Also, while people are not generally bitten
by O3 compilation time, you may end up with terrible performances for
large TUs (and I unfortunately learned this the hard way).

--
Davide

Sean Silva via llvm-dev

unread,
Mar 24, 2017, 5:58:53 PM3/24/17
to Michael Zolotukhin, llvm-dev
An ALIVE-like matching automaton could algorithmically mitigate the costs of large numbers of matchers.

I don't know if we're at the point where we need to take that leap, but we should probably answer dannyb's comments before doing that. Once we know what we want we can then look for solutions (possibly using new/different algorithms, such as what we are doing for NewGVN)

Honestly I think that finding out what we want is going to be as hard or harder to solving it. As Gerolf mentions, there's still quite a bit of analysis to be done to better understand InstCombine's role in the our current pipeline.

-- Sean Silva

Mikulin, Dmitry via llvm-dev

unread,
Apr 13, 2017, 8:18:49 PM4/13/17
to Davide Italiano, llvm-dev
I’m taking a first look at InstCombine performance. I picked up the caching patch and ran a few experiments on one of our larger C++ apps. The size of the *.0.2.internalize.bc no-debug IR is ~ 30M. Here are my observations so far.

Interestingly, caching produced a slight but measurable performance degradation of -O3 compile time.

InstCombine takes about 35% of total execution time, of which ~20% originates from CGPassManager.

ComputeKnownBits contributes 7.8%, but calls from InstCombine contribute only 2.6% to the total execution time. Caching only covers InstCombine use of KnownBits. This may explain limited gain or even slight degradation if KnownBits are not re-computed as often as we thought.

Most of the time is spent in instruction visitor routines. CmpInst, LoadInst, CallInst, GetElementPtrInst and StoreInst are the top contributors.

ICmpInst 6.1%
LoadInst 5.5%
CallInst 2.1%
GetElementPtrInst 2.1%
StoreInst 1.6%

Out of 35% InstCombine time, about half is spent in the top 5 visitor routines.

I wanted to see what transformations InstCombine actually performs. Using -debug option turned out not to be very scalable. Never mind the large output size of the trace, running "opt -debug -instcombine” on anything other than a small IR is excruciatingly slow. Out of curiosity I profiled it too: 96% of the time is spent decoding and printing instructions. Is this a known problem? If so, what are the alternatives for debugging large scale problem? If not, it’s possibly another item to add to the to-do list.

Back to InstCombine, from the profile it does not appear there’s an obvious magic bullet that can help drastically improve performance. I will take a closer look at visitor functions and see if there’s anything that can be done.

Dmitry.

Davide Italiano via llvm-dev

unread,
Apr 13, 2017, 10:44:38 PM4/13/17
to Mikulin, Dmitry, llvm-dev
On Thu, Apr 13, 2017 at 5:18 PM, Mikulin, Dmitry
<dmitry....@sony.com> wrote:
> I’m taking a first look at InstCombine performance. I picked up the caching patch and ran a few experiments on one of our larger C++ apps. The size of the *.0.2.internalize.bc no-debug IR is ~ 30M. Here are my observations so far.
>
> Interestingly, caching produced a slight but measurable performance degradation of -O3 compile time.
>
> InstCombine takes about 35% of total execution time, of which ~20% originates from CGPassManager.
>

It's because we run instcombine as we inline (see
addFunctionSimplificationPasses()) IIRC. We don't quite do this at LTO
time (FullLTO) because it's too expensive compile-time wise. ThinLTO
runs it.

> ComputeKnownBits contributes 7.8%, but calls from InstCombine contribute only 2.6% to the total execution time. Caching only covers InstCombine use of KnownBits. This may explain limited gain or even slight degradation if KnownBits are not re-computed as often as we thought.
>
> Most of the time is spent in instruction visitor routines. CmpInst, LoadInst, CallInst, GetElementPtrInst and StoreInst are the top contributors.
>
> ICmpInst 6.1%
> LoadInst 5.5%
> CallInst 2.1%
> GetElementPtrInst 2.1%
> StoreInst 1.6%
>
> Out of 35% InstCombine time, about half is spent in the top 5 visitor routines.
>

So walking the matchers seems to be expensive from your preliminary
analysis, at least, this is what you're saying?
Is this a run with debug info? i.e. are you passing -g to the per-TU
pipeline? I'm inclined to think this is mostly an additive effect
adding matchers here and there that don't really hurt small testcases
but we pay the debt over time (in particular for LTO). Side note, I
noticed (and others did as well) that instcombine is way slower with
`-g` on (one of the reasons could be we walking much longer use lists,
due to the dbg use). Do you have numbers of instcombine ran on IR with
and without debug info?

> I wanted to see what transformations InstCombine actually performs. Using -debug option turned out not to be very scalable. Never mind the large output size of the trace, running "opt -debug -instcombine” on anything other than a small IR is excruciatingly slow. Out of curiosity I profiled it too: 96% of the time is spent decoding and printing instructions. Is this a known problem? If so, what are the alternatives for debugging large scale problem? If not, it’s possibly another item to add to the to-do list.
>

You may consider adding statistics (those should be much more
scalable) although more coarse.

Thanks!

--
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Daniel Berlin via llvm-dev

unread,
Apr 13, 2017, 11:27:39 PM4/13/17
to Mikulin, Dmitry, llvm-dev
On Thu, Apr 13, 2017 at 5:18 PM, Mikulin, Dmitry via llvm-dev <llvm...@lists.llvm.org> wrote:
I’m taking a first look at InstCombine performance. I picked up the caching patch and ran a few experiments on one of our larger C++ apps. The size of the *.0.2.internalize.bc no-debug IR is ~ 30M. Here are my observations so far.

Interestingly, caching produced a slight but measurable performance degradation of -O3 compile time.

InstCombine takes about 35% of total execution time, of which ~20% originates from CGPassManager.

ComputeKnownBits contributes 7.8%, but calls from InstCombine contribute only 2.6% to the total execution time. Caching only covers InstCombine use of KnownBits. This may explain limited gain or even slight degradation if KnownBits are not re-computed as often as we thought.
....
Not entirely shocking. 
I continue to believe we could compute it essentially once or twice, in the right iteration order, and have a performance gain here.
 

Most of the time is spent in instruction visitor routines. CmpInst, LoadInst, CallInst, GetElementPtrInst and StoreInst are the top contributors.

ICmpInst          6.1%
LoadInst          5.5%
CallInst          2.1%
GetElementPtrInst 2.1%
StoreInst         1.6%

Out of 35% InstCombine time, about half is spent in the top 5 visitor routines.

I wanted to see what transformations InstCombine actually performs. Using -debug option turned out not to be very scalable. Never mind the large output size of the trace, running "opt -debug -instcombine”

on anything other than a small IR is excruciatingly slow. Out of curiosity I profiled it too: 96% of the time is spent decoding and printing instructions. Is this a known problem?

Yes

The problem is *every* value print call builds an assembly writer, which  calls the module typefinder to be able to print the types.  This walks everything in the module to find the types.
For large ir with a lot of types, this is *ridiculously* slow.

IE it's basically processing a large part of the module for *every operand* it prints.

You can, for something like what you are doing, just hack it up to build it once or not at all.

I've never understood why this doesn't annoy people more :)

As a hack, you can comment out AsmWriter.cpp:2137

It should fix what you are seeing,.

In practice, most types seem to print fine even without doing this, so ... 

Craig Topper via llvm-dev

unread,
Apr 14, 2017, 12:46:32 AM4/14/17
to Daniel Berlin, llvm-dev
I've looked a little on where time is spent. I can provide more info when I'm back at work tomorrow.

visitGetElementPtr spends most of its time in SimplifyGEPInst. I'll have to look back at what its doing in there.

Loads and stores spend much of their time in getOrEnforceKnownAlignment which uses computeKnownBits. I think loads also spend time in FindAvailableLoadedValue.

I haven't looked at calls or icmps in much detail.



~Craig

Mikulin, Dmitry via llvm-dev

unread,
Apr 14, 2017, 1:39:39 PM4/14/17
to Davide Italiano, llvm-dev

> On Apr 13, 2017, at 7:43 PM, Davide Italiano <dav...@freebsd.org> wrote:
>
> On Thu, Apr 13, 2017 at 5:18 PM, Mikulin, Dmitry
> <dmitry....@sony.com> wrote:
>> I’m taking a first look at InstCombine performance. I picked up the caching patch and ran a few experiments on one of our larger C++ apps. The size of the *.0.2.internalize.bc no-debug IR is ~ 30M. Here are my observations so far.
>>
>> Interestingly, caching produced a slight but measurable performance degradation of -O3 compile time.
>>
>> InstCombine takes about 35% of total execution time, of which ~20% originates from CGPassManager.
>>
>
> It's because we run instcombine as we inline (see
> addFunctionSimplificationPasses()) IIRC. We don't quite do this at LTO
> time (FullLTO) because it's too expensive compile-time wise. ThinLTO
> runs it.
>
>> ComputeKnownBits contributes 7.8%, but calls from InstCombine contribute only 2.6% to the total execution time. Caching only covers InstCombine use of KnownBits. This may explain limited gain or even slight degradation if KnownBits are not re-computed as often as we thought.
>>
>> Most of the time is spent in instruction visitor routines. CmpInst, LoadInst, CallInst, GetElementPtrInst and StoreInst are the top contributors.
>>
>> ICmpInst 6.1%
>> LoadInst 5.5%
>> CallInst 2.1%
>> GetElementPtrInst 2.1%
>> StoreInst 1.6%
>>
>> Out of 35% InstCombine time, about half is spent in the top 5 visitor routines.
>>
>
> So walking the matchers seems to be expensive from your preliminary
> analysis, at least, this is what you're saying?

Looks like it. Other than computeKnownBits, most other functions at the top of the profile for InstCombine are instruction visitors.

> Is this a run with debug info? i.e. are you passing -g to the per-TU
> pipeline? I'm inclined to think this is mostly an additive effect
> adding matchers here and there that don't really hurt small testcases
> but we pay the debt over time (in particular for LTO). Side note, I
> noticed (and others did as well) that instcombine is way slower with
> `-g` on (one of the reasons could be we walking much longer use lists,
> due to the dbg use). Do you have numbers of instcombine ran on IR with
> and without debug info?

I do have the numbers for the same app with and without debug info. The results above are for the no-debug version.

Total execution time of -O3 is 34% slower with debug info. The size of the debug IR is 162M vs 39M no-debug. Both profiles look relatively similar with the exception of bit code writer and verifier taking a larger share in the -g case.

Looking at InstCombine, it’s 23% slower. One notable thing is that CallInst takes significantly larger share with -g: 5s vs 13s, which translates to about half of the InstCombine slowdown. Need to understand why. ComputeKnownBits takes about the same time and other visitors have elevated times I would guess due to the need to propagate debug info.

Mikulin, Dmitry via llvm-dev

unread,
Apr 14, 2017, 5:19:48 PM4/14/17
to Mikulin, Dmitry, llvm-dev

>> Is this a run with debug info? i.e. are you passing -g to the per-TU
>> pipeline? I'm inclined to think this is mostly an additive effect
>> adding matchers here and there that don't really hurt small testcases
>> but we pay the debt over time (in particular for LTO). Side note, I
>> noticed (and others did as well) that instcombine is way slower with
>> `-g` on (one of the reasons could be we walking much longer use lists,
>> due to the dbg use). Do you have numbers of instcombine ran on IR with
>> and without debug info?
>
> I do have the numbers for the same app with and without debug info. The results above are for the no-debug version.
>
> Total execution time of -O3 is 34% slower with debug info. The size of the debug IR is 162M vs 39M no-debug. Both profiles look relatively similar with the exception of bit code writer and verifier taking a larger share in the -g case.
>
> Looking at InstCombine, it’s 23% slower. One notable thing is that CallInst takes significantly larger share with -g: 5s vs 13s, which translates to about half of the InstCombine slowdown. Need to understand why.

Ah, it’s all those calls to @llvm.dbg.* functions. I’ll explore if they can be safely ignored by InstCombine.

Davide Italiano via llvm-dev

unread,
Apr 14, 2017, 5:23:45 PM4/14/17
to Mikulin, Dmitry, llvm-dev
On Fri, Apr 14, 2017 at 2:19 PM, Mikulin, Dmitry
<dmitry....@sony.com> wrote:
>
>>> Is this a run with debug info? i.e. are you passing -g to the per-TU
>>> pipeline? I'm inclined to think this is mostly an additive effect
>>> adding matchers here and there that don't really hurt small testcases
>>> but we pay the debt over time (in particular for LTO). Side note, I
>>> noticed (and others did as well) that instcombine is way slower with
>>> `-g` on (one of the reasons could be we walking much longer use lists,
>>> due to the dbg use). Do you have numbers of instcombine ran on IR with
>>> and without debug info?
>>
>> I do have the numbers for the same app with and without debug info. The results above are for the no-debug version.
>>
>> Total execution time of -O3 is 34% slower with debug info. The size of the debug IR is 162M vs 39M no-debug. Both profiles look relatively similar with the exception of bit code writer and verifier taking a larger share in the -g case.
>>
>> Looking at InstCombine, it’s 23% slower. One notable thing is that CallInst takes significantly larger share with -g: 5s vs 13s, which translates to about half of the InstCombine slowdown. Need to understand why.
>
> Ah, it’s all those calls to @llvm.dbg.* functions. I’ll explore if they can be safely ignored by InstCombine.
>
>

I took a look and saw no immediate problems, also discussed with David
Majnemer on IRC, who thinks we should just bail out early.

Craig Topper via llvm-dev

unread,
Apr 14, 2017, 5:31:16 PM4/14/17
to Davide Italiano, llvm-dev
I think the easy way to ignore them is to add visitDbgDeclareInst and visitDbgValueInst to the InstCombine visitor tree and then just return nullptr from them.

~Craig

Reid Kleckner via llvm-dev

unread,
Apr 15, 2017, 11:38:58 AM4/15/17
to Mikulin, Dmitry, llvm-dev
I had an idea that llvm.dbg.value should be variadic. I was staring at some program output, and I noticed that debug values tend to group together around inline call sites. It might be interesting to shorten the instruction stream by extending the dbg.value operand list to describe multiple variables and expressions.

Chris Lattner via llvm-dev

unread,
Apr 15, 2017, 2:53:17 PM4/15/17
to Daniel Berlin, llvm-dev
On Apr 13, 2017, at 8:27 PM, Daniel Berlin via llvm-dev <llvm...@lists.llvm.org> wrote:
Out of 35% InstCombine time, about half is spent in the top 5 visitor routines.

I wanted to see what transformations InstCombine actually performs. Using -debug option turned out not to be very scalable. Never mind the large output size of the trace, running "opt -debug -instcombine”

on anything other than a small IR is excruciatingly slow. Out of curiosity I profiled it too: 96% of the time is spent decoding and printing instructions. Is this a known problem?

Yes

The problem is *every* value print call builds an assembly writer, which  calls the module typefinder to be able to print the types.  This walks everything in the module to find the types.
For large ir with a lot of types, this is *ridiculously* slow.

IE it's basically processing a large part of the module for *every operand* it prints.

You can, for something like what you are doing, just hack it up to build it once or not at all.

I've never understood why this doesn't annoy people more :)

As a hack, you can comment out AsmWriter.cpp:2137

You can just create a ModuleSlotTracker and pass it back into the print functions.

-Chris



Daniel Berlin via llvm-dev

unread,
Apr 15, 2017, 3:29:50 PM4/15/17
to Chris Lattner, llvm-dev
It's not the slot tracker (which does not waste any time, AFAICT), it's the type incorporation of the typeprinter, which walks the entire module, and is getting recreated on every single operator << call.

But yes, you could create a typeprinter and pass it back in (if typeprinter wasn't internal to AsmWriter.cpp).
 
But then every operator << you use has to be passed a typeprinter or be super slow. That also seems silly, because most people want to do

DEBUG(dbgs() << "This instruction: " << *I << " did a thing\n");

and trying to keep and pass a type printer into all of these, ...
alternatively, rewriting them also sucks.

This is in fact, why nobody does it. Because they just want the above to work and not worry about the fact that deep down, we are wasting tremendous amounts of time.



-Chris




Mikulin, Dmitry via llvm-dev

unread,
Apr 17, 2017, 4:06:12 PM4/17/17
to Craig Topper, Davide Italiano, llvm-dev

This change recovered the time lost processing @llvm.dbg intrinsics:

 

diff --git a/lib/Transforms/InstCombine/InstCombineInternal.h b/lib/Transforms/InstCombine/InstCombineInternal.h

index 7100006..e70613c 100644

--- a/lib/Transforms/InstCombine/InstCombineInternal.h

+++ b/lib/Transforms/InstCombine/InstCombineInternal.h

@@ -290,6 +290,7 @@ public:

   Instruction *visitSelectInst(SelectInst &SI);

   Instruction *visitCallInst(CallInst &CI);

   Instruction *visitInvokeInst(InvokeInst &II);

+  Instruction *visitDbgInfoIntrinsic(DbgInfoIntrinsic &CI) { return nullptr; }

 

   Instruction *SliceUpIllegalIntegerPHI(PHINode &PN);

   Instruction *visitPHINode(PHINode &PN);

 

The –O3 compile time speedup on my benchmark is ~3%. No llvm test failures.

Safe to commit?

Reply all
Reply to author
Forward
0 new messages