It'd be nice to double-check that any runtime performance loss at -O2 is negligible. But this sounds like a great idea!
vedant
> Performance Improvements - Compile Time Δ Previous Current σ
> CTMark/sqlite3/sqlite3 -1.55% 6.8155 6.7102 0.0081
> CTMark/mafft/pairlocalalign -1.05% 8.0407 7.9559 0.0193
> CTMark/ClamAV/clamscan -1.02% 11.3893 11.2734 0.0081
> CTMark/lencod/lencod -1.01% 12.8763 12.7461 0.0244
> CTMark/SPASS/SPASS -1.01% 12.5048 12.3791 0.0340
>
> Performance Improvements - Compile Time Δ Previous Current σ
> External/SPEC/CINT2006/403.gcc/403.gcc -1.64% 54.0801 53.1930 -
> External/SPEC/CINT2006/400.perlbench/400.perlbench -1.25% 19.1481 18.9091 -
> External/SPEC/CINT2006/445.gobmk/445.gobmk -1.01% 15.2819 15.1274 -
>
>
> Do such changes make sense? The patch doesn't change O3, but it does change Os and potentially can change performance there (though I didn't see any changes in my tests).
>
> The patch is attached for the reference, if we decide to go for it, I'll upload it to phab:
>
> <0001-InstCombine-Move-some-infrequent-patterns-under-if-E.patch>
>
>
> Thanks,
> Michael
>
> [1]: http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html
>
> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
On Mar 17, 2017, at 2:02 PM, Vedant Kumar <v...@apple.com> wrote:
On Mar 17, 2017, at 11:50 AM, Mikhail Zolotukhin via llvm-dev <llvm...@lists.llvm.org> wrote:
Hi,
One of the most time-consuming passes in LLVM middle-end is InstCombine (see e.g. [1]). It is a very powerful pass capable of doing all the crazy stuff, and new patterns are being constantly introduced there. The problem is that we often use it just as a clean-up pass: it's scheduled 6 times in the current pass pipeline, and each time it's invoked it checks all known patterns. It sounds ok for O3, where we try to squeeze as much performance as possible, but it is too excessive for other opt-levels. InstCombine has an ExpensiveCombines parameter to address that - but I think it's underused at the moment.
Trying to find out, which patterns are important, and which are rare, I profiled clang using CTMark and got the following coverage report:
<InstCombine_covreport.html>
(beware, the file is ~6MB).
Guided by this profile I moved some patterns under the "if (ExpensiveCombines)" check, which expectedly happened to be neutral for runtime performance, but improved compile-time. The testing results are below (measured for Os).
It'd be nice to double-check that any runtime performance loss at -O2 is negligible. But this sounds like a great idea!
On Mar 17, 2017, at 11:50 AM, Mikhail Zolotukhin via llvm-dev <llvm...@lists.llvm.org> wrote:
Hi,One of the most time-consuming passes in LLVM middle-end is InstCombine (see e.g. [1]). It is a very powerful pass capable of doing all the crazy stuff, and new patterns are being constantly introduced there. The problem is that we often use it just as a clean-up pass: it's scheduled 6 times in the current pass pipeline, and each time it's invoked it checks all known patterns. It sounds ok for O3, where we try to squeeze as much performance as possible, but it is too excessive for other opt-levels. InstCombine has an ExpensiveCombines parameter to address that - but I think it's underused at the moment.
Trying to find out, which patterns are important, and which are rare, I profiled clang using CTMark and got the following coverage report:
<InstCombine_covreport.html>
(beware, the file is ~6MB).Guided by this profile I moved some patterns under the "if (ExpensiveCombines)" check, which expectedly happened to be neutral for runtime performance, but improved compile-time. The testing results are below (measured for Os).
On Mar 17, 2017, at 11:50 AM, Mikhail Zolotukhin via llvm-dev <llvm...@lists.llvm.org> wrote:
Hi,
One of the most time-consuming passes in LLVM middle-end is InstCombine (see e.g. [1]). It is a very powerful pass capable of doing all the crazy stuff, and new patterns are being constantly introduced there. The problem is that we often use it just as a clean-up pass: it's scheduled 6 times in the current pass pipeline, and each time it's invoked it checks all known patterns. It sounds ok for O3, where we try to squeeze as much performance as possible, but it is too excessive for other opt-levels. InstCombine has an ExpensiveCombines parameter to address that - but I think it's underused at the moment.
Yes, the “ExpensiveCombines” has been added recently (4.0? 3.9?) but I believe has always been intended to be extended the way you’re doing it. So I support this effort :)
-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
Honestly, I'm not a huge fan of this change as-is. The set of transforms that were added behind ExpensiveChecks seems awfully strange and many would not lead the reader to believe that they are expensive at all (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls being the obvious expensive routines).
The purpose of many of InstCombine's xforms is to canonicalize the IR to make life easier for downstream passes and analyses.
InstCombine internally relies on knowing what transforms it may or may not perform. This is important: canonicalizations may duel endlessly if we get this wrong; the order of the combines is also important for exactly the same reason (SelectionDAG deals with this problem in a different way with its pattern complexity field).
Another concern with moving seemingly arbitrary combines under ExpensiveCombines is that it will make it that much harder to understand what is and is not canonical at a given point during the execution of the optimizer.
I'd be much more interested in a patch which caches the result of frequently called ValueTracking functionality like ComputeKnownBits, ComputeSignBit, etc. which often doesn't change but is not intelligently reused. I imagine that the performance win might be quite comparable. Such a patch would have the benefit of keeping the set of available transforms constant throughout the pipeline while bringing execution time down; I wouldn't be at all surprised if caching the ValueTracking functions resulted in a bigger time savings.
On Mar 17, 2017, at 6:12 PM, David Majnemer <david.m...@gmail.com> wrote:Honestly, I'm not a huge fan of this change as-is. The set of transforms that were added behind ExpensiveChecks seems awfully strange and many would not lead the reader to believe that they are expensive at all (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls being the obvious expensive routines).The purpose of many of InstCombine's xforms is to canonicalize the IR to make life easier for downstream passes and analyses.
InstCombine internally relies on knowing what transforms it may or may not perform. This is important: canonicalizations may duel endlessly if we get this wrong; the order of the combines is also important for exactly the same reason (SelectionDAG deals with this problem in a different way with its pattern complexity field).Another concern with moving seemingly arbitrary combines under ExpensiveCombines is that it will make it that much harder to understand what is and is not canonical at a given point during the execution of the optimizer.
On Mar 17, 2017, at 6:12 PM, David Majnemer via llvm-dev <llvm...@lists.llvm.org> wrote:Honestly, I'm not a huge fan of this change as-is. The set of transforms that were added behind ExpensiveChecks seems awfully strange and many would not lead the reader to believe that they are expensive at all (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls being the obvious expensive routines).The purpose of many of InstCombine's xforms is to canonicalize the IR to make life easier for downstream passes and analyses.
InstCombine internally relies on knowing what transforms it may or may not perform. This is important: canonicalizations may duel endlessly if we get this wrong; the order of the combines is also important for exactly the same reason (SelectionDAG deals with this problem in a different way with its pattern complexity field).
Another concern with moving seemingly arbitrary combines under ExpensiveCombines is that it will make it that much harder to understand what is and is not canonical at a given point during the execution of the optimizer.
I'd be much more interested in a patch which caches the result of frequently called ValueTracking functionality like ComputeKnownBits, ComputeSignBit, etc. which often doesn't change but is not intelligently reused. I imagine that the performance win might be quite comparable.
As somebody who brought up this problem at least once in the mailing
lists, I'm in agreement with David Majnemer here.
I think we should consider a caching strategy before going this route.
FWIW, I'm not a big fan of `ExpensiveCombines` at all, I can see the
reason why it was introduced, but in my experience the "expensive"
bits of Instcombine comes from the implementation of bitwise domain,
i.e. known bits & friends, so at least evaluating caching is something
I would try earlier.
Something else that can be tried (even if it doesn't improve compile
time is still a nice cleanup) is that of moving combines not creating
new instructions from instcombine to instsimplify. Many passes use
instruction simplify so that might result in the amount of code that's
processed by instcombine being smaller and/or could result in improved
code quality. Just speculations, but a decent experiment if somebody
has time to take a look at.
--
Davide
"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare
On Mar 17, 2017, at 6:12 PM, David Majnemer via llvm-dev <llvm...@lists.llvm.org> wrote:
Honestly, I'm not a huge fan of this change as-is. The set of transforms that were added behind ExpensiveChecks seems awfully strange and many would not lead the reader to believe that they are expensive at all (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls being the obvious expensive routines).
The purpose of many of InstCombine's xforms is to canonicalize the IR to make life easier for downstream passes and analyses.As we get further along with compile-time improvements one question we need to ask ourselves more frequently is about the effectiveness of optimizations/passes. For example - in this case - how can we make an educated assessment that running the combiner N times is a good cost/benefit investment of compute resources? The questions below are meant to figure out what technologies/instrumentations/etc could help towards a more data-driven decision process when it comes to the effectiveness of optimizations. Instcombiner might just be an inspirational use case to see what is possible in that direction.
The combiner is invoked in full multiple times. But is it really necessary to run all of it for that purpose? After instcombine is run once is there a mapping from transformation -> combines? I suspect most transformations could invoke a subset of combines to re-canonicalize. Or, if there was a (cheap) verifier for canonical IR, it could invoke a specific canonicalization routine. Instrumenting the instcombiner and checking which patterns actually kick in (for different invocations) might give insight into how the combiner could be structured and so that only a subset of pattern need to be checked.
InstCombine internally relies on knowing what transforms it may or may not perform. This is important: canonicalizations may duel endlessly if we get this wrong; the order of the combines is also important for exactly the same reason (SelectionDAG deals with this problem in a different way with its pattern complexity field).Can you elaborate on this “duel endlessly” with specific examples? This is out of curiosity. There must be verifiers that check that this cannot happen. Or an implementation strategy that guarantees that. Global isel will run into the same/similar question when it gets far enough to replace SD.
Another concern with moving seemingly arbitrary combines under ExpensiveCombines is that it will make it that much harder to understand what is and is not canonical at a given point during the execution of the optimizer.
I'd be much more interested in a patch which caches the result of frequently called ValueTracking functionality like ComputeKnownBits, ComputeSignBit, etc. which often doesn't change but is not intelligently reused. I imagine that the performance win might be quite comparable.Can you back this up with measurements? Caching schemes are tricky. Is there a way to evaluate when the results of ComputeKnownBits etc is actually effective meaining the result is used and gives faster instructions? E.g. it might well be that only the first instance of inst_combine benefits from computing the bits.
> To (hopefully) make it easier to answer this question, I've posted my (work-in-progress) patch which adds a known-bits cache to InstCombine.Thanks! I only have the one rough data point based on PR32037, but caching does good things for compile-time on that example.
> I rebased it yesterday, so it should be fairly easy to apply: https://reviews.llvm.org/D31239 - Seeing what this does to the performance of the
> benchmarks mentioned in this thread (among others) would certainly be interesting.
Trunk r298514 compiled release on macOS running on Haswell 4GHz:
$ time ./opt -O2 row_common.bc -S -o /dev/null
user 0m0.302s
user 0m0.300s
user 0m0.296s
user 0m0.299s
user 0m0.296s
With your patch applied:
user 0m0.264s
user 0m0.269s
user 0m0.269s
user 0m0.268s
user 0m0.268s
So the time for all of -O2 has dropped to 89.6% of the baseline (improvement of 11.5%).
A profile shows time spent in InstCombine->computeKnownBits dropped from 58 ms to 15 ms (lines up with the overall time drop), so we're about 4x faster in ValueTracking with the caching.
Have you profiled a single InstCombine run to see where we actually
spend our cycles (as Sanjay did for his reduced testcase)?
> I realize that InstCombine doesn't usually do any harm, if we don't care about compile time, but that's only the case for O3 (to some extent), not for other optimization levels.
Independently from what's the optimization level, I think compile-time
is important. Note, for example, that we run a (kinda) similar
pipeline at O3 and LTO (full, that is), where the impact of compile
time is much more evident. Also, while people are not generally bitten
by O3 compilation time, you may end up with terrible performances for
large TUs (and I unfortunately learned this the hard way).
--
Davide
It's because we run instcombine as we inline (see
addFunctionSimplificationPasses()) IIRC. We don't quite do this at LTO
time (FullLTO) because it's too expensive compile-time wise. ThinLTO
runs it.
> ComputeKnownBits contributes 7.8%, but calls from InstCombine contribute only 2.6% to the total execution time. Caching only covers InstCombine use of KnownBits. This may explain limited gain or even slight degradation if KnownBits are not re-computed as often as we thought.
>
> Most of the time is spent in instruction visitor routines. CmpInst, LoadInst, CallInst, GetElementPtrInst and StoreInst are the top contributors.
>
> ICmpInst 6.1%
> LoadInst 5.5%
> CallInst 2.1%
> GetElementPtrInst 2.1%
> StoreInst 1.6%
>
> Out of 35% InstCombine time, about half is spent in the top 5 visitor routines.
>
So walking the matchers seems to be expensive from your preliminary
analysis, at least, this is what you're saying?
Is this a run with debug info? i.e. are you passing -g to the per-TU
pipeline? I'm inclined to think this is mostly an additive effect
adding matchers here and there that don't really hurt small testcases
but we pay the debt over time (in particular for LTO). Side note, I
noticed (and others did as well) that instcombine is way slower with
`-g` on (one of the reasons could be we walking much longer use lists,
due to the dbg use). Do you have numbers of instcombine ran on IR with
and without debug info?
> I wanted to see what transformations InstCombine actually performs. Using -debug option turned out not to be very scalable. Never mind the large output size of the trace, running "opt -debug -instcombine” on anything other than a small IR is excruciatingly slow. Out of curiosity I profiled it too: 96% of the time is spent decoding and printing instructions. Is this a known problem? If so, what are the alternatives for debugging large scale problem? If not, it’s possibly another item to add to the to-do list.
>
You may consider adding statistics (those should be much more
scalable) although more coarse.
Thanks!
--
Davide
"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare
I’m taking a first look at InstCombine performance. I picked up the caching patch and ran a few experiments on one of our larger C++ apps. The size of the *.0.2.internalize.bc no-debug IR is ~ 30M. Here are my observations so far.
Interestingly, caching produced a slight but measurable performance degradation of -O3 compile time.
InstCombine takes about 35% of total execution time, of which ~20% originates from CGPassManager.
ComputeKnownBits contributes 7.8%, but calls from InstCombine contribute only 2.6% to the total execution time. Caching only covers InstCombine use of KnownBits. This may explain limited gain or even slight degradation if KnownBits are not re-computed as often as we thought.
Most of the time is spent in instruction visitor routines. CmpInst, LoadInst, CallInst, GetElementPtrInst and StoreInst are the top contributors.
ICmpInst 6.1%
LoadInst 5.5%
CallInst 2.1%
GetElementPtrInst 2.1%
StoreInst 1.6%
Out of 35% InstCombine time, about half is spent in the top 5 visitor routines.
I wanted to see what transformations InstCombine actually performs. Using -debug option turned out not to be very scalable. Never mind the large output size of the trace, running "opt -debug -instcombine”
on anything other than a small IR is excruciatingly slow. Out of curiosity I profiled it too: 96% of the time is spent decoding and printing instructions. Is this a known problem?
I took a look and saw no immediate problems, also discussed with David
Majnemer on IRC, who thinks we should just bail out early.
Out of 35% InstCombine time, about half is spent in the top 5 visitor routines.
I wanted to see what transformations InstCombine actually performs. Using -debug option turned out not to be very scalable. Never mind the large output size of the trace, running "opt -debug -instcombine”on anything other than a small IR is excruciatingly slow. Out of curiosity I profiled it too: 96% of the time is spent decoding and printing instructions. Is this a known problem?YesThe problem is *every* value print call builds an assembly writer, which calls the module typefinder to be able to print the types. This walks everything in the module to find the types.For large ir with a lot of types, this is *ridiculously* slow.IE it's basically processing a large part of the module for *every operand* it prints.You can, for something like what you are doing, just hack it up to build it once or not at all.I've never understood why this doesn't annoy people more :)As a hack, you can comment out AsmWriter.cpp:2137
-Chris
This change recovered the time lost processing @llvm.dbg intrinsics:
diff --git a/lib/Transforms/InstCombine/InstCombineInternal.h b/lib/Transforms/InstCombine/InstCombineInternal.h
index 7100006..e70613c 100644
--- a/lib/Transforms/InstCombine/InstCombineInternal.h
+++ b/lib/Transforms/InstCombine/InstCombineInternal.h
@@ -290,6 +290,7 @@ public:
Instruction *visitSelectInst(SelectInst &SI);
Instruction *visitCallInst(CallInst &CI);
Instruction *visitInvokeInst(InvokeInst &II);
+ Instruction *visitDbgInfoIntrinsic(DbgInfoIntrinsic &CI) { return nullptr; }
Instruction *SliceUpIllegalIntegerPHI(PHINode &PN);
Instruction *visitPHINode(PHINode &PN);
The –O3 compile time speedup on my benchmark is ~3%. No llvm test failures.
Safe to commit?