I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been working on an interprocedural outlining (code folding) pass for code size improvement at the IR level. We hit a couple of use cases that the current code size solutions didn’t handle well enough. Outlining is one of the avenues that seemed potentially beneficial.
-- Algorithmic Approach --
The general implementation can be explained in stages:
* Candidate Selection:
Each instruction in the module is mapped to a congruence class based upon relaxed equivalency constraints. For most instructions this is simply: The type, opcode, and operand types. The constraints are tightened for instructions with special state that require more exact equivalence (e.g. ShuffleVector requires a constant mask vector). Candidates are then found by constructing a suffix array and lcp(longest common prefix) array. Walking the lcp array gives us congruent chains of instructions within the module.
* Candidate Analysis:
A verification step splits candidates that have different internal input sequences or incompatible parent function attributes between occurrences. An example of incompatible internal inputs sequences is:
X = W + 6; vs X = W + 6;
Y = X + 4; Y = W + 4;
The above two occurrences would need special control flow to exist within the same outlined function.
During analysis candidates have their inputs and outputs computed along with an estimated benefit from extraction. During input calculation a constant folding step removes all inputs that are the same amongst all occurrences.
* Candidate Pruning:
Overlapping candidates are pruned with a generic greedy algorithm that picks occurrences starting from the most beneficial candidates.
* Candidate Outlining:
Non pruned candidates are then outlined. Outputs from a candidate are returned via an output parameter, except for one output that is promoted to a return value. During outlining the inputs into the candidate are condensed by computing the equivalencies between the arguments at each occurrence. An example of this is:
outlinedFn(1,6,1); -> outlinedFn(1,6);
outlinedFn(2,4,2); -> outlinedFn(2,4);
In the above, parameters 1 and 3 were found to be equivalent for all occurrences, thus the amount of inputs was reduced to 2.
* Debug Info:
Debug information is preserved for the calls to functions which have been outlined but all debug info from the original outlined portions is removed, making them harder to debug.
* Profile Info:
If the pass is running at Os the outliner will only consider cold blocks, whereas Oz considers all blocks that are not marked as hot.
* Location in Pipeline:
The pass is currently configured to run very late in the optimization pipeline. It is intended to run at Oz but will also run at Os if there is profile data available. The pass can optionally be run twice, once before function simplification and then again at the default location. This run is optional because you are gambling the potential benefits of redundancy elimination vs the potential benefits from function simplification. This can lead to large benefits or regressions depending on the benchmark (though the benefits tend to outnumber the regressions). The performance section contains data for both on a variety of benchmarks.
-- Why at the IR level --
The decision to place the outliner at the IR level comes from a couple of major factors:
- Desire to be target independent
- Most opportunities for congruency
The downside to having this type of transformation be at the IR level is it means there will be less accuracy in the cost model - we can somewhat accurately model the cost per instruction but we can’t get information on how a window of instructions may lower. This can cause regressions depending on the platform/codebase, therefore to help alleviate this there are several tunable parameters for the cost model.
-- Performance --
More results including clang, llvm-tblgen, and more specific numbers about benefits/regressions can be found in the notes section below.
* Size Reduction:
- Test Suite(X86_64):
- Early+Late outlining provides a geomean of 10.5% reduction over clang Oz, with a largest improvement of ~67% and largest regression of ~7.5%.
- Late outlining provides a geomean of 4.65% reduction, with a largest improvement of ~51% and largest regression of ~6.4%.
- Spec 2006(X86_64)
- Early+Late outlining provides a geomean reduction of 2.08%.
- Late outlining provides 2.09%.
- CSiBE(AArch64)
- Early+Late outlining shows geomean reduction of around 3.5%.
- Late outlining shows 3.1%.
* Compile Time:
Compile time was tested under test-suite with a multisample of 5.
- Early+Late outlining
- Many improvements with > 40% compile time reduction.
- Few regressions.
- Late outlining
- Greatest improvement is ~7.8%
- Greatest regression is ~4% with a difference of <0.02s
Our explanation for the compile time reduction during early outlining is that due to the amount of redundancy reduction early in the optimization pipeline there is a reduction in the amount of instruction processing during the rest of the compilation.
* Execution Time:
Ran with a multisample of 5.
- Test Suite:
- Early+Late outlining has many regressions up to 97%. The greatest improvement was around 7.5%.
- Late outlining also has several regressions up to 44% and a greatest improvement of around 5.3%.
- Spec:
- Early+Late has a geomean regression of 3.5%.
- Late outlining has a geomean regression of 1.25%.
The execution time results are to be expected given that the outliner, without profile data, will extract from whatever region it deems profitable. Extracting from the hot path can lead to a noticeable performance regression on any platform, which can be somewhat avoided by providing profile data during outlining.
-- Tested Improvements --
* MergeFunctions:
- No noticeable benefit.
* LTO:
- LTO doesn’t have a code size pipeline, but %reductions over LTO are comparable to non LTO.
* Input/Output Partitioning:
-This identifies inputs/outputs that may be folded by splitting a candidate. The benefit is minimal for the computations required.
* Similar Candidate Merging:
- The benefit to be gained is currently not worth the large complexity required to catch the desired cases.
-- Potential Improvements --
* Suffix&LCP array construction: The pass currently uses a very basic implementation that could be improved. There are definitely faster algorithms and some can construct both simultaneously. We will investigate this as a potential benefit for compile time in the future.
* Os performance tuning: Under -Os the pass currently only runs on cold blocks. Ideally we could expand this to be a little more lax on less frequently executed blocks that aren’t cold.
* Candidate Selection: The algorithm currently focuses on the longest common sequences. More checks could be added to see if shortening the sequence length produces a larger benefit(e.g less inputs/outputs). This would likely lead to an increase in compile time but should remain less than the baseline.
* Non Exact Functions: The outliner currently ignores functions that do not have an exact definition.
-- --
* CSiBE(Code Size Benchmark):
* More detailed performance data:
* Implementation:
https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp
Hi River,
Do you know LLVM has the MachineOutliner pass (lib/CodeGen/MachineOutliner.cpp)?
It would be interesting to compare your approach with it.
Thanks,
Evgeny Astigeevich
enc-3des: 66.31%
StatementReordering-dbl: 51.45%
Symbolics-dbl: 51.42%
Recurrences-dbl: 51.38%
Packing-dbl: 51.33%
enc-3des: 50.7%
ecbdes: 46.27%
security-rjindael:45.13%
ControlFlow-flt: 25.79%
ControlFlow-dbl: 25.74%
ecbdes: 28.22%
Expansion-flt: 22.56%
Recurrences-flt: 22.19%
StatementReordering-flt: 22.15%
Searching-flt: 21.96%
bzip2: 9.15%
gcc: 4.03%
sphinx3: 3.8%
H264ref: 3.24%
Perlbench: 3%
bzip2: 7.27%
sphinx3: 3.65%
Namd: 3.08%
Gcc: 3.06%
H264ref: 3.05%
Namd: 7.8%
bzip2: 7.27%
libquantum: 2.99%
h264ref: 2%
* Debug Info:
Debug information is preserved for the calls to functions which have been outlined but all debug info from the original outlined portions is removed, making them harder to debug.
The execution time results are to be expected given that the outliner, without profile data, will extract from whatever region it deems profitable. Extracting from the hot path can lead to a noticeable performance regression on any platform, which can be somewhat avoided by providing profile data during outlining.
* LTO:
- LTO doesn’t have a code size pipeline, but %reductions over LTO are comparable to non LTO.
Thanks for the reply!
> 23 июля 2017 г., в 1:05, River Riddle <riddl...@gmail.com> написал(а):
>
> - The explanation as to why the improvements can vary between the IR and MIR outliner mainly boil down to the level of abstraction that each are working at. The MIR level has very accurate heuristics and is effectively the last post ISel target independent code gen pass. The IR outliner on the other hand has more estimation in the cost model, and can affect the decisions of function simplification passes, instruction selection, RA, etc. Taking this into account can lead to different results.
To clarify, I'm surprised not with % differences (this is understandable), but with differences in what benchmarks got improved. It seems odd that MO, working on lower abstraction level, managed to find redundancies (say, in libquantum) that EO+LO missed. But indeed -- perhaps EO+LO cost model just considered these cases to be non-profitable. It would be interesting to know precisely.
Yours,
Andrey
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Hi River,
Thank you for information. Do you have your code in Phabricator?
As I understand your pass always runs after the Inliner. At the moment the Inliner is too aggressive at Os/Oz. I am trying to tune Inliner heuristics for Oz. It would be interesting to see if there are any code size improvements when the Inliner is disabled. This will demonstrate whether the Outliner could improve code size when duplicated code is not due to the Inliner. Another problem I have is that C++ programs get code size regressions due my changes in the inline heuristics. Have you seen this problem?
Thanks,
Evgeny Astigeevich
On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <llvm...@lists.llvm.org> wrote:The downside to having this type of transformation be at the IR level is it means there will be less accuracy in the cost model - we can somewhat accurately model the cost per instruction but we can’t get information on how a window of instructions may lower. This can cause regressions depending on the platform/codebase, therefore to help alleviate this there are several tunable parameters for the cost model.
On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev <llvm...@lists.llvm.org> wrote:Hi Jessica,The comparison to the inliner is an interesting one but we think it's important to note the difference in the use of heuristics. The inliner is juggling many different tasks at the same time, execution speed, code size, etc. which can cause the parameters to be very sensitive depending on the benchmark/platform/etc. The outliners heuristics are focused solely on the potential code size savings from outlining, and is thus only sensitive to the current platform. This only creates a problem when we are over estimating the potential cost of a set of instructions for a particular target. The cost model parameters are only minimums: instruction sequence length, estimated benefit, occurrence amount. The heuristics themselves are conservative and based upon all of the target information available at the IR level, the parameters are just setting a lower bound to weed out any outliers. You are correct in that being at the machine level, before or after RA, will give the most accurate heuristics but we feel there's an advantage to being at the IR level. At the IR level we can do so many more things that are either too difficult/complex for the machine level(e.g parameterization/outputs/etc). Not only can we do these things but they are available on all targets immediately, without the need for target hooks. The caution on the use of heuristics is understandable, but there comes a point when trade offs need to be made. We made the trade off for a loss in exact cost modeling to gain flexibility, coverage, and potential for further features. This trade off is the same made for quite a few IR level optimizations, including inlining. As for the worry about code size regressions, so far the results seem to support our hypothesis.
Thanks,River RiddleOn Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette <jpaq...@apple.com> wrote:Hi River,I’m working on the MachineOutliner pass at the MIR level. Working at the IR level sounds interesting! It also seems like our algorithms are similar. I was thinking of taking the suffix array route with the MachineOutliner in the future.Anyway, I’d like to ask about this:On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <llvm...@lists.llvm.org> wrote:The downside to having this type of transformation be at the IR level is it means there will be less accuracy in the cost model - we can somewhat accurately model the cost per instruction but we can’t get information on how a window of instructions may lower. This can cause regressions depending on the platform/codebase, therefore to help alleviate this there are several tunable parameters for the cost model.The inliner is threshold-based and it can be rather unpredictable how it will impact the code size of a program. Do you have any idea as to how heuristics/thresholds/parameters could be tuned to prevent this? In my experience, making good code size decisions with these sorts of passes requires a lot of knowledge about what instructions you’re dealing with exactly. I’ve seen the inliner cause some pretty serious code size regressions in projects due to small changes to the cost model/parameters which cause improvements in other projects. I’m a little worried that an IR-level outliner for code size would be doomed to a similar fate.Perhaps it would be interesting to run this sort of pass pre-register allocation? This would help pull you away from having to use heuristics, but give you some more opportunities for finding repeated instruction sequences. I’ve thought of doing something like this in the future with the MachineOutliner and seeing how it goes.- Jessica
The inliner is juggling many different tasks at the same time, execution speed, code size, etc. which can cause the parameters to be very sensitive depending on the benchmark/platform/etc.
The outliners heuristics are focused solely on the potential code size savings from outlining, and is thus only sensitive to the current platform.
“code size” vs. “code reduction” seems like a really nice way to think about this. I don’t know a lot about the areas in discussion, but one thing that occurred to me while reading this was: a code reduction couldn’t produce a worse code size than without the reduction, right?
— Meador
On Jul 24, 2017, at 2:36 PM, River Riddle <riddl...@gmail.com> wrote:Hi Quentin,I appreciate the feedback. When I reference the cost of Target Hooks it's mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with "having heuristics solely focused on code size do not seem realistic", but I don't agree with that statement.
I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information.
When you say "Would still be interesting to see how well this could perform on some exact model (i.e., at the Machine level), IMO." I am slightly confused as to what you mean. I do not intend to try and implement this algorithm at the MIR level given that it exists in Machine Outliner.
There are several comparison benchmarks given in the "More detailed performance data" of the original RFC. It includes comparisons to the Machine Outliner when possible(I can't build clang on Linux with Machine Outliner). I welcome any and all discussion on the placement of the outliner in LLVM.
On Jul 24, 2017, at 4:09 PM, Andrey Bokhanko <andreyb...@gmail.com> wrote:Hi Quentin,I really don't have a dog in this race, but... we can theoritize on strengths and weaknesses of different approaches ad infinitum -- or just accept that "the proof is in the pudding", or so they say.
And the proof is here! -- look at the numbers.
If anything, I believe it makes sense to at least accept Riddle's phase to the trunk and let two approaches evolve in parallel.
It probably silly to have both enabled at the same time, so one that currently provides a greater benefit should work on -Os/-Oz. With time, evolution will decide which one should survive -- as it always does.
On Jul 24, 2017, at 4:56 PM, River Riddle <riddl...@gmail.com> wrote:Hey Quentin,Sorry I missed the last question. Currently it doesn't do continual outlining, but it's definitely a possibility.
Thanks,River RiddleOn Mon, Jul 24, 2017 at 4:36 PM, River Riddle <riddl...@gmail.com> wrote:Hi Quentin,I understand your points and I believe that some meaning is being lost via email. For performance it's true that that cost isn't necessarily modeled, there is currently only support for using profile data to avoid mitigate that. I was working under the assumption, possibly incorrectly, that at Oz we favor small code over anything else including runtime performance. This is why we only run at Os if we have profile data, and even then we are very conservative about what we outline from.I also understand that some target hooks may be required for certain things, it happens for other IR level passes as well. I just want to minimize that behavior as much as I can, though thats a personal choice.As for a motivating reason to push for this, it actually solves some of the problems that you brought up in the post for the original Machine Outliner RFC. If I can quote you:"So basically, better cost model. That's one part of the story.The other part is at the LLVM IR level or before register allocation identifying similar code sequence is much harder, at least with a suffix tree like algorithm. Basically the problem is how do we name our instructions such that we can match them.Let me take an example.foo() {/* bunch of code */a = add b, c;d = add e, f;}bar() {d = add e, g;f = add c, w;}With proper renaming, we can outline both adds in one function. The difficulty is to recognize that they are semantically equivalent to give them the same identifier in the suffix tree. I won’t get into the details but it gets tricky quickly. We were thinking of reusing GVN to have such identifier if we wanted to do the outlining at IR level but solving this problem is hard."This outliner will catch your case, it is actually one of the easiest cases for it to catch. The outliner can recognize semantically equivalent instructions and can be expanded to catch even more.
As for the cost model it is quite simple:- We identify all of the external inputs into the sequence. For estimating the benefit we constant fold and condense the inputs so that we can get the set of *unique* inputs into the sequence.
- We also identify outputs from the sequence, instructions that have external references. We add the cost of storing/loading/passing output parameter to the outlined function.
- We identify one output to promote to a return value for the function. This can end up reducing an output parameter that is passed to our outlined function.
- We estimate the cost of the sequence of instructions by mostly using TTI.
- With the above we can estimate the cost per occurrence as well as the cost for the new function. Of course these are mostly estimates, we lean towards the conservative side to be safe.
- Finally we can compute an estimated benefit for the sequence taking into account benefit per occurrence and the estimated cost of the new function.
There is definitely room for improvement in the cost model. I do not believe its the best but its shown to be somewhat reliable for most cases. It has benefits and it has regressions, as does the machine outliner.
The reasoning for adding the new framework, is because I believe that this transformation should exist at the IR level. Not only because it is the simplest place to put it but also because it provides the greatest opportunities for improvement with the largest coverage.
We can expand the framework to catch Danny's cases from the machine outliner RFC, we can add region outlining, etc. We can do all of this and make it available to every target automatically. The reason why I am for this being at the IR level is not because I want to split up the technical effort, but combine it.
Hey Quentin,To clarify some of your concerns/questions:- Currently parameter/output passing are based around the simple cost=1, though this is something that will need to change a bit, as you have pointed out. As I said, the cost model isn't perfect.- The output that is turned into a return is either: one that will remove an output parameter from the function call, or the last output in the instruction sequence.- As for being conservative with the cost, we prefer to overestimate the cost of outlining(New function cost) and under estimate the cost of the instruction sequence when possible. This means that we tend to under estimate the benefit, given that we are working with approximate costs.- When I speak of regressions I am referring to the size of the text section.- The gist of identifying semantic equivalence is the very first sentence of Candidate Selection in the initial RFC. We identify equivalence generally via Type,Opcode, operand types, any special state. This is just the current definition, we can relax this even further if we want to.- Working at the IR level has more target independent advantages than just reducing the amount of new hooks. This can be seen in the fact that: we don't require mno-red-zone or any other abi flag/check, we can easily parameterize sequences, generate outputs from sequences, etc.- By combining work I mean that the ability for adding advancements are much easier. if someone wants to add functionality for outlining regions they don't have to worry about any target specific patchups. It just works, for every target.
The downside to having this type of transformation be at the IR level is it means there will be less accuracy in the cost model - we can somewhat accurately model the cost per instruction but we can’t get information on how a window of instructions may lower. This can cause regressions depending on the platform/codebase, therefore to help alleviate this there are several tunable parameters for the cost model.
-- Performance --
More results including clang, llvm-tblgen, and more specific numbers about benefits/regressions can be found in the notes section below.
* Size Reduction:
- Test Suite(X86_64):
- Early+Late outlining provides a geomean of 10.5% reduction over clang Oz, with a largest improvement of ~67% and largest regression of ~7.5%.
- Late outlining provides a geomean of 4.65% reduction, with a largest improvement of ~51% and largest regression of ~6.4%.
- Spec 2006(X86_64)
- Early+Late outlining provides a geomean reduction of 2.08%.
- Late outlining provides 2.09%.
- CSiBE(AArch64)
- Early+Late outlining shows geomean reduction of around 3.5%.
- Late outlining shows 3.1%.
* Compile Time:
Compile time was tested under test-suite with a multisample of 5.
- Early+Late outlining
- Many improvements with > 40% compile time reduction.
- Few regressions.
- Late outlining
- Greatest improvement is ~7.8%
- Greatest regression is ~4% with a difference of <0.02s
Our explanation for the compile time reduction during early outlining is that due to the amount of redundancy reduction early in the optimization pipeline there is a reduction in the amount of instruction processing during the rest of the compilation.
* Execution Time:
Ran with a multisample of 5.
- Test Suite:
- Early+Late outlining has many regressions up to 97%. The greatest improvement was around 7.5%.
- Late outlining also has several regressions up to 44% and a greatest improvement of around 5.3%.
- Spec:
- Early+Late has a geomean regression of 3.5%.
- Late outlining has a geomean regression of 1.25%.
The execution time results are to be expected given that the outliner, without profile data, will extract from whatever region it deems profitable. Extracting from the hot path can lead to a noticeable performance regression on any platform, which can be somewhat avoided by providing profile data during outlining.
-- Tested Improvements --
* MergeFunctions:
- No noticeable benefit.
* LTO:
- LTO doesn’t have a code size pipeline, but %reductions over LTO are comparable to non LTO.
* Input/Output Partitioning:
-This identifies inputs/outputs that may be folded by splitting a candidate. The benefit is minimal for the computations required.
* Similar Candidate Merging:
- The benefit to be gained is currently not worth the large complexity required to catch the desired cases.
-- Potential Improvements --
* Suffix&LCP array construction: The pass currently uses a very basic implementation that could be improved. There are definitely faster algorithms and some can construct both simultaneously. We will investigate this as a potential benefit for compile time in the future.
* Os performance tuning: Under -Os the pass currently only runs on cold blocks. Ideally we could expand this to be a little more lax on less frequently executed blocks that aren’t cold.
* Candidate Selection: The algorithm currently focuses on the longest common sequences. More checks could be added to see if shortening the sequence length produces a larger benefit(e.g less inputs/outputs). This would likely lead to an increase in compile time but should remain less than the baseline.
* Non Exact Functions: The outliner currently ignores functions that do not have an exact definition.
-- --
* CSiBE(Code Size Benchmark):
* More detailed performance data:
* Implementation:
https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp
Hey Sean,The bit about the attributes is good to know. When LTO is enabled the early run will still run per-TU but the late run will be shifted to work on the full LTO bitcode. Also I don't have specific numbers on how often parameterization is utilized but I can assure you that it's a majority of the time.
River,
Thanks for the reply!
> 23 июля 2017 г., в 1:05, River Riddle <riddl...@gmail.com> написал(а):
>
> - The explanation as to why the improvements can vary between the IR and MIR outliner mainly boil down to the level of abstraction that each are working at. The MIR level has very accurate heuristics and is effectively the last post ISel target independent code gen pass. The IR outliner on the other hand has more estimation in the cost model, and can affect the decisions of function simplification passes, instruction selection, RA, etc. Taking this into account can lead to different results.
To clarify, I'm surprised not with % differences (this is understandable), but with differences in what benchmarks got improved. It seems odd that MO, working on lower abstraction level, managed to find redundancies (say, in libquantum) that EO+LO missed.
Hi River,On Jul 24, 2017, at 2:36 PM, River Riddle <riddl...@gmail.com> wrote:Hi Quentin,I appreciate the feedback. When I reference the cost of Target Hooks it's mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with "having heuristics solely focused on code size do not seem realistic", but I don't agree with that statement.If you only want code size I agree it makes sense, but I believe, even in Oz, we probably don’t want to slow the code by a big factor for a couple bytes. That’s what I wanted to say and what I wanted to point out is that you need to have some kind of model for the performance to avoid those worst cases. Unless we don’t care :).I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information.I don’t see how you can get around adding more hooks to know how a specific function prototype is going to be lowered (e.g., i64 needs to be split into two registers, fourth and onward parameters need to be pushed on the stack and so on). Those change the code size benefit.When you say "Would still be interesting to see how well this could perform on some exact model (i.e., at the Machine level), IMO." I am slightly confused as to what you mean. I do not intend to try and implement this algorithm at the MIR level given that it exists in Machine Outliner.Of course, I don’t expect you to do that :). What I meant is that the claim that IR offers the better trade off is not based on hard evidences. I actually don’t buy it.My point was to make sure I understand what you are trying to solve and given you have mentioned the MachineOutliner, why you are not working on improving it instead of suggesting a new framework.Don’t take me wrong, maybe creating a new framework at the IR level is the right thing to do, but I still didn’t get that from your comments.There are several comparison benchmarks given in the "More detailed performance data" of the original RFC. It includes comparisons to the Machine Outliner when possible(I can't build clang on Linux with Machine Outliner). I welcome any and all discussion on the placement of the outliner in LLVM.My fear with a new framework is that we are going to split the effort for pushing the outliner technology forward and I’d like to avoid that if at all possible.
I don't think we have to go that far. A heuristic based on the immediate
argument should be enough.
Joerg
On Jul 24, 2017, at 8:31 PM, River Riddle <riddl...@gmail.com> wrote:
Hey Quentin,
To clarify some of your concerns/questions:- Currently parameter/output passing are based around the simple cost=1, though this is something that will need to change a bit, as you have pointed out. As I said, the cost model isn't perfect.- The output that is turned into a return is either: one that will remove an output parameter from the function call, or the last output in the instruction sequence.- As for being conservative with the cost, we prefer to overestimate the cost of outlining(New function cost) and under estimate the cost of the instruction sequence when possible. This means that we tend to under estimate the benefit, given that we are working with approximate costs.- When I speak of regressions I am referring to the size of the text section.- The gist of identifying semantic equivalence is the very first sentence of Candidate Selection in the initial RFC. We identify equivalence generally via Type,Opcode, operand types, any special state. This is just the current definition, we can relax this even further if we want to.
The two passes are pretty different in their approaches to congruency finding, so I don't think it helps to group them as though they were interchangeable "outliner technology". The two passes might be totally orthogonal.
1. if you run the LLVM IR level code size outliner, then the MachineOutliner fails to find any significant redundancies.2. if you run the LLVM IR level code size outliner, then the MachineOutliner finds just as many redundancies.
As a simple example, many commonly used x86 instructions encode as over 5 bytes, which is the size of a call instruction. So an outlined function that consists of a single instruction can be profitable. IIRC there was about 5% code size saving just from outlining single instructions (that encode at >5 bytes) at machine code level on the test case I looked at (I mentined it in a comment on one of Jessica's patches IIRC).
Do we have a way to get an instruction's exact encoded length for the MIR outliner?
-- Sean Silva
On Jul 25, 2017, at 9:24 AM, Jessica Paquette <jpaq...@apple.com> wrote:The two passes are pretty different in their approaches to congruency finding, so I don't think it helps to group them as though they were interchangeable "outliner technology". The two passes might be totally orthogonal.I think that based off how River described his algorithm, the actual underlying method should be pretty similar. If it is, then we could probably compare the performance of the suffix tree vs. suffix array method and then create a general outliner that can be run at any level of representation.
The two passes are pretty different in their approaches to congruency finding, so I don't think it helps to group them as though they were interchangeable "outliner technology". The two passes might be totally orthogonal.I think that based off how River described his algorithm, the actual underlying method should be pretty similar. If it is, then we could probably compare the performance of the suffix tree vs. suffix array method and then create a general outliner that can be run at any level of representation. By that I mean that the actual suffix tree/array candidate search algorithm would be separate from the actual implementation of *outlining things*. The actual implementation at each level of representation would define- How to outline a sequence of instructions- An equivalence/congruence scheme for two instructions- A cost modelI don’t think it’d be too difficult to split it up in that sort of way. I’d also like to experiment with pre-regalloc outlining in general, so it’d make it easier to explore that route as well.
It might also be cool to see how various levels of outlining + the inliner would interact.For example, say we did something like this1. IR outline2. Inline3. MIR outlineThe inliner should be capable of undoing bad outlining decisions (like outlining from, say, hot loops). The MIR outliner should be able to undo bad inlining decisions. Since each cost model is different (and they’re all working on different code), no pass should ever undo all decisions made by another unless they all ended up being *bad* decisions. It might be possible to have each pass play together in a way that you end up with a nice balance between performance and size.
Hi River,On Jul 24, 2017, at 2:36 PM, River Riddle <riddl...@gmail.com> wrote:Hi Quentin,I appreciate the feedback. When I reference the cost of Target Hooks it's mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with "having heuristics solely focused on code size do not seem realistic", but I don't agree with that statement.If you only want code size I agree it makes sense, but I believe, even in Oz, we probably don’t want to slow the code by a big factor for a couple bytes. That’s what I wanted to say and what I wanted to point out is that you need to have some kind of model for the performance to avoid those worst cases. Unless we don’t care :).
I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information.I don’t see how you can get around adding more hooks to know how a specific function prototype is going to be lowered (e.g., i64 needs to be split into two registers, fourth and onward parameters need to be pushed on the stack and so on). Those change the code size benefit.
There are several comparison benchmarks given in the "More detailed performance data" of the original RFC. It includes comparisons to the Machine Outliner when possible(I can't build clang on Linux with Machine Outliner). I welcome any and all discussion on the placement of the outliner in LLVM.My fear with a new framework is that we are going to split the effort for pushing the outliner technology forward and I’d like to avoid that if at all possible.
On Jul 25, 2017, at 10:36 PM, Mehdi AMINI <joke...@gmail.com> wrote:2017-07-24 16:14 GMT-07:00 Quentin Colombet via llvm-dev <llvm...@lists.llvm.org>:Hi River,On Jul 24, 2017, at 2:36 PM, River Riddle <riddl...@gmail.com> wrote:Hi Quentin,I appreciate the feedback. When I reference the cost of Target Hooks it's mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with "having heuristics solely focused on code size do not seem realistic", but I don't agree with that statement.If you only want code size I agree it makes sense, but I believe, even in Oz, we probably don’t want to slow the code by a big factor for a couple bytes. That’s what I wanted to say and what I wanted to point out is that you need to have some kind of model for the performance to avoid those worst cases. Unless we don’t care :).That's why we have threshold though, don't we?
Also the IR makes it easy to connect to PGO, which allows to focus the outlining on "cold" regions and preserve good performance.River: did you consider this already? Having a good integration with PGO could make this part of the default optimization pipeline (i.e. having a mode where we outline only the knowingly "cold" code).I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information.I don’t see how you can get around adding more hooks to know how a specific function prototype is going to be lowered (e.g., i64 needs to be split into two registers, fourth and onward parameters need to be pushed on the stack and so on). Those change the code size benefit.How is the inliner doing? How are we handling Oz there?If we are fine living with approximation for the inliner, why wouldn't the same work for an outliner?
There are several comparison benchmarks given in the "More detailed performance data" of the original RFC. It includes comparisons to the Machine Outliner when possible(I can't build clang on Linux with Machine Outliner). I welcome any and all discussion on the placement of the outliner in LLVM.My fear with a new framework is that we are going to split the effort for pushing the outliner technology forward and I’d like to avoid that if at all possible.It isn't clear to me that implementing it at the MachineLevel was the right trade-off in the first place.
I'm not sure a full comparative study was performed and discussed upstream at the time where the MachineIR outliner was implemented? If so it wouldn't be fair to ask this to River now.
--Mehdi
On Jul 25, 2017, at 10:36 PM, Mehdi AMINI <joke...@gmail.com> wrote:2017-07-24 16:14 GMT-07:00 Quentin Colombet via llvm-dev <llvm...@lists.llvm.org>:Hi River,On Jul 24, 2017, at 2:36 PM, River Riddle <riddl...@gmail.com> wrote:Hi Quentin,I appreciate the feedback. When I reference the cost of Target Hooks it's mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with "having heuristics solely focused on code size do not seem realistic", but I don't agree with that statement.If you only want code size I agree it makes sense, but I believe, even in Oz, we probably don’t want to slow the code by a big factor for a couple bytes. That’s what I wanted to say and what I wanted to point out is that you need to have some kind of model for the performance to avoid those worst cases. Unless we don’t care :).That's why we have threshold though, don't we?When I see threshold, I think magic number and I don’t like it that.
Also the IR makes it easy to connect to PGO, which allows to focus the outlining on "cold" regions and preserve good performance.River: did you consider this already? Having a good integration with PGO could make this part of the default optimization pipeline (i.e. having a mode where we outline only the knowingly "cold" code).I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information.I don’t see how you can get around adding more hooks to know how a specific function prototype is going to be lowered (e.g., i64 needs to be split into two registers, fourth and onward parameters need to be pushed on the stack and so on). Those change the code size benefit.How is the inliner doing? How are we handling Oz there?If we are fine living with approximation for the inliner, why wouldn't the same work for an outliner?Unlike inlining, outlining does not expose optimization opportunities.
There are several comparison benchmarks given in the "More detailed performance data" of the original RFC. It includes comparisons to the Machine Outliner when possible(I can't build clang on Linux with Machine Outliner). I welcome any and all discussion on the placement of the outliner in LLVM.My fear with a new framework is that we are going to split the effort for pushing the outliner technology forward and I’d like to avoid that if at all possible.It isn't clear to me that implementing it at the MachineLevel was the right trade-off in the first place.Fair enough. it has the advantage of not rely on heuristic for its cost model though.I'm not sure a full comparative study was performed and discussed upstream at the time where the MachineIR outliner was implemented? If so it wouldn't be fair to ask this to River now.I am not asking that :).
On Jul 26, 2017, at 9:36 AM, Mehdi AMINI <joke...@gmail.com> wrote:2017-07-26 9:31 GMT-07:00 Quentin Colombet <qcol...@apple.com>:On Jul 25, 2017, at 10:36 PM, Mehdi AMINI <joke...@gmail.com> wrote:2017-07-24 16:14 GMT-07:00 Quentin Colombet via llvm-dev <llvm...@lists.llvm.org>:Hi River,On Jul 24, 2017, at 2:36 PM, River Riddle <riddl...@gmail.com> wrote:Hi Quentin,I appreciate the feedback. When I reference the cost of Target Hooks it's mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with "having heuristics solely focused on code size do not seem realistic", but I don't agree with that statement.If you only want code size I agree it makes sense, but I believe, even in Oz, we probably don’t want to slow the code by a big factor for a couple bytes. That’s what I wanted to say and what I wanted to point out is that you need to have some kind of model for the performance to avoid those worst cases. Unless we don’t care :).That's why we have threshold though, don't we?When I see threshold, I think magic number and I don’t like it that.Fair, but heuristic is the best we have when we don't want to optimize for a single metric or we can't have a perfect modeling of the world.Also the IR makes it easy to connect to PGO, which allows to focus the outlining on "cold" regions and preserve good performance.River: did you consider this already? Having a good integration with PGO could make this part of the default optimization pipeline (i.e. having a mode where we outline only the knowingly "cold" code).I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information.I don’t see how you can get around adding more hooks to know how a specific function prototype is going to be lowered (e.g., i64 needs to be split into two registers, fourth and onward parameters need to be pushed on the stack and so on). Those change the code size benefit.How is the inliner doing? How are we handling Oz there?If we are fine living with approximation for the inliner, why wouldn't the same work for an outliner?Unlike inlining, outlining does not expose optimization opportunities.I would expect that getting the cold code out of the way would help with locality / caching of the hot-path. I remember Amaury even working on getting cold *basic block* in a different section without outlining them in a function.But I guess what you mean is that as long as we're focusing solely on getting the smallest possible binary ever, you may be closer to "perfect modeling" very late in the pipeline.
On Wed, Jul 26, 2017 at 10:10 AM, Quentin Colombet via llvm-dev
<llvm...@lists.llvm.org> wrote:
> No, I mean in terms of enabling other optimizations in the pipeline like
> vectorizer. Outliner does not expose any of that.
I have not made a lot of effort to understand the full discussion here (so what
I say below may be off-base), but I think there are some cases where outlining
(especially working with function-attrs) can make optimization easier.
It can help transforms that duplicate code (like loop unrolling and inlining) be
more profitable -- I'm thinking of cases where unrolling/inlining would have to
duplicate a lot of code, but after outlining would require duplicating only a
few call instructions.
It can help EarlyCSE do things that require GVN today:
void foo() {
... complex computation that computes func()
... complex computation that computes func()
}
outlining=>
int func() { ... }
void foo() {
int x = func();
int y = func();
}
functionattrs=>
int func() readonly { ... }
void foo(int a, int b) {
int x = func();
int y = func();
}
earlycse=>
int func(int t) readnone { ... }
void foo(int a, int b) {
int x = func(a);
int y = x;
}
GVN will catch this, but EarlyCSE is (at least supposed to be!) cheaper.
Once we have an analysis that can prove that certain functions can't trap,
outlining can allow LICM etc. to speculate entire outlined regions out of loops.
Generally, I think outlining exposes information that certain regions of the
program are doing identical things. We should expect to get some mileage out of
this information.
-- Sanjoy
On Wed, Jul 26, 2017 at 12:54 PM, Sean Silva <chiso...@gmail.com> wrote:
> The way I interpret Quentin's statement is something like:
>
> - Inlining turns an interprocedural problem into an intraprocedural problem
> - Outlining turns an intraprocedural problem into an interprocedural problem
>
> Insofar as our intraprocedural analyses and transformations are strictly
> more powerful than interprocedural, then there is a precise sense in which
> inlining exposes optimization opportunities while outlining does not.
While I think our intra-proc optimizations are *generally* more
powerful, I don't think they are *always* more powerful. For
instance, LICM (today) won't hoist full regions but it can hoist
single function calls. If we can extract out a region into a
readnone+nounwind function call then LICM will hoist it to the
preheader if the safety checks pass.
> Actually, for his internship last summer River wrote a profile-guided
> outliner / partial inliner (it didn't try to do deduplication; so it was
> more like PartialInliner.cpp). IIRC he found that LLVM's interprocedural
> analyses were so bad that there were pretty adverse effects from many of the
> outlining decisions. E.g. if you outline from the left side of a diamond,
> that side basically becomes a black box to most LLVM analyses and forces
> downstream dataflow meet points to give an overly conservative result, even
> though our standard intraprocedural analyses would have happily dug through
> the left side of the diamond if the code had not been outlined.
>
> Also, River's patch (the one in this thread) does parameterized outlining.
> For example, two sequences containing stores can be outlined even if the
> corresponding stores have different pointers. The pointer to be loaded from
> is passed as a parameter to the outlined function. In that sense, the
> outlined function's behavior becomes a conservative approximation of both
> which in principle loses precision.
Can we outline only once we've already done all of these optimizations
that outlining would block?
> I like your EarlyCSE example and it is interesting that combined with
> functionattrs it can make a "cheap" pass get a transformation that an
> "expensive" pass would otherwise be needed. Are there any cases where we
> only have the "cheap" pass and thus the outlining would be essential for our
> optimization pipeline to get the optimization right?
>
> The case that comes to mind for me is cases where we have some cutoff of
> search depth. Reducing a sequence to a single call (+ functionattr
> inference) can essentially summarize the sequence and effectively increase
> search depth, which might give more results. That seems like a bit of a weak
> example though.
I don't know if River's patch outlines entire control flow regions at
a time, but if it does then we could use cheap basic block scanning
analyses for things that would normally require CFG-level analysis.
-- Sanjoy
Hi,
> I like your EarlyCSE example and it is interesting that combined with
> functionattrs it can make a "cheap" pass get a transformation that an
> "expensive" pass would otherwise be needed. Are there any cases where we
> only have the "cheap" pass and thus the outlining would be essential for our
> optimization pipeline to get the optimization right?
>
> The case that comes to mind for me is cases where we have some cutoff of
> search depth. Reducing a sequence to a single call (+ functionattr
> inference) can essentially summarize the sequence and effectively increase
> search depth, which might give more results. That seems like a bit of a weak
> example though.
I don't know if River's patch outlines entire control flow regions at
a time, but if it does then we could use cheap basic block scanning
analyses for things that would normally require CFG-level analysis.
That’s a fair point.
Using outlining is one possible way of doing that, indeed.
We could get that out of some analysis or improve the different passes. Because generally speaking, I don’t think you want to add call indirections in performance sensitive areas.
Apologies for delayed joining of this discussion, but I had a few notes from this thread that I really wanted to chime in about.River,I don't mean to put you on the spot, but I do want to start on a semantic issue. In several places in the thread you used the words "we" and "our" to imply that you're not alone in writing this (which is totally fine), but your initial thread presented this as entirely your own work. So, when you said things like "we feel there's an advantage to being at the IR level", can you please clarify who is "we"?
Given that there are a number of disagreements and opinions floating around I think it benefits us all to speak clearly about who is taking what stances.One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
- IR outline
- Inline
- MIR outline
In your response to that proposal you dismissed it out of hand with "feelings" but not data. Given that the proposal came from Jessica (a community member with significant relevant experience in outlining), and it was also recognized as interesting by Eric Christopher (a long-time member of the community with wide reaching expertise), I think dismissing it may have been a little premature.
I also want to visit a few procedural notes.Mehdi commented on the thread that it wouldn't be fair to ask for a comparative study because the MIR outliner didn't have one. While I don't think anyone is asking for a comparative study, I want to point out that I think it is completely fair. If a new contributor approached the community with a new SROA pass and wanted to land it in-tree it would be appropriate to ask for a comparative analysis against the existing pass. How is this different?
Adding a new IR outliner is a different situation from when the MIR one was added. When the MIR outliner was introduced there was no in-tree analog. When someone comes to the community with something that has no existing in-tree analog it isn't fair to necessarily ask them to implement it multiple different ways to prove their solution is the best. However, as a community, we do still exercise the right to reject contributions we disagree with, and we frequently request changes to the implementation (as is shown every time someone tries to add SPIR-V support).
In the LLVM community we have a long history of approaching large contributions (especially ones from new contributors) with scrutiny and discussion. It would be a disservice to the project to forget that.River, as a last note. I see that you've started uploading patches to Phabricator, and I know you're relatively new to the community. When uploading patches it helps to include appropriate reviewers so that the right people see the patches as they come in. To that end can you please include Jessica as a reviewer? Given her relevant domain experience I think her feedback on the patches will be very valuable.
Thank you,-Chris
Hi Chris,
> One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
> 1. IR outline
> 2. Inline
> 3. MIR outline
IMHO, there is no need to restrict a place of the Outliner in the pipeline at the moment. I hope people representing different architectures will try different configurations and the best will be chosen. I’d like to try the pipeline configuration:
1. Inline
2. IR optimizations
3. IR outline
4. MIR optimizations
5. MIR outline
I think this configuration allows to apply as many IR optimizations, especially those which reduce code size, as possible and then extract commonly used code into functions. I am also interested in some kind of Oz LTO with the IR Outliner enabled.
Evgeny Astigeevich
Senior Compiler Engineer
Compilation Tools
ARM
Given that there are a number of disagreements and opinions floating around I think it benefits us all to speak clearly about who is taking what stances.One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
- IR outline
- Inline
- MIR outline
In your response to that proposal you dismissed it out of hand with "feelings" but not data. Given that the proposal came from Jessica (a community member with significant relevant experience in outlining), and it was also recognized as interesting by Eric Christopher (a long-time member of the community with wide reaching expertise), I think dismissing it may have been a little premature.I dismissed the idea of an outliner at the machine level being able to catch bad inlining decisions. Given the loss of information between the two I felt it was a little optimistic to rely on a very late pass being able to reverse those decisions, especially coupled with the fact that the current machine outliner requires exact equivalence. I don't disagree with the proposal of an example : outline, inline, outline: pipeline, but the idea of being able to catch inlining decisions given the circumstances seemed optimistic to me. From there I went ahead and implemented a generic interface for outlining that can be shared between IR/Machine level so that such a pipeline could be more feasible.
I also want to visit a few procedural notes.Mehdi commented on the thread that it wouldn't be fair to ask for a comparative study because the MIR outliner didn't have one. While I don't think anyone is asking for a comparative study, I want to point out that I think it is completely fair. If a new contributor approached the community with a new SROA pass and wanted to land it in-tree it would be appropriate to ask for a comparative analysis against the existing pass. How is this different?The real question comes from what exactly you want to define as a "comparative analysis". When posting the patch I included additional performance data( found here goo.gl/5k6wsP) that includes benchmarking and comparisons between the outliner that I am proposing and the machine outliner on a wide variety of benchmarks. The proposed outliner performs quite favorable in comparison. As for feature comparison, the proposed outliner has many features currently missing from the machine outliner:- parameterization- outputs- relaxed equivalence(machine outliner requires exact)- usage of profile data- support for opt remarksThe machine outliner currently only supports X86 and AArch64, the IR outliner can/should support all targets immediately without the requirement of ABI restrictions(mno-red-zone is required for the machine outliner).At the IR level we have much more opportunity to find congruent instructions than at the machine level given the possible variation at that level: RA, instruction selection, instruction scheduling, etc.
In the LLVM community we have a long history of approaching large contributions (especially ones from new contributors) with scrutiny and discussion. It would be a disservice to the project to forget that.River, as a last note. I see that you've started uploading patches to Phabricator, and I know you're relatively new to the community. When uploading patches it helps to include appropriate reviewers so that the right people see the patches as they come in. To that end can you please include Jessica as a reviewer? Given her relevant domain experience I think her feedback on the patches will be very valuable.I accidentally posted without any reviewers at first, I've been going back through and adding people I missed.
Hi Chris,
> One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
> 1. IR outline
> 2. Inline
> 3. MIR outline
IMHO, there is no need to restrict a place of the Outliner in the pipeline at the moment. I hope people representing different architectures will try different configurations and the best will be chosen. I’d like to try the pipeline configuration:
Hi Eric,
Thank you for feedback. I must apologise if I have caused any concern or offence.
-Evgeny
Apologies for delayed joining of this discussion, but I had a few notes from this thread that I really wanted to chime in about.River,I don't mean to put you on the spot, but I do want to start on a semantic issue. In several places in the thread you used the words "we" and "our" to imply that you're not alone in writing this (which is totally fine), but your initial thread presented this as entirely your own work. So, when you said things like "we feel there's an advantage to being at the IR level", can you please clarify who is "we"?Given that there are a number of disagreements and opinions floating around I think it benefits us all to speak clearly about who is taking what stances.One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
- IR outline
- Inline
- MIR outline
In your response to that proposal you dismissed it out of hand with "feelings" but not data. Given that the proposal came from Jessica (a community member with significant relevant experience in outlining), and it was also recognized as interesting by Eric Christopher (a long-time member of the community with wide reaching expertise), I think dismissing it may have been a little premature.
I also want to visit a few procedural notes.Mehdi commented on the thread that it wouldn't be fair to ask for a comparative study because the MIR outliner didn't have one. While I don't think anyone is asking for a comparative study, I want to point out that I think it is completely fair.
If a new contributor approached the community with a new SROA pass and wanted to land it in-tree it would be appropriate to ask for a comparative analysis against the existing pass. How is this different?
Adding a new IR outliner is a different situation from when the MIR one was added. When the MIR outliner was introduced there was no in-tree analog.
When someone comes to the community with something that has no existing in-tree analog it isn't fair to necessarily ask them to implement it multiple different ways to prove their solution is the best.
2017-07-28 21:58 GMT-07:00 Chris Bieneman via llvm-dev <llvm...@lists.llvm.org>:Apologies for delayed joining of this discussion, but I had a few notes from this thread that I really wanted to chime in about.River,I don't mean to put you on the spot, but I do want to start on a semantic issue. In several places in the thread you used the words "we" and "our" to imply that you're not alone in writing this (which is totally fine), but your initial thread presented this as entirely your own work. So, when you said things like "we feel there's an advantage to being at the IR level", can you please clarify who is "we"?Given that there are a number of disagreements and opinions floating around I think it benefits us all to speak clearly about who is taking what stances.One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
- IR outline
- Inline
- MIR outline
In your response to that proposal you dismissed it out of hand with "feelings" but not data. Given that the proposal came from Jessica (a community member with significant relevant experience in outlining), and it was also recognized as interesting by Eric Christopher (a long-time member of the community with wide reaching expertise), I think dismissing it may have been a little premature.It isn't clear to me how much the *exact* pipeline and ordering of passes is relevant to consider if "having an outliner at the IR level" is a good idea.I also want to visit a few procedural notes.Mehdi commented on the thread that it wouldn't be fair to ask for a comparative study because the MIR outliner didn't have one. While I don't think anyone is asking for a comparative study, I want to point out that I think it is completely fair.If a new contributor approached the community with a new SROA pass and wanted to land it in-tree it would be appropriate to ask for a comparative analysis against the existing pass. How is this different?It seems quite different to me because there is no outliner at the IR level and so they don't provide the same functionality. The "Why at the IR level" section of the original email combined with the performance numbers seems largely enough to me to explain why it isn't redundant to the Machine-level outliner.I'd consider this work for inclusion upstream purely on its technical merit at this point.Discussing inclusion as part of any of the default pipeline is a different story.Similarly last year, the IR-level PGO was also implemented even though we already had a PGO implementation, because 1) it provided a generic solutions for other frontend (just like here it could be said that it provides a generic solution for targets) and 2) it supported cases that FE-PGO didn't (especially around better counter-context using pre-inlining and such).Adding a new IR outliner is a different situation from when the MIR one was added. When the MIR outliner was introduced there was no in-tree analog.We still usually discuss design extensively. Skipping the IR-level option didn't seem obvious to me, to say the least. And it wasn't really much discussed/considered extensively upstream.
...and adding $0.02 to the "IR outline + inline + MIR outline" idea, my gut feeling (yes, only a "feeling" -- and one coming from my gut, not head!) is that inlining correcting wrong IR outlining decisions with MIR outlining correcting wrong inlining decisions is absolutely unrealistic and a heuristics nightmare at best.Inliner's heuristics are already complex enough and not 100% bulletproof; if we add IR outliner heuristics to the mix -- and then just a little bit of MIR outliner heuristics (which are more precise but, as demonstrated above, not 100% precise as well)
On Aug 1, 2017 1:18 AM, "Andrey Bokhanko via llvm-dev" <llvm...@lists.llvm.org> wrote:...and adding $0.02 to the "IR outline + inline + MIR outline" idea, my gut feeling (yes, only a "feeling" -- and one coming from my gut, not head!) is that inlining correcting wrong IR outlining decisions with MIR outlining correcting wrong inlining decisions is absolutely unrealistic and a heuristics nightmare at best.Inliner's heuristics are already complex enough and not 100% bulletproof; if we add IR outliner heuristics to the mix -- and then just a little bit of MIR outliner heuristics (which are more precise but, as demonstrated above, not 100% precise as well)The post-RA MachineOutliner can in principle have a perfect cost model (and it's already pretty close I think).However, the post-RA MachineOutliner would nonetheless have a hard time undoing an inlining decision because it would have to get lucky with the register assignment matching at different inlining sites. (same applies to scheduling decisions)I can't think of any plausible way to avoid that sort of problem. A pre-RA MIR outliner could avoid needing to get lucky on the register assignment at least, but any string-based algorithm will inherently be thwarted by different scheduling.
Also as a side note, I think in the original MachineOutliner RFC thread there was some confusion as to whether it was possible to solve the code folding outlining problem exactly as a graph problem on SSA using standard value numbering algorithms in polynomial time.
I can elaborate further, but1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR
2. Given two dags, you can create their respective isomorphic data flow graphs (say, put them into two separate functions)3. An exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.4. Thus, graph isomorphism on dags can be solved with such an algorithm and thus the outlining problem is GI-hard and a polynomial time solution would be a big breakthrough in CS.
5. The actual problem the outliner is trying to solve is actually more like finding subgraphs that are isomorphic, making the problem even harder (something like "given dags A and B does there exist a subgraph of A that is isomorphic to a subgraph of B")
Also as a side note, I think in the original MachineOutliner RFC thread there was some confusion as to whether it was possible to solve the code folding outlining problem exactly as a graph problem on SSA using standard value numbering algorithms in polynomial time.I can elaborate further, but1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR2. Given two dags, you can create their respective isomorphic data flow graphs (say, put them into two separate functions)3. An exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.4. Thus, graph isomorphism on dags can be solved with such an algorithm and thus the outlining problem is GI-hard and a polynomial time solution would be a big breakthrough in CS.First, you'd have to reduce them in both directions to prove that ;)All that's been shown is that you can reduce it to a hard problem.
On Jul 31, 2017, at 10:38 PM, Mehdi AMINI <joke...@gmail.com> wrote:2017-07-28 21:58 GMT-07:00 Chris Bieneman via llvm-dev <llvm...@lists.llvm.org>:Apologies for delayed joining of this discussion, but I had a few notes from this thread that I really wanted to chime in about.River,I don't mean to put you on the spot, but I do want to start on a semantic issue. In several places in the thread you used the words "we" and "our" to imply that you're not alone in writing this (which is totally fine), but your initial thread presented this as entirely your own work. So, when you said things like "we feel there's an advantage to being at the IR level", can you please clarify who is "we"?Given that there are a number of disagreements and opinions floating around I think it benefits us all to speak clearly about who is taking what stances.One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
- IR outline
- Inline
- MIR outline
In your response to that proposal you dismissed it out of hand with "feelings" but not data. Given that the proposal came from Jessica (a community member with significant relevant experience in outlining), and it was also recognized as interesting by Eric Christopher (a long-time member of the community with wide reaching expertise), I think dismissing it may have been a little premature.It isn't clear to me how much the *exact* pipeline and ordering of passes is relevant to consider if "having an outliner at the IR level" is a good idea.
I also want to visit a few procedural notes.Mehdi commented on the thread that it wouldn't be fair to ask for a comparative study because the MIR outliner didn't have one. While I don't think anyone is asking for a comparative study, I want to point out that I think it is completely fair.If a new contributor approached the community with a new SROA pass and wanted to land it in-tree it would be appropriate to ask for a comparative analysis against the existing pass. How is this different?It seems quite different to me because there is no outliner at the IR level and so they don't provide the same functionality. The "Why at the IR level" section of the original email combined with the performance numbers seems largely enough to me to explain why it isn't redundant to the Machine-level outliner.I'd consider this work for inclusion upstream purely on its technical merit at this point.
Discussing inclusion as part of any of the default pipeline is a different story.
Similarly last year, the IR-level PGO was also implemented even though we already had a PGO implementation, because 1) it provided a generic solutions for other frontend (just like here it could be said that it provides a generic solution for targets) and 2) it supported cases that FE-PGO didn't (especially around better counter-context using pre-inlining and such).Adding a new IR outliner is a different situation from when the MIR one was added. When the MIR outliner was introduced there was no in-tree analog.We still usually discuss design extensively. Skipping the IR-level option didn't seem obvious to me, to say the least. And it wasn't really much discussed/considered extensively upstream.
If the idea is that implementing a concept at the machine level may preclude a future implementation at the IR level, it means we should be *a lot* more picky before accepting such contribution.
In this case, if I had anticipated any push-back on an IR-level implementation only based on the fact that we have now a Machine-level one, I'd likely have pushed back on the machine-level one.
When someone comes to the community with something that has no existing in-tree analog it isn't fair to necessarily ask them to implement it multiple different ways to prove their solution is the best.It may or may not be fair, but there is a tradeoff in how much effort we would require them to convince the community that this is *the* right way to go, depending on what it implies for future approaches.
On Aug 1, 2017, at 1:07 AM, Andrey Bokhanko via llvm-dev <llvm...@lists.llvm.org> wrote:All,
+1 to what Mehdi said.It's a fair concern to question whatever we need yet another Outlining pass. I believe this concern has been cleared by River -- both with theoretical arguments and practical data (benchmark numbers).
Jessica's pipeline proposal is completely orthogonal. It's not fair to request River to implement / fit into what she suggested. Sure, it's a valid topic to discuss -- but yet completely orthogonal one. If anything, accepting River's implementation would enable us to do experiments / developments like pipeline changes of this ilk!
On Aug 1, 2017, at 1:07 AM, Andrey Bokhanko via llvm-dev <llvm...@lists.llvm.org> wrote:All,
+1 to what Mehdi said.It's a fair concern to question whatever we need yet another Outlining pass. I believe this concern has been cleared by River -- both with theoretical arguments and practical data (benchmark numbers).I want to point out that River has not provided full raw benchmark data, he has provided summarized data. Also, as a community of engineers I think theoretical arguments are dangerous to accept. The community has also not been presented with sufficient information to understand some of the differences between the numbers for the IR outliner and MIR outliner.
Since the two are using very similar algorithms for detecting outlining candidates I would expect the numbers to be very similar, however in many cases they are not. I think understanding those differences and comparing them will help inform whether or not the IR outliner needs changes before being integrated to LLVM. I'm not advocating that we shouldn't take the IR outliner (in fact I think IR outlining is an interesting optimization). I'm advocating that we should understand and explore how this pass will fit into LLVM today and in the future.
Jessica's pipeline proposal is completely orthogonal. It's not fair to request River to implement / fit into what she suggested. Sure, it's a valid topic to discuss -- but yet completely orthogonal one. If anything, accepting River's implementation would enable us to do experiments / developments like pipeline changes of this ilk!I completely disagree. When considering accepting a new pass to LLVM it is completely reasonable, appropriate, and not orthogonal to consider how that pass would interact with other existing passes. Also, given that in the past we've requested much more of contributors trying to make large contributions (see the many SPIR-V discussions) I think it is totally fair to request River consider Jessica's suggestion, and if the community felt that was the right approach it would be fair to request him to implement it.
Please keep in mind, every new piece of functionality added to LLVM adds a maintenance burden on the community. If we are going to accept the burden of maintaining a new pass, it is reasonable for us, as a community, to request that the pass be engineered in a way that will make it worth that maintenance.Further, I am incredibly frustrated by the fact that people in this thread seem intent on squashing relevant technical discussions or turning them into combative arguments. There is nothing wrong with asking technical questions on an RFC, and we should be encouraging cooperative discussions of present and future implications of all proposals.
-Chris
On Aug 1, 2017, at 10:28 AM, Daniel Berlin via llvm-dev <llvm...@lists.llvm.org> wrote:Also as a side note, I think in the original MachineOutliner RFC thread there was some confusion as to whether it was possible to solve the code folding outlining problem exactly as a graph problem on SSA using standard value numbering algorithms in polynomial time.I can elaborate further, but1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR2. Given two dags, you can create their respective isomorphic data flow graphs (say, put them into two separate functions)3. An exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.4. Thus, graph isomorphism on dags can be solved with such an algorithm and thus the outlining problem is GI-hard and a polynomial time solution would be a big breakthrough in CS.First, you'd have to reduce them in both directions to prove that ;)All that's been shown is that you can reduce it to a hard problem.You can also reduce it to 3-sat., but that doesn't mean anything unless you can reduce 3-sat to it.5. The actual problem the outliner is trying to solve is actually more like finding subgraphs that are isomorphic, making the problem even harder (something like "given dags A and B does there exist a subgraph of A that is isomorphic to a subgraph of B")This assumes, strongly, that this reduction is the way to do it, and also that SSA/etc imparts no structure in the reduction that enables you to solve it faster (IE restricts the types of graphs, etc)FWIW, the exact argument above holds for value numbering, and since the days of kildall, it was believed not solvable in a complete fashion in less than exponential time due to the need to do graph isomorphism on large value graphs at join points.Except, much like RA and other things, it turns out this is not right for SSA, and in the past few years, it was proved you could do complete value numbering in polynomial time on SSA.So with all due respect to quentin, i don't buy it yet.
Without more, i'd bet using similar techniques to solve value numbering in polynomial time to SSA could be applied here.This is because the complete polynomial value numbering techniques are in fact, value graph isomorphism ....So i'd assume the opposite (it can be done in polynomial time) without more.
FWIW, I didn’t make any claim on actual complexity (I think? :)). We just didn't try it and didn’t have time to fit that into the schedule of an outliner for an internship.
On Jul 31, 2017, at 10:38 PM, Mehdi AMINI <joke...@gmail.com> wrote:2017-07-28 21:58 GMT-07:00 Chris Bieneman via llvm-dev <llvm...@lists.llvm.org>:Apologies for delayed joining of this discussion, but I had a few notes from this thread that I really wanted to chime in about.River,I don't mean to put you on the spot, but I do want to start on a semantic issue. In several places in the thread you used the words "we" and "our" to imply that you're not alone in writing this (which is totally fine), but your initial thread presented this as entirely your own work. So, when you said things like "we feel there's an advantage to being at the IR level", can you please clarify who is "we"?Given that there are a number of disagreements and opinions floating around I think it benefits us all to speak clearly about who is taking what stances.One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
- IR outline
- Inline
- MIR outline
In your response to that proposal you dismissed it out of hand with "feelings" but not data. Given that the proposal came from Jessica (a community member with significant relevant experience in outlining), and it was also recognized as interesting by Eric Christopher (a long-time member of the community with wide reaching expertise), I think dismissing it may have been a little premature.It isn't clear to me how much the *exact* pipeline and ordering of passes is relevant to consider if "having an outliner at the IR level" is a good idea.I think it is particularly relevant because based on the limited performance numbers we've seen it looks like the MIR and IR outliners have different benefits. Figuring out a pipeline where one doesn't prevent the other from performing good optimizations seems like a reasonable precondition to accepting these patches.
I also want to visit a few procedural notes.Mehdi commented on the thread that it wouldn't be fair to ask for a comparative study because the MIR outliner didn't have one. While I don't think anyone is asking for a comparative study, I want to point out that I think it is completely fair.If a new contributor approached the community with a new SROA pass and wanted to land it in-tree it would be appropriate to ask for a comparative analysis against the existing pass. How is this different?It seems quite different to me because there is no outliner at the IR level and so they don't provide the same functionality. The "Why at the IR level" section of the original email combined with the performance numbers seems largely enough to me to explain why it isn't redundant to the Machine-level outliner.I'd consider this work for inclusion upstream purely on its technical merit at this point.I believe the technical merit has not been shown clearly enough. The only data we've seen has been cherry-picked and there are outstanding technical questions about the approach.
Discussing inclusion as part of any of the default pipeline is a different story.The patches that were sent out *do* include it in default pass pipelines.
Similarly last year, the IR-level PGO was also implemented even though we already had a PGO implementation, because 1) it provided a generic solutions for other frontend (just like here it could be said that it provides a generic solution for targets) and 2) it supported cases that FE-PGO didn't (especially around better counter-context using pre-inlining and such).Adding a new IR outliner is a different situation from when the MIR one was added. When the MIR outliner was introduced there was no in-tree analog.We still usually discuss design extensively. Skipping the IR-level option didn't seem obvious to me, to say the least. And it wasn't really much discussed/considered extensively upstream.The reasoning for this was covered in the discussions and in Jessica's LLVM dev meeting talk. It may not have been widely discussed because it was widely agreed on.
If the idea is that implementing a concept at the machine level may preclude a future implementation at the IR level, it means we should be *a lot* more picky before accepting such contribution.Nobody is precluding an IR implementation. We are merely holding the IR implementation to the same high standards of justification that we held the MIR one to. You may not recall this, but the MIR one took *months* to go from RFC to landing in-tree.
In this case, if I had anticipated any push-back on an IR-level implementation only based on the fact that we have now a Machine-level one, I'd likely have pushed back on the machine-level one.There is no pushback based solely on the presence of the MIR outliner.
On Aug 1, 2017, at 1:07 AM, Andrey Bokhanko via llvm-dev <llvm...@lists.llvm.org> wrote:All,
+1 to what Mehdi said.It's a fair concern to question whatever we need yet another Outlining pass. I believe this concern has been cleared by River -- both with theoretical arguments and practical data (benchmark numbers).I want to point out that River has not provided full raw benchmark data, he has provided summarized data. Also, as a community of engineers I think theoretical arguments are dangerous to accept. The community has also not been presented with sufficient information to understand some of the differences between the numbers for the IR outliner and MIR outliner.Since the two are using very similar algorithms for detecting outlining candidates I would expect the numbers to be very similar, however in many cases they are not. I think understanding those differences and comparing them will help inform whether or not the IR outliner needs changes before being integrated to LLVM. I'm not advocating that we shouldn't take the IR outliner (in fact I think IR outlining is an interesting optimization). I'm advocating that we should understand and explore how this pass will fit into LLVM today and in the future.
Jessica's pipeline proposal is completely orthogonal. It's not fair to request River to implement / fit into what she suggested. Sure, it's a valid topic to discuss -- but yet completely orthogonal one. If anything, accepting River's implementation would enable us to do experiments / developments like pipeline changes of this ilk!I completely disagree. When considering accepting a new pass to LLVM it is completely reasonable, appropriate, and not orthogonal to consider how that pass would interact with other existing passes.
Also, given that in the past we've requested much more of contributors trying to make large contributions (see the many SPIR-V discussions)
Also as a side note, I think in the original MachineOutliner RFC thread there was some confusion as to whether it was possible to solve the code folding outlining problem exactly as a graph problem on SSA using standard value numbering algorithms in polynomial time.I can elaborate further, but1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR2. Given two dags, you can create their respective isomorphic data flow graphs (say, put them into two separate functions)3. An exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.4. Thus, graph isomorphism on dags can be solved with such an algorithm and thus the outlining problem is GI-hard and a polynomial time solution would be a big breakthrough in CS.First, you'd have to reduce them in both directions to prove that ;)All that's been shown is that you can reduce it to a hard problem.You can also reduce it to 3-sat., but that doesn't mean anything unless you can reduce 3-sat to it.
5. The actual problem the outliner is trying to solve is actually more like finding subgraphs that are isomorphic, making the problem even harder (something like "given dags A and B does there exist a subgraph of A that is isomorphic to a subgraph of B")This assumes, strongly, that this reduction is the way to do it, and also that SSA/etc imparts no structure in the reduction that enables you to solve it faster (IE restricts the types of graphs, etc)FWIW, the exact argument above holds for value numbering, and since the days of kildall, it was believed not solvable in a complete fashion in less than exponential time due to the need to do graph isomorphism on large value graphs at join points.Except, much like RA and other things, it turns out this is not right for SSA, and in the past few years, it was proved you could do complete value numbering in polynomial time on SSA.
So with all due respect to quentin, i don't buy it yet.Without more, i'd bet using similar techniques to solve value numbering in polynomial time to SSA could be applied here.This is because the complete polynomial value numbering techniques are in fact, value graph isomorphism ....
On Tue, Aug 1, 2017 at 10:28 AM, Daniel Berlin <dbe...@dberlin.org> wrote:Also as a side note, I think in the original MachineOutliner RFC thread there was some confusion as to whether it was possible to solve the code folding outlining problem exactly as a graph problem on SSA using standard value numbering algorithms in polynomial time.I can elaborate further, but1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR2. Given two dags, you can create their respective isomorphic data flow graphs (say, put them into two separate functions)3. An exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.4. Thus, graph isomorphism on dags can be solved with such an algorithm and thus the outlining problem is GI-hard and a polynomial time solution would be a big breakthrough in CS.First, you'd have to reduce them in both directions to prove that ;)All that's been shown is that you can reduce it to a hard problem.You can also reduce it to 3-sat., but that doesn't mean anything unless you can reduce 3-sat to it.Only one direction is needed to prove it is GI-hard.
X is NP-hard, if there is an NP-complete problem Y, such that Y is reducible to X in polynomial time."On Tue, Aug 1, 2017 at 12:28 PM, Sean Silva <chiso...@gmail.com> wrote:On Tue, Aug 1, 2017 at 10:28 AM, Daniel Berlin <dbe...@dberlin.org> wrote:Also as a side note, I think in the original MachineOutliner RFC thread there was some confusion as to whether it was possible to solve the code folding outlining problem exactly as a graph problem on SSA using standard value numbering algorithms in polynomial time.I can elaborate further, but1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR2. Given two dags, you can create their respective isomorphic data flow graphs (say, put them into two separate functions)3. An exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.4. Thus, graph isomorphism on dags can be solved with such an algorithm and thus the outlining problem is GI-hard and a polynomial time solution would be a big breakthrough in CS.First, you'd have to reduce them in both directions to prove that ;)All that's been shown is that you can reduce it to a hard problem.You can also reduce it to 3-sat., but that doesn't mean anything unless you can reduce 3-sat to it.Only one direction is needed to prove it is GI-hard.Yes sorry, but you still haven't done it.
So if you really want to do it, let's start with a definition:"The precise definition here is that a problemXis NP-hard, if there is an NP-complete problemY, such thatYis reducible toXin polynomial time."For this to be GI-hard you must be able to reduce *every* GI problem to one about SSA-based outlining, and you must be able to do so in *polynomial time*.The above does not do this.Things that would need proof:1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIRA routine or proof to do this in polynomial time ;) Please make sure you do an arbitrary dag, and not just rooted DAGs, as isomorphism on rooted DAGs can be solved in linear time ;)3. in exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.Proof that code outlining necessarily requires discovering full isomorphism between the graphs.(IE that whatever process code outlining uses must end up doing a thing that would prove isomorphism).
Keep in mind that the way we ended up with linear time dataflow was precisely to prove that it did not actually require doing things that reduced to boolean matrix multiplicationBut let's take this offline, as it's going to go into a very theoretical route.