[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

310 views
Skip to first unread message

River Riddle via llvm-dev

unread,
Jul 20, 2017, 6:48:07 PM7/20/17
to llvm...@lists.llvm.org

 I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been working on an interprocedural outlining (code folding) pass for code size improvement at the IR level. We hit a couple of use cases that the current code size solutions didn’t handle well enough. Outlining is one of the avenues that seemed potentially beneficial.


-- Algorithmic Approach --

The general implementation can be explained in stages:


* Candidate Selection:

Each instruction in the module is mapped to a congruence class based upon relaxed equivalency constraints. For most instructions this is simply: The type, opcode, and operand types. The constraints are tightened for instructions with special state that require more exact equivalence (e.g. ShuffleVector requires a constant mask vector). Candidates are then found by constructing a suffix array and lcp(longest common prefix) array. Walking the lcp array gives us congruent chains of instructions within the module.


* Candidate Analysis:

A verification step splits candidates that have different internal input sequences or incompatible parent function attributes between occurrences. An example of incompatible internal inputs sequences is:

X = W + 6;   vs    X = W + 6;

Y = X + 4;            Y = W + 4;

The above two occurrences would need special control flow to exist within the same outlined function.


During analysis candidates have their inputs and outputs computed along with an estimated benefit from extraction. During input calculation a constant folding step removes all inputs that are the same amongst all occurrences.


* Candidate Pruning:

Overlapping candidates are pruned with a generic greedy algorithm that picks occurrences starting from the most beneficial candidates.


* Candidate Outlining:

Non pruned candidates are then outlined. Outputs from a candidate are returned via an output parameter, except for one output that is promoted to a return value. During outlining the inputs into the candidate are condensed by computing the equivalencies between the arguments at each occurrence. An example of this is:


outlinedFn(1,6,1);  ->  outlinedFn(1,6);

outlinedFn(2,4,2);  ->  outlinedFn(2,4);


In the above, parameters 1 and 3 were found to be equivalent for all occurrences, thus the amount of inputs was reduced to 2.


* Debug Info:

Debug information is preserved for the calls to functions which have been outlined but all debug info from the original outlined portions is removed, making them harder to debug.


* Profile Info:

If the pass is running at Os the outliner will only consider cold blocks, whereas Oz considers all blocks that are not marked as hot.


* Location in Pipeline:

The pass is currently configured to run very late in the optimization pipeline. It is intended to run at Oz but will also run at Os if there is profile data available. The pass can optionally be run twice, once before function simplification and then again at the default location. This run is optional because you are gambling the potential benefits of redundancy elimination vs the potential benefits from function simplification. This can lead to large benefits or regressions depending on the benchmark (though the benefits tend to outnumber the regressions). The performance section contains data for both on a variety of benchmarks.


-- Why at the IR level --

The decision to place the outliner at the IR level comes from a couple of major factors:

- Desire to be target independent

- Most opportunities for congruency


The downside to having this type of transformation be at the IR level is it means there will be less accuracy in the cost model -  we can somewhat accurately model the cost per instruction but we can’t get information on how a window of instructions may lower. This can cause regressions depending on the platform/codebase, therefore to help alleviate this there are several tunable parameters for the cost model.


-- Performance --

More results including clang, llvm-tblgen, and more specific numbers about benefits/regressions can be found in the notes section below.

* Size Reduction:

- Test Suite(X86_64):

   - Early+Late outlining provides a geomean of 10.5% reduction over clang Oz, with a largest improvement of ~67% and largest regression of ~7.5%.

   - Late outlining provides a geomean of 4.65% reduction, with a largest improvement of ~51% and largest regression of ~6.4%.

- Spec 2006(X86_64)

   - Early+Late outlining provides a geomean reduction of 2.08%.

   - Late outlining provides 2.09%.

- CSiBE(AArch64)

   - Early+Late outlining shows geomean reduction of around 3.5%.

   - Late outlining shows 3.1%.


* Compile Time:

Compile time was tested under test-suite with a multisample of 5.

- Early+Late outlining

   - Many improvements with > 40% compile time reduction.

   - Few regressions.

- Late outlining

   - Greatest improvement is ~7.8%

   - Greatest regression is ~4% with a difference of <0.02s

Our explanation for the compile time reduction during early outlining is that due to the amount of redundancy reduction early in the optimization pipeline there is a reduction in the amount of instruction processing during the rest of the compilation.


* Execution Time:

Ran with a multisample of 5.

- Test Suite:

   - Early+Late outlining has many regressions up to 97%. The greatest improvement was around 7.5%.

   - Late outlining also has several regressions up to 44% and a greatest improvement of around 5.3%.

- Spec:

   - Early+Late has a geomean regression of 3.5%.

   - Late outlining has a geomean regression of 1.25%.

The execution time results are to be expected given that the outliner, without profile data, will extract from whatever region it deems profitable. Extracting from the hot path can lead to a noticeable performance regression on any platform, which can be somewhat avoided by providing profile data during outlining.


-- Tested Improvements --

* MergeFunctions:

   - No noticeable benefit.

* LTO:

   - LTO doesn’t have a code size pipeline, but %reductions over LTO are comparable to non LTO.

* Input/Output Partitioning:

   -This identifies inputs/outputs that may be folded by splitting a candidate. The benefit is minimal for the computations required.

* Similar Candidate Merging:

   - The benefit to be gained is currently not worth the large complexity required to catch the desired cases.


-- Potential Improvements --

* Suffix&LCP array construction: The pass currently uses a very basic implementation that could be improved. There are definitely faster algorithms and some can construct both simultaneously. We will investigate this as a potential benefit for compile time in the future.


* Os performance tuning: Under -Os the pass currently only runs on cold blocks. Ideally we could expand this to be a little more lax on less frequently executed blocks that aren’t cold.


* Candidate Selection: The algorithm currently focuses on the longest common sequences. More checks could be added to see if shortening the sequence length produces a larger benefit(e.g less inputs/outputs). This would likely lead to an increase in compile time but should remain less than the baseline.


* Non Exact Functions: The outliner currently ignores functions that do not have an exact definition.


-- --

* CSiBE(Code Size Benchmark):

www.csibe.org


* More detailed performance data:

goo.gl/5k6wsP


* Implementation:

https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp


Evgeny Astigeevich via llvm-dev

unread,
Jul 21, 2017, 5:24:27 AM7/21/17
to River Riddle, llvm...@lists.llvm.org, nd

Hi River,


Do you know LLVM has the MachineOutliner pass (lib/CodeGen/MachineOutliner.cpp)? 

It would be interesting to compare your approach with it.


Thanks,

Evgeny Astigeevich


From: llvm-dev <llvm-dev...@lists.llvm.org> on behalf of River Riddle via llvm-dev <llvm...@lists.llvm.org>
Sent: Thursday, July 20, 2017 11:47:57 PM
To: llvm...@lists.llvm.org
Subject: [llvm-dev] [RFC] Add IR level interprocedural outliner for code size.
 

River Riddle via llvm-dev

unread,
Jul 21, 2017, 1:02:04 PM7/21/17
to Evgeny Astigeevich, llvm...@lists.llvm.org, nd
Hi Evgeny,
 I know of the current machine outliner in LLVM. If you look in the "More detailed performance data"  in the end section it includes performance comparisons to the machine outliner. 
 As for the algorithmic approach they are kind of similar. 
Machine Outliner:
  - Builds a suffix tree based on identical equivalence between machine instrs. 
  - Uses target specific cost model for benefit estimation.
  - Uses a greedy algorithm for candidate pruning.
  - Currently supports X86, AArch64 targets.
  - Requires NoRedZone on supported targets.

IR Outliner:
  - Builds a suffix array based upon relaxed equivalence between ir instrs.
  - Allows for inputs/outputs from an outlined function.
  - Complex verification of outlining candidates due to above two points(relaxed equivalency, inputs/outputs).
  - Tunable generic target independent cost model (estimates constant folded inputs, output condensing, etc.).
       - Note it uses the target info available (TargetTransformInfo) 
  - Uses a greedy interval based algorithm for candidate pruning.
  - Supports profile data.
  - Supports emitting opt remarks.
  - Target independent, thus no required ABI constraints or hooks.
Though some parts of the approach are similar the implementation is very different. Also to note, the IR outliner can be expanded to also outline from multiple basic blocks. I've already prototyped the infrastructure for outlining regions, so its possible.
  
Thanks,
  River Riddle

Andrey Bokhanko via llvm-dev

unread,
Jul 22, 2017, 6:23:31 PM7/22/17
to River Riddle, llvm-dev
Hi River,

Very impressive! -- thanks for working on this.

A few questions, if you don't mind.

First, on results (from goo.gl/5k6wsP). Some of them are quite surprising. In theory, "top improvements" should be quite similar in all three approaches ("Early&Late Outlining", "Late Outlining" and "Machine Outliner"), with E&LO capturing most of the cases. Yet, they are very different:

Test Suite, top improvements:
E&LO:
  • enc-3des: 66.31%

  • StatementReordering-dbl: 51.45%

  • Symbolics-dbl: 51.42%

  • Recurrences-dbl: 51.38%

  • Packing-dbl: 51.33%

LO:
  • enc-3des: 50.7%

  • ecbdes: 46.27%

  • security-rjindael:45.13%

  • ControlFlow-flt: 25.79%

  • ControlFlow-dbl: 25.74%

MO:
  • ecbdes: 28.22%

  • Expansion-flt: 22.56%

  • Recurrences-flt: 22.19%

  • StatementReordering-flt: 22.15%

  • Searching-flt: 21.96%


SPEC, top improvements:
E&LO:
  • bzip2: 9.15%

  • gcc: 4.03%

  • sphinx3: 3.8%

  • H264ref: 3.24%

  • Perlbench: 3%

LO:
  • bzip2: 7.27%

  • sphinx3: 3.65%

  • Namd: 3.08%

  • Gcc: 3.06%

  • H264ref: 3.05%

MO:
  • Namd: 7.8%

  • bzip2: 7.27%

  • libquantum: 2.99%

  • h264ref: 2%


Do you understand why so?

I'm especially interested in cases where MO managed to find redundancies while E&O+LO didn't. For example, 2.99% on libquantum (or is it simply below "top 5 results" for E&O+LO?) -- did you investigated this?

Also, it would be nice to specify full options list for SPEC (I assume SPEC CPU2006?), similar to how results are reported on spec.org.

And a few questions on the RFC:

On Fri, Jul 21, 2017 at 12:47 AM, River Riddle via llvm-dev <llvm...@lists.llvm.org> wrote:

* Debug Info:

Debug information is preserved for the calls to functions which have been outlined but all debug info from the original outlined portions is removed, making them harder to debug.


Just to check I understand it correctly: you remove *all* debug info in outlined functions, essentially making them undebuggable -- correct? Did you considered copying debug info from one of outlined fragments instead? -- at least line numbers?

The execution time results are to be expected given that the outliner, without profile data, will extract from whatever region it deems profitable. Extracting from the hot path can lead to a noticeable performance regression on any platform, which can be somewhat avoided by providing profile data during outlining.


Some of regressions are quite severe. It would be interesting to implement what you stated above and measure -- both code size reductions and performance degradations -- again.
 

* LTO:

   - LTO doesn’t have a code size pipeline, but %reductions over LTO are comparable to non LTO.


LTO is known to affect code size significantly (for example, by removing redundant functions), so I'm frankly quite surprised that the results are the same...

Yours,
Andrey
===
Compiler Architect
NXP

River Riddle via llvm-dev

unread,
Jul 22, 2017, 7:05:52 PM7/22/17
to Andrey Bokhanko, llvm-dev
Hi Andrey,
 Questions and feedback are very much welcome. 
- The explanation as to why the improvements can vary between the IR and MIR outliner mainly boil down to the level of abstraction that each are working at. The MIR level has very accurate heuristics and is effectively the last post ISel target independent code gen pass. The IR outliner on the other hand has more estimation in the cost model, and can affect the decisions of function simplification passes, instruction selection, RA, etc. Taking this into account can lead to different results. Its important to remember the differences that being at a higher abstraction level can cause.
- As for the spec(it is 2006, sorry I missed that) command line options, as well as any other benchmark, I only added "-Oz -mno-red-zone(to keep comparisons fair) -(whatever enables each transform)" to the default flags needed for compilation. I'll try to get the exact command line options used and add them.
- Debug information(line numbers) is currently only kept if it is the same across all occurrences. This was simply a design choice and can be changed if keeping one instance is the desired behavior.
- The behavior described with the profile data is already implemented, I will update the numbers to include the results after including pgo data. 
- The LTO results are very experimental given that there isn't a size pipeline for LTO yet(there should be). The %improvements can be similar to non LTO but because the LTO binary is generally much smaller the actual decrease in size is also much smaller. I'll add more detailed LTO numbers as well.
Thanks,
River Riddle

via llvm-dev

unread,
Jul 23, 2017, 3:43:16 AM7/23/17
to River Riddle, llvm-dev
River,

Thanks for the reply!

> 23 июля 2017 г., в 1:05, River Riddle <riddl...@gmail.com> написал(а):


>
> - The explanation as to why the improvements can vary between the IR and MIR outliner mainly boil down to the level of abstraction that each are working at. The MIR level has very accurate heuristics and is effectively the last post ISel target independent code gen pass. The IR outliner on the other hand has more estimation in the cost model, and can affect the decisions of function simplification passes, instruction selection, RA, etc. Taking this into account can lead to different results.

To clarify, I'm surprised not with % differences (this is understandable), but with differences in what benchmarks got improved. It seems odd that MO, working on lower abstraction level, managed to find redundancies (say, in libquantum) that EO+LO missed. But indeed -- perhaps EO+LO cost model just considered these cases to be non-profitable. It would be interesting to know precisely.

Yours,
Andrey

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

via llvm-dev

unread,
Jul 23, 2017, 2:32:12 PM7/23/17
to River Riddle, llvm-dev
River,

...and one more thing that just occurred to me.

Will Outlining put the following two expressions:

int x = (int)x2 + 1
int* p = (int *)p2 + 1

into the same class of congruency? -- if sizeof(int) == sizeof(int *) on the target hardware?

Even more interesting and relevant are pointers of different types.

I believe MO has no problems capturing both of these.

Yours,
Andrey

Отправлено с iPad

23 июля 2017 г., в 1:05, River Riddle <riddl...@gmail.com> написал(а):

River Riddle via llvm-dev

unread,
Jul 23, 2017, 5:57:28 PM7/23/17
to Andrey Bokhanko, llvm-dev
Andrey,
  Currently it will not catch this case because congruence is determined by types/opcode/etc, but cases like this can be caught by redefining what it means to be congruent. We can redefine congruency for add/sub/icmp/gep/etc to no longer care about types, or even opcodes, in certain cases. but this may cause the need for extra control flow.
 
As for your example, the machine outliner can handle the case when the addition amount and register are the same:
int x = (int) x2 + 4;
int *p = (int*)p2 + 1;

If we do relax the congruency restrictions, being at the IR level allows for us to identify that we could outline more than just the simple case:
assuming: sizeof(int) == sizeof(int *) == sizeof(long long *) == sizeof(char *)

int x = (int)x2 + 1;
int *p = (int *)p2 + (int)ipidx;
long long*lp = (long long*)lp2 + 3;
char *cp = (char *)cp2 + (int)cpidx;

-- could outline to -- 
int x = outlinedFn(1, x2, 1);
int *p = (int *)outlinedFn(4, (int)p2, ipidx);
long long*lp = (long long*)outlinedFn(8, (int)lp2, 3);
char *cp = (char *)outlinedFn(1, (int)cp2, cpidx);

int outlinedFn(int SizePatchup, int Var, int Amount) {
  return Var + (SizePatchup * Amount);
}

In the above we need some extra patch-up to account for the different sizes of the pointee types.

This is just one opportunity that can be caught when we start to redefine equivalence, something that is really powerful at the IR level. We wanted the initial submission to be have a compact and efficient implementation. Extensions like these can easily be added, but it is not a part of the initial design.
Thanks,
 River Riddle

Evgeny Astigeevich via llvm-dev

unread,
Jul 24, 2017, 8:38:29 AM7/24/17
to River Riddle, llvm...@lists.llvm.org, nd

Hi River,


Thank you for information. Do you have your code in Phabricator?


As I understand your pass always runs after the Inliner. At the moment the Inliner is too aggressive at Os/Oz. I am trying to tune Inliner heuristics for Oz. It would be interesting to see if there are any code size improvements when the Inliner is disabled. This will demonstrate whether the Outliner could improve code size when duplicated code is not due to the Inliner. Another problem I have is that C++ programs get code size regressions due my changes in the inline heuristics. Have  you seen this problem?


Thanks,

Evgeny Astigeevich


From: River Riddle <riddl...@gmail.com>
Sent: Friday, July 21, 2017 6:01:56 PM
To: Evgeny Astigeevich
Cc: llvm...@lists.llvm.org; nd; Kristof Beyls; gab...@inf.u-szeged.hu
Subject: Re: [llvm-dev] [RFC] Add IR level interprocedural outliner for code size.
 

River Riddle via llvm-dev

unread,
Jul 24, 2017, 1:53:26 PM7/24/17
to Evgeny Astigeevich, llvm...@lists.llvm.org, nd
Hi Evgeny,
 The code hasn't been posted to Phabricator yet, we wanted to get feedback on the current design and approach first. The code is forked on GitHub currently and a link was included in the original email. As for the inliner having an effect on the potential benefits, I just rebuilt llvm-tblgen with "-fno-inline -fno-inline-functions" to see if there was any change in results:
 - The inlining enabled Oz binary was ~20% smaller than the no-inline binary.
 - LO still gave ~2% benefit(down from 2.5)
 - E+LO gave ~2.5% benefit(down from 3.82)
During testing I have noticed that E+LO functions often have other functions inlined into them, which possibly leads to some of the benefit that can be observed from it. I haven't really played around with the inlining heuristics myself, so I likely won't be able to help very much further in that regard.
 Thanks,
River Riddle

Jessica Paquette via llvm-dev

unread,
Jul 24, 2017, 2:12:32 PM7/24/17
to River Riddle, llvm...@lists.llvm.org

Hi River,

I’m working on the MachineOutliner pass at the MIR level. Working at the IR level sounds interesting! It also seems like our algorithms are similar. I was thinking of taking the suffix array route with the MachineOutliner in the future. 

Anyway, I’d like to ask about this:

On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <llvm...@lists.llvm.org> wrote:

The downside to having this type of transformation be at the IR level is it means there will be less accuracy in the cost model -  we can somewhat accurately model the cost per instruction but we can’t get information on how a window of instructions may lower. This can cause regressions depending on the platform/codebase, therefore to help alleviate this there are several tunable parameters for the cost model.

The inliner is threshold-based and it can be rather unpredictable how it will impact the code size of a program. Do you have any idea as to how heuristics/thresholds/parameters could be tuned to prevent this? In my experience, making good code size decisions with these sorts of passes requires a lot of knowledge about what instructions you’re dealing with exactly. I’ve seen the inliner cause some pretty serious code size regressions in projects due to small changes to the cost model/parameters which cause improvements in other projects. I’m a little worried that an IR-level outliner for code size would be doomed to a similar fate.

Perhaps it would be interesting to run this sort of pass pre-register allocation? This would help pull you away from having to use heuristics, but give you some more opportunities for finding repeated instruction sequences. I’ve thought of doing something like this in the future with the MachineOutliner and seeing how it goes.

- Jessica

River Riddle via llvm-dev

unread,
Jul 24, 2017, 2:55:19 PM7/24/17
to Jessica Paquette, llvm-dev
Hi Jessica,
 The comparison to the inliner is an interesting one but we think it's important to note the difference in the use of heuristics. The inliner is juggling many different tasks at the same time, execution speed, code size, etc. which can cause the parameters to be very sensitive depending on the benchmark/platform/etc. The outliners heuristics are focused solely on the potential code size savings from outlining, and is thus only sensitive to the current platform. This only creates a problem when we are over estimating the potential cost of a set of instructions for a particular target. The cost model parameters are only minimums: instruction sequence length, estimated benefit, occurrence amount. The heuristics themselves are conservative and based upon all of the target information available at the IR level, the parameters are just setting a lower bound to weed out any outliers. You are correct in that being at the machine level, before or after RA, will give the most accurate heuristics but we feel there's an advantage to being at the IR level. At the IR level we can do so many more things that are either too difficult/complex for the machine level(e.g parameterization/outputs/etc). Not only can we do these things but they are available on all targets immediately, without the need for target hooks. The caution on the use of heuristics is understandable, but there comes a point when trade offs need to be made. We made the trade off for a loss in exact cost modeling to gain flexibility, coverage, and potential for further features. This trade off is the same made for quite a few IR level optimizations, including inlining. As for the worry about code size regressions, so far the results seem to support our hypothesis.
 Thanks,
River Riddle

Quentin Colombet via llvm-dev

unread,
Jul 24, 2017, 4:42:57 PM7/24/17
to River Riddle, llvm-dev
Hi River,

On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev <llvm...@lists.llvm.org> wrote:

Hi Jessica,
 The comparison to the inliner is an interesting one but we think it's important to note the difference in the use of heuristics. The inliner is juggling many different tasks at the same time, execution speed, code size, etc. which can cause the parameters to be very sensitive depending on the benchmark/platform/etc. The outliners heuristics are focused solely on the potential code size savings from outlining, and is thus only sensitive to the current platform. This only creates a problem when we are over estimating the potential cost of a set of instructions for a particular target. The cost model parameters are only minimums: instruction sequence length, estimated benefit, occurrence amount. The heuristics themselves are conservative and based upon all of the target information available at the IR level, the parameters are just setting a lower bound to weed out any outliers. You are correct in that being at the machine level, before or after RA, will give the most accurate heuristics but we feel there's an advantage to being at the IR level. At the IR level we can do so many more things that are either too difficult/complex for the machine level(e.g parameterization/outputs/etc). Not only can we do these things but they are available on all targets immediately, without the need for target hooks. The caution on the use of heuristics is understandable, but there comes a point when trade offs need to be made. We made the trade off for a loss in exact cost modeling to gain flexibility, coverage, and potential for further features. This trade off is the same made for quite a few IR level optimizations, including inlining. As for the worry about code size regressions, so far the results seem to support our hypothesis.

Would still be interesting to see how well this could perform on some exact model (i.e., at the Machine level), IMO. Target hooks are cheap and choosing an implementation because it is simpler might not be the right long term solution.
At the very least, to know what trade-off we are making, having prototypes with the different approaches sounds sensible.
In particular, all the heuristics about cost for parameter passing (haven’t checked how you did it) sounds already complex enough and would require target hooks. Therefore, I am not seeing a clear win with an IR approach here.

Finally, having heuristics solely focused on code size do not seem realistic. Indeed, I am guessing you have some thresholds to avoid outlining some piece of code too small that would end up adding a whole lot of indirections and I don’t like magic numbers in general :).

To summarize, I wanted to point out that an IR approach is not as a clear win as you describe and would thus deserve more discussion.

Cheers,
-Quentin

 Thanks,
River Riddle

On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette <jpaq...@apple.com> wrote:

Hi River,

I’m working on the MachineOutliner pass at the MIR level. Working at the IR level sounds interesting! It also seems like our algorithms are similar. I was thinking of taking the suffix array route with the MachineOutliner in the future. 

Anyway, I’d like to ask about this:

On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <llvm...@lists.llvm.org> wrote:

The downside to having this type of transformation be at the IR level is it means there will be less accuracy in the cost model -  we can somewhat accurately model the cost per instruction but we can’t get information on how a window of instructions may lower. This can cause regressions depending on the platform/codebase, therefore to help alleviate this there are several tunable parameters for the cost model.

The inliner is threshold-based and it can be rather unpredictable how it will impact the code size of a program. Do you have any idea as to how heuristics/thresholds/parameters could be tuned to prevent this? In my experience, making good code size decisions with these sorts of passes requires a lot of knowledge about what instructions you’re dealing with exactly. I’ve seen the inliner cause some pretty serious code size regressions in projects due to small changes to the cost model/parameters which cause improvements in other projects. I’m a little worried that an IR-level outliner for code size would be doomed to a similar fate.

Perhaps it would be interesting to run this sort of pass pre-register allocation? This would help pull you away from having to use heuristics, but give you some more opportunities for finding repeated instruction sequences. I’ve thought of doing something like this in the future with the MachineOutliner and seeing how it goes.

- Jessica


Jessica Paquette via llvm-dev

unread,
Jul 24, 2017, 4:59:58 PM7/24/17
to River Riddle, llvm-dev
Hi River,

I think outlining at the IR level certainly has a lot of potential. I also think that working at a more generic representation such as IR opens lots of doors for interesting code size improvements; however, I’m still iffy on heuristics in general.

The inliner is juggling many different tasks at the same time, execution speed, code size, etc. which can cause the parameters to be very sensitive depending on the benchmark/platform/etc. 
Strongly agree here; it’s unclear *what* a heuristic should look like for an inliner.

The outliners heuristics are focused solely on the potential code size savings from outlining, and is thus only sensitive to the current platform. 
This is where I get a little concerned.

Firstly, as previously mentioned, we don’t know how an instruction will be lowered. So, if we say that each IR instruction has a cost of “1 somethings”, we haven’t really solved a *code size problem*, but more of a *code reduction problem*. That is, we’ve performed a reduction to the overall structure of the program, but we don’t know that that overall structural reduction will produce an equivalent code size reduction. I personally would be inclined to figure out how far off such a cost model could be, given a known architecture.

Secondly, suppose we have a cost model like in the inliner, where we have some heuristic constants which are employed by some cost model. If we change the model, then how do we know our heuristic constants are still good? Even if we tune these constants for some specific target to try and make good decisions, if we change the outliner *at all*, then we have to retune the heuristics. Changes in later passes could also introduce the need for retuning.

Finally, the big issue here is that with code size problems, or at least with outlining, we’re trying to solve something which is inherently tied to the architecture we’re working with. A good outlining decision for x86 will look different from a good outlining decision for ARM64. The issue isn’t purely structural; the actual instructions we’re dealing with matter.

Once again, I think this stuff is really interesting, and I think your results are looking great so far. It’s really awesome to see this method being successful at a high level. I’m also making a lot of assumptions about what your cost model might look like. I’ll probably have to look at the code to make a sound judgement. :)

- Jessica

Meador Inge via llvm-dev

unread,
Jul 24, 2017, 5:17:07 PM7/24/17
to Jessica Paquette, llvm-dev
> Firstly, as previously mentioned, we don’t know how an instruction will be lowered. So, if we say that each IR instruction has a cost of “1 somethings”, we haven’t really solved a *code size problem*, but more of a *code reduction problem*. That is, we’ve performed a reduction to the overall structure of the program, but we don’t know that that overall structural reduction will produce an equivalent code size reduction. I personally would be inclined to figure out how far off such a cost model could be, given a known architecture.

“code size” vs. “code reduction” seems like a really nice way to think about this. I don’t know a lot about the areas in discussion, but one thing that occurred to me while reading this was: a code reduction couldn’t produce a worse code size than without the reduction, right?

— Meador

River Riddle via llvm-dev

unread,
Jul 24, 2017, 5:36:12 PM7/24/17
to Quentin Colombet, llvm-dev
Hi Quentin,
 I appreciate the feedback. When I reference the cost of Target Hooks it's mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with "having heuristics solely focused on code size do not seem realistic", but I don't agree with that statement. I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information. 
 When you say "Would still be interesting to see how well this could perform on some exact model (i.e., at the Machine level), IMO." I am slightly confused as to what you mean. I do not intend to try and implement this algorithm at the MIR level given that it exists in Machine Outliner. There are several comparison benchmarks given in the "More detailed performance data" of the original RFC. It includes comparisons to the Machine Outliner when possible(I can't build clang on Linux with Machine Outliner). I welcome any and all discussion on the placement of the outliner in LLVM.
 Thanks,
River Riddle

River Riddle via llvm-dev

unread,
Jul 24, 2017, 5:46:00 PM7/24/17
to Jessica Paquette, llvm-dev
Hi Jessica,
 When we model the cost of an instruction it is largely based upon an estimation by the TTI. It is still an estimation but the results prove to be somewhat close to the exact costs. If you are interested in the results for some known platforms check the "More detailed results" section of the original RFC. It provides comparisons between the two configurations of outlining at the IR level as well as the Machine Outliner. It includes results for X86_64(Linux and Mac) as well as AArch64(The only two platforms that the Machine Outliner supports). The results tend to be quite favorable in regards to the cost model.
 As for the tunable heuristic(constants) numbers, these are very similar to those that the Machine Outliner might have if it was tunable: Minimum Benefit, Min Occurrences, Min Instruction Length. Those are the only constants that really exist within the cost model.
 I appreciate all of the feedback and discussion!
Thanks,
River Riddle

Jessica Paquette via llvm-dev

unread,
Jul 24, 2017, 6:28:58 PM7/24/17
to Meador Inge, llvm-dev
(Cc llvm-dev, whoops!)

Intuitively, no, but suppose you have a sequence of instructions that look like this (in some weird pseudocode I just made up)

a
b
c
x
a
b
c

Now, this could be reduced to something that looks like this by an outliner:

call my_function
x
call my_function

my_function:
a
b
c
return

Structurally-speaking, this *is* a reduction.

Now, suppose we’re using an architecture like AArch64 where we care about something like a link register. Then to be correct we may have to actually emit something like this:

save special_register
call my_function
restore special_register
x
save special_register
call my_function
restore special_register

my_function:
a
b
c
return

In this case, we’ve actually produced more instructions than before. Although we’ve simplified the structure of the code, we’ve still produced more of it. So, if a high-level outliner isn’t aware of this sort of thing, you could actually end up making binaries *larger* by removing similarity from the code.

- Jessica

Andrey Bokhanko via llvm-dev

unread,
Jul 24, 2017, 7:09:40 PM7/24/17
to Quentin Colombet, llvm-dev
Hi Quentin,

I really don't have a dog in this race, but... we can theoritize on strengths and weaknesses of different approaches ad infinitum -- or just accept that "the proof is in the pudding", or so they say.

And the proof is here! -- look at the numbers.

If anything, I believe it makes sense to at least accept Riddle's phase to the trunk and let two approaches evolve in parallel. It probably silly to have both enabled at the same time, so one that currently provides a greater benefit should work on -Os/-Oz. With time, evolution will decide which one should survive -- as it always does.

Yours,
Andrey
===
Compiler Architect
NXP

Quentin Colombet via llvm-dev

unread,
Jul 24, 2017, 7:14:18 PM7/24/17
to River Riddle, llvm-dev
Hi River,

On Jul 24, 2017, at 2:36 PM, River Riddle <riddl...@gmail.com> wrote:

Hi Quentin,
 I appreciate the feedback. When I reference the cost of Target Hooks it's mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with "having heuristics solely focused on code size do not seem realistic", but I don't agree with that statement.

If you only want code size I agree it makes sense, but I believe, even in Oz, we probably don’t want to slow the code by a big factor for a couple bytes. That’s what I wanted to say and what I wanted to point out is that you need to have some kind of model for the performance to avoid those worst cases. Unless we don’t care :).

I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information. 

I don’t see how you can get around adding more hooks to know how a specific function prototype is going to be lowered (e.g., i64 needs to be split into two registers, fourth and onward parameters need to be pushed on the stack and so on). Those change the code size benefit.

 When you say "Would still be interesting to see how well this could perform on some exact model (i.e., at the Machine level), IMO." I am slightly confused as to what you mean. I do not intend to try and implement this algorithm at the MIR level given that it exists in Machine Outliner.

Of course, I don’t expect you to do that :). What I meant is that the claim that IR offers the better trade off is not based on hard evidences. I actually don’t buy it.

My point was to make sure I understand what you are trying to solve and given you have mentioned the MachineOutliner, why you are not working on improving it instead of suggesting a new framework.
Don’t take me wrong, maybe creating a new framework at the IR level is the right thing to do, but I still didn’t get that from your comments.

There are several comparison benchmarks given in the "More detailed performance data" of the original RFC. It includes comparisons to the Machine Outliner when possible(I can't build clang on Linux with Machine Outliner). I welcome any and all discussion on the placement of the outliner in LLVM.

My fear with a new framework is that we are going to split the effort for pushing the outliner technology forward and I’d like to avoid that if at all possible.

Now, to be more concrete on your proposal, could you describe the cost model for deciding what to outline? (Really the cost model, not the suffix algo.)
Are outlined functions pushed into the list candidates for further outlining?

River Riddle via llvm-dev

unread,
Jul 24, 2017, 7:37:09 PM7/24/17
to Quentin Colombet, llvm-dev
Hi Quentin,
  I understand your points and I believe that some meaning is being lost via email. For performance it's true that that cost isn't necessarily modeled, there is currently only support for using profile data to avoid mitigate that. I was working under the assumption, possibly incorrectly, that at Oz we favor small code over anything else including runtime performance. This is why we only run at Os if we have profile data, and even then we are very conservative about what we outline from. 
  I also understand that some target hooks may be required for certain things, it happens for other IR level passes as well. I just want to minimize that behavior as much as I can, though thats a personal choice.
As for a motivating reason to push for this, it actually solves some of the problems that you brought up in the post for the original Machine Outliner RFC. If I can quote you:

"So basically, better cost model. That's one part of the story.

The other part is at the LLVM IR level or before register allocation identifying similar code sequence is much harder, at least with a suffix tree like algorithm. Basically the problem is how do we name our instructions such that we can match them.
Let me take an example.
foo() {
/* bunch of code */
a = add b, c;
d = add e, f; 
}

bar() {
d = add e, g;
f = add c, w;
}

With proper renaming, we can outline both adds in one function. The difficulty is to recognize that they are semantically equivalent to give them the same identifier in the suffix tree. I won’t get into the details but it gets tricky quickly. We were thinking of reusing GVN to have such identifier if we wanted to do the outlining at IR level but solving this problem is hard."

This outliner will catch your case, it is actually one of the easiest cases for it to catch. The outliner can recognize semantically equivalent instructions and can be expanded to catch even more.

As for the cost model it is quite simple:
 - We identify all of the external inputs into the sequence. For estimating the benefit we constant fold and condense the inputs so that we can get the set of *unique* inputs into the sequence. 
 - We also identify outputs from the sequence, instructions that have external references. We add the cost of storing/loading/passing output parameter to the outlined function.
 - We identify one output to promote to a return value for the function. This can end up reducing an output parameter that is passed to our outlined function.
 - We estimate the cost of the sequence of instructions by mostly using TTI.
 - With the above we can estimate the cost per occurrence as well as the cost for the new function. Of course these are mostly estimates, we lean towards the conservative side to be safe. 
 - Finally we can compute an estimated benefit for the sequence taking into account benefit per occurrence and the estimated cost of the new function.

There is definitely room for improvement in the cost model. I do not believe its the best but its shown to be somewhat reliable for most cases. It has benefits and it has regressions, as does the machine outliner. The reasoning for adding the new framework, is because I believe that this transformation should exist at the IR level. Not only because it is the simplest place to put it but also because it provides the greatest opportunities for improvement with the largest coverage. We can expand the framework to catch Danny's cases from the machine outliner RFC, we can add region outlining, etc. We can do all of this and make it available to every target automatically. The reason why I am for this being at the IR level is not because I want to split up the technical effort, but combine it.

Thanks,
  River Riddle

River Riddle via llvm-dev

unread,
Jul 24, 2017, 7:57:07 PM7/24/17
to Quentin Colombet, llvm-dev
Hey Quentin,
 Sorry I missed the last question. Currently it doesn't do continual outlining, but it's definitely a possibility.
Thanks,
River Riddle

Quentin Colombet via llvm-dev

unread,
Jul 24, 2017, 8:57:05 PM7/24/17
to Andrey Bokhanko, llvm-dev
On Jul 24, 2017, at 4:09 PM, Andrey Bokhanko <andreyb...@gmail.com> wrote:

Hi Quentin,

I really don't have a dog in this race, but... we can theoritize on strengths and weaknesses of different approaches ad infinitum -- or just accept that "the proof is in the pudding", or so they say.

I agree.


And the proof is here! -- look at the numbers.

That I somewhat agree, IMO you can tell a lot of different things with numbers. But that’s a different story.


If anything, I believe it makes sense to at least accept Riddle's phase to the trunk and let two approaches evolve in parallel.

Well if that’s the way we want to move with this, fine. I tend to prefer more focused effort. As long as we are making this a consciousness decision that works for me.

It probably silly to have both enabled at the same time, so one that currently provides a greater benefit should work on -Os/-Oz. With time, evolution will decide which one should survive -- as it always does.

Sure.

Quentin Colombet via llvm-dev

unread,
Jul 24, 2017, 9:24:49 PM7/24/17
to River Riddle, llvm-dev
Hi River,

Thanks for the detailed explanation.

If people are okay for you to move forward, like I said to Andrey, I won’t oppose. I feel sad we have to split our effort on outlining technology, but I certainly don’t pretend to know what is best!
The bottom line is if people are happy with that going in, the conversation on the details can continue in parallel.

On Jul 24, 2017, at 4:56 PM, River Riddle <riddl...@gmail.com> wrote:

Hey Quentin,
 Sorry I missed the last question. Currently it doesn't do continual outlining, but it's definitely a possibility.

Ok, that means we probably won’t have the very bad runtime problems I had in mind with adding a lot of indirections. 

Thanks,
River Riddle

On Mon, Jul 24, 2017 at 4:36 PM, River Riddle <riddl...@gmail.com> wrote:
Hi Quentin,
  I understand your points and I believe that some meaning is being lost via email. For performance it's true that that cost isn't necessarily modeled, there is currently only support for using profile data to avoid mitigate that. I was working under the assumption, possibly incorrectly, that at Oz we favor small code over anything else including runtime performance. This is why we only run at Os if we have profile data, and even then we are very conservative about what we outline from. 
  I also understand that some target hooks may be required for certain things, it happens for other IR level passes as well. I just want to minimize that behavior as much as I can, though thats a personal choice.
As for a motivating reason to push for this, it actually solves some of the problems that you brought up in the post for the original Machine Outliner RFC. If I can quote you:

"So basically, better cost model. That's one part of the story.

The other part is at the LLVM IR level or before register allocation identifying similar code sequence is much harder, at least with a suffix tree like algorithm. Basically the problem is how do we name our instructions such that we can match them.
Let me take an example.
foo() {
/* bunch of code */
a = add b, c;
d = add e, f; 
}

bar() {
d = add e, g;
f = add c, w;
}

With proper renaming, we can outline both adds in one function. The difficulty is to recognize that they are semantically equivalent to give them the same identifier in the suffix tree. I won’t get into the details but it gets tricky quickly. We were thinking of reusing GVN to have such identifier if we wanted to do the outlining at IR level but solving this problem is hard."

This outliner will catch your case, it is actually one of the easiest cases for it to catch. The outliner can recognize semantically equivalent instructions and can be expanded to catch even more.

Interesting, could you explain how you do that?
I didn’t see it in the original post.


As for the cost model it is quite simple:
 - We identify all of the external inputs into the sequence. For estimating the benefit we constant fold and condense the inputs so that we can get the set of *unique* inputs into the sequence. 

Ok, those are your parameters. How do you account for the cost of setting up those parameters?

 - We also identify outputs from the sequence, instructions that have external references. We add the cost of storing/loading/passing output parameter to the outlined function.

Ok, those are your return values (or your output parameter). I see the cost computation you do on those, but it still miss the general parameter setup cost.

 - We identify one output to promote to a return value for the function. This can end up reducing an output parameter that is passed to our outlined function.

How do you choose that one?

 - We estimate the cost of the sequence of instructions by mostly using TTI.

Sounds sensible. (Although I am not a fan of the whole TTI thing :)).

 - With the above we can estimate the cost per occurrence as well as the cost for the new function. Of course these are mostly estimates, we lean towards the conservative side to be safe. 

Conservative in what sense? Put differently how do you know your cost is conservative?

 - Finally we can compute an estimated benefit for the sequence taking into account benefit per occurrence and the estimated cost of the new function.

Make sense.


There is definitely room for improvement in the cost model. I do not believe its the best but its shown to be somewhat reliable for most cases. It has benefits and it has regressions, as does the machine outliner.

Regressions in what sense?
Do you actually have functions that are bigger?

To clarify, AFAIR, the machine outliner does not have such regressions per say. The outliner does perfectly what the cost model predicted: functions are individually smaller. Code size grow may happen because of padding in the object file (and getting in the way of some linker optimization).

The reasoning for adding the new framework, is because I believe that this transformation should exist at the IR level. Not only because it is the simplest place to put it but also because it provides the greatest opportunities for improvement with the largest coverage.

Again I don’t get those arguments. Machine passes can be target independent. Sure they may require the targets to adopt new hooks but that’s about it and I don’t see how this is different from having to add adopt hooks from TTI. I am guessing you can mostly achieve your goals with the existing TTI hooks, but the arguments passing ones, which is the part I missed in your explanation :).

We can expand the framework to catch Danny's cases from the machine outliner RFC, we can add region outlining, etc. We can do all of this and make it available to every target automatically. The reason why I am for this being at the IR level is not because I want to split up the technical effort, but combine it.

I fail to see the combining part here :).

To give you a bit context, I am afraid that, indeed, the implementation is going to be simpler (partly because a lot of people are more familiar with LLVM IR than MachineInstr) and for the most part TTI is going to be good enough. However, when we will want to go the extra mile and do crazy stuff, we will, like with the vectorizer IMHO, hit that wall of TTI  abstractions that we are going to work around in various ways. I’d like, if at all possible, to avoid repeating the history here.

River Riddle via llvm-dev

unread,
Jul 24, 2017, 11:31:36 PM7/24/17
to Quentin Colombet, llvm-dev
Hey Quentin,
 To clarify some of your concerns/questions:
- Currently parameter/output passing are based around the simple cost=1, though this is something that will need to change a bit, as you have pointed out. As I said, the cost model isn't perfect.
- The output that is turned into a return is either: one that will remove an output parameter from the function call, or the last output in the instruction sequence.
- As for being conservative with the cost, we prefer to overestimate the cost of outlining(New function cost) and under estimate the cost of the instruction sequence when possible. This means that we tend to under estimate the benefit, given that we are working with approximate costs.
- When I speak of regressions I am referring to the size of the text section.
- The gist of identifying semantic equivalence is the very first sentence of Candidate Selection in the initial RFC. We identify equivalence generally via Type,Opcode, operand types, any special state. This is just the current definition, we can relax this even further if we want to.
- Working at the IR level has more target independent advantages than just reducing the amount of new hooks. This can be seen in the fact that: we don't require mno-red-zone or any other abi flag/check, we can easily parameterize sequences, generate outputs from sequences, etc. 
- By combining work I mean that the ability for adding advancements are much easier. if someone wants to add functionality for outlining regions they don't have to worry about any target specific patchups. It just works, for every target. 
Both outliners require hooks to get better performance but with IR outliner these hooks are limited to the cost model and not the outlining process itself.
Even though we disagree, I do appreciate the discussion and feedback!
 Thanks,
River Riddle

Sean Silva via llvm-dev

unread,
Jul 25, 2017, 12:03:02 AM7/25/17
to Quentin Colombet, llvm-dev
The current MIR outliner doesn't take into account instruction encoding length either. Considering that on x86 instructions can commonly be both 1 and 10+ bytes long, the variability from not modeling that is probably comparable to the inaccuracy of estimating a fixed cost for an LLVM IR instruction w.r.t. its lowering to machine code.

As a simple example, many commonly used x86 instructions encode as over 5 bytes, which is the size of a call instruction. So an outlined function that consists of a single instruction can be profitable. IIRC there was about 5% code size saving just from outlining single instructions (that encode at >5 bytes) at machine code level on the test case I looked at (I mentined it in a comment on one of Jessica's patches IIRC).

Do we have a way to get an instruction's exact encoded length for the MIR outliner?

-- Sean Silva

Sean Silva via llvm-dev

unread,
Jul 25, 2017, 12:05:39 AM7/25/17
to River Riddle, llvm-dev
On Mon, Jul 24, 2017 at 8:31 PM, River Riddle via llvm-dev <llvm...@lists.llvm.org> wrote:
Hey Quentin,
 To clarify some of your concerns/questions:
- Currently parameter/output passing are based around the simple cost=1, though this is something that will need to change a bit, as you have pointed out. As I said, the cost model isn't perfect.
- The output that is turned into a return is either: one that will remove an output parameter from the function call, or the last output in the instruction sequence.
- As for being conservative with the cost, we prefer to overestimate the cost of outlining(New function cost) and under estimate the cost of the instruction sequence when possible. This means that we tend to under estimate the benefit, given that we are working with approximate costs.
- When I speak of regressions I am referring to the size of the text section.
- The gist of identifying semantic equivalence is the very first sentence of Candidate Selection in the initial RFC. We identify equivalence generally via Type,Opcode, operand types, any special state. This is just the current definition, we can relax this even further if we want to.
- Working at the IR level has more target independent advantages than just reducing the amount of new hooks. This can be seen in the fact that: we don't require mno-red-zone or any other abi flag/check, we can easily parameterize sequences, generate outputs from sequences, etc. 
- By combining work I mean that the ability for adding advancements are much easier. if someone wants to add functionality for outlining regions they don't have to worry about any target specific patchups. It just works, for every target. 

I also think it would be substantially difficult to implement the parameterized outlining at machine level. Do you have any numbers on how much the benefit comes from parameterization?

-- Sean Silva

Sean Silva via llvm-dev

unread,
Jul 25, 2017, 12:17:53 AM7/25/17
to River Riddle, llvm-dev

The downside to having this type of transformation be at the IR level is it means there will be less accuracy in the cost model -  we can somewhat accurately model the cost per instruction but we can’t get information on how a window of instructions may lower. This can cause regressions depending on the platform/codebase, therefore to help alleviate this there are several tunable parameters for the cost model.


-- Performance --

More results including clang, llvm-tblgen, and more specific numbers about benefits/regressions can be found in the notes section below.

* Size Reduction:

- Test Suite(X86_64):

   - Early+Late outlining provides a geomean of 10.5% reduction over clang Oz, with a largest improvement of ~67% and largest regression of ~7.5%.

   - Late outlining provides a geomean of 4.65% reduction, with a largest improvement of ~51% and largest regression of ~6.4%.

- Spec 2006(X86_64)

   - Early+Late outlining provides a geomean reduction of 2.08%.

   - Late outlining provides 2.09%.

- CSiBE(AArch64)

   - Early+Late outlining shows geomean reduction of around 3.5%.

   - Late outlining shows 3.1%.



It would be good to visualize these results. Maybe a bar chart like https://goo.gl/qN2HqA from http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html for SPEC?
 

* Compile Time:

Compile time was tested under test-suite with a multisample of 5.

- Early+Late outlining

   - Many improvements with > 40% compile time reduction.

   - Few regressions.

- Late outlining

   - Greatest improvement is ~7.8%

   - Greatest regression is ~4% with a difference of <0.02s

Our explanation for the compile time reduction during early outlining is that due to the amount of redundancy reduction early in the optimization pipeline there is a reduction in the amount of instruction processing during the rest of the compilation.


* Execution Time:

Ran with a multisample of 5.

- Test Suite:

   - Early+Late outlining has many regressions up to 97%. The greatest improvement was around 7.5%.

   - Late outlining also has several regressions up to 44% and a greatest improvement of around 5.3%.

- Spec:

   - Early+Late has a geomean regression of 3.5%.

   - Late outlining has a geomean regression of 1.25%.

The execution time results are to be expected given that the outliner, without profile data, will extract from whatever region it deems profitable. Extracting from the hot path can lead to a noticeable performance regression on any platform, which can be somewhat avoided by providing profile data during outlining.


-- Tested Improvements --

* MergeFunctions:

   - No noticeable benefit.

* LTO:

   - LTO doesn’t have a code size pipeline, but %reductions over LTO are comparable to non LTO.


-Os/-Oz are communicated through the optsize and minsize attributes. There isn't a specific code size pipeline per se (I think this is true for per-TU compilation as well, though I would have to check).

Also, can you clarify what you mean by "LTO"? I assume this means that the outliner did not run during per-TU compilation but did run on the combined FullLTO bitcode, but want to check to be sure.

-- Sean Silva
 

* Input/Output Partitioning:

   -This identifies inputs/outputs that may be folded by splitting a candidate. The benefit is minimal for the computations required.

* Similar Candidate Merging:

   - The benefit to be gained is currently not worth the large complexity required to catch the desired cases.


-- Potential Improvements --

* Suffix&LCP array construction: The pass currently uses a very basic implementation that could be improved. There are definitely faster algorithms and some can construct both simultaneously. We will investigate this as a potential benefit for compile time in the future.


* Os performance tuning: Under -Os the pass currently only runs on cold blocks. Ideally we could expand this to be a little more lax on less frequently executed blocks that aren’t cold.


* Candidate Selection: The algorithm currently focuses on the longest common sequences. More checks could be added to see if shortening the sequence length produces a larger benefit(e.g less inputs/outputs). This would likely lead to an increase in compile time but should remain less than the baseline.


* Non Exact Functions: The outliner currently ignores functions that do not have an exact definition.


-- --

* CSiBE(Code Size Benchmark):

www.csibe.org


* More detailed performance data:

goo.gl/5k6wsP


* Implementation:

https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp



River Riddle via llvm-dev

unread,
Jul 25, 2017, 12:25:57 AM7/25/17
to Sean Silva, llvm-dev
Hey Sean, 
  The bit about the attributes is good to know. When LTO is enabled the early run will still run per-TU but the late run will be shifted to work on the full LTO bitcode. Also I don't have specific numbers on how often parameterization is utilized but I can assure you that it's a majority of the time. I'll look into transforming the data into a visual format since its in google docs anyways.
Thanks,
 River Riddle

Sean Silva via llvm-dev

unread,
Jul 25, 2017, 12:43:36 AM7/25/17
to River Riddle, llvm-dev
On Mon, Jul 24, 2017 at 9:25 PM, River Riddle <riddl...@gmail.com> wrote:
Hey Sean, 
  The bit about the attributes is good to know. When LTO is enabled the early run will still run per-TU but the late run will be shifted to work on the full LTO bitcode. Also I don't have specific numbers on how often parameterization is utilized but I can assure you that it's a majority of the time.

Interesting. It would be good to have some specific data on this (is "majority" actually 65% or 99%? How does it vary across benchmarks?), because this is something that can't be done in the current post-RA machine outliner (even in principle).

A pre-RA MIR-level outliner would have similar issues to your LLVM IR outliner w.r.t. estimating instruction sizes.
For example, pre-RA you might see a virtual register but it might actually end up spilled so there will be spill/fill code size that is unaccounted for (in fact, I wouldn't be surprised if this inaccuracy of the cost model was actually greater than the inaccuracy from assuming a fixed cost per instruction either at pre-RA MIR level or LLVM IR level; I don't have data supporting one way or the other though).
Also, outlining pre-RA inherently constrains the register allocator, so there will be indirect effects on code size. It's not clear that doing things at MIR level pre-RA will really allow avoiding this any more than marking LLVM IR level outlined functions using an appropriate calling convention (and maybe a target hook or two).

-- Sean Silva 

Sean Silva via llvm-dev

unread,
Jul 25, 2017, 12:54:46 AM7/25/17
to Andrey Bokhanko, llvm-dev
On Sun, Jul 23, 2017 at 12:43 AM, via llvm-dev <llvm...@lists.llvm.org> wrote:
River,

Thanks for the reply!

> 23 июля 2017 г., в 1:05, River Riddle <riddl...@gmail.com> написал(а):
>
> - The explanation as to why the improvements can vary between the IR and MIR outliner mainly boil down to the level of abstraction that each are working at. The MIR level has very accurate heuristics and is effectively the last post ISel target independent code gen pass. The IR outliner on the other hand has more estimation in the cost model, and can affect the decisions of function simplification passes, instruction selection, RA, etc. Taking this into account can lead to different results.
To clarify, I'm surprised not with % differences (this is understandable), but with differences in what benchmarks got improved. It seems odd that MO, working on lower abstraction level, managed to find redundancies (say, in libquantum) that EO+LO missed.

When I looked at this, most of the code size saving from the machine outliner is from outlining 2-3 instruction sequences (often the same sequence is outlined for each of many permutations of register assignments). Think of sequences like "TEST; SETCC" (and many different register assignments thereof).

-- Sean Silva 

Sean Silva via llvm-dev

unread,
Jul 25, 2017, 1:26:09 AM7/25/17
to Quentin Colombet, llvm-dev
On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:
Hi River,

On Jul 24, 2017, at 2:36 PM, River Riddle <riddl...@gmail.com> wrote:

Hi Quentin,
 I appreciate the feedback. When I reference the cost of Target Hooks it's mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with "having heuristics solely focused on code size do not seem realistic", but I don't agree with that statement.

If you only want code size I agree it makes sense, but I believe, even in Oz, we probably don’t want to slow the code by a big factor for a couple bytes. That’s what I wanted to say and what I wanted to point out is that you need to have some kind of model for the performance to avoid those worst cases. Unless we don’t care :).

I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information. 

I don’t see how you can get around adding more hooks to know how a specific function prototype is going to be lowered (e.g., i64 needs to be split into two registers, fourth and onward parameters need to be pushed on the stack and so on). Those change the code size benefit.

 When you say "Would still be interesting to see how well this could perform on some exact model (i.e., at the Machine level), IMO." I am slightly confused as to what you mean. I do not intend to try and implement this algorithm at the MIR level given that it exists in Machine Outliner.

Of course, I don’t expect you to do that :). What I meant is that the claim that IR offers the better trade off is not based on hard evidences. I actually don’t buy it.

My point was to make sure I understand what you are trying to solve and given you have mentioned the MachineOutliner, why you are not working on improving it instead of suggesting a new framework.
Don’t take me wrong, maybe creating a new framework at the IR level is the right thing to do, but I still didn’t get that from your comments.

There are several comparison benchmarks given in the "More detailed performance data" of the original RFC. It includes comparisons to the Machine Outliner when possible(I can't build clang on Linux with Machine Outliner). I welcome any and all discussion on the placement of the outliner in LLVM.

My fear with a new framework is that we are going to split the effort for pushing the outliner technology forward and I’d like to avoid that if at all possible.

The two passes are pretty different in their approaches to congruency finding, so I don't think it helps to group them as though they were interchangeable "outliner technology". The two passes might be totally orthogonal.

I can imagine two extreme outcomes (reality is probably somewhere in between):

1. if you run the LLVM IR level code size outliner, then the MachineOutliner fails to find any significant redundancies.

2. if you run the LLVM IR level code size outliner, then the MachineOutliner finds just as many redundancies.

It would be good if River could get some data about this.

-- Sean Silva

Joerg Sonnenberger via llvm-dev

unread,
Jul 25, 2017, 8:17:02 AM7/25/17
to llvm...@lists.llvm.org
On Mon, Jul 24, 2017 at 09:02:55PM -0700, Sean Silva via llvm-dev wrote:
> Do we have a way to get an instruction's exact encoded length for the MIR
> outliner?

I don't think we have to go that far. A heuristic based on the immediate
argument should be enough.

Joerg

Quentin Colombet via llvm-dev

unread,
Jul 25, 2017, 12:21:33 PM7/25/17
to River Riddle, llvm-dev
Thanks River for the further explanations!

On Jul 24, 2017, at 8:31 PM, River Riddle <riddl...@gmail.com> wrote:

Hey Quentin,
 To clarify some of your concerns/questions:
- Currently parameter/output passing are based around the simple cost=1, though this is something that will need to change a bit, as you have pointed out. As I said, the cost model isn't perfect.
- The output that is turned into a return is either: one that will remove an output parameter from the function call, or the last output in the instruction sequence.
- As for being conservative with the cost, we prefer to overestimate the cost of outlining(New function cost) and under estimate the cost of the instruction sequence when possible. This means that we tend to under estimate the benefit, given that we are working with approximate costs.
- When I speak of regressions I am referring to the size of the text section.
- The gist of identifying semantic equivalence is the very first sentence of Candidate Selection in the initial RFC. We identify equivalence generally via Type,Opcode, operand types, any special state. This is just the current definition, we can relax this even further if we want to.

The thing is with that definition, I fail to see how we can differentiate say add x,x from add x,y. How does that work?

Jessica Paquette via llvm-dev

unread,
Jul 25, 2017, 12:24:53 PM7/25/17
to Sean Silva, llvm-dev
The two passes are pretty different in their approaches to congruency finding, so I don't think it helps to group them as though they were interchangeable "outliner technology". The two passes might be totally orthogonal.

I think that based off how River described his algorithm, the actual underlying method should be pretty similar. If it is, then we could probably compare the performance of the suffix tree vs. suffix array method and then create a general outliner that can be run at any level of representation. By that I mean that the actual suffix tree/array candidate search algorithm would be separate from the actual implementation of *outlining things*. The actual implementation at each level of representation would define

- How to outline a sequence of instructions
- An equivalence/congruence scheme for two instructions
- A cost model

I don’t think it’d be too difficult to split it up in that sort of way. I’d also like to experiment with pre-regalloc outlining in general, so it’d make it easier to explore that route as well.

1. if you run the LLVM IR level code size outliner, then the MachineOutliner fails to find any significant redundancies.

2. if you run the LLVM IR level code size outliner, then the MachineOutliner finds just as many redundancies.

This would be interesting. The MachineOutliner deems two instructions equivalent iff their opcodes and operands are identical. This isn’t super flexible (pre-register allocation outlining would be much better). However, an IR-level outliner has a lot more wiggle-room. It would be interesting to see how certain equivalence schemes perform. Once again, making assumptions about River’s algorithm, but the equivalence/congruence schemes are probably where the most significant differences in the way each technique handles outlining lie.

- Jessica

Quentin Colombet via llvm-dev

unread,
Jul 25, 2017, 12:27:17 PM7/25/17
to Sean Silva, llvm-dev
Hi Sean,

Right, I forgot about x86, AArch64 being the primary target when we did that :).
Frankly, there is nothing hard doing the estimate on the actual encoding at this stage of the IR. The reasons why we didn’t do it were:
1. AArch64 does not have differences
2. It takes some (compile) time to compute that information and given we didn’t need it for AArch64 we didn’t do it.


As a simple example, many commonly used x86 instructions encode as over 5 bytes, which is the size of a call instruction. So an outlined function that consists of a single instruction can be profitable. IIRC there was about 5% code size saving just from outlining single instructions (that encode at >5 bytes) at machine code level on the test case I looked at (I mentined it in a comment on one of Jessica's patches IIRC).

Good point!


Do we have a way to get an instruction's exact encoded length for the MIR outliner?

Not right now, IIRC, but I would say that’s half a day effort at most. To be fair, I haven’t followed the development of the code base since Jessica’s internship. Jessica would know the details for sure.

Cheers,
-Quentin


-- Sean Silva
 

Quentin Colombet via llvm-dev

unread,
Jul 25, 2017, 12:31:26 PM7/25/17
to Jessica Paquette, llvm-dev
On Jul 25, 2017, at 9:24 AM, Jessica Paquette <jpaq...@apple.com> wrote:

The two passes are pretty different in their approaches to congruency finding, so I don't think it helps to group them as though they were interchangeable "outliner technology". The two passes might be totally orthogonal.

I think that based off how River described his algorithm, the actual underlying method should be pretty similar. If it is, then we could probably compare the performance of the suffix tree vs. suffix array method and then create a general outliner that can be run at any level of representation.

+1, I believe that would be the ideal outcome.
I can see a split like this:
- Outliner algorithm IR-agnostic (with possibly different mode: suffix tree, array.)
- Pluggable cost model
- Rewriter

Jessica Paquette via llvm-dev

unread,
Jul 25, 2017, 12:59:42 PM7/25/17
to Quentin Colombet, llvm-dev
It might also be cool to see how various levels of outlining + the inliner would interact.

For example, say we did something like this

1. IR outline
2. Inline
3. MIR outline

The inliner should be capable of undoing bad outlining decisions (like outlining from, say, hot loops). The MIR outliner should be able to undo bad inlining decisions. Since each cost model is different (and they’re all working on different code), no pass should ever undo all decisions made by another unless they all ended up being *bad* decisions. It might be possible to have each pass play together in a way that you end up with a nice balance between performance and size.

- Jessica

Sean Silva via llvm-dev

unread,
Jul 25, 2017, 6:32:36 PM7/25/17
to Jessica Paquette, llvm-dev
On Tue, Jul 25, 2017 at 9:24 AM, Jessica Paquette <jpaq...@apple.com> wrote:
The two passes are pretty different in their approaches to congruency finding, so I don't think it helps to group them as though they were interchangeable "outliner technology". The two passes might be totally orthogonal.

I think that based off how River described his algorithm, the actual underlying method should be pretty similar. If it is, then we could probably compare the performance of the suffix tree vs. suffix array method and then create a general outliner that can be run at any level of representation. By that I mean that the actual suffix tree/array candidate search algorithm would be separate from the actual implementation of *outlining things*. The actual implementation at each level of representation would define

- How to outline a sequence of instructions
- An equivalence/congruence scheme for two instructions
- A cost model

I don’t think it’d be too difficult to split it up in that sort of way. I’d also like to experiment with pre-regalloc outlining in general, so it’d make it easier to explore that route as well.

There is some commonality, but the suffix tree/array candidate selection is only the first step in the IR-level outliner in River's patch, and the most interesting things it does happen after that. The result of the suffix tree/array candidate selection in River's patch is a set of sequences that are not guaranteed to be identical because relaxed congruence is used, and the code has to determine if they are in fact identical and parameterize any parameterizable differences.

This RFC/review should probably acknowledge the similarities with the MachineOutliner but actually focus more on the differences. I think we all agree that just running the current MachineOutliner's algorithm at IR level is not interesting. And indeed, River's patch isn't that.


There is definitely commonality and potentially reusability for the "repeated substring" aspect of candidate selection. It probably wouldn't be too difficult to factor out the MachineOutliner's suffix tree that operates on a simple std::vector<unsigned> and reuse that in River's patch, avoiding the need for the suffix array implementation (or vice versa, though it's probably best to move code around in tree at first than introduce new code). The entire function findOutliningOccurrences in River's patch smells a lot like MachineOutliner::buildCandidateList+pruneOverlaps. It would probably make sense to common at least the suffix tree/array in a first patch, and possibly the overlap pruning too. That would decrease the line count of CodeSizeOutliner.cpp in the patch and emphasize the differences from the MachineOutliner. 

-- Sean Silva

Eric Christopher via llvm-dev

unread,
Jul 25, 2017, 9:08:27 PM7/25/17
to Jessica Paquette, Quentin Colombet, llvm-dev
On Tue, Jul 25, 2017 at 9:59 AM Jessica Paquette via llvm-dev <llvm...@lists.llvm.org> wrote:
It might also be cool to see how various levels of outlining + the inliner would interact.

For example, say we did something like this

1. IR outline
2. Inline
3. MIR outline

The inliner should be capable of undoing bad outlining decisions (like outlining from, say, hot loops). The MIR outliner should be able to undo bad inlining decisions. Since each cost model is different (and they’re all working on different code), no pass should ever undo all decisions made by another unless they all ended up being *bad* decisions. It might be possible to have each pass play together in a way that you end up with a nice balance between performance and size.


This sounds pretty interesting honestly. This might be an good push/pull strategy for inlining and outlining where we can discover commonalities and performance enhancements over time as we proceed through compilation.

What do you think the best path forward here is Jessica?

-eric

River Riddle via llvm-dev

unread,
Jul 25, 2017, 9:26:29 PM7/25/17
to Quentin Colombet, llvm-dev
Hey Quentin,
 I went ahead and implemented an initial generic interface for outlining and rewrote both passes to use it.

As for the equivalency:
 "I fail to see how we can differentiate say add x,x from add x,y."
We don't care about what the values are in the computation. If we have an example like that where its:
    add i32 x,x and add i32 x,y

We see that there are two inputs to this potential sequence. If they are external inputs then its easy, if they come from within the region that
 is going to be extracted then we verify that they come from the same internal location. 
ex: 
add i32 5, 6   and   add i32 3, 7
We just parameterize the different operands:

outlinedFn(5, 6)    and  outlinedFn(3, 7)
outlinedFn(i32 a, i32 b)
  add i32 a, b

  Thanks,
River Riddle

River Riddle via llvm-dev

unread,
Jul 25, 2017, 10:56:48 PM7/25/17
to Jessica Paquette, llvm-dev
Hey Jessica,
 I'm going to have to disagree with you on this. I don't want to place trust in the inliner to fix broken outlining decisions because it won't undo the entire transformation. The function call that gets inlined could be the one that actually made the transformation profitable in the first place. The inliner doesn't know of how inlining that call messes with the profitability of all of the other calls to that outlined function.

Conversely, I feel that a majority of the inlining similarities, due to simplification and lowering, will be lost by the time it reaches the MIR level. Combining that with issues of parameterization/multi basic block outlining/ etc, I feel that putting hope on the MIR outliner to catch bad inlining decisions is a bit too optimistic.

Although, I can think of two easy ways that the outliner could play nicely with the inliner:
 - We set it up similarly to the partial inliner. If we have profile guided outlining we can improve size by removing cold sections, as well as potentially make the function more profitable for inlining in general.
 - We could enter a case of when we outline all calls to a particular function, leading to a last call to static scenario.

Thanks,
 River Riddle

Quentin Colombet via llvm-dev

unread,
Jul 25, 2017, 11:31:30 PM7/25/17
to River Riddle, llvm-dev
Got it, thanks!

Mehdi AMINI via llvm-dev

unread,
Jul 26, 2017, 1:36:21 AM7/26/17
to Quentin Colombet, llvm-dev
2017-07-24 16:14 GMT-07:00 Quentin Colombet via llvm-dev <llvm...@lists.llvm.org>:
Hi River,

On Jul 24, 2017, at 2:36 PM, River Riddle <riddl...@gmail.com> wrote:

Hi Quentin,
 I appreciate the feedback. When I reference the cost of Target Hooks it's mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with "having heuristics solely focused on code size do not seem realistic", but I don't agree with that statement.

If you only want code size I agree it makes sense, but I believe, even in Oz, we probably don’t want to slow the code by a big factor for a couple bytes. That’s what I wanted to say and what I wanted to point out is that you need to have some kind of model for the performance to avoid those worst cases. Unless we don’t care :).

That's why we have threshold though, don't we? 
Also the IR makes it easy to connect to PGO, which allows to focus the outlining on "cold" regions and preserve good performance.
River: did you consider this already? Having a good integration with PGO could make this part of the default optimization pipeline (i.e. having a mode where we outline only the knowingly "cold" code).





 

I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information. 

I don’t see how you can get around adding more hooks to know how a specific function prototype is going to be lowered (e.g., i64 needs to be split into two registers, fourth and onward parameters need to be pushed on the stack and so on). Those change the code size benefit.

How is the inliner doing? How are we handling Oz there?
If we are fine living with approximation for the inliner, why wouldn't the same work for an outliner?

 

There are several comparison benchmarks given in the "More detailed performance data" of the original RFC. It includes comparisons to the Machine Outliner when possible(I can't build clang on Linux with Machine Outliner). I welcome any and all discussion on the placement of the outliner in LLVM.

My fear with a new framework is that we are going to split the effort for pushing the outliner technology forward and I’d like to avoid that if at all possible.

It isn't clear to me that implementing it at the MachineLevel was the right trade-off in the first place.
I'm not sure a full comparative study was performed and discussed upstream at the time where the MachineIR outliner was implemented? If so it wouldn't be fair to ask this to River now.

-- 
Mehdi

Mehdi AMINI via llvm-dev

unread,
Jul 26, 2017, 1:39:05 AM7/26/17
to Quentin Colombet, llvm-dev
Forgot to mention another aspect doing this it early may improve compile time because we will reduce the amount of IR to codegen. 

-- 
Mehdi

River Riddle via llvm-dev

unread,
Jul 26, 2017, 1:43:27 AM7/26/17
to Mehdi AMINI, llvm-dev
Hey Mehdi,
 I already have PGO support implemented. If we favor speed(Os) it only considers "cold" blocks. If we favor size(Oz) it considers any block that isn't "hot".

I've also noticed the improved compile time on quite a few of the tests when running with early outlining enabled.

Thanks,
 River Riddle

Quentin Colombet via llvm-dev

unread,
Jul 26, 2017, 12:31:22 PM7/26/17
to Mehdi AMINI, llvm-dev
On Jul 25, 2017, at 10:36 PM, Mehdi AMINI <joke...@gmail.com> wrote:



2017-07-24 16:14 GMT-07:00 Quentin Colombet via llvm-dev <llvm...@lists.llvm.org>:
Hi River,

On Jul 24, 2017, at 2:36 PM, River Riddle <riddl...@gmail.com> wrote:

Hi Quentin,
 I appreciate the feedback. When I reference the cost of Target Hooks it's mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with "having heuristics solely focused on code size do not seem realistic", but I don't agree with that statement.

If you only want code size I agree it makes sense, but I believe, even in Oz, we probably don’t want to slow the code by a big factor for a couple bytes. That’s what I wanted to say and what I wanted to point out is that you need to have some kind of model for the performance to avoid those worst cases. Unless we don’t care :).

That's why we have threshold though, don't we? 

When I see threshold, I think magic number and I don’t like it that.

Also the IR makes it easy to connect to PGO, which allows to focus the outlining on "cold" regions and preserve good performance.
River: did you consider this already? Having a good integration with PGO could make this part of the default optimization pipeline (i.e. having a mode where we outline only the knowingly "cold" code).





 

I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information. 

I don’t see how you can get around adding more hooks to know how a specific function prototype is going to be lowered (e.g., i64 needs to be split into two registers, fourth and onward parameters need to be pushed on the stack and so on). Those change the code size benefit.

How is the inliner doing? How are we handling Oz there?
If we are fine living with approximation for the inliner, why wouldn't the same work for an outliner?

Unlike inlining, outlining does not expose optimization opportunities.


 

There are several comparison benchmarks given in the "More detailed performance data" of the original RFC. It includes comparisons to the Machine Outliner when possible(I can't build clang on Linux with Machine Outliner). I welcome any and all discussion on the placement of the outliner in LLVM.

My fear with a new framework is that we are going to split the effort for pushing the outliner technology forward and I’d like to avoid that if at all possible.

It isn't clear to me that implementing it at the MachineLevel was the right trade-off in the first place.

Fair enough. it has the advantage of not rely on heuristic for its cost model though.

I'm not sure a full comparative study was performed and discussed upstream at the time where the MachineIR outliner was implemented? If so it wouldn't be fair to ask this to River now.

I am not asking that :).


-- 
Mehdi


Mehdi AMINI via llvm-dev

unread,
Jul 26, 2017, 12:36:30 PM7/26/17
to Quentin Colombet, llvm-dev
2017-07-26 9:31 GMT-07:00 Quentin Colombet <qcol...@apple.com>:

On Jul 25, 2017, at 10:36 PM, Mehdi AMINI <joke...@gmail.com> wrote:



2017-07-24 16:14 GMT-07:00 Quentin Colombet via llvm-dev <llvm...@lists.llvm.org>:
Hi River,

On Jul 24, 2017, at 2:36 PM, River Riddle <riddl...@gmail.com> wrote:

Hi Quentin,
 I appreciate the feedback. When I reference the cost of Target Hooks it's mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with "having heuristics solely focused on code size do not seem realistic", but I don't agree with that statement.

If you only want code size I agree it makes sense, but I believe, even in Oz, we probably don’t want to slow the code by a big factor for a couple bytes. That’s what I wanted to say and what I wanted to point out is that you need to have some kind of model for the performance to avoid those worst cases. Unless we don’t care :).

That's why we have threshold though, don't we? 

When I see threshold, I think magic number and I don’t like it that.

Fair, but heuristic is the best we have when we don't want to optimize for a single metric or we can't have a perfect modeling of the world.
 

Also the IR makes it easy to connect to PGO, which allows to focus the outlining on "cold" regions and preserve good performance.
River: did you consider this already? Having a good integration with PGO could make this part of the default optimization pipeline (i.e. having a mode where we outline only the knowingly "cold" code).





 

I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information. 

I don’t see how you can get around adding more hooks to know how a specific function prototype is going to be lowered (e.g., i64 needs to be split into two registers, fourth and onward parameters need to be pushed on the stack and so on). Those change the code size benefit.

How is the inliner doing? How are we handling Oz there?
If we are fine living with approximation for the inliner, why wouldn't the same work for an outliner?

Unlike inlining, outlining does not expose optimization opportunities.

I would expect that getting the cold code out of the way would help with locality / caching of the hot-path. I remember Amaury even working on getting cold *basic block* in a different section without outlining them in a function.

But I guess what you mean is that as long as we're focusing solely on getting the smallest possible binary ever, you may be closer to "perfect modeling" very late in the pipeline.


There are several comparison benchmarks given in the "More detailed performance data" of the original RFC. It includes comparisons to the Machine Outliner when possible(I can't build clang on Linux with Machine Outliner). I welcome any and all discussion on the placement of the outliner in LLVM.

My fear with a new framework is that we are going to split the effort for pushing the outliner technology forward and I’d like to avoid that if at all possible.

It isn't clear to me that implementing it at the MachineLevel was the right trade-off in the first place.

Fair enough. it has the advantage of not rely on heuristic for its cost model though.

I'm not sure a full comparative study was performed and discussed upstream at the time where the MachineIR outliner was implemented? If so it wouldn't be fair to ask this to River now.

I am not asking that :).

OK great :)


-- 
Mehdi
 

Quentin Colombet via llvm-dev

unread,
Jul 26, 2017, 1:11:17 PM7/26/17
to Mehdi AMINI, llvm-dev
On Jul 26, 2017, at 9:36 AM, Mehdi AMINI <joke...@gmail.com> wrote:



2017-07-26 9:31 GMT-07:00 Quentin Colombet <qcol...@apple.com>:

On Jul 25, 2017, at 10:36 PM, Mehdi AMINI <joke...@gmail.com> wrote:



2017-07-24 16:14 GMT-07:00 Quentin Colombet via llvm-dev <llvm...@lists.llvm.org>:
Hi River,

On Jul 24, 2017, at 2:36 PM, River Riddle <riddl...@gmail.com> wrote:

Hi Quentin,
 I appreciate the feedback. When I reference the cost of Target Hooks it's mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with "having heuristics solely focused on code size do not seem realistic", but I don't agree with that statement.

If you only want code size I agree it makes sense, but I believe, even in Oz, we probably don’t want to slow the code by a big factor for a couple bytes. That’s what I wanted to say and what I wanted to point out is that you need to have some kind of model for the performance to avoid those worst cases. Unless we don’t care :).

That's why we have threshold though, don't we? 

When I see threshold, I think magic number and I don’t like it that.

Fair, but heuristic is the best we have when we don't want to optimize for a single metric or we can't have a perfect modeling of the world.
 

Also the IR makes it easy to connect to PGO, which allows to focus the outlining on "cold" regions and preserve good performance.
River: did you consider this already? Having a good integration with PGO could make this part of the default optimization pipeline (i.e. having a mode where we outline only the knowingly "cold" code).





 

I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information. 

I don’t see how you can get around adding more hooks to know how a specific function prototype is going to be lowered (e.g., i64 needs to be split into two registers, fourth and onward parameters need to be pushed on the stack and so on). Those change the code size benefit.

How is the inliner doing? How are we handling Oz there?
If we are fine living with approximation for the inliner, why wouldn't the same work for an outliner?

Unlike inlining, outlining does not expose optimization opportunities.

I would expect that getting the cold code out of the way would help with locality / caching of the hot-path. I remember Amaury even working on getting cold *basic block* in a different section without outlining them in a function.

But I guess what you mean is that as long as we're focusing solely on getting the smallest possible binary ever, you may be closer to "perfect modeling" very late in the pipeline.

No, I mean in terms of enabling other optimizations in the pipeline like vectorizer. Outliner does not expose any of that.

Sanjoy Das via llvm-dev

unread,
Jul 26, 2017, 3:07:24 PM7/26/17
to Quentin Colombet, llvm-dev
Hi,

On Wed, Jul 26, 2017 at 10:10 AM, Quentin Colombet via llvm-dev
<llvm...@lists.llvm.org> wrote:
> No, I mean in terms of enabling other optimizations in the pipeline like
> vectorizer. Outliner does not expose any of that.

I have not made a lot of effort to understand the full discussion here (so what
I say below may be off-base), but I think there are some cases where outlining
(especially working with function-attrs) can make optimization easier.

It can help transforms that duplicate code (like loop unrolling and inlining) be
more profitable -- I'm thinking of cases where unrolling/inlining would have to
duplicate a lot of code, but after outlining would require duplicating only a
few call instructions.


It can help EarlyCSE do things that require GVN today:

void foo() {
... complex computation that computes func()
... complex computation that computes func()
}

outlining=>

int func() { ... }

void foo() {
int x = func();
int y = func();
}

functionattrs=>

int func() readonly { ... }

void foo(int a, int b) {
int x = func();
int y = func();
}

earlycse=>

int func(int t) readnone { ... }

void foo(int a, int b) {
int x = func(a);
int y = x;
}

GVN will catch this, but EarlyCSE is (at least supposed to be!) cheaper.


Once we have an analysis that can prove that certain functions can't trap,
outlining can allow LICM etc. to speculate entire outlined regions out of loops.


Generally, I think outlining exposes information that certain regions of the
program are doing identical things. We should expect to get some mileage out of
this information.

-- Sanjoy

Sean Silva via llvm-dev

unread,
Jul 26, 2017, 3:54:16 PM7/26/17
to Sanjoy Das, llvm-dev
The way I interpret Quentin's statement is something like:

- Inlining turns an interprocedural problem into an intraprocedural problem
- Outlining turns an intraprocedural problem into an interprocedural problem

Insofar as our intraprocedural analyses and transformations are strictly more powerful than interprocedural, then there is a precise sense in which inlining exposes optimization opportunities while outlining does not.

Actually, for his internship last summer River wrote a profile-guided outliner / partial inliner (it didn't try to do deduplication; so it was more like PartialInliner.cpp). IIRC he found that LLVM's interprocedural analyses were so bad that there were pretty adverse effects from many of the outlining decisions. E.g. if you outline from the left side of a diamond, that side basically becomes a black box to most LLVM analyses and forces downstream dataflow meet points to give an overly conservative result, even though our standard intraprocedural analyses would have happily dug through the left side of the diamond if the code had not been outlined.

Also, River's patch (the one in this thread) does parameterized outlining. For example, two sequences containing stores can be outlined even if the corresponding stores have different pointers. The pointer to be loaded from is passed as a parameter to the outlined function. In that sense, the outlined function's behavior becomes a conservative approximation of both which in principle loses precision.


I like your EarlyCSE example and it is interesting that combined with functionattrs it can make a "cheap" pass get a transformation that an "expensive" pass would otherwise be needed. Are there any cases where we only have the "cheap" pass and thus the outlining would be essential for our optimization pipeline to get the optimization right?

The case that comes to mind for me is cases where we have some cutoff of search depth. Reducing a sequence to a single call (+ functionattr inference) can essentially summarize the sequence and effectively increase search depth, which might give more results. That seems like a bit of a weak example though. 

-- Sean Silva

Sanjoy Das via llvm-dev

unread,
Jul 26, 2017, 4:41:24 PM7/26/17
to Sean Silva, llvm-dev
Hi,

On Wed, Jul 26, 2017 at 12:54 PM, Sean Silva <chiso...@gmail.com> wrote:
> The way I interpret Quentin's statement is something like:
>
> - Inlining turns an interprocedural problem into an intraprocedural problem
> - Outlining turns an intraprocedural problem into an interprocedural problem
>
> Insofar as our intraprocedural analyses and transformations are strictly
> more powerful than interprocedural, then there is a precise sense in which
> inlining exposes optimization opportunities while outlining does not.

While I think our intra-proc optimizations are *generally* more
powerful, I don't think they are *always* more powerful. For
instance, LICM (today) won't hoist full regions but it can hoist
single function calls. If we can extract out a region into a
readnone+nounwind function call then LICM will hoist it to the
preheader if the safety checks pass.

> Actually, for his internship last summer River wrote a profile-guided
> outliner / partial inliner (it didn't try to do deduplication; so it was
> more like PartialInliner.cpp). IIRC he found that LLVM's interprocedural
> analyses were so bad that there were pretty adverse effects from many of the
> outlining decisions. E.g. if you outline from the left side of a diamond,
> that side basically becomes a black box to most LLVM analyses and forces
> downstream dataflow meet points to give an overly conservative result, even
> though our standard intraprocedural analyses would have happily dug through
> the left side of the diamond if the code had not been outlined.
>
> Also, River's patch (the one in this thread) does parameterized outlining.
> For example, two sequences containing stores can be outlined even if the
> corresponding stores have different pointers. The pointer to be loaded from
> is passed as a parameter to the outlined function. In that sense, the
> outlined function's behavior becomes a conservative approximation of both
> which in principle loses precision.

Can we outline only once we've already done all of these optimizations
that outlining would block?

> I like your EarlyCSE example and it is interesting that combined with
> functionattrs it can make a "cheap" pass get a transformation that an
> "expensive" pass would otherwise be needed. Are there any cases where we
> only have the "cheap" pass and thus the outlining would be essential for our
> optimization pipeline to get the optimization right?
>
> The case that comes to mind for me is cases where we have some cutoff of
> search depth. Reducing a sequence to a single call (+ functionattr
> inference) can essentially summarize the sequence and effectively increase
> search depth, which might give more results. That seems like a bit of a weak
> example though.

I don't know if River's patch outlines entire control flow regions at
a time, but if it does then we could use cheap basic block scanning
analyses for things that would normally require CFG-level analysis.

-- Sanjoy

River Riddle via llvm-dev

unread,
Jul 26, 2017, 4:52:37 PM7/26/17
to Sanjoy Das, llvm-dev
Hey Sanjoy,
  
On Wed, Jul 26, 2017 at 1:41 PM, Sanjoy Das via llvm-dev <llvm...@lists.llvm.org> wrote:
Hi,

 
  The outliner is able to run at any point in the interprocedural pipeline. There are currently two locations: Early outlining(pre inliner) and late outlining(practically the last pass to run). It is configured to run either Early+Late, or just Late. 


> I like your EarlyCSE example and it is interesting that combined with
> functionattrs it can make a "cheap" pass get a transformation that an
> "expensive" pass would otherwise be needed. Are there any cases where we
> only have the "cheap" pass and thus the outlining would be essential for our
> optimization pipeline to get the optimization right?
>
> The case that comes to mind for me is cases where we have some cutoff of
> search depth. Reducing a sequence to a single call (+ functionattr
> inference) can essentially summarize the sequence and effectively increase
> search depth, which might give more results. That seems like a bit of a weak
> example though.

I don't know if River's patch outlines entire control flow regions at
a time, but if it does then we could use cheap basic block scanning
analyses for things that would normally require CFG-level analysis.

  The current patch currently just supports outlining from within a single block. Although, I had a working prototype for Region based outlining, I kept it from this patch for simplicity. So its entirely possible to add that kind of functionality because I've already tried.
Thanks,
  River Riddle

Quentin Colombet via llvm-dev

unread,
Jul 27, 2017, 12:59:34 AM7/27/17
to Sanjoy Das, llvm-dev

That’s a fair point.
Using outlining is one possible way of doing that, indeed.
We could get that out of some analysis or improve the different passes. Because generally speaking, I don’t think you want to add call indirections in performance sensitive areas.

Chris Bieneman via llvm-dev

unread,
Jul 29, 2017, 12:58:36 AM7/29/17
to River Riddle, llvm-dev
Apologies for delayed joining of this discussion, but I had a few notes from this thread that I really wanted to chime in about.

River,

I don't mean to put you on the spot, but I do want to start on a semantic issue. In several places in the thread you used the words "we" and "our" to imply that you're not alone in writing this (which is totally fine), but your initial thread presented this as entirely your own work. So, when you said things like "we feel there's an advantage to being at the IR level", can you please clarify who is "we"?

Given that there are a number of disagreements and opinions floating around I think it benefits us all to speak clearly about who is taking what stances.

One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
  1. IR outline
  2. Inline
  3. MIR outline
In your response to that proposal you dismissed it out of hand with "feelings" but not data. Given that the proposal came from Jessica (a community member with significant relevant experience in outlining), and it was also recognized as interesting by Eric Christopher (a long-time member of the community with wide reaching expertise), I think dismissing it may have been a little premature.

I also want to visit a few procedural notes.

Mehdi commented on the thread that it wouldn't be fair to ask for a comparative study because the MIR outliner didn't have one. While I don't think anyone is asking for a comparative study, I want to point out that I think it is completely fair. If a new contributor approached the community with a new SROA pass and wanted to land it in-tree it would be appropriate to ask for a comparative analysis against the existing pass. How is this different? 

Adding a new IR outliner is a different situation from when the MIR one was added. When the MIR outliner was introduced there was no in-tree analog. When someone comes to the community with something that has no existing in-tree analog it isn't fair to necessarily ask them to implement it multiple different ways to prove their solution is the best. However, as a community, we do still exercise the right to reject contributions we disagree with, and we frequently request changes to the implementation (as is shown every time someone tries to add SPIR-V support).

In the LLVM community we have a long history of approaching large contributions (especially ones from new contributors) with scrutiny and discussion. It would be a disservice to the project to forget that.

River, as a last note. I see that you've started uploading patches to Phabricator, and I know you're relatively new to the community. When uploading patches it helps to include appropriate reviewers so that the right people see the patches as they come in. To that end can you please include Jessica as a reviewer? Given her relevant domain experience I think her feedback on the patches will be very valuable.

Thank you,
-Chris

River Riddle via llvm-dev

unread,
Jul 29, 2017, 1:33:24 AM7/29/17
to Chris Bieneman, llvm-dev
Hi Chris,

It's okay to put this on the spot because posting the patches was meant to help further the discussion that kind of stalled previously. 

On Fri, Jul 28, 2017 at 9:58 PM, Chris Bieneman <be...@apple.com> wrote:
Apologies for delayed joining of this discussion, but I had a few notes from this thread that I really wanted to chime in about.

River,

I don't mean to put you on the spot, but I do want to start on a semantic issue. In several places in the thread you used the words "we" and "our" to imply that you're not alone in writing this (which is totally fine), but your initial thread presented this as entirely your own work. So, when you said things like "we feel there's an advantage to being at the IR level", can you please clarify who is "we"?

 In regards to the words "we" and "our", I am referring to myself. My writing style tends to shift between the usage of those words. I'll avoid any kind of confusion in the future.  


Given that there are a number of disagreements and opinions floating around I think it benefits us all to speak clearly about who is taking what stances.

One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
  1. IR outline
  2. Inline
  3. MIR outline
In your response to that proposal you dismissed it out of hand with "feelings" but not data. Given that the proposal came from Jessica (a community member with significant relevant experience in outlining), and it was also recognized as interesting by Eric Christopher (a long-time member of the community with wide reaching expertise), I think dismissing it may have been a little premature.
 
I dismissed the idea of an outliner at the machine level being able to catch bad inlining decisions. Given the loss of information between the two I felt it was a little optimistic to rely on a very late pass being able to reverse those decisions, especially coupled with the fact that the current machine outliner requires exact equivalence. I don't disagree with the proposal of an example : outline, inline, outline: pipeline, but the idea of being able to catch inlining decisions given the circumstances seemed optimistic to me. From there I went ahead and implemented a generic interface for outlining that can be shared between IR/Machine level so that such a pipeline could be more feasible.
 

I also want to visit a few procedural notes.

Mehdi commented on the thread that it wouldn't be fair to ask for a comparative study because the MIR outliner didn't have one. While I don't think anyone is asking for a comparative study, I want to point out that I think it is completely fair. If a new contributor approached the community with a new SROA pass and wanted to land it in-tree it would be appropriate to ask for a comparative analysis against the existing pass. How is this different? 
 
The real question comes from what exactly you want to define as a "comparative analysis". When posting the patch I included additional performance data( found here goo.gl/5k6wsP) that includes benchmarking and comparisons between the outliner that I am proposing and the machine outliner on a wide variety of benchmarks. The proposed outliner performs quite favorable in comparison. As for feature comparison, the proposed outliner has many features currently missing from the machine outliner: 
 - parameterization
 - outputs
 - relaxed equivalence(machine outliner requires exact)
 - usage of profile data
 - support for opt remarks

 The machine outliner currently only supports X86 and AArch64, the IR outliner can/should support all targets immediately without the requirement of ABI restrictions(mno-red-zone is required for the machine outliner).
 At the IR level we have much more opportunity to find congruent instructions than at the machine level given the possible variation at that level: RA, instruction selection, instruction scheduling, etc.


I am more than willing to do a comparative analysis but I'm not quite sure what the expectation for one is.


Adding a new IR outliner is a different situation from when the MIR one was added. When the MIR outliner was introduced there was no in-tree analog. When someone comes to the community with something that has no existing in-tree analog it isn't fair to necessarily ask them to implement it multiple different ways to prove their solution is the best. However, as a community, we do still exercise the right to reject contributions we disagree with, and we frequently request changes to the implementation (as is shown every time someone tries to add SPIR-V support).
 
I perfectly agree :) 


In the LLVM community we have a long history of approaching large contributions (especially ones from new contributors) with scrutiny and discussion. It would be a disservice to the project to forget that.

River, as a last note. I see that you've started uploading patches to Phabricator, and I know you're relatively new to the community. When uploading patches it helps to include appropriate reviewers so that the right people see the patches as they come in. To that end can you please include Jessica as a reviewer? Given her relevant domain experience I think her feedback on the patches will be very valuable.
 
I accidentally posted without any reviewers at first, I've been going back through and adding people I missed.  


Thank you,
-Chris
I appreciate the feedback and welcome all critical discussion about the right way to move forward.
Thanks,
 River Riddle

Evgeny Astigeevich via llvm-dev

unread,
Jul 31, 2017, 11:47:13 AM7/31/17
to River Riddle, Chris Bieneman, llvm-dev, nd

Hi Chris,

 

> One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:

> 1. IR outline

> 2. Inline

> 3. MIR outline

 

IMHO, there is no need to restrict a place of the Outliner in the pipeline at the moment. I hope people representing different architectures will try different configurations and the best will be chosen. I’d like to try the pipeline configuration:

 

1.       Inline

2.       IR optimizations

3.       IR outline

4.       MIR optimizations

5.       MIR outline

 

I think this configuration allows to apply as many IR optimizations, especially those which reduce code size, as possible and then extract commonly used code into functions. I am also interested in some kind of Oz LTO with the IR Outliner enabled.

 

Evgeny Astigeevich

Senior Compiler Engineer

Compilation Tools

ARM

Eric Christopher via llvm-dev

unread,
Jul 31, 2017, 12:55:52 PM7/31/17
to River Riddle, Chris Bieneman, llvm-dev
Hi River,


Given that there are a number of disagreements and opinions floating around I think it benefits us all to speak clearly about who is taking what stances.

One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
  1. IR outline
  2. Inline
  3. MIR outline
In your response to that proposal you dismissed it out of hand with "feelings" but not data. Given that the proposal came from Jessica (a community member with significant relevant experience in outlining), and it was also recognized as interesting by Eric Christopher (a long-time member of the community with wide reaching expertise), I think dismissing it may have been a little premature.
 
I dismissed the idea of an outliner at the machine level being able to catch bad inlining decisions. Given the loss of information between the two I felt it was a little optimistic to rely on a very late pass being able to reverse those decisions, especially coupled with the fact that the current machine outliner requires exact equivalence. I don't disagree with the proposal of an example : outline, inline, outline: pipeline, but the idea of being able to catch inlining decisions given the circumstances seemed optimistic to me. From there I went ahead and implemented a generic interface for outlining that can be shared between IR/Machine level so that such a pipeline could be more feasible.

Honestly given that the owner of the outlining code was suggesting this path, I don't think that without a concrete reason you should unilaterally make this decision.
 
 

I also want to visit a few procedural notes.

Mehdi commented on the thread that it wouldn't be fair to ask for a comparative study because the MIR outliner didn't have one. While I don't think anyone is asking for a comparative study, I want to point out that I think it is completely fair. If a new contributor approached the community with a new SROA pass and wanted to land it in-tree it would be appropriate to ask for a comparative analysis against the existing pass. How is this different? 
 
The real question comes from what exactly you want to define as a "comparative analysis". When posting the patch I included additional performance data( found here goo.gl/5k6wsP) that includes benchmarking and comparisons between the outliner that I am proposing and the machine outliner on a wide variety of benchmarks. The proposed outliner performs quite favorable in comparison. As for feature comparison, the proposed outliner has many features currently missing from the machine outliner: 
 - parameterization
 - outputs
 - relaxed equivalence(machine outliner requires exact)
 - usage of profile data
 - support for opt remarks

 The machine outliner currently only supports X86 and AArch64, the IR outliner can/should support all targets immediately without the requirement of ABI restrictions(mno-red-zone is required for the machine outliner).
 At the IR level we have much more opportunity to find congruent instructions than at the machine level given the possible variation at that level: RA, instruction selection, instruction scheduling, etc.

These are all theoretical advantages and quite compelling, however, numbers are important and I think we should see one.
 
In the LLVM community we have a long history of approaching large contributions (especially ones from new contributors) with scrutiny and discussion. It would be a disservice to the project to forget that.

River, as a last note. I see that you've started uploading patches to Phabricator, and I know you're relatively new to the community. When uploading patches it helps to include appropriate reviewers so that the right people see the patches as they come in. To that end can you please include Jessica as a reviewer? Given her relevant domain experience I think her feedback on the patches will be very valuable.
 
I accidentally posted without any reviewers at first, I've been going back through and adding people I missed.  


Last I checked you had still not added Jessica here. I think for design and future decisions here she should be added and be considered one of the prime reviewers of this effort.

-eric

Eric Christopher via llvm-dev

unread,
Jul 31, 2017, 12:57:24 PM7/31/17
to Evgeny Astigeevich, River Riddle, Chris Bieneman, llvm-dev, nd
Hi Evgeny,

On Mon, Jul 31, 2017 at 8:47 AM Evgeny Astigeevich via llvm-dev <llvm...@lists.llvm.org> wrote:

Hi Chris,

 

> One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:

> 1. IR outline

> 2. Inline

> 3. MIR outline

 

IMHO, there is no need to restrict a place of the Outliner in the pipeline at the moment. I hope people representing different architectures will try different configurations and the best will be chosen. I’d like to try the pipeline configuration:

 


This is largely irrelevant to the discussion at hand. The original thoughts were about one or the other and Jessica has rightly pointed out (which you seem to agree with) that there's room for both in the pipeline.

Thanks.

-eric

Evgeny Astigeevich via llvm-dev

unread,
Jul 31, 2017, 1:12:45 PM7/31/17
to Eric Christopher, llvm-dev, nd

Hi Eric,


Thank you for feedback. I must apologise if I have caused any concern or offence.


-Evgeny


From: Eric Christopher <echr...@gmail.com>
Sent: Monday, July 31, 2017 5:57:01 PM
To: Evgeny Astigeevich; River Riddle; Chris Bieneman
Cc: llvm-dev; nd

Eric Christopher via llvm-dev

unread,
Jul 31, 2017, 1:14:24 PM7/31/17
to Evgeny Astigeevich, llvm-dev, nd
Hi Evgeny,

Absolutely not at all. I think it's exciting that everyone wants to use the outlining support :)

-eric

Mehdi AMINI via llvm-dev

unread,
Aug 1, 2017, 1:38:40 AM8/1/17
to Chris Bieneman, llvm-dev
2017-07-28 21:58 GMT-07:00 Chris Bieneman via llvm-dev <llvm...@lists.llvm.org>:
Apologies for delayed joining of this discussion, but I had a few notes from this thread that I really wanted to chime in about.

River,

I don't mean to put you on the spot, but I do want to start on a semantic issue. In several places in the thread you used the words "we" and "our" to imply that you're not alone in writing this (which is totally fine), but your initial thread presented this as entirely your own work. So, when you said things like "we feel there's an advantage to being at the IR level", can you please clarify who is "we"?

Given that there are a number of disagreements and opinions floating around I think it benefits us all to speak clearly about who is taking what stances.

One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
  1. IR outline
  2. Inline
  3. MIR outline
In your response to that proposal you dismissed it out of hand with "feelings" but not data. Given that the proposal came from Jessica (a community member with significant relevant experience in outlining), and it was also recognized as interesting by Eric Christopher (a long-time member of the community with wide reaching expertise), I think dismissing it may have been a little premature.

It isn't clear to me how much the *exact* pipeline and ordering of passes is relevant to consider if "having an outliner at the IR level" is a good idea.

 
I also want to visit a few procedural notes.

Mehdi commented on the thread that it wouldn't be fair to ask for a comparative study because the MIR outliner didn't have one. While I don't think anyone is asking for a comparative study, I want to point out that I think it is completely fair.
If a new contributor approached the community with a new SROA pass and wanted to land it in-tree it would be appropriate to ask for a comparative analysis against the existing pass. How is this different? 

It seems quite different to me because there is no outliner at the IR level and so they don't provide the same functionality. The "Why at the IR level" section of the original email combined with the performance numbers seems largely enough to me to explain why it isn't redundant to the Machine-level outliner.
I'd consider this work for inclusion upstream purely on its technical merit at this point.
Discussing inclusion as part of any of the default pipeline is a different story.

Similarly last year, the IR-level PGO was also implemented even though we already had a PGO implementation, because 1) it provided a generic solutions for other frontend (just like here it could be said that it provides a generic solution for targets) and 2) it supported cases that FE-PGO didn't (especially around better counter-context using pre-inlining and such).

 

Adding a new IR outliner is a different situation from when the MIR one was added. When the MIR outliner was introduced there was no in-tree analog.

We still usually discuss design extensively. Skipping the IR-level option didn't seem obvious to me, to say the least. And it wasn't really much discussed/considered extensively upstream.
If the idea is that implementing a concept at the machine level may preclude a future implementation at the IR level, it means we should be *a lot* more picky before accepting such contribution. 
In this case, if I had anticipated any push-back on an IR-level implementation only based on the fact that we have now a Machine-level one, I'd likely have pushed back on the machine-level one.

 
When someone comes to the community with something that has no existing in-tree analog it isn't fair to necessarily ask them to implement it multiple different ways to prove their solution is the best.

It may or may not be fair, but there is a tradeoff in how much effort we would require them to convince the community that this is *the* right way to go, depending on what it implies for future approaches.

-- 
Mehdi

Sean Silva via llvm-dev

unread,
Aug 1, 2017, 4:03:03 AM8/1/17
to Mehdi AMINI, llvm-dev


On Jul 31, 2017 10:38 PM, "Mehdi AMINI via llvm-dev" <llvm...@lists.llvm.org> wrote:


2017-07-28 21:58 GMT-07:00 Chris Bieneman via llvm-dev <llvm...@lists.llvm.org>:
Apologies for delayed joining of this discussion, but I had a few notes from this thread that I really wanted to chime in about.

River,

I don't mean to put you on the spot, but I do want to start on a semantic issue. In several places in the thread you used the words "we" and "our" to imply that you're not alone in writing this (which is totally fine), but your initial thread presented this as entirely your own work. So, when you said things like "we feel there's an advantage to being at the IR level", can you please clarify who is "we"?

Given that there are a number of disagreements and opinions floating around I think it benefits us all to speak clearly about who is taking what stances.

One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
  1. IR outline
  2. Inline
  3. MIR outline
In your response to that proposal you dismissed it out of hand with "feelings" but not data. Given that the proposal came from Jessica (a community member with significant relevant experience in outlining), and it was also recognized as interesting by Eric Christopher (a long-time member of the community with wide reaching expertise), I think dismissing it may have been a little premature.

It isn't clear to me how much the *exact* pipeline and ordering of passes is relevant to consider if "having an outliner at the IR level" is a good idea.

 
I also want to visit a few procedural notes.

Mehdi commented on the thread that it wouldn't be fair to ask for a comparative study because the MIR outliner didn't have one. While I don't think anyone is asking for a comparative study, I want to point out that I think it is completely fair.
If a new contributor approached the community with a new SROA pass and wanted to land it in-tree it would be appropriate to ask for a comparative analysis against the existing pass. How is this different? 

It seems quite different to me because there is no outliner at the IR level and so they don't provide the same functionality. The "Why at the IR level" section of the original email combined with the performance numbers seems largely enough to me to explain why it isn't redundant to the Machine-level outliner.
I'd consider this work for inclusion upstream purely on its technical merit at this point.
Discussing inclusion as part of any of the default pipeline is a different story.

Similarly last year, the IR-level PGO was also implemented even though we already had a PGO implementation, because 1) it provided a generic solutions for other frontend (just like here it could be said that it provides a generic solution for targets) and 2) it supported cases that FE-PGO didn't (especially around better counter-context using pre-inlining and such).

 

Adding a new IR outliner is a different situation from when the MIR one was added. When the MIR outliner was introduced there was no in-tree analog.

We still usually discuss design extensively. Skipping the IR-level option didn't seem obvious to me, to say the least. And it wasn't really much discussed/considered extensively upstream.


I think Quentin described it pretty well in a reply to the original RFC:


"
The other part is at the LLVM IR level or before register allocation identifying similar code sequence is much harder, at least with a suffix tree like algorithm. Basically the problem is how do we name our instructions such that we can match them.
Let me take an example.
foo() {
/* bunch of code */
a = add b, c;
d = add e, f; 
}

bar() {
d = add e, g;
f = add c, w;
}

With proper renaming, we can outline both adds in one function. The difficulty is to recognize that they are semantically equivalent to give them the same identifier in the suffix tree. I won’t get into the details but it gets tricky quickly. We were thinking of reusing GVN to have such identifier if we wanted to do the outlining at IR level but solving this problem is hard.
"

The pass in this RFC solves this problem to allow using a suffix tree/array type algorithm (string algorithm) on a dataflow graph like IR or pre-RA MIR. It doesn't do it by producing value numbers based on an exact congruence relation to feed into the string algorithms though (and I think that is provably impossible except post-RA; I can elaborate if anyone is interested). Instead it uses a relaxed congruence relation for the suffix tree/array to find potential repeated sequences (that may not in fact be exactly congruent). Then further steps perform exact congruence checks on the found sequences along with parameterizing parameterizable differences.

Admittedly, I don't think this has come across well in River's posts. I've been working offline with him to help him rework his approach to this RFC and how to work with the community more idiomatically. I'm hoping he'll be able to successfully reboot this RFC as I think this is a very neat algorithm.


Also as a side note, I think in the original MachineOutliner RFC thread there was some confusion as to whether it was possible to solve the code folding outlining problem exactly as a graph problem on SSA using standard value numbering algorithms in polynomial time. I can elaborate further, but
1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR
2. Given two dags, you can create their respective isomorphic data flow graphs (say, put them into two separate functions)
3. An exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.
4. Thus, graph isomorphism on dags can be solved with such an algorithm and thus the outlining problem is GI-hard and a polynomial time solution would be a big breakthrough in CS.
5. The actual problem the outliner is trying to solve is actually more like finding subgraphs that are isomorphic, making the problem even harder (something like "given dags A and B does there exist a subgraph of A that is isomorphic to a subgraph of B")

So some sort of compromise is needed.

The reduction of the problem from a graph problem to a string problem is a way to work around this. We sacrifice some code folding opportunities due to the particular order in which the instructions were linearized into a string. Or to put it another way, commuting instructions could reveal better code folding opportunities to such string algorithms, but finding the optimal order to commute them into to reveal such opportunities is GI-hard. (and I think it is interesting future work to see if heuristically reordering instructions can expose more opportunities to string-based code folding outliners. For example, one can imagine a pass that tries to canonicalize prologue or call-setup sequences to promote code folding by our post-RA MachineOutliner)




-- Sean Silva

Andrey Bokhanko via llvm-dev

unread,
Aug 1, 2017, 4:07:54 AM8/1/17
to Mehdi AMINI, llvm-dev
All,

+1 to what Mehdi said.

It's a fair concern to question whatever we need yet another Outlining pass. I believe this concern has been cleared by River -- both with theoretical arguments and practical data (benchmark numbers).

Jessica's pipeline proposal is completely orthogonal. It's not fair to request River to implement / fit into what she suggested. Sure, it's a valid topic to discuss -- but yet completely orthogonal one. If anything, accepting River's implementation would enable us to do experiments / developments like pipeline changes of this ilk!

Yours,
Andrey
===
Compiler Architect
NXP

Andrey Bokhanko via llvm-dev

unread,
Aug 1, 2017, 4:18:02 AM8/1/17
to Mehdi AMINI, llvm-dev
...and adding $0.02 to the "IR outline + inline + MIR outline" idea, my gut feeling (yes, only a "feeling" -- and one coming from my gut, not head!) is that inlining correcting wrong IR outlining decisions with MIR outlining correcting wrong inlining decisions is absolutely unrealistic and a heuristics nightmare at best.

Inliner's heuristics are already complex enough and not 100% bulletproof; if we add IR outliner heuristics to the mix -- and then just a little bit of MIR outliner heuristics (which are more precise but, as demonstrated above, not 100% precise as well) on top... you can imagine.

Yours,
Andrey

Sean Silva via llvm-dev

unread,
Aug 1, 2017, 5:53:47 AM8/1/17
to Andrey Bokhanko, llvm-dev


On Aug 1, 2017 1:18 AM, "Andrey Bokhanko via llvm-dev" <llvm...@lists.llvm.org> wrote:
...and adding $0.02 to the "IR outline + inline + MIR outline" idea, my gut feeling (yes, only a "feeling" -- and one coming from my gut, not head!) is that inlining correcting wrong IR outlining decisions with MIR outlining correcting wrong inlining decisions is absolutely unrealistic and a heuristics nightmare at best.

Inliner's heuristics are already complex enough and not 100% bulletproof; if we add IR outliner heuristics to the mix -- and then just a little bit of MIR outliner heuristics (which are more precise but, as demonstrated above, not 100% precise as well)

The post-RA MachineOutliner can in principle have a perfect cost model (and it's already pretty close I think).

However, the post-RA MachineOutliner would nonetheless have a hard time undoing an inlining decision because it would have to get lucky with the register assignment matching at different inlining sites. (same applies to scheduling decisions)

I can't think of any plausible way to avoid that sort of problem. A pre-RA MIR outliner could avoid needing to get lucky on the register assignment at least, but any string-based algorithm will inherently be thwarted by different scheduling.

-- Sean Silva

Sean Silva via llvm-dev

unread,
Aug 1, 2017, 6:16:50 AM8/1/17
to Andrey Bokhanko, llvm-dev


On Aug 1, 2017 2:53 AM, "Sean Silva" <chiso...@gmail.com> wrote:


On Aug 1, 2017 1:18 AM, "Andrey Bokhanko via llvm-dev" <llvm...@lists.llvm.org> wrote:
...and adding $0.02 to the "IR outline + inline + MIR outline" idea, my gut feeling (yes, only a "feeling" -- and one coming from my gut, not head!) is that inlining correcting wrong IR outlining decisions with MIR outlining correcting wrong inlining decisions is absolutely unrealistic and a heuristics nightmare at best.

Inliner's heuristics are already complex enough and not 100% bulletproof; if we add IR outliner heuristics to the mix -- and then just a little bit of MIR outliner heuristics (which are more precise but, as demonstrated above, not 100% precise as well)

The post-RA MachineOutliner can in principle have a perfect cost model (and it's already pretty close I think).

However, the post-RA MachineOutliner would nonetheless have a hard time undoing an inlining decision because it would have to get lucky with the register assignment matching at different inlining sites. (same applies to scheduling decisions)

I can't think of any plausible way to avoid that sort of problem. A pre-RA MIR outliner could avoid needing to get lucky on the register assignment at least, but any string-based algorithm will inherently be thwarted by different scheduling.

And I think that at least with Selection DAG we don't really have a notion of "pre-sched" MIR. I'm not super familiar with backend stuff though. I assume GlobalIsel must have pre-sched MIR available.

-- Sean Silva

Jessica Paquette via llvm-dev

unread,
Aug 1, 2017, 12:15:58 PM8/1/17
to Andrey Bokhanko, llvm-dev
I think I agree with this sentiment (and the rather heated response to this suggestion says that I ought to as well!)

The later pass would have to have a strong guarantee to “undo” bad decisions, which, given that we have to use heuristics to make code size decisions, is very difficult. I don’t know if I’d say *impossible*, but it would definitely force us to work in a more restrictive space. I was mostly just thinking of how the two separate technologies might work together in practice.

For example, in “Function outlining and partial inlining” by Peng Zhao and J. N. Amaral, they found that you could get partial inlining for free by enabling inlining and outlining at the IR-level. What I think would be cool to explore is whether or not there would be a similar relationship between IR-level outlining and MIR-level outlining. Of course, this is just a suggestion!

Daniel Berlin via llvm-dev

unread,
Aug 1, 2017, 1:28:33 PM8/1/17
to Sean Silva, llvm-dev


Also as a side note, I think in the original MachineOutliner RFC thread there was some confusion as to whether it was possible to solve the code folding outlining problem exactly as a graph problem on SSA using standard value numbering algorithms in polynomial time.
 
I can elaborate further, but
1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR
 
2. Given two dags, you can create their respective isomorphic data flow graphs (say, put them into two separate functions)
3. An exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.
4. Thus, graph isomorphism on dags can be solved with such an algorithm and thus the outlining problem is GI-hard and a polynomial time solution would be a big breakthrough in CS.

First, you'd have to reduce them in both directions to prove that ;)

All that's been shown is that you can reduce it to a hard problem.
You can also reduce it to 3-sat., but that doesn't mean anything unless you can reduce 3-sat to it.

5. The actual problem the outliner is trying to solve is actually more like finding subgraphs that are isomorphic, making the problem even harder (something like "given dags A and B does there exist a subgraph of A that is isomorphic to a subgraph of B")


This assumes, strongly, that this reduction is the way to do it, and also that SSA/etc imparts no structure in the reduction that enables you to solve it faster (IE restricts the types of graphs, etc)
 
FWIW, the exact argument above holds for value numbering, and since the days of kildall, it was believed not solvable in a complete fashion in less than exponential time due to the need to do graph isomorphism on large value graphs at join points.

Except, much like RA and other things, it turns out this is not right for SSA, and in the past few years, it was proved you could do complete  value numbering in polynomial time on SSA.

So with all due respect to quentin, i don't buy it yet.

Without more, i'd bet using similar techniques to solve value numbering in polynomial time to SSA could be applied here.

This is because the complete polynomial value numbering techniques are in fact, value graph isomorphism ....

So i'd assume the opposite (it can be done in polynomial time) without more.


River Riddle via llvm-dev

unread,
Aug 1, 2017, 1:37:11 PM8/1/17
to Jessica Paquette, llvm-dev
Hey Jessica,
  I was interested in the same relationship you are. Last summer i implemented many variations of the algorithm that is outlined in the paper you've provided, even including hot switch splitting.

When I implemented this last year there were several road blocks: Inliner not always playing nicely, the inliner didn't preserve pgo data at that point, aa was not very forgiving about the loss of information, etc. Now that some of these problems have been improved/fixed it would be interesting to review what can be gained from this type of transformation.

   It's also important to note that the two outlining transformations would use completely different algorithms, as they are satisfying different benefit functions.
 The algorithm used for enabling partial inlining is pgo driven, making it very context, or occurrence, sensitive. You are looking at subsections of the graph to identify cold sections that could increase performance if outlined. There isn't a concept for more than once occurrence because the benefit function is very sensitive to the pgo data of that one instance of the subgraph. Given that you are only considering one instance you don't have to worry about trying to solve graph isomorphism, which substantially increases the amount of candidates you can consider. This type of outlining also interacts much better with the inliner because all of the cost modeling for both is contained within the context of a single call.

So effectively we have two different types of outliners, which I'm assuming is part of the reason why Dan B brought up the semantics of what to call these transformations in the original MachineOutliner RFC. Having implemented both, I can attest to there not really being a lot of shared code. Providing pgo data to the IR outliner will somewhat emulate this behavior, but the amount of benefit to performance is not likely to be as high. I'm sensing that this will hold even if/when we extend IR outlining to handle regions as well.
 
We kind of have an outliner of this type in the form of the partial inliner in LLVM. It currently only cares about enabling more inlining but that could definitely be changed or extended depending on the performance benefits that we could get. 

Thanks,
  River Riddle

River Riddle via llvm-dev

unread,
Aug 1, 2017, 1:44:57 PM8/1/17
to Daniel Berlin, llvm-dev
Hey Dan,

If you are willing to point me in the right direction, I am more than willing to implement/adapt to this type of solution for candidate selection.
Thanks,
 River Riddle

Daniel Berlin via llvm-dev

unread,
Aug 1, 2017, 2:03:18 PM8/1/17
to Sean Silva, llvm-dev
On Tue, Aug 1, 2017 at 10:28 AM, Daniel Berlin <dbe...@dberlin.org> wrote:


Also as a side note, I think in the original MachineOutliner RFC thread there was some confusion as to whether it was possible to solve the code folding outlining problem exactly as a graph problem on SSA using standard value numbering algorithms in polynomial time.
 
I can elaborate further, but
1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR
 
2. Given two dags, you can create their respective isomorphic data flow graphs (say, put them into two separate functions)
3. An exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.
4. Thus, graph isomorphism on dags can be solved with such an algorithm and thus the outlining problem is GI-hard and a polynomial time solution would be a big breakthrough in CS.

First, you'd have to reduce them in both directions to prove that ;)

All that's been shown is that you can reduce it to a hard problem.
In particular, Quentin has not actually shown that every possible DFG can be generated by SSA IR (without affecting the time/space class) , or that every possible DFG can be converted to SSA IR (ditto).
While i believe the latter is possible (Though SSA is not a linear space transform, even though it is a linear time one. I believe it is at worst an N^2 transform, and not an exponential space one), i'm significantly less sure about the former, and it seems quite wrong to me without thinking about it too hard.
(it seems it would imply SSA is a lot more powerful than it is).

I'm also pretty sure i've seen papers that prove the former is false.



Chris Bieneman via llvm-dev

unread,
Aug 1, 2017, 2:04:11 PM8/1/17
to Mehdi AMINI, llvm-dev
On Jul 31, 2017, at 10:38 PM, Mehdi AMINI <joke...@gmail.com> wrote:



2017-07-28 21:58 GMT-07:00 Chris Bieneman via llvm-dev <llvm...@lists.llvm.org>:
Apologies for delayed joining of this discussion, but I had a few notes from this thread that I really wanted to chime in about.

River,

I don't mean to put you on the spot, but I do want to start on a semantic issue. In several places in the thread you used the words "we" and "our" to imply that you're not alone in writing this (which is totally fine), but your initial thread presented this as entirely your own work. So, when you said things like "we feel there's an advantage to being at the IR level", can you please clarify who is "we"?

Given that there are a number of disagreements and opinions floating around I think it benefits us all to speak clearly about who is taking what stances.

One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
  1. IR outline
  2. Inline
  3. MIR outline
In your response to that proposal you dismissed it out of hand with "feelings" but not data. Given that the proposal came from Jessica (a community member with significant relevant experience in outlining), and it was also recognized as interesting by Eric Christopher (a long-time member of the community with wide reaching expertise), I think dismissing it may have been a little premature.

It isn't clear to me how much the *exact* pipeline and ordering of passes is relevant to consider if "having an outliner at the IR level" is a good idea.

I think it is particularly relevant because based on the limited performance numbers we've seen it looks like the MIR and IR outliners have different benefits. Figuring out a pipeline where one doesn't prevent the other from performing good optimizations seems like a reasonable precondition to accepting these patches. 


 
I also want to visit a few procedural notes.

Mehdi commented on the thread that it wouldn't be fair to ask for a comparative study because the MIR outliner didn't have one. While I don't think anyone is asking for a comparative study, I want to point out that I think it is completely fair.
If a new contributor approached the community with a new SROA pass and wanted to land it in-tree it would be appropriate to ask for a comparative analysis against the existing pass. How is this different? 

It seems quite different to me because there is no outliner at the IR level and so they don't provide the same functionality. The "Why at the IR level" section of the original email combined with the performance numbers seems largely enough to me to explain why it isn't redundant to the Machine-level outliner.
I'd consider this work for inclusion upstream purely on its technical merit at this point.

I believe the technical merit has not been shown clearly enough. The only data we've seen has been cherry-picked and there are outstanding technical questions about the approach.

Discussing inclusion as part of any of the default pipeline is a different story.

The patches that were sent out *do* include it in default pass pipelines.


Similarly last year, the IR-level PGO was also implemented even though we already had a PGO implementation, because 1) it provided a generic solutions for other frontend (just like here it could be said that it provides a generic solution for targets) and 2) it supported cases that FE-PGO didn't (especially around better counter-context using pre-inlining and such).

 

Adding a new IR outliner is a different situation from when the MIR one was added. When the MIR outliner was introduced there was no in-tree analog.

We still usually discuss design extensively. Skipping the IR-level option didn't seem obvious to me, to say the least. And it wasn't really much discussed/considered extensively upstream.

The reasoning for this was covered in the discussions and in Jessica's LLVM dev meeting talk. It may not have been widely discussed because it was widely agreed on.

If the idea is that implementing a concept at the machine level may preclude a future implementation at the IR level, it means we should be *a lot* more picky before accepting such contribution.  

Nobody is precluding an IR implementation. We are merely holding the IR implementation to the same high standards of justification that we held the MIR one to. You may not recall this, but the MIR one took *months* to go from RFC to landing in-tree.

In this case, if I had anticipated any push-back on an IR-level implementation only based on the fact that we have now a Machine-level one, I'd likely have pushed back on the machine-level one.

There is no pushback based solely on the presence of the MIR outliner. One source of inquiry about the merits of the IR outliner is its comparison to the MIR outliner, and whether or not the two can play well together. This seems like a reasonable line of inquiry to me.


 
When someone comes to the community with something that has no existing in-tree analog it isn't fair to necessarily ask them to implement it multiple different ways to prove their solution is the best.

It may or may not be fair, but there is a tradeoff in how much effort we would require them to convince the community that this is *the* right way to go, depending on what it implies for future approaches.

Sure, and several of us are trying to have a conversation with River about how the IR outliner will best fit into LLVM and what technical considerations have to be made. You arguing that we should just accept the patches as they are is counter productive to us being able to ensure that the IR outliner is at an appropriate quality and has sufficient technical merit.

-Chris

Daniel Berlin via llvm-dev

unread,
Aug 1, 2017, 2:08:06 PM8/1/17
to River Riddle, llvm-dev
Though i'm happy to have this tangent discussion because i think it's interesting, realistically, I'm not sure it's worth it ATM when you've got stuff that works pretty well.  So i'm going to drop off and let you concentrate on *this* patch, because i think that's what we should start with, realistically.

I just don't want us to declare it "impossible" without actual  proof.  Ken Zadeck told me when he first tried to show that he could solve "thought-to-be-N^3" dataflow problems in linear time, the response get got was "that's impossible and i can prove it", and it turns out they were wrong :)

Chris Bieneman via llvm-dev

unread,
Aug 1, 2017, 2:20:22 PM8/1/17
to Andrey Bokhanko, llvm-dev
On Aug 1, 2017, at 1:07 AM, Andrey Bokhanko via llvm-dev <llvm...@lists.llvm.org> wrote:

All,

+1 to what Mehdi said.

It's a fair concern to question whatever we need yet another Outlining pass. I believe this concern has been cleared by River -- both with theoretical arguments and practical data (benchmark numbers).

I want to point out that River has not provided full raw benchmark data, he has provided summarized data. Also, as a community of engineers I think theoretical arguments are dangerous to accept. The community has also not been presented with sufficient information to understand some of the differences between the numbers for the IR outliner and MIR outliner.

Since the two are using very similar algorithms for detecting outlining candidates I would expect the numbers to be very similar, however in many cases they are not. I think understanding those differences and comparing them will help inform whether or not the IR outliner needs changes before being integrated to LLVM. I'm not advocating that we shouldn't take the IR outliner (in fact I think IR outlining is an interesting optimization). I'm advocating that we should understand and explore how this pass will fit into LLVM today and in the future.


Jessica's pipeline proposal is completely orthogonal. It's not fair to request River to implement / fit into what she suggested. Sure, it's a valid topic to discuss -- but yet completely orthogonal one. If anything, accepting River's implementation would enable us to do experiments / developments like pipeline changes of this ilk!

I completely disagree. When considering accepting a new pass to LLVM it is completely reasonable, appropriate, and not orthogonal to consider how that pass would interact with other existing passes. Also, given that in the past we've requested much more of contributors trying to make large contributions (see the many SPIR-V discussions) I think it is totally fair to request River consider Jessica's suggestion, and if the community felt that was the right approach it would be fair to request him to implement it.

Please keep in mind, every new piece of functionality added to LLVM adds a maintenance burden on the community. If we are going to accept the burden of maintaining a new pass, it is reasonable for us, as a community, to request that the pass be engineered in a way that will make it worth that maintenance.

Further, I am incredibly frustrated by the fact that people in this thread seem intent on squashing relevant technical discussions or turning them into combative arguments. There is nothing wrong with asking technical questions on an RFC, and we should be encouraging cooperative discussions of present and future implications of all proposals.

-Chris

Daniel Berlin via llvm-dev

unread,
Aug 1, 2017, 2:34:51 PM8/1/17
to Chris Bieneman, llvm-dev

Hey Chris,

On Tue, Aug 1, 2017 at 11:20 AM, Chris Bieneman via llvm-dev <llvm...@lists.llvm.org> wrote:

On Aug 1, 2017, at 1:07 AM, Andrey Bokhanko via llvm-dev <llvm...@lists.llvm.org> wrote:

All,

+1 to what Mehdi said.

It's a fair concern to question whatever we need yet another Outlining pass. I believe this concern has been cleared by River -- both with theoretical arguments and practical data (benchmark numbers).

I want to point out that River has not provided full raw benchmark data, he has provided summarized data. Also, as a community of engineers I think theoretical arguments are dangerous to accept. The community has also not been presented with sufficient information to understand some of the differences between the numbers for the IR outliner and MIR outliner. 

Since the two are using very similar algorithms for detecting outlining candidates I would expect the numbers to be very similar, however in many cases they are not. I think understanding those differences and comparing them will help inform whether or not the IR outliner needs changes before being integrated to LLVM. I'm not advocating that we shouldn't take the IR outliner (in fact I think IR outlining is an interesting optimization). I'm advocating that we should understand and explore how this pass will fit into LLVM today and in the future.

You *sound* very frustrated by the process here, but as a person with no dog in this fight, i'm kind of struggling to understand why.
It seems a very similar process to what we've done in other situations.   

If i may suggest: I think, regardless of the outcome of this thread, it may be worth you taking a step back, abstracting the set of issues you are seeing a bit, and starting a thread about it (with concrete examples, but hopefully without blame!).

Because if you (and others!) are seeing a broken process here, it definitely should be looked at.
But in some ways, i don't think we're going to be able to intertwine "process fixing" with this particular patch.   IME, that rarely works very well.
(But i'll butt out if you really want to try :P)





Jessica's pipeline proposal is completely orthogonal. It's not fair to request River to implement / fit into what she suggested. Sure, it's a valid topic to discuss -- but yet completely orthogonal one. If anything, accepting River's implementation would enable us to do experiments / developments like pipeline changes of this ilk!

I completely disagree. When considering accepting a new pass to LLVM it is completely reasonable, appropriate, and not orthogonal to consider how that pass would interact with other existing passes. Also, given that in the past we've requested much more of contributors trying to make large contributions (see the many SPIR-V discussions) I think it is totally fair to request River consider Jessica's suggestion, and if the community felt that was the right approach it would be fair to request him to implement it.

I think this is very much a matter of degree: IE i think there is a continuum of stuff people would consider reasonable and not.
 

Please keep in mind, every new piece of functionality added to LLVM adds a maintenance burden on the community. If we are going to accept the burden of maintaining a new pass, it is reasonable for us, as a community, to request that the pass be engineered in a way that will make it worth that maintenance.

Further, I am incredibly frustrated by the fact that people in this thread seem intent on squashing relevant technical discussions or turning them into combative arguments. There is nothing wrong with asking technical questions on an RFC, and we should be encouraging cooperative discussions of present and future implications of all proposals.

This sounds right, as long as we all agree that all processes much reach finality at some point.  IE at some point a decision has to be made, and it can't be "after we've exhaustively attempted everything" 
-Chris

Quentin Colombet via llvm-dev

unread,
Aug 1, 2017, 2:35:30 PM8/1/17
to Daniel Berlin, llvm-dev
On Aug 1, 2017, at 10:28 AM, Daniel Berlin via llvm-dev <llvm...@lists.llvm.org> wrote:



Also as a side note, I think in the original MachineOutliner RFC thread there was some confusion as to whether it was possible to solve the code folding outlining problem exactly as a graph problem on SSA using standard value numbering algorithms in polynomial time.
 
I can elaborate further, but
1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR
 
2. Given two dags, you can create their respective isomorphic data flow graphs (say, put them into two separate functions)
3. An exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.
4. Thus, graph isomorphism on dags can be solved with such an algorithm and thus the outlining problem is GI-hard and a polynomial time solution would be a big breakthrough in CS.

First, you'd have to reduce them in both directions to prove that ;)

All that's been shown is that you can reduce it to a hard problem.
You can also reduce it to 3-sat., but that doesn't mean anything unless you can reduce 3-sat to it.

5. The actual problem the outliner is trying to solve is actually more like finding subgraphs that are isomorphic, making the problem even harder (something like "given dags A and B does there exist a subgraph of A that is isomorphic to a subgraph of B")


This assumes, strongly, that this reduction is the way to do it, and also that SSA/etc imparts no structure in the reduction that enables you to solve it faster (IE restricts the types of graphs, etc)
 
FWIW, the exact argument above holds for value numbering, and since the days of kildall, it was believed not solvable in a complete fashion in less than exponential time due to the need to do graph isomorphism on large value graphs at join points.

Except, much like RA and other things, it turns out this is not right for SSA, and in the past few years, it was proved you could do complete  value numbering in polynomial time on SSA.

So with all due respect to quentin, i don't buy it yet.

FWIW, I didn’t make any claim on actual complexity (I think? :)). We just didn't try it and didn’t have time to fit that into the schedule of an outliner for an internship.

What I am saying is that I am not responsible of how people quote me :).


Without more, i'd bet using similar techniques to solve value numbering in polynomial time to SSA could be applied here.

This is because the complete polynomial value numbering techniques are in fact, value graph isomorphism ....

So i'd assume the opposite (it can be done in polynomial time) without more.

I would believe so FWIW.

Daniel Berlin via llvm-dev

unread,
Aug 1, 2017, 3:02:09 PM8/1/17
to Quentin Colombet, llvm-dev

FWIW, I didn’t make any claim on actual complexity (I think? :)). We just didn't try it and didn’t have time to fit that into the schedule of an outliner for an internship.


I'm disappointed you didn't have time to try to solve major open computer science problems in an internship.
What are you even doing over there?

Mehdi AMINI via llvm-dev

unread,
Aug 1, 2017, 3:05:53 PM8/1/17
to Chris Bieneman, llvm-dev
2017-08-01 11:03 GMT-07:00 Chris Bieneman <be...@apple.com>:

On Jul 31, 2017, at 10:38 PM, Mehdi AMINI <joke...@gmail.com> wrote:



2017-07-28 21:58 GMT-07:00 Chris Bieneman via llvm-dev <llvm...@lists.llvm.org>:
Apologies for delayed joining of this discussion, but I had a few notes from this thread that I really wanted to chime in about.

River,

I don't mean to put you on the spot, but I do want to start on a semantic issue. In several places in the thread you used the words "we" and "our" to imply that you're not alone in writing this (which is totally fine), but your initial thread presented this as entirely your own work. So, when you said things like "we feel there's an advantage to being at the IR level", can you please clarify who is "we"?

Given that there are a number of disagreements and opinions floating around I think it benefits us all to speak clearly about who is taking what stances.

One particular disagreement that I think very much needs to be revisited in this thread was Jessica's proposal of a pipeline of:
  1. IR outline
  2. Inline
  3. MIR outline
In your response to that proposal you dismissed it out of hand with "feelings" but not data. Given that the proposal came from Jessica (a community member with significant relevant experience in outlining), and it was also recognized as interesting by Eric Christopher (a long-time member of the community with wide reaching expertise), I think dismissing it may have been a little premature.

It isn't clear to me how much the *exact* pipeline and ordering of passes is relevant to consider if "having an outliner at the IR level" is a good idea.

I think it is particularly relevant because based on the limited performance numbers we've seen it looks like the MIR and IR outliners have different benefits. Figuring out a pipeline where one doesn't prevent the other from performing good optimizations seems like a reasonable precondition to accepting these patches. 


 
I also want to visit a few procedural notes.

Mehdi commented on the thread that it wouldn't be fair to ask for a comparative study because the MIR outliner didn't have one. While I don't think anyone is asking for a comparative study, I want to point out that I think it is completely fair.
If a new contributor approached the community with a new SROA pass and wanted to land it in-tree it would be appropriate to ask for a comparative analysis against the existing pass. How is this different? 

It seems quite different to me because there is no outliner at the IR level and so they don't provide the same functionality. The "Why at the IR level" section of the original email combined with the performance numbers seems largely enough to me to explain why it isn't redundant to the Machine-level outliner.
I'd consider this work for inclusion upstream purely on its technical merit at this point.

I believe the technical merit has not been shown clearly enough. The only data we've seen has been cherry-picked and there are outstanding technical questions about the approach.

Discussing inclusion as part of any of the default pipeline is a different story.

The patches that were sent out *do* include it in default pass pipelines.

Saying we shouldn't include this in any default pipeline for now seems totally reasonable to me. My understanding was that the patch were sent so that folks can reproduce the results, I think these were "WIP" not patches ready to commit. But again, discussing the implementation and other details is perfectly appropriate. That isn't what I was trying to point at.


 


Similarly last year, the IR-level PGO was also implemented even though we already had a PGO implementation, because 1) it provided a generic solutions for other frontend (just like here it could be said that it provides a generic solution for targets) and 2) it supported cases that FE-PGO didn't (especially around better counter-context using pre-inlining and such).

 

Adding a new IR outliner is a different situation from when the MIR one was added. When the MIR outliner was introduced there was no in-tree analog.

We still usually discuss design extensively. Skipping the IR-level option didn't seem obvious to me, to say the least. And it wasn't really much discussed/considered extensively upstream.

The reasoning for this was covered in the discussions and in Jessica's LLVM dev meeting talk. It may not have been widely discussed because it was widely agreed on.

I disagree with your assessment and I'm puzzle that you can claim that while your previous email was along the line "the community should discussed and decide about a new proposal". 
It was widely agreed on by the people who started this project, that's far from a reason to not discuss the pros/cons upstream.
Again, *shrug* on my side as long as it does not preclude other approach at the IR level.

 

If the idea is that implementing a concept at the machine level may preclude a future implementation at the IR level, it means we should be *a lot* more picky before accepting such contribution.  

Nobody is precluding an IR implementation. We are merely holding the IR implementation to the same high standards of justification that we held the MIR one to. You may not recall this, but the MIR one took *months* to go from RFC to landing in-tree.

I'm opposing to holding the RFC itself solely based on the existence of MIR. I don't disagree with reviewing carefully the IR outliner implementation, and requiring incremental individual patches, etc, on the opposite this should happen.


 

In this case, if I had anticipated any push-back on an IR-level implementation only based on the fact that we have now a Machine-level one, I'd likely have pushed back on the machine-level one.

There is no pushback based solely on the presence of the MIR outliner.

Then it seems we're on perfect agreement, but that wasn't my first impression.

-- 
Mehdi

Mehdi AMINI via llvm-dev

unread,
Aug 1, 2017, 3:14:31 PM8/1/17
to Chris Bieneman, llvm-dev
2017-08-01 11:20 GMT-07:00 Chris Bieneman <chris.b...@me.com>:

On Aug 1, 2017, at 1:07 AM, Andrey Bokhanko via llvm-dev <llvm...@lists.llvm.org> wrote:

All,

+1 to what Mehdi said.

It's a fair concern to question whatever we need yet another Outlining pass. I believe this concern has been cleared by River -- both with theoretical arguments and practical data (benchmark numbers).

I want to point out that River has not provided full raw benchmark data, he has provided summarized data. Also, as a community of engineers I think theoretical arguments are dangerous to accept. The community has also not been presented with sufficient information to understand some of the differences between the numbers for the IR outliner and MIR outliner.

Since the two are using very similar algorithms for detecting outlining candidates I would expect the numbers to be very similar, however in many cases they are not. I think understanding those differences and comparing them will help inform whether or not the IR outliner needs changes before being integrated to LLVM. I'm not advocating that we shouldn't take the IR outliner (in fact I think IR outlining is an interesting optimization). I'm advocating that we should understand and explore how this pass will fit into LLVM today and in the future.

Being able to experiment with the pass in tree for the next two years of more without having it in a default pipeline is what will allow to "understand and explore how this pass will fit into LLVM today and in the future".
  
Jessica's pipeline proposal is completely orthogonal. It's not fair to request River to implement / fit into what she suggested. Sure, it's a valid topic to discuss -- but yet completely orthogonal one. If anything, accepting River's implementation would enable us to do experiments / developments like pipeline changes of this ilk!

I completely disagree. When considering accepting a new pass to LLVM it is completely reasonable, appropriate, and not orthogonal to consider how that pass would interact with other existing passes.

Consider how this would interact is fair, asking to implement all possible combinations and perform all the possible experimentations isn't. AFAIK we promote incremental development and we're quite agile.
Discussing RFC and high-level direction is important, finding an incremental way forward is important, requesting a full complete implementation with full comparison on day 1 isn't fair and hasn't been a prerequisite in the past. Being able to have an implementation available is what will allow to try many combinations, across various scenario, and get data. 

I'd even argue the opposite: this favor out-of-tree development and lower the community involvement. I'm glad we had a different story with ThinLTO for instance, and didn't require anything similar to what you're asking here, or it likely wouldn't have happened or not be as successful as it has been. Many aspect of it came out of the incremental implementation, and derived significantly from the prototype or the original RFCs discussions.

 
Also, given that in the past we've requested much more of contributors trying to make large contributions (see the many SPIR-V discussions)

I don't relate to the SPIR-V discussion (and I was involved), it had a lot of other kind of issues.

-- 
Mehdi

Quentin Colombet via llvm-dev

unread,
Aug 1, 2017, 3:20:14 PM8/1/17
to Daniel Berlin, llvm-dev
Heh, I am also wondering :P

Sean Silva via llvm-dev

unread,
Aug 1, 2017, 3:28:17 PM8/1/17
to Daniel Berlin, llvm-dev
On Tue, Aug 1, 2017 at 10:28 AM, Daniel Berlin <dbe...@dberlin.org> wrote:


Also as a side note, I think in the original MachineOutliner RFC thread there was some confusion as to whether it was possible to solve the code folding outlining problem exactly as a graph problem on SSA using standard value numbering algorithms in polynomial time.
 
I can elaborate further, but
1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR
 
2. Given two dags, you can create their respective isomorphic data flow graphs (say, put them into two separate functions)
3. An exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.
4. Thus, graph isomorphism on dags can be solved with such an algorithm and thus the outlining problem is GI-hard and a polynomial time solution would be a big breakthrough in CS.

First, you'd have to reduce them in both directions to prove that ;)

All that's been shown is that you can reduce it to a hard problem.
You can also reduce it to 3-sat., but that doesn't mean anything unless you can reduce 3-sat to it.

Only one direction is needed to prove it is GI-hard. I think you are thinking about proving it GI-complete. Solving a GI-hard problem in polynomial time proves that GI is in P which would be a major breakthrough in CS.
 

5. The actual problem the outliner is trying to solve is actually more like finding subgraphs that are isomorphic, making the problem even harder (something like "given dags A and B does there exist a subgraph of A that is isomorphic to a subgraph of B")


This assumes, strongly, that this reduction is the way to do it, and also that SSA/etc imparts no structure in the reduction that enables you to solve it faster (IE restricts the types of graphs, etc)
 
FWIW, the exact argument above holds for value numbering, and since the days of kildall, it was believed not solvable in a complete fashion in less than exponential time due to the need to do graph isomorphism on large value graphs at join points.

Except, much like RA and other things, it turns out this is not right for SSA, and in the past few years, it was proved you could do complete  value numbering in polynomial time on SSA.

Can you explain what structure SSA imbues besides just modeling a dag? The description I gave above is phrased solely in terms of a data flow graph (dag of say `add` instructions) in a single BB; there are no phi nodes or anything.

I'm thinking something like:

For each vertex v (with label %foo) in the dag in topological order:
  append to the BB an instruction `add  %foo <- %bar, %baz, %qux, ...` where `%bar, %baz, %qux, ...` are the labels of all instructions u such that an edge u->v exists.

The BB generated this way is guaranteed to be SSA.

(for simplicity I've assumed an N-ary add instruction, but it is obvious you can isomorphically desugar that into two-operand instructions)


  


So with all due respect to quentin, i don't buy it yet.

Without more, i'd bet using similar techniques to solve value numbering in polynomial time to SSA could be applied here.

This is because the complete polynomial value numbering techniques are in fact, value graph isomorphism ....

My understanding is that standard complete GVN algorithms are looking for:

%0 = ...
add %1 <- %0, %0
add %2 <- %1, %1
add %3 <- %2, %2
...
add %1 <- %0, %0
add %2 <- %1, %1
add %3 <- %2, %2
...

and identifying the repeated
```
add %1 <- %0, %0
add %2 <- %1, %1
add %3 <- %2, %2
```


However, a code folding outliner like the one discussed here wants to be able to identify the two sequences as equivalent even if the second run of adds doesn't start with %0 is as an input.

%0 = ...
add %1 <- %0, %0
add %2 <- %1, %1
add %3 <- %2, %2
...
%foo0 = ...
add %foo1 <- %foo0, %foo0
add %foo2 <- %foo1, %foo1
add %foo3 <- %foo2, %foo2
...

A code folding outliner wants to find:

outlinedFunc(%arg) {
add %1 <- %arg, %arg
add %2 <- %1, %1
add %3 <- %2, %2
}

In other words, I think the algorithms you are thinking of prove equivalent *values* whereas what the code folding outliner is looking for is equivalent *computations* (which may have different inputs at the different sites they occur and so do not necessarily have identical values at run time). For example, %3 and %foo3 have different values at runtime.

As another example

%0 = ...
add %1 <- %0, %0
mul %2 <- %1, %1
add %3 <- %2, %2

%foo0 = ...
add %foo1 <- %foo0, %foo0
mul %foo2 <- %foo1, %foo1

%bar1 = ...
mul %bar2 <- %bar1, %bar1
add %bar3 <- %bar2, %bar2

The algorithm we want for the code folding outliner should identify that
```
add %1 <- %0, %0
mul %2 <- %1, %1
```
is the same computation as the `foo` computation
and
```
mul %2 <- %1, %1
add %3 <- %2, %2
```
is the same computation as the `bar` computation.



Or to put it yet another way, if you have:

foo(a,b,c) {
  a single BB with data flow graph G
}

bar(d,e,f) {
  a single BB with a graph isomorphic to G subject to d=b e=a c=f
}

This is easy to do with a standard VN algorithm if you knew a priori knew d=b e=a c=f and just initialized the initial congruence classes right. The hard part manifests itself I think in terms of discovering that correspondence. (and in general the number of such live-in values to the subgraphs being compared is O(the size of the graph), so the algorithm must be polynomial in the number of such live-in values).

How does the algorithm you are thinking of infer the equivalence of those live-in values?



Sorry for really digging in here, but I've actually thought about this a lot since we talked about this at the social and I can't seem to find a problem with my reasoning for the GI-hard proof and am genuinely curious.

-- Sean Silva

Daniel Berlin via llvm-dev

unread,
Aug 1, 2017, 4:03:41 PM8/1/17
to Sean Silva, llvm-dev
On Tue, Aug 1, 2017 at 12:28 PM, Sean Silva <chiso...@gmail.com> wrote:


On Tue, Aug 1, 2017 at 10:28 AM, Daniel Berlin <dbe...@dberlin.org> wrote:


Also as a side note, I think in the original MachineOutliner RFC thread there was some confusion as to whether it was possible to solve the code folding outlining problem exactly as a graph problem on SSA using standard value numbering algorithms in polynomial time.
 
I can elaborate further, but
1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR
 
2. Given two dags, you can create their respective isomorphic data flow graphs (say, put them into two separate functions)
3. An exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.
4. Thus, graph isomorphism on dags can be solved with such an algorithm and thus the outlining problem is GI-hard and a polynomial time solution would be a big breakthrough in CS.

First, you'd have to reduce them in both directions to prove that ;)

All that's been shown is that you can reduce it to a hard problem.
You can also reduce it to 3-sat., but that doesn't mean anything unless you can reduce 3-sat to it.

Only one direction is needed to prove it is GI-hard.

Yes sorry, but you still haven't done it. 
So if you really want to do it, let's start with a definition:

"The precise definition here is that a problem X is NP-hard, if there is an NP-complete problem Y, such that Y is reducible to X in polynomial time."

For this to be GI-hard you must be able to reduce *every* GI problem to one about SSA-based outlining, and you must be able to do so in *polynomial time*.
 
The above does not do this.

Things that would need proof:

1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR

A routine or proof to do this in polynomial time ;)  Please make sure you do an arbitrary dag, and not just rooted DAGs, as isomorphism on rooted DAGs can be solved in linear time ;)


3. in exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.

Proof that  code outlining necessarily requires discovering full isomorphism between the graphs.

(IE that whatever process code outlining uses must end up doing a thing that would prove isomorphism).

Keep in mind that the way we ended up with linear time dataflow was precisely to prove that it did not actually require doing things that reduced to boolean matrix multiplication


But let's take this offline, as it's going to go into a very theoretical route.

I would suggest reading some of the very early papers. Early value numbering, and complete formulations, assume uninterpreted operators and in some cases, operands.  There is also the distinction between whether you are trying to discover the set of equivalent values that already exist in the program vs things that are equivalent to those values vs ....

It's 100% possible to come up with formulations of VN that are still exponential time depending precisely on the details you are giving.

It's also interesting to note that Babai still has his paper on graph isomorphism in quasi-polynomial time, and nobody has found holes in it yet (and since it's in NP but not NP complete, it's little more believable than the standard P = NP fair)

Sean Silva via llvm-dev

unread,
Aug 1, 2017, 4:25:09 PM8/1/17
to Daniel Berlin, llvm-dev


On Aug 1, 2017 1:03 PM, "Daniel Berlin" <dbe...@dberlin.org> wrote:


On Tue, Aug 1, 2017 at 12:28 PM, Sean Silva <chiso...@gmail.com> wrote:


On Tue, Aug 1, 2017 at 10:28 AM, Daniel Berlin <dbe...@dberlin.org> wrote:


Also as a side note, I think in the original MachineOutliner RFC thread there was some confusion as to whether it was possible to solve the code folding outlining problem exactly as a graph problem on SSA using standard value numbering algorithms in polynomial time.
 
I can elaborate further, but
1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR
 
2. Given two dags, you can create their respective isomorphic data flow graphs (say, put them into two separate functions)
3. An exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.
4. Thus, graph isomorphism on dags can be solved with such an algorithm and thus the outlining problem is GI-hard and a polynomial time solution would be a big breakthrough in CS.

First, you'd have to reduce them in both directions to prove that ;)

All that's been shown is that you can reduce it to a hard problem.
You can also reduce it to 3-sat., but that doesn't mean anything unless you can reduce 3-sat to it.

Only one direction is needed to prove it is GI-hard.

Yes sorry, but you still haven't done it.

I agree I certainly haven't provided more than a proof sketch (and a bad one at that).

 
So if you really want to do it, let's start with a definition:

"The precise definition here is that a problem X is NP-hard, if there is an NP-complete problem Y, such that Y is reducible to X in polynomial time."

For this to be GI-hard you must be able to reduce *every* GI problem to one about SSA-based outlining, and you must be able to do so in *polynomial time*.
 
The above does not do this.

Things that would need proof:

1. it is easy to see that you can map an arbitrary dag to an isomorphic data flow graph in an SSA IR e.g. in LLVM IR or pre-RA MIR

A routine or proof to do this in polynomial time ;)  Please make sure you do an arbitrary dag, and not just rooted DAGs, as isomorphism on rooted DAGs can be solved in linear time ;)


3. in exact graph based code folding outliner would be able to discover if the two dataflow graphs are isomorphic (that is basically what I mean by exact) and outline them.

Proof that  code outlining necessarily requires discovering full isomorphism between the graphs.

(IE that whatever process code outlining uses must end up doing a thing that would prove isomorphism).

Keep in mind that the way we ended up with linear time dataflow was precisely to prove that it did not actually require doing things that reduced to boolean matrix multiplication


But let's take this offline, as it's going to go into a very theoretical route.

Sure. Thanks for the pointers. I'll keep thinking about it and maybe start putting some thoughts more precisely in LaTeX which is probably a better format at this point.

-- Sean Silva

Jessica Paquette via llvm-dev

unread,
Aug 1, 2017, 4:29:19 PM8/1/17
to Quentin Colombet, llvm-dev
Just to chime in… I do think there is a relationship with string isomorphism (although I haven’t hashed out a proof of it yet!).

String isomorphism is defined as

“Given two equal-length strings S1, S2 over a finite alphabet and a permutation group G, is there an element of G that will transform S1 into S2?”

It’s known to be quasipolynomial-time. It sounds similar in spirit to the equivalence problem between strings in outlining, given you have 0 parameters to pass. What we want is a structure-preserving transformation between two strings, given a higher-order structure/state that is defined by the entire program. For example, our state might be the values in memory at any given timestep. If we *always* have such a transformation between two strings composed of identical instructions, but different/unknown registers, then those sequences must be safe to outline.

Of course, this is just an observation, and not a claim of equivalence or complexity. Attempting a reduction might be a fun evening project though!

- Jessica

Reply all
Reply to author
Forward
0 new messages