=====
Goal:
=====
Extending Loop Vectorizer (LV) such that it can handle outer loops, via VPlan infrastructure enhancements.
Understand the trade-offs in trying to make concurrent progress with moving remaining inner loop vectorization
functionality to VPlan infrastructure
===========
Background:
===========
This is related to VPlan infrastructure project we started a while back, a project to extend the (inner loop vectorization focused) Loop Vectorizer to support outer loop vectorization.
VPlan is the vectorization planner that records the decisions and candidate directions to pursue in order to drive cost modeling and vector code generation. When it is fully integrated into LV (i.e., at the end of this big project), VPlan will use a Hierarchical-CFG (HCFG) and transform it starting from the abstraction of the input IR to reflect current vectorization decisions being made. The HCFG eventually becomes the abstraction of the output IR, and the vector code generation is driven by this abstract representation.
This is a follow up of the previous RFCs and LLVM Developer Conference presentations:
http://lists.llvm.org/pipermail/llvm-dev/2016-September/105057.html (RFC: Extending LV to vectorize outerloops)
http://lists.llvm.org/pipermail/llvm-dev/2017-February/110159.html (RFC: Introducing VPlan to model the vectorized code and drive its transformation)
https://www.youtube.com/watch?v=XXAvdUwO7kQ (Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop Auto-Vectorization)
https://www.youtube.com/watch?v=IqzJRs6tb7Y (Introducing VPlan to the LoopVectorizer)
https://www.youtube.com/watch?v=BjBSJFzYDVk (Vectorizing Loops with VPlan - Current State and Next Steps)
Please also refer to the separately sent out status update http://lists.llvm.org/pipermail/llvm-dev/2017-December/119522.html.
.
So far, two big patches related to VPlan implementation have been committed.
https://reviews.llvm.org/D28975 by Gil Rapaport. (Introducing VPlan to model the vectorized code and drive its transformation)
Has been broken down to a series of smaller patches and went in. The last (re)commit of the series is
https://reviews.llvm.org/rL311849
https://reviews.llvm.org/D38676 by Gil Rapaport. (Modeling masking in VPlan, introducing VPInstructions)
This is also broken down to a series of smaller patches to facilitate the review.
Committed as https://reviews.llvm.org/rL318645
With the first patch, we introduced the concept of VPlan to LV and started explicitly recording decisions like interleave memory access optimization and serialization.
With the second patch, we also introduced the concept of VPInstruction (to model newly introduced use-def) and started explicitly modeling masking and mask manipulations.
So far, work in this area has been mostly restricted to refactoring of Loop Vectorizer's existing functionality with several remaining areas of LV that still need to be refactored into the new VPlan based representation. The following is a non-exhaustive list of areas that need to be worked on and gives an idea of the amount of remaining work.
* Predication
* Cost model
* Remainder Loop
* Runtime Guards
* External Users
* Reduction Epilog
* Interleave Grouping
* Sink Scalar Operands
While it is important to keep refactoring the above mentioned items such that entire LV becomes fully VPlan based (which completes the transition into VPlan infrastructure), we are aware that there is also a need to make progress in outer loop vectorization. Function vectorization via VecClone pass (https://reviews.llvm.org/D22792) also depends on the progress of outer loop vectorization.
==============
Proposed Plan:
==============
Instead of waiting for most of the above mentioned refactoring to finish before working on outer loop vectorization, we believe it serves the best interest of the community if we start making concurrent progress on continued refactoring effort and outer loop vectorization effort.
To implement outer loop vectorization in incremental steps, we propose adding new code paths that would be exercised initially for only outer loop vectorization cases. This would ensure that the changes are NFC for innermost loop vectorization cases until we (slowly) transition them to also go through these new code paths.
Given that LV never had the ability to vectorize outer level loops, letting outer loop vectorization take slightly different code paths, compared to innermost loop vectorization, won't cause any issues in innermost loop vectorization. The new paths are what inner loop vectorization will eventually go through (i.e., post-refactoring), after the kinks are worked out so that doing so will not regress any of the existing functionality and performance. Having the new code paths exposed early would encourage more community participating in maturing the new code paths, makes it clearer what still needs to be moved over to VPlan infrastructure, and hopefully also encourage more community participation in the remaining refactoring effort.
======================
Proposed Patch Series:
======================
There will be a lot of new development before declaring "feature complete". As such, we propose to break it into five or more series of patches ---- starting from vectorizing trivial outer loops. Along the way, we expect to include NFC patches to make LV more modular, in order to keep LoopVectorize.cpp file size from growing too much.
Patch Series #1
---------------
Trivial outer loop vectorization
* Inner loops are trivially lock-step among all vector elements
* No other branches inside the outer loop (to avoid touching divergence analysis in this series)
This should require
1) inner loop uniformity check (SCEV based invariance will do) --- the first small patch https://reviews.llvm.org/D40874 is a subset of this.
2) VPlan construction
3) Vector code generation (w/ unmodified uniform inner control flow) from VPlan
for (i=0;i<N;i++){ // vectorize here
// no branches here
for (j=0;j<loop_invariant_value; j++){
// no branches here
}
// no branches here
}
Patch Series #2
---------------
Simple outer loop vectorization
* Inner loops are still trivially lock-step among all vector elements
* Non-loop branches are blindly assumed as divergent
This should require
1) Patch Series #1
2) Predication analysis and masking on VPlan
3) Vector code generation support for masking
for (i=0;i<N;i++){ // vectorize here
// branches may be here
for (j=0;j<loop_invariant_value; j++){ // unconditional w.r.t. i-loop
// branches may be here
}
// branches may be here
}
Patch Series #3
---------------
Better simple outer loop vectorization
* Inner loops are still trivially lock-step among all vector elements
* Non-loop branches may be uniform/non-uniform
This should require
1) Patch Series #1, #2
2) We first use SCEV invariance for uniformity analysis.
3) Vector code generation support for uniform forward branches
4) We are working with Simon Moll (U-Saarland) to improve uniformity analysis (https://www.youtube.com/watch?v=svMEphbFukw&feature=youtu.be).
Not a hard dependency, but improved analysis is good for performance.
Same kind of loop as Patch Series #2, but uniform branches are optimized (i.e., less masking).
Patch Series #4
---------------
Less simple outer loop vectorization
* Inner loops may not be lock-step among all vector elements, but single-entry/single-exit
This should require
1) Patch Series #1, #2, #3
2) VPlan to VPlan transformation to make inner loop lock-step. After the transform, the VPlan should look like the ones #3 can handle.
We'll look into porting code from U-Saarland RV project, which performs this transformation in IR to IR manner.
for (i=0;i<N;i++){ // vectorize here
// branches may be here
if (cond(i)) {
// branches may be here
while (cond1(i)){
// branches may be here
}
// branches may be here
}
// branch may be here
}
Patch Series #5
---------------
Outer loop vectorization "Feature Complete"
* Inner loops must be redusible, allows multiple exits.
This should require
1) Patch Series #1, #2, #3, #4
2) VPlan to VPlan transformation to make inner loop lock-step.
3) Vector code generation support for inner loop with multiple exits
for (i=0;i<N;i++){ // vectorize here
// branches may be here
if (cond(1)) {
// branches may be here
while (cond1(i)){
.
if (cond2) break;
.
if (cond3) break;
.
}
// branches may be here
}
// branch may be here
}
============================
Future Work beyond this RFC:
============================
* Outer loop auto-vectorization legality analysis
* Outer loop auto-vectorization cost model
* Cost model to compute VF/UF for explicit outer loop vectorization (when not specified)
* (lower in priority) Irrducible inner loop handling (VPlan to VPlan transform to make inner loops reducible. After the transform, the VPlan should look like the ones #5 can handle.)
===========
Trade-Offs:
===========
The new code path also has the ability to perform innermost loop vectorization (as a trivial subset of outer loop vectorization if we want to exercise that path for such purposes). It's good to be able to compare, side-by-side, what existing code path does and what new path does so that we can incrementally build up confidence/experience levels on the new code path. For the time being, however, that can be seen as extra maintenance needs. Instead of trying to do this on the trunk, we can do the same thing on the side, using github, until "outer loop vectorization feature" becomes mature enough. Amount of code review required, to get into the trunk, most likely would be comparable.
========
Summary:
========
Outer loop vectorization feature implementation plan is presented. We think concurrent progress has a good enough chance of expediting overall transition of LV into the new infrastructure
while benefiting from delivering new functionality to those in need sooner, with wider community participation in the overall development.
Thanks,
Hideki Saito
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
I'd like to hear what plan is in place to ensure we don't destabilize
the vectorizer while working on this.
One thing we could consider is leveraging the new IR fuzzer to help find
assertion failures either before submission or shortly there after.
Another might be to introduce changes under feature flags to ease the
revert/reintroduce/revert cycle.
Philip
>>That sounds like an excellent idea! Any concrete ideas/plans how people could get involved, besides doing reviews?
>
>Let's talk about this in the RFC context. http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html.
>Divergence Analysis work mentioned there is a good example.
One of the big things we are hoping to see people outside of our Intel team contributing is dependence analysis --- for outer
loop auto-vectorization (and most of that can also be used for auto-parallelization of outer level loops). The plan
outlined below aims for explicit vectorization, but we'd certainly want to hook this up for outer loop auto-vectorization.
Having sent out the second clean up patch (https://reviews.llvm.org/D41045), I'm currently looking at LoopVectorizationLegality class
and thinking about how to 1) make it more modular, 2) take it out of LoopVectorize.cpp, and 3) move it to Analysis directory (but not
necessarily a separate Analysis pass, yet). Part of the Legality is the dependence checking that needs to be upgraded to deal
with outer loops. Any interest in working in this area? It's not really dependent on the VPlan progress. Can be started right away.
Thanks,
Hideki
https://bugs.llvm.org/show_bug.cgi?id=35282
https://bugs.llvm.org/show_bug.cgi?id=35687
https://bugs.llvm.org/show_bug.cgi?id=35734
https://bugs.llvm.org/show_bug.cgi?id=35773
https://reviews.llvm.org/D41939
I also see another 10 or so correctness related ones in a a simple
search:
https://bugs.llvm.org/buglist.cgi?quicksearch=LoopVectorize&list_id=131629
The loop vectorizer is currently a serious pain point in our regression
testing. As the rest of the optimizer changes, we are finding that
existing bugs in the vectorizer are being exposed with disturbing
frequency and that some new regressions are landing as well.
Philip
I certainly understand what you're saying, but, as you point out, many
of these are existing bugs that are being exposed by other changes (and
they're seemingly all over the map). My general feeling is that the more
limited the applicability of a particular transform the buggier it will
tend to be. The work here to allow vectorization of an every-wider set
of inputs will really help to expose, and thus help us eliminate, bugs.
As such, one of the largest benefits of adding the
function-vectorization work (https://reviews.llvm.org/D22792), and
outer-loop vectorization capabilities, will be making it easier to throw
essentially-arbitrary inputs at the vectorizer (and have it do
something), and thus, hit it more effectively with automated testing.
Maybe we can do a better job, even with the current capabilities, of
automatically generating data-parallel loops with reductions of various
kinds? I'm thinking about automated testing because, AFAIK, the
vectorizer is already run through almost all of the relevant benchmarks
and test suites, and even if we add a few more, we probably need to
increase the test coverage by a lot more than that.
Do you have ideas about how we can have better testing in this area
otherwise?
-Hal
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
If you’re finding bugs that have been extant for a while, is there anything special you’re doing that’s uncovering them? If so, is there a way we could replicate that kind of IR input to the vectorizer to better test the corner cases?
Amara
Absolutely agreed.
We haven't stopped working on the vectoriser but for the past few
years it feels as if we're always trading performance numbers all over
the place and not progressing.
We need better, more consistent, analysis. We need a generic approach.
We need more powerful approaches. We need a single pipeline that can
decide between clear costs, not luck.
The work Hideki/Ayal/Gil are doing covers most of those topics. It implements:
1. VPlan: which will help us understand more accurate costs and pick
the best choice, not the first profitable one
2. Outer loop: which will allow us to look at the loop as a complete
set, not a bunch of inner loops
The work Tobi & the Polly guys are doing covers:
3. Fantastic analysis and powerful transformations
4. Exposing code for other passes to profit
Linaro is looking on HPC workloads (mainly core loops [1]) and we
found that loop distribution would be very profitable to ease register
allocation in big loops, but that needs whole-loop analysis.
But, as was said before, we're lacking in understanding and
organisation. To be able to profit from all of those advances, we need
to understand the loop better, and for that, powerful analysis needs
to happen.
Polly has some powerful ones, but we haven't plugged that in properly.
The work to plug Polly into LLVM and make it an integral part of the
pipeline is important so that we can use parts of its analysis tools
to benefit other passes.
But we also need more alias analysis, inter-procedural access pattern
analysis etc.
> As such, one of the largest benefits of adding the
> function-vectorization work (https://reviews.llvm.org/D22792), and
> outer-loop vectorization capabilities, will be making it easier to throw
> essentially-arbitrary inputs at the vectorizer (and have it do
> something), and thus, hit it more effectively with automated testing.
Function vectorisation is important, but whole-loop analysis
(including outer-loop, distribution, fusion) can open more doors to
new patterns.
Now, the real problem here is below...
> Maybe we can do a better job, even with the current capabilities, of
> automatically generating data-parallel loops with reductions of various
> kinds? I'm thinking about automated testing because, AFAIK, the
> vectorizer is already run through almost all of the relevant benchmarks
> and test suites, and even if we add a few more, we probably need to
> increase the test coverage by a lot more than that.
Without a wider understanding of what's missing and how to improve,
it's hard to know what path to take.
For a while I was running simple things (like Livermore Loops) and
digging specific details, and at every step I realised that I needed
better analysis, but ended up settling for a simplified version just
to get that one case vectorised.
After I stopped working on this, I continued reviewing performance
patches and what ended up happening is that we're always accepting
changes that give positive geomean, but which could also push some of
the past gains down considerably.
So much so that my current benchmarks show LLVM 5 performing worse on
almost all cases comparing to LLVM 4. This is worrying.
None of those benchmarks were done in a standard way, so I'm not sure
how to replicate, which means they're worthless and in summary, I have
wasted a lot of time.
That is why our current focus is to make sure we're all benchmarking
the things that make sense in a way that makes sense.
Our HCQC tool [1] is one way of doing that, but we need more, better
analysis of benchmark results, as well as what benchmarks we care and
how we run them.
Every arch / sub-arch / vendor / board tuple has special rules, and we
need to compare the best on each, not the standard on all. But once we
get the numbers we need to compare as apples-to-apples, and that's
sometimes not possible.
> Do you have ideas about how we can have better testing in this area
> otherwise?
I'd like to gather information about what benchmarks we really care,
how we run them and what analysis we should all do on them. I can't
share my raw numbers with you, but if we agree on a method, we can
share gains and know that they're as close as possible to be
meaningful.
For this thread, we can focus on loop benchmarks. Our team is focusing
on HPC workloads, so we will worry more about heavy loops than
"Hacker's delight" transformations, but we have to make sure we don't
break other people's stuff, so we need *all* in a bundle.
I think the test-suite in benchmark mode has a lot of potential to
become the package that we run for validation (before commit), but
that needs a lot of love before we can trust its results. We need more
relevant benchmarks, better suited results and analysis so that it can
work with the current fantastic visualisation LNT gives us.
This, IMO, together with whole-loop analysis, should be our priority
for LLVM 7 (because 6 is gone... :)
cheers,
--renato
[1] https://github.com/Linaro/hcqc