I've recently mentioned in a few places that I'm interested in enhancing the loop vectorizer to handle multiple exit loops, and have been asked to share plans. This email is intended to a) share my current thinking and b) help spark discussion among interested parties. I do need to warn that my near term plans for this have been delayed; I got pulled into an internal project instead.
Background
At the moment, our loop vectorizer does not handle any loop with
more than one exit, or where that sole exit is not from the latch
block. Interestingly, it does handle internal predication within
a single loop iteration. This results in a slightly odd behavior
where a loop which can be written with either a continue or a
break can exhibit wildly different performance depending on that
choice. It does hint at a possible direction for implementation
though, and implies that most of the cost modeling pieces are
already in place.
The use cases I'm looking at basically fall into two buckets:
for (int i = 0; i < N; i++) {
if (expr(a[i])) break;
... other vectorizable stuff ...
}
for (int i = 0; i < N; i++) {
if (expr(i)) break;
... other vectorizable stuff ...
}
The former are the actually "interesting" examples. The later are
cases where we missed eliminating some check we could have, but
not-vectorizing creates an unfortunate performance cliff.
Framing
We have three important sub-cases to handle.
First, there are all the cases where we could have handled the multiple exit loop, but chose not to for implementation reasons. A great example is:
for (int i = 0; i < N; i++) {
if (i > M) break;
a[i] = i;
}
In this case, SCEV can happily compute the actual exit bound of the loop, and we could use the existing vector-body/scalar slow-path structure. The only change needed would be to exit the vector body earlier. (There are some details here, but it's almost as easy as I'm describing if my POC patch isn't missing something major.)
There's a couple other cases like this. I suspect we can get
decent mileage out of just generalizing the existing code.
Second, there are the cases where we actually have to handle
iterations within the vector-body with predication (or
speculation). The good news is that there's already support in
code for predication, we just need to add another source of
potential predication. The bad news is that's a fairly major
change.
Our challenge will be finding a runtime check for
dereferenceability. Consider the following example:
for (int i = 0; i < N; i++)
if (a[i] == 0) return false;
return true;
If 'a' is an alloca, or statically
sized global, we can form a runtime check which ensures 'i' is
in bounds and a speculative load will not fault.
Here's a nice example we could handle with this approach.
// Assume a and b are both statically sized.
for (int i
= 0; i < N; i++) {
t = b[i];
if (t > M) throw();
sum += a[t];
}
(This is a classic a[b[i]] pattern, but with range checks added.)
This is broadly applicable enough to
be useful, and in practice covers my use cases, but I'm hoping
others have ideas for extensions here.
The other major challenge here is profitability. Consider a simple loop such as:
// assume a[0:N] is known
dereferenceable
for (int i = 0; i < N; i++)
if (a[i] == 0) return false;
return true;
If N is large, and the array is
non-zero, then this is profitable to vectorize. If a[0] == 0,
then it isn't, regardless of the value of N.
Figuring out when to vectorize vs not
for cases like this will require some thought. I don't really
have a good answer for this yet, other than when the profile on
the early exit tells us it's rarely taken.
Third, for both of the former cases,
we need to be able to compute exit values along the early exit
edges. We can get a lot of mileage out of simply skipping loops
with exit values (i.e. lcssa phis), as our loop exit value
rewriting will tend to eliminate them. However, we will
eventually want to handle this case, which will require
generating some interesting complicated code down the exit path
to figure out which iteration actually exit.
I see two general options here:
1) Use the vector-body/scalar-body
idiom of today, and have the vector body exit with IV = I when
any iteration in [I, I+VF-1] would exit. (i.e. roll back)
2) Insert dedicated exit blocks which
recompute exit conditions to determine exact exit value, and
then let the vector body run all iterations in VF which contain
the exiting iteration. (Assumes predicated stores, and that the
exit blocks skip the scalar loop)
I currently don't have a reason to
strongly prefer one over the other. (2) is conceptually cleaner
and the one which keeps coming up in conversation, but (1) may
be simpler to actually implement.
Current Plans
At the current moment, I'm reasonable sure that I'm going to get
the resources to at least tackle some of the cases where we bail
out unnecessarily. This will be a huge practical improvement in
vectorizing robustness, at (I hope) relatively little
implementation cost.
I'm also going to continue playing around with enhancements to
our current dereferenceability logic. I see that as being a key
building block to make any predication based approach practical.
I'm not sure I'm going to get to the predication support. I'd like to, but am not sure my resource constraints allow it. I'll also mention that I'm not at all sure how all of this might fit in with the VPLAN work. I'd really welcome feedback on that; is what I'm proposing at all consistent with others plans?
Philip
Apologies for leaving this thread linger so long. It was in my
back-burner but Alex's weekly remind me to reply (thanks again,
Alex!). Starting from the end...
On Mon, 9 Sep 2019 at 18:53, Philip Reames via llvm-dev
<llvm...@lists.llvm.org> wrote:
> Current Plans
>
> At the current moment, I'm reasonable sure that I'm going to get the resources to at least tackle some of the cases where we bail out unnecessarily. This will be a huge practical improvement in vectorizing robustness, at (I hope) relatively little implementation cost.
This makes a lot of sense to me. It doesn't sound it will need a huge
refactoring to the current code and it can be done in a piece-wise way
so that progress is consistent and non-regressable.
It would be good to know if any of the test-suite benchmarks will be
affected, or if we could insert a new one there, just to keep track of
things.
> I'm also going to continue playing around with enhancements to our current dereferenceability logic. I see that as being a key building block to make any predication based approach practical.
This is *always* good work. There has been some work in the past to
common up LV with VPLan analysis stages (separating classes, moving
code), and I think if we keep working on the common infrastructure,
we'll benefit more than just LV or VPlan.
> I'm not sure I'm going to get to the predication support. I'd like to, but am not sure my resource constraints allow it. I'll also mention that I'm not at all sure how all of this might fit in with the VPLAN work. I'd really welcome feedback on that; is what I'm proposing at all consistent with others plans?
As I said, the analysis work can play well with VPlan, and if it
doesn't, we can make it work by following the same cleanups that has
been done in the past. These are complementary parts of the work and
it's great you can spend some time on some of it.
> The use cases I'm looking at basically fall into two buckets:
>
> for (int i = 0; i < N; i++) {
> if (expr(a[i])) break;
> ... other vectorizable stuff ...
> }
>
> for (int i = 0; i < N; i++) {
> if (expr(i)) break;
> ... other vectorizable stuff ...
> }
The safety analysis is different, but the vectorizer would have to
behave similarly on both cases. Specilative loads, boundary
conditions, run time checks (unless the bounds are compile time
constants?).
Doing the latter first would make the second almost *only* about
reference analysis.
> First, there are all the cases where we could have handled the multiple exit loop, but chose not to for implementation reasons. A great example is:
>
> for (int i = 0; i < N; i++) {
> if (i > M) break;
> a[i] = i;
> }
>
> In this case, SCEV can happily compute the actual exit bound of the loop, and we could use the existing vector-body/scalar slow-path structure. The only change needed would be to exit the vector body earlier. (There are some details here, but it's almost as easy as I'm describing if my POC patch isn't missing something major.)
If M < N, then this could potentially be converted to iterate over M
and make the loop unconditional, no? I know this is a silly case, but
if a more complex loop simplifies to this one by other optimisations,
this would be a very easy change to make.
We could also look at it from a loop-splitting point of view, for
example the common pattern:
for (i: M) {
if (a == 0)
startup(i);
else
remainder(i);
}
Or more complicated examples, like matrix operations that are "easy"
in the bulk of it, but complicated (non-vectorizeable) at the edges.
This would probably be something for the VPlan side of things, but
it's good to keep in mind that a simple example can get very
interesting. :)
> // Assume a and b are both statically sized.
> for (int i = 0; i < N; i++) {
> t = b[i];
> if (t > M) throw();
> sum += a[t];
> }
>
> (This is a classic a[b[i]] pattern, but with range checks added.)
This looks like a clear pattern for predicated semantics. :)
In the absence of that, if the condition is simple, we can try
splitting or tail loops, but only predication would give us an extra
mile on preventing speculative loads to crash.
> // assume a[0:N] is known dereferenceable
> for (int i = 0; i < N; i++)
> if (a[i] == 0) return false;
> return true;
>
> If N is large, and the array is non-zero, then this is profitable to vectorize. If a[0] == 0, then it isn't, regardless of the value of N.
>
> Figuring out when to vectorize vs not for cases like this will require some thought. I don't really have a good answer for this yet, other than when the profile on the early exit tells us it's rarely taken.
We can already bail on small constant N, and generally, code that
assumes it's small, usually use it as a constant (like RGB or xyz
handling). In those cases, it may also be profitable to do strided
vectorisation (which we do).
> Third, for both of the former cases, we need to be able to compute exit values along the early exit edges. We can get a lot of mileage out of simply skipping loops with exit values (i.e. lcssa phis), as our loop exit value rewriting will tend to eliminate them. However, we will eventually want to handle this case, which will require generating some interesting complicated code down the exit path to figure out which iteration actually exit.
I think getting a real case in the test-suite and make that work would
make a lot of people happy. :)
> 1) Use the vector-body/scalar-body idiom of today, and have the vector body exit with IV = I when any iteration in [I, I+VF-1] would exit. (i.e. roll back)
>
> 2) Insert dedicated exit blocks which recompute exit conditions to determine exact exit value, and then let the vector body run all iterations in VF which contain the exiting iteration. (Assumes predicated stores, and that the exit blocks skip the scalar loop)
The second option sounds easier to reason with, but it could also
generate more clutter for other passes to get lost in. I don't have a
strong opinion either, but I would like to limit the damage of the
current vectoriser (to other passes, or itself), if we are to move to
VPlan any time soon.
Thanks for working on this, I think a lot of good stuff will come out of it! :)
cheers,
--renato
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Yup.
> I don't know of any plans to incorporate iteration space splitting
> w/VPLan. I don't have any plans to go that far; if I had such test
> cases - I don't - I'd want to start with separately factored transforms
> if we could. Doing everything within one transform is undesirable. :)
AFAIK this isn't strictly in the immediate plans, but loop splitting
was one of the aims for doing outer-loop vectorisation.
My point was that if we want to do that, vplan would be a good place,
because we could add this as its own plan, which would expose other
vectorisation opportunities (including fusion with other loops).
> > I think getting a real case in the test-suite and make that work would
> > make a lot of people happy. :)
> I wish. Unfortunately, my "real test case" is a java benchmark. I
> doubt I'll be able to get it into the test suite. :(
Hm, not likely. Getting those loops as IR in the lit tests would
probably be enough, if these transformations don't affect anything in
the test-suite.
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
Not specifically, but I guess I was "specific", sorry about that. :)
I meant that loop splitting is one particular case that can expose
outer-loop vectorisation opportunities in VPlan, as you describe
below.
> VPlan is designed to support the tentative decisions and alternatives when vectorizing. E.g., when different types of instructions are to be generated for different VF's. Loop index splitting, and turning a multiple exit loop into a (countable) single exit loop, seem like preparatory transformations that can enable vectorization and/or interleaving for any VF/UF, similar to loop distribution.
Right, for probing the search space in more than one path, and picking
the best overall, not the first profitable. With some conservative
heuristics we can restrict the combinations and reach a reasonable
solution without considerably more compile time.
> VPlan currently models the instructions inside the vector loop. If a "while" or multiple exit loop has stores above the "break point", it may be possible to sink them in order to avoid issuing speculative stores. LV applies similar "SinkAfter" code motion when needed to facilitate 1st order recurrence. Loads may indeed be handled speculatively by dereferencability, which may in turn require peeling the first iterations to reach an aligned address. Such peeling may be folded into the vector loop by masking, if desired, analogous to LV's foldTailByMasking.
There was also another work, for platforms without masking, to fold
tail loops into smaller VFs, but I think that didn't go in because it
increased complexity a lot and performance by almost nothing. With
most large vector platforms now supporting masks, I think peeling with
masks may be a generic and sufficient approach on both ends, and leave
head/tail scalar loops for older platforms.
Responding to this old thread to let interested parties know
there's been some progress on this (finally).
The first sub-item described below - multiple exit loops with computable trip counts - is in progress, and will likely be wrapped up in the not too distant future. The first major change (4b33b2387) landed two weeks ago, two smaller changes are on review (D93725, and D93865), and there's likely only one major patch needed after that.
To my
knowledge, there's been no progress on the second item and I'm
not anticipating any in the near future.
Philip
Small progress update.
Work on this largely stalled not long after I sent my last email
due to a difficult to debug issue seen on PPC builders. That was
only recently resolved in e49d65f3. As of now, the last patch for
the analyzeable exit subset is now on review
(https://reviews.llvm.org/D105817).
Once that goes in, I don't plan to take this any further at this time. This was a hobby project for me, and has taken much longer than I anticipated. Unless I find someone to sponsor this work, I'll probably turn my personal hobby efforts towards easier and more immediately rewarding efforts.
Philip
This project was finally completed exactly one month ago with
change 95346ba87. Support for multiple exit loop vectorization
has now been in tree without reported problems long enough to be
considered complete.
If anyone is interested, I wrote up a summary of the work and some thoughts on open problems in this space. I'm not currently planning on continuing work here, and this writeup is me dumping my mental state for potential latter reconstruction if needed more than anything else, but others might find it interesting as well.
Philip