typedef short int16_t;
typedef unsigned short uint16_t;
int q(int16_t d[], uint16_t m[], uint16_t b[]) {
int n = 0;
for (int i = 0; i < 16; i++) {
if (d[i] > 0)
d[i] = (b[i] + d[i]) * m[i] >> 16;
else
d[i] = -((b[i] - d[i]) * m[i] >> 16);
n |= d[i];
}
return n;
}
As it turns out, LLVM adds runtime alias checks for this function and then it
considers it legal to vectorise with if-conversion. However, the vectorisation
cost model effectively bans that particular if-conversion by assigning a
ridiculous cost to emulated masked loads and stores.
Originally, each branch of the if statement in the loop body contains three
identical loads. Manually hoisting these allows the loop to be vectorised at
`-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation).
There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting of
instructions from the two successors of a block, which subroutine does indeed
hoist two of the three loads, but gives up before hoisting the third one.
We'd need a way to make LLVM hoist all three of the loads by itself. `GVNHoist`
can do that, but that pass is disabled by default and has been disabled for a
long, long time.
As an alternative, I was thinking of a simpler hoisting transformation, that
just handles moving instructions from two single-predecessor blocks to their
common predecessor. That could be made reasonably fast, by pairing instructions
by their GVN value number. Limiting hoisting to a predecessor block (for the
most part) would also avoid excessive increase of lifetimes (for the majority of
the case) and would also simplify correctness checks.
I've written such a transformation as a subroutine to `GVN`, it seemed like a
good place for it and is an a similar spirit as various PREs the GVN does. The
Phabricator review is at https://reviews.llvm.org/D109760.
Initial benchmarking on Neoverse N1 looks good (speedup, higher is better):
500.perlbench_r 1.13%
502.gcc_r 0.00%
505.mcf_r -1.89%
520.omnetpp_r 0.00%
523.xalancbmk_r 0.00%
525.x264_r 7.67%
531.deepsjeng_r 0.60%
541.leela_r 0.24%
548.exchange2_r 0.00%
557.xz_r 0.75%
Comments?
~chill
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Initial response focused on different potential ways of solving this problem. I'll reply separately on your proposed GVN patch.
I see a number of ways to tackle this:
Your in-passing mention that -O3 and unrolling breaks
vectorization also concerns me. It really shouldn't. That sounds
like a probably issue in the SLP vectorizer, and maybe a pass
order issue.
All of the above is simply me brainstorming. :)
Philip
Ok, this response is focused on the proposed transform. See
inline.
I haven't looked at this pass recently, but I would strongly
recommend distrusting the current implementation. The software
engineering process that went into writing it doesn't give me any
confidence. I remember a long string of enable-revert-bugfix
without obvious justification as to why the bugs being found had
to be found with such an expensive approach. If you do decide to
pursue this, expect the need for a) targeted review and b) a bunch
of dedicated fuzzing.
Having said that, *conceptually* this pass takes the proper
approach by using MemorySSA. Your current implementation has a
bunch of complexity to basically do a local renumbering of memory
instructions. Some of that can be improved (see below), but if
you were doing this with MemorySSA, you wouldn't need to do
anything fancy as GVN would have numbered the loads correctly.
NewGVN is a MemorySSA version of GVN which I'd nominally say is
the right place for this, but it has never been productionized. A
huge blocker was MemorySSA itself which was recently
productionized, but I don't know what's left for NewGVN itself.
(Alina might be able to add context here?)
As an alternative, I was thinking of a simpler hoisting transformation, that just handles moving instructions from two single-predecessor blocks to their common predecessor. That could be made reasonably fast, by pairing instructions by their GVN value number. Limiting hoisting to a predecessor block (for the most part) would also avoid excessive increase of lifetimes (for the majority of the case) and would also simplify correctness checks. I've written such a transformation as a subroutine to `GVN`, it seemed like a good place for it and is an a similar spirit as various PREs the GVN does. The Phabricator review is at https://reviews.llvm.org/D109760.
The current structure of algorithm you proposed can probably be simplified. Here's what I'd suggest:
Start by handling all of the non-memory instructions. This
should be as simple as constructing a hash map from value number
to instruction for one block, and then scanning the other block.
If you find the same value number in both blocks, the (by
assumption scalar) instruction must exist in both blocks and can
be hoisted. This can be reviewed, tested, submitted, etc..
Then add a single pre-pass which "renumbers" all of the memory
instructions. You can either use a trivial "no clobbers seen"
like EarlyCSE does, or only handle loads where
MD->getDependency() is nonlocal. (We already do this query in
processLoad, so it should be cached.) The same algorithm can be
used as above, all that changes is where the valuenumber comes
from. (The "pre-pass" can probably be implemented as a helper
function which renumbers memory instructions on demand.)
I would specifically *not* add an extra layer of iteration here.
If you find you need iteration to handle the cases you care about,
I suspect we should actually be adjusting the value re-numbering
scheme for loads described just above.
A couple of assorted notes:
Initial benchmarking on Neoverse N1 looks good (speedup, higher is better): 500.perlbench_r 1.13% 502.gcc_r 0.00% 505.mcf_r -1.89% 520.omnetpp_r 0.00% 523.xalancbmk_r 0.00% 525.x264_r 7.67% 531.deepsjeng_r 0.60% 541.leela_r 0.24% 548.exchange2_r 0.00% 557.xz_r 0.75%
Realized that the detailed nature of my response may have missed
the high order bit. I generally think that doing this
transformation in GVN is entirely reasonable, and would be a
useful enhancement. That's why I spent the time to give the
detailed feedback on implementation.
Philip
p.s. It's worth noting that what we're implementing here is a
restricted form of "very busy expressions" which basically the
dual of PRE. There's a bunch of existing literature on the
general problem.
p.p.s. Once we're through with the hoisting variant, there's a
restricted sinking variant too. If we structure the hoisting
code right, the sinking should be straight forward.
The target architecture is AArch64 and the cost model indeed is
entirely unreasonable, in fact, it on purpose sets a high value for
emulated masked loads/stores so as to disable vectorisation.
For the case I have in hand, though, I set on finding a way to not
have to deal with loads/stores in the first place, instead of handling
them better (by fixing the cost model or otherwise).
> We could extend SimplifyCFG. You mention this, but don't get into *why* we don't handle the last load. We really should in this case. (Though, after CSE, there should only be two conditional loads in your inner loop? Maybe I'm missing something?)
Two of the loads are hoisted by
`SimplifyCFGOpt::HoistThenElseCodeToIf`, which is intentionally
limited to hoist identical instructions in identical order. In this
example, the scan of the two blocks
stops at the first pair of different instructions, which happens to be
before the third load.
> Your in-passing mention that -O3 and unrolling breaks vectorization also concerns me. It really shouldn't. That sounds like a probably issue in the SLP vectorizer, and maybe a pass order issue.
I would (maybe naively) think that loop vectoriser should be given a
chance before loop unrolling and the SLP vectoriser.
Yes, GVNHoist will do what we need
[hit send too soon]
So, GVNHoist will do the hoisting we want.. However, while technically
possible there a few drawbacks in taking the existing GVNHoist
approach:
* GVNHoist is large and algorithmically complex
* Even if enabled, it won't solve this specific issue - after hoisting
of the loads the loop size falls below some unrolling threshold and
the loop is fully unrolled - instead the loop should be
loop-vectorised, IMHO - so we get a pass order issue.
* After unrolling the SLP vectoriser does not vectorise the loop - so
we have to fix the SLP vectoriser too.
While I think all of these are worthwhile, we have on one hand some
rather simple transformation (the mini-GVNHoist) vs. on the other hand
the actual GVNHoist+pass manager+SLP Vectoriser, which
could easily turn into a time sink with no end in sight.
It likely needs more than that - manually hoisting all the loads and
adding `restrict` still does not vectorise at `-O3`:
https://gcc.godbolt.org/z/z3KjaYv9d
> Philip seems happy with this being part of GVN, and I don't have strong opinions, but GVNHoist seems like the natural place for this. An alternative strategy could thus be to integrate this into GVNHoist, enable it by default but only your simple gvn hoist algorithm (and disable the rest). That would perhaps then be a first step to address Philip's unhappiness with GVNHoist and restructure and improve it step by step.
I'm afraid I have to disagree - GVNHoist already does all the hoisting
we need, in that sense the mini-GVNhoist is not something to
integrate, but a simpler alternative.
If we put the issue of dealing with masked loads and stores in
vectorisers aside, I see these approaches:
* improve on the mini-GNVhoist as part of GVN (current proposal)
* bring GVNHoist up to snuff
* create a NewGVNHoist on top of NewGVN
Do you mean the NewGVN will number loads and store correctly? The old
GVN will have unique numbers for loads and stores.
> Phabricator review is at https://reviews.llvm.org/D109760.
>
> The current structure of algorithm you proposed can probably be simplified. Here's what I'd suggest:
>
> Start by handling all of the non-memory instructions. This should be as simple as constructing a hash map from value number to instruction for one block, and then scanning the other block. If you find the same value number in both blocks, the (by assumption scalar) instruction must exist in both blocks and can be hoisted. This can be reviewed, tested, submitted, etc..
>
> Then add a single pre-pass which "renumbers" all of the memory instructions. You can either use a trivial "no clobbers seen" like EarlyCSE does, or only handle loads where MD->getDependency() is nonlocal. (We already do this query in processLoad, so it should be cached.) The same algorithm can be used as above, all that changes is where the valuenumber comes from. (The "pre-pass" can probably be implemented as a helper function which renumbers memory instructions on demand.)
The issue here is that uniqueness of load numbers propagates to the
users of those loads, so just after we've hoisted and merged a pair of
loads into one instruction we can potentially get the same VN in that
load's users.
I was handling that with a followup iteration, but indeed we could
recursively traverse the users, starting from the load, and recompute
VNs.
> I would specifically *not* add an extra layer of iteration here. If you find you need iteration to handle the cases you care about, I suspect we should actually be adjusting the value re-numbering scheme for loads described just above.
>
> A couple of assorted notes:
>
> MemDep should be responsible for handling all aliasing barriers. If it returns a non-local dependence, reordering that instruction to the beginning of the block should be legal w.r.t. memory. Your use of isHoistBarrier is confusing to me. You may be trying to handle speculation safety (i.e. avoid introducing faults), but if so, the special casing around volatile and atomics seems odd? (Also, why allow hosting of the barrier instruction?) I suspect there's a particular case you're trying to handle. I would strongly advice dropping that from the initial patch, then add back the complexity in a separate change which is well motivated with tests, etc..
With regard to volatile and the atomics, I guess I was being too
conservartive trying to not reorder *any* instructions above them. I
can remove those conditions. The other condition is to not
speculatively execute instructions by moving them across an
instruction that may throw, e.g. with
VN 9: br i1 %tobool, label %if.then, label %if.else
if.then:
VN 11: call void @h()
VN 7: %0 = sdiv i32 %a, %b
...
if.else:
VN 7: %1 = sdiv i32 %a, %b
...
we don't want to end up with:
VN 7: %0 = sdiv i32 %a, %b
br i1 %tobool, label %if.then, label %if.else
if.then:
VN 11: call void @h()
...
if.else:
VN 7: %1 = sdiv i32 %a, %b
...
but with (`h` is `readnone`):
VN 9: br i1 %tobool, label %if.then, label %if.else
if.then:
VN 11: call void @h()
VN 7: %0 = sdiv i32 %a, %b
...
if.else:
VN 11: call void @h()
VN 7: %1 = sdiv i32 %a, %b
...
we could move both the call and the division, while preserving their
order, which also answers "Also, why allow hosting of the barrier
instruction?")
So, the subsequent iterations of the algorithm were designed to:
a) handle renumbering of loads and their users (which could possibly
be dealt with as described above)
b) to prevent all reordering across volatiles and atomics (I'll drop
that and prevent reordering only when MemDep says there is a
dependency).
c) detect when hoisting an instruction turns it from a local
dependency into a non-local one, thus potentially allowing more
hoisting; I'll see about explicitly checking if two matching
instructions have a local dependency on
a couple if instructions that are themselves paired for hoisting.
d) prevent speculation across throwing instructions; I don't have a
nice non-iterative solution for that, e.g. in the example above there
is no dependency between the call and the division.
> In terms of the actual merge logic, I'd suggest implementing it as hoist-one, then CSE. That would allow you to reuse the existing patching logic and avoid reimplementing a bunch of complexity.
I'm not sure I understand this. Currently I move one instruction, then
replace all uses of the other with the first one. Do you mean that or
something else?
Thanks for the comments and the suggestions, I now have a bunch of new
ideas to try!
Thanks and best regards,
On Sep 15, 2021, at 18:04, Momchil Velikov <momchil...@gmail.com> wrote:On Wed, Sep 15, 2021 at 5:17 PM Sjoerd Meijer <Sjoerd...@arm.com> wrote:The SLP vectoriser will fail to vectorise this because it wasn't taught to emit runtime alias analysis checks.
It likely needs more than that - manually hoisting all the loads and
adding `restrict` still does not vectorise at `-O3`:
https://gcc.godbolt.org/z/z3KjaYv9d
Your argument here actually makes me less convinced. :)
In general, we should not be duplicating transform functionality just
because of pass ordering issues, or deficiencies in other passes. We
should be fixing those issues, since they are likely common to other
workloads as well.
If my opinion of GVNHoist was anything thing other than terrible, this
would probably have convinced my that we should pursue the GVNHoist option.
Philip
The SLP vectoriser will fail to vectorise this because it wasn't taught to emit runtime alias analysis checks. This is being addressed by Florian in https://reviews.llvm.org/D102834. We had many issues with full unrolling removing opportunities for the loop vectoriser, and the SLP missing some tricks. But with D102834 fixed, I expect this loop to be SLP vectorised and most of this pass ordering problems to disappear.
Philip seems happy with this being part of GVN, and I don't have strong opinions, but GVNHoist seems like the natural place for this. An alternative strategy could thus be to integrate this into GVNHoist, enable it by default but only your simple gvn hoist algorithm (and disable the rest). That would perhaps then be a first step to address Philip's unhappiness with GVNHoist and restructure and improve it step by step.
+1 to this point. This would be a perfectly reasonable way to implement the MemorySSA based approach in a shippable way. I'm not opposed to the non-memoryssa variant in (old) GVN, but this would be fine too.
Philip
Your missing my point. The renumbering scheme would explicitly consider
the incoming memory state to the load as an input to the hash. It would
not uniquely number the loads. As a result, two loads (in either
different blocks or even the same one) would be assigned the same value
number if all operands (including memory state!) were the same.
The numbering scheme should handle this.
> d) prevent speculation across throwing instructions; I don't have a
> nice non-iterative solution for that, e.g. in the example above there
> is no dependency between the call and the division.
As you walk forward, if you encounter an instruction not guaranteed to
transfer execution, only hash instructions which are safe to speculate
into the header. You can no longer use guarantee-to-execute fault
safety past that point.
>
>> In terms of the actual merge logic, I'd suggest implementing it as hoist-one, then CSE. That would allow you to reuse the existing patching logic and avoid reimplementing a bunch of complexity.
> I'm not sure I understand this. Currently I move one instruction, then
> replace all uses of the other with the first one. Do you mean that or
> something else?
I mean specifically, you should be able to reuse the patchReplacement
logic rather than reimplementing all of the flag merging logic.
> The renumbering scheme would explicitly consider
> the incoming memory state.
Oh, I see, cheers!
From: Alina Sbirlea <asbi...@google.com>
> I added a note about this in https://reviews.llvm.org/D110822.
> So, stepping back, I'd like to raise the question of how this change is going to be used if implemented.
> In the current default pipeline MemorySSA is not available where GVN is added. This is forcing the
> computation of MemorySSA before GVN.
> Having MemorySSA computed before GVN may be good in the LTO pipeline (GVN already preserves the
> analysis), as the following passes require it (MemCpyOpt and DSE), but in the function simplification pipeline
> we'd end up with GVN computing, using and preserving two analysis (MemorySSA and MemDepAnalysis) with
> overlapping goals.
> The status for switching to NewGVN is uncertain at this point, but porting GVN to switch over from
> MemDepAnalysis to MemorySSA may be an option, at least in the interim. This will lift the issue of having two
> analysis and also provide the availability of MemorySSA here.
> Is this something you'd be interested in exploring?
It is possible. We (the Arm team working on these things) would like to have a reasonable idea[1]
of a couple of things.
First, will this get us anywhere? Does anyone can think of any other objections
(provided, of course, the patch is of sufficient quality) against merging this and enabling it by default, including
computing MemorySSA?
AFAIK, GVNHoist was disabled due to some benchmark regressions, unfortunately I wasn't able to find a trace about
this decision, does anyone know anything more about it? Would anyone interested be able to apply
https://reviews.llvm.org/D111418 (5 patches) and check and see if there are unacceptable regressions,
that we can try to resolve?
Second, what are we getting into, if we decide to migrate GVN.cpp to MemorySSA? Would that be
a couple of weeks or a multi-year project? Has anyone already thought about approaches to doing that?
I spent maybe a total of three days looking at GVN/MemorySSA/MemoryDependenceAnalysis
and the transition does not look straightforward or obvious, more like re-implementing (parts of)
MemoryDependenceAnalysis. I mean, it looks doable and not *that* much work, but more likely than not
I'm overlooking something.
For example, looking at the dependencies of loads (if I'm not mistaken that's the only kind relevant to GVN)
the MemoryDependencyAnalysis would return other loads as "dependencies", depending on aliasing,
and also `alloca`s. This does not have a direct equivalent in MemorySSA. I'm thinking of something along
the lines of: get the clobbering def for a load, get all its uses, from this list, remove uses, post-dominated
by other uses (as long as the post-dominating one is not a MayAlias), and somehow do this
in an efficient manner - that'll hopefully get the same set of dependencies (local or not) as the ones obtained from
the MemoryDependencyAnalysis, continue from there as before.
Any other ideas?
~chill
[1] by no means a guarantee, we're just looking to asses the risks going one or another way
[back to the mailing list, for more(?) exposure]
From: Alina Sbirlea <asbi...@google.com>
> I added a note about this in https://reviews.llvm.org/D110822.
> So, stepping back, I'd like to raise the question of how this change is going to be used if implemented.
> In the current default pipeline MemorySSA is not available where GVN is added. This is forcing the
> computation of MemorySSA before GVN.
> Having MemorySSA computed before GVN may be good in the LTO pipeline (GVN already preserves the
> analysis), as the following passes require it (MemCpyOpt and DSE), but in the function simplification pipeline
> we'd end up with GVN computing, using and preserving two analysis (MemorySSA and MemDepAnalysis) with
> overlapping goals.
> The status for switching to NewGVN is uncertain at this point, but porting GVN to switch over from
> MemDepAnalysis to MemorySSA may be an option, at least in the interim. This will lift the issue of having two
> analysis and also provide the availability of MemorySSA here.
> Is this something you'd be interested in exploring?
It is possible. We (the Arm team working on these things) would like to have a reasonable idea[1]
of a couple of things.
First, will this get us anywhere? Does anyone can think of any other objections
(provided, of course, the patch is of sufficient quality) against merging this and enabling it by default, including
computing MemorySSA?
AFAIK, GVNHoist was disabled due to some benchmark regressions, unfortunately I wasn't able to find a trace about
this decision, does anyone know anything more about it? Would anyone interested be able to apply
https://reviews.llvm.org/D111418 (5 patches) and check and see if there are unacceptable regressions,
that we can try to resolve?
Second, what are we getting into, if we decide to migrate GVN.cpp to MemorySSA? Would that be
a couple of weeks or a multi-year project? Has anyone already thought about approaches to doing that?
I spent maybe a total of three days looking at GVN/MemorySSA/MemoryDependenceAnalysis
and the transition does not look straightforward or obvious, more like re-implementing (parts of)
MemoryDependenceAnalysis. I mean, it looks doable and not *that* much work, but more likely than not
I'm overlooking something.
For example, looking at the dependencies of loads (if I'm not mistaken that's the only kind relevant to GVN)
the MemoryDependencyAnalysis would return other loads as "dependencies", depending on aliasing,
and also `alloca`s. This does not have a direct equivalent in MemorySSA. I'm thinking of something along
the lines of: get the clobbering def for a load, get all its uses, from this list, remove uses, post-dominated
by other uses (as long as the post-dominating one is not a MayAlias), and somehow do this
in an efficient manner - that'll hopefully get the same set of dependencies (local or not) as the ones obtained from
the MemoryDependencyAnalysis, continue from there as before.
Any other ideas?
~chill
[1] by no means a guarantee, we're just looking to asses the risks going one or another way