Hi Sanjay,
The feature was originally developed for ARM's VLDn/VSTn instructions
and then extended to AArch64 and PPC, but not x86/64 yet.
I believe Elena was working on that, but needed to get the
scatter/gather intrinsics working first. I just copied her in case I'm
wrong. :)
cheers,
--renato
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
Hi Michael,
Sometime back I did some experiments with interleave vectorizer and did not found any degrade,
probably my tests/benchmarks are not extensive enough to cover much.
Elina is the right person to comment on it as she already experienced cases where it hinders performance.
For interleave vectorizer on X86 we do not have any specific costing, it goes to BasicTTI where the costing is not appropriate(WRT X86).
It consider cost of extracts & inserts for extracting elements from a wide vector, which is really expensive.
i.e. in your test case the cost of load associated with “in[i * 2]” is 10 (for VF4).
Interleave vectorize will generate following instructions for it:
%wide.vec = load <8 x i32>, <8 x i32>* %14, align 4, !tbaa !1, !alias.scope !5
%strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
For wide load it get cost as 2(as it has to generate 2 loads) but for extracting elements (shuffle operation) it get cost as 8 (4 for extract + 4 for insert).
The cost should be 3 here, 2 for loads & 1 for shuffle.
To enable Interleave vectorizer on X86 we should implement a proper cost estimation.
Test you mentioned is indeed a candidate for Stride memory vectorization.
Regards,
Ashutosh
Isn't our current interleaved access vectorization just a special case of the more general strided access proposal? If so, from a development perspective, it might make sense to begin incorporating some of that work into the existing framework (with appropriate target hooks and costs). This could probably be done piecemeal rather than all at once.
Also, keep in mind that ARM/Aarch64 run an additional IR pass (InterleavedAccessPass) that matches the load/store plus shuffle sequences that the vectorizer generates to target-specific instrinsics.
-- Matt
Right now we concluded that interleaved access is not always beneficial for X86 or it requires additional target specific optimizations at least.
First of all, we need more precise cost model that can estimate a real number of shuffles in each case. We also need to understand what to do with edge elements in the vector if their loading is not required. We, probably, should issue a masked load in this case.
As far as I remember, may be I’m wrong, vectorizer does not generate shuffles for interleave access. It generates a bunch of extracts and inserts that ought to be coupled into shuffles after wise.
It adds uncertainty to the cost modeling.
- Elena
That's my understanding as well.
Whatever strategy we take, it will be a mix of telling the cost model
to avoid some pathological cases as well as improving the detection of
the patterns in the x86 back-end.
The work to benchmark this properly looks harder than enabling the
right flags and patterns. :)
cheers,
--renato
Elena said she saw "some" improvements. :)
> That's why I'm curious about where we saw regressions - I'm wondering
> whether there's really a significant cost modeling issue I'm missing, or
> it's something that's easy to fix so that we can make forward progress,
> while Ashutosh is working on the longer-term solution.
Sounds like a task to try a few patterns and fiddle with the cost model.
Arnold did a lot of those during the first months of the vectorizer,
so it might be just a matter of finding the right heuristics, at least
for the low hanging fruits.
Of course, that'd also involve benchmarking everything else, to make
sure the new heuristics doesn't introduce regressions on
non-interleaved vectorisation.
On 6 August 2016 at 00:18, Michael Kuperstein <mku...@google.com> wrote:
> I agree that we can get *more* improvement with better cost modeling, but
> I'd expect to be able to get *some* improvement the way things are right
> now.
Elena said she saw "some" improvements. :)
> That's why I'm curious about where we saw regressions - I'm wondering
> whether there's really a significant cost modeling issue I'm missing, or
> it's something that's easy to fix so that we can make forward progress,
> while Ashutosh is working on the longer-term solution.
Sounds like a task to try a few patterns and fiddle with the cost model.
Arnold did a lot of those during the first months of the vectorizer,
so it might be just a matter of finding the right heuristics, at least
for the low hanging fruits.
Of course, that'd also involve benchmarking everything else, to make
sure the new heuristics doesn't introduce regressions on
non-interleaved vectorisation.
Two things which emerged from this whole discussion is
1. That current costing does not account the folding of chain of “extracts” and “inserts” by InstCombine, hence Improving it will move us in positive direction.
2. Another is the requirement to know the issue with interleave access enabled. While it will help understand the performance behavior of the vectorizer, it will also help improve the vectorizer infrastructure by properly improving and including Ashutosh’s patch.
Vectorizer being a complex component of a compiler, having a small-small improvements with low hanging fruits is a good approach considering the testing and perf analysis involved.
Regards,
Shahid
From: llvm-dev [mailto:llvm-dev...@lists.llvm.org]
On Behalf Of Michael Kuperstein via llvm-dev
Sent: Saturday, August 06, 2016 5:26 AM
To: Renato Golin
Cc: Matthew Simpson; llvm-dev
Subject: Re: [llvm-dev] enabling interleaved access loop vectorization
On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato...@linaro.org> wrote:
From: Michael Kuperstein [mailto:mku...@google.com]
Sent: Saturday, August 06, 2016 02:56
To: Renato Golin <renato...@linaro.org>
Cc: Demikhovsky, Elena <elena.de...@intel.com>; Matthew Simpson <mssi...@codeaurora.org>; Nema, Ashutosh <Ashuto...@amd.com>; Sanjay Patel <spa...@rotateright.com>; llvm-dev <llvm...@lists.llvm.org>; Zaks, Ayal <ayal...@intel.com>
Subject: Re: [llvm-dev] enabling interleaved access loop vectorization
On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato...@linaro.org> wrote:
---------------------------------------------------------------------
Intel Israel (74) Limited
> We also need to understand what to do with edge elements in the vector if their loading is not required. We, probably, should issue a masked load in this case.
The existing code solves such edge cases where the last element of an InterleaveGroup is absent by making sure the last iteration (and up to last VF iterations) are peeled and executed scalarly; see requiresScalarEpilogue.
One place to find them, using the default BaseT::getInterleavedMemoryOpCost(), is DENBench’s RGB conversions.
Ayal.
Hi Michael,
Don’t quite have a full reproducer for you yet. You’re welcome to try and see what’s happening in 32 bit mode when enabling interleaving for the following, based on “https://en.wikipedia.org/wiki/YIQ#From_RGB_to_YIQ”:
void rgb2yik (char * in, char * out, int N)
{
int j;
for (j = 0; j < N; ++j) {
unsigned char r = *in++;
unsigned char g = *in++;
unsigned char b = *in++;
unsigned char y = 0.299*r + 0.587*g + 0.114*b;
signed char i = 0.596*r + -0.274*g + -0.321*b;
signed char q = 0.211*r + -0.523*g + 0.312*b;
*out++ = y;
*out++ = (unsigned char)i;
*out++ = (unsigned char)q;
}
}
but you’d currently need to force it to vectorize to overcome its expected cost.
Ayal.
So turns out it is a full reproducer after all (choosing to vectorize on AVX), good.
> The details are in PR29025.
Interesting. (So we should carefully insert unconditional branches inside shuffle sequences, eh? ;-)
> But if we modify the program by adding "*out++ = 0" right after "*out++ = q;" (thus eliminating the pesky <12 x i8>), we get:
Indeed such padding is a known (programmer) optimization to effectively have power-of-2 strides and/or alignment.
> So, unfortunately, it turns out I don't have access to DENBench.
If you like we could test your patch to see how it (mis)behaves.
> Seriously though - you probably saw that I just committed a fix for PR29025 (r280418).
> Care to give it a spin internally?
Sure; spinning with r280423 and the patch below (*) indeed takes care of the slowdowns observed in 32 bit mode for AVX J.
> If you still see regressions, more reproducers will be appreciated. :-)
> If there are no more regressions, let me know, and I'll post a patch to enable interleaved access for x86.
Unfortunately, we’re still observing severe slowdowns in 32 bit mode for SSE with -march=slm for the same rgb conversion workloads. Seems like we’ll need a different reproducer for that, as rgb2yik.c below is left unvectorized when compiled to slm.
Ayal.
(*) used the following in anticipation of your patch(?), effectively equivalent to -enable-interleaved-mem-accesses:
Index: lib/Target/X86/X86TargetTransformInfo.cpp
===================================================================
--- lib/Target/X86/X86TargetTransformInfo.cpp (revision 280423)
+++ lib/Target/X86/X86TargetTransformInfo.cpp (working copy)
@@ -41,6 +41,10 @@
return ST->hasPOPCNT() ? TTI::PSK_FastHardware : TTI::PSK_Software;
}
+bool X86TTIImpl::enableInterleavedAccessVectorization() {
+ return true;
+}
+
unsigned X86TTIImpl::getNumberOfRegisters(bool Vector) {
if (Vector && !ST->hasSSE1())
return 0;
Index: lib/Target/X86/X86TargetTransformInfo.h
===================================================================
--- lib/Target/X86/X86TargetTransformInfo.h (revision 280423)
+++ lib/Target/X86/X86TargetTransformInfo.h (working copy)
@@ -59,6 +59,7 @@
/// \name Vector TTI Implementations
/// @{
+ bool enableInterleavedAccessVectorization();
unsigned getNumberOfRegisters(bool Vector);
unsigned getRegisterBitWidth(bool Vector);
unsigned getMaxInterleaveFactor(unsigned VF);
_______________________________________________