[llvm-dev] enabling interleaved access loop vectorization

157 views
Skip to first unread message

Sanjay Patel via llvm-dev

unread,
May 26, 2016, 2:12:09 PM5/26/16
to llvm-dev
Is there a compile-time and/or potential runtime cost that makes enableInterleavedAccessVectorization() default to 'false'?

I notice that this is set to true for ARM, AArch64, and PPC.

In particular, I'm wondering if there's a reason it's not enabled for x86 in relation to PR27881:
https://llvm.org/bugs/show_bug.cgi?id=27881

Renato Golin via llvm-dev

unread,
May 26, 2016, 2:25:16 PM5/26/16
to Sanjay Patel, Demikhovsky, Elena, llvm-dev
On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev

Hi Sanjay,

The feature was originally developed for ARM's VLDn/VSTn instructions
and then extended to AArch64 and PPC, but not x86/64 yet.

I believe Elena was working on that, but needed to get the
scatter/gather intrinsics working first. I just copied her in case I'm
wrong. :)

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Demikhovsky, Elena via llvm-dev

unread,
May 26, 2016, 3:35:24 PM5/26/16
to Renato Golin, Sanjay Patel, llvm-dev
Interleaved access is not enabled on X86 yet.
We looked at this feature and got into conclusion that interleaving (as loads + shuffles) is not always profitable on X86. We should provide the right cost which depends on number of shuffles. Number of shuffles depends on permutations (shuffle mask). And even if we estimate the number of shuffles, the shuffles are not generated in-place. Vectorizer produces a long queue of "extracts" and "inserts" that hopefully will be coupled into shuffles on a later instcombine pass.

- Elena


>-----Original Message-----
>From: Renato Golin [mailto:renato...@linaro.org]
>Sent: Thursday, May 26, 2016 21:25
>To: Sanjay Patel <spa...@rotateright.com>; Demikhovsky, Elena
><elena.de...@intel.com>
>Cc: llvm-dev <llvm...@lists.llvm.org>
>Subject: Re: [llvm-dev] enabling interleaved access loop vectorization
>
>On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev <llvm-
>d...@lists.llvm.org> wrote:
>> Is there a compile-time and/or potential runtime cost that makes
>> enableInterleavedAccessVectorization() default to 'false'?
>>
>> I notice that this is set to true for ARM, AArch64, and PPC.
>>
>> In particular, I'm wondering if there's a reason it's not enabled for
>> x86 in relation to PR27881:
>> https://llvm.org/bugs/show_bug.cgi?id=27881
>
>Hi Sanjay,
>
>The feature was originally developed for ARM's VLDn/VSTn instructions
>and then extended to AArch64 and PPC, but not x86/64 yet.
>
>I believe Elena was working on that, but needed to get the scatter/gather
>intrinsics working first. I just copied her in case I'm wrong. :)
>
>cheers,
>--renato
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Michael Kuperstein via llvm-dev

unread,
Aug 4, 2016, 7:23:39 PM8/4/16
to Demikhovsky, Elena, llvm-dev, Matthew Simpson
Hi Elena,

Circling back to this, do you know of any concrete cases where enabling interleaved access on x86 is unprofitable?
Right now, there are some cases where we lose significantly, because (a) we consider gathers (on architectures that don't have them) extremely expensive, so we won't vectorize them at all without interleaved access, and (b) we have interleaved access turned off.

Consider something like this:

void foo(int *in, int *out) {
  int i = 0;
  for (i = 0; i < 256; ++i) {
    out[i] = in[i] + in[i + 1] + in[i + 2] + in[i * 2];
  }
}

We don't vectorize this loop at all, because we calculate the cost of the in[i * 2] gather to be 14 cycles per lane (!).
This is an overestimate we need to fix, since the vectorized code is actually fairly decent - e.g. forcing vectorization, with SSE4.2, we get:

.LBB0_3:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
movdqu (%rdi,%rax,4), %xmm3
movd %xmm0, %rcx
movdqu 4(%rdi,%rcx,4), %xmm4
paddd %xmm3, %xmm4
movdqu 8(%rdi,%rcx,4), %xmm3
paddd %xmm4, %xmm3
movdqa %xmm1, %xmm4
paddq %xmm4, %xmm4
movdqa %xmm0, %xmm5
paddq %xmm5, %xmm5
movd %xmm5, %rcx
pextrq $1, %xmm5, %rdx
movd %xmm4, %r8
pextrq $1, %xmm4, %r9
movd (%rdi,%rcx,4), %xmm4    # xmm4 = mem[0],zero,zero,zero
pinsrd $1, (%rdi,%rdx,4), %xmm4
pinsrd $2, (%rdi,%r8,4), %xmm4
pinsrd $3, (%rdi,%r9,4), %xmm4
paddd %xmm3, %xmm4
movdqu %xmm4, (%rsi,%rax,4)
addq $4, %rax
paddq %xmm2, %xmm0
paddq %xmm2, %xmm1
cmpq $256, %rax              # imm = 0x100
jne .LBB0_3

But the real point is that with interleaved access enabled, we vectorize, and get:

.LBB0_3:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
movdqu (%rdi,%rcx), %xmm0
movdqu 4(%rdi,%rcx), %xmm1
movdqu 8(%rdi,%rcx), %xmm2
paddd %xmm0, %xmm1
paddd %xmm2, %xmm1
movdqu (%rdi,%rcx,2), %xmm0
movdqu 16(%rdi,%rcx,2), %xmm2
pshufd $132, %xmm2, %xmm2      # xmm2 = xmm2[0,1,0,2]
pshufd $232, %xmm0, %xmm0      # xmm0 = xmm0[0,2,2,3]
pblendw $240, %xmm2, %xmm0      # xmm0 = xmm0[0,1,2,3],xmm2[4,5,6,7]
paddd %xmm1, %xmm0
movdqu %xmm0, (%rsi,%rcx)
cmpq $992, %rcx              # imm = 0x3E0
jne .LBB0_7

The performance I see out of the 3 versions (with a 500K-iteration outer loop):

Scalar: 0m10.320s
Vector (Non-interleaved): 0m8.054s
Vector (Interleaved): 0m3.541s

This is far from being the perfect use case for interleaved access:
1) There's no real interleaving, just one strided gather, so this would be better served by Ashutosh's full "strided access" proposal.
2) It looks like the actual move + shuffle sequence is not better, and even probably worse, than just inserting directly from memory - but it's still worthwhile because of how much we save on the index computations.
Regardless of all that, the fact of the matter is that we get much better code by treating it as interleaved, and I think this may be a good enough motivation to enable it, unless we significantly regress in other cases.

I was going to look at benchmarks to see if we get any regressions, but if you already have examples you're aware of, that would be great.

Thanks,
  Michael

Nema, Ashutosh via llvm-dev

unread,
Aug 5, 2016, 7:20:58 AM8/5/16
to Michael Kuperstein, Demikhovsky, Elena, llvm-dev, Matthew Simpson

Hi Michael,

 

Sometime back I did some experiments with interleave vectorizer and did not found any degrade,

probably my tests/benchmarks are not extensive enough to cover much.

 

Elina is the right person to comment on it as she already experienced cases where it hinders performance.

 

For interleave vectorizer on X86 we do not have any specific costing, it goes to BasicTTI where the costing is not appropriate(WRT X86).

It consider cost of extracts & inserts for extracting elements from a wide vector, which is really expensive.

i.e. in your test case the cost of load associated with “in[i * 2]” is 10 (for VF4).

Interleave vectorize will generate following instructions for it:

  %wide.vec = load <8 x i32>, <8 x i32>* %14, align 4, !tbaa !1, !alias.scope !5

  %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>

 

For wide load it get cost as 2(as it has to generate 2 loads) but for extracting elements (shuffle operation) it get cost as 8 (4 for extract + 4 for insert).

The cost should be 3 here, 2 for loads & 1 for shuffle.

 

To enable Interleave vectorizer on X86 we should implement a proper cost estimation.

 

Test you mentioned is indeed a candidate for Stride memory vectorization.

 

Regards,

Ashutosh

Matthew Simpson via llvm-dev

unread,
Aug 5, 2016, 11:38:14 AM8/5/16
to Nema, Ashutosh, Michael Kuperstein, Demikhovsky, Elena, llvm-dev

Isn't our current interleaved access vectorization just a special case of the more general strided access proposal? If so, from a development perspective, it might make sense to begin incorporating some of that work into the existing framework (with appropriate target hooks and costs). This could probably be done piecemeal rather than all at once.

 

Also, keep in mind that ARM/Aarch64 run an additional IR pass (InterleavedAccessPass) that matches the load/store plus shuffle sequences that the vectorizer generates to target-specific instrinsics.

 

-- Matt

Michael Kuperstein via llvm-dev

unread,
Aug 5, 2016, 12:57:42 PM8/5/16
to Nema, Ashutosh, Matthew Simpson, llvm-dev
I agree the BasicTTI cost for interleaving is fairly conservative, but I don't think that's "inappropriate" for x86.

The cost we have for gathers right now is very conservative (as I wrote in the original email, 14 per lane). So, enabling interleaving, even with the BasicTTI cost, will only reduce the total estimated cost for the vectorized versions - which should be a good thing (since the cost is *still* conservative).

Michael Kuperstein via llvm-dev

unread,
Aug 5, 2016, 1:05:52 PM8/5/16
to Matthew Simpson, llvm-dev
Regarding InterleavedAccessPass - sure, but proper strided/interleaved access optimization ought to have a positive impact even without target support.
Case in point - Hal enabled it on PPC last September. An important difference vs. x86 seems to be that arbitrary shuffles are cheap on PPC, but, as I said below, I hope we can enable it on x86 with a conservative cost function, and still get improvement.

Demikhovsky, Elena via llvm-dev

unread,
Aug 5, 2016, 4:01:07 PM8/5/16
to Michael Kuperstein, Matthew Simpson, llvm-dev

Ayal tried to enable interleave access, just by switching “false” to “true” and measured performance. He got significant performance degradation on several benchmarks. Some benchmarks looked better, of course.

Right now we concluded that interleaved access is not always beneficial for X86 or it requires additional target specific optimizations at least.

First of all, we need more precise cost model that can estimate a real number of shuffles in each case. We also need to understand what to do with edge elements in the vector if their loading is not required. We, probably, should issue a masked load in this case.

As far as I remember, may be I’m wrong, vectorizer does not generate shuffles for interleave access. It generates a bunch of extracts and inserts that ought to be coupled into shuffles after wise.

It adds uncertainty to the cost modeling.

 

-           Elena

Renato Golin via llvm-dev

unread,
Aug 5, 2016, 5:03:12 PM8/5/16
to Demikhovsky, Elena, Matthew Simpson, llvm-dev
On 5 August 2016 at 21:00, Demikhovsky, Elena

<elena.de...@intel.com> wrote:
> As far as I remember, may be I’m wrong, vectorizer does not generate
> shuffles for interleave access. It generates a bunch of extracts and inserts
> that ought to be coupled into shuffles after wise.

That's my understanding as well.

Whatever strategy we take, it will be a mix of telling the cost model
to avoid some pathological cases as well as improving the detection of
the patterns in the x86 back-end.

The work to benchmark this properly looks harder than enabling the
right flags and patterns. :)

cheers,
--renato

Michael Kuperstein via llvm-dev

unread,
Aug 5, 2016, 7:19:23 PM8/5/16
to Renato Golin, Matthew Simpson, llvm-dev
As Ashutosh wrote, the BasicTTI cost model evaluates this as the cost of using extracts and inserts.
So even if we end up generating inserts and extracts (and I believe we actually manage to get the right shuffles, more or less, courtesy of InstCombine and the shuffle lowering code), we should be seeing improvements with the current cost model.
I agree that we can get *more* improvement with better cost modeling, but I'd expect to be able to get *some* improvement the way things are right now.

That's why I'm curious about where we saw regressions - I'm wondering whether there's really a significant cost modeling issue I'm missing, or it's something that's easy to fix so that we can make forward progress, while Ashutosh is working on the longer-term solution.

Renato Golin via llvm-dev

unread,
Aug 5, 2016, 7:37:32 PM8/5/16
to Michael Kuperstein, Matthew Simpson, llvm-dev
On 6 August 2016 at 00:18, Michael Kuperstein <mku...@google.com> wrote:
> I agree that we can get *more* improvement with better cost modeling, but
> I'd expect to be able to get *some* improvement the way things are right
> now.

Elena said she saw "some" improvements. :)


> That's why I'm curious about where we saw regressions - I'm wondering
> whether there's really a significant cost modeling issue I'm missing, or
> it's something that's easy to fix so that we can make forward progress,
> while Ashutosh is working on the longer-term solution.

Sounds like a task to try a few patterns and fiddle with the cost model.

Arnold did a lot of those during the first months of the vectorizer,
so it might be just a matter of finding the right heuristics, at least
for the low hanging fruits.

Of course, that'd also involve benchmarking everything else, to make
sure the new heuristics doesn't introduce regressions on
non-interleaved vectorisation.

Michael Kuperstein via llvm-dev

unread,
Aug 5, 2016, 7:56:15 PM8/5/16
to Renato Golin, Matthew Simpson, llvm-dev
On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato...@linaro.org> wrote:
On 6 August 2016 at 00:18, Michael Kuperstein <mku...@google.com> wrote:
> I agree that we can get *more* improvement with better cost modeling, but
> I'd expect to be able to get *some* improvement the way things are right
> now.

Elena said she saw "some" improvements. :)


I didn't mean "some improvements, some regressions", I meant "some of the improvement we'd expect from the full solution". :-)
 

> That's why I'm curious about where we saw regressions - I'm wondering
> whether there's really a significant cost modeling issue I'm missing, or
> it's something that's easy to fix so that we can make forward progress,
> while Ashutosh is working on the longer-term solution.

Sounds like a task to try a few patterns and fiddle with the cost model.

Arnold did a lot of those during the first months of the vectorizer,
so it might be just a matter of finding the right heuristics, at least
for the low hanging fruits.

Of course, that'd also involve benchmarking everything else, to make
sure the new heuristics doesn't introduce regressions on
non-interleaved vectorisation.


I don't disagree with you.

All I'm saying is that before fiddling with the heuristics, it'd be good to understand what exactly breaks if we simply flip the flag. If the answer happens to be "nothing" - well, problem solved. Unfortunately, according to Elena, that's not the answer. 
I'm going to play with it with our internal benchmarks, but it's my understanding that Elena/Ayal already have some idea of what the problems are.

Shahid, Asghar-ahmad via llvm-dev

unread,
Aug 6, 2016, 5:13:08 AM8/6/16
to Michael Kuperstein, Renato Golin, llvm-dev, Matthew Simpson

Two things which emerged from this whole discussion is

1.       That current costing does not account the folding of chain of “extracts” and “inserts” by InstCombine, hence Improving it will move us in positive direction.

2.       Another is the requirement to know the issue with interleave access enabled. While it will help understand the performance behavior of the vectorizer, it will also help improve the vectorizer infrastructure by properly improving and  including Ashutosh’s patch.

 

Vectorizer being a complex component of a compiler, having a small-small improvements with low hanging fruits is a good approach considering the testing and perf analysis involved.

 

Regards,

Shahid

 

From: llvm-dev [mailto:llvm-dev...@lists.llvm.org] On Behalf Of Michael Kuperstein via llvm-dev
Sent: Saturday, August 06, 2016 5:26 AM
To: Renato Golin
Cc: Matthew Simpson; llvm-dev
Subject: Re: [llvm-dev] enabling interleaved access loop vectorization

 

 

 

On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato...@linaro.org> wrote:

Demikhovsky, Elena via llvm-dev

unread,
Aug 7, 2016, 5:09:07 PM8/7/16
to Michael Kuperstein, Renato Golin, llvm-dev, Matthew Simpson

From: Michael Kuperstein [mailto:mku...@google.com]
Sent: Saturday, August 06, 2016 02:56
To: Renato Golin <renato...@linaro.org>
Cc: Demikhovsky, Elena <elena.de...@intel.com>; Matthew Simpson <mssi...@codeaurora.org>; Nema, Ashutosh <Ashuto...@amd.com>; Sanjay Patel <spa...@rotateright.com>; llvm-dev <llvm...@lists.llvm.org>; Zaks, Ayal <ayal...@intel.com>
Subject: Re: [llvm-dev] enabling interleaved access loop vectorization

 

 

 

On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato...@linaro.org> wrote:

 

---------------------------------------------------------------------
Intel Israel (74) Limited

Zaks, Ayal via llvm-dev

unread,
Aug 8, 2016, 6:21:32 PM8/8/16
to Demikhovsky, Elena, Michael Kuperstein, Renato Golin, llvm-dev, Matthew Simpson

> We also need to understand what to do with edge elements in the vector if their loading is not required. We, probably, should issue a masked load in this case.

 

The existing code solves such edge cases where the last element of an InterleaveGroup is absent by making sure the last iteration (and up to last VF iterations) are peeled and executed scalarly; see requiresScalarEpilogue.

 

 

> All regressions that we see are in 32-bit mode.

 

One place to find them, using the default BaseT::getInterleavedMemoryOpCost(), is DENBench’s RGB conversions.

 

Ayal.

Michael Kuperstein via llvm-dev

unread,
Aug 9, 2016, 2:25:54 PM8/9/16
to Zaks, Ayal, Matthew Simpson, llvm-dev
Thanks Ayal!

I'll take a look at DENBench.

As another data point - I tried enabling this on our internal benchmarks. I'm seeing one regression, and it seems to be a regression of the "good" kind - without interleaving we don't vectorize the innermost loop, and with interleaving we do. The vectorized loop is actually significantly faster when benchmarked in isolation, but in this specific instance, the static loop count is unknown, and the dynamic loop count happens to almost always be 1 - and this lives inside a hot outer loop.
That's something we ought to be handling through PGO (or, conceivably, outer loop vectorization :-) ).

Michael

Michael Kuperstein via llvm-dev

unread,
Aug 10, 2016, 7:32:51 PM8/10/16
to Zaks, Ayal, Matthew Simpson, llvm-dev
So, unfortunately, it turns out I don't have access to DENBench.

Do you happen to have a reduced example that gets pessimized by this?

Michael Kuperstein via llvm-dev

unread,
Aug 16, 2016, 5:51:41 PM8/16/16
to Zaks, Ayal, Demikhovsky, Elena, llvm-dev, Matthew Simpson
Hi Ayal, Elena,

I'd really like to enable this by default.

As I wrote above, I didn't see any regressions in internal benchmarks, and there doesn't seem to be anything in SPEC2006 either. I do see a performance improvement in an internal benchmark (that is, a real workload). 

Would you be able to provide an example that gets pessimized? I have no doubt you've seen regressions related to this, but the fact they exist doesn't help me analyze them as long as I can't see them. :-) I'd really rather look at regressions before making the change - and either try to make the necessary improvements to the cost model, or abandon this as unfeasible for now (pending Ashutosh's work). 

If you can't, an alternative is to turn this on, and then, if regressions show up on anyone's radar (where we can actually get a reproducer), turn it off again and go back to analysis. But I'd strongly prefer to "prefetch" the problem.

Thanks,
  Michael



Zaks, Ayal via llvm-dev

unread,
Aug 17, 2016, 5:15:14 PM8/17/16
to Michael Kuperstein, Demikhovsky, Elena, llvm-dev, Matthew Simpson

Hi Michael,

 

Don’t quite have a full reproducer for you yet. You’re welcome to try and see what’s happening in 32 bit mode when enabling  interleaving for the following, based on “https://en.wikipedia.org/wiki/YIQ#From_RGB_to_YIQ”:

 

void rgb2yik (char * in, char * out, int N)

{

  int j;

  for (j = 0; j < N; ++j) {

    unsigned char r = *in++;

    unsigned char g = *in++;

    unsigned char b = *in++;

    unsigned char y = 0.299*r + 0.587*g + 0.114*b;

    signed char i = 0.596*r + -0.274*g + -0.321*b;

    signed char q = 0.211*r + -0.523*g + 0.312*b;

    *out++ = y;

    *out++ = (unsigned char)i;

    *out++ = (unsigned char)q;

  }

}

 

but you’d currently need to force it to vectorize to overcome its expected cost.

 

Ayal.

Michael Kuperstein via llvm-dev

unread,
Aug 17, 2016, 5:56:58 PM8/17/16
to Zaks, Ayal, Matthew Simpson, llvm-dev
Thanks Ayal!

Michael Kuperstein via llvm-dev

unread,
Aug 17, 2016, 8:57:54 PM8/17/16
to Zaks, Ayal, Matthew Simpson, llvm-dev
So, at least for this example, it looks like we actually want to vectorize with -enable-interleaved-mem-accesses, we just need the backend to generate good code for the vector types that produces, specifically, in this case, <12 x i8>. The details are in PR29025.

The upshot of this is that for the original program (with an outer loop around it):

$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe
real 0m2.229s
user 0m2.224s
$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe
real 0m2.590s
user 0m2.584s

This indicates that we do have a slight cost modeling issue - the cost model is not quite conservative enough in case we really do use inserts and extracts. One thing we're probably not accounting for is a bunch of GPR spills  - although I'm not sure *why* we end up spilling so much. So perhaps this should also be fixed in regalloc.

But if we modify the program by adding "*out++ = 0" right after "*out++ = q;" (thus eliminating the pesky <12 x i8>), we get:

$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe
real 0m2.257s
user 0m2.256s
$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe 
real 0m0.958s
user 0m0.956s

Zaks, Ayal via llvm-dev

unread,
Sep 1, 2016, 7:26:35 PM9/1/16
to Michael Kuperstein, Matthew Simpson, llvm-dev

So turns out it is a full reproducer after all (choosing to vectorize on AVX), good.

 

 

> The details are in PR29025.

 

Interesting. (So we should carefully insert unconditional branches inside shuffle sequences, eh? ;-)

 

 

> But if we modify the program by adding "*out++ = 0" right after "*out++ = q;" (thus eliminating the pesky <12 x i8>), we get:

 

Indeed such padding is a known (programmer) optimization to effectively have power-of-2 strides and/or alignment.

 

 

> So, unfortunately, it turns out I don't have access to DENBench.

 

If you like we could test your patch to see how it (mis)behaves.

Michael Kuperstein via llvm-dev

unread,
Sep 1, 2016, 7:47:56 PM9/1/16
to Zaks, Ayal, Matthew Simpson, llvm-dev
Yes, carefully inserting branches is the way to go!

Seriously though - you probably saw that I just committed a fix for PR29025 (r280418).
For the reproducer you provided, we now have (without forcing vectorization, and without "padding" to have power-of-2 stride):

$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe
real 0m2.290s
user 0m2.289s
sys 0m0.003s
$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe
real 0m1.095s
user 0m1.095s
sys 0m0.002s

Care to give it a spin internally?

Note that this is not a full solution - we still won't vectorize PR27619, and force-vectorizing it is still a bad idea. Getting that right will require more lowering improvements as well as cost model adjustments. But hopefully post-r280418 things should be good enough to avoid regressions for the cases we will vectorize. 
If you still see regressions, more reproducers will be appreciated. :-)
If there are no more regressions, let me know, and I'll post a patch to enable interleaved access for x86.

Thanks,
 Michael

Zaks, Ayal via llvm-dev

unread,
Sep 4, 2016, 5:10:12 PM9/4/16
to Michael Kuperstein, Matthew Simpson, llvm-dev

> Seriously though - you probably saw that I just committed a fix for PR29025 (r280418).

> Care to give it a spin internally?

 

Sure; spinning with r280423 and the patch below (*) indeed takes care of the slowdowns observed in 32 bit mode for AVX J.

 

 

> If you still see regressions, more reproducers will be appreciated. :-)

> If there are no more regressions, let me know, and I'll post a patch to enable interleaved access for x86.

 

Unfortunately, we’re still observing severe slowdowns in 32 bit mode for SSE with -march=slm for the same rgb conversion workloads. Seems like we’ll need a different reproducer for that, as rgb2yik.c below is left unvectorized when compiled to slm.

 

Ayal.

 

 

(*) used the following in anticipation of your patch(?), effectively equivalent to -enable-interleaved-mem-accesses:

 

Index: lib/Target/X86/X86TargetTransformInfo.cpp

===================================================================

--- lib/Target/X86/X86TargetTransformInfo.cpp   (revision 280423)

+++ lib/Target/X86/X86TargetTransformInfo.cpp   (working copy)

@@ -41,6 +41,10 @@

   return ST->hasPOPCNT() ? TTI::PSK_FastHardware : TTI::PSK_Software;

}

 

+bool X86TTIImpl::enableInterleavedAccessVectorization() {

+  return true;

+}

+

unsigned X86TTIImpl::getNumberOfRegisters(bool Vector) {

   if (Vector && !ST->hasSSE1())

     return 0;

Index: lib/Target/X86/X86TargetTransformInfo.h

===================================================================

--- lib/Target/X86/X86TargetTransformInfo.h     (revision 280423)

+++ lib/Target/X86/X86TargetTransformInfo.h     (working copy)

@@ -59,6 +59,7 @@

   /// \name Vector TTI Implementations

   /// @{

 

+  bool enableInterleavedAccessVectorization();

   unsigned getNumberOfRegisters(bool Vector);

   unsigned getRegisterBitWidth(bool Vector);

   unsigned getMaxInterleaveFactor(unsigned VF);

Chandler Carruth via llvm-dev

unread,
Sep 4, 2016, 6:52:57 PM9/4/16
to Zaks, Ayal, Michael Kuperstein, llvm-dev, Matthew Simpson
Ayal, we're going on a month now waiting to enable a feature because of regressions you reported without a reproducer. Please prioritize getting a reproduction, even if it isn't reduced. I don't think it is reasonable for upstream to continually delay enabling this feature when all of the test cases and reproductions we have access to are good...

_______________________________________________
Reply all
Reply to author
Forward
0 new messages