[llvm-dev] how to force llvm generate gather intrinsic

zhi chen via llvm-dev

unread,

Jan 22, 2016, 7:00:36 PM1/22/16

to LLVM Developers Mailing List

Hi,

I used clang -O3 -c -emit-llvm on the follow code to generate a bitcode, say a.bc. I read the .ll file and didn't see any gather intrinsic. Also, I used opt -O3 -mcpu=core-avx2/-mcpu=skx, but there is still no gather intrinsic generated.

int foo(int A[800], int B[800], int C[800]) {

for (int i = 0; i < 800; i++) {

A[B[i]] = i + 5;

}

for (int i = 0; i < 800; i++) {

A[B[i]]++;

}

for (int i = 0; i < 800; i++) {

A[i] = B[C[i]];

}

return 0;

}

Could some give me an example that will generate gather intrinsic for AVX2? I tried to used the masked_gather intrinsic provided in the language ref. But it seemed that it only generates gather intrinsic for AVX-512 but for AVX-2. I found that there are 16 gather intrinsic versions depending on the data types provided for AVX-2. Do I have to check the data type before calling them specifically? or is there a generic way to use AVX-2 gather intrinsic?

Best,

Zhi

Sanjay Patel via llvm-dev

unread,

Jan 22, 2016, 7:54:19 PM1/22/16

to zhi chen, Demikhovsky, Elena, LLVM Developers Mailing List

I was just looking at the related masked load/store operations, and I think there are at least 2 bugs:

1. X86TTIImpl::isLegalMaskedLoad/Store() should be legal for FP types with AVX1 (not just AVX2).
2. X86TTIImpl::isLegalMaskedGather/Scatter() should be legal for 128/256 bit vectors with AVX2 (not just AVX512).

I looked at this for the first time today, so I may be missing something...

So for the moment, the answer to your question is 'no'; there's no generic way to produce these instructions. You should be able to use the _mm_* intrinsics in C though.

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

zhi chen via llvm-dev

unread,

Jan 22, 2016, 7:58:21 PM1/22/16

to Sanjay Patel, LLVM Developers Mailing List

Thanks for your response, Sanjay. I know there are intrinsics available in C/C++. But the problem is that I want to instrument my code at the IR level and generate those instructions. I don't want to touch the source code.

Best,

Zhi

Demikhovsky, Elena via llvm-dev

unread,

Jan 23, 2016, 3:01:48 AM1/23/16

to zhi chen, Sanjay Patel, LLVM Developers Mailing List

1) I did not switch-on masked_load/store to AVX1, I can do this.

2) I did not switch-on masked gather on AVX2 because the instruction is slow. There is no scatter on AVX2.

3) Currently, gather/scatter does not work on SKX because the patch is still under review http://reviews.llvm.org/D15690. I’d be happy if you agree to review it.

- Elena

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

zhi chen via llvm-dev

unread,

Jan 23, 2016, 3:21:13 AM1/23/16

to Demikhovsky, Elena, LLVM Developers Mailing List

I don't need scatter but only gather on AVX-2, and performance is not the biggest concern. Could you please kindly suggest me how to switch masked gather on?

Best,

Zhi

Demikhovsky, Elena via llvm-dev

unread,

Jan 23, 2016, 7:56:28 AM1/23/16

to zhi chen, LLVM Developers Mailing List

It is not supported on codegen level. You can’t just switch it on.

But you can use compiler intrinsics instead. See examples in

test/CodeGen/X86/avx2-intrinsics-x86.ll

Nema, Ashutosh via llvm-dev

unread,

Jan 23, 2016, 8:45:44 AM1/23/16

to Demikhovsky, Elena, zhi chen, Sanjay Patel, llvm-dev

Thanks Sanjay for highlighting this, few days back I also faced similar problem

while generating masked store in avx1 mode, found its only supported under

avx2 else we scalarize it.

> 1) I did not switch-on masked_load/store to AVX1, I can do this.

Yes Elena, This should be supported for FP type in avx1 mode (for INT type, I doubt X86 has masked_load/store instruction in avx1 mode).

Thanks,

Ashutosh

Sanjay Patel via llvm-dev

unread,

Jan 23, 2016, 11:42:00 AM1/23/16

to Nema, Ashutosh, llvm-dev, zhi chen

On Sat, Jan 23, 2016 at 6:45 AM, Nema, Ashutosh <Ashuto...@amd.com> wrote:

Thanks Sanjay for highlighting this, few days back I also faced similar problem

while generating masked store in avx1 mode, found its only supported under

avx2 else we scalarize it.

> 1) I did not switch-on masked_load/store to AVX1, I can do this.

Yes Elena, This should be supported for FP type in avx1 mode (for INT type, I doubt X86 has masked_load/store instruction in avx1 mode).

Thanks everyone for the answers. My immediate motivation is to improve the masked load/store ops for an AVX target. If we can fix scatter/gather similarly, that would be great.

Can we legalize the same set of masked load/store operations for AVX1 as AVX2? If I'm understanding them correctly, the AVX1 FP instructions (vmaskmovps/pd) can be used in place of the AVX2 int instructions (vpmaskmovd/q), just with domain crossing penalties thrown in. I think we do this for other missing integer ops for an AVX1 target either in x86 lowering or in the tablegen patterns.

Elena - I'm not too familiar with the vectorizers or scatter/gather, but I'll certainly take a look at D15690. Thanks for pointing out the patch!

Demikhovsky, Elena via llvm-dev

unread,

Jan 23, 2016, 3:06:24 PM1/23/16

to Sanjay Patel, Nema, Ashutosh, llvm-dev, zhi chen

Ø Can we legalize the same set of masked load/store operations for AVX1 as AVX2?

Yes, of course.

- Elena

From: Sanjay Patel [mailto:spa...@rotateright.com]
Sent: Saturday, January 23, 2016 18:42
To: Nema, Ashutosh <Ashuto...@amd.com>
Cc: Demikhovsky, Elena <elena.de...@intel.com>; zhi chen <zch...@gmail.com>; llvm-dev <llvm...@lists.llvm.org>
Subject: Re: [llvm-dev] how to force llvm generate gather intrinsic

On Sat, Jan 23, 2016 at 6:45 AM, Nema, Ashutosh <Ashuto...@amd.com> wrote:

---------------------------------------------------------------------
Intel Israel (74) Limited

zhi chen via llvm-dev

unread,

Feb 24, 2016, 6:20:33 PM2/24/16

to Demikhovsky, Elena, llvm-dev

Hi Elena,

Are the masked_load and gather working now?

Best,

Zhi

Demikhovsky, Elena via llvm-dev

unread,

Feb 25, 2016, 1:39:09 AM2/25/16

to zhi chen, llvm-dev

Yes, masked load/store/gather/scatter are completed.

- Elena

From: zhi chen [mailto:zch...@gmail.com]

Sent: Thursday, February 25, 2016 01:20
To: Demikhovsky, Elena <elena.de...@intel.com>

Sanjay Patel via llvm-dev

unread,

Feb 25, 2016, 11:28:06 AM2/25/16

to Demikhovsky, Elena, llvm-dev, zhi chen

I don't think gather has been enabled for AVX2 as of r261875.

Masked load/store were enabled for AVX with:
http://reviews.llvm.org/D16528 / http://reviews.llvm.org/rL258675

zhi chen via llvm-dev

unread,

Feb 25, 2016, 1:48:08 PM2/25/16

to Sanjay Patel, llvm-dev

It seems that http://reviews.llvm.org/D15690 only implemented gather/scatter for AVX-512, but not for AVX/AVX2. Is there any plan to enable gather for AVX/2? Thanks.

Best,

Zhi

Demikhovsky, Elena via llvm-dev

unread,

Feb 26, 2016, 2:24:02 PM2/26/16

to zhi chen, Sanjay Patel, llvm-dev

No. Gather operation is slow on AVX2 processors.

- Elena

From: zhi chen [mailto:zch...@gmail.com]

Sent: Thursday, February 25, 2016 20:48
To: Sanjay Patel <spa...@rotateright.com>

Sanjay Patel via llvm-dev

unread,

Feb 26, 2016, 3:49:30 PM2/26/16

to Demikhovsky, Elena, llvm-dev, zhi chen

If I'm understanding correctly, you're saying that vgather* is slow on all of Excavator, Haswell, Broadwell, and Skylake (client). Therefore, we will not generate it for any of those machines.

Even if that's true, we should not define "gatherIsSlow()" as "hasAVX2() && !hasAVX512()". It could break for some hypothetical future processor that manages to implement it properly. The AVX2 spec includes gather; whether it's slow or fast is an implementation detail. We need a feature bit / cost model entry somewhere to signify this, so we're not overloading the meaning of the architectural features with that implementation detail.

zhi chen via llvm-dev

unread,

Feb 26, 2016, 4:46:37 PM2/26/16

to Sanjay Patel, llvm-dev

That makes great sense. It would be great if we have profitability mode to see the necessity to use gathers. Or it also would be good if there is a compiler option for the users to enable LLVM to generate the gather instructions no matter it is faster or slow.

Best,

Zhi

Demikhovsky, Elena via llvm-dev

unread,

Feb 28, 2016, 3:29:37 AM2/28/16

to zhi chen, Sanjay Patel, llvm-dev

I, myself, do not plan this work right now. But I’m ready to assist and review if somebody will take it.

Reply all

Reply to author

Forward