---------------------------------------------------------------------
Intel Israel (74) Limited
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
I wrote a loop with conditional load and store and measured performance on AVX2, where masking support is very basic, relatively to AVX-512.
I got 2x speedup with vpmaskmovd.
The maskmov instruction is slower than one vector load or store, but much faster than 8 scalar memory operations and 8 branches.
Usage of masked instructions on AVX-512 will give much more. There is no latency on target in comparison to the regular vector memop.
- Elena
Hi,We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well.The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.
On Oct 24, 2014, at 10:57 AM, Adam Nemet <ane...@apple.com> wrote:On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <elena.de...@intel.com> wrote:Hi,We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well.
The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.
I do agree that we would like to have one IR node to capture these so that they survive until ISel and that their specific semantics can be expressed. However, can you discuss the other options (new IR instructions, target-specific intrinsics) and why you went with target-independent intrinsics.
My intuition would have been to go with target-specific intrinsics until we have something solid implemented and then potentially turn this into native IR instructions as the next step (for other targets, etc.). I am particularly worried whether we really want to generate these for targets that don’t have vector predication support.
Adam, yes, there are more stuff we need to consider, e.g. masked gather / scatter, masked arithmetic ops, …etc. This proposal serves the first step which is an important, as a direction check w/ community.
Xinmin Tian
From: llvmdev...@cs.uiuc.edu [mailto:llvmdev...@cs.uiuc.edu] On Behalf Of Adam Nemet
Sent: Friday, October 24, 2014 10:58 AM
To: Demikhovsky, Elena
Cc: d...@cray.com; llv...@cs.uiuc.edu
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics
On Oct 24, 2014, at 10:57 AM, Adam Nemet <ane...@apple.com> wrote:On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <elena.de...@intel.com> wrote:Hi,We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well.I am happy to hear that you are working on this because it means that in the future we would be able to teach the SLP Vectorizer to vectorize types of <3 x float>.The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.+1. I think that this is an important requirement.I do agree that we would like to have one IR node to capture these so that they survive until ISel and that their specific semantics can be expressed. However, can you discuss the other options (new IR instructions, target-specific intrinsics) and why you went with target-independent intrinsics.I agree with the approach of adding target-independent masked memory intrinsics. One reason is that I would like to keep the vectorizers target independent (and use the target transform info to query the backends). I oppose adding new first-level instructions because we would need to teach all of the existing optimizations about the new instructions, and considering the limited usefulness of masked operations it is not worth the effort.
This is, hopefully, a bit better now than it was in the past. Nevertheless, our handling of these things is not bad to improve in general. Alignment it has, and alias metadata should just work (except perhaps for TBAA, but that should be easy to fix).
-Hal
> _______________________________________________
> LLVM Developers mailing list
> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
> I am particularly worried whether we really want to generate these for
> targets that don’t have vector predication support.
We almost certainly don't want to do that. Clang or whatever is
generating LLVM IR will need to be aware of target vector capabilities.
Still, legalization needs to be available to handle this situation if it
arises.
> There is also the related question of vector predicating any other
> instruction beyond just loads and stores which AVX512 supports. This
> is probably a smaller gain but should probably be part of the plan as
> well.
It's not a small gain, it is a *critical* thing to do. We have
customers that always run with traps enabled and without masking, it
severely limits what code can be vectorized.
On Oct 24, 2014, at 11:38 AM, Nadav Rotem <nro...@apple.com> wrote:
I agree with the approach of adding target-independent masked memory intrinsics. One reason is that I would like to keep the vectorizers target independent (and use the target transform info to query the backends). I oppose adding new first-level instructions because we would need to teach all of the existing optimizations about the new instructions, and considering the limited usefulness of masked operations it is not worth the effort.
I think you're confusing the IR types with the backend types. At the IR level, the masks are <n x i1> (one boolean per vector lane), the backend may represent this with a different type, but that's true of comparison results generally (they're often represented with different types in the backend), we already deal with that. Regarding the pointer type, it is irrelevant, we'll just cast to it from whatever the deal pointer type is.
-Hal
>
>
> Also, given that the types of the vectors matter, it seems like we’re
> going to need TTI anyway whenever we want to generate one of these,
> or else we’ll end up generating an illegal version which has to be
> scalarised in the backend.
>
>
> Thanks,
> Pete
>
> _______________________________________________
> LLVM Developers mailing list
> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
On Oct 24, 2014, at 1:49 PM, Hal Finkel <hfi...@anl.gov> wrote:----- Original Message -----From: "Pete Cooper" <peter_...@apple.com>
To: "Nadav Rotem" <nro...@apple.com>
Cc: d...@cray.com, llv...@cs.uiuc.edu
Sent: Friday, October 24, 2014 3:40:10 PM
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics
On Oct 24, 2014, at 11:38 AM, Nadav Rotem < nro...@apple.com > wrote:
I agree with the approach of adding target-independent masked memory
intrinsics. One reason is that I would like to keep the vectorizers
target independent (and use the target transform info to query the
backends). I oppose adding new first-level instructions because we
would need to teach all of the existing optimizations about the new
instructions, and considering the limited usefulness of masked
operations it is not worth the effort. I agree with this. They
should be target independent.
However, what types should be supported here? I haven’t looked in
detail, but from memory I believe AVX-512 masks 32-bit values, and
not bytes. Are we going to have an intrinsic which can handle any
vector type, or just <n x 32-bit> vectors, even at first?
I think you're confusing the IR types with the backend types. At the IR level, the masks are <n x i1> (one boolean per vector lane), the backend may represent this with a different type, but that's true of comparison results generally (they're often represented with different types in the backend), we already deal with that. Regarding the pointer type, it is irrelevant, we'll just cast to it from whatever the deal pointer type is.
You're correct, we'd use TTI to avoid creating illegal cases. If we do end up with an illegal case, it will need to be scalarized (which should always be possible). Syntactically, all basic vector types should be allowed.
-Hal
> However, what types should be supported here? I haven’t looked in
> detail, but from memory I believe AVX-512 masks 32-bit values, and not
> bytes. Are we going to have an intrinsic which can handle any vector
> type, or just <n x 32-bit> vectors, even at first?
Eventually we should support at least f/i 8, 16, 32 and 64. We can
start with f/i 32, 64 for now I think.
> Also, given that the types of the vectors matter, it seems like we’re
> going to need TTI anyway whenever we want to generate one of these, or
> else we’ll end up generating an illegal version which has to be
> scalarised in the backend.
Yep.
We may receive less optimal code on other targets as a result. User may want optimize a sequence of scalar instructions after vectorization did not pass.
- Elena
From: shahid shahid [mailto:shah...@yahoo.com]
Sent: Saturday, October 25, 2014 17:53
To: Demikhovsky, Elena; llv...@cs.uiuc.edu
Cc: d...@cray.com
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics
Hi Elena,
we just follow a common recommendation to start with intrinsics:
http://llvm.org/docs/ExtendingLLVM.html
- Elena
From: Owen Anderson [mailto:resi...@mac.com]
Sent: Sunday, October 26, 2014 23:57
To: Demikhovsky, Elena
Cc: llv...@cs.uiuc.edu; d...@cray.com
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics
What is the motivation for using intrinsics versus adding new instructions?
Hi Elena,
Can you please elaborate on the list. I don’t see how 20 intrinsics would cover “All FP”. But do you really have to do all FP or only instructions that can trap with LLVM (e.g. division by zero)?
I do agree that we want to go step by step but we also need to see the the end goal to make sure the design will scale.
Thanks,
Adam
> Can you please elaborate on the list. I don’t see how 20 intrinsics
> would cover “All FP”. But do you really have to do all FP or only
> instructions that can trap with LLVM (e.g. division by zero)?
We need intrinsics for all the FP operations. Any operand that is a
signaling NaN or even a quiet NaN for some operations will trap. LLVM
needs masking to protect itself from that when vectorizing certain kinds
of loops.
-David
- Elena
-----Original Message-----
From: d...@cray.com [mailto:d...@cray.com]
Sent: Monday, October 27, 2014 19:39
To: Adam Nemet
Cc: Demikhovsky, Elena; Hal Finkel; llv...@cs.uiuc.edu
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics
Adam Nemet <ane...@apple.com> writes:
> Can you please elaborate on the list. I don’t see how 20 intrinsics
> would cover “All FP”. But do you really have to do all FP or only
> instructions that can trap with LLVM (e.g. division by zero)?
We need intrinsics for all the FP operations. Any operand that is a signaling NaN or even a quiet NaN for some operations will trap. LLVM needs masking to protect itself from that when vectorizing certain kinds of loops.
-David
---------------------------------------------------------------------
Intel Israel (74) Limited
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
_______________________________________________
Many oveloaded intrinsics may be replaced with instructions - fabs or fma or sqrt.
Chandler will probably explain the criteria. What the diff between fma and fadd? Or fptrunc and fabs?
A new instruction like
%a = loadm <4 x i32>* %addr, <4 x i32> %passthru, i32 4, <4 x i1>%mask
is possible, but may be not very useful for most of targets.
So we start from intrinsics.
I think that starting with the intrinsics, for now, will be the right path while we figure out exactly what the design space is and use cases are. For the moment if these are primarily generated by the loop vectorizer, it should not be a big problem. Obviously when adding new instructions, there are a lot of switch statements to update ;)
>
>
> There’s a big qualitative difference between fabs and these masked
> operations, mostly because of the degree of type overloading you
> intend to support. I am very concerned that the IR that will contain
> these constructs will be dramatically harder to read because of it.
I think this ties back to the other thread on intrinsics name mangling (and the lack of the need for it). I think that, at least, Elena, Philip and I agree that, generally speaking, we'd like to clean this up, but that we should do this as a separate change independent of this. The memcpy.whatever are not easy to read either ;) -- and I agree that this could make things worse in that regard.
-Hal
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________