[LLVMdev] Adding masked vector load and store intrinsics

367 views
Skip to first unread message

Demikhovsky, Elena

unread,
Oct 24, 2014, 7:27:36 AM10/24/14
to llv...@cs.uiuc.edu, d...@cray.com
Hi,
 
We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well.
 
The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.
The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed.
 
  call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask)
 
  %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask)
 
where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef).
 
Comments so far, before we dive into more details?
 
Thank you.
 
- Elena and Ayal
 
 

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Hal Finkel

unread,
Oct 24, 2014, 8:52:17 AM10/24/14
to Elena Demikhovsky, d...@cray.com, llv...@cs.uiuc.edu
For the stores, I think this is a reasonable idea. The alternative is to represent them in scalar form with a lot of control flow, and I think that expecting the backend to properly pattern match that after isel is not realistic.

For the loads, I'm must less sure. Why can't we represent the loads as select(mask, load(addr), passthru)? It is true, that the load might get separated from the select so that isel might not see it (because isel if basic-block local), but we can add some code in CodeGenPrep to fix that for targets on which it is useful to do so (which is a more-general solution than the intrinsic anyhow). What do you think?

Thanks again,
Hal

>
> Thank you.
>
> - Elena and Ayal
>
>
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> _______________________________________________
> LLVM Developers mailing list
> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Demikhovsky, Elena

unread,
Oct 24, 2014, 9:11:51 AM10/24/14
to Hal Finkel, d...@cray.com, llv...@cs.uiuc.edu
> For the loads, I'm must less sure. Why can't we represent the loads as select(mask, load(addr), passthru)? It is true, that the load might get separated from the select so that isel might not see it (because isel if basic-block local), but we can add some code in CodeGenPrep to fix that for targets on which it is useful to do so (which is a more-general solution than the intrinsic anyhow). What do you think?

We generate the vector-masked-intrinsic on IR-to-IR pass. It is too far from instruction selection. We'll need to guarantee that all subsequent IR-to-IR passes will not break the sequence. And only for one or two specific targets. Then we'll keep the logic in type legalizer, which may split or extend operations. Then we are taking care in DAG-combine.
In my opinion, this is just unsafe.

- Elena

Das, Dibyendu

unread,
Oct 24, 2014, 9:23:03 AM10/24/14
to elena.de...@intel.com, llv...@cs.uiuc.edu, d...@cray.com
This looks to be a reasonable proposal. However native instructions that support such masked ld/st may have a high latency ? Also, it would be good to state some workloads where this will have a positive impact.

-dibyendu

 
From: Demikhovsky, Elena [mailto:elena.de...@intel.com]
Sent: Friday, October 24, 2014 06:24 AM Central Standard Time
To: llv...@cs.uiuc.edu <llv...@cs.uiuc.edu>
Cc: d...@cray.com <d...@cray.com>
Subject: [LLVMdev] Adding masked vector load and store intrinsics
 

Demikhovsky, Elena

unread,
Oct 24, 2014, 9:40:47 AM10/24/14
to Das, Dibyendu, llv...@cs.uiuc.edu, d...@cray.com

I wrote a loop with conditional load and store and measured performance on AVX2, where masking support is very basic, relatively to AVX-512.

I got 2x speedup with vpmaskmovd.

 

The maskmov instruction is slower than one vector load or store, but much faster than 8 scalar memory operations and 8 branches.

 

Usage of masked instructions on AVX-512 will give much more. There is no latency on target in comparison to the regular vector memop.

 

-           Elena

Hal Finkel

unread,
Oct 24, 2014, 9:47:12 AM10/24/14
to Elena Demikhovsky, d...@cray.com, llv...@cs.uiuc.edu
----- Original Message -----
> From: "Elena Demikhovsky" <elena.de...@intel.com>
> To: "Hal Finkel" <hfi...@anl.gov>
> Cc: d...@cray.com, llv...@cs.uiuc.edu, "Ayal Zaks" <ayal...@intel.com>
> Sent: Friday, October 24, 2014 8:07:18 AM
> Subject: RE: [LLVMdev] Adding masked vector load and store intrinsics
>
> > For the loads, I'm must less sure. Why can't we represent the loads
> > as select(mask, load(addr), passthru)? It is true, that the load
> > might get separated from the select so that isel might not see it
> > (because isel if basic-block local), but we can add some code in
> > CodeGenPrep to fix that for targets on which it is useful to do so
> > (which is a more-general solution than the intrinsic anyhow). What
> > do you think?
>
> We generate the vector-masked-intrinsic on IR-to-IR pass. It is too
> far from instruction selection. We'll need to guarantee that all
> subsequent IR-to-IR passes will not break the sequence.

I'm fully aware of this issue. This needs to be weighed against the cost of updating all other optimizations that operate on loads to also understand this intrinsic.

> And only for
> one or two specific targets.

Regardless, they're certainly targets many users care about ;)

> Then we'll keep the logic in type
> legalizer, which may split or extend operations. Then we are taking
> care in DAG-combine.
> In my opinion, this is just unsafe.

If this were really a question of safety, I'd agree. And if we were talking about gather loads, I'd agree. For a regular vector loads, I don't see this as a safety issue. We should outline what the downside of emitting a regular load would actually be should some optimization be done to the select. Can you please elaborate on this?

Thanks again,
Hal

Hal Finkel

unread,
Oct 24, 2014, 10:31:39 AM10/24/14
to Elena Demikhovsky, d...@cray.com, llv...@cs.uiuc.edu
Nevermind ;) -- I changed my mind, the safety issue is with non-aligned loads that might cross page boundaries. Is that right? If so, I think this proposal is good (although obviously the docs need to make clear what the faulting behavior of these intrinsics is).

Thanks again,
Hal

Zaks, Ayal

unread,
Oct 24, 2014, 10:49:09 AM10/24/14
to Hal Finkel, Demikhovsky, Elena, d...@cray.com, llv...@cs.uiuc.edu
> Why can't we represent the loads as select(mask, load(addr), passthru)?

This suggests masked-off lanes are free to speculatively load from memory. Whereas proposed semantics is that:

> The addressed memory will not be touched for masked-off lanes. In
> particular, if all lanes are masked off no address will be accessed.

Ayal.

-----Original Message-----
From: llvmdev...@cs.uiuc.edu [mailto:llvmdev...@cs.uiuc.edu] On Behalf Of Hal Finkel
Sent: Friday, October 24, 2014 15:50
To: Demikhovsky, Elena
Cc: d...@cray.com; llv...@cs.uiuc.edu

Hal Finkel

unread,
Oct 24, 2014, 11:32:24 AM10/24/14
to Ayal Zaks, d...@cray.com, llv...@cs.uiuc.edu
----- Original Message -----
> From: "Ayal Zaks" <ayal...@intel.com>
> To: "Hal Finkel" <hfi...@anl.gov>, "Elena Demikhovsky" <elena.de...@intel.com>
> Cc: d...@cray.com, llv...@cs.uiuc.edu
> Sent: Friday, October 24, 2014 9:46:01 AM
> Subject: RE: [LLVMdev] Adding masked vector load and store intrinsics
>
> > Why can't we represent the loads as select(mask, load(addr),
> > passthru)?
>
> This suggests masked-off lanes are free to speculatively load from
> memory. Whereas proposed semantics is that:
>
> > The addressed memory will not be touched for masked-off lanes. In
> > particular, if all lanes are masked off no address will be
> > accessed.

Agreed -- as I said in an e-mail that you probably did not see before you wrote this ;) -- but we should make sure to explicitly state this in the rationale. "touched" is not really the right term here. The underlying issue is that it allows us to deal with unaligned loads that cross page boundaries - i.e. that a masked-off load is safe to speculate.

On a related note, I presume that the 'i32 4' in the provided example is the alignment. Is that correct?

Thanks again,
Hal

Demikhovsky, Elena

unread,
Oct 24, 2014, 11:41:23 AM10/24/14
to Hal Finkel, Zaks, Ayal, d...@cray.com, llv...@cs.uiuc.edu
> On a related note, I presume that the 'i32 4' in the provided example is the alignment. Is that correct?
yes.

- Elena


-----Original Message-----
From: Hal Finkel [mailto:hfi...@anl.gov]

d...@cray.com

unread,
Oct 24, 2014, 12:56:25 PM10/24/14
to Hal Finkel, llv...@cs.uiuc.edu
Hal Finkel <hfi...@anl.gov> writes:

> For the loads, I'm must less sure. Why can't we represent the loads as
> select(mask, load(addr), passthru)?

Because that does not specify the correct semantics. This formulation
expects the load to happen before the mask is applied. The load could
trap. The operation needs to be presented as an atomic unit.

The same problem exists with any potentially trapping instruction
(e.g. all floating point computations). The need for intrinsics goes
way beyond loads and stores.

-David

d...@cray.com

unread,
Oct 24, 2014, 12:58:21 PM10/24/14
to Hal Finkel, Elena Demikhovsky, llv...@cs.uiuc.edu, Ayal Zaks
Hal Finkel <hfi...@anl.gov> writes:

> I'm fully aware of this issue. This needs to be weighed against the
> cost of updating all other optimizations that operate on loads to also
> understand this intrinsic.

In my experience, LLVM's behavior of treating unknwon intrinsics
conservatively works just fine.

> If this were really a question of safety, I'd agree. And if we were
> talking about gather loads, I'd agree. For a regular vector loads, I
> don't see this as a safety issue.

It absolutely is a safety issue. Not only could loop control flow cause
some vector elements to be skipped that would otherwise trap if loaded,
there are some vector optimizations that assume masking behavior will
handle overindexing and other such problems.

Masking is an extremely powerful concept and the sooner LLVM understands
it, the better.

-David

d...@cray.com

unread,
Oct 24, 2014, 1:05:46 PM10/24/14
to Hal Finkel, llv...@cs.uiuc.edu
Hal Finkel <hfi...@anl.gov> writes:

>> If this were really a question of safety, I'd agree. And if we were
>> talking about gather loads, I'd agree. For a regular vector loads, I
>> don't see this as a safety issue. We should outline what the
>> downside of emitting a regular load would actually be should some
>> optimization be done to the select. Can you please elaborate on
>> this?
>
> Nevermind ;) -- I changed my mind, the safety issue is with
> non-aligned loads that might cross page boundaries. Is that right?

That's just one safety issue. There are others.

> If so, I think this proposal is good (although obviously the docs need
> to make clear what the faulting behavior of these intrinsics is).

The behavior should be not to ever fault on an element whose mask bit is
false, and behave as a regular load (wrt trapping) for any element whose
mask bit is true.

-David

Hal Finkel

unread,
Oct 24, 2014, 1:07:13 PM10/24/14
to d...@cray.com, llv...@cs.uiuc.edu
----- Original Message -----
> From: d...@cray.com
> To: "Hal Finkel" <hfi...@anl.gov>
> Cc: "Elena Demikhovsky" <elena.de...@intel.com>, llv...@cs.uiuc.edu
> Sent: Friday, October 24, 2014 11:56:14 AM
> Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics
>
> Hal Finkel <hfi...@anl.gov> writes:
>
> >> If this were really a question of safety, I'd agree. And if we
> >> were
> >> talking about gather loads, I'd agree. For a regular vector loads,
> >> I
> >> don't see this as a safety issue. We should outline what the
> >> downside of emitting a regular load would actually be should some
> >> optimization be done to the select. Can you please elaborate on
> >> this?
> >
> > Nevermind ;) -- I changed my mind, the safety issue is with
> > non-aligned loads that might cross page boundaries. Is that right?
>
> That's just one safety issue. There are others.

Can you be more specific? You mentioned overindexing in your other e-mail, exactly what do you mean by that?

Thanks again,
Hal

>
> > If so, I think this proposal is good (although obviously the docs
> > need
> > to make clear what the faulting behavior of these intrinsics is).
>
> The behavior should be not to ever fault on an element whose mask bit
> is
> false, and behave as a regular load (wrt trapping) for any element
> whose
> mask bit is true.
>
> -David
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Stephen Canon

unread,
Oct 24, 2014, 1:34:58 PM10/24/14
to Hal Finkel, David Greene, llv...@cs.uiuc.edu
One can at least imagine using a masked load to access device memory which might have access granularity smaller than the vector size (this seems like a *terrible* idea to me, but at least I can conceive of cases where the semantics would matter beyond just page-crossing loads).

That said, page-crossing loads are a good-enough reason to support this on their own.
– Steve

d...@cray.com

unread,
Oct 24, 2014, 1:36:09 PM10/24/14
to Demikhovsky, Elena, llv...@cs.uiuc.edu
"Demikhovsky, Elena" <elena.de...@intel.com> writes:

> %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32>
> %passthru, i32 4, <8 x i1> %mask)
> where %passthru is used to fill the elements of %data that are
> masked-off (if any; can be zeroinitializer or undef).

So %passthrough can *only* be undef or zeroinitializer? If that's the
case it might make more sense to have two intrinsics, one that fills
with undef and one that fills with zero. Using a general vector operand
with a restriction on valid values seems odd and potentially misleading.

Another option is to always fill with undef and require a select on top
of the load to fill with zero. The load + select would be easily
matchable to a target instruction.

I'm trying to think beyond just AVX-512 to what other future
architectures might want. It's not a given that future architectures
will fill with zero *or* undef though those are the two most likely fill
values.

-David

d...@cray.com

unread,
Oct 24, 2014, 1:37:20 PM10/24/14
to Das, Dibyendu, llv...@cs.uiuc.edu
"Das, Dibyendu" <Dibyen...@amd.com> writes:

> This looks to be a reasonable proposal. However native instructions
> that support such masked ld/st may have a high latency ? Also, it
> would be good to state some workloads where this will have a positive
> impact.

Any significant vector workload will see a giant gain from this.

The masked operations really shouldn't have any more latency. The time
of the memory operation itself dominates.

d...@cray.com

unread,
Oct 24, 2014, 1:58:33 PM10/24/14
to Hal Finkel, llv...@cs.uiuc.edu
Hal Finkel <hfi...@anl.gov> writes:

>> > Nevermind ;) -- I changed my mind, the safety issue is with
>> > non-aligned loads that might cross page boundaries. Is that right?
>>
>> That's just one safety issue. There are others.
>
> Can you be more specific? You mentioned overindexing in your other
> e-mail, exactly what do you mean by that?

Accessing past the end of an array. Some vector optimizations do that
and assume the masking will prevent traps. Aggressive vectorizers can
do all kinds of "unsafe" transformations that are safe in the presence
of masks.

Any time there is control flow in the loop protecting a dereference of a
NULL pointer, a mask is needed and it needs to be applied at the time of
the load, not at the time of the write to the loaded-to register.
That's why select doesn't work. This same issues extends to any trap
situation like a divide-by-zero or use of a NaN. It's not only the
write to the register that needs protection, it's the operation itself.

-David

Tian, Xinmin

unread,
Oct 24, 2014, 2:15:19 PM10/24/14
to d...@cray.com, Hal Finkel, llv...@cs.uiuc.edu
> select(mask, load(addr), passthru)?

David is right, "select(mask, load(addr), passthru)" is like vector load + blending ... which involves memory access speculation, and not safe in some cases, so it does not have same semantics of masking-lane-off,

Xinmin

Adam Nemet

unread,
Oct 24, 2014, 2:17:42 PM10/24/14
to Demikhovsky, Elena, d...@cray.com, llv...@cs.uiuc.edu
On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <elena.de...@intel.com> wrote:

Hi,
 
We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well.
 
The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.

I do agree that we would like to have one IR node to capture these so that they survive until ISel and that their specific semantics can be expressed.  However, can you discuss the other options (new IR instructions, target-specific intrinsics) and why you went with target-independent intrinsics.

My intuition would have been to go with target-specific intrinsics until we have something solid implemented and then potentially turn this into native IR instructions as the next step (for other targets, etc.).  I am particularly worried whether we really want to generate these for targets that don’t have vector predication support.

There is also the related question of vector predicating any other instruction beyond just loads and stores which AVX512 supports.  This is probably a smaller gain but should probably be part of the plan as well.

Adam

Smith, Kevin B

unread,
Oct 24, 2014, 2:19:05 PM10/24/14
to d...@cray.com, Demikhovsky, Elena, llv...@cs.uiuc.edu
> So %passthrough can *only* be undef or zeroinitializer?

No, that wasn't the intent. %passthrough can be any other definition that is needed. Zero and undef were simply two possible values that illustrated some interesting behavior.
Mapping of the %passthrough to the actual semantics of many vector instruction sets where the masked instructions leave the masked-off elements of the destination unchanged
is done in a similar manner as three-address instructions are turned into two address instructions, by placing a copy as necessary so that dest and passthrough are in the same register.

Kevin B. Smith

-----Original Message-----
From: llvmdev...@cs.uiuc.edu [mailto:llvmdev...@cs.uiuc.edu] On Behalf Of d...@cray.com
Sent: Friday, October 24, 2014 10:21 AM
To: Demikhovsky, Elena
Cc: llv...@cs.uiuc.edu
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics

Smith, Kevin B

unread,
Oct 24, 2014, 2:40:56 PM10/24/14
to d...@cray.com, Hal Finkel, llv...@cs.uiuc.edu
I strongly agree with all these reasons, and it is for all those reasons that the proposal is written this way.

Kevin B. Smith

-----Original Message-----
From: llvmdev...@cs.uiuc.edu [mailto:llvmdev...@cs.uiuc.edu] On Behalf Of d...@cray.com
Sent: Friday, October 24, 2014 10:39 AM
To: Hal Finkel
Cc: llv...@cs.uiuc.edu
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics

Nadav Rotem

unread,
Oct 24, 2014, 3:08:07 PM10/24/14
to Adam Nemet, d...@cray.com, llv...@cs.uiuc.edu
On Oct 24, 2014, at 10:57 AM, Adam Nemet <ane...@apple.com> wrote:

On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <elena.de...@intel.com> wrote:

Hi,
 
We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well.
 

I am happy to hear that you are working on this because it means that in the future we would be able to teach the SLP Vectorizer to vectorize types of <3 x float>.  

The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.


+1. I think that this is an important requirement. 

I do agree that we would like to have one IR node to capture these so that they survive until ISel and that their specific semantics can be expressed.  However, can you discuss the other options (new IR instructions, target-specific intrinsics) and why you went with target-independent intrinsics.


I agree with the approach of adding target-independent masked memory intrinsics. One reason is that I would like to keep the vectorizers target independent (and use the target transform info to query the backends). I oppose adding new first-level instructions because we would need to teach all of the existing optimizations about the new instructions, and considering the limited usefulness of masked operations it is not worth the effort. 

My intuition would have been to go with target-specific intrinsics until we have something solid implemented and then potentially turn this into native IR instructions as the next step (for other targets, etc.).  I am particularly worried whether we really want to generate these for targets that don’t have vector predication support.

Probably not, but this is a cost-benefit decision that the vectorizers would need to make. 

Tian, Xinmin

unread,
Oct 24, 2014, 3:16:02 PM10/24/14
to Smith, Kevin B, d...@cray.com, Hal Finkel, llv...@cs.uiuc.edu
Ditto

Xinmin Tian

Das, Dibyendu

unread,
Oct 24, 2014, 3:17:45 PM10/24/14
to d...@cray.com, llv...@cs.uiuc.edu
Is there an example of such a workload ( lets say from the spec cpu 2006 harness or similar ) that you have in mind and the amount of gain expected ?
- dibyendu

Tian, Xinmin

unread,
Oct 24, 2014, 3:21:16 PM10/24/14
to Adam Nemet, Demikhovsky, Elena, d...@cray.com, llv...@cs.uiuc.edu

Adam,  yes, there are more stuff we need to consider, e.g. masked gather / scatter, masked arithmetic ops, …etc.  This proposal serves the first step which is an important, as a direction check w/ community.  

 

Xinmin Tian

 

From: llvmdev...@cs.uiuc.edu [mailto:llvmdev...@cs.uiuc.edu] On Behalf Of Adam Nemet


Sent: Friday, October 24, 2014 10:58 AM
To: Demikhovsky, Elena

Cc: d...@cray.com; llv...@cs.uiuc.edu
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics

 

On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <elena.de...@intel.com> wrote:

Adam Nemet

unread,
Oct 24, 2014, 3:27:50 PM10/24/14
to Nadav Rotem, d...@cray.com, llv...@cs.uiuc.edu
On Oct 24, 2014, at 11:38 AM, Nadav Rotem <nro...@apple.com> wrote:


On Oct 24, 2014, at 10:57 AM, Adam Nemet <ane...@apple.com> wrote:

On Oct 24, 2014, at 4:24 AM, Demikhovsky, Elena <elena.de...@intel.com> wrote:

Hi,
 
We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well.
 

I am happy to hear that you are working on this because it means that in the future we would be able to teach the SLP Vectorizer to vectorize types of <3 x float>.  

The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.


+1. I think that this is an important requirement. 

I do agree that we would like to have one IR node to capture these so that they survive until ISel and that their specific semantics can be expressed.  However, can you discuss the other options (new IR instructions, target-specific intrinsics) and why you went with target-independent intrinsics.


I agree with the approach of adding target-independent masked memory intrinsics. One reason is that I would like to keep the vectorizers target independent (and use the target transform info to query the backends). I oppose adding new first-level instructions because we would need to teach all of the existing optimizations about the new instructions, and considering the limited usefulness of masked operations it is not worth the effort.

Thanks, Nadav, that makes sense.  Do you foresee any potential issues due to the limitation of what information can be attached to an intrinsic call vs. a store, e.g. alignment or alias info.  I do remember from trying to optimize from-memory-broadcast intrinsics that the optimizers were pretty limited dealing with intrinsics accessing memory.

Adam

d...@cray.com

unread,
Oct 24, 2014, 3:34:11 PM10/24/14
to Smith, Kevin B, llv...@cs.uiuc.edu
"Smith, Kevin B" <kevin....@intel.com> writes:

>> So %passthrough can *only* be undef or zeroinitializer?
>
> No, that wasn't the intent. %passthrough can be any other definition
> that is needed. Zero and undef were simply two possible values that
> illustrated some interesting behavior.

> Mapping of the %passthrough to the actual semantics of many vector
> instruction sets where the masked instructions leave the masked-off
> elements of the destination unchanged is done in a similar manner as
> three-address instructions are turned into two address instructions,
> by placing a copy as necessary so that dest and passthrough are in the
> same register.

How would one express such semantics in LLVM IR with this intrinsic? By
definition, %data anmd %passthrough are different IR virtual registers
and there are no copy instructions in LLVM IR.

In the more general case:

%b = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %a, i32 4, <8 x i1> %mask)

where %a and %b have no relation to each other, I presume the backend
would be responsible for doing a select/merge after the load if the ISA
didn't directly support the merge as part of the load operation. Right?

Smith, Kevin B

unread,
Oct 24, 2014, 3:43:23 PM10/24/14
to d...@cray.com, llv...@cs.uiuc.edu
> How would one express such semantics in LLVM IR with this intrinsic? By definition, %data anmd %passthrough are different IR virtual registers and there are no copy instructions in LLVM IR.

You never need to express this semantic in LLVM IR, because in SSA form they are always different SSA defs for the result of the operation versus the inputs to the operation. Someplace late in the CG needs to handle
this, in exactly an analogous fashion as it already has to handle this for mapping to regular X86 two address code.

For example, this LLVM IR

%add = add nsw i32 %b, %a

gets converted into

# *** IR Dump After Expand ISel Pseudo-instructions ***:
# Machine code for function foo: SSA
Function Live Ins: %EDI in %vreg0, %ESI in %vreg1

BB#0: derived from LLVM BB %entry
Live Ins: %EDI %ESI
%vreg1<def> = COPY %ESI; GR32:%vreg1
%vreg0<def> = COPY %EDI; GR32:%vreg0
%vreg2<def,tied1> = ADD32rr %vreg1<tied0>, %vreg0, %EFLAGS<imp-def,dead>
; GR32:%vreg2,%vreg1,%vreg0

in ISEL. So, the necessary instruction semantic needn't be represented in LLVM IR. It is created once you have to do mapping to "real" machine instructions using virtual registers, where copies, and the ability to mark a destination and a
source as "tied" together are representable.

Kevin

-----Original Message-----
From: d...@cray.com [mailto:d...@cray.com]
Sent: Friday, October 24, 2014 12:23 PM
To: Smith, Kevin B
Cc: Demikhovsky, Elena; llv...@cs.uiuc.edu
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics

Hal Finkel

unread,
Oct 24, 2014, 3:44:42 PM10/24/14
to Adam Nemet, d...@cray.com, llv...@cs.uiuc.edu
----- Original Message -----
> From: "Adam Nemet" <ane...@apple.com>
> To: "Nadav Rotem" <nro...@apple.com>
> Cc: d...@cray.com, llv...@cs.uiuc.edu
> Sent: Friday, October 24, 2014 2:03:24 PM
> Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics
>

This is, hopefully, a bit better now than it was in the past. Nevertheless, our handling of these things is not bad to improve in general. Alignment it has, and alias metadata should just work (except perhaps for TBAA, but that should be easy to fix).

-Hal

> _______________________________________________
> LLVM Developers mailing list
> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

--

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________

d...@cray.com

unread,
Oct 24, 2014, 3:54:14 PM10/24/14
to Das, Dibyendu, llv...@cs.uiuc.edu
"Das, Dibyendu" <Dibyen...@amd.com> writes:

> Is there an example of such a workload ( lets say from the spec cpu
> 2006 harness or similar ) that you have in mind and the amount of gain
> expected ?

Literally nearly every code that has significant vector work in it.
Even if there is no control flow in the loop, masking allows the
compiler to more aggressively vectorize and rely on the masks to prevent
unsafe execution.

The amount of gain is highly code-dependent but my guess is that Elena's
example of 2x speedup is typical, maybe even on the lower end.

The capability of the vectorizer is the biggest factor. Without masks,
the vectorizer cannot be as aggressive. With masks, the vectorizer
still has to be written to be aggressive. Ph.D. dissertations have been
written on the topic. It's non-trivial work.

Masking is an enabling technology, not an end goal.

d...@cray.com

unread,
Oct 24, 2014, 4:00:43 PM10/24/14
to Smith, Kevin B, llv...@cs.uiuc.edu
"Smith, Kevin B" <kevin....@intel.com> writes:

> I strongly agree with all these reasons, and it is for all those
> reasons that the proposal is written this way.

Once general loads and store intrinsics are added we really do need to
address more general masking. Nearly every floating-point operation can
trap, so those need masking available. Integer operations like divide
will also need masking. Either we will need a generalized masked
intrinsic for each such operation (e.g. llvm.masked.fadd) or we need a
more general way to represent masks in LLVM IR. We had such a
discussion some years ago but it didn't really go anywhere.

Adding general mask intrinsics for operations that can trap seems like
the easiest way to make forward progress.

d...@cray.com

unread,
Oct 24, 2014, 4:01:57 PM10/24/14
to Adam Nemet, llv...@cs.uiuc.edu
Adam Nemet <ane...@apple.com> writes:

> I am particularly worried whether we really want to generate these for
> targets that don’t have vector predication support.

We almost certainly don't want to do that. Clang or whatever is
generating LLVM IR will need to be aware of target vector capabilities.
Still, legalization needs to be available to handle this situation if it
arises.

> There is also the related question of vector predicating any other
> instruction beyond just loads and stores which AVX512 supports. This
> is probably a smaller gain but should probably be part of the plan as
> well.

It's not a small gain, it is a *critical* thing to do. We have
customers that always run with traps enabled and without masking, it
severely limits what code can be vectorized.

d...@cray.com

unread,
Oct 24, 2014, 4:07:31 PM10/24/14
to Nadav Rotem, llv...@cs.uiuc.edu
Nadav Rotem <nro...@apple.com> writes:

> I oppose adding new first-level instructions because we would need to
> teach all of the existing optimizations about the new instructions,
> and considering the limited usefulness of masked operations it is not
> worth the effort.

Limited usefulness? It is quite the opposite. If we were starting from
scratch on an IR, we'd want to have first-class mask support, with masks
as an additional operand to nearly every IR instructions. Given where
we are, target-independent intrinsics seems like a good compromise
because as you said it would be a huge task to teach all of the existing
LLVM code about a new instruction operand. With intrinsics, passes are
conservative when they see an intrinsic they don't understand. We can
teach passes about specific intrinsics as we find benefit in doing so.

Pete Cooper

unread,
Oct 24, 2014, 4:44:06 PM10/24/14
to Nadav Rotem, d...@cray.com, llv...@cs.uiuc.edu
On Oct 24, 2014, at 11:38 AM, Nadav Rotem <nro...@apple.com> wrote:

I agree with the approach of adding target-independent masked memory intrinsics. One reason is that I would like to keep the vectorizers target independent (and use the target transform info to query the backends). I oppose adding new first-level instructions because we would need to teach all of the existing optimizations about the new instructions, and considering the limited usefulness of masked operations it is not worth the effort. 
I agree with this.  They should be target independent.

However, what types should be supported here?  I haven’t looked in detail, but from memory I believe AVX-512 masks 32-bit values, and not bytes.  Are we going to have an intrinsic which can handle any vector type, or just <n x 32-bit> vectors, even at first?

Also, given that the types of the vectors matter, it seems like we’re going to need TTI anyway whenever we want to generate one of these, or else we’ll end up generating an illegal version which has to be scalarised in the backend.

Thanks,
Pete

Hal Finkel

unread,
Oct 24, 2014, 4:51:15 PM10/24/14
to Pete Cooper, d...@cray.com, llv...@cs.uiuc.edu
----- Original Message -----
> From: "Pete Cooper" <peter_...@apple.com>
> To: "Nadav Rotem" <nro...@apple.com>
> Cc: d...@cray.com, llv...@cs.uiuc.edu
> Sent: Friday, October 24, 2014 3:40:10 PM
> Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics
>
> On Oct 24, 2014, at 11:38 AM, Nadav Rotem < nro...@apple.com > wrote:
>
> I agree with the approach of adding target-independent masked memory
> intrinsics. One reason is that I would like to keep the vectorizers
> target independent (and use the target transform info to query the
> backends). I oppose adding new first-level instructions because we
> would need to teach all of the existing optimizations about the new
> instructions, and considering the limited usefulness of masked
> operations it is not worth the effort. I agree with this. They
> should be target independent.
>
>
> However, what types should be supported here? I haven’t looked in
> detail, but from memory I believe AVX-512 masks 32-bit values, and
> not bytes. Are we going to have an intrinsic which can handle any
> vector type, or just <n x 32-bit> vectors, even at first?

I think you're confusing the IR types with the backend types. At the IR level, the masks are <n x i1> (one boolean per vector lane), the backend may represent this with a different type, but that's true of comparison results generally (they're often represented with different types in the backend), we already deal with that. Regarding the pointer type, it is irrelevant, we'll just cast to it from whatever the deal pointer type is.

-Hal

>
>
> Also, given that the types of the vectors matter, it seems like we’re
> going to need TTI anyway whenever we want to generate one of these,
> or else we’ll end up generating an illegal version which has to be
> scalarised in the backend.
>
>
> Thanks,
> Pete
>

> _______________________________________________
> LLVM Developers mailing list
> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

--

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________

Pete Cooper

unread,
Oct 24, 2014, 4:53:12 PM10/24/14
to Hal Finkel, d...@cray.com, llv...@cs.uiuc.edu
On Oct 24, 2014, at 1:49 PM, Hal Finkel <hfi...@anl.gov> wrote:

----- Original Message -----
From: "Pete Cooper" <peter_...@apple.com>
To: "Nadav Rotem" <nro...@apple.com>
Cc: d...@cray.com, llv...@cs.uiuc.edu
Sent: Friday, October 24, 2014 3:40:10 PM
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics

On Oct 24, 2014, at 11:38 AM, Nadav Rotem < nro...@apple.com > wrote:

I agree with the approach of adding target-independent masked memory
intrinsics. One reason is that I would like to keep the vectorizers
target independent (and use the target transform info to query the
backends). I oppose adding new first-level instructions because we
would need to teach all of the existing optimizations about the new
instructions, and considering the limited usefulness of masked
operations it is not worth the effort. I agree with this. They
should be target independent.


However, what types should be supported here? I haven’t looked in
detail, but from memory I believe AVX-512 masks 32-bit values, and
not bytes. Are we going to have an intrinsic which can handle any
vector type, or just <n x 32-bit> vectors, even at first?

I think you're confusing the IR types with the backend types. At the IR level, the masks are <n x i1> (one boolean per vector lane), the backend may represent this with a different type, but that's true of comparison results generally (they're often represented with different types in the backend), we already deal with that. Regarding the pointer type, it is irrelevant, we'll just cast to it from whatever the deal pointer type is.
Sorry, I should have been clearer.  I mean what types can the lanes be?  And assuming its not all types down to i8, how should we handle the illegal cases, or avoid creating them in the first place?

Thanks,
Pete

Hal Finkel

unread,
Oct 24, 2014, 4:56:47 PM10/24/14
to Pete Cooper, d...@cray.com, llv...@cs.uiuc.edu
----- Original Message -----
> From: "Pete Cooper" <peter_...@apple.com>
> To: "Hal Finkel" <hfi...@anl.gov>
> Cc: d...@cray.com, llv...@cs.uiuc.edu, "Nadav Rotem" <nro...@apple.com>
> Sent: Friday, October 24, 2014 3:51:02 PM
> Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics
>

You're correct, we'd use TTI to avoid creating illegal cases. If we do end up with an illegal case, it will need to be scalarized (which should always be possible). Syntactically, all basic vector types should be allowed.

-Hal

Pete Cooper

unread,
Oct 24, 2014, 4:59:40 PM10/24/14
to Hal Finkel, d...@cray.com, llv...@cs.uiuc.edu
Cool. Sounds good.

Pete

Hal Finkel

unread,
Oct 24, 2014, 5:06:55 PM10/24/14
to Elena Demikhovsky, d...@cray.com, llv...@cs.uiuc.edu
Elena,

As far as I can tell, consensus is strongly in favor. Please submit a patch :-)

Thanks again,
Hal

----- Original Message -----
> From: "Elena Demikhovsky" <elena.de...@intel.com>
> To: llv...@cs.uiuc.edu
> Cc: d...@cray.com
> Sent: Friday, October 24, 2014 6:24:15 AM
> Subject: [LLVMdev] Adding masked vector load and store intrinsics
>
>
>
> Hi,
>
> We would like to add support for masked vector loads and stores by
> introducing new target-independent intrinsics. The loop vectorizer
> will then be enhanced to optimize loops containing conditional
> memory accesses by generating these intrinsics for existing targets
> such as AVX2 and AVX-512. The vectorizer will first ask the target
> about availability of masked vector loads and stores. The SLP
> vectorizer can potentially be enhanced to use these intrinsics as
> well.
>
> The intrinsics would be legal for all targets; targets that do not
> support masked vector loads or stores will scalarize them.
> The addressed memory will not be touched for masked-off lanes. In
> particular, if all lanes are masked off no address will be accessed.
>
> call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4,
> <16 x i1> %mask)
>
> %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32>
> %passthru, i32 4, <8 x i1> %mask)
>
> where %passthru is used to fill the elements of %data that are
> masked-off (if any; can be zeroinitializer or undef).
>
> Comments so far, before we dive into more details?
>
> Thank you.
>
> - Elena and Ayal
>
>
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

d...@cray.com

unread,
Oct 24, 2014, 6:18:48 PM10/24/14
to Smith, Kevin B, d...@cray.com, llv...@cs.uiuc.edu
"Smith, Kevin B" <kevin....@intel.com> writes:

>> How would one express such semantics in LLVM IR with this intrinsic?
>> By definition, %data anmd %passthrough are different IR virtual
>> registers and there are no copy instructions in LLVM IR.
>
> You never need to express this semantic in LLVM IR, because in SSA
> form they are always different SSA defs for the result of the
> operation versus the inputs to the operation. Someplace late in the
> CG needs to handle this, in exactly an analogous fashion as it already
> has to handle this for mapping to regular X86 two address code.

Ok, I think that works. I was concerned there may be some reason to
express this at the IR level for, say, AVX-512 because of masks but I
think you're right, the normal two-operand handling scheme can take care
of it.

d...@cray.com

unread,
Oct 24, 2014, 6:23:43 PM10/24/14
to Smith, Kevin B, d...@cray.com, llv...@cs.uiuc.edu
"Smith, Kevin B" <kevin....@intel.com> writes:

>> How would one express such semantics in LLVM IR with this intrinsic?
>> By definition, %data anmd %passthrough are different IR virtual
>> registers and there are no copy instructions in LLVM IR.
>
> You never need to express this semantic in LLVM IR, because in SSA
> form they are always different SSA defs for the result of the
> operation versus the inputs to the operation. Someplace late in the
> CG needs to handle this, in exactly an analogous fashion as it already
> has to handle this for mapping to regular X86 two address code.

Following up, doing it this way will require that additional intrinsics
(for exmaple, all FP operations) each have an additional passthrough
register operand:

%result = llvm.masked.fadd(%a, %b, %mask, %passthrough)

Otherwise we would need some implicit specification that either %a or %b
is the passthrough which seems very wrong for a general intrinsic.

Is this how you see this going?

d...@cray.com

unread,
Oct 24, 2014, 6:35:04 PM10/24/14
to Pete Cooper, d...@cray.com, llv...@cs.uiuc.edu
Pete Cooper <peter_...@apple.com> writes:

> However, what types should be supported here? I haven’t looked in
> detail, but from memory I believe AVX-512 masks 32-bit values, and not
> bytes. Are we going to have an intrinsic which can handle any vector
> type, or just <n x 32-bit> vectors, even at first?

Eventually we should support at least f/i 8, 16, 32 and 64. We can
start with f/i 32, 64 for now I think.

> Also, given that the types of the vectors matter, it seems like we’re
> going to need TTI anyway whenever we want to generate one of these, or
> else we’ll end up generating an illegal version which has to be
> scalarised in the backend.

Yep.

Smith, Kevin B

unread,
Oct 24, 2014, 6:37:34 PM10/24/14
to d...@cray.com, llv...@cs.uiuc.edu
Yes, IMO that has to be the direction in order for SSA form to work properly for masked operations.

Kevin B. Smith

-----Original Message-----
From: d...@cray.com [mailto:d...@cray.com]
Sent: Friday, October 24, 2014 3:13 PM
To: Smith, Kevin B
Cc: d...@cray.com; Demikhovsky, Elena; llv...@cs.uiuc.edu
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics

Smith, Kevin B

unread,
Oct 24, 2014, 7:58:46 PM10/24/14
to Smith, Kevin B, d...@cray.com, llv...@cs.uiuc.edu
Also, FWIW, this is the direction that was taken in the Intel Compiler to support masking in the IR as well.

Demikhovsky, Elena

unread,
Oct 25, 2014, 7:25:29 AM10/25/14
to d...@cray.com, llv...@cs.uiuc.edu
> So %passthrough can *only* be undef or zeroinitializer?
No, it can be any value including undef and zeroinitializer.

We considered, while designing, zero and merge semantics and decided that merge semantics is better because it covers zero semantics if you use zeroinitializer in the %paththru.

- Elena


-----Original Message-----
From: d...@cray.com [mailto:d...@cray.com]
Sent: Friday, October 24, 2014 20:21
To: Demikhovsky, Elena
Cc: llv...@cs.uiuc.edu; Zaks, Ayal; Nadav Rotem <nro...@apple.com> (nro...@apple.com); Chandler Carruth (chan...@google.com); Adam Nemet (ane...@apple.com)
Subject: Re: Adding masked vector load and store intrinsics

"Demikhovsky, Elena" <elena.de...@intel.com> writes:

> %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32>
> %passthru, i32 4, <8 x i1> %mask) where %passthru is used to fill the
> elements of %data that are masked-off (if any; can be zeroinitializer
> or undef).

So %passthrough can *only* be undef or zeroinitializer? If that's the case it might make more sense to have two intrinsics, one that fills with undef and one that fills with zero. Using a general vector operand with a restriction on valid values seems odd and potentially misleading.

Another option is to always fill with undef and require a select on top of the load to fill with zero. The load + select would be easily matchable to a target instruction.

I'm trying to think beyond just AVX-512 to what other future architectures might want. It's not a given that future architectures will fill with zero *or* undef though those are the two most likely fill values.

-David
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Demikhovsky, Elena

unread,
Oct 25, 2014, 7:32:36 AM10/25/14
to d...@cray.com, Hal Finkel, llv...@cs.uiuc.edu
> The same problem exists with any potentially trapping instruction (e.g. all floating point computations). The need for intrinsics goes way beyond loads and stores.

We definitely looking at them, but decided to start from load and store. All FP + gather/scatter are in our long term plan. It will be about 20 intrinsics.
But step-by-step.

- Elena


-----Original Message-----
From: d...@cray.com [mailto:d...@cray.com]
Sent: Friday, October 24, 2014 19:49
To: Hal Finkel
Cc: Demikhovsky, Elena; llv...@cs.uiuc.edu
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics

Hal Finkel <hfi...@anl.gov> writes:

> For the loads, I'm must less sure. Why can't we represent the loads as
> select(mask, load(addr), passthru)?

Because that does not specify the correct semantics. This formulation expects the load to happen before the mask is applied. The load could trap. The operation needs to be presented as an atomic unit.

The same problem exists with any potentially trapping instruction (e.g. all floating point computations). The need for intrinsics goes way beyond loads and stores.

Demikhovsky, Elena

unread,
Oct 25, 2014, 7:42:41 AM10/25/14
to Hal Finkel, d...@cray.com, llv...@cs.uiuc.edu
Thank you Hal,

meanwhile, I implemented something quick to be sure that it works and estimate what pieces of LLVM code should be touched.
I'll prepare a patch soon.

- Elena


-----Original Message-----
From: Hal Finkel [mailto:hfi...@anl.gov]
Sent: Saturday, October 25, 2014 00:02
To: Demikhovsky, Elena
Cc: d...@cray.com; llv...@cs.uiuc.edu

shahid shahid

unread,
Oct 25, 2014, 10:56:28 AM10/25/14
to Demikhovsky, Elena, llv...@cs.uiuc.edu, d...@cray.com
Hi Elena,

Nice to see that your thinking are quite similar with mine.

Do you plan to generate this intrinsic in Loop Vectorizer based on subtarget feature?

If so, it would be better to let it generate here in target independent manner.Later on,
during lowering, based on the availability of target support for masked ops you can decide
either to scalarize or generate the target masked ops instruction.

Shahid

Demikhovsky, Elena

unread,
Oct 26, 2014, 3:11:00 AM10/26/14
to shahid shahid, llv...@cs.uiuc.edu, d...@cray.com

We may receive less optimal code on other targets as a result. User may want optimize a sequence of scalar instructions after vectorization did not pass.

 

-           Elena

 

From: shahid shahid [mailto:shah...@yahoo.com]
Sent: Saturday, October 25, 2014 17:53
To: Demikhovsky, Elena; llv...@cs.uiuc.edu
Cc: d...@cray.com
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics

 

Hi Elena,

Owen Anderson

unread,
Oct 26, 2014, 5:59:49 PM10/26/14
to Demikhovsky, Elena, d...@cray.com, llv...@cs.uiuc.edu
What is the motivation for using intrinsics versus adding new instructions?

—Owen

Demikhovsky, Elena

unread,
Oct 27, 2014, 3:05:45 AM10/27/14
to Owen Anderson, d...@cray.com, llv...@cs.uiuc.edu

we just follow  a common recommendation to start with intrinsics:

http://llvm.org/docs/ExtendingLLVM.html

 

 

-           Elena

 

From: Owen Anderson [mailto:resi...@mac.com]

Sent: Sunday, October 26, 2014 23:57
To: Demikhovsky, Elena

Cc: llv...@cs.uiuc.edu; d...@cray.com
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics

 

What is the motivation for using intrinsics versus adding new instructions?

Chandler Carruth

unread,
Oct 27, 2014, 3:11:15 AM10/27/14
to Demikhovsky, Elena, d...@cray.com, llv...@cs.uiuc.edu
It's not clear that these would need to be totally new instructions as opposed to a mask operand to the existing store instruction? I'm curious what Owen is actually imagining here...

Adam Nemet

unread,
Oct 27, 2014, 12:42:37 PM10/27/14
to Demikhovsky, Elena, d...@cray.com, llv...@cs.uiuc.edu

> On Oct 25, 2014, at 4:30 AM, Demikhovsky, Elena <elena.de...@intel.com> wrote:
>
>> The same problem exists with any potentially trapping instruction (e.g. all floating point computations). The need for intrinsics goes way beyond loads and stores.
>
> We definitely looking at them, but decided to start from load and store. All FP + gather/scatter are in our long term plan. It will be about 20 intrinsics.
> But step-by-step.

Hi Elena,

Can you please elaborate on the list. I don’t see how 20 intrinsics would cover “All FP”. But do you really have to do all FP or only instructions that can trap with LLVM (e.g. division by zero)?

I do agree that we want to go step by step but we also need to see the the end goal to make sure the design will scale.

Thanks,
Adam

Owen Anderson

unread,
Oct 27, 2014, 12:59:20 PM10/27/14
to Chandler Carruth, d...@cray.com, llv...@cs.uiuc.edu
Adding a mask operand to the existing store instructions seems risky, as lots of existing code would not necessarily preserve/respect them.

—Owen

Owen Anderson

unread,
Oct 27, 2014, 1:03:23 PM10/27/14
to Demikhovsky, Elena, d...@cray.com, llv...@cs.uiuc.edu
Since this is something that you expect to be supported on all targets, and which requires extensive type overloading, it seems like a perfect candidate for being an Instruction rather than an intrinsic.

—Owen

d...@cray.com

unread,
Oct 27, 2014, 1:24:40 PM10/27/14
to Demikhovsky, Elena, llv...@cs.uiuc.edu
"Demikhovsky, Elena" <elena.de...@intel.com> writes:

>> The same problem exists with any potentially trapping instruction
>> (e.g. all floating point computations). The need for intrinsics
>> goes way beyond loads and stores.
>
> We definitely looking at them, but decided to start from load and
> store. All FP + gather/scatter are in our long term plan. It will be
> about 20 intrinsics.
> But step-by-step.

Makes total sense. Glad to hear the rest is coming!

-David

d...@cray.com

unread,
Oct 27, 2014, 1:47:54 PM10/27/14
to Adam Nemet, llv...@cs.uiuc.edu
Adam Nemet <ane...@apple.com> writes:

> Can you please elaborate on the list. I don’t see how 20 intrinsics
> would cover “All FP”. But do you really have to do all FP or only
> instructions that can trap with LLVM (e.g. division by zero)?

We need intrinsics for all the FP operations. Any operand that is a
signaling NaN or even a quiet NaN for some operations will trap. LLVM
needs masking to protect itself from that when vectorizing certain kinds
of loops.

-David

d...@cray.com

unread,
Oct 27, 2014, 1:59:43 PM10/27/14
to Demikhovsky, Elena, llv...@cs.uiuc.edu
"Demikhovsky, Elena" <elena.de...@intel.com> writes:

>> So %passthrough can *only* be undef or zeroinitializer?
> No, it can be any value including undef and zeroinitializer.
>
> We considered, while designing, zero and merge semantics and decided
> that merge semantics is better because it covers zero semantics if you
> use zeroinitializer in the %paththru.

But passthrough has to have some value, right, to know what to merge?
It must appear as an operand, correct? I think this is good, I just
want to clarify. :)

-David

Demikhovsky, Elena

unread,
Oct 28, 2014, 4:46:01 AM10/28/14
to d...@cray.com, Adam Nemet, llv...@cs.uiuc.edu
Yes, David is right. We should cover all instructions that can trap on NaN.
I just counted all FP instructions, including conversions: fadd, fsub, .., fptrunc, fext, ..sitofp, fcmp, fma (~13) + gather/scatter (2) + load/store(2).
I'm not sure about integer divide and remainder, because we don't have a solution in the Intel Architecture today. On the other hand, a library may support these operations in masked vector form and do it faster than a scalar sequence.

- Elena


-----Original Message-----
From: d...@cray.com [mailto:d...@cray.com]
Sent: Monday, October 27, 2014 19:39
To: Adam Nemet
Cc: Demikhovsky, Elena; Hal Finkel; llv...@cs.uiuc.edu
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics

Adam Nemet <ane...@apple.com> writes:

> Can you please elaborate on the list. I don’t see how 20 intrinsics
> would cover “All FP”. But do you really have to do all FP or only
> instructions that can trap with LLVM (e.g. division by zero)?

We need intrinsics for all the FP operations. Any operand that is a signaling NaN or even a quiet NaN for some operations will trap. LLVM needs masking to protect itself from that when vectorizing certain kinds of loops.

-David


---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

_______________________________________________

Demikhovsky, Elena

unread,
Oct 28, 2014, 5:16:22 AM10/28/14
to d...@cray.com, llv...@cs.uiuc.edu

>> So %passthrough can *only* be undef or zeroinitializer?
> No, it can be any value including undef and zeroinitializer.
>
> We considered, while designing, zero and merge semantics and decided
> that merge semantics is better because it covers zero semantics if you
> use zeroinitializer in the %paththru.

> But passthrough has to have some value, right, to know what to merge?
[Demikhovsky, Elena] Not necessarily. It may be "undef" on IR level. When we don't care about value in masked-off lanes.
> It must appear as an operand, correct?
[Demikhovsky, Elena] Yes.
> I think this is good, I just want to clarify. :)

-David
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Demikhovsky, Elena

unread,
Oct 28, 2014, 8:30:29 AM10/28/14
to Owen Anderson, d...@cray.com, llv...@cs.uiuc.edu

Many oveloaded intrinsics may be replaced with instructions - fabs or fma or sqrt.

Chandler will probably explain the criteria. What the diff between fma and fadd? Or fptrunc and fabs?

 

A new instruction like

%a = loadm <4 x i32>* %addr, <4 x i32> %passthru,  i32 4, <4 x i1>%mask

is possible, but may be not very useful for most of targets.

So we start from intrinsics.

Owen Anderson

unread,
Oct 28, 2014, 12:36:44 PM10/28/14
to Demikhovsky, Elena, d...@cray.com, llv...@cs.uiuc.edu
I would have no issue promoting some of the fundamental floating point operations that are currently intrinsics to instructions, though I don’t think there’s a strong impetus to do so at this time.

The only “deep” reasons for the guidance to start with intrinsics is (1) it’s more work to add an instruction, and (2) in theory the instruction opcode space is bounded, though this has never been a practical problem.  The advantages include better compile-time (not having to string-match function names), more compact bitcode representation, and cleaner IR syntax particularly vis-a-vis type overloading.

There’s a big qualitative difference between fabs and these masked operations, mostly because of the degree of type overloading you intend to support.  I am very concerned that the IR that will contain these constructs will be dramatically harder to read because of it.

—Owen

d...@cray.com

unread,
Oct 28, 2014, 12:41:07 PM10/28/14
to Demikhovsky, Elena, llv...@cs.uiuc.edu
"Demikhovsky, Elena" <elena.de...@intel.com> writes:

>> But passthrough has to have some value, right, to know what to merge?
> [Demikhovsky, Elena] Not necessarily. It may be "undef" on IR
> level. When we don't care about value in masked-off lanes.

I was including "undef" as a "value." :)

>> It must appear as an operand, correct?
> [Demikhovsky, Elena] Yes.

Great, this sounds very good.

-David

Hal Finkel

unread,
Oct 28, 2014, 1:26:12 PM10/28/14
to Owen Anderson, d...@cray.com, llv...@cs.uiuc.edu
----- Original Message -----
> From: "Owen Anderson" <resi...@mac.com>
> To: "Elena Demikhovsky" <elena.de...@intel.com>
> Cc: d...@cray.com, llv...@cs.uiuc.edu
> Sent: Tuesday, October 28, 2014 11:30:15 AM
> Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics
>
>
> I would have no issue promoting some of the fundamental floating
> point operations that are currently intrinsics to instructions,
> though I don’t think there’s a strong impetus to do so at this time.
>
>
> The only “deep” reasons for the guidance to start with intrinsics is
> (1) it’s more work to add an instruction, and (2) in theory the
> instruction opcode space is bounded, though this has never been a
> practical problem. The advantages include better compile-time (not
> having to string-match function names), more compact bitcode
> representation, and cleaner IR syntax particularly vis-a-vis type
> overloading.

I think that starting with the intrinsics, for now, will be the right path while we figure out exactly what the design space is and use cases are. For the moment if these are primarily generated by the loop vectorizer, it should not be a big problem. Obviously when adding new instructions, there are a lot of switch statements to update ;)

>
>
> There’s a big qualitative difference between fabs and these masked
> operations, mostly because of the degree of type overloading you
> intend to support. I am very concerned that the IR that will contain
> these constructs will be dramatically harder to read because of it.

I think this ties back to the other thread on intrinsics name mangling (and the lack of the need for it). I think that, at least, Elena, Philip and I agree that, generally speaking, we'd like to clean this up, but that we should do this as a separate change independent of this. The memcpy.whatever are not easy to read either ;) -- and I agree that this could make things worse in that regard.

-Hal

--

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________

Reply all
Reply to author
Forward
0 new messages