[LLVMdev] Predication on SIMD architectures and LLVM

Marcello Maggioni

unread,

Oct 19, 2012, 11:38:29 AM10/19/12

to llv...@cs.uiuc.edu

Hello,
I'm working on a compiler based on LLVM for a SIMD architecture that
supports instruction predication. We would like to implement branching
on this architecture using predication.
As you know the LLVM-IR doesn't support instruction predication, so I'm
not exactly sure on what is the best way to implement it.
We came up with some ways to do it in LLVM:

- Do not add any predication in the IR (except for load and stores
through intrinsics), linearize the branches and substitute PHI nodes
with selects for merging values . In the backend then we would custom
lower the select instruction to produce a predicated mov to choose the
right version of the value. I think this option doesn't make use of the
possible benefits of the architecture we are targeting at all.

- Another way could be adding intrinsics for all instructions in the
target to make them support predication, still linearize all the
branches, but use instruction predication instead of generating cmovs .
The backend then would custom lower almost any instruction into
predicated custom nodes that are matched through tablegen patterns. We
could generate these intrinsics in the same IR pass that linearizes
branches.

- Make a custom backend that actually directly outputs predicated
instructions (we really mainly only need one type of predicate , so
every instruction could use that kind of predicate ...) but I think this
is a nasty solution ...

Did someone already tried to do this in LLVM and if yes what solution/s
did you use to solve the problem?

Regards,
Marcello

--
Marcello Maggioni
Codeplay Software Ltd
45 York Place, Edinburgh, EH1 3HP
Tel: 0131 466 0503
Fax: 0131 557 6600
Website: http://www.codeplay.com
Twitter: https://twitter.com/codeplaysoft

This email and any attachments may contain confidential and /or privileged information and is for use by the addressee only. If you are not the intended recipient, please notify Codeplay Software Ltd immediately and delete the message from your computer. You may not copy or forward it,or use or disclose its contents to any other person. Any views or other information in this message which do not relate to our business are not authorized by Codeplay software Ltd, nor does this message form part of any contract unless so stated.
As internet communications are capable of data corruption Codeplay Software Ltd does not accept any responsibility for any changes made to this message after it was sent. Please note that Codeplay Software Ltd does not accept any liability or responsibility for viruses and it is your responsibility to scan any attachments.
Company registered in England and Wales, number: 04567874
Registered office: 81 Linkfield Street, Redhill RH1 6BY

_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

陳韋任 (Wei-Ren Chen)

unread,

Oct 19, 2012, 12:12:49 PM10/19/12

to Marcello Maggioni, llv...@cs.uiuc.edu

Hi Marcello,

On Fri, Oct 19, 2012 at 04:38:29PM +0100, Marcello Maggioni wrote:
> Hello,
> I'm working on a compiler based on LLVM for a SIMD architecture that
> supports instruction predication. We would like to implement branching
> on this architecture using predication.
> As you know the LLVM-IR doesn't support instruction predication, so I'm
> not exactly sure on what is the best way to implement it.
> We came up with some ways to do it in LLVM:

I recall Ocelot [1], a binary translator which translates PTX into LLVM
also faces the same problem. You might want to take a look on what
Ocelot does.

HTH,
chenwj

[1] http://www.gdiamos.net/papers/ocelot-pact.pdf

--
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

Ralf Karrenberg

unread,

Oct 19, 2012, 12:24:14 PM10/19/12

to marc...@codeplay.com, llv...@cs.uiuc.edu

Hi Marcello,

I am sure I've seen some postings on the list concerning architectures
that support predicated execution and how to map that to LLVM IR, I'm
just not sure anymore when and who was involved :).

I have implemented your first suggestion for targets that do not have
predicated instructions (where control flow to data flow conversion with
explicit maintaining of masks and blend operations is the only option
you have for SIMD vectorization). However, I agree with you that this is
not a good solution for you since you would basically ignore one of the
strengths of your platform. Should you still need some code or help with
this, feel free to drop me a message.

Cheers,
Ralf

Sergei Larin

unread,

Oct 19, 2012, 12:29:20 PM10/19/12

to Marcello Maggioni, llv...@cs.uiuc.edu

We are currently doing something similar to your third option in Hexagon
backend. But it is a VLIW so predication is not the only reason for that.

Sergei

---
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by
The Linux Foundation

Tom Stellard

unread,

Oct 19, 2012, 12:37:59 PM10/19/12

to Marcello Maggioni, llv...@cs.uiuc.edu

On Fri, Oct 19, 2012 at 04:38:29PM +0100, Marcello Maggioni wrote:

> Hello,
> I'm working on a compiler based on LLVM for a SIMD architecture that
> supports instruction predication. We would like to implement
> branching on this architecture using predication.
> As you know the LLVM-IR doesn't support instruction predication, so
> I'm not exactly sure on what is the best way to implement it.
> We came up with some ways to do it in LLVM:
>
> - Do not add any predication in the IR (except for load and stores
> through intrinsics), linearize the branches and substitute PHI nodes
> with selects for merging values . In the backend then we would
> custom lower the select instruction to produce a predicated mov to
> choose the right version of the value. I think this option doesn't
> make use of the possible benefits of the architecture we are
> targeting at all.
>

You may want to look at the IfConversion pass
(lib/CodeGen/IfConversion.cpp). This converts branches to predicated
instructions, and you may be able to use it for all branching if you
teach it to maintain a predicate stack. I actually looked into doing
this for the newest generation of GPUs (Southern Islands) supported by
the R600[1] backend which use predication for all branching, but opted
to go with a target specific pass until the backend is more stable.

-Tom

[1] http://cgit.freedesktop.org/~tstellar/llvm/tree/lib/Target/AMDGPU

Marcello Maggioni

unread,

Oct 19, 2012, 1:09:40 PM10/19/12

to llv...@cs.uiuc.edu

On 10/19/2012 5:12 PM, 陳韋任 (Wei-Ren Chen) wrote:
> Hi Marcello,
>
> On Fri, Oct 19, 2012 at 04:38:29PM +0100, Marcello Maggioni wrote:
>> Hello,
>> I'm working on a compiler based on LLVM for a SIMD architecture that
>> supports instruction predication. We would like to implement branching
>> on this architecture using predication.
>> As you know the LLVM-IR doesn't support instruction predication, so I'm
>> not exactly sure on what is the best way to implement it.
>> We came up with some ways to do it in LLVM:
> I recall Ocelot [1], a binary translator which translates PTX into LLVM
> also faces the same problem. You might want to take a look on what
> Ocelot does.
>
> HTH,
> chenwj
>
> [1] http://www.gdiamos.net/papers/ocelot-pact.pdf
>

Hi Wen-Ren,

thank you for your link, seems like an interesting read on the argument!

Marcello

--
Marcello Maggioni
Codeplay Software Ltd
45 York Place, Edinburgh, EH1 3HP
Tel: 0131 466 0503
Fax: 0131 557 6600
Website: http://www.codeplay.com
Twitter: https://twitter.com/codeplaysoft

This email and any attachments may contain confidential and /or privileged information and is for use by the addressee only. If you are not the intended recipient, please notify Codeplay Software Ltd immediately and delete the message from your computer. You may not copy or forward it,or use or disclose its contents to any other person. Any views or other information in this message which do not relate to our business are not authorized by Codeplay software Ltd, nor does this message form part of any contract unless so stated.
As internet communications are capable of data corruption Codeplay Software Ltd does not accept any responsibility for any changes made to this message after it was sent. Please note that Codeplay Software Ltd does not accept any liability or responsibility for viruses and it is your responsibility to scan any attachments.
Company registered in England and Wales, number: 04567874
Registered office: 81 Linkfield Street, Redhill RH1 6BY

_______________________________________________

Marcello Maggioni

unread,

Oct 19, 2012, 1:45:52 PM10/19/12

to Ralf Karrenberg, llv...@cs.uiuc.edu

Hi Ralf,

yeah, I've checked if on the list there were any kind of reference to
that , but I only found scattered and incomplete information (maybe some
good stuff is there, but very well hidden, I will try to check again by
the way).

About your work on predication, I know of your IR-level if-conversion
and vectorization and a lot of my knowledge on the matter actually comes
from your talk in the last Euro-LLVM conference in London, so thanks for
that! Also thank you for your availability on discussing the matter!

Marcello

--
Marcello Maggioni
Codeplay Software Ltd
45 York Place, Edinburgh, EH1 3HP
Tel: 0131 466 0503
Fax: 0131 557 6600
Website: http://www.codeplay.com
Twitter: https://twitter.com/codeplaysoft

This email and any attachments may contain confidential and /or privileged information and is for use by the addressee only. If you are not the intended recipient, please notify Codeplay Software Ltd immediately and delete the message from your computer. You may not copy or forward it,or use or disclose its contents to any other person. Any views or other information in this message which do not relate to our business are not authorized by Codeplay software Ltd, nor does this message form part of any contract unless so stated.
As internet communications are capable of data corruption Codeplay Software Ltd does not accept any responsibility for any changes made to this message after it was sent. Please note that Codeplay Software Ltd does not accept any liability or responsibility for viruses and it is your responsibility to scan any attachments.
Company registered in England and Wales, number: 04567874
Registered office: 81 Linkfield Street, Redhill RH1 6BY

Marcello Maggioni

unread,

Oct 19, 2012, 1:51:36 PM10/19/12

to Sergei Larin, llv...@cs.uiuc.edu

Hello Sergei,

I actually don't know the hexagon platform very well, I only just
briefly checked it's backend for reference on some things . Our target
also has some VLIW features, but I hope we will not have to endup with
the third option, I would like to keep it as a last resort.

Marcello

Marcello Maggioni

unread,

Oct 19, 2012, 2:06:23 PM10/19/12

to Tom Stellard, llv...@cs.uiuc.edu

Hello Tom,

so basically what you are doing is in your AMDGPU backend is generating
machine code like if it was a normal target (with diverging branches and
stuff) and then through a custom post-ISel machine pass you do the if
conversion linearizing and predicating the branches. Am I right? Seems
like a much easier approach to apply than doing it at the IR level
(because you don't have to add intrinsics to predicate your instructions) .

Marcello

Dan Gohman

unread,

Oct 19, 2012, 3:59:20 PM10/19/12

to Marcello Maggioni, llv...@cs.uiuc.edu

Hi,

I've done work on predicated SIMD representations for LLVM.

If you search through the archives, you may find my "applymask" proposal, which is an attempt at representing predication in a very comprehensive way. I've since stopped pushing the proposal in part because Larrabee's changing fortunes led to a decline of interest at the time, in part because the proposal doesn't look intuitive to people who don't have experience in SIMD programming, and in part because there were some technical details with my actual proposal (although I believe solutions could be found).

And, in part because a popular trend seems to be to have SIMD units which don't trap or raise exception flags on arithmetic and which don't go faster when predicated, such that there's no reason to predicate anything except stores and occasionally loads. On these architectures, simply having intrinsics for stores, and perhaps loads, is basically sufficient, and less invasive.

And, in part because predication is another wrinkle for SIMD performance portability. As people start caring more about SIMD performance, there will be more pressure to tune SIMD code in target-specific ways, and it erodes the benefit of a target-independent representation. This is a complex topic though, and there are multiple considerations, and not everyone agrees with me here.

One thing that's initially counter-intuitive is that SIMD predication cannot be done in the same way as scalar or VLIW predication, where the majority of the compiler works as if it's on a "normal" scalar machine and predication happens during codegen, where the optimizer doesn't have to think about it. SIMD predication must be applied by whatever code is producing SIMD instructions, and in LLVM, that's typically in the optimizer or earlier.

Dan

d...@cray.com

unread,

Oct 22, 2012, 1:10:20 PM10/22/12

to Ralf Karrenberg, llv...@cs.uiuc.edu

Ralf Karrenberg <Cha...@gmx.de> writes:

> I am sure I've seen some postings on the list concerning architectures
> that support predicated execution and how to map that to LLVM IR, I'm
> just not sure anymore when and who was involved :).

I was one of them. I suggested adding general predication to the LLVM
IR but that doesn't look like it's going to happen. Dan Gohman had
another idea on how to represent predicate masks but that also didn't
relaly go anywhere.

None of your proposed solutions is ideal. We really should have
first-class predication in the IR. It's only going to get more
important.

-David

d...@cray.com

unread,

Oct 22, 2012, 1:15:49 PM10/22/12

to Dan Gohman, llv...@cs.uiuc.edu

Dan Gohman <dan4...@gmail.com> writes:

> And, in part because a popular trend seems to be to have SIMD units
> which don't trap or raise exception flags on arithmetic and which don't
> go faster when predicated, such that there's no reason to predicate
> anything except stores and occasionally loads. On these architectures,
> simply having intrinsics for stores, and perhaps loads, is basically
> sufficient, and less invasive.

This is going to change. Intel recently released the ISA for Knights
Corner, a machine with general predication for SIMD.

http://software.intel.com/en-us/forums/topic/278102

> And, in part because predication is another wrinkle for SIMD
> performance portability. As people start caring more about SIMD
> performance, there will be more pressure to tune SIMD code in
> target-specific ways, and it erodes the benefit of a
> target-independent representation. This is a complex topic though, and
> there are multiple considerations, and not everyone agrees with me
> here.

It's true that a target-independent predicated IR isn't going to
translate well to a target that doesn't have predication. However, for
targets that do it's a godsend.

> One thing that's initially counter-intuitive is that SIMD predication
> cannot be done in the same way as scalar or VLIW predication, where
> the majority of the compiler works as if it's on a "normal" scalar
> machine and predication happens during codegen, where the optimizer
> doesn't have to think about it. SIMD predication must be applied by
> whatever code is producing SIMD instructions, and in LLVM, that's
> typically in the optimizer or earlier.

Yep. This is why I think IR support is essential.

-David

David Chisnall

unread,

Oct 23, 2012, 4:31:35 AM10/23/12

to d...@cray.com, llv...@cs.uiuc.edu

On 22 Oct 2012, at 18:10, <d...@cray.com> wrote:

> None of your proposed solutions is ideal. We really should have
> first-class predication in the IR. It's only going to get more
> important.

Perhaps I am missing something, but isn't a predicated instruction effectively an single-instruction version of an arithmetic operation followed by a select? As we can already represent this in the IR, and already match other predicated instructions (e.g. on ARM) to this pattern, what is gained by adding predication directly to the IR?

David Chisnall

unread,

Oct 23, 2012, 12:43:34 PM10/23/12

to d...@cray.com, llv...@cs.uiuc.edu

I am talking about the LLVM select instruction, not a vector select:

http://llvm.org/docs/LangRef.html#i_select

In any non-trapping case, an arithmetic operation (or sequence of operations) followed by a select is semantically equivalent to the predicated version. This is exactly how predicated instructions on ARM are handled. For example, the following IR:

%cmp = icmp sgt i32 %c, %b
%add = add nsw i32 %b, 1
%add1 = add nsw i32 %c, 2
%retval.0 = select i1 %cmp, i32 %add, i32 %add1

Becomes this ARM assembly:

add r2, r1, #2
cmp r1, r0
addgt r2, r0, #1
mov r0, r2

An equally valid form would be:

cmp r1, r0
addle r2, r1, #2
addgt r2, r0, #1
mov r0, r2

Separating the select, which embodies the predication, from the operations allows more choice in terms of the final representation. Unless the load or store is volatile, the compiler is free to elide it if its result is not used, and is most definitely free to fold it into a predicated load. The same is obviously true of any side-effect-free operations, such as divides and square roots: folding them into predicated instructions is no less invalid than conditionally executing them in branches or removing them entirely via dead code elimination.

Just because the generated machine code must contain predicated instructions most definitely does mean that the LLVM IR must contain it, or even that we would gain anything in terms of expressive power by permitting it.

David

On 23 Oct 2012, at 17:25, <d...@cray.com> wrote:

> David Chisnall <David.C...@cl.cam.ac.uk> writes:
>
>> Perhaps I am missing something, but isn't a predicated instruction
>> effectively an single-instruction version of an arithmetic operation
>> followed by a select?
>

> No, it is not. Among other things, predication is used to avoid traps.
> A vector select is an entirely different operation.

>
>> As we can already represent this in the IR, and already match other
>> predicated instructions (e.g. on ARM) to this pattern, what is gained
>> by adding predication directly to the IR?
>

> Predicated loads, stores, divides, sqrts, etc. are essential for
> correctly vectorizing loops with conditionals due to safety concerns.
> If the loop body has no dangerous operations, then yes, a vector select
> can be used without problems but it is often slower than predication.
> Usually the hardware can optimize instructions with certain values of
> predicates.
>
> -David

d...@cray.com

unread,

Oct 23, 2012, 12:25:26 PM10/23/12

to David Chisnall, llv...@cs.uiuc.edu

David Chisnall <David.C...@cl.cam.ac.uk> writes:

> Perhaps I am missing something, but isn't a predicated instruction
> effectively an single-instruction version of an arithmetic operation
> followed by a select?

No, it is not. Among other things, predication is used to avoid traps.
A vector select is an entirely different operation.

> As we can already represent this in the IR, and already match other
> predicated instructions (e.g. on ARM) to this pattern, what is gained
> by adding predication directly to the IR?

Predicated loads, stores, divides, sqrts, etc. are essential for
correctly vectorizing loops with conditionals due to safety concerns.
If the loop body has no dangerous operations, then yes, a vector select
can be used without problems but it is often slower than predication.
Usually the hardware can optimize instructions with certain values of
predicates.

-David

Nadav Rotem

unread,

Oct 24, 2012, 12:29:24 AM10/24/12

to d...@cray.com, llv...@cs.uiuc.edu

On Oct 22, 2012, at 10:15 AM, d...@cray.com wrote:

It's true that a target-independent predicated IR isn't going to
translate well to a target that doesn't have predication. However, for
targets that do it's a godsend.

Even for MIC (Xeon Phi), the predicated IR is not necessary. The instructions that really benefit from predication are loads and stores. MIC masks are write masks, but even if they were to help the performance of predicated instructions, there are other ways to do this. One way would be to implement masked load and mask store intrinsics, and to place 'select' instructions in strategic locations: before instructions that may fault, before phi-nodes, etc. A pre-register allocation pass can propagate the masks to all of the instructions that need them. But this is theoretical since only load/store really benefit from predication.

Yep. This is why I think IR support is essential.

I don't think that we need to change the IR, even for a predicated architecture such as MIC .

d...@cray.com

unread,

Oct 24, 2012, 1:12:19 PM10/24/12

to Nadav Rotem, llv...@cs.uiuc.edu

Nadav Rotem <nro...@apple.com> writes:

> One way would be to implement masked load and mask store
> intrinsics, and to place 'select' instructions in strategic locations:
> before instructions that may fault, before phi-nodes, etc. A
> pre-register allocation pass can propagate the masks to all of the
> instructions that need them.

How does this work if the load is not conditional but trapping
operations that use the loaded values are conditional?

Yes, such propagation can probably be done but it's painful and every
predicated target would have to implement it. It's much easier to just
select the right operation in isel, I think, and that seems to require
IR support.

d...@cray.com

unread,

Oct 24, 2012, 1:24:13 PM10/24/12

to David Chisnall, llv...@cs.uiuc.edu

David Chisnall <David.C...@cl.cam.ac.uk> writes:

> I am talking about the LLVM select instruction, not a vector select:
>
> http://llvm.org/docs/LangRef.html#i_select

That is what I mean by a vector select.

> In any non-trapping case, an arithmetic operation (or sequence of
> operations) followed by a select is semantically equivalent to the
> predicated version.

Yes.

> Separating the select, which embodies the predication, from the
> operations allows more choice in terms of the final representation.

Sure.

> Just because the generated machine code must contain predicated
> instructions most definitely does mean that the LLVM IR must contain
> it, or even that we would gain anything in terms of expressive power
> by permitting it.

Certainly such transformations *can* be done, but is it the most
efficient/best way to do things? I wonder how many different passes of
"select to predication" we will end up having, one per target.

Renato Golin

unread,

Oct 24, 2012, 1:37:32 PM10/24/12

to d...@cray.com, llv...@cs.uiuc.edu

On 24 October 2012 18:24, <d...@cray.com> wrote:
> Certainly such transformations *can* be done, but is it the most
> efficient/best way to do things? I wonder how many different passes of
> "select to predication" we will end up having, one per target.

I don't understand what's the problem with this. Different targets
have different predication rules, so they should have different
selection from IR to produce code.

If this is an IR pass, or a DAG selection step, I don't know. I'd
think it'd be the latter, though, as it's where the target-specific
code is, but there might be other IR passes that need this information
before DAG sel.

--
cheers,
--renato

http://systemcall.org/

Reply all

Reply to author

Forward