[LLVMdev] [RFC] "noclone" function attribute

James Molloy

unread,

Dec 1, 2012, 11:02:30 AM12/1/12

to llv...@cs.uiuc.edu

Hi,

OpenCL has a "barrier" function with very specific semantics, and there is currently no analogue to model this in LLVM.

This has been touched on by the SPIR folks but I don't believe they put forward a proposal.

The barrier function is a special function that ensures that all workitems executing a kernel have executed up to that point before execution on any workitem can continue.

The CL spec is specific about how user kernels can use barriers - the sequence of barriers that are hit by all workitems in a workgroup must be identical. An issue occurs when defining what "the same barrier" actually means, however. GPU Hardware, and CPU implementations such as Ralf Karrenberg's (http://llvm.org/devmtg/2012-04-12/Slides/Ralf_Karrenberg.pdf) key off the PC, so barrier call A and barrier call B are the same if and only if the PC value at A and B is the same, for some definition of PC.

Last time this was mentioned, Eli suggested that keying off the PC was a bit silly - it is my understanding that the next CL spec has "named barriers" proposed, which give the key to the barrier function explicitly as a parameter. However even if this is ratified, we (CL vendors) still need to support the old behaviour of keying off the PC.

This (keying off the PC) has advantages in terms of implementation for the CPU. For an example, and an example of how this can go wrong, see the end of this message.

This can go wrong if a barrier call is cloned. This can happen in loop unrolling, loop unswitching and jump threading, currently. I believe multiple CL vendors have hacked ad-hoc checking in these three areas currently - it'd be nice to standardise this and reduce downstream hacks.

I'm proposing a new function attribute, "noclone", with the semantics that "calls to functions marked "noclone" cannot be cloned or duplicated into the same function.". That is, it is illegal to call J = I->clone() then attach J to the same basic block as I if I is marked "noclone".

This means that cloning whole functions (CloneFunction and CloneFunctionInto) will still work fine, but CloneBasicBlock with a new parent set equal to the old parent (i.e. cloning a block in the same function) will assert.

I have a proof of concept patch for this but it's slightly out of date, so I'll need to update it.

I'm envisaging a large group of people with torches and pitchforks walking menacingly towards me right now, so without further ado I'll hand over to them to tell me where I've gone wrong and why the idea is utterly braindead...

Cheers,

James

EXAMPLE
=======

Ralf Karrenberg proposed an algorithm which for a kernel like this:

kernel void k() {
if (x())
y();
barrier();
if (x())
z();
else
w();
}

split it up into sub-functions and would produce a state machine and a loop similar to this:

while (1) {
switch (state) {
case STATE_START:
for (x...) for (y...) for (z...)
state = kernel_START(x, y, z);
break;

case STATE_BARRIER1:
for (x...) for (y...) for (z...)
state = kernel_BARRIER1(x, y, z);
break;

case STATE_END:
return;
}
}

where every kernel sub-function (kernel_START and kernel_BARRIER1 in this example) return a new state.

Notice this relies upon all calls to either kernel_START or kernel_BARRIER1 returning the *same* next state. This is guaranteed by the OpenCL spec.

Let's apply jump threading to that kernel:

kernel void k() {
if (x()) {
y();
barrier();
z();
} else {
barrier();
w();
}
}

Oh dear. Now, we'd end up creating a state machine with four states - START, BARRIER1, BARRIER2 and END. It is no longer guaranteed that all workitems will hit the same barrier, because we've broken an invariant the user guaranteed. Our optimisation has been broken.

_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Krzysztof Parzyszek

unread,

Dec 1, 2012, 11:22:29 AM12/1/12

to llv...@cs.uiuc.edu

On 12/1/2012 10:02 AM, James Molloy wrote:
>
> This means that cloning whole functions (CloneFunction and CloneFunctionInto) will still work [...].

Unfortunately, it won't work.

Assume all threads call foo:

foo() {
...
bar(i)
...
}

bar(int i) {
...
barrier();
...
}

Now, suppose that we have discovered that bar(0) can be greatly
optimized and generate a call to the specialized version, bar_0:

foo() {
...
if (i == 0) bar_0();
else bar(i);
...
}

And now we have multiple threads that no longer have a common barrier.

-Krzysztof

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

James Molloy

unread,

Dec 1, 2012, 11:36:13 AM12/1/12

to Krzysztof Parzyszek, llv...@cs.uiuc.edu

Hi Krzysztof,

Yes, however this can be solved in one of two ways:

1) Fully inline the call graph for all leaf functions that call the barrier intrinsic. This is done on several implementations as standard already, and "no call stack" is a requirement for Karrenberg's algorithm at least.

2) Apply the "noclone" attribute transitively such that if a function may transitively call the barrier intrinsic, it is marked "noclone".

Either of these methods allow the user to stop LLVM "breaking their IR. I'm aware that the general case with no user help (such as force-inlining, or otherwise controlling function cloning) is a very difficult problem. My intention is that there are no corner cases *with user assistance*. Currently there is no way to stop stuff breaking *even with* user assistance! :)

Cheers,

James
________________________________________
From: llvmdev...@cs.uiuc.edu [llvmdev...@cs.uiuc.edu] On Behalf Of Krzysztof Parzyszek [kpar...@codeaurora.org]
Sent: 01 December 2012 16:22
To: llv...@cs.uiuc.edu
Subject: Re: [LLVMdev] [RFC] "noclone" function attribute

Krzysztof Parzyszek

unread,

Dec 1, 2012, 11:51:55 AM12/1/12

to James Molloy, llv...@cs.uiuc.edu

On 12/1/2012 10:36 AM, James Molloy wrote:
>
> Either of these methods allow the user to stop LLVM "breaking their IR. I'm aware that the general case with no user help (such as force-inlining, or otherwise controlling function cloning) is a very difficult problem. My intention is that there are no corner cases *with user assistance*. Currently there is no way to stop stuff breaking *even with* user assistance! :)

I'm not against the idea, I was just pointing out that cloning of
functions can create problems.

Here's another thing. Imagine this code:

if (x > 0) {
barrier();
...
} else {
barrier();
...
}

Even though "barrier" may have side-effects, normally, it would be
possible to pull such a call out of the if-the-else statements. This
cannot happen with barriers, so the attribute would also mean
"don't-collapse" (as in "don't collapse multiple calls into one).

Here's another idea: internally translate calls to "barrier" without
arguments into calls to "barrier" with an argument that uniquely
identifies the call. The users wouldn't see it in their sources, and
the compiler/runtime could handle it in its own way.

Krzysztof Parzyszek

unread,

Dec 1, 2012, 11:58:06 AM12/1/12

to llv...@cs.uiuc.edu

On 12/1/2012 10:51 AM, Krzysztof Parzyszek wrote:
>
> Here's another thing. Imagine this code:
>
> if (x > 0) {
> barrier();
> ...
> } else {
> barrier();
> ...
> }
>
> Even though "barrier" may have side-effects, normally, it would be
> possible to pull such a call out of the if-the-else statements. This
> cannot happen with barriers, so the attribute would also mean
> "don't-collapse" (as in "don't collapse multiple calls into one).

On the second thought, this "commoning out" of barriers will most likely
be ok.

James Molloy

unread,

Dec 1, 2012, 11:54:35 AM12/1/12

to Krzysztof Parzyszek, llv...@cs.uiuc.edu

Hi,

> Here's another thing. Imagine this code:
>
> if (x > 0) {
> barrier();
> ...
> } else {
> barrier();
> ...
> }

That is actually fine. The spec is broken when you *split* a barrier call into two, not when you fuse two together. The spec says that all workitems must hit the same sequence of barriers - you can break that if you increase the number of barriers, but never if you fuse two together.

> Here's another idea: internally translate calls to "barrier" without
> arguments into calls to "barrier" with an argument that uniquely
> identifies the call. The users wouldn't see it in their sources, and
> the compiler/runtime could handle it in its own way.

Yep, I considered this. Problem is, "in its own way" for a CL GPU or CPU implementation following Karrenberg's model would be undoing the optimization that cloned the barrier. Either that or adding a call to another, "fused barrier" function which would be horribly messy and impossible on several GPU architectures (and, again, karrenberg's CPU model).

Undoing these optimizations is a near-impossible problem. Inhibiting them to start with is easier.

Thanks for finding these issues/corner cases - I'm certain there may be one or two that I've missed and can't explain away...

Cheers,

James
________________________________________
From: Krzysztof Parzyszek [kpar...@codeaurora.org]
Sent: 01 December 2012 16:51
To: James Molloy
Cc: llv...@cs.uiuc.edu

Subject: Re: [LLVMdev] [RFC] "noclone" function attribute

Krzysztof Parzyszek

unread,

Dec 1, 2012, 12:19:24 PM12/1/12

to llv...@cs.uiuc.edu

On 12/1/2012 10:02 AM, James Molloy wrote:
>

> I'm proposing a new function attribute, "noclone", with the semantics that "calls to functions marked "noclone" cannot be cloned or duplicated into the same function.". That is, it is illegal to call J = I->clone() then attach J to the same basic block as I if I is marked "noclone".

The class Loop has something similar in it:

/// isSafeToClone - Return true if the loop body is safe to clone in
practice.
/// Routines that reform the loop CFG and split edges often fail on
indirectbr.
bool Loop::isSafeToClone() const {
// Return false if any loop blocks contain indirectbrs.
for (Loop::block_iterator I = block_begin(), E = block_end(); I != E;
++I) {
if (isa<IndirectBrInst>((*I)->getTerminator()))
return false;
}
return true;
}

Maybe a similar interface could be added to Instruction, and an
instruction would declare itself unsafe to clone if it was a call to a
function with the attribute that you are proposing.

I could imagine that this may not the only example of an instruction
that should not be cloned.

I'd only suggest that the attribute is called something like
"noclonecalls", or (preferably) something shorter that clarifies that
it's the calls that shouldn't be cloned. :)

-Krzysztof

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

James Molloy

unread,

Dec 1, 2012, 12:25:41 PM12/1/12

to Krzysztof Parzyszek, llv...@cs.uiuc.edu

> Maybe a similar interface could be added to Instruction, and an
> instruction would declare itself unsafe to clone if it was a call to a
> function with the attribute that you are proposing.

I experimented with something similar to this, where Instruction::clone ensured it wasn't "noclone" - if it was, it asserted. But that broke the use-case of cloning whole functions.

My patch extends Loop::isSafeToClone to check if a callinst is contained which is "noclone".

I agree about the naming, but have yet to think of something more snappy :)

Cheers,

James
______________________________________
From: llvmdev...@cs.uiuc.edu [llvmdev...@cs.uiuc.edu] On Behalf Of Krzysztof Parzyszek [kpar...@codeaurora.org]
Sent: 01 December 2012 17:19
To: llv...@cs.uiuc.edu

Subject: Re: [LLVMdev] [RFC] "noclone" function attribute

Krzysztof Parzyszek

unread,

Dec 1, 2012, 12:40:55 PM12/1/12

to James Molloy, llv...@cs.uiuc.edu

On 12/1/2012 11:25 AM, James Molloy wrote:
>> Maybe a similar interface could be added to Instruction, and an
>> instruction would declare itself unsafe to clone if it was a call to a
>> function with the attribute that you are proposing.
>
> I experimented with something similar to this, where Instruction::clone ensured it wasn't "noclone" - if it was, it asserted. But that broke the use-case of cloning whole functions.

I guess the problem would be when the compiler wants to clone the
instruction with the plan of deleting the original later. In such a
scenario there could be temporarily (and legitimately) multiple calls to
the barrier. An example could be inlining of the only call to a static
foo, which contains a call to "barrier".

Checking of issues with split barriers could be done by the module
verifier. We would need to make all the calls to barrier unique at the
beginning (for example, by passing a fake, compiler-generated identifier
as an argument). The module verifier could check that in a given module
there are no multiple calls to "barrier" that have the same identifier.

James Molloy

unread,

Dec 1, 2012, 12:50:17 PM12/1/12

to Krzysztof Parzyszek, llv...@cs.uiuc.edu

> Checking of issues with split barriers could be done by the module
> verifier. We would need to make all the calls to barrier unique at the
> beginning (for example, by passing a fake, compiler-generated identifier
> as an argument). The module verifier could check that in a given module
> there are no multiple calls to "barrier" that have the same identifier.

I like that idea, as long as we can find a way to implement it that isn't a hack in any way. The IR doesn't know about "barriers" (hence the concept of a noclone attribute), and we can't make the module verifier be stateful (knowing what noclone insts were around last time, comparing it with this time).

So I'm not sure how we'd do this without it being a hack in some way...
________________________________________
From: Krzysztof Parzyszek [kpar...@codeaurora.org]
Sent: 01 December 2012 17:40
To: James Molloy
Cc: llv...@cs.uiuc.edu

Subject: Re: [LLVMdev] [RFC] "noclone" function attribute

Kuperstein, Michael M

unread,

Dec 2, 2012, 2:49:42 AM12/2/12

to James Molloy, llv...@cs.uiuc.edu

I definitely support this.

In fact we were about to send a very similar proposal. The main difference I can see between this proposal and ours was that we named the attribute "noduplicate".
I graciously defer to James on the bikeshade color issue.

Michael

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Chris Lattner

unread,

Dec 3, 2012, 1:11:59 AM12/3/12

to Kuperstein, Michael M, James Molloy, llv...@cs.uiuc.edu

On Dec 1, 2012, at 11:49 PM, "Kuperstein, Michael M" <michael.m....@intel.com> wrote:

> I definitely support this.
>
> In fact we were about to send a very similar proposal. The main difference I can see between this proposal and ours was that we named the attribute "noduplicate".
> I graciously defer to James on the bikeshade color issue.

Yes, this sort of functionality is useful. A few requests though:
1) please name it "noduplicate". "cloning" has other naming implications in llvm related to function bodies, but calls to a noduplicate function should not be duplicated in any way (e.g. tail duplication, loop unrolling, etc).
2) please have the llvm/Analysis/CodeMetrics.h code consider them to be unduplicatable (generalizing the containsIndirectBr bit).
3) Please change random parts of the compiler to use CodeMetrics, instead of scattering random checks for this attribute throughout the code. Anything duplicating code and not using CodeMetrics is just plain incorrect.

-Chris

Nadav Rotem

unread,

Dec 3, 2012, 1:27:48 AM12/3/12

to Chris Lattner, llv...@cs.uiuc.edu

On Dec 2, 2012, at 10:11 PM, Chris Lattner <clat...@apple.com> wrote:

3) Please change random parts of the compiler to use CodeMetrics, instead of scattering random checks for this attribute throughout the code. Anything duplicating code and not using CodeMetrics is just plain incorrect.

One problem that we may run into when using CodeMetrics is compile time. In many cases we are looking for one particular trait and can exit as soon as we find it without having to scan the entire basic block or function. For example, the jump threading pass stops scanning additional instructions as soon as it passes the cost threshold.

Chandler Carruth

unread,

Dec 3, 2012, 2:17:58 AM12/3/12

to Nadav Rotem, LLVM Developers Mailing List

The inline cost analysis has similar problems, but compounded: we need to share the walk of instructions between cost computation and checking for incompatible patterns for inlining.

A long standing todo on my list has been to factor code metrics into an API that inline cost analysis could actually use without interfering with the instruction visit pattern of that cost analysis. I haven't had time to really make progress, in part because I don't (yet) have any particularly good ideas....

Chris Lattner

unread,

Dec 3, 2012, 11:08:28 AM12/3/12

to Chandler Carruth, LLVM Developers Mailing List

Good points, it seems like the right API would be to have a CodeMetrics objects that instructions can be "pushed" into during an exterior walk, and then the cost could be queried after each instruction is added.

-Chris

James Molloy

unread,

Dec 3, 2012, 12:48:02 PM12/3/12

to Nadav Rotem, llv...@cs.uiuc.edu

Hi,

Thanks for the pointers. My patch now calls the attribute "noduplicate",
and updates CodeMetrics to have another field:

bool notDuplicatable;

Which semantically is "containsIndirectBr || containsNoDuplicateInst". I
didn't repurpose containsIndirectBr because I felt what I'm looking for
is sufficiently different (indirectbr inhibits inlining, whereas
noduplicate does not, if there is one call site).

I still need to ensure InlineCost is correct; patch will be incoming
tomorrow morning.

All uses use CodeMetrics, except for JumpThreading because it has an
early-exit mechanism as Nadav has explained.

Cheers,

James

Chris Lattner

unread,

Dec 3, 2012, 6:46:46 PM12/3/12

to James Molloy, llv...@cs.uiuc.edu

On Dec 3, 2012, at 9:48 AM, James Molloy <James....@arm.com> wrote:

> Hi,
>
> Thanks for the pointers. My patch now calls the attribute "noduplicate",
> and updates CodeMetrics to have another field:
>
> bool notDuplicatable;
>
> Which semantically is "containsIndirectBr || containsNoDuplicateInst". I
> didn't repurpose containsIndirectBr because I felt what I'm looking for
> is sufficiently different (indirectbr inhibits inlining, whereas
> noduplicate does not, if there is one call site).

I'm pretty sure that it's fine to inline indirectbr if there is a single call site and the inlinee is to be deleted.

-Chris

James Molloy

unread,

Dec 4, 2012, 12:23:27 PM12/4/12

to Chris Lattner, llvm-commits, llv...@cs.uiuc.edu

Hi all + llvm-commits,

After the discussion below, please find attached my patch to add a new
"noduplicate" function attribute.

I've modified CodeMetrics and LoopInfo, which covers most cases, but
JumpThreading and InlineCost don't use CodeMetrics yet, so they required
changing manually.

Cheers,

James

On Mon, 2012-12-03 at 23:46 +0000, Chris Lattner wrote:
> On Dec 3, 2012, at 9:48 AM, James Molloy <James....@arm.com> wrote:
>
> > Hi,
> >
> > Thanks for the pointers. My patch now calls the attribute "noduplicate",
> > and updates CodeMetrics to have another field:
> >
> > bool notDuplicatable;
> >
> > Which semantically is "containsIndirectBr || containsNoDuplicateInst". I
> > didn't repurpose containsIndirectBr because I felt what I'm looking for
> > is sufficiently different (indirectbr inhibits inlining, whereas
> > noduplicate does not, if there is one call site).
>
> I'm pretty sure that it's fine to inline indirectbr if there is a single call site and the inlinee is to be deleted.
>
> -Chris
>
>

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

noduplicate.diff

Kuperstein, Michael M

unread,

Dec 6, 2012, 2:47:45 AM12/6/12

to James Molloy, Chris Lattner, llvm-commits, llv...@cs.uiuc.edu

I'm not sure I agree with the semantics this patch creates.

The way I see it, there are two options:
1) Make "noduplicate" non-transitive and forbid inlining when there are multiple callsites.
2) Allow inlining, but make the attribute transitive.

Consider the following code, where barrier() is marked noduplicate.

kernel void k() {
if (x())
y();

b();

if (x())
z();
else
w();
}

void b() {
barrier();
}

Both inlining and threading b() is clearly illegal. The question is - which of the two should be forbidden. The current patch, assuming I understand it correctly, will:
i) allow inlining but forbid threading after inlining. This is clearly fine.
ii) allow threading but forbid inlining after threading.
I am not at all sure option (ii) is desirable.

On the other hand, I believe we'd like to allow inlining in the following case:

void f(int foo) {
if (foo)
b();
else
b();
}

void b() {
barrier();
}

Basically, the issue here is a bit philosophical - is an instruction reached through code paths that go through different callsites considered the "same" instruction or not.
If it is the same, then we cannot inline multiple call-sites, but don't need to make noduplicate transitive, since if you have a() -> b() -> barrier(), it doesn't matter what you do with calls to a(), it's still the same barrier().
If it isn't, inlining is always allowed (I think - James, do you have an example that breaks?), but the attribute must be transitive.

I feel the right thing(TM) is to consider these instructions different, which leads to "transitive, inlining allowed", as opposed to the "non-transitive, no inlining" approach that the patch takes.

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

James Molloy

unread,

Dec 6, 2012, 6:04:32 AM12/6/12

to Kuperstein, Michael M, llvm-commits, llv...@cs.uiuc.edu

Hi Michael,

After some head-scratching and discussion with our tame Khronos member,
I agree with you.

It comes down to the interpretation of the ambiguous spec. It refers to
"the barrier", implying there is some sort of equivalence relation over
barriers. The question is, what is that equivalent relation? In your
example code:

> void f(int foo) {
> if (foo)
> b();
> else
> b();
> }
>
> void b() {
> barrier();
> }
>

After talking with others, I now see that the spirit of the spec is that
the barrier() as reached through the True condition is not the same as
the barrier as reached through the False condition, even though it is
the same statement as written in CL-C.

So inlining should not be restricted, as you say.

With regards transitivity, I'd say that it is the user's responsibility
to ensure that the noduplicate attribute is transitively applied
correctly before handing it to the optimizer. Otherwise, we'd need
costly checks every time a user asks the question "is this function
noduplicate?"

So my proposed change is to remove the restrictions on inlining and
update the LangRef to say that the attribute should be transitive but it
is the IR producer's responsibility to ensure that.

Does that sound about right?

Cheers,

James

Kuperstein, Michael M

unread,

Dec 7, 2012, 12:40:52 PM12/7/12

to James Molloy, llvm-commits, llv...@cs.uiuc.edu

Sounds good to me.
I'm not sure the solution for transitivity is optimal, but it's a good compromise.

James Molloy

unread,

Dec 10, 2012, 12:05:42 PM12/10/12

to Kuperstein, Michael M, llvm-commits, llv...@cs.uiuc.edu

Hi,

Changes made:
* The inliner is no longer affected whatsoever.
* The verifier will now ensure that the transitive closure of the
callers of a noduplicate function all are noduplicate too.
* LangRef updated.
* Verifier test added.

Please review!

Cheers,

James

noduplicate.diff

Reply all

Reply to author

Forward