[llvm-dev] Memory barrier problem

213 views
Skip to first unread message

Kaylor, Andrew via llvm-dev

unread,
Jan 27, 2021, 12:50:20 PM1/27/21
to llvm...@lists.llvm.org

Hi everyone,

 

I have a problem with multi-threaded memory synchronization that I’d like to get some input on.

 

Consider the following IR:

 

------------

 

define void @bar() convergent {

  fence acq_rel

  ret void

}

 

define i32 @foo(i32* noalias %p, i32 %flag) {

entry:

  store i32 0, i32* %p

  call void @bar()

  %cmp = icmp eq i32 %flag, 0

  br i1 %cmp, label %if.then, label %if.end

 

if.then:

  store i32 1, i32* %p

  br label %if.end

 

if.end:

  call void @bar()

  %x = load i32, i32* %p

  ret i32 %x

}

 

------------

 

I have an argument (%p) which is marked with the ‘noalias’ attribute. The memory pointed to by this argument is read, written, and read again within the function. Between these accesses, I am calling a function that contains a fence instruction. If that call with the fence is not inlined, GVN will eliminate the second load.

 

------------

 

define i32 @foo(i32* noalias %p, i32 %flag) {

entry:

  store i32 0, i32* %p, align 4

  call void @bar()

  %cmp = icmp eq i32 %flag, 0

  br i1 %cmp, label %if.then, label %if.end

 

if.then:

  store i32 1, i32* %p, align 4

  br label %if.end

 

if.end:

  %x = phi i32 [ 1, %if.then ], [ 0, %entry ]   ; <============== Incorrect

  call void @bar()

  ret i32 %x

}

 

------------

 

https://godbolt.org/z/14o8oY

 

This is a reduction of a scenario I’ve come across in a SYCL program. The bar() function corresponds to a work group barrier that is meant to have the memory synchronizing effect described by the fence instruction in my example. I’m trying to figure out how to construct LLVM IR that will represent the semantics I need.

 

If I remove the ‘noalias’ attribute from the argument, GVN won’t make this optimization because it conservatively assumes that the memory might be modified within the called function. That’s fine, but I think it fixes the problem for the wrong reason. In fact, the memory location is not modified in the called function and as I understand it the ‘noalias’ attribute only guarantees that the memory won’t be accessed *in the current thread* using pointers that aren’t based on the ‘noalias’ pointer. So, the fact that it might be modified by another thread shouldn’t invalidate the ‘noalias’ attribute. Is that correct?

 

I can also block the GVN optimization by putting the fence instruction directly in the foo() function, such as by inlining the call to bar(). But, of course, the semantics of the IR should not depend on whether or not I’ve inlined functions. In this case the inlining is trivial, but the problem potentially exists for a called function that uses a barrier in a way that is not so immediately visible.

 

I put the ‘convergent’ attribute on my bar() function mostly to demonstrate that this doesn’t solve the problem. As I understand it, the ‘convergent’ attribute describes control flow constraints and says nothing about memory access synchronization. Is that correct?

 

 

Is there a way to handle this case? I have some ideas, but I’d like to start by just posing the question to see if there are better avenues available than I’ve considered.

 

Thanks,

Andy

 

Johannes Doerfert via llvm-dev

unread,
Jan 27, 2021, 1:54:46 PM1/27/21
to Kaylor, Andrew, llvm...@lists.llvm.org

So far, I don't think we have a proper way to handle this. The argument
was, the user code is wrong because multiple threads wrote the variable
which violated restrict. As we added deduction of noalias we run into
this again. What we proposed, but haven't tried to upstream yet, is to
provide explicit uses of restrict pointers. See Chapter 3 in
https://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf

I'd be very interested in discussing this further, a little short on
time right now though.

~ Johannes

P.S. There was a bug report with ample of related discussion, I to look
for it again, maybe Eli remembers.


>
> Thanks,
> Andy
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Kaylor, Andrew via llvm-dev

unread,
Jan 27, 2021, 3:35:06 PM1/27/21
to Johannes Doerfert, llvm...@lists.llvm.org
Thanks, Johannes!

It looks like the bug you were referring to is https://bugs.llvm.org/show_bug.cgi?id=41781.

I see there that Eli is asserting that 'restrict' (and therefore 'noalias') applies to memory accesses in any thread. I was assuming otherwise. If I remove the 'noalias' attribute there are no problems with my example, but this would also mean the potential loss of local optimizations that would otherwise be possible in more complicated cases.

In the case I was starting from, the 'noalias' attribute was something our compiler derived based on its knowledge of the SYCL rules. Within the kernel, we know the pointer appearing as an argument here won't alias with any other pointers in the kernel, but nothing prevents other instances of the kernel from modifying the same memory. Hence, the barriers to synchronize the accesses.

I'll have to read the relevant section of your paper a few more times to fully grasp what you are saying there. It's clear that you are addressing the same general problem I'm looking at here. I wasn't clear on first reading whether you are saying that the restrict/noalias attribute could be employed for this optimization (possibly with additional constructs to manage the synchronization) or whether you meant that something entirely different than restrict/noalias was needed.

-Andy

Johannes Doerfert via llvm-dev

unread,
Jan 27, 2021, 5:26:28 PM1/27/21
to Kaylor, Andrew, llvm...@lists.llvm.org
Long story short, we have 3 general options forward:

1) Introduce/Leverage an attribute and make it "stronger" than the
`noalias` guarantee we have right now. This is not a great option
because we would need to scan for intermediate instructions with this
property in order to utilize `noalias`.
2) We go with explicit uses of the `noalias` pointer as proposed in [0].
The downside is that we need to add them when we inline and such. We
also need to add them to synchronizing instructions or declare the
behavior of GVN right now as somewhat sane. So we say that synchronizing
instructions in the current function are stronger than `noalias` but
that brings us back to the problem with 1) which I really dislike. I
guess, operand bundles for instructions would solve this in a reasonably
nice way.
3) Introduce a weaker version of `noalias` with the drawbacks of 1) but
at least __restrict__ pointers don't inherit the problem (as it is a
user error to synchronize via restrict pointers).
[4) Don't deduce `noalias` if there might be synchronization in the
scope/function. This is what the Attributor does right now:
https://github.com/llvm/llvm-project/blob/fb12df4a8e33d759938057718273dfb434b2d9c4/llvm/lib/Transforms/IPO/AttributorAttributes.cpp#L2458]

I'm interested in other thoughts on this.

~ Johannes


[0] https://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf

Saito, Hideki via llvm-dev

unread,
Jan 29, 2021, 1:56:41 PM1/29/21
to Johannes Doerfert, Kaylor, Andrew, llvm...@lists.llvm.org

Disclaimer: I've been focusing on vectorization (single thread problem) for the last 10+yrs. Please read this with grain of rust. 😊

The scheme proposed in the Page 4 Fig 1(d) of [0] may not work in general problems, as is. Some code may not have bar(). Then, one may certainly argue that why bother splitting this into two critical sections. There are values in keeping critical section sizes as small as possible, and it's not too hard to imagine two such critical sections touching the same variable along with some other variables.

if (omp_critical_start(tid)) {
*p += 1;
omp_critical_end(tid);
}
bar()[p]; // May "use" pointer p.
if (omp_critical_start(tid)) {
*p *= 2;
omp_critical_end(tid);
}

In the absence of user-code bar(), if we are to take the spirit of the proposal, we would inject a pseudo function call (appropriately name it as __memory_fence() 😊) and it would look like

if (omp_critical_start(tid)) {
*p += 1;
omp_critical_end(tid);
}
__memory_fence()[p]; // May "use" pointer p.
if (omp_critical_start(tid)) {
*p *= 2;
omp_critical_end(tid);
}

We would then proceed to argue with the following improvement of it

if (omp_critical_start(tid)) {
*p += 1;
omp_critical_end(tid);
}
__memory_fence_release()[p]; // May "use" pointer p.
;
__memory_fence_acquire()[p]; // May "use" pointer p.
if (omp_critical_start(tid)) {
*p *= 2;
omp_critical_end(tid);
}

We then might as well talk about improving LLVM IR fence instruction definition.

My preference is to interpret "noalias/restrict to be applicable to all threads, but only between one fence to the next (per dynamic control flow sense)". However, if most developers think " noalias/restrict to be effective to all threads for the entire scope of the variable/parameter" && we can make such fence based dependency explicit like in the spirit of the proposal in [0] (and as I outlined above), it would be good to discuss the long term pros/cons of the both approaches. For the optimizers, the former can't really be based on dynamic control flow based. As such, it would end up in some conservative static scoping based enforcement. The latter, if implementable, may have a better reflection of the mem-fence requirement and control flow, which may lead to better optimization.

Thanks,
Hideki

Johannes Doerfert via llvm-dev

unread,
Jan 29, 2021, 2:03:56 PM1/29/21
to Saito, Hideki, Kaylor, Andrew, llvm...@lists.llvm.org

On 1/29/21 12:56 PM, Saito, Hideki wrote:
> Disclaimer: I've been focusing on vectorization (single thread problem) for the last 10+yrs. Please read this with grain of rust. 😊
>
> The scheme proposed in the Page 4 Fig 1(d) of [0] may not work in general problems, as is. Some code may not have bar(). Then, one may certainly argue that why bother splitting this into two critical sections. There are values in keeping critical section sizes as small as possible, and it's not too hard to imagine two such critical sections touching the same variable along with some other variables.
>
> if (omp_critical_start(tid)) {
> *p += 1;
> omp_critical_end(tid);
> }
> bar()[p]; // May "use" pointer p.
> if (omp_critical_start(tid)) {
> *p *= 2;
> omp_critical_end(tid);
> }
>
> In the absence of user-code bar(), if we are to take the spirit of the proposal, we would inject a pseudo function call (appropriately name it as __memory_fence() 😊) and it would look like

Could you explain why we would need a fence (or anything for that
matter) if there is no user code bar()?
The way I see it, if there is no user code that might be a
synchronization event, then there is no problem.
The noalias can just apply to the pointer and we know no other thread
will interfere with the memory in the
scope because it would be a race. Does that make sense?

~ Johannes

Saito, Hideki via llvm-dev

unread,
Jan 29, 2021, 2:33:14 PM1/29/21
to Johannes Doerfert, Kaylor, Andrew, llvm...@lists.llvm.org

>Could you explain why we would need a fence (or anything for that
>matter) if there is no user code bar()?
>The way I see it, if there is no user code that might be a synchronization event, then there is no problem.
>The noalias can just apply to the pointer and we know no other thread will interfere with the memory in the scope because it would be a race. Does that make sense?

As soon as the critical section ends, some other thread can modify the value of *p (under its critical section), prior to the beginning of the next critical section. If OS decides to context switch between omp_critical_end() and omp_critical_start(), that would be eternity from application SW perspective. Threads aren't executing in lock-step.

Thanks,
Hideki

-----Original Message-----
From: Johannes Doerfert <johannes...@gmail.com>
Sent: Friday, January 29, 2021 11:04 AM
To: Saito, Hideki <hideki...@intel.com>; Kaylor, Andrew <andrew...@intel.com>
Cc: llvm...@lists.llvm.org; Eli Friedman <efri...@quicinc.com>
Subject: Re: [llvm-dev] Memory barrier problem


Johannes Doerfert via llvm-dev

unread,
Jan 30, 2021, 6:02:37 PM1/30/21
to Saito, Hideki, Kaylor, Andrew, llvm...@lists.llvm.org
Hi Saito,

thanks for pointing this out. I agree now, the reasoning was not
"complete" because I focused on the example too much.


On 1/29/21 1:33 PM, Saito, Hideki wrote:
>> Could you explain why we would need a fence (or anything for that
>> matter) if there is no user code bar()?
>> The way I see it, if there is no user code that might be a synchronization event, then there is no problem.
>> The noalias can just apply to the pointer and we know no other thread will interfere with the memory in the scope because it would be a race. Does that make sense?
> As soon as the critical section ends, some other thread can modify the value of *p (under its critical section), prior to the beginning of the next critical section. If OS decides to context switch between omp_critical_end() and omp_critical_start(), that would be eternity from application SW perspective. Threads aren't executing in lock-step.

Disclaimer: I thought there is no problem, and I still believe there is
none for the example given in the paper and discussed here before.
I explain why I think there is no problem in this example below.
Afterwards I make the example slightly more complicated and argue
why it is now broken if we do not also add "operand bundle uses" to the
`om_critical_` calls which can, after all, also synchronize and be
observed by the user.

This is only relevant for the example and only serves to explain why I
thought there was no problem:
Even if a second thread updates *p between the first `omp_critical_end`
and the second `omp_critical_start`, that even is not synchronized with
the thread running this method.
Let's call that thread T0 and the other one T1 and assume T1 will write
5, we then have 3 critical sections to argue about:

T0: A: critical{*p = *p + 1}
    B: critical{*p = *p * 2}
T1: critical{*p = 5}

Now we can have 3 different interleavings from a observer perspective:

1: T0A T0B T1
2: T0A T1 T0B
3: T1 T0A T0B

But since there is no synchronization between T0 and T1, the user cannot
argue any of them is less correct than the others.
So if we utilize the `noalias` guarantee to replace the load in T0B with
the value stored in T0A, we basically made interleaving
2 impossible and it would instead "become" interleaving 3. However, if
there is a barrier-like effect between T0A and T0B, the
user can determine that T0A should have happened already, basically
distinguish interleaving 2 from 3. That brings me to the example
where this logic falls apart:

if (omp_critical_start(tid)) {
*p += 1;

*flag = 1
omp_critical_end(tid);


}
if (omp_critical_start(tid)) {
*p *= 2;
omp_critical_end(tid);
}

Since the user can now observe if T0A was already performed, by checking
`flag`, they can distinguish interleaving 2 and 3 which makes
the forwarding illegal. The problem is that not only "bar" but also
`omp_critical_{start/end}` synchronize threads. We need to add the
"fake uses" to those calls as well. This basically will keep the pointer
update in the critical section, which is conservatively correct.

I hope this made some sense.

~ Johannes

Saito, Hideki via llvm-dev

unread,
Feb 1, 2021, 2:43:15 AM2/1/21
to Johannes Doerfert, Kaylor, Andrew, llvm...@lists.llvm.org

>The problem is that not only "bar" but also `omp_critical_{start/end}` synchronize threads. We need to add the "fake uses" to those calls as well.

Yes, indeed, the following two operations need to be "atomic".

>> omp_critical_end(tid);
>> }
>> __memory_fence_release()[p]; // May "use" pointer p.

and so are

>> __memory_fence_acquire()[p]; // May "use" pointer p.
>> if (omp_critical_start(tid)) {

Now, going back to Andy's example. We either need to make bar() act like a fence, or sandwich it with fences (with "fake uses" if your proposal is adopted) --- from the caller perspective. Letting bar have a fence inside isn't enough, from the caller side optimization purpose. (Fence inside bar() would work from the execution on HW perspective, if compiler hasn't already messed it up.)

Johannes Doerfert via llvm-dev

unread,
Feb 1, 2021, 9:57:19 AM2/1/21
to Saito, Hideki, Kaylor, Andrew, llvm...@lists.llvm.org

On 2/1/21 1:43 AM, Saito, Hideki wrote:
>> The problem is that not only "bar" but also `omp_critical_{start/end}` synchronize threads. We need to add the "fake uses" to those calls as well.
> Yes, indeed, the following two operations need to be "atomic".
>
>>> omp_critical_end(tid);
>>> }
>>> __memory_fence_release()[p]; // May "use" pointer p.
> and so are
>
>>> __memory_fence_acquire()[p]; // May "use" pointer p.
>>> if (omp_critical_start(tid)) {
> Now, going back to Andy's example. We either need to make bar() act like a fence, or sandwich it with fences (with "fake uses" if your proposal is adopted) --- from the caller perspective. Letting bar have a fence inside isn't enough, from the caller side optimization purpose. (Fence inside bar() would work from the execution on HW perspective, if compiler hasn't already messed it up.)

The thing is, I still don't understand why an explicit fence causes LLVM
to avoid the optimization based on `noalias` but an outlined fence
doesn't. That is not consistent *at all*. So let's not try to build
something on top of that. I prefer fake uses but a weakened `noalias`
can be discussed too.

Saito, Hideki via llvm-dev

unread,
Feb 1, 2021, 12:34:50 PM2/1/21
to Johannes Doerfert, Kaylor, Andrew, llvm...@lists.llvm.org

>I still don't understand why an explicit fence causes LLVM to avoid the optimization based on `noalias` but an outlined fence doesn't. That is not consistent *at all*.

Me too.

>So let's not try to build something on top of that. I prefer fake uses but a weakened `noalias` can be discussed too.

Maybe, but if we can fix this problem

>>>>> define void @bar() convergent {
>>>>> fence acq_rel
>>>>> ret void
>>>>> }

as below, by an ability to mark any call a potential fence, by a function attribute, that's what I prefer. It could be that there was an underlying assumption (or misconception) that the convergent attribute would suffice, but it apparently does not. Revisiting there seems worthwhile.

define void @bar() convergent fence.acq_rel {
fence acq_rel
ret void

Johannes Doerfert via llvm-dev

unread,
Feb 1, 2021, 1:08:00 PM2/1/21
to Saito, Hideki, Kaylor, Andrew, llvm...@lists.llvm.org

On 2/1/21 11:34 AM, Saito, Hideki wrote:
>> I still don't understand why an explicit fence causes LLVM to avoid the optimization based on `noalias` but an outlined fence doesn't. That is not consistent *at all*.
> Me too.
>
>> So let's not try to build something on top of that. I prefer fake uses but a weakened `noalias` can be discussed too.
> Maybe, but if we can fix this problem
>
>>>>>> define void @bar() convergent {
>>>>>> fence acq_rel
>>>>>> ret void
>>>>>> }
> as below, by an ability to mark any call a potential fence, by a function attribute, that's what I prefer. It could be that there was an underlying assumption (or misconception) that the convergent attribute would suffice, but it apparently does not. Revisiting there seems worthwhile.
>
> define void @bar() convergent fence.acq_rel {
> fence acq_rel
> ret void
> }

If that is what you want, make `nosync` "required" for `noalias` "over
calls". I mean, everything needs to start as "may have fence" by default
and it's not only "fence", but any sync instruction. So, `noalias`
pointers are impacted by potentially synchronizing calls.
I am not sure this is what we want, in parts because it degrades
optimization potential for all restrict pointers, even in the non threaded
environment. I was thinking we keep user `__restrict__` pointer as they
are, if you break synchronization it's a user fault.
That said, if we derive `noalias` we need to ensure we don't break
synchronization. "Fake uses" is one way but probably not the only one.

Saito, Hideki via llvm-dev

unread,
Feb 2, 2021, 1:30:35 PM2/2/21
to Johannes Doerfert, Kaylor, Andrew, llvm...@lists.llvm.org

>If that is what you want, make `nosync` "required" for `noalias` "over calls". I mean, everything needs to start as "may have fence" by default and it's not only "fence", but any sync instruction. So, `noalias` >pointers are impacted by potentially synchronizing calls.
>I am not sure this is what we want, in parts because it degrades optimization potential for all restrict pointers, even in the non threaded environment. I was thinking we keep user `__restrict__` pointer as >they are, if you break synchronization it's a user fault.
>That said, if we derive `noalias` we need to ensure we don't break synchronization. "Fake uses" is one way but probably not the only one.

I think (and others may disagree) it makes sense to let the programmer decide whether they want a thread safe program or thread unsafe (or rather single thread optimized) program (e.g., a flag, pragma, per-function attribute, etc.). I do not believe compiler can resolve that problem on its own. If we go for that direction, there will be a question of what the default should be and my preference on the default is go conservative (i.e., go safer versus more optimizable). That might mean existing code may get compiled into slower executing code until a new flag ("single thread optimized") is added. [This is from "threading is already a norm" thinking.]

My preference on the use of fence comes from the desire to let compiler implementation match with the theory that you learn at school (Educators, please chime in) and also how the HW works. If for some reason, fence approach doesn't work well with LLVM framework today, we should have a good documentation on how that conclusion was drawn, adopt plan-B, and once in a while revisit the viability. I'm not totally opposed on the alternatives since we have a problem that needs a resolution quickly enough. I like theoretical clarity, but I can't fight for it forever.

W.r.t. restrict, I'd like to hear more from the language lawyers on their original intent when the language construct was born and the current interpretation of it in the presence of threading.

Johannes Doerfert via llvm-dev

unread,
Feb 2, 2021, 2:05:22 PM2/2/21
to Saito, Hideki, Kaylor, Andrew, llvm...@lists.llvm.org

On 2/2/21 12:30 PM, Saito, Hideki wrote:
>> If that is what you want, make `nosync` "required" for `noalias` "over calls". I mean, everything needs to start as "may have fence" by default and it's not only "fence", but any sync instruction. So, `noalias` >pointers are impacted by potentially synchronizing calls.
>> I am not sure this is what we want, in parts because it degrades optimization potential for all restrict pointers, even in the non threaded environment. I was thinking we keep user `__restrict__` pointer as >they are, if you break synchronization it's a user fault.
>> That said, if we derive `noalias` we need to ensure we don't break synchronization. "Fake uses" is one way but probably not the only one.
> I think (and others may disagree) it makes sense to let the programmer decide whether they want a thread safe program or thread unsafe (or rather single thread optimized) program (e.g., a flag, pragma, per-function attribute, etc.). I do not believe compiler can resolve that problem on its own. If we go for that direction, there will be a question of what the default should be and my preference on the default is go conservative (i.e., go safer versus more optimizable). That might mean existing code may get compiled into slower executing code until a new flag ("single thread optimized") is added. [This is from "threading is already a norm" thinking.]

Too many nobs is not helpful either. As said before, we should keep
`__restrict__` as it is and deal with "deduced" `noalias` properly. How
we do that is another question though.


>
> My preference on the use of fence comes from the desire to let compiler implementation match with the theory that you learn at school (Educators, please chime in) and also how the HW works. If for some reason, fence approach doesn't work well with LLVM framework today, we should have a good documentation on how that conclusion was drawn, adopt plan-B, and once in a while revisit the viability. I'm not totally opposed on the alternatives since we have a problem that needs a resolution quickly enough. I like theoretical clarity, but I can't fight for it forever.

I don't really follow this argument, I think I simply don't understand
it. What do you mean by "the use of fence"? Adding explicit fence
instructions somewhere?


>
> W.r.t. restrict, I'd like to hear more from the language lawyers on their original intent when the language construct was born and the current interpretation of it in the presence of threading.

I would have assumed `__restrict` predates "common" multi-processing in
C. Since the language of restrict is to this day implying other threads
cannot access those pointers, I would not dare to argue we should weaken
it in order to deduce `noalias`.

Saito, Hideki via llvm-dev

unread,
Feb 2, 2021, 3:01:46 PM2/2/21
to Johannes Doerfert, Kaylor, Andrew, llvm...@lists.llvm.org

>I don't really follow this argument, I think I simply don't understand it. What do you mean by "the use of fence"? Adding explicit fence instructions somewhere?

In the context of OpenMP example below, add fence attribute to omp_critical_start/end().
For the SPMD programming example below, add fence attribute to bar().
If we add explicit fence instructions separate from those calls, I don't think they can be made "quasi atomic" with those calls.

What I'm advocating is that they are cross-thread synchronization points and thus we should mark them just as such (i.e., fence!). Compiler should understand how to deal with cross thread synchronization points. For me, doing something else is a workaround (sorry if it sounds harsh) ---- even if it's a practical workaround that may be the only deployable solution today.

>Since the language of restrict is to this day implying other threads cannot access those pointers

If language lawyers are standing firm here (which might require changes in the definition wording, to better accommodate SPMD programing model), I think we should lobby for a similar but multi-thread friendly "restrict"-ness to the C/C++ language standard. Asking programmers to drop "restrictness" will lead to missed optimization opportunities.

I think I've stated enough of my points. I'd like to hear more from others.

Johannes Doerfert via llvm-dev

unread,
Feb 2, 2021, 3:11:05 PM2/2/21
to Saito, Hideki, Kaylor, Andrew, llvm...@lists.llvm.org

On 2/2/21 2:00 PM, Saito, Hideki wrote:
>> I don't really follow this argument, I think I simply don't understand it. What do you mean by "the use of fence"? Adding explicit fence instructions somewhere?
> In the context of OpenMP example below, add fence attribute to omp_critical_start/end().
> For the SPMD programming example below, add fence attribute to bar().
> If we add explicit fence instructions separate from those calls, I don't think they can be made "quasi atomic" with those calls.
>
> What I'm advocating is that they are cross-thread synchronization points and thus we should mark them just as such (i.e., fence!). Compiler should understand how to deal with cross thread synchronization points. For me, doing something else is a workaround (sorry if it sounds harsh) ---- even if it's a practical workaround that may be the only deployable solution today.

I guess I just don't get why you use "fence" equivalent to
"synchronization point". Maybe it's just a language mismatch on my end.

Again, everything that is not `nosync` is potentially a cross thread
synchronization point. There is no need to "opt-in" to such a thing,
it's the default.

~ Johannes

Saito, Hideki via llvm-dev

unread,
Feb 2, 2021, 4:20:43 PM2/2/21
to Johannes Doerfert, Kaylor, Andrew, llvm...@lists.llvm.org

>I guess I just don't get why you use "fence" equivalent to "synchronization point". Maybe it's just a language mismatch on my end.

I agree there is a difference. Synchronization doesn't necessarily mean memory based synchronization.

>Again, everything that is not `nosync` is potentially a cross thread synchronization point. There is no need to "opt-in" to such a thing, it's the default.

If "lack of nosync" doesn't imply fence (not clear in LangRef), we need explicit fence semantics (or some other mechanism) to force write-back/re-read. Hopefully, we do not have to
conservatively generate a HW fence instruction (i.e., somewhere in the callee code path does it where it is really needed). "Fake use" might force write-back/re-read, but it won't necessarily imply "communicating with other threads" (which fence does). That's why I call it a workaround. If "lack of nosync" implies fence and it is ignored by optimizers due to the presence of noalias && if we think noalias definition won't be made multithread friendly, we need to come up with a new flavor of noalias that is multithread friendly so that we won't lose optimization opportunities within the code between two "fences".

If you are coming from multithread unfriendly "noalias", what is the logical explanation to add "fake use" of such noalias pointers to the call?

Anyway, long story short:
* At language level: Restrict definition is unlikely to change. We need something new that is thread friendly.
* At LLVM IR level: We need two flavors of noalias. Thread friendly and thread unfriendly.
* We need to clarify what "lack of nosync" means. Does it imply fence? If not, how do we force write-back/re-read?

Jeroen Dobbelaere via llvm-dev

unread,
Feb 3, 2021, 1:44:58 PM2/3/21
to Saito, Hideki, Johannes Doerfert, Kaylor, Andrew, llvm...@lists.llvm.org
> >> W.r.t. restrict, I'd like to hear more from the language lawyers on their
> original intent when the language construct was born and the current
> interpretation of it in the presence of threading.
> > I would have assumed `__restrict` predates "common" multi-processing in C.
> Since the language of restrict is to this day implying other threads cannot
> access those pointers, I would not dare to argue we should weaken it in order
> to deduce `noalias`.
> >
> > ~ Johannes
> >

Having interacted recently with wg14 to get a better understanding about some of the
corner cases around restrict, I can add the following:

One way to look at a restrict pointer[1], is as if you get a local array.
That means that following code:

void foo_a(int *restrict rpDest, int *restrict rpSrc, int n) {
for (int i=0; i<n; ++i)
rpDest[i] = rpSrc[i]+1;
}

is allowed to behave as if it was written as follows:
void foo_b(int *pDest, int *pSrc, int n) {
int localDest[n];
int localSrc[n];
memcpy(&localDest[0], pDest, n*sizeof(int));
memcpy(&localSrct[0], pSrc, n*sizeof(int));
for (int i=0; i<n; ++i)
localDest[i] = localSrc[i]+1;
memcpy(pDest, &localDest[0], n*sizeof(int));
}

Calling foo_a and foo_b with overlapping arrays can show different results, depending
on how the loop was optimized. That is an indication that this usage of 'foo_a' is
triggering undefined behavior and should not be done.

Wrt to threading: as long as the restrict pointer (rpDest, rpSrc; localDest, localSrc) is not escaping,
a different thread should not be able to access the memory, as there is no way it can get a pointer 'based on'
the restrict pointer.

Note [1]: things get more interesting when having a 'pointer to a restrict pointer' (aka int *restrict *prp).

Greetings,

Jeroen Dobbelaere

Johannes Doerfert via llvm-dev

unread,
Feb 3, 2021, 1:52:11 PM2/3/21
to Jeroen Dobbelaere, Saito, Hideki, Kaylor, Andrew, llvm...@lists.llvm.org

The way I interpret this is consistent with Eli's opinion and what we
basically do so far,
restrict is stronger than synchronization since the local arrays are not
synchronized across
threads. If two threads access the same memory (even well synchronized)
it breaks the restrict
requirement and is therefor UB.

So a weaker `noalias` or a way to mark uses seems therefore required for
`noalias` deduction.

~ Johannes

Saito, Hideki via llvm-dev

unread,
Feb 3, 2021, 2:04:08 PM2/3/21
to Johannes Doerfert, Jeroen Dobbelaere, Kaylor, Andrew, llvm...@lists.llvm.org

>So a weaker `noalias` or a way to mark uses seems therefore required for `noalias` deduction.

Appears to be that way. Can we do that w/o having a weaker restrict in the language spec?

-----Original Message-----
From: Johannes Doerfert <johannes...@gmail.com>
Sent: Wednesday, February 3, 2021 10:52 AM
To: Jeroen Dobbelaere <Jeroen.D...@synopsys.com>; Saito, Hideki <hideki...@intel.com>; Kaylor, Andrew <andrew...@intel.com>
Cc: llvm...@lists.llvm.org
Subject: Re: [llvm-dev] Memory barrier problem


Jeroen Dobbelaere via llvm-dev

unread,
Feb 4, 2021, 3:04:41 AM2/4/21
to Saito, Hideki, Johannes Doerfert, Kaylor, Andrew, llvm...@lists.llvm.org
> >So a weaker `noalias` or a way to mark uses seems therefore required for
> >`noalias` deduction.
>
> Appears to be that way. Can we do that w/o having a weaker restrict in the
> language spec?

The full restrict[0] implementation does not depend on the 'noalias' attribute on
function arguments. The attribute is even too strong for just mapping a
'C99 restrict pointer argument' to a 'LLVM-IR noalias pointer argument'.
For backwards compatibility, I kept the default mapping of restrict pointer arguments
onto 'noalias' and provided the '-fno-noalias-arguments' option to disable this mapping
For some code, this can result in a wrong 'based on' deduction.[1]

Given that, IMHO, it still makes sense to have a strong and a weaker version of the
'noalias argument attribute'. At least, the stronger (current 'noalias') version can be
converted to the noalias scope/intrinsics mapping during inlining, keeping the strong guarantees.
Converting a weaker version likely will need some more tweaking.

The noalias attribute is also used for a struct pointer argument when a function returns a struct.

Greetings,

Jeroen Dobbelaere
[0] Full Restrict patches: https://reviews.llvm.org/D68484
[1] 'clang/test/CodeGen/restrict/arg_reuse.c' testcase in: https://reviews.llvm.org/D68521

Johannes Doerfert via llvm-dev

unread,
Feb 12, 2021, 2:41:41 PM2/12/21
to Jeroen Dobbelaere, Saito, Hideki, Kaylor, Andrew, llvm...@lists.llvm.org

On 2/4/21 2:04 AM, Jeroen Dobbelaere wrote:
>>> So a weaker `noalias` or a way to mark uses seems therefore required for
>>> `noalias` deduction.
>> Appears to be that way. Can we do that w/o having a weaker restrict in the
>> language spec?
> The full restrict[0] implementation does not depend on the 'noalias' attribute on
> function arguments. The attribute is even too strong for just mapping a
> 'C99 restrict pointer argument' to a 'LLVM-IR noalias pointer argument'.
> For backwards compatibility, I kept the default mapping of restrict pointer arguments
> onto 'noalias' and provided the '-fno-noalias-arguments' option to disable this mapping
> For some code, this can result in a wrong 'based on' deduction.[1]
>
> Given that, IMHO, it still makes sense to have a strong and a weaker version of the
> 'noalias argument attribute'. At least, the stronger (current 'noalias') version can be
> converted to the noalias scope/intrinsics mapping during inlining, keeping the strong guarantees.
> Converting a weaker version likely will need some more tweaking.
>
> The noalias attribute is also used for a struct pointer argument when a function returns a struct.

Interesting. I guess we would not keep it for restrict if it is too
strong but there
are other uses where the guarantees are useful I believe.

Given that full restrict will make the `__restrict` problem go away,
let's look at the
deduction one.

What if we make `nosync` a value/pointer attribute as well and then have:

  `noalias`
  Does not alias other pointers in scope but synchronizing events might
still change
  the value because other threads might have the same "expression".
That is, we declare
  the deductions as correct by weakening `noalias`.

  `nosync`
  The value is not modified in this scope by another thread.

  `noalias` + `nosync`
  Matches the `__restrict` guarantee that nothing not based of the
pointer can modify it,
  so this is "stronger" than synchronization events and you can forward
over fences/barriers.

WDYT?

~ Johannes

Saito, Hideki via llvm-dev

unread,
Feb 12, 2021, 4:11:47 PM2/12/21
to Johannes Doerfert, Jeroen Dobbelaere, Kaylor, Andrew, llvm...@lists.llvm.org

>What if we make `nosync` a value/pointer attribute as well and then have:
>
> `noalias`
> Does not alias other pointers in scope but synchronizing events might still change
> the value because other threads might have the same "expression".
>That is, we declare
> the deductions as correct by weakening `noalias`.
>
> `nosync`
> The value is not modified in this scope by another thread.
>
> `noalias` + `nosync`
> Matches the `__restrict` guarantee that nothing not based of the pointer can modify it,
> so this is "stronger" than synchronization events and you can forward over fences/barriers.
>
>WDYT?

Sounds like a good new direction to have a debate for the best solution. I think this or a similar variation of this would work to address the original issue Andy brought up,
within the compiler (but probably not between the programmer and the compiler ---- a separate discussion on this?).

Hideki
Reply all
Reply to author
Forward
0 new messages