[llvm-dev] Memory scope proposal

Ke Bai via llvm-dev

unread,

Jan 28, 2016, 3:34:32 PM1/28/16

to llvm...@lists.llvm.org

Hi all,

Currently, the LLVM IR uses a binary value (SingleThread/CrossThread) to represent synchronization scope on atomic instructions. We would like to enhance the representation of memory scopes in LLVM IR to allow more values than just the current two. The intention of this email is to invite comments on our proposal. There are some discussion before and it can be found here:

https://groups.google.com/forum/#!searchin/llvm-dev/hsail/llvm-dev/46eEpS5h0E4/i3T9xw-DNVYJ

Here is our new proposal:

=================================================================

We still let the bitcode store memory scopes as unsigned integers, since that is the easiest way to maintain compatibility. The values 0 and 1 are special. All other values are meaningful only within that bc file. In addition, a global metadata in the file will provide a map from unsigned integers to string symbols which should be used to interpret all the non-standard integers. If the global metadata is empty or non-existent, then all non-zero values will be mapped to "system", which is the current behavior.

The proposed syntax for synchronization scope is as follows:

Synchronization scopes are of arbitrary width, but implemented as unsigned in the bitcode, just like address spaces.

Cross-thread is default.

Keyword "singlethread" is unchanged

New syntax "synchscope(n)" for other target-specific scopes.

There is no keyword for cross-thread, but it can be specified as "synchscope(0)".

The proposed new integer implementation expanded synchronization scopes are as follows:

Format	Single Thread	System (renamed)	Intermediate
Bitcode	zero	one	unsigned n
Assembly	singlethread, synchscope(~0U)	empty (default), synchscope(0)	synchscope(n-1)
In-memory	~0U	zero	unsigned n-1
SelectionDAG	~0U	zero	unsigned n-1

The choice of “~0U” for singlethread makes it easy to maintain backward compatibility in the bitcode. The values 0 and 1 remain unchanged in the bitcode, and the reader simply decrements them by one to compute the correct value in the in-memory data-structure.

Name Mapping

Now we comes to name mapping from integers to strings. If a CLANG front end wants to map a language that has memory scopes (e.g. OpenCL) to LLVM IR, how does it determine what syncscopes to use? Without any rules, each target can define its own meaning for the scopes, can give them any name, and can map them to the LLVM-IR unit values in any way. In this case, I think each target have to provide a mapping function that maps a specific language’s name for a scope into that targets name for a scope that has conservatively the same semantics. Namely, the act of supporting a new language that has memory scopes requires every target to support that language to be updated accordingly.

Therefore, in order to allow front end writers to share memory scope definitions when they match to avoid the effort of updating all targets for each language,it's better to define standard memory scope names. A target is free to implement them or not, but if a target does implement them they must have the defined relational semantics (e.g., hierarchical nesting). If a target does implement them then it will be able to support any language that uses them, including languages not yet invented. A new memory scope name can be added if the existing ones are insufficient.

With the first try, we can define the standard scopes with what a common language that has memory scopes needs, e.g., OpenCL uses system, device, workgroup, workitem. It uses the same approach as LLVM has done for debug information. There are standard debug entities (that a common language (C) needs), and each new language uses those standard entities where there is a match, and subsequently defines only the delta.

A bitcode example with the proposal

define void <at> test(i32* %addr) {

; forward compatibility

cmpxchg i32* %addr, i32 42, i32 0 singlethread monotonic monotonic

; new synchscope that will be defined by each backend

cmpxchg i32* %addr, i32 42, i32 0 synchscope(2) monotonic monotonic, 2

cmpxchg i32* %addr, i32 42, i32 0 synchscope(3) monotonic monotonic, 3

ret void

}

!synchscope = metadata !{{i32 0, !"SingleThread"}, {i32 2, !"WorkGroup"}, ...}

=================================================================

Thank you!

---

Best regards,

Ke

Ke Bai via llvm-dev

unread,

Feb 10, 2016, 1:16:08 PM2/10/16

to llvm...@lists.llvm.org

Ping. Do we have any advice on this proposal? Thanks!

---
Best Regards,
Ke Bai

Ke Bai via llvm-dev

unread,

Feb 22, 2016, 2:57:32 PM2/22/16

to llvm...@lists.llvm.org

Ping! We need to close on whether we can use integers and global metadata for interpreting all the non-standard integers, to represent memory scopes.

Ke Bai via llvm-dev

unread,

Mar 22, 2016, 4:43:17 PM3/22/16

to llvm...@lists.llvm.org

Dear all,

Here is the plain text version of the proposal:

Currently, the LLVM IR uses a binary value (SingleThread/CrossThread) to represent synchronization scope on atomic instructions. We would like to enhance the representation of memory scopes in LLVM IR to allow more values than just the current two. The intention of this email is to invite comments on our proposal. There are some discussion before and it can be found here:
https://groups.google.com/forum/#!searchin/llvm-dev/hsail/llvm-dev/46eEpS5h0E4/i3T9xw-DNVYJ

Here is our new proposal:

=================================================================

We still let the bitcode store memory scopes as "unsigned integers", since that is the easiest way to maintain compatibility. The values 0 and 1 are special. All other values are meaningful only within that bc file. In addition, "a global metadata in the file" will provide a map from unsigned integers to string symbols which should be used to interpret all the non-standard integers. If the global metadata is empty or non-existent, then all non-zero values will be mapped to "system", which is the current behavior.

The proposed syntax for synchronization scope is as follows:

* Synchronization scopes are of arbitrary width, but implemented as unsigned in the bitcode, just like address spaces.
* Cross-thread is default.
* Keyword "singlethread" is unchanged
* New syntax "synchscope(n)" for other target-specific scopes.
* There is no keyword for cross-thread, but it can be specified as "synchscope(0)".

The proposed new integer implementation expanded synchronization scopes are as follows:

****************************************************************

Format

Single Thread

System (renamed)

Intermediate
Bitcode

zero

one

unsigned n
Assembly

singlethread,

empty (default),

synchscope(n-1)

synchscope(~0U)

synchscope(0)
In-memory

~0U

z

ero

unsigned n-1
SelectionDAG

~0U

zero

unsigned n-1

****************************************************************

The choice of “~0U” for singlethread makes it easy to maintain backward compatibility in the bitcode. The values 0 and 1 remain unchanged in the bitcode, and the reader simply decrements them by one to compute the correct value in the in-memory data-structure.

Name Mapping

Now we comes to name mapping from integers to strings. If a CLANG front end wants to map a language that has memory scopes (e.g. OpenCL) to LLVM IR, how does it determine what syncscopes to use? Without any rules, each target can define its own meaning for the scopes, can give them any name, and can map them to the LLVM-IR unit values in any way. In this case, I think each target have to provide a mapping function that maps a specific language’s name for a scope into that targets name for a scope that has conservatively the same semantics. Namely, the act of supporting a new language that has memory scopes requires every target to support that language to be updated accordingly.

Therefore, in order to allow front end writers to share memory scope definitions when they match to avoid the effort of updating all targets for each language,it's better to define standard memory scope names. A target is free to implement them or not, but if a target does implement them they must have the defined relational semantics (e.g., hierarchical nesting). If a target does implement them then it will be able to support any language that uses them, including languages not yet invented. A new memory scope name can be added if the existing ones are insufficient.

With the first try, we can define the standard scopes with what a common language that has memory scopes needs, e.g., OpenCL uses system, device, workgroup, workitem. It uses the same approach as LLVM has done for debug information. There are standard debug entities (that a common language (C) needs), and each new language uses those standard entities where there is a match, and subsequently defines only the delta.

A bitcode example with the proposal

*****************************************************************

define void <at> test(i32* %addr) {
; forward compatibility
cmpxchg i32* %addr, i32 42, i32 0 singlethread monotonic monotonic

; new synchscope that will be defined by each backend
cmpxchg i32* %addr, i32 42, i32 0 synchscope(2) monotonic monotonic, 2
cmpxchg i32* %addr, i32 42, i32 0 synchscope(3) monotonic monotonic, 3

ret void
}

!synchscope = metadata !{{i32 0, !"SingleThread"}, {i32 2, !"WorkGroup"}, ...}

*****************************************************************

=================================================================

--

Best Regard,
Ke Bai, Ph.D.

Philip Reames via llvm-dev

unread,

Mar 28, 2016, 10:17:50 PM3/28/16

to Ke Bai, llvm...@lists.llvm.org

Ke,

I'll be the bearer of bad news here. The radio silence this proposal has gotten probably means there is not enough interest in the community in this proposal to see it land.

One concern I have with the current proposal is that the optimization value of these scopes is not clear to me. Is it only the backend which is expected to support optimizations over these scopes? Or are you expecting the middle end optimizer to understand them? If so, I'd suspect we'd need a refined definition which allows us to discuss relative strengths of memory scopes.

More fundamentally, it's not clear to me that "scope" is even the right model for this. I could see a case where we'd want something along the lines of "acquire semantics on memory space 1, release semantics on memory space 2, cst_seq semantics on address space 3".

Also, unless I'm misreading on my skim of your proposal, the current definition of scope is slightly off from what you've specified. A "seq_cst singlethread" fence is a much weaker fence than a "seq_cst crossthread". It's probably easiest to reason about the current scheme as having the cross product of {singlethread, crossthread} x {orderings...} distinct orderings rather than a set of orderings with two overlapping scopes.

Philip

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Owen Anderson via llvm-dev

unread,

Mar 29, 2016, 1:03:43 PM3/29/16

to Philip Reames, llvm...@lists.llvm.org, Ke Bai

On Mar 28, 2016, at 7:17 PM, Philip Reames via llvm-dev <llvm...@lists.llvm.org> wrote:

Ke,

I'll be the bearer of bad news here. The radio silence this proposal has gotten probably means there is not enough interest in the community in this proposal to see it land.

FWIW, I’m very interested in seeing it go in, but haven’t had a lot of time to write a response.

One concern I have with the current proposal is that the optimization value of these scopes is not clear to me. Is it only the backend which is expected to support optimizations over these scopes? Or are you expecting the middle end optimizer to understand them? If so, I'd suspect we'd need a refined definition which allows us to discuss relative strengths of memory scopes.

I don’t know about Ke’s use cases, but I at least am not very concerned with having any portion of LLVM optimize them. Right now LLVM has no way to represent the information encoded here at all.

More fundamentally, it's not clear to me that "scope" is even the right model for this. I could see a case where we'd want something along the lines of "acquire semantics on memory space 1, release semantics on memory space 2, cst_seq semantics on address space 3”.

Scopes are orthogonal to ordering constraints. Scopes are about memory operation visibility, primarily in the context of a machine with non-coherent caches. Imagine an accelerator with:

- Per HW thread load/store buffers

- Per core L1

- Accelerator-wide L2

- Whole-system DRAM

… and at any level of the hierarchy, the caching for one thread/core/accelerator may not be coherent with caches for other threads/cores/accelerators.

Scopes allow the program author to express the requisite visibility for a memory option; an that needs to be visible to other cores within the accelerator may need to bypass or flush the per-core L1. Communication to the host CPU or other accelerators may similarly need to bypass the the L2.

—Owen

Tom Stellard via llvm-dev

unread,

Apr 14, 2016, 1:41:35 PM4/14/16

to Ke Bai, llvm...@lists.llvm.org

This part of the proposal is formatted strangely and is a little confusing.
Was this supposed to be a table? Can you re-format so it is more clear what
is being proposed.

Thanks,
Tom

> > We still let the bitcode store memory scopes as *unsigned integers*,

> > since that is the easiest way to maintain compatibility. The values 0 and 1
> > are special. All other values are meaningful only within that bc file. In

> > addition, *a global metadata in the **file will provide a map* from

> > unsigned integers to string symbols which should be used to interpret all
> > the non-standard integers. If the global metadata is empty or non-existent,
> > then all non-zero values will be mapped to "system", which is the current
> > behavior.
> > The proposed syntax for synchronization scope is as follows:
> >

> > - Synchronization scopes are of arbitrary width, but implemented as

> > unsigned in the bitcode, just like address spaces.

> > - Cross-thread is default.
> > - Keyword "singlethread" is unchanged
> > - New syntax "synchscope(n)" for other target-specific scopes.
> > - There is no keyword for cross-thread, but it can be specified as

> > "synchscope(0)".
> >
> > The proposed new integer implementation expanded synchronization scopes
> > are as follows:

> > *Format*
> > *Single Thread*
> > *System (renamed)*
> > *Intermediate*
> > *Bitcode*
> > zero
> > one
> > unsigned n
> > *Assembly*

> > singlethread,
> > synchscope(~0U)
> > empty (default),
> > synchscope(0)
> > synchscope(n-1)

> > *In-memory*
> > ~0U
> > zero
> > unsigned n-1
> > *SelectionDAG*

> > *A **bitcode example with the proposal*

> > define void <at> test(i32* %addr) {
> > ; forward compatibility
> > cmpxchg i32* %addr, i32 42, i32 0 singlethread monotonic monotonic
> >
> > ; new synchscope that will be defined by each backend
> > cmpxchg i32* %addr, i32 42, i32 0 synchscope(2) monotonic monotonic, 2
> > cmpxchg i32* %addr, i32 42, i32 0 synchscope(3) monotonic monotonic, 3
> >
> > ret void
> > }
> >
> > !synchscope = metadata !{{i32 0, !"SingleThread"}, {i32 2, !"WorkGroup"},
> > ...}
> > =================================================================
> >
> > Thank you!
> >
> > ---
> > Best regards,
> > Ke
> >
>
>
>
> --
> Best Regard,
> Ke Bai, Ph.D.

> _______________________________________________

Tom Stellard via llvm-dev

unread,

Apr 18, 2016, 12:12:30 PM4/18/16

to Ke Bai, chan...@google.com, resi...@mac.com, llvm...@lists.llvm.org

Here is the initial proposal with some formatting fixed:

***********************************************************************

----------------------------------------------------------------------|

***********************************************************************

ret void
}

=================================================================

Mehdi Amini via llvm-dev

unread,

Apr 18, 2016, 12:46:02 PM4/18/16

to Tom Stellard, Ke Bai, llvm...@lists.llvm.org

Why not going with a metadata attachment directly and kill the "singlethread" keyword? Something like:
Something like:

cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!42}
cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!43}

...

!42 = !{"singlethread"}
!43 = !{"L2"}

I also avoids manipulating/pruning the global map, and target won't depend on integer to keep bitcode compatibility.

--
Mehdi

Philip Reames via llvm-dev

unread,

Apr 18, 2016, 7:40:45 PM4/18/16

to Mehdi Amini, Tom Stellard, llvm...@lists.llvm.org, Ke Bai

+1. I like this idea. Haven't given it serious thought, but on the
surface this sounds nice. We'd need to establish a naming scheme so that
targets could add meta tags without preventing upstream from adding
generic ones going forward.

Philip

Tom Stellard via llvm-dev

unread,

May 2, 2016, 10:46:38 AM5/2/16

to Mehdi Amini, Yaxu...@amd.com, Ke Bai, Stanislav....@amd.com, llvm...@lists.llvm.org, tony...@amd.com

This seems like it will work assuming that promoting something like "L2" to the
most conservative memory scope is legal (this is what would have to happen
if the metadata was dropped). Are there any targets where this type of'
promotion would be illegal?

-Tom

Pekka Jääskeläinen

unread,

May 18, 2016, 4:17:48 AM5/18/16

to Tom Stellard, Mehdi Amini, Stanislav....@amd.com, Yaxu...@amd.com, tony...@amd.com, Ke Bai, llvm...@lists.llvm.org

Hi all,

On 02.05.2016 17:46, Tom Stellard via llvm-dev wrote:
>> Why not going with a metadata attachment directly and kill the "singlethread" keyword? Something like:
>> >Something like:
>> >
>> > cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!42}
>> > cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!43}
>> >
>> >...
>> >
>> >!42 = !{"singlethread"}
>> >!43 = !{"L2"}
>> >
>> >
>> >I also avoids manipulating/pruning the global map, and target won't depend on integer to keep bitcode compatibility.
>> >
>> >
> This seems like it will work assuming that promoting something like "L2" to the
> most conservative memory scope is legal (this is what would have to happen
> if the metadata was dropped). Are there any targets where this type of'
> promotion would be illegal?

+1

Sorry to enter this discussion so late, but in my opinion this
is a very good non-intrusive starting point solution.

Implementing it as a MD with the assumption of there being a safe
conservative memory scope to fall back to (in case the MD gets
stripped off for some reason) converts it to a mere
optimization hint without the need to touch LLVM IR instruction semantics.

Also, as it's only a MD, if we encounter a need to extend it later
towards a more integrated solution (or a mechanism to support
targets where this scheme is not feasible), we can more easily do so.

BR,
--
Pekka

Tom Stellard via llvm-dev

unread,

Jun 22, 2016, 4:50:55 PM6/22/16

to Pekka Jääskeläinen, Ke Bai, Yaxu...@amd.com, Stanislav....@amd.com, Brian....@amd.com, llvm...@lists.llvm.org, Konstantin...@amd.com, tony...@amd.com

+ Brian and Konstantin

Zhuravlyov, Konstantin via llvm-dev

unread,

Jun 25, 2016, 10:33:52 PM6/25/16

to Tom Stellard, Pekka Jääskeläinen, Ke Bai, Liu, Yaxun (Sam), Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Tye, Tony

We believe that it would be best that this is added to the LLVM IR atomic memory instruction as fields on atomic instructions rather than using meta data.

The reasoning is that this information is similar to other information that is represented as instruction fields. For example, the indication that memory operations are atomic rather than non-atomic, the memory ordering of atomics, and whether per-thread or system scope. In all these cases this information has a semantic meaning for the instructions that can be exploited by optimizations. Representing it as meta data would mean this information could be dropped making the optimizations impossible with very significant performance penalty.

For example, if memory operations were not marked as being atomic, all memory operations would have to be generated as sequential consistent atomics at system scope. Although this "default" behavior is correct, it would not be very performant. Similarly, the memory ordering could use a "default" of sequentially consistent, which again is much less efficient than the weaker orderings. By analogy, the memory scope could also have a "default" of system scope, but that is also not performant when the scope is narrower.

In all these cases this information changes the semantics of the instructions. It affects whether a program is undefined behavior. Using a "default" value leads to those same programs being treated as having defined behavior (for example by eliminating data races).

Currently the atomic memory instructions have the own-thread/system recorded on the instruction which is a limited form of memory scope. The proposal is to replace this with a more general field that can have more than 2 values. Languages that do not use memory scopes can simply use the value corresponding to system scope.

We understand that it is good to avoid adding information to LLVM instructions that is not primary, but in this case it seems that the atomicity, memory ordering and memory scope are all equally primary information that characterize the semantics of memory instructions.

We have posted reviews that implement the proposal and invite everyone to discuss it:
http://reviews.llvm.org/D21723
http://reviews.llvm.org/D21724

Thank you,
Konstantin

Mehdi Amini via llvm-dev

unread,

Jun 25, 2016, 10:38:23 PM6/25/16

to Zhuravlyov, Konstantin, Liu, Yaxun (Sam), Tye, Tony, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Ke Bai

On Jun 25, 2016, at 11:05 AM, Zhuravlyov, Konstantin <Konstantin...@amd.com> wrote:

We believe that it would be best that this is added to the LLVM IR atomic memory instruction as fields on atomic instructions rather than using meta data.

The reasoning is that this information is similar to other information that is represented as instruction fields. For example, the indication that memory operations are atomic rather than non-atomic, the memory ordering of atomics, and whether per-thread or system scope. In all these cases this information has a semantic meaning for the instructions that can be exploited by optimizations. Representing it as meta data would mean this information could be dropped making the optimizations impossible with very significant performance penalty.

For example, if memory operations were not marked as being atomic, all memory operations would have to be generated as sequential consistent atomics at system scope. Although this "default" behavior is correct, it would not be very performant. Similarly, the memory ordering could use a "default" of sequentially consistent, which again is much less efficient than the weaker orderings. By analogy, the memory scope could also have a "default" of system scope, but that is also not performant when the scope is narrower.

In all these cases this information changes the semantics of the instructions. It affects whether a program is undefined behavior. Using a "default" value leads to those same programs being treated as having defined behavior (for example by eliminating data races).

It is not clear to me if there is any correctness issues to dropping metadata?

Currently the atomic memory instructions have the own-thread/system recorded on the instruction which is a limited form of memory scope. The proposal is to replace this with a more general field that can have more than 2 values. Languages that do not use memory scopes can simply use the value corresponding to system scope.

We understand that it is good to avoid adding information to LLVM instructions that is not primary, but in this case it seems that the atomicity, memory ordering and memory scope are all equally primary information that characterize the semantics of memory instructions.

We have posted reviews that implement the proposal and invite everyone to discuss it:
http://reviews.llvm.org/D21723
http://reviews.llvm.org/D21724

It seems you’re going back to integer, which I don’t really like for reasons mentioned earlier in this thread, and that I don’t feel you addressed here.

—

Mehdi

Philip Reames via llvm-dev

unread,

Jul 3, 2016, 7:40:00 PM7/3/16

to Mehdi Amini, Liu, Yaxun (Sam), Ke Bai, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Tye, Tony

I will comment - as one of the few people actually working on llvm's atomic implementation with any regularity - that I am opposed to extending the instructions without a strong motivating case. I don't care anywhere near as much about metadata based schemes, but extending the instruction semantics imposes a much larger burden on the rest of the community. That burden has to be well justified and supported.

Philip

Sameer Sahasrabuddhe via llvm-dev

unread,

Jul 10, 2016, 4:06:13 AM7/10/16

to Philip Reames, Tye, Tony, Liu, Yaxun (Sam), Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Ke Bai

On Mon, Jul 4, 2016 at 5:09 AM, Philip Reames via llvm-dev <llvm...@lists.llvm.org> wrote:

I will comment - as one of the few people actually working on llvm's atomic implementation with any regularity - that I am opposed to extending the instructions without a strong motivating case. I don't care anywhere near as much about metadata based schemes, but extending the instruction semantics imposes a much larger burden on the rest of the community. That burden has to be well justified and supported.

In OpenCL 2.x, two atomic operations on the same atomic object need to have the same scope to prevent a data race. This derives from the definition of "inclusive scope" in OpenCL 2.x. Encoding OpenCL 2.x scope as metadata in LLVM IR would be a problem because there cannot be a "safe default value" to be used when the metadata is dropped. If the "largest" scope is used as the default, then the optimizer must guarantee that the metadata is dropped from every atomic operation in the whole program, or not dropped at all.

Hence the original attempt to extend LLVM atomic instructions with a broader scope field.

Sameer.

Zhuravlyov, Konstantin via llvm-dev

unread,

Aug 17, 2016, 4:39:43 PM8/17/16

to Sameer Sahasrabuddhe, Philip Reames, Tye, Tony, Liu, Yaxun (Sam), Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Ke Bai

Hi,

I have updated the review here:

https://reviews.llvm.org/D21723

As Sameer pointed out, the motivation is:

In OpenCL 2.x, two atomic operations on the same atomic object need to have the same scope to prevent a data race. This derives from the definition of "inclusive scope" in OpenCL 2.x. Encoding OpenCL 2.x scope as metadata in LLVM IR would be a problem because there cannot be a "safe default value" to be used when the metadata is dropped. If the "largest" scope is used as the default, then the optimizer must guarantee that the metadata is dropped from every atomic operation in the whole program, or not dropped at all.

Thanks,

Konstantin

From: Sameer Sahasrabuddhe [mailto:sam...@sbuddhe.net]
Sent: Sunday, July 10, 2016 4:06 AM
To: Philip Reames <list...@philipreames.com>
Cc: Mehdi Amini <mehdi...@apple.com>; Liu, Yaxun (Sam) <Yaxu...@amd.com>; Ke Bai <keba...@gmail.com>; Mekhanoshin, Stanislav <Stanislav....@amd.com>; Sumner, Brian <Brian....@amd.com>; llvm...@lists.llvm.org; Zhuravlyov, Konstantin <Konstantin...@amd.com>; Tye, Tony <Tony...@amd.com>
Subject: Re: [llvm-dev] Memory scope proposal

On Mon, Jul 4, 2016 at 5:09 AM, Philip Reames via llvm-dev <llvm...@lists.llvm.org> wrote:

Zhuravlyov, Konstantin via llvm-dev

unread,

Aug 17, 2016, 5:23:51 PM8/17/16

to Sameer Sahasrabuddhe, Philip Reames, Tye, Tony, Liu, Yaxun (Sam), Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Ke Bai

>Why not going with a metadata attachment directly and kill the "singlethread" keyword? Something like:

>Something like:

> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!42}

> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!43}

>...

>!42 = !{"singlethread"}

>!43 = !{"L2"}

>It is not clear to me if there is any correctness issues to dropping metadata?

Yes, we cannot use the metadata approach since this metadata can be dropped during the processing of one module but not dropped in the processing of a second module, potentially resulting in inconsistent scopes for synchronizing operations leading to data races and subsequently leading to correctness issues.

Thanks,

Konstantin

Mehdi Amini via llvm-dev

unread,

Aug 17, 2016, 6:05:14 PM8/17/16

to Zhuravlyov, Konstantin, Liu, Yaxun (Sam), Tye, Tony, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Ke Bai

On Aug 17, 2016, at 2:08 PM, Zhuravlyov, Konstantin <Konstantin...@amd.com> wrote:

>Why not going with a metadata attachment directly and kill the "singlethread" keyword? Something like:
>Something like:
> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!42}
> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!43}

>...

>!42 = !{"singlethread"}
>!43 = !{"L2"}

>It is not clear to me if there is any correctness issues to dropping metadata?

Yes, we cannot use the metadata approach since this metadata can be dropped during the processing of one module but not dropped in the processing of a second module, potentially resulting in inconsistent scopes for synchronizing operations leading to data races and subsequently leading to correctness issues.

Right, I saw Sameer's explanation for that earlier, and we shouldn’t move forward (without Philip’s opinion on the topic as he expressed concerns).

But you stripped out the second part of my email where I wrote "It seems you’re going back to integer, which I don’t really like for reasons mentioned earlier in this thread, and that I don’t feel you addressed here”. Why can’t `synchscope` take a string literal?

—

Mehdi

Justin Lebar via llvm-dev

unread,

Aug 17, 2016, 7:01:02 PM8/17/16

to Mehdi Amini, Liu, Yaxun (Sam), Ke Bai, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Tye, Tony

I'm coming at this from a CUDA perspective, so apologies if this
doesn't make a lot of sense:

In CUDA we have a similar problem as OpenCL. CUDA solves it by having
a bunch of atomic builtins for each of the memory scopes. These map
to various llvm target-specific intrinsics.

It's not great, because the intrinsics are mostly opaque to the
optimizer. But atomic ops are already pretty slow on the GPU, so I've
been operating under the assumption that this isn't hurting us too
much.

Am I wrong about that?

Philip Reames via llvm-dev

unread,

Aug 21, 2016, 2:15:07 PM8/21/16

to Mehdi Amini, Zhuravlyov, Konstantin, Liu, Yaxun (Sam), Tye, Tony, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Ke Bai

On 08/17/2016 03:05 PM, Mehdi Amini wrote:

On Aug 17, 2016, at 2:08 PM, Zhuravlyov, Konstantin <Konstantin...@amd.com> wrote:

>Why not going with a metadata attachment directly and kill the "singlethread" keyword? Something like:

>Something like:

> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!42}

> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!43}

>...

>!42 = !{"singlethread"}

>!43 = !{"L2"}

>It is not clear to me if there is any correctness issues to dropping metadata?

Yes, we cannot use the metadata approach since this metadata can be dropped during the processing of one module but not dropped in the processing of a second module, potentially resulting in inconsistent scopes for synchronizing operations leading to data races and subsequently leading to correctness issues.

Right, I saw Sameer's explanation for that earlier, and we shouldn’t move forward (without Philip’s opinion on the topic as he expressed concerns).

Given my current time commitments, having me on the critical path for any proposal is not a good idea. I'm willing to step aside here as long as the proposal is well reviewed by someone who's familiar with the memory model. Hal, Sanjoy, JF, Chandler, and Danny would all be reasonable alternates. Mehdi, if things get to the point where you think they're good to go and no one else has chimed in, ping me. I'm not going to be following until then, but I'll make the time for a final pass through if no one else has first.

Mehdi Amini via llvm-dev

unread,

Aug 21, 2016, 2:20:05 PM8/21/16

to Philip Reames, Liu, Yaxun (Sam), Tye, Tony, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Ke Bai

On Aug 21, 2016, at 11:14 AM, Philip Reames <list...@philipreames.com> wrote:

On 08/17/2016 03:05 PM, Mehdi Amini wrote:

On Aug 17, 2016, at 2:08 PM, Zhuravlyov, Konstantin <Konstantin...@amd.com> wrote:

>Why not going with a metadata attachment directly and kill the "singlethread" keyword? Something like:

>Something like:

> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!42}

> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!43}

>...

>!42 = !{"singlethread"}

>!43 = !{"L2"}

>It is not clear to me if there is any correctness issues to dropping metadata?

Yes, we cannot use the metadata approach since this metadata can be dropped during the processing of one module but not dropped in the processing of a second module, potentially resulting in inconsistent scopes for synchronizing operations leading to data races and subsequently leading to correctness issues.

Right, I saw Sameer's explanation for that earlier, and we shouldn’t move forward (without Philip’s opinion on the topic as he expressed concerns).

Given my current time commitments, having me on the critical path for any proposal is not a good idea. I'm willing to step aside here as long as the proposal is well reviewed by someone who's familiar with the memory model. Hal, Sanjoy, JF, Chandler, and Danny would all be reasonable alternates.

OK, good to know. I put you on the path because you wrote:

"I am opposed to extending the instructions without a strong motivating case. I don't care anywhere near as much about metadata based schemes, but extending the instruction semantics imposes a much larger burden on the rest of the community. That burden has to be well justified and supported."

It is not clear to me right now if the "use case" makes it "well justified" or not (an alternative being using intrinsic for OpenCL as Justin Lebar mentioned). I don’t feel I can answer this, so adding CC Chandler and Sanjoy to begin with.

—

Mehdi

Sameer Sahasrabuddhe via llvm-dev

unread,

Aug 21, 2016, 11:51:48 PM8/21/16

to Mehdi Amini, Liu, Yaxun (Sam), Tye, Tony, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Ke Bai

On Sun, Aug 21, 2016 at 11:49 PM, Mehdi Amini <mehdi...@apple.com> wrote:

On Aug 21, 2016, at 11:14 AM, Philip Reames <list...@philipreames.com> wrote:

Given my current time commitments, having me on the critical path for any proposal is not a good idea. I'm willing to step aside here as long as the proposal is well reviewed by someone who's familiar with the memory model. Hal, Sanjoy, JF, Chandler, and Danny would all be reasonable alternates.

OK, good to know. I put you on the path because you wrote:

"I am opposed to extending the instructions without a strong motivating case. I don't care anywhere near as much about metadata based schemes, but extending the instruction semantics imposes a much larger burden on the rest of the community. That burden has to be well justified and supported."

It is not clear to me right now if the "use case" makes it "well justified" or not (an alternative being using intrinsic for OpenCL as Justin Lebar mentioned). I don’t feel I can answer this, so adding CC Chandler and Sanjoy to begin with.

One yardstick to determine if this is "well justified" is to ask if any LLVM backend has an undefined behaviour similar to OpenCL if the scope metadata is dropped. For every LLVM target X which can be potentially targetted from OpenCL, if the metadata is dropped (in one module but not another), does the memory model for X still produce behaviour that is compatible with the behaviour intended by the original program? If that question cannot be answered satisfactorily for all cases, then metadata is not a reliable way to move forward.

To put it differently, when viewed as an "OpenCL implementation", can every LLVM backend guarantee that when OpenCL scopes are lowered to metadata, the synchronisation specified in the original program is preserved?

Sameer.

Mehdi Amini via llvm-dev

unread,

Aug 21, 2016, 11:54:36 PM8/21/16

to Sameer Sahasrabuddhe, Liu, Yaxun (Sam), Tye, Tony, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Ke Bai

This does not address why you can’t use intrinsics (like the CUDA implementation does apparently).

—

Mehdi

Mehdi Amini via llvm-dev

unread,

Aug 22, 2016, 12:00:26 AM8/22/16

to Mehdi Amini, Liu, Yaxun (Sam), Ke Bai, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Tye, Tony

Let me rephrase: why is it preferable to add first class instruction support for opaque scope rather than using intrinsics?

—

Mehdi

Tye, Tony via llvm-dev

unread,

Aug 23, 2016, 5:05:13 PM8/23/16

to mehdi...@apple.com, Liu, Yaxun (Sam), Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Ke Bai

> Let me rephrase: why is it preferable to add first class instruction support for opaque scope rather than using intrinsics?

Given that LLVM core now supports atomic instructions, it would seem desirable to use them for all languages supported by LLVM. This would allow optimizations to be aware of the memory semantics. By using intrinsics this information no longer becomes available in a unified way. It also requires the target code generator to support both the LLVM atomics as well as all the additional intrinsics. In general it seems preferable to use a single approach rather than multiple approaches to express information.

Currently LLVM has support for atomics by specifying both the memory ordering and memory scope as enumerated values specified by an enum type. However, the values for memory scope currently only include those needed for CPU-style targeted languages. Languages such as CUDA and OpenCL introduce additional memory scopes into the memory model. This patch extends the existing enumeration for memory scope to allow the target to define additional memory scopes that can be used by such languages. The current bit code already represents the memory scope as a 32 bit unsigned value, so this change introduces no backward compatible issues.

Thanks,

-Tony

Mehdi Amini via llvm-dev

unread,

Aug 23, 2016, 5:14:26 PM8/23/16

to Tye, Tony, Liu, Yaxun (Sam), Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Ke Bai

On Aug 23, 2016, at 2:00 PM, Tye, Tony <Tony...@amd.com> wrote:

> Let me rephrase: why is it preferable to add first class instruction support for opaque scope rather than using intrinsics?

Given that LLVM core now supports atomic instructions, it would seem desirable to use them for all languages supported by LLVM.

This would allow optimizations to be aware of the memory semantics.

Since the scope is “opaque” and target specific, can you elaborate what kind of generic optimization can be performed?

—

Mehdi

Tye, Tony via llvm-dev

unread,

Aug 23, 2016, 5:29:19 PM8/23/16

to mehdi...@apple.com, Liu, Yaxun (Sam), Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Ke Bai

> Since the scope is “opaque” and target specific, can you elaborate what kind of generic optimization can be performed?

Some optimizations that are related to a single thread could be done without needing to know the actual memory scope. For example, an atomic acquire can restrict reordering memory operations after it, but allow reordering of memory operations (except another atomic acquire) before it, regardless of the memory scope.

Thanks,

-Tony

Sanjoy Das via llvm-dev

unread,

Aug 30, 2016, 8:53:53 PM8/30/16

to Tye, Tony, Liu, Yaxun (Sam), Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Ke Bai

Hi,

[Sorry for chiming in so late.]

I understand why a straightforward metadata scheme won't work here,
but have you considered an alternate scheme that works in the
following way:

- We add a MD node called !nosynch that lists a set of "domains" a
certain memory operation does *not* synchronize with.

- Memory operations with !nosynch synchronize with memory operations
without any !nosynch metadata (so dropping !nosynch is safe).

This will only work if your frontend knows, ahead of time, what the
possible set of synch-domains are, but it presumably already knows
that (otherwise how do you map domain names to integers)?

The other disadvantage with the scheme above is that memory operations
on the "normal CPU heap" (pardon my GPU n00b-ness here :) ) will synch
with the memory operations with !nosynch metadata. However, we can
solve that by modeling the "normal CPU heap" as "!nosynch
!{!special_domain_a, !special_domain_b, ... all domains except
!cpu_heap_domain}".

Thanks,
-- Sanjoy

Sanjoy Das via llvm-dev

unread,

Aug 30, 2016, 9:56:15 PM8/30/16

to Tye, Tony, Liu, Yaxun (Sam), Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Ke Bai

Hi,

Sanjoy Das wrote:
> I understand why a straightforward metadata scheme won't work here,
> but have you considered an alternate scheme that works in the
> following way:
>
> - We add a MD node called !nosynch that lists a set of "domains" a
> certain memory operation does *not* synchronize with.
>
> - Memory operations with !nosynch synchronize with memory operations
> without any !nosynch metadata (so dropping !nosynch is safe).

I missed a spot here ^, !nosynch metadata will also have to have a
sub-node for the kind of synch-domain *it* is in. The synchs-with
relation is then:

bool SynchsWith(MemOp A, MemOp B) {
MD_A = A.getMD(MD_nosynch);
MD_B = B.getMD(MD_nosynch);
if (!MD_A || !MD_B)
return true;
return MD_B.nosync_list.contains(MD_A.id) ||
MD_A.nosync_list.contains(MD_B.id);
}

I'm still not a 100% convinced that the above works, but I think there
are advantages to expressing synch scopes as metadata. For instance,
the optimizer already "knows" what to do with the metadata on loads it
speculates.

Mehdi Amini via llvm-dev

unread,

Aug 31, 2016, 11:23:42 AM8/31/16

to Sanjoy Das, Liu, Yaxun (Sam), Ke Bai, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Tye, Tony

> On Aug 30, 2016, at 5:53 PM, Sanjoy Das <san...@playingwithpointers.com> wrote:
>
> Hi,
>
> [Sorry for chiming in so late.]
>
> I understand why a straightforward metadata scheme won't work here,
> but have you considered an alternate scheme that works in the
> following way:
>
> - We add a MD node called !nosynch that lists a set of "domains" a
> certain memory operation does *not* synchronize with.
>
> - Memory operations with !nosynch synchronize with memory operations
> without any !nosynch metadata (so dropping !nosynch is safe).

I’m not sure, but isn’t the synchscope id (or domains as you seem to call it) intended to change which instruction would be actually codegen?
In which case I’m not sure dropping it is ever a good idea, even when it does not affect correctness it would dramatically affect performance.

—
Mehdi

Sanjoy Das via llvm-dev

unread,

Aug 31, 2016, 3:15:02 PM8/31/16

to Mehdi Amini, Liu, Yaxun (Sam), Ke Bai, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Tye, Tony

Hi Mehdi,

Mehdi Amini wrote:
> I’m not sure, but isn’t the synchscope id (or domains as you seem to
> call it) intended to change which instruction would be actually

FYI, I don't know what the right term for this is. :)

> codegen?
>
> In which case I’m not sure dropping it is ever a good idea, even
> when it does not affect correctness it would dramatically affect
> performance.

Sure, that is a good reason to avoid a metadata based approach. I was
just unconvinced by the current argument of (at least on the commit
message) "it can't be done", because I think a scheme modeled after
the !tbaa style metadata scheme has a chance of working. If the
fundamental reason why we want a non-MD scheme is something other than
"it can't be done", I'm fine with that.

Mehdi Amini via llvm-dev

unread,

Aug 31, 2016, 3:20:06 PM8/31/16

to Sanjoy Das, Hal Finkel, Chandler Carruth, Liu, Yaxun (Sam), Ke Bai, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Tye, Tony

> On Aug 31, 2016, at 12:14 PM, Sanjoy Das <san...@playingwithpointers.com> wrote:
>
> Hi Mehdi,
>
> Mehdi Amini wrote:
> > I’m not sure, but isn’t the synchscope id (or domains as you seem to
> > call it) intended to change which instruction would be actually
>
> FYI, I don't know what the right term for this is. :)
>
> > codegen?
> >
> > In which case I’m not sure dropping it is ever a good idea, even
> > when it does not affect correctness it would dramatically affect
> > performance.
>
> Sure, that is a good reason to avoid a metadata based approach. I was
> just unconvinced by the current argument of (at least on the commit
> message) "it can't be done", because I think a scheme modeled after
> the !tbaa style metadata scheme has a chance of working. If the
> fundamental reason why we want a non-MD scheme is something other than
> "it can't be done", I'm fine with that.

My understanding is that (I leave the AMD folks (Tony?) correct this if I missed something):

1) we want to preserve the synchscope information because it is important for codegen. This is usually done with intrinsics in other backends
2) not doing with intrinsic would allow the optimizer to reason about the atomic operation within a single scope, but couldn’t assume anything across scope

I don’t know enough to assert if 2 is compelling enough to have first class support in the IR for something that is so “opaque”.

—
Mehdi

Justin Lebar via llvm-dev

unread,

Aug 31, 2016, 3:23:59 PM8/31/16

to Tye, Tony, Liu, Yaxun (Sam), Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Ke Bai

> Some optimizations that are related to a single thread could be done without needing to know the actual memory scope.

Right, it's clear to me that there exist optimizations that you cannot
do if we model these ops as target-specific intrinsics.

But what I think Mehdi and I were trying to get at is: How much of a
problem is this in practice? Are there real-world programs that
suffer because we miss these optimizations? If so, how much?

The reason I'm asking this question is, there's a real cost to adding
complexity in LLVM. Everyone in the project is going to pay that
cost, forever (or at least, until we remove the feature :). So I want
to try to evaluate whether paying that cost is actually worth while,
as compared to the simple alternative (i.e., intrinsics). Given the
tepid response to this proposal, I'm sort of thinking that now may not
be the time to start paying this cost. (We can always revisit this in
the future.) But I remain open to being convinced.

As a point of comparison, we have a rule of thumb that we'll add an
optimization that increases compilation time by x% if we have a
benchmark that is sped up by at least x%. Similarly here, I'd want to
weigh the added complexity against the improvements to user code.

-Justin

Tom Stellard via llvm-dev

unread,

Sep 1, 2016, 11:52:21 AM9/1/16

to Justin Lebar, Liu, Yaxun (Sam), Ke Bai, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Tye, Tony

On Wed, Aug 31, 2016 at 12:23:34PM -0700, Justin Lebar via llvm-dev wrote:
> > Some optimizations that are related to a single thread could be done without needing to know the actual memory scope.
>
> Right, it's clear to me that there exist optimizations that you cannot
> do if we model these ops as target-specific intrinsics.
>
> But what I think Mehdi and I were trying to get at is: How much of a
> problem is this in practice? Are there real-world programs that
> suffer because we miss these optimizations? If so, how much?
>
> The reason I'm asking this question is, there's a real cost to adding
> complexity in LLVM. Everyone in the project is going to pay that
> cost, forever (or at least, until we remove the feature :). So I want
> to try to evaluate whether paying that cost is actually worth while,
> as compared to the simple alternative (i.e., intrinsics). Given the
> tepid response to this proposal, I'm sort of thinking that now may not
> be the time to start paying this cost. (We can always revisit this in
> the future.) But I remain open to being convinced.
>

I think the cost of adding this information to the IR is really low.
There is already a sychronization scope field present for LLVM atomic
instructions, and it is already being encoded as 32-bits, so it is
possible to represent the additional scopes using the existing bitcode
format. Optimization passes are already aware of this synchronization
scope field, so they know how to preserve it when transforming the IR.

The primary goal here is to pass synchronization scope information from
the fronted to the backend. We already have a mechanism for doing this,
so why not use it? That seems like the lowest cost option to me.

-Tom

Philip Reames via llvm-dev

unread,

Sep 2, 2016, 8:52:32 PM9/2/16

to Tom Stellard, Justin Lebar, Liu, Yaxun (Sam), Tye, Tony, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Ke Bai

On 09/01/2016 08:52 AM, Tom Stellard via llvm-dev wrote:
> On Wed, Aug 31, 2016 at 12:23:34PM -0700, Justin Lebar via llvm-dev wrote:
>>> Some optimizations that are related to a single thread could be done without needing to know the actual memory scope.
>> Right, it's clear to me that there exist optimizations that you cannot
>> do if we model these ops as target-specific intrinsics.
>>
>> But what I think Mehdi and I were trying to get at is: How much of a
>> problem is this in practice? Are there real-world programs that
>> suffer because we miss these optimizations? If so, how much?
>>
>> The reason I'm asking this question is, there's a real cost to adding
>> complexity in LLVM. Everyone in the project is going to pay that
>> cost, forever (or at least, until we remove the feature :). So I want
>> to try to evaluate whether paying that cost is actually worth while,
>> as compared to the simple alternative (i.e., intrinsics). Given the
>> tepid response to this proposal, I'm sort of thinking that now may not
>> be the time to start paying this cost. (We can always revisit this in
>> the future.) But I remain open to being convinced.
>>
> I think the cost of adding this information to the IR is really low.
> There is already a sychronization scope field present for LLVM atomic
> instructions, and it is already being encoded as 32-bits, so it is
> possible to represent the additional scopes using the existing bitcode
> format. Optimization passes are already aware of this synchronization
> scope field, so they know how to preserve it when transforming the IR.

I disagree with this assessment. Atomics are an area where additional
complexity has a *substantial* conceptual cost. I also question whether
the single_thread scope is actually respected throughout the optimizer
in practice.

I view the request of changing the IR as a fairly big ask. In
particular, I'm really nervous about what the exact optimization
semantics of such scopes would be. Depending on how that was defined,
this could be anything from fairly straight forward to outright messy.
In particular, if there are optimizations which are legal for only some
subset of scopes (or subset of pairs of scopes?), I'd really like to see
a clear definition given for how those are defined.

(p.s. Is there a current patch with an updated LangRef for the proposal
being discussed? I've lost track of it.)

Let me give an example proposal just to illustrate my point. This isn't
really a counter proposal per se, just me thinking out loud.

Say we added 32 distinct concurrent domains. One of them is used for
"single_thread". One is used for "everything else". The remaining 30
are defined in a target specific manner w/the exception that they can't
overlap with each other or with the two predefined ones. The effect of
a given atomic operation with respect to each concurrency domain could
be defined in terms of a 32 bit mask. If a bit was set, the operation
is ordered (according to the separately stated ordering) with that
domain. If not, it is explicitly unordered w.r.t. that domain. A
memory operation would be tagged with the memory domains which which it
might interact.

The key bit here is that I can describe transformations in terms of
these abstract domains without knowing anything about how the frontend
might be using such a domain or how the backend might lower it. In
particular, if I have the sequence:
%v = load i64, %p atomic scope {domain3 only}
fence seq_cst scope={domain1 only}
%v2 = load i64, %p atomic scope {domain3 only}

I can tell that the two loads aren't order with respect to the fence and
that I can do load forwarding here.

In general, an IR extension needs to be well defined, general enough to
be used by multiple distinct users, and fairly battle tested design
wise. We're not completely afraid of having to remove bad ideas from
the IR, but we really try to avoid adding things until they're fairly
proven.

Mehdi Amini via llvm-dev

unread,

Sep 2, 2016, 11:13:43 PM9/2/16

to Philip Reames, Liu, Yaxun (Sam), Tye, Tony, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Ke Bai, Zhuravlyov, Konstantin

Here is the patch: https://reviews.llvm.org/D21723

Let me give an example proposal just to illustrate my point. This isn't really a counter proposal per se, just me thinking out loud.

Say we added 32 distinct concurrent domains. One of them is used for "single_thread". One is used for "everything else". The remaining 30 are defined in a target specific manner w/the exception that they can't overlap with each other or with the two predefined ones. The effect of a given atomic operation with respect to each concurrency domain could be defined in terms of a 32 bit mask. If a bit was set, the operation is ordered (according to the separately stated ordering) with that domain. If not, it is explicitly unordered w.r.t. that domain. A memory operation would be tagged with the memory domains which which it might interact.

The key bit here is that I can describe transformations in terms of these abstract domains without knowing anything about how the frontend might be using such a domain or how the backend might lower it. In particular, if I have the sequence:
%v = load i64, %p atomic scope {domain3 only}
fence seq_cst scope={domain1 only}
%v2 = load i64, %p atomic scope {domain3 only}

I can tell that the two loads aren't order with respect to the fence and that I can do load forwarding here.

I see the current proposal as a strip-down version what you describe: the optimizer can reason about operations inside a single scope, but can’t assume anything cross-scope (they may or may not interact with each other).

What you describes seems like having always non-overlapping domains (from the optimizer point of view), and require the frontend to express the overlapping by attaching a “list" of domains that an atomic operation interacts with.

I hope I make sense :)

Best,

—

Mehdi

Tye, Tony via llvm-dev

unread,

Sep 9, 2016, 3:34:22 PM9/9/16

to Philip Reames, mehdi...@apple.com, Liu, Yaxun (Sam), Ke Bai, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin

Currently the Synchronization Scope (aka memory scope) information appears not to be used in any atomic related optimizations. It would seem any such optimizations should consider memory scope and an approach such as suggested by Philip seems reasonable. Could that change be tackled as a separate patch? Initially any atomic optimizations could be restricted to only be allowed when the memory scopes are exactly equal which should be conservatively correct.

This patch would be a first step towards adding support for atomics with memory scopes used by languages such as OpenCL. Doing this simplifies how memory scope information is passed from CLANG to the code generator as mentioned by Tom. There seems to be several companies interested in doing this as it will simplify the code and allow atomics to be handled in a consistent way for all languages, and allow atomic optimizations to benefit these languages.

Thanks,

-Tony

Sameer Sahasrabuddhe via llvm-dev

unread,

Oct 7, 2016, 4:41:04 AM10/7/16

to Mehdi Amini, Liu, Yaxun (Sam), Ke Bai, Mekhanoshin, Stanislav, Sumner, Brian, llvm...@lists.llvm.org, Zhuravlyov, Konstantin, Tye, Tony

On Sat, Sep 3, 2016 at 8:43 AM, Mehdi Amini via llvm-dev <llvm...@lists.llvm.org> wrote:

The key bit here is that I can describe transformations in terms of these abstract domains without knowing anything about how the frontend might be using such a domain or how the backend might lower it. In particular, if I have the sequence:
%v = load i64, %p atomic scope {domain3 only}
fence seq_cst scope={domain1 only}
%v2 = load i64, %p atomic scope {domain3 only}
I can tell that the two loads aren't order with respect to the fence and that I can do load forwarding here.

I see the current proposal as a strip-down version what you describe: the optimizer can reason about operations inside a single scope, but can’t assume anything cross-scope (they may or may not interact with each other).

What you describes seems like having always non-overlapping domains (from the optimizer point of view), and require the frontend to express the overlapping by attaching a “list" of domains that an atomic operation interacts with.

There is another way to tackle this, and Chandler had hinted at it in an old thread:
http://lists.llvm.org/pipermail/llvm-dev/2015-January/080236.html

Quoting from Chandler's email:
"Essentially, I think target-independent optimizations are still attractive, but we might want to just force them to go through an actual target-implemented API to interpret the scopes rather than making the interpretation work from first principles. I just worry that the targets are going to be too different and we may fail to accurately predict future targets' needs."

Note that in Philip's example above, the optimization is not really asking whether the two loads are ordered. It is asking whether the second load can be reordered to occur before the fence. Whatever the question, it can be implemented as a query to the target as a simple predicate. For example, "isOrdered(inst1, inst2)" or "canEliminate(store1, store2)". The latter query is when the optimizer wants to eliminate a store if it is followed by another store to the same location. The target can interpret the scope in whatever way and return true/false.

The advantage here is that now the optimizer does not need to know anything at all about the scopes. For example, in memory models like OpenCL, the scopes are nested, and it should be sufficient to specify just one bit in the mask and it could "automatically include" lower bits. The optimizer does not need to know that. In fact the implementation need not even be a bitmask. It can just be a set of opaque "sigils" like in the original design.

In practice, I am wondering how often will scopes really affect optimizations. At least on targets that have memory models similar to OpenCL 2.x, it's likely that most queries have answers independent of scopes.

Sameer.

Reply all

Reply to author

Forward