Currently, the LLVM IR uses a binary value (SingleThread/CrossThread) to represent synchronization scope on atomic instructions. We would like to enhance the representation of memory scopes in LLVM IR to allow more values than just the current two. The intention of this email is to invite comments on our proposal. There are some discussion before and it can be found here:
https://groups.google.com/forum/#!searchin/llvm-dev/hsail/llvm-dev/46eEpS5h0E4/i3T9xw-DNVYJ
Here is our new proposal:
Format | Single Thread | System (renamed) | Intermediate |
Bitcode | zero | one | unsigned n |
Assembly | singlethread, synchscope(~0U) | empty (default), synchscope(0) | synchscope(n-1) |
In-memory | ~0U | zero | unsigned n-1 |
SelectionDAG | ~0U | zero | unsigned n-1 |
Name Mapping
Now we comes to name mapping from integers to strings. If a CLANG front end wants to map a language that has memory scopes (e.g. OpenCL) to LLVM IR, how does it determine what syncscopes to use? Without any rules, each target can define its own meaning for the scopes, can give them any name, and can map them to the LLVM-IR unit values in any way. In this case, I think each target have to provide a mapping function that maps a specific language’s name for a scope into that targets name for a scope that has conservatively the same semantics. Namely, the act of supporting a new language that has memory scopes requires every target to support that language to be updated accordingly.
Therefore, in order to allow front end writers to share memory scope definitions when they match to avoid the effort of updating all targets for each language,it's better to define standard memory scope names. A target is free to implement them or not, but if a target does implement them they must have the defined relational semantics (e.g., hierarchical nesting). If a target does implement them then it will be able to support any language that uses them, including languages not yet invented. A new memory scope name can be added if the existing ones are insufficient.
With the first try, we can define the standard scopes with what a common language that has memory scopes needs, e.g., OpenCL uses system, device, workgroup, workitem. It uses the same approach as LLVM has done for debug information. There are standard debug entities (that a common language (C) needs), and each new language uses those standard entities where there is a match, and subsequently defines only the delta.
Ping. Do we have any advice on this proposal? Thanks!
---
Best Regards,
Ke Bai
_______________________________________________ LLVM Developers mailing list llvm...@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
On Mar 28, 2016, at 7:17 PM, Philip Reames via llvm-dev <llvm...@lists.llvm.org> wrote:Ke,
I'll be the bearer of bad news here. The radio silence this proposal has gotten probably means there is not enough interest in the community in this proposal to see it land.
One concern I have with the current proposal is that the optimization value of these scopes is not clear to me. Is it only the backend which is expected to support optimizations over these scopes? Or are you expecting the middle end optimizer to understand them? If so, I'd suspect we'd need a refined definition which allows us to discuss relative strengths of memory scopes.
More fundamentally, it's not clear to me that "scope" is even the right model for this. I could see a case where we'd want something along the lines of "acquire semantics on memory space 1, release semantics on memory space 2, cst_seq semantics on address space 3”.
This part of the proposal is formatted strangely and is a little confusing.
Was this supposed to be a table? Can you re-format so it is more clear what
is being proposed.
Thanks,
Tom
> > We still let the bitcode store memory scopes as *unsigned integers*,
> > since that is the easiest way to maintain compatibility. The values 0 and 1
> > are special. All other values are meaningful only within that bc file. In
> > addition, *a global metadata in the **file will provide a map* from
> > unsigned integers to string symbols which should be used to interpret all
> > the non-standard integers. If the global metadata is empty or non-existent,
> > then all non-zero values will be mapped to "system", which is the current
> > behavior.
> > The proposed syntax for synchronization scope is as follows:
> >
> > - Synchronization scopes are of arbitrary width, but implemented as
> > unsigned in the bitcode, just like address spaces.
> > - Cross-thread is default.
> > - Keyword "singlethread" is unchanged
> > - New syntax "synchscope(n)" for other target-specific scopes.
> > - There is no keyword for cross-thread, but it can be specified as
> > "synchscope(0)".
> >
> > The proposed new integer implementation expanded synchronization scopes
> > are as follows:
> > *Format*
> > *Single Thread*
> > *System (renamed)*
> > *Intermediate*
> > *Bitcode*
> > zero
> > one
> > unsigned n
> > *Assembly*
> > singlethread,
> > synchscope(~0U)
> > empty (default),
> > synchscope(0)
> > synchscope(n-1)
> > *In-memory*
> > ~0U
> > zero
> > unsigned n-1
> > *SelectionDAG*
> > *A **bitcode example with the proposal*
> > define void <at> test(i32* %addr) {
> > ; forward compatibility
> > cmpxchg i32* %addr, i32 42, i32 0 singlethread monotonic monotonic
> >
> > ; new synchscope that will be defined by each backend
> > cmpxchg i32* %addr, i32 42, i32 0 synchscope(2) monotonic monotonic, 2
> > cmpxchg i32* %addr, i32 42, i32 0 synchscope(3) monotonic monotonic, 3
> >
> > ret void
> > }
> >
> > !synchscope = metadata !{{i32 0, !"SingleThread"}, {i32 2, !"WorkGroup"},
> > ...}
> > =================================================================
> >
> > Thank you!
> >
> > ---
> > Best regards,
> > Ke
> >
>
>
>
> --
> Best Regard,
> Ke Bai, Ph.D.
> _______________________________________________
***********************************************************************
| Format | Single Thread | System (renamed) | Intermediate |
----------------------------------------------------------------------|
| Bitcode | zero | one | unsigned n |
| Assembly | singlethread, | empty (default), | synchscope(n-1) |
| synchscope(~0U) synchscope(0) |
| In-memory | ~0U | zero | unsigned n-1 |
| SelectionDAG | ~0U | zero | unsigned n-1 |
***********************************************************************
ret void
}
=================================================================
Why not going with a metadata attachment directly and kill the "singlethread" keyword? Something like:
Something like:
cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!42}
cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!43}
...
!42 = !{"singlethread"}
!43 = !{"L2"}
I also avoids manipulating/pruning the global map, and target won't depend on integer to keep bitcode compatibility.
--
Mehdi
+1. I like this idea. Haven't given it serious thought, but on the
surface this sounds nice. We'd need to establish a naming scheme so that
targets could add meta tags without preventing upstream from adding
generic ones going forward.
Philip
This seems like it will work assuming that promoting something like "L2" to the
most conservative memory scope is legal (this is what would have to happen
if the metadata was dropped). Are there any targets where this type of'
promotion would be illegal?
-Tom
On 02.05.2016 17:46, Tom Stellard via llvm-dev wrote:
>> Why not going with a metadata attachment directly and kill the "singlethread" keyword? Something like:
>> >Something like:
>> >
>> > cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!42}
>> > cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!43}
>> >
>> >...
>> >
>> >!42 = !{"singlethread"}
>> >!43 = !{"L2"}
>> >
>> >
>> >I also avoids manipulating/pruning the global map, and target won't depend on integer to keep bitcode compatibility.
>> >
>> >
> This seems like it will work assuming that promoting something like "L2" to the
> most conservative memory scope is legal (this is what would have to happen
> if the metadata was dropped). Are there any targets where this type of'
> promotion would be illegal?
+1
Sorry to enter this discussion so late, but in my opinion this
is a very good non-intrusive starting point solution.
Implementing it as a MD with the assumption of there being a safe
conservative memory scope to fall back to (in case the MD gets
stripped off for some reason) converts it to a mere
optimization hint without the need to touch LLVM IR instruction semantics.
Also, as it's only a MD, if we encounter a need to extend it later
towards a more integrated solution (or a mechanism to support
targets where this scheme is not feasible), we can more easily do so.
BR,
--
Pekka
The reasoning is that this information is similar to other information that is represented as instruction fields. For example, the indication that memory operations are atomic rather than non-atomic, the memory ordering of atomics, and whether per-thread or system scope. In all these cases this information has a semantic meaning for the instructions that can be exploited by optimizations. Representing it as meta data would mean this information could be dropped making the optimizations impossible with very significant performance penalty.
For example, if memory operations were not marked as being atomic, all memory operations would have to be generated as sequential consistent atomics at system scope. Although this "default" behavior is correct, it would not be very performant. Similarly, the memory ordering could use a "default" of sequentially consistent, which again is much less efficient than the weaker orderings. By analogy, the memory scope could also have a "default" of system scope, but that is also not performant when the scope is narrower.
In all these cases this information changes the semantics of the instructions. It affects whether a program is undefined behavior. Using a "default" value leads to those same programs being treated as having defined behavior (for example by eliminating data races).
Currently the atomic memory instructions have the own-thread/system recorded on the instruction which is a limited form of memory scope. The proposal is to replace this with a more general field that can have more than 2 values. Languages that do not use memory scopes can simply use the value corresponding to system scope.
We understand that it is good to avoid adding information to LLVM instructions that is not primary, but in this case it seems that the atomicity, memory ordering and memory scope are all equally primary information that characterize the semantics of memory instructions.
We have posted reviews that implement the proposal and invite everyone to discuss it:
http://reviews.llvm.org/D21723
http://reviews.llvm.org/D21724
Thank you,
Konstantin
On Jun 25, 2016, at 11:05 AM, Zhuravlyov, Konstantin <Konstantin...@amd.com> wrote:We believe that it would be best that this is added to the LLVM IR atomic memory instruction as fields on atomic instructions rather than using meta data.
The reasoning is that this information is similar to other information that is represented as instruction fields. For example, the indication that memory operations are atomic rather than non-atomic, the memory ordering of atomics, and whether per-thread or system scope. In all these cases this information has a semantic meaning for the instructions that can be exploited by optimizations. Representing it as meta data would mean this information could be dropped making the optimizations impossible with very significant performance penalty.
For example, if memory operations were not marked as being atomic, all memory operations would have to be generated as sequential consistent atomics at system scope. Although this "default" behavior is correct, it would not be very performant. Similarly, the memory ordering could use a "default" of sequentially consistent, which again is much less efficient than the weaker orderings. By analogy, the memory scope could also have a "default" of system scope, but that is also not performant when the scope is narrower.
In all these cases this information changes the semantics of the instructions. It affects whether a program is undefined behavior. Using a "default" value leads to those same programs being treated as having defined behavior (for example by eliminating data races).
Currently the atomic memory instructions have the own-thread/system recorded on the instruction which is a limited form of memory scope. The proposal is to replace this with a more general field that can have more than 2 values. Languages that do not use memory scopes can simply use the value corresponding to system scope.
We understand that it is good to avoid adding information to LLVM instructions that is not primary, but in this case it seems that the atomicity, memory ordering and memory scope are all equally primary information that characterize the semantics of memory instructions.
We have posted reviews that implement the proposal and invite everyone to discuss it:
http://reviews.llvm.org/D21723
http://reviews.llvm.org/D21724
I will comment - as one of the few people actually working on llvm's atomic implementation with any regularity - that I am opposed to extending the instructions without a strong motivating case. I don't care anywhere near as much about metadata based schemes, but extending the instruction semantics imposes a much larger burden on the rest of the community. That burden has to be well justified and supported.
Hi,
I have updated the review here:
https://reviews.llvm.org/D21723
As Sameer pointed out, the motivation is:
In OpenCL 2.x, two atomic operations on the same atomic object need to have the same scope to prevent a data race. This derives from the definition of "inclusive scope" in OpenCL 2.x. Encoding OpenCL 2.x scope as metadata in LLVM IR would be a problem because there cannot be a "safe default value" to be used when the metadata is dropped. If the "largest" scope is used as the default, then the optimizer must guarantee that the metadata is dropped from every atomic operation in the whole program, or not dropped at all.
Thanks,
Konstantin
From: Sameer Sahasrabuddhe [mailto:sam...@sbuddhe.net]
Sent: Sunday, July 10, 2016 4:06 AM
To: Philip Reames <list...@philipreames.com>
Cc: Mehdi Amini <mehdi...@apple.com>; Liu, Yaxun (Sam) <Yaxu...@amd.com>; Ke Bai <keba...@gmail.com>; Mekhanoshin, Stanislav <Stanislav....@amd.com>; Sumner, Brian <Brian....@amd.com>; llvm...@lists.llvm.org; Zhuravlyov, Konstantin
<Konstantin...@amd.com>; Tye, Tony <Tony...@amd.com>
Subject: Re: [llvm-dev] Memory scope proposal
On Mon, Jul 4, 2016 at 5:09 AM, Philip Reames via llvm-dev <llvm...@lists.llvm.org> wrote:
>Why not going with a metadata attachment directly and kill the "singlethread" keyword? Something like:
>Something like:
> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!42}
> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!43}
>...
>!42 = !{"singlethread"}
>!43 = !{"L2"}
>It is not clear to me if there is any correctness issues to dropping metadata?
Yes, we cannot use the metadata approach since this metadata can be dropped during the processing of one module but not dropped in the processing of a second module, potentially resulting in inconsistent scopes for synchronizing operations leading to data races and subsequently leading to correctness issues.
Thanks,
Konstantin
On Aug 17, 2016, at 2:08 PM, Zhuravlyov, Konstantin <Konstantin...@amd.com> wrote:>Why not going with a metadata attachment directly and kill the "singlethread" keyword? Something like:>Something like:> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!42}> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!43}>...>!42 = !{"singlethread"}>!43 = !{"L2"}>It is not clear to me if there is any correctness issues to dropping metadata?Yes, we cannot use the metadata approach since this metadata can be dropped during the processing of one module but not dropped in the processing of a second module, potentially resulting in inconsistent scopes for synchronizing operations leading to data races and subsequently leading to correctness issues.
In CUDA we have a similar problem as OpenCL. CUDA solves it by having
a bunch of atomic builtins for each of the memory scopes. These map
to various llvm target-specific intrinsics.
It's not great, because the intrinsics are mostly opaque to the
optimizer. But atomic ops are already pretty slow on the GPU, so I've
been operating under the assumption that this isn't hurting us too
much.
Am I wrong about that?
On Aug 17, 2016, at 2:08 PM, Zhuravlyov, Konstantin <Konstantin...@amd.com> wrote:
>Why not going with a metadata attachment directly and kill the "singlethread" keyword? Something like:>Something like:> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!42}> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!43}>...>!42 = !{"singlethread"}>!43 = !{"L2"}>It is not clear to me if there is any correctness issues to dropping metadata?Yes, we cannot use the metadata approach since this metadata can be dropped during the processing of one module but not dropped in the processing of a second module, potentially resulting in inconsistent scopes for synchronizing operations leading to data races and subsequently leading to correctness issues.
Right, I saw Sameer's explanation for that earlier, and we shouldn’t move forward (without Philip’s opinion on the topic as he expressed concerns).
On Aug 21, 2016, at 11:14 AM, Philip Reames <list...@philipreames.com> wrote:On 08/17/2016 03:05 PM, Mehdi Amini wrote:
Given my current time commitments, having me on the critical path for any proposal is not a good idea. I'm willing to step aside here as long as the proposal is well reviewed by someone who's familiar with the memory model. Hal, Sanjoy, JF, Chandler, and Danny would all be reasonable alternates.
On Aug 17, 2016, at 2:08 PM, Zhuravlyov, Konstantin <Konstantin...@amd.com> wrote:
>Why not going with a metadata attachment directly and kill the "singlethread" keyword? Something like:>Something like:> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!42}> cmpxchg i32* %addr, i32 42, i32 0 monotonic monotonic, 3, !memory.scope{!43}>...>!42 = !{"singlethread"}>!43 = !{"L2"}>It is not clear to me if there is any correctness issues to dropping metadata?Yes, we cannot use the metadata approach since this metadata can be dropped during the processing of one module but not dropped in the processing of a second module, potentially resulting in inconsistent scopes for synchronizing operations leading to data races and subsequently leading to correctness issues.
Right, I saw Sameer's explanation for that earlier, and we shouldn’t move forward (without Philip’s opinion on the topic as he expressed concerns).
On Aug 21, 2016, at 11:14 AM, Philip Reames <list...@philipreames.com> wrote:Given my current time commitments, having me on the critical path for any proposal is not a good idea. I'm willing to step aside here as long as the proposal is well reviewed by someone who's familiar with the memory model. Hal, Sanjoy, JF, Chandler, and Danny would all be reasonable alternates.OK, good to know. I put you on the path because you wrote:"I am opposed to extending the instructions without a strong motivating case. I don't care anywhere near as much about metadata based schemes, but extending the instruction semantics imposes a much larger burden on the rest of the community. That burden has to be well justified and supported."It is not clear to me right now if the "use case" makes it "well justified" or not (an alternative being using intrinsic for OpenCL as Justin Lebar mentioned). I don’t feel I can answer this, so adding CC Chandler and Sanjoy to begin with.
> Let me rephrase: why is it preferable to add first class instruction support for opaque scope rather than using intrinsics?
Given that LLVM core now supports atomic instructions, it would seem desirable to use them for all languages supported by LLVM. This would allow optimizations to be aware of the memory semantics. By using intrinsics this information no longer becomes available in a unified way. It also requires the target code generator to support both the LLVM atomics as well as all the additional intrinsics. In general it seems preferable to use a single approach rather than multiple approaches to express information.
Currently LLVM has support for atomics by specifying both the memory ordering and memory scope as enumerated values specified by an enum type. However, the values for memory scope currently only include those needed for CPU-style targeted languages. Languages such as CUDA and OpenCL introduce additional memory scopes into the memory model. This patch extends the existing enumeration for memory scope to allow the target to define additional memory scopes that can be used by such languages. The current bit code already represents the memory scope as a 32 bit unsigned value, so this change introduces no backward compatible issues.
Thanks,
-Tony
On Aug 23, 2016, at 2:00 PM, Tye, Tony <Tony...@amd.com> wrote:> Let me rephrase: why is it preferable to add first class instruction support for opaque scope rather than using intrinsics?Given that LLVM core now supports atomic instructions, it would seem desirable to use them for all languages supported by LLVM.
This would allow optimizations to be aware of the memory semantics.
> Since the scope is “opaque” and target specific, can you elaborate what kind of generic optimization can be performed?
Some optimizations that are related to a single thread could be done without needing to know the actual memory scope. For example, an atomic acquire can restrict reordering memory operations after it, but allow reordering of memory operations (except another atomic acquire) before it, regardless of the memory scope.
Thanks,
-Tony
[Sorry for chiming in so late.]
I understand why a straightforward metadata scheme won't work here,
but have you considered an alternate scheme that works in the
following way:
- We add a MD node called !nosynch that lists a set of "domains" a
certain memory operation does *not* synchronize with.
- Memory operations with !nosynch synchronize with memory operations
without any !nosynch metadata (so dropping !nosynch is safe).
This will only work if your frontend knows, ahead of time, what the
possible set of synch-domains are, but it presumably already knows
that (otherwise how do you map domain names to integers)?
The other disadvantage with the scheme above is that memory operations
on the "normal CPU heap" (pardon my GPU n00b-ness here :) ) will synch
with the memory operations with !nosynch metadata. However, we can
solve that by modeling the "normal CPU heap" as "!nosynch
!{!special_domain_a, !special_domain_b, ... all domains except
!cpu_heap_domain}".
Thanks,
-- Sanjoy
Sanjoy Das wrote:
> I understand why a straightforward metadata scheme won't work here,
> but have you considered an alternate scheme that works in the
> following way:
>
> - We add a MD node called !nosynch that lists a set of "domains" a
> certain memory operation does *not* synchronize with.
>
> - Memory operations with !nosynch synchronize with memory operations
> without any !nosynch metadata (so dropping !nosynch is safe).
I missed a spot here ^, !nosynch metadata will also have to have a
sub-node for the kind of synch-domain *it* is in. The synchs-with
relation is then:
bool SynchsWith(MemOp A, MemOp B) {
MD_A = A.getMD(MD_nosynch);
MD_B = B.getMD(MD_nosynch);
if (!MD_A || !MD_B)
return true;
return MD_B.nosync_list.contains(MD_A.id) ||
MD_A.nosync_list.contains(MD_B.id);
}
I'm still not a 100% convinced that the above works, but I think there
are advantages to expressing synch scopes as metadata. For instance,
the optimizer already "knows" what to do with the metadata on loads it
speculates.
I’m not sure, but isn’t the synchscope id (or domains as you seem to call it) intended to change which instruction would be actually codegen?
In which case I’m not sure dropping it is ever a good idea, even when it does not affect correctness it would dramatically affect performance.
—
Mehdi
Mehdi Amini wrote:
> I’m not sure, but isn’t the synchscope id (or domains as you seem to
> call it) intended to change which instruction would be actually
FYI, I don't know what the right term for this is. :)
> codegen?
>
> In which case I’m not sure dropping it is ever a good idea, even
> when it does not affect correctness it would dramatically affect
> performance.
Sure, that is a good reason to avoid a metadata based approach. I was
just unconvinced by the current argument of (at least on the commit
message) "it can't be done", because I think a scheme modeled after
the !tbaa style metadata scheme has a chance of working. If the
fundamental reason why we want a non-MD scheme is something other than
"it can't be done", I'm fine with that.
My understanding is that (I leave the AMD folks (Tony?) correct this if I missed something):
1) we want to preserve the synchscope information because it is important for codegen. This is usually done with intrinsics in other backends
2) not doing with intrinsic would allow the optimizer to reason about the atomic operation within a single scope, but couldn’t assume anything across scope
I don’t know enough to assert if 2 is compelling enough to have first class support in the IR for something that is so “opaque”.
—
Mehdi
Right, it's clear to me that there exist optimizations that you cannot
do if we model these ops as target-specific intrinsics.
But what I think Mehdi and I were trying to get at is: How much of a
problem is this in practice? Are there real-world programs that
suffer because we miss these optimizations? If so, how much?
The reason I'm asking this question is, there's a real cost to adding
complexity in LLVM. Everyone in the project is going to pay that
cost, forever (or at least, until we remove the feature :). So I want
to try to evaluate whether paying that cost is actually worth while,
as compared to the simple alternative (i.e., intrinsics). Given the
tepid response to this proposal, I'm sort of thinking that now may not
be the time to start paying this cost. (We can always revisit this in
the future.) But I remain open to being convinced.
As a point of comparison, we have a rule of thumb that we'll add an
optimization that increases compilation time by x% if we have a
benchmark that is sped up by at least x%. Similarly here, I'd want to
weigh the added complexity against the improvements to user code.
-Justin
I think the cost of adding this information to the IR is really low.
There is already a sychronization scope field present for LLVM atomic
instructions, and it is already being encoded as 32-bits, so it is
possible to represent the additional scopes using the existing bitcode
format. Optimization passes are already aware of this synchronization
scope field, so they know how to preserve it when transforming the IR.
The primary goal here is to pass synchronization scope information from
the fronted to the backend. We already have a mechanism for doing this,
so why not use it? That seems like the lowest cost option to me.
-Tom
On 09/01/2016 08:52 AM, Tom Stellard via llvm-dev wrote:
> On Wed, Aug 31, 2016 at 12:23:34PM -0700, Justin Lebar via llvm-dev wrote:
>>> Some optimizations that are related to a single thread could be done without needing to know the actual memory scope.
>> Right, it's clear to me that there exist optimizations that you cannot
>> do if we model these ops as target-specific intrinsics.
>>
>> But what I think Mehdi and I were trying to get at is: How much of a
>> problem is this in practice? Are there real-world programs that
>> suffer because we miss these optimizations? If so, how much?
>>
>> The reason I'm asking this question is, there's a real cost to adding
>> complexity in LLVM. Everyone in the project is going to pay that
>> cost, forever (or at least, until we remove the feature :). So I want
>> to try to evaluate whether paying that cost is actually worth while,
>> as compared to the simple alternative (i.e., intrinsics). Given the
>> tepid response to this proposal, I'm sort of thinking that now may not
>> be the time to start paying this cost. (We can always revisit this in
>> the future.) But I remain open to being convinced.
>>
> I think the cost of adding this information to the IR is really low.
> There is already a sychronization scope field present for LLVM atomic
> instructions, and it is already being encoded as 32-bits, so it is
> possible to represent the additional scopes using the existing bitcode
> format. Optimization passes are already aware of this synchronization
> scope field, so they know how to preserve it when transforming the IR.
I disagree with this assessment. Atomics are an area where additional
complexity has a *substantial* conceptual cost. I also question whether
the single_thread scope is actually respected throughout the optimizer
in practice.
I view the request of changing the IR as a fairly big ask. In
particular, I'm really nervous about what the exact optimization
semantics of such scopes would be. Depending on how that was defined,
this could be anything from fairly straight forward to outright messy.
In particular, if there are optimizations which are legal for only some
subset of scopes (or subset of pairs of scopes?), I'd really like to see
a clear definition given for how those are defined.
(p.s. Is there a current patch with an updated LangRef for the proposal
being discussed? I've lost track of it.)
Let me give an example proposal just to illustrate my point. This isn't
really a counter proposal per se, just me thinking out loud.
Say we added 32 distinct concurrent domains. One of them is used for
"single_thread". One is used for "everything else". The remaining 30
are defined in a target specific manner w/the exception that they can't
overlap with each other or with the two predefined ones. The effect of
a given atomic operation with respect to each concurrency domain could
be defined in terms of a 32 bit mask. If a bit was set, the operation
is ordered (according to the separately stated ordering) with that
domain. If not, it is explicitly unordered w.r.t. that domain. A
memory operation would be tagged with the memory domains which which it
might interact.
The key bit here is that I can describe transformations in terms of
these abstract domains without knowing anything about how the frontend
might be using such a domain or how the backend might lower it. In
particular, if I have the sequence:
%v = load i64, %p atomic scope {domain3 only}
fence seq_cst scope={domain1 only}
%v2 = load i64, %p atomic scope {domain3 only}
I can tell that the two loads aren't order with respect to the fence and
that I can do load forwarding here.
In general, an IR extension needs to be well defined, general enough to
be used by multiple distinct users, and fairly battle tested design
wise. We're not completely afraid of having to remove bad ideas from
the IR, but we really try to avoid adding things until they're fairly
proven.
Let me give an example proposal just to illustrate my point. This isn't really a counter proposal per se, just me thinking out loud.
Say we added 32 distinct concurrent domains. One of them is used for "single_thread". One is used for "everything else". The remaining 30 are defined in a target specific manner w/the exception that they can't overlap with each other or with the two predefined ones. The effect of a given atomic operation with respect to each concurrency domain could be defined in terms of a 32 bit mask. If a bit was set, the operation is ordered (according to the separately stated ordering) with that domain. If not, it is explicitly unordered w.r.t. that domain. A memory operation would be tagged with the memory domains which which it might interact.
The key bit here is that I can describe transformations in terms of these abstract domains without knowing anything about how the frontend might be using such a domain or how the backend might lower it. In particular, if I have the sequence:
%v = load i64, %p atomic scope {domain3 only}
fence seq_cst scope={domain1 only}
%v2 = load i64, %p atomic scope {domain3 only}
I can tell that the two loads aren't order with respect to the fence and that I can do load forwarding here.
Currently the Synchronization Scope (aka memory scope) information appears not to be used in any atomic related optimizations. It would seem any such optimizations should consider memory scope and an approach such as suggested by Philip seems reasonable. Could that change be tackled as a separate patch? Initially any atomic optimizations could be restricted to only be allowed when the memory scopes are exactly equal which should be conservatively correct.
This patch would be a first step towards adding support for atomics with memory scopes used by languages such as OpenCL. Doing this simplifies how memory scope information is passed from CLANG to the code generator as mentioned by Tom. There seems to be several companies interested in doing this as it will simplify the code and allow atomics to be handled in a consistent way for all languages, and allow atomic optimizations to benefit these languages.
Thanks,
-Tony
The key bit here is that I can describe transformations in terms of these abstract domains without knowing anything about how the frontend might be using such a domain or how the backend might lower it. In particular, if I have the sequence:%v = load i64, %p atomic scope {domain3 only}fence seq_cst scope={domain1 only}%v2 = load i64, %p atomic scope {domain3 only}I can tell that the two loads aren't order with respect to the fence and that I can do load forwarding here.I see the current proposal as a strip-down version what you describe: the optimizer can reason about operations inside a single scope, but can’t assume anything cross-scope (they may or may not interact with each other).What you describes seems like having always non-overlapping domains (from the optimizer point of view), and require the frontend to express the overlapping by attaching a “list" of domains that an atomic operation interacts with.