atomic memory_order with command or with fence

itaj sherman

unread,

May 24, 2012, 2:38:12 PM5/24/12

to

When specifying atomic memory_order_relase/acquire, what is the
difference between specifying on a certain command to a separate
atomic_thread_fence?

Aren't the following options equivalent w.r.t. memory ordering of all
operations:

//option1
x.store( r, memory_order_release );

//option2
atomic_thread_fence( memory_order_release );
x.store( r, memory_order_relaxed );

and same goes for loads:

//option1
r = x.load( memory_order_acquire );

//option2
r = x.load( memory_order_relaxed );
atomic_thread_fence( memory_order_acquire );

Reading the standard 1.10, 29.8 it seems that these are equivalent,
but it doesn't make it too obvious, as I feel it's supposed to be.
In essence, wouldn't the compiler be allowed to produce the same
output?
I only refer to the fences as in this example, a single non-
conditional fence, not other uses within loops and conditions (where
there's a straight forward performance advantage in avoiding
barriers).

thanks
itaj

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

itaj sherman

unread,

May 27, 2012, 5:08:33 PM5/27/12

to

Maybe I should clarify my question:
Referring to the standard 1.10 and 29.8, in regard to the effects of
atomic_thread_fence (orders
release and acquire) on memory ordering. Although the standard doesn't
give this example explicitly.

On May 24, 9:38 pm, itaj sherman <itajsher...@gmail.com> wrote:
> When specifying atomic memory_order_relase/acquire, what is the
> difference between specifying on a certain command to a separate
> atomic_thread_fence?
>
> Aren't the following options equivalent w.r.t. memory ordering of all
> operations:
>
> //option1
> x.store( r, memory_order_release );
>
> //option2
> atomic_thread_fence( memory_order_release );
> x.store( r, memory_order_relaxed );
>

I'll put this code in functions, to clarify the context of x and r:

template< typename T >
void my_store_release_1( std::atomic<T>& x, T r )
{
x.store( r, memory_order_release );
}

template< typename T >
void my_store_release_2( std::atomic<T>& x, T r )
{
std::atomic_thread_fence( memory_order_release );
x.store( r, memory_order_relaxed );
}

It seems to me that these two functions are equivalent, in having the
same effect on memory orderings.
Is that so?
Is it true that a call to one of these can always be replaced by the
other?
(ofcourse I mean in a conforming implementation, in a program without
data races).

The same I would ask about load and acquire:

> and same goes for loads:
>
> //option1
> r = x.load( memory_order_acquire );
>
> //option2
> r = x.load( memory_order_relaxed );
> atomic_thread_fence( memory_order_acquire );
>

template< typename T >
T my_load_acquire_1( std::atomic<T>& x )
{
T const r = x.load( memory_order_relaxed );
std::atomic_thread_fence( memory_order_acquire );
return r;
}

template< typename T >
T my_load_acquire_2( std::atomic<T>& x )
{
T const r = x.load( memory_order_acquire );
return r;

}

> Reading the standard 1.10, 29.8 it seems that these are equivalent,
> but it doesn't make it too obvious, as I feel it's supposed to be.
> In essence, wouldn't the compiler be allowed to produce the same
> output?
> I only refer to the fences as in this example, a single non-
> conditional fence, not other uses within loops and conditions (where
> there's a straight forward performance advantage in avoiding
> barriers).

{ quoted signature removed -mod }

Is it also the same for read-modify-write operations? (not the ones
with conditional writes, just the ones that always write). Again it
seems so to me, but the standard doesn't say that explicitly.

Zoltan Juhasz

unread,

May 29, 2012, 3:54:38 PM5/29/12

to

On Sunday, 27 May 2012 17:08:33 UTC-4, itaj sherman wrote:
> I'll put this code in functions, to clarify the context of x and r:
>
> template< typename T >
> void my_store_release_1( std::atomic<T>& x, T r )
> {
> x.store( r, memory_order_release );
> }
>
> template< typename T >
> void my_store_release_2( std::atomic<T>& x, T r )
> {
> std::atomic_thread_fence( memory_order_release );
> x.store( r, memory_order_relaxed );
> }

Disclaimer: I am most certainly not an expert on this area, but based
on my current understanding on the topic, I believe these are not
the same. Hopefully someone, who has more experience, will clarify.

A fence or atomic store operation that is marked with
'memory_order_release' introduces inter-thread, happens-before
relationship on store operations that appear before the
'memory_order_release' fence or atomic store operation - given
it is paired with an acquire counterpart.

Conversely, it introduces no happens-before relationship on
operations that appear after the store / fence marked with
'memory_order_release', in regards their visibility in another
thread.

In this case the fence, marked with 'memory_order_release',
introduces no happens-before relationship on the store of x in
regards of the visibility of the store on x in another thread,
since the store appears after the fence.

I believe if you write:

template< typename T >
void my_store_release_3( std::atomic<T>& x, T r )
{
x.store( r, memory_order_relaxed );
std::atomic_thread_fence( memory_order_release );
}

Then 1 and 3 are equivalent, as far as the introduced inter-thread
happens-before relationship is concerned.

> The same I would ask about load and acquire:

> template< typename T >
> T my_load_acquire_1( std::atomic<T>& x )
> {
> T const r = x.load( memory_order_relaxed );
> std::atomic_thread_fence( memory_order_acquire );
> return r;
> }
>
> template< typename T >
> T my_load_acquire_2( std::atomic<T>& x )
> {
> T const r = x.load( memory_order_acquire );
> return r;
> }

Situation is similar here, the "memory_order_acquire" does not
impose happens-before relationship on the load to x, since the
load appears before the fence.

The correct way is:

template< typename T >
T my_load_acquire_3( std::atomic<T>& x )
{
std::atomic_thread_fence( memory_order_acquire );

T const r = x.load( memory_order_relaxed );

return r;
}

Of course by itself the load / store part is meaningless, the
acquire has to be paired with a release to see the complete picture.

In your example the order of operations vs. fence might have been
just an oversight, but I wanted to clarify, because it is very
important. As far as release-acquire semantic is concerned, I
believe that the newly written load_3/store_3 have the same effect
as 'my_store_release_1' (or store_3) paired with 'my_load_acquire_2'
(or load_3), as far as happens-before relationship is concerned.

I would risk to say that the only semantic difference is who
introduces the happens-before relationship, the fence(s), or the
atomic store / load operation(s).

PS: I am not sure if one is allowed to refer / advertise books,
but throughout discussion can be found on the on the C++11 memory
model in C++ Concurrency in Action book (Chapter 5) from Anthony
Williams, where he asks a very similar question with a very
similar example, and reaches the same conclusion. I can highly
recommend that book.

-- Zoltan

itaj sherman

unread,

May 29, 2012, 8:39:27 PM5/29/12

to

I will give this one example, where I'm pretty sure I can demostrate
what I mean.
However my question remains whether it's always so.

On May 29, 10:54 pm, Zoltan Juhasz <zoltan.juh...@gmail.com> wrote:
> On Sunday, 27 May 2012 17:08:33 UTC-4, itaj sherman wrote:
> > I'll put this code in functions, to clarify the context of x and r:
>
> > template< typename T >
> > void my_store_release_1( std::atomic<T>& x, T r )
> > {
> > x.store( r, memory_order_release );
> > }
>
> > template< typename T >
> > void my_store_release_2( std::atomic<T>& x, T r )
> > {
> > std::atomic_thread_fence( memory_order_release );
> > x.store( r, memory_order_relaxed );
> > }
>
> Disclaimer: I am most certainly not an expert on this area, but based
> on my current understanding on the topic, I believe these are not
> the same. Hopefully someone, who has more experience, will clarify.
>
> A fence or atomic store operation that is marked with
> 'memory_order_release' introduces inter-thread, happens-before
> relationship on store operations that appear before the
> 'memory_order_release' fence or atomic store operation - given
> it is paired with an acquire counterpart.

but in order to syncheronize a release fence with an acquire fence,
you need an
atomic variable and a store on it sequenced after the release fence,
whose
value be read by a load that is sequenced before the acuire fence.

standard 29.8-p2:
A release fence A synchronizes with an acquire fence B if there
exist atomic operations X and Y, both
operating on some atomic object M, such that A is sequenced before
X, X modifies M, Y is sequenced
before B, and Y reads the value written by X or a value written by
any side effect in the hypothetical
release sequence X would head if it were a release operation.

operations X and Y in my code were meant to be x.store and x.load.
And this is why I ordered them inside the function before or after
the fence as I did, deliberately.

>
> Conversely, it introduces no happens-before relationship on
> operations that appear after the store / fence marked with
> 'memory_order_release', in regards their visibility in another
> thread.
>
> In this case the fence, marked with 'memory_order_release',
> introduces no happens-before relationship on the store of x in
> regards of the visibility of the store on x in another thread,
> since the store appears after the fence.
>

Right, it doesn't order x, I didn't mean for it to. The point was for
x to
cause a synchronization (an optional one) on the fences. So that
stores that
were sequenced before the release fence, be certainly visible to loads
that
happen after the acquire fence.

So I can show the following use example, in which I think 1 and 2 are
equivalent.
But I'm looking for an answer whether it is always true.

std::atomic<int> atomic_data( 0 );
std::atomic<int> atomic_flag( 0 ); //change flag to 1 when data can be
read.

//thread#1
int data;
std::cin >> data;
atomic_data.store( data, memory_order_relaxed );
my_store_release_XXX( atomic_flag, 1 ); //XXX is one of the above
versions

//thread#2
int const current_flag = my_load_acquire_XXX( atomic_flag );
int const current_data = atomic_data.load( memory_order_relaxed );
if( flag == 1 ) {
//the atomic_flag store_release synchronizes with load_acquire
//therefor the atomic_data store happens before the load.
std::cout << "data arrived " << current_data; //must be what came in
std::cin
} else {
//no certain synchronization
std::cout << "no flag for data arrived "; //data maybe 0, maybe
already changed.
}

so, I expect we should agree without explanation that when using
my_store_release_1/my_load_acquire_1 this example works as expected.
(per standard 1.10).

Now regarding 29.8-p2, I assert that using my versions
my_store_release_2/my_load_acquire_2
this should work just the same, just in this example, because the code
would convert to:

//inlining the functions of versions 2:

//thread#1
int data;
std::cin >> data;
atomic_data.store( data, memory_order_relaxed );
std::atomic_thread_fence( memory_order_release ); // <-- fence A
atomic_flag.store( 1, memory_order_relaxed ); // <-- store operation X

//thread#2
int const current_flag = atomic_flag.load( memory_order_relaxed ); //
<-- load operation Y
std::atomic_thread_fence( memory_order_acquire ); // <-- fence B
int const current_data = atomic_data.load( memory_order_relaxed );
if( flag == 1 ) {
//in this case, the value of flag implies that fence A synchronized
with fence B per 29.8-p2
std::cout << "data arrived " << current_data; //must be what came in
std::cin
} else {
//no certain synchronization
std::cout << "no flag for data arrived "; //data maybe 0, maybe
already changed.
}

I will also assert that it will also work (in this example) when
changing just one
of the functions version, and thus mixing my_store_release_1/
my_load_acquire_2 or
my_store_release_2/my_load_acquire_1.

But this example is just one case, I want to know whether they are
always equivalent.

On the other hand, I don't see that it would work with your version 3.
It actually seems like a counter example.

//inlining the functions of versions 3:

//thread#1
int data;
std::cin >> data;
atomic_data.store( data, memory_order_relaxed );
atomic_flag.store( 1, memory_order_relaxed ); // <-- store operation X
std::atomic_thread_fence( memory_order_release ); // <-- fence A

//thread#2
std::atomic_thread_fence( memory_order_acquire ); // <-- fence B
int const current_flag = atomic_flag.load( memory_order_relaxed ); //
<-- load operation Y
int const current_data = atomic_data.load( memory_order_relaxed );
if( flag == 1 ) {
//it might be possible to load the value of store operation X even
when fence A did not occur yet.
//in such a case, it is uncertain what value of atomic_data is
loaded.
std::cout << "data arrived " << current_data;
} else {
std::cout << "no flag for data arrived ";
}

itaj

Pete Becker

unread,

May 29, 2012, 10:21:30 PM5/29/12

to

On 2012-05-30 00:39:27 +0000, itaj sherman said:

>
> Right, it doesn't order x, I didn't mean for it to. The point was for
> x to
> cause a synchronization (an optional one) on the fences. So that
> stores that
> were sequenced before the release fence, be certainly visible to loads
> that
> happen after the acquire fence.

That's not quite right. The fence causes the synchronization. But the
only way for the second thread to know that the synchronization has
occurred is to see the value that the first thread wrote into x. So
from a coding perspective, once you read the correct value, you know
that all the stuff that happened before the fence is visible in your
thread. If you haven't read the correct value it could simply because
the other thread hasn't gotten there yet.

--
Pete

Zoltan Juhasz

unread,

May 29, 2012, 10:21:47 PM5/29/12

to

On Tuesday, 29 May 2012 20:39:27 UTC-4, itaj sherman wrote:
> //thread#1
> int data;
> std::cin >> data;
> atomic_data.store( data, memory_order_relaxed );
> atomic_flag.store( 1, memory_order_relaxed ); // <-- store operation X
> std::atomic_thread_fence( memory_order_release ); // <-- fence A
>
> //thread#2
> std::atomic_thread_fence( memory_order_acquire ); // <-- fence B
> int const current_flag = atomic_flag.load( memory_order_relaxed ); //
> <-- load operation Y
> int const current_data = atomic_data.load( memory_order_relaxed );
> if( flag == 1 ) {
> //it might be possible to load the value of store operation X even
> when fence A did not occur yet.

Ah, I see, this is the source of the confusion. Atomic operations,
and the use of memory models on atomic operations or
fences (e.g. sequentially correct, acquire-release, relaxed) are
pretty much independent from thread synchronization. They provide
no synchronization guarantee between threads, they provide
happens-before relationship between atomic operations.

Let me give you a short example
(credit goes to C++ Concurrency in Action):

std::atomic< bool > x,y,z;

void write_x_then_y()
{
x.store( true, std::memory_order_relaxed );
y.store( true, std::memory_order_release );
}

void read_y_then_x()
{
// y is used as synchronization point
// by spin waiting for y to be set to true
while( !y.load( std::memory_order_acquire ) );

if( x.load( std::memory_order_relaxed ) )
z.store( true, std::memory_order_relaxed );
}

int main()
{
x = false;
y = false;
z = false;

std::thread a( write_x_then_y );
std::thread b( read_y_then_x );

// synchronization point with thread a...
a.join();

// ... and b
b.join();

// z cannot be 0
assert( z.load() );
}

As you can see, there is an explicit synchronization point,
between thread 'a' and 'b' (see the spin-wait on y).

The acquire-release semantic is used to make sure that once
y is set to be true, the store to x also becomes visible in
thread b.

So the acquire-release semantic introduced a happens-before
relationship between y and x store operation, and it has
absolutely nothing to do with intra-thread synchronization,
only the ordering of atomic operations relative to each other.

Fences are similar, just they are not tied to any atomic operation.

If you remove the spin waiting on y:

void bad_read_y_then_x()
{
y.load( std::memory_order_acquire );
if( x.load( std::memory_order_relaxed ) ) ++z;
}

then of course anything could happen, y might be true or false,
but if it is true, then x is guaranteed to be true; otherwise x
might be true or false, depending on the exacty scheduling of the
threads, so this version of the function introduces a race
condition.

Again, enforcing certain memory model on atomic operations or
fences does not introduce synchronization point between
threads; on the other hand they can be used to create a
synchronization point (e.g. spin-wait).

-- Zoltan

itaj sherman

unread,

May 31, 2012, 12:18:35 AM5/31/12

to

On May 30, 5:21 am, Pete Becker <p...@versatilecoding.com> wrote:
> On 2012-05-30 00:39:27 +0000, itaj sherman said:
>
> > Right, it doesn't order x, I didn't mean for it to. The point was
> > for x to cause a synchronization (an optional one) on the
> > fences. So that stores that were sequenced before the release
> > fence, be certainly visible to loads that happen after the acquire
> > fence.
>
> That's not quite right. The fence causes the synchronization. But the
> only way for the second thread to know that the synchronization has
> occurred is to see the value that the first thread wrote into x. So
> from a coding perspective, once you read the correct value, you know
> that all the stuff that happened before the fence is visible in your
> thread. If you haven't read the correct value it could simply because
> the other thread hasn't gotten there yet.
>

Oh, I get what you mean, I only meant "x could imply synchronization"
not "cause", by that the pair of store and load (which read the stored
value) imply that the fences synchronize, per 29.8-p2

itaj

--

itaj sherman

unread,

May 31, 2012, 3:06:31 PM5/31/12

to

On May 30, 5:21 am, Pete Becker <p...@versatilecoding.com> wrote:

> On 2012-05-30 00:39:27 +0000, itaj sherman said:
> > Right, it doesn't order x, I didn't mean for it to. The point was for
> > x to
> > cause a synchronization (an optional one) on the fences. So that
> > stores that
> > were sequenced before the release fence, be certainly visible to loads
> > that
> > happen after the acquire fence.
>
> That's not quite right. The fence causes the synchronization. But the
> only way for the second thread to know that the synchronization has
> occurred is to see the value that the first thread wrote into x. So
> from a coding perspective, once you read the correct value, you know
> that all the stuff that happened before the fence is visible in your
> thread. If you haven't read the correct value it could simply because
> the other thread hasn't gotten there yet.
>

Yeah, this is what I meant by "x cause synchronization (optional) on
the fences", that if the load in my_load_acquire_2 gets to read the
value from the store in my_store_release_2 into x, then it implies
that their fences are synchronized.
I'm pretty sure then I can do that in the simple example there below,
where the variable atomic_flag takes the place of x, but this doesn't
prove that they are always equivalent.

itaj

--

itaj sherman

unread,

May 31, 2012, 3:07:21 PM5/31/12

to

On May 30, 3:39 am, itaj sherman <itajsher...@gmail.com> wrote:
> I will give this one example, where I'm pretty sure I can demostrate
> what I mean.
> However my question remains whether it's always so.
>
> On May 29, 10:54 pm, Zoltan Juhasz <zoltan.juh...@gmail.com> wrote:
>
> So I can show the following use example, in which I think 1 and 2 are
> equivalent.
> But I'm looking for an answer whether it is always true.
>
> std::atomic<int> atomic_data( 0 );
> std::atomic<int> atomic_flag( 0 ); //change flag to 1 when data can be
> read.
>
> //thread#1
> int data;
> std::cin >> data;
> atomic_data.store( data, memory_order_relaxed );
> my_store_release_XXX( atomic_flag, 1 ); //XXX is one of the above
> versions
>
> //thread#2
> int const current_flag = my_load_acquire_XXX( atomic_flag );
> int const current_data = atomic_data.load( memory_order_relaxed );
> if( flag == 1 ) {

I meant (current_flag==1). flag is not defined.

> //the atomic_flag store_release synchronizes with load_acquire
> //therefor the atomic_data store happens before the load.
> std::cout << "data arrived " << current_data; //must be what came in
> std::cin} else {
>
> //no certain synchronization
> std::cout << "no flag for data arrived "; //data maybe 0, maybe
> already changed.
>
> }
>

Pete Becker

unread,

May 31, 2012, 7:05:42 PM5/31/12

to

On 2012-05-31 04:18:35 +0000, itaj sherman said:

> On May 30, 5:21 am, Pete Becker <p...@versatilecoding.com> wrote:
>> On 2012-05-30 00:39:27 +0000, itaj sherman said:
>>
>>> Right, it doesn't order x, I didn't mean for it to. The point was
>>> for x to cause a synchronization (an optional one) on the
>>> fences. So that stores that were sequenced before the release
>>> fence, be certainly visible to loads that happen after the acquire
>>> fence.
>>
>> That's not quite right. The fence causes the synchronization. But the
>> only way for the second thread to know that the synchronization has
>> occurred is to see the value that the first thread wrote into x. So
>> from a coding perspective, once you read the correct value, you know
>> that all the stuff that happened before the fence is visible in your
>> thread. If you haven't read the correct value it could simply because
>> the other thread hasn't gotten there yet.
>>
>
> Oh, I get what you mean, I only meant "x could imply synchronization"
> not "cause", by that the pair of store and load (which read the stored
> value) imply that the fences synchronize, per 29.8-p2
>

Sorry, I was a bit sloppy. The fence enforces visibility, which is what
I referred to as "synchronization". You used "synchronization" in a
higher-level sense, as in, ensuring that you've got visibility before
proceeding.

--
Pete

itaj sherman

unread,

Jun 3, 2012, 3:44:20 PM6/3/12

to

On May 24, 9:38 pm, itaj sherman <itajsher...@gmail.com> wrote:

I went looking for more info on the net, and found a discussion on
StackOverflow. Although it doesn't directly ask what I'm asking here,
it seems that the first reply, by no other than Anthony Williams,
begins with answering just this issue.

http://stackoverflow.com/questions/7461484/memory-model-ordering-and-visibility

I hope that by "equivalent" he also means just with respect to the
standard definitions, without further assumptions based on any
specific hardwares.
He also mentions that the compiler might produce better code for the
memory order as extra operation argument rather than separate fence. I
suppose it just goes to say that although the optimizer is generally
allowed to produce the same code, it doesn't always succeeds to.