One reason why I like atomic_thread_fence...

291 views
Skip to first unread message

Chris M. Thomasson

unread,
Mar 31, 2018, 4:41:35 PM3/31/18
to Scalable Synchronization Algorithms

Notice how there is an acquire barrier inside of the CAS loop within the enqueue and dequeue functions of:


http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue


?


Well, fwiw we can get rid of them by using stand alone fences:


Btw, sorry for the membar abstraction. I can quickly get sick and tired of having to type out (std::memory_order_relaxed) all the damn time. ;^)


Anyway, here is the simple code using Relacy:

___________________

/* Membar Abstraction

___________________________________________________*/

#define mbrlx std::memory_order_relaxed

#define mbacq std::memory_order_acquire

#define mbrel std::memory_order_release

#define mb(mb_membar) std::atomic_thread_fence(mb_membar) 



template<typename T, unsigned int T_N>

struct fifo

{

    struct cell

    {

        VAR_T(T) m_data;

        std::atomic<unsigned int> m_ver;

    };


    std::atomic<unsigned int> m_head;

    std::atomic<unsigned int> m_tail;

    cell m_cells[T_N];


    fifo() : m_head(0), m_tail(0)

    {

        for (unsigned int i = 0; i < T_N; ++i)

        {

            m_cells[i].m_ver.store(i, mbrlx);

        }

    }


    bool push_try(T const& data)

    {

        cell* c = nullptr;

        unsigned int ver = m_head.load(mbrlx);


        do

        {

            c = &m_cells[ver & (T_N - 1)];

            if (c->m_ver.load(mbrlx) != ver) return false;


        } while (! m_head.compare_exchange_weak(ver, ver + 1, mbrlx));


        mb(mbacq);

        VAR(c->m_data) = data;

        mb(mbrel);


        c->m_ver.store(ver + 1, mbrlx);


        return true;

    }


    bool pop_try(T& data)

    {

        cell* c = nullptr;

        unsigned int ver = m_tail.load(mbrlx);


        do

        {

            c = &m_cells[ver & (T_N - 1)];

            if (c->m_ver.load(mbrlx) != ver + 1) return false;


        } while (! m_tail.compare_exchange_weak(ver, ver + 1, mbrlx));


        mb(mbacq);

        data = VAR(c->m_data);

        mb(mbrel);


        c->m_ver.store(ver + T_N, mbrlx);


        return true;

    }

};

___________________



There is no need for any membars within the loops. The version count keeps everything in order.


:^)

Dmitry Vyukov

unread,
Apr 3, 2018, 8:44:38 AM4/3/18
to lock...@googlegroups.com
compare_exchange in C/C++ has optional failure memory ordering, so the
following looks equivalent to me ;)

m_tail.compare_exchange_weak(ver, ver + 1, acquire, relaxed)

Chris M. Thomasson

unread,
Apr 5, 2018, 4:03:06 PM4/5/18
to Scalable Synchronization Algorithms
On Tuesday, April 3, 2018 at 5:44:38 AM UTC-7, Dmitry Vyukov wrote:
On Sat, Mar 31, 2018 at 10:41 PM, Chris M. Thomasson
<cri...@charter.net> wrote:
> Notice how there is an acquire barrier inside of the CAS loop within the
> enqueue and dequeue functions of:
>
>
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
>
>
> ?
>
>
> Well, fwiw we can get rid of them by using stand alone fences:
>
>
> Btw, sorry for the membar abstraction. I can quickly get sick and tired of
> having to type out (std::memory_order_relaxed) all the damn time. ;^)
>
>
> Anyway, here is the simple code using Relacy:
>
> ___________________
[...]

> ___________________
>
>
>
> There is no need for any membars within the loops. The version count keeps
> everything in order.

compare_exchange in C/C++ has optional failure memory ordering, so the
following looks equivalent to me ;)

m_tail.compare_exchange_weak(ver, ver + 1, acquire, relaxed)

Yes, you are correct about that. However, I personally like to keep everything relaxed, and only add memory barriers when they are absolutely needed. The stand alone fence can do this. Ahh, perhaps I am thinking too much of SPARC with its awesome membar instruction. For instance, the acquire barrier you have in the CAS loop in your MPMC queue code is what a stand alone fence can get rid of. Think of the following line of code, that is within the loop:

 size_t seq =
        cell->sequence_.load(std::memory_order_acquire);

Executing an acquire barrier on every iteration of the CAS loop is not necessary. The actual version count keeps everything in order. 

However, you do need a single acquire fence _after_ the CAS loop succeeds in order to get a clear view of the element.

Dmitry Vyukov

unread,
Apr 7, 2018, 4:46:20 AM4/7/18
to lock...@googlegroups.com
This is true.

I don't like standalone fences because they are plague for
verification. Consider, a single release fences turns _all_ subsequent
relaxed atomic stores ever executed by the thread into release
operations (they release memory state up to the fence point) and
handling of acquire/release operations is an O(N) operation (and
generally done under a mutex). The same for acquire fences: a single
acquire fences turns _all_ loads ever executed by the thread into
acquire operations ton he corresponding memory locations, which means
that you need to handle all relaxed loads as a "shadow" acquire loads
for the case they will be materialized by a subsequent acquire fence.

The same is actually true for human reasoning. Say, I am reading your
code. We have 3 load operations in the loop and an acquire fences
after the loop. Now the question is: which of the loads we wanted to
turn into acquire by adding the fence? Or is it maybe 2 of them?
Which? Or maybe 1 in the loop and 1 somewhere before the loop, in a
different function?
One can, of course, comment that, but Relacy won't check comments, so
I won't trust them ;)

Chris M. Thomasson

unread,
Apr 8, 2018, 9:38:09 PM4/8/18
to Scalable Synchronization Algorithms
On Saturday, April 7, 2018 at 1:46:20 AM UTC-7, Dmitry Vyukov wrote:
On Thu, Apr 5, 2018 at 10:03 PM, Chris M. Thomasson <cri...@charter.net> wrote:
> On Tuesday, April 3, 2018 at 5:44:38 AM UTC-7, Dmitry Vyukov wrote:
>>
>> On Sat, Mar 31, 2018 at 10:41 PM, Chris M. Thomasson
>> <cri...@charter.net> wrote:
>> > Notice how there is an acquire barrier inside of the CAS loop within the
>> > enqueue and dequeue functions of:
>> >
>> > http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
[...]

> Executing an acquire barrier on every iteration of the CAS loop is not
> necessary. The actual version count keeps everything in order.
>
> However, you do need a single acquire fence _after_ the CAS loop succeeds in
> order to get a clear view of the element.

This is true.

Agreed. I personally like the ability to see the membars being separated out and 

standing alone. It is a habit of mine from SPARC. Now, tagged standalone membars 

aside for a moment, perhaps ones that can include memory locations they are 

interested in... ;^)

 

I don't like standalone fences because they are plague for
verification. Consider, a single release fences turns _all_ subsequent
relaxed atomic stores ever executed by the thread into release
operations (they release memory state up to the fence point) and
handling of acquire/release operations is an O(N) operation (and
generally done under a mutex).

A release operation should make sure all _prior_ operations are visible _before_ 

they are visible to another thread. They have no effect on subsequent relaxed 

operations. For instance:


// producer

A = 1

B = 2

RELEASE

C = 3

D = 4


// consumer

while (D != 4) backoff;

ACQUIRE

assert(A == 1 && B == 2);


Well, A and B are going be in sync with an acquire such that the assert will never 
fail, however C can be hoisted up and not be in sync at all! C is incoherent wrt the 
consumer because it was not covered by the standalone release barrier. 


The same for acquire fences: a single
acquire fences turns _all_ loads ever executed by the thread into
acquire operations ton he corresponding memory locations, which means
that you need to handle all relaxed loads as a "shadow" acquire loads
for the case they will be materialized by a subsequent acquire fence.

An acquire operation should make sure all operations wrt the release are visible 

_before_ any subsequent operations can be performed _after_ that fact is 

accomplished.


Well, fwiw, the membars that can be embedded into the CAS wrt acquire and 

release do effect prior and subsequent activity anyway, standalone or not. A release will 

dump prior stores such that an acquire barrier will see them all. Now, when we 

are dealing with a consume barrier, well that is targeting the release dependency 

chain wrt the pointer. A consume barrier is more precisely targeted when compared 

to the wider spectrum of an acquire. Btw, iirc consume is emulated in Relacy as 

acquire right?


Also, think of popping all nodes at once from an atomic LIFO:


https://groups.google.com/d/topic/comp.lang.c++/V0s__czQwa0/discussion


Well, how can we accomplish the following without using standalone fences?:


      // try to flush all of our nodes 
      node* flush() 
      { 
          node* n = m_head.exchange(nullptr, mb_relaxed); 

          if (n) 
          { 
              mb_fence(mb_acquire); 
          } 

          return n; 
      }   


 

The same is actually true for human reasoning. Say, I am reading your
code. We have 3 load operations in the loop and an acquire fences
after the loop. Now the question is: which of the loads we wanted to
turn into acquire by adding the fence? Or is it maybe 2 of them?
Which? Or maybe 1 in the loop and 1 somewhere before the loop, in a
different function?
One can, of course, comment that, but Relacy won't check comments, so
I won't trust them ;)
 

Interesting. Still makes me think of tagged membars. I will get back to you with 

a more detailed response.

Dmitry Vyukov

unread,
Apr 14, 2018, 2:45:51 AM4/14/18
to lock...@googlegroups.com
In this case the RELEASE turned store to D into a release-store (a
subsequent store).
And ACQUIRE turned load of D into an acquire-load (a preceding load).

At lease this is how this is defined in C/C++ standards.
ACQUIRE/RELEASE fences do not establish any happens-before relations
themselves. You still need a load in one thread to observe a value
stored in another thread. And only that "materializes" standalone
fence synchronization. So a store that materializes RELEASE fence will
always be a subsequent store.
I can't disagree. They are definitely more flexible.


>> The same is actually true for human reasoning. Say, I am reading your
>> code. We have 3 load operations in the loop and an acquire fences
>> after the loop. Now the question is: which of the loads we wanted to
>> turn into acquire by adding the fence? Or is it maybe 2 of them?
>> Which? Or maybe 1 in the loop and 1 somewhere before the loop, in a
>> different function?
>> One can, of course, comment that, but Relacy won't check comments, so
>> I won't trust them ;)
>
>
>
> Interesting. Still makes me think of tagged membars. I will get back to you
> with
>
> a more detailed response.


You mean something like:

// try to flush all of our nodes
node* flush()
{
node* n = m_head.exchange(nullptr, mb_relaxed);

if (n)
{
mb_fence(mb_acquire, m_head); // <---- HERE
}

return n;
}

? Interesting.

Chris M. Thomasson

unread,
Apr 15, 2018, 6:32:22 PM4/15/18
to Scalable Synchronization Algorithms
D should be a pure relaxed store, and C should not be covered by the RELEASE. Iirc, it works this way on SPARC RMO mode. However, on x86, C will be covered because each store has implied release characteristics, wb memory aside for a moment.

 

At lease this is how this is defined in C/C++ standards.
ACQUIRE/RELEASE fences do not establish any happens-before relations
themselves. You still need a load in one thread to observe a value
stored in another thread. And only that "materializes" standalone
fence synchronization. So a store that materializes RELEASE fence will
always be a subsequent store.

Humm... That is too strict, and has to be there whether we use standalone fences or not. The store to D = 4 makes A and B wrt the RELEASE visible to the consumer threads that look for D = 4 and execute the ACQUIRE barrier after that fact has been observed. Afaict, C should NOT be covered.

 


>> The same for acquire fences: a single
>> acquire fences turns _all_ loads ever executed by the thread into
>> acquire operations ton he corresponding memory locations, which means
>> that you need to handle all relaxed loads as a "shadow" acquire loads
>> for the case they will be materialized by a subsequent acquire fence.

That sounds to coarse.
Agreed.
 


>> The same is actually true for human reasoning. Say, I am reading your
>> code. We have 3 load operations in the loop and an acquire fences
>> after the loop. Now the question is: which of the loads we wanted to
>> turn into acquire by adding the fence? Or is it maybe 2 of them?
>> Which? Or maybe 1 in the loop and 1 somewhere before the loop, in a
>> different function?
>> One can, of course, comment that, but Relacy won't check comments, so
>> I won't trust them ;)
>
>
>
> Interesting. Still makes me think of tagged membars. I will get back to you
> with
>
> a more detailed response.


You mean something like:

       // try to flush all of our nodes
       node* flush()
       {
           node* n = m_head.exchange(nullptr, mb_relaxed);

           if (n)
           {
               mb_fence(mb_acquire, m_head);  // <---- HERE
           }

           return n;
       }

? Interesting.

Yes! The stand alone fence can say, we want to perform an acquire barrier wrt m_head. Something like that should be able to create more fine grain setups. Perhaps even something like the following pseudo-code:
______________________
// setup
int a = 0;
int b = 0;
int c = 0;
signal = false;

// producer
a = 1;
b = 2;
RELEASE(&signal, &a, &b);
c = 3;
STORE_RELAXED(&signal, true);

// consumers
while (LOAD_RELAXED(&signal) != true) backoff;
ACQUIRE(&signal, &a, &b);
assert(a == 1 && b == 2);
______________________

The consumers would always see a and b as 1 and 2, however, c was not covered, so it is an incoherent state wrt said consumers.

The acquire would only target a and b, as would the release.

Hummm... Just thinking out loud here. :^)

Dmitry Vyukov

unread,
Apr 20, 2018, 5:06:22 AM4/20/18
to lock...@googlegroups.com
On Mon, Apr 16, 2018 at 12:32 AM, Chris M. Thomasson
C and D are completely symmetric wrt the RELEASE. Later you can
discover that there is also a thread that does:

// consumer 2
while (C != 3) backoff;
ACQUIRE
assert(A == 1 && B == 2);

And now suddenly C is release operation exactly the same way D is.


>> At lease this is how this is defined in C/C++ standards.
>> ACQUIRE/RELEASE fences do not establish any happens-before relations
>> themselves. You still need a load in one thread to observe a value
>> stored in another thread. And only that "materializes" standalone
>> fence synchronization. So a store that materializes RELEASE fence will
>> always be a subsequent store.
>
>
> Humm... That is too strict, and has to be there whether we use standalone
> fences or not.

No, in C/C++ memory ordering constrains tied to memory operations act
only on that memory operation.

Consider:

DATA = 1;
C.store(1, memory_order_release);
D.store(1, memory_order_relaxed);

vs:

DATA = 1;
atomic_thread_fence(memory_order_release);
C.store(1, memory_order_relaxed);
D.store(1, memory_order_relaxed);


And 2 consumers:

// consumer 1
while (C.load(memory_order_acquire) == 0) backoff();
assert(DATA == 1);

// consumer 2
while (D.load(memory_order_acquire) == 0) backoff();
assert(DATA == 1);

Both consumers are correct wrt the atomic_thread_fence version of
producer. But only the first one is correct wrt the store(1,
memory_order_release) version of producer.

And this can actually break on x86 because:

DATA = 1;
C.store(1, memory_order_release);
D.store(1, memory_order_relaxed);

can be compiled to machine code as:

D.store(1, memory_order_relaxed);
DATA = 1;
C.store(1, memory_order_release);

But:

DATA = 1;
atomic_thread_fence(memory_order_release);
C.store(1, memory_order_relaxed);
D.store(1, memory_order_relaxed);

cannot be compiled to machine code as (store to D cannot hoist above
the release fence):

D.store(1, memory_order_relaxed);
DATA = 1;
atomic_thread_fence(memory_order_release);
C.store(1, memory_order_relaxed);


> The store to D = 4 makes A and B wrt the RELEASE visible to
> the consumer threads that look for D = 4 and execute the ACQUIRE barrier
> after that fact has been observed. Afaict, C should NOT be covered.
>
>
>>
>>
>>
>> >> The same for acquire fences: a single
>> >> acquire fences turns _all_ loads ever executed by the thread into
>> >> acquire operations ton he corresponding memory locations, which means
>> >> that you need to handle all relaxed loads as a "shadow" acquire loads
>> >> for the case they will be materialized by a subsequent acquire fence.
>
>
> That sounds to coarse.

That's the semantics of stand-alone fence. Consider:

foo = 1;

... gazillion lines of code and hours later ...

atomic_thread_fence(memory_order_release);

... gazillion lines of code and hours later ...

x.store(1, relaxed);


and then in another thread:

if (x.load(acquire)) assert(foo == 1);

and now suddenly these 3 lines of code separated by gazilions lines of
code and hours of execution fold into connected synchronization
pattern.
With release-store we at least have the release and the store on the
same line _and_ we know that the release act only on this store and
not on unknown set of other stores.
This would help verification tools tremendously... but unfortunately
it's not the reality we are living in :)

Chris M. Thomasson

unread,
Apr 23, 2018, 5:19:28 PM4/23/18
to Scalable Synchronization Algorithms
On Friday, April 20, 2018 at 2:06:22 AM UTC-7, Dmitry Vyukov wrote:
On Mon, Apr 16, 2018 at 12:32 AM, Chris M. Thomasson
<cri...@charter.net> wrote:
>
>
> On Friday, April 13, 2018 at 11:45:51 PM UTC-7, Dmitry Vyukov wrote:
>>
>> On Mon, Apr 9, 2018 at 3:38 AM, Chris M. Thomasson <cri...@charter.net>
>> wrote:
>> > On Saturday, April 7, 2018 at 1:46:20 AM UTC-7, Dmitry Vyukov wrote:
>> >>
>> >> On Thu, Apr 5, 2018 at 10:03 PM, Chris M. Thomasson
>> >> <cri...@charter.net>
>> >> wrote:
>> >> > On Tuesday, April 3, 2018 at 5:44:38 AM UTC-7, Dmitry Vyukov wrote:
>> >> >>
>> >> >> On Sat, Mar 31, 2018 at 10:41 PM, Chris M. Thomasson
>> >> >> <cri...@charter.net> wrote:
[...]

> D should be a pure relaxed store, and C should not be covered by the
> RELEASE. Iirc, it works this way on SPARC RMO mode. However, on x86, C will
> be covered because each store has implied release characteristics, wb memory
> aside for a moment.

C and D are completely symmetric wrt the RELEASE. Later you can
discover that there is also a thread that does:

// consumer 2
while (C != 3) backoff;
ACQUIRE
assert(A == 1 && B == 2);

And now suddenly C is release operation exactly the same way D is.

Argh! D is the "control" signal. C should not be used as a signal to acquire. Damn. C should not have to be involved because the RELEASE is _before_ C was assigned.

Argh.
Iirc, x86 has implied release semantics on stores, and acquire on loads. If we load C and observe 1, we will see DATA as 1. However, D is not in sync at all. If we load D and see it as being 1, then C = 1 and DATA = 1.

Consumer 1 will see C = 1 and DATA = 1: D will be out of sync
Consumer 2 will see D = 1, C = 1 and DATA = 1: They will all be in sync


 [...]
Shi% happens! However, it would allow one to combine the flexibility of standalone fences and targeted memory addresses all in one.

Need to think some more on this.

Thank you! :^D
Reply all
Reply to author
Forward
0 new messages