Notice
how there is an acquire barrier inside of the CAS loop within the
enqueue and dequeue functions of:
http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
?
Well, fwiw we can get rid of them by using stand alone fences:
Btw, sorry for the membar abstraction. I can quickly get sick and tired of having to type out (std::memory_order_relaxed) all the damn time. ;^)
Anyway, here is the simple code using Relacy:
___________________
/* Membar Abstraction
___________________________________________________*/
#define mbrlx std::memory_order_relaxed
#define mbacq std::memory_order_acquire
#define mbrel std::memory_order_release
#define mb(mb_membar) std::atomic_thread_fence(mb_membar)
template<typename T, unsigned int T_N>
struct fifo
{
struct cell
{
VAR_T(T) m_data;
std::atomic<unsigned int> m_ver;
};
std::atomic<unsigned int> m_head;
std::atomic<unsigned int> m_tail;
cell m_cells[T_N];
fifo() : m_head(0), m_tail(0)
{
for (unsigned int i = 0; i < T_N; ++i)
{
m_cells[i].m_ver.store(i, mbrlx);
}
}
bool push_try(T const& data)
{
cell* c = nullptr;
unsigned int ver = m_head.load(mbrlx);
do
{
c = &m_cells[ver & (T_N - 1)];
if (c->m_ver.load(mbrlx) != ver) return false;
} while (! m_head.compare_exchange_weak(ver, ver + 1, mbrlx));
mb(mbacq);
VAR(c->m_data) = data;
mb(mbrel);
c->m_ver.store(ver + 1, mbrlx);
return true;
}
bool pop_try(T& data)
{
cell* c = nullptr;
unsigned int ver = m_tail.load(mbrlx);
do
{
c = &m_cells[ver & (T_N - 1)];
if (c->m_ver.load(mbrlx) != ver + 1) return false;
} while (! m_tail.compare_exchange_weak(ver, ver + 1, mbrlx));
mb(mbacq);
data = VAR(c->m_data);
mb(mbrel);
c->m_ver.store(ver + T_N, mbrlx);
return true;
}
};
___________________
There is no need for any membars within the loops. The version count keeps everything in order.
:^)
On Sat, Mar 31, 2018 at 10:41 PM, Chris M. Thomasson
<cri...@charter.net> wrote:
> Notice how there is an acquire barrier inside of the CAS loop within the
> enqueue and dequeue functions of:
>
>
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
>
>
> ?
>
>
> Well, fwiw we can get rid of them by using stand alone fences:
>
>
> Btw, sorry for the membar abstraction. I can quickly get sick and tired of
> having to type out (std::memory_order_relaxed) all the damn time. ;^)
>
>
> Anyway, here is the simple code using Relacy:
>
> ___________________
[...]
> ___________________
>
>
>
> There is no need for any membars within the loops. The version count keeps
> everything in order.
compare_exchange in C/C++ has optional failure memory ordering, so the
following looks equivalent to me ;)
m_tail.compare_exchange_weak(ver, ver + 1, acquire, relaxed)
cell->sequence_.load(std::memory_order_acquire);
Executing an acquire barrier on every iteration of the CAS loop is not necessary. The actual version count keeps everything in order.
However, you do need a single acquire fence _after_ the CAS loop succeeds in order to get a clear view of the element.
On Thu, Apr 5, 2018 at 10:03 PM, Chris M. Thomasson <cri...@charter.net> wrote:
> On Tuesday, April 3, 2018 at 5:44:38 AM UTC-7, Dmitry Vyukov wrote:
>>
>> On Sat, Mar 31, 2018 at 10:41 PM, Chris M. Thomasson
>> <cri...@charter.net> wrote:
>> > Notice how there is an acquire barrier inside of the CAS loop within the
>> > enqueue and dequeue functions of:
>> >
>> > http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
[...]
> Executing an acquire barrier on every iteration of the CAS loop is not
> necessary. The actual version count keeps everything in order.
>
> However, you do need a single acquire fence _after_ the CAS loop succeeds in
> order to get a clear view of the element.
This is true.
Agreed. I personally like the ability to see the membars being separated out and
standing alone. It is a habit of mine from SPARC. Now, tagged standalone membars
aside for a moment, perhaps ones that can include memory locations they are
interested in... ;^)
I don't like standalone fences because they are plague for
verification. Consider, a single release fences turns _all_ subsequent
relaxed atomic stores ever executed by the thread into release
operations (they release memory state up to the fence point) and
handling of acquire/release operations is an O(N) operation (and
generally done under a mutex).
A release operation should make sure all _prior_ operations are visible _before_
they are visible to another thread. They have no effect on subsequent relaxed
operations. For instance:
// producer
A = 1
B = 2
RELEASE
C = 3
D = 4
// consumer
while (D != 4) backoff;
ACQUIRE
assert(A == 1 && B == 2);
The same for acquire fences: a single
acquire fences turns _all_ loads ever executed by the thread into
acquire operations ton he corresponding memory locations, which means
that you need to handle all relaxed loads as a "shadow" acquire loads
for the case they will be materialized by a subsequent acquire fence.
An acquire operation should make sure all operations wrt the release are visible
_before_ any subsequent operations can be performed _after_ that fact is
accomplished.
Well, fwiw, the membars that can be embedded into the CAS wrt acquire and
release do effect prior and subsequent activity anyway, standalone or not. A release will
dump prior stores such that an acquire barrier will see them all. Now, when we
are dealing with a consume barrier, well that is targeting the release dependency
chain wrt the pointer. A consume barrier is more precisely targeted when compared
to the wider spectrum of an acquire. Btw, iirc consume is emulated in Relacy as
acquire right?
Also, think of popping all nodes at once from an atomic LIFO:
https://groups.google.com/d/topic/comp.lang.c++/V0s__czQwa0/discussion
Well, how can we accomplish the following without using standalone fences?:
// try to flush
all of our nodes
node* flush()
{
node* n =
m_head.exchange(nullptr, mb_relaxed);
if (n)
{
mb_fence(mb_acquire);
}
return n;
}
The same is actually true for human reasoning. Say, I am reading your
code. We have 3 load operations in the loop and an acquire fences
after the loop. Now the question is: which of the loads we wanted to
turn into acquire by adding the fence? Or is it maybe 2 of them?
Which? Or maybe 1 in the loop and 1 somewhere before the loop, in a
different function?
One can, of course, comment that, but Relacy won't check comments, so
I won't trust them ;)
Interesting. Still makes me think of tagged membars. I will get back to you with
a more detailed response.
At lease this is how this is defined in C/C++ standards.
ACQUIRE/RELEASE fences do not establish any happens-before relations
themselves. You still need a load in one thread to observe a value
stored in another thread. And only that "materializes" standalone
fence synchronization. So a store that materializes RELEASE fence will
always be a subsequent store.
>> The same for acquire fences: a single
>> acquire fences turns _all_ loads ever executed by the thread into
>> acquire operations ton he corresponding memory locations, which means
>> that you need to handle all relaxed loads as a "shadow" acquire loads
>> for the case they will be materialized by a subsequent acquire fence.
>> The same is actually true for human reasoning. Say, I am reading your
>> code. We have 3 load operations in the loop and an acquire fences
>> after the loop. Now the question is: which of the loads we wanted to
>> turn into acquire by adding the fence? Or is it maybe 2 of them?
>> Which? Or maybe 1 in the loop and 1 somewhere before the loop, in a
>> different function?
>> One can, of course, comment that, but Relacy won't check comments, so
>> I won't trust them ;)
>
>
>
> Interesting. Still makes me think of tagged membars. I will get back to you
> with
>
> a more detailed response.
You mean something like:
// try to flush all of our nodes
node* flush()
{
node* n = m_head.exchange(nullptr, mb_relaxed);
if (n)
{
mb_fence(mb_acquire, m_head); // <---- HERE
}
return n;
}
? Interesting.
On Mon, Apr 16, 2018 at 12:32 AM, Chris M. Thomasson
<cri...@charter.net> wrote:
>
>
> On Friday, April 13, 2018 at 11:45:51 PM UTC-7, Dmitry Vyukov wrote:
>>
>> On Mon, Apr 9, 2018 at 3:38 AM, Chris M. Thomasson <cri...@charter.net>
>> wrote:
>> > On Saturday, April 7, 2018 at 1:46:20 AM UTC-7, Dmitry Vyukov wrote:
>> >>
>> >> On Thu, Apr 5, 2018 at 10:03 PM, Chris M. Thomasson
>> >> <cri...@charter.net>
>> >> wrote:
>> >> > On Tuesday, April 3, 2018 at 5:44:38 AM UTC-7, Dmitry Vyukov wrote:
>> >> >>
>> >> >> On Sat, Mar 31, 2018 at 10:41 PM, Chris M. Thomasson
>> >> >> <cri...@charter.net> wrote:
[...]
> D should be a pure relaxed store, and C should not be covered by the
> RELEASE. Iirc, it works this way on SPARC RMO mode. However, on x86, C will
> be covered because each store has implied release characteristics, wb memory
> aside for a moment.
C and D are completely symmetric wrt the RELEASE. Later you can
discover that there is also a thread that does:
// consumer 2
while (C != 3) backoff;
ACQUIRE
assert(A == 1 && B == 2);
And now suddenly C is release operation exactly the same way D is.