DCAS-atomic

Bonita Montero

unread,

Jun 15, 2020, 11:44:25 AM6/15/20

to

I've developed a little DCAS-specialization of std::atomic. It it uses
pair<uintptr_t, uintptr_t> as its atomic datatype. I've not implemented
the volatile methods as well as the compare-exchange-methods that accept
only one memory-consistency-parameter.
Here it is:

#pragma once
#include <cstdint>
#include <utility>
#include <atomic>
#if defined(_MSC_VER)
#include <intrin.h>
#endif
#include <type_traits>

template<>
struct std::atomic<std::pair<std::uintptr_t, std::uintptr_t>>
{
using pair_t = std::pair<std::uintptr_t, std::uintptr_t>;
using mo_t = std::memory_order;
atomic() = default;
atomic( pair_t desired );
atomic( const atomic & ) = delete;
atomic &operator =( pair_t desired );
atomic &operator =( pair_t const &desired ) = delete;
bool is_lock_free() const;
void store( pair_t desired, mo_t mo = std::memory_order_seq_cst );
pair_t load( mo_t mo = std::memory_order_seq_cst );
operator pair_t();
pair_t exchange( pair_t desired, mo_t mo = std::memory_order_seq_cst );
bool compare_exchange_weak( pair_t &expected, pair_t desired, mo_t
success, mo_t failure );
bool compare_exchange_strong( pair_t &expected, pair_t desired, mo_t
success, mo_t failure );
private:
alignas(2 * sizeof(uintptr_t))
pair_t m_pair;
static
bool cmpxchgPair( pair_t &dst, pair_t &expected, pair_t const &desired );
};

inline
std::atomic<std::pair<std::uintptr_t, std::uintptr_t>>::atomic( pair_t
desired )
{
m_pair = desired;
}

inline
typename std::atomic<std::pair<std::uintptr_t, std::uintptr_t>>
&std::atomic<std::pair<std::uintptr_t, std::uintptr_t>>::operator =(
pair_t desired )
{
store( desired );
return *this;
}

inline
bool std::atomic<std::pair<std::uintptr_t,
std::uintptr_t>>::is_lock_free() const
{
return true;
}

inline
void std::atomic<std::pair<std::uintptr_t, std::uintptr_t>>::store(
pair_t desired, mo_t mo )
{
pair_t cmp = m_pair;
while( !cmpxchgPair( m_pair, cmp, desired ) );
atomic_thread_fence( mo );
}

inline
std::pair<std::uintptr_t, std::uintptr_t>
std::atomic<std::pair<std::uintptr_t, std::uintptr_t>>::load( mo_t mo )
{
pair_t cmp = m_pair;
cmpxchgPair( m_pair, cmp, cmp );
atomic_thread_fence( mo );
return cmp;
}

inline
std::atomic<std::pair<std::uintptr_t, std::uintptr_t>>::operator pair_t()
{
return load();
}

inline
std::pair<std::uintptr_t, std::uintptr_t>
std::atomic<std::pair<std::uintptr_t, std::uintptr_t>>::exchange( pair_t
desired, mo_t mo )
{
pair_t cmp = m_pair;
while( !cmpxchgPair( m_pair, cmp, desired ) );
atomic_thread_fence( mo );
return cmp;
}

bool std::atomic<std::pair<std::uintptr_t,
std::uintptr_t>>::compare_exchange_weak( pair_t &expected, pair_t
desired, mo_t success, mo_t failure )
{
bool succ = cmpxchgPair( m_pair, expected, desired );
atomic_thread_fence( succ ? success : failure );
return succ;
}

bool std::atomic<std::pair<std::uintptr_t,
std::uintptr_t>>::compare_exchange_strong( pair_t &expected, pair_t
desired, mo_t success, mo_t failure )
{
bool succ = cmpxchgPair( m_pair, expected, desired );
atomic_thread_fence( succ ? success : failure );
return succ;
}

inline
bool std::atomic<std::pair<std::uintptr_t,
std::uintptr_t>>::cmpxchgPair( pair_t &dst, pair_t &expected, pair_t
const &desired )
{
#if defined(_MSC_VER)
return _InterlockedCompareExchange128( &(__int64 &)dst, desired.second,
desired.first, &(__int64 &)expected );
#elif defined(__GNUC__)
return __sync_bool_compare_and_swap( &(__int128 &)dst, &(__int128
&)expected, &(__int128 &)desired );
#endif
}

Unfortunately __sync_bool_compare_and_swap maps to a sympbol called
__sync_bool_compare_and_swap_16 and my linker can't find it:
"undefined reference to `__sync_bool_compare_and_swap_16'"
Any ideas ?

Bonita Montero

unread,

Jun 15, 2020, 3:20:13 PM6/15/20

to

> Unfortunately __sync_bool_compare_and_swap maps to a sympbol called
> __sync_bool_compare_and_swap_16 and my linker can't find it:
> "undefined reference to `__sync_bool_compare_and_swap_16'"
> Any ideas ?

I got it. I have to compile it with -march=x86-64.

Chris M. Thomasson

unread,

Jun 15, 2020, 4:59:48 PM6/15/20

to

On 6/15/2020 8:44 AM, Bonita Montero wrote:
> I've developed a little DCAS-specialization of std::atomic. It it uses
> pair<uintptr_t, uintptr_t> as its atomic datatype. I've not implemented
> the volatile methods as well as the compare-exchange-methods that accept
> only one memory-consistency-parameter.
> Here it is:

[...]

Actually, if the underlying system supports lock free DWCAS then C++
"should" support it on said system.

https://groups.google.com/d/msg/lock-free/X3fuuXknQF0/Ho0H1iJgmrQJ

If not, then you need to go rouge. Take careful note of the
memory_order_acq_rel membar.

An atomic CAS with acquire semantics, the membar would go _after_ the CAS.

An atomic CAS with release semantics, the membar would go _before_ the CAS.

Chris M. Thomasson

unread,

Jun 15, 2020, 5:16:31 PM6/15/20

to

Fwiw, think about the standalone (atomic_thread_fence) membars required
for a general purpose lock:
________________________________
Atomic RMW to take the lock

Acquire Membar

[critical section]

Release Membar

Atomic RMW to release the lock
________________________________

An acquire/release would look like:
________________________________
Release Membar

Atomic RMW

Acquire Membar
________________________________

See how the membars line up with the mutex case?

Bonita Montero

unread,

Jun 16, 2020, 12:53:32 AM6/16/20

to

> If not, then you need to go rouge. Take careful note of the
> memory_order_acq_rel membar.

I inserted the appropriate fences.

Chris M. Thomasson

unread,

Jun 16, 2020, 3:43:16 PM6/16/20

to

Just make double sure to put them in the correct places within the code.

Bonita Montero

unread,

Jun 16, 2020, 3:54:05 PM6/16/20

to

>> I inserted the appropriate fences.

> Just make double sure to put them in the correct places within the code.

The are in the correct places. I've changed the code a bit because
the fences aren't necessary because the intrinsics have their own
fences. So I have a private fence( mo_t o ) function that selects
per ifdef if we have neither gcc, nor MSVC, i.e. unknown compilers
are assumed to do no fencing on CAS until the code is adpated.

Chris M. Thomasson

unread,

Jun 16, 2020, 4:23:32 PM6/16/20

to

Well, perhaps create a model that can work as if the intrinsic RMW's are
naked, or relaxed if you will. In other words, treating
InterlockedExchangeAdd as if it has no implied memory order. Call it the
working base model for systems with relaxed memory ordering. Akin to
SPARC RMO.

It seems like you are always putting the membar after the RMW:

example:
_________________________

inline
std::pair<std::uintptr_t, std::uintptr_t>
std::atomic<std::pair<std::uintptr_t, std::uintptr_t>>::exchange( pair_t
desired, mo_t mo )
{
pair_t cmp = m_pair;
while( !cmpxchgPair( m_pair, cmp, desired ) );
atomic_thread_fence( mo );
return cmp;
}

_________________________

This is fine for an acquire barrier, but not for release. The release
membar should be _before_ the atomic RMW.

Bonita Montero

unread,

Jun 17, 2020, 1:11:33 AM6/17/20

to

The cmpxchg has a membar itself.

Chris M. Thomasson

unread,

Jun 17, 2020, 4:13:07 PM6/17/20

to

On 6/16/2020 10:11 PM, Bonita Montero wrote:
> The cmpxchg has a membar itself.

you mean LOCK cmpxchg on x86, yes you are correct.

Have you ever programmed for the SPARC in RMO mode or PPC? What about a
DEC Alpha?

Please read all of this:

https://en.wikipedia.org/wiki/Memory_ordering

Can you name some algorithms that require explicit membars on an x86
when using atomic loads and stores? I know of several.

Chris M. Thomasson

unread,

Jun 17, 2020, 4:34:16 PM6/17/20

to

On x86 wrt atomic loads and stores...

Atomic Store to A release
Atomic Load from B acquire

Well.... The load from B can be hoisted up above the store to A on an
x86. This can occur because A and B are different locations.

The MFENCE instruction can take care of this, to ensure that the load
from B is performed _after_ the store to A.

Store to A release
MFENCE
Load from B acquire

Okay, now, it works. Btw, there are several algorithms that depend on
this. Using a LOCK RMW on x86 can do this as well. It has the right membar.

Iirc, there was a problem with a bugged processor. It was a long time
ago. Iirc it was part of the Plan9 problem. Iirc, it had something to do
with using an atomic store to release a spinlock. I first learned about
it way back on comp.programming.threads.

Bonita Montero

unread,

Jun 17, 2020, 6:02:36 PM6/17/20

to

> you mean LOCK cmpxchg on x86, yes you are correct.

I'm talking about the used intinsics of MSVC and gcc.
They have full barriers on all CPUs.

> Have you ever programmed for the SPARC in RMO mode or PPC?
> What about a DEC Alpha?

That are all dead CPUs.

Bonita Montero

unread,

Jun 17, 2020, 6:04:00 PM6/17/20

to

Sorry, but youre talkin stupid stuff. I'm using the
MSVC / gcc intinsics, and both have full barriers.

Look at this code:

long xchg;

void f( int &i, int &j, int &k )
{
k = i + j;
__sync_bool_compare_and_swap( &xchg, 0, 1 );
k = i + j;
}

This compiles to this with g++:

movl (%rsi), %eax
addl (%rdi), %eax
movl $1, %ecx
movl %eax, (%rdx)
xorl %eax, %eax
lock cmpxchgq %rcx, xchg(%rip)
movl (%rsi), %eax
addl (%rdi), %eax
movl %eax, (%rdx)
ret

So the compiler doesn't optimize away the duplicate calculation
because __sync_bool_compare_and_swap has acquire- as well as
release- behaviour - logically from the view of the instruction
-stream as well as related to the internal order of the loads
and stores of the CPU.

Chris M. Thomasson

unread,

Jun 17, 2020, 6:23:11 PM6/17/20

to

On 6/17/2020 3:03 PM, Bonita Montero wrote:
> Sorry, but youre talkin stupid stuff.

I'm using the
> MSVC / gcc intinsics, and both have full barriers.

They do for x86. Well, except the acquire and release variants. See,
Windows runs on different machines. ;^)

>
> Look at this code:
>
> long xchg;
>
> void f( int &i, int &j, int &k )
> {
>     k = i + j;
>     __sync_bool_compare_and_swap( &xchg, 0, 1 );
>     k = i + j;
> }
>
> This compiles to this with g++:
>
>     movl    (%rsi), %eax
>     addl    (%rdi), %eax
>     movl    $1, %ecx
>     movl    %eax, (%rdx)
>     xorl    %eax, %eax
>     lock cmpxchgq    %rcx, xchg(%rip)
>     movl    (%rsi), %eax
>     addl    (%rdi), %eax
>     movl    %eax, (%rdx)
>     ret
>
> So the compiler doesn't optimize away the duplicate calculation
> because __sync_bool_compare_and_swap has acquire- as well as
> release- behaviour - logically from the view of the instruction
> -stream as well as related to the internal order of the loads
> and stores of the CPU.

Where did I argue against that? I was just saying to think about
programming for an arch that does not have implicit membars in its
atomic RMW's.

Chris M. Thomasson

unread,

Jun 17, 2020, 6:23:24 PM6/17/20

to

PPC is dead?

Bonita Montero

unread,

Jun 17, 2020, 6:30:52 PM6/17/20

to

> They do for x86. Well, except the acquire and release variants.

They have acquire and release semantics for x86 and ARM.

> Where did I argue against that? I was just saying to think about
> programming for an arch that does not have implicit membars in its
> atomic RMW's.

My code runs with MSVC and gcc. And both have full membars with their
CMPXCHG-intrinsics. So I don't know why you complain something here.

Bonita Montero

unread,

Jun 17, 2020, 6:31:30 PM6/17/20

to

>>> Have you ever programmed for the SPARC in RMO mode or PPC?
>>> What about a DEC Alpha?

>> That are all dead CPUs.

> PPC is dead?

Almost. It is mostly replaced in embedded-hardware by ARM-chips.

Chris M. Thomasson

unread,

Jun 17, 2020, 7:51:47 PM6/17/20

to

On 6/17/2020 3:30 PM, Bonita Montero wrote:
>> They do for x86. Well, except the acquire and release variants.
>
> They have acquire and release semantics for x86 and ARM

Windows can run on PPC, and they have acquire release semantics in their
Interlocked Instructions. Did you ever think about why they have the
acquire release variants for the atomic RMW's, even though they are
useless on x86? forget about the intrinsic for a moment:

https://docs.microsoft.com/en-us/previous-versions/windows/desktop/legacy/ms683594(v=vs.85)

On x86 this makes no sense because a LOCK'ed atomic RMW is a full
membar. It can be used for sequential consistency, its very expensive
but it can work.

Xenon in xbox 360s are from IBM and based on PPC.

They created the acquire release variants to get around having to use
full membars on every damn atomic RMW.

>
>> Where did I argue against that? I was just saying to think about
>> programming for an arch that does not have implicit membars in its
>> atomic RMW's.
>
> My code runs with MSVC and gcc. And both have full membars with their
> CMPXCHG-intrinsics. So I don't know why you complain something here.

only on x86. Its a bad line of thinking wrt assuming atomic RMW's always
have membars.

Chris M. Thomasson

unread,

Jun 17, 2020, 7:53:08 PM6/17/20

to

ARM's can be weakly ordered as well.

https://developer.arm.com/docs/100941/0100/the-memory-model

Its good to learn how to program for them.

Bonita Montero

unread,

Jun 18, 2020, 2:13:20 AM6/18/20

to

> Did you ever think about why they have the
> acquire release variants for the atomic RMW's, even though they are
> useless on x86? forget about the intrinsic for a moment:

That's with the additional intrinsics for ARM.

> On x86 this makes no sense ...

It would make sense since acquire-behaviour isn't only the physical
behaviour of the CPU but also the logical behaviour of the compiler,
i.e. where it places the istructions before or after the physical
barrier.

>> My code runs with MSVC and gcc. And both have full membars with their
>> CMPXCHG-intrinsics. So I don't know why you complain something here.

> only on x86.

No, on all CPUs.

Bonita Montero

unread,

Jun 18, 2020, 2:14:20 AM6/18/20

to

>> Almost. It is mostly replaced in embedded-hardware by ARM-chips.

> ARM's can be weakly ordered as well.

I know, but the intrinsics in my code have a full fence on all CPUs.

Chris M. Thomasson

unread,

Jun 18, 2020, 3:46:24 PM6/18/20

to

On 6/17/2020 11:13 PM, Bonita Montero wrote:
>> Did you ever think about why they have the
>> acquire release variants for the atomic RMW's, even though they are
>> useless on x86? forget about the intrinsic for a moment:
>
> That's with the additional intrinsics for ARM.

Indeed.

>> On x86 this makes no sense ...
>
> It would make sense since acquire-behaviour isn't only the physical
> behaviour of the CPU but also the logical behaviour of the compiler,
> i.e. where it places the istructions before or after the physical
> barrier.

on x86, acquire release is implied wrt atomic loads and stores, and a
full membar is implied for LOCK'ed atomic RMW's.

Funny think on x86, LOCK'ed atomic RMW on memory that straddles a cache
line will implement a full bus lock. Bad, but can be useful for certain
exotic algorithms.

>>> My code runs with MSVC and gcc. And both have full membars with their
>>> CMPXCHG-intrinsics. So I don't know why you complain something here.
>
>> only on x86.
>
> No, on all CPUs.

GCC has these:

https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

Which can work well with weak memory models. Wrt pure C++, try to think
about coding it up where the atomic RMW's are all relaxed. Then use
atomic_thread_fence in the right places. Remember, the acquire membar
goes _after_ the atomic RMW for acquire semantics. The release membar
goes _before_ the atomic RMW for release semantics.

For simple atomic loads and stores:

load acquire, does the load, then the membar.

store release, does the membar, then performs the store.

Chris M. Thomasson

unread,

Jun 18, 2020, 3:48:07 PM6/18/20

to

That's fine, but its not ideal. Its like creating an algorithm where
everything is seq_cst. Yes it works, but its not ideal wrt performance.

A full fence is expensive.

Bonita Montero

unread,

Jun 19, 2020, 1:22:05 AM6/19/20

to

> Which can work well with weak memory models. Wrt pure C++, try to think
> about coding it up where the atomic RMW's are all relaxed. Then use
> atomic_thread_fence in the right places. Remember, the acquire membar
> goes _after_ the atomic RMW for acquire semantics. The release membar
> goes _before_ the atomic RMW for release semantics.

That's not necessary with my code.

Bonita Montero

unread,

Jun 19, 2020, 1:22:59 AM6/19/20

to

> That's fine, but its not ideal. Its like creating an algorithm where
> everything is seq_cst. Yes it works, but its not ideal wrt performance.

You're talking stupid stuff.

Chris M. Thomasson

unread,

Jun 19, 2020, 5:53:07 PM6/19/20

to

Well, its still nice to learn.

Chris M. Thomasson

unread,

Jun 19, 2020, 5:55:49 PM6/19/20

to

Wrong! For instance, if RCU was forced to execute a full memory barrier
for every operation, it would basically be useless.

Chris M. Thomasson

unread,

Jun 19, 2020, 10:10:10 PM6/19/20

to

Algorithm A that uses membars all over the place, vs Algorithm B that
uses them in the right strategic places, and can eliminate them
altogether in the read side, well DEC ALpha aside for a moment. Just
enough to make it correct. Humm...