DCAS-supplement ?

26 views
Skip to first unread message

Bonita Montero

unread,
Sep 19, 2021, 12:14:02 PMSep 19
to
I just wanted to test if my compiler recognizes ...
atomic<pair<uintptr_t, uintptr_t>
... and uses DCASes on it or if it supplements the pair with a usual
lock. So I first wrote a minimal program with a single noinline-function
which does a compare and swap on a atomic pair. The compiler generated
a lot of code for this which I didn't understand, and I didn't saw any
LOCK-prefixed instructions in that. I was curisous about whether the
implementation places a lock within the atomic and printed a sizeof
of the above atomic pair: 24 on a 64-bit-platform, so obviously without
a lock.
At last I wrote a program which increments both portions of a single
atomic pair by all the threads my system has (Ryzen Threadripper 64
core, Win10, SMT off) a predefined number of times. Then I calculated
the time for each increment in nanoseconds. The time is rather high,
about 20.000ns for each successful increment, so it first looked to
me if there was a lock I overlooked; so this couldn't be true with
a sizeof of this atomic of 24 bytes. And when I saw at the Processs
Viewer I saw that all 64 cores were nearly at 100% _user_ CPU time
all the time - so there couldn't be any kernel-interventions.
So is there anyone here smarter than me and can identify what this
DCAS-substitute does from the assembly-dump ?

Here's my test-code:

#include <iostream>
#include <atomic>
#include <utility>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <chrono>

using namespace std;
using namespace chrono;

using uintptr_pair = pair<uintptr_t, uintptr_t>;
using atomic_pair = atomic<uintptr_pair>;

int main()
{
cout << "atomic<pair<uintptr_t, uintptr_t>>: " << sizeof(atomic_pair)
<< endl;
atomic_pair ap( uintptr_pair( 0, 0 ) );
mutex mtx;
unsigned nThreads = thread::hardware_concurrency();
unsigned ready = nThreads;
condition_variable cvReady;
bool run = false;
condition_variable cvRun;
atomic_int64_t sumDur = 0;
auto theThread = [&]( size_t n )
{
unique_lock<mutex> lock( mtx );
if( !--ready )
cvReady.notify_one();
cvRun.wait( lock, [&]() -> bool { return run; } );
lock.unlock();
auto start = high_resolution_clock::now();
uintptr_pair cmp = ap.load( memory_order_relaxed );
for( ; n--; )
while( !ap.compare_exchange_weak( cmp, uintptr_pair( cmp.first + 1,
cmp.second + 1 ), memory_order_relaxed, memory_order_relaxed ) );
sumDur.fetch_add( duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count(), memory_order_relaxed );
lock.lock();
};
vector<jthread> threads;
threads.reserve( nThreads );
static size_t const ROUNDS = 100'000;
for( unsigned t = nThreads; t--; )
threads.emplace_back( theThread, ROUNDS );
unique_lock<mutex> lock( mtx );
cvReady.wait( lock, [&]() -> bool { return !ready; } );
run = true;
cvRun.notify_all();
lock.unlock();
for( jthread &thr : threads )
thr.join();
cout << (double)sumDur / ((double)nThreads * ROUNDS) << endl;
uintptr_pair p = ap.load( memory_order_relaxed );
cout << "synch: " << (p.first == p.second ? "yes" : "no") << endl;
}

Chris M. Thomasson

unread,
Sep 19, 2021, 6:30:01 PMSep 19
to
On 9/19/2021 9:13 AM, Bonita Montero wrote:
> I just wanted to test if my compiler recognizes ...
>     atomic<pair<uintptr_t, uintptr_t>
> ... and uses DCASes on it or if it supplements the pair with a usual
> lock.

[...]

Can you get this to compile on your MSVC?
_________________________
#include <iostream>
#include <atomic>
#include <xmmintrin.h>

int main()
{
std::atomic<__int64> foo_64;
std::atomic<__m128> foo_128;

std::cout << sizeof(__int64) << "\n";
std::cout << sizeof(__m128) << "\n";

std::cout << foo_64.is_lock_free() << "\n";
std::cout << foo_128.is_lock_free() << "\n";

return 0;
}
_________________________


In 64-bit mode I am getting:

8
16
1
0

This means that this compiler is not supporting lock-free operations on
128-bit numbers in 64-bit mode, humm. What do you get? Also, you might
try to see if your system supports the CMPXCHG16B instruction. If so,
code DWCAS manually in assembly language. Now, in 32 bit mode, I also get:

8
16
1
0

This means that DWCAS _is_ supported in a lock-free manner in 32 bit
mode through CMPXCHG8B. So, try to use CMPXCHG16B directly if you want
DWCAS in 64-bit mode.

I have not tried this experiment in GCC. It might work with __int128,
and end up using:

https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

__atomic_compare_exchange_16, you might have to use -mcx16.

Hope that helps a bit!

Bonita Montero

unread,
Sep 20, 2021, 1:04:00 AMSep 20
to
Am 20.09.2021 um 00:29 schrieb Chris M. Thomasson:
> On 9/19/2021 9:13 AM, Bonita Montero wrote:
>> I just wanted to test if my compiler recognizes ...
>>      atomic<pair<uintptr_t, uintptr_t>
>> ... and uses DCASes on it or if it supplements the pair with a usual
>> lock.
>
> [...]
>
> Can you get this to compile on your MSVC?
> _________________________
> #include <iostream>
> #include <atomic>
> #include <xmmintrin.h>
>
> int main()
> {
>     std::atomic<__int64> foo_64;
>     std::atomic<__m128> foo_128;
>
>     std::cout << sizeof(__int64) << "\n";
>     std::cout << sizeof(__m128) << "\n";
>
>     std::cout << foo_64.is_lock_free() << "\n";
>     std::cout << foo_128.is_lock_free() << "\n";
>
>     return 0;
> }
> _________________________
>

That's not what I wanted to test.
Read this thread:

https://stackoverflow.com/questions/69245183/dcas-alternative-with-no-help-of-the-kernel

Chris M. Thomasson

unread,
Sep 20, 2021, 1:20:10 AMSep 20
to
A quote:

"I just wanted to test if my compiler recognizes ...and uses DCASes on
it or if it supplements the pair with a usual lock"

Well, is_lock_free() should help?


Bonita Montero

unread,
Sep 20, 2021, 1:26:19 AMSep 20
to
I put the compare_exchangE_weak-code into a noinline function:

struct uip_pair
{
uip_pair() = default;
uip_pair( uintptr_t first, uintptr_t second ) :
first( first ),
second( second )
{
}
uintptr_t first, second;
};

using atomic_pair = atomic<uip_pair>;

#if defined(_MSC_VER)
#define NOINLINE __declspec(noinline)
#elif defined(__GNUC__) || defined(__clang__)
#define NOINLINE __attribute((noinline))
#endif

NOINLINE
bool cmpXchg( atomic_pair &ap, uip_pair &cmp, uip_pair xchg )
{
return ap.compare_exchange_weak( cmp, xchg, memory_order_relaxed,
memory_order_relaxed );
}

?cmpXchg@@YA_NAEAU?$atomic@Uuip_pair@@@std@@AEAUuip_pair@@U3@@Z PROC
cmpXchg, COMDAT
mov eax, 1
mov r10, rcx
mov r9d, eax
xchg DWORD PTR [rcx], eax
test eax, eax
je SHORT $LN14@cmpXchg
$LL13@cmpXchg:
mov eax, DWORD PTR [rcx]
test eax, eax
je SHORT $LN16@cmpXchg
$LL15@cmpXchg:
mov eax, r9d
test r9d, r9d
je SHORT $LN38@cmpXchg
npad 1
$LL19@cmpXchg:
pause
sub eax, 1
jne SHORT $LL19@cmpXchg
cmp r9d, 64
jl SHORT $LN38@cmpXchg
lea r9d, QWORD PTR [rax+64]
jmp SHORT $LN22@cmpXchg
$LN38@cmpXchg:
add r9d, r9d
$LN22@cmpXchg:
mov eax, DWORD PTR [rcx]
test eax, eax
jne SHORT $LL15@cmpXchg
$LN16@cmpXchg:
mov eax, 1
xchg DWORD PTR [rcx], eax
test eax, eax
jne SHORT $LL13@cmpXchg
$LN14@cmpXchg:
mov rax, QWORD PTR [rcx+8]
sub rax, QWORD PTR [rdx]
jne SHORT $LN39@cmpXchg
mov rax, QWORD PTR [rcx+16]
sub rax, QWORD PTR [rdx+8]
$LN39@cmpXchg:
test rax, rax
sete al
test al, al
je SHORT $LN6@cmpXchg
movups xmm0, XMMWORD PTR [r8]
movups XMMWORD PTR [rcx+8], xmm0
xor ecx, ecx
xchg DWORD PTR [r10], ecx
ret 0
$LN6@cmpXchg:
movups xmm0, XMMWORD PTR [rcx+8]
xor ecx, ecx
movups XMMWORD PTR [rdx], xmm0
xchg DWORD PTR [r10], ecx
ret 0
?cmpXchg@@YA_NAEAU?$atomic@Uuip_pair@@@std@@AEAUuip_pair@@U3@@Z ENDP

Maybe someone here understands what the above assembly-code does.

Chris M. Thomasson

unread,
Sep 20, 2021, 1:41:15 AMSep 20
to
Need to examine it, however there is a way to make a lock without using
LOCK'ed anything in x86, x64. BTW, did you know that xchg has an implied
LOCK prefix? So, it looks locked to me. What does is_lock_free say?

Bonita Montero

unread,
Sep 20, 2021, 1:53:58 AMSep 20
to
> I put the compare_exchangE_weak-code into a noinline function:
>
> struct uip_pair
> {
>     uip_pair() = default;
>     uip_pair( uintptr_t first, uintptr_t second ) :
>         first( first ),
>         second( second )
>     {
>     }
>     uintptr_t first, second;
If I add ...
uintptr_t i, j, k;
... here the structure is still only 8 bytes larger than it
should be. Maybe MSVC does something like STM on this structure.
> };

Chris M. Thomasson

unread,
Sep 20, 2021, 7:02:50 PMSep 20
to
I don't know why they just don't use CMPXCHG16B directly. Look at the
disassembly of:

https://docs.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange128

on your system.

Bonita Montero

unread,
Sep 21, 2021, 3:31:14 AMSep 21
to
Can you please respond according to what I wrote ??? It's
about the assembly of the code I posted and not this intrinsic.

Chris M. Thomasson

unread,
Sep 21, 2021, 7:09:48 PMSep 21
to
You do it. Its a pain to converse with you.
Reply all
Reply to author
Forward
0 new messages