Little program to test concurrency of .fetch_add and .compare_exchange

Bonita Montero

unread,

Dec 5, 2021, 12:13:25 PM12/5/21

to

I've written a little program that tests the throughput of fetch_add
on an increasing number of processors in your systems and if you chose
it, it couts the throughput of compare_exchange_weak. On my Ryzen
Threadripper 3990X (Zen2) / Windows 10 (SMT disabled) the fetch_add
timings are linar with increasing number of threads and the compare
_exchange_weak-timings are linear in the beginning, but become expo-
nential at the end.

I'd like to see your results:

#include <iostream>
#include <cstring>
#include <atomic>
#include <charconv>
#include <thread>
#include <vector>
#include <semaphore>
#include <chrono>
#include <algorithm>
#include <functional>

using namespace std;
using namespace chrono;

int main( int argc, char **argv )
{
if( argc < 2 )
return EXIT_FAILURE;
bool xchg = strcmp( argv[1], "xchg" ) == 0;
if( argc - xchg < 2 )
return EXIT_FAILURE;
auto parseValue = []( char const *str ) -> unsigned
{
unsigned value;
from_chars_result fcr = from_chars( str, str + strlen( str ), value );
if( fcr.ec != errc() || *fcr.ptr )
return -1;
return value;
};
unsigned fromThreads, toThreads;
if( argc - xchg == 2 )
if( (fromThreads = toThreads = parseValue( argv[1 + xchg] )) == -1 )
return EXIT_FAILURE;
else;
else
if( (fromThreads = parseValue( argv[1 + xchg] )) == -1 || (toThreads =
parseValue( argv[2 + xchg] )) == -1 )
return EXIT_FAILURE;
unsigned hc = thread::hardware_concurrency();
hc = hc ? hc : toThreads;
toThreads = toThreads <= hc ? toThreads : hc;
fromThreads = fromThreads <= hc ? fromThreads : hc;
if( fromThreads > toThreads )
swap( fromThreads, toThreads );
for( unsigned nThreads = fromThreads; nThreads <= toThreads; ++nThreads )
{
atomic_uint readyCountDown( nThreads );
binary_semaphore semReady( 0 );
counting_semaphore semRun( 0 );
atomic_uint synch( nThreads );
atomic_uint64_t aui64;
atomic_uint64_t nsSum( 0 );
auto theThread = [&]( function<void()> &addFn, size_t n )
{
if( readyCountDown.fetch_sub( 1, memory_order_relaxed ) == 1 )
semReady.release();
semRun.acquire();
if( synch.fetch_sub( 1, memory_order_relaxed ) != 1 )
while( synch.load( memory_order_relaxed ) );
auto start = high_resolution_clock::now();
for( ; n; --n )
addFn();
nsSum.fetch_add( (int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count(), memory_order_relaxed );
};
vector<jthread> threads;
threads.reserve( nThreads );
static size_t const TURNS = 10'000'000;
auto fetchAddFn = [&]() { aui64.fetch_add( 1, memory_order_relaxed ); };
auto cmpXchgFn = [&]()
{
uint64_t ref = aui64.load( memory_order_relaxed );
while( !aui64.compare_exchange_weak( ref, ref + 1,
memory_order_relaxed ) );
};
function<void()> xchgFn;
if( !xchg )
xchgFn = bind( fetchAddFn );
else
xchgFn = bind( cmpXchgFn );
for( unsigned t = 0; t != nThreads; ++t )
threads.emplace_back( theThread, xchgFn ), TURNS );
semReady.acquire();
semRun.release( nThreads );
for( jthread &thr : threads )
thr.join();
double ns = (double)(int64_t)nsSum.load( memory_order_relaxed );
ns = ns / ((double)TURNS * (int)nThreads);
cout << ns << endl;
}
}

The timings are important for every kind of synchronization on your PC.
The programm can be called like that
./a.out <n-threads> - tests fetch_add witn n-threads
./a.out <from-threads> <to-threads> - tests fetch_add ranging from
from-threads to to-threads
./a.out xchg <n-threads> - tests compar_exchange_weak with
n-threads
./a.out xchg <from-threads> <to-threads> - tests compar_exchange_weak
ranging from-threads to
to-threads

Ben Bacarisse

unread,

Dec 5, 2021, 2:56:21 PM12/5/21

to

Bonita Montero <Bonita....@gmail.com> writes:

> for( unsigned t = 0; t != nThreads; ++t )
> threads.emplace_back( theThread, xchgFn ), TURNS );

?

--
Ben.

Chris M. Thomasson

unread,

Dec 5, 2021, 6:22:56 PM12/5/21

to

It seems way to complicated. To test fetch_add vs compare_exchange just:

spawn T_N threads.

each thread performs N fetch_add operations on a global counter.

join the threads.

Give a time.

vs.

spawn T_N threads.

each thread performs N compare_exchange operations that increments a
global counter. Basically using CAS to build a fetch_add.

join the threads.

Give a time.

In my experience fetch_add always beats a CAS-loop on x86.

Ben Bacarisse

unread,

Dec 5, 2021, 6:44:04 PM12/5/21

to

"Chris M. Thomasson" <chris.m.t...@gmail.com> writes:

> On 12/5/2021 11:56 AM, Ben Bacarisse wrote:
>> Bonita Montero <Bonita....@gmail.com> writes:
>>
>>> for( unsigned t = 0; t != nThreads; ++t )
>>> threads.emplace_back( theThread, xchgFn ), TURNS );
>> ?
>
> It seems way to complicated.

In case it was not clear, my comment was about the syntax error.

--
Ben.

Chris M. Thomasson

unread,

Dec 5, 2021, 10:31:29 PM12/5/21

to

Oh ouch! I did not even notice it. You must be referencing this:

threads.emplace_back( theThread, xchgFn ), TURNS );

The parenthesis balance.

Bonita Montero

unread,

Dec 6, 2021, 12:25:16 AM12/6/21

to

Remove one ).

Bonita Montero

unread,

Dec 6, 2021, 12:27:05 AM12/6/21

to

Am 06.12.2021 um 00:22 schrieb Chris M. Thomasson:

> It seems way to complicated. To test fetch_add vs compare_exchange just:

There's nothing complicated with my test.

> In my experience fetch_add always beats a CAS-loop on x86.

Of course, because it neer fails.

Bonita Montero

unread,

Dec 6, 2021, 12:28:08 AM12/6/21

to

So, here's the corrected version (C++20):

for( ; n; addFn(), --n );
nsSum.fetch_add( (uint64_t)duration_cast<nanoseconds>(

high_resolution_clock::now() - start ).count(), memory_order_relaxed );
};
vector<jthread> threads;
threads.reserve( nThreads );
static size_t const TURNS = 10'000'000;
auto fetchAddFn = [&]() { aui64.fetch_add( 1, memory_order_relaxed ); };
auto cmpXchgFn = [&]()
{
uint64_t ref = aui64.load( memory_order_relaxed );
while( !aui64.compare_exchange_weak( ref, ref + 1,
memory_order_relaxed ) );
};
function<void()> xchgFn;
if( !xchg )
xchgFn = bind( fetchAddFn );
else
xchgFn = bind( cmpXchgFn );
for( unsigned t = 0; t != nThreads; ++t )

threads.emplace_back( theThread, ref( xchgFn ), TURNS );

semReady.acquire();
semRun.release( nThreads );
for( jthread &thr : threads )
thr.join();
double ns = (double)(int64_t)nsSum.load( memory_order_relaxed );
ns = ns / ((double)TURNS * (int)nThreads);

cout << nThreads << "\t" << ns << endl;
}
}

Ben Bacarisse

unread,

Dec 6, 2021, 5:45:17 AM12/6/21

to

The code does not compile with any of the three )s removed.

--
Ben.

Bonita Montero

unread,

Dec 6, 2021, 7:09:46 AM12/6/21

to

Am 06.12.2021 um 11:45 schrieb Ben Bacarisse:
> Bonita Montero <Bonita....@gmail.com> writes:
>
>> Am 05.12.2021 um 20:56 schrieb Ben Bacarisse:
>>> Bonita Montero <Bonita....@gmail.com> writes:
>>>
>>>> for( unsigned t = 0; t != nThreads; ++t )
>>>> threads.emplace_back( theThread, xchgFn ), TURNS );
>>> ?
>>
>> Remove one ).
>
> The code does not compile with any of the three )s removed.

Take the latest code I've posted.

Scott Lurndal

unread,

Dec 6, 2021, 1:03:14 PM12/6/21

to

Bonita Montero <Bonita....@gmail.com> writes:
>I've written a little program that tests the throughput of fetch_add
>on an increasing number of processors in your systems and if you chose

This is _highly_ microarchitectural dependent.

Some processors will acquire the cacheline into the nearest
cache to ensure exclusive access for the add, while
others will pass the entire operation to the last-level cache
where it is executed atomically.

In the former case, scaling will be very bad on large core counts.

In the latter case, scaling will be quite good with large core counts,
at least on a single-socket system. On a multi-socket system, this
breaks down somewhat as the LLC on each socket will compete for the
cache line.

In any case, a couple dozen line assembler program would be a
far better test than your overly complicated C++.

Bonita Montero

unread,

Dec 7, 2021, 12:28:54 AM12/7/21

to

Am 06.12.2021 um 19:02 schrieb Scott Lurndal:

> Some processors will acquire the cacheline into the nearest
> cache to ensure exclusive access for the add, while
> others will pass the entire operation to the last-level cache
> where it is executed atomically.

There's for sure no architecture that does atomic operations in
the last level cache because this would be silly.

> In any case, a couple dozen line assembler program would be a
> far better test than your overly complicated C++.

No, it wouldn't give better results and the code would be
magnitudes longer if it would do the same.

Scott Lurndal

unread,

Dec 7, 2021, 1:07:16 PM12/7/21

to

Bonita Montero <Bonita....@gmail.com> writes:
>Am 06.12.2021 um 19:02 schrieb Scott Lurndal:
>
>> Some processors will acquire the cacheline into the nearest
>> cache to ensure exclusive access for the add, while
>> others will pass the entire operation to the last-level cache
>> where it is executed atomically.
>
>There's for sure no architecture that does atomic operations in
>the last level cache because this would be silly.

Well, are you sure? Why do you think it would be silly?

https://genzconsortium.org/wp-content/uploads/2019/04/Gen-Z-Atomics-2019.pdf

Given that at least three high-end processor chips have taped out just
this year with the capability of executing "far" atomic operations in
the LLC (or to a PCI Express Root complex host bridge), I think you really don't have
a clue what you are talking about.

Bonita Montero

unread,

Dec 7, 2021, 1:15:11 PM12/7/21

to

And which CPUs currently support this Gen-Z interconnect ?
And which CPUs currently use this far atomics for thread
-synchronitation - none.
Did you really read the paper and noted what Gen-Z is ?
No.

Scott Lurndal

unread,

Dec 7, 2021, 1:49:28 PM12/7/21

to

Bonita Montero <Bonita....@gmail.com> writes:
>Am 07.12.2021 um 19:06 schrieb Scott Lurndal:
>> Bonita Montero <Bonita....@gmail.com> writes:
>>> Am 06.12.2021 um 19:02 schrieb Scott Lurndal:
>>>
>>>> Some processors will acquire the cacheline into the nearest
>>>> cache to ensure exclusive access for the add, while
>>>> others will pass the entire operation to the last-level cache
>>>> where it is executed atomically.
>>>
>>> There's for sure no architecture that does atomic operations in
>>> the last level cache because this would be silly.
>>
>> Well, are you sure? Why do you think it would be silly?
>>
>> https://genzconsortium.org/wp-content/uploads/2019/04/Gen-Z-Atomics-2019.pdf
>>
>> Given that at least three high-end processor chips have taped out just
>> this year with the capability of executing "far" atomic operations in
>> the LLC (or to a PCI Express Root complex host bridge), I think you
>> really don't have a clue what you are talking about.
>
>And which CPUs currently support this Gen-Z interconnect ?

I'd tell you, but various NDA's forbid.

>And which CPUs currently use this far atomics for thread
>-synchronitation - none.

How do you know? I'm aware of three. Two sampling to
customers, with core counts from 8 to 64. A handful of others
are in development by several processor vendors as I
write this.

>Did you really read the paper and noted what Gen-Z is ?

I know exactly what it is, and I know what CXL is as well,
both being part of my day job. And if you don't think Intel
is designing all of their server CPUs to be CXL [*] compatible,
you're not thinking.

[*] "In November 2021 the CXL Consortium and the GenZ Consortium
signed a letter of intent for Gen-Z to transfer its specifications
and assets to CXL, leaving CXL as the sole industry standard moving
forward"

Bonita Montero

unread,

Dec 7, 2021, 1:55:16 PM12/7/21

to

Am 07.12.2021 um 19:49 schrieb Scott Lurndal:
> Bonita Montero <Bonita....@gmail.com> writes:
>> Am 07.12.2021 um 19:06 schrieb Scott Lurndal:
>>> Bonita Montero <Bonita....@gmail.com> writes:
>>>> Am 06.12.2021 um 19:02 schrieb Scott Lurndal:
>>>>
>>>>> Some processors will acquire the cacheline into the nearest
>>>>> cache to ensure exclusive access for the add, while
>>>>> others will pass the entire operation to the last-level cache
>>>>> where it is executed atomically.
>>>>
>>>> There's for sure no architecture that does atomic operations in
>>>> the last level cache because this would be silly.
>>>
>>> Well, are you sure? Why do you think it would be silly?
>>>
>>> https://genzconsortium.org/wp-content/uploads/2019/04/Gen-Z-Atomics-2019.pdf
>>>
>>> Given that at least three high-end processor chips have taped out just
>>> this year with the capability of executing "far" atomic operations in
>>> the LLC (or to a PCI Express Root complex host bridge), I think you
>>> really don't have a clue what you are talking about.
>>
>> And which CPUs currently support this Gen-Z interconnect ?
>
> I'd tell you, but various NDA's forbid.

LOOOOOOOL.

>> And which CPUs currently use this far atomics for thread
>> -synchronitation - none.

> How do you know?

Because this would be slower since the lock-modifications
woudln't be done in the L1-caches but in far memory. That's
just a silly idea.

> I'm aware of three. Two sampling to customers, with core
> counts from 8 to 64.

And you can't tell it because of NDAs. Hrhr.

Scott Lurndal

unread,

Dec 7, 2021, 2:14:59 PM12/7/21

to

Bonita Montero <Bonita....@gmail.com> writes:

>Am 07.12.2021 um 19:49 schrieb Scott Lurndal:

>
>> How do you know?
>
>Because this would be slower since the lock-modifications
>woudln't be done in the L1-caches but in far memory. That's
>just a silly idea.

Hello, it's a cache-coherent multiprocessor. You need to
fetch it exclusively into the L1 first, so instead of sending the fetch
(or invalidate if converting a shared line to owned),
you send the atomic op and it gets handled atomically at
the far end (e.g. LLC, PCI express device, SoC coprocessor)
saving the interconnect (mesh, ring, whatever) bandwidth and
the round-trip time between L1 and LLC and reducing contention
for the line.

If it's already in the L1 cache, then the processor will
automatically treat it as a near-atomic, this is expected
to be a rare case with correctly designed atomic usage.

Scott Lurndal

unread,

Dec 7, 2021, 2:25:46 PM12/7/21

to

In case you need a public reference for a shipping processor:

https://developer.arm.com/documentation/102099/0000/L1-data-memory-system/Instruction-implementation-in-the-L1-data-memory-system

David Brown

unread,

Dec 8, 2021, 3:08:18 AM12/8/21

to

This is such an obvious improvement that I am constantly amazed how long
it has taken to be implemented. Using ordinary memory for atomic
operations, locks, etc., is massively inefficient compared to a
dedicated hardware solution.

I've used a multi-core embedded microcontroller with a semaphore block,
consisting of a number (16, IIRC) of individual semaphores. Each of
these was made of two 16-bit parts - the lock tag and the value. You
can only change the value if you have the lock, and you get the lock by
writing a non-zero tag when the tag is currently 0 (unlocked). You
release it by writing your tag with the high bit set. It is all very
simple, and extremely fast - no need to go through caches, snooping, or
any of that nonsense because it is dedicated and connected close to the
cpu's core buses.

Obviously in a "big" system you need to handle more than two cores (and
with the Z-Gen and CLX system, other bus masters), support larger
numbers of locks, and security is a rather different matter! But the
principle of having dedicated hardware, memory mapped but not passing
through caches and slow external memory, is the same.

Atomic operations carried out by the core on memory in the L1 caches
will be fast as long as their are no conflicts, but you wouldn't bother
with atomics unless there /were/ a risk of conflict. And then they get
slow. With a "far atomics" solution, you should be able to get much
more consistent timings and efficient results.

(At least, that is my understanding of it, without having actually used
them!)

Bonita Montero

unread,

Dec 8, 2021, 3:50:46 AM12/8/21

to

That's not a processor implementing this Gen-Z interconnect and
it's atomic facilities. This is just an optimization for a special
kind of processor architecture.

Chris M. Thomasson

unread,

Dec 8, 2021, 3:59:51 AM12/8/21

to

You have encountered the rabbit hole of Bonita! I have proved her/it
wrong several times. No good, goes nowhere.

Bonita Montero

unread,

Dec 8, 2021, 9:42:10 AM12/8/21

to

What he links isn't a proof for what he says.
The above CPU doesn't implement the mentioned interconnect. It's
just a minor improvement for this special kind of CPU -architecture
to speed up lock-flipping with concurrent cores.

Scott Lurndal

unread,

Dec 8, 2021, 10:49:12 AM12/8/21

to

Bonita Montero <Bonita....@gmail.com> writes:
>Am 08.12.2021 um 09:59 schrieb Chris M. Thomasson:

>> You have encountered the rabbit hole of Bonita! I have proved her/it
>> wrong several times. No good, goes nowhere.
>
>What he links isn't a proof for what he says.

As you note Chris, Christof/Bonita cannot admit he
was wrong.

>The above CPU doesn't implement the mentioned interconnect.

Of course not, ARM doesn't make CPUs. They provide the IP
used to make real CPUs; for example the Amazon AWS Graviton 2 and 3.

Yet, ARM does provide interconnect IP which fully supports
near and far atomics.

Some of the current Neoverse N2 licensees are listed here:

https://www.design-reuse.com/news/49872/arm-neoverse-n2-v1-platform.html

Bonita Montero

unread,

Dec 8, 2021, 11:36:16 AM12/8/21

to

Am 08.12.2021 um 16:48 schrieb Scott Lurndal:
> Bonita Montero <Bonita....@gmail.com> writes:
>> Am 08.12.2021 um 09:59 schrieb Chris M. Thomasson:
>
>>> You have encountered the rabbit hole of Bonita! I have proved her/it
>>> wrong several times. No good, goes nowhere.
>>
>> What he links isn't a proof for what he says.
>
> As you note Chris, Christof/Bonita cannot admit he
> was wrong.
>
>> The above CPU doesn't implement the mentioned interconnect.
>
> Of course not, ARM doesn't make CPUs. They provide the IP
> used to make real CPUs; for example the Amazon AWS Graviton 2 and 3.
>
> Yet, ARM does provide interconnect IP which fully supports
> near and far atomics.

They're not far in the sense of the mentioned interconnect.

Öö Tiib

unread,

Dec 8, 2021, 1:58:07 PM12/8/21

to

But it what comp.lang.c++ is. Whenever you come here then BM is
present, wrong (or not even wrong), desperately trying to shadow
it by snipping out of context, removing attributions, misrepresenting
what others wrote, moving goalposts etc. Keeping hearth and home
warm. ;-D

Manfred

unread,

Dec 8, 2021, 6:28:16 PM12/8/21

to

On 12/8/2021 7:57 PM, Öö Tiib wrote:
> removing attributions

At least this part appears to have improved, as of recent. For all the
rest, there's still a long way to go, but never loose hope ;)

Juha Nieminen

unread,

Dec 9, 2021, 1:30:34 AM12/9/21

to

Öö Tiib <oot...@hot.ee> wrote:
>> You have encountered the rabbit hole of Bonita! I have proved her/it
>> wrong several times. No good, goes nowhere.
>
> But it what comp.lang.c++ is. Whenever you come here then BM is
> present, wrong (or not even wrong), desperately trying to shadow
> it by snipping out of context, removing attributions, misrepresenting
> what others wrote, moving goalposts etc. Keeping hearth and home
> warm. ;-D

The same pattern repeats again and again and again. I can't decide if
it's amusing or tiresome.

At least she doesn't use as many insults and derogatory tone anymore
towards people who are just trying to help, so I suppose that's an
improvement.

Little program to test concurrency of .fetch_add and .compare_exchange_weak

Bonita Montero

Ben Bacarisse

Chris M. Thomasson

Ben Bacarisse

Chris M. Thomasson

Bonita Montero

Bonita Montero

Bonita Montero

Ben Bacarisse

Bonita Montero

Scott Lurndal

Bonita Montero

Scott Lurndal

Bonita Montero

Scott Lurndal

Bonita Montero

Scott Lurndal

Scott Lurndal

David Brown

Bonita Montero

Chris M. Thomasson

Bonita Montero

Scott Lurndal

Bonita Montero

Öö Tiib

Manfred

Juha Nieminen