Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

atomic<>-costs

113 views
Skip to first unread message

Bonita Montero

unread,
Jan 6, 2021, 9:09:56 AM1/6/21
to
I wrote a little benchmark that measures the cost of atomic-operations
when a number of hw-threads are constantly hammering the same cacheline.
Here it is:

#include <iostream>
#include <atomic>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <thread>

using namespace std;
using namespace chrono;

int main()
{
unsigned hc = thread::hardware_concurrency();
if( !hc )
return EXIT_FAILURE;
uint64_t const ROUNDS = 1'000'000;
auto benchmark = [&]( bool xadd )
{
mutex mtx;
condition_variable cvReady,
cvStart;
unsigned ready;
bool start;
uint64_t count;
atomic<uint64_t> atomicValue;
uint64_t totalTicks;
uint64_t totalFails;
auto atomicThread = [&]()
{
using hrc_tp = time_point<high_resolution_clock>;
unique_lock<mutex> lock( mtx );
++ready;
cvReady.notify_one();
for( ; !start; cvStart.wait( lock ) );
uint64_t n = count;
lock.unlock();
uint64_t ref = atomicValue;
hrc_tp start = high_resolution_clock::now();
uint64_t fails = 0;
if( !xadd )
for( ; n; --n )
for( ; !atomicValue.compare_exchange_weak( ref, ref + 1,
memory_order_relaxed ); ++fails );
else
for( ; n; --n )
atomicValue.fetch_add( 1, memory_order_relaxed );
uint64_t ns = duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();
lock.lock();
totalTicks += ns;
totalFails += fails;
};
for( unsigned t = 1; t <= hc; ++t )
{
vector<thread> vt;
vt.reserve( t );
ready = 0;
start = false;
count = ROUNDS;
totalTicks = 0;
totalFails = 0;
for( unsigned i = 1; i <= t; ++i )
vt.emplace_back( atomicThread );
unique_lock<mutex> lock( mtx );
for( ; ready != t; cvReady.wait( lock ) );
start = true;
cvStart.notify_all();
lock.unlock();
for( thread &thr : vt )
thr.join();
double ns = (double)(int64_t)totalTicks / (int)t / ROUNDS;
double failsPerSucc = (double)(int64_t)totalFails / (int)t / ROUNDS;
cout << t << " threads: " << ns << "ns";
if( !xadd )
cout << ", avg fails: " << failsPerSucc;
cout << endl;
}
};
cout << "xchg:" << endl;
benchmark( false );
cout << "xadd:" << endl;
benchmark( true );
}

I've got a 2,9Ghz 64 core Threadripper 3990X and when all 128 threads
are XCHGing on the same cacheline, each successful operation takes
about 1.500ns, i.e. 4.350 clock cycles until there's a successful
XCHG. There are about 4 swap-failures until a fifth succeeds.
And a successful XADD is about 240ns, i.e. 700 clock cycles.
So show me your data !

Bonita Montero

unread,
Jan 6, 2021, 11:55:35 AM1/6/21
to
Sorry, under Windows this test doesn't exactly works if you
have more than 64 threads since the scheduler can only handle
up to 64 threads per processor-group.

So I've did a dirty workaround:

#if defined(_MSC_VER)
#include <Windows.h>
#endif
#include <iostream>
#include <atomic>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <thread>

using namespace std;
using namespace chrono;

int main( int argc, char **argv )
{
if( argc < 2 )
return EXIT_FAILURE;
unsigned hc = atoi( argv[1] );
#if defined(_MSC_VER)
GROUP_AFFINITY ga;
HANDLE hThread;

#endif
for( unsigned i = 1; i <= t; ++i )
vt.emplace_back( atomicThread ),
#if !defined(_MSC_VER)
0;
#else
hThread = vt.back().native_handle(),
GetThreadGroupAffinity( hThread, &ga ),
ga.Group = (WORD)(i / 64),
SetThreadGroupAffinity( hThread, &ga, nullptr );
#endif
unique_lock<mutex> lock( mtx );
for( ; ready != t; cvReady.wait( lock ) );
start = true;
cvStart.notify_all();
lock.unlock();
for( thread &thr : vt )
thr.join();
double ns = (double)(int64_t)totalTicks / (int)t / ROUNDS;
double failsPerSucc = (double)(int64_t)totalFails / (int)t / ROUNDS;
cout << t << " threads: " << ns << "ns";
if( !xadd )
cout << ", avg fails: " << failsPerSucc;
cout << endl;
}
};
cout << "xchg:" << endl;
benchmark( false );
cout << "xadd:" << endl;
benchmark( true );
}

This workaround only works if there are 64 threads in a processor group.
Otherwise I would have to do a complex processor topology query.

Bonita Montero

unread,
Jan 6, 2021, 1:24:36 PM1/6/21
to
Now it fully supports processor-groups. There could be actually
asymetrically populated sockets with different number of cores:

#if defined(_MSC_VER)
#include <Windows.h>
#endif
#include <iostream>
#include <atomic>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <thread>

using namespace std;
using namespace chrono;

int main()
{
#if !defined(_MSC_VER)
unsigned processors = thread::hardware_concurrency();
#else
DWORD dwLength = 0;
if( !GetLogicalProcessorInformationEx( RelationGroup, nullptr,
&dwLength ) &&
GetLastError() != ERROR_INSUFFICIENT_BUFFER )
return EXIT_FAILURE;
vector<char> buf( dwLength );
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX pSlpie =
(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf.data();
if( !GetLogicalProcessorInformationEx( RelationGroup, pSlpie, &dwLength ) )
return EXIT_FAILURE;
unsigned processors = 0;
for( unsigned i = 0; i != pSlpie->Group.ActiveGroupCount; ++i )
processors += pSlpie->Group.GroupInfo[i].ActiveProcessorCount;
auto getProcessorGroup = [&]( unsigned processor ) -> unsigned
{
processor %= processors;
unsigned group = 0;
for( ; group != pSlpie->Group.ActiveGroupCount
&& processor >=
pSlpie->Group.GroupInfo[group].ActiveProcessorCount;
++group );
return group;
};
#endif
for( unsigned t = 1; t <= processors; ++t )
{
vector<thread> vt;
vt.reserve( t );
ready = 0;
start = false;
count = ROUNDS;
totalTicks = 0;
totalFails = 0;
#if defined(_MSC_VER)
GROUP_AFFINITY ga;
HANDLE hThread;

#endif
for( unsigned i = 1; i <= t; ++i )
vt.emplace_back( atomicThread ),
#if !defined(_MSC_VER)
0;
#else
hThread = vt.back().native_handle(),
GetThreadGroupAffinity( hThread, &ga ),
ga.Group = getProcessorGroup( i ),

Chris M. Thomasson

unread,
Jan 6, 2021, 3:31:45 PM1/6/21
to
On 1/6/2021 6:09 AM, Bonita Montero wrote:
> I wrote a little benchmark that measures the cost of atomic-operations
> when a number of hw-threads are constantly hammering the same cacheline.
> Here it is:
[...]

> So show me your data !

I get:

xchg:
1 threads: 36.4439ns, avg fails: 0.999999
2 threads: 131.794ns, avg fails: 1.93928
3 threads: 169.609ns, avg fails: 2.3317
4 threads: 573.478ns, avg fails: 3.73222
xadd:
1 threads: 15.1847ns
2 threads: 92.2467ns
3 threads: 187.475ns
4 threads: 360.952ns


one little nitpick, XCHG should be renamed to CMPXCHG. :^)

XADD usually beats a CMPXCHG loop because it cannot fail.

Chris M. Thomasson

unread,
Jan 6, 2021, 3:56:24 PM1/6/21
to
On 1/6/2021 6:09 AM, Bonita Montero wrote:
> I wrote a little benchmark that measures the cost of atomic-operations
> when a number of hw-threads are constantly hammering the same cacheline.
> Here it is:
[...]
> I've got a 2,9Ghz 64 core Threadripper 3990X and when all 128 threads
> are XCHGing on the same cacheline, each successful operation takes
> about 1.500ns, i.e. 4.350 clock cycles until there's a successful
> XCHG. There are about 4 swap-failures until a fifth succeeds.
> And a successful XADD is about 240ns, i.e. 700 clock cycles.
> So show me your data !

One other nitpick... Make sure that your:

atomic<uint64_t> atomicValue;

Is isolated: Aligned and padded on a cache line boundary. Afaict, you
did not do that! Its going to alter the test.

Bonita Montero

unread,
Jan 7, 2021, 2:17:25 AM1/7/21
to
> One other nitpick... Make sure that your:
> atomic<uint64_t>   atomicValue;
> Is isolated: Aligned and padded on a cache line boundary.

That's automatically done through the /-8-alignment of the
compiler.

Bonita Montero

unread,
Jan 7, 2021, 3:04:33 AM1/7/21
to
The XCHG-loop was incorrect:

#if defined(_MSC_VER)
#include <Windows.h>
#endif
#include <iostream>
#include <atomic>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <thread>

using namespace std;
using namespace chrono;

int main()
{
#if !defined(_MSC_VER)
unsigned processors = thread::hardware_concurrency();
#else
DWORD dwLength = 0;
if( !GetLogicalProcessorInformationEx( RelationGroup, nullptr,
&dwLength ) &&
GetLastError() != ERROR_INSUFFICIENT_BUFFER )
return EXIT_FAILURE;
vector<char> buf( dwLength );
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX pSlpie =
(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf.data();
if( !GetLogicalProcessorInformationEx( RelationGroup, pSlpie, &dwLength ) )
return EXIT_FAILURE;
unsigned processors = 0;
for( unsigned i = 0; i != pSlpie->Group.ActiveGroupCount; ++i )
processors += pSlpie->Group.GroupInfo[i].ActiveProcessorCount;
auto getProcessorGroup = [pSlpie, processors]( unsigned processor ) ->
unsigned
{
processor %= processors;
BYTE pCount;
unsigned group = 0;
for( ; group != pSlpie->Group.ActiveGroupCount
&& processor >= (pCount =
pSlpie->Group.GroupInfo[group].ActiveProcessorCount);
++group, processor -= pCount );
return group;
};
#endif
uint64_t const ROUNDS = 1'000'000;
auto benchmark = [&]( bool xadd )
{
mutex mtx;
condition_variable cvReady,
cvStart;
unsigned ready;
bool start;
uint64_t count;
atomic<uint64_t> atomicValue;
uint64_t totalTicks;
uint64_t totalFails;
auto atomicThread = [&]()
{
using hrc_tp = time_point<high_resolution_clock>;
unique_lock<mutex> lock( mtx );
++ready;
cvReady.notify_one();
for( ; !start; cvStart.wait( lock ) );
uint64_t n = count;
lock.unlock();
uint64_t ref = atomicValue;
hrc_tp start = high_resolution_clock::now();
uint64_t fails = 0;
if( !xadd )
for( ; n; --n )
for( ref = atomicValue; !atomicValue.compare_exchange_weak( ref,
ref + 1, memory_order_relaxed ); ++fails );
else
for( ; n; --n )
atomicValue.fetch_add( 1, memory_order_relaxed );
uint64_t ns = duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();
lock.lock();
totalTicks += ns;
totalFails += fails;
};
for( unsigned t = 1; t <= processors; ++t )
{
vector<thread> vt;
vt.reserve( t );
ready = 0;
start = false;
count = ROUNDS;
totalTicks = 0;
totalFails = 0;
#if defined(_MSC_VER)
GROUP_AFFINITY ga;
HANDLE hThread;
#endif
for( unsigned i = 1; i <= t; ++i )
{
vt.emplace_back( atomicThread );
#if defined(_MSC_VER)
hThread = vt.back().native_handle();
GetThreadGroupAffinity( hThread, &ga );
ga.Group = getProcessorGroup( i );
SetThreadGroupAffinity( hThread, &ga, nullptr );
#endif
}
unique_lock<mutex> lock( mtx );
for( ; ready != t; cvReady.wait( lock ) );
start = true;
cvStart.notify_all();
lock.unlock();
for( thread &thr : vt )
thr.join();
double ns = (double)(int64_t)totalTicks / (int)t / ROUNDS;
double failsPerSucc = (double)(int64_t)totalFails / (int)t / ROUNDS;
cout << t << " threads: " << ns << "ns";
if( !xadd )
cout << ", avg fails: " << failsPerSucc;
cout << endl;
}
};
cout << "xchg:" << endl;
benchmark( false );
cout << "xadd:" << endl;
benchmark( true );
}

It usually counted one more fail.

Chris M. Thomasson

unread,
Jan 7, 2021, 4:03:22 AM1/7/21
to
Well, I mean padded up to a, say, L2 64 byte cache line, some use 128,
and aligned. Think for a moment. I want atomicValue to be 100% isolated.

Chris M. Thomasson

unread,
Jan 7, 2021, 4:05:06 AM1/7/21
to
Basically, remove ANY possibility of false sharing...

Chris M. Thomasson

unread,
Jan 7, 2021, 4:05:55 AM1/7/21
to
On 1/7/2021 12:04 AM, Bonita Montero wrote:
> The XCHG-loop was incorrect:
>
[...]
CMPXCHG... Well, I will run your new variant.

Chris M. Thomasson

unread,
Jan 7, 2021, 4:08:22 AM1/7/21
to
On 1/7/2021 12:04 AM, Bonita Montero wrote:
> The XCHG-loop was incorrect:
>
[...]
> It usually counted one more fail.

I get:

xchg:
1 threads: 31.7102ns, avg fails: 0
2 threads: 151.655ns, avg fails: 0.665942
3 threads: 327.787ns, avg fails: 0.851925
4 threads: 509.759ns, avg fails: 1.19408
xadd:
1 threads: 19.4809ns
2 threads: 80.3329ns
3 threads: 214.025ns
4 threads: 337.996ns

Chris M. Thomasson

unread,
Jan 7, 2021, 4:20:15 AM1/7/21
to
On 1/6/2021 6:09 AM, Bonita Montero wrote:
> I wrote a little benchmark that measures the cost of atomic-operations
> when a number of hw-threads are constantly hammering the same cacheline.
> Here it is:
[...]

God... This makes me wish I never sold my SunFire T2000. DAMN. Although,
the damn thing sounded like a vacuum cleaner whenever I fired it up.

Bonita Montero

unread,
Jan 7, 2021, 7:01:12 AM1/7/21
to
> Well, I mean padded up to a, say, L2 64 byte cache line, some use 128,
> and aligned. Think for a moment. I want atomicValue to be 100% isolated.

It doesn't need to be aligned on a cacheline-boundary.
It just hasn't to cross a cacheline-boundary, and that's
ensured by having a /-8-alignment.

Scott Lurndal

unread,
Jan 7, 2021, 11:56:39 AM1/7/21
to
DAGS "False Sharing", then reconsider your last statement.

Bonita Montero

unread,
Jan 7, 2021, 12:23:09 PM1/7/21
to
>> It doesn't need to be aligned on a cacheline-boundary.
>> It just hasn't to cross a cacheline-boundary, and that's
>> ensured by having a /-8-alignment.

> DAGS "False Sharing", then reconsider your last statement.

Where do I have a false sharing problem ?

Chris M. Thomasson

unread,
Jan 7, 2021, 2:49:29 PM1/7/21
to
In atomicValue.

Bonita Montero

unread,
Jan 7, 2021, 2:52:15 PM1/7/21
to
>>> DAGS "False Sharing", then reconsider your last statement.

>> Where do I have a false sharing problem ?

> In atomicValue.

No, there's no false sharing. While the threads are hammering on
atomicValue, no other thread is doing anything on the cacheline
holding atomicValue. And the values around atomiValues aren't
touched meanwhile.

Chris M. Thomasson

unread,
Jan 7, 2021, 2:55:14 PM1/7/21
to
atomicValue needs to be completely isolated. Padded up to a cache line
and aligned on a boundary.

Bonita Montero

unread,
Jan 7, 2021, 3:30:48 PM1/7/21
to
>> No, there's no false sharing. While the threads are hammering on
>> atomicValue, no other thread is doing anything on the cacheline
>> holding atomicValue. And the values around atomiValues aren't
>> touched meanwhile.

> atomicValue needs to be completely isolated. Padded up to a cache line
> and aligned on a boundary.

No, only if the data surrounding it in the cacheline is
concurrently accessed by other threads; but it isn't.

Chris M. Thomasson

unread,
Jan 7, 2021, 4:39:53 PM1/7/21
to
Its an odd way to design things. You know, all threads might not startup
at the same time... ready, start and count, ect... intrude on
atomicValue. Why not just isolate it, and be done with it? Keep in mind,
aligned and padded...

Bonita Montero

unread,
Jan 8, 2021, 3:04:44 AM1/8/21
to
> Its an odd way to design things. You know, all threads might
> not startup at the same time... ready, start and count, ect... ...

There's no false sharing.

Juha Nieminen

unread,
Jan 8, 2021, 3:32:46 AM1/8/21
to
Chris M. Thomasson <chris.m.t...@gmail.com> wrote:
> atomicValue needs to be completely isolated. Padded up to a cache line
> and aligned on a boundary.

Is there a standard way of doing that?

Bonita Montero

unread,
Jan 8, 2021, 4:28:56 AM1/8/21
to
>> atomicValue needs to be completely isolated. Padded up to a cache line
>> and aligned on a boundary.

> Is there a standard way of doing that?

It's not necessary here.
But if you need it you can do it that way:

#include <iostream>
#include <atomic>
#include <thread>
#include <cstdint>

using namespace std;

struct alignas(hardware_constructive_interference_size)
{
atomic<uint64_t> valaue;
} a;

int main()
{
cout << sizeof ::a << endl;
bool fits = ((uintptr_t)&::a & 64 - 1) == 0;
cout << (fits ? "fits" : "doesn't fit") << endl;
}

Bonita Montero

unread,
Jan 8, 2021, 4:48:40 AM1/8/21
to
Better this way.

>     bool fits = ((uintptr_t)&::a & 64 - 1) == 0;
bool fits = !(alignof(decltype(::a)) % 64);


Chris M. Thomasson

unread,
Jan 8, 2021, 7:43:17 PM1/8/21
to
Nit pick, there can be.

totalTicks += ns;
totalFails += fails;

can interfere with atomicValue. Think about it.

Chris M. Thomasson

unread,
Jan 8, 2021, 7:48:21 PM1/8/21
to
That should do it. :^)

Bonita Montero

unread,
Jan 9, 2021, 12:37:42 AM1/9/21
to
>> There's no false sharing.

> Nit pick, there can be.
> totalTicks += ns;
> totalFails += fails;
> can interfere with atomicValue. Think about it.

There's nothing which interferes.
Make the atomicValue aligned, and you don't measure
different results.

But here's the result of a

* 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
* 2x AMD EPYC 7551 32-Core Processor
* 2x AMD EPYC 7401 24-Core Processor

https://easyupload.io/4jaazg

Chris M. Thomasson

unread,
Jan 9, 2021, 12:45:07 AM1/9/21
to
On 1/8/2021 9:37 PM, Bonita Montero wrote:
>>> There's no false sharing.
>
>> Nit pick, there can be.
>> totalTicks += ns;
>> totalFails += fails;
>> can interfere with atomicValue. Think about it.
>
> There's nothing which interferes.

Is atomicValue on a completely aligned on a boundary and on a different
cache line than anything else, think of totalTicks, totalFails?



> Make the atomicValue aligned, and you don't measure
> different results.

Do it anyway.

Bonita Montero

unread,
Jan 9, 2021, 1:43:53 AM1/9/21
to
> Is atomicValue on a completely aligned on a boundary and on a different
> cache line than anything else, think of totalTicks, totalFails?

No, it isn't, but it doesn't jneed to be. While the threads are running
and incrementing atomicValue nothing happens to the same cacheline.
Align it and see that you won't get any different results.

Chris M. Thomasson

unread,
Jan 9, 2021, 1:59:04 AM1/9/21
to
On 1/8/2021 10:43 PM, Bonita Montero wrote:
>> Is atomicValue on a completely aligned on a boundary and on a
>> different cache line than anything else, think of totalTicks, totalFails?
>
> No, it isn't, but it doesn't jneed to be. While the threads are running
> and incrementing atomicValue nothing happens to the same cacheline.

Even when some threads finish iteration and mess with totalTicks and
totalFails while others are still iterating?

Bonita Montero

unread,
Jan 9, 2021, 3:23:14 AM1/9/21
to
>> No, it isn't, but it doesn't jneed to be. While the threads are running
>> and incrementing atomicValue nothing happens to the same cacheline.

> Even when some threads finish iteration and mess with totalTicks and
> totalFails while others are still iterating?

That's not relevant since the loops overlap by 99% or more.

Scott Lurndal

unread,
Jan 9, 2021, 11:55:51 AM1/9/21
to
You are wrong. But since you insist on removing context and attributions
from your posts, I'll not try to explain to you why.

Bonita Montero

unread,
Jan 9, 2021, 12:25:37 PM1/9/21
to
>> No, it isn't, but it doesn't jneed to be. While the threads are running
>> and incrementing atomicValue nothing happens to the same cacheline.
>> Align it and see that you won't get any different results.

> You are wrong. But since you insist on removing context and attributions
> from your posts, I'll not try to explain to you why.

No, I'm not wrong. At least 99% of the time the threads are pounding on
atomicValue they're not contending with other threads doing different
things in the same cacheline. The small skew of the threads after being
started when they do the unlock() isn't relevant.

Chris M. Thomasson

unread,
Jan 9, 2021, 3:25:35 PM1/9/21
to
I see what you mean, but there can be false sharing: Keep in mind that I
am being rather pedantic here Bonita. Okay. The main point is that using
XADD when you can, is generally "better" than a CMPXCHG loop. Also, you
are using compare_exchange_weak. This means that on platforms other than
x86, perhaps using LL/SC, it can spuriously fail even if the CAS should
of succeeded. So, on a PPC, it would be fun to test
compare_exchange_weak vs compare_exchange_strong.

I remember a scenario a long time ago where a compare_exchange_strong
was needed to get a state machine to work properly. It was well before
C++11, but these problems are nothing new. Where a failed CAS actually
meant it really failed on its comparand, not spuriously failed when it
should of succeeded. The state machine depended on this difference!

Basically, isolation via proper aligning and padding is just a good
habit to get into. Especially on those other platforms. False sharing
can invade the reservation granule and cause actual livelock like
scenarios. On a x86 it won't cause a CMPXCHG to fail, but it causes
unnecessary cache thrashing, destroying performance. On a PPC, false
sharing to the reservation can be rather deadly... Livelock it when the
system in under heavy periods of load.

Chris M. Thomasson

unread,
Jan 9, 2021, 3:53:02 PM1/9/21
to
On 1/8/2021 1:28 AM, Bonita Montero wrote:
Works fine. Now I am thinking about dynamic memory ala new. Probably
have to use aligned_alloc and placement new? My old alignment code way
back in the day still works, but its definitely not standard!

https://pastebin.com/raw/f37a23918

Back in 2009. Damn, I am getting older!

Bonita Montero

unread,
Jan 9, 2021, 3:56:35 PM1/9/21
to
> I see what you mean, but there can be false sharing:
> Keep in mind that I am being rather pedantic here Bonita.

The threads don't become released by cvStart synchonously. cvStart is
signalled for all threads at the same time but the threads need to get
hold of the mutex that is associated with this CV also and this can be
done one thread after another when the threads release the mutex. So
the first thread is releasing the second thread and and so forth. But
this results only in a small skew which is insignificant related to
the time the threads are doing the atomic operation.
What would be more appropriate here would be to have a semaphore which
is released by the number of waiting threads. But there are no semapho-
res until C++20.


> The main point is that using
> XADD when you can, is generally "better" than a CMPXCHG loop.

My test doesn't imply what it is better when you are using either oper-
ation for synchronitzation-purposes. It simply tests their effectiveness
and nothing mire.

> XADD when you can, is generally "better" than a CMPXCHG loop. Also, you
> are using compare_exchange_weak. This means that on platforms other than
> x86, perhaps using LL/SC, it can spuriously fail even if the CAS should
> of succeeded.

That's intended because if you're using LL/SC for synchronization
you can't do anything against spuriuous fails other than repeating
the operation as well.

> Basically, isolation via proper aligning and padding is just a good
> habit to get into. ...

It's not necessary for this benchmark. It won't get less accurate
beyond the inacurracy you have from other measurement error effects.
So the whole discussion is nonsense.

Bonita Montero

unread,
Jan 9, 2021, 3:57:29 PM1/9/21
to
> Works fine. Now I am thinking about dynamic memory ala new. Probably
> have to use aligned_alloc and placement new? My old alignment code way
> back in the day still works, but its definitely not standard!

The newed memory would be aligned also.

Chris M. Thomasson

unread,
Jan 9, 2021, 4:01:53 PM1/9/21
to
Really? When you find some free time, can you code me up a quick
example? I am still using my old code:

https://pastebin.com/raw/f37a23918

It would be really fun to spice it up with C++11! I think you can help
me out, since I am not totally familiar with some exotic C++ features. I
was almost always in C. I still have code in x86 asm that implemented my
atomic lib.

So, is there a way to convert my simple ralloc region allocator into
something that is 100% standard compliant?

Chris M. Thomasson

unread,
Jan 9, 2021, 4:38:51 PM1/9/21
to
True. However, the retry logic can be more "efficient" using
compare_exchange_strong rather than trying to emulate the same thing
with compare_exchange_weak.

Chris M. Thomasson

unread,
Jan 9, 2021, 6:32:04 PM1/9/21
to
However, this is not necessary with your benchmark. Using
compare_exchange_strong really shines when one wants to create a state
machine wrt the failure of a compare_exchange. In these cases, one does
not want spuriuous failures.

Bonita Montero

unread,
Jan 10, 2021, 1:01:12 AM1/10/21
to
>> That's intended because if you're using LL/SC for synchronization
>> you can't do anything against spuriuous fails other than repeating
>> the operation as well.

> True. However, the retry logic can be more "efficient" using
> compare_exchange_strong rather than trying to emulate the same thing
> with compare_exchange_weak.

compare_exchange_strong would have its own loop for retrying inside
and I won't notice the additional retries so that I could count them.
So this wouldn't be appropriate for me.

Bonita Montero

unread,
Jan 10, 2021, 1:02:07 AM1/10/21
to
>>> Works fine. Now I am thinking about dynamic memory ala new. Probably
>>> have to use aligned_alloc and placement new? My old alignment code
>>> way back in the day still works, but its definitely not standard!

>> The newed memory would be aligned also.

> Really? ...

Yes.

Chris M. Thomasson

unread,
Jan 10, 2021, 1:32:16 AM1/10/21
to
Agreed. Fwiw, in times where using compare_exchange_strong is
appropriate, it can be better than using a compare_exchange_weak loop.

Chris M. Thomasson

unread,
Jan 10, 2021, 1:33:14 AM1/10/21
to
Show me.

Chris M. Thomasson

unread,
Jan 10, 2021, 1:34:28 AM1/10/21
to
On 1/9/2021 10:01 PM, Bonita Montero wrote:
You cut off my point about using new vs aligned_alloc and placement new.

Bonita Montero

unread,
Jan 10, 2021, 3:31:57 AM1/10/21
to
>>>> The newed memory would be aligned also.

>>> Really? ...

>> Yes.

> Show me.

#include <iostream>
#include <atomic>
#include <cstdint>
#include <memory>
#include <vector>

using namespace std;

struct alignas(hardware_constructive_interference_size) A
{
atomic<uint64_t> valaue;
} a;

int main()
{
vector<unique_ptr<A>> vua( 1'000'000 );
for( unique_ptr<A> &ua : vua )
ua = make_unique<A>();
bool fits = true;
for( unique_ptr<A> &ua : vua )
if( (uintptr_t)ua.get() % 64 )
fits = false;

Bonita Montero

unread,
Jan 10, 2021, 4:01:11 AM1/10/21
to
Now, that's a version with my own semaphore-class:

#if defined(_MSC_VER)
#include <Windows.h>
#elif defined(__unix__)
#include <unistd.h>
#include <sched.h>
#include <sys/resource.h>
#include <sys/types.h>
#endif
#include <iostream>
#include <atomic>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <thread>
#include "semaphore.h"

using namespace std;
using namespace chrono;

#if defined(_MSC_VER)
int wmain( int argc, wchar_t **argv )
#else
int main( int argc, char **argv )
#endif
{
#if !defined(_MSC_VER)
unsigned processors = thread::hardware_concurrency();
if( !processors )
return EXIT_FAILURE;
#else
DWORD dwLength = 0;
if( !GetLogicalProcessorInformationEx( RelationGroup, nullptr,
&dwLength ) &&
GetLastError() != ERROR_INSUFFICIENT_BUFFER )
return EXIT_FAILURE;
vector<char> buf( dwLength );
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX pSlpie =
(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf.data();
if( !GetLogicalProcessorInformationEx( RelationGroup, pSlpie, &dwLength ) )
return EXIT_FAILURE;
unsigned processors = 0;
for( unsigned i = 0; i != pSlpie->Group.ActiveGroupCount; ++i )
processors += pSlpie->Group.GroupInfo[i].ActiveProcessorCount;
auto getProcessorGroup = [pSlpie, processors]( unsigned processor ) ->
unsigned
{
processor %= processors;
BYTE pCount;
unsigned group = 0;
for( ; group != pSlpie->Group.ActiveGroupCount
&& processor >= (pCount =
pSlpie->Group.GroupInfo[group].ActiveProcessorCount);
++group, processor -= pCount );
return group;
};
#endif
#if defined(_MSC_VER)
// log on as a different user to gain higher scheduling-prioroties
HANDLE hToken;
if( argc >= 3 && LogonUserW( argv[1], nullptr, argv[2],
LOGON32_LOGON_INTERACTIVE, LOGON32_PROVIDER_DEFAULT, &hToken ) )
(void)ImpersonateLoggedOnUser( hToken );
HANDLE hCurrentProcess = GetCurrentProcess();
SetPriorityClass( hCurrentProcess, HIGH_PRIORITY_CLASS );
SetPriorityClass( hCurrentProcess, REALTIME_PRIORITY_CLASS );
#elif defined(__unix__)
for( int nice = 0; nice != -21 && setpriority( PRIO_PROCESS, getpid(),
nice ) == 0; --nice );
#endif
uint64_t const ROUNDS = 1'000'000;
auto benchmark = [&]( bool xadd )
{
semaphore semReady,
semStart;
uint64_t count;
atomic<uint64_t> atomicValue;
mutex mtx;
uint64_t totalTicks;
uint64_t totalFails;
auto atomicThread = [&]()
{
using hrc_tp = time_point<high_resolution_clock>;
semReady.forced_release();
semStart.forced_wait();
uint64_t n = count;
uint64_t ref;
hrc_tp start = high_resolution_clock::now();
uint64_t fails = 0;
if( !xadd )
for( ; n; --n )
for( ref = atomicValue; !atomicValue.compare_exchange_weak( ref,
ref + 1, memory_order_relaxed ); ++fails );
else
for( ; n; --n )
atomicValue.fetch_add( 1, memory_order_relaxed );
uint64_t ns = duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();
lock_guard<mutex> lock( mtx );
totalTicks += ns;
totalFails += fails;
};
for( unsigned t = 1; t <= processors; ++t )
{
vector<thread> vt;
vt.reserve( t );
count = ROUNDS;
totalTicks = 0;
totalFails = 0;
for( unsigned i = 0; i != t; ++i )
{
vt.emplace_back( atomicThread );
#if defined(_MSC_VER)
GROUP_AFFINITY ga;
HANDLE hThread;
hThread = vt.back().native_handle();
SetThreadAffinityMask( hThread, (DWORD_PTR)1 << i );
GetThreadGroupAffinity( hThread, &ga );
ga.Group = getProcessorGroup( i );
SetThreadGroupAffinity( hThread, &ga, nullptr );
SetThreadPriority( hThread, THREAD_PRIORITY_HIGHEST );
SetThreadPriority( hThread, THREAD_PRIORITY_TIME_CRITICAL );
#elif defined(__unix__)
cpu_set_t cpuSet;
CPU_ZERO( &cpuSet );
CPU_SET( i, &cpuSet );
pthread_setaffinity_np( vt.back().native_handle(), sizeof cpuSet,
&cpuSet );
#endif
}
for( unsigned i = 0; i != t; ++i )
semReady.forced_wait();
semStart.release( t );
for( thread &thr : vt )
thr.join();
double ns = (double)(int64_t)totalTicks / (int)t / ROUNDS;
double failsPerSucc = (double)(int64_t)totalFails / (int)t / ROUNDS;
cout << t << " threads: " << ns << "ns";
if( !xadd )
cout << ", avg fails: " << failsPerSucc;
cout << endl;
}
};
cout << "xchg:" << endl;
benchmark( false );
cout << "xadd:" << endl;
benchmark( true );
}

// semaphore.h:

#pragma once
#if defined(_MSC_VER)
#include <Windows.h>
#include <intrin.h>
#elif defined(__unix__)
#include <semaphore.h>
#include <pthread.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <sys/stat.h>
#include <climits>
#endif
#include <new>
#include <cstdint>
#include <cassert>
#include <system_error>

class semaphore
{
public:
semaphore();
semaphore( semaphore const & ) = delete;
void operator =( semaphore const & ) = delete;
~semaphore();
bool wait();
void forced_wait();
unsigned release( unsigned count = 1 );
void forced_release( unsigned count = 1 );

private:
#if defined(_MSC_VER)
HANDLE m_hSem;
#elif defined(__unix__)
#if defined(SYSV_SEMAPHORE)
union semun
{
int val;
semid_ds *buf;
unsigned short *array;
seminfo *__buf;
};
int m_semid;
#else
sem_t m_sem;
#endif
#else
#error need platform-specific semaphore!
#endif
};

#if defined(_MSC_VER)

// throws system_error if creating semaphore failed

inline
semaphore::semaphore()
{
using namespace std;
if( (m_hSem = CreateSemaphore( nullptr, 0, 0x7FFFFFFF, nullptr )) == NULL )
throw system_error( error_code( (int)GetLastError(), system_category()
), "creating semaphore failed" );
}

inline
semaphore::~semaphore()
{
BOOL success = CloseHandle( m_hSem );
assert(success);
}

inline
bool semaphore::wait()
{
bool success = WaitForSingleObject( m_hSem, INFINITE ) == WAIT_OBJECT_0;
assert(success);
return success;
}

inline
unsigned semaphore::release( unsigned count )
{
BOOL success = ReleaseSemaphore( m_hSem, (LONG)count, nullptr );
assert(success);
return success ? 0 : count;
}

#elif defined(__unix__)

// throws system_error if creating semaphore failed

inline
semaphore::semaphore()
{
using namespace std;
#if defined(SYSV_SEMAPHORE)
if( (m_semid = semget( IPC_PRIVATE, 1, IPC_CREAT | S_IRWXU )) == -1 )
throw system_error( error_code( errno, system_category() ), "creating
semaphore failed" );
semun su;
su.val = 0;
if( semctl( m_semid, 0, SETVAL, su ) == -1 )
{
int errNo = errno;
int ret = semctl( m_semid, 0, IPC_RMID );
assert(ret != -1);
throw system_error( error_code( errNo, system_category() ), "setting
semaphore to zero after creation failed" );
}
#else
if( sem_init( &m_sem, 0, 0 ) != 0 )
throw system_error( error_code( errno, system_category() ), "creating
semaphore failed" );
#endif
}

inline
semaphore::~semaphore()
{
#if defined(SYSV_SEMAPHORE)
int ret = semctl( m_semid, 0, IPC_RMID );
assert(ret != -1);
#else
int ret = sem_destroy( &m_sem );
assert(ret == 0);
#endif
}

inline
bool semaphore::wait()
{
#if defined(SYSV_SEMAPHORE)
int ret;
do
{
sembuf sb;
sb.sem_op = -1;
sb.sem_num = 0;
sb.sem_flg = 0;
ret = semop( m_semid, &sb, 1 );
} while( ret == EINTR );
assert(ret == 0);
return ret == 0;
#else
int ret;
do
ret = sem_wait( &m_sem );
while( ret == EINTR );
assert(ret == 0);
return ret == 0;
#endif
}

inline
unsigned semaphore::release( unsigned count )
{
#if defined(SYSV_SEMAPHORE)
for( unsigned short release; (release = (unsigned short)(count %
SHRT_MAX)) != 0; count -= release )
{
int ret;
do
{
sembuf sb;
sb.sem_op = (short)release;
sb.sem_num = 0;
sb.sem_flg = 0;
ret = semop( m_semid, &sb, 1 );
} while( ret == EINTR );
if( ret != 0 )
{
assert(false);
return count;
}
}
return 0;
#else
for( ; count; --count )
if( sem_post( &m_sem ) != 0 )
{
assert(false);
return count;
}
return 0;
#endif
}

#else
#error need platform-specific semaphore
#endif

inline
void semaphore::forced_wait()
{
while( !wait() );
}

inline
void semaphore::forced_release( unsigned count )
{
for( ; count; count = release( count ) );
}

Chris M. Thomasson

unread,
Jan 10, 2021, 4:21:33 PM1/10/21
to
On 1/10/2021 12:31 AM, Bonita Montero wrote:
> #include <iostream>
> #include <atomic>
> #include <cstdint>
> #include <memory>
> #include <vector>
>
> using namespace std;
>
> struct alignas(hardware_constructive_interference_size) A
> {
>     atomic<uint64_t> valaue;
> } a;
>
> int main()
> {
>     vector<unique_ptr<A>> vua( 1'000'000 );
>     for( unique_ptr<A> &ua : vua )
>         ua = make_unique<A>();
>     bool fits = true;
>     for( unique_ptr<A> &ua : vua )
>         if( (uintptr_t)ua.get() % 64 )
>             fits = false;
>     cout << (fits ? "fits" : "doesn't fit") << endl;
> }

I get:

doesn't fit

___

Using hardware_constructive_interference_size of 64.

Humm... I don't think that using new works either. Now, I think it has
to, or should, work with aligned_alloc and placement new. Afaict, the
vector is allocating its internal storage on an unaligned boundary wrt
hardware_constructive_interference_size.

Chris M. Thomasson

unread,
Jan 10, 2021, 6:35:23 PM1/10/21
to
On 1/9/2021 10:01 PM, Bonita Montero wrote:
Well, check this crap out, on MSVC 2017... It seems to work fine. Can
you compile it?
_______________________________
#include <iostream>
#include <atomic>
#include <cstdint>
#include <malloc.h> // humm...


#define CT_PAGE_ALIGNMENT 8192
#define CT_L2_ALIGNMENT 128

using namespace std;


struct alignas(CT_L2_ALIGNMENT) block
{
atomic<uint64_t> value;
};


struct test
{
int a;
char b;
block c;
char d;
long e;
};


int main()
{
void* raw_mem = _aligned_malloc(sizeof(test), CT_PAGE_ALIGNMENT);
if (!raw_mem) { std::cout << "failed alloc!\n"; return 0; }

std::cout << "raw_mem = " << raw_mem << "\n";

// test raw_mem
if ((std::uintptr_t)raw_mem % CT_PAGE_ALIGNMENT)
{
std::cout << "yikes! raw_mem does not fit in CT_PAGE_ALIGNMENT\n";
}

else
{
std::cout << "So far so good... raw_mem fits CT_PAGE_ALIGNMENT\n";
}

test* t = new (raw_mem) test;

// test &t->c
if ((std::uintptr_t)&t->c % CT_L2_ALIGNMENT)
{
std::cout << "yikes! &t->c does not fit in CT_L2_ALIGNMENT\n";
}

else
{
std::cout << "So far so good... &t->c fits CT_L2_ALIGNMENT\n";
}

_aligned_free(raw_mem);

return 0;
}
_______________________________


Can you get it to work on your end? For some reason my msvc 2017 does
not have std::aligned_alloc. I am getting the following output:

raw_mem = 01080000
So far so good... raw_mem fits CT_PAGE_ALIGNMENT
So far so good... &t->c fits CT_L2_ALIGNMENT


Chris M. Thomasson

unread,
Jan 12, 2021, 3:50:38 PM1/12/21
to
On 1/10/2021 3:35 PM, Chris M. Thomasson wrote:
> On 1/9/2021 10:01 PM, Bonita Montero wrote:
>>>>> Works fine. Now I am thinking about dynamic memory ala new.
>>>>> Probably have to use aligned_alloc and placement new? My old
>>>>> alignment code way back in the day still works, but its definitely
>>>>> not standard!
>>
>>>> The newed memory would be aligned also.
>>
>>> Really? ...
>>
>> Yes.
>>
>
> Well, check this crap out, on MSVC 2017... It seems to work fine. Can
> you compile it?
> _______________________________
[code snip]
> _______________________________
>
>
> Can you get it to work on your end? For some reason my msvc 2017 does
> not have std::aligned_alloc. I am getting the following output:
>
> raw_mem = 01080000
> So far so good... raw_mem fits CT_PAGE_ALIGNMENT
> So far so good... &t->c fits CT_L2_ALIGNMENT
>
>

Notice the CT_PAGE_ALIGNMENT. Now, this allows one to do a lot of fun
and interesting things. Like stealing bits from the actual pointer
value. We can stuff meta data in a pointer, basically bit stealing. This
data can be used to create exotic lock-free algorithms. Even wait-free!
One can use it for a state machine. Or use it to round down to
CT_PAGE_ALIGNMENT, works well with memory allocators. Many different
things! Of course we would want to make sure that struct test is padded
to a CT_PAGE_ALIGNMENT. I did not do that here. But its rather trivial
to modify my example code.

Bonita Montero

unread,
Jan 17, 2021, 5:27:12 AM1/17/21
to
According to this
https://www.bfilipek.com/2019/08/newnew-align.html
C++17 honors alignas.

Chris M. Thomasson

unread,
Jan 18, 2021, 12:43:21 PM1/18/21
to
Ohhh. Nice! I need to give it a go. Thanks.

Bonita Montero

unread,
Jan 18, 2021, 2:20:35 PM1/18/21
to
>> According to this
>>      https://www.bfilipek.com/2019/08/newnew-align.html
>> C++17 honors alignas.

> Ohhh. Nice! I need to give it a go. Thanks.

Which compiler did you use to test the code I did give ?

Chris M. Thomasson

unread,
Jan 18, 2021, 3:13:32 PM1/18/21
to
MSVC 2017

Chris M. Thomasson

unread,
Jan 18, 2021, 3:15:24 PM1/18/21
to
I have been wanting to upgrade to the newer versions. Damn.

Chris M. Thomasson

unread,
Jan 19, 2021, 9:26:43 PM1/19/21
to
On 1/18/2021 11:20 AM, Bonita Montero wrote:
What compiler did you use for it?

Bonita Montero

unread,
Jan 20, 2021, 1:37:22 AM1/20/21
to
>>>> According to this
>>>>      https://www.bfilipek.com/2019/08/newnew-align.html
>>>> C++17 honors alignas.

>>> Ohhh. Nice! I need to give it a go. Thanks.

>> Which compiler did you use to test the code I did give ?

> What compiler did you use for it?

MSVC2019 under Win10 and gcc under WSL/Win10.

Chris M. Thomasson

unread,
Jan 20, 2021, 4:25:30 PM1/20/21
to
I have been meaning to install MSVC 2019 for a while now. Never got
around to it. So, I am assuming your code works under it right? Where
the output is _not_:

doesn't fit

right?

Also, can you pretty please try to quote properly? Its hard to follow
the chain in the thread.

Bonita Montero

unread,
Jan 21, 2021, 12:15:04 AM1/21/21
to
> Also, can you pretty please try to quote properly? Its hard to follow
> the chain in the thread.

For me the result is "fits".

Chris M. Thomasson

unread,
Jan 21, 2021, 12:15:25 AM1/21/21
to
Good. in MSVC 2019 right?

Chris M. Thomasson

unread,
Jan 21, 2021, 12:23:57 AM1/21/21
to
On 1/18/2021 11:20 AM, Bonita Montero wrote:
I am installing MSVC 2019 right now.

Bonita Montero

unread,
Jan 21, 2021, 12:28:55 AM1/21/21
to
>>> Also, can you pretty please try to quote properly? Its hard to follow
>>> the chain in the thread.

>> For me the result is "fits".

> Good. in MSVC 2019 right?

I already said that I'm using MSVC 2019.

Chris M. Thomasson

unread,
Jan 21, 2021, 12:32:13 AM1/21/21
to
You did. Shit. Missed that. Sorry.

Chris M. Thomasson

unread,
Jan 21, 2021, 5:23:25 PM1/21/21
to
On 1/18/2021 11:20 AM, Bonita Montero wrote:
I installed MSVC 2019, and your code still does not work. I am getting:

doesn't fit

What am I doing wrong here?

Chris M. Thomasson

unread,
Jan 21, 2021, 5:25:45 PM1/21/21
to
DOH! I was not compiling in C++17 mode. So, now it works. Cool.

Thanks.
0 new messages