atomic<>-costs

Bonita Montero

unread,

Jan 6, 2021, 9:09:56 AM1/6/21

to

I wrote a little benchmark that measures the cost of atomic-operations
when a number of hw-threads are constantly hammering the same cacheline.
Here it is:

#include <iostream>
#include <atomic>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <thread>

using namespace std;
using namespace chrono;

int main()
{
unsigned hc = thread::hardware_concurrency();
if( !hc )
return EXIT_FAILURE;
uint64_t const ROUNDS = 1'000'000;
auto benchmark = [&]( bool xadd )
{
mutex mtx;
condition_variable cvReady,
cvStart;
unsigned ready;
bool start;
uint64_t count;
atomic<uint64_t> atomicValue;
uint64_t totalTicks;
uint64_t totalFails;
auto atomicThread = [&]()
{
using hrc_tp = time_point<high_resolution_clock>;
unique_lock<mutex> lock( mtx );
++ready;
cvReady.notify_one();
for( ; !start; cvStart.wait( lock ) );
uint64_t n = count;
lock.unlock();
uint64_t ref = atomicValue;
hrc_tp start = high_resolution_clock::now();
uint64_t fails = 0;
if( !xadd )
for( ; n; --n )
for( ; !atomicValue.compare_exchange_weak( ref, ref + 1,
memory_order_relaxed ); ++fails );
else
for( ; n; --n )
atomicValue.fetch_add( 1, memory_order_relaxed );
uint64_t ns = duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();
lock.lock();
totalTicks += ns;
totalFails += fails;
};
for( unsigned t = 1; t <= hc; ++t )
{
vector<thread> vt;
vt.reserve( t );
ready = 0;
start = false;
count = ROUNDS;
totalTicks = 0;
totalFails = 0;
for( unsigned i = 1; i <= t; ++i )
vt.emplace_back( atomicThread );
unique_lock<mutex> lock( mtx );
for( ; ready != t; cvReady.wait( lock ) );
start = true;
cvStart.notify_all();
lock.unlock();
for( thread &thr : vt )
thr.join();
double ns = (double)(int64_t)totalTicks / (int)t / ROUNDS;
double failsPerSucc = (double)(int64_t)totalFails / (int)t / ROUNDS;
cout << t << " threads: " << ns << "ns";
if( !xadd )
cout << ", avg fails: " << failsPerSucc;
cout << endl;
}
};
cout << "xchg:" << endl;
benchmark( false );
cout << "xadd:" << endl;
benchmark( true );
}

I've got a 2,9Ghz 64 core Threadripper 3990X and when all 128 threads
are XCHGing on the same cacheline, each successful operation takes
about 1.500ns, i.e. 4.350 clock cycles until there's a successful
XCHG. There are about 4 swap-failures until a fifth succeeds.
And a successful XADD is about 240ns, i.e. 700 clock cycles.
So show me your data !

Bonita Montero

unread,

Jan 6, 2021, 11:55:35 AM1/6/21

to

Sorry, under Windows this test doesn't exactly works if you
have more than 64 threads since the scheduler can only handle
up to 64 threads per processor-group.

So I've did a dirty workaround:

#if defined(_MSC_VER)
#include <Windows.h>
#endif

#include <iostream>
#include <atomic>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <thread>

using namespace std;
using namespace chrono;

int main( int argc, char **argv )
{
if( argc < 2 )
return EXIT_FAILURE;
unsigned hc = atoi( argv[1] );

#if defined(_MSC_VER)
GROUP_AFFINITY ga;
HANDLE hThread;

#endif

for( unsigned i = 1; i <= t; ++i )

vt.emplace_back( atomicThread ),
#if !defined(_MSC_VER)
0;
#else
hThread = vt.back().native_handle(),
GetThreadGroupAffinity( hThread, &ga ),
ga.Group = (WORD)(i / 64),
SetThreadGroupAffinity( hThread, &ga, nullptr );
#endif

unique_lock<mutex> lock( mtx );
for( ; ready != t; cvReady.wait( lock ) );
start = true;
cvStart.notify_all();
lock.unlock();
for( thread &thr : vt )
thr.join();
double ns = (double)(int64_t)totalTicks / (int)t / ROUNDS;
double failsPerSucc = (double)(int64_t)totalFails / (int)t / ROUNDS;
cout << t << " threads: " << ns << "ns";
if( !xadd )
cout << ", avg fails: " << failsPerSucc;
cout << endl;
}
};
cout << "xchg:" << endl;
benchmark( false );
cout << "xadd:" << endl;
benchmark( true );
}

This workaround only works if there are 64 threads in a processor group.
Otherwise I would have to do a complex processor topology query.

Bonita Montero

unread,

Jan 6, 2021, 1:24:36 PM1/6/21

to

Now it fully supports processor-groups. There could be actually
asymetrically populated sockets with different number of cores:

#if defined(_MSC_VER)
#include <Windows.h>
#endif
#include <iostream>
#include <atomic>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <thread>

using namespace std;
using namespace chrono;

int main()
{
#if !defined(_MSC_VER)
unsigned processors = thread::hardware_concurrency();
#else
DWORD dwLength = 0;
if( !GetLogicalProcessorInformationEx( RelationGroup, nullptr,
&dwLength ) &&
GetLastError() != ERROR_INSUFFICIENT_BUFFER )
return EXIT_FAILURE;
vector<char> buf( dwLength );
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX pSlpie =
(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf.data();
if( !GetLogicalProcessorInformationEx( RelationGroup, pSlpie, &dwLength ) )
return EXIT_FAILURE;
unsigned processors = 0;
for( unsigned i = 0; i != pSlpie->Group.ActiveGroupCount; ++i )
processors += pSlpie->Group.GroupInfo[i].ActiveProcessorCount;
auto getProcessorGroup = [&]( unsigned processor ) -> unsigned
{
processor %= processors;
unsigned group = 0;
for( ; group != pSlpie->Group.ActiveGroupCount
&& processor >=
pSlpie->Group.GroupInfo[group].ActiveProcessorCount;
++group );
return group;
};
#endif

for( unsigned t = 1; t <= processors; ++t )

{
vector<thread> vt;
vt.reserve( t );
ready = 0;
start = false;
count = ROUNDS;
totalTicks = 0;
totalFails = 0;
#if defined(_MSC_VER)
GROUP_AFFINITY ga;
HANDLE hThread;

#endif
for( unsigned i = 1; i <= t; ++i )
vt.emplace_back( atomicThread ),
#if !defined(_MSC_VER)
0;
#else
hThread = vt.back().native_handle(),
GetThreadGroupAffinity( hThread, &ga ),

ga.Group = getProcessorGroup( i ),

Chris M. Thomasson

unread,

Jan 6, 2021, 3:31:45 PM1/6/21

to

On 1/6/2021 6:09 AM, Bonita Montero wrote:
> I wrote a little benchmark that measures the cost of atomic-operations
> when a number of hw-threads are constantly hammering the same cacheline.
> Here it is:

[...]

> So show me your data !

I get:

xchg:
1 threads: 36.4439ns, avg fails: 0.999999
2 threads: 131.794ns, avg fails: 1.93928
3 threads: 169.609ns, avg fails: 2.3317
4 threads: 573.478ns, avg fails: 3.73222
xadd:
1 threads: 15.1847ns
2 threads: 92.2467ns
3 threads: 187.475ns
4 threads: 360.952ns

one little nitpick, XCHG should be renamed to CMPXCHG. :^)

XADD usually beats a CMPXCHG loop because it cannot fail.

Chris M. Thomasson

unread,

Jan 6, 2021, 3:56:24 PM1/6/21

to

On 1/6/2021 6:09 AM, Bonita Montero wrote:

> I wrote a little benchmark that measures the cost of atomic-operations
> when a number of hw-threads are constantly hammering the same cacheline.
> Here it is:

[...]

> I've got a 2,9Ghz 64 core Threadripper 3990X and when all 128 threads
> are XCHGing on the same cacheline, each successful operation takes
> about 1.500ns, i.e. 4.350 clock cycles until there's a successful
> XCHG. There are about 4 swap-failures until a fifth succeeds.
> And a successful XADD is about 240ns, i.e. 700 clock cycles.
> So show me your data !

One other nitpick... Make sure that your:

atomic<uint64_t> atomicValue;

Is isolated: Aligned and padded on a cache line boundary. Afaict, you
did not do that! Its going to alter the test.

Bonita Montero

unread,

Jan 7, 2021, 2:17:25 AM1/7/21

to

> One other nitpick... Make sure that your:
> atomic<uint64_t> atomicValue;
> Is isolated: Aligned and padded on a cache line boundary.

That's automatically done through the /-8-alignment of the
compiler.

Bonita Montero

unread,

Jan 7, 2021, 3:04:33 AM1/7/21

to

The XCHG-loop was incorrect:

#if defined(_MSC_VER)
#include <Windows.h>
#endif

#include <iostream>
#include <atomic>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <thread>

using namespace std;
using namespace chrono;

int main()
{

#if !defined(_MSC_VER)
unsigned processors = thread::hardware_concurrency();
#else
DWORD dwLength = 0;
if( !GetLogicalProcessorInformationEx( RelationGroup, nullptr,
&dwLength ) &&
GetLastError() != ERROR_INSUFFICIENT_BUFFER )
return EXIT_FAILURE;
vector<char> buf( dwLength );
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX pSlpie =
(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf.data();
if( !GetLogicalProcessorInformationEx( RelationGroup, pSlpie, &dwLength ) )
return EXIT_FAILURE;
unsigned processors = 0;
for( unsigned i = 0; i != pSlpie->Group.ActiveGroupCount; ++i )
processors += pSlpie->Group.GroupInfo[i].ActiveProcessorCount;

auto getProcessorGroup = [pSlpie, processors]( unsigned processor ) ->
unsigned
{
processor %= processors;
BYTE pCount;

unsigned group = 0;
for( ; group != pSlpie->Group.ActiveGroupCount

&& processor >= (pCount =
pSlpie->Group.GroupInfo[group].ActiveProcessorCount);
++group, processor -= pCount );
return group;
};
#endif

uint64_t const ROUNDS = 1'000'000;
auto benchmark = [&]( bool xadd )
{
mutex mtx;
condition_variable cvReady,
cvStart;
unsigned ready;
bool start;
uint64_t count;
atomic<uint64_t> atomicValue;
uint64_t totalTicks;
uint64_t totalFails;
auto atomicThread = [&]()
{
using hrc_tp = time_point<high_resolution_clock>;
unique_lock<mutex> lock( mtx );
++ready;
cvReady.notify_one();
for( ; !start; cvStart.wait( lock ) );
uint64_t n = count;
lock.unlock();
uint64_t ref = atomicValue;
hrc_tp start = high_resolution_clock::now();
uint64_t fails = 0;
if( !xadd )
for( ; n; --n )

for( ref = atomicValue; !atomicValue.compare_exchange_weak( ref,

ref + 1, memory_order_relaxed ); ++fails );
else
for( ; n; --n )
atomicValue.fetch_add( 1, memory_order_relaxed );
uint64_t ns = duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();
lock.lock();
totalTicks += ns;
totalFails += fails;
};

for( unsigned t = 1; t <= processors; ++t )

{
vector<thread> vt;
vt.reserve( t );
ready = 0;
start = false;
count = ROUNDS;
totalTicks = 0;
totalFails = 0;

#if defined(_MSC_VER)
GROUP_AFFINITY ga;
HANDLE hThread;
#endif

for( unsigned i = 1; i <= t; ++i )
{
vt.emplace_back( atomicThread );

#if defined(_MSC_VER)
hThread = vt.back().native_handle();
GetThreadGroupAffinity( hThread, &ga );
ga.Group = getProcessorGroup( i );

SetThreadGroupAffinity( hThread, &ga, nullptr );
#endif
}

unique_lock<mutex> lock( mtx );
for( ; ready != t; cvReady.wait( lock ) );
start = true;
cvStart.notify_all();
lock.unlock();
for( thread &thr : vt )
thr.join();
double ns = (double)(int64_t)totalTicks / (int)t / ROUNDS;
double failsPerSucc = (double)(int64_t)totalFails / (int)t / ROUNDS;
cout << t << " threads: " << ns << "ns";
if( !xadd )
cout << ", avg fails: " << failsPerSucc;
cout << endl;
}
};
cout << "xchg:" << endl;
benchmark( false );
cout << "xadd:" << endl;
benchmark( true );
}

It usually counted one more fail.

Chris M. Thomasson

unread,

Jan 7, 2021, 4:03:22 AM1/7/21

to

Well, I mean padded up to a, say, L2 64 byte cache line, some use 128,
and aligned. Think for a moment. I want atomicValue to be 100% isolated.

Chris M. Thomasson

unread,

Jan 7, 2021, 4:05:06 AM1/7/21

to

unread,

unread,

Jan 8, 2021, 4:48:40 AM1/8/21

to

Better this way.

> bool fits = ((uintptr_t)&::a & 64 - 1) == 0;

bool fits = !(alignof(decltype(::a)) % 64);

Chris M. Thomasson

unread,

Jan 8, 2021, 7:43:17 PM1/8/21

to

Nit pick, there can be.

totalTicks += ns;
totalFails += fails;

can interfere with atomicValue. Think about it.

Chris M. Thomasson

unread,

Jan 8, 2021, 7:48:21 PM1/8/21

to

That should do it. :^)

Bonita Montero

unread,

Jan 9, 2021, 12:37:42 AM1/9/21

to

>> There's no false sharing.

> Nit pick, there can be.
> totalTicks += ns;
> totalFails += fails;
> can interfere with atomicValue. Think about it.

There's nothing which interferes.
Make the atomicValue aligned, and you don't measure
different results.

But here's the result of a

* 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
* 2x AMD EPYC 7551 32-Core Processor
* 2x AMD EPYC 7401 24-Core Processor

https://easyupload.io/4jaazg

Chris M. Thomasson

unread,

Jan 9, 2021, 12:45:07 AM1/9/21

to

On 1/8/2021 9:37 PM, Bonita Montero wrote:
>>> There's no false sharing.
>
>> Nit pick, there can be.
>> totalTicks += ns;
>> totalFails += fails;
>> can interfere with atomicValue. Think about it.
>
> There's nothing which interferes.

Is atomicValue on a completely aligned on a boundary and on a different
cache line than anything else, think of totalTicks, totalFails?

> Make the atomicValue aligned, and you don't measure
> different results.

Do it anyway.

Bonita Montero

unread,

Jan 9, 2021, 1:43:53 AM1/9/21

to

> Is atomicValue on a completely aligned on a boundary and on a different
> cache line than anything else, think of totalTicks, totalFails?

No, it isn't, but it doesn't jneed to be. While the threads are running
and incrementing atomicValue nothing happens to the same cacheline.
Align it and see that you won't get any different results.

Chris M. Thomasson

unread,

unread,

Jan 9, 2021, 3:56:35 PM1/9/21

to

> I see what you mean, but there can be false sharing:
> Keep in mind that I am being rather pedantic here Bonita.

The threads don't become released by cvStart synchonously. cvStart is
signalled for all threads at the same time but the threads need to get
hold of the mutex that is associated with this CV also and this can be
done one thread after another when the threads release the mutex. So
the first thread is releasing the second thread and and so forth. But
this results only in a small skew which is insignificant related to
the time the threads are doing the atomic operation.
What would be more appropriate here would be to have a semaphore which
is released by the number of waiting threads. But there are no semapho-
res until C++20.

> The main point is that using
> XADD when you can, is generally "better" than a CMPXCHG loop.

My test doesn't imply what it is better when you are using either oper-
ation for synchronitzation-purposes. It simply tests their effectiveness
and nothing mire.

> XADD when you can, is generally "better" than a CMPXCHG loop. Also, you
> are using compare_exchange_weak. This means that on platforms other than
> x86, perhaps using LL/SC, it can spuriously fail even if the CAS should
> of succeeded.

That's intended because if you're using LL/SC for synchronization
you can't do anything against spuriuous fails other than repeating
the operation as well.

> Basically, isolation via proper aligning and padding is just a good

> habit to get into. ...

It's not necessary for this benchmark. It won't get less accurate
beyond the inacurracy you have from other measurement error effects.
So the whole discussion is nonsense.

Bonita Montero

unread,

Jan 9, 2021, 3:57:29 PM1/9/21

to

> Works fine. Now I am thinking about dynamic memory ala new. Probably
> have to use aligned_alloc and placement new? My old alignment code way
> back in the day still works, but its definitely not standard!

The newed memory would be aligned also.

Chris M. Thomasson

unread,

Jan 9, 2021, 4:01:53 PM1/9/21

to

Really? When you find some free time, can you code me up a quick
example? I am still using my old code:

https://pastebin.com/raw/f37a23918

It would be really fun to spice it up with C++11! I think you can help
me out, since I am not totally familiar with some exotic C++ features. I
was almost always in C. I still have code in x86 asm that implemented my
atomic lib.

So, is there a way to convert my simple ralloc region allocator into
something that is 100% standard compliant?

Chris M. Thomasson

unread,

Jan 9, 2021, 4:38:51 PM1/9/21

to

True. However, the retry logic can be more "efficient" using
compare_exchange_strong rather than trying to emulate the same thing
with compare_exchange_weak.

unread,

Jan 10, 2021, 1:32:16 AM1/10/21

to

Agreed. Fwiw, in times where using compare_exchange_strong is
appropriate, it can be better than using a compare_exchange_weak loop.

Chris M. Thomasson

unread,

Jan 10, 2021, 1:33:14 AM1/10/21

to

Show me.

Chris M. Thomasson

unread,

Jan 10, 2021, 1:34:28 AM1/10/21

to

On 1/9/2021 10:01 PM, Bonita Montero wrote:

You cut off my point about using new vs aligned_alloc and placement new.

Bonita Montero

unread,

Jan 10, 2021, 3:31:57 AM1/10/21

to

>>>> The newed memory would be aligned also.

>>> Really? ...

>> Yes.

> Show me.

#include <iostream>
#include <atomic>
#include <cstdint>
#include <memory>
#include <vector>

using namespace std;

struct alignas(hardware_constructive_interference_size) A

{
atomic<uint64_t> valaue;
} a;

int main()
{

vector<unique_ptr<A>> vua( 1'000'000 );
for( unique_ptr<A> &ua : vua )
ua = make_unique<A>();
bool fits = true;
for( unique_ptr<A> &ua : vua )
if( (uintptr_t)ua.get() % 64 )
fits = false;

Bonita Montero

unread,

Jan 10, 2021, 4:01:11 AM1/10/21

to

Now, that's a version with my own semaphore-class:

#if defined(_MSC_VER)
#include <Windows.h>
#elif defined(__unix__)
#include <unistd.h>
#include <sched.h>
#include <sys/resource.h>
#include <sys/types.h>
#endif
#include <iostream>
#include <atomic>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <thread>
#include "semaphore.h"

using namespace std;
using namespace chrono;

#if defined(_MSC_VER)
int wmain( int argc, wchar_t **argv )
#else
int main( int argc, char **argv )
#endif
{
#if !defined(_MSC_VER)
unsigned processors = thread::hardware_concurrency();
if( !processors )
return EXIT_FAILURE;
#else
DWORD dwLength = 0;
if( !GetLogicalProcessorInformationEx( RelationGroup, nullptr,
&dwLength ) &&
GetLastError() != ERROR_INSUFFICIENT_BUFFER )
return EXIT_FAILURE;
vector<char> buf( dwLength );
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX pSlpie =
(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf.data();
if( !GetLogicalProcessorInformationEx( RelationGroup, pSlpie, &dwLength ) )
return EXIT_FAILURE;
unsigned processors = 0;
for( unsigned i = 0; i != pSlpie->Group.ActiveGroupCount; ++i )
processors += pSlpie->Group.GroupInfo[i].ActiveProcessorCount;
auto getProcessorGroup = [pSlpie, processors]( unsigned processor ) ->
unsigned
{
processor %= processors;
BYTE pCount;
unsigned group = 0;
for( ; group != pSlpie->Group.ActiveGroupCount
&& processor >= (pCount =
pSlpie->Group.GroupInfo[group].ActiveProcessorCount);
++group, processor -= pCount );
return group;
};
#endif
#if defined(_MSC_VER)
// log on as a different user to gain higher scheduling-prioroties
HANDLE hToken;
if( argc >= 3 && LogonUserW( argv[1], nullptr, argv[2],
LOGON32_LOGON_INTERACTIVE, LOGON32_PROVIDER_DEFAULT, &hToken ) )
(void)ImpersonateLoggedOnUser( hToken );
HANDLE hCurrentProcess = GetCurrentProcess();
SetPriorityClass( hCurrentProcess, HIGH_PRIORITY_CLASS );
SetPriorityClass( hCurrentProcess, REALTIME_PRIORITY_CLASS );
#elif defined(__unix__)
for( int nice = 0; nice != -21 && setpriority( PRIO_PROCESS, getpid(),
nice ) == 0; --nice );
#endif
uint64_t const ROUNDS = 1'000'000;
auto benchmark = [&]( bool xadd )
{
semaphore semReady,
semStart;
uint64_t count;
atomic<uint64_t> atomicValue;
mutex mtx;
uint64_t totalTicks;
uint64_t totalFails;
auto atomicThread = [&]()
{
using hrc_tp = time_point<high_resolution_clock>;
semReady.forced_release();
semStart.forced_wait();
uint64_t n = count;
uint64_t ref;
hrc_tp start = high_resolution_clock::now();
uint64_t fails = 0;
if( !xadd )
for( ; n; --n )
for( ref = atomicValue; !atomicValue.compare_exchange_weak( ref,
ref + 1, memory_order_relaxed ); ++fails );
else
for( ; n; --n )
atomicValue.fetch_add( 1, memory_order_relaxed );
uint64_t ns = duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();
lock_guard<mutex> lock( mtx );

totalTicks += ns;
totalFails += fails;

};
for( unsigned t = 1; t <= processors; ++t )
{
vector<thread> vt;
vt.reserve( t );
count = ROUNDS;
totalTicks = 0;
totalFails = 0;
for( unsigned i = 0; i != t; ++i )
{
vt.emplace_back( atomicThread );
#if defined(_MSC_VER)
GROUP_AFFINITY ga;
HANDLE hThread;
hThread = vt.back().native_handle();
SetThreadAffinityMask( hThread, (DWORD_PTR)1 << i );
GetThreadGroupAffinity( hThread, &ga );
ga.Group = getProcessorGroup( i );
SetThreadGroupAffinity( hThread, &ga, nullptr );
SetThreadPriority( hThread, THREAD_PRIORITY_HIGHEST );
SetThreadPriority( hThread, THREAD_PRIORITY_TIME_CRITICAL );
#elif defined(__unix__)
cpu_set_t cpuSet;
CPU_ZERO( &cpuSet );
CPU_SET( i, &cpuSet );
pthread_setaffinity_np( vt.back().native_handle(), sizeof cpuSet,
&cpuSet );
#endif
}
for( unsigned i = 0; i != t; ++i )
semReady.forced_wait();
semStart.release( t );
for( thread &thr : vt )
thr.join();
double ns = (double)(int64_t)totalTicks / (int)t / ROUNDS;
double failsPerSucc = (double)(int64_t)totalFails / (int)t / ROUNDS;
cout << t << " threads: " << ns << "ns";
if( !xadd )
cout << ", avg fails: " << failsPerSucc;
cout << endl;
}
};
cout << "xchg:" << endl;
benchmark( false );
cout << "xadd:" << endl;
benchmark( true );
}

// semaphore.h:

#pragma once
#if defined(_MSC_VER)
#include <Windows.h>
#include <intrin.h>
#elif defined(__unix__)
#include <semaphore.h>
#include <pthread.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <sys/stat.h>
#include <climits>
#endif
#include <new>
#include <cstdint>
#include <cassert>
#include <system_error>

class semaphore
{
public:
semaphore();
semaphore( semaphore const & ) = delete;
void operator =( semaphore const & ) = delete;
~semaphore();
bool wait();
void forced_wait();
unsigned release( unsigned count = 1 );
void forced_release( unsigned count = 1 );

private:
#if defined(_MSC_VER)
HANDLE m_hSem;
#elif defined(__unix__)
#if defined(SYSV_SEMAPHORE)
union semun
{
int val;
semid_ds *buf;
unsigned short *array;
seminfo *__buf;
};
int m_semid;
#else
sem_t m_sem;
#endif
#else
#error need platform-specific semaphore!
#endif
};

#if defined(_MSC_VER)

// throws system_error if creating semaphore failed

inline
semaphore::semaphore()
{
using namespace std;
if( (m_hSem = CreateSemaphore( nullptr, 0, 0x7FFFFFFF, nullptr )) == NULL )
throw system_error( error_code( (int)GetLastError(), system_category()
), "creating semaphore failed" );
}

inline
semaphore::~semaphore()
{
BOOL success = CloseHandle( m_hSem );
assert(success);
}

inline
bool semaphore::wait()
{
bool success = WaitForSingleObject( m_hSem, INFINITE ) == WAIT_OBJECT_0;
assert(success);
return success;
}

inline
unsigned semaphore::release( unsigned count )
{
BOOL success = ReleaseSemaphore( m_hSem, (LONG)count, nullptr );
assert(success);
return success ? 0 : count;
}

#elif defined(__unix__)

// throws system_error if creating semaphore failed

inline
semaphore::semaphore()
{
using namespace std;
#if defined(SYSV_SEMAPHORE)
if( (m_semid = semget( IPC_PRIVATE, 1, IPC_CREAT | S_IRWXU )) == -1 )
throw system_error( error_code( errno, system_category() ), "creating
semaphore failed" );
semun su;
su.val = 0;
if( semctl( m_semid, 0, SETVAL, su ) == -1 )
{
int errNo = errno;
int ret = semctl( m_semid, 0, IPC_RMID );
assert(ret != -1);
throw system_error( error_code( errNo, system_category() ), "setting
semaphore to zero after creation failed" );
}
#else
if( sem_init( &m_sem, 0, 0 ) != 0 )
throw system_error( error_code( errno, system_category() ), "creating
semaphore failed" );
#endif
}

inline
semaphore::~semaphore()
{
#if defined(SYSV_SEMAPHORE)
int ret = semctl( m_semid, 0, IPC_RMID );
assert(ret != -1);
#else
int ret = sem_destroy( &m_sem );
assert(ret == 0);
#endif
}

inline
bool semaphore::wait()
{
#if defined(SYSV_SEMAPHORE)
int ret;
do
{
sembuf sb;
sb.sem_op = -1;
sb.sem_num = 0;
sb.sem_flg = 0;
ret = semop( m_semid, &sb, 1 );
} while( ret == EINTR );
assert(ret == 0);
return ret == 0;
#else
int ret;
do
ret = sem_wait( &m_sem );
while( ret == EINTR );
assert(ret == 0);
return ret == 0;
#endif
}

inline
unsigned semaphore::release( unsigned count )
{
#if defined(SYSV_SEMAPHORE)
for( unsigned short release; (release = (unsigned short)(count %
SHRT_MAX)) != 0; count -= release )
{
int ret;
do
{
sembuf sb;
sb.sem_op = (short)release;
sb.sem_num = 0;
sb.sem_flg = 0;
ret = semop( m_semid, &sb, 1 );
} while( ret == EINTR );
if( ret != 0 )
{
assert(false);
return count;
}
}
return 0;
#else
for( ; count; --count )
if( sem_post( &m_sem ) != 0 )
{
assert(false);
return count;
}
return 0;
#endif
}

#else
#error need platform-specific semaphore
#endif

inline
void semaphore::forced_wait()
{
while( !wait() );
}

inline
void semaphore::forced_release( unsigned count )
{
for( ; count; count = release( count ) );
}

Chris M. Thomasson

unread,

Jan 10, 2021, 4:21:33 PM1/10/21

to

On 1/10/2021 12:31 AM, Bonita Montero wrote:
> #include <iostream>
> #include <atomic>
> #include <cstdint>
> #include <memory>
> #include <vector>
>
> using namespace std;
>
> struct alignas(hardware_constructive_interference_size) A
> {
>     atomic<uint64_t> valaue;
> } a;
>
> int main()
> {
>     vector<unique_ptr<A>> vua( 1'000'000 );
>     for( unique_ptr<A> &ua : vua )
>         ua = make_unique<A>();
>     bool fits = true;
>     for( unique_ptr<A> &ua : vua )
>         if( (uintptr_t)ua.get() % 64 )
>             fits = false;
>     cout << (fits ? "fits" : "doesn't fit") << endl;
> }

I get:

doesn't fit

___

Using hardware_constructive_interference_size of 64.

Humm... I don't think that using new works either. Now, I think it has
to, or should, work with aligned_alloc and placement new. Afaict, the
vector is allocating its internal storage on an unaligned boundary wrt
hardware_constructive_interference_size.

Chris M. Thomasson

unread,

Jan 10, 2021, 6:35:23 PM1/10/21

to

On 1/9/2021 10:01 PM, Bonita Montero wrote:

Well, check this crap out, on MSVC 2017... It seems to work fine. Can
you compile it?
_______________________________
#include <iostream>
#include <atomic>
#include <cstdint>
#include <malloc.h> // humm...

#define CT_PAGE_ALIGNMENT 8192
#define CT_L2_ALIGNMENT 128

using namespace std;

struct alignas(CT_L2_ALIGNMENT) block
{
atomic<uint64_t> value;
};

struct test
{
int a;
char b;
block c;
char d;
long e;
};

int main()
{
void* raw_mem = _aligned_malloc(sizeof(test), CT_PAGE_ALIGNMENT);
if (!raw_mem) { std::cout << "failed alloc!\n"; return 0; }

std::cout << "raw_mem = " << raw_mem << "\n";

// test raw_mem
if ((std::uintptr_t)raw_mem % CT_PAGE_ALIGNMENT)
{
std::cout << "yikes! raw_mem does not fit in CT_PAGE_ALIGNMENT\n";
}

else
{
std::cout << "So far so good... raw_mem fits CT_PAGE_ALIGNMENT\n";
}

test* t = new (raw_mem) test;

// test &t->c
if ((std::uintptr_t)&t->c % CT_L2_ALIGNMENT)
{
std::cout << "yikes! &t->c does not fit in CT_L2_ALIGNMENT\n";
}

else
{
std::cout << "So far so good... &t->c fits CT_L2_ALIGNMENT\n";
}

_aligned_free(raw_mem);

return 0;
}
_______________________________

Can you get it to work on your end? For some reason my msvc 2017 does
not have std::aligned_alloc. I am getting the following output:

raw_mem = 01080000
So far so good... raw_mem fits CT_PAGE_ALIGNMENT
So far so good... &t->c fits CT_L2_ALIGNMENT

Chris M. Thomasson

unread,

Jan 12, 2021, 3:50:38 PM1/12/21

to

On 1/10/2021 3:35 PM, Chris M. Thomasson wrote:
> On 1/9/2021 10:01 PM, Bonita Montero wrote:
>>>>> Works fine. Now I am thinking about dynamic memory ala new.
>>>>> Probably have to use aligned_alloc and placement new? My old
>>>>> alignment code way back in the day still works, but its definitely
>>>>> not standard!
>>
>>>> The newed memory would be aligned also.
>>
>>> Really? ...
>>
>> Yes.
>>
>
> Well, check this crap out, on MSVC 2017... It seems to work fine. Can
> you compile it?
> _______________________________

[code snip]

> _______________________________
>
>
> Can you get it to work on your end? For some reason my msvc 2017 does
> not have std::aligned_alloc. I am getting the following output:
>
> raw_mem = 01080000
> So far so good... raw_mem fits CT_PAGE_ALIGNMENT
> So far so good... &t->c fits CT_L2_ALIGNMENT
>
>

Notice the CT_PAGE_ALIGNMENT. Now, this allows one to do a lot of fun
and interesting things. Like stealing bits from the actual pointer
value. We can stuff meta data in a pointer, basically bit stealing. This
data can be used to create exotic lock-free algorithms. Even wait-free!
One can use it for a state machine. Or use it to round down to
CT_PAGE_ALIGNMENT, works well with memory allocators. Many different
things! Of course we would want to make sure that struct test is padded
to a CT_PAGE_ALIGNMENT. I did not do that here. But its rather trivial
to modify my example code.

Bonita Montero

unread,

Jan 17, 2021, 5:27:12 AM1/17/21

to

According to this
https://www.bfilipek.com/2019/08/newnew-align.html
C++17 honors alignas.

Bonita Montero

unread,

Jan 21, 2021, 12:28:55 AM1/21/21

to

>>> Also, can you pretty please try to quote properly? Its hard to follow
>>> the chain in the thread.

>> For me the result is "fits".

> Good. in MSVC 2019 right?

I already said that I'm using MSVC 2019.

Chris M. Thomasson

unread,

Jan 21, 2021, 12:32:13 AM1/21/21

to

You did. Shit. Missed that. Sorry.

Chris M. Thomasson

unread,

Jan 21, 2021, 5:23:25 PM1/21/21

to

On 1/18/2021 11:20 AM, Bonita Montero wrote:

I installed MSVC 2019, and your code still does not work. I am getting:

doesn't fit

What am I doing wrong here?

Chris M. Thomasson

unread,

Jan 21, 2021, 5:25:45 PM1/21/21

to

DOH! I was not compiling in C++17 mode. So, now it works. Cool.

Thanks.