Semaphore thread Flip: 20.000 clock cycles

Bonita Montero

unread,

Jun 17, 2023, 12:22:51 PM6/17/23

to

I wanted to test how many time it takes for a thread to signal
a semaphore to another thread and to wait to be signalled back.
That's essential for mutexes when they're contended. I tested
this under Windows 11 with a Ryzen 9 7950X system.
I tested different combinations of logical cores. The first
thread is always fixed on the first core and the other thread
is varying. I print the X2 APIC ID along with the result.
The fastest result I get is about 20.000 clock cycles for one
switch to the other thread. I think that's enormous.
A similar benchmark written for linux using Posix semapohres
gives about 8.000 clock cylces per switch on a 3990X system.
That's a huge difference since the CPU is a Zen2-CPU with a
much lower clock rate than the 7950X Zen4 system.

#include <Windows.h>
#include <iostream>
#include <thread>
#include <system_error>
#include <chrono>
#include <latch>
#include <charconv>
#include <intrin.h>

using namespace std;
using namespace chrono;

int main( int argc, char **argv )
{
static auto errTerm = []( bool succ, char const *what )
{
if( succ )
return;
cerr << what << endl;
terminate();
};
int regs[4];
__cpuid( regs, 0 );
errTerm( (unsigned)regs[0] >= 0xB, "max CPUID below 0xB" );
bool fPrio = SetPriorityClass( GetCurrentProcess(),
REALTIME_PRIORITY_CLASS )
|| SetPriorityClass( GetCurrentProcess(), HIGH_PRIORITY_CLASS );
errTerm( fPrio, "can't set process priority class" );
unsigned nCPUs = jthread::hardware_concurrency();
for( unsigned cpuB = 1; cpuB != nCPUs; ++cpuB )
{
auto init = []( HANDLE &hSem, bool set )
{
hSem = CreateSemaphoreA( nullptr, set, 1, nullptr );
errTerm( hSem, "can't create semaphore" );
};
HANDLE hSemA, hSemB;
init( hSemA, false );
init( hSemB, true );
atomic_int64_t tSum( 0 );
latch latRun( 3 );
auto flipThread = [&]( HANDLE hSemMe, HANDLE hSemYou, size_t n,
uint32_t *pX2ApicId )
{
latRun.arrive_and_wait();
auto start = high_resolution_clock::now();
for( ; n--; )
errTerm( WaitForSingleObject( hSemMe, INFINITE ) == WAIT_OBJECT_0,
"can't wait for semaphore" ),
errTerm( ReleaseSemaphore( hSemYou, 1, nullptr ), "can't post
semaphore" );
tSum.fetch_add( duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count(), memory_order::relaxed );
if( !pX2ApicId )
return;
int regs[4];
__cpuidex( regs, 0xB, 0 );
*pX2ApicId = regs[3];
};
constexpr size_t ROUNDS = 10'000;
uint32_t x2ApicId;
jthread
thrA( flipThread, hSemA, hSemB, ROUNDS, nullptr ),
thrB( flipThread, hSemB, hSemA, ROUNDS, &x2ApicId );
errTerm( SetThreadAffinityMask( thrA.native_handle(), 1 ), "can't set
CPU affinity" );
errTerm( SetThreadAffinityMask( thrB.native_handle(), (DWORD_PTR)1 <<
cpuB ), "can't set CPU affinity" );
latRun.arrive_and_wait();
thrA.join();
thrB.join();
cout << x2ApicId << ": " << (double)tSum.load( memory_order::relaxed )
/ (2.0 * ROUNDS) << endl;
};
}

Bo Persson

unread,

Jun 17, 2023, 12:38:17 PM6/17/23

to

On 2023-06-17 at 18:22, Bonita Montero wrote:
> I wanted to test how many time it takes for a thread to signal
> a semaphore to another thread and to wait to be signalled back.
> That's essential for mutexes when they're contended. I tested
> this under Windows 11 with a Ryzen 9 7950X system.
> I tested different combinations of logical cores. The first
> thread is always fixed on the first core and the other thread
> is varying. I print the X2 APIC ID along with the result.
> The fastest result I get is about 20.000 clock cycles for one
> switch to the other thread. I think that's enormous.
> A similar benchmark written for linux using Posix semapohres
> gives about 8.000 clock cylces per switch on a 3990X system.
> That's a huge difference since the CPU is a Zen2-CPU with a
> much lower clock rate than the 7950X Zen4 system.

I have Windows 10 with a Core i9 9900K, and get results between 7500 and
8000.

So is it the Windows version or the CPU model that is most important?

Bonita Montero

unread,

Jun 17, 2023, 12:58:32 PM6/17/23

to

Am 17.06.2023 um 18:37 schrieb Bo Persson:

> I have Windows 10 with a Core i9 9900K, and get results between 7500 and
> 8000.

That's the number of nanoseconds for each switch.
You must check the current clock rate.

Bo Persson

unread,

Jun 17, 2023, 1:38:09 PM6/17/23

to

Ok, so if running at 5 Ghz, you multiply by 5?

Then it goes from good to really bad! :-)

Bonita Montero

unread,

Jun 17, 2023, 1:51:20 PM6/17/23

to

I used CoreTemp to see how the clocking of
the cores develops while running that code.

Pavel

unread,

Jun 17, 2023, 2:28:26 PM6/17/23

to

Bonita Montero wrote:
> I wanted to test how many time it takes for a thread to signal
> a semaphore to another thread and to wait to be signalled back.
> That's essential for mutexes when they're contended. I tested
> this under Windows 11 with a Ryzen 9 7950X system.
> I tested different combinations of logical cores. The first
> thread is always fixed on the first core and the other thread
> is varying. I print the X2 APIC ID along with the result.
> The fastest result I get is about 20.000 clock cycles for one
> switch to the other thread. I think that's enormous.
> A similar benchmark written for linux using Posix semapohres
> gives about 8.000 clock cylces per switch on a 3990X system.
> That's a huge difference since the CPU is a Zen2-CPU with a
> much lower clock rate than the 7950X Zen4 system.

Comparing the performance of POSIX mutices and Windows semaphores for
thread signalling is apple-to-orange. A Windows critical section is the
closest analogue to the POSIX inter-thread recursive non-robust mutex. I
don't think there is a close analogue to a non-recursive non-robust
mutex -- which is potentially the fastest.

Bonita Montero

unread,

Jun 18, 2023, 12:49:01 AM6/18/23

to

Am 17.06.2023 um 20:28 schrieb Pavel:

> Comparing the performance of POSIX mutices and Windows semaphores for
> thread signalling is apple-to-orange. A Windows critical section is the
> closest analogue to the POSIX inter-thread recursive non-robust mutex.

I'm comparing semaphores vs. semaphores, not mutexes vs. mutexes.
That's relevant when a mutex is contended and you don't spin.

Chris M. Thomasson

unread,

Jun 18, 2023, 3:46:05 PM6/18/23

to

Have you heard of an adaptive mutex that spins on a contended state of a
mutex for a finite number of times before it resorts to blocking? There
are "fast" semaphores as well. Take a look at the benaphore:

https://www.haiku-os.org/legacy-docs/benewsletter/Issue1-26.html#Engineering1-26

The benaphore helps things out by skipping calls into the kernel.

Bonita Montero

unread,

Jun 18, 2023, 10:55:55 PM6/18/23

to

Am 18.06.2023 um 21:45 schrieb Chris M. Thomasson:

> Have you heard of an adaptive mutex that spins on a contended state of a
> mutex for a finite number of times before it resorts to blocking? There
> are "fast" semaphores as well. Take a look at the benaphore:

I've written that before myself, using the spin count algorithm from
glibc (max = before * 2 + 10). Spinning only makes sense if the mustex
is held a very short time and there's a high degree of contention. This
doesn't fit very often.

Chris M. Thomasson

unread,

Jun 19, 2023, 8:37:33 PM6/19/23

to

I have "seen things":

https://youtu.be/QefqJ7YhbWQ?t=117

;^D Seriously... Nasty deadlocks, and horrible things! ;^o

Mainly back when I used to work on this kind of stuff quite a bit, well
over a decade ago. I have seen some code where some simple logic changes
ended up making some mutexes go quite "hot", contended, and become
actual bottlenecks... Luckily, a decent amount of those had critical
sections that were able to be made lock-free.

One can also spin on semaphore logic where _if_ the semaphore is going
to have to block in the kernel, we can detect that and spin a couple of
times on a CAS instead issuing XADD's for the counting logic.

Benaphores are pretty damn nice! :^)