Some help needed

Bonita Montero

unread,

Sep 25, 2021, 10:41:22 AM9/25/21

to

I've developed a monitor-object like that of Java for C++ with some
improvements. The major improvement is that there's not only a spin
-loop for locking and unlocking but also for waiting on an event. In
this case you don't have to lock the mutex but supply a predicate on
a wait_poll-function and the code repeatedly tries to lock the mutex
polling and if it can lock the mutex it calls the predicate which
returns (or moves) a pair of a bool and the result-type.
Waiting to for a semaphore and or a event-object (Win32) in the kernel
can easily take from 1.000 to 10.000 clock-cylces even when the call
immediately returns because the semaphore or event has been set before.
So there has to be a spin count with a reasonable relationship to this
waiting-inteval, f.e. spinning one tenth of the minimum interval being
spent in the kernel.
With my monitor-object I've taken the spincount recalculation-algorithm
from the glibc. And I'm also using the PAUSE-instruction. But I think
that the glibc induces to heavy cacheline-flipping by re-loading the
mutex-flags immediately after a single PAUSE-instruction. So I decided
to loop PAUSE several times and to take less spinning iterations there-
fore.
To get a reasonable number of PAUSE-spinnings I need the time PAUSE
takes on different processors. On my CPU PAUSE halts the pipe only
for about for 0,78 nanoseconds, which is about 3,25 clock-cycles in
average. I've written a short progam tha repeatedly PAUSEs and takes
the aveage time. I want to encourage you to compile the code on your
machine and give me the PAUSE-timing it outputs here.

This is the code:

#include <iostream>
#include <chrono>
#include <cstddef>
#include <cstdint>
#include <immintrin.h>

using namespace std;
using namespace chrono;

int main( int argc, char **argv )
{
static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
auto start = high_resolution_clock::now();
for( uint64_t i = PAUSE_ROUNDS; i; --i )
_mm_pause();
double ns = (int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
cout << ns << endl;
}

Paavo Helde

unread,

Sep 25, 2021, 12:52:02 PM9/25/21

to

On my computer this outputs:

34.9194

(Intel Xeon E-2286M CPU @ 2.40 GHz)

Bonita Montero

unread,

Sep 25, 2021, 1:01:52 PM9/25/21

to

That seems much more reasonable to me than the < 4 clock cycles on my
PC (Ryzen Threadripper 3990X).

Bo Persson

unread,

Sep 25, 2021, 1:06:04 PM9/25/21

to

I get

30.7635

(Core i9 9900K 5GHz)

Bonita Montero

unread,

Sep 25, 2021, 1:14:34 PM9/25/21

to

Am 25.09.2021 um 19:05 schrieb Bo Persson:

>> This is the code:
>>
>> #include <iostream>
>> #include <chrono>
>> #include <cstddef>
>> #include <cstdint>
>> #include <immintrin.h>
>>
>> using namespace std;
>> using namespace chrono;
>>
>> int main( int argc, char **argv )
>> {
>>      static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
>>      auto start = high_resolution_clock::now();
>>      for( uint64_t i = PAUSE_ROUNDS; i; --i )
>>          _mm_pause();
>>      double ns = (int64_t)duration_cast<nanoseconds>(
>> high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
>>      cout << ns << endl;
>> }
>
> I get
>
> 30.7635
>
> (Core i9 9900K 5GHz)

Why did AMD decide for such an idiotic timing ? That is neither suitable
for spin-loops with only a single PAUSE-instructions, nor does it save
power while spinning.

Branimir Maksimovic

unread,

Sep 25, 2021, 1:15:19 PM9/25/21

to

On 2021-09-25, Bonita Montero <Bonita....@gmail.com> wrote:
> I've developed a monitor-object like that of Java for C++ with some
> improvements. The major improvement is that there's not only a spin

> for( uint64_t i = PAUSE_ROUNDS; i; --i )
> _mm_pause();

What's in _mm_pause ?

--

7-77-777
Evil Sinner!

Bonita Montero

unread,

Sep 25, 2021, 1:16:30 PM9/25/21

to

Read again what I wrote or google for _mm_pause.

Branimir Maksimovic

unread,

Sep 25, 2021, 1:20:55 PM9/25/21

to

> On my computer this outputs:
>
> 34.9194
>
> (Intel Xeon E-2286M CPU @ 2.40 GHz)
>

On mine does not compile becaise it is x86 specific...

--

7-77-777
Evil Sinner!

Branimir Maksimovic

unread,

Sep 25, 2021, 1:30:09 PM9/25/21

to

On 2021-09-25, Bonita Montero <Bonita....@gmail.com> wrote:
> Am 25.09.2021 um 19:15 schrieb Branimir Maksimovic:
>> On 2021-09-25, Bonita Montero <Bonita....@gmail.com> wrote:
>>> I've developed a monitor-object like that of Java for C++ with some
>>> improvements. The major improvement is that there's not only a spin
>>> for( uint64_t i = PAUSE_ROUNDS; i; --i )
>>> _mm_pause();
>> What's in _mm_pause ?
>
> Read again what I wrote or google for _mm_pause.

On mine system it is:

#include <iostream>
#include <chrono>
#include <cstddef>
#include <cstdint>

using namespace std;
using namespace chrono;

int main( int argc, char **argv )
{
static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
auto start = high_resolution_clock::now();

for( uint64_t i = PAUSE_ROUNDS; i; --i )

__asm__ __volatile__("isb\n");
// _mm_pause();

double ns = (int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
cout << ns << endl;
}

--

7-77-777
Evil Sinner!

Branimir Maksimovic

unread,

Sep 25, 2021, 1:33:48 PM9/25/21

to

On 2021-09-25, Bo Persson <b...@bo-persson.se> wrote:
>
> I get
>
> 30.7635
>
> (Core i9 9900K 5GHz

bmaxa@Branimirs-Air News % ./a.out
9.14544

M1 processor. (with code modification to compile)

--

7-77-777
Evil Sinner!

Bonita Montero

unread,

Sep 26, 2021, 1:26:45 AM9/26/21

to

So as the timings of the PAUSE-instructions are so different I
decided to write a singleton containing the fastest timing of
the PAUSE-instruction on the machine. This is the code:

cpu_pause.h:

#pragma once
#if defined(_MSC_VER) && (defined(_M_X64) || defined(_M_IX86))
#include <intrin.h>
#elif defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))
#include <immintrin.h>
#else
#error "need platform-specific pause-instruction"
#endif

inline
void cpu_pause()
{
#if defined(_MSC_VER) && (defined(_M_X64) || defined(_M_IX86)) ||
defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))
_mm_pause();
#endif
}

inline
void cpu_pause_n( unsigned iterations )
{
for( ; iterations--; cpu_pause() );
}

struct pause_singleton
{
static
double getNsPerPause();
private:
static
struct singleton_t
{
singleton_t();
double m_nsPerPause;
} singleton;
};

inline
double pause_singleton::getNsPerPause()
{
return singleton.m_nsPerPause;
}

cpu_pasue.h:

#include <chrono>
#include <limits>
#include "cpu_pause.h"

using namespace std;
using namespace chrono;

#if defined(_MSC_VER)
#pragma warning(disable: 26495) // member not initialized
#pragma warning(disable: 26498) // consider constexpr
#endif

pause_singleton::singleton_t::singleton_t()
{
int64_t leastTicks = numeric_limits<int64_t>::max();;
for( size_t i = 1000; i; --i )
{

auto start = high_resolution_clock::now();

for( size_t j = 1'000; j; --j )
cpu_pause();
int64_t ticks = (int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();
if( ticks < leastTicks )
leastTicks = ticks;
}
m_nsPerPause = leastTicks / 1'000.0;
}

pause_singleton::singleton_t pause_singleton::singleton;

So I can adjust the spinning-loop according
to pause_singleton::getNsPerPause().

Branimir Maksimovic

unread,

Sep 26, 2021, 2:56:44 AM9/26/21

to

On 2021-09-26, Bonita Montero <Bonita....@gmail.com> wrote:
> So as the timings of the PAUSE-instructions are so different I
> decided to write a singleton containing the fastest timing of
> the PAUSE-instruction on the machine. This is the code:
>
> cpu_pause.h:

> for( size_t j = 1'000; j; --j )
> cpu_pause();

Please add for aarch64 i gave to you.

--

7-77-777
Evil Sinner!

Bonita Montero

unread,

Sep 27, 2021, 11:36:12 AM9/27/21

to

Am 26.09.2021 um 07:26 schrieb Bonita Montero:

> So I can adjust the spinning-loop according
> to pause_singleton::getNsPerPause().

I dropped it ! I simply made a spinning-loop according to the TSC
if the CPU has a TSC and it is invariant (these are also invariant
across sockets !). Reading the TSC can be done at roughly every 10
nanoseconds my PC (TR3990X, Zen3, Win10, SMT off). It's not accu-
rate since it might overlap with instruction before or afterwards,
but accuracy isn't relevant when you spin hundreds of clock-cycles.
And I changed a single pause per spin loop instead of a row of
PAUSEs which sum up to 30ns (which is roughly the most common
value on newer Intel -CPUs). This more eager spinnging may gain
locking earlier, although it may generate more interconnect-traffic.

But as I'm using RDTSC: I'm asking myself how fast RDTSC is on
different CPUs. So I modified my test-program to measure different
routines to test a loop of 10 RDTSCs per loop. Here it is:

#include <iostream>
#include <chrono>
#include <limits>
#include <functional>
#if defined(_MSC_VER)
#include <intrin.h>
#endif

using namespace std;
using namespace chrono;

int main( int argc, char **argv )
{

using bench_fn = function<void(size_t)>;
auto bench = []( bench_fn const &fn, size_t nTests, size_t nIterations
) -> double
{
int64_t nsShortest = numeric_limits<int64_t>::max();
for( size_t p = nTests; p; --p )

{
auto start = high_resolution_clock::now();

fn( nIterations );
int64_t ns = (int64_t)duration_cast<nanoseconds>(

high_resolution_clock::now() - start ).count();

nsShortest = ns < nsShortest ? ns : nsShortest;
}
return (double)nsShortest / (ptrdiff_t)nIterations;
};
auto rdtscLoop = []( size_t nIterations )
{
uint64_t TSCs[10];
for( ; nIterations; --nIterations )
// unfortunately there's no #directive vor REP'ing
#if defined(_MSC_VER)
TSCs[0] += __rdtsc(),
TSCs[1] += __rdtsc(),
TSCs[2] += __rdtsc(),
TSCs[3] += __rdtsc(),
TSCs[4] += __rdtsc(),
TSCs[5] += __rdtsc(),
TSCs[6] += __rdtsc(),
TSCs[7] += __rdtsc(),
TSCs[8] += __rdtsc(),
TSCs[9] += __rdtsc();
#elif defined(__GNUC__)
TSCs[0] += __builtin_ia32_rdtsc(),
TSCs[1] += __builtin_ia32_rdtsc(),
TSCs[2] += __builtin_ia32_rdtsc(),
TSCs[3] += __builtin_ia32_rdtsc(),
TSCs[4] += __builtin_ia32_rdtsc(),
TSCs[5] += __builtin_ia32_rdtsc(),
TSCs[6] += __builtin_ia32_rdtsc(),
TSCs[7] += __builtin_ia32_rdtsc(),
TSCs[8] += __builtin_ia32_rdtsc(),
TSCs[9] += __builtin_ia32_rdtsc();
#endif
uint64_t sum = 0; // prevent optimization
for( uint64_t TSC : TSCs )
sum += TSC;
uint64_t volatile vsum = sum;
};
static size_t const
N_TESTS = 100, // number of tests to get the shortest timing
N_ITERATIONS = 500, // iterations of the test-loop
N_REPEATS = 10; // REPetitions inside the test-loop
double nsPerREP = bench( bench_fn( bind( rdtscLoop, placeholders::_1 )
), N_TESTS, N_ITERATIONS ) / N_REPEATS;
cout << "ns per RDTSC: " << nsPerREP << endl;
}

It would be nice if you could compile this on your machine and
give me the number of the RDTSC-timing here. This would give me
a hint if what I try is feasible.

Bo Persson

unread,

Sep 27, 2021, 1:29:25 PM9/27/21

to

I get

ns per RDTSC: 5.42

on my Core i9-9900K 5GHz