Am 26.09.2021 um 07:26 schrieb Bonita Montero:
> So I can adjust the spinning-loop according
> to pause_singleton::getNsPerPause().
I dropped it ! I simply made a spinning-loop according to the TSC
if the CPU has a TSC and it is invariant (these are also invariant
across sockets !). Reading the TSC can be done at roughly every 10
nanoseconds my PC (TR3990X, Zen3, Win10, SMT off). It's not accu-
rate since it might overlap with instruction before or afterwards,
but accuracy isn't relevant when you spin hundreds of clock-cycles.
And I changed a single pause per spin loop instead of a row of
PAUSEs which sum up to 30ns (which is roughly the most common
value on newer Intel -CPUs). This more eager spinnging may gain
locking earlier, although it may generate more interconnect-traffic.
But as I'm using RDTSC: I'm asking myself how fast RDTSC is on
different CPUs. So I modified my test-program to measure different
routines to test a loop of 10 RDTSCs per loop. Here it is:
#include <iostream>
#include <chrono>
#include <limits>
#include <functional>
#if defined(_MSC_VER)
#include <intrin.h>
#endif
using namespace std;
using namespace chrono;
int main( int argc, char **argv )
{
using bench_fn = function<void(size_t)>;
auto bench = []( bench_fn const &fn, size_t nTests, size_t nIterations
) -> double
{
int64_t nsShortest = numeric_limits<int64_t>::max();
for( size_t p = nTests; p; --p )
{
auto start = high_resolution_clock::now();
fn( nIterations );
int64_t ns = (int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();
nsShortest = ns < nsShortest ? ns : nsShortest;
}
return (double)nsShortest / (ptrdiff_t)nIterations;
};
auto rdtscLoop = []( size_t nIterations )
{
uint64_t TSCs[10];
for( ; nIterations; --nIterations )
// unfortunately there's no #directive vor REP'ing
#if defined(_MSC_VER)
TSCs[0] += __rdtsc(),
TSCs[1] += __rdtsc(),
TSCs[2] += __rdtsc(),
TSCs[3] += __rdtsc(),
TSCs[4] += __rdtsc(),
TSCs[5] += __rdtsc(),
TSCs[6] += __rdtsc(),
TSCs[7] += __rdtsc(),
TSCs[8] += __rdtsc(),
TSCs[9] += __rdtsc();
#elif defined(__GNUC__)
TSCs[0] += __builtin_ia32_rdtsc(),
TSCs[1] += __builtin_ia32_rdtsc(),
TSCs[2] += __builtin_ia32_rdtsc(),
TSCs[3] += __builtin_ia32_rdtsc(),
TSCs[4] += __builtin_ia32_rdtsc(),
TSCs[5] += __builtin_ia32_rdtsc(),
TSCs[6] += __builtin_ia32_rdtsc(),
TSCs[7] += __builtin_ia32_rdtsc(),
TSCs[8] += __builtin_ia32_rdtsc(),
TSCs[9] += __builtin_ia32_rdtsc();
#endif
uint64_t sum = 0; // prevent optimization
for( uint64_t TSC : TSCs )
sum += TSC;
uint64_t volatile vsum = sum;
};
static size_t const
N_TESTS = 100, // number of tests to get the shortest timing
N_ITERATIONS = 500, // iterations of the test-loop
N_REPEATS = 10; // REPetitions inside the test-loop
double nsPerREP = bench( bench_fn( bind( rdtscLoop, placeholders::_1 )
), N_TESTS, N_ITERATIONS ) / N_REPEATS;
cout << "ns per RDTSC: " << nsPerREP << endl;
}
It would be nice if you could compile this on your machine and
give me the number of the RDTSC-timing here. This would give me
a hint if what I try is feasible.