new benchmark for my read/write algorithm...

Chris M. Thomasson

unread,

Feb 20, 2019, 1:49:23 AM2/20/19

to

Fwiw, I wrote a crude new benchmark that measures how many reads and
writes can be performed in a given amount of time. My algorithm vs
std::shared_mutex. So, we are _primarily_ looking for how many reads can
be performed in this test at 60 seconds. The number of threads is
variable and is determined by std::hardware_concurrency * THREADS, set
at 8 in the test. So on my system the setup is:
___________________________________
cpu_threads_n = 4
threads_n = 32
writers = 16
readers = 16
test duration = 60 seconds
___________________________________

This is going to be different for each system. Readers iterate a shared
linked list with 1000000 nodes in this test. Writers pop and push items,
and never use new or delete. Well, so far, the timings I am getting on
my end using MSVC 2017 is:

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
cpu_threads_n = 4
threads_n = 32
writers = 16
readers = 16
test duration = 60 seconds
___________________________________

___________________________________
Raw Reads: 18046464
Raw Writes: 347498
reads_per_tick = 300693
writes_per_tick = 5790
Ticks = 60.0161
___________________________________

Testing: std::shared_mutex

___________________________________
cpu_threads_n = 4
threads_n = 32
writers = 16
readers = 16
test duration = 60 seconds
___________________________________

___________________________________
Raw Reads: 650006
Raw Writes: 168741
reads_per_tick = 10829
writes_per_tick = 2811
Ticks = 60.0193
___________________________________

As you can see, my algorithm is performing many more reads than
std::shared_mutex. Anyway, here is my code:

When you get some time, can you please give it a go? I want to see
timings on other systems, the good, bad and ugly. ;^) Thanks.

https://pastebin.com/raw/1QtPCGhV
___________________________________
/* Crude Read/Write Mutex Benchmark

This tests my algorithm ct_rwmutex vs. std::shared_mutex.

It shows how many reads and writes can be performed within
a fixed amount of time.

by: Chris M. Thomasson
__________________________________________*/

#include <thread>
#include <atomic>
#include <shared_mutex>
#include <condition_variable>
#include <iostream>
#include <functional>
#include <chrono>
#include <cassert>
#include <cstdlib>
#include <ctime>
#include <climits>
#include <cstdint>
#include <vector>

// undefine to test std::shared_mutex
#define CT_TEST_FAST_MUTEX 1

#define THREADS 8 // multiplied by std::hardware_concurrency
#define NODE_PRIME 1000000 // number of nodes
#define CRUDE_CACHE_PAD 256 // crude cache pad
#define TEST_DURATION_SEC 60 // number of seconds for the test

// bare bones mutex/condvar based semaphore
struct ct_slow_semaphore
{
unsigned long m_state;
std::mutex m_mutex;
std::condition_variable m_cond;

ct_slow_semaphore(unsigned long state) : m_state(state) {}

void inc()
{
m_mutex.lock();
++m_state;
m_mutex.unlock();
m_cond.notify_one();
}

void add(unsigned long addend)
{
{
std::unique_lock<std::mutex> lock(m_mutex);
m_state += addend;
}

m_cond.notify_all();
}

void dec()
{
std::unique_lock<std::mutex> lock(m_mutex);
while (m_state == 0) m_cond.wait(lock);
--m_state;
}
};

// bin-sema
struct ct_auto_reset_event
{
bool m_state;
std::mutex m_mutex;
std::condition_variable m_cond;

ct_auto_reset_event() : m_state(false) {}

void signal()
{
std::unique_lock<std::mutex> lock(m_mutex);
m_state = true;
m_cond.notify_one();
}

void wait()
{
std::unique_lock<std::mutex> lock(m_mutex);
while (m_state == false) m_cond.wait(lock);
m_state = false; // auto-reset
}
};

// just a layer over an auto-reset event
struct ct_fast_mutex
{
std::atomic<unsigned int> m_state;
ct_auto_reset_event m_waitset;

ct_fast_mutex() : m_state(0) {}

void lock()
{
if (m_state.exchange(1, std::memory_order_acquire))
{
while (m_state.exchange(2, std::memory_order_acquire))
{
m_waitset.wait();
}
}
}

void unlock()
{
if (m_state.exchange(0, std::memory_order_release) == 2)
{
m_waitset.signal();
}
}
};

// Chris M. Thomassons Experimental Read/Write Mutex
// Yeah, it is pretty damn fat wrt the state, however
// it has some interesting properties...
// The state can be compressed a bit...
// btw, it has no loops...
// Take a look at the lock_shared and unlock_shared functions

// The number of readers _must_ never exceed LONG_MAX!

#define RWMUTEX_COUNT_MAX LONG_MAX

struct ct_rwmutex
{
unsigned char m_crude_cache_pad_0[CRUDE_CACHE_PAD];

// shared state
std::atomic<long> m_count;
unsigned char m_crude_cache_pad_1[CRUDE_CACHE_PAD];

std::atomic<long> m_rdwake;
unsigned char m_crude_cache_pad_2[CRUDE_CACHE_PAD];

ct_slow_semaphore m_rdwset;
unsigned char m_crude_cache_pad_3[CRUDE_CACHE_PAD];

ct_slow_semaphore m_wrwset;
unsigned char m_crude_cache_pad_4[CRUDE_CACHE_PAD];

ct_fast_mutex m_wrlock;
unsigned char m_crude_cache_pad_5[CRUDE_CACHE_PAD];

ct_rwmutex() :
m_count(RWMUTEX_COUNT_MAX),
m_rdwake(0),
m_rdwset(0),
m_wrwset(0) {
}

// READ, pretty slim...
void lock_shared()
{
if (m_count.fetch_add(-1, std::memory_order_acquire) < 1)
{
// We need to wait for a writer.
m_rdwset.dec();
}
}

void unlock_shared()
{
if (m_count.fetch_add(1, std::memory_order_release) < 0)
{
// There is a writer
if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
{
// We need to wake the writer up
m_wrwset.inc();
}
}
}

// WRITE, more hefty
void lock()
{
// Acquire exclusive access
m_wrlock.lock();

// we are the only thread in here now.

// Gain write access wrt m_count
long count = m_count.fetch_add(-RWMUTEX_COUNT_MAX,
std::memory_order_acquire);

// count can never be negative.
if (count < RWMUTEX_COUNT_MAX)
{
// We detected readers.
long rdwake = m_rdwake.fetch_add(RWMUTEX_COUNT_MAX - count,
std::memory_order_acquire);

if (rdwake + RWMUTEX_COUNT_MAX - count)
{
// Okay, we need to wait for all of the readers
// The number of readers is actually
// RWMUTEX_COUNT_MAX - count
m_wrwset.dec();
}
}
}

// write unlock
void unlock()
{
// Release write access wrt m_count
long count = m_count.fetch_add(RWMUTEX_COUNT_MAX,
std::memory_order_release);

if (count < 0)
{
// We need to wake -count readers.
m_rdwset.add(-count);
}

// Release exclusive access
m_wrlock.unlock();
}
};

struct ct_simple_stack
{
struct node
{
node* m_next;
unsigned int m_tid;

node(unsigned int tid) : m_tid(tid) {}
};

node* m_head;

ct_simple_stack() : m_head(nullptr) {}

~ct_simple_stack()
{
ct_simple_stack::node* n = flush();

while (n)
{
ct_simple_stack::node* next = n->m_next;
delete n;
n = next;
}
}

void push(node* n)
{
n->m_next = m_head;
m_head = n;
}

node* pop()
{
node* n = m_head;

if (n)
{
m_head = n->m_next;
}

return n;
}

node* flush()
{
node* n = m_head;
m_head = nullptr;
return n;
}

};

struct ct_shared
{
std::atomic<bool> m_run;

// protected by m_std_rwmutex
std::uint64_t m_reads;
std::uint64_t m_writes;

ct_simple_stack m_stack;
ct_simple_stack m_stack_dtor;

#if defined (CT_TEST_FAST_MUTEX)
ct_rwmutex m_std_rwmutex;
#else
std::shared_mutex m_std_rwmutex;
#endif

ct_shared() : m_run(true), m_reads(0), m_writes(0)
{
// prime m_stack
for (unsigned int i = 0; i < NODE_PRIME; ++i)
{
m_stack.push(new ct_simple_stack::node(i));
}
}
};

void ct_thread_reader(ct_shared& shared, std::size_t tidx)
{
std::uint64_t reads = 0;

while (shared.m_run.load(std::memory_order_relaxed))
{
shared.m_std_rwmutex.lock_shared();

ct_simple_stack::node* n = shared.m_stack.m_head;

while (n)
{
ct_simple_stack::node* next = n->m_next;

n = next;
}

std::this_thread::yield();

shared.m_std_rwmutex.unlock_shared();

++reads;
}

shared.m_std_rwmutex.lock();
shared.m_reads += reads;
shared.m_std_rwmutex.unlock();
}

void ct_thread_writer(ct_shared& shared, std::size_t tidx)
{
std::uint64_t writes = 0;

while (shared.m_run.load(std::memory_order_relaxed))
{
shared.m_std_rwmutex.lock();
ct_simple_stack::node* n = shared.m_stack.pop();
shared.m_std_rwmutex.unlock();

std::this_thread::yield();

shared.m_std_rwmutex.lock();

if (n)
{
shared.m_stack.push(n);
}

shared.m_std_rwmutex.unlock();

std::this_thread::yield();

++writes;
}

shared.m_std_rwmutex.lock();
shared.m_writes += writes;
shared.m_std_rwmutex.unlock();
}

int main()
{
ct_shared shared;

std::cout << "Testing: ";

#if defined (CT_TEST_FAST_MUTEX)
std::cout << "Chris M. Thomasson's Experimental Read/Write Mutex\n\n";
#else
std::cout << "std::shared_mutex\n\n";
#endif

std::cout.flush();

{
// Setup our threads
std::size_t cpu_threads_n = std::thread::hardware_concurrency();
std::size_t threads_n = cpu_threads_n * THREADS;
std::vector<std::thread> threads(threads_n);

std::size_t writers = threads_n / 2;
std::size_t readers = threads_n - writers;

std::cout << "___________________________________\n";
std::cout << "cpu_threads_n = " << cpu_threads_n << "\n";
std::cout << "threads_n = " << threads_n << "\n";
std::cout << "writers = " << writers << "\n";
std::cout << "readers = " << readers << "\n";
std::cout << "test duration = " << TEST_DURATION_SEC << "
seconds\n";
std::cout << "___________________________________\n\n\n";

auto time_start = std::chrono::high_resolution_clock::now();

// Create threads
for (std::size_t i = 0; i < threads_n; ++i)
{
if (i < writers)
{
threads[i] = std::thread(ct_thread_writer,
std::ref(shared), i);
}

else
{
threads[i] = std::thread(ct_thread_reader,
std::ref(shared), i);
}
}

// Give the test some time to run...
std::chrono::seconds time_duration(TEST_DURATION_SEC);
std::this_thread::sleep_for(time_duration);
shared.m_run.store(false, std::memory_order_relaxed);

// Join threads
for (std::size_t i = 0; i < threads_n; ++i)
{
threads[i].join();
}

// Grab our timing.
auto time_stop = std::chrono::high_resolution_clock::now();

std::chrono::duration<double> time_diff = time_stop - time_start;

std::uint64_t raw_reads = shared.m_reads;
std::uint64_t raw_writes = shared.m_writes;

std::cout << "___________________________________\n";
std::cout << "Raw Reads: " << raw_reads << "\n";
std::cout << "Raw Writes: " << raw_writes << "\n";

std::uint64_t reads_per_tick = raw_reads / time_diff.count();
std::uint64_t writes_per_tick = raw_writes / time_diff.count();

std::cout << "reads_per_tick = " << reads_per_tick << "\n";
std::cout << "writes_per_tick = " << writes_per_tick << "\n";

std::cout << "Ticks = " << time_diff.count() << "\n";
std::cout << "___________________________________\n\n";
}

std::cout << "\n\nFin!\n";

return 0;
}
___________________________________

:^)

Melzzzzz

unread,

Feb 20, 2019, 2:16:12 AM2/20/19

to

___________________________________
cpu_threads_n = 16
threads_n = 128
writers = 64
readers = 64

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 174300
Raw Writes: 16346
reads_per_tick = 2902
writes_per_tick = 272
Ticks = 60.043
___________________________________

~/.../bmaxa_data/examples >>> time ./bench1
Testing: std::shared_mutex

___________________________________
cpu_threads_n = 16
threads_n = 128
writers = 64
readers = 64

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 120862
Raw Writes: 25142
reads_per_tick = 2009
writes_per_tick = 417
Ticks = 60.155

--
press any key to continue or any other to quit...

Chris M. Thomasson

unread,

Feb 20, 2019, 2:47:10 AM2/20/19

to

Perfect. Thanks a lot.

Interesting! My algorithm is getting 893 more reads per-second when
compared to std::shared_mutex. However, it lost wrt writes. The test is
primarily about how many reads are accomplished, humm... So it did
fairly "decent". Not great, but still gives you more reads per-second.
This is a _lot_ different than my last benchmark, where my algorithm
would seem to just crush std::shared_mutex. Humm...

Ian Collins

unread,

Feb 20, 2019, 2:59:32 AM2/20/19

to

On 20/02/2019 19:49, Chris M. Thomasson wrote:
> Fwiw, I wrote a crude new benchmark that measures how many reads and
> writes can be performed in a given amount of time. My algorithm vs
> std::shared_mutex. So, we are _primarily_ looking for how many reads can
> be performed in this test at 60 seconds. The number of threads is
> variable and is determined by std::hardware_concurrency * THREADS, set
> at 8 in the test. So on my system the setup is:

<snip>

> #define THREADS 8 // multiplied by std::hardware_concurrency
> #define NODE_PRIME 1000000 // number of nodes
> #define CRUDE_CACHE_PAD 256 // crude cache pad
> #define TEST_DURATION_SEC 60 // number of seconds for the test

General whinge: these ugly macros are bad form in C++!

Some results, Ubuntu 18.04, Ryzen 2700X:

$ g++ -Ofast x.cc -lpthread -std=c++17
$ ./a.out
Testing: std::shared_mutex

___________________________________
cpu_threads_n = 16
threads_n = 64
writers = 32
readers = 32

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 220762
Raw Writes: 526
reads_per_tick = 3679
writes_per_tick = 8
Ticks = 60.0055
___________________________________

Fin!

$ g++ -Ofast x.cc -lpthread -std=c++17
$ ./a.out

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________

cpu_threads_n = 16
threads_n = 64
writers = 32
readers = 32

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 214921
Raw Writes: 3718
reads_per_tick = 3581
writes_per_tick = 61
Ticks = 60.0135
___________________________________

Fin!

I reduced THREADS to 4, threads_n = 128 sucked the life out of my machine...

--
Ian.

Chris M. Thomasson

unread,

Feb 20, 2019, 3:11:27 AM2/20/19

to

On 2/19/2019 11:59 PM, Ian Collins wrote:
> On 20/02/2019 19:49, Chris M. Thomasson wrote:
>> Fwiw, I wrote a crude new benchmark that measures how many reads and
>> writes can be performed in a given amount of time. My algorithm vs
>> std::shared_mutex. So, we are _primarily_ looking for how many reads can
>> be performed in this test at 60 seconds. The number of threads is
>> variable and is determined by std::hardware_concurrency * THREADS, set
>> at 8 in the test. So on my system the setup is:
>
> <snip>
>
>> #define THREADS 8 // multiplied by std::hardware_concurrency
>> #define NODE_PRIME 1000000 // number of nodes
>> #define CRUDE_CACHE_PAD 256 // crude cache pad
>> #define TEST_DURATION_SEC 60 // number of seconds for the test
>
> General whinge: these ugly macros are bad form in C++!

Yeah... Habit from C.

Wrt, THREADS being 8, well, yeah. That can get gnarly. Sorry about that.

>
> Some results, Ubuntu 18.04, Ryzen 2700X:
>
> $ g++ -Ofast x.cc -lpthread -std=c++17
> $ ./a.out
> Testing: std::shared_mutex

[...]

> ___________________________________
> Raw Reads: 220762
> Raw Writes: 526
> reads_per_tick = 3679
> writes_per_tick = 8
> Ticks = 60.0055
> ___________________________________

[...]

> $ g++ -Ofast x.cc -lpthread -std=c++17
> $ ./a.out
> Testing: Chris M. Thomasson's Experimental Read/Write Mutex

[...]

> ___________________________________
> Raw Reads: 214921
> Raw Writes: 3718
> reads_per_tick = 3581
> writes_per_tick = 61
> Ticks = 60.0135
> ___________________________________

Ahh, mine loses here by 98 reads per-second. Yet wins wrt writers by 53
reads a second. Humm... This is definitely different than my other
benchmark. Need to think about this...

Thanks for giving it a go Ian. :^)

Paavo Helde

unread,

Feb 20, 2019, 3:20:41 AM2/20/19

to

Ran the thing on my laptop, results below. What can I say? It looks like
std::shared_mutex on my laptop is from some other universe where things
run 30x faster. I cannot explain this. Do you have any asserts in your
test ensuring that it works correctly?

The debugger shows the code is calling e.g.

> ntdll.dll!RtlAcquireSRWLockShared ()

ntdll.dll is shown in Modules pane as:

ntdll.dll ntdll.dll C:\Windows\System32\ntdll.dll N/A No Symbols
loaded.
C:\Users\heldepn\AppData\Local\Temp\SymbolCache\ntdll.pdb\d1e4e60e3ccf4933875c04d1a21456d12\ntdll.pdb
2 6.1.7601.23677 (win7sp1_ldr.170209-0600) 9.02.2017 18:33
0000000077A20000-0000000077BCA000 [8236] ConsoleTestVS2017.exe

Timing results:

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
cpu_threads_n = 4
threads_n = 32
writers = 16
readers = 16
test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 15713431
Raw Writes: 136097
reads_per_tick = 261887
writes_per_tick = 2268
Ticks = 60.0007
___________________________________

Fin!

Testing: std::shared_mutex

___________________________________
cpu_threads_n = 4
threads_n = 32
writers = 16
readers = 16
test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 185521674
Raw Writes: 4739901
reads_per_tick = 3087224
writes_per_tick = 78875
Ticks = 60.0934
___________________________________

Fin!

Chris M. Thomasson

unread,

Feb 20, 2019, 3:26:13 AM2/20/19

to

On 2/20/2019 12:11 AM, Chris M. Thomasson wrote:
> On 2/19/2019 11:59 PM, Ian Collins wrote:
>> On 20/02/2019 19:49, Chris M. Thomasson wrote:
>>> Fwiw, I wrote a crude new benchmark that measures how many reads and
>>> writes can be performed in a given amount of time. My algorithm vs
>>> std::shared_mutex. So, we are _primarily_ looking for how many reads can
>>> be performed in this test at 60 seconds. The number of threads is
>>> variable and is determined by std::hardware_concurrency * THREADS, set
>>> at 8 in the test. So on my system the setup is:

[...]

I just ran it again under MSVC 2017 and am a little perplexed as to why
I am getting so many more reads on my little setup:

___________________________________
cpu_threads_n = 4
threads_n = 32
writers = 16
readers = 16

test duration = 60 seconds
___________________________________

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
Raw Reads: 18170011
Raw Writes: 347308
reads_per_tick = 302762
writes_per_tick = 5787
Ticks = 60.0141

___________________________________

Testing: std::shared_mutex

___________________________________

Raw Reads: 665021
Raw Writes: 175965
reads_per_tick = 11079
writes_per_tick = 2931
Ticks = 60.0203
___________________________________

Humm... I guess that std::shared_mutex and my ct_rwmutex do not really
like all of those cores on your setup wrt THREADS equaling 4:

___________________________________
cpu_threads_n = 16
threads_n = 64
writers = 32
readers = 32
test duration = 60 seconds
___________________________________

So far, afaict, it goes to show how a single centralized general purpose
read/write lock can start to suffer under all of the hardcore contention
that the test generates. Humm...

Chris M. Thomasson

unread,

Feb 20, 2019, 3:48:09 AM2/20/19

to

On 2/20/2019 12:20 AM, Paavo Helde wrote:
> On 20.02.2019 8:49, Chris M. Thomasson wrote:
>> Fwiw, I wrote a crude new benchmark that measures how many reads and
>> writes can be performed in a given amount of time. My algorithm vs
>> std::shared_mutex. So, we are _primarily_ looking for how many reads can
>> be performed in this test at 60 seconds. The number of threads is
>> variable and is determined by std::hardware_concurrency * THREADS, set
>> at 8 in the test. So on my system the setup is:

[...]

>> When you get some time, can you please give it a go? I want to see
>> timings on other systems, the good, bad and ugly. ;^) Thanks.
>
> Ran the thing on my laptop, results below. What can I say? It looks like
> std::shared_mutex on my laptop is from some other universe where things
> run 30x faster. I cannot explain this. Do you have any asserts in your
> test ensuring that it works correctly?

So far, I don't see anything wrong. However, I am basically getting the
opposite of what you are reporting on my end. STRANGE! Big time.

> The debugger shows the code is calling e.g.
>
> >    ntdll.dll!RtlAcquireSRWLockShared ()
>
> ntdll.dll is shown in Modules pane as:
>
> ntdll.dll    ntdll.dll    C:\Windows\System32\ntdll.dll    N/A    No
> Symbols loaded.
> C:\Users\heldepn\AppData\Local\Temp\SymbolCache\ntdll.pdb\d1e4e60e3ccf4933875c04d1a21456d12\ntdll.pdb
> 2    6.1.7601.23677 (win7sp1_ldr.170209-0600)    9.02.2017 18:33
> 0000000077A20000-0000000077BCA000    [8236] ConsoleTestVS2017.exe

Need to investigate, because when CT_TEST_FAST_MUTEX is defined then
std::shared_mutex is not even defined in the code at all:
__________________________

struct ct_shared
{
std::atomic<bool> m_run;

// protected by m_std_rwmutex
std::uint64_t m_reads;
std::uint64_t m_writes;

ct_simple_stack m_stack;
ct_simple_stack m_stack_dtor;

#if defined (CT_TEST_FAST_MUTEX)
ct_rwmutex m_std_rwmutex;
#else
std::shared_mutex m_std_rwmutex;
#endif

ct_shared() : m_run(true), m_reads(0), m_writes(0)
{
// prime m_stack
for (unsigned int i = 0; i < NODE_PRIME; ++i)
{
m_stack.push(new ct_simple_stack::node(i));
}
}
};

__________________________

The only thing I can think of is that the std::mutex's in
ct_slow_semaphore and ct_auto_reset_event are using SRW's on Windows.
But that should not matter at all here. I am focusing on
std::shared_mutex vs ct_rwmutex.

I am assuming that std::shared_mutex is using SRW's on my system under
MSVC 2017 Community. I need to double check. Am wondering if they are
using SRW's on MSVC 2017 professional edition. Will find out tomorrow.
Thanks.

WEIRD! I am getting the opposite on my end. Okay, I am running the tests
right now, again. I swear to GOD that my results are, just now:

****************************************************

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
cpu_threads_n = 4
threads_n = 32
writers = 16
readers = 16
test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 17073438
Raw Writes: 337110
reads_per_tick = 284481
writes_per_tick = 5617
Ticks = 60.016

___________________________________

Fin!

Testing: std::shared_mutex

___________________________________
cpu_threads_n = 4
threads_n = 32
writers = 16
readers = 16
test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 660283
Raw Writes: 174796
reads_per_tick = 11000
writes_per_tick = 2912
Ticks = 60.0227
___________________________________

Fin!
****************************************************

WTF is going on MAN! HUMM... ;^o

I need to dig into std::shared_mutex and see what it is using. This
should be fun! ;^)

Chris M. Thomasson

unread,

Feb 20, 2019, 4:21:41 AM2/20/19

to

On 2/19/2019 10:49 PM, Chris M. Thomasson wrote:
[..]
> ___________________________________
[...]

> struct ct_simple_stack
> {
>     struct node
>     {
>         node* m_next;
>         unsigned int m_tid;
>
>         node(unsigned int tid) : m_tid(tid) {}
>     };

[...]

Humm... The ct_simple_stack::node::m_next should probably be a volatile
pointer...

> void ct_thread_reader(ct_shared& shared, std::size_t tidx)
> {
>     std::uint64_t reads = 0;
>
>     while (shared.m_run.load(std::memory_order_relaxed))
>     {
>         shared.m_std_rwmutex.lock_shared();
>
>         ct_simple_stack::node* n = shared.m_stack.m_head;
>

///////////////////////////

>         while (n)
>         {
>             ct_simple_stack::node* next = n->m_next;
>
>             n = next;
>         }

///////////////////////////

You know... I am wondering if the optimizer can omit the while loop
above. Shi%.

>
>         std::this_thread::yield();
>
>         shared.m_std_rwmutex.unlock_shared();
>
>         ++reads;
>     }
>
>     shared.m_std_rwmutex.lock();
>     shared.m_reads += reads;
>     shared.m_std_rwmutex.unlock();
> }

[...]

Let me change it to a volatile pointer and run the tests again...

I edited the code at: https://pastebin.com/raw/1QtPCGhV

Just reload the page and you should see version 0.1 pop up. The new code
will always have Testing Version 0.1 in front of it. Here are my much
different timings:

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
Raw Reads: 54195
Raw Writes: 3232
reads_per_tick = 902
writes_per_tick = 53
Ticks = 60.0432
___________________________________

Testing Version 0.1: std::shared_mutex

___________________________________
Raw Reads: 23452
Raw Writes: 1873
reads_per_tick = 390
writes_per_tick = 31
Ticks = 60.0513
___________________________________

Very different. Mine is still beating std::shared_mutex on my end. Still
have to check if std::shared_mutex is using SRW on my end to address Paavo.

Sorry about that. But, can you try running Version 0.1?

Chris M. Thomasson

unread,

Feb 20, 2019, 4:57:30 AM2/20/19

to

On 2/20/2019 12:20 AM, Paavo Helde wrote:

> On 20.02.2019 8:49, Chris M. Thomasson wrote:

[...]

> >    ntdll.dll!RtlAcquireSRWLockShared ()
>
> ntdll.dll is shown in Modules pane as:
>
> ntdll.dll    ntdll.dll    C:\Windows\System32\ntdll.dll    N/A    No
> Symbols loaded.
> C:\Users\heldepn\AppData\Local\Temp\SymbolCache\ntdll.pdb\d1e4e60e3ccf4933875c04d1a21456d12\ntdll.pdb
> 2    6.1.7601.23677 (win7sp1_ldr.170209-0600)    9.02.2017 18:33
> 0000000077A20000-0000000077BCA000    [8236] ConsoleTestVS2017.exe

on my end std::shared_mutex calls into SRW:

--- f:\dd\vctools\crt\crtw32\stdcpp\thr\sharedmutex.cpp
------------------------

void __cdecl _Smtx_lock_shared(_Smtx_t * smtx)
{ /* lock shared mutex non-exclusively */
00D7C8F0 push ebp
00D7C8F1 mov ebp,esp
AcquireSRWLockShared(reinterpret_cast<PSRWLOCK>(smtx));
00D7C8F3 mov eax,dword ptr [smtx]
AcquireSRWLockShared(reinterpret_cast<PSRWLOCK>(smtx));
00D7C8F6 push eax
00D7C8F7 call dword ptr [__imp__AcquireSRWLockShared@4 (0D8B00Ch)]
}

std::mutex does something weird. I don't think its using a
CRITICAL_SECTION or SRW. Not sure. Humm...

mvor...@gmail.com

unread,

Feb 20, 2019, 6:35:32 AM2/20/19

to

Testes on my 2018 i5 2.3GHz MacBook Pro and 3 compilers with max optimizations:

*************************************
* GCC -Ofast -march=native -lstdc++ *
*************************************

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
cpu_threads_n = 4
threads_n = 32
writers = 16
readers = 16
test duration = 60seconds
___________________________________

___________________________________

Raw Reads: 101998
Raw Writes: 3245
reads_per_tick = 1699
writes_per_tick = 54
Ticks = 60.0093
___________________________________

Fin!

******************************************
* Apple CLANG -Ofast -march=native -lc++ *
******************************************

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
cpu_threads_n = 4
threads_n = 32
writers = 16
readers = 16
test duration = 60seconds
___________________________________

___________________________________

Raw Reads: 108546
Raw Writes: 3480
reads_per_tick = 1808
writes_per_tick = 57
Ticks = 60.0066
___________________________________

Fin!

*****************************************
* LLVM CLANG -Ofast -march=native -lc++ *
*****************************************

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
cpu_threads_n = 4
threads_n = 32
writers = 16
readers = 16
test duration = 60seconds
___________________________________

___________________________________

Raw Reads: 111532
Raw Writes: 3518
reads_per_tick = 1858
writes_per_tick = 58
Ticks = 60.0082
___________________________________

Fin!

Melzzzzz

unread,

Feb 20, 2019, 6:40:25 AM2/20/19

to

On 2019-02-20, Ian Collins <ian-...@hotmail.com> wrote:
> On 20/02/2019 19:49, Chris M. Thomasson wrote:
>> Fwiw, I wrote a crude new benchmark that measures how many reads and
>> writes can be performed in a given amount of time. My algorithm vs
>> std::shared_mutex. So, we are _primarily_ looking for how many reads can
>> be performed in this test at 60 seconds. The number of threads is
>> variable and is determined by std::hardware_concurrency * THREADS, set
>> at 8 in the test. So on my system the setup is:
>
><snip>
>
>> #define THREADS 8 // multiplied by std::hardware_concurrency
>> #define NODE_PRIME 1000000 // number of nodes
>> #define CRUDE_CACHE_PAD 256 // crude cache pad
>> #define TEST_DURATION_SEC 60 // number of seconds for the test
>
> General whinge: these ugly macros are bad form in C++!
>
> Some results, Ubuntu 18.04, Ryzen 2700X:
>
> $ g++ -Ofast x.cc -lpthread -std=c++17

Hm I have same CPU and use Linux? But I use custom kernel with muqss
scheduler...

gcc version 8.2.1 20181127 (GCC)
~/.../bmaxa_data/examples >>> g++ -std=c++17 -Wall -O3 -march=native -pthread bench1.cpp -o bench1
~/.../bmaxa_data/examples >>> time ./bench1

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
cpu_threads_n = 16

threads_n = 128
writers = 64
readers = 64

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 173341
Raw Writes: 17845
reads_per_tick = 2886
writes_per_tick = 297
Ticks = 60.0494
___________________________________

Fin!
~/.../bmaxa_data/examples >>> clang -stdlib=libc++ -std=c++17 -Wall -O3 -march=native -pthread bench1.cpp -o bench2 -lc++
~/.../bmaxa_data/examples >>> time ./bench2

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
cpu_threads_n = 16

threads_n = 128
writers = 64
readers = 64

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 171178
Raw Writes: 19213
reads_per_tick = 2850
writes_per_tick = 319
Ticks = 60.0465
___________________________________

Fin!
and finally intel...
~/.../bmaxa_data/examples >>> icpc -std=c++17 -Wall -O3 -march=native -pthread bench1.cpp -o bench3 -lstdc++ [139]
~/.../bmaxa_data/examples >>> time ./bench3

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
cpu_threads_n = 16

threads_n = 128
writers = 64
readers = 64

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 292056246
Raw Writes: 49667
reads_per_tick = 4864753
writes_per_tick = 827
Ticks = 60.0352
___________________________________

Fin!
./bench3 526.83s user 463.91s system 1649% cpu 1:00.07 total
Hm, Intel resulsts are strange ;p
~/.../bmaxa_data/examples >>> icpc -std=c++17 -Wall -O3 -march=native -pthread bench1.cpp -o bench3 -lstdc++
~/.../bmaxa_data/examples >>> time ./bench3

Testing: std::shared_mutex

___________________________________
cpu_threads_n = 16

threads_n = 128
writers = 64
readers = 64

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 312272024
Raw Writes: 44788
reads_per_tick = 5192839
writes_per_tick = 744
Ticks = 60.1351
___________________________________

Fin!

Melzzzzz

unread,

Feb 20, 2019, 7:00:49 AM2/20/19

to

On 2019-02-20, Chris M. Thomasson <invalid_chris_t...@invalid.com> wrote:

Intel compiler shows some ridicilous results ;)

>
>
>
>>
>>         std::this_thread::yield();
>>
>>         shared.m_std_rwmutex.unlock_shared();
>>
>>         ++reads;
>>     }
>>
>>     shared.m_std_rwmutex.lock();
>>     shared.m_reads += reads;
>>     shared.m_std_rwmutex.unlock();
>> }
> [...]
>
> Let me change it to a volatile pointer and run the tests again...
>
> I edited the code at: https://pastebin.com/raw/1QtPCGhV

This time results between and Intel are wildely different:
~/.../bmaxa_data/examples >>> icpc -std=c++17 -Wall -O3 -march=native -pthread bench1.cpp -o bench3 -lstdc++ [130]
~/.../bmaxa_data/examples >>> time ./bench3

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________

cpu_threads_n = 16
threads_n = 128
writers = 64
readers = 64

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 174527
Raw Writes: 501643
reads_per_tick = 2880
writes_per_tick = 8279
Ticks = 60.5865
___________________________________

Fin!
./bench3 993.07s user 4.89s system 1646% cpu 1:00.63 total
~/.../bmaxa_data/examples >>> vim bench1.cpp

~/.../bmaxa_data/examples >>> icpc -std=c++17 -Wall -O3 -march=native -pthread bench1.cpp -o bench3 -lstdc++
~/.../bmaxa_data/examples >>> time ./bench3

Testing Version 0.1: std::shared_mutex

___________________________________

cpu_threads_n = 16
threads_n = 128
writers = 64
readers = 64

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 119727
Raw Writes: 28599
reads_per_tick = 1990
writes_per_tick = 475
Ticks = 60.1585
___________________________________

Fin!
./bench3 1014.76s user 2.97s system 1690% cpu 1:00.20 total
~/.../bmaxa_data/examples >>> vim bench1.cpp

~/.../bmaxa_data/examples >>> g++ -std=c++17 -Wall -O3 -march=native -pthread bench1.cpp -o bench1
~/.../bmaxa_data/examples >>> time ./bench1

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________

cpu_threads_n = 16
threads_n = 128
writers = 64
readers = 64

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 170409
Raw Writes: 8955
reads_per_tick = 2838
writes_per_tick = 149
Ticks = 60.0264
___________________________________

Fin!
./bench1 976.84s user 4.84s system 1634% cpu 1:00.07 total
~/.../bmaxa_data/examples >>> vim bench1.cpp

~/.../bmaxa_data/examples >>> g++ -std=c++17 -Wall -O3 -march=native -pthread bench1.cpp -o bench1
~/.../bmaxa_data/examples >>> time ./bench1

Testing Version 0.1: std::shared_mutex

___________________________________

cpu_threads_n = 16
threads_n = 128
writers = 64
readers = 64

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 120385
Raw Writes: 17958
reads_per_tick = 1999
writes_per_tick = 298
Ticks = 60.1946
___________________________________

Fin!

Paavo Helde

unread,

Feb 20, 2019, 7:51:02 AM2/20/19

to

On 20.02.2019 11:57, Chris M. Thomasson wrote:

> 00D7C8F7 call dword ptr [__imp__AcquireSRWLockShared@4 (0D8B00Ch)]

What I see from here is that apparently you are running in 32-bit mode.
Who would do this in year 2019? Recompile in 64-bit and measure again!

Bonita Montero

unread,

Feb 20, 2019, 1:25:19 PM2/20/19

to

This is similar shared_mutex, but just for Win32 only.
My shared mutex gives shared readers always priority over
writers (don't know what std::shared_mutex does instead).

#include <Windows.h>
#include <intrin.h>
#include <cassert>

class SharedMutex
{
public:
SharedMutex();
~SharedMutex();
void Lock();
void Unlock();
void LockShared();
void UnlockShared();

private:
volatile LONG m_lRWCounters; // lower 16 bits: writers including
the one that owns the lock
// bits 16-30: count of threads having
read-access or waiting for reac-access
// sign-bit: readers active
volatile DWORD m_dwOwningThreadId;
volatile DWORD m_dwRecursionCount;
HANDLE m_hEvtReleaseWriter;
HANDLE m_hSemReleaseReaders;
};

SharedMutex::SharedMutex()
{
m_lRWCounters = 0;
m_dwOwningThreadId = 0;
m_dwRecursionCount = 0;
m_hEvtReleaseWriter = CreateEvent( NULL, FALSE, FALSE, NULL );;
m_hSemReleaseReaders = CreateSemaphore( NULL, 0, 0x7FFFFFFF, NULL );
}

SharedMutex::~SharedMutex()
{
assert(m_dwOwningThreadId == 0);
assert(m_dwRecursionCount == 0);

CloseHandle( m_hEvtReleaseWriter );
CloseHandle( m_hSemReleaseReaders );
}

void SharedMutex::Lock()
{
DWORD dwCurrentThreadId;

if( m_dwOwningThreadId == (dwCurrentThreadId = GetCurrentThreadId()) )
{
m_dwRecursionCount++;
return;
}

LONG lRWCounters,
lOldRWCounters;

for( lRWCounters = m_lRWCounters; ; lRWCounters = lOldRWCounters )
{
assert((lRWCounters & 0x0FFFF) != 0x0FFFF);

if( (lOldRWCounters = _InterlockedCompareExchange( &m_lRWCounters,
lRWCounters + 1, lRWCounters )) == lRWCounters )
if( lOldRWCounters == 0 )
{
m_dwOwningThreadId = dwCurrentThreadId;
return;
}
else
break;

}

WaitForSingleObject( m_hEvtReleaseWriter, INFINITE );
m_dwOwningThreadId = dwCurrentThreadId;
}

void SharedMutex::Unlock()
{
DWORD dwRecursionCount;

assert(m_dwOwningThreadId == GetCurrentThreadId());

if( (dwRecursionCount = m_dwRecursionCount) != 0 )
{
m_dwRecursionCount = dwRecursionCount - 1;
return;
}

m_dwOwningThreadId = 0;

LONG lRWCounters,
lOldRWCounters;

// sharers have priority over writers

for( lRWCounters = m_lRWCounters; ; lRWCounters = lOldRWCounters )
{
assert(lRWCounters > 0);

if( (lRWCounters & 0x7FFF0000) )
{
if( (lOldRWCounters = _InterlockedCompareExchange(
&m_lRWCounters, (lRWCounters - 1) | (LONG)0x80000000u, lRWCounters )) ==
lRWCounters )
{
ReleaseSemaphore( m_hSemReleaseReaders, (lRWCounters &
0x7FFF0000) >> 16, NULL );
return;
}
}
else
if( (lOldRWCounters = _InterlockedCompareExchange(
&m_lRWCounters, lRWCounters - 1, lRWCounters )) == lRWCounters )
{
if( (lOldRWCounters & 0x0FFFF) > 1 )
SetEvent( m_hEvtReleaseWriter );
return;
}
}
}

void SharedMutex::LockShared()
{
LONG lRWCounters,
lOldRWCounters;

for( lRWCounters = m_lRWCounters; ; lRWCounters = lOldRWCounters )
{
assert((lRWCounters & 0x7FFF0000) != 0x7FFF0000);

// even shared access can be recursively, but in this case the
sharer-count will increment

if( lRWCounters <= 0 )
{
if( (lOldRWCounters = _InterlockedCompareExchange(
&m_lRWCounters, (lRWCounters + 0x10000) | (LONG)0x80000000u, lRWCounters
)) == lRWCounters )
return;
}
else
if( (lOldRWCounters = _InterlockedCompareExchange(
&m_lRWCounters, (lRWCounters + 0x10000), lRWCounters )) == lRWCounters )
{
WaitForSingleObject( m_hSemReleaseReaders, INFINITE );
return;
}
}
}

void SharedMutex::UnlockShared()
{
LONG lRWCounters,
lOldRWCounters;

for( lRWCounters = m_lRWCounters; ; lRWCounters = lOldRWCounters )
{
assert(lRWCounters < 0 && (lRWCounters & 0x7FFF0000) != 0);

if( (lRWCounters & 0x7FFF0000) == 0x10000 )
{
if( (lOldRWCounters = _InterlockedCompareExchange(
&m_lRWCounters, (lRWCounters - 0x10000) & (LONG)0x7FFFFFFF, lRWCounters
)) == lRWCounters )
{
if( (lRWCounters & 0x0FFFF) != 0 )
SetEvent( m_hEvtReleaseWriter );
return;
}
}
else
if( (lOldRWCounters = _InterlockedCompareExchange(
&m_lRWCounters, lRWCounters - 0x10000, lRWCounters )) == lRWCounters )
return;
}
}

I'm pretty sure you won't understand the code.
But the performance should be unbeatable. ;-)

Bonita Montero

unread,

Feb 20, 2019, 1:29:47 PM2/20/19

to

And just ignore that there is no error-handling for CreateEvent,
CreateSemaphore, SetEvent and ReleaseSemaphore. This is just to
show the basic concent.
And no, there are no additional barriers necessary!

Paavo Helde

unread,

Feb 20, 2019, 2:22:27 PM2/20/19

to

On 20.02.2019 20:25, Bonita Montero wrote:
> This is similar shared_mutex, but just for Win32 only.
> My shared mutex gives shared readers always priority over
> writers (don't know what std::shared_mutex does instead).
>
> #include <Windows.h>
> #include <intrin.h>
> #include <cassert>
>
> class SharedMutex
> {
> public:
> SharedMutex();
> ~SharedMutex();
> void Lock();
> void Unlock();
> void LockShared();
> void UnlockShared();
>
> private:
> volatile LONG m_lRWCounters; // lower 16 bits: writers including
> the one that owns the lock
> // bits 16-30: count of threads having
> read-access or waiting for reac-access
> // sign-bit: readers active
> volatile DWORD m_dwOwningThreadId;
> volatile DWORD m_dwRecursionCount;

'volatile' is neither needed nor sufficient for multithreading code.

Bonita Montero

unread,

Feb 20, 2019, 2:24:20 PM2/20/19

to

>> volatile DWORD m_dwOwningThreadId;
>> volatile DWORD m_dwRecursionCount;

> 'volatile' is neither needed nor sufficient for multithreading code.

It is just to document that it may change asyncronously.

Paavo Helde

unread,

Feb 20, 2019, 3:01:25 PM2/20/19

to

If so, I see they are accessed without proper locking all over the
place. For example, there is no memory barrier after assigning to
m_dwOwningThreadId in Lock() and before potential reading it in in
another Lock() in another thread, so there is no guarantee the other
thread ever sees the changed value.

However, now I looked at the MSDN site and saw that by Microsoft rules
indeed volatile causes additional memory barriers when compiled with
/volatile:ms (which is the default, except for ARM). So my original
comment was wrong, volatile is indeed needed for this code to work, but
the code is restricted to not just to Windows, but also requires MSVC
compiler and its /volatile:ms regime.

Bonita Montero

unread,

Feb 20, 2019, 3:34:15 PM2/20/19

to

> If so, I see they are accessed without proper locking all over the
> place.

The flags for the reader- and writer-count and writer-flag (sign bit)
are the lock themselfes. And the other two members are only modified
by the writer that owns the mutex.

> For example, there is no memory barrier after assigning to

> m_dwOwningThreadId in Lock() ....

That's not explicitly necesary because the Xchg-Intrinsic before has
aquire and release behaviour.

> ... and before potential reading it in in another Lock()

> in another thread, so there is no guarantee the other

I think you consider this code in the beginning of Lock():

if( m_dwOwningThreadId == (dwCurrentThreadId = GetCurrentThreadId()) )
{
m_dwRecursionCount++;
return;
}

There is absolutely no barrier necessary here! If the thread that does
this check is the same that owns the mutex everything is ok because
the barrier is the Xchg in the outer Lock(). If the thread doesn't
own the mutex, the m_dwOwningThreadId may change at any time and may
have any value different than the current thread id.

Chris M. Thomasson

unread,

Feb 20, 2019, 3:46:56 PM2/20/19

to

OUCH! ARGH.... Will do. Thanks. ;^o

Chris M. Thomasson

unread,

Feb 20, 2019, 3:53:55 PM2/20/19

to

Why not? I need to give it a good read, and then test it out in Relacy
Race Detector:

http://www.1024cores.net/home/relacy-race-detector

> But the performance should be unbeatable. ;-)
>

Nice. Okay, I will try to get it up and running in Relacy. It can
emulate Windows sync primitives, pretty nice. Once I get it running in
there, to make sure it passes tests, then I can add it to my benchmark
code. It will be fun. Thanks for the algorithm Bonita. :^)

There is a main different between your algorithm and mine, notice that I
avoid the use of CAS and loops. I am looking forward to giving your
algorithm a go. Thanks again. :^)

Chris M. Thomasson

unread,

Feb 20, 2019, 4:02:58 PM2/20/19

to

On 2/20/2019 3:35 AM, mvor...@gmail.com wrote:
> On Wednesday, February 20, 2019 at 1:49:23 AM UTC-5, Chris M. Thomasson wrote:
>> Fwiw, I wrote a crude new benchmark that measures how many reads and
>> writes can be performed in a given amount of time. My algorithm vs
>> std::shared_mutex. So, we are _primarily_ looking for how many reads can
>> be performed in this test at 60 seconds. The number of threads is
>> variable and is determined by std::hardware_concurrency * THREADS, set
>> at 8 in the test. So on my system the setup is:

[snip code]

> Testes on my 2018 i5 2.3GHz MacBook Pro and 3 compilers with max optimizations:
>
>
>
> *************************************
> * GCC -Ofast -march=native -lstdc++ *
> *************************************
>
> Testing: Chris M. Thomasson's Experimental Read/Write Mutex

[...]

Thank you! Sorry to ask, but can you try running version 0.1?

Try reloading the page at: https://pastebin.com/raw/1QtPCGhV

The newest version 0.1 should have:

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex

Testing Version 0.1: std::shared_mutex

Fwiw, here is the code:
____________________________________

/* Crude Read/Write Mutex Benchmark

This tests my algorithm ct_rwmutex vs. std::shared_mutex.

It shows how many reads and writes can be performed within
a fixed amount of time.

by: Chris M. Thomasson

version 0.1

node* volatile m_next; // to suppress optimization

std::cout << "Testing Version 0.1: ";

____________________________________

Chris M. Thomasson

unread,

Feb 20, 2019, 4:26:38 PM2/20/19

to

It was indeed optimizing the actual work of the readers away, on my end
as well. However, wrt version 0.1, adding in volatile to the
ct_simple_stack::node::m_next pointer did the "trick", so to speak. My
numbers look at lot more reasonable now. :^)

>>>         std::this_thread::yield();
>>>
>>>         shared.m_std_rwmutex.unlock_shared();
>>>
>>>         ++reads;
>>>     }
>>>
>>>     shared.m_std_rwmutex.lock();
>>>     shared.m_reads += reads;
>>>     shared.m_std_rwmutex.unlock();
>>> }
>> [...]
>>
>> Let me change it to a volatile pointer and run the tests again...
>>
>> I edited the code at: https://pastebin.com/raw/1QtPCGhV

> This time results between and Intel are wildely different:

Wrt the difference between version 0.0 and 0.1, is working nicely. Perfect!

Nice! In this test, my algorithm is allowing your system to get 890 more
reads per-second , out of 60 seconds than std::shared_mutex. Good. This
is not all that bad. Thanks Melzzzzz.

Again, thanks again. :^)

Fwiw, in this test my work gets 839 more reads per-second. I am mainly
focused on the reads. The writes, well, okay, fine. If it can beat
std::shared_mutex on both reads and writes per-second, okay... I am
mainly focused on reads per-second. Humm...

Chris M. Thomasson

unread,

Feb 20, 2019, 5:06:41 PM2/20/19

to

On 2/20/2019 3:40 AM, Melzzzzz wrote:
> On 2019-02-20, Ian Collins <ian-...@hotmail.com> wrote:
>> On 20/02/2019 19:49, Chris M. Thomasson wrote:
>>> Fwiw, I wrote a crude new benchmark that measures how many reads and
>>> writes can be performed in a given amount of time. My algorithm vs
>>> std::shared_mutex. So, we are _primarily_ looking for how many reads can
>>> be performed in this test at 60 seconds. The number of threads is
>>> variable and is determined by std::hardware_concurrency * THREADS, set
>>> at 8 in the test. So on my system the setup is:
>>
>> <snip>
>>
>>> #define THREADS 8 // multiplied by std::hardware_concurrency
>>> #define NODE_PRIME 1000000 // number of nodes
>>> #define CRUDE_CACHE_PAD 256 // crude cache pad
>>> #define TEST_DURATION_SEC 60 // number of seconds for the test
>>
>> General whinge: these ugly macros are bad form in C++!
>>
>> Some results, Ubuntu 18.04, Ryzen 2700X:
>>
>> $ g++ -Ofast x.cc -lpthread -std=c++17
>
> Hm I have same CPU and use Linux? But I use custom kernel with muqss
> scheduler...

[...]

> ./bench3 526.83s user 463.91s system 1649% cpu 1:00.07 total
> Hm, Intel resulsts are strange ;p
> ~/.../bmaxa_data/examples >>> icpc -std=c++17 -Wall -O3 -march=native -pthread bench1.cpp -o bench3 -lstdc++
> ~/.../bmaxa_data/examples >>> time ./bench3
> Testing: std::shared_mutex

[...]

>> Testing: Chris M. Thomasson's Experimental Read/Write Mutex

[...]
Luckily, version 0.1 is here:

https://pastebin.com/raw/1QtPCGhV
(hit refresh until you see version 0.1)

Also, I made sure to put the version number in the actual output of the
program:

Testing Version 0.1: [...]

So, if you do not see Version 0.1 in the output of the program, then you
are using the older version that allowed a compiler to completely remove
read side work. It optimized away the entire read side iteration through
all of the nodes contained within the ct_simple_stack in
ct_shared::m_stack. Had to add a volatile to pointer in
ct_simple_stack::node::m_next wrt version 0.1:

https://groups.google.com/forum/#!msg/comp.lang.c++/DBIG55vCBSA/JZOkaBlYAAAJ
(read all...)

Can you kindly test again, when you get some free time to kill?

Sorry about this non-sense. ;^o

Chris M. Thomasson

unread,

Feb 20, 2019, 5:08:36 PM2/20/19

to

On 2/20/2019 2:06 PM, Chris M. Thomasson wrote:
> On 2/20/2019 3:40 AM, Melzzzzz wrote:
>> On 2019-02-20, Ian Collins <ian-...@hotmail.com> wrote:
>>> On 20/02/2019 19:49, Chris M. Thomasson wrote:

[...]

> Sorry about this non-sense. ;^o

Argh. This was meant as a response to Ian Collins. Sorry.

;^o

Ian should try out version 0.1, if interested. You already did Melzzzzz.
Thanks again.

mvor...@gmail.com

unread,

Feb 20, 2019, 5:17:55 PM2/20/19

to

On Wednesday, February 20, 2019 at 1:49:23 AM UTC-5, Chris M. Thomasson wrote:
> Fwiw, I wrote a crude new benchmark that measures how many reads and
> writes can be performed in a given amount of time. My algorithm vs
> std::shared_mutex. So, we are _primarily_ looking for how many reads can
> be performed in this test at 60 seconds. The number of threads is
> variable and is determined by std::hardware_concurrency * THREADS, set
> at 8 in the test. So on my system the setup is:

> ___________________________________
> cpu_threads_n = 4
> threads_n = 32
> writers = 16
> readers = 16

> test duration = 60 seconds
> ___________________________________
>

> This is going to be different for each system. Readers iterate a shared
> linked list with 1000000 nodes in this test. Writers pop and push items,
> and never use new or delete. Well, so far, the timings I am getting on
> my end using MSVC 2017 is:
>
>

> Testing: Chris M. Thomasson's Experimental Read/Write Mutex
>

> ___________________________________
> cpu_threads_n = 4
> threads_n = 32
> writers = 16
> readers = 16

> test duration = 60 seconds
> ___________________________________
>
>
> ___________________________________

> Raw Reads: 18046464
> Raw Writes: 347498
> reads_per_tick = 300693
> writes_per_tick = 5790
> Ticks = 60.0161
> ___________________________________
>
>
>

> Testing: std::shared_mutex
>
> ___________________________________

> cpu_threads_n = 4
> threads_n = 32
> writers = 16
> readers = 16

> test duration = 60 seconds
> ___________________________________
>
>
> ___________________________________

> Raw Reads: 650006
> Raw Writes: 168741
> reads_per_tick = 10829
> writes_per_tick = 2811
> Ticks = 60.0193
> ___________________________________
>
>
>
> As you can see, my algorithm is performing many more reads than
> std::shared_mutex. Anyway, here is my code:
>
> When you get some time, can you please give it a go? I want to see
> timings on other systems, the good, bad and ugly. ;^) Thanks.
>
> https://pastebin.com/raw/1QtPCGhV
> ___________________________________

> /* Crude Read/Write Mutex Benchmark
>
> This tests my algorithm ct_rwmutex vs. std::shared_mutex.
>
> It shows how many reads and writes can be performed within
> a fixed amount of time.
>
> by: Chris M. Thomasson

> __________________________________________*/
>
>
>
> #include <thread>
> #include <atomic>
> #include <shared_mutex>
> #include <condition_variable>
> #include <iostream>
> #include <functional>
> #include <chrono>
> #include <cassert>
> #include <cstdlib>
> #include <ctime>
> #include <climits>
> #include <cstdint>
> #include <vector>
>
>
> // undefine to test std::shared_mutex
> #define CT_TEST_FAST_MUTEX 1
>
>

> #define THREADS 8 // multiplied by std::hardware_concurrency
> #define NODE_PRIME 1000000 // number of nodes
> #define CRUDE_CACHE_PAD 256 // crude cache pad
> #define TEST_DURATION_SEC 60 // number of seconds for the test
>
>

> struct ct_simple_stack
> {
> struct node
> {
> node* m_next;
> unsigned int m_tid;
>
> node(unsigned int tid) : m_tid(tid) {}
> };
>

> node* m_head;
>
> ct_simple_stack() : m_head(nullptr) {}
>
> ~ct_simple_stack()
> {
> ct_simple_stack::node* n = flush();
>

> while (n)
> {
> ct_simple_stack::node* next = n->m_next;

> void ct_thread_reader(ct_shared& shared, std::size_t tidx)
> {
> std::uint64_t reads = 0;
>
> while (shared.m_run.load(std::memory_order_relaxed))
> {
> shared.m_std_rwmutex.lock_shared();
>
> ct_simple_stack::node* n = shared.m_stack.m_head;
>

> while (n)
> {
> ct_simple_stack::node* next = n->m_next;
>
> n = next;
> }
>

> std::this_thread::yield();
>
> shared.m_std_rwmutex.unlock_shared();
>
> ++reads;
> }
>
> shared.m_std_rwmutex.lock();
> shared.m_reads += reads;
> shared.m_std_rwmutex.unlock();
> }
>
>

> void ct_thread_writer(ct_shared& shared, std::size_t tidx)
> {
> std::uint64_t writes = 0;
>
> while (shared.m_run.load(std::memory_order_relaxed))
> {

> shared.m_std_rwmutex.lock();
> ct_simple_stack::node* n = shared.m_stack.pop();
> shared.m_std_rwmutex.unlock();
>
> std::this_thread::yield();
>
> shared.m_std_rwmutex.lock();
>
> if (n)
> {
> shared.m_stack.push(n);
> }
>
> shared.m_std_rwmutex.unlock();
>
> std::this_thread::yield();
>
> ++writes;
> }
>
> shared.m_std_rwmutex.lock();
> shared.m_writes += writes;
> shared.m_std_rwmutex.unlock();
> }
>
>
> int main()
> {
> ct_shared shared;
>

> std::cout << "Testing: ";

> :^)

ran it on my 2012 i7 MacBook Pro today.

*************************************
* GCC -Ofast -march=native -lstdc++ *
*************************************

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
cpu_threads_n = 8

threads_n = 64
writers = 32
readers = 32

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 236261
Raw Writes: 3946
reads_per_tick = 3937
writes_per_tick = 65
Ticks = 60.0075

___________________________________

Fin!

******************************************
* Apple CLANG -Ofast -march=native -lc++ *
******************************************

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
cpu_threads_n = 8

threads_n = 64
writers = 32
readers = 32

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 232857
Raw Writes: 3981
reads_per_tick = 3880
writes_per_tick = 66
Ticks = 60.0092

___________________________________

Fin!

*****************************************
* LLVM CLANG -Ofast -march=native -lc++ *
*****************************************

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
cpu_threads_n = 8

threads_n = 64
writers = 32
readers = 32

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 236715
Raw Writes: 3927
reads_per_tick = 3944
writes_per_tick = 65
Ticks = 60.0111
___________________________________

Fin!

Chris M. Thomasson

unread,

Feb 20, 2019, 5:27:08 PM2/20/19

to

On 2/20/2019 4:50 AM, Paavo Helde wrote:

Just tested with a x64 build, and here are the results wrt version 0.1:

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex

___________________________________
cpu_threads_n = 4
threads_n = 32
writers = 16
readers = 16
test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 19658
Raw Writes: 1516
reads_per_tick = 327
writes_per_tick = 25
Ticks = 60.0547
___________________________________

Testing Version 0.1: std::shared_mutex

___________________________________
cpu_threads_n = 4
threads_n = 32
writers = 16
readers = 16
test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 8241
Raw Writes: 1114
reads_per_tick = 137
writes_per_tick = 18
Ticks = 60.1291
___________________________________

Nice! Humm... Iirc, unsigned long on Windows is only 32-bit? My
algorithm is beating 64-bit SRW on my end using only 32-bit atomics. ;^)

My algorithm is getting 190 more reads per-second on a linked shared
data-structure that has NODE_PRIME nodes, or 1,000,000 nodes...

So each read actually iterated NODE_PRIME nodes.

NICE!

Chris M. Thomasson

unread,

Feb 20, 2019, 5:29:28 PM2/20/19

to

On 2/20/2019 2:17 PM, mvor...@gmail.com wrote:
> On Wednesday, February 20, 2019 at 1:49:23 AM UTC-5, Chris M. Thomasson wrote:
>> Fwiw, I wrote a crude new benchmark that measures how many reads and
>> writes can be performed in a given amount of time. My algorithm vs
>> std::shared_mutex. So, we are _primarily_ looking for how many reads can
>> be performed in this test at 60 seconds. The number of threads is
>> variable and is determined by std::hardware_concurrency * THREADS, set
>> at 8 in the test. So on my system the setup is:
>> ___________________________________
>> cpu_threads_n = 4
>> threads_n = 32
>> writers = 16
>> readers = 16
>> test duration = 60 seconds
>> ___________________________________
>>

[...]

> Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex

[...]

Thanks. Now, when you get some ore free time to kill, can you run it
again with CT_TEST_FAST_MUTEX undefined so that you can show the
difference between my work and std::shared_mutex?

;^)

Melzzzzz

unread,

Feb 20, 2019, 5:33:50 PM2/20/19

to

On 2019-02-20, Chris M. Thomasson <invalid_chris_t...@invalid.com> wrote:

Yep. And long double is 64 bits if even hardware provides 80 bits...

mvor...@gmail.com

unread,

Feb 20, 2019, 5:37:51 PM2/20/19

to

*************************************
* GCC -Ofast -march=native -lstdc++ *
*************************************

Testing Version 0.1: std::shared_mutex

___________________________________

cpu_threads_n = 8
threads_n = 64
writers = 32
readers = 32

test duration = 60 seconds
___________________________________

___________________________________
Raw Reads: 54066
Raw Writes: 27089
reads_per_tick = 900
writes_per_tick = 451
Ticks = 60.0368

___________________________________

Fin!

******************************************
* Apple CLANG -Ofast -march=native -lc++ *
******************************************

Testing Version 0.1: std::shared_mutex

___________________________________

cpu_threads_n = 8
threads_n = 64
writers = 32
readers = 32

test duration = 60 seconds
___________________________________

___________________________________
Raw Reads: 58693
Raw Writes: 27387
reads_per_tick = 977
writes_per_tick = 456
Ticks = 60.0413

___________________________________

Fin!

*****************************************
* LLVM CLANG -Ofast -march=native -lc++ *
*****************************************

Testing Version 0.1: std::shared_mutex

___________________________________

cpu_threads_n = 8
threads_n = 64
writers = 32
readers = 32

test duration = 60 seconds
___________________________________

___________________________________
Raw Reads: 61948
Raw Writes: 28017
reads_per_tick = 1031
writes_per_tick = 466
Ticks = 60.0474
___________________________________

Fin!

Chris M. Thomasson

unread,

Feb 20, 2019, 6:28:56 PM2/20/19

to

On 2/20/2019 2:17 PM, mvor...@gmail.com wrote:

> On Wednesday, February 20, 2019 at 1:49:23 AM UTC-5, Chris M. Thomasson wrote:
>> Fwiw, I wrote a crude new benchmark that measures how many reads and
>> writes can be performed in a given amount of time. My algorithm vs
>> std::shared_mutex. So, we are _primarily_ looking for how many reads can
>> be performed in this test at 60 seconds. The number of threads is
>> variable and is determined by std::hardware_concurrency * THREADS, set
>> at 8 in the test. So on my system the setup is:

[...]

Wrt Version 0.1, remember my first ct_rwmutex that did not use
ct_fast_mutex, but instead ct_fast_semaphore with an initial count of 1?
Now, it would be beneficial to test against this implementation as well.
Also, another poster, Bonita showed another algorithm that uses CAS,
kind of like Windows SRW. It has a lot of explicit looping, my algorithm
has no loops in ct_rwmutex. Fwiw, it seems like my benchmark code can be
used to test many read/write algorithms.

Humm...

Chris M. Thomasson

unread,

Feb 20, 2019, 7:15:20 PM2/20/19

to

On 2/20/2019 10:25 AM, Bonita Montero wrote:

> This is similar shared_mutex, but just for Win32 only.
> My shared mutex gives shared readers always priority over
> writers (don't know what std::shared_mutex does instead).

[...]

Need to examine it and implement it for myself, however, your read
priority is _radically_ different than mine. ct_rwmutex actually fairly
alternates between batches of reads, and a write, like a bakery
algorithm, and does _not_ use CAS. Writes can never starve reads, and
reads can never starve writes.

Chris M. Thomasson

unread,

Feb 21, 2019, 1:03:56 AM2/21/19

to

On 2/20/2019 10:25 AM, Bonita Montero wrote:

> This is similar shared_mutex, but just for Win32 only.
> My shared mutex gives shared readers always priority over
> writers (don't know what std::shared_mutex does instead).
>
> #include <Windows.h>
> #include <intrin.h>
> #include <cassert>

[...]

I implemented your algorithm, minus the recursion part... , in Relacy
Race Detector. I am using all seq_cst memory ordering here because you
do not provide any memory barrier documentation. It is working fine with
seq_cst. Still need to test the recursion because I have seen many
broken recursion schemes. I forgot if volatile on MSVC will always use
that memory order, with some obscure compiler flag, anyway, here is the
code:
_________________________________
// Bonita Montero Read/Write Mutex
// Relacy Test By Chris M. Thomasson
//_______________________________________________

//#define RL_DEBUGBREAK_ON_ASSERT
//#define RL_MSVC_OUTPUT
//#define RL_FORCE_SEQ_CST
//#define RL_GC

#include <relacy/relacy_std.hpp>
#include <iostream>

#define MB_SEQ_CST std::memory_order_seq_cst

#define READERS 3
#define WRITERS 3
#define THREADS (READERS + WRITERS)

LONG ct_InterlockedCompareExchange(
std::atomic<long>* Destination,
long Exchange,
long Comparand
) {
LONG cmp = Comparand;

if (Destination->compare_exchange_strong(cmp, Exchange, MB_SEQ_CST))
{
RL_ASSERT(cmp == Comparand);
}

return cmp;

}

class SharedMutex
{
public:
SharedMutex();
~SharedMutex();
void Lock();
void Unlock();
void LockShared();
void UnlockShared();

private:

std::atomic<LONG> m_lRWCounters; // lower 16 bits: writers

including the one that owns the lock
// bits 16-30: count of threads
having read-access or waiting for reac-access
// sign-bit: readers active

HANDLE m_hEvtReleaseWriter;
HANDLE m_hSemReleaseReaders;
};

SharedMutex::SharedMutex()
{

m_lRWCounters.store(0, MB_SEQ_CST);
m_hEvtReleaseWriter = CreateEvent(NULL, FALSE, FALSE, NULL);
m_hSemReleaseReaders = CreateSemaphore(NULL, 0, 0x7FFFFFFF, NULL);
}

SharedMutex::~SharedMutex()
{
CloseHandle(m_hEvtReleaseWriter);
CloseHandle(m_hSemReleaseReaders);
}

void SharedMutex::Lock()
{
LONG lRWCounters,
lOldRWCounters;

for (lRWCounters = m_lRWCounters.load(MB_SEQ_CST); ; lRWCounters =
lOldRWCounters)
{
RL_ASSERT((lRWCounters & 0x0FFFF) != 0x0FFFF);

if ((lOldRWCounters =
ct_InterlockedCompareExchange(&m_lRWCounters, lRWCounters + 1,
lRWCounters)) == lRWCounters)
if (lOldRWCounters == 0)
{

return;
}
else
break;

}

WaitForSingleObject(m_hEvtReleaseWriter, INFINITE);
}

void SharedMutex::Unlock()
{

LONG lRWCounters,
lOldRWCounters;

// sharers have priority over writers

for (lRWCounters = m_lRWCounters.load(MB_SEQ_CST); ; lRWCounters =
lOldRWCounters)
{
RL_ASSERT(lRWCounters > 0);

if ((lRWCounters & 0x7FFF0000))
{
if ((lOldRWCounters =
ct_InterlockedCompareExchange(&m_lRWCounters, (lRWCounters - 1) |
(LONG)0x80000000u, lRWCounters)) == lRWCounters)
{
ReleaseSemaphore(m_hSemReleaseReaders, (lRWCounters &
0x7FFF0000) >> 16, NULL);

return;
}
}
else
if ((lOldRWCounters =

ct_InterlockedCompareExchange(&m_lRWCounters, lRWCounters - 1,
lRWCounters)) == lRWCounters)
{
if ((lOldRWCounters & 0x0FFFF) > 1)
SetEvent(m_hEvtReleaseWriter);

return;
}
}
}

void SharedMutex::LockShared()
{
LONG lRWCounters,
lOldRWCounters;

for (lRWCounters = m_lRWCounters.load(MB_SEQ_CST); ; lRWCounters =
lOldRWCounters)
{
RL_ASSERT((lRWCounters & 0x7FFF0000) != 0x7FFF0000);

// even shared access can be recursively, but in this case the
sharer-count will increment

if (lRWCounters <= 0)
{
if ((lOldRWCounters =

ct_InterlockedCompareExchange(&m_lRWCounters, (lRWCounters + 0x10000) |
(LONG)0x80000000u, lRWCounters)) == lRWCounters)

return;
}
else
if ((lOldRWCounters =

ct_InterlockedCompareExchange(&m_lRWCounters, (lRWCounters + 0x10000),
lRWCounters)) == lRWCounters)
{
WaitForSingleObject(m_hSemReleaseReaders, INFINITE);

return;
}
}
}

void SharedMutex::UnlockShared()
{
LONG lRWCounters,
lOldRWCounters;

for (lRWCounters = m_lRWCounters.load(MB_SEQ_CST); ; lRWCounters =
lOldRWCounters)
{
RL_ASSERT(lRWCounters < 0 && (lRWCounters & 0x7FFF0000) != 0);

if ((lRWCounters & 0x7FFF0000) == 0x10000)
{
if ((lOldRWCounters =

ct_InterlockedCompareExchange(&m_lRWCounters, (lRWCounters - 0x10000) &

(LONG)0x7FFFFFFF, lRWCounters)) == lRWCounters)
{
if ((lRWCounters & 0x0FFFF) != 0)
SetEvent(m_hEvtReleaseWriter);

return;
}
}
else
if ((lOldRWCounters =

ct_InterlockedCompareExchange(&m_lRWCounters, lRWCounters - 0x10000,
lRWCounters)) == lRWCounters)
return;
}
}

// Relacy Stack Test...
struct ct_fast_mutex_test
: rl::test_suite<ct_fast_mutex_test, THREADS>
{
VAR_T(long) g_shared_state;
SharedMutex g_shared_mutex;

void before()
{
VAR(g_shared_state) = 0;
}

void after()
{

}

// reader
void reader(unsigned int tidx)
{
g_shared_mutex.LockShared();
long shared_state = VAR(g_shared_state);
g_shared_mutex.UnlockShared();

RL_ASSERT(shared_state > -1);
}

// writer
void writer(unsigned int tidx)
{
g_shared_mutex.Lock();
++VAR(g_shared_state);
g_shared_mutex.Unlock();
}

void thread(unsigned int tidx)
{
if (tidx < WRITERS)
{
writer(tidx);
}

else
{
reader(tidx);
}
}
};

// Test away... Or fly? Humm...
int main()
{
{
rl::test_params p;

p.iteration_count = 10000000;
//p.execution_depth_limit = 33333;
//p.search_type = rl::sched_bound;
//p.search_type = rl::fair_full_search_scheduler_type;
//p.search_type = rl::fair_context_bound_scheduler_type;

rl::simulate<ct_fast_mutex_test>(p);
}

return 0;
}
_________________________________

Okay. Getting closer. :^)

Christian Gollwitzer

unread,

Feb 21, 2019, 1:09:21 AM2/21/19

to

Am 20.02.19 um 21:01 schrieb Paavo Helde:

> On 20.02.2019 21:24, Bonita Montero wrote:
>>>> volatile DWORD m_dwOwningThreadId;
>>>> volatile DWORD m_dwRecursionCount;
>>
>>> 'volatile' is neither needed nor sufficient for multithreading code.
>>
>> It is just to document that it may change asyncronously.
>

> However, now I looked at the MSDN site and saw that by Microsoft rules
> indeed volatile causes additional memory barriers when compiled with
> /volatile:ms (which is the default, except for ARM). So my original
> comment was wrong, volatile is indeed needed for this code to work, but
> the code is restricted to not just to Windows, but also requires MSVC
> compiler and its /volatile:ms regime.

Wouldn't that mean the code can become standard C++ by replacing the
volatile stuff with std::atomic<> ?

Christian

Bonita Montero

unread,

Feb 21, 2019, 1:20:33 AM2/21/19

to

> for (lRWCounters = m_lRWCounters.load(MB_SEQ_CST); ; lRWCounters =
> lOldRWCounters)

Your additional barrier is not necessary here and in the other places!
And m_lRWCounters doesn't need to be an atomic variable.

And you disbled recursive locking. I think if it is possible to
do shared locking recursively implicitly it should be able to to
exclusive locking explicitly.

Bonita Montero

unread,

Feb 21, 2019, 1:36:21 AM2/21/19

to

> Wouldn't that mean the code can become standard C++ by
> replacing the volatile stuff with std::atomic<> ?

volatile is necessary in one case: with m_dwOwningThreadId.
volatile just should keep away the compiler from caching this value
in a register. But the value in memory can asynchronously change at
any time when the thread trying to get a write-lock isn't the current
write-owner of the mutex. And on all Win32-platforms loads and stores
to aligned DWORDs are atomic by nature.

Chris M. Thomasson

unread,

Feb 21, 2019, 3:26:51 PM2/21/19

to

On 2/20/2019 10:20 PM, Bonita Montero wrote:
>> for (lRWCounters = m_lRWCounters.load(MB_SEQ_CST); ; lRWCounters
>> = lOldRWCounters)
>
> Your additional barrier is not necessary here and in the other places!
> And m_lRWCounters doesn't need to be an atomic variable.

Of course you are correct on this point. :^)

Fwiw, I just quickly mocked it up using seq_cst. Now, I can add
fine-grain membars. Fwiw, I can basically already see what is needed and
where in your algorithm. I need to add them to your Interlocked*'s as
well. The next step will be your algorithm using the weakest memory
model that I can possibly get, and still still pass Relacy testing. Then
it will be ready for testing in my read/write benchmark. Nice. :^)

Does this sound okay to you Bonita?

> And you disbled recursive locking. I think if it is possible to
> do shared locking recursively implicitly it should be able to to
> exclusive locking explicitly.
>

I disabled recursion simply because my test does not need any, and it
would just add a little extra overhead to your algorithm. So, why have
it in there? Do you want me to model your recursion with Relacy? I
personally don't see a real need right now, but will do it. Well,
perhaps I am "biased" against recursive locking in the first place. Some
think it is a bit "evil"? ;^)

Chris M. Thomasson

unread,

Feb 21, 2019, 3:41:37 PM2/21/19

to

On 2/20/2019 10:20 PM, Bonita Montero wrote:

After I get the weakest memory model, it is going to be fun to test your
work against Windows SRW. It should beat it because of the way your are
waiting on the semaphore and event. In other words, they are fire and
forget, and not hooked up to any looping. Nice. :^)

Chris M. Thomasson

unread,

Feb 21, 2019, 4:01:51 PM2/21/19

to

[...]

Notice how I am using compare_exchange_strong? I think, this thought is
before testing, that your algorihtm should work fine with
compare_exchange_weak. The weak version does not impose some rather
hefty requirements on the CAS. It is more "friendly" on systems that use
LL/SC to emulate CAS.

Chris M. Thomasson

unread,

Feb 21, 2019, 4:04:17 PM2/21/19

to

[...]

Thank you! I need to try to organize all of the various results into a
single database.

Chris M. Thomasson

unread,

Feb 21, 2019, 4:28:21 PM2/21/19

to

On 2/20/2019 2:33 PM, Melzzzzz wrote:
> On 2019-02-20, Chris M. Thomasson <invalid_chris_t...@invalid.com> wrote:
>> On 2/20/2019 4:50 AM, Paavo Helde wrote:
>>> On 20.02.2019 11:57, Chris M. Thomasson wrote:
>>>
>>>> 00D7C8F7 call dword ptr [__imp__AcquireSRWLockShared@4
>>>> (0D8B00Ch)]
>>>
>>> What I see from here is that apparently you are running in 32-bit mode.
>>> Who would do this in year 2019? Recompile in 64-bit and measure again!
>>>
>>
>> Just tested with a x64 build, and here are the results wrt version 0.1:
>>
>>
>> Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex

[...]

>> Nice! Humm... Iirc, unsigned long on Windows is only 32-bit?
>
> Yep. And long double is 64 bits if even hardware provides 80 bits...
>

Fwiw, on Intel, I want fetch_add to use LOCK XADD, this is very
important. Let's check out a little test on GodBolt:

https://godbolt.org/z/yruhAL // using long

https://godbolt.org/z/NKYDeI // using short

Wrt my ct_rwmutex, it better use LOCK XADD no matter what damn it.

Chris M. Thomasson

unread,

Feb 21, 2019, 5:51:04 PM2/21/19

to

Ahh, my waits on these "slow-paths" are fire and forget as well. In that
sense, they are "similar". This is a good property to have Bonita.

Chris M. Thomasson

unread,

Feb 21, 2019, 6:43:02 PM2/21/19

to

On 2/20/2019 10:03 PM, Chris M. Thomasson wrote:
> On 2/20/2019 10:25 AM, Bonita Montero wrote:
>> This is similar shared_mutex, but just for Win32 only.
>> My shared mutex gives shared readers always priority over
>> writers (don't know what std::shared_mutex does instead).

Our algorithms are getting similar reads, but mine and std::shared_mutex
get many more writes. This is due to your explicit writer starvation,
aka reader priority over all else. Also, your looping on CAS can give my
loopless algorithm an advantage wrt more intensive reads for longer
periods of time. The code at the end of this post uses your raw
algorihtm posted as-is, I included windows.h and intrin.h.

It tests your work against std::shared_mutex, and should compile with
MSVC 2017. Have not tested it on GCC yet. I can port your algorihtm to
pure C++11, but then it would have to use event and semaphore created
with mutex/condvar, and not Windows primitives directly.

Keep in mind that my work is using 100% pure C++11, and yours is using
windows specific primitives.

Afaict, your work is faster than std::shared_mutex on my end, but
consistently suffers from writer starvation. This can be an issue. Well,
fwiw, here is code for a test of your algorihtm vs std::shared_mutex, it
should compile for you wrt windows:

_________________________________________________

/* Crude Read/Write Mutex Benchmark

This tests Bonita's algorithm SharedMutex vs. std::shared_mutex.

It shows how many reads and writes can be performed within
a fixed amount of time.

by: Chris M. Thomasson

version 0.1

__________________________________________*/

#include <thread>
#include <atomic>
#include <shared_mutex>
#include <condition_variable>
#include <iostream>
#include <functional>
#include <chrono>
#include <cassert>
#include <cstdlib>
#include <ctime>
#include <climits>
#include <cstdint>
#include <vector>

// undefine to test std::shared_mutex
#define CT_TEST_FAST_MUTEX 1

#define THREADS 8 // multiplied by std::hardware_concurrency
#define NODE_PRIME 1000000 // number of nodes
#define CRUDE_CACHE_PAD 256 // crude cache pad
#define TEST_DURATION_SEC 60 // number of seconds for the test

// Bonita work start
#define WIN32_LEAN_AND_MEAN

// Bonita Montero Read/Write Mutex

// Interface modified for std::shared_mutex compatibility.

#include <Windows.h>
#include <intrin.h>
#include <cassert>

class SharedMutex
{
public:
SharedMutex();
~SharedMutex();

void lock();
void unlock();
void lock_shared();
void unlock_shared();

private:
volatile LONG m_lRWCounters; // lower 16 bits: writers including

the one that owns the lock
// bits 16-30: count of threads
having read-access or waiting for reac-access
// sign-bit: readers active

volatile DWORD m_dwOwningThreadId;
volatile DWORD m_dwRecursionCount;

HANDLE m_hEvtReleaseWriter;
HANDLE m_hSemReleaseReaders;
};

SharedMutex::SharedMutex()
{

m_lRWCounters = 0;
m_dwOwningThreadId = 0;
m_dwRecursionCount = 0;

m_hEvtReleaseWriter = CreateEvent(NULL, FALSE, FALSE, NULL);;
m_hSemReleaseReaders = CreateSemaphore(NULL, 0, 0x7FFFFFFF, NULL);
}

SharedMutex::~SharedMutex()
{

assert(m_dwOwningThreadId == 0);
assert(m_dwRecursionCount == 0);

CloseHandle(m_hEvtReleaseWriter);
CloseHandle(m_hSemReleaseReaders);
}

void SharedMutex::lock()

{
DWORD dwCurrentThreadId;

if (m_dwOwningThreadId == (dwCurrentThreadId = GetCurrentThreadId()))
{
m_dwRecursionCount++;
return;
}

LONG lRWCounters,
lOldRWCounters;

for (lRWCounters = m_lRWCounters; ; lRWCounters = lOldRWCounters)
{
assert((lRWCounters & 0x0FFFF) != 0x0FFFF);

if ((lOldRWCounters =

_InterlockedCompareExchange(&m_lRWCounters, lRWCounters + 1,
lRWCounters)) == lRWCounters)
if (lOldRWCounters == 0)
{

m_dwOwningThreadId = dwCurrentThreadId;

return;
}
else
break;

}

WaitForSingleObject(m_hEvtReleaseWriter, INFINITE);

m_dwOwningThreadId = dwCurrentThreadId;
}

void SharedMutex::unlock()

{
DWORD dwRecursionCount;

assert(m_dwOwningThreadId == GetCurrentThreadId());

if ((dwRecursionCount = m_dwRecursionCount) != 0)
{
m_dwRecursionCount = dwRecursionCount - 1;
return;
}

m_dwOwningThreadId = 0;

LONG lRWCounters,
lOldRWCounters;

// sharers have priority over writers

for (lRWCounters = m_lRWCounters; ; lRWCounters = lOldRWCounters)
{
assert(lRWCounters > 0);

if ((lRWCounters & 0x7FFF0000))
{
if ((lOldRWCounters =

_InterlockedCompareExchange(&m_lRWCounters, (lRWCounters - 1) |
(LONG)0x80000000u, lRWCounters)) == lRWCounters)
{
ReleaseSemaphore(m_hSemReleaseReaders, (lRWCounters &
0x7FFF0000) >> 16, NULL);
return;
}
}
else
if ((lOldRWCounters =

_InterlockedCompareExchange(&m_lRWCounters, lRWCounters - 1,
lRWCounters)) == lRWCounters)
{
if ((lOldRWCounters & 0x0FFFF) > 1)
SetEvent(m_hEvtReleaseWriter);
return;
}
}
}

void SharedMutex::lock_shared()
{
LONG lRWCounters,
lOldRWCounters;

for (lRWCounters = m_lRWCounters; ; lRWCounters = lOldRWCounters)
{
assert((lRWCounters & 0x7FFF0000) != 0x7FFF0000);

// even shared access can be recursively, but in this case the
sharer-count will increment

if (lRWCounters <= 0)
{
if ((lOldRWCounters =

_InterlockedCompareExchange(&m_lRWCounters, (lRWCounters + 0x10000) |
(LONG)0x80000000u, lRWCounters)) == lRWCounters)
return;
}
else
if ((lOldRWCounters =

_InterlockedCompareExchange(&m_lRWCounters, (lRWCounters + 0x10000),
lRWCounters)) == lRWCounters)
{
WaitForSingleObject(m_hSemReleaseReaders, INFINITE);
return;
}
}
}

void SharedMutex::unlock_shared()
{
LONG lRWCounters,
lOldRWCounters;

for (lRWCounters = m_lRWCounters; ; lRWCounters = lOldRWCounters)
{
assert(lRWCounters < 0 && (lRWCounters & 0x7FFF0000) != 0);

if ((lRWCounters & 0x7FFF0000) == 0x10000)
{
if ((lOldRWCounters =

_InterlockedCompareExchange(&m_lRWCounters, (lRWCounters - 0x10000) &
(LONG)0x7FFFFFFF, lRWCounters)) == lRWCounters)
{
if ((lRWCounters & 0x0FFFF) != 0)
SetEvent(m_hEvtReleaseWriter);
return;
}
}
else
if ((lOldRWCounters =

_InterlockedCompareExchange(&m_lRWCounters, lRWCounters - 0x10000,
lRWCounters)) == lRWCounters)
return;
}
}

// Bonita work end

struct ct_simple_stack
{
struct node
{

node* volatile m_next; // to suppress optimization

//ct_rwmutex m_std_rwmutex;
SharedMutex m_std_rwmutex;

std::cout << "Testing Version 0.1: ";

#if defined (CT_TEST_FAST_MUTEX)

//std::cout << "Chris M. Thomasson's Experimental Read/Write
Mutex\n\n";
std::cout << "Bonita's Read/Write Mutex\n\n";

_________________________________________________

Chris M. Thomasson

unread,

Feb 21, 2019, 11:49:00 PM2/21/19

to

On 2/20/2019 10:25 AM, Bonita Montero wrote:
> This is similar shared_mutex, but just for Win32 only.
> My shared mutex gives shared readers always priority over
> writers (don't know what std::shared_mutex does instead).

[...]

> I'm pretty sure you won't understand the code.

> But the performance should be unbeatable. ;-)
>

I actually like your algorihtm. It does indeed remind me of mine in
certain areas; fire and forget event and sema waits: nice. Its almost
like a nice amalgam wrt pure CAS loops. Under heavy read contention, the
CAS loop on your end can break down. However, I still like your algo.

Bonita Montero

unread,

Feb 22, 2019, 3:56:16 AM2/22/19

to

> Under heavy read contention, theCAS loop on your end can break down.

No, that's unlikely. I wrote a litte Win32-program that soawns as
many threads as there are cores. And each thread coninously CMP-XCHGes
a 64-bit-value in memory with its own thread id in 32 bit and a counter
in the other 32 bit. The idea behind that is t count how often a thread
can do this before the cache management relinquishes the line with the
64-bit data-structure and gives it to another core. I found that in a
tight loop that spends almost the same time doing this in each thread
than my mutex or other similar locking-algorithms (the time of a round
in this loop is almost only determined by the slow LOCK-CMPCHG-instruc-
tion) there can be in average about 2,5 exchanges before a core loses
the cacheline (on my Ryzen 7 1800X). This means that the M(O)ESI-sig-
nalling between the cores is that slow that the first exchange in my
mutex should succeed in almost any attempt.
But if we had contention on this flags of my mutex and the loop would
behave different, i.e. the CAS-loop would need more attempts, it would
be very likely that we have kernel-waits and -signalling. And if this
would happen, the timing of the CAS-loop wouldn't rule here because
the kernel-calls would be slowert by magnitudes.

Geoff

unread,

Feb 22, 2019, 3:03:50 PM2/22/19

to

On Tue, 19 Feb 2019 22:49:13 -0800, "Chris M. Thomasson"
<invalid_chris_t...@invalid.com> wrote:

>Fwiw, I wrote a crude new benchmark that measures how many reads and
>writes can be performed in a given amount of time. My algorithm vs
>std::shared_mutex. So, we are _primarily_ looking for how many reads can
>be performed in this test at 60 seconds. The number of threads is
>variable and is determined by std::hardware_concurrency * THREADS, set
>at 8 in the test.

On Win32 binary release build from VS2017 Community:

Testing Version 0.1: std::shared_mutex

___________________________________

cpu_threads_n = 6
threads_n = 48
writers = 24
readers = 24

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 16488
Raw Writes: 2672
reads_per_tick = 274
writes_per_tick = 44
Ticks = 60.1008
___________________________________

Fin!

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write
Mutex

___________________________________
cpu_threads_n = 6
threads_n = 48
writers = 24
readers = 24

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 41335
Raw Writes: 2814
reads_per_tick = 688
writes_per_tick = 46
Ticks = 60.0609
___________________________________

Fin!

On x86_64 Release Build VS2017:

Testing Version 0.1: std::shared_mutex

___________________________________

cpu_threads_n = 6
threads_n = 48
writers = 24
readers = 24

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 5983
Raw Writes: 3715
reads_per_tick = 99
writes_per_tick = 61
Ticks = 60.2067
___________________________________

Fin!

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write
Mutex

___________________________________
cpu_threads_n = 6
threads_n = 48
writers = 24
readers = 24

test duration = 60 seconds
___________________________________

___________________________________

Raw Reads: 18364
Raw Writes: 2123
reads_per_tick = 305
writes_per_tick = 35
Ticks = 60.1187
___________________________________

Fin!

There seems to be quite a performance hit on x64 code vs x86.
Both tests were run from Windows Power Shell.

Chris M. Thomasson

unread,

Feb 22, 2019, 4:27:12 PM2/22/19

to

On 2/22/2019 12:56 AM, Bonita Montero wrote:
>> Under heavy read contention, theCAS loop on your end can break down.
>
> No, that's unlikely.

Ahhhh. Try comparing CMPXCHG vs XADD. To do this, we can "emulate" XADD
with CMPXCHG. The loopless XADD performs much better than the CMPXCHG
emulation.

> I wrote a litte Win32-program that soawns as
> many threads as there are cores. And each thread coninously CMP-XCHGes
> a 64-bit-value in memory with its own thread id in 32 bit and a counter
> in the other 32 bit.

Try to emulate XADD with CMPXCHG, then test this emulation against pure
XADD.

> The idea behind that is t count how often a thread
> can do this before the cache management relinquishes the line with the
> 64-bit data-structure and gives it to another core. I found that in a
> tight loop that spends almost the same time doing this in each thread
> than my mutex or other similar locking-algorithms (the time of a round
> in this loop is almost only determined by the slow LOCK-CMPCHG-instruc-
> tion) there can be in average about 2,5 exchanges before a core loses
> the cacheline (on my Ryzen 7 1800X). This means that the M(O)ESI-sig-
> nalling between the cores is that slow that the first exchange in my
> mutex should succeed in almost any attempt.

> But if we had contention on this flags of my mutex and the loop would
> behave different, i.e. the CAS-loop would need more attempts, it would
> be very likely that we have kernel-waits and -signalling.

When your CAS loop spins due to contention on your counter, it is in
user space. However, there is no bound on how many times your CAS loop
can spin under high contention scenarios. With XADD, there is no looping
at all. XADD is basically a wait free, loopless operation on Intel.
There is a big difference when we emulate XADD with loop implying CMPXCHG...

> And if this
> would happen, the timing of the CAS-loop wouldn't rule here because
> the kernel-calls would be slowert by magnitudes.

I have tested CAS loops vs wait-free in the past, and the loopless
algorihtm always seems to win.

Chris M. Thomasson

unread,

Feb 22, 2019, 5:57:24 PM2/22/19

to

Perhaps create a blog like you did. Humm, perhaps we can link to each
others work.

Chris M. Thomasson

unread,

Feb 22, 2019, 8:11:26 PM2/22/19

to

Yeah. I get similar results on my end as well, and still not sure why.
Humm...

Bonita Montero

unread,

Feb 23, 2019, 1:27:56 AM2/23/19

to

> Ahhhh. Try comparing CMPXCHG vs XADD. To do this, we can "emulate" XADD
> with CMPXCHG. The loopless XADD performs much better than the CMPXCHG
> emulation.

You didn't understand what I said: with my CAS-benchmark with as many
threads doing tight CAS-loops as there are cores I had about an averagge
of 2,5 successfull CASes until a thread relinquishes the cacheline hol-
ding the flags. That's because when having fetched the cacheline from
another core it remains for a while in the core fetched to because the
MO(E)SI-signalling between the cores is slow.
This means that with the CAS-loop in my mutex the CAS should succeed
in the first attempt in almost any case. So there should be no need
for XADD here.
And even more, XADD could only be used for getting ownership as a reader
and not as a writer because in the latter case there is not only an add
but I set also the sign-flag.
And with XADD I couldn't do the overflow-assert when NDEBUG is not
defined.

>> But if we had contention on this flags of my mutex and the loop would
>> behave different, i.e. the CAS-loop would need more attempts, it would
>> be very likely that we have kernel-waits and -signalling.

> When your CAS loop spins due to contention on your counter, it is in
> user space.

Boy, you're so slow on the updatek! When theres contention on this
flag it's very likely that the there will be kernel-waits and signalling
also; and if this happens the performance of the atomic operations don't
count.

> However, there is no bound on how many times your CAS loop can spin

> under high contention scenarios. ...

As I already said and according to what I measured with my benchmark
the CMPXCHG should succeed in almost any time at the first attempt.

Chris M. Thomasson

unread,

Feb 23, 2019, 3:31:30 PM2/23/19

to

On 2/22/2019 10:27 PM, Bonita Montero wrote:
>> Ahhhh. Try comparing CMPXCHG vs XADD. To do this, we can "emulate"
>> XADD with CMPXCHG. The loopless XADD performs much better than the
>> CMPXCHG emulation.

[...]

> As I already said and according to what I measured with my benchmark
> the CMPXCHG should succeed in almost any time at the first attempt.
>

Should? Humm... Think of the following simple scenario:

Writer A reads your count.

Reader B gains read access.

Writer A spins on CAS because it failed!

Reader B releases read access.

Writer A spins on CAS because it failed!

Reader C gains read access.

Writer A spins on CAS because it failed!

Reader C releases read access.

Writer A spins on CAS because it failed!

[...]

Sustained read access can starve your CAS loop before any kernel waits
even occur, and make it loop many many times in userspace without doing
any real work at all. This scenario can never happen in my algorithm
because it has no loops. What is so hard to understand? This can happen
in your work, not in mine. You cannot prove that this scenario will
never happen in your algorihtm. Period.

Notice how the writer is starting to livelock? I don't think you
understand how a CAS can livelock in these scenarios. This is why I
avoided CAS loops when I designed my algorithm.

You cannot give me a fixed bound on your CAS loops, no matter how hard
you try. These are general purpose read write locks. So, you cannot
predict how they are going to be used.

You need to understand that a CAS loop can break down. This is nothing new.

https://groups.google.com/d/msg/lock-free/zzZX4fvtG04/ebVWl0BCBAAJ

CAS loops degrade under high contention. End of story.

Btw, do you want me to tell you about compare_exchange_weak?

Also, you really should test the difference between XADD and its CMPXCHG
emulation. That is the key.

Chris M. Thomasson

unread,

Feb 23, 2019, 3:44:03 PM2/23/19

to

On 2/22/2019 10:27 PM, Bonita Montero wrote:

[...]

Also, Bonita: Can you please properly quote context? Why you do seem to
always snip all authorship?

Chris M. Thomasson

unread,

Feb 23, 2019, 6:51:03 PM2/23/19

to

On 2/23/2019 12:31 PM, Chris M. Thomasson wrote:
> On 2/22/2019 10:27 PM, Bonita Montero wrote:
>>> Ahhhh. Try comparing CMPXCHG vs XADD. To do this, we can "emulate"
>>> XADD with CMPXCHG. The loopless XADD performs much better than the
>>> CMPXCHG emulation.
> [...]
>> As I already said and according to what I measured with my benchmark
>> the CMPXCHG should succeed in almost any time at the first attempt.
>>
>
> Should? Humm... Think of the following simple scenario:
>
> Writer A reads your count.
>
> Reader B gains read access.
>
> Writer A spins on CAS because it failed!
>
> Reader B releases read access.
>
> Writer A spins on CAS because it failed!
>
> Reader C gains read access.
>
> Writer A spins on CAS because it failed!
>
> Reader C releases read access.
>
> Writer A spins on CAS because it failed!

A Reads your count.

B does a CAS.

A spins on CAS. failed.

C does a CAS.

A spins on CAS. failed.

B does a CAS.

A spins on CAS. failed.

Your CAS will spin on basically any interference like this. This can be
a real issue. Think of the reason why there is compare_exchange_weak,
and compare_exchange_strong in the first place. Think of spurious
failures that have nothing to do with another CAS... Study up on
compare_exchange_weak.

Bonita Montero

unread,

Feb 24, 2019, 9:06:29 AM2/24/19

to

1. You omitted what I determied from my benchmark. According to my
benchkark an echange succeeds in average about 2,5 times until the
cacheline is moved to a different core (Ryzen 7 1800X, but I think
that's not very different with Intel-CPUs).
2. When multiple writers are contending on the writer-flag it's
very likely that there will be also kernel-waiting and signalling.
That's my magnitudes slower than the CAS-loop. So the performance
of the latter doesn't count here any more.
3. The only case where XADD could make sense is when there is an
active writer and a reader is registering to gain shared ownership.

Bonita Montero

unread,

Feb 24, 2019, 11:06:59 AM2/24/19

to

> 3. The only case where XADD could make sense is when there is an
> active writer and a reader is registering to gain shared ownership.

... and there are active writers. But if there are active writers
there will be kernel-transisions which are magnitudes slower than
the CAS-loop. So XADD rather don't help here.

Chris M. Thomasson

unread,

Feb 24, 2019, 5:24:02 PM2/24/19

to

On 2/22/2019 5:11 PM, Chris M. Thomasson wrote:
> On 2/22/2019 12:03 PM, Geoff wrote:
>> On Tue, 19 Feb 2019 22:49:13 -0800, "Chris M. Thomasson"
>> <invalid_chris_t...@invalid.com> wrote:
>>
>>> Fwiw, I wrote a crude new benchmark that measures how many reads and
>>> writes can be performed in a given amount of time. My algorithm vs
>>> std::shared_mutex. So, we are _primarily_ looking for how many reads can
>>> be performed in this test at 60 seconds. The number of threads is
>>> variable and is determined by std::hardware_concurrency * THREADS, set
>>> at 8 in the test.
>>
>> On Win32 binary release build from VS2017 Community:
>> Testing Version 0.1: std::shared_mutex
>>
>> ___________________________________
>> cpu_threads_n = 6
>> threads_n = 48
>> writers = 24
>> readers = 24
>> test duration = 60 seconds
>> ___________________________________

[...]

>> There seems to be quite a performance hit on x64 code vs x86.
>> Both tests were run from Windows Power Shell.
>>
>
> Yeah. I get similar results on my end as well, and still not sure why.
> Humm...

Perhaps, I should try out using int64_t in my atomics. Right now, on
windows, long is 32-bits on 64-bit builds.

Chris M. Thomasson

unread,

Feb 25, 2019, 4:27:50 PM2/25/19

to

On 2/24/2019 6:06 AM, Bonita Montero wrote:

[try to properly quote... Pretty please?]

CAS will _always_ fail if the comparand is different than the
destination at the point of the atomic operation. This is basically the
"strong" version. However, the weak version can fail at any time, even
if the compared is the same. Think of a system where something in the
reservation granularity gets mutated wrt a LL/SC. A CAS loop can be
highly inefficient when compared to a single loopless operation like
XADD. I always try to avoid loops when I design sync algorithms.

Chris M. Thomasson

unread,

Feb 25, 2019, 4:33:35 PM2/25/19

to

When can a writer hit a kernel waitset when it can livelock in heavy
reader contention? Exactly how many spins do your CMPXCHG loops
accumulate before it can get any real work done, under heavy use? You
cannot give me an answer. I can wrt XADD.

You cannot work your way out of the following:
__________________________

Writer A reads your count.

Reader B gains read access.

Writer A spins on CAS because it failed!

Reader B releases read access.

Writer A spins on CAS because it failed!

Reader C gains read access.

Writer A spins on CAS because it failed!

Reader C releases read access.

Writer A spins on CAS because it failed!

__________________________

There are many other examples of CAS failure patterns.

Bonita Montero

unread,

Feb 26, 2019, 1:23:38 AM2/26/19

to

>> ... and there are active writers. But if there are active writers
>> there will be kernel-transisions which are magnitudes slower than
>> the CAS-loop. So XADD rather don't help here.

> When can a writer hit a kernel waitset when it can livelock in
> heavy reader contention? Exactly how many spins do your CMPXCHG
> loops accumulate before it can get any real work done, under
> heavy use? You cannot give me an answer. I can wrt XADD.
>
> You cannot work your way out of the following:
> __________________________
> Writer A reads your count.
> Reader B gains read access.
> Writer A spins on CAS because it failed!
> Reader B releases read access.

Impossible that reader B has read-access such a little time.
That's only some clock-clycles where no meaningful work could
be done.
If there was really meaningful work in reader B the locking
-interval was at least that long that writer A gets to sleep.

Bonita Montero

unread,

Feb 26, 2019, 1:25:06 AM2/26/19

to

> CAS will _always_ fail if the comparand is different than
> the destination at the point of the atomic operation.

Thank you for this insigt! I never would have known this otherwise!

Chris M. Thomasson

unread,

Feb 26, 2019, 11:51:19 PM2/26/19

to

Sorry for that. Well, I really do like your algorihtm. It is a very nice
piece of work. Thank you for sharing it.

Chris M. Thomasson

unread,

Feb 26, 2019, 11:52:11 PM2/26/19

to

On 2/25/2019 10:23 PM, Bonita Montero wrote:
>>> ... and there are active writers. But if there are active writers
>>> there will be kernel-transisions which are magnitudes slower than
>>> the CAS-loop. So XADD rather don't help here.
>
>> When can a writer hit a kernel waitset when it can livelock in
>> heavy reader contention? Exactly how many spins do your CMPXCHG
>> loops accumulate before it can get any real work done, under
>> heavy use? You cannot give me an answer. I can wrt XADD.
>>
>> You cannot work your way out of the following:
>> __________________________
>> Writer A reads your count.
>> Reader B gains read access.
>> Writer A spins on CAS because it failed!
>> Reader B releases read access.
>
> Impossible that reader B has read-access such a little time.

Tell that to a general purpose read/writer lock... ;^)

> That's only some clock-clycles where no meaningful work could
> be done.
> If there was really meaningful work in reader B the locking
> -interval was at least that long that writer A gets to sleep.

Regardless, I like your algorihtm. Is has an elegance about it.

Chris M. Thomasson

unread,

Feb 26, 2019, 11:57:20 PM2/26/19

to

Just wondering if you are interested in a version of your algorihtm that
is 100% pure c++11, and using the weakest memory order? The semaphore
and event would be emulated in c++11 using mutex/condvar. With that in
mind, it is going to "reduce" performance wrt raw Windows primitives.

Chris M. Thomasson

unread,

Feb 26, 2019, 11:58:34 PM2/26/19

to

The performance is nice. Not bad at all.

Bonita Montero

unread,

Feb 27, 2019, 2:58:49 AM2/27/19

to

>> Regardless, I like your algorihtm. Is has an elegance about it.

> Just wondering if you are interested in a version of your algorihtm

> that is 100% pure c++11, ...

How should that be possible? C++-whatever has no kernel-synchroniza-
tion-facilities like events or semaphores.

> The semaphore and event would be emulated in c++11 using mutex/condvar.

That would be inefficient.

It would be better to rewrite the code for POSIX with POSIX-sempahores.

Bonita Montero

unread,

Feb 27, 2019, 5:25:36 AM2/27/19

to

>> The semaphore and event would be emulated in c++11 using mutex/condvar.

> That would be inefficient.

I reconsidered that: it wouldn't be that inefficient as the mutex and
the condvar would be used only in contention cases and when that happens
there would be kernel-calls which are a lot slower the additional over-
head over raw semaphores (and eventually events).
But I think it's more elegant because more minimalistic when you realize
that with the system facilities and have more than one implementation of
the rw-mutex.

Paavo Helde

unread,

Feb 27, 2019, 5:59:40 AM2/27/19

to

On 27.02.2019 9:58, Bonita Montero wrote:
>>> Regardless, I like your algorihtm. Is has an elegance about it.
>
>> Just wondering if you are interested in a version of your algorihtm
>> that is 100% pure c++11, ...
>
> How should that be possible? C++-whatever has no kernel-synchroniza-
> tion-facilities like events or semaphores.

By this argument a C++ program would also not be able to allocate memory
or open files as these are also kernel facilities.

Bonita Montero

unread,

Feb 27, 2019, 7:06:49 AM2/27/19

to

>> How should that be possible? C++-whatever has no kernel-synchroniza-
>> tion-facilities like events or semaphores.

> By this argument a C++ program would also not be able to allocate memory
> or open files as these are also kernel facilities.

You are unable to read. I only said that C++ ha "no kernel
-synchronization-facilities like events or semaphores".

Or please tell me what the classes for semaphores and events are.

Paavo Helde

unread,

Feb 27, 2019, 9:29:19 AM2/27/19

to

C++11 contains everything needed for efficient multithreading
programming. I trust the committee in that if semaphores and events are
not included then they are not needed for that purpose.

Bonita Montero

unread,

Feb 27, 2019, 9:35:25 AM2/27/19

to

> C++11 contains everything needed for efficient multithreading
> programming.

I didn't tell the opposiste or: what C++11 supplies is sufficient in
most cases. But if you try to implement a rw-mutex you can't do this
that effient than with native events and semaphores.

Paavo Helde

unread,

Feb 27, 2019, 10:19:32 AM2/27/19

to

OK, fair enough. I guess that's why there is std::shared_mutex added in
C++17.

Scott Lurndal

unread,

Feb 27, 2019, 10:25:46 AM2/27/19

to

Over the last three decades, I've found the mutexes and condition
variables have been sufficient for everything from mainframe
operating systems (which had, in 1985, mutex and condition variables
built right into the instruction set) to modern pthread apps.

http://vseries.lurndal.org/doku.php?id=instructions:lok

Chris M. Thomasson

unread,

Mar 1, 2019, 12:13:18 AM3/1/19

to

On 2/27/2019 2:25 AM, Bonita Montero wrote:
>>> The semaphore and event would be emulated in c++11 using mutex/condvar.
>
>> That would be inefficient.
>
> I reconsidered that: it wouldn't be that inefficient as the mutex and
> the condvar would be used only in contention cases and when that happens
> there would be kernel-calls which are a lot slower the additional over-
> head over raw semaphores (and eventually events).

Agreed.

> But I think it's more elegant because more minimalistic when you realize
> that with the system facilities and have more than one implementation of
> the rw-mutex.

Using system specific primitives is going to, or really should be "more"
efficient than emulating them with mutex/condvar. However, a 100% pure
c++11 implementation would allow the algorihtm to be exposed to
different architectures and operating systems: A fairly decent trade-off?

Your algorihtm reminds me of mine, but in a different sort of way. I am
wondering about what would happen to your algorihtm if it were to
alternate reader and writer preference at random in the Unlock function.
It should work and create different interesting testing results wrt
reads and writes performed in a fixed amount of time.

Actually, std::shared_mutex should beat our algorithms, but on my end,
ours both beat it out. Oh well... ;^)

Chris M. Thomasson

unread,

Mar 1, 2019, 1:15:35 AM3/1/19

to

On 2/27/2019 2:25 AM, Bonita Montero wrote:

Actually, you can remove the loop in Lock and use fetch_add, or
InterlockedExchangeAdd. Recursive locking aside, your Lock function CAS
loop can be reduced down to:
___________________________
void SharedMutex::Lock()
{
if (InterlockedExchangeAdd(&m_lRWCounters, 1) != 0)
{
WaitForSingleObject(m_hEvtReleaseWriter, INFINITE);
}
}
___________________________

Now, this really reminds me of my algorihtm. :^)

Chris M. Thomasson

unread,

Mar 1, 2019, 1:21:55 AM3/1/19

to

Fwiw, this type of discussion brings me back to the good ol' days in
comp.programming.threads. Before c++ officially recognized these types
of things in the standard...

Chris M. Thomasson

unread,

Mar 2, 2019, 11:57:06 PM3/2/19

to

Afaict, std::shared_mutex should use native primitives that can beat my
loopless algorihtm in 100% pure C++. Except, sometimes it doesn't. My
contrived test, version 0.1, blasts the shi% out of the read/write
algorihtm with massive read/write contention. Designed to see how many
reads and writes occur when a centralized read/write mutex gets hammered
with contention for a fixed amount of time.

There is another class of algorihtm that should be thrown into the mix.
A special algorihtm in which readers are not dependent on writer
activity. Well, the test might have to be altered to represent a
read-mostly access pattern, instead of blasting with heavy writes and
reads. Humm...

Chris M. Thomasson

unread,

Mar 3, 2019, 12:08:22 AM3/3/19

to

On 2/19/2019 10:49 PM, Chris M. Thomasson wrote:
> Fwiw, I wrote a crude new benchmark that measures how many reads and
> writes can be performed in a given amount of time.

[...]

I need to create another type of test, one that does not hammer things
so much wrt creating an equal amount of reader and writer threads. I
need to basically model a "read-mostly" work environment. Instead of a
hyper-hardcore read and write focused contention massacre...

Chris M. Thomasson

unread,

Mar 3, 2019, 4:43:38 PM3/3/19

to

On 2/27/2019 2:25 AM, Bonita Montero wrote:

Just a little side note:
__________________________
Wrt porting your algorithm to 100% pure C++, I am a little bit worried
about performing shifts on a signed integer. Especially when you mutate
the "assumed" sign bit. Testing the value against zero: greater than,
less than, equal to, ect... After that: Well, it seems like UB, or
implementation defined behavior can occur...
__________________________

Port your algorithm to unsigned long. It is fairly straightforward, and
clears everything up.

Bonita Montero

unread,

Mar 6, 2019, 7:31:43 AM3/6/19

to

> Actually, you can remove the loop in Lock and use fetch_add, or
> InterlockedExchangeAdd. Recursive locking aside, your Lock function CAS
> loop can be reduced down to:
> ___________________________
> void SharedMutex::Lock()
> {
>     if (InterlockedExchangeAdd(&m_lRWCounters, 1) != 0)
>     {
>         WaitForSingleObject(m_hEvtReleaseWriter, INFINITE);
>     }
> }
> ___________________________
>
> Now, this really reminds me of my algorihtm. :^)

My implementation is just experimental. If you would write a clean
implementation you would throw an exception if either counter saturates.
With this LOCK XADD wouldn't be possible.

Chris M. Thomasson

unread,

Mar 6, 2019, 3:03:35 PM3/6/19

to

On 3/6/2019 4:31 AM, Bonita Montero wrote:
>> Actually, you can remove the loop in Lock and use fetch_add, or
>> InterlockedExchangeAdd. Recursive locking aside, your Lock function
>> CAS loop can be reduced down to:
>> ___________________________
>> void SharedMutex::Lock()
>> {
>>      if (InterlockedExchangeAdd(&m_lRWCounters, 1) != 0)
>>      {
>>          WaitForSingleObject(m_hEvtReleaseWriter, INFINITE);
>>      }
>> }
>> ___________________________
>>
>> Now, this really reminds me of my algorihtm. :^)
>
> My implementation is just experimental.

It is still nice. My work is experimental as well, but then I found out
that my algorihtm has been used in the Go Language for the past 7 years.
Go figure. ;^)

> If you would write a clean
> implementation you would throw an exception if either counter saturates.
> With this LOCK XADD wouldn't be possible.

Humm... Well, I guess you could create a simple rule, if more than
0xFFFF writers hit your algorihtm at the same time, it is undefined
behavior, end of story. Mine has that wrt _more_ than LONG_MAX readers
hitting it at the same time: Well, another one bites the dust.

Also, you can check to see if the return value of XADD, say count, is
(count & 0x0000FFFF) == 0xFFFF. That means the writers just intruded
into the reader logic, and an exception should be thrown. Humm...

Chris M. Thomasson

unread,

Mar 6, 2019, 3:53:38 PM3/6/19

to

That would be after the fact of the mutation. Shi% has already occurred,
then you get an exception. Different than CAS loop. However, a simple
rule that no more than 0xFFFF writers can hit it at the same time should
be sufficient. Perhaps have two different modes. One that throws on
anything that is not Kosher, and another one that relies on the user to
not break the rules.

For some reason choosing CAS just to throw when the user breaks the
rules, seems like putting "corks on the forks" to prevent people from
hurting themselves.

Bonita Montero

unread,

Mar 8, 2019, 3:04:19 PM3/8/19

to

> Also, you can check to see if the return value of XADD, say count, is
> (count & 0x0000FFFF) == 0xFFFF. That means the writers just intruded
> into the reader logic, and an exception should be thrown. Humm...

Then you would have to end the process. Rolling back the XADD
woudln't be possible because another thread would rely on the
state asynchronously.

Chris M. Thomasson

unread,

Mar 9, 2019, 9:21:28 PM3/9/19

to

Just tell the users to not break the rules. Simple. Like, never unlock a
mutex, unless it has been locked already. Or, always be sure to unlock a
mutex that was previously locked. This was never meant to try and
emulate a robust mutex... Why put corks on the forks?

Bonita Montero

unread,

Mar 10, 2019, 5:17:49 AM3/10/19

to

> Just tell the users to not break the rules. Simple.
> Like, never unlock a mutex, unless it has been locked already.

That's not the issue I'm talking about. The problem here is that
more than (2 ^ 16) - 1 threads might have locked the mutex exclusively
or more than (2 ^ 15) - 1 threads might have locked the mutex shared
or shared recursively.

Chris M. Thomasson

unread,

Mar 10, 2019, 11:37:26 PM3/10/19

to

I know what your issue is, and how CAS can solve the problem _before_ it
can occur. However, you should make an explicit _rule_ that no more than
0xFFFF, wrt writers, can access your work for exclusive access at any
one time. If your CAS loop detects an error, it is sort of acting like a
proverbial the cork on the fork. Btw, there are ways to fix XADD
accounting errors, but they are sometimes not pretty. Wrt robust mutex,
I am writing about what happens if a process dies while waiting on a
kernel primitive, (semaphore, auto-reset event)...

Bonita Montero

unread,

Mar 11, 2019, 4:13:51 AM3/11/19

to

> However, you should make an explicit _rule_ that no more than
> 0xFFFF, wrt writers, can access your work for exclusive access
> at any one time.

The best way is to trap the threads in the way I told: simply check
the counter before incrementing. Sometimes you simply can't predict
how many threads will lock the mutex, especially with resoursive
read-locking.

Chris M. Thomasson

unread,

Mar 11, 2019, 4:23:48 PM3/11/19

to

Fair enough. 0xFFFF readers may bit a little bit to small? My algorihtm
allows for LONG_MAX readers. So, the chances of breaking my work's rules
are much harder than breaking 0xFFFF...

Bonita Montero

unread,

Apr 4, 2019, 5:56:45 AM4/4/19

to

I've re-implemented the class more usable and POSIX-compilable:

#if defined(_MSC_VER)
#include <Windows.h>
#include <intrin.h>
#elif defined(__unix__)
#include <semaphore.h>
#include <pthread.h>
#endif
#include <exception>
#include <cassert>
#include <cstdint>

class resource_exception : public std::exception
{
public:
resource_exception() :
exception()
{
}

resource_exception(resource_exception const& other) noexcept :
exception(other)
{
}
};

class semaphore
{
public:
semaphore();
~semaphore();
bool wait();
void forced_wait();
unsigned release( unsigned count );
void forced_release( unsigned count );

private:
#if defined(_MSC_VER)
HANDLE m_hSem;
#elif defined(__unix__)
sem_t m_sem;
#else
#error need platform-specific semaphore!
#endif
};

class thread_id
{
public:
thread_id();
thread_id &set_to_current();
bool is_current();
void set_unowned();
bool is_owned();
thread_id &operator =( thread_id const &other );
bool operator ==( thread_id const &other );

private:
#if defined(_MSC_VER)
DWORD m_dwThreadId;
#elif defined(__unix__)
bool m_owned;
pthread_t m_threadId;
#else
#error need platform-specific thread-id!
#endif
};

#if defined(_MSC_VER)
inline
semaphore::semaphore()
{
if( (m_hSem = CreateSemaphore( nullptr, 0, 0x7FFFFFFF, nullptr )) ==
NULL )
throw resource_exception();
}

inline
semaphore::~semaphore()
{
CloseHandle( m_hSem );
}

inline
bool semaphore::wait()
{
return WaitForSingleObject( m_hSem, INFINITE ) == WAIT_OBJECT_0;
}

inline
unsigned semaphore::release( unsigned count )
{
return ReleaseSemaphore( m_hSem, (LONG)count, nullptr ) ? 0 : count;
}

inline std::int32_t CAS( std::int32_t volatile *pWhere, std::int32_t
xchg, std::int32_t cmp )
{
return _InterlockedCompareExchange( (LONG volatile *)pWhere,
(LONG)xchg, (LONG)cmp );
}

inline
thread_id::thread_id()
{
m_dwThreadId = 0;
}

inline
thread_id &thread_id::set_to_current()
{
m_dwThreadId = GetCurrentThreadId();
return *this;
}

inline
bool thread_id::is_current()
{
return m_dwThreadId == GetCurrentThreadId();
}

inline
void thread_id::set_unowned()
{
m_dwThreadId = 0;
}

inline
bool thread_id::is_owned()
{
return m_dwThreadId != 0;
}

inline
thread_id &thread_id::operator =( thread_id const &other )
{
m_dwThreadId = other.m_dwThreadId;
return *this;
}

inline
bool thread_id::operator ==( thread_id const &other )
{
return m_dwThreadId == other.m_dwThreadId;
}

#elif defined(__unix__)
inline
semaphore::semaphore()
{
if( sem_init( &m_sem, 0, 0 ) != 0 )
throw resource_exception();
}

inline
semaphore::~semaphore()
{
sem_destroy( &m_sem );
}

inline
bool semaphore::wait()
{
return sem_wait( &m_sem ) == 0;
}

inline
unsigned semaphore::release( unsigned count )
{
for( ; count; --count )
if( sem_post( &m_sem ) != 0 )
return count;
return 0; // to disable g++-warning
}

inline
thread_id::thread_id()
{
m_owned = false;
}

inline
thread_id &thread_id::set_to_current()
{
m_owned = true;
m_threadId = pthread_self();
return *this;
}

inline
bool thread_id::is_current()
{
return m_owned && pthread_equal( m_threadId, pthread_self() ) != 0;
}

inline
void thread_id::set_unowned()
{
m_owned = false;
}

inline
bool thread_id::is_owned()
{
return m_owned;
}

inline
thread_id &thread_id::operator =( thread_id const &other )
{
m_owned = other.m_owned;
m_threadId = other.m_threadId;
return *this;
}

inline
bool thread_id::operator ==( thread_id const &other )
{
return m_owned && other.m_owned && m_threadId == other.m_threadId;
}
#if defined(__GNUC__) || defined(__clang__)
inline std::int32_t CAS( std::int32_t volatile *pWhere, std::int32_t
xchg, std::int32_t cmp )
{
return __sync_val_compare_and_swap( pWhere, cmp, xchg );
}
#else
error need CAS-implementation!
#endif
#else
#error need platform-specific cas-adaption!
#endif

inline
void semaphore::forced_wait()
{
while( !wait() );
}

inline
void semaphore::forced_release( unsigned count )
{
while( count )
count = release( count );
}

class shared_mutex
{
public:
shared_mutex();
~shared_mutex();
void lock();
bool try_lock();
void unlock( bool preferSharers = true );
void lock_shared();
bool try_lock_shared();
void unlock_shared();

private:
volatile int32_t m_rwCounters; // lower 16 bits: writers including
the one that owns the lock
// bits 16-30: count of threads
having read-access or waiting for reac-access
// sign-bit: readers active
thread_id m_owningThreadId;
volatile uint32_t m_recursionCount;
semaphore m_releaseWriterSem;
semaphore m_releaseReadersSem;
};

inline
shared_mutex::shared_mutex()
{
m_rwCounters = 0;
m_recursionCount = 0;
}

inline
shared_mutex::~shared_mutex()
{
assert(!m_owningThreadId.is_owned());
assert(m_rwCounters == 0);
assert(m_recursionCount == 0);
}

void shared_mutex::lock()
{
thread_id myThreadId;
if( m_owningThreadId == myThreadId.set_to_current() )
{
++m_recursionCount;
return;
}

int32_t rwCounters,
oldRwCounters;
for( rwCounters = m_rwCounters; ; rwCounters = oldRwCounters )
{
assert((rwCounters & 0x0FFFF) != 0x0FFFF);
if( (oldRwCounters = CAS( &m_rwCounters, rwCounters + 1, rwCounters
)) == rwCounters )
if( oldRwCounters == 0 )
{
m_owningThreadId = myThreadId;
return;
}
else
{
m_releaseWriterSem.forced_wait();
m_owningThreadId = myThreadId;
return;
}
}
}

bool shared_mutex::try_lock()
{
if( m_owningThreadId == thread_id().set_to_current() )
{
++m_recursionCount;
return true;
}

return m_rwCounters == 0 && CAS( &m_rwCounters, 1, 0 ) == 0;
}

void shared_mutex::unlock( bool preferSharers )
{
uint32_t recursionCount;
assert(m_owningThreadId.is_current());
if( (recursionCount = m_recursionCount) != 0 )
{
m_recursionCount = recursionCount - 1;
return;
}
m_owningThreadId.set_unowned();

int32_t rwCounters,
oldRwCounters;
for( rwCounters = m_rwCounters; ; rwCounters = oldRwCounters )
{
assert(rwCounters > 0);
if( (rwCounters & 0x7FFF0000) && (preferSharers || (rwCounters &
0x0FFFF) == 0) )
{
if( (oldRwCounters = CAS( &m_rwCounters, (rwCounters - 1) |
(int32_t)0x80000000u, rwCounters )) == rwCounters )
{
m_releaseReadersSem.forced_release( (unsigned)((rwCounters &
0x7FFF0000) >> 16) );
return;
}
}
else
if( (oldRwCounters = CAS( &m_rwCounters, rwCounters - 1,
rwCounters )) == rwCounters )
{
if( (oldRwCounters & 0x0FFFF) > 1 )
m_releaseWriterSem.release( 1 );
return;
}
}
}

void shared_mutex::lock_shared()
{
int32_t rwCounters,
oldRwCounters;
for( rwCounters = m_rwCounters; ; rwCounters = oldRwCounters )
{
assert((rwCounters & 0x7FFF0000) != 0x7FFF0000);
// even shared access can be recursively, but in this case the
sharer-count will increment
if( rwCounters <= 0 )
{
if( (oldRwCounters = CAS( &m_rwCounters, (rwCounters + 0x10000) |
(int32_t)0x80000000, rwCounters )) == rwCounters )
return;
}
else
if( (oldRwCounters = CAS( &m_rwCounters, (rwCounters + 0x10000),
rwCounters )) == rwCounters )
{
m_releaseReadersSem.forced_wait();
return;
}
}
}

bool shared_mutex::try_lock_shared()
{
int32_t rwCounters,
oldRwCounters;
for( rwCounters = m_rwCounters; ; rwCounters = oldRwCounters )
{
assert((rwCounters & 0x7FFF0000) != 0x7FFF0000);
// even shared access can be recursively, but in this case the
sharer-count will increment
if( rwCounters > 0 )
return false;
if( (oldRwCounters = CAS( &m_rwCounters, (rwCounters + 0x10000) |
(int32_t)0x80000000, rwCounters )) == rwCounters )
return true;
}
}

void shared_mutex::unlock_shared()
{
int32_t rwCounters,
oldRwCounters;
for( rwCounters = m_rwCounters; ; rwCounters = oldRwCounters )
{
assert(rwCounters < 0 && (rwCounters & 0x7FFF0000) != 0);
if( (rwCounters & 0x7FFF0000) == 0x10000 )
{
if( (oldRwCounters = CAS( &m_rwCounters, (rwCounters - 0x10000) &
(int32_t)0x7FFFFFFF, rwCounters )) == rwCounters )
{
if( (rwCounters & 0x0FFFF) != 0 )
m_releaseWriterSem.forced_release( 1 );
return;
}
}
else
if( (oldRwCounters = CAS( &m_rwCounters, rwCounters - 0x10000,
rwCounters )) == rwCounters )
return;
}
}

Bonita Montero

unread,

Apr 4, 2019, 8:00:05 AM4/4/19

to

Hm, maybe I should add fences to help g++ not to reorder memory
-operations. MSVC doesn't need this as _InterlockedCompareExchange
has implicit acquire and release behaviour.

Bonita Montero

unread,

Apr 4, 2019, 11:49:14 AM4/4/19

to

> Hm, maybe I should add fences to help g++ not to reorder memory
> -operations. MSVC doesn't need this as _InterlockedCompareExchange
> has implicit acquire and release behaviour.

Is there any gcc-"intrinsic" for _logical_ barriers? I.e. an intrinsic
that keeps the compiler away from reordering loads and stores, but not
by _physically_ having a barrier like "xchg [mem], reg". The physical
barrier is already done by the __sync_val_compare_and_swap-intrinsic.
I know that I could do a "__asm__ __volatile__ ("" ::: "memory");",
but that's a full barrier and I rather need an aquire and release
-barrier.

David Brown

unread,

Apr 4, 2019, 3:14:47 PM4/4/19

to

Have you considered using a C++11 fence?

In gcc, asm volatile ("" ::: "memory") is a compiler barrier. It tells
the compiler to make sure anything that is due to be written out to
memory from registers, gets written out, and it forgets anything about
data read in from memory to registers. If you think of the cpu
registers as a "level -1 cache", it is a level -1 cache flush operation.

This is /not/ the same as fences, which affect the order and
synchronisation of data between threads.

The gcc compiler barrier is usually a cheap barrier - it does no
hardware synchronisation or instructions, and mostly stores data that
would have been stored sooner or later anyway.

If you need a barrier that only affects certain variables, that is also
possible.

Chris M. Thomasson

unread,

Apr 4, 2019, 4:16:29 PM4/4/19

to

On 4/4/2019 2:56 AM, Bonita Montero wrote:
> I've re-implemented the class more usable and POSIX-compilable:

[...]

Need to actually compile this, and take a closer look. However, one
little point. Beware of the fact that sem_wait can return EINTR, and
needs to be waited on again in a loop.

https://pubs.opengroup.org/onlinepubs/7908799/xsh/sem_wait.html

> inline
> bool semaphore::wait()
> {
> return sem_wait( &m_sem ) == 0;
> }

[...]

Chris M. Thomasson

unread,

Apr 4, 2019, 4:18:19 PM4/4/19

to

On 4/4/2019 1:16 PM, Chris M. Thomasson wrote:
> On 4/4/2019 2:56 AM, Bonita Montero wrote:
>> I've re-implemented the class more usable and POSIX-compilable:
> [...]
>
> Need to actually compile this, and take a closer look. However, one
> little point. Beware of the fact that sem_wait can return EINTR, and
> needs to be waited on again in a loop.

More precisely sem_wait can fail and errno can be equal to EINTR. This
means we have to try the sem_wait again.

Bonita Montero

unread,

Apr 5, 2019, 7:01:15 AM4/5/19

to

> Have you considered using a C++11 fence?

The C++11-fence results in a logical as well as a physical fence.

> In gcc, asm volatile ("" ::: "memory") is a compiler barrier.

Yes, it is a logical barrier, but it has aquire as well as release
-behaviour.

> This is /not/ the same as fences, which affect the order and
> synchronisation of data between threads.

No, ''("" ::: "memory")'' is a _logical_ barrier.

Bonita Montero

unread,

Apr 5, 2019, 7:02:32 AM4/5/19

to

>> Need to actually compile this, and take a closer look. However,
>> one little point. Beware of the fact that sem_wait can return
>> EINTR, and needs to be waited on again in a loop.

I'm waiting in a loop in the forced_wait anyway.

David Brown

unread,

Apr 5, 2019, 8:22:31 AM4/5/19

to

Bonita, if you want to reply to me, then please learn to quote properly,
and cite properly, using standard Usenet conventions. You'll find it a
pain with Google's appallingly bad "groups" interface, so I recommend
getting a real news client and a real news server (Thunderbird and
news.eternal-september.org are free and popular). Proper tools make the
job extremely simple.

If you wish to continue with your disrespectful, non-standard and
unhelpful posting habits, then you can do so without me - and please do
not quote what I write without giving proper citations.

On 05/04/2019 13:01, Bonita Montero wrote:
>> Have you considered using a C++11 fence?
>
> The C++11-fence results in a logical as well as a physical fence.
>

That makes no sense - we are talking programming, not gardening. There
is no such concept as a "physical fence".

>> In gcc, asm volatile ("" ::: "memory") is a compiler barrier.
>
> Yes, it is a logical barrier, but it has aquire as well as release
> -behaviour.
>

You asked for an "acquire and release barrier".

>> This is /not/ the same as fences, which affect the order and
>> synchronisation of data between threads.
>
> No, ''("" ::: "memory")'' is a _logical_ barrier.

I just said it is a compiler barrier, but not a fence. Why did you say
"no" to that?

Bonita Montero

unread,

Apr 5, 2019, 8:38:35 AM4/5/19

to

> That makes no sense - we are talking programming, not gardening.
> There is no such concept as a "physical fence".

There is!
A logical fence keeps the compiler away from reordering the load-/store
-intructions. A physical fence keeps the processor away from doing this
in the intruction-stream. The fences you have with C++11 do both both.

>>> In gcc, asm volatile ("" ::: "memory") is a compiler barrier.

>> Yes, it is a logical barrier, but it has aquire as well as release
>> -behaviour.

> You asked for an "acquire and release barrier".

Separate!

David Brown

unread,

Apr 5, 2019, 10:28:20 AM4/5/19

to

On 05/04/2019 14:38, Bonita Montero wrote:
<snip>

What part of proper Usenet posting conventions seems to be troubling you?

When you can communicate politely in the same way as others in the
group, you will get far better answers.

Bonita Montero

unread,

Apr 5, 2019, 2:49:21 PM4/5/19

to

Don't you recognize your own posting parts?
I'm so sorry for you!