Experimental Read/Write Mutex..

256 views
Skip to first unread message

Chris M. Thomasson

unread,
Feb 16, 2019, 6:34:36 PM2/16/19
to Scalable Synchronization Algorithms
Fwiw, here is a simple benchmark for an older read/write algorithm I invented:




It should compile fine on MSVC 2017 and GCC with -std=c++17. There is a macro in the code called CT_TEST_FAST_MUTEX. If it is defined then the code will test my algorithm. If not defined then it will test against std::shared_mutex. I am wondering if anybody can compile and run the code? Test it out on as many operating systems as you can. Can you please show the output?


Thanks everybody. :^)

Here is the code:

/* Simple, crude read/write mutex test
   by: Chris M. Thomasson
__________________________________________*/



#include <thread>
#include <atomic>
#include <shared_mutex>
#include <condition_variable>
#include <iostream>
#include <functional>
#include <cassert>
#include <cstdlib>
#include <ctime>


#define THREADS 16UL
#define ITERS 10000000UL
#define COUNT (THREADS * ITERS)


// undefine to test std::shared_mutex
#define CT_TEST_FAST_MUTEX 1


// bare bones mutex/condvar based semaphore
struct ct_slow_semaphore
{
    unsigned long m_state;
    std::mutex m_mutex;
    std::condition_variable m_cond;

    ct_slow_semaphore(unsigned long state) : m_state(state) {}

    void inc()
    {
        {
            std::unique_lock<std::mutex> lock(m_mutex);
            ++m_state;
        }

        m_cond.notify_one();
    }

    void add(unsigned long addend)
    {
        {
            std::unique_lock<std::mutex> lock(m_mutex);
            m_state += addend;
        }

        m_cond.notify_all();
    }

    void dec()
    {
        std::unique_lock<std::mutex> lock(m_mutex);
        while (m_state == 0) m_cond.wait(lock);
        --m_state;
    }
};




// bin-sema
struct ct_auto_reset_event
{
    bool m_state;
    std::mutex m_mutex;
    std::condition_variable m_cond;

    ct_auto_reset_event() : m_state(false) {}

    void signal()
    {
        std::unique_lock<std::mutex> lock(m_mutex);
        m_state = true;
        m_cond.notify_one();
    }

    void wait()
    {
        std::unique_lock<std::mutex> lock(m_mutex);
        while (m_state == false) m_cond.wait(lock);
        m_state = false; // auto-reset
    }
};


// just a layer over an auto-reset event
struct ct_fast_mutex
{
    std::atomic<unsigned int> m_state;
    ct_auto_reset_event m_waitset;

    ct_fast_mutex() : m_state(0) {}

    void lock()
    {
        if (m_state.exchange(1, std::memory_order_acquire))
        {
            while (m_state.exchange(2, std::memory_order_acquire))
            {
                m_waitset.wait();
            }
        }
    }

    void unlock()
    {
        if (m_state.exchange(0, std::memory_order_release) == 2)
        {
            m_waitset.signal();
        }
    }
};



// Chris M. Thomassons Experimental Read/Write Mutex
// Yeah, it is pretty damn fat wrt the state, however
// it has some interesting properties...
// The state can be compressed a bit...
// btw, it has no loops...
// Take a look at the lock_shared and unlock_shared functions

#define RWMUTEX_COUNT_MAX LONG_MAX

struct ct_rwmutex
{
    // shared state
    std::atomic<long> m_wrstate;
    std::atomic<long> m_count;
    std::atomic<long> m_rdwake;

    ct_slow_semaphore m_rdwset;
    ct_slow_semaphore m_wrwset;
    ct_fast_mutex m_wrlock;


    ct_rwmutex() :
        m_wrstate(1),
        m_count(RWMUTEX_COUNT_MAX),
        m_rdwake(0),
        m_rdwset(0),
        m_wrwset(0) {
    }


    // READ, pretty slim...
    void lock_shared()
    {
        if (m_count.fetch_add(-1, std::memory_order_acquire) < 1)
        {
            m_rdwset.dec();
        }
    }

    void unlock_shared()
    {
        if (m_count.fetch_add(1, std::memory_order_release) < 0)
        {
            if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
            {
                m_wrwset.inc();
            }
        }
    }


    // WRITE, more hefty
    void lock()
    {
        m_wrlock.lock();

        long count = m_count.fetch_add(-RWMUTEX_COUNT_MAX, std::memory_order_acquire);

        if (count < RWMUTEX_COUNT_MAX)
        {
            long rdwake = m_rdwake.fetch_add(RWMUTEX_COUNT_MAX - count, std::memory_order_acquire);

            if (rdwake + RWMUTEX_COUNT_MAX - count)
            {
                m_wrwset.dec();
            }
        }
    }

    // write unlock
    void unlock()
    {
        long count = m_count.fetch_add(RWMUTEX_COUNT_MAX, std::memory_order_release);

        if (count < 0)
        {
            m_rdwset.add(-count);
        }

        m_wrlock.unlock();
    }
};


struct ct_shared
{
    std::atomic<unsigned long> m_state;

#if defined (CT_TEST_FAST_MUTEX)
    ct_rwmutex m_std_rwmutex;
#else
    std::shared_mutex m_std_rwmutex;
#endif

    ct_shared() : m_state(0) {}
};


void ct_thread(ct_shared& shared, std::size_t index)
{
    for (unsigned int i = 0; i < ITERS; ++i)
    {

        shared.m_std_rwmutex.lock();
        if (i % 256 == 0) std::this_thread::yield();
        shared.m_state += 1;
        shared.m_std_rwmutex.unlock();


        shared.m_std_rwmutex.lock_shared();
        if (i % 512 == 0) std::this_thread::yield();
        //shared.m_state += 1;
        shared.m_std_rwmutex.unlock_shared();

    }
}


int main()
{
    ct_shared shared;

    std::cout << "Testing: ";

#if defined (CT_TEST_FAST_MUTEX)
    std::cout << "Chris M. Thomasson's Experimental Read/Write Mutex\n\n";
#else
    std::cout << "std::shared_mutex\n\n";
#endif

    {
        std::thread threads[THREADS];

        std::clock_t start = std::clock();

        for (std::size_t i = 0; i < THREADS; ++i)
        {
            threads[i] = std::thread(ct_thread, std::ref(shared), i);
        }

        for (std::size_t i = 0; i < THREADS; ++i)
        {
            threads[i].join();
        }

        std::clock_t diff = clock() - start;

        unsigned long msec = diff * 1000 / CLOCKS_PER_SEC;

        std::cout << "msec = " << msec << "\n";
    }

    std::cout << "shared.m_state = " << shared.m_state << "\n";
    std::cout << "\n\nFin!\n";

    assert(shared.m_state == COUNT);

    return 0;
}

Dmitry Vyukov

unread,
Feb 17, 2019, 2:24:42 AM2/17/19
to lock...@googlegroups.com
Hey Chris!

Here are my results on i7-8650U (4 HT cores) on Debian 4.19.16-1 gcc 7.3.0

std::shared_mutex
msec = 56118
msec = 49149
msec = 69416
msec = 68737
msec = 59174
msec = 65915

ct_rwmutex
msec = 62772
msec = 71139
msec = 68110
msec = 66038
msec = 60436
msec = 74238

I can also test on 2 CPU system with 88 hardware threads in total, but
the bench hardcodes 16 threads (there is std::hardware_concurrency()
or something like that).

I've found that it's critical to model realistic/target ratio between
reads/writes and amount of local work and work inside of critical
section in such benchmarks. Otherwise in 100% synthetic hammering
usually the worst mutex shows the best results. Synthetic benchmarks
produce extreme contention, so a mutex that blocks threads as much as
possible and allows as few threads as possible to work, shows the best
result because contention is minimized. The best would be to run all
threads sequentially one-by-one. But that's counter productive for
real life because blocked threads don't do useful work too.

Also this benchmark has fixed amount of work per thread, so I suspect
it may be subject to load imbalance when an unlucky outliner delays
total result. A better option would be to run all threads for X
seconds and then set a global var for all of them to stop and then
join all of them and measure number of iterations.

Also for me it lacked #include <climits>.

Chris M. Thomasson

unread,
Feb 17, 2019, 5:51:04 PM2/17/19
to Scalable Synchronization Algorithms
On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov wrote:
On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson <cri...@charter.net> wrote:
>
> Fwiw, here is a simple benchmark for an older read/write algorithm I invented:
>
>
> https://pastebin.com/raw/xCBHY9qd
>
>
> It should compile fine on MSVC 2017 and GCC with -std=c++17. There is a macro in the code called CT_TEST_FAST_MUTEX. If it is defined then the code will test my algorithm. If not defined then it will test against std::shared_mutex. I am wondering if anybody can compile and run the code? Test it out on as many operating systems as you can. Can you please show the output?
>
>
> Thanks everybody. :^)
>
> Here is the code:
>
> /* Simple, crude read/write mutex test
>    by: Chris M. Thomasson
> __________________________________________*/
[...]
 

Hey Chris!

Here are my results on i7-8650U (4 HT cores) on Debian 4.19.16-1 gcc 7.3.0

std::shared_mutex
msec = 56118
msec = 49149
msec = 69416
msec = 68737
msec = 59174
msec = 65915

ct_rwmutex
msec = 62772
msec = 71139
msec = 68110
msec = 66038
msec = 60436
msec = 74238


Thank you so much! Okay, not great, but all that "radically" hardcore terrible when compared to a std::shared_mutex on your end. Also, my work can benefit from many sessions of padding and alignment therapy wrt data-structure layout.

 

I can also test on 2 CPU system with 88 hardware threads in total, but
the bench hardcodes 16 threads (there is std::hardware_concurrency()
or something like that).

Indeed! Will correct this issue. Make it dynamic. :^)

Although, it can be beneficial to create more threads just to see how the algorithm handles these cases of "oversubscribed" contention.

 

I've found that it's critical to model realistic/target ratio between
reads/writes and amount of local work and work inside of critical
section in such benchmarks. Otherwise in 100% synthetic hammering
usually the worst mutex shows the best results. Synthetic benchmarks
produce extreme contention, so a mutex that blocks threads as much as
possible and allows as few threads as possible to work, shows the best
result because contention is minimized. The best would be to run all
threads sequentially one-by-one. But that's counter productive for
real life because blocked threads don't do useful work too.

Agreed. Just wondering why it performs so much better in some high contention scenarios. Load imbalance is a factor for sure.





 

Also this benchmark has fixed amount of work per thread, so I suspect
it may be subject to load imbalance when an unlucky outliner delays
total result. A better option would be to run all threads for X
seconds and then set a global var for all of them to stop and then
join all of them and measure number of iterations.

Indeed. A benckmark that just modeled a list that readers iterated, and writers pushed and popped from. Then we can show how many operations, reads or writes, they performed per-second. So it would be the number of reads and writes per-second, per-thread.

This reminds of some of Joe Seighs tests back on comp.programming.threads.

 

Also for me it lacked #include <climits>.

Fixed that. Thanks again:

(reload the page...) 


Fwiw, I am wondering what you think about the algorithm itself? Is it crap, or decent? :^)

Chris M. Thomasson

unread,
Feb 17, 2019, 6:32:08 PM2/17/19
to Scalable Synchronization Algorithms


On Sunday, February 17, 2019 at 2:51:04 PM UTC-8, Chris M. Thomasson wrote:
On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov wrote:
On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson <cri...@charter.net> wrote:
 
[...]
Thank you so much! Okay, not great, but all that "radically" hardcore terrible when compared to a std::shared_mutex on your end. Also, my work can benefit from many sessions of padding and alignment therapy wrt data-structure layout.
[...]

I meant to say it is _not_ all that "radically" hardcore terrible when compared to a std::shared_mutex on your end.

Well, lets wait and see for the next more realistic benchmark. lol. ;^)

However, I do like the way my algorithm acquires and releases read access:

_____
    // READ, pretty slim...
    void lock_shared()
    {
        if (m_count.fetch_add(-1, std::memory_order_acquire) < 1)
        {
            m_rdwset.dec();
        }
    }

    void unlock_shared()
    {
        if (m_count.fetch_add(1, std::memory_order_release) < 0)
        {
            if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
            {
                m_wrwset.inc();
            }
        }
    }
______

Not all "that" bad? ;^)

Dmitry Vyukov

unread,
Feb 18, 2019, 3:13:21 PM2/18/19
to lock...@googlegroups.com
Agree. So std::hardware_concurrency(), 2*std::hardware_concurrency(),
4*std::hardware_concurrency().


>> I've found that it's critical to model realistic/target ratio between
>> reads/writes and amount of local work and work inside of critical
>> section in such benchmarks. Otherwise in 100% synthetic hammering
>> usually the worst mutex shows the best results. Synthetic benchmarks
>> produce extreme contention, so a mutex that blocks threads as much as
>> possible and allows as few threads as possible to work, shows the best
>> result because contention is minimized. The best would be to run all
>> threads sequentially one-by-one. But that's counter productive for
>> real life because blocked threads don't do useful work too.
>
>
> Agreed. Just wondering why it performs so much better in some high contention scenarios. Load imbalance is a factor for sure.
>
> https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ
>
> https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ

I don't need to tell you that performance of sync primitives can be
very non-linear and counter-intuitive :)

Perhaps the other mutexes were not super dumb, so they did not perform
the best under very high contention.


>> Also this benchmark has fixed amount of work per thread, so I suspect
>> it may be subject to load imbalance when an unlucky outliner delays
>> total result. A better option would be to run all threads for X
>> seconds and then set a global var for all of them to stop and then
>> join all of them and measure number of iterations.
>
>
> Indeed. A benckmark that just modeled a list that readers iterated, and writers pushed and popped from. Then we can show how many operations, reads or writes, they performed per-second. So it would be the number of reads and writes per-second, per-thread.
>
> This reminds of some of Joe Seighs tests back on comp.programming.threads.
>
>
>>
>>
>> Also for me it lacked #include <climits>.
>
>
> Fixed that. Thanks again:
>
> https://pastebin.com/raw/xCBHY9qd
> (reload the page...)
>
>
> Fwiw, I am wondering what you think about the algorithm itself? Is it crap, or decent? :^)


I think it's one of the best algorithms out there.
The complete fairness for both readers and writers is notable.
And the wait-free fast-path for readers. It still suffers from high
cache line contention even on read-only workload, but it's as good as
one can get with a centralized design (i.e. not per-thread/cpu
distributed stuff which has own problems). I would expect a
CAS-loop-based read lock to degrade significantly under high read
load.
Btw your algorithm is used as the standard Go RWMutex for the past 7+ years:
https://github.com/golang/go/commit/daaf29cf932
(I should have mentioned your authorship somewhere!)

As for potential improvements I can only think of optimizing
uncontented writer lock/unlock to be 1 RMW each, i.e. not offloading
writer competition resolution to a separate mutex, but rather
incorporate it into the same atomic variable readers and writers use
for synchronization with each other. Do you think it's possible? With
CAS-loop? Or maybe with some smart atomic_fetch_or? Wait,
atomic_fetch_or is compiled to a CAS loop on x86 anyway. These new
C/C++ atomic interfaces sometimes make me forget that.

Chris M. Thomasson

unread,
Feb 18, 2019, 11:50:21 PM2/18/19
to Scalable Synchronization Algorithms
Fwiw, here is a more realistic benchmark, in the Relacy stage for verification purposes:


Looking good, and passing tests so far. Here is the code, using Relacy:
___________________________
// Experimental Read-Write Mutex Test
// More "realistic" test, in Relacy...
// By: Chris M. Thomasson
//_______________________________________________


//#define RL_DEBUGBREAK_ON_ASSERT
//#define RL_MSVC_OUTPUT
//#define RL_FORCE_SEQ_CST
//#define RL_GC


#include <relacy/relacy_std.hpp>
#include <iostream>


// Simple macro based redirection of the verbose  std membars.
#define CT_MB_ACQ std::memory_order_acquire
#define CT_MB_REL std::memory_order_release
#define CT_MB_RLX std::memory_order_relaxed
#define CT_MB_ACQ_REL std::memory_order_acq_rel
#define CT_MB_SEQ_CST std::memory_order_seq_cst
#define CT_MB(mp_0) std::atomic_thread_fence(mp_0)


#define READERS 4
#define WRITERS 2
#define THREADS (READERS + WRITERS)
#define ITERS 3




////////////////


// bare bones mutex/condvar based semaphore
struct ct_slow_semaphore
{
    VAR_T(unsigned long) m_state;
    std::mutex m_mutex;
    std::condition_variable m_cond;

    ct_slow_semaphore(unsigned long state) : m_state(state) {}

    void inc()
    {
        m_mutex.lock($);
        ++VAR(m_state);
        m_mutex.unlock($);
        m_cond.notify_one($);
    }

    void add(unsigned long addend)
    {
        m_mutex.lock($);
        VAR(m_state) += addend;
        m_mutex.unlock($);
        m_cond.notify_all($);
    }

    void dec()
    {
        m_mutex.lock($);
        while (VAR(m_state) == 0) m_cond.wait(m_mutex, $);
        --VAR(m_state);
        m_mutex.unlock($);
    }
};




// bin-sema
struct ct_auto_reset_event
{
    bool m_state;
    std::mutex m_mutex;
    std::condition_variable m_cond;

    ct_auto_reset_event() : m_state(false) {}

    void signal()
    {
        m_mutex.lock($);
        m_state = true;
        m_cond.notify_one($);
        m_mutex.unlock($);
    }

    void wait()
    {
        m_mutex.lock($);
        while (m_state == false) m_cond.wait(m_mutex, $);
        m_state = false; // auto-reset
        m_mutex.unlock($);
    }
};




// just a layer over an auto-reset event
struct ct_fast_mutex
{
    std::atomic<unsigned int> m_state;
    ct_auto_reset_event m_waitset;

    ct_fast_mutex() : m_state(0) {}

    void lock()
    {
        if (m_state.exchange(1, std::memory_order_acquire))
        {
            while (m_state.exchange(2, std::memory_order_acquire))
            {
                m_waitset.wait();
            }
        }
    }

    void unlock()
    {
        if (m_state.exchange(0, std::memory_order_release) == 2)
        {
            m_waitset.signal();
        }
    }
};



// Chris M. Thomassons Experimental Read/Write Mutex
// Yeah, it is pretty damn fat wrt the state, however
// it has some interesting properties...
// The state can be compressed a bit...
// btw, it has no loops...
// Take a look at the lock_shared and unlock_shared functions

#define RWMUTEX_COUNT_MAX LONG_MAX

struct ct_rwmutex
{
    // shared state
    std::atomic<long> m_wrstate;
    std::atomic<long> m_count;
    std::atomic<long> m_rdwake;

    ct_slow_semaphore m_rdwset;
    ct_slow_semaphore m_wrwset;
    ct_fast_mutex m_wrlock;


    ct_rwmutex() :
        m_wrstate(1),
        m_count(RWMUTEX_COUNT_MAX),
        m_rdwake(0),
        m_rdwset(0),
        m_wrwset(0) {
    }


    // READ, pretty slim...
    void lock_shared()
    {
        if (m_count.fetch_add(-1, std::memory_order_acquire) < 1)
        {
            m_rdwset.dec();
        }
    }

    void unlock_shared()
    {
        if (m_count.fetch_add(1, std::memory_order_release) < 0)
        {
            if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
            {
                m_wrwset.inc();
            }
        }
    }


    // WRITE, more hefty
    void lock()
    {
        m_wrlock.lock();

        long count = m_count.fetch_add(-RWMUTEX_COUNT_MAX, std::memory_order_acquire);

        if (count < RWMUTEX_COUNT_MAX)
        {
            long rdwake = m_rdwake.fetch_add(RWMUTEX_COUNT_MAX - count, std::memory_order_acquire);

            if (rdwake + RWMUTEX_COUNT_MAX - count)
            {
                m_wrwset.dec();
            }
        }
    }

    // write unlock
    void unlock()
    {
        long count = m_count.fetch_add(RWMUTEX_COUNT_MAX, std::memory_order_release);

        if (count < 0)
        {
            m_rdwset.add(-count);
        }

        m_wrlock.unlock();
    }
};



////////////////


struct ct_simple_stack
{
    struct node
    {
        VAR_T(node*) m_next;
        VAR_T(unsigned int) m_tid;

        node(unsigned int tid) : m_tid(tid) {}
    };

    VAR_T(node*) m_head;

    ct_simple_stack() : m_head(nullptr) {}

    void push(node* n)
    {
        VAR(n->m_next) = VAR(m_head);
        VAR(m_head) = n;
    }

    node* pop()
    {
        node* n = VAR(m_head);

        if (n)
        {
            VAR(m_head) = VAR(n->m_next);
        }

        return n;
    }


    node* flush()
    {
        node* n = VAR(m_head);
        VAR(m_head) = nullptr;
        return n;
    }

};


void ct_destroy_nodes(ct_simple_stack& s)
{
    ct_simple_stack::node* n = s.flush();

    while (n)
    {
        ct_simple_stack::node* next = VAR(n->m_next);
        delete n;
        n = next;
    }
}



// Relacy Stack Test...
struct ct_fast_mutex_test
    : rl::test_suite<ct_fast_mutex_test, THREADS>
{
    ct_rwmutex g_rwmutex;
    ct_simple_stack g_stack;

    void before()
    {

    }

    void after()
    {
        ct_destroy_nodes(g_stack);
    }

    // reader
    void reader(unsigned int tidx)
    {
        g_rwmutex.lock_shared();

        for (unsigned int i = 0; i < ITERS; ++i)
        {
            ct_simple_stack::node* h = VAR(g_stack.m_head);

            while (h)
            {
                ct_simple_stack::node* next = VAR(h->m_next);

                if (i % 512 == 0)
                {
                    rl::yield(1, $);
                }

                h = next;
            }
        }

        g_rwmutex.unlock_shared();
    }


    // writer
    void writer(unsigned int tidx)
    {
        g_rwmutex.lock();
        g_stack.push(new ct_simple_stack::node(tidx));
        g_stack.push(new ct_simple_stack::node(tidx));
        g_rwmutex.unlock();

        g_rwmutex.lock();
        ct_simple_stack::node* n = g_stack.pop();
        g_rwmutex.unlock();

        if (n)
        {
            // destroy
            delete n;
        }
    }


    void thread(unsigned int tidx)
    {
        if (tidx < READERS)
        {
            reader(tidx);
        }

        else
        {
            writer(tidx);
        }
    }
};



// Test away... Or fly? Humm...
int main()
{
    {
        rl::test_params p;

        p.iteration_count = 10000000;
        //p.execution_depth_limit = 33333;
        //p.search_type = rl::sched_bound;
        //p.search_type = rl::fair_full_search_scheduler_type;
        //p.search_type = rl::fair_context_bound_scheduler_type;

        rl::simulate<ct_fast_mutex_test>(p);
    }

    return 0;
}
___________________________

Well, readers iterate, and writer push and pop. What do you think about this test Dmitry? Good "enough"?

This is the basic logic of the test I am going to write up in 100% pure standard C++11. Well, C++17 when test against std::shared_mutex... ;^)

Chris M. Thomasson

unread,
Feb 19, 2019, 7:31:37 PM2/19/19
to Scalable Synchronization Algorithms
On Monday, February 18, 2019 at 12:13:21 PM UTC-8, Dmitry Vyukov wrote:
On Sun, Feb 17, 2019 at 11:51 PM Chris M. Thomasson <cri...@charter.net> wrote:
>
> On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov wrote:
>>
>> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson <cri...@charter.net> wrote:
[...]

>> I can also test on 2 CPU system with 88 hardware threads in total, but
>> the bench hardcodes 16 threads (there is std::hardware_concurrency()
>> or something like that).
>
>
> Indeed! Will correct this issue. Make it dynamic. :^)
>
> Although, it can be beneficial to create more threads just to see how the algorithm handles these cases of "oversubscribed" contention.

Agree. So std::hardware_concurrency(), 2*std::hardware_concurrency(),
4*std::hardware_concurrency().

Indeed.

 
>> I've found that it's critical to model realistic/target ratio between
>> reads/writes and amount of local work and work inside of critical
>> section in such benchmarks. Otherwise in 100% synthetic hammering
>> usually the worst mutex shows the best results. Synthetic benchmarks
>> produce extreme contention, so a mutex that blocks threads as much as
>> possible and allows as few threads as possible to work, shows the best
>> result because contention is minimized. The best would be to run all
>> threads sequentially one-by-one. But that's counter productive for
>> real life because blocked threads don't do useful work too.
>
>
> Agreed. Just wondering why it performs so much better in some high contention scenarios. Load imbalance is a factor for sure.
>
> https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ
>
> https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ

I don't need to tell you that performance of sync primitives can be
very non-linear and counter-intuitive :)

Perhaps the other mutexes were not super dumb, so they did not perform
the best under very high contention.

:^D


>> Also this benchmark has fixed amount of work per thread, so I suspect
>> it may be subject to load imbalance when an unlucky outliner delays
>> total result. A better option would be to run all threads for X
>> seconds and then set a global var for all of them to stop and then
>> join all of them and measure number of iterations.
>
>
> Indeed. A benckmark that just modeled a list that readers iterated, and writers pushed and popped from. Then we can show how many operations, reads or writes, they performed per-second. So it would be the number of reads and writes per-second, per-thread.
>
> This reminds of some of Joe Seighs tests back on comp.programming.threads.
>
>
>>
>>
>> Also for me it lacked #include <climits>.
>
>
> Fixed that. Thanks again:
>
> https://pastebin.com/raw/xCBHY9qd
> (reload the page...)
>
>
> Fwiw, I am wondering what you think about the algorithm itself? Is it crap, or decent? :^)


I think it's one of the best algorithms out there.

Thanks Dmitry. So far, wrt a general purpose centralized rwmutex, it is not all that bad. :^)

 
The complete fairness for both readers and writers is notable.

Indeed. My algorithm has a strict bakery style about it. In this case, the readers get the "main benefit" wrt the wait-free fast-path, assuming fetch_add is a single operation in hardware like x86, not a CAS or LL/SC loop in software. Well, that is great because who uses a rwmutex for writer performance anyway? ;^)

 
And the wait-free fast-path for readers.

Big time. Imvvho, this is the most important aspect. Almost ready with a much better benchmark that should show this off quite well.

 
It still suffers from high
cache line contention even on read-only workload, but it's as good as
one can get with a centralized design (i.e. not per-thread/cpu
distributed stuff which has own problems).
 
Agreed.

 
I would expect a
CAS-loop-based read lock to degrade significantly under high read
load.

Agreed. It would not be able to really compete wrt the explicit loop vs a loop-free fetch_add under periods of heavy, sustained read activity. Think of a shi% load of read-only database searches hitting the system all at once.

 
Btw your algorithm is used as the standard Go RWMutex for the past 7+ years:
https://github.com/golang/go/commit/daaf29cf932
(I should have mentioned your authorship somewhere!)

Oh wow. I am glad that you used it. Wrt to proper attribution, well, you just did that here. Thanks. Wrt the code, well, perhaps you can stick my name in there somewhere, if at all possible?


As for potential improvements I can only think of optimizing
uncontented writer lock/unlock to be 1 RMW each, i.e. not offloading
writer competition resolution to a separate mutex, but rather
incorporate it into the same atomic variable readers and writers use
for synchronization with each other. Do you think it's possible? With
CAS-loop? Or maybe with some smart atomic_fetch_or? Wait,
atomic_fetch_or is compiled to a CAS loop on x86 anyway. These new
C/C++ atomic interfaces sometimes make me forget that.


Humm... I can artificially limit the range of m_count and gain some extra state. Right now, it uses LONG_MAX. Okay, well, what about using LONG_MAX - sizeof(metadata), or something. Then, the writes can use CAS, or even fetch_add in a special way to gain write access using ct_rwmutex::m_count directly. Need to think on this. Perhaps, if I come up with something, it can be added to the source code, with my name in there, somewhere? Thanks again.

Chris M. Thomasson

unread,
Feb 20, 2019, 12:58:23 AM2/20/19
to Scalable Synchronization Algorithms
On Monday, February 18, 2019 at 8:50:21 PM UTC-8, Chris M. Thomasson wrote:
Fwiw, here is a more realistic benchmark, in the Relacy stage for verification purposes:




Fwiw, I have almost completed my new benchmark in pure C++11. The timings that I am getting measure a different thing. How many reads and writes can be completed within a given time frame. Well, here are some initial results:

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

Raw Reads: 3086724
Raw Writes: 59651
reads_per_tick = 308309
writes_per_tick = 5958
Ticks = 10.0118


Testing: std::shared_mutex

Raw Reads: 110922
Raw Writes: 29502
reads_per_tick = 11071
writes_per_tick = 2944
Ticks = 10.019


For a simple test of 10 seconds, my algorithm allows for 3086724 reads and 59651 writes to be completed. While std::shared_mutex allows for only 110922 reads and 29502 writes, damn. Look at the difference in the read rate per tick! Wow. I can't wait for you, and others to test it and tell me the results, good or bad. Heck, std::shared_mutex should beat my work, wouldn't ya think? Humm... 

Chris M. Thomasson

unread,
Feb 20, 2019, 1:51:23 AM2/20/19
to Scalable Synchronization Algorithms
Fwiw, I wrote a crude new benchmark that measures how many reads and writes can be performed in a given amount of time. My algorithm vs std::shared_mutex. So, we are _primarily_ looking for how many reads can be performed in this test at 60 seconds. The number of threads is variable and is determined by std::hardware_concurrency * THREADS, set at 8 in the test. So on my system the setup is: 
___________________________________ 
cpu_threads_n = 4 
threads_n = 32 
writers = 16 
readers = 16 
test duration = 60 seconds 
___________________________________ 

This is going to be different for each system. Readers iterate a shared linked list with 1000000 nodes in this test. Writers pop and push items, and never use new or delete. Well, so far, the timings I am getting on my end using MSVC 2017 is: 


Testing: Chris M. Thomasson's Experimental Read/Write Mutex 

___________________________________ 
cpu_threads_n = 4 
threads_n = 32 
writers = 16 
readers = 16 
test duration = 60 seconds 
___________________________________ 


___________________________________ 
Raw Reads: 18046464 
Raw Writes: 347498 
reads_per_tick = 300693 
writes_per_tick = 5790 
Ticks = 60.0161 
___________________________________ 



Testing: std::shared_mutex 

___________________________________ 
cpu_threads_n = 4 
threads_n = 32 
writers = 16 
readers = 16 
test duration = 60 seconds 
___________________________________ 


___________________________________ 
Raw Reads: 650006 
Raw Writes: 168741 
reads_per_tick = 10829 
writes_per_tick = 2811 
Ticks = 60.0193 
___________________________________ 



As you can see, my algorithm is performing many more reads than std::shared_mutex. Anyway, here is my code: 

When you get some time, can you please give it a go? I want to see timings on other systems, the good, bad and ugly. ;^) Thanks. 

___________________________________ 
/* Crude Read/Write Mutex Benchmark 

   This tests my algorithm ct_rwmutex vs. std::shared_mutex. 

   It shows how many reads and writes can be performed within 
   a fixed amount of time. 

   by: Chris M. Thomasson 
__________________________________________*/ 



#include <thread> 
#include <atomic> 
#include <shared_mutex> 
#include <condition_variable> 
#include <iostream> 
#include <functional> 
#include <chrono> 
#include <cassert> 
#include <cstdlib> 
#include <ctime> 
#include <climits> 
#include <cstdint> 
#include <vector> 


// undefine to test std::shared_mutex 
#define CT_TEST_FAST_MUTEX 1 


#define THREADS 8 // multiplied by std::hardware_concurrency 
#define NODE_PRIME 1000000 // number of nodes 
#define CRUDE_CACHE_PAD 256 // crude cache pad 
#define TEST_DURATION_SEC 60 // number of seconds for the test 


// bare bones mutex/condvar based semaphore 
struct ct_slow_semaphore 
    unsigned long m_state; 
    std::mutex m_mutex; 
    std::condition_variable m_cond; 

    ct_slow_semaphore(unsigned long state) : m_state(state) {} 

    void inc() 
    { 
        m_mutex.lock(); 
        ++m_state; 
        m_mutex.unlock(); 
// The number of readers _must_ never exceed LONG_MAX! 

#define RWMUTEX_COUNT_MAX LONG_MAX 

struct ct_rwmutex 
    unsigned char m_crude_cache_pad_0[CRUDE_CACHE_PAD]; 

    // shared state 
    std::atomic<long> m_count; 
    unsigned char m_crude_cache_pad_1[CRUDE_CACHE_PAD]; 

    std::atomic<long> m_rdwake; 
    unsigned char m_crude_cache_pad_2[CRUDE_CACHE_PAD]; 

    ct_slow_semaphore m_rdwset; 
    unsigned char m_crude_cache_pad_3[CRUDE_CACHE_PAD]; 

    ct_slow_semaphore m_wrwset; 
    unsigned char m_crude_cache_pad_4[CRUDE_CACHE_PAD]; 

    ct_fast_mutex m_wrlock; 
    unsigned char m_crude_cache_pad_5[CRUDE_CACHE_PAD]; 


    ct_rwmutex() : 
        m_count(RWMUTEX_COUNT_MAX), 
        m_rdwake(0), 
        m_rdwset(0), 
        m_wrwset(0) { 
    } 


    // READ, pretty slim... 
    void lock_shared() 
    { 
        if (m_count.fetch_add(-1, std::memory_order_acquire) < 1) 
        { 
            // We need to wait for a writer. 
            m_rdwset.dec(); 
        } 
    } 

    void unlock_shared() 
    { 
        if (m_count.fetch_add(1, std::memory_order_release) < 0) 
        { 
            // There is a writer 
            if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1) 
            { 
                // We need to wake the writer up 
                m_wrwset.inc(); 
            } 
        } 
    } 


    // WRITE, more hefty 
    void lock() 
    { 
        // Acquire exclusive access 
        m_wrlock.lock(); 

        // we are the only thread in here now. 

        // Gain write access wrt m_count 
        long count = m_count.fetch_add(-RWMUTEX_COUNT_MAX, std::memory_order_acquire); 

        // count can never be negative. 
        if (count < RWMUTEX_COUNT_MAX) 
        { 
            // We detected readers. 
            long rdwake = m_rdwake.fetch_add(RWMUTEX_COUNT_MAX - count, std::memory_order_acquire); 

            if (rdwake + RWMUTEX_COUNT_MAX - count) 
            { 
                // Okay, we need to wait for all of the readers 
                // The number of readers is actually 
                // RWMUTEX_COUNT_MAX - count 
                m_wrwset.dec(); 
            } 
        } 
    } 

    // write unlock 
    void unlock() 
    { 
        // Release write access wrt m_count 
        long count = m_count.fetch_add(RWMUTEX_COUNT_MAX, std::memory_order_release); 

        if (count < 0) 
        { 
            // We need to wake -count readers. 
            m_rdwset.add(-count); 
        } 

        // Release exclusive access 
        m_wrlock.unlock(); 
    } 
}; 


struct ct_simple_stack 
    struct node 
    { 
        node* m_next; 
        unsigned int m_tid; 

        node(unsigned int tid) : m_tid(tid) {} 
    }; 

    node* m_head; 

    ct_simple_stack() : m_head(nullptr) {} 

    ~ct_simple_stack() 
    { 
        ct_simple_stack::node* n = flush(); 

        while (n) 
        { 
            ct_simple_stack::node* next = n->m_next; 
            delete n; 
            n = next; 
        } 
    } 

    void push(node* n) 
    { 
        n->m_next = m_head; 
        m_head = n; 
    } 

    node* pop() 
    { 
        node* n = m_head; 

        if (n) 
        { 
            m_head = n->m_next; 
        } 

        return n; 
    } 


    node* flush() 
    { 
        node* n = m_head; 
        m_head = nullptr; 
        return n; 
    } 

}; 


struct ct_shared 
    std::atomic<bool> m_run; 

    // protected by m_std_rwmutex 
    std::uint64_t m_reads; 
    std::uint64_t m_writes; 

    ct_simple_stack m_stack; 
    ct_simple_stack m_stack_dtor; 

#if defined (CT_TEST_FAST_MUTEX) 
    ct_rwmutex m_std_rwmutex; 
#else 
    std::shared_mutex m_std_rwmutex; 
#endif 

    ct_shared() : m_run(true), m_reads(0), m_writes(0) 
    { 
        // prime m_stack 
        for (unsigned int i = 0; i < NODE_PRIME; ++i) 
        { 
            m_stack.push(new ct_simple_stack::node(i)); 
        } 
    } 
}; 


void ct_thread_reader(ct_shared& shared, std::size_t tidx) 
    std::uint64_t reads = 0; 

    while (shared.m_run.load(std::memory_order_relaxed)) 
    { 
        shared.m_std_rwmutex.lock_shared(); 

        ct_simple_stack::node* n = shared.m_stack.m_head; 

        while (n) 
        { 
            ct_simple_stack::node* next = n->m_next; 

            n = next; 
        } 

        std::this_thread::yield(); 

        shared.m_std_rwmutex.unlock_shared(); 

        ++reads; 
    } 

    shared.m_std_rwmutex.lock(); 
    shared.m_reads += reads; 
    shared.m_std_rwmutex.unlock(); 


void ct_thread_writer(ct_shared& shared, std::size_t tidx) 
    std::uint64_t writes = 0; 

    while (shared.m_run.load(std::memory_order_relaxed)) 
    { 
        shared.m_std_rwmutex.lock(); 
        ct_simple_stack::node* n = shared.m_stack.pop(); 
        shared.m_std_rwmutex.unlock(); 

        std::this_thread::yield(); 

        shared.m_std_rwmutex.lock(); 

        if (n) 
        { 
            shared.m_stack.push(n); 
        } 

        shared.m_std_rwmutex.unlock(); 

        std::this_thread::yield(); 

        ++writes; 
    } 

    shared.m_std_rwmutex.lock(); 
    shared.m_writes += writes; 
    shared.m_std_rwmutex.unlock(); 


int main() 
    ct_shared shared; 

    std::cout << "Testing: "; 

#if defined (CT_TEST_FAST_MUTEX) 
    std::cout << "Chris M. Thomasson's Experimental Read/Write Mutex\n\n"; 
#else 
    std::cout << "std::shared_mutex\n\n"; 
#endif 

    std::cout.flush(); 

    { 
        // Setup our threads 
        std::size_t cpu_threads_n = std::thread::hardware_concurrency(); 
        std::size_t threads_n = cpu_threads_n * THREADS; 
        std::vector<std::thread> threads(threads_n); 

        std::size_t writers = threads_n / 2; 
        std::size_t readers = threads_n - writers; 

        std::cout << "___________________________________\n"; 
        std::cout << "cpu_threads_n = " << cpu_threads_n << "\n"; 
        std::cout << "threads_n = " << threads_n << "\n"; 
        std::cout << "writers = " << writers << "\n"; 
        std::cout << "readers = " << readers << "\n"; 
        std::cout << "test duration = " << TEST_DURATION_SEC << " seconds\n"; 
        std::cout << "___________________________________\n\n\n"; 

        auto time_start = std::chrono::high_resolution_clock::now(); 

        // Create threads 
        for (std::size_t i = 0; i < threads_n; ++i) 
        { 
            if (i < writers) 
            { 
                threads[i] = std::thread(ct_thread_writer, std::ref(shared), i); 
            } 

            else 
            { 
                threads[i] = std::thread(ct_thread_reader, std::ref(shared), i); 
            } 
        } 

        // Give the test some time to run... 
        std::chrono::seconds time_duration(TEST_DURATION_SEC); 
        std::this_thread::sleep_for(time_duration); 
        shared.m_run.store(false, std::memory_order_relaxed); 

        // Join threads 
        for (std::size_t i = 0; i < threads_n; ++i) 
        { 
            threads[i].join(); 
        } 

        // Grab our timing. 
        auto time_stop = std::chrono::high_resolution_clock::now(); 

        std::chrono::duration<double> time_diff = time_stop - time_start; 

        std::uint64_t raw_reads = shared.m_reads; 
        std::uint64_t raw_writes = shared.m_writes; 

        std::cout << "___________________________________\n"; 
        std::cout << "Raw Reads: " << raw_reads << "\n"; 
        std::cout << "Raw Writes: " << raw_writes << "\n"; 

        std::uint64_t reads_per_tick = raw_reads / time_diff.count(); 
        std::uint64_t writes_per_tick = raw_writes / time_diff.count(); 

        std::cout << "reads_per_tick = " << reads_per_tick << "\n"; 
        std::cout << "writes_per_tick = " << writes_per_tick << "\n"; 

        std::cout << "Ticks = " << time_diff.count() << "\n"; 
        std::cout << "___________________________________\n\n"; 
    } 

    std::cout << "\n\nFin!\n"; 

    return 0; 
___________________________________ 


:^) 

Dmitry Vyukov

unread,
Feb 26, 2019, 3:10:02 AM2/26/19
to lock...@googlegroups.com
Benchmarked this on 2 systems (3 alternating runs for each mutex):

1. i7-8650U

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
___________________________________
cpu_threads_n = 8
threads_n = 64
writers = 32
readers = 32
test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 100855
Raw Writes: 2158
reads_per_tick = 1680
writes_per_tick = 35
Ticks = 60.019
___________________________________
___________________________________
Raw Reads: 108197
Raw Writes: 2126
reads_per_tick = 1802
writes_per_tick = 35
Ticks = 60.0109
___________________________________
___________________________________
Raw Reads: 118111
Raw Writes: 2397
reads_per_tick = 1967
writes_per_tick = 39
Ticks = 60.0193
___________________________________





Testing Version 0.1: std::shared_mutex
___________________________________
cpu_threads_n = 8
threads_n = 64
writers = 32
readers = 32
test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 97667
Raw Writes: 766
reads_per_tick = 1627
writes_per_tick = 12
Ticks = 60.0066
___________________________________
___________________________________
Raw Reads: 96409
Raw Writes: 1162
reads_per_tick = 1606
writes_per_tick = 19
Ticks = 60.0094
___________________________________
___________________________________
Raw Reads: 95314
Raw Writes: 792
reads_per_tick = 1588
writes_per_tick = 13
Ticks = 60.0055
___________________________________



2. 2 x Xeon 6154

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
___________________________________
cpu_threads_n = 72
threads_n = 576
writers = 288
readers = 288
test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 181439
Raw Writes: 3116
reads_per_tick = 3014
writes_per_tick = 51
Ticks = 60.1803
___________________________________
___________________________________
Raw Reads: 179390
Raw Writes: 3343
reads_per_tick = 2980
writes_per_tick = 55
Ticks = 60.1921
___________________________________
___________________________________
Raw Reads: 179058
Raw Writes: 3262
reads_per_tick = 2975
writes_per_tick = 54
Ticks = 60.1714
___________________________________


Testing Version 0.1: std::shared_mutex
___________________________________
cpu_threads_n = 72
threads_n = 576
writers = 288
readers = 288
test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 1061634
Raw Writes: 1385
reads_per_tick = 17657
writes_per_tick = 23
Ticks = 60.1246
___________________________________
___________________________________
Raw Reads: 1056793
Raw Writes: 1368
reads_per_tick = 17601
writes_per_tick = 22
Ticks = 60.0405
___________________________________
___________________________________
Raw Reads: 1068157
Raw Writes: 1312
reads_per_tick = 17770
writes_per_tick = 21
Ticks = 60.1086
___________________________________


Now it is your problem to interpret this :)

Another interesting data point is time output.
On the large system for your mutex:
848.76user 10.83system 1:00.26elapsed 1426%CPU

for std mutex:
4238.27user 26.79system 1:00.18elapsed 7086%CPU

So whatever your mutex did, it used 5 times less CPU time for that.

Dmitry Vyukov

unread,
Feb 26, 2019, 3:16:24 AM2/26/19
to lock...@googlegroups.com
We can try with some new change :)

I never seen any names in Go sources before. And everything is usually
very consistent, it's not that everybody can do whatever they want in
a new file just because they wrote it and nobody else will care.
People actually care a lot about consistency and common style for
everything.
But now grepping the source I've found an existing precedent:
https://github.com/golang/go/blob/master/src/go/printer/nodes.go#L662
So we may try too :)

Chris M. Thomasson

unread,
Feb 26, 2019, 5:47:09 PM2/26/19
to Scalable Synchronization Algorithms
On Tuesday, February 26, 2019 at 12:10:02 AM UTC-8, Dmitry Vyukov wrote:
On Wed, Feb 20, 2019 at 7:51 AM Chris M. Thomasson <cri...@charter.net> wrote:
>
> Fwiw, I wrote a crude new benchmark that measures how many reads and writes can be performed in a given amount of time. My algorithm vs std::shared_mutex. So, we are _primarily_ looking for how many reads can be performed in this test at 60 seconds. The number of threads is variable and is determined by std::hardware_concurrency * THREADS, set at 8 in the test. So on my system the setup is:
> ___________________________________
> cpu_threads_n = 4
> threads_n = 32
> writers = 16
> readers = 16
> test duration = 60 seconds
> ___________________________________
[...]


Benchmarked this on 2 systems (3 alternating runs for each mutex):

[...]

Thank you! :^)
 

Now it is your problem to interpret this :)

std::shared_mutex might have a reader priority? Wondering if it is using a distributed algorithm on the larger system?

The strict bakery style interleaving in my algorithm must be doing something "interesting" on the larger system. Mine seems to allow for some more writes in the data, too fair? The mutex aspect of my algorithm might be kicking in here. It is using Alexander Terekhov's algorithm from pthread_mutex_t in pthreads-win32, actually it can use any mutual exclusion algorithm for writer access:


Integrating writer access wrt ct_rwmutex::m_count should be beneficial...


 

Another interesting data point is time output.
On the large system for your mutex:
848.76user 10.83system 1:00.26elapsed 1426%CPU

for std mutex:
4238.27user 26.79system 1:00.18elapsed 7086%CPU

So whatever your mutex did, it used 5 times less CPU time for that.

Bakery style, and the slow paths boiling down to condvar/mutex? I wonder what std::shared_mutex is using on your end when it has to wait? futex?

Dmitry Vyukov

unread,
Feb 27, 2019, 4:33:29 AM2/27/19
to lock...@googlegroups.com
It's just a wrapper around pthread_rwlock and I have glibc 2.24.

Manuel Pöter

unread,
Feb 27, 2019, 9:34:41 AM2/27/19
to Scalable Synchronization Algorithms
Benchmarked this on 2 systems with `THREADS` set to 1, 2 and 4.

Result 1:

8x Intel(R) Xeon(R) CPU E7- 8850  @ 2.00GHz (80 cores, 2x SMT)

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
___________________________________
cpu_threads_n = 160
threads_n = 160
writers = 80
readers = 80

test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 186803
Raw Writes: 1922
reads_per_tick = 3111
writes_per_tick = 32
Ticks = 60.0334
___________________________________

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
___________________________________
cpu_threads_n = 160
threads_n = 320
writers = 160
readers = 160

test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 336265
Raw Writes: 2559
reads_per_tick = 5596
writes_per_tick = 42
Ticks = 60.0848
___________________________________

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
___________________________________
cpu_threads_n = 160
threads_n = 640
writers = 320
readers = 320

test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 449283
Raw Writes: 3718
reads_per_tick = 7471
writes_per_tick = 61
Ticks = 60.1302
___________________________________

Testing Version 0.1: std::shared_mutex
___________________________________
cpu_threads_n = 160
threads_n = 160
writers = 80
readers = 80

test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 191840
Raw Writes: 784
reads_per_tick = 3194
writes_per_tick = 13
Ticks = 60.0533
___________________________________

Testing Version 0.1: std::shared_mutex
___________________________________
cpu_threads_n = 160
threads_n = 320
writers = 160
readers = 160

test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 350020
Raw Writes: 1738
reads_per_tick = 5826
writes_per_tick = 28
Ticks = 60.0688
___________________________________

Testing Version 0.1: std::shared_mutex
___________________________________
cpu_threads_n = 160
threads_n = 640
writers = 320
readers = 320

test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 706867
Raw Writes: 1830
reads_per_tick = 11752
writes_per_tick = 30
Ticks = 60.1452
___________________________________

Result 2:

4x SPARC-T5-4 (64 cores, 8x SMT)

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
___________________________________
cpu_threads_n = 512
threads_n = 512
writers = 256
readers = 256

test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 640255
Raw Writes: 7999
reads_per_tick = 10650
writes_per_tick = 133
Ticks = 60.1149
___________________________________

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
___________________________________
cpu_threads_n = 512
threads_n = 1024
writers = 512
readers = 512

test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 948097
Raw Writes: 12602
reads_per_tick = 15746
writes_per_tick = 209
Ticks = 60.2094
___________________________________

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
___________________________________
cpu_threads_n = 512
threads_n = 2048
writers = 1024
readers = 1024

test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 1718250
Raw Writes: 23019
reads_per_tick = 28402
writes_per_tick = 380
Ticks = 60.4974
___________________________________


Testing Version 0.1: std::shared_mutex
___________________________________
cpu_threads_n = 512
threads_n = 512
writers = 256
readers = 256

test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 4482
Raw Writes: 2166488
reads_per_tick = 74
writes_per_tick = 36045
Ticks = 60.1037
___________________________________

Testing Version 0.1: std::shared_mutex
___________________________________
cpu_threads_n = 512
threads_n = 1024
writers = 512
readers = 512

test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 1536
Raw Writes: 2093636
reads_per_tick = 25
writes_per_tick = 34767
Ticks = 60.2185
___________________________________

Testing Version 0.1: std::shared_mutex
___________________________________
cpu_threads_n = 512
threads_n = 2048
writers = 1024
readers = 1024

test duration = 60 seconds
___________________________________
___________________________________
Raw Reads: 4096
Raw Writes: 2001034
reads_per_tick = 67
writes_per_tick = 33130
Ticks = 60.3978
___________________________________

Chris M. Thomasson

unread,
Mar 1, 2019, 1:37:42 AM3/1/19
to Scalable Synchronization Algorithms
Okay. Thank you. So, my work is losing wrt reads, but performing more writes. This has to be an aspect of my algorithms strict bakery style fairness. Writers can never starve readers, and vise versa. It has no read or writer preference.
Ummm.... This is wild! It is sort of hard to believe that std::shared_mutex is "tanking" on the read throughput so much wrt the SPARC. Well, it must have a heavy writer preference. I will have some more time in a day or two to work on this, wrt trying to make sense of these results.

Imvvho, std::shared_mutex should always be able to beat out a rwmutex implemented with 100% pure c++11. My algorithm is creating some very interesting results on these larger systems. 

Thanks again Manuel, and Dmitry. :^)

Chris M. Thomasson

unread,
Mar 1, 2019, 1:51:16 AM3/1/19
to Scalable Synchronization Algorithms
Need to examine the implementation. If it is based solely on mutex/condvar, I am going to be really surprised! I would think that pthread_rwlock should  be using atomics and futex on Linux.

Manuel Pöter

unread,
Mar 1, 2019, 10:11:20 AM3/1/19
to Scalable Synchronization Algorithms
I could get my hands on an AMD EPYC 7351P system and a Xeon Phi (Knights Corner) if you are interested in more results...

Chris M. Thomasson

unread,
Mar 1, 2019, 5:29:41 PM3/1/19
to Scalable Synchronization Algorithms
On Friday, March 1, 2019 at 7:11:20 AM UTC-8, Manuel Pöter wrote:
I could get my hands on an AMD EPYC 7351P system and a Xeon Phi (Knights Corner) if you are interested in more results...

I would be grateful if you can find the time to do this Manuel: Thanks. :^)

Really like the "difference" between the Intel and SPARC results. Afaict, std::shared_mutex, on Solaris? Needs to be fixed? I do not like the way it seems to "tank" the reads and obtain a shi% load of writes. Humm...

Manuel Pöter

unread,
Mar 2, 2019, 10:28:28 AM3/2/19
to Scalable Synchronization Algorithms


On Friday, 1 March 2019 23:29:41 UTC+1, Chris M. Thomasson wrote:
On Friday, March 1, 2019 at 7:11:20 AM UTC-8, Manuel Pöter wrote:
I could get my hands on an AMD EPYC 7351P system and a Xeon Phi (Knights Corner) if you are interested in more results...

I would be grateful if you can find the time to do this Manuel: Thanks. :^)
 
Will do! :)

Really like the "difference" between the Intel and SPARC results. Afaict, std::shared_mutex, on Solaris? Needs to be fixed? I do not like the way it seems to "tank" the reads and obtain a shi% load of writes. Humm...

Yes, the SPARC system is running Solaris 11.3; the binaries were built using a self-compiled GCC 6.3. I have to admit that I did not really look at the results before posting them here, but now that you pointed it out I agree - the shared_mutex results on Solaris are definitely not what I would have expected!

Chris M. Thomasson

unread,
Mar 3, 2019, 12:03:44 AM3/3/19
to Scalable Synchronization Algorithms
I need to create another type of test, one that does not hammer things so much wrt creating an equal amount of reader and writer threads. I need to basically model a "read-mostly" work environment. Instead of a hyper-hardcore read and write focused contention massacre...

Chris M. Thomasson

unread,
Mar 3, 2019, 4:28:03 AM3/3/19
to Scalable Synchronization Algorithms
On Tuesday, February 26, 2019 at 12:16:24 AM UTC-8, Dmitry Vyukov wrote:
On Wed, Feb 20, 2019 at 1:31 AM Chris M. Thomasson <cri...@charter.net> wrote:
[...]

We can try with some new change :)

Just as a first start, I changed m_count to unsigned long, and is "integrated" with lock and unlock, and still has no loops, just a first exploratory experiment. Here is the code:

It still only uses fetch_add. Humm, not sure what to make of it. Also, I am pissed that fetch_or goes to CMPXCHG loop on x86.

_________________________________
// Chris M. Thomassons Experimental Read/Write Mutex
// Yeah, it is pretty damn fat wrt the state, however
// it has some interesting properties...
// The state can be compressed a bit...
// btw, it has no loops...

// 32-bit layout
#define READ        0x00000001UL
#define READ_MASK   0x0000FFFFUL
#define WRITE_BIT   0x00010000UL
#define WRITE       0x00020000UL
#define WRITE_MASK  0xFFFF0000UL
#define NEUTRAL     0x00000000UL


struct ct_rwmutex
{
    // shared state
    std::atomic<unsigned long> m_count;
    std::atomic<long> m_rdwake;

    ct_slow_semaphore m_rdwset;
    ct_slow_semaphore m_wrwset;
    ct_slow_semaphore m_wrmtx;


    ct_rwmutex() :
        m_count(NEUTRAL),
        m_rdwake(0),
        m_rdwset(0),
        m_wrwset(0),
        m_wrmtx(0) {
    }


    ~ct_rwmutex()
    {
        RL_ASSERT(m_count.load(std::memory_order_relaxed) == 0);
    }


    // READ, pretty slim...
    void rdlock()
    {
        if (m_count.fetch_add(READ, std::memory_order_acquire) & WRITE_BIT)
        {
            m_rdwset.dec();
        }
    }

    void rdunlock()
    {
        if (m_count.fetch_sub(READ, std::memory_order_release) & WRITE_BIT)
        {
            if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
            {
                m_wrwset.inc();
            }
        }
    }


    // WRITE, more hefty
    void wrlock()
    {
        unsigned long wrcount = m_count.fetch_add(WRITE, std::memory_order_acquire);

        if ((wrcount & WRITE_MASK) > 0)
        {
            m_wrmtx.dec();
        }

        unsigned long count = m_count.fetch_add(WRITE_BIT, std::memory_order_acquire);
        unsigned long readers = count & READ_MASK;

        if (readers)
        {
            long rdwake = m_rdwake.fetch_add(readers, std::memory_order_acquire);

            if (rdwake + readers)
            {
                m_wrwset.dec();
            }
        }
    }

    // write unlock
    void wrunlock()
    {
        unsigned long count = m_count.fetch_sub(WRITE_BIT, std::memory_order_release);
        unsigned long readers = count & READ_MASK;

        if (readers)
        {
            m_rdwset.add(readers);
        }

        unsigned long wrcount = m_count.fetch_sub(WRITE, std::memory_order_relaxed);

        if ((wrcount & WRITE_MASK) > WRITE)
        {
            m_wrmtx.inc();
        }
    }
};
_________________________________


Humm... Any thoughts? ;^)

 

I never seen any names in Go sources before. And everything is usually
very consistent, it's not that everybody can do whatever they want in
a new file just because they wrote it and nobody else will care.
People actually care a lot about consistency and common style for
everything.
But now grepping the source I've found an existing precedent:
https://github.com/golang/go/blob/master/src/go/printer/nodes.go#L662
So we may try too :)

Very cool. Thanks Dmitry. 

Chris M. Thomasson

unread,
Mar 5, 2019, 12:01:53 AM3/5/19
to Scalable Synchronization Algorithms


On Sunday, March 3, 2019 at 1:28:03 AM UTC-8, Chris M. Thomasson wrote:
On Tuesday, February 26, 2019 at 12:16:24 AM UTC-8, Dmitry Vyukov wrote:
On Wed, Feb 20, 2019 at 1:31 AM Chris M. Thomasson <cri...@charter.net> wrote:
[...]
We can try with some new change :)

Just as a first start, I changed m_count to unsigned long, and is "integrated" with lock and unlock, and still has no loops, just a first exploratory experiment. Here is the code:

It still only uses fetch_add. Humm, not sure what to make of it. Also, I am pissed that fetch_or goes to CMPXCHG loop on x86.

_________________________________
// Chris M. Thomassons Experimental Read/Write Mutex
[...]

    // WRITE, more hefty
    void wrlock()
    {
        unsigned long wrcount = m_count.fetch_add(WRITE, std::memory_order_acquire);

        if ((wrcount & WRITE_MASK) > 0)
        {
            m_wrmtx.dec();
        }
^^^^^^^^^^^^^^^^^^

Still thinking about getting rid of the above mutex, and directly integrating the waiting into a single semaphore without using loops. Humm... I would have to change the way readers can unlock writers and vise versa. It is a bit challenging to do without loops.

 

        unsigned long count = m_count.fetch_add(WRITE_BIT, std::memory_order_acquire);

[...]

Chris M. Thomasson

unread,
Mar 5, 2019, 12:10:19 AM3/5/19
to Scalable Synchronization Algorithms


On Saturday, March 2, 2019 at 7:28:28 AM UTC-8, Manuel Pöter wrote:


On Friday, 1 March 2019 23:29:41 UTC+1, Chris M. Thomasson wrote:
On Friday, March 1, 2019 at 7:11:20 AM UTC-8, Manuel Pöter wrote:
I could get my hands on an AMD EPYC 7351P system and a Xeon Phi (Knights Corner) if you are interested in more results...

I would be grateful if you can find the time to do this Manuel: Thanks. :^)
 
Will do! :)

Thanks. I am interested in all the results you can get.
 

Really like the "difference" between the Intel and SPARC results. Afaict, std::shared_mutex, on Solaris? Needs to be fixed? I do not like the way it seems to "tank" the reads and obtain a shi% load of writes. Humm...

Yes, the SPARC system is running Solaris 11.3; the binaries were built using a self-compiled GCC 6.3. I have to admit that I did not really look at the results before posting them here, but now that you pointed it out I agree - the shared_mutex results on Solaris are definitely not what I would have expected!

Agreed. Very strange, massive writer preference. Humm...

Chris M. Thomasson

unread,
Mar 15, 2019, 4:45:51 PM3/15/19
to Scalable Synchronization Algorithms
Sorry for not working on this like I should be doing. Fwiw, I got distracted because several of my fractals are now on a very nice site:


My work:





The multi julia is very interesting, and creates some strange looking 3d renderings, even though the algorithm is 100% 2d in nature. Interested? 

To clarify, Paul used my special IFS algorithm to create the renderings. He used is own coloring algorithms. 
Reply all
Reply to author
Forward
0 new messages