asymmetric_rw_mutex version 2

Dmitriy Vyukov

unread,

Jan 24, 2008, 10:06:53 AM1/24/08

to

rw_mutex with costless read_lock()/read_unlock()
All overhead is moved to writer side (therefore called 'asymmetric')

Based on my previous ideas:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/1348064e4838e871
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/b7d3b8c08f9ca3c6

And uses this epoch auto detection algorithm:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f8f78ff07df6df5f

Now asymmetric_rw_mutex is free of any caveats like possible
deadlocks. And not use any OS/hardware specific features.

Implementation:

int const max_reader_count = 1024;

// unique thread index (0 .. (max_reader_count - 1))
__declspec(thread) int current_thread_index;

class arw_mutex_t
{
private:
CRITICAL_SECTION cs;
// private flag for every reader
bool volatile reader_inside [max_reader_count];
bool volatile writer_pending;

public:
arw_mutex_t()
{
InitializeCriticalSection(&cs);
writer_pending = false;
for (int i = 0; i != max_reader_count; ++i)
reader_inside[i] = false;
}

void reader_lock()
{
reader_inside[current_thread_index] = true;
// no explicit #StoreLoad
// but it will be implicitly executed
// by sys_synch() in writer_lock()
if (writer_pending)
{
// if writer is pending
// than signal that we see it
// and wait for writer to complete
reader_inside[current_thread_index] = false;
EnterCriticalSection(&cs);
reader_inside[current_thread_index] = true;
LeaveCriticalSection(&cs);
}
// need to order user code inside critical section
// acquire compiler barrier
_ReadWriteBarrier();
}

void reader_unlock()
{
// need to order user code inside critical section
// release compiler barrier
_ReadWriteBarrier();
reader_inside[current_thread_index] = false;
}

void writer_lock()
{
EnterCriticalSection(&cs);
// set that we are waiting
writer_pending = true;
// all magic is here:
// implicitly execute full memory barrier
// on all other processors
sys_synch();
// here we are sure that:
// (1) writer will see (reader_inside[i] == true)
// or (2) reader will see (writer_pending == true)
// so no race conditions
for (int i = 0; i != max_reader_count; ++i)
{
// wait for all readers to complete
while (reader_inside[i])
SwitchToThread();
}
}

void writer_unlock()
{
writer_pending = false;
LeaveCriticalSection(&cs);
}
};

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 24, 2008, 11:15:55 AM1/24/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:dc7df043-ec27-4df5...@i12g2000prf.googlegroups.com...
[...]

> void writer_lock()
> {
> EnterCriticalSection(&cs);
> // set that we are waiting
> writer_pending = true;
> // all magic is here:
> // implicitly execute full memory barrier
> // on all other processors
> sys_synch();

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is the same sys_synch() as the one defined here:

http://groups.google.com/group/comp.programming.threads/msg/ee9e4f173b7c1a7a

correct?

[...]

BTW, I have implemented your algorithm using the vZOOM epoch detection on
Windows/Linux/Solaris; it works.

Dmitriy Vyukov

unread,

Jan 24, 2008, 11:22:31 AM1/24/08

to

On Jan 24, 7:15 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

>
> news:dc7df043-ec27-4df5...@i12g2000prf.googlegroups.com...
> [...]
>
> > void writer_lock()
> > {
> > EnterCriticalSection(&cs);
> > // set that we are waiting
> > writer_pending = true;
> > // all magic is here:
> > // implicitly execute full memory barrier
> > // on all other processors
> > sys_synch();
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> This is the same sys_synch() as the one defined here:
>

> http://groups.google.com/group/comp.programming.threads/msg/ee9e4f173...
>
> correct?

Yes

> BTW, I have implemented your algorithm using the vZOOM epoch detection on
> Windows/Linux/Solaris; it works.

Yes, right, one can use any epoch detection logic with this algorithm
- sys_synch(), FlushProcessWriteBuffers(), rcu_synchronize(), signal
to every thread, polling of thread/processor activity etc. Main point
- epoch detection logic must be completely automatic and must not
require any explicit participation from user threads. Otherwise this
can cause deadlocks.

Any impressions? ;)

I'm going to test this mutex (and compare to other implementations) on
multicore under different workloads - different ratio of readers/
writers, different amount of local payload.

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 25, 2008, 4:19:53 PM1/25/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:ceee2e00-0901-4a6d...@s19g2000prg.googlegroups.com...

> On Jan 24, 7:15 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

[...]

>> BTW, I have implemented your algorithm using the vZOOM epoch detection on
>> Windows/Linux/Solaris; it works.

[...]
>
>
> Any impressions? ;)
[...]

It slaughters traditional rw-locks!

Dmitriy Vyukov

unread,

Jan 27, 2008, 12:27:27 PM1/27/08

to

On 24 янв, 18:06, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> rw_mutex with costless read_lock()/read_unlock()

Am I right that one can't implement such algorithm in pure Java/C#
because store to volatile variable followed by load of volatile
variable will implicitly issue #StoreLoad memory barrier?

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 27, 2008, 12:56:20 PM1/27/08

to

On 26 янв, 00:19, "Chris Thomasson" <cris...@comcast.net> wrote:

> >> BTW, I have implemented your algorithm using the vZOOM epoch detection on
> >> Windows/Linux/Solaris; it works.
>

> > Any impressions? ;)
>
> It slaughters traditional rw-locks!

I play with it on Intel quad-core. Yeah, bloody mess! :)

Provided low write rate (for example, about 1 write in 1ms) it's a
serious competitor to all other "effective read" techniques.
It has costless and scalable read access. Like RCU, SMR+RCU, vZOOM.
And unlike SMR, proxy-collector, mutex, rw_mutex.
It is "zero garbage". Like proxy-collector, mutex, rw_mutex. And
unlike RCU, SMR+RCU, vZOOM, SMR.
User don't have to write "atomic lock-free" code for modifications.
Like mutex, rw_mutex. And unlike RCU, SMR+RCU, vZOOM, SMR, proxy-
collector. Requirement to write "atomic lock-free" code can be
showstopper because it's not always possible to express data
structure's algorithm in such style, for example double-linked list.
User can modify data structure in place and free unneeded memory
straight away. Like mutex, rw_mutex. And unlike RCU, SMR+RCU, vZOOM,
SMR, proxy-collector. It's a big win sometimes.
The only disadvantage I can notice - while writer lock is hold all
readers are blocked. This can be serious disadvantage sometimes.

Dmitriy V'jukov

unread,

Feb 4, 2008, 7:27:46 AM2/4/08

to

On Jan 24, 6:06 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> rw_mutex with costless read_lock()/read_unlock()
> All overhead is moved to writer side (therefore called 'asymmetric')

It recalls David Dice's 'Asymmetric Dekker Synchronization':
http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronizat...
Imho, reader-writer mutex is somewhat more applicable that 'two-
thread' Dekker mutex...

Dmitriy V'jukov

Steve Watt

unread,

Feb 6, 2008, 1:43:58 PM2/6/08

to

In article <dc7df043-ec27-4df5...@i12g2000prf.googlegroups.com>,
Dmitriy Vyukov <dvy...@gmail.com> wrote:
[ ... ]

>class arw_mutex_t
>{
>private:
> CRITICAL_SECTION cs;
> // private flag for every reader
> bool volatile reader_inside [max_reader_count];
> bool volatile writer_pending;

[ ... ]

> void reader_lock()
> {
> reader_inside[current_thread_index] = true;

Uhm... You're assuming here that the array of bool can be atomically
(and correctly) accessed by multiple threads without synchronization.
See the usual ongoing discussions about hidden word-tearing and the
definition of "memory location".

That assumption may, in fact, be true on platforms where Windows
runs, but it's not generally the case, and depends on a fair number
of variables.
--
Steve Watt KD6GGD PP-ASEL-IA ICBM: 121W 56' 57.5" / 37N 20' 15.3"
Internet: steve @ Watt.COM Whois: SW32-ARIN
Free time? There's no such thing. It just comes in varying prices...

Dmitriy V'jukov

unread,

Feb 7, 2008, 2:02:42 PM2/7/08

to

On Feb 6, 9:43 pm, Steve Watt <steve.removet...@Watt.COM> wrote:

> Uhm... You're assuming here that the array of bool can be atomically
> (and correctly) accessed by multiple threads without synchronization.
> See the usual ongoing discussions about hidden word-tearing and the
> definition of "memory location".
>
> That assumption may, in fact, be true on platforms where Windows
> runs, but it's not generally the case, and depends on a fair number
> of variables.

Yes, I've read this:
http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf

MSVC and gcc always generate correct code. If your compiler generates
incorrect code for this than you can use array of ints.

Until C++0x all such code unconditionally will be compiler/
architecture dependent...

Anyway in real life there must be something like:

struct reader_t {
int volatile flag;
char cacheline_pad [arch::cacheline_size - sizeof(int)];
};
reader_t reader_inside [max_reader_count];

Or some form of dynamic registration of readers.

Whatever it's absolutely NOT main point here.

Dmitriy V'jukov

Chris Thomasson

unread,

Feb 8, 2008, 9:37:41 PM2/8/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:00a7374e-9146-4323...@b2g2000hsg.googlegroups.com...

> On Feb 6, 9:43 pm, Steve Watt <steve.removet...@Watt.COM> wrote:
>
>> Uhm... You're assuming here that the array of bool can be atomically
>> (and correctly) accessed by multiple threads without synchronization.
>> See the usual ongoing discussions about hidden word-tearing and the
>> definition of "memory location".
>>
>> That assumption may, in fact, be true on platforms where Windows
>> runs, but it's not generally the case, and depends on a fair number
>> of variables.
>
> Yes, I've read this:
> http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf
>
> MSVC and gcc always generate correct code. If your compiler generates
> incorrect code for this than you can use array of ints.
>
> Until C++0x all such code unconditionally will be compiler/
> architecture dependent...
>
> Anyway in real life there must be something like:
>
> struct reader_t {
> int volatile flag;
> char cacheline_pad [arch::cacheline_size - sizeof(int)];
> };
> reader_t reader_inside [max_reader_count];

Right. You could do this instead of using a caculation to determine the
pad-size:

union reader_t {
int volatile flag;
char cacheline_pad[arch::cacheline_size];
};

It looks simpler.

>
> Or some form of dynamic registration of readers.

IMHO, that would be the best way to implement this thing.