Based on my previous ideas:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/1348064e4838e871
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/b7d3b8c08f9ca3c6
And uses this epoch auto detection algorithm:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f8f78ff07df6df5f
Now asymmetric_rw_mutex is free of any caveats like possible
deadlocks. And not use any OS/hardware specific features.
Implementation:
int const max_reader_count = 1024;
// unique thread index (0 .. (max_reader_count - 1))
__declspec(thread) int current_thread_index;
class arw_mutex_t
{
private:
CRITICAL_SECTION cs;
// private flag for every reader
bool volatile reader_inside [max_reader_count];
bool volatile writer_pending;
public:
arw_mutex_t()
{
InitializeCriticalSection(&cs);
writer_pending = false;
for (int i = 0; i != max_reader_count; ++i)
reader_inside[i] = false;
}
void reader_lock()
{
reader_inside[current_thread_index] = true;
// no explicit #StoreLoad
// but it will be implicitly executed
// by sys_synch() in writer_lock()
if (writer_pending)
{
// if writer is pending
// than signal that we see it
// and wait for writer to complete
reader_inside[current_thread_index] = false;
EnterCriticalSection(&cs);
reader_inside[current_thread_index] = true;
LeaveCriticalSection(&cs);
}
// need to order user code inside critical section
// acquire compiler barrier
_ReadWriteBarrier();
}
void reader_unlock()
{
// need to order user code inside critical section
// release compiler barrier
_ReadWriteBarrier();
reader_inside[current_thread_index] = false;
}
void writer_lock()
{
EnterCriticalSection(&cs);
// set that we are waiting
writer_pending = true;
// all magic is here:
// implicitly execute full memory barrier
// on all other processors
sys_synch();
// here we are sure that:
// (1) writer will see (reader_inside[i] == true)
// or (2) reader will see (writer_pending == true)
// so no race conditions
for (int i = 0; i != max_reader_count; ++i)
{
// wait for all readers to complete
while (reader_inside[i])
SwitchToThread();
}
}
void writer_unlock()
{
writer_pending = false;
LeaveCriticalSection(&cs);
}
};
Dmitriy V'jukov
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is the same sys_synch() as the one defined here:
http://groups.google.com/group/comp.programming.threads/msg/ee9e4f173b7c1a7a
correct?
[...]
BTW, I have implemented your algorithm using the vZOOM epoch detection on
Windows/Linux/Solaris; it works.
Yes
> BTW, I have implemented your algorithm using the vZOOM epoch detection on
> Windows/Linux/Solaris; it works.
Yes, right, one can use any epoch detection logic with this algorithm
- sys_synch(), FlushProcessWriteBuffers(), rcu_synchronize(), signal
to every thread, polling of thread/processor activity etc. Main point
- epoch detection logic must be completely automatic and must not
require any explicit participation from user threads. Otherwise this
can cause deadlocks.
Any impressions? ;)
I'm going to test this mutex (and compare to other implementations) on
multicore under different workloads - different ratio of readers/
writers, different amount of local payload.
Dmitriy V'jukov
It slaughters traditional rw-locks!
> rw_mutex with costless read_lock()/read_unlock()
Am I right that one can't implement such algorithm in pure Java/C#
because store to volatile variable followed by load of volatile
variable will implicitly issue #StoreLoad memory barrier?
Dmitriy V'jukov
> >> BTW, I have implemented your algorithm using the vZOOM epoch detection on
> >> Windows/Linux/Solaris; it works.
>
> > Any impressions? ;)
>
> It slaughters traditional rw-locks!
I play with it on Intel quad-core. Yeah, bloody mess! :)
Provided low write rate (for example, about 1 write in 1ms) it's a
serious competitor to all other "effective read" techniques.
It has costless and scalable read access. Like RCU, SMR+RCU, vZOOM.
And unlike SMR, proxy-collector, mutex, rw_mutex.
It is "zero garbage". Like proxy-collector, mutex, rw_mutex. And
unlike RCU, SMR+RCU, vZOOM, SMR.
User don't have to write "atomic lock-free" code for modifications.
Like mutex, rw_mutex. And unlike RCU, SMR+RCU, vZOOM, SMR, proxy-
collector. Requirement to write "atomic lock-free" code can be
showstopper because it's not always possible to express data
structure's algorithm in such style, for example double-linked list.
User can modify data structure in place and free unneeded memory
straight away. Like mutex, rw_mutex. And unlike RCU, SMR+RCU, vZOOM,
SMR, proxy-collector. It's a big win sometimes.
The only disadvantage I can notice - while writer lock is hold all
readers are blocked. This can be serious disadvantage sometimes.
Dmitriy V'jukov
> rw_mutex with costless read_lock()/read_unlock()
> All overhead is moved to writer side (therefore called 'asymmetric')
It recalls David Dice's 'Asymmetric Dekker Synchronization':
http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronizat...
Imho, reader-writer mutex is somewhat more applicable that 'two-
thread' Dekker mutex...
Dmitriy V'jukov
Uhm... You're assuming here that the array of bool can be atomically
(and correctly) accessed by multiple threads without synchronization.
See the usual ongoing discussions about hidden word-tearing and the
definition of "memory location".
That assumption may, in fact, be true on platforms where Windows
runs, but it's not generally the case, and depends on a fair number
of variables.
--
Steve Watt KD6GGD PP-ASEL-IA ICBM: 121W 56' 57.5" / 37N 20' 15.3"
Internet: steve @ Watt.COM Whois: SW32-ARIN
Free time? There's no such thing. It just comes in varying prices...
> Uhm... You're assuming here that the array of bool can be atomically
> (and correctly) accessed by multiple threads without synchronization.
> See the usual ongoing discussions about hidden word-tearing and the
> definition of "memory location".
>
> That assumption may, in fact, be true on platforms where Windows
> runs, but it's not generally the case, and depends on a fair number
> of variables.
Yes, I've read this:
http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf
MSVC and gcc always generate correct code. If your compiler generates
incorrect code for this than you can use array of ints.
Until C++0x all such code unconditionally will be compiler/
architecture dependent...
Anyway in real life there must be something like:
struct reader_t {
int volatile flag;
char cacheline_pad [arch::cacheline_size - sizeof(int)];
};
reader_t reader_inside [max_reader_count];
Or some form of dynamic registration of readers.
Whatever it's absolutely NOT main point here.
Dmitriy V'jukov
Right. You could do this instead of using a caculation to determine the
pad-size:
union reader_t {
int volatile flag;
char cacheline_pad[arch::cacheline_size];
};
It looks simpler.
>
> Or some form of dynamic registration of readers.
IMHO, that would be the best way to implement this thing.