rw-mutex based on distributed Petersons algorithm...

Chris Thomasson

unread,

Oct 28, 2007, 5:12:17 PM10/28/07

to

This can get you fairly fine-grain read access without using any
interlocking:

__________________________________________________________
#define PER_THREAD_DEPTH()100
#define PER_THREAD_HASH_DEPTH()(PER_THREAD_DEPTH() - 1)
#define PER_THREAD_HASH(mp_ptr) \
(((size_t)(mp_ptr)) % PER_THREAD_HASH_DEPTH())

struct per_thread {
petersons_mutex_state pmtx[PER_THREAD_DEPTH()];
};

static pthread_mutex_t global_writer = PTHREAD_MUTEX_INITIALIZER;

/* called by one thread only; _this thread to be exact... */
void read_lock(
per_thread* const _this,
void const* ptr
) {
petersons_mutex_process1_lock(
&_this->rwlck[PER_THREAD_HASH(ptr)]
);
}

void read_unlock(
per_thread* const _this,
void const* ptr
) {
petersons_mutex_process1_unlock(
&_this->rwlck[PER_THREAD_HASH(ptr)]
);
}

/* also called by one thread only; global_writer ensures that... */
void write_lock(
void const* ptr
) {
pthread_mutex_lock(&global_writer);
for each per_thread as _this {
petersons_mutex_process2_lock(
&_this->rwlck[PER_THREAD_HASH(ptr)]
);
}
pthread_mutex_unlock(&global_writer);
}

int write_unlock(
void const* ptr
) {

pthread_mutex_lock(&global_writer);
for each per_thread as _this {
petersons_mutex_process2_unlock(
&_this->rwlck[PER_THREAD_HASH(ptr)]
);
}
pthread_mutex_unlock(&global_writer);
}
__________________________________________________________

I am also thinking of ways to get rid of global_writer....

Chris Thomasson

unread,

Oct 28, 2007, 5:24:00 PM10/28/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:Df6dnXmBspBkZLna...@comcast.com...

> This can get you fairly fine-grain read access without using any
> interlocking:
>
>
> __________________________________________________________
> #define PER_THREAD_DEPTH()100
> #define PER_THREAD_HASH_DEPTH()(PER_THREAD_DEPTH() - 1)
> #define PER_THREAD_HASH(mp_ptr) \
> (((size_t)(mp_ptr)) % PER_THREAD_HASH_DEPTH())
>
>
> struct per_thread {
> petersons_mutex_state pmtx[PER_THREAD_DEPTH()];
> };

^^^^^^^^^^^^^^

struct per_thread {
petersons_mutex_state rwlck[PER_THREAD_DEPTH()];
};

[...]

BTW, would anybody be interested in looking at an implementation of the
Petersons based rw-lock, or this one:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a23aeb712e8dbdf9

?

Dmitriy Vyukov

unread,

Oct 30, 2007, 9:42:11 PM10/30/07

to

On Oct 29, 12:24 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> BTW, would anybody be interested in looking at an implementation of the
> Petersons based rw-lock

What is the point to use Peterson algorithm instead of "traditional"
mutex? To replace InterlockedXXX operation with mfence (StoreLoad)?

Dmitriy V'jukov

Chris Thomasson

unread,

Oct 31, 2007, 3:15:39 AM10/31/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1193738757.5...@q5g2000prf.googlegroups.com...

Yup. Your 100% Asymmetric Blocking Algorithm would work out alright, however
the blocking et al, humm... Well, the dependences hamper things somewhat...
Well, eliminating the StoreLoad is key... I would rather use proxy GC, but
your algorithm works under rather tight circumstances wrt blocking in a
"critical" region... The Natural Asymmetric Nature of Things Eh?

;^)

Chris Thomasson

unread,

Oct 31, 2007, 3:17:51 AM10/31/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:g4SdnTAONKLBt7Xa...@comcast.com...

To drill down on my concern:

Asymmetric == Dependences which could Allow For Deadlock Under Certain
Circumstances...

Any Thoughts???

Chris Thomasson

unread,

Nov 2, 2007, 10:36:50 PM11/2/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:LJidnaSmZvFCt7Xa...@comcast.com...

The dependencies can be eliminated if you build the virtual machine
yourself.

Chris Thomasson

unread,

Nov 26, 2007, 5:10:26 PM11/26/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:xsKdnVU1idflQLba...@comcast.com...

You can manage some of the dependencies by creating a shi%tload of i/o or
GUI wrapper functions; wrap anything that can potentially block a thread.

Dmitriy Vyukov

unread,

Nov 29, 2007, 3:55:15 AM11/29/07

to

On Oct 31, 10:15 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> > What is the point to use Peterson algorithm instead of "traditional"
> > mutex? To replace InterlockedXXX operation with mfence (StoreLoad)?
>
> Yup

Do you think that mfence is better in any way than lock prefix?

I've try to ask this question here:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30242850.aspx

And here:
http://groups.google.com/group/comp.arch/browse_frm/thread/516f0dcfae17de2c

But it seems that just nobody knows in the whole world...

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Nov 29, 2007, 4:11:26 AM11/29/07

to

On Nov 27, 1:10 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> >> Any Thoughts???
>
> > The dependencies can be eliminated if you build the virtual machine
> > yourself.
>
> You can manage some of the dependencies by creating a shi%tload of i/o or
> GUI wrapper functions; wrap anything that can potentially block a thread.

Yes, this is approach used by user space cooperative threading
libraries.

Win32 has thread blocking detection logic for IOCP subsystem. I'm
thinking how this can be useful for us. See CreateIoCompletionPort()
function, NumberOfConcurrentThreads parameter.
One can create 1 IOCP port with NumberOfConcurrentThreads = 1, and 1
helper thread, for every worker thread. First of all worker thread
must dequeue work item from IOCP port. Then helper thread must try to
dequeue work item from IOCP port (actually thread will be blocked,
because NumberOfConcurrentThreads = 1). So OS will start monitoring
blocking of worker thread. Then worker thread will be blocked
*anyhow*, OS will wake up helper thread. Now helper thread can say
"hey, I'am start running, this means than worker thread is blocked, so
he is definitely have executed full membar, and will see that mutex is
already locked in exclusive mode".
This can break deadly circle of dependences.

But I am not sure that this solution will work. And what performance
it will have. And definitely it is not portable...

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Nov 29, 2007, 4:20:05 AM11/29/07

to

On Nov 27, 1:10 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> >> Any Thoughts???
>
> > The dependencies can be eliminated if you build the virtual machine
> > yourself.
>
> You can manage some of the dependencies by creating a shi%tload of i/o or
> GUI wrapper functions; wrap anything that can potentially block a thread.

If all threads in the system are binded to cores than one can create
helper thread with *idle* priority for every core. When such helper
thread starts running he can say "hey, I'm running! So all threads
which are binded to this core have executed full membar and will see

that mutex is already locked in exclusive mode".

Imho this solution is more promising than IOCP ports, and is portable.
But useful only for systems with strictly binded threads.
But here is a little problem - helper thread must ensure that worker
threads doesn't preempted *inside* critical section...

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Nov 29, 2007, 4:25:37 AM11/29/07

to

On Nov 29, 12:20 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>
> If all threads in the system are binded to cores than one can create
> helper thread with *idle* priority for every core. When such helper
> thread starts running he can say "hey, I'm running! So all threads
> which are binded to this core have executed full membar and will see
> that mutex is already locked in exclusive mode".
>

Or maybe try from the other side.
Create helper thread for every core with *real-time* priority.
Normally those threads are blocked on event or waiting for APC. And
when writer need to acquire mutex in exclusive mode, he signals to all
those helper threads. Then helper threads notify somehow that they
receive signal and membar is executed on core.
Hmmm... What do you think?

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Nov 29, 2007, 4:37:35 AM11/29/07

to

On Nov 29, 12:25 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> Or maybe try from the other side.
> Create helper thread for every core with *real-time* priority.
> Normally those threads are blocked on event or waiting for APC. And
> when writer need to acquire mutex in exclusive mode, he signals to all
> those helper threads. Then helper threads notify somehow that they
> receive signal and membar is executed on core.
> Hmmm... What do you think?

On Win32 APC can be used to determine when thread is going to block.
But thread must use only wait functions that wait in alertable state
(for example WaitForSingleObjectEx(), SleepEx() and so on; this is
will not work with plain WaitForSingleObject() and Sleep()).

Dmitriy V'jukov

Chris Thomasson

unread,

Nov 29, 2007, 6:36:59 PM11/29/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:862c07a1-0275-4a49...@e25g2000prg.googlegroups.com...

I used mfence in my implementation sketch of Petersons algorithm:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c49c0658e2607317

simply because there was no need to any interlock anything; here is what I
think of the LOCK prefix:

http://groups.google.com/group/comp.programming.threads/msg/aeb7e857ef440520

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7c2556f454cbd268

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7c4d0711f348ae26

Dmitriy Vyukov

unread,

Nov 30, 2007, 9:38:41 AM11/30/07

to

On Nov 30, 2:36 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> > Do you think that mfence is better in any way than lock prefix?
>

> I used mfence in my implementation sketch of Petersons algorithm:

> simply because there was no need to any interlock anything;

Yes, there is no need to interlock anything. But you can get another
algorithm where is the need. And the question is: what algorithm is
better?

> here is what I think of the LOCK prefix:

But what do you think about MFENCE? :)

Dmitriy V'jukov

Chris Thomasson

unread,

Nov 30, 2007, 5:06:40 PM11/30/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:d8868ed7-9565-47dc...@a39g2000pre.googlegroups.com...

I think MFENCE should be used when there is no need to expose yourself to
the overhead of the LOCK prefix _and_ atomic interlocking of shared
variables. Here is something:

http://groups.google.com/group/comp.programming.threads/msg/e33fd49fa067ffd0

Very, very stupid simple test the proves nothing! Well, refer here:

http://groups.google.com/group/comp.lang.c++/msg/789c546c030d0d0c

It good for an architecture to have completely naked interlocked operations,
and provide a robust membar instruction to handle memory visibility
concerns.

Humm...

Dmitriy Vyukov

unread,

Dec 1, 2007, 11:35:44 AM12/1/07

to

On 1 дек, 01:06, "Chris Thomasson" <cris...@comcast.net> wrote:

> >> here is what I think of the LOCK prefix:
>
> > But what do you think about MFENCE? :)
>
> I think MFENCE should be used when there is no need to expose yourself to
> the overhead of the LOCK prefix _and_ atomic interlocking of shared
> variables.

But what if MFENCE equal in all senses to LOCK? Or even more
expensive?

> Here is something:

I don't understand what you are trying to say...

I have measured by simple tests latency of LOCK and MFENCE on Intel
and AMD single-core processors. They are equal...

Dmitriy V'jukov

Chris Thomasson

unread,

Dec 3, 2007, 7:56:52 PM12/3/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:b4eff060-dd40-42ea...@o42g2000hsc.googlegroups.com...

On 1 дек, 01:06, "Chris Thomasson" <cris...@comcast.net> wrote:

> >> here is what I think of the LOCK prefix:
>
> > But what do you think about MFENCE? :)
>
> I think MFENCE should be used when there is no need to expose yourself to
> the overhead of the LOCK prefix _and_ atomic interlocking of shared
> variables.

IMVHO, I don't want to use an interlocked instruction just to get a memory
barrier.
If the algorithm needs to execute a membar, and does not need to atomically
mutate a shared variable, why use interlocking?

> But what if MFENCE equal in all senses to LOCK? Or even more
> expensive?

I don't quite see how it could be more expensive because the LOCK prefix is
used
in conjunction with a RMW operation. MFENCE does not need to RMW anything.
Does that make any sense?

> Here is something:
> I don't understand what you are trying to say...
> I have measured by simple tests latency of LOCK and MFENCE on Intel
> and AMD single-core processors. They are equal...

Are you using a thread-local dummy variable for the interlocked instruction
to operate
on? For instance:
_____________________________________________
void membar_storeload() {
register atomic_t dummy;
XCHG &dummy, 0; /* LOCK prefix implied with XCHG */
}
_____________________________________________

Why not just use:
_____________________________________________
void membar_storeload() {
MFENCE;
}
_____________________________________________

?

Chris Thomasson

unread,

Dec 3, 2007, 8:07:56 PM12/3/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:dc1ae9e2-a6a5-49dc...@a39g2000pre.googlegroups.com...

> On Nov 27, 1:10 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> >> Any Thoughts???
>>
>> > The dependencies can be eliminated if you build the virtual machine
>> > yourself.
>>
>> You can manage some of the dependencies by creating a shi%tload of i/o or
>> GUI wrapper functions; wrap anything that can potentially block a thread.
>
>
> If all threads in the system are binded to cores than one can create
> helper thread with *idle* priority for every core. When such helper
> thread starts running he can say "hey, I'm running! So all threads
> which are binded to this core have executed full membar and will see
> that mutex is already locked in exclusive mode".

Yup. The threads that are bound to the same core will never run
concurrently, so when one thread runs, all the others should have executed
membar because there queued up in the part of the OS scheduler.

> Imho this solution is more promising than IOCP ports, and is portable.
> But useful only for systems with strictly binded threads.

Yeah. It would be only be useful if there are multiple threads bound to
the same core. Humm, there could be some performance degradation because
your not letting the OS scheduler do its thread migration thing...

> But here is a little problem - helper thread must ensure that worker
> threads doesn't preempted *inside* critical section...

The shit would hit the fan then!

:^0

Chris Thomasson

unread,

Dec 3, 2007, 8:16:33 PM12/3/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:9287e324-94c2-4cd6...@d27g2000prf.googlegroups.com...

I think of how that it going to scale to a systems with a boatload of
processors. This might be putting a bit to much overhead on the writer
access. That is what I pointed out here:

http://groups.google.com/group/comp.programming.threads/msg/878a80ea30b2e849

http://groups.google.com/group/comp.programming.threads/msg/cb9c51bcef19f655
(last paragraph...)

However, you can just say that writers are going to always be few and far
between...

;^)

Chris Thomasson

unread,

Dec 4, 2007, 1:04:24 AM12/4/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:4_ednVwOct9MNcna...@comcast.com...

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:9287e324-94c2-4cd6...@d27g2000prf.googlegroups.com...
>> On Nov 29, 12:20 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>>>
>>> If all threads in the system are binded to cores than one can create
>>> helper thread with *idle* priority for every core. When such helper
>>> thread starts running he can say "hey, I'm running! So all threads
>>> which are binded to this core have executed full membar and will see
>>> that mutex is already locked in exclusive mode".
>>>
>>
>>
>> Or maybe try from the other side.
>> Create helper thread for every core with *real-time* priority.
>> Normally those threads are blocked on event or waiting for APC. And
>> when writer need to acquire mutex in exclusive mode, he signals to all
>> those helper threads. Then helper threads notify somehow that they
>> receive signal and membar is executed on core.
>> Hmmm... What do you think?
>
> I think of how that it going to scale to a systems with a boatload of
> processors.

[...]

Something like Cray's Red Storm Supercomputer with 10,000+ CPU's...

Dmitriy Vyukov

unread,

Dec 4, 2007, 5:43:22 AM12/4/07

to

On Dec 4, 3:56 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message
>
> news:b4eff060-dd40-42ea...@o42g2000hsc.googlegroups.com...

> On 1 ÄÅË, 01:06, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> > >> here is what I think of the LOCK prefix:
>
> > > But what do you think about MFENCE? :)
>
> > I think MFENCE should be used when there is no need to expose yourself to
> > the overhead of the LOCK prefix _and_ atomic interlocking of shared
> > variables.
>
> IMVHO, I don't want to use an interlocked instruction just to get a memory
> barrier.
> If the algorithm needs to execute a membar, and does not need to atomically
> mutate a shared variable, why use interlocking?
>
> > But what if MFENCE equal in all senses to LOCK? Or even more
> > expensive?
>
> I don't quite see how it could be more expensive because the LOCK prefix is
> used
> in conjunction with a RMW operation. MFENCE does not need to RMW anything.

Ok. But it can be equal.

> Does that make any sense?

It means that *PROBABLY* mfence *CAN* be faster...

> > Here is something:
> > I don't understand what you are trying to say...
> > I have measured by simple tests latency of LOCK and MFENCE on Intel
> > and AMD single-core processors. They are equal...
>
> Are you using a thread-local dummy variable for the interlocked instruction
> to operate
> on? For instance:
> _____________________________________________
> void membar_storeload() {
> register atomic_t dummy;
> XCHG &dummy, 0; /* LOCK prefix implied with XCHG */}
>
> _____________________________________________
>
> Why not just use:
> _____________________________________________
> void membar_storeload() {
> MFENCE;}
>
> _____________________________________________
>
> ?

I measure:
LOCK ADD dword ptr [esp], 0

against:
ADD dword ptr [esp], 0
MFENCE

Reasons to use traditional mutex instead Peterson:
1. More clear implementation
2. Already implemented
3. Can be used with more than 2 threads

Reasons to NOT use traditional mutex instead Peterson:
1. *PROBABLY* *CAN* be faster

Hummm
To use Petersons algo at least I have to see that it SOMETIMES A BIT
faster... I don't see this yet...

Dmitriy V'jukov