__________________________________________________________
#define PER_THREAD_DEPTH()100
#define PER_THREAD_HASH_DEPTH()(PER_THREAD_DEPTH() - 1)
#define PER_THREAD_HASH(mp_ptr) \
(((size_t)(mp_ptr)) % PER_THREAD_HASH_DEPTH())
struct per_thread {
petersons_mutex_state pmtx[PER_THREAD_DEPTH()];
};
static pthread_mutex_t global_writer = PTHREAD_MUTEX_INITIALIZER;
/* called by one thread only; _this thread to be exact... */
void read_lock(
per_thread* const _this,
void const* ptr
) {
petersons_mutex_process1_lock(
&_this->rwlck[PER_THREAD_HASH(ptr)]
);
}
void read_unlock(
per_thread* const _this,
void const* ptr
) {
petersons_mutex_process1_unlock(
&_this->rwlck[PER_THREAD_HASH(ptr)]
);
}
/* also called by one thread only; global_writer ensures that... */
void write_lock(
void const* ptr
) {
pthread_mutex_lock(&global_writer);
for each per_thread as _this {
petersons_mutex_process2_lock(
&_this->rwlck[PER_THREAD_HASH(ptr)]
);
}
pthread_mutex_unlock(&global_writer);
}
int write_unlock(
void const* ptr
) {
pthread_mutex_lock(&global_writer);
for each per_thread as _this {
petersons_mutex_process2_unlock(
&_this->rwlck[PER_THREAD_HASH(ptr)]
);
}
pthread_mutex_unlock(&global_writer);
}
__________________________________________________________
I am also thinking of ways to get rid of global_writer....
struct per_thread {
petersons_mutex_state rwlck[PER_THREAD_DEPTH()];
};
[...]
BTW, would anybody be interested in looking at an implementation of the
Petersons based rw-lock, or this one:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a23aeb712e8dbdf9
?
> BTW, would anybody be interested in looking at an implementation of the
> Petersons based rw-lock
What is the point to use Peterson algorithm instead of "traditional"
mutex? To replace InterlockedXXX operation with mfence (StoreLoad)?
Dmitriy V'jukov
Yup. Your 100% Asymmetric Blocking Algorithm would work out alright, however
the blocking et al, humm... Well, the dependences hamper things somewhat...
Well, eliminating the StoreLoad is key... I would rather use proxy GC, but
your algorithm works under rather tight circumstances wrt blocking in a
"critical" region... The Natural Asymmetric Nature of Things Eh?
;^)
To drill down on my concern:
Asymmetric == Dependences which could Allow For Deadlock Under Certain
Circumstances...
Any Thoughts???
The dependencies can be eliminated if you build the virtual machine
yourself.
You can manage some of the dependencies by creating a shi%tload of i/o or
GUI wrapper functions; wrap anything that can potentially block a thread.
> > What is the point to use Peterson algorithm instead of "traditional"
> > mutex? To replace InterlockedXXX operation with mfence (StoreLoad)?
>
> Yup
Do you think that mfence is better in any way than lock prefix?
I've try to ask this question here:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30242850.aspx
And here:
http://groups.google.com/group/comp.arch/browse_frm/thread/516f0dcfae17de2c
But it seems that just nobody knows in the whole world...
Dmitriy V'jukov
> >> Any Thoughts???
>
> > The dependencies can be eliminated if you build the virtual machine
> > yourself.
>
> You can manage some of the dependencies by creating a shi%tload of i/o or
> GUI wrapper functions; wrap anything that can potentially block a thread.
Yes, this is approach used by user space cooperative threading
libraries.
Win32 has thread blocking detection logic for IOCP subsystem. I'm
thinking how this can be useful for us. See CreateIoCompletionPort()
function, NumberOfConcurrentThreads parameter.
One can create 1 IOCP port with NumberOfConcurrentThreads = 1, and 1
helper thread, for every worker thread. First of all worker thread
must dequeue work item from IOCP port. Then helper thread must try to
dequeue work item from IOCP port (actually thread will be blocked,
because NumberOfConcurrentThreads = 1). So OS will start monitoring
blocking of worker thread. Then worker thread will be blocked
*anyhow*, OS will wake up helper thread. Now helper thread can say
"hey, I'am start running, this means than worker thread is blocked, so
he is definitely have executed full membar, and will see that mutex is
already locked in exclusive mode".
This can break deadly circle of dependences.
But I am not sure that this solution will work. And what performance
it will have. And definitely it is not portable...
Dmitriy V'jukov
> >> Any Thoughts???
>
> > The dependencies can be eliminated if you build the virtual machine
> > yourself.
>
> You can manage some of the dependencies by creating a shi%tload of i/o or
> GUI wrapper functions; wrap anything that can potentially block a thread.
If all threads in the system are binded to cores than one can create
helper thread with *idle* priority for every core. When such helper
thread starts running he can say "hey, I'm running! So all threads
which are binded to this core have executed full membar and will see
that mutex is already locked in exclusive mode".
Imho this solution is more promising than IOCP ports, and is portable.
But useful only for systems with strictly binded threads.
But here is a little problem - helper thread must ensure that worker
threads doesn't preempted *inside* critical section...
Dmitriy V'jukov
Or maybe try from the other side.
Create helper thread for every core with *real-time* priority.
Normally those threads are blocked on event or waiting for APC. And
when writer need to acquire mutex in exclusive mode, he signals to all
those helper threads. Then helper threads notify somehow that they
receive signal and membar is executed on core.
Hmmm... What do you think?
Dmitriy V'jukov
> Or maybe try from the other side.
> Create helper thread for every core with *real-time* priority.
> Normally those threads are blocked on event or waiting for APC. And
> when writer need to acquire mutex in exclusive mode, he signals to all
> those helper threads. Then helper threads notify somehow that they
> receive signal and membar is executed on core.
> Hmmm... What do you think?
On Win32 APC can be used to determine when thread is going to block.
But thread must use only wait functions that wait in alertable state
(for example WaitForSingleObjectEx(), SleepEx() and so on; this is
will not work with plain WaitForSingleObject() and Sleep()).
Dmitriy V'jukov
I used mfence in my implementation sketch of Petersons algorithm:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c49c0658e2607317
simply because there was no need to any interlock anything; here is what I
think of the LOCK prefix:
http://groups.google.com/group/comp.programming.threads/msg/aeb7e857ef440520
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7c2556f454cbd268
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7c4d0711f348ae26
> > Do you think that mfence is better in any way than lock prefix?
>
> I used mfence in my implementation sketch of Petersons algorithm:
> simply because there was no need to any interlock anything;
Yes, there is no need to interlock anything. But you can get another
algorithm where is the need. And the question is: what algorithm is
better?
> here is what I think of the LOCK prefix:
But what do you think about MFENCE? :)
Dmitriy V'jukov
I think MFENCE should be used when there is no need to expose yourself to
the overhead of the LOCK prefix _and_ atomic interlocking of shared
variables. Here is something:
http://groups.google.com/group/comp.programming.threads/msg/e33fd49fa067ffd0
Very, very stupid simple test the proves nothing! Well, refer here:
http://groups.google.com/group/comp.lang.c++/msg/789c546c030d0d0c
It good for an architecture to have completely naked interlocked operations,
and provide a robust membar instruction to handle memory visibility
concerns.
Humm...
> >> here is what I think of the LOCK prefix:
>
> > But what do you think about MFENCE? :)
>
> I think MFENCE should be used when there is no need to expose yourself to
> the overhead of the LOCK prefix _and_ atomic interlocking of shared
> variables.
But what if MFENCE equal in all senses to LOCK? Or even more
expensive?
> Here is something:
I don't understand what you are trying to say...
I have measured by simple tests latency of LOCK and MFENCE on Intel
and AMD single-core processors. They are equal...
Dmitriy V'jukov
> >> here is what I think of the LOCK prefix:
>
> > But what do you think about MFENCE? :)
>
> I think MFENCE should be used when there is no need to expose yourself to
> the overhead of the LOCK prefix _and_ atomic interlocking of shared
> variables.
IMVHO, I don't want to use an interlocked instruction just to get a memory
barrier.
If the algorithm needs to execute a membar, and does not need to atomically
mutate a shared variable, why use interlocking?
> But what if MFENCE equal in all senses to LOCK? Or even more
> expensive?
I don't quite see how it could be more expensive because the LOCK prefix is
used
in conjunction with a RMW operation. MFENCE does not need to RMW anything.
Does that make any sense?
> Here is something:
> I don't understand what you are trying to say...
> I have measured by simple tests latency of LOCK and MFENCE on Intel
> and AMD single-core processors. They are equal...
Are you using a thread-local dummy variable for the interlocked instruction
to operate
on? For instance:
_____________________________________________
void membar_storeload() {
register atomic_t dummy;
XCHG &dummy, 0; /* LOCK prefix implied with XCHG */
}
_____________________________________________
Why not just use:
_____________________________________________
void membar_storeload() {
MFENCE;
}
_____________________________________________
?
Yup. The threads that are bound to the same core will never run
concurrently, so when one thread runs, all the others should have executed
membar because there queued up in the part of the OS scheduler.
> Imho this solution is more promising than IOCP ports, and is portable.
> But useful only for systems with strictly binded threads.
Yeah. It would be only be useful if there are multiple threads bound to
the same core. Humm, there could be some performance degradation because
your not letting the OS scheduler do its thread migration thing...
> But here is a little problem - helper thread must ensure that worker
> threads doesn't preempted *inside* critical section...
The shit would hit the fan then!
:^0
I think of how that it going to scale to a systems with a boatload of
processors. This might be putting a bit to much overhead on the writer
access. That is what I pointed out here:
http://groups.google.com/group/comp.programming.threads/msg/878a80ea30b2e849
http://groups.google.com/group/comp.programming.threads/msg/cb9c51bcef19f655
(last paragraph...)
However, you can just say that writers are going to always be few and far
between...
;^)
Something like Cray's Red Storm Supercomputer with 10,000+ CPU's...
Ok. But it can be equal.
> Does that make any sense?
It means that *PROBABLY* mfence *CAN* be faster...
> > Here is something:
> > I don't understand what you are trying to say...
> > I have measured by simple tests latency of LOCK and MFENCE on Intel
> > and AMD single-core processors. They are equal...
>
> Are you using a thread-local dummy variable for the interlocked instruction
> to operate
> on? For instance:
> _____________________________________________
> void membar_storeload() {
> register atomic_t dummy;
> XCHG &dummy, 0; /* LOCK prefix implied with XCHG */}
>
> _____________________________________________
>
> Why not just use:
> _____________________________________________
> void membar_storeload() {
> MFENCE;}
>
> _____________________________________________
>
> ?
I measure:
LOCK ADD dword ptr [esp], 0
against:
ADD dword ptr [esp], 0
MFENCE
Reasons to use traditional mutex instead Peterson:
1. More clear implementation
2. Already implemented
3. Can be used with more than 2 threads
Reasons to NOT use traditional mutex instead Peterson:
1. *PROBABLY* *CAN* be faster
Hummm
To use Petersons algo at least I have to see that it SOMETIMES A BIT
faster... I don't see this yet...
Dmitriy V'jukov