I develop SCADA application which processes a lot of immediate-to-serve
data.
This time I do SWMRG - a Microsoft proposed lock - to restrict access in
the manner 'one writer, multiple readers'.
It uses mutexes and a semaphore. But now I began to notice, that SWMRG
is too costly, 'cause it eats a lot of processor time.
I do not need to share my lock with other processes - it works only
inside one instance of server.
Does anybody know how to implement such a thing in the most
time-efficient way? Timeout is a must ...
Thanks,
Eugene
--
John Mills, <joh...@slmd.com>
Sr. Software Engineer
Core Systems Group
Spacelabs Medical
Eugene Kormiltsev <eug...@iface.ru> wrote in message
news:38296747...@iface.ru...
John Mills wrote:
> Well... if you only need it inside 1 process, you can change the mutexes to
> critical sections. CS's act the same way, but they cannot be shared between
> processes. Aside from that, they are a bit faster in execution.
>
> --
>
Thanks John,
CS's are fine, they are more than 'a bit' faster.
But unfortunately, there are no timeouts for critical sections.
Maybe some original thoughts for such a lock?
Eugene
I found one R/W lock implementation in an old project. Unfortunately, it's
badly tied to the rest of the project so it's not easy to extract it (yes, I
know it was stupid, but I had to add it to the project in no time... so I
did not have much time to care about the style). If you can wait one or two
days, I'll reshape it into something more respectful when I have a few spare
hours.
> Eugene
--
Slava
Please send any replies to this newsgroup.
microsoft.public.win32.programmer.kernel
"Slava M. Usov" wrote:
> I found one R/W lock implementation in an old project. Unfortunately, it's
> badly tied to the rest of the project so it's not easy to extract it (yes, I
> know it was stupid, but I had to add it to the project in no time... so I
> did not have much time to care about the style). If you can wait one or two
> days, I'll reshape it into something more respectful when I have a few spare
> hours.
>
> Slava
>
Thank you Slava,
I'll be very grateful if you post it.
And for now I append my solution.
It looks a bit strange, but keep in mind that I've replaced original SWMRG calls
in project and some arguments became unused.
My object allows reentering lock with the same access type.
Eugene
//////////////////////// swmrg.h
typedef struct _SWMRG{
DWORD writer;
LONG enterers_count;
DWORD writers_waiting_count; // to mantain writer's priority
CRITICAL_SECTION c_s;
}SWMRG,*PSWMRG;
BOOL SWMRGInitialize(LPSECURITY_ATTRIBUTES lpsa,PSWMRG pSWMRG,LPCTSTR lpszName);
VOID SWMRGDelete(PSWMRG pSWMRG);
DWORD SWMRGEnter(PSWMRG pSWMRG,BOOL fWrite,DWORD Timeout);
VOID SWMRGLeave(PSWMRG pSWMRG);
//////////////////////// swmrg.c
#include <windows.h>
#include "swmrg.h"
#define TICK_THRESHOLD 10
BOOL SWMRGInitialize(
LPSECURITY_ATTRIBUTES lpsa,
PSWMRG pSWMRG,
LPCTSTR lpszName
)
{
pSWMRG->writer=0;
pSWMRG->enterers_count=0;
pSWMRG->writers_waiting_count=0;
InitializeCriticalSection(&pSWMRG->c_s);
lpsa=lpsa;
lpszName=lpszName;
return TRUE;
}
VOID SWMRGDelete(PSWMRG pSWMRG)
{
DeleteCriticalSection(&pSWMRG->c_s);
}
DWORD SWMRGEnter(PSWMRG pSWMRG,BOOL fWrite,DWORD Timeout)
{
BOOL fUseTiming=FALSE;
DWORD InitTc,TickDif,thid=GetCurrentThreadId();
DWORD cycles=0;
DWORD writers_waiting_to_compare=0;
if(fWrite){
EnterCriticalSection(&pSWMRG->c_s);
pSWMRG->writers_waiting_count++;
writers_waiting_to_compare=0xffffffff;
LeaveCriticalSection(&pSWMRG->c_s);
}
while(1){
EnterCriticalSection(&pSWMRG->c_s);
if(pSWMRG->writers_waiting_count<=writers_waiting_to_compare){
if(!pSWMRG->writer){
if(!fWrite)break;
if(pSWMRG->enterers_count==0)break;
}
}
LeaveCriticalSection(&pSWMRG->c_s);
if(Timeout<TICK_THRESHOLD){
if(fWrite){
EnterCriticalSection(&pSWMRG->c_s);
pSWMRG->writers_waiting_count--;
LeaveCriticalSection(&pSWMRG->c_s);
}
return WAIT_TIMEOUT;
}
if(cycles>=10)Sleep(TICK_THRESHOLD);
else Sleep(0);
cycles++;
if(Timeout!=INFINITE){
if(!fUseTiming){
InitTc=GetTickCount();
fUseTiming=TRUE;
}
else {
TickDif=GetTickCount()-InitTc;
if(TickDif>=Timeout){
if(fWrite){
EnterCriticalSection(&pSWMRG->c_s);
pSWMRG->writers_waiting_count--;
LeaveCriticalSection(&pSWMRG->c_s);
}
return WAIT_TIMEOUT;
}
}
}
}
if(fWrite)pSWMRG->writer=thid;
pSWMRG->enterers_count++;
if(fWrite){
pSWMRG->writers_waiting_count--;
}
LeaveCriticalSection(&pSWMRG->c_s);
return WAIT_OBJECT_0;
}
VOID SWMRGLeave(PSWMRG pSWMRG)
{
DWORD thid=GetCurrentThreadId();
EnterCriticalSection(&pSWMRG->c_s);
if(!pSWMRG->writer){
if(pSWMRG->enterers_count)pSWMRG->enterers_count--;
}
else {
if(pSWMRG->writer==thid){
if(pSWMRG->enterers_count)pSWMRG->enterers_count--;
if(pSWMRG->enterers_count==0)pSWMRG->writer=0;
}
}
LeaveCriticalSection(&pSWMRG->c_s);
}
...
> I found one R/W lock implementation in an old project. Unfortunately, it's
> badly tied to the rest of the project so it's not easy to extract it (yes,
I
> know it was stupid, but I had to add it to the project in no time... so I
> did not have much time to care about the style). If you can wait one or
two
> days, I'll reshape it into something more respectful when I have a few
spare
> hours.
Sorry,
I got some flu on the weekend, so there is a small delay. I hope I will have
everything fixed today and post it today later or tomorrow.
> Sorry,
>
> I got some flu on the weekend, so there is a small delay. I hope I will have
> everything fixed today and post it today later or tomorrow.
>
> --
>
> Slava
>
Don't hurry, Slava, it's OK
Eugene
Thanks, Slava.
The idea is superior. The code should execute very fast.
Only two point are of slight weakness:
1. The lock does not allow nested calls
2. The use of InterlockedCompareExchange
in some circumstances my DLL can be loaded in W95 environment (lock code
won't be executed,
but loaded - use GetProcAddress()?).
I am not sure that
asm { lock cmpxchg } // the matter of the function
is supported on every Pentium, 586 etc, and definitely not supported on 486.
For the embedded systems 80486 is not too obsolete
Please reply what you think of it
E
Eugene Kormiltsev wrote:
> I am not sure that
> asm { lock cmpxchg } // the matter of the function
> is supported on every Pentium, 586 etc, and definitely not supported on 486.
Sorry,
forget it - it's supported
E
Thanks :-) Actually, it's been awhile since I did that... now I have an idea
that could make it even faster... in essence, it may be worth to remove
SetEvent() and ResetEvent() calls, and replace WaitForSingleObject() with
Sleep(0) or SwitchToThread()... When I have a few free hours more :-), I'll
try.
> Only two point are of slight weakness:
>
> 1. The lock does not allow nested calls
As I said, it does allow nested rw_reader_in()/rw_reader_out() calls. Not so
for writers, true. It can be done (of course) by stuffing the current
writer's thread ID in the lock, but then rw_writer_in() will be more complex
and costly performance-wise. Well, if you think it's worth the effort, I'll
try and see.
GetTickCount() is MUCH better than QueryPerformanceCounter on non-SMP
machines when you do not *need* the resolution. QPC turns off interrupts and
takes about 1400 clocks to execute (on my single Petium-II 266 MHz NT4SP5
machine)
GetTickCount() executes completely in user-mode and takes only 11 clocks!
On SMP machines QPC uses RDTSC directly which of course is blindingly fast.
>
> 3. The spin counts... there is quite a few words on in the source, so
there
> is not really much to add here. Do ensure you have read the comments. And
> share your ideas, please.
>
> 5. And don't take the demo very seriously!
>
> OK, I'll be interested to hear what you think on it. Thanks.
>
> --
>
> Slava
You want to know what I think ?
You want it - you got it :-)
Implementing a Read/Write lock is an interesting problem that is much more
demanding than most people think. The following is my list of requirements
for the perfect solution:
#1 Safety - A writer is never allowed to hold the lock while any other
writer or reader is holding the lock.
#2 Deadlock free - There must not be any scenario that leads to deadlock.
#3 Lockout free
#3a It must not be possible for readers to permanently lock out writers.
#3b It must not be possible for writers to permanently lock out readers.
#4 Performance - we want maximum performance in situations that do not lead
to blocking
#4a A reader entering or leaving an open lock
#4b A writer entering or leaving an open lock
#4c A reader entering or leaving a lock that is already held by other
readers
#4d A number of readers that are rapidly entering and leaving the lock
should not force unnecessary thread-switching or other overhead
#5 Verifyability - it should be possible to inspect the code and manually
understand how it works.
#6 Deterministic behavior - it should be possible to put an upper bound on
the time a thread must wait to aquire the lock when there is heavy
contention.
If I look at your solution from those requirements, I find the following:
#1 Not easy to be sure, but it looks OK
#2 OK
#3a Sorry - Writers are only let in when the number of readers drop to 0.
That means that if you have multiple readers that use the lock heavily, they
can lock out all writers indefinitely. This is the reason why the test
program times out so often. - writers are unable to enter within a
reasonable time.
#3b OK
#4 Your solution tries to avoid costly kernel-mode calls as much as
possible. It does a good job but it could be better.
#4a This situation requires a call to SetEvent() when entering and
ResetEvent when leaving. This costs about 1600 cycles on my machine. -
fairly expensive.
#4b ResetEvent on entry and SetEvent on exit - 1600 cycles.
#4c This is VERY fast - 145 cycles.
#4d This scenario is quite non-deterministic. I get measured values ranging
from 700 - 8000 cycles. The reason it is non-deterministic is the RW_S_TRAN
state. When a reader finds this state it will enter a busy-loop waiting for
the value to change. This means (on a single CPU) that it will burn up the
remainder of its timeslice for no reason - since there is only one CPU, the
value will never change while the thread is looping.
This can easily be improved by adding a Sleep(0) in the case RW_S_TRAN in
rw_reader_in.
After doing that, I measure consistently below 1000 clocks in this scenario.
NOTE! My performance measurements were made on a single-CPU box with your
spincounts manually changed to 1. The performance on a multi-CPU box will be
different.
#5 Well, the code is not impossible to understand, but it is not easy to
modify. For example, it is not at all clear how it should be changed to
correct #3a above.
#6 The RW_S_TRAN loop and the writer starvation makes it non-deterministic.
I have an alternate solution to this problem with the following
characteristics:
#1 OK
#2 OK
#3a OK
#3b OK
#4a 125 clocks
#4b 135 clocks
#4c 125 clocks
#4d This is my main problem - I use a critical section in StartRead() - this
means that if I have a large number of heavily competing readers, each call
to
StartRead() will cause a thread switch. This gives a performance of about
2000 - 3000 clocks. :-(
#5 At least *I* think the code is quite easy to understand :-)
#6 As long as other threads finish their work in the lock quickly the
behavior is deterministic.
My code is not finished yet, but I will post it here when it is ready.
If anyone has a solution that is good on all of the points above, I would
love to see it.
/SG
Stefan,
first off, thank you for the in-depth analysis and even timing! of the code.
I should have said from the very beginning that the RW lock was developed on
and *for* SMP machines, and for some other requirements, so it does have
some peculiarities that make it only an *approximation* to a general purpose
RW lock, and should be only considered as a fully functional but yet a
sample. Now that I have said that...
[#3a It must not be possible for readers to permanently lock out writers.]
> #3a Sorry - Writers are only let in when the number of readers drop to 0.
> That means that if you have multiple readers that use the lock heavily,
they
> can lock out all writers indefinitely. This is the reason why the test
> program times out so often. - writers are unable to enter within a
> reasonable time.
Yes, I'm aware, and I even tried to introduce another sub-state meaning
"reader in and a writer is about to enter", with the intention to block
further readers from entering. But I then figured out that this would
conflict with another, *necessary* requirement of the possibility to call
rw_reader_in() after the thread has entered the lock as a reader. There is
no easy way to check if a thread is already a reader (without using TLS, and
this is not very clean and not even compact, and will restrict the number of
locks to the number of TLS slots available), so a situation
rw_reader_in() -> RW_S_READ -> [another thread] -> rw_writer_in() ->
RW_S_WRIT_PENDING -> [original thread] -> rw_reader_in() would result in a
deadlock.
But, this problem can be easily fixed without modifying the lock in any way
simply by increasing writer's thread priority.
[#4a A reader entering or leaving an open lock]
[#4b A writer entering or leaving an open lock]
> #4a This situation requires a call to SetEvent() when entering and
> ResetEvent when leaving. This costs about 1600 cycles on my machine. -
> fairly expensive.
>
> #4b ResetEvent on entry and SetEvent on exit - 1600 cycles.
Yes, this is my biggest worry and currently I don't know how to overcome it
in a clean way. One way to do it would be to remove SetEvent() and
ResetEvent() calls completely and replace WaitForSingleObject() by
SwitchToThread().
[#4d A number of readers that are rapidly entering and leaving the lock
should not force unnecessary thread-switching or other overhead]
> #4d This scenario is quite non-deterministic. I get measured values
ranging
> from 700 - 8000 cycles. The reason it is non-deterministic is the
RW_S_TRAN
> state. When a reader finds this state it will enter a busy-loop waiting
for
> the value to change. This means (on a single CPU) that it will burn up the
> remainder of its timeslice for no reason - since there is only one CPU,
the
> value will never change while the thread is looping.
>
> This can easily be improved by adding a Sleep(0) in the case RW_S_TRAN in
> rw_reader_in.
>
> After doing that, I measure consistently below 1000 clocks in this
scenario.
>
> NOTE! My performance measurements were made on a single-CPU box with your
> spincounts manually changed to 1. The performance on a multi-CPU box will
be
> different.
True, this is where SMP and non-SMP are clashing. On SMP, it's *not*
desirable to stop busy-waiting on RW_S_TRAN because this state is meant to
be transient and transient quickly. On the other hand, probably this still
has to have an upper limit even on SMP...
[#5 Verifyability - it should be possible to inspect the code and manually
understand how it works.]
> #5 Well, the code is not impossible to understand, but it is not easy to
> modify. For example, it is not at all clear how it should be changed to
> correct #3a above.
As I have said, I doubt there is a solution satisfying #3a and nested
rw_reader_in() requirement and not requiring TLS or a list of readers. I'll
be happy to learn that there *is* a way, though.
[#6 Deterministic behavior - it should be possible to put an upper bound on
the time a thread must wait to aquire the lock when there is heavy
contention.]
> #6 The RW_S_TRAN loop and the writer starvation makes it
non-deterministic.
I will (sometime) redo it with the SwitchToThread() approach... this should
eliminate most of long operations in RW_S_TRAN state and greatly reduce the
overall amount of spinning. Then combined with an upper limit on spinning in
RW_S_TRAN...
> I have an alternate solution to this problem with the following
> characteristics:
>
> #1 OK
> #2 OK
> #3a OK
> #3b OK
> #4a 125 clocks
> #4b 135 clocks
> #4c 125 clocks
> #4d This is my main problem - I use a critical section in StartRead() -
this
> means that if I have a large number of heavily competing readers, each
call
> to
> StartRead() will cause a thread switch. This gives a performance of about
> 2000 - 3000 clocks. :-(
I'll be definitely interested in seeing your code.
Stefan Gustafsson wrote:
>
> My code is not finished yet, but I will post it here when it is ready.
>
Thanks, Stefan,
Eagerly waiting for it..
But what do you think about
Enter/LeaveCriticalSection() timings in different situations?
Eugene
It was a requirement of the project that I was working on. As I said, my RW
lock is only a "particular" case implementation that may be worthless in
another situation. All I wanted was to give the folks, Eugene in particular,
an impression how a light-weight RW lock could be implemented. And I hoped
that this would create a discussion, and generate other proposals, for
everbody's benefit. Now that we have the discussion and *two* different
implemtation, I feel happy :-)
> Oh! I did not understand that nested readers was a necessary requirement
for
> you. My solution does not allow nested readers for exactly this reason. I
> suppose you have to choose: Writer starvation or nested readers.
Yes, it seems so to me. That is, it so unless we want to tag each thread
with the lock it owns, but this will require TLS and carefull control of
threading, and this will nuke the relative compactness of the resource
usage, and the ease of use. Besides, there was some internal knowledge so I
*knew* writer starvation would not be a problem.
> > But, this problem can be easily fixed without modifying the lock in any
> way
> > simply by increasing writer's thread priority.
>
> I do not think so:
>
> It does not matter how high the priority of a writer is because as soon as
> it blocks, it will never be awakened again until all readers have exited.
If
> you have mutiple readers using the lock heavily, the high-priority writer
> might be blocked forever.
True. Writers do starve.
> The biggest prblem is that you call ResetEvent while in RW_S_TRAN, this
> means that if you are using the lock in a tight loop, you might easily
spend
> most of the time in RW_S_TRAN.
This is why I want to remove the event altogether.
> > I'll be definitely interested in seeing your code.
>
> You've got it!
This will have to wait till the end of the day...
...
> This will have to wait till the end of the day...
Actually, I found a few minutes to build and run it a few times. I thought
you'd be interested in seeing the output [dual Pentium II]. The results
below are *without* Sleep(0) in rw_reader_in(). Uncommenting Sleep(0) does
have a stabilizing effect, yielding in/out time at about 2470 clocks, but
this is well below the top performance with 1757 clocks. So the general
advice [on SMP] seems to be *against* calling Sleep(0) even in very tight
in/out scenarios.
Hmm, it looks like you serialize in/out scenarios a bit more than necessary
:-)
BTW, these are the results with g_MultiCPU set to TRUE. When I first built
it without touching the variable, your lock would routinely show performance
of more than 8200 clocks per in/out...
OK, this is just a benchmark result and I don't normally trust benchmarks
until I know what is being benchmarked. Time to dig in.
QueryPerformanceCounter = 556 clocks
GetTickCount = 9 clocks
Semaphore wait/signal() = 114 clocks
SetEvent() = 917 clocks
ResetEvent() = 704 clocks
Set/ResetEvent() = 1752 clocks
Sleep(0) = 793 clocks
Enter/LeaveCriticalSection = 81 clocks
Enter/LeaveMutex = 2101 clocks
RWLock:Start/EndRead = 285 clocks
rw_lock:Start/EndRead = 1996 clocks
RWLock:Start/EndWrite = 411 clocks
rw_lock:Start/EndWrite = 1981 clocks
RWLock:Start/EndRead = 269 clocks
rw_lock:second Start/EndRead = 286 clocks
rw_lock:multiple readers Start/EndRead = 1757 clocks Counter=711000
RWLock: multiple readers Start/EndRead = 46665 clocks Counter=26000
QueryPerformanceCounter = 556 clocks
GetTickCount = 9 clocks
Semaphore wait/signal() = 114 clocks
SetEvent() = 1051 clocks
ResetEvent() = 722 clocks
Set/ResetEvent() = 1736 clocks
Sleep(0) = 792 clocks
Enter/LeaveCriticalSection = 81 clocks
Enter/LeaveMutex = 2020 clocks
RWLock:Start/EndRead = 286 clocks
rw_lock:Start/EndRead = 2179 clocks
RWLock:Start/EndWrite = 269 clocks
rw_lock:Start/EndWrite = 1891 clocks
RWLock:Start/EndRead = 269 clocks
rw_lock:second Start/EndRead = 286 clocks
rw_lock:multiple readers Start/EndRead = 1900 clocks Counter=658000
RWLock: multiple readers Start/EndRead = 42983 clocks Counter=28000
QueryPerformanceCounter = 584 clocks
GetTickCount = 9 clocks
Semaphore wait/signal() = 114 clocks
SetEvent() = 919 clocks
ResetEvent() = 723 clocks
Set/ResetEvent() = 1663 clocks
Sleep(0) = 792 clocks
Enter/LeaveCriticalSection = 81 clocks
Enter/LeaveMutex = 1973 clocks
RWLock:Start/EndRead = 286 clocks
rw_lock:Start/EndRead = 2043 clocks
RWLock:Start/EndWrite = 269 clocks
rw_lock:Start/EndWrite = 1859 clocks
RWLock:Start/EndRead = 269 clocks
rw_lock:second Start/EndRead = 286 clocks
rw_lock:multiple readers Start/EndRead = 1962 clocks Counter=624000
RWLock: multiple readers Start/EndRead = 46934 clocks Counter=26000
QueryPerformanceCounter = 556 clocks
GetTickCount = 9 clocks
Semaphore wait/signal() = 114 clocks
SetEvent() = 918 clocks
ResetEvent() = 727 clocks
Set/ResetEvent() = 1630 clocks
Sleep(0) = 793 clocks
Enter/LeaveCriticalSection = 81 clocks
Enter/LeaveMutex = 1977 clocks
RWLock:Start/EndRead = 285 clocks
rw_lock:Start/EndRead = 1987 clocks
RWLock:Start/EndWrite = 269 clocks
rw_lock:Start/EndWrite = 1810 clocks
RWLock:Start/EndRead = 310 clocks
rw_lock:second Start/EndRead = 286 clocks
rw_lock:multiple readers Start/EndRead = 2308 clocks Counter=542000
RWLock: multiple readers Start/EndRead = 44304 clocks Counter=27000
After a bit more of fiddling with the constants, I found this tendency:
reducing both INNER_SPIN and OUTER_SPIN makes performance better, with the
peak performance of about 8200 clocks per in/out when both are zero [the
same as g_MultiCPU is zero]. Reducing INNER_SPIN produces bigger impact,
which is demonstrated by setting the INNER_SPIN to zero and leaving the
OUTER_SPIN at 100. This gives a varying picture of 9000 to 11000 clocks.
But, honestly, this is not illustrative because this benchmark is
pathological:
1. Your code imposes very stringent serialization on threads when they enter
and leave the lock even as readers;
2. The threads are doing *nothing* other than entering or leaving the lock
[ergo]
3. We have a few threads basically competing for a mutually exclusive
critical section, all in tight loops.
So the problem is the serialization. This is best magnified by setting the
MAX_THREAD to 1, when the performance of your lock is as high as 341 [the
performance of mine is 2020], while setting MAX_THREAD to 2 immediately
brings your lock down to 7600. Interestingly, 2 threads give peak
performance to my lock, upto 1000 clocks. This is, of course, due to the
dual-processor machine. Setting MAX_THREAD to 3 then sends your lock to 8700
where it stays no matter how many threads are running [I have tried upto 64;
higher values would give weird results -- like 1 clock -- and occasional
access violations somewhere, and I was too lazy to see where]. About the
same can be said about my lock, with more than 2 threads its performance is
about 2020, with occasional dropdowns [to more than 40000], which can be
eliminated by uncommenting Sleep(0).
By the way, I have modified rw_reader_in() and the overall performance is a
bit better [about 300 clocks less on average]. This is the modification:
case RW_S_FREE:
IIN(&rw->readers);
IEX(&rw->state, RW_S_READ);
if( !ResetEvent(rw->event) )
{
u err = GetLastError();
// restore the lock
while( IEX(&rw->state, RW_S_TRAN) != RW_S_READ );
if( IDE(&rw->readers) == 0 )
IEX(&rw->state, RW_S_FREE);
else
IEX(&rw->state, RW_S_READ);
return err;
}
return ERROR_SUCCESS;
Basically, now it performs ResetEvent() outside of RW_S_TRAN loop. This
improves the performance in this *pathological* case, but in case of writers
competing with readers it may be actually worse. Note that in this
pathological case my lock performs better if both spin counts are 1, which
is a clear indication of the pathology.
So I tried to measure more realistic "reader crowding" when readers do
something while they're readers. Basically, I simply added Sleep(0); between
entering and leaving of the locks in *both* threads, and run it with 64
threads. The other settings were:
for my lock, RW_SPIN_READER set to 100
for your lock, both INNER_COUNT and OUTER_COUNT set to 100, g_MultiCPU to 1
a typical output is:
QueryPerformanceCounter = 556 clocks
GetTickCount = 9 clocks
Semaphore wait/signal() = 114 clocks
SetEvent() = 930 clocks
ResetEvent() = 705 clocks
Set/ResetEvent() = 1651 clocks
Sleep(0) = 792 clocks
Enter/LeaveCriticalSection = 81 clocks
Enter/LeaveMutex = 1973 clocks
RWLock:Start/EndRead = 286 clocks
rw_lock:Start/EndRead = 2032 clocks
RWLock:Start/EndWrite = 267 clocks
rw_lock:Start/EndWrite = 1858 clocks
RWLock:Start/EndRead = 270 clocks
rw_lock:second Start/EndRead = 292 clocks
rw_lock:multiple readers Start/EndRead = 1210 clocks Counter=1024000
RWLock: multiple readers Start/EndRead = 1274 clocks Counter=960000
where my lock (typically) performs [very] slightly better. Setting
g_MultiCPU to zero [which used to give the top performance to your lock in
the pathological case] immediately sent your lock back to ca 8000 clocks per
in/out. With greater sleep values, the in/out time of both lock would get
higher [because it's measured *with* sleeping], and the difference between
the locks would disappear completely [for instance, the same settings, with
Sleep(10) would give a very stable value of 56773 clocks, and g_MultiCPU
would have *no* effect on it all!].
So the conclusion is, in case of reader crowding my lock is marginally
better, in case of very tight [pathological] crowding it's significantly
better but an additional stabilization [Sleep(0) in RW_S_TRAN busy-waiting]
may be necessary.
As for a real reader/writer situation, the writer in/out and readers in/out
sequence is inherently slow because waiting on event or a semaphore is
unavoidable with either lock. The advantage of your lock is that it prevents
writer starvation [by starving readers]. To the defense of my lock :-),
since it supports timeouts, the excessive writer starvation can be made
visible to a monitor or a human administrator, and corrective actions can be
applied [in fact, this how it was done in my original project, where writing
was only possibly by a direct administrative intervention, so it was up to
the administrator to prevent writer starvation, and there were some means
for it].
I still want to remove the SetEvent()/ResetEvent() drain...
Yes, calling Sleep(0) was never meant to be used on SMP - it is just
supposed to avoid burning up a full timeslice on non-SMP.
>
>Hmm, it looks like you serialize in/out scenarios a bit more than necessary
>:-)
I know - each entry to the monitor casues a thread-switch when there are
multiple readers competing for entry.
>
>BTW, these are the results with g_MultiCPU set to TRUE. When I first built
>it without touching the variable, your lock would routinely show
performance
>of more than 8200 clocks per in/out...
Are you saying that the last benchmark test (multiple readers) took 8200
clocks with g_MultiCPU=FALSE and 46665 with g_MultiCPU=TRUE ?
I have not tested the code on an SMP-machine. It seems that the spinloop is
not working the way it is supposed to :-(
>
>OK, this is just a benchmark result and I don't normally trust benchmarks
>until I know what is being benchmarked. Time to dig in.
>
>QueryPerformanceCounter = 556 clocks
>GetTickCount = 9 clocks
>Semaphore wait/signal() = 114 clocks
>SetEvent() = 917 clocks
>ResetEvent() = 704 clocks
>Set/ResetEvent() = 1752 clocks
>Sleep(0) = 793 clocks
>Enter/LeaveCriticalSection = 81 clocks
>Enter/LeaveMutex = 2101 clocks
>RWLock:Start/EndRead = 285 clocks
>rw_lock:Start/EndRead = 1996 clocks
>RWLock:Start/EndWrite = 411 clocks
>rw_lock:Start/EndWrite = 1981 clocks
>RWLock:Start/EndRead = 269 clocks
>rw_lock:second Start/EndRead = 286 clocks
>rw_lock:multiple readers Start/EndRead = 1757 clocks Counter=711000
>RWLock: multiple readers Start/EndRead = 46665 clocks Counter=26000
I know this - as a matter of fact, I do not think that the testcase is
pathological at all:
The lock was actually designed with the assumption that threads would spend
significantly longer time outside the lock than inside. If that is the case,
the probability for a threadswitch while a thread holds the critical section
is very low, and my lock works very well.
My original testcase actually had a Sleep(0) between entering and leaving
the lock. As your testing shows, this shows excellent performance. (The
reason it is so good, is that no thread is ever preempted while holding a
critical section)
Then I started to think about the real world. What is the the most common
(and interesting performance-wise) use-case for a rw-lock ?
Well, there are basically two alternatives:
#1 You do very little while holding the lock (like scanning a short FIFO)
the time spent in the lock is on the order of 100-1000 clocks
#2 You do heavy work while holding the lock - serious processing or even
blocking waiting for IO. Time is on the order of 10-1000 ms
Case #2 is what you test when you have a Sleep(0) while holding the lock.
Since you are spending so much time doing other things, the actual
performance of the lock does not matter at all. As a matter of fact, a more
realistic test would be to spin in an empty loop simulating actual work.
Sleep(0) causes an immediate threadswitch which is not very realistic.
Case #1 is where the performance of the lock itself is really important.
Since you might only spend 100 clocks holding the lock, an overhead of
2000-8000 clocks just to maintain the lock is very bad.
My "pathological" testcase is designed to test #1, just to expose the
fundamental problem of my lock implementation. (The critical section)
What I found was that my lock behaved exactly as predicted: It caused a
thread-switch for each entry.
The interesting thing is that your lock also exhibits bad performance in
this case. The reason of course that your RW_S_TRAN state actually behaves
as a critsec with busy wait. (improved on single CPU by adding the Sleep(0))
The best lock would avoid any kind of blocking or spinning when readers were
just competing with other readers. I do not really know if this is possible,
but it is something to think about. What I am aiming for is a consistent
performance of less than 300 clocks for all cases that do not involve
blocking.
I believe the solution would look more like your implementation than mine,
because the whole monitor concept relies on mutual exclusion.
> where it stays no matter how many threads are running [I have tried upto
64;
> higher values would give weird results -- like 1 clock -- and occasional
> access violations somewhere, and I was too lazy to see where]. About the
WaitForMultipleObjects can not wait for more than 64 objects. ;-)
>
> As for a real reader/writer situation, the writer in/out and readers
in/out
> sequence is inherently slow because waiting on event or a semaphore is
> unavoidable with either lock. The advantage of your lock is that it
prevents
> writer starvation [by starving readers]. To the defense of my lock :-),
> since it supports timeouts, the excessive writer starvation can be made
> visible to a monitor or a human administrator, and corrective actions can
be
> applied [in fact, this how it was done in my original project, where
writing
> was only possibly by a direct administrative intervention, so it was up to
> the administrator to prevent writer starvation, and there were some means
> for it].
I do not think that my lock starves readers. When a writer has finished
writing it will always wake up a reader if there is a waiting reader. When
the last reader finishes it wakes up a writer.
This means that if both readers and writers are competing to enter the lock
they will take turns.
/SG
BOOL TimedOutCriticalSection( CS cs, int timeout, int step ) {
int total_time = 0;
while( TryEnterCriticalSection( &cs ) == FALSE ) {
Sleep(step);
total_time += step;
if( total_time >= timeout
return FALSE;
}
return TRUE;
}
which will timeout after "timeout" milliseconds, returning FALSE;
"step" is used to make the race fine grained (CPU lost) or coarse grained
(bigger delays on sleeping), or even give some threads a statistical
advantage,
if for them it is smaller than others'
Of course the big flaw is that no thread enters _as soon as_ the owner of
the CS
exits...But if the threads' load is counted to 100s of milliseconds, and
step = 5 msec,
then the loss can be tolerated, I suppose...
Dimitris
Eugene Kormiltsev <eug...@iface.ru> wrote in message
news:382ABFC7...@iface.ru...
> CS's are fine, they are more than 'a bit' faster.
> But unfortunately, there are no timeouts for critical sections.
> Maybe some original thoughts for such a lock?
>
> Eugene
>
>
>
If the assumption was "the threads spend just a few clocks in the lock",
then, sorry, it does not appear to work very well. See below.
> My original testcase actually had a Sleep(0) between entering and leaving
> the lock. As your testing shows, this shows excellent performance. (The
> reason it is so good, is that no thread is ever preempted while holding a
> critical section)
>
> Then I started to think about the real world. What is the the most common
> (and interesting performance-wise) use-case for a rw-lock ?
>
> Well, there are basically two alternatives:
>
> #1 You do very little while holding the lock (like scanning a short FIFO)
> the time spent in the lock is on the order of 100-1000 clocks
>
> #2 You do heavy work while holding the lock - serious processing or even
> blocking waiting for IO. Time is on the order of 10-1000 ms
>
> Case #2 is what you test when you have a Sleep(0) while holding the lock.
> Since you are spending so much time doing other things, the actual
> performance of the lock does not matter at all. As a matter of fact, a
more
> realistic test would be to spin in an empty loop simulating actual work.
> Sleep(0) causes an immediate threadswitch which is not very realistic.
Real work, meaning IO, is pretty much realistically modelled by Sleep() with
zero or non-zero delays. And as I wrote, both zero and non-zero delays
demonstrated excellent *and equal* performance of both lock. So I do believe
that both locks [and, actually, any locks, even those built with mutexes]
will work wonderfully when there are real work being done. Even replacing
Sleep(0) or Sleep(10) with
for( int j = 0; j < 20000; j++ ) NOP();
would confirm it, by showing very stable *and equal* figures.
> Case #1 is where the performance of the lock itself is really important.
>
> Since you might only spend 100 clocks holding the lock, an overhead of
> 2000-8000 clocks just to maintain the lock is very bad.
>
> My "pathological" testcase is designed to test #1, just to expose the
> fundamental problem of my lock implementation. (The critical section)
>
> What I found was that my lock behaved exactly as predicted: It caused a
> thread-switch for each entry.
>
> The interesting thing is that your lock also exhibits bad performance in
> this case. The reason of course that your RW_S_TRAN state actually behaves
> as a critsec with busy wait. (improved on single CPU by adding the
Sleep(0))
I still think that your benchmark is pathological, so I modified it to model
just the case #1: readers spend 100-1000 clocks while holding the lock. It
was simply done by putting
for( int j = 0; j < 100; j++ ) NOP();
instead of Sleep(0). This would result in 4 machine instructions
$NOP: ret
$call: call $NOP
dec esi
jne $call
being executed for 100 times, which was timed on my machine as 613 clocks.
Again, this case required your lock to have g_MultiCPU set to zero to
achieve the top performance of about 8700 clocks, which is *very* close to
the performance with Sleep(0). So the serialization is still present and the
lock does not "work very well". BTW, your lock performs at about the same
level with any number of threads greater than 2. When there is only one
thread, it's 877 (as opposed to 200 with no load), and with two threads, it
does show an amazing level of performance at 560. This still runs with more
than double overhead as compared to the single-threaded case, because 560 *
2 - 877 = 240, and the single-threaded overhead is 200 clocks.
But the real surprise came when I started fiddling with my lock. Initially,
with Sleep(0) in rw_reader_in() commented out, with the yesterday's update,
and with another update [I replaced IIN() and IDE() in rw_reader_in() and
rw_reader_out() because I noticed there is already serialization via
RW_S_TRAN where they're called, and this reduced second
rw_reader_in()/rw_reader_out() from 292 to 212 clocks], running with 64
threads would give *very* varying results from 400 to 4000 clocks and
occasionally much worse. The range, however, would be practically
exponentially dependent on the numbers of threads, falling off very rapidly
and reaching 500 clocks at -- you guess it! -- two threads. I thought that
could have been a result of a thread being preempted holding the lock at
RW_S_TRAN, making (N - 1) / 2 [where N is the number of threads] timeslices
wasted by other threads in tight spinning. So I did follow your advice and
uncommented Sleep(0) in spinning on RW_S_TRAN, and this sent the performance
of *any* number of threads to 500 clocks!
In case of one thread, however, the lock performs at 2800 [613 clocks load
included, mind you] :-(. Anyway, given that the second
rw_reader_in()/rw_reader_out() without load is 212, singly threaded "second"
performance would be 212 + 613 = 825, so we still have 500*2 - 825 = 175
"additional" overhead spent, presumably, in Sleep(0) and spinning. Thus the
overall overhead is about 390 clocks [slightly less than double singly
threaded second one... hmm, does these words make any sense :-)], which is a
bit too much. Still it's good that it's [statistically] constant with the
number of threads varying. Also, it can be significantly reduced by
implementing IEX() and ICE() inline, because each one takes 45 clocks to
execute, and there are four invocations of IEX() during in/out sequence,
summing to 180 clocks, plus I believe there is additional overhead involved
in calling them which was optimized away in measuring loops, but is present
in real code, probably stealing 200 or so clocks. So it's possible to
replace 200 clocks overhead by probably 50 clocks via careful coding in asm,
giving a total of 240 overhead, which I believe is very modest. That's for
real asm lovers, of course :-)
So, now, I do think that very tight reader in/out scenarios with the number
of threads more than 2 times greater than the number of CPUs do indeed
perfrom better with Sleep(0) in RW_S_TRAN loop, and anything between 1 and 2
times should be very carefully considered. Note that in most server
applications that use IOCP the number of running threads is usually exactly
the number of CPUs. Also note that with Sleep(0) in the RW_S_TRAN loop *and*
Sleep(0) between rw_reader_in() and rw_reader_out() the performance is only
1000, and this is why it should be *carefully* considered.
> The best lock would avoid any kind of blocking or spinning when readers
were
> just competing with other readers. I do not really know if this is
possible,
> but it is something to think about. What I am aiming for is a consistent
> performance of less than 300 clocks for all cases that do not involve
> blocking.
>
> I believe the solution would look more like your implementation than mine,
> because the whole monitor concept relies on mutual exclusion.
Incidentally, I think so too :-) Well, seriously, I can accept the current
340 clocks overhead because it's *portable* overhead, but I really am not
happy with 2000 overhead of singly threaded "first" scenario. This is what
I'd like to consider next week [because currently, unfortunately, I have to
do what I'm paid for :-)]. If anybody has an idea or [better] a working
implementation, I will be fascinated.
> > where it stays no matter how many threads are running [I have tried upto
> 64;
> > higher values would give weird results -- like 1 clock -- and occasional
> > access violations somewhere, and I was too lazy to see where]. About the
>
> WaitForMultipleObjects can not wait for more than 64 objects. ;-)
Oh, yeah, what was I thinking!
> > As for a real reader/writer situation, the writer in/out and readers
> in/out
> > sequence is inherently slow because waiting on event or a semaphore is
> > unavoidable with either lock. The advantage of your lock is that it
> prevents
> > writer starvation [by starving readers]. To the defense of my lock :-),
> > since it supports timeouts, the excessive writer starvation can be made
> > visible to a monitor or a human administrator, and corrective actions
can
> be
> > applied [in fact, this how it was done in my original project, where
> writing
> > was only possibly by a direct administrative intervention, so it was up
to
> > the administrator to prevent writer starvation, and there were some
means
> > for it].
>
> I do not think that my lock starves readers. When a writer has finished
> writing it will always wake up a reader if there is a waiting reader. When
> the last reader finishes it wakes up a writer.
>
> This means that if both readers and writers are competing to enter the
lock
> they will take turns.
Well, really, this is wholly dependent on what you think is fair behavior.
If you think [have defined so] that a fair lock is a lock that gives 1/2
probability to entering lock os reader or writer without regard to the
distribution of readers and writers in the application, then, yes, your lock
is fair and nobody is starving [tell this to 10 readers who always have to
wait for one writer].
But, if a fair lock is [have been defined to be] a lock that maintains the
"natural" readers/writers distribution, we have a different picture. The
stringent definition is: if the probability of a thread *wanting* to become
reader is p [and (1 - p) to become writer] equals the probability p' of a
thread actually becoming a reader [by entering the lock], then the lock is
fair. If p < p', then it's writer starvation. If p > p' then it's reader
starvation. Your lock tries to maintain p' == 1/2, so if p < 1/2 [there are
less readers than writers, an atypical situation], we have writer
starvation; when p > 1/2 [less writers than readers], we have reader
starvation. Your lock is only fair when p = 1/2 but in this case it exhibits
highly serialized performance like ... W - R - W - R ..., and is best
replaced by a critical section altogether. :-)
We all love abstract stuff, don't we?
Unfortunately, Sleep() is only as accurate as the length of a timeslice is.
So the net result of this will be a highly serialized *and*
non-deterministic RW lock. To better understand what it means, keep in mind
that my lock is *not* highly serialized and non-deterministic [which is its
problem, although not as much as the "first" overhead is]; Stefan's lock is
serialized [which is its problem] and deterministic [which is its
advantage]. So what you suggest combines all of the disadvantages of both
locks.
--
Slava
Please send any replies to this newsgroup.
microsoft.public.win32.programmer.kernel
>
I have to read your text more carefully later, but I just want to clarify
one misunderstanding:
> > I know this - as a matter of fact, I do not think that the testcase is
> > pathological at all:
> >
> > The lock was actually designed with the assumption that threads would
> spend
> > significantly longer time outside the lock than inside. If that is the
> case,
> > the probability for a threadswitch while a thread holds the critical
> section
> > is very low, and my lock works very well.
>
> If the assumption was "the threads spend just a few clocks in the lock",
> then, sorry, it does not appear to work very well. See below.
No - this is not what I meant.
What I meant was that when I originally designed the lock, my assumption was
that threads did significant work while holding the lock. (Equivalent to
Sleep(0) or an empty loop)
This use-case still "works very well".
The problem is the other case (very little work when holding the lock). In
this case the speed of the lock itself is most important, it is also this
case that gives my lock an unnecessary thread-switch for each entering
reader.
This is my whole point: My lock is not good enough when threads do little
work holding the lock. That is the only reason I am not satisfied with it.
As I have said before: The important performance criteria for a RW-lock is
that it has as low overhead as possible for situations that do not require
blocking. (including multiple readers doing high-speed in-out) As it stands
at the moment, my lock is no better than a critsec for high-speed scenarios.
/SG
All points taken, all agreed.
After completely rethinking the RWLock problem, I came up with the following
super-fast and super-simple implementation:
class RWLock2 {
public:
RWLock2()
{
readers = writers = 0;
InitializeCriticalSection(&cs);
}
void StartRead(bool avoidStarvingWriters = true)
{
InterlockedIncrement(&readers);
// Wait until all writers have left
while (writers > 0)
Sleep(0);
}
void EndRead()
{
InterlockedDecrement(&readers);
}
void StartWrite()
{
// Use a critsec for mutual exclusion between writers
EnterCriticalSection(&cs);
// We must now wait until all readers have left the lock
InterlockedIncrement(&writers); // Assume there are no readers
while (readers > 0) {
// Readers in the lock
// Decrement writers again to avoid deadlocking with readers that
// are interrupted immediately after incrementing readers
InterlockedDecrement(&writers);
Sleep(0);
InterlockedIncrement(&writers); // Try again
}
}
void EndWrite()
{
InterlockedDecrement(&writers);
LeaveCriticalSection(&cs);
}
protected:
long readers; // The number of readers in the lock
long writers; // The number of writers in the lock
CRITICAL_SECTION cs;
};
This lock has the following properties:
* Spectacular performance: non-blocking in-out scenarios take less than 100
clocks.
* No blocking or looping at all while readers compete with readers.
* It is not starvation-free in theory but practical experimentation does not
show any starvation.
* Nested reading is allowed
* Nested writing is not allowed - it might cause deadlock with readers
trying to get in.
It is always the simplest solutions that are the best :-)
/SG
Agree, but I have also [quite incidentally :-)] been thinking on using
Sleep(0) instead of any synchronization object, and there appears to be an
argument that makes me vote *against* such schemes: thread priorities. Say
you have a writer with a high thread priority, and the writer wakes up very
infrequently, while there are normally lots of [less prioritized] readers
hanging around the lock. So there are good chances that anytime the writer
wakes there is a least one reader in the lock... do I need to expand on
this? It's trivial to see that on uni-processor computers this will result
in the writer spinning forever without any chance for the readers to get
outta the lock. Well, actually NT will give even the lowest priority threads
some time, but the overall execution will be bad. By the way, the same will
happen if we have an occasional high-priority reader...
I have not done any field testing on this yet...
Hmm
I thought that Sleep(0) always put the current thread at the end of the
queue of runnable threads regardless of priority.
But you are right, the documentation explicitly states that it only works as
I intended for threads with the same priority.
I suppose I could just give up and say that the lock is only meant for
threads of equal priority.
Hmm - I wonder how Sleep(1) or SwitchToThread is handled...
I will have to run some tests.
/SG