More efficient TCriticalSection for multi core machines

Bart van der Werf

unread,

Dec 18, 2006, 6:49:17 AM12/18/06

to

This replacement for TCriticalSection performance upto ten times better in
cases where there is alot of contention between threads on a lock that is
hold often and shortly on computers with multiple cores/cpus/hyperthreading.
It does this by making the criticalsection 'unfair' allowing a thread that
holds a lock to leave and reenter the lock as long as no other thread
successfully enters the lock, having another threads wait on the mutex will
not disallow you to re-enter it again if no other thread has yet
successfully entered it.
This ensures that the amount of contextswitching on a highly contested lock
is strongly reduced.
This does introduce a minor risk of starvation.

http://blogs.msdn.com/larryosterman/archive/2004/03/29/101329.aspx

http://msdn.microsoft.com/msdnmag/issues/05/10/ConcurrentAffairs/

http://msdn.microsoft.com/msdnmag/issues/06/03/ConcurrentAffairs/

http://msdn.microsoft.com/msdnmag/issues/06/06/ConcurrentAffairs/

http://msdn.microsoft.com/msdnmag/issues/06/09/ConcurrentAffairs/

http://msdn.microsoft.com/msdnmag/issues/06/11/ConcurrentAffairs/

unit BWOpertunisticMutex;

interface

uses
SyncObjs, Windows;

type
TBWOpertunisticMutex = class(TSynchroObject)
private
FNotifyRelease: Integer;
FSection: TRTLCriticalSection;
FHandle: THandle;
public
constructor Create;
destructor Destroy; override;
procedure Acquire; override;
procedure Release; override;
function TryAcquire: Boolean;
end;

implementation

procedure TBWOpertunisticMutex.Acquire;
var
Timeout: Boolean;
begin
if not TryEnterCriticalSection(FSection) then
begin
FNotifyRelease := 1;
Timeout := False;
while not (Timeout or TryEnterCriticalSection(FSection)) do
begin
Timeout := WaitForSingleObject(FHandle, 1000) <> WAIT_OBJECT_0;
FNotifyRelease := 1; // allways put this back on so that the handle is
allways preheated
end;
if Timeout then
begin
EnterCriticalSection(FSection); // starvation ? lets go 'fair'
end;
end;
end;

constructor TBWOpertunisticMutex.Create;
begin
inherited Create;
InitializeCriticalSection(FSection);
FHandle := CreateEvent(nil, False, False, '');
end;

destructor TBWOpertunisticMutex.Destroy;
begin
CloseHandle(FHandle);
DeleteCriticalSection(FSection);
inherited;
end;

procedure TBWOpertunisticMutex.Release;
begin
if FSection.RecursionCount = 1 then
begin
LeaveCriticalSection(FSection);
if FNotifyRelease = 1 then
begin
FNotifyRelease := 0;
SetEvent(FHandle);
end
end
else
LeaveCriticalSection(FSection);
end;

function TBWOpertunisticMutex.TryAcquire: Boolean;
begin
Result := TryEnterCriticalSection(FSection);
end;

end.

Bart van der Werf

unread,

Dec 18, 2006, 10:09:09 AM12/18/06

to

using the following bit might help some more, this basicly adds a spinlock
besides working around the fair scheduling.

const
cSpinLimit = 10000;

procedure TBWOpertunisticMutex.Acquire;
var
Timeout: Boolean;

Spin: Integer;

begin
if not TryEnterCriticalSection(FSection) then
begin

Spin := cSpinLimit;
while (Spin > 0) and not TryEnterCriticalSection(FSection) do
Dec(Spin);
if Spin = 0 then

begin
FNotifyRelease := 1;
Timeout := False;
while not (Timeout or TryEnterCriticalSection(FSection)) do
begin
Timeout := WaitForSingleObject(FHandle, 1000) <> WAIT_OBJECT_0;
FNotifyRelease := 1; // allways put this back on so that the handle
is allways preheated
end;
if Timeout then

EnterCriticalSection(FSection); // starvation ? lets go 'fair'
end;
end;
end;

constructor TBWOpertunisticMutex.Create;
begin
inherited Create;

InitializeCriticalSectionAndSpinCount(FSection, cSpinLimit);

Robert Houdart

unread,

Dec 18, 2006, 10:16:31 AM12/18/06

to

Hello Bart,

Looks interesting.
How did you measure the performance improvement?
What exactly do you mean by "starvation" and how "minor" is the risk?

Robert

Bart van der Werf

unread,

Dec 18, 2006, 10:23:51 AM12/18/06

to

"Robert Houdart" <rob...@no.com> wrote in message
news:4586b08e$1...@newsgroups.borland.com...

> Hello Bart,
>
> Looks interesting.
> How did you measure the performance improvement?

1,2 or 8 threads that access the innerloop at the same time.

This was tested on a coreduo machine, the singlecore tests are run by
setting the cpu affinity to a single core
TCriticalSection:

single core:
1: 150-167
2: 683-696 (340)
8: 6863-7094 (850)

dual core:
1: 146-148
2: 5951-6885 (3000)
8: 28862 (3600)

TBWopertunisticMutex:

single core:
1: 154-159
2: 1005-1032 (500) (this case is weaker then the default)
8: 4365-4416 (545)

dual core:
1: 152-159
2: 494-582 (250)
8: 2884-2944 (360)

The inner loop:

for I := 1 to 1000000 do
begin
FMutex.Acquire;
if (i mod 10) = 0 then
Sleep(0);
FMutex.Release;
end;

> What exactly do you mean by "starvation" and how "minor" is the risk?

with starvation i mean that one thread might allways get the lock because it
is never contextswitched outside of holding the lock, causing no other
threads to ever get the lock. (well, not atleast until the timeout case in
the code is invoked after atleast 1 second).

however this should hardly ever be the case because you have bigger problems
in your code if you do.

>
> Robert
>
>

Robert Houdart

unread,

Dec 18, 2006, 10:58:28 AM12/18/06

to

Bart,

Thanks. That's indeed a high contention scenario!

It's not unlikely that in this case the performance bottleneck is the
Critical Section itself (without even considering the context switches).
This was exactly the situation of the Memory Manager, in which we replaced
the Critical Section with a spinlock to get a huge performance boost.
You might consider testing an alternative approach: getting rid of the
critical section altogether, and replacing it with a spinlock.

In the MM typically the following kind of spinlock was used (FSpinLock is an
integer):

[code]
procedure EnterSpinLock;
begin
if InterLockedCompareExchange(FSpinLock, 1, 0) <> 0 then begin
Sleep(0);
while InterLockedCompareExchange(FSpinLock, 1, 0) <> 0 do
Sleep(1);
end;
end;

procedure LeaveSpinLock;
begin
FSpinLock := 0;
end;
[/code]

This might or might not work in your test scenario, hard to tell without
trying...

Robert

Bart van der Werf

unread,

Dec 18, 2006, 11:10:26 AM12/18/06

to

"Robert Houdart" <rob...@no.com> wrote in message

news:4586...@newsgroups.borland.com...

> Bart,
>
> Thanks. That's indeed a high contention scenario!
>
> It's not unlikely that in this case the performance bottleneck is the
> Critical Section itself (without even considering the context switches).
> This was exactly the situation of the Memory Manager, in which we replaced
> the Critical Section with a spinlock to get a huge performance boost.
> You might consider testing an alternative approach: getting rid of the
> critical section altogether, and replacing it with a spinlock.

FastMM4 ? great thing :)

I've updated the mutex to include a spin lock, with TryEnterCriticalSection
instead of InterlockedCompareExchange
I have yet to time this, oen of the issues is that i do not want to have
busy waiting after the spinlock count is exhaused anywhere because in the
application i want to use these there are alot more threads then cpus.

>
> In the MM typically the following kind of spinlock was used (FSpinLock is
> an integer):
>
> [code]
> procedure EnterSpinLock;
> begin
> if InterLockedCompareExchange(FSpinLock, 1, 0) <> 0 then begin
> Sleep(0);
> while InterLockedCompareExchange(FSpinLock, 1, 0) <> 0 do
> Sleep(1);
> end;
> end;

I would suggest

if InterLockedCompareExchange(FSpinLock, 1, 0) <> 0 then
begin

i := cLimit;
while (i > 0) and (InterLockedCompareExchange(FSpinLock, 1, 0) <> 0) do
Dec(i);
if i = 0 then

while InterLockedCompareExchange(FSpinLock, 1, 0) <> 0 do

SwitchToThread; // explicitly switch to another thread

Robert Houdart

unread,

Dec 18, 2006, 11:42:16 AM12/18/06

to

Bart,

> FastMM4 ? great thing :)

I worked on BucketMM, the runner-up of the MM Challenge, of which quite some
ideas made it into FastMM4 (and vice versa).

> I've updated the mutex to include a spin lock, with
> TryEnterCriticalSection instead of InterlockedCompareExchange

The Critical Section *really* was a performance killer in the MM.
If you can, get rid of it...

> I would suggest
>
> if InterLockedCompareExchange(FSpinLock, 1, 0) <> 0 then
> begin
> i := cLimit;
> while (i > 0) and (InterLockedCompareExchange(FSpinLock, 1, 0) <> 0) do
> Dec(i);
> if i = 0 then
> while InterLockedCompareExchange(FSpinLock, 1, 0) <> 0 do
> SwitchToThread; // explicitly switch to another thread
> end;

Even a single InterLockedCompareExchange instruction can be relatively
expensive compared to the rest of the code, especially in a high contention
scenario.
So I'm not convinced that your more complex code will perform better than
the very simple code used both in FastMM4 and BucketMM.

Only the benchmark can tell...

Robert

Bart van der Werf

unread,

Dec 18, 2006, 12:09:03 PM12/18/06

to

Robert Houdart,

>
>> FastMM4 ? great thing :)
>
> I worked on BucketMM, the runner-up of the MM Challenge, of which quite
> some ideas made it into FastMM4 (and vice versa).

Allways a good thing.

>> I've updated the mutex to include a spin lock, with
>> TryEnterCriticalSection instead of InterlockedCompareExchange
>
> The Critical Section *really* was a performance killer in the MM.
> If you can, get rid of it...

I really want to avoid the 'busy' waiting because i have many more threads
then cpus. (a bit of a scalability issue, but not one i want to fix today :)
So i need atleast the waitfor*objects calls.

Also keeping track of recursion count means i need to get the current thread
id and more such stuff, so it would require some work to rewrite.
but i will definitly look into it.

>
>> I would suggest
>>
>> if InterLockedCompareExchange(FSpinLock, 1, 0) <> 0 then
>> begin
>> i := cLimit;
>> while (i > 0) and (InterLockedCompareExchange(FSpinLock, 1, 0) <> 0)
>> do
>> Dec(i);
>> if i = 0 then
>> while InterLockedCompareExchange(FSpinLock, 1, 0) <> 0 do
>> SwitchToThread; // explicitly switch to another thread
>> end;
>
> Even a single InterLockedCompareExchange instruction can be relatively
> expensive compared to the rest of the code, especially in a high
> contention scenario.

Surely it is less expensive then a Sleep(0) ?

if the cLimit is low enough that the first loop is less expensive then a
contextswitch then it is probably worth it.

for a single cpu machine cLimit needs to be 0 so then it would be useless.

> So I'm not convinced that your more complex code will perform better than
> the very simple code used both in FastMM4 and BucketMM.
> Only the benchmark can tell...

I'll try my own benchmarks for a bit first.
I'm willing to benchmark this for a bit, how do i do that ?

Greets, Bart

John O'Harrow

unread,

Dec 18, 2006, 2:54:55 PM12/18/06

to

Listed below is my Spinlock unit I use everywhre now instead of using
critical sections. It is essentially a BASM version of Roberts code

unit SpinLock;

interface

{$IFDEF VER170}
{$DEFINE Inline}
{$WARN UNSAFE_CODE OFF}
{$ENDIF}

{$IFDEF CONDITIONALEXPRESSIONS}
{$IF CompilerVersion >= 18.0}
{$DEFINE Inline}
{$WARN UNSAFE_CODE OFF}
{$IFEND}
{$ENDIF}

procedure EnterSpinLock(var SpinLock : LongBool);
function TryEnterSpinLock(var SpinLock : LongBool) : Boolean;
procedure LeaveSpinLock(var SpinLock : LongBool); {$IFDEF Inline} inline;
{$ENDIF}

implementation

uses
Windows;

procedure EnterSpinLock(var SpinLock : LongBool);
asm
mov ecx, eax
xor eax, eax
mov edx, 1
lock cmpxchg [ecx], edx
jz @@Exit {Exit on Sucessful Exchange}
xor eax, eax
push ecx
@@Loop:
push eax
call Sleep {Sleep(0) for First Loop then Sleep(1) for Remainder}
mov ecx, [esp]
xor eax, eax
mov edx, 1
lock cmpxchg [ecx], edx
jnz @@Loop
pop ecx
@@Exit:
end; {SpinLockEnter}

function TryEnterSpinLock(var SpinLock : LongBool) : Boolean;
asm
mov ecx, eax
xor eax, eax
mov edx, 1
lock cmpxchg [ecx], edx
setz al
@@Exit:
end; {TryEnterSpinLock}

procedure LeaveSpinLock(var SpinLock : LongBool);
begin
SpinLock := False;
end; {LeaveSpinLock}

end.

--
regards,
John

The Fastcode Project:
http://www.fastcodeproject.org/

> Robert
>
>

Charalabos Michael

unread,

Dec 18, 2006, 4:38:02 PM12/18/06

to

Hello John,

> unit SpinLock;

Do you have a Pascal version too ?

Thank you

--
Charalabos Michael - [Creation Power] - http://www.creationpower.com -
http://www.creationpower.gr

Bart van der Werf

unread,

Dec 18, 2006, 4:13:58 PM12/18/06

to

John,

> Listed below is my Spinlock unit I use everywhre now instead of using
> critical sections. It is essentially a BASM version of Roberts code

Great, i timed it against mine and ot does indeed out perform mine, uses a
bit less cpu too.

I'll wrap it into somethin that uses fs:[0x24] and has a recursion count and
it will be usable as a TCriticalSection replacement.

The magic is in the sleep(0) and sleep(1) i guess.

It does the same thing as what i try to do with the event, try to avoid the
fifo nature of the standard cricitalsection so that you don't get lock
convoys and lockstep context switching,
expect this is obviously cheaper.

sleep(0) to force a context switch, sleep(1) to put the thread at the back
of the queue for a while, i'm still wondering if adding a pure spin with no
sleep for nCount based on the number of cores could add something.

is the sleep(1) treated specially ? because i would expect that the
waitforsingleobject autoreset event would have less latency ? aso less
expensive if you have alot of threads.

greet, Bart

Bart van der Werf

unread,

Dec 18, 2006, 4:40:07 PM12/18/06

to

This is fun :)

changing my code to:

procedure TBWOpertunisticMutex.Acquire;
var
Spin: Integer;

begin
if not TryEnterCriticalSection(FSection) then
begin

Spin := cSpinLimit;
while (Spin > 0) and not TryEnterCriticalSection(FSection) do
Dec(Spin);

if Spin = 0 then
begin
Sleep(0);
while not (TryEnterCriticalSection(FSection)) do
begin
Sleep(1);
end;
end;
end;
end;

on a yonah Core Duo T2300
i get:

Mine:
2 threads: 2122
8 threads: 7135

Johns:
2 threads: 3134
8 threads: 12615

So it definitly is worth adding a pure spin in there if you have multiple
cores, atleast for my testcase, i'll try some more patterns.

After some fiddling i found 4k to be a good number for cSpinLimit, which
seems to make sense as that is also what the windows hep allocater seems to
use.
A higher time might make sense for 4+ cpu machines.

Simply not calling sleep as fastmm does for 4+ cpu might be asking for
deadlocks if more then 4 threads are active.

Bart van der Werf

unread,

Dec 18, 2006, 4:48:36 PM12/18/06

to

Alot seems to depend on what pattern the locking takes.

for I := 1 to 1000000 do
begin

for j := 1 to 100 do;
FMutex.Acquire;
for j := 1 to 100 do;
FMutex.Release;
end;

is what i use, a 50/50 scenario and i'm assiming that the loop is a good
representation of a typical shortlocked piece of code.

Your lock performs better if the inner loop takes longer then the spin but
short enough to make it significant wastage of cpu.

Bart van der Werf

unread,

Dec 19, 2006, 6:13:44 AM12/19/06

to

Under win2k3server the difference between a normal critical section and this
style is much smaller luckily.

Bart van der Werf

unread,

Dec 19, 2006, 8:36:50 AM12/19/06

to

After testing on some more machines i decided on the following
it is much simpler, has no handle, it performs better then the default lock,
and fixes the performance problems that exsist in xp and earlier with lock
convoys.

Interestingly all this is much better on win2003 and further.

unit BWOpertunisticMutex;

interface

uses
SyncObjs, Windows;

type
TBWOpertunisticMutex = class(TSynchroObject)
private

FSection: TRTLCriticalSection;

public
constructor Create;
destructor Destroy; override;
procedure Acquire; override;
procedure Release; override;
function TryAcquire: Boolean;
end;

implementation

procedure TBWOpertunisticMutex.Acquire;

begin
if not TryEnterCriticalSection(FSection) then
begin

Sleep(0);
while not TryEnterCriticalSection(FSection) do
Sleep(1);
end;
end;

constructor TBWOpertunisticMutex.Create;
begin
inherited Create;
InitializeCriticalSection(FSection);

end;

destructor TBWOpertunisticMutex.Destroy;
begin
DeleteCriticalSection(FSection);
inherited;
end;

procedure TBWOpertunisticMutex.Release;
begin

Robert Houdart

unread,

Dec 19, 2006, 9:13:44 AM12/19/06

to

Bart,

Elegant and interesting!
Note that the Sleep(0) / Sleep(1) combination used in BucketMM is not the
only possibility.
In FastMM4 Pierre Le Riche uses a different Sleep(0) / Sleep(10)
combination; it would be interesting to know why he's chosen these values.

Can you publish some benchmark results?
How much is your OpportunisticMutex faster than the standard Critical
Section?
How are the results compared to the full spinlock approach?

Maybe you should continue this discussion in the Fastcode group.

Robert

John O'Harrow

unread,

Dec 19, 2006, 10:07:38 AM12/19/06

to

"Robert Houdart" <rob...@no.com> wrote in message

news:4587...@newsgroups.borland.com...

>
> Maybe you should continue this discussion in the Fastcode group.

Fastcode group? You mean BASM group ;-)

Bart van der Werf

unread,

Dec 19, 2006, 10:34:46 AM12/19/06

to

"Robert Houdart" <rob...@no.com> wrote in message
news:4587...@newsgroups.borland.com...

> Bart,
>
> Elegant and interesting!
> Note that the Sleep(0) / Sleep(1) combination used in BucketMM is not the
> only possibility.
> In FastMM4 Pierre Le Riche uses a different Sleep(0) / Sleep(10)
> combination; it would be interesting to know why he's chosen these values.

The sleep seems to be a great way to make it working, far more successfull
then my attempt with the event.
Adding a spin with no wait for a limit N does significantly improve
performance for some special cases on my dual core, but in the general case
the code John provided is alot better.
Interrestingly enough on a quad core the nowait spin actually reduced
performance somewhat, something i don't yet fully grasp, i expected the
inverse.
For the optimal performance you need to have prediction powers to know if
doing a busy spin will give you the lock earlier then the sleep, and that
your not just wasting time.
possibly adding some "pause or rep nop" instructions in the loop might
reduce the memorybus load of the ilcxhg so that it doesn't disturb the other
cpu's as much.
the Sleep(1) is especially interresting, because basicly you want to do a
yield but also give low priority stuff time, and lower cpu load, but still
be back as soon as the lock is free.

> Can you publish some benchmark results?

I guess we need to define a sensible benchmark first ?

> How much is your OpportunisticMutex faster than the standard Critical
> Section?

Even in my own benchmark it depends, about the same in the case that no
contention exsists, to alot fast (x5-10) with alot of contention on
XP/coreduo and earlier machines, on 2k3/quadxeon and later it is (x2) at
most.

> How are the results compared to the full spinlock approach?

John is alittle faster in all (about a linear 5% constant) but a special
case where my lock plus busy spinning can be upto 50% faster on my coreduo.
However johns lock does not support nesting at the moment, something i need.

A simple busy full spin lock is somethign i haven't tried, i would expect
pretty bad performance on the quadcore because 3 cpu's are busy spamming the
memory bus with their ilcxhg, also starvation and even lifelocks could
result pretty easily.

Krisztian Pinter

unread,

Dec 19, 2006, 10:58:06 AM12/19/06

to

On Tue, 19 Dec 2006 16:34:46 +0100, Bart van der Werf <blue...@xs4all.nl>
wrote:

> The sleep seems to be a great way to make it working, far more
> successfull
> then my attempt with the event.

it raises a question though. what if the situation is a ping-pong pattern,
that is,
the two threads have to alternately do a very small piece of work.
introducung a
sleep 1 can limit the performance to 1000 switches per second.

not that i want to imply it is a usual situation, nor it would be a wise
way to do
things, but theoretically possible.

Robert Houdart

unread,

Dec 19, 2006, 11:17:18 AM12/19/06

to

Krisztian,

> it raises a question though. what if the situation is a ping-pong pattern,
> that is,
> the two threads have to alternately do a very small piece of work.
> introducung a
> sleep 1 can limit the performance to 1000 switches per second.

It would be very bad software design to do this kind of ping-pong patterns
using separate threads.
Threads should usually run quite independently; that's the whole point of
asynchronous operation.

Robert

Robert Houdart

unread,

Dec 19, 2006, 11:11:44 AM12/19/06

to

Bart,

Thanks for the results.

> the Sleep(1) is especially interresting, because basicly you want to do a
> yield but also give low priority stuff time, and lower cpu load, but still
> be back as soon as the lock is free.

Indeed.

> Even in my own benchmark it depends, about the same in the case that no
> contention exsists, to alot fast (x5-10) with alot of contention on
> XP/coreduo and earlier machines, on 2k3/quadxeon and later it is (x2) at
> most.

Excellent!
The nice thing is that you have developed a very simple and generic
replacement for TCriticalSection with an interesting performance boost.

Robert

Bart van der Werf

unread,

Dec 19, 2006, 11:17:57 AM12/19/06

to

"Krisztian Pinter,

>> The sleep seems to be a great way to make it working, far more
>> successfull
>> then my attempt with the event.
>
> it raises a question though. what if the situation is a ping-pong pattern,
> that is,
> the two threads have to alternately do a very small piece of work.
> introducung a
> sleep 1 can limit the performance to 1000 switches per second.
>
> not that i want to imply it is a usual situation, nor it would be a wise
> way to do
> things, but theoretically possible.

Yes, a busy spin wait could solve this somewhat for multi cpu machines,
keeping the thread around and actively retrying to see if it comes back very
fast.
however, doing this right is something i'm still trying to figure out.
The spinlimit should be high enough to keep the thread around to pick up the
lock when the other leaves it again, but short enough so that it is not a
big useless cpu hog,
also simply busylooping on the ilcxhg might cause a negative effect on the
other cores because you are using the memorybus alot.

Lord Crc

unread,

Dec 19, 2006, 2:32:15 PM12/19/06

to

On Tue, 19 Dec 2006 14:36:50 +0100, "Bart van der Werf"
<blue...@xs4all.nl> wrote:

>After testing on some more machines i decided on the following
>it is much simpler, has no handle, it performs better then the default lock,
>and fixes the performance problems that exsist in xp and earlier with lock
>convoys.

Hmm, I tried using it in my ray-tracer, which has a shared data
structure which it queries very often, and occationally writes to
(writing takes a lot of time).

Here's the rendering times:

Threads: 2 8
CS: 52 sec 50 sec
OM: 89 sec 71 sec

I guess the spinlock is nice if there's very little work to be done
inside the locked code.

- Asbjørn

Bart van der Werf

unread,

Dec 19, 2006, 2:46:18 PM12/19/06

to

"Lord Crc"

> Hmm, I tried using it in my ray-tracer, which has a shared data
> structure which it queries very often, and occationally writes to
> (writing takes a lot of time).

The reads are short and locked with the criticalsection ?

> Here's the rendering times:
>
> Threads: 2 8
> CS: 52 sec 50 sec
> OM: 89 sec 71 sec

That is alot worse :(
what is your usage scenario, this is the first one that i've seen where the
original lock is better, i want to add this to the testcases :)
You we're using the normal criticalsection before ?
what kind of cpu do you have ?
are you on windows xp ?

> I guess the spinlock is nice if there's very little work to be done
> inside the locked code.

Yes i still want to include a busy spinlock if i can find a way to reduce
its negative effects for some usecases.
The parent post does not include a busy spinlock btw, because i can't get it
to produce postive results for all tescases/machines.

Lord Crc

unread,

Dec 19, 2006, 3:22:11 PM12/19/06

to

On Tue, 19 Dec 2006 20:46:18 +0100, "Bart van der Werf"
<blue...@xs4all.nl> wrote:

>> Here's the rendering times:
>>
>> Threads: 2 8
>> CS: 52 sec 50 sec
>> OM: 89 sec 71 sec
>
>That is alot worse :(
>what is your usage scenario, this is the first one that i've seen where the
>original lock is better, i want to add this to the testcases :)

Ok the data structure is a cache. The worker checks the cache for an
existing value, if the value already exists, this is returned, and the
worker uses this. This involves walking down a tree and locating the
appropriate node, and returning it. The tree is locked during this.

If an appropriate value is not found, the worker computes a new value
and inserts it into the tree. The tree is locked during the insertion.

So the worker code goes something like this:

data := Cache.locate(x);
if not assigned(data) then
begin
// this is an expensive operation
data:= TData.Create(x);

Cache.insert(data);
end;

And the cache code goes something like this:

function TCache.locate(x): TData;
begin
Lock.Aquire;
try
result:= FRoot.locate(x);
finally
Lock.Release;
end;
end;

procedure TCache.insert(data);
begin
Lock.Aquire;
try
result:= FRoot.insert(data);
finally
Lock.Release;
end;
end;

Looking at the task manager, I found that when using critical sections
the program uses ~100% cpu, but only 60% when using the OM. Using the
(excellent) sampling profiler from Eric gave a hint, CS version spends
30% of the time in the kernel, OM over 60%.

>You we're using the normal criticalsection before ?

Yes, TCriticalSection (which made it very smooth to switch to your
code, yay for OOP :)

>what kind of cpu do you have ?

AMD X2 3800+

>are you on windows xp ?

Yes, XP (32bit) SP2

>> I guess the spinlock is nice if there's very little work to be done
>> inside the locked code.
>
>Yes i still want to include a busy spinlock if i can find a way to reduce
>its negative effects for some usecases.
>The parent post does not include a busy spinlock btw, because i can't get it
>to produce postive results for all tescases/machines.

If it helps, a friend of mine had made a spinlock some time ago, with
the same results for me :)

I'll dig some more...

- Asbjørn

Bart van der Werf

unread,

Dec 20, 2006, 5:48:47 AM12/20/06

to

"Robert Houdart" <rob...@no.com> wrote in message
news:4587...@newsgroups.borland.com...

> Bart,
>
> Elegant and interesting!
> Note that the Sleep(0) / Sleep(1) combination used in BucketMM is not the
> only possibility.
> In FastMM4 Pierre Le Riche uses a different Sleep(0) / Sleep(10)
> combination; it would be interesting to know why he's chosen these values.
>
> Can you publish some benchmark results?

this is with the OM with a busy spinlock added with a limit on a 1000 loops

with 100/100 scenario

8 threads

p4 hyperthreaded winxp

CS: 9976
OM: 2605
John: 1527

quad xeon win2k3

CS: 3189
OM: 3641
John: 1169

(note: on win2k3 and further the biggest problem with the criticalsection is
fixed)

8 threads

var
I,J: Integer;
begin

for I := 1 to 1000000 do
begin

for j := 1 to 1000 do;

FMutex.Acquire;
for j := 1 to 100 do;
FMutex.Release;
end;

end;

1000/100 scenario

p4 ht win xp

CS: 11072
OM: 5841
John: 5819

quad xean win2k3

CS
6514
5802
4500
5357
6076

OM
3528
2900
3603
3817
3609

Johns lock
6084
6427
6748
5310
6514

1000/1 scenario

p4 ht xp

CS:6333
OM: 3894
John: 5070

quadxeaon win2k3

CS
4400
4641

OM
4166
3390

John
4695
5048

So basicly i totally depends on the hardware and os and usecase which lock
is the best.

With high contention and low concurrency scenarios johns is clearly the
best, which is logical because the threads are parked efficiently
With medium contention and high concurrency i think OM is best, because the
spinlock doesn't park the thread immediatly so that the contented area
doesn't impact the big area with concurrency as much.

Bart van der Werf

unread,

Dec 20, 2006, 6:00:48 AM12/20/06

to

Could you also try:
unit BWOpertunisticMutex;

interface

uses
SyncObjs, Windows;

type
TBWOpertunisticMutex = class(TSynchroObject)
private
FSection: TRTLCriticalSection;
public
constructor Create;
destructor Destroy; override;
procedure Acquire; override;
procedure Release; override;
function TryAcquire: Boolean;
end;

implementation

const
cSpinLimit = 1000;

procedure TBWOpertunisticMutex.Acquire;
var
Spin: Integer;

begin
if not TryEnterCriticalSection(FSection) then
begin

Spin := cSpinLimit;
while (Spin > 0) and not TryEnterCriticalSection(FSection) do
Dec(Spin);

if Spin = 0 then

Robert Houdart

unread,

Dec 20, 2006, 12:54:58 PM12/20/06

to

Thanks for sharing the results.

Robert

Lord Crc

unread,

Dec 20, 2006, 3:12:13 PM12/20/06

to

On Wed, 20 Dec 2006 12:00:48 +0100, "Bart van der Werf"
<blue...@xs4all.nl> wrote:

>Could you also try:

Nice, now it's almost identical to regular CS:

2 threads 4 threads
CS 52 50
OM1 83 83
OM2 52 50

However, during one of the runs with the new version, I noticed some
glitches in the output, which suggests that the cache was altered
while another thread was performing a lookup. It was sporadic, so I
haven't been able to determine if it's some changes I made to the
renderer or your locking code. I'll investigate...

- Asbjørn

Lord Crc

unread,

Dec 20, 2006, 3:53:25 PM12/20/06

to

On Wed, 20 Dec 2006 21:12:13 +0100, Lord Crc <lor...@hotmail.com>
wrote:

>It was sporadic, so I
>haven't been able to determine if it's some changes I made to the
>renderer or your locking code. I'll investigate...

Well it occurs with regular CS so it's my fault, whatever it is :D

- Asbjørn

Tommi Prami

unread,

Dec 21, 2006, 5:16:48 AM12/21/06

to

Hello,

I was thinking that could this be designed in a way of:

type
TBWOpertunisticMutex = class(TSynchroObject)
private
FSection: TRTLCriticalSection;
public
constructor Create;
destructor Destroy; override;

property Acquire: TMutexMethod read FAcquire;
property Release: TMutexMethod read FRelease;
property TryAcquire: TMutexFunction read FTryAcquire;
end;

And Constructor could have some user hints Like
omMultiWriteSingleReader, omMoltioReaderSingleWriter etc...

This way it could Morph it self depending on case of use and Number of
Cores on system.

Any idea on This???

-TP-

Bart van der Werf

unread,

Dec 21, 2006, 6:10:57 AM12/21/06

to

"Tommi Prami" <tommi.nos...@poista.ecomond.com> wrote in message
news:458a5f54$1...@newsgroups.borland.com...
> Hello,

> And Constructor could have some user hints Like omMultiWriteSingleReader,
> omMoltioReaderSingleWriter etc...
>
> This way it could Morph it self depending on case of use and Number of
> Cores on system.
>
> Any idea on This???

Using a factory to produce the right TSynchroObject class is something i
allready use in my own code,
making one class do everything is often a bad idea, but having several
derived implementations from a base types works pretty good.

Lord Crc

unread,

Dec 21, 2006, 7:00:12 AM12/21/06

to

On Wed, 20 Dec 2006 21:53:25 +0100, Lord Crc <lor...@hotmail.com>
wrote:

>Well it occurs with regular CS so it's my fault, whatever it is :D

Yep, found it and fixed it :)

- Asbjørn